SNN N! Ordering Space
A proof-of-concept framework for exploiting the factorial representational space of spike temporal ordering in Spiking Neural Networks.
Symbol Table
All symbols used throughout this document, with types and domains.
Formula Sheet
torch.no_grad(); gradients flow through log $P(\pi^{(k)} | \hat{\mathbf{a}})$ only.Full Pipeline
The complete forward pass, from input current to classification loss.
The gradient flows backwards through every step. The only discrete operation, the Gumbel-max argsort for Monte Carlo sampling, is placed inside torch.no_grad(); gradients pass through the log-probability computation instead.
Gradient path
Every factor is non-zero. Prior to this work, the $\partial \mathbf{p}/\partial \hat{\mathbf{a}}$ term was zero because the pipeline used argmax/argsort instead of the Plackett-Luce formulation.
Component Definitions
State Leaky — cascaded LIF
Two LIF neurons in series. LIF₁ ($\beta_1 = 0.9$) receives the input current and generates an initial sparse spike train. LIF₂ ($\beta_2 = 0.95$, higher $\beta$ = slower decay) receives LIF₁'s spikes as its input, integrating them over a longer window. The effect is temporal smoothing: spikes that arrive in a burst from LIF₁ are spread across more timesteps by LIF₂, increasing the number of distinct first-spike times and therefore the information content of the ordering.
Orthogonal layer — rotation only
Any matrix $\mathbf{W}$ can be decomposed as $\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ (SVD). The diagonal $\mathbf{\Sigma}$ controls scaling per direction. A standard nn.Linear layer learns both rotation and scaling; scaling can amplify current differences to the point where ordering collapses into a binary on/off code. The orthogonal layer forces $\mathbf{\Sigma} = I$ by parameterising only the rotational component via a skew-symmetric generator $\mathbf{S}$ and the Cayley map. Relative current magnitudes are preserved: if neuron 2 has a slightly higher current than neuron 0 before the layer, it will still have a slightly higher current after.
Recurrent connections on LIF₂
A learnable $M \times M$ weight matrix $\mathbf{W}_\text{rec}$ is added to LIF₂'s input. When $W_{\text{rec},ij} < 0$, a spike from neuron $j$ at $t-1$ suppresses neuron $i$ at $t$. This implements soft winner-take-all dynamics: early-firing neurons inhibit later ones, pushing spike times towards the extremes of the time axis. The result is higher fst-std and better ordering separation under large-$C$ pressure.
Proof Chain
Four mutually reinforcing experiments establishing that spike ordering is the primary information carrier.
Training a model on direct-current inputs with soft first-spike scoring, then randomly shuffling the time axis of $S$ at evaluation, causes accuracy to drop from 1.000 to 0.198. The model cannot recover by relying on firing rate or other non-temporal features.
Across six values of $C$ from 24 to 10,000, accuracy and E-drop remain nearly equal. After time-shuffle, accuracy falls to approximately $1/C$ (random chance). This confirms there is no secondary pathway the model can exploit.
The Ordering-Invariant Dataset holds each class's ordering fixed while randomly re-sampling absolute spike times for every example. Within-class first-spike standard deviation is 1.96 (large), yet ordering accuracy reaches 1.000. The model learns the permutation, not the time values.
The Ordering-Preserving Shuffle changes all absolute spike times while holding the permutation $\pi$ constant. Without soft rank normalisation, OPS-drop = 0.228 because the PL distribution's concentration changes. With soft rank ($\tau = 0.01$), which discards magnitude information from $\mathbf{a}$, OPS-drop falls to 0.017 — a 13× reduction — while accuracy remains at 1.000.
Experimental Results
Exp A — capacity scaling ($M=8$, $M! = 40{,}320$)
| $C$ | $M!/C$ | mean acc | mean E-drop | acc $-$ E-drop |
|---|---|---|---|---|
| 24 | 1,680 | 1.000 | 0.764 | +0.236 |
| 100 | 403 | 0.978 | 0.885 | +0.093 |
| 500 | 81 | 0.856 | 0.830 | +0.026 |
| 1,000 | 40 | 0.733 | 0.717 | +0.016 |
| 5,000 | 8 | 0.310 | 0.307 | +0.003 |
| 10,000 | 4 | 0.162 | 0.160 | +0.002 ≈ 1/C |
Exp B — input layer type ($M=8$, $C=24$)
| fc type | acc | E-drop | orth error $\|\mathbf{W}^\top\mathbf{W} - I\|_F$ |
|---|---|---|---|
| None (direct) | 1.000 | 0.744 | — |
| Standard Linear | 0.860 | 0.051 | — |
| Orthogonal (Cayley) | 0.923 | 0.420 | 0.000 |
Exp Capacity — SNN-Base vs SNN-Rec ($M=8$)
| $C$ | Base acc | Rec acc | Rec $-$ Base | Base collapse | Rec collapse |
|---|---|---|---|---|---|
| 128 | 0.620 | 0.622 | +0.002 | YES | YES |
| 256 | 0.401 | 0.467 | +0.066 | YES | YES |
| 512 | 0.241 | 0.320 | +0.079 | YES | YES |
| 1,024 | 0.142 | 0.192 | +0.050 | 2/3 NO | YES |
| 2,048 | 0.078 | 0.108 | +0.030 | YES | YES |
OPS + soft rank — tau sweep
| $\tau$ | acc | OPS-drop | vs baseline |
|---|---|---|---|
| baseline (no rank) | 1.000 | 0.228 | — |
| 1.0 | 0.999 | 0.341 | worse |
| 0.1 | 1.000 | 0.113 | −0.115 |
| 0.01 | 1.000 | 0.017 | −0.211 (13×) |
Confirmed Findings
Open Questions
- Q1 Hierarchical readout. Can replacing the flat PL readout with a hierarchical one (first reading which neuron fires first, then second, etc.) allow generalisation to unseen permutations? fn-unseen = 0.670 suggests the SNN part is ready; the bottleneck is the readout interface. Partial
- Q2 Ordering embedding. Can a learned embedding $\pi \mapsto \mathbf{e}_\pi \in \mathbb{R}^d$ encode geometric structure so that similar permutations are close? This would allow $\mathbf{z} = \sum_\pi p_\pi \mathbf{e}_\pi$ to carry both ordering identity and neighbourhood information. Open
- Q3 Optimal $w$ and $\tau$. Current values ($w = 0.5$, $\tau = 0.01$) were chosen empirically. What is the theoretical relationship between $w$, $T$, and the gradient magnitude reaching early vs late timesteps? Is there a principled choice? Open
- Q4 FFN + Orthogonal + SNN on real data. The orthogonal layer enables FFN encoders to coexist with ordering. Does this hold on high-dimensional noisy inputs (e.g., time-series, genomics)? Does E-drop survive a deep encoder? Open
- Q5 Capacity threshold. Degradation begins at $M!/C \approx 8$. Is this threshold architecture-dependent or a fundamental property of the Plackett-Luce estimation with $K$ samples? Does recurrent competition shift it? Partial
- Q6 Ordering autoencoder. If a decoder can reconstruct the input $\mathbf{x}$ from the ordering alone (with no class supervision), it would prove ordering carries the full input signal. This remains the strongest possible proof of the core claim. Open
- Q7 $W_\text{rec}$ topology. Does the learned recurrent weight matrix exhibit predominantly inhibitory off-diagonal entries (winner-take-all), and does this resemble known biological competition circuits? What is the effect of initialising $W_\text{rec}$ as inhibitory? Open