§1

Symbol Table

All symbols used throughout this document, with types and domains.

M Number of neurons Scalar · ℤ₊
The number of LIF neurons in the SNN layer. Determines the size of the ordering space: $M$ neurons can produce $M!$ distinct spike orderings.
T Number of timesteps Scalar · ℤ₊
The simulation length of the SNN. Should satisfy $T \gg M$ to allow all neurons sufficient time to fire at distinct timesteps, maximising ordering information.
C Number of classes Scalar · ℤ₊
The number of output categories. The ratio $M!/C$ determines how much ordering space is available per class. Capacity degrades when $M!/C \lesssim 8$.
$\beta$ Membrane decay rate Scalar · (0,1)
The leakage coefficient of the LIF neuron. Higher $\beta$ means slower decay (longer integration window). LIF₁ uses $\beta_1 = 0.9$, LIF₂ uses $\beta_2 = 0.95$.
$\theta$ Firing threshold Scalar · ℝ₊
The membrane potential value above which the neuron fires a spike. Default $\theta = 1.0$ in snntorch.
$\mathbf{cur}$ Input current vector Vector · ℝ^M
The input driving current for each neuron. In the direct-input architecture, this is fed straight to LIF₁ without any linear transformation. The $i$-th entry $\text{cur}_i$ determines how quickly neuron $i$ reaches threshold.
$S$ Spike matrix Matrix · {0,1}^{M×T}
The binary output of Pass 1. Entry $S_{i,t} = 1$ if and only if neuron $i$ fires at timestep $t$. This is the uncompressed representation preserving all temporal information.
$t_i^*$ First spike time of neuron $i$ Scalar · {0,…,T}
The timestep at which neuron $i$ fires for the first time: $t_i^* = \min\{t : S_{i,t} = 1\}$. Assigned $T$ if the neuron never fires. This is the key quantity encoding ordering information.
$\pi$ Spike ordering (permutation) Vector · S_M
The permutation of neuron indices ranked by their first spike times: $\pi = \text{argsort}([t_0^*, t_1^*, \ldots, t_{M-1}^*])$. There are $M!$ possible orderings. This is the core representational object of this research.
$w$ Temporal decay rate Scalar · ℝ₊
Controls the exponential weighting of spike times in the soft first-spike score. Larger $w$ emphasises early spikes more strongly; the limit $w \to \infty$ approximates argmin. Default $w = 0.5$.
$M_{i,t}$ First-spike mask Scalar · {0,1}
A binary mask isolating the first spike of neuron $i$. Equals 1 at time $t$ if and only if neuron $i$ fires at $t$ and has not fired at any earlier timestep. Used to make the first-spike score differentiable.
$a_i$ Soft first-spike score Scalar · ℝ₊
A differentiable approximation of $e^{-w \cdot t_i^*}$. Computed from $S$ via the first-spike mask. Larger $a_i$ means neuron $i$ fired earlier. Gradient $\partial a_i / \partial S_{i,t} = e^{-wt} \neq 0$ everywhere.
$\hat{a}_i$ Soft rank score Scalar · [0,1]
The rank-normalised version of $a_i$. Represents how many other neurons neuron $i$ fired before. Contains only ordinal information — the absolute magnitude of $a_i$ is discarded. Invariant to ordering-preserving rescaling of spike times.
$\tau$ Soft rank temperature Scalar · ℝ₊
Controls the sharpness of the sigmoid in the soft rank formula. $\tau \to 0$ recovers exact rank (zero gradient); $\tau \to \infty$ gives uniform output (no ordinal information). Default $\tau = 0.01$ for OPS invariance.
$\mathbf{p}$ Plackett-Luce distribution Vector · Δ^{M!-1}
A probability distribution over all $M!$ permutations, parameterised by the score vector. Entry $p_\pi = P(\pi \mid \hat{\mathbf{a}})$ is the probability of observing ordering $\pi$ given the soft rank scores. For $M > 6$, approximated by Monte Carlo with $K$ samples.
$K$ Monte Carlo sample count Scalar · ℤ₊
Number of permutations sampled from the Plackett-Luce distribution via the Gumbel-max trick when $M > 6$ makes exact enumeration intractable. Default $K = 512$. The readout dimension equals $K$ in Monte Carlo mode.
$\mathbf{W}_\text{out}$ Readout weight matrix Matrix · ℝ^{C×K}
The sole learnable component outside the SNN. Maps the Plackett-Luce distribution $\mathbf{p}$ to class logits. Dimension is $C \times M!$ for exact mode and $C \times K$ for Monte Carlo mode.
$\mathbf{S}$ Skew-symmetric generator Matrix · ℝ^{M×M}, $S_{ij} = -S_{ji}$
The learnable parameter of the orthogonal layer, constrained to be skew-symmetric ($\mathbf{S} = -\mathbf{S}^T$). The orthogonal weight matrix $\mathbf{W}$ is derived from $\mathbf{S}$ via the Cayley map. The upper triangle of $\mathbf{S}$ contains the free parameters.
$\sigma$ Logistic sigmoid Function · ℝ → (0,1)
The standard sigmoid function $\sigma(x) = 1/(1+e^{-x})$. Used in the soft rank formula to approximate the step function $\mathbf{1}[a_i > a_j]$ with a smooth, differentiable surrogate.
§2

Formula Sheet

LIF Dynamics Membrane update & spike
$$\text{mem}_i[t] = \beta \cdot \text{mem}_i[t-1] + \text{input}_i[t]$$ $$S_{i,t} = \mathbf{1}\!\left[\text{mem}_i[t] > \theta\right]$$
Surrogate gradient approximates $\partial S / \partial \text{mem}$ during backpropagation.
First-Spike Mask Isolate first event
$$M_{i,t} = S_{i,t} \cdot \prod_{s=0}^{t-1}(1 - S_{i,s})$$
$M_{i,t} = 1$ iff neuron $i$ fires at $t$ and has not fired before. Gradient: $\partial M_{i,t}/\partial S_{i,s} \neq 0$.
Soft First-Spike Score Differentiable timing
$$a_i = \sum_{t=0}^{T-1} M_{i,t} \cdot e^{-w \cdot t}$$
Approximates $e^{-w \cdot t_i^*}$. Gradient: $\partial a_i / \partial S_{i,t} = e^{-wt} \cdot \prod_{s < t}(1 - S_{i,s}) \neq 0$.
Soft Rank Normalisation Discard magnitude
$$\hat{a}_i = \frac{1}{M-1}\sum_{j \neq i} \sigma\!\left(\frac{a_i - a_j}{\tau}\right)$$
$\hat{a}_i \in [0,1]$. Invariant to any order-preserving transformation of $\mathbf{a}$. As $\tau \to 0$, $\hat{a}_i \to \text{rank}(a_i)/(M-1)$.
Plackett-Luce Distribution over $S_M$
$$\log P(\pi \mid \hat{\mathbf{a}}) = \sum_{k=1}^{M} \left(\hat{a}_{\pi_k} - \log \sum_{j=k}^{M} e^{\hat{a}_{\pi_j}}\right)$$
Computed in log-space for numerical stability. Final $\mathbf{p} = \text{softmax}([\log P(\pi_1|\hat{\mathbf{a}}), \ldots])$ over all $M!$ permutations.
Gumbel-Max Sampling Monte Carlo PL
$$\pi^{(k)} = \text{argsort}\!\left(-(\hat{\mathbf{a}} + \mathbf{g}^{(k)})\right), \quad \mathbf{g}^{(k)} \sim \text{Gumbel}(0,1)^M$$
Used when $M > 6$. argsort is inside torch.no_grad(); gradients flow through log $P(\pi^{(k)} | \hat{\mathbf{a}})$ only.
Readout & Loss Classification
$$\text{logits} = \mathbf{W}_\text{out}\,\mathbf{p} + \mathbf{b}$$ $$\mathcal{L} = -\log \frac{e^{\text{logits}_y}}{\sum_{c} e^{\text{logits}_c}}$$
Cross-entropy on the true class $y$. Gradient flows back through $\mathbf{p} \to \hat{\mathbf{a}} \to \mathbf{a} \to S \to \text{mem} \to \mathbf{cur}$.
Orthogonal Layer Cayley map
$$\mathbf{W} = (I + \mathbf{S})^{-1}(I - \mathbf{S}), \quad \mathbf{S} = -\mathbf{S}^\top$$ $$\mathbf{W}^\top \mathbf{W} = I \;\Rightarrow\; \|\mathbf{W}\mathbf{x}\| = \|\mathbf{x}\|$$
Learned parameter is $\mathbf{S}$ (unconstrained upper triangle). $\mathbf{W}$ is derived — rotation only, no scaling. Preserves relative current magnitudes.
Recurrent Input LIF₂ with competition
$$\text{input\_to\_lif2}_{i,t} = S_{i,t}^{(1)} + \sum_{j} W_{\text{rec},ij} \cdot S_{j,t-1}^{(2)}$$
$W_\text{rec} \in \mathbb{R}^{M \times M}$, learnable. When off-diagonal entries are negative, earlier-firing neurons inhibit later ones (winner-take-all).
E-drop Temporal dependence metric
$$\text{E-drop} = \text{acc}(\text{normal}) - \text{acc}(\text{time-shuffled})$$
Time-shuffle permutes the $T$ axis of $S$ uniformly at random, destroying first-spike timing while preserving spike counts. High E-drop confirms temporal ordering is the primary information source.
OPS drop Ordering vs timing
$$\text{OPS-drop} = \text{acc}(\text{normal}) - \text{acc}(\text{order-preserving shuffle})$$
OPS re-samples spike times from a new random set while preserving $\pi$. Low OPS-drop means the model uses ordering, not absolute timing values.
fst-std Ordering separation
$$\text{fst-std} = \frac{1}{M}\sum_{i=1}^{M} \text{std}_{c}\!\left(\mathbb{E}\!\left[t_i^* \mid \text{class} = c\right]\right)$$
Average standard deviation of per-neuron first spike times across classes. Higher values indicate greater ordering separation between classes.
§3

Full Pipeline

The complete forward pass, from input current to classification loss.

$$\underbrace{\mathbf{cur}}_{\mathbb{R}^M} \;\xrightarrow{\text{LIF}_1,\,\text{LIF}_2}\; \underbrace{S}_{\{0,1\}^{M \times T}} \;\xrightarrow{\text{mask}} \;\underbrace{a_i}_{\mathbb{R}_+^M} \;\xrightarrow{\text{rank}}\; \underbrace{\hat{a}_i}_{[0,1]^M} \;\xrightarrow{\text{PL}}\; \underbrace{\mathbf{p}}_{\Delta^{K-1}} \;\xrightarrow{\mathbf{W}_\text{out}}\; \underbrace{\text{logits}}_{\mathbb{R}^C} \;\xrightarrow{\mathcal{L}_\text{CE}}\; \text{loss}$$

The gradient flows backwards through every step. The only discrete operation, the Gumbel-max argsort for Monte Carlo sampling, is placed inside torch.no_grad(); gradients pass through the log-probability computation instead.

Gradient path

$$\frac{\partial \mathcal{L}}{\partial \mathbf{cur}} = \frac{\partial \mathcal{L}}{\partial \mathbf{p}} \cdot \frac{\partial \mathbf{p}}{\partial \hat{\mathbf{a}}} \cdot \frac{\partial \hat{\mathbf{a}}}{\partial \mathbf{a}} \cdot \frac{\partial \mathbf{a}}{\partial S} \cdot \underbrace{\frac{\partial S}{\partial \text{mem}}}_{\text{surrogate}} \cdot \frac{\partial \text{mem}}{\partial \mathbf{cur}}$$

Every factor is non-zero. Prior to this work, the $\partial \mathbf{p}/\partial \hat{\mathbf{a}}$ term was zero because the pipeline used argmax/argsort instead of the Plackett-Luce formulation.

§4

Component Definitions

State Leaky — cascaded LIF

Two LIF neurons in series. LIF₁ ($\beta_1 = 0.9$) receives the input current and generates an initial sparse spike train. LIF₂ ($\beta_2 = 0.95$, higher $\beta$ = slower decay) receives LIF₁'s spikes as its input, integrating them over a longer window. The effect is temporal smoothing: spikes that arrive in a burst from LIF₁ are spread across more timesteps by LIF₂, increasing the number of distinct first-spike times and therefore the information content of the ordering.

Orthogonal layer — rotation only

Any matrix $\mathbf{W}$ can be decomposed as $\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ (SVD). The diagonal $\mathbf{\Sigma}$ controls scaling per direction. A standard nn.Linear layer learns both rotation and scaling; scaling can amplify current differences to the point where ordering collapses into a binary on/off code. The orthogonal layer forces $\mathbf{\Sigma} = I$ by parameterising only the rotational component via a skew-symmetric generator $\mathbf{S}$ and the Cayley map. Relative current magnitudes are preserved: if neuron 2 has a slightly higher current than neuron 0 before the layer, it will still have a slightly higher current after.

Recurrent connections on LIF₂

A learnable $M \times M$ weight matrix $\mathbf{W}_\text{rec}$ is added to LIF₂'s input. When $W_{\text{rec},ij} < 0$, a spike from neuron $j$ at $t-1$ suppresses neuron $i$ at $t$. This implements soft winner-take-all dynamics: early-firing neurons inhibit later ones, pushing spike times towards the extremes of the time axis. The result is higher fst-std and better ordering separation under large-$C$ pressure.

§5

Proof Chain

Four mutually reinforcing experiments establishing that spike ordering is the primary information carrier.

01
Temporal information is used — E-drop test

Training a model on direct-current inputs with soft first-spike scoring, then randomly shuffling the time axis of $S$ at evaluation, causes accuracy to drop from 1.000 to 0.198. The model cannot recover by relying on firing rate or other non-temporal features.

E-drop = 0.744 · ratio between/within = 18.6 · acc = 1.000
02
No other information source — acc ≈ E-drop across all $C$

Across six values of $C$ from 24 to 10,000, accuracy and E-drop remain nearly equal. After time-shuffle, accuracy falls to approximately $1/C$ (random chance). This confirms there is no secondary pathway the model can exploit.

shuffle acc ≈ 1/C = 0.0001 at C=10,000 · verified: 0.002 ≈ 1/C
03
Ordering is learned, not absolute timing — OID test

The Ordering-Invariant Dataset holds each class's ordering fixed while randomly re-sampling absolute spike times for every example. Within-class first-spike standard deviation is 1.96 (large), yet ordering accuracy reaches 1.000. The model learns the permutation, not the time values.

ordering_acc = 1.000 · within_fst_std = 1.96 · acc = 0.793
04
Only ordering matters at evaluation — OPS + soft rank test

The Ordering-Preserving Shuffle changes all absolute spike times while holding the permutation $\pi$ constant. Without soft rank normalisation, OPS-drop = 0.228 because the PL distribution's concentration changes. With soft rank ($\tau = 0.01$), which discards magnitude information from $\mathbf{a}$, OPS-drop falls to 0.017 — a 13× reduction — while accuracy remains at 1.000.

OPS-drop (baseline) = 0.228 · OPS-drop (τ=0.01) = 0.017 · acc = 1.000
§6

Experimental Results

Exp A — capacity scaling ($M=8$, $M! = 40{,}320$)

$C$ $M!/C$ mean acc mean E-drop acc $-$ E-drop
241,6801.0000.764+0.236
1004030.9780.885+0.093
500810.8560.830+0.026
1,000400.7330.717+0.016
5,00080.3100.307+0.003
10,00040.1620.160+0.002 ≈ 1/C
As $C$ grows, acc $\approx$ E-drop, confirming ordering is the sole information channel.

Exp B — input layer type ($M=8$, $C=24$)

fc typeaccE-droporth error $\|\mathbf{W}^\top\mathbf{W} - I\|_F$
None (direct)1.0000.744
Standard Linear0.8600.051
Orthogonal (Cayley)0.9230.4200.000
A standard linear layer destroys ordering (E-drop 0.744→0.051). The orthogonal layer retains it (0.420).

Exp Capacity — SNN-Base vs SNN-Rec ($M=8$)

$C$Base accRec accRec $-$ BaseBase collapseRec collapse
1280.6200.622+0.002YESYES
2560.4010.467+0.066YESYES
5120.2410.320+0.079YESYES
1,0240.1420.192+0.0502/3 NOYES
2,0480.0780.108+0.030YESYES
Recurrent advantage peaks at C=512 (+0.079). Rec maintains binary collapse and higher fst-std under pressure.

OPS + soft rank — tau sweep

$\tau$accOPS-dropvs baseline
baseline (no rank)1.0000.228
1.00.9990.341worse
0.11.0000.113−0.115
0.011.0000.017−0.211 (13×)
$\tau = 0.01$ reduces OPS-drop by 13× while maintaining perfect accuracy.
§7

Confirmed Findings

F1SNN can learn spike ordering
With direct current input and soft first-spike scoring, the SNN exploits temporal spike ordering as the primary information channel. E-drop = 0.744; time-shuffle reduces accuracy to near random chance.
F2$M!$ is the real capacity bound
Accuracy and E-drop remain coupled across all tested $C$. Capacity degrades gracefully when $M!/C \lesssim 8$, confirming the factorial space is the operative constraint, not any architecture artefact.
F3Any linear layer collapses ordering
An 8×8 standard linear layer before the SNN drops E-drop from 0.744 to 0.051. Scaling of currents — even slight — allows the model to use a binary on/off code instead of ordering.
F4Orthogonal layer preserves ordering
Constraining the pre-SNN layer to be orthogonal via Cayley parameterisation maintains E-drop at 0.420. Orthogonal accuracy (0.923) exceeds standard linear (0.860), suggesting ordering is not merely preserved but beneficial.
F5Recurrent connections improve separation
LIF₂-level recurrent connections improve accuracy at all tested $C$ values. The advantage peaks at $C=512$ (+0.079). Recurrent winner-take-all dynamics maintain higher fst-std (9.49 vs 8.84) under large-$C$ pressure.
F6Soft rank eliminates magnitude leakage
OPS-drop measures whether the model depends on absolute spike time values beyond their ordinal structure. Without soft rank: OPS-drop = 0.228. With $\tau = 0.01$: OPS-drop = 0.017 (13× reduction). Accuracy is maintained at 1.000.
F7Geometric structure is partially learned
In the geo-gen experiment, fn-unseen = 0.670 (vs chance 0.250): the model correctly identifies the first-firing neuron for orderings it has never seen. The SNN generalises the concept "neuron $k$ fires first" without having been trained on all permutations containing it.
§8

Open Questions

  • Q1 Hierarchical readout. Can replacing the flat PL readout with a hierarchical one (first reading which neuron fires first, then second, etc.) allow generalisation to unseen permutations? fn-unseen = 0.670 suggests the SNN part is ready; the bottleneck is the readout interface. Partial
  • Q2 Ordering embedding. Can a learned embedding $\pi \mapsto \mathbf{e}_\pi \in \mathbb{R}^d$ encode geometric structure so that similar permutations are close? This would allow $\mathbf{z} = \sum_\pi p_\pi \mathbf{e}_\pi$ to carry both ordering identity and neighbourhood information. Open
  • Q3 Optimal $w$ and $\tau$. Current values ($w = 0.5$, $\tau = 0.01$) were chosen empirically. What is the theoretical relationship between $w$, $T$, and the gradient magnitude reaching early vs late timesteps? Is there a principled choice? Open
  • Q4 FFN + Orthogonal + SNN on real data. The orthogonal layer enables FFN encoders to coexist with ordering. Does this hold on high-dimensional noisy inputs (e.g., time-series, genomics)? Does E-drop survive a deep encoder? Open
  • Q5 Capacity threshold. Degradation begins at $M!/C \approx 8$. Is this threshold architecture-dependent or a fundamental property of the Plackett-Luce estimation with $K$ samples? Does recurrent competition shift it? Partial
  • Q6 Ordering autoencoder. If a decoder can reconstruct the input $\mathbf{x}$ from the ordering alone (with no class supervision), it would prove ordering carries the full input signal. This remains the strongest possible proof of the core claim. Open
  • Q7 $W_\text{rec}$ topology. Does the learned recurrent weight matrix exhibit predominantly inhibitory off-diagonal entries (winner-take-all), and does this resemble known biological competition circuits? What is the effect of initialising $W_\text{rec}$ as inhibitory? Open