Research Checkpoint · June 2026

SNN N! Ordering Space

A proof-of-concept framework for exploiting the factorial representational space of spike temporal ordering in Spiking Neural Networks.

Fu-Chun Yang UCSC / Relay Labs M=8 · M!=40,320 8× H100 80GB

§1

Symbol Table

All symbols used throughout this document, with types and domains.

M Number of neurons Scalar · ℤ₊

The number of LIF neurons in the SNN layer. Determines the size of the ordering space: $M$ neurons can produce $M!$ distinct spike orderings.

T Number of timesteps Scalar · ℤ₊

The simulation length of the SNN. Should satisfy $T \gg M$ to allow all neurons sufficient time to fire at distinct timesteps, maximising ordering information.

C Number of classes Scalar · ℤ₊

The number of output categories. The ratio $M!/C$ determines how much ordering space is available per class. Capacity degrades when $M!/C \lesssim 8$.

$\beta$ Membrane decay rate Scalar · (0,1)

The leakage coefficient of the LIF neuron. Higher $\beta$ means slower decay (longer integration window). LIF₁ uses $\beta_1 = 0.9$, LIF₂ uses $\beta_2 = 0.95$.

$\theta$ Firing threshold Scalar · ℝ₊

The membrane potential value above which the neuron fires a spike. Default $\theta = 1.0$ in snntorch.

$\mathbf{cur}$ Input current vector Vector · ℝ^M

The input driving current for each neuron. In the direct-input architecture, this is fed straight to LIF₁ without any linear transformation. The $i$-th entry $\text{cur}_i$ determines how quickly neuron $i$ reaches threshold.

$S$ Spike matrix Matrix · {0,1}^{M×T}

The binary output of Pass 1. Entry $S_{i,t} = 1$ if and only if neuron $i$ fires at timestep $t$. This is the uncompressed representation preserving all temporal information.

$t_i^*$ First spike time of neuron $i$ Scalar · {0,…,T}

The timestep at which neuron $i$ fires for the first time: $t_i^* = \min\{t : S_{i,t} = 1\}$. Assigned $T$ if the neuron never fires. This is the key quantity encoding ordering information.

$\pi$ Spike ordering (permutation) Vector · S_M

The permutation of neuron indices ranked by their first spike times: $\pi = \text{argsort}([t_0^*, t_1^*, \ldots, t_{M-1}^*])$. There are $M!$ possible orderings. This is the core representational object of this research.

$w$ Temporal decay rate Scalar · ℝ₊

Controls the exponential weighting of spike times in the soft first-spike score. Larger $w$ emphasises early spikes more strongly; the limit $w \to \infty$ approximates argmin. Default $w = 0.5$.

$M_{i,t}$ First-spike mask Scalar · {0,1}

A binary mask isolating the first spike of neuron $i$. Equals 1 at time $t$ if and only if neuron $i$ fires at $t$ and has not fired at any earlier timestep. Used to make the first-spike score differentiable.

$a_i$ Soft first-spike score Scalar · ℝ₊

A differentiable approximation of $e^{-w \cdot t_i^*}$. Computed from $S$ via the first-spike mask. Larger $a_i$ means neuron $i$ fired earlier. Gradient $\partial a_i / \partial S_{i,t} = e^{-wt} \neq 0$ everywhere.

$\hat{a}_i$ Soft rank score Scalar · [0,1]

The rank-normalised version of $a_i$. Represents how many other neurons neuron $i$ fired before. Contains only ordinal information — the absolute magnitude of $a_i$ is discarded. Invariant to ordering-preserving rescaling of spike times.

$\tau$ Soft rank temperature Scalar · ℝ₊

Controls the sharpness of the sigmoid in the soft rank formula. $\tau \to 0$ recovers exact rank (zero gradient); $\tau \to \infty$ gives uniform output (no ordinal information). Default $\tau = 0.01$ for OPS invariance.

$\mathbf{p}$ Plackett-Luce distribution Vector · Δ^{M!-1}

A probability distribution over all $M!$ permutations, parameterised by the score vector. Entry $p_\pi = P(\pi \mid \hat{\mathbf{a}})$ is the probability of observing ordering $\pi$ given the soft rank scores. For $M > 6$, approximated by Monte Carlo with $K$ samples.

$K$ Monte Carlo sample count Scalar · ℤ₊

Number of permutations sampled from the Plackett-Luce distribution via the Gumbel-max trick when $M > 6$ makes exact enumeration intractable. Default $K = 512$. The readout dimension equals $K$ in Monte Carlo mode.

$\mathbf{W}_\text{out}$ Readout weight matrix Matrix · ℝ^{C×K}

The sole learnable component outside the SNN. Maps the Plackett-Luce distribution $\mathbf{p}$ to class logits. Dimension is $C \times M!$ for exact mode and $C \times K$ for Monte Carlo mode.

$\mathbf{S}$ Skew-symmetric generator Matrix · ℝ^{M×M}, $S_{ij} = -S_{ji}$

The learnable parameter of the orthogonal layer, constrained to be skew-symmetric ($\mathbf{S} = -\mathbf{S}^T$). The orthogonal weight matrix $\mathbf{W}$ is derived from $\mathbf{S}$ via the Cayley map. The upper triangle of $\mathbf{S}$ contains the free parameters.

$\sigma$ Logistic sigmoid Function · ℝ → (0,1)

The standard sigmoid function $\sigma(x) = 1/(1+e^{-x})$. Used in the soft rank formula to approximate the step function $\mathbf{1}[a_i > a_j]$ with a smooth, differentiable surrogate.

§2

Formula Sheet

LIF Dynamics Membrane update & spike

$$\text{mem}_i[t] = \beta \cdot \text{mem}_i[t-1] + \text{input}_i[t]$$ $$S_{i,t} = \mathbf{1}\!\left[\text{mem}_i[t] > \theta\right]$$

Surrogate gradient approximates $\partial S / \partial \text{mem}$ during backpropagation.

First-Spike Mask Isolate first event

$$M_{i,t} = S_{i,t} \cdot \prod_{s=0}^{t-1}(1 - S_{i,s})$$

$M_{i,t} = 1$ iff neuron $i$ fires at $t$ and has not fired before. Gradient: $\partial M_{i,t}/\partial S_{i,s} \neq 0$.

Soft First-Spike Score Differentiable timing

$$a_i = \sum_{t=0}^{T-1} M_{i,t} \cdot e^{-w \cdot t}$$

Approximates $e^{-w \cdot t_i^*}$. Gradient: $\partial a_i / \partial S_{i,t} = e^{-wt} \cdot \prod_{s < t}(1 - S_{i,s}) \neq 0$.

Soft Rank Normalisation Discard magnitude

$$\hat{a}_i = \frac{1}{M-1}\sum_{j \neq i} \sigma\!\left(\frac{a_i - a_j}{\tau}\right)$$

$\hat{a}_i \in [0,1]$. Invariant to any order-preserving transformation of $\mathbf{a}$. As $\tau \to 0$, $\hat{a}_i \to \text{rank}(a_i)/(M-1)$.

Plackett-Luce Distribution over $S_M$

$$\log P(\pi \mid \hat{\mathbf{a}}) = \sum_{k=1}^{M} \left(\hat{a}_{\pi_k} - \log \sum_{j=k}^{M} e^{\hat{a}_{\pi_j}}\right)$$

Computed in log-space for numerical stability. Final $\mathbf{p} = \text{softmax}([\log P(\pi_1|\hat{\mathbf{a}}), \ldots])$ over all $M!$ permutations.

Gumbel-Max Sampling Monte Carlo PL

$$\pi^{(k)} = \text{argsort}\!\left(-(\hat{\mathbf{a}} + \mathbf{g}^{(k)})\right), \quad \mathbf{g}^{(k)} \sim \text{Gumbel}(0,1)^M$$

Used when $M > 6$. argsort is inside torch.no_grad(); gradients flow through log $P(\pi^{(k)} | \hat{\mathbf{a}})$ only.

Readout & Loss Classification

$$\text{logits} = \mathbf{W}_\text{out}\,\mathbf{p} + \mathbf{b}$$ $$\mathcal{L} = -\log \frac{e^{\text{logits}_y}}{\sum_{c} e^{\text{logits}_c}}$$

Cross-entropy on the true class $y$. Gradient flows back through $\mathbf{p} \to \hat{\mathbf{a}} \to \mathbf{a} \to S \to \text{mem} \to \mathbf{cur}$.

Orthogonal Layer Cayley map

$$\mathbf{W} = (I + \mathbf{S})^{-1}(I - \mathbf{S}), \quad \mathbf{S} = -\mathbf{S}^\top$$ $$\mathbf{W}^\top \mathbf{W} = I \;\Rightarrow\; \|\mathbf{W}\mathbf{x}\| = \|\mathbf{x}\|$$

Learned parameter is $\mathbf{S}$ (unconstrained upper triangle). $\mathbf{W}$ is derived — rotation only, no scaling. Preserves relative current magnitudes.

Recurrent Input LIF₂ with competition

$$\text{input\_to\_lif2}_{i,t} = S_{i,t}^{(1)} + \sum_{j} W_{\text{rec},ij} \cdot S_{j,t-1}^{(2)}$$

$W_\text{rec} \in \mathbb{R}^{M \times M}$, learnable. When off-diagonal entries are negative, earlier-firing neurons inhibit later ones (winner-take-all).

E-drop Temporal dependence metric

$$\text{E-drop} = \text{acc}(\text{normal}) - \text{acc}(\text{time-shuffled})$$

Time-shuffle permutes the $T$ axis of $S$ uniformly at random, destroying first-spike timing while preserving spike counts. High E-drop confirms temporal ordering is the primary information source.

OPS drop Ordering vs timing

$$\text{OPS-drop} = \text{acc}(\text{normal}) - \text{acc}(\text{order-preserving shuffle})$$

OPS re-samples spike times from a new random set while preserving $\pi$. Low OPS-drop means the model uses ordering, not absolute timing values.

fst-std Ordering separation

$$\text{fst-std} = \frac{1}{M}\sum_{i=1}^{M} \text{std}_{c}\!\left(\mathbb{E}\!\left[t_i^* \mid \text{class} = c\right]\right)$$

Average standard deviation of per-neuron first spike times across classes. Higher values indicate greater ordering separation between classes.

§3

Full Pipeline

The complete forward pass, from input current to classification loss.

$$\underbrace{\mathbf{cur}}_{\mathbb{R}^M} \;\xrightarrow{\text{LIF}_1,\,\text{LIF}_2}\; \underbrace{S}_{\{0,1\}^{M \times T}} \;\xrightarrow{\text{mask}} \;\underbrace{a_i}_{\mathbb{R}_+^M} \;\xrightarrow{\text{rank}}\; \underbrace{\hat{a}_i}_{[0,1]^M} \;\xrightarrow{\text{PL}}\; \underbrace{\mathbf{p}}_{\Delta^{K-1}} \;\xrightarrow{\mathbf{W}_\text{out}}\; \underbrace{\text{logits}}_{\mathbb{R}^C} \;\xrightarrow{\mathcal{L}_\text{CE}}\; \text{loss}$$

The gradient flows backwards through every step. The only discrete operation, the Gumbel-max argsort for Monte Carlo sampling, is placed inside torch.no_grad(); gradients pass through the log-probability computation instead.

Gradient path

$$\frac{\partial \mathcal{L}}{\partial \mathbf{cur}} = \frac{\partial \mathcal{L}}{\partial \mathbf{p}} \cdot \frac{\partial \mathbf{p}}{\partial \hat{\mathbf{a}}} \cdot \frac{\partial \hat{\mathbf{a}}}{\partial \mathbf{a}} \cdot \frac{\partial \mathbf{a}}{\partial S} \cdot \underbrace{\frac{\partial S}{\partial \text{mem}}}_{\text{surrogate}} \cdot \frac{\partial \text{mem}}{\partial \mathbf{cur}}$$

Every factor is non-zero. Prior to this work, the $\partial \mathbf{p}/\partial \hat{\mathbf{a}}$ term was zero because the pipeline used argmax/argsort instead of the Plackett-Luce formulation.

§4

Component Definitions

State Leaky — cascaded LIF

Two LIF neurons in series. LIF₁ ($\beta_1 = 0.9$) receives the input current and generates an initial sparse spike train. LIF₂ ($\beta_2 = 0.95$, higher $\beta$ = slower decay) receives LIF₁'s spikes as its input, integrating them over a longer window. The effect is temporal smoothing: spikes that arrive in a burst from LIF₁ are spread across more timesteps by LIF₂, increasing the number of distinct first-spike times and therefore the information content of the ordering.

Orthogonal layer — rotation only

Any matrix $\mathbf{W}$ can be decomposed as $\mathbf{U}\mathbf{\Sigma}\mathbf{V}^\top$ (SVD). The diagonal $\mathbf{\Sigma}$ controls scaling per direction. A standard nn.Linear layer learns both rotation and scaling; scaling can amplify current differences to the point where ordering collapses into a binary on/off code. The orthogonal layer forces $\mathbf{\Sigma} = I$ by parameterising only the rotational component via a skew-symmetric generator $\mathbf{S}$ and the Cayley map. Relative current magnitudes are preserved: if neuron 2 has a slightly higher current than neuron 0 before the layer, it will still have a slightly higher current after.

Recurrent connections on LIF₂

A learnable $M \times M$ weight matrix $\mathbf{W}_\text{rec}$ is added to LIF₂'s input. When $W_{\text{rec},ij} < 0$, a spike from neuron $j$ at $t-1$ suppresses neuron $i$ at $t$. This implements soft winner-take-all dynamics: early-firing neurons inhibit later ones, pushing spike times towards the extremes of the time axis. The result is higher fst-std and better ordering separation under large-$C$ pressure.

§5

Proof Chain

Four mutually reinforcing experiments establishing that spike ordering is the primary information carrier.

Temporal information is used — E-drop test

Training a model on direct-current inputs with soft first-spike scoring, then randomly shuffling the time axis of $S$ at evaluation, causes accuracy to drop from 1.000 to 0.198. The model cannot recover by relying on firing rate or other non-temporal features.

E-drop = 0.744 · ratio between/within = 18.6 · acc = 1.000

No other information source — acc ≈ E-drop across all $C$

Across six values of $C$ from 24 to 10,000, accuracy and E-drop remain nearly equal. After time-shuffle, accuracy falls to approximately $1/C$ (random chance). This confirms there is no secondary pathway the model can exploit.

shuffle acc ≈ 1/C = 0.0001 at C=10,000 · verified: 0.002 ≈ 1/C

Ordering is learned, not absolute timing — OID test

The Ordering-Invariant Dataset holds each class's ordering fixed while randomly re-sampling absolute spike times for every example. Within-class first-spike standard deviation is 1.96 (large), yet ordering accuracy reaches 1.000. The model learns the permutation, not the time values.

ordering_acc = 1.000 · within_fst_std = 1.96 · acc = 0.793

Only ordering matters at evaluation — OPS + soft rank test

The Ordering-Preserving Shuffle changes all absolute spike times while holding the permutation $\pi$ constant. Without soft rank normalisation, OPS-drop = 0.228 because the PL distribution's concentration changes. With soft rank ($\tau = 0.01$), which discards magnitude information from $\mathbf{a}$, OPS-drop falls to 0.017 — a 13× reduction — while accuracy remains at 1.000.

OPS-drop (baseline) = 0.228 · OPS-drop (τ=0.01) = 0.017 · acc = 1.000

§6

Experimental Results

Exp A — capacity scaling ($M=8$, $M! = 40{,}320$)

$C$	$M!/C$	mean acc	mean E-drop	acc $-$ E-drop
24	1,680	1.000	0.764	+0.236
100	403	0.978	0.885	+0.093
500	81	0.856	0.830	+0.026
1,000	40	0.733	0.717	+0.016
5,000	8	0.310	0.307	+0.003
10,000	4	0.162	0.160	+0.002 ≈ 1/C

As $C$ grows, acc $\approx$ E-drop, confirming ordering is the sole information channel.

Exp B — input layer type ($M=8$, $C=24$)

fc type	acc	E-drop	orth error $\\|\mathbf{W}^\top\mathbf{W} - I\\|_F$
None (direct)	1.000	0.744	—
Standard Linear	0.860	0.051	—
Orthogonal (Cayley)	0.923	0.420	0.000

A standard linear layer destroys ordering (E-drop 0.744→0.051). The orthogonal layer retains it (0.420).

Exp Capacity — SNN-Base vs SNN-Rec ($M=8$)

$C$	Base acc	Rec acc	Rec $-$ Base	Base collapse	Rec collapse
128	0.620	0.622	+0.002	YES	YES
256	0.401	0.467	+0.066	YES	YES
512	0.241	0.320	+0.079	YES	YES
1,024	0.142	0.192	+0.050	2/3 NO	YES
2,048	0.078	0.108	+0.030	YES	YES

Recurrent advantage peaks at C=512 (+0.079). Rec maintains binary collapse and higher fst-std under pressure.

OPS + soft rank — tau sweep

$\tau$	acc	OPS-drop	vs baseline
baseline (no rank)	1.000	0.228	—
1.0	0.999	0.341	worse
0.1	1.000	0.113	−0.115
0.01	1.000	0.017	−0.211 (13×)

$\tau = 0.01$ reduces OPS-drop by 13× while maintaining perfect accuracy.

§7

Confirmed Findings

F1SNN can learn spike ordering

With direct current input and soft first-spike scoring, the SNN exploits temporal spike ordering as the primary information channel. E-drop = 0.744; time-shuffle reduces accuracy to near random chance.

F2$M!$ is the real capacity bound

Accuracy and E-drop remain coupled across all tested $C$. Capacity degrades gracefully when $M!/C \lesssim 8$, confirming the factorial space is the operative constraint, not any architecture artefact.

F3Any linear layer collapses ordering

An 8×8 standard linear layer before the SNN drops E-drop from 0.744 to 0.051. Scaling of currents — even slight — allows the model to use a binary on/off code instead of ordering.

F4Orthogonal layer preserves ordering

Constraining the pre-SNN layer to be orthogonal via Cayley parameterisation maintains E-drop at 0.420. Orthogonal accuracy (0.923) exceeds standard linear (0.860), suggesting ordering is not merely preserved but beneficial.

F5Recurrent connections improve separation

LIF₂-level recurrent connections improve accuracy at all tested $C$ values. The advantage peaks at $C=512$ (+0.079). Recurrent winner-take-all dynamics maintain higher fst-std (9.49 vs 8.84) under large-$C$ pressure.

F6Soft rank eliminates magnitude leakage

OPS-drop measures whether the model depends on absolute spike time values beyond their ordinal structure. Without soft rank: OPS-drop = 0.228. With $\tau = 0.01$: OPS-drop = 0.017 (13× reduction). Accuracy is maintained at 1.000.

F7Geometric structure is partially learned

In the geo-gen experiment, fn-unseen = 0.670 (vs chance 0.250): the model correctly identifies the first-firing neuron for orderings it has never seen. The SNN generalises the concept "neuron $k$ fires first" without having been trained on all permutations containing it.

§8

Open Questions

Q1 Hierarchical readout. Can replacing the flat PL readout with a hierarchical one (first reading which neuron fires first, then second, etc.) allow generalisation to unseen permutations? fn-unseen = 0.670 suggests the SNN part is ready; the bottleneck is the readout interface. Partial
Q2 Ordering embedding. Can a learned embedding $\pi \mapsto \mathbf{e}_\pi \in \mathbb{R}^d$ encode geometric structure so that similar permutations are close? This would allow $\mathbf{z} = \sum_\pi p_\pi \mathbf{e}_\pi$ to carry both ordering identity and neighbourhood information. Open
Q3 Optimal $w$ and $\tau$. Current values ($w = 0.5$, $\tau = 0.01$) were chosen empirically. What is the theoretical relationship between $w$, $T$, and the gradient magnitude reaching early vs late timesteps? Is there a principled choice? Open
Q4 FFN + Orthogonal + SNN on real data. The orthogonal layer enables FFN encoders to coexist with ordering. Does this hold on high-dimensional noisy inputs (e.g., time-series, genomics)? Does E-drop survive a deep encoder? Open
Q5 Capacity threshold. Degradation begins at $M!/C \approx 8$. Is this threshold architecture-dependent or a fundamental property of the Plackett-Luce estimation with $K$ samples? Does recurrent competition shift it? Partial
Q6 Ordering autoencoder. If a decoder can reconstruct the input $\mathbf{x}$ from the ordering alone (with no class supervision), it would prove ordering carries the full input signal. This remains the strongest possible proof of the core claim. Open
Q7 $W_\text{rec}$ topology. Does the learned recurrent weight matrix exhibit predominantly inhibitory off-diagonal entries (winner-take-all), and does this resemble known biological competition circuits? What is the effect of initialising $W_\text{rec}$ as inhibitory? Open