How a Home Lab Solved Jane Street's Neural Network Puzzle

Distributed simulated annealing, trace pairing, and the 3-opt rotations that no single-swap search could find

February 16, 2026 · Cherokee AI Federation · ~10 min read

The Problem

Jane Street, in collaboration with the Dwarkesh podcast, released a challenge: a neural network broken into 97 shuffled pieces. 48 input blocks, 48 output blocks, one final layer. Your job is to find the two permutations — which input block goes where, which output block goes where — that reassemble the network into its original form. Get it right and the SHA-256 hash of your permutations matches a known target. Get it wrong and you're somewhere in a search space of (48!)² ≈ 10122 possibilities.

$50K in prizes across three tracks. Track 2 was the permutation recovery problem. At the time we started, 44 people had solved it. No one had published a writeup showing how.

TL;DR

We solved it in about 2.5 days using a 6-node home lab — no cloud, no corporate cluster. Four distinct algorithmic phases, each requiring a qualitatively different approach. The final solution required three simultaneous 3-way block rotations that no pairwise swap search could ever discover. MSE 0.0000000000. Hash matched.

97
Puzzle Pieces
6
Compute Nodes
~2.5d
Total Solve Time
0.000
Final MSE

The Fleet

We run a small federation of compute nodes — consumer hardware, mixed architectures, nothing exotic. Three Apple Silicon machines, three x86-64 boxes. No GPUs were used for this puzzle; the workload is pure CPU-bound numpy. Every node ran multiple simulated annealing workers sharing solutions through a PostgreSQL pool.

Architecture Chip Steps/sec Relative
Apple M4 Max ARM64 318 1.0x (fastest)
Apple M1 Max (×2) ARM64 182 0.57x
Intel i9-13900K x86-64 132 0.42x
AMD Threadripper 7960X x86-64 122 0.38x
Intel i7-12700K x86-64 112 0.35x

The Apple Silicon advantage was striking: 2.4–2.8x faster per-thread than Intel or AMD on this workload. The M4 Max, our newest machine, ran nearly three times faster than a Threadripper — a chip with far more cores but worse single-thread numpy throughput. For CPU-bound Python optimization, ARM unified memory architecture appears to be a genuine structural advantage.

50 workers total across the fleet. Each worker ran simulated annealing with pool seeding — periodically pulling the best known solution from PostgreSQL, perturbing it, and searching from there. Quality-gated writes ensured only improving solutions entered the pool. The topology was a shared-memory star: all workers read and write to one central pool, no peer-to-peer gossip.

Four Basins

The solve wasn't a smooth descent. It was four distinct phases, each ending in a plateau that the current algorithm couldn't break. Each breakthrough required a qualitatively different insight. Optimization landscapes are fractal — the strategies that work at one scale are useless at the next.

Basin 1: Jacobian Seeding (0.45 → 0.08)

The obvious first move. Each neural network block transforms its input differently — some reshape gently, others aggressively. Compute the Jacobian norm for each block to measure its "intensity," use the Hungarian algorithm to find optimal input-output pairings, then sort by intensity: gentle blocks first, heavy transformers last.

This gave us a starting permutation with MSE 0.08 — better than random (0.45) but still far from solved. We seeded the SA fleet with this permutation and let 50 workers refine it. The fleet converged quickly to about 0.03, then stalled. Hard.

Basin 2: Break-Point Surgery (0.08 → 0.03)

When the fleet stalled, we looked for structural problems. Two independent analyses pointed at the same positions:

The overlap between these two signals gave us high-confidence surgical targets. Swapping the identified blocks and letting SA cascade from there broke the basin. But the gains were marginal — 0.03 was still orders of magnitude from solved.

Basin 3: Trace Pairing (0.03 → 0.005)

This was the real breakthrough. While exploring the weight matrices algebraically, we discovered that trace(W_out × W_inp) — the trace of the product of an output block's weights with an input block's weights — produces a remarkably strong pairing signal:

Running the Hungarian algorithm on the trace cost matrix gave us 38 out of 48 correct pairings — double the accuracy of our previous best method (gradient MSE, which got 19/48). Purely structural, no training data needed.

With correct pairings locked in, we reduced the search from (48!)² to just 48! (ordering only). The fleet converged to MSE 0.005 within hours — a 6x improvement over our previous best, and 3x better than the only public solver we'd found.

Basin 4: The Endgame (0.005 → 0.000)

At MSE 0.005, the fleet agreed on nearly everything. We ran consensus analysis across the top pool solutions: positions where the fleet disagreed were search targets, positions where it agreed were anchored. A round of uncertain-position enumeration dropped us to 0.002.

Then pairwise swaps dried up. Exhaustive testing of all C(48,2) = 1,128 possible swaps found nothing. The remaining error was invisible to any move that only touched two positions at a time. We were stuck.

This is where the puzzle got interesting.

The Endgame: 3-Opt Rotations

If no single swap improves MSE, maybe the remaining errors are coordinated. Three blocks in the wrong positions, where any pairwise swap makes things worse because you're breaking one correct placement to fix another.

We wrote a 3-opt search: for all C(48,3) = 17,296 position triples, try both possible 3-way rotations (A→B→C→A and A→C→B→A). This is cheap enough to enumerate exhaustively.

It found three moves:

0.000253 → 0.000174 — 3-opt rotation at positions 32, 33, 34
0.000174 → 0.000111 — 3-opt rotation at positions 28, 29, 30
0.000111 → 0.000000 — 3-opt rotation at positions 19, 20, 21 — SOLVED

Three groups of three consecutive positions, each requiring all three blocks to rotate simultaneously. We called them "DNA tumblers" — like a combination lock where three tumblers must turn together. Move any one block and MSE goes up. Move all three and the error vanishes.

This is why SA got stuck. SA proposes single swaps. Even with 50 workers running for days, the probability of proposing the right 3-way rotation at the right temperature is effectively zero. The algorithm wasn't wrong — it was the wrong kind of algorithm for this part of the landscape.

The Key Insight

The puzzle had three layers of structure, each invisible to the layer below. Jacobian analysis couldn't see pairing errors. Pairwise search couldn't see 3-way rotations. Each layer required a qualitatively different tool. The optimization landscape isn't just deep — it's fractal.

What Didn't Work

We tried a lot of things. Intellectual honesty demands a list of the dead ends. In roughly chronological order:

Method Result Why It Failed
Gumbel-Sinkhorn gradient descent MSE 11.75 Soft permutation relaxation "cheats" — blends nearby blocks in ways that don't correspond to any valid hard permutation. Errors compound through 48 cascading residual blocks.
Pairwise TSP (MRF model) MSE 7.96 Pairwise compatibility doesn't capture cascading 48-block dynamics.
Greedy chain building MSE 1.99 Too myopic. Locally optimal placement at each step leads to globally poor arrangements.
Weight structure analysis (4 methods) 0/48 matches Static weight statistics have zero predictive power for block pairing. Forward-pass data is the only signal.
Beam search (width 50) MSE 0.465 Can't compete with SA when building from scratch. Search space too large for tractable beam widths.
Jacobian chain conditioning MSE 0.252 Condition number was the wrong cost function for this problem.
Spectral analysis / eigenvalue ordering MSE 0.462 Confirmed gentle-first ordering but couldn't improve on it.

If we're being honest, we spent more time on the Sinkhorn approach than it deserved. The paper (Mena et al., 2018) shows beautiful results on jigsaw puzzles, where each piece is independent. But neural network blocks with cascading residual connections are a fundamentally different beast — the soft relaxation can't approximate the hard discrete forward pass when errors compound through 48 layers.

The weight analysis failure was also instructive. Four independent methods (cosine similarity, SVD decomposition, transpose correlation, activation fingerprints) all returned zero signal. This eliminated an entire class of approaches in one afternoon and pointed us firmly toward data-flow methods. Sometimes the most productive thing you can do is prove that a direction is hopeless.

The Full Trajectory

Here's how MSE evolved over the 2.5-day solve, showing which method drove each breakthrough:

MSE        Time     Event
───────────────────────────────────────────────────────────
0.450      Day 1    Fleet warm-up (random seeds)
0.148      Day 1    SA convergence (50 workers, pool sharing)
0.084      Day 1    Jacobian seeding injected
0.030      Day 1    Break-point surgery + SA refinement
   ─── overnight stall ───
0.005      Day 2    Trace pairing breakthrough (38/48 correct)
0.003      Day 2    Consensus analysis + uncertain position enumeration
0.000253   Day 2    Exhaustive pairwise swap cascade (14 improving swaps)
   ─── pairwise swaps exhausted ───
0.000174   Day 3    3-opt rotation: positions 32, 33, 34
0.000111   Day 3    3-opt rotation: positions 28, 29, 30
0.000000   Day 3    3-opt rotation: positions 19, 20, 21 — SOLVED

The pattern is clear: each plateau lasted longer than the last, and each breakthrough required a more sophisticated tool. Random search to structured seeding to algebraic analysis to exhaustive multi-point moves. The difficulty wasn't the search space — it was recognizing when to change strategy.

Infrastructure Notes

A few things that mattered more than expected:

Quality-gated pool writes. Workers only push solutions to PostgreSQL if they improve on the current best. Without this, the pool fills with mediocre solutions and cross-pollination degrades. Simple idea, large impact.

A metacognitive observer. A separate process monitored fleet convergence, detected stagnation, and adjusted worker parameters (seeding ratios, perturbation widths) in real time. When the fleet stalled for more than an hour, the observer increased random seeding to force exploration. This prevented the monoculture problem we'd see in early runs where all 50 workers converged to the same local minimum.

Three rounds of database bugs. We lost two full runs of breakthrough results to serialization issues before we got the pipeline right. numpy.int64 isn't JSON-serializable. A column renamed during development was still referenced by the old name. A boolean field was missing from the INSERT statement. Mundane bugs, real consequences. Save your work properly.

Lessons

The flow found its path. Constructal Law — when the obvious paths are blocked, the system finds the one that requires coordinated movement.

Cherokee AI Federation · Built on consumer hardware · No cloud · No compromise