MET-AL — an outside reviewer's commentary

On the variations of work done and planned · written by Claude (Fable 5) · 2026-07-01

What this is. A personal, critical read of the lab's work streams — not documentation (that's the Starlight site in docs-site/) and not a status report. Opinions here are mine; where I disagree with a decision on record I say so. Grades are calibrated to "research-lab prototype", not "production software".

Section 1Scorecard

Work streamGradeOne-line verdict
Rounds 1–2 · the seven view experimentsB+Disciplined breadth-first survey; synthetic data was the right scaffold and the honest badges later kept it honest.
Consolidation · shared lib + inliner + distATextbook: extract, differential-test bit-for-bit, document every de-dup decision with a "why".
Round 3 · .stat parser + real-archive validationAThe independent-oracle methodology (validate against MET's own paired lines) is the single best decision in the repo.
Card 09 · MET Pipeline ExplorerAThe lab's thesis, demonstrated rather than argued — computed-vs-MET-vs-Δ on every value is devastatingly effective.
Card 10 + v2 store · R2 streaming, pairs-on-the-flyA−The raw-first redesign is genuinely better than MET's own intermediate format; minus for demo-resolution store and thin oracle coverage.
Card 11 · WebGPU FSSAFour kernels racing each other with exact parity checks is how GPU work should always be presented.
Explainers (pipeline · multiscale storage)B+Excellent shareable artifacts; they slightly outpace the apps they describe (a docs-lead-code smell worth watching).
Real-data wiring of the original seven appsC+The blueprint is 13 months of thinking ahead of 2 apps of execution — the plan aged while cards 09–11 leapfrogged it.
Process & infrastructure (tests, CI, remote)CWorld-class test culture, zero test automation; no git remote; deploy is a memorized incantation.

Section 2What the variations got right

Breadth-first, then honest consolidation Rounds 1–2

Seven isolated worktrees exploring four themes in parallel, each with plan → build → review and an explicit verdict, was the right shape for a lab: the cost of a wrong idea stayed local. Two choices elevate it above the usual prototype pile:

The oracle discipline Round 3

The N_THRESH incident — 76/76 tests passing over a real off-by-one because the fixtures encoded the same wrong assumption — could have been quietly fixed. Instead it became policy: fixtures are real MET lines, parity is judged against MET's own derived output, and one review was even run blind (provenance-stripped "Series A/B"). This is the intellectual core of the lab and the thing I'd protect most fiercely as it grows. The full-archive numbers (6,329 files / 88,456 records / 0 errors; 32,389/32,390 + 7,038/7,038 oracle checks, with the one miss root-caused to MET's own 5-dp rounding) are a stronger validation than most production parsers ever get.

The v2 store: out-designing the source format Card 10

The decision to not store MET's _pairs.nc intermediates (55k objects, 810 MB) and instead regrid raw fields once to a common grid and compute pairs live is the rare redesign that is simultaneously smaller, faster, and more general — any model/var/lead/region pairing becomes computable, not just the ones MET happened to run. Keeping a tiny parity oracle so the browser==.stat proof survives the redesign shows the oracle discipline generalizing. Two caveats below in §3.

GPU work presented honestly Card 11

Most WebGPU demos show one kernel and a speedup number. This one shows four, races them across grid size and radius, shows where the naive kernel's parallelism beats its O(r²) complexity and where it loses, and pins every kernel to exact-parity checks on real hardware. The by-product lessons (NaN sentinel folding under Metal fast-math, the silent 128 MiB binding cap) are worth more than the speedup itself.

Section 3Where I push back

The blueprint is aging faster than it's being executed

v2 store caveats

Process debt is the cheapest remaining win

Smaller observations

Section 4If I ran the next 30 days

The pattern across all of it: this lab's strength is that it proves things rather than demos them. Every recommendation above is some form of "aim that proving instinct at the remaining soft spots" — the plans, the process, and the six apps still living on synthetic data.