MET-AL · An Outside Reviewer's Commentary

Section 1Scorecard

Work stream	Grade	One-line verdict
Rounds 1–2 · the seven view experiments	B+	Disciplined breadth-first survey; synthetic data was the right scaffold and the honest badges later kept it honest.
Consolidation · shared lib + inliner + dist	A	Textbook: extract, differential-test bit-for-bit, document every de-dup decision with a "why".
Round 3 · `.stat` parser + real-archive validation	A	The independent-oracle methodology (validate against MET's own paired lines) is the single best decision in the repo.
Card 09 · MET Pipeline Explorer	A	The lab's thesis, demonstrated rather than argued — computed-vs-MET-vs-Δ on every value is devastatingly effective.
Card 10 + v2 store · R2 streaming, pairs-on-the-fly	A−	The raw-first redesign is genuinely better than MET's own intermediate format; minus for demo-resolution store and thin oracle coverage.
Card 11 · WebGPU FSS	A	Four kernels racing each other with exact parity checks is how GPU work should always be presented.
Explainers (pipeline · multiscale storage)	B+	Excellent shareable artifacts; they slightly outpace the apps they describe (a docs-lead-code smell worth watching).
Real-data wiring of the original seven apps	C+	The blueprint is 13 months of thinking ahead of 2 apps of execution — the plan aged while cards 09–11 leapfrogged it.
Process & infrastructure (tests, CI, remote)	C	World-class test culture, zero test automation; no git remote; deploy is a memorized incantation.

Section 2What the variations got right

Breadth-first, then honest consolidation Rounds 1–2

Seven isolated worktrees exploring four themes in parallel, each with plan → build → review and an explicit verdict, was the right shape for a lab: the cost of a wrong idea stayed local. Two choices elevate it above the usual prototype pile:

Duplication was budgeted, not banned. The experiments were allowed to repeat scaffolding, and the debt was paid once, deliberately, on integration — with ~34k bit-for-bit differential assertions proving the migrations changed nothing. Most teams either forbid duplication (and strangle exploration) or never repay it.
The one exception is documented. stat-interaction's categorical path stays local because its Math.max(1, denom) convention differs from the lib's NaN-safety. Right call, but note: this is now a permanent behavioral fork two keystrokes from invisibility. It belongs in a code comment and the docs, and ideally behind a named constant (DEGENERATE_DENOM_POLICY) so the fork is greppable.

The oracle discipline Round 3

The N_THRESH incident — 76/76 tests passing over a real off-by-one because the fixtures encoded the same wrong assumption — could have been quietly fixed. Instead it became policy: fixtures are real MET lines, parity is judged against MET's own derived output, and one review was even run blind (provenance-stripped "Series A/B"). This is the intellectual core of the lab and the thing I'd protect most fiercely as it grows. The full-archive numbers (6,329 files / 88,456 records / 0 errors; 32,389/32,390 + 7,038/7,038 oracle checks, with the one miss root-caused to MET's own 5-dp rounding) are a stronger validation than most production parsers ever get.

The v2 store: out-designing the source format Card 10

The decision to not store MET's _pairs.nc intermediates (55k objects, 810 MB) and instead regrid raw fields once to a common grid and compute pairs live is the rare redesign that is simultaneously smaller, faster, and more general — any model/var/lead/region pairing becomes computable, not just the ones MET happened to run. Keeping a tiny parity oracle so the browser==.stat proof survives the redesign shows the oracle discipline generalizing. Two caveats below in §3.

GPU work presented honestly Card 11

Most WebGPU demos show one kernel and a speedup number. This one shows four, races them across grid size and radius, shows where the naive kernel's parallelism beats its O(r²) complexity and where it loses, and pins every kernel to exact-parity checks on real hardware. The by-product lessons (NaN sentinel folding under Metal fast-math, the silent 128 MiB binding cap) are worth more than the speedup itself.

Section 3Where I push back

The blueprint is aging faster than it's being executed

Plan/reality drift: docs/REAL_DATA_INTEGRATION.md still opens with the truncated-subset inventory (1,452 files / 20,478 records / 6 cycles) that the full-archive validation superseded, and its §4 wiring order (stat-interaction first) has been overtaken by events — cards 05/09/10 went real by a completely different route (the v2 store), which the blueprint doesn't know exists. A stale plan that reads authoritatively is worse than no plan. It needs a one-hour revision pass: update the inventory block, mark Q5 (spatial-maps gridding) as answered by the v2 store, and re-sequence §4 around the Parquet-from-R2 path.
The 8 open decisions have quietly become ~4. Q5 is answered (v2 store grids), Q3 is answered in practice (MODEL-axis repurposing shipped in three apps), Q7 is half-answered (R2-first with load-a-file fallback is now the stated direction). Leaving them "open" costs decision-making attention every time someone reads the doc.
Six of eleven apps are still synthetic (01, 02, 03, 04, 06, 07) in a lab whose active direction is "real data by default". Only 06 (ensemble) has a data excuse. The queued plan (wire 02 and 04 to the R2 Parquet) is right — but it has been queued for a while. The missing keystone is lib/met-data-source.mjs: it was specified in the blueprint (§2) and everything else waits on it. Build the funnel, and apps 02/04 wire in an afternoon each.

v2 store caveats

The shipped store is demo-resolution. ×3-downsampled CONUS (533×782) is fine for a PoC but the design target is full 2.5 km (1597×2345); until the full-res store exists, the "regrid erases resolution" claim is untested at the resolution where it matters (and the regrid-parity bonus — matching MET's own regrid.to_grid — is still theoretical).
Oracle coverage is thin: six precip cases, one variable family. The bit-identical claim generalizes only as far as the oracle reaches; a dozen more cases across TMP/wind/PRMSL and a second mask (when one exists) would make it robust against per-variable encoding regressions (bitround keepbits are per-variable — exactly the kind of knob that silently breaks categorical parity for one variable).
De-identification is enforced by scripts + vigilance, and it has already needed one full-site scrub. A tiny automated tripwire (a deploy-time grep of the upload set for the sensitive strings) would convert vigilance into mechanism.

Process debt is the cheapest remaining win

No CI. The repo has five node-runnable self-test suites totaling ~470 assertions and zero automation running them. One 15-line tools/check.sh (or a GitHub Action, once a remote exists) closes the gap between "world-class test culture" and "tests that actually gate changes".
No git remote. One laptop is the repository. For a lab producing publishable artifacts and a live site, this is the largest single risk on the board — larger than anything technical, and fixable in ten minutes.
The dist-staleness policy is a trap in waiting. "dist/ is not rebuilt automatically" plus "some changes are deliberately not propagated" means the offline builds silently drift from the served apps. Fine while one person holds the policy in their head; a tools/check-dist-fresh.mjs that compares module graphs would make the drift visible instead of remembered.
Deploys are memorized incantations (root-not-dist, pin --branch main, exclude .git, beware the 0-files manifest reuse). All four gotchas are documented, but documentation is where incantations go to be forgotten — a tools/deploy.sh that encodes them is ~20 lines.

Smaller observations

The gallery lede is drifting stale: it still says "the first nine share the verification math and are dependency-free" while card 05 now loads a real binned field and card 09 streams from R2 by default. The honesty badges are the source of truth; the prose should defer to them.
The injection-hardening checklist (blueprint §5) is executed only for stat-ingest. That was the agreed sequencing (harden as each app wires real data) — but the checklist's file:line references will rot as apps evolve; convert it to grep-able patterns before it's needed.
ideas.html says "nothing here has been investigated yet" — charmingly false for over a year of work. The board deserves a status pass: ~12 of 23 cards have been prototyped and their statuses still read idea.
The accessibility card has no experiment and no plan beyond Tier 3. For an NCAR-branded public artifact, CVD-safe palettes and keyboard paths on the two flagship apps (09, 10) would be a cheap, visible commitment.

Section 4If I ran the next 30 days

1. Build lib/met-data-source.mjs — the specified funnel everything else waits on. Then wire Stat Explorer (02) and Guided Journey (04) to the R2 Parquet. Three artifacts, all on the existing critical path.
2. Pay the process debt: git remote + tools/check.sh running all five suites + a deploy script. One afternoon, permanent dividends.
3. Revise the blueprint against post-v2 reality (see §3) and run the ideas.html status pass so the planning surfaces tell the truth again.
4. Close the METcalcpy gap that matters: bootstrap CIs. It is the recurring review follow-up, it is first-class uncertainty (an idea-board theme), and it is a pure lib+UI feature with no data dependency. A scorecard view (METviewer's signature) is the natural companion.
5. Deepen the oracle before widening anything else GPU-side: more cases, more variables, ideally one full-resolution store variable to test the regrid-parity thesis.

The pattern across all of it: this lab's strength is that it proves things rather than demos them. Every recommendation above is some form of "aim that proving instinct at the remaining soft spots" — the plans, the process, and the six apps still living on synthetic data.