Section 1Scorecard
| Work stream | Grade | One-line verdict |
|---|---|---|
| Rounds 1–2 · the seven view experiments | B+ | Disciplined breadth-first survey; synthetic data was the right scaffold and the honest badges later kept it honest. |
| Consolidation · shared lib + inliner + dist | A | Textbook: extract, differential-test bit-for-bit, document every de-dup decision with a "why". |
Round 3 · .stat parser + real-archive validation | A | The independent-oracle methodology (validate against MET's own paired lines) is the single best decision in the repo. |
| Card 09 · MET Pipeline Explorer | A | The lab's thesis, demonstrated rather than argued — computed-vs-MET-vs-Δ on every value is devastatingly effective. |
| Card 10 + v2 store · R2 streaming, pairs-on-the-fly | A− | The raw-first redesign is genuinely better than MET's own intermediate format; minus for demo-resolution store and thin oracle coverage. |
| Card 11 · WebGPU FSS | A | Four kernels racing each other with exact parity checks is how GPU work should always be presented. |
| Explainers (pipeline · multiscale storage) | B+ | Excellent shareable artifacts; they slightly outpace the apps they describe (a docs-lead-code smell worth watching). |
| Real-data wiring of the original seven apps | C+ | The blueprint is 13 months of thinking ahead of 2 apps of execution — the plan aged while cards 09–11 leapfrogged it. |
| Process & infrastructure (tests, CI, remote) | C | World-class test culture, zero test automation; no git remote; deploy is a memorized incantation. |
Section 2What the variations got right
Breadth-first, then honest consolidation Rounds 1–2
Seven isolated worktrees exploring four themes in parallel, each with plan → build → review and an explicit verdict, was the right shape for a lab: the cost of a wrong idea stayed local. Two choices elevate it above the usual prototype pile:
- Duplication was budgeted, not banned. The experiments were allowed to repeat
scaffolding, and the debt was paid once, deliberately, on
integration— with ~34k bit-for-bit differential assertions proving the migrations changed nothing. Most teams either forbid duplication (and strangle exploration) or never repay it. - The one exception is documented. stat-interaction's categorical path stays
local because its
Math.max(1, denom)convention differs from the lib's NaN-safety. Right call, but note: this is now a permanent behavioral fork two keystrokes from invisibility. It belongs in a code comment and the docs, and ideally behind a named constant (DEGENERATE_DENOM_POLICY) so the fork is greppable.
The oracle discipline Round 3
The N_THRESH incident — 76/76 tests passing over a real off-by-one because
the fixtures encoded the same wrong assumption — could have been quietly fixed. Instead it
became policy: fixtures are real MET lines, parity is judged against MET's own derived
output, and one review was even run blind (provenance-stripped "Series A/B").
This is the intellectual core of the lab and the thing I'd protect most fiercely as it
grows. The full-archive numbers (6,329 files / 88,456 records / 0 errors; 32,389/32,390 +
7,038/7,038 oracle checks, with the one miss root-caused to MET's own 5-dp rounding) are
a stronger validation than most production parsers ever get.
The v2 store: out-designing the source format Card 10
The decision to not store MET's _pairs.nc intermediates (55k objects,
810 MB) and instead regrid raw fields once to a common grid and compute pairs live is the
rare redesign that is simultaneously smaller, faster, and more general — any
model/var/lead/region pairing becomes computable, not just the ones MET happened to run.
Keeping a tiny parity oracle so the browser==.stat proof survives the
redesign shows the oracle discipline generalizing. Two caveats below in §3.
GPU work presented honestly Card 11
Most WebGPU demos show one kernel and a speedup number. This one shows four, races them across grid size and radius, shows where the naive kernel's parallelism beats its O(r²) complexity and where it loses, and pins every kernel to exact-parity checks on real hardware. The by-product lessons (NaN sentinel folding under Metal fast-math, the silent 128 MiB binding cap) are worth more than the speedup itself.
Section 3Where I push back
The blueprint is aging faster than it's being executed
- Plan/reality drift:
docs/REAL_DATA_INTEGRATION.mdstill opens with the truncated-subset inventory (1,452 files / 20,478 records / 6 cycles) that the full-archive validation superseded, and its §4 wiring order (stat-interaction first) has been overtaken by events — cards 05/09/10 went real by a completely different route (the v2 store), which the blueprint doesn't know exists. A stale plan that reads authoritatively is worse than no plan. It needs a one-hour revision pass: update the inventory block, mark Q5 (spatial-maps gridding) as answered by the v2 store, and re-sequence §4 around the Parquet-from-R2 path. - The 8 open decisions have quietly become ~4. Q5 is answered (v2 store grids), Q3 is answered in practice (MODEL-axis repurposing shipped in three apps), Q7 is half-answered (R2-first with load-a-file fallback is now the stated direction). Leaving them "open" costs decision-making attention every time someone reads the doc.
- Six of eleven apps are still synthetic (01, 02, 03, 04, 06, 07) in a
lab whose active direction is "real data by default". Only 06 (ensemble) has a data
excuse. The queued plan (wire 02 and 04 to the R2 Parquet) is right — but it has been
queued for a while. The missing keystone is
lib/met-data-source.mjs: it was specified in the blueprint (§2) and everything else waits on it. Build the funnel, and apps 02/04 wire in an afternoon each.
v2 store caveats
- The shipped store is demo-resolution. ×3-downsampled CONUS
(533×782) is fine for a PoC but the design target is full 2.5 km (1597×2345); until
the full-res store exists, the "regrid erases resolution" claim is untested at the
resolution where it matters (and the regrid-parity bonus — matching MET's own
regrid.to_grid— is still theoretical). - Oracle coverage is thin: six precip cases, one variable family. The bit-identical claim generalizes only as far as the oracle reaches; a dozen more cases across TMP/wind/PRMSL and a second mask (when one exists) would make it robust against per-variable encoding regressions (bitround keepbits are per-variable — exactly the kind of knob that silently breaks categorical parity for one variable).
- De-identification is enforced by scripts + vigilance, and it has already needed one full-site scrub. A tiny automated tripwire (a deploy-time grep of the upload set for the sensitive strings) would convert vigilance into mechanism.
Process debt is the cheapest remaining win
- No CI. The repo has five node-runnable self-test suites totaling
~470 assertions and zero automation running them. One 15-line
tools/check.sh(or a GitHub Action, once a remote exists) closes the gap between "world-class test culture" and "tests that actually gate changes". - No git remote. One laptop is the repository. For a lab producing publishable artifacts and a live site, this is the largest single risk on the board — larger than anything technical, and fixable in ten minutes.
- The dist-staleness policy is a trap in waiting. "dist/ is not
rebuilt automatically" plus "some changes are deliberately not propagated" means the
offline builds silently drift from the served apps. Fine while one person holds the
policy in their head; a
tools/check-dist-fresh.mjsthat compares module graphs would make the drift visible instead of remembered. - Deploys are memorized incantations (root-not-dist, pin
--branch main, exclude.git, beware the 0-files manifest reuse). All four gotchas are documented, but documentation is where incantations go to be forgotten — atools/deploy.shthat encodes them is ~20 lines.
Smaller observations
- The gallery lede is drifting stale: it still says "the first nine share the verification math and are dependency-free" while card 05 now loads a real binned field and card 09 streams from R2 by default. The honesty badges are the source of truth; the prose should defer to them.
- The injection-hardening checklist (blueprint §5) is executed only for stat-ingest. That was the agreed sequencing (harden as each app wires real data) — but the checklist's file:line references will rot as apps evolve; convert it to grep-able patterns before it's needed.
- ideas.html says "nothing here has been investigated yet" — charmingly false
for over a year of work. The board deserves a status pass: ~12 of 23 cards have been
prototyped and their statuses still read
idea. - The accessibility card has no experiment and no plan beyond Tier 3. For an NCAR-branded public artifact, CVD-safe palettes and keyboard paths on the two flagship apps (09, 10) would be a cheap, visible commitment.
Section 4If I ran the next 30 days
- 1. Build
lib/met-data-source.mjs— the specified funnel everything else waits on. Then wire Stat Explorer (02) and Guided Journey (04) to the R2 Parquet. Three artifacts, all on the existing critical path. - 2. Pay the process debt: git remote +
tools/check.shrunning all five suites + a deploy script. One afternoon, permanent dividends. - 3. Revise the blueprint against post-v2 reality (see §3) and run the ideas.html status pass so the planning surfaces tell the truth again.
- 4. Close the METcalcpy gap that matters: bootstrap CIs. It is the recurring review follow-up, it is first-class uncertainty (an idea-board theme), and it is a pure lib+UI feature with no data dependency. A scorecard view (METviewer's signature) is the natural companion.
- 5. Deepen the oracle before widening anything else GPU-side: more cases, more variables, ideally one full-resolution store variable to test the regrid-parity thesis.
The pattern across all of it: this lab's strength is that it proves things rather than demos them. Every recommendation above is some form of "aim that proving instinct at the remaining soft spots" — the plans, the process, and the six apps still living on synthetic data.