Skip to content

How the lab works

MET-AL is organized as a research lab, not a product repo. The unit of work is the experiment: one idea from the idea board, prototyped in isolation, reviewed, and only then consolidated.

Each experiment gets:

  • Its own branch (experiment/<name>) and worktree (.claude/worktrees/experiment-<name>/) so parallel experiments never interfere;
  • PLAN.md — the design intent written before building;
  • a runnable, dependency-free prototype (vanilla ES modules, no build step, no CDN);
  • REVIEW.md — a feasibility write-up with an explicit verdict: merge · needs work · defer · discard.

Experiments were built by 3-agent teams (architect → implementer → reviewer), then independently verified in a real browser (Playwright over a static HTTP server) by the lab orchestrator, with screenshots archived under docs/reports/screenshots/.

Duplicate scaffolding across experiments is expected during prototyping and squeezed out later on the integration branch — see Repository layout.

Several lab conventions exist to keep feasibility signal trustworthy:

  • Ratio-of-sums, on screen. Aggregating derived metrics (“mean-of-ratios”) is statistically wrong; apps carry raw counts/partial sums, aggregate those, then derive — and several render the wrong value alongside to show the gap.
  • Parity against an independent oracle. Recomputed statistics are checked against MET’s own .stat output, not against fixtures that could share the code’s assumptions (a real bug — the N_THRESH off-by-one — slipped past 76/76 self-tests precisely because the fixtures encoded the same wrong assumption; an adversarial review against MET source caught it).
  • Provenance badges. Every gallery card declares its data provenance (real·streamed / real·baked / real·load-a-file / synthetic) and which MET component it reimagines; the genuinely novel app (the 3D metric cube) carries no component badge.
  • Blind review. For the pipeline-explorer parity claim, a fresh reviewer was handed the 84 paired values with provenance stripped (“Series A/B”) and asked to judge agreement independently.
  1. Formula self-tests — hand-worked cells and edge cases (lib/selftest.mjs, 129 checks).
  2. Differential tests — old-vs-new code on real data during migrations (~34k assertions, bit-for-bit).
  3. Real-data oracles — parser + math validated on the full archive: 6,329 .stat files, 88,456 records, 0 errors; 32,389/32,390 SL1L2→CNT and 7,038/7,038 VL1L2→VCNT cross-checks (the single “miss” is a documented catastrophic-cancellation artifact of MET’s 5-dp rounding, not a bug).
  4. Browser verification — every app exercised in a real browser before its verdict; GPU kernels verified on real Apple Metal hardware, not SwiftShader.