How the lab works
MET-AL is organized as a research lab, not a product repo. The unit of work is the experiment: one idea from the idea board, prototyped in isolation, reviewed, and only then consolidated.
The experiment pattern
Section titled “The experiment pattern”Each experiment gets:
- Its own branch (
experiment/<name>) and worktree (.claude/worktrees/experiment-<name>/) so parallel experiments never interfere; PLAN.md— the design intent written before building;- a runnable, dependency-free prototype (vanilla ES modules, no build step, no CDN);
REVIEW.md— a feasibility write-up with an explicit verdict:merge·needs work·defer·discard.
Experiments were built by 3-agent teams (architect → implementer → reviewer), then
independently verified in a real browser (Playwright over a static HTTP server) by the lab
orchestrator, with screenshots archived under docs/reports/screenshots/.
Duplicate scaffolding across experiments is expected during prototyping and squeezed out
later on the integration branch — see Repository layout.
Honesty as a design rule
Section titled “Honesty as a design rule”Several lab conventions exist to keep feasibility signal trustworthy:
- Ratio-of-sums, on screen. Aggregating derived metrics (“mean-of-ratios”) is statistically wrong; apps carry raw counts/partial sums, aggregate those, then derive — and several render the wrong value alongside to show the gap.
- Parity against an independent oracle. Recomputed statistics are checked against MET’s own
.statoutput, not against fixtures that could share the code’s assumptions (a real bug — theN_THRESHoff-by-one — slipped past 76/76 self-tests precisely because the fixtures encoded the same wrong assumption; an adversarial review against MET source caught it). - Provenance badges. Every gallery card declares its data provenance (real·streamed / real·baked / real·load-a-file / synthetic) and which MET component it reimagines; the genuinely novel app (the 3D metric cube) carries no component badge.
- Blind review. For the pipeline-explorer parity claim, a fresh reviewer was handed the 84 paired values with provenance stripped (“Series A/B”) and asked to judge agreement independently.
Verification layers
Section titled “Verification layers”- Formula self-tests — hand-worked cells and edge cases (
lib/selftest.mjs, 129 checks). - Differential tests — old-vs-new code on real data during migrations (~34k assertions, bit-for-bit).
- Real-data oracles — parser + math validated on the full archive: 6,329
.statfiles, 88,456 records, 0 errors; 32,389/32,390 SL1L2→CNT and 7,038/7,038 VL1L2→VCNT cross-checks (the single “miss” is a documented catastrophic-cancellation artifact of MET’s 5-dp rounding, not a bug). - Browser verification — every app exercised in a real browser before its verdict; GPU kernels verified on real Apple Metal hardware, not SwiftShader.
