How the lab works

MET-AL is organized as a research lab, not a product repo. The unit of work is the experiment: one idea from the idea board, prototyped in isolation, reviewed, and only then consolidated.

The experiment pattern

Each experiment gets:

Its own branch (experiment/<name>) and worktree (.claude/worktrees/experiment-<name>/) so parallel experiments never interfere;
PLAN.md — the design intent written before building;
a runnable, dependency-free prototype (vanilla ES modules, no build step, no CDN);
REVIEW.md — a feasibility write-up with an explicit verdict: merge · needs work · defer · discard.

Experiments were built by 3-agent teams (architect → implementer → reviewer), then independently verified in a real browser (Playwright over a static HTTP server) by the lab orchestrator, with screenshots archived under docs/reports/screenshots/.

Duplicate scaffolding across experiments is expected during prototyping and squeezed out later on the integration branch — see Repository layout.

Honesty as a design rule

Several lab conventions exist to keep feasibility signal trustworthy:

Ratio-of-sums, on screen. Aggregating derived metrics (“mean-of-ratios”) is statistically wrong; apps carry raw counts/partial sums, aggregate those, then derive — and several render the wrong value alongside to show the gap.
Parity against an independent oracle. Recomputed statistics are checked against MET’s own .stat output, not against fixtures that could share the code’s assumptions (a real bug — the N_THRESH off-by-one — slipped past 76/76 self-tests precisely because the fixtures encoded the same wrong assumption; an adversarial review against MET source caught it).
Provenance badges. Every gallery card declares its data provenance (real·streamed / real·baked / real·load-a-file / synthetic) and which MET component it reimagines; the genuinely novel app (the 3D metric cube) carries no component badge.
Blind review. For the pipeline-explorer parity claim, a fresh reviewer was handed the 84 paired values with provenance stripped (“Series A/B”) and asked to judge agreement independently.

Verification layers

Formula self-tests — hand-worked cells and edge cases (lib/selftest.mjs, 129 checks).
Differential tests — old-vs-new code on real data during migrations (~34k assertions, bit-for-bit).
Real-data oracles — parser + math validated on the full archive: 6,329 .stat files, 88,456 records, 0 errors; 32,389/32,390 SL1L2→CNT and 7,038/7,038 VL1L2→VCNT cross-checks (the single “miss” is a documented catastrophic-cancellation artifact of MET’s 5-dp rounding, not a bug).
Browser verification — every app exercised in a real browser before its verdict; GPU kernels verified on real Apple Metal hardware, not SwiftShader.