Lessons & gotchas
A lab’s most valuable output is often what it learned. These are the non-obvious ones, kept so nobody pays for them twice.
Statistics & numerics
Section titled “Statistics & numerics”- Ratio-of-sums, never mean-of-ratios. Aggregate raw counts / partial sums first, then derive the metric. Averaging per-group metrics is wrong and the error is large enough to show on screen.
- MET rounds its
.statoutput to 5 decimal places. Recomputed real-valued stats can only be expected to match to ~±5e-6; near-perfect forecasts can produce catastrophic cancellation when deriving RMSE from rounded partial sums (clamp tiny negative MSE to 0). N_THRESHcounts threshold edges, not bins (MET writespct.nrows()+1). The first parser cut got this wrong and 76/76 self-tests passed anyway — because the fixtures encoded the same wrong assumption. For format parsers, verify against an independent source of truth (the MET source code), not fixtures you wrote yourself.- Pair SL1L2↔CNT lines on the full common header minus
LINE_TYPE/ALPHA— SL1L2 hasALPHA=NAwhile its CNT twin hasALPHA=0.05; a reduced key silently collides across regions/levels. FCST_LEADis an HHMMSS string (e.g.060000), zero-padded. Match leads as strings or convert once in a shared place; an int comparison silently pairs the wrong lead.
GPU / WebGPU
Section titled “GPU / WebGPU”- Never use NaN as a missing-data sentinel in WGSL. Apple Metal fast-math folds the
x != xNaN self-test to a constant, so masked cells leak into every window. Use an out-of-band ordered sentinel (1e30) and testx < 1e29. - WebGPU’s default
maxStorageBufferBindingSizeis 128 MiB. Oversized bindings don’t error visibly — they silently bind zeros. Request the adapter’s limits at device creation. - Backticks inside a WGSL comment inside a JS template literal terminate the string.
A one-character comment edit can become a bewildering
SyntaxError. - Playwright’s bundled headless Chromium has no WebGPU. Use the real Chrome channel with
--enable-unsafe-webgpu --use-angle=metalto get the actual Metal adapter — which also means GPU code can be benchmarked honestly in automation.
Data engineering
Section titled “Data engineering”- Validate archives with
gzip -t, nottar -tzf. libarchive lists headers leniently and exits 0 on a truncated stream; only the gzip CRC check validates the whole file. - GRIB2 decode dominates conversion cost (~90% of ~73 s/cycle) and GRIB2→Zarr expands on disk (GRIB2’s native packing beats zstd+bitround). Decode once, offline.
- Avoid the tiny-object anti-pattern on object stores. 55k objects for 810 MB of pairs cubes is the wrong shape for R2; consolidate before uploading.
- Grid conventions bite twice: the v2 common grid stores longitude 0–360 with row 0 at the south edge; every consumer must convert to −180..180 and flip vertically or the map renders blank/upside-down.
Web platform & deployment
Section titled “Web platform & deployment”- ES modules don’t load from
file://(CORSorigin: null). Either serve over HTTP or inline the whole module graph — which is exactly whattools/inline.mjsdoes, via an import map ofdata:URLs (relative specifiers would resolve against thedata:base and break; bare import-map keys dodge that). - Cloudflare Pages reuses the previous deployment manifest when “0 files uploaded”. Deleting or excluding a file does not remove it from the live site; change at least one file’s content to force a real manifest.
- Pin
--branch mainwhen deploying from a non-main worktree, or the deploy lands as a preview URL instead of production. - Serve the repo root, not
dist/— apps import../../lib/met-stats.mjs, so the whole tree must share one origin.
Process
Section titled “Process”- Never let parallel agents drive one browser — verification is centralized and sequential; concurrent Chromium automation deadlocks.
- Beware circular fixtures (see
N_THRESHabove): the reviewer that catches spec bugs is the one reading the other implementation. - De-identify before anything public. Model names, masks, coordinates, and valid times were scrubbed everywhere public; a private bucket + gated Worker keeps the real data real.
