MET in the client — a feasibility matrix

The one ruleThree destinies for a MET workload

Verification math → runs anywhere

CTC/CTS, SL1L2/CNT, VL1L2/VCNT, PCT/PSTD, aggregation, bootstrap CIs. Sums and ratios over small numbers of aggregables. Sub-millisecond to low-millisecond in plain JS — cheaper than a single network round-trip.

Grid operators → CPU fine, GPU shines

Neighborhood methods (FSS), object identification (MODE), regridding, pairs-on-the-fly. O(cells) work: tens of ms per megacell on CPU, ~7× faster on WebGPU, rendered straight to screen with no readback.

Format decode → offline, once

GRIB2 / prepBUFR / NetCDF decoding is ~90% of pipeline cost, I/O-bound, and its output expands. Pre-convert to Zarr v3 + Parquet; the browser range-reads exactly what a view needs (1–2 GETs per map frame).

proven vs MET validated against MET's own .stat output · measured benchmarked with built-in correctness checks · partial / untested plausible, not yet demonstrated · offline by design deliberately kept out of the client

The matrixMET ecosystem × client-side feasibility

Functionality	Client?	Evidence (measured)	Why · limitations · what the client buys
MET core — grid & point verification
grid_stat · categorical + continuousthreshold → CTC/CTS · SL1L2 → CNT · masks	proven vs MET	real 8k-cell case: 0.46 ms warm / 4.28 ms cold; SL1L2 reduction ~70 M cells/s; counts BIT-IDENTICAL to MET .stat, reals ±5e-6 (MET's 5-dp rounding); 88/88 oracle checks; live in app 09	This is the heart of MET, and it is decisively client-sized. A full recompute costs less than one 60 fps frame, so thresholds/masks/regions become sliders, not batch jobs. Limitation: interpolation-method parity (e.g. budget vs nearest) must match MET's config to reproduce matched pairs exactly — hence verifying on a common grid. Benefit: verification-from-a-URL; zero install; what-if in real time.
grid_stat · neighborhood (FSS/NBRCNT)the operator with no closed form over sums	proven vs CPU ref	integral-image CPU O(cells); 4 WebGPU kernels (naive/separable/prefix-scan/multi-block) parity Δ~1e-8 on real Metal; ~7.3× CPU at 2048² (4.2 M cells); GPU→screen with no readback; app 11	The stress test for "grid ops on the GPU" — passed. No NBRCNT lines exist in this archive, so parity is proven CPU↔GPU (exact n, FSS to 1e-8), not vs a MET oracle — stated honestly. Limitations discovered and documented: NaN-sentinel folding under fast-math, the silent 128 MiB binding cap. Benefit: radius/threshold sweeps become interactive; the same integral image drives both score and on-screen field.
point_stat · matching + scoringinterpolate to obs, form pairs, score	measured	bilinear match + SL1L2 + CNT at ~13 M pts/s; 1 M points = 79 ms; pair errors ≤1e-13 on an analytic field; RMSE≈σ sanity passes (tools/bench/bench-point-match.mjs)	Given pre-decoded obs (Parquet), the inner loop is trivially fast. The archive's GDAS obs are 11.7 M rows = 45 MB Parquet; one synoptic time over a region is 10³–10⁵ points — sub-millisecond territory. Limitations: prepBUFR decode and MET's obs QC/level-dedup logic stay offline/unported; land/sea & topography masks need the static fields shipped once. Benefit: station-level drill-down with live re-scoring.
Regridding · `regrid.to_grid`put fcst and truth on one grid	measured	bilinear weights (417k pts) built once in 4.5 ms; 1.22 ms/frame apply (342 M pts/s); budget ×3 box-average 3.3 ms/frame; plane reproduced to 1.4e-13 (bench-regrid.mjs)	Fast enough to regrid interactively, which the v2 pipeline currently does offline in numpy — exploratory "compare any two models on any grid" needs no conversion pass. Limitations: conservative regridding of precip at large ratios needs proper area weights (budget here is ×3 box-average); Lambert/rotated-pole projections add a coordinate transform (cheap, but must match MET's). Benefit: kills the biggest constraint on which model pairs the client can score.
pcp_combineaccumulation-bucket arithmetic	measured (same op)	frame-minus-frame at 416k cells runs live in app 10's Difference mode (with running bias/RMSE/MAE); array add/subtract ≪ regrid cost	Sum/difference of accumulation fields is the cheapest grid op here. Limitation: bucket semantics (resets, missing cycles) are bookkeeping that must be encoded in store metadata. Benefit: build any accumulation window on the fly instead of pre-materializing APCP_01/03/06/24 variants.
MODE · object-based verificationsmooth → threshold → objects → attributes → match	faithful core BUILT	lib/met-mode.mjs: disc convolution + attributes + MET fuzzy-interest engine (default weights) + cluster merging; 42/42 selftest (hand-worked geometry); full pipeline ~12 ms on the real case, live in the MODE Lab app (card 13); scale probe 73.5 ms at 1024²	The core loop is comfortably interactive on CPU alone — and it is FSS-shaped work, so the GPU path is proven adjacent. Limitations, honestly: this is now a faithful CORE (true disc smoothing, MET's attribute set, the MODEConfig_default weights and threshold, configurable interest maps, cluster merging) — but the archive has no MODE output to use as an oracle, so verification is hand-worked geometry + invariants, not parity; curvature/percentile-thresholds/secondary merges remain unported. Benefit: convolution radius and threshold become sliders on the object map — MODE's parameters are notoriously fiddly, and interactive feedback is exactly what its users lack.
MTD · time-domain objectsMODE in 3-D (x, y, t)	measured	3-D 26-connected space-time labeling + track attributes: 24×512² in 101 ms (~107 MB working set); 24×1024² in 379 ms (~428 MB); moving-blob verification exact (bench-mtd.mjs)	The spike is done: compute is fine (~65 M cells/s), and memory is the real constraint exactly as predicted — a 24-frame 1024² stack holds ~428 MB of arrays, comfortable on desktop, tight on mobile. Streaming frames (label t against t−1 only) would cut that if it ever matters.
ensemble_statRHIST · CRPS · spread-skill · ensemble probs	math ready · no data	20 members × 100k cases: rank hist 5.1 ms · spread-skill 15.2 ms · CRPS 112 ms (bench-ensemble.mjs); lib math covered by 129/129 selftest	Compute is a non-issue; the blocker is that this archive is deterministic-only (no ECNT/ORANK/RHIST lines), so there is nothing real to verify against — app 06 stays synthetic with a banner saying so. The day ensemble output exists, the client is already fast enough.
Probabilistic (PCT/PSTD/PRC)reliability · Brier decomposition · ROC	proven (math)	parsed PCT bins reproduce the paired PSTD line's Brier decomposition EXACTLY (browser-verified, app 08); N_THRESH=edges lesson encoded in the parser	The math and parsing are proven end-to-end; the archive just contains no probabilistic lines to display. ROC/PRC rendering is a small addition (queued with the Taylor diagram). Benefit: reliability diagrams that recompute as you re-bin.
wavelet_statintensity-scale decomposition	untested	no measurement; a Haar transform is O(cells) with tiny constants	Nothing about it looks client-hostile — it is less work per cell than FSS. Parked only because no one has asked for it yet; would need its own oracle case.
Statistics over archives — the analysis layer
stat_analysis · filter/aggregate .stat archivesthe batch tool behind most MET Q&A	proven vs MET	FULL real archive (6,329 files / 59.8 MB / 88,456 records): parse 602 ms (99 MB/s), ratio-of-sums aggregate 14.5 ms → 1,005 groups, series 0.14 ms; cold→plot 1.4 s (bench-archive-aggregate.mjs); parser 0 errors; 32,389/32,390 + 7,038/7,038 derived-line oracle checks	The whole archive's statistics are a client-side object. After a one-time parse (or an 8.8 MB Parquet load), every re-slice — new model, threshold, grouping — costs ~15 ms. The 5-dp cancellation caveat applies when re-deriving from MET-rounded sums (one documented case in 32k). Benefit: METviewer-class questions with zero infrastructure, plus CIs the batch tool doesn't give you (next row).
Bootstrap confidence intervalsthe METcalcpy capability gap	measured	SHIPPED in lib/met-stats.mjs (bootstrapCI, percentile + BCa; selftest 140/140) and live in apps 02/04/12: 0.4 ms per CI (B=1000); bands recomputed per interaction; scorecard significance = paired, event-equalized bootstrap	The blueprint asked whether an n-weighted proxy over MET's per-record _BCL/_BCU was acceptable. Wrong question — the real thing is free. Resampling cycles and re-deriving via ratio-of-sums is sub-millisecond, so every plotted point can carry a CI, recomputed on every interaction. Limitations: percentile method under-covers slightly at small n (93.3% vs 95% at n=48; BCa would tighten it); within-case pair-level bootstrap needs the pair grids (available via the v2 store). Benefit: first-class uncertainty — an idea-board theme — for ~30 lines of code.
Event equalizationonly compare where all models verify	implemented	the scorecard (card 12) compares models on their COMMON init cycles only — a set intersection over the funnel's cycle keys, inside every paired-bootstrap cell	Plain set intersection over (cycle, lead, var, mask) keys — minutes of work on top of the existing grouping, or one SQL `INTERSECT` in DuckDB. Listed separately because METcalcpy treats it as a feature; here it falls out of the data model.
The ecosystem — data, viewers, orchestration
METdataioload .stat into a database	replaced, serverless	entire 1.53 M-row .stat archive = 8.8 MB Parquet; DuckDB-WASM queries it live from R2 in app 10 ("80 rows across 4 models"); .stat/.txt/MODE parser 97/97 + 0 errors on 6,329 real files (app 08)	The database's job was random access and aggregation; Parquet + DuckDB-WASM do both without the server. Limitation: writes/curation stay offline (that's the conversion step); multi-user concurrency is N browsers reading one immutable object — which is a feature. Benefit: the entire "install MySQL, define schemas, run the loader" on-ramp disappears.
METviewerweb UI: select → aggregate → plot	demonstrated	app 10 (SQL → RMSE-vs-lead by model, live from R2) + app 02 (Metric/Dimension/Filter/Facet explorer) + 15 ms re-aggregation at full-archive scale	Architecture proven and the signature views now exist: the scorecard (with honest paired-bootstrap significance), Taylor diagram, and ROC shipped as card 12. Limitation: METviewer's long tail of plot types and its saved-XML workflows would need deliberate porting. Benefit: selection loop goes from form-submit-wait to direct manipulation at 60 fps.
METcalcpyaggregation, derivation, statistics in Python	mostly covered	ratio-of-sums aggregation (proven), VL1L2→VCNT 7,038/7,038, bootstrap CIs (shipped, 140/140 lib checks), event equalization (shipped in the scorecard); remaining: the long tail of specialized derivations	The shared lib is a growing JS METcalcpy with a stricter honesty contract (NaN-safety, wrong-way demonstrators). Remaining gap after this round: equalization + the long tail of specialized derivations (add per demand, each with an oracle).
METplotpythe plot catalog	broad coverage	performance diagram, reliability+sharpness+Brier, rank histogram, spread-skill, CRPS, threshold scrubber, object views, spatial fields — all shipped & browser-verified across apps 03–07	Interactive versions of the core catalog exist; Taylor diagram and ROC are the named absentees. Benefit over static PNGs: linked brushing, hover-to-counts, and provenance on every plot.
METexpresssimplified predefined-query viewer	same class	no separate demo; it is a curated subset of the METviewer capability demonstrated above	If METviewer's loop runs serverless, METexpress's narrower loop does too. The interesting port is its curated question set, not its plumbing.
METplus wrappersconfig-driven batch orchestration	different role	app 09 reconstructs the pipeline narratively (10 stages, each badged browser/pre-encode/offline) rather than executing configs	Orchestration is the one piece that shouldn't move client-side: it exists to babysit filesystems, schedulers, and 62 GiB of I/O. The browser-side analog is session state (URL-hash configs, shareable analyses), which the lab already has. Verdict: reimagine, don't port.
Kept offline — on purpose
GRIB2 / NetCDF decode+ prepBUFR for point obs	offline by design	measured: decode ≈ 90% of the ~73 s/cycle conversion; GRIB2→Zarr EXPANDS on disk (1,226→1,692 MB — GRIB2's packing beats zstd+bitround); full archive ≈ 25–30 min once	A WASM ecCodes port would work and still be the wrong answer: you'd ship megabytes of decoder to spend seconds of CPU producing data you then can't range-read. Decode once, store analysis-ready (Zarr v3 whole-frame chunks + Parquet), and a map frame costs 1–2 GETs (~2.5–8 MB compressed). The client's superpower is lazy access, not brute decoding.
Bulk pair materializationMET's _pairs.nc intermediates	eliminated	v2 store: pairs computed on the fly (fcst − truth on a common grid); 55,331 objects / 810 MB of stored pairs → a few oracle objects; live pairs for any selection ≈ 4 GETs / 11–17 MB; CTC from the oracle BIT-IDENTICAL across all 5 precip thresholds (app 10, production-verified)	Client-side compute doesn't just relocate this stage — it deletes it. Any model/var/lead/region pairing becomes computable, not just the ones a batch config anticipated. A small stored oracle keeps the parity proof alive.

NumbersThe benchmark ledger

Apple M2 Pro · Node v26 (V8, as the engine proxy) unless marked browser; browser numbers from the shipped apps on Apple Metal. New measurements are reproducible via tools/bench/ (each script carries its own correctness checks).

Workload	Scale	Result	Source
grid_stat full recompute (threshold→CTC→scores→sums→CNT)	real case, 8k cells	0.46 ms warm · 4.28 ms cold	explainer (browser+node)
SL1L2 reduction throughput	up to 2048² synthetic	~70 M cells/s	explainer
FSS on WebGPU vs CPU integral	2048² / 4.2 M cells	~7.3× faster · Δ ~1e-8 (browser, Metal)	app 11
.stat archive → plot-ready series (parse+index+aggregate)	6,329 files · 88,456 records	1.4 s cold · 14.5 ms per re-slice	bench-archive-aggregate
Bootstrap CI (percentile, B=1000, ratio-of-sums re-derive)	24 cycles/case set	0.4 ms per CI · 4.7 ms per 20-lead series	bench-bootstrap
MODE-lite object pipeline (smooth→label→attrs→match)	real 83×97 · 1024²	1.0 ms · 73.5 ms (≈14 M cells/s)	bench-mode-objects
Bilinear regrid apply (global 0.25° → 2.5 km-class grid)	417k target points	1.22 ms/frame (342 M pts/s)	bench-regrid
point_stat inner loop (match+SL1L2+CNT)	1 M obs points	79 ms (~13 M pts/s)	bench-point-match
Ensemble diagnostics (RHIST · spread-skill · CRPS)	20 members × 100k cases	5.1 · 15.2 · 111.8 ms	bench-ensemble
Full MODE pipeline (disc conv → objects → interest → clusters)	real case, 83×97 × 2 fields	~12 ms (browser, live sliders)	lib/met-mode + card 13
MTD 3-D space-time labeling + track attributes	24×512² · 24×1024²	101 ms · 379 ms (~65 M cells/s)	bench-mtd
Archive bundle → app-ready DataSource (funnel load)	44,228 cases · 0.8 MB gz	40 ms load · 38 ms 3-model series w/ B=1000 CIs	lib/met-data-source
DuckDB-WASM SQL over the full .stat Parquet	1.53 M rows · 8.8 MB	interactive, streamed from R2 (browser)	app 10
Offline conversion (the part kept out of the client)	1 cycle, all models	~73 s (GRIB2 decode ≈ 90%)	data-store bench

Premise"Provided the optimal data input format"

Gridded fields: Zarr v3, one common verification grid, whole-frame inner chunks (the dominant read is a full 2-D frame → 1–2 range GETs), per-variable precision (bitround where tolerable, lossless for categorical-critical fields like APCP), byte-shuffle + zstd, 1e30 fill (GPU-safe — never NaN).
Statistics: Parquet. The full archive's 1.53 M stat rows are 8.8 MB — small enough to query whole, structured enough for DuckDB to push filters down.
Point obs: Parquet (11.7 M GDAS obs = 45 MB), sliced by time/region before matching.
Parity oracles: a few KB of MET's own .stat values stored beside the data, so every client recompute can prove itself against the reference — the lab's core habit.

Honesty notes. (1) New benchmarks ran in Node/V8; prior rounds showed Node≈Chrome on these kernels, and the flagship paths (apps 09/10/11) are verified in the browser on the production origin. (2) The MODE row measures a simplified core, not MODE parity — no MODE output exists in this archive to verify against. (3) The archive is deterministic-only and single-region; ensemble/probabilistic verdicts are math-plus-benchmarks, not end-to-end demonstrations. (4) The regional model appears here under its de-identified name (WRF-REG); all public artifacts follow the lab's de-identification policy. (5) "5-dp cancellation": statistics re-derived from MET's rounded partial sums can differ near zero error — one documented case in 32,390.

Bottom lineWhat should move to the client — and what shouldn't

Move (proven): all stat math, archive aggregation, the analysis/viewer layer (METdataio/METviewer/METcalcpy roles), neighborhood + pairs grid ops, bootstrap CIs.
Move next (measured, needs product work): point_stat drill-down, interactive regridding, MODE-lite → faithful MODE, ROC/Taylor/scorecard views.
Don't move: format decode, bulk conversion, batch orchestration. They run once, offline, and make everything above possible — that division of labor is the architecture.