The MET Pipeline Explorer takes a real forecast/observation case and reproduces
MET's grid_stat math — thresholds, contingency tables, partial sums, skill scores, FSS —
live in a browser tab, then checks every number against MET's own .stat output.
This page explains how, stage by stage, with real benchmarks.
Strip away the file formats and orchestration, and MET's gridded verification is three
operators wrapped around arithmetic: interpolate forecast onto the observation geometry,
threshold + mask, then accumulate partial sums and contingency counts — from which every
published score is simple algebra. The accumulation and algebra are cheap; a browser does them instantly.
The only genuinely hard part — decoding GRIB2 / prepBUFR — is done once, offline, exactly as MET's
own pb2nc / regrid_data_plane do it before verification runs.
Click any step to open its detail. Green = runs in the browser (and is proven live against MET); amber = reachable but heavy or data-gated; orange = must be pre-processed offline first.
The same kernel the explorer runs, measured on synthetic grids from the real case size (8k cells) up to 2048² (4.2 M cells). Single-threaded JavaScript, no WASM or GPU.
The numbers above are warm (steady state, after the JS engine has compiled the hot path). What a colleague actually feels on their very first click is the cold column — the first call before V8 optimizes. It settles to the warm number within a few interactions.
Step 6 (the neighborhood / Fractions Skill Score) is the one operator with no closed form over partial sums — it needs a sliding window over the whole grid, which makes it the natural GPU target. It's now built four ways as real WebGPU compute kernels, each parity-checked against the CPU integral image (scored-center counts exact, FSS within ~2×10⁻⁸). Try it live: WebGPU FSS.
The four kernels trade simplicity for scaling, and the right choice depends on the neighborhood radius r and the grid size:
Fixed 1024² grid (1.05 M cells), growing the neighborhood. Apple M2 Pro. Warm median ms.
| neighborhood | CPU integral | GPU naive | GPU separable | GPU prefix-scan |
|---|---|---|---|---|
| r = 2 (5×5) | 32.6 | 6.4 | 7.0 | 6.3 |
| r = 8 (17×17) | 32.8 | 11.5 | 7.4 | 5.3 |
| r = 16 (33×33) | 33.7 | 30.8 | 6.9 | 5.1 |
The prefix-scan kernel scans each line with a single workgroup (256 threads × 8 elements = 2048 cells), so a line longer than 2048 has nowhere to go. The multi-block scan removes that ceiling with the textbook three-phase parallel prefix: every 2048-cell block scans itself and emits its total; one pass exclusive-scans the per-line block totals into offsets; a final pass adds each block's offset back. Run for rows then columns, it builds the full 2-D summed-area table for arbitrarily long lines. Verified live on grids the single-block kernel can't touch — 3072² (9.4 M cells) and 8000×1024 (four blocks per line) — still bit-for-bit with the CPU (n exact, FSS Δ ~10⁻⁹).
vec4 buffer — 16 bytes/cell,
so a 3072² grid needs ~151 MB, over WebGPU's default 128 MiB storage-binding cap. It returned
silent zeros (validation errors don't reach the console) until the device was created requesting the adapter's
true maximum. Defaults are conservative; large GPU buffers must ask.x != x. GPU backends (Metal here)
compile shaders with fast-math assumptions under which that NaN self-test can fold to a constant, so masked
cells leaked into every window. The fix is an out-of-band sentinel (1×10³⁰) tested with an ordered
comparison (x < 1×10²⁹), which fast-math can't break. A good reminder that IEEE NaN semantics JS
guarantees are not guaranteed in shader code.
The fraction fields the FSS compares are also rendered end-to-end on the GPU, from the same integral image the scan kernel builds: a compute pass runs the multi-block SAT scan, a colorize pass box-queries it (four O(1) lookups per cell) and writes the colormap straight into a texture, and a render pass samples that texture onto the canvas — nothing is copied back to JavaScript. The scan source is literally shared — one WGSL fragment is included by both the scoring kernel and the renderer — so the map and the score are one integral image, not two implementations. The whole scan → box-query → colormap → pixels path stays on the device (~1 ms/frame), the shape of a real interactive verification map.
Five levels of the same score, each a point on the effort-vs-payoff curve. Pros and cons sit side by side; switch to benchmarks (warm-median ms at 1024² on the Apple M2 Pro, measured in the live app).
To score any model live in the browser, we put every model on one common grid and make
model a dimension. But models don't all cover the same ground: AIGFS and GFS are global,
while a mesoscale WRF or 4DWX run covers only a small region. Naïvely, a shared grid means every regional model
pays for the whole continent in empty cells. Two Zarr properties make that not so — and the same idea
scales the store from one region up to the planet.
First, missing data is stored as an ordered sentinel (1e30), and a chunk that's entirely
sentinel either isn't written at all or compresses to almost nothing. So a regional model in the
shared model dimension costs storage proportional to its footprint, not the full grid.
Second, a multiscale pyramid lets each model live at its honest resolution — global models on a
coarse global level, regional models on a fine regional level — so a coarse 0.25° model is never blown up to
2.5 km just to sit in the same store.
Toggle models and switch the storage strategy — watch what each one actually costs. (Tile counts and sizes are illustrative but track the real design ratios.)
.stat, every stage is checked against the operational answer as it computes — the ultimate
provenance.