MET-AL · verification lab

Running MET verification
in the browser

The MET Pipeline Explorer takes a real forecast/observation case and reproduces MET's grid_stat math — thresholds, contingency tables, partial sums, skill scores, FSS — live in a browser tab, then checks every number against MET's own .stat output. This page explains how, stage by stage, with real benchmarks.

The idea in one breath

Strip away the file formats and orchestration, and MET's gridded verification is three operators wrapped around arithmetic: interpolate forecast onto the observation geometry, threshold + mask, then accumulate partial sums and contingency counts — from which every published score is simple algebra. The accumulation and algebra are cheap; a browser does them instantly. The only genuinely hard part — decoding GRIB2 / prepBUFR — is done once, offline, exactly as MET's own pb2nc / regrid_data_plane do it before verification runs.

GRIB2 / prepBUFR Zarr v3 · Parquet regrid threshold accumulate scores FSS aggregate

The pipeline, step by step

Click any step to open its detail. Green = runs in the browser (and is proven live against MET); amber = reachable but heavy or data-gated; orange = must be pre-processed offline first.

Benchmarks — does it actually scale?

The same kernel the explorer runs, measured on synthetic grids from the real case size (8k cells) up to 2048² (4.2 M cells). Single-threaded JavaScript, no WASM or GPU.

partial sums (SL1L2) contingency (CTC) neighborhood (FSS 5×5) — time vs grid size, log–log

Numbers

Cold start — a fresh tab, an empty cache

The numbers above are warm (steady state, after the JS engine has compiled the hot path). What a colleague actually feels on their very first click is the cold column — the first call before V8 optimizes. It settles to the warm number within a few interactions.

Taking FSS to the GPU

Step 6 (the neighborhood / Fractions Skill Score) is the one operator with no closed form over partial sums — it needs a sliding window over the whole grid, which makes it the natural GPU target. It's now built four ways as real WebGPU compute kernels, each parity-checked against the CPU integral image (scored-center counts exact, FSS within ~2×10⁻⁸). Try it live: WebGPU FSS.

4
GPU kernels — naive · separable · prefix-scan · multi-block
~7×
fastest GPU vs CPU at 2048² (4.2 M cells)
exact
contingency counts; FSS Δ ~2×10⁻⁸ vs CPU

The four kernels trade simplicity for scaling, and the right choice depends on the neighborhood radius r and the grid size:

The r-sweep — where algorithm beats brute force

Fixed 1024² grid (1.05 M cells), growing the neighborhood. Apple M2 Pro. Warm median ms.

neighborhoodCPU integralGPU naiveGPU separableGPU prefix-scan
r = 2 (5×5)32.66.47.06.3
r = 8 (17×17)32.811.57.45.3
r = 16 (33×33)33.730.86.95.1
The CPU integral image is flat (~33 ms, O(cells)). The naive GPU kernel climbs with r² until at r = 16 it barely beats the CPU; the separable and prefix-scan kernels stay flat and pull far ahead. The lesson: on the GPU you still need the right algorithm, not just the right hardware.

Beyond one workgroup — the multi-block scan

The prefix-scan kernel scans each line with a single workgroup (256 threads × 8 elements = 2048 cells), so a line longer than 2048 has nowhere to go. The multi-block scan removes that ceiling with the textbook three-phase parallel prefix: every 2048-cell block scans itself and emits its total; one pass exclusive-scans the per-line block totals into offsets; a final pass adds each block's offset back. Run for rows then columns, it builds the full 2-D summed-area table for arbitrarily long lines. Verified live on grids the single-block kernel can't touch — 3072² (9.4 M cells) and 8000×1024 (four blocks per line) — still bit-for-bit with the CPU (n exact, FSS Δ ~10⁻⁹).

A real portability wrinkle surfaced here: to fit the extra block-sum scratch under WebGPU's 8-storage-buffer default limit, the table packs (F, O, count) into one vec4 buffer — 16 bytes/cell, so a 3072² grid needs ~151 MB, over WebGPU's default 128 MiB storage-binding cap. It returned silent zeros (validation errors don't reach the console) until the device was created requesting the adapter's true maximum. Defaults are conservative; large GPU buffers must ask.

A GPU portability trap worth knowing

The first GPU run silently miscounted — it scored every cell (8,051) instead of the 7,810 valid ones. The cause: masked cells were encoded as NaN and tested with x != x. GPU backends (Metal here) compile shaders with fast-math assumptions under which that NaN self-test can fold to a constant, so masked cells leaked into every window. The fix is an out-of-band sentinel (1×10³⁰) tested with an ordered comparison (x < 1×10²⁹), which fast-math can't break. A good reminder that IEEE NaN semantics JS guarantees are not guaranteed in shader code.

GPU → screen, no readback

The fraction fields the FSS compares are also rendered end-to-end on the GPU, from the same integral image the scan kernel builds: a compute pass runs the multi-block SAT scan, a colorize pass box-queries it (four O(1) lookups per cell) and writes the colormap straight into a texture, and a render pass samples that texture onto the canvas — nothing is copied back to JavaScript. The scan source is literally shared — one WGSL fragment is included by both the scoring kernel and the renderer — so the map and the score are one integral image, not two implementations. The whole scan → box-query → colormap → pixels path stays on the device (~1 ms/frame), the shape of a real interactive verification map.

WGSL computeworkgroup shared memory Hillis-Steele scanmulti-block scansummed-area table shared shader sourcestorage texturerender pass

The trade-offs — pros, cons, benchmarks

Five levels of the same score, each a point on the effort-vs-payoff curve. Pros and cons sit side by side; switch to benchmarks (warm-median ms at 1024² on the Apple M2 Pro, measured in the live app).

A global-capable store where most models are regional-only

To score any model live in the browser, we put every model on one common grid and make model a dimension. But models don't all cover the same ground: AIGFS and GFS are global, while a mesoscale WRF or 4DWX run covers only a small region. Naïvely, a shared grid means every regional model pays for the whole continent in empty cells. Two Zarr properties make that not so — and the same idea scales the store from one region up to the planet.

First, missing data is stored as an ordered sentinel (1e30), and a chunk that's entirely sentinel either isn't written at all or compresses to almost nothing. So a regional model in the shared model dimension costs storage proportional to its footprint, not the full grid. Second, a multiscale pyramid lets each model live at its honest resolution — global models on a coarse global level, regional models on a fine regional level — so a coarse 0.25° model is never blown up to 2.5 km just to sit in the same store.

Toggle models and switch the storage strategy — watch what each one actually costs. (Tile counts and sizes are illustrative but track the real design ratios.)

models
storage
objects (chunks) stored
≈ store size (illustrative)
vs. dense NA-filling
In the real v2 design on the 2.5 km CONUS grid: a regional model covering ~⅙ of the domain stores ~1.3 MB/frame vs ~8 MB for a full-cover model — ≈ its footprint — and the pyramid keeps global models at 0.25° (~1.2 MB/frame) instead of upsampling them ~150× to 2.5 km. Same move that lets the store grow from one region to global: each dataset at its honest resolution, empty space free. The vocabulary behind this — COG overviews, OME-Zarr, GeoZarr — is in the pyramids & cloud-native rasters explainer.

What this means for MET users