MET-AL · verification lab

Running MET verification
in the browser

The MET Pipeline Explorer takes a real forecast/observation case and reproduces MET's grid_stat math — thresholds, contingency tables, partial sums, skill scores, FSS — live in a browser tab, then checks every number against MET's own .stat output. This page explains how, stage by stage, with real benchmarks.

The idea in one breath

Strip away the file formats and orchestration, and MET's gridded verification is three operators wrapped around arithmetic: interpolate forecast onto the observation geometry, threshold + mask, then accumulate partial sums and contingency counts — from which every published score is simple algebra. The accumulation and algebra are cheap; a browser does them instantly. The only genuinely hard part — decoding GRIB2 / prepBUFR — is done once, offline, exactly as MET's own pb2nc / regrid_data_plane do it before verification runs.

GRIB2 / prepBUFR → Zarr v3 · Parquet → regrid → threshold → accumulate → scores → FSS → aggregate

The pipeline, step by step

Click any step to open its detail. Green = runs in the browser (and is proven live against MET); amber = reachable but heavy or data-gated; orange = must be pre-processed offline first.

Benchmarks — does it actually scale?

The same kernel the explorer runs, measured on synthetic grids from the real case size (8k cells) up to 2048² (4.2 M cells). Single-threaded JavaScript, no WASM or GPU.

partial sums (SL1L2) contingency (CTC) neighborhood (FSS 5×5) — time vs grid size, log–log

Numbers

Cold start — a fresh tab, an empty cache

The numbers above are warm (steady state, after the JS engine has compiled the hot path). What a colleague actually feels on their very first click is the cold column — the first call before V8 optimizes. It settles to the warm number within a few interactions.

Taking FSS to the GPU

Step 6 (the neighborhood / Fractions Skill Score) is the one operator with no closed form over partial sums — it needs a sliding window over the whole grid, which makes it the natural GPU target. It's now built four ways as real WebGPU compute kernels, each parity-checked against the CPU integral image (scored-center counts exact, FSS within ~2×10⁻⁸). Try it live: WebGPU FSS.

GPU kernels — naive · separable · prefix-scan · multi-block

~7×

fastest GPU vs CPU at 2048² (4.2 M cells)

exact

contingency counts; FSS Δ ~2×10⁻⁸ vs CPU

The four kernels trade simplicity for scaling, and the right choice depends on the neighborhood radius r and the grid size:

Naive — one thread per cell sums its full (2r+1)² window. Dead simple, O(cells·r²): wins at small r by sheer parallelism, but its work grows with r².
Separable — factor the 2-D box into a horizontal then a vertical 1-D sliding-window pass. O(cells), independent of r — stays flat as the window grows, but each line's running sum is serial (one thread per line).
Prefix-scan — build an integral image on the GPU with a workgroup shared-memory block scan (Hillis-Steele), then answer each cell with four O(1) lookups. Also O(cells), and by removing the separable kernel's serial per-line dependency it ends up the fastest at large r — but one workgroup handles one line, so a line can't exceed 2048 cells.
Multi-block — the same integral image, but each line is split into 2048-cell blocks scanned in parallel, then a second pass scans the per-line block totals and a third adds the offsets back. This lifts the single-workgroup line limit so the scan scales to any real grid, at the cost of a more complex kernel.

The r-sweep — where algorithm beats brute force

Fixed 1024² grid (1.05 M cells), growing the neighborhood. Apple M2 Pro. Warm median ms.

neighborhood	CPU integral	GPU naive	GPU separable	GPU prefix-scan
r = 2 (5×5)	32.6	6.4	7.0	6.3
r = 8 (17×17)	32.8	11.5	7.4	5.3
r = 16 (33×33)	33.7	30.8	6.9	5.1

The CPU integral image is flat (~33 ms, O(cells)). The naive GPU kernel climbs with r² until at r = 16 it barely beats the CPU; the separable and prefix-scan kernels stay flat and pull far ahead. The lesson: on the GPU you still need the right algorithm, not just the right hardware.

Beyond one workgroup — the multi-block scan

The prefix-scan kernel scans each line with a single workgroup (256 threads × 8 elements = 2048 cells), so a line longer than 2048 has nowhere to go. The multi-block scan removes that ceiling with the textbook three-phase parallel prefix: every 2048-cell block scans itself and emits its total; one pass exclusive-scans the per-line block totals into offsets; a final pass adds each block's offset back. Run for rows then columns, it builds the full 2-D summed-area table for arbitrarily long lines. Verified live on grids the single-block kernel can't touch — 3072² (9.4 M cells) and 8000×1024 (four blocks per line) — still bit-for-bit with the CPU (n exact, FSS Δ ~10⁻⁹).

A real portability wrinkle surfaced here: to fit the extra block-sum scratch under WebGPU's 8-storage-buffer default limit, the table packs (F, O, count) into one vec4 buffer — 16 bytes/cell, so a 3072² grid needs ~151 MB, over WebGPU's default 128 MiB storage-binding cap. It returned silent zeros (validation errors don't reach the console) until the device was created requesting the adapter's true maximum. Defaults are conservative; large GPU buffers must ask.

A GPU portability trap worth knowing

The first GPU run silently miscounted — it scored every cell (8,051) instead of the 7,810 valid ones. The cause: masked cells were encoded as NaN and tested with x != x. GPU backends (Metal here) compile shaders with fast-math assumptions under which that NaN self-test can fold to a constant, so masked cells leaked into every window. The fix is an out-of-band sentinel (1×10³⁰) tested with an ordered comparison (x < 1×10²⁹), which fast-math can't break. A good reminder that IEEE NaN semantics JS guarantees are not guaranteed in shader code.

GPU → screen, no readback

The fraction fields the FSS compares are also rendered end-to-end on the GPU, from the same integral image the scan kernel builds: a compute pass runs the multi-block SAT scan, a colorize pass box-queries it (four O(1) lookups per cell) and writes the colormap straight into a texture, and a render pass samples that texture onto the canvas — nothing is copied back to JavaScript. The scan source is literally shared — one WGSL fragment is included by both the scoring kernel and the renderer — so the map and the score are one integral image, not two implementations. The whole scan → box-query → colormap → pixels path stays on the device (~1 ms/frame), the shape of a real interactive verification map.

WGSL computeworkgroup shared memory Hillis-Steele scanmulti-block scansummed-area table shared shader sourcestorage texturerender pass

The trade-offs — pros, cons, benchmarks

Five levels of the same score, each a point on the effort-vs-payoff curve. Pros and cons sit side by side; switch to benchmarks (warm-median ms at 1024² on the Apple M2 Pro, measured in the live app).

A global-capable store where most models are regional-only

To score any model live in the browser, we put every model on one common grid and make model a dimension. But models don't all cover the same ground: AIGFS and GFS are global, while a mesoscale WRF or 4DWX run covers only a small region. Naïvely, a shared grid means every regional model pays for the whole continent in empty cells. Two Zarr properties make that not so — and the same idea scales the store from one region up to the planet.

First, missing data is stored as an ordered sentinel (1e30), and a chunk that's entirely sentinel either isn't written at all or compresses to almost nothing. So a regional model in the shared model dimension costs storage proportional to its footprint, not the full grid. Second, a multiscale pyramid lets each model live at its honest resolution — global models on a coarse global level, regional models on a fine regional level — so a coarse 0.25° model is never blown up to 2.5 km just to sit in the same store.

Toggle models and switch the storage strategy — watch what each one actually costs. (Tile counts and sizes are illustrative but track the real design ratios.)

models

storage

—

objects (chunks) stored

—

≈ store size (illustrative)

—

vs. dense NA-filling

In the real v2 design on the 2.5 km CONUS grid: a regional model covering ~⅙ of the domain stores ~1.3 MB/frame vs ~8 MB for a full-cover model — ≈ its footprint — and the pyramid keeps global models at 0.25° (~1.2 MB/frame) instead of upsampling them ~150× to 2.5 km. Same move that lets the store grow from one region to global: each dataset at its honest resolution, empty space free. The vocabulary behind this — COG overviews, OME-Zarr, GeoZarr — is in the pyramids & cloud-native rasters explainer.

What this means for MET users

The verification math is portable. Every continuous and categorical score MET publishes is a function of a handful of accumulators — reproduced here to the 5th–6th decimal, with contingency counts bit-identical. No server round-trip, no install, no upload.
The bottleneck is I/O, not compute. GRIB2 and prepBUFR decoding stay offline (heavyweight, format-heavy). Everything downstream — regrid, threshold, accumulate, derive, FSS — is browser-native once the data is in Zarr v3 / Parquet.
It scales to operational grids. A 1024² field verifies in tens of milliseconds; 4 million cells in well under a second — single-threaded. The neighborhood/FSS operator is the natural next target for WebGPU / WGSL, where a grid-parallel convolution would leave the CPU far behind.
Validation is built in. Because the browser can hold both the input and MET's ground-truth .stat, every stage is checked against the operational answer as it computes — the ultimate provenance.

Benchmarks measured with the explorer's own kernel (Node / V8 v26, Apple M2 Pro; a browser runs the same V8 engine, so figures are comparable). Real case: one de-identified grid_stat case (AIGFS vs URMA, +24 h, 2 m temperature & 6 h precip). Accuracy figures from an independent blind review of 84 paired quantities. Part of the MET-AL verification lab.