Skip to content

Cloud-native pipeline

The pipeline’s rule of thumb, established by the pipeline explorer: verification math is cheap everywhere; grid operators are the GPU’s job; format decode stays offline.

GRIB2 / prepBUFR / .stat (the archive)
│ offline, once (cfgrib/ecCodes decode dominates: ~90% of ~73 s/cycle)
Zarr v3 grids + Parquet tables (sibling project: metplus-data-store)
│ one R2 upload (maintainer-run)
private R2 bucket ──► gated Worker (Range/206 + CORS) ──► browser
zarrita.js · DuckDB-WASM · WebGPU
Store Size Content
metplus-input.zarr 30 GB Decoded model/analysis grids (bitround-12 + zstd; URMA lossless), 256² inner tiles
metplus-grid-output.zarr 810 MB 2,335 MET _pairs.nc cubes (superseded by the v2 direction)
metplus_stat.parquet 8.8 MB The full 1.53M-row .stat archive — all models/vars/line types
gdas_points.parquet 45 MB 11.7M point observations
web-demo-v2.zarr 83 MB The v2 common-grid demo store (in R2 now)

Note the striking economics: the entire .stat statistics archive compresses to 8.8 MB of Parquet — DuckDB-WASM can query any variable/line-type/statistic with zero server.

The v2 store — raw-first, pairs on the fly

Section titled “The v2 store — raw-first, pairs on the fly”

A from-scratch redesign (two independent design teams, then user decisions) replaced “store MET’s _pairs.nc intermediates” with:

  • One common grid (URMA 2.5 km CONUS; the demo ships a ×3-downsampled 533×782 version) onto which every forecast is regridded once at conversion (hand-rolled numpy bilinear, weights cached, ~0.008 s).
  • Model as a dimension, variable as separate arrays (per-variable precision: bitround t2m=10 / si10=9 bits; APCP lossless for categorical parity), lead as a dimension with valid_time as the truth join key.
  • Whole-frame chunks (the app’s dominant read is a full 2-D frame → 1–2 range GETs per frame, vs ~10× GETs with 256² tiles).
  • No stored pairs at all — the browser (or GPU) computes fcst − truth live for any model/var/lead/region: strictly more powerful than MET’s fixed pairings. A tiny de-identified oracle (6 precip cases + expected CTC counts) preserves the browser == .stat parity proof.
  • Missing data uses the 1e30 sentinel (GPU-safe), and regional models cost ≈ their footprint via omitted/fill chunks.

Result: the pairs collection (55,331 objects / 810 MB — a tiny-object anti-pattern on an object store) collapses to a few oracle objects, and live pairs for any selection cost ~4 GETs / 11–17 MB.

Step Cost
One full cycle, all models (v1: GRIB2→Zarr + pairs + stat→Parquet) ~73 s (GRIB2 decode ≈ 90%)
Full 24-cycle archive ≈ 25–30 min
v2 regrid + encode (from decoded Zarr; 160 frames + truth) ~7.9 s

GRIB2→Zarr expands on disk (1,226 → 1,692 MB even with bitrounding) — GRIB2’s native packing is excellent; the win is random access, not size.

For global-plus-regional storage the growth path is a multiscale pyramid (GeoZarr-style multiscales: coarse global level + fine regional level), which keeps per-view fetch cost flat while the dataset grows — see the interactive multiscale storage explainer.