Cloud-native pipeline
The pipeline’s rule of thumb, established by the pipeline explorer: verification math is cheap everywhere; grid operators are the GPU’s job; format decode stays offline.
GRIB2 / prepBUFR / .stat (the archive) │ offline, once (cfgrib/ecCodes decode dominates: ~90% of ~73 s/cycle) ▼Zarr v3 grids + Parquet tables (sibling project: metplus-data-store) │ one R2 upload (maintainer-run) ▼private R2 bucket ──► gated Worker (Range/206 + CORS) ──► browser zarrita.js · DuckDB-WASM · WebGPUStores
Section titled “Stores”| Store | Size | Content |
|---|---|---|
metplus-input.zarr |
30 GB | Decoded model/analysis grids (bitround-12 + zstd; URMA lossless), 256² inner tiles |
metplus-grid-output.zarr |
810 MB | 2,335 MET _pairs.nc cubes (superseded by the v2 direction) |
metplus_stat.parquet |
8.8 MB | The full 1.53M-row .stat archive — all models/vars/line types |
gdas_points.parquet |
45 MB | 11.7M point observations |
web-demo-v2.zarr |
83 MB | The v2 common-grid demo store (in R2 now) |
Note the striking economics: the entire .stat statistics archive compresses to 8.8 MB of
Parquet — DuckDB-WASM can query any variable/line-type/statistic with zero server.
The v2 store — raw-first, pairs on the fly
Section titled “The v2 store — raw-first, pairs on the fly”A from-scratch redesign (two independent design teams, then user decisions) replaced
“store MET’s _pairs.nc intermediates” with:
- One common grid (URMA 2.5 km CONUS; the demo ships a ×3-downsampled 533×782 version) onto which every forecast is regridded once at conversion (hand-rolled numpy bilinear, weights cached, ~0.008 s).
- Model as a dimension, variable as separate arrays (per-variable precision:
bitround t2m=10 / si10=9 bits; APCP lossless for categorical parity), lead as a dimension
with
valid_timeas the truth join key. - Whole-frame chunks (the app’s dominant read is a full 2-D frame → 1–2 range GETs per frame, vs ~10× GETs with 256² tiles).
- No stored pairs at all — the browser (or GPU) computes
fcst − truthlive for any model/var/lead/region: strictly more powerful than MET’s fixed pairings. A tiny de-identified oracle (6 precip cases + expected CTC counts) preserves the browser ==.statparity proof. - Missing data uses the
1e30sentinel (GPU-safe), and regional models cost ≈ their footprint via omitted/fill chunks.
Result: the pairs collection (55,331 objects / 810 MB — a tiny-object anti-pattern on an object store) collapses to a few oracle objects, and live pairs for any selection cost ~4 GETs / 11–17 MB.
Conversion benchmarks (M2 Pro)
Section titled “Conversion benchmarks (M2 Pro)”| Step | Cost |
|---|---|
| One full cycle, all models (v1: GRIB2→Zarr + pairs + stat→Parquet) | ~73 s (GRIB2 decode ≈ 90%) |
| Full 24-cycle archive | ≈ 25–30 min |
| v2 regrid + encode (from decoded Zarr; 160 frames + truth) | ~7.9 s |
GRIB2→Zarr expands on disk (1,226 → 1,692 MB even with bitrounding) — GRIB2’s native packing is excellent; the win is random access, not size.
Scaling direction
Section titled “Scaling direction”For global-plus-regional storage the growth path is a multiscale pyramid (GeoZarr-style multiscales: coarse global level + fine regional level), which keeps per-view fetch cost flat while the dataset grows — see the interactive multiscale storage explainer.
