Skip to content

11 · WebGPU FSS

Round 3 · GPU compute · data: real · baked (synthetic grids only for the throughput sweep) · reimagines MET core (the FSS operator)

The Fractions Skill Score is MET’s one common operator with no closed form over partial sums — it needs the actual neighborhood fractions over the grid. That makes it the honest test case for GPU compute. This app implements it four ways in WGSL and races them:

Kernel Complexity Character
Naive per-cell window O(cells · r²) Wins by parallelism at small r, blows up with radius
Separable 2-pass sliding window O(cells) Flat in r, but has a serial per-line dependency
Prefix-scan integral image (SAT) O(cells) Flat ~5 ms, fastest at large r
Multi-block prefix scan O(cells) Lifts the 2048-cell line cap (3-phase block scan), lines ≤ 524,288

Plus GPU-to-screen rendering: the compute pass writes turbo-colormapped Pf/Po/|Pf−Po| into a storage texture sampled by the render pass — the on-screen field is colorized from the same integral image the score uses, with no readback (~0.2 ms).

  • Layered parity: Node selftest proves naive == separable == integral CPU (148/148, maxΔ exactly 0, including >2048 grids); the browser proves GPU == CPU on real Apple Metal (n exact, FSS Δ ~1e-8 to 1e-9).
  • ~7.3× faster than CPU at 2048² (4.2 M cells); the r-sweep at 1024² shows naive climbing O(r²) (5.4 ms at r=1 → 30.7 ms at r=16) while the O(cells) kernels hold flat.
  • §6 mirrors the radius slider with a live FSS readout so the score, the map, and the GPU render stay in lock-step.
  • NaN is unusable as a missing-data sentinel under Metal fast-math — use 1e30 + ordered test.
  • WebGPU’s default 128 MiB storage-binding limit fails silently (bindings read zeros) — request adapter limits explicitly.
  • Real-Chrome-with-flags is required for WebGPU in automation; the bundled headless shell has no GPU.

Intentionally no single-file build — it needs a WebGPU browser.