WebGPU: the browser's new compute primitive
How WebGPU changes what's possible for AI in the browser, and why it matters for local inference at scale.
WebGPU landed in Chrome 113 in May 2023, and since then, browser AI has had a new ceiling. Most developers still think of WebGPU as a graphics API — a successor to WebGL for 3D rendering. That framing misses its most important property: WebGPU is a general-purpose compute platform that happens to also do graphics.
Beyond 3D: WebGPU as a compute platform
At its core, WebGPU exposes the GPU’s ability to run massively parallel programs. Where WebGL hacked compute on top of fragment shaders, WebGPU gives you explicit compute pipelines: you write compute shaders in WGSL (WebGPU Shading Language), dispatch them with configurable workgroups, and read back results via GPU buffers.
This matters enormously for AI. Neural network inference is almost entirely matrix multiplication. A transformer doing a forward pass multiplies query, key, and value matrices against weight matrices — billions of multiply-accumulates per token. That maps directly to what GPUs are designed to do: execute the same operation on thousands of data points in parallel.
Why not WebAssembly?
WASM is excellent for scalar workloads and CPU-friendly operations. A well-optimized WASM build with SIMD can run models surprisingly fast on modern CPUs. But CPUs have 8–32 physical cores. A mobile GPU has hundreds of shader units. A desktop GPU has thousands. For matrix-heavy workloads, WebGPU wins by an order of magnitude.
The practical difference: running a 7B parameter model quantized to 4-bit on CPU via WASM might hit 2–5 tokens/second on a modern laptop. The same model via WebGPU hits 15–40 tokens/second on an integrated GPU, and significantly more on discrete hardware.
A minimal compute shader
Here’s what matrix multiplication looks like in WGSL:
@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> result: array<f32>;
@compute @workgroup_size(8, 8)
fn main(@builtin(global_invocation_id) id: vec3u) {
let row = id.x;
let col = id.y;
let N: u32 = 512u;
var sum: f32 = 0.0;
for (var k: u32 = 0u; k < N; k++) {
sum += a[row * N + k] * b[k * N + col];
}
result[row * N + col] = sum;
}Real inference kernels are more complex — they handle quantization, attention masking, KV caching — but the structure is the same: parallel execution across a grid of invocations, each computing one element of the output.
What’s available today
The two libraries that matter for browser AI via WebGPU:
- Transformers.js (Hugging Face) — runs ONNX models with a WebGPU backend. Drop-in pipeline API for text, embeddings, and vision tasks.
- WebLLM (MLC AI) — purpose-built for chat LLMs. Ships quantized Llama, Phi, Mistral, and Gemma variants. Best-in-class throughput.
- MediaPipe Tasks (Google) — optimized for specific tasks like sentiment analysis, object detection, and hand pose. Very fast, well-documented.
For most use cases, start with Transformers.js. It covers 90% of what you’ll want and handles the WebGPU backend automatically when available, falling back to WASM otherwise.
Current limitations
WebGPU is not yet universally available. Safari added support in Safari 18 (September 2024). Firefox support is still behind a flag. On mobile, support is improving but fragmented. Your inference code needs a graceful fallback — detect WebGPU, fall back to WASM, degrade gracefully.
Memory is also constrained. GPU memory is shared with the system on integrated GPUs. Large models (7B+ unquantized) will OOM on most consumer hardware. The sweet spot today is 1B–3B parameter models at 4-bit quantization.
The trajectory
WebGPU is getting compute-specific extensions. subgroups (cooperative vector operations) and shader-f16 (half-precision floats) are both in draft and being implemented. These will cut inference latency further and unlock quantized kernels that run even faster.
The browser is becoming a first-class AI runtime. WebGPU is why that’s possible.