7 min read

The runtime behind browser AI

What actually happens when you load and run an AI model in the browser — from ONNX weights to WebGPU shaders.

RuntimeWebGPUONNX

When you call pipeline('text-generation', 'model-name')in Transformers.js, you’re three layers away from actual execution. Understanding those layers explains why things are fast when they’re fast — and why they break when they break.

Two backends, one API

All browser AI inference runs on one of two backends: WebAssembly (CPU) or WebGPU (GPU). Transformers.js selects automatically — WebGPU when the browser supports it, WASM otherwise. You don’t configure this; you just notice the difference in speed.

WASM runs matrix ops on CPU cores with SIMD acceleration. WebGPU dispatches compute shaders to the GPU. The gap is roughly an order of magnitude for transformer workloads. Both backends are handled by the same engine underneath: ONNX Runtime Web.

The model format: ONNX

Browser AI models are not PyTorch checkpoints. They’re ONNX graphs — a vendor-neutral intermediate representation that ONNX Runtime Web can execute directly, without Python, without a conversion step at runtime.

Hugging Face pre-exports ONNX versions of most popular models. Quantization level is encoded in the filename: model.onnx (fp32), model_quantized.onnx (int8), model_q4.onnx (4-bit). Always use a quantized variant in the browser. The quality drop is small. The size drop is 4–8x.

Loading and caching

The first load is slow. Model weights download over the network, then get cached by the browser. Transformers.js uses the Cache API by default — the same one that Service Workers use — so weights persist across page reloads without re-downloading.

For faster subsequent reads, Transformers.js also supports the Origin Private File System (OPFS): a sandboxed, high-performance file system available inside Web Workers. OPFS reads are significantly faster than pulling from the HTTP cache on cold starts.

import { env } from '@huggingface/transformers';

// Store models in OPFS instead of the HTTP cache
env.backends.onnx.wasm.proxy = true;
env.cacheDir = '/models'; // resolves to OPFS

The progress bar you see in browser AI demos is the download, not the inference. A quantized 1B model is ~500MB. After the first load it’s instant.

Threading

Model initialization blocks the thread. If you run it on the main thread, your UI freezes until the model is ready. Run it in a Web Worker.

Transformers.js has a worker-safe API — post messages to the worker, receive progress events and outputs back. The pattern looks like this:

// worker.js
import { pipeline } from '@huggingface/transformers';

let generator;

self.onmessage = async ({ data }) => {
  if (data.type === 'load') {
    generator = await pipeline('text-generation', data.model);
    self.postMessage({ type: 'ready' });
  }
  if (data.type === 'generate') {
    const output = await generator(data.prompt);
    self.postMessage({ type: 'output', output });
  }
};

WebLLM takes a different path

WebLLM (by MLC AI) does not use ONNX Runtime. It uses Apache TVM — a compiler that turns model weights into optimized WebGPU compute shaders at model-preparation time, not at runtime. The result is higher throughput than ONNX Runtime’s WebGPU backend, with a narrower scope: it’s built specifically for autoregressive text generation.

If you’re building a chat interface, WebLLM is worth benchmarking. For embeddings, classification, or vision tasks, Transformers.js is the right tool — WebLLM won’t help you there.

What you actually control

The backend selection, the model format, and the runtime are mostly handled for you. What you actually control:

  • Which model and quantization level — this is the biggest quality/speed lever. A Q4 model at 500MB beats a fp32 model at 2GB for most tasks, both in load time and inference speed.
  • Where weights are cached — OPFS for fast cold starts; default Cache API for simplicity.
  • Thread isolation — Web Worker for every model operation, no exceptions.

The browser AI stack is less fragile than it was two years ago. ONNX Runtime Web, WebGPU, and Transformers.js are all stable enough to ship. The constraint now is model size and user patience on first load — not whether the runtime holds up.