May 25, 20265 min read

Local LLM in a Todo App — WebLLM vs Prompt API

The same todo app, two different ways to run an LLM entirely in the browser: WebLLM with model weights you download, and Chrome's built-in Prompt API with no download at all.

WebLLMPrompt APILocal AI

This is a walkthrough of building a todo list app with an embedded AI chatbot — one that runs entirely in the browser using WebLLM. No server. No API key. The model downloads once, caches in the browser, and works offline from there. The source is on GitHub — the web-llm branch has the implementation covered here.

What WebLLM is

WebLLM is a web-based runtime for LLMs from Machine Learning Compilation (MLC). It uses WebAssembly for CPU calculations and WebGPU for GPU access. Inference runs on the user’s device instead of in the cloud — which means first load is slow (the model downloads) but every subsequent session is instant and fully offline.

Installing

npm install @mlc-ai/web-llm

Understanding model names

WebLLM uses models with structured names. Understanding them matters because your choice directly controls file size, quality, and whether the model fits in your users’ VRAM.

Take Llama-3.2-3B-Instruct-q4f32_1-MLC:

3B — three billion parameters. More parameters generally means better reasoning, larger file size.
Instruct — fine-tuned to follow instructions, not just complete text. Always use an instruct variant for chatbots.
q4 — 4-bit quantization. Weights are stored at reduced precision to shrink the model.
f32 — 32-bit floating point. A Q4 model with f32 activations gives a balance of size and accuracy.

The 3B model lands around 1.4GB. A 7B model gives noticeably better results for translation and open-ended questions, but weighs in at 3.3GB or more. Start with 3B for the developer loop, then benchmark whether the larger model is worth the download cost for your users.

Initializing the engine

import { CreateMLCEngine } from '@mlc-ai/web-llm';

const engine = await CreateMLCEngine('Llama-3.2-3B-Instruct-q4f32_1-MLC', {
  initProgressCallback: ({ progress }) => {
    console.log(progress); // 0.0 → 1.0 as weights download
  },
});

The initProgressCallback fires as the model downloads. Use it to drive a progress bar — on a slow connection the first load takes a while and users need feedback. After the download, WebLLM caches the weights in the Cache API (the same storage Service Workers use). Subsequent loads skip the download entirely and read from cache.

Grounding the model on the task list

The chatbot is useful here because it knows the user’s tasks. The system prompt includes the current todo list serialized as JSON. When the user asks “how many open tasks do I have?”, the model has the data it needs to answer accurately.

const messages = [
  {
    role: 'system',
    content: `You are a helpful assistant. You will answer questions related to
the user's to-do list. Decline all other requests not related to the user's
todos. This is the to-do list in JSON: ${JSON.stringify(todos)}`,
  },
  { role: 'user', content: userMessage },
];

Rebuild messages on every turn so the system prompt always reflects the current state of the list — if the user adds a task mid-conversation, the next message will include it.

Streaming the response

LLMs generate tokens one at a time. If you wait for the full response before rendering, the UI freezes while the model works. Use streaming to show output as it arrives:

const chunks = await engine.chat.completions.create({
  messages,
  stream: true,
});

let reply = '';
for await (const chunk of chunks) {
  reply += chunk.choices[0]?.delta.content ?? '';
  updateUI(reply); // re-render on each token
}

The API returns an AsyncGenerator. Each iteration gives you a delta — the new token(s) since the last chunk. Accumulate them into a string and update the DOM on each tick. The result feels responsive even for longer answers.

Offline after first load

WebLLM stores downloaded model weights in the Cache API, which persists across sessions. Once cached, the app loads the model locally — no network request. The user can close the tab, go offline, and reopen the app later with the chatbot fully functional.

Storage is per-origin. If you serve the app from https://yourapp.com, the cached model is only available to that origin. A different site at a different origin can’t share it.

Treat LLM output as untrusted input

The model can hallucinate, and its output could contain malicious strings if the user crafts prompts to manipulate it. Never inject LLM responses directly as HTML, and never execute anything the model returns as JavaScript. Treat generated text the same way you treat user-supplied text: sanitize before rendering.

Try it

git clone https://github.com/beladevo/todo-ai
cd todo-ai
git checkout web-llm
npm i && npm start

The first load downloads the model. After that, disconnect from the network — the chatbot keeps working. The repo also has a prompt-apibranch if you want to compare against Chrome’s built-in model, which skips the download entirely at the cost of being Chrome-only and non-configurable.