12 min read

Building browser RAG with Transformers.js

A practical walkthrough of retrieval-augmented generation running entirely client-side — embeddings, chunking, and retrieval without a server.

RAGTransformers.js

Retrieval-augmented generation (RAG) is the architecture that makes language models useful for specific knowledge domains: instead of relying on what the model memorized during training, you retrieve relevant context at query time and include it in the prompt. Standard RAG runs on a server — embed documents, store them in a vector database, call an LLM API at query time. Browser RAG throws out every server in that pipeline.

Why client-side RAG makes sense

Zero infrastructure. Zero retrieval latency. Complete privacy. Your documents never leave the browser. Useful for corporate tools where data cannot be sent to a third party, or consumer apps where users upload personal notes, PDFs, or local files.

The constraint: both the model and the vector index must fit in browser memory. That limits useful corpus size to thousands of documents, not millions. For personal knowledge bases, documentation search, and file-aware assistants — that’s enough.

Architecture overview

The stack has four parts:

  1. Embeddings model — converts text chunks into dense vectors. A small model like all-MiniLM-L6-v2 (~22 MB) is fast and fits comfortably in memory.
  2. Vector store — holds embedded chunks. In-memory for small corpora, IndexedDB for persistence between sessions.
  3. Retriever — cosine similarity search over the store to find the most relevant chunks for a given query.
  4. Generator — a local LLM (Phi-3-mini, Llama 3.2 1B, Gemma 2B) that reads the retrieved context and produces an answer.

Step 1: Chunk and embed documents

Chunking strategy matters more than most developers realize. Splitting on character count alone produces chunks that break mid-sentence, hurting retrieval precision. Better: split on sentence boundaries with a small sliding window of overlap.

import { pipeline } from '@xenova/transformers';

const embedder = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2'
);

async function embedChunks(chunks) {
  const output = await embedder(chunks, {
    pooling: 'mean',
    normalize: true,  // reduces cosine similarity to dot product
  });
  return Array.from(output.data);
}

The normalize: trueflag is important — it scales vectors to unit length, making cosine similarity equivalent to a dot product. That’s faster to compute and easier to reason about.

Step 2: Build the index

For fewer than ~10,000 chunks, brute-force cosine similarity is fast enough (under 50ms). Store embeddings alongside the original text:

const DIM = 384; // all-MiniLM-L6-v2 output dimension

const index = chunks.map((text, i) => ({
  text,
  embedding: new Float32Array(
    embeddings.slice(i * DIM, (i + 1) * DIM)
  ),
}));

Persist the index to IndexedDB between sessions — re-embedding on every reload is the fastest way to make your app feel broken.

Step 3: Retrieve

At query time, embed the user’s question and rank all stored chunks by similarity:

function cosineSimilarity(a, b) {
  let dot = 0;
  for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
  return dot; // valid because embeddings are normalized
}

function retrieve(queryEmbedding, index, topK = 5) {
  return index
    .map(item => ({
      ...item,
      score: cosineSimilarity(queryEmbedding, item.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

Step 4: Generate

Pass the retrieved chunks as context to a local LLM. Keep the prompt short — local models degrade noticeably with long contexts:

const context = retrieved.map(r => r.text).join('

');
const prompt = `Context:
${context}

Question: ${query}

Answer:`;

A 1B parameter model handles this correctly when the context is clean and the question is specific. Avoid open-ended reasoning tasks — use RAG for lookup and synthesis, not multi-step problem solving.

Limitations

Browser RAG is not for every problem. Corpora over 50,000 chunks will feel slow. Long contexts cause local models to lose coherence. And the quality ceiling is the local model itself — a 1B model over retrieved context is not GPT-4. Use it where privacy, offline capability, or zero-infrastructure matter more than raw capability.