March 5, 202510 min read

On-device embeddings: a practical guide

Generating, storing, and querying text embeddings without leaving the browser tab. Includes a deep dive on IndexedDB as a vector store.

EmbeddingsIndexedDB

An embedding is a dense numerical vector that represents meaning. “Browser” and “Chrome” land close together in embedding space. “JavaScript” and “Python” cluster near each other and far from “giraffe.” This spatial structure enables semantic search, duplicate detection, clustering, and recommendation — all without any server, all in the browser tab.

Why on-device

The standard path — send text to an API, get embeddings back — works but comes with costs: network latency per request, per-token pricing, and a copy of your data leaving the browser. For personal tools, sensitive documents, or high-volume operations, those costs add up.

An on-device embedding model loads once (~20–90 MB depending on the model), then processes text at under 10ms per chunk — no network round-trips, no API keys, no data leaving the tab.

Choosing a model

The right model depends on the trade-off between size and quality:

Model	Size	Dimensions	Best for
`all-MiniLM-L6-v2`	~22 MB	384	Fast, good quality — default choice
`bge-small-en-v1.5`	~33 MB	384	Strong quality/size ratio
`nomic-embed-text`	~270 MB	768	Best quality at reasonable size
`all-mpnet-base-v2`	~420 MB	768	Highest quality, slower load

For most browser applications, all-MiniLM-L6-v2 is the right starting point. It loads in ~2 seconds on a typical connection, runs at ~5ms per sentence, and scores competitively on semantic similarity benchmarks.

Generating embeddings with Transformers.js

import { pipeline } from '@xenova/transformers';

// Load once, reuse for all subsequent calls
const extractor = await pipeline(
  'feature-extraction',
  'Xenova/all-MiniLM-L6-v2'
);

async function embed(text) {
  const output = await extractor(text, {
    pooling: 'mean',    // average token embeddings into one vector
    normalize: true,    // scale to unit length
  });
  return Array.from(output.data); // Float32Array → plain array
}

Run this in a Web Worker — the model load blocks the thread and will freeze your UI if run on the main thread.

Persisting to IndexedDB

Models and embeddings are expensive to regenerate. Persist them between sessions:

async function saveEmbedding(db, id, text, embedding) {
  const tx = db.transaction('embeddings', 'readwrite');
  await tx.store.put({
    id,
    text,
    embedding: new Float32Array(embedding),
  });
}

async function loadAll(db) {
  const tx = db.transaction('embeddings', 'readonly');
  return tx.store.getAll();
}

Use a hash of the source content as the key so you detect when re-embedding is necessary. The IndexedDB storage limit is generous — most browsers allow several gigabytes — so you won’t hit it for typical corpora.

Cosine similarity search

When embeddings are normalized to unit length, cosine similarity reduces to a dot product. Simple and fast:

function similarity(a, b) {
  let dot = 0;
  for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
  return dot; // range [-1, 1], higher = more similar
}

function search(queryEmbedding, store, topK = 10) {
  return store
    .map(item => ({
      ...item,
      score: similarity(queryEmbedding, item.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
}

For stores under 10,000 vectors, this brute-force approach completes in under 50ms. Beyond that, you need approximate nearest-neighbor structures like HNSW or IVF — harder to implement client-side but achievable with the right library.

Performance in practice

Benchmarks on real hardware:

MacBook Air M2: model load ~1.8s, single embedding ~4ms, search across 5,000 vectors ~8ms.
Snapdragon 8 Gen 1 (Android): model load ~4.2s, single embedding ~18ms, search across 5,000 vectors ~35ms.

Acceptable for interactive apps. The critical optimization: run embedding in a Web Worker so the main thread stays responsive during bulk indexing.

When not to use on-device embeddings

On-device works well for corpora up to tens of thousands of documents. Beyond that, the index grows too large and linear search becomes too slow. Also: the best-performing embedding models (text-embedding-3-large, Cohere embed-v3) don’t run in browsers yet. If retrieval quality is critical and the corpus is large, use a server-side solution.

For personal tools, small corpora, and privacy-sensitive applications: on-device embeddings are production-ready today.