On-device embeddings: a practical guide
Generating, storing, and querying text embeddings without leaving the browser tab. Includes a deep dive on IndexedDB as a vector store.
An embedding is a dense numerical vector that represents meaning. “Browser” and “Chrome” land close together in embedding space. “JavaScript” and “Python” cluster near each other and far from “giraffe.” This spatial structure enables semantic search, duplicate detection, clustering, and recommendation — all without any server, all in the browser tab.
Why on-device
The standard path — send text to an API, get embeddings back — works but comes with costs: network latency per request, per-token pricing, and a copy of your data leaving the browser. For personal tools, sensitive documents, or high-volume operations, those costs add up.
An on-device embedding model loads once (~20–90 MB depending on the model), then processes text at under 10ms per chunk — no network round-trips, no API keys, no data leaving the tab.
Choosing a model
The right model depends on the trade-off between size and quality:
| Model | Size | Dimensions | Best for |
|---|---|---|---|
all-MiniLM-L6-v2 | ~22 MB | 384 | Fast, good quality — default choice |
bge-small-en-v1.5 | ~33 MB | 384 | Strong quality/size ratio |
nomic-embed-text | ~270 MB | 768 | Best quality at reasonable size |
all-mpnet-base-v2 | ~420 MB | 768 | Highest quality, slower load |
For most browser applications, all-MiniLM-L6-v2 is the right starting point. It loads in ~2 seconds on a typical connection, runs at ~5ms per sentence, and scores competitively on semantic similarity benchmarks.
Generating embeddings with Transformers.js
import { pipeline } from '@xenova/transformers';
// Load once, reuse for all subsequent calls
const extractor = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2'
);
async function embed(text) {
const output = await extractor(text, {
pooling: 'mean', // average token embeddings into one vector
normalize: true, // scale to unit length
});
return Array.from(output.data); // Float32Array → plain array
}Run this in a Web Worker — the model load blocks the thread and will freeze your UI if run on the main thread.
Persisting to IndexedDB
Models and embeddings are expensive to regenerate. Persist them between sessions:
async function saveEmbedding(db, id, text, embedding) {
const tx = db.transaction('embeddings', 'readwrite');
await tx.store.put({
id,
text,
embedding: new Float32Array(embedding),
});
}
async function loadAll(db) {
const tx = db.transaction('embeddings', 'readonly');
return tx.store.getAll();
}Use a hash of the source content as the key so you detect when re-embedding is necessary. The IndexedDB storage limit is generous — most browsers allow several gigabytes — so you won’t hit it for typical corpora.
Cosine similarity search
When embeddings are normalized to unit length, cosine similarity reduces to a dot product. Simple and fast:
function similarity(a, b) {
let dot = 0;
for (let i = 0; i < a.length; i++) dot += a[i] * b[i];
return dot; // range [-1, 1], higher = more similar
}
function search(queryEmbedding, store, topK = 10) {
return store
.map(item => ({
...item,
score: similarity(queryEmbedding, item.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}For stores under 10,000 vectors, this brute-force approach completes in under 50ms. Beyond that, you need approximate nearest-neighbor structures like HNSW or IVF — harder to implement client-side but achievable with the right library.
Performance in practice
Benchmarks on real hardware:
- MacBook Air M2: model load ~1.8s, single embedding ~4ms, search across 5,000 vectors ~8ms.
- Snapdragon 8 Gen 1 (Android): model load ~4.2s, single embedding ~18ms, search across 5,000 vectors ~35ms.
Acceptable for interactive apps. The critical optimization: run embedding in a Web Worker so the main thread stays responsive during bulk indexing.
When not to use on-device embeddings
On-device works well for corpora up to tens of thousands of documents. Beyond that, the index grows too large and linear search becomes too slow. Also: the best-performing embedding models (text-embedding-3-large, Cohere embed-v3) don’t run in browsers yet. If retrieval quality is critical and the corpus is large, use a server-side solution.
For personal tools, small corpora, and privacy-sensitive applications: on-device embeddings are production-ready today.