Beyond bi-encoders: cross-encoder reranking and contextual retrieval

A memory layer is only as good as its retrieval. When an agent asks “what do I owe Sarah, and when?”, it does not need ten loosely-related memories — it needs the one that answers the question, at the top. Getting that right is a two-part problem, and a single embedding model solves neither part well.

Bi-encoders recall; cross-encoders rank

MemWal embeds each neuron independently and compares vectors — a bi-encoder. That is exactly right for the first stage: it is fast and casts a wide net, so the neuron you want almost always lands somewhere in the top-20. But bi-encoders score the query and the document separately, so they are notoriously weak at ordering those 20 candidates precisely. That is a documented anti-pattern, and it is why naive vector search so often buries the best answer at position 6.

A cross-encoder fixes the ordering. It reads the query and a candidate together and emits a single relevance score, capturing interactions a dot-product cannot. The cost is that you cannot pre-compute it — so you run it only over the small candidate pool, as a second stage. Neurus usesms-marco-MiniLM-L-6-v2locally via ONNX — no API key, ~3ms per query — and it is the single biggest quality lever in the pipeline.

const hits = await memwal.recall(query, 20);   // stage 1: broad, bi-encoder
const ranked = await rerank(query, hits.map(h => h.text)); // stage 2: precise, cross-encoder
return ranked.slice(0, limit);

The chunk that lost its context

The second problem is subtler. Split a document into chunks and one of them reads simply “The base rate is 0.05%.” Embedded on its own, that chunk has no idea it is about fees on the Zephyr Protocol— so a query for “trading fee” may never retrieve it. Anthropic’s contextual retrieval (Sept 2024) showed this is responsible for a large share of retrieval failures, and that prepending a short, generated context sentence to each chunk before embedding cut failed retrievals by ~35% on its own — and up to ~67% combined with BM25 and reranking.

We implement this with one twist for cost: instead of an LLM call per chunk, a single call per document returns a context line for every chunk at once. The context is prepended only to the embedded text; the original body is preserved for display and grounding.

// one LLM call situates every chunk in the whole document
const contexts = await contextualize(docTitle, chunks);
chunk.meta.embedText = `${contexts[i]}

${chunk.body}`; // embed the contextual version
// chunk.body stays original for the answer + citation

The honest part

We built an eval harness to measure this, and found something worth saying plainly: at small scale, dense recall plus a cross-encoder already saturates — hybrid BM25 and contextual chunking add little, because the gold neuron is always already in the pool. These levers earn their keep at document scale, where the first stage genuinely misses. So they ship on by default, the eval makes every future change measurable, and we do not claim a number we cannot show.