Why Your Retrieval Stack Never Converges
Companion to Compressed Coordination · Semantic match is task-relative, not absolute
You started with vector search. It worked, then it missed in obvious ways — keyword-overlap queries that any literal match would have caught. So you added BM25, blended the scores, and recovered the surface-form cases. Then it missed differently — paraphrases, distant restatements, queries where the words don't overlap but the meaning does. So you stacked a reranker. The reranker recovered some of those, and broke others by overweighting context the earlier layers had handled fine. You fine-tuned an embedding model on your own data. The failure boundary shifted again.
You did not converge. You moved the failure around.
This is not just because retrieval is hard. It is because the question your stack is trying to answer — is this document a semantic match for this query? — does not have an answer. Not because the answer is unknown — because the question, as posed, does not pick out a well-defined object. Replace it with the question that does:
Would the next action change if you returned a different result?
The next action — generate, route, classify, cite, trigger — is observable. That makes the question measurable. It gives evaluation a target, tells you when to stop tuning, and explains, in one move, why the stack never converged.
Before you build the stack
The replacement question is also a triage. Apply it before any retrieval architecture goes in: would the next action change if you returned a different result? If you cannot describe a result that would change the action, retrieval is not the operation you need. There are two ways for the triage to fail, and most of the developers who hit the wall hit it because of one of them.
The task is not a retrieval problem at all. The action's correctness depends on something that is not a property of any document.
- Aggregation or computation. "How many patients had X last quarter?" "What is the average dosage?" The right answer is a number no individual document carries. Retrieval can return documents about the underlying records; it cannot return the count. This is a query against structured data.
- Current state. "What is the patient's BP right now?" "Is the order still pending?" The indexed document is stale by definition; current state lives in a system retrieval is not connected to.
- Action selection for a specific case. "What should I do next for this patient?" Retrieval can supply general knowledge, but the right action depends on the specific case state too. Retrieval alone is insufficient; the result omits information the decision actually depends on.
- Negation or completeness. "What contraindications am I missing?" "Are there cases we have not seen?" Retrieval produces evidence of presence. It cannot produce evidence of absence.
- Exact identity. "Pull the record for patient #4291832." This is a key lookup. The right answer has exactly one member; similarity is the wrong primitive.
The task is a retrieval problem, but top-k similarity cannot reach the right answer. Retrieval is the right operation; top-k by similarity is the wrong access pattern.
- Cross-corpus synthesis. "Compare how this policy evolved across these fifty documents." The right answer requires distinctions across the full corpus; top-k discards most of it.
- Authoritative-source preference. "What does our protocol say?" Similarity ranks by closeness, not by authority. The right answer is the canonical document, not "any document about the topic."
- Long-tail. Rare conditions, unusual combinations. Embeddings cluster the common; the rare lives between clusters, retrieved unreliably.
- Temporal scope. Question is about 2024; corpus stops in 2020. The corpus does not contain the answer at all.
The diagnostic is one move. Before you tune anything, describe the result whose presence would change the next action. If you cannot, the operation you need is not retrieval. If you can but it requires the full corpus or a specific canonical record, top-k similarity is the wrong access pattern even where retrieval is the right idea.
The rest of this essay is for the cases that pass the triage.
Two definitions of semantic match
There are two things "semantic match" can mean, and the retrieval literature does not always separate them. The conflation is doing most of the damage.
Absolute semantic match. Two representations carry the same meaning in some task-free, lossless sense. The query and the document refer to the same underlying object, the same fact, the same idea — independent of what the system is for.
Task-relative semantic match. Two representations are close enough that substituting one for the other preserves the distinctions needed for the current objective. The query and the document land in the same equivalence class under the task the system is running.
The retrieval techniques in common use are proxies for the first. Embedding models are trained on similarity objectives that do not know your task. BM25 scores lexical overlap that does not know your task. Cross-encoder rerankers score query–document relevance against a general notion of relevance that does not know your task. Hybrid blends combine these signals; none of the signals being blended is a task partition.
The first definition is not a coherent engineering target. The second is.
Why the stack never converges
The argument is one sentence. Similarity metrics are not task partitions.
Stacking more similarity signals — sparse, dense, cross-encoded, fine-tuned, reranked — improves performance because each signal encodes a slightly different implicit notion of relevance. The new layer catches what the old layer's implicit notion missed. But none of the layers, and no blend of them, is an explicit statement of which differences between documents matter for the action the system is going to take. The task partition lives in the application, not in the embedding.
So the stack improves locally and never bottoms out globally. There is no task-free fixed point for it to converge to. The asymptote does not exist.
The escalating stack is not evidence that retrieval is unusually difficult. It is evidence that the object the stack is optimizing toward is not well-defined. Every layer is a different guess at a partition the stack has not been told.
In framework terms
The general result the retrieval case instantiates lives in Compressed Coordination. The short version, in the vocabulary that essay develops:
A bounded sender compresses a source state into a signal. The channel admits fewer distinguishable signals than there are distinguishable source states (M < N, as compact notation for the constraint), so the signal underdetermines the source. The receiver does not recover the source state; it produces a reconstruction conditioned on the signal and its priors. Coordination succeeds when the reconstruction lands in the same equivalence class as the source state under the task partition — the partition that groups source states by which next action they would justify.
Retrieval is the same architecture, with the roles renamed. The query is a compressed signal of a source state — what the user actually needs from the system, which the query underdetermines. The retrieval system produces a reconstruction: a ranked list of documents the system treats as task-equivalent to that source state. The match is "right" when the returned document lands in the same equivalence class as the source state under the partition that matters — the partition defined by what the system is going to do with the retrieved result.
A retrieval result is task-sufficient when replacing it would not change the downstream policy.
That is the criterion.
What changes when you take this seriously
Evaluation moves to the action boundary. The right test of a retrieval result is not whether a human annotator judges it semantically relevant. The right test is whether the downstream action changes when the result changes. A pipeline that produces the same answer regardless of which of several documents you returned has, in those cases, no retrieval problem — those documents are in the same equivalence class under the task. A pipeline whose answer flips when you swap a "less relevant" document for a "more relevant" one has a retrieval problem precisely there, at that swap. Label failures by whether they change downstream policy; ignore the rest.
Fine-tuning gets a real target. Generic semantic-similarity fine-tuning trains toward a partition you did not specify. Task-conditioned fine-tuning treats the signal that matters — does the downstream action change — as the supervision signal. The objective is no longer "make these two strings score high together." It is "score together the results that the downstream policy treats as the same, and score apart the results it treats as different."
The partition has to live somewhere. Embedding models do not encode your task; they cannot. Cross-encoders and rerankers can — but only if you train them on the partition the application cares about, which is why a reranker for answering differs from one for summarization, which differs again from one for citation, even on the same corpus. When the partition cannot live in the model layer, it has to live around it: filters, structured fields, retrieval policies, or explicit task-conditioned objectives. Pretending the embedding will handle the partition is what keeps the stack growing without converging.
Verification is downstream of retrieval, not part of it. Retrieval delivers a reconstruction; it cannot certify the reconstruction landed in the right equivalence class. The failure mode is familiar: retrieval returns a passable document, generation produces a confident answer, and the answer is wrong about a fact the retrieved document did not actually support. That is not a retrieval failure or a generation failure in isolation. It is the absence of a verification layer that checks the action against an external constraint after retrieval has done its work. The retrieval layer reconstructs; the verification layer checks. Conflating them — "just return the answer" — is how silent wrong-class landings become silent wrong actions.
The general case
Retrieval is one instance of the architecture developed in Compressed Coordination: success is judged by equivalence under a task partition, not by resemblance. The foundation essay develops the general case — the constraint runs the same way wherever bounded senders, lossy channels, and task-dependent receivers are connected. The retrieval case is what that constraint looks like when the sender is a user, the channel is a query, and the receiver is a system about to act.
Stop scoring retrieval by resemblance. Score it by whether swapping the result changes the next action. That is the only target that exists.