Coda: The Closed-Loop Checklist
Six Essays on Compression · Coda · The closed-loop checklist
Six essays in, the picture is this. Communication between any two finite processors, human or machine, is a chain of compressions and decompressions. The chain breaks silently in three ways: the wrong thing gets dropped (an action-changing fact never reaches the receiver), the wrong thing gets filled in (the receiver reconstructs the wrong picture from a confident message), or the gap is never checked (no one verifies whether the receiver's picture matches what the sender meant). Many clinical AI systems are still built so that all three failure modes can happen at once, in production, without a reliable mechanism for noticing. The expensive error is unflagged loss, a fluent compression that decompresses cleanly into the wrong patient. Loss itself is the table stakes; loss the receiver does not know to ask about is what does the harm.
The cleanest way to see this is as a picture.
The patient is at one end. The decision is at the other. In between are five codec hops: perception (the clinician sees the patient and compresses what they saw into a working model), inscription (they compress the working model into a chart entry), retrieval (the AI compresses chart entries and other data into its internal representation), generation (the AI decompresses that representation into natural-language output), and reconstruction (the clinician decompresses the AI's output back into a mental model that drives action). Each hop can drop information, distort it, or invent it. Each hop's reliability depends on the codec being shared with the next hop.
A closed-loop architecture has feedback at every gap, and it has the six properties from Essay VI in active use. Most current clinical AI implements one or two of them, partially. The integrated version is the work.
The closed-loop clinical AI checklist:
Matched codec. Does the AI's output match the format the receiver is using right now: their specialty's idioms, their hospital's conventions, the level of abstraction the moment in workflow calls for? Or does it impose its own format and leave them to translate: narrative when they need a problem list, a full admission summary when they need the change since yesterday, an outside specialty's frame when they are reading from their own?
Externalized shared codec. When the agreement is too large to live in either head, both ends point at the same external source: the same ontology, the same KB, the same record version, the same definition of "stable." Retrieval systems are part of the codec; bad retrieval is bad communication.
Decompression verification. Does the system surface what the clinician actually took from its output, and detect meaningful divergence between the AI's represented patient-state and the clinician's apparent working model? Without this, both ends produce fluent output and the gap between them stays invisible until something downstream goes wrong.
Uncertainty in the message itself. When the AI is hedging, does the message carry the hedge in a form the receiver can act on? Or has the hedge been smoothed into prose that reads as confident, leaving the receiver to fill in certainty the sender never had?
Iterative expansion at minimum bandwidth. Does the system start with the smallest useful pointer and expand on demand, questions before lectures, or does it dump everything at the front and trust the receiver to find what matters?
Active maintenance against drift. Does the team running this system notice when its sense of "the typical patient on this floor" has stopped matching what the clinicians on this floor are actually seeing? Of the six properties, this one lives outside the model itself. It is operational labor: standing evals, distribution monitors, periodic re-grounding against current practice. Drift is caught by the team running the system, not by the system. It is the failure mode no one catches reactively, and the one that is currently almost entirely missing from production stacks.
What property 3 looks like in product terms, since this is the property the other five depend on: the AI shows the patient-state it is representing as a structured object the clinician can inspect; it asks for confirmation only on the action-changing variables, and only when those carry real uncertainty; it flags divergence when the clinician's plan does not match the patient-state the AI is working from; it preserves the uncertainty markers in any downstream re-use. None of this requires a smarter model. All of it requires the system to be in conversation with the human, not just to deliver to them.
Underneath all of this is the deepest open question in the architecture. How does the system actually infer the clinician's reconstructed patient-state, well enough to detect divergence from its own? This is the user-state inference problem: a theory-of-mind problem for clinical AI, and one almost no current stack treats as first-class. Three routes are plausible, each with its own failure mode. Explicit read-back asks the clinician to confirm the action-changing slots before the next step; the most verifiable route, and the most exposed to confirmation fatigue, the same pattern that broke alert-based decision support. Behavioral inference watches what the clinician orders, asks, or ignores and reconstructs their working model from the trace; the lowest friction, and the slowest to catch a problem, since divergence only surfaces once it has already driven an action. A separate user-model module maintains a persistent representation of this clinician, updated continuously across interactions, the way the system maintains its patient model; architecturally the cleanest and the heaviest research lift, with a model-of-the-model problem riding along, since the user model can drift or mis-initialize in the same ways the patient model can. A serious closed-loop system probably needs some of all three: read-back on the highest-stakes slots, behavioral inference for everything else, a lightweight user model that learns which slots count as high-stakes for this clinician. Naming the problem is the prerequisite to picking a direction, and most stacks have not even named it.
Many clinical AI systems in production today, especially summarization tools, ambient scribes, and chart-based copilots, appear strong on one or two of these properties and weak or silent on the rest. Larger context windows don't help with most of them. Smarter base models don't help with most of them. Better retrieval helps with one or two. The integrated architecture, all six, with explicit handoffs and a return channel from the human at the end of every chain, is what closing the loop actually requires.
The Epic Sepsis Model is one well-documented Property 6 case. The original external validation at one academic medical center found an AUROC of 0.63 with 33% sensitivity, far below the vendor's pre-deployment claims (Wong et al., JAMA Internal Medicine 2021). Five years later the same lead investigator validated the revised ESM v2 across four US health systems and found AUROC between 0.82 and 0.92, but with high institutional variability, positive predictive value of 0.13 to 0.26 at a 60% sensitivity threshold, and a number-needed-to-evaluate of 21 to 35 over a twelve-hour window (Wong et al., JAMA Network Open 2026). The model is better than v1; the deployment problem is the same. Local calibration on this population, this floor, this season is only knowable through standing local evaluation, which is operational work the deploying site has to do, not a property the model ships with.
A note on what this checklist is for. It is a list of properties to look at when you are buying, building, or trusting a clinical AI system. The checklist is diagnostic, not regulatory. If you cannot say which mechanism in the system implements each one, the system is probably open-loop somewhere it shouldn't be.
A note on what this checklist isn't. It is not yet operationalized. None of the six properties has a published metric, and turning them into a benchmark is a separate piece of work. Property 3 in particular is still an open research problem; the three-route sketch above is orientation, not a solution. The framework is human-AI centric and does not handle the cases where two AI processors hand off to each other, which the next twenty-four months will force. The empirical case is still asserted rather than demonstrated. An audit of named deployed systems against the six properties is the next move, and this Coda is the architecture argument that audit would test.
Where this fits among existing frameworks. The FDA's Software as a Medical Device guidance, the Coalition for Health AI (CHAI) assurance standards, the FUTURE-AI consortium's guidelines, the TRIPOD-AI reporting checklist, the Duke deployment framework from Sendak and colleagues, and the ONC's HTI-1 transparency rule all touch parts of this. Most of them put weight on Property 4 (uncertainty and reporting transparency) and Property 6 (post-deployment monitoring and bias evaluation). None of them, as far as I can find, name Property 3 (decompression verification between the model's represented patient-state and the clinician's working model) as a first-class property a system either has or lacks. That is the gap this Coda is trying to fill.
The work of clinical AI is to keep the compressions coupled across hops, across time, and especially between the model and the human who has to act on what it says. Smarter compression is downstream of that.
What this series is pointing at, in the end, is a safety architecture for high-stakes compression: the discipline of making sure every codec hop between a patient and a decision either checks itself or signals when it cannot. Read it as safety engineering more than as philosophy of AI.
The picture from inside finite agency is the picture I have been trying to draw. There are only compressions, each made by some finite thing trying to fit some piece of the world into the room it has. The hardest one to get right is the last one, the one between the model and the clinician who is about to make a decision. That is where most current systems are open-loop. That is where the work is.
Six Essays on Compression · Preface · I · II · III · IV · V · VI · Coda · Postscript