
The Minimum-Information Problem: Why Triage Is So Hard
A patient walks through the door. You have ninety seconds.
A 64-year-old man, walking in unassisted. Chief complaint: "back pain since this morning." Vitals: BP 142/86, HR 88, RR 16, SpO2 98%, T 36.9. Appearance: uncomfortable but conversant, no diaphoresis, no distress. That is all you have. Behind him there are eleven other patients in the waiting room. In front of him are decisions: which acuity tier, how fast, what's the resource cost of being wrong.
This is the moment when emergency medicine asks the most of the clinician, and gives them the least to work with.
What triage actually does
Triage is not diagnosis. It is stakes-weighted action selection under irreducible uncertainty. The clinician is choosing a routing path, not inferring a state. The triage problem is underdetermined by design: the data needed to resolve the inference will not arrive in the next ninety seconds, and the clinician's job is to commit to an action before the underdetermination resolves. Treating triage as diagnosis is the most common architectural mistake in clinical decision support: the right answer to "what is this?" at minute 90 is not the right answer to "where should this patient go in the next 30 seconds?"
Given a presentation that could plausibly explain a dozen different things, each with a different miss-cost and a different timeframe, the question is: how quickly, by whom, through which pathway. The decision is a bet on the patient's near-future state.
The solution shape that follows is exhaustive enumeration of plausible candidates, each weighted by miss-cost and timeframe. This is a candidate-management problem as much as a perception problem. Structured systems hold that matrix explicit. Human working memory under time pressure, across twelve consecutive patients, does not.
The output of triage in most modern systems is a tier (ESI in the US, CTAS in Canada, Manchester in the UK, ATS in Australia) that maps to a time-to-physician target and a resource bundle. The tier is implicitly a forecast: how bad will this patient be in T+30 minutes, T+2 hours, T+6 hours, if I commit to this resource path.
The asymmetric error model
Triage errors are not symmetric. Under-triage, sending a sick patient to a slower path, can kill. The ESI Level 3 chest pain who turns out to be a thoracic aortic dissection waits in a hallway bed for forty minutes while a syncope being worked up next door has a normal CT. The mistake ends up on the patient's chart and on the medical examiner's report.
Over-triage, sending a well patient to a faster path, costs throughput. The ESI Level 2 chest pain who is actually GERD takes a monitored bed for three hours, blocking the next genuine ACS. The mistake ends up on the operations dashboard and in the next ambulance diversion.
Clinicians and health systems both prefer to err on the side of over-triage. The legal, ethical, and emotional cost of a miss dominates the operational cost of an over-call. That preference is rational at the individual case level and costly at the population level. A system that over-triages at scale has less capacity for the next genuine emergency. An outcome-anchored proxy analysis of more than 5 million ED visits at Kaiser Permanente, the largest such study published to date, found 24.9% over-triage, 3.3% under-triage, and 71.9% outcome-concordant assignments, with disparities by race, language, age, and neighborhood income (JAMA Network Open, 2023). The "accuracy" rate is measured against downstream outcomes like admission, ICU stay, and critical interventions, which is a proxy for true acuity, not a gold standard.
Why humans struggle
Three forces converge to make human triage imperfect at the door.
Anchoring on the chief complaint. "Chest pain in a 65-year-old smoker" anchors the clinician to ACS. The mind builds priors that match the modal case, and atypicals fall through. The 65-year-old smoker whose chest pain is actually a thoracic aortic dissection presents with the same complaint, the same demographic, and a normal ECG. The features that would distinguish dissection (pulse deficit, focal neurologic deficit, abrupt onset to maximal intensity) are not on the triage form unless the clinician asks. They will not ask if the anchor is already set. Pattern recognition trained on the modal case routes the catastrophic edge case to the slower path because, at the minimum-information moment, the catastrophic edge case looks like the modal case.
Atypical presentations bury the lethal cases. Women with myocardial infarction more often present with a broader symptom distribution (dyspnea, fatigue, nausea, jaw or back pain) alongside chest pain, which remains common in both sexes. Some authors argue that the "atypical" label itself is a misnomer that has slowed recognition for decades (Journal of the American Heart Association, 2020). Immunosuppressed patients with necrotizing fasciitis present with cellulitis that looks indistinguishable from the benign version at triage and is unsalvageable by the time the wound is bedside-imaged. The atypical-presentation rate is not a small correction. It is the dominant failure mode.
Time pressure across heterogeneous patients, with noisy inputs. The triage clinician sees one patient every two to four minutes across a population with wildly different baseline priors. A "compensated" 85-year-old at HR 110 and BP 110/70 may already be in shock; the same numbers in a 25-year-old are unremarkable. A patient on chronic beta-blockade who cannot mount a tachycardia in compensated shock looks vitally well on arrival. Inputs are also unreliable: triage respiratory rates are routinely miscounted, vitals are taken once and not repeated, and history is filtered through language barriers, intoxication, dementia, or psychiatric overlay.
What the reliability data says
The peer-reviewed performance of triage scales reflects this difficulty. A meta-analysis of ESI reliability found kappa coefficients ranging from 0.46 to 0.98 across studies. Moderate to substantial agreement, with wide variability by setting, training, and audit cadence (Meta-analysis, 2015).
But there is a deeper problem with what the scales are actually measuring. ESI is not pure acuity. It also predicts expected resource use. Much of the apparent disagreement in mistriage analyses is over downstream resource intensity, not over imminent danger. Current scales conflate current physiologic instability with downstream resource needs into a single tier. The hazard-vs-acuity decomposition that triage scales merge is part of why two reasonable raters can produce two different tiers for the same patient, and part of why the patient who "looks well" with a high-hazard latent diagnosis can be assigned a slow path without the system noticing.
Beyond the initial moment: reassessment and crowding
Triage is also not a single static decision. The dominant safety failure in modern EDs is rarely the initial acuity call alone. It is the combination of an under-recognized presentation at the door and a failed reassessment under crowding and boarding. A patient who is "safe to wait 30 minutes" is not safe to wait 6 hours in a boarded department with no monitored beds. Reassessment, protocol triggers (door ECG, stroke alert, sepsis bundle), and bedside escalation by experienced nurses are the safety net that catches what the initial call misses. Any triage architecture that ignores the waiting room, and the operational reality that boarding has converted waiting rooms into low-monitoring inpatient units, is solving the wrong problem.
The seasoned clinician's gestalt
It would be a mistake to read the aggregate numbers as evidence that the clinicians making the calls are interchangeable. They are not, and the same triage moment that defeats System 1 in inexperienced hands is sometimes brilliantly handled in experienced ones.
A 2024 prospective study of 2,484 patient-physician encounters in critically ill ED patients offers the most striking recent illustration. Early physician gestalt within the first 15 minutes of the encounter outperformed every standard screening tool (qSOFA, SIRS, SOFA, MEWS) and outperformed a LASSO-regularized machine-learning model trained on the same cohort. Physician gestalt AUC was 0.90 (95% CI 0.88 to 0.92); qSOFA, SIRS, and SOFA all sat at 0.67; MEWS at 0.66; the ML model reached 0.84 (Annals of Emergency Medicine, 2024).

That study is doing real work, but it is doing narrow work. It is not a doorway-triage study. It is not a nurse-triage study. It is not evidence that gestalt generally beats machine learning at undifferentiated triage. It is evidence that early physician impression in a critically ill cohort contains high-value signal not captured by standard structured inputs. That is enough.
What is the gestalt doing? It is integrating exactly the channels that get lost in narrative compression: pallor, work of breathing, the quality of speech, the way a patient holds themselves. None of this is in the structured intake form. All of it is encoded in the experienced clinician's nervous system through tens of thousands of prior encounters. The "sense of doom" is not mysticism. It is rapid pattern matching against a feature space the rest of the system cannot see.
But gestalt is also where the failure modes live. Anchoring bias, fixing on the chief complaint and not revisiting, is gestalt. Availability bias, over-weighting the last dramatic case the clinician saw, is gestalt. The same pattern matching that catches the septic patient at the doorway misses the dissection that looks like ACS because the prior dissection in the clinician's memory looked different. Gestalt is a model, and like any model it has training data, distributional limits, and out-of-distribution failures.
Why AI triage alone is harder
An AI agent doing triage instead of the human is not yet a credible option, for specific reasons worth being clear about.
The physical exam channel is missing. The triage clinician sees pallor, work of breathing, peritoneal posturing, the way a patient lifts their arm. An AI agent reading a structured intake form sees none of this. The single most diagnostic signal at triage, appearance, is the hardest to digitize. Computer-vision systems for pallor, respiratory rate, and gait estimation exist in research settings, but no deployed triage system today reliably operationalizes them at the fidelity of the experienced clinician's eye.
Narrative compression loses information. When a triage nurse converts a patient's account into structured fields, signal is lost. The patient said "it feels like my chest is being squeezed and the pain is going to my back." The structured field reads "chest pain, radiating." The interscapular radiation that screams dissection has been compressed into the same field as the substernal radiation that suggests ACS. An LLM reading the structured field cannot recover what the human conversion threw away.
Calibration on rare lethal conditions is brittle. LLMs trained on common-case distributions are confident on common diagnoses and miscalibrated on rare ones. Aortic dissection occurs in roughly 0.1% of ED chest-pain presentations. A general-purpose model has seen tens of thousands of chest-pain examples and a few hundred dissection cases. Its confidence on "this is not a dissection" will be high even when the features warrant suspicion.
Forward simulation and the wrong output shape. Current LLMs do not naturally project "where will this patient be in two hours under each resource path." They emit a most-likely diagnosis with a confidence score. Triage is not a most-likely-diagnosis problem. It is a stakes-weighted hazard projection problem. The right output is a ranked candidate list and, for each candidate, a hazard envelope showing what happens to this patient on the next-best path versus the optimal path, and how fast.
Where structure changes the call

The AI side's unique contribution is not better signal extraction. It is exhaustive parallel scoring of the plausible candidate space, with each candidate carrying its hazard signature: miss-cost, timeframe, and the de-risking action that would change the disposition. The 2021 AHA/ACC joint chest pain guideline alone enumerates more than forty distinct diagnostic entities for a single chief complaint (Circulation, 2021); multiply each by its hazard signature and de-risking action, and the load is what no clinician can sustain across twelve consecutive patients under time pressure. Structured systems can. The combination, clinician's eye for the signal plus system's exhaustiveness on the hazard math, is the architecture this piece argues for.
Three worked examples make the failure mode and the structural fix concrete. These are not prevalence exemplars. They are geometry-of-risk exemplars. In each one, the modal candidate the human anchors on is genuinely the most likely diagnosis. The system does not have to be smarter about which is more likely. It has to be unwilling to drop the highest-stakes alternative from the active candidate list without a cheap de-risking maneuver at the door.

The three examples below sit in the upper-left quadrant of the diagram above: low current acuity, high latent hazard. That is the territory current acuity-driven scales structurally fail to surface, and the territory a structured DDX with stakes signatures is specifically designed to handle.
1. Type A aortic dissection presenting as resolved chest tightness
A 58-year-old hypertensive man arrives with "chest tightness" that started 45 minutes ago and has mostly resolved. Vitals: BP 138/84 right arm, HR 78, RR 14, SpO2 98%. He is alert and only mildly diaphoretic on arrival. Triage assigns ESI 2 for chest pain and routes him to the chest-pain section for troponin and telemetry.
This is the modal disposition for chest pain in a hypertensive smoker, and it is correctly anchored on ACS. The miss pattern is that the troponin-first reflex obscures the parallel candidate, Type A aortic dissection. Older teaching estimates early untreated mortality at roughly 1 to 2% per hour, but the exact hourly figure matters less than the central point: untreated dissection is rapidly time-critical, and resolving chest pain with a normal initial ECG does not rule it out. Transient pain in dissection often reflects a tear that has propagated rather than a problem that has gone away.
The de-risking action is small. A structured DDX with per-candidate hazard signatures keeps dissection on the active list and forces one nursing maneuver at triage: take the BP in both arms. A >20 mmHg differential is a specific but insensitive finding. Its presence materially raises suspicion, its absence does not rule out (Klompas, JAMA, 2002; Ohle et al., Academic Emergency Medicine, 2018). The same system asks one history question ("did the pain reach its worst within the first few seconds?") that captures the abrupt-onset feature ACS does not share. Neither costs more than a minute, neither is in the standard ESI workflow, and both follow from the system's refusal to drop dissection from the candidate list given its stakes signature.
2. Sentinel SAH presenting as headache improved with Tylenol
A 47-year-old woman arrives describing a "really bad headache" that started suddenly that morning and has now improved to 4/10 after two acetaminophen. Vitals: BP 138/82, HR 76, RR 16, SpO2 99%. Neurologic exam intact, no meningismus, GCS 15. She does not volunteer the onset character; she says it "came on fast." Triage assigns ESI 3 and places her in the headache queue behind two migraine patients.
The trap is the improvement. The human reads a headache that responded to Tylenol as migraine resolving. The system reads it as a sudden-onset headache event whose subsequent symptomatic improvement does not change the latent hazard. Subarachnoid hemorrhage carries a 30-day mortality of around 50% with substantial disability among survivors, and warning headaches have been reported in older retrospective series in advance of aneurysmal rupture. The literature is affected by recall bias and the exact fraction varies, but the rupture risk after a sentinel event is concentrated in the days that follow. Migraine has no equivalent hazard envelope.
The de-risking action is small, and already encoded in a validated ED decision aid. It is just not embedded at triage. The Ottawa SAH rule operationalizes "did the headache reach maximum intensity within sixty seconds of onset?" as a high-risk historical feature (Perry et al., JAMA, 2013). The rule is intended as a sensitive screen rather than a strong rule-in test, but a positive answer on that question rerouting a "headache now improved" patient out of the migraine queue is exactly the kind of triage-level escalation cue a structured DDX with hazard signatures would surface. This is not the AI being clever. It is a validated clinical decision rule reaching the door rather than waiting for the physician encounter ninety minutes later.
3. Spinal epidural abscess with a normal neurologic exam
A 44-year-old woman with IV drug use disclosed at registration arrives with four days of progressively worsening mid-thoracic back pain. Vitals: BP 122/76, HR 82, RR 14, SpO2 99%, T 37.3. Midline thoracic tenderness on palpation. Strength 5/5 in all limbs, sensation intact, gait normal, no saddle anesthesia, no urinary retention. Triage assigns ESI 3 (back pain, stable vitals, intact neurologic exam) and routes her to the musculoskeletal queue.
The miss pattern is the normal neurologic exam. It is exactly what reassures the human into the MSK anchor, and it is exactly what spinal epidural abscess can predict at this stage of its natural history. The deficit window opens later than the diagnostic window. A normal current neurologic exam does not safely exclude SEA, and outcomes worsen sharply once motor deficits develop and intervention is delayed. Permanent neurologic disability and bacteremic mortality are well-documented risks of missed SEA. MSK back strain has no equivalent hazard envelope.
The de-risking action this time is not a single vital or a single rule item. It is a population-context composite (IV drug use × axial spine pain) that should reroute the patient out of the musculoskeletal queue and into expedited clinician evaluation, with the IVDU-plus-spine-pain combination explicitly flagged in the handoff. The IVDU history markedly raises pretest probability relative to the average back-pain presentation. A structured DDX with hazard signatures uses that composite as the route trigger, not the current neurologic exam. The current exam is precisely the variable the disease is silent on at this stage.
What the three examples share
In each, the human's anchor is correct as priors go, but the lethal candidate is silent on the variable the human is reading and loud on a variable the structured system keeps live: stakes signature, hazard timeframe, population-context prior. The calculation that fails is not probabilistic ranking. It is expected-utility under stakes-asymmetric loss, where Stakes(lethal) × P(lethal) can dominate Stakes(modal) × P(modal) even when P(modal) is much larger. Working memory under time pressure compresses that calculation away; structured systems keep it alive. The AI doesn't have to be smarter than the clinician. It has to be exhaustive where the clinician structurally cannot be. That is what the structural levers (differentiators with likelihood ratios, per-disease stakes with timeframes, hazard derived candidate by candidate, stage-anchored forward projection) are for.
Human plus AI is the foreseeable future
The most honest reading of the evidence is that neither pure human triage nor pure AI triage is going to be the right answer in the near term, and the most useful framing is complementarity.
Human gestalt does what no algorithm reliably does today: it integrates the visual, postural, vocal, and somatic channels at the doorway, before the patient has spoken in structured-data form. The clinician's eye is information the system should not duplicate but should depend on.
A well-constructed system could do something the human cannot reliably do under time pressure across consecutive patients: preserve an explicit alternative space, apply the same stakes-weighted check at the twelfth patient as at the first, surface the miss-cost of every candidate diagnosis with its timeframe, and keep its assumptions inspectable. The clinician brings the eye and the hands. The system brings the cognitive scaffolding that survives the twelfth consecutive patient.
This is also where decision support is most useful for the inexperienced clinician, and where the gestalt evidence is least flattering. The experienced physician's sepsis gestalt outperforms screening scores. The new graduate's does not yet. A well-calibrated assistant brings the floor up; a well-trained clinician's gestalt raises the ceiling. The combination raises both.
The case for complementarity is not a guarantee. It is a design hypothesis. Hybrid systems can fail too. Automation bias makes clinicians defer to the model when they shouldn't. Alert fatigue erodes attention. Brittle likelihood ratios can give false precision. Models trained on local outcomes can encode the very access and care-process biases the system is supposed to neutralize. UI burden at triage matters: one extra click per patient kills adoption. The combination is not magic. It is engineering, and it has to be earned locally.
The honest accounting
None of this fixes triage. Better-structured priors do not solve the signal-extraction problem, do not replace the missing physical exam, and do not compensate for narrative compression at intake. Even the best human-AI hybrid is still working in the minimum-information moment of medicine, and the patient at the door of any given emergency department is still going to be assessed in ninety seconds.
What better structure and better tooling can do is force the reasoner (human, AI, or both) to consider stakes-weighted alternatives at the moment of commitment, to make the waiting-room reassessment visible, and to use the experienced clinician's eye for what no system can yet replicate. The bar is not "AI replaces the triage nurse." The bar is "the AI-assisted triage moment produces a lower miss rate than the unaided one." That bar has not yet been cleared in any large-scale prospective deployment we are aware of, and it is not obvious it will be cleared by scaling current model architectures alone. The path forward is the seasoned clinician with the eye and the hands, supported by explicit, machine-readable disease knowledge that encodes stakes, timeframes, and differentiators as first-class structured fields, and that survives the moment when the chief complaint anchors and System 2 stops running. The work is in the knowledge base, in the interface, in the waiting-room reassessment loop, and in the partnership.
Triage will continue to be hard. It is also the moment with the largest asymmetric cost: a system that improves the catastrophic-miss rate by even a few percent saves lives.
References
- Evaluation of Version 4 of the Emergency Severity Index in US Emergency Departments for the Rate of Mistriage. JAMA Network Open, 2023. Link
- Reliability of the Emergency Severity Index: Meta-analysis. PMC4318610, 2015. Link
- Klompas M. Does this patient have an acute thoracic aortic dissection? JAMA, 2002. Link
- Ohle R et al. Clinical Examination for Acute Aortic Dissection: A Systematic Review and Meta-analysis. Academic Emergency Medicine, 2018. Link
- Typical and Atypical Symptoms of Acute Coronary Syndrome: Time to Retire the Terms? Journal of the American Heart Association, 2020. Link
- Early Physician Gestalt Versus Usual Screening Tools for the Prediction of Sepsis in Critically Ill Emergency Patients. Annals of Emergency Medicine, 2024. Link
- Perry JJ et al. Clinical decision rules to rule out subarachnoid hemorrhage for acute headache. JAMA, 2013. Link
- Gulati M et al. 2021 AHA/ACC/ASE/CHEST/SAEM/SCCT/SCMR Guideline for the Evaluation and Diagnosis of Chest Pain. Circulation, 2021;144:e368–e454. Link