Why 91.8% of clinicians have encountered medical AI hallucinations, the three structural failure modes, and the engineering layer that sits between model output and clinical workflow.
The model was not the problem. The absence of a validation layer between model output and clinical workflow was the problem.
The model produced confidence scores that were never checked against the actual chart.
There were no engineering criteria for when human review was mandatory versus advisory.
Hallucinations were discovered by clinicians, not caught by systems — and one near-miss put the entire contract at risk.
The model asserts a clinical fact not in the record. Lab values, medications, prior diagnoses — the most dangerous because they look like data retrieval.
64–72% of clinical hallucinations stem from causal or temporal reasoning errors. The model had the facts but could not sequence them correctly.
The output reads complete but drops critical information — a medication change, an abnormal result flagged as normal by context. JAMA found these as dangerous as fabrications and harder to catch.
Model outputs are not checked against structured chart data before they reach a clinician. Renal function claims are never reconciled against the creatinine value the model just read.
Confidence scores are calibrated against a general validation set, not against high-stakes output categories. 80% confident on a scheduling time is not 80% confident on a dose.
Most health systems require clinician review by policy. Few have engineering criteria for when review is mandatory versus advisory — so clinicians habituate and stop reading.