A field guide to AI cost optimization for VP Engineering teams running clinical and operational LLMs in production.
What's actually going wrong and what changes once teams stop guessing.
Inference now eats 85 percent of the enterprise AI budget.
In healthcare, the problem looks different than it does at a SaaS company.
Quality matters more in healthcare than in almost any other domain.
Not every request needs the frontier model. A medication reconciliation summary does.
For any task that runs more than 50,000 times a month, distillation pays. A distilled model is 5 to 10 times smaller, runs on cheaper hardware, and typically holds 95 percent of the original's accuracy on the narrow task it was trained for.
About 30 to 40 percent of clinical LLM traffic in our audits is duplicate or near-duplicate. The same patient message gets summarized for three different staff.
Not every request needs the frontier model.
For any task that runs more than 50,000 times a month, distillation pays.
About 30 to 40 percent of clinical LLM traffic in our audits is duplicate or near-duplicate.
If your AI cost line is outpacing your value story, the answer is not 'use less AI.' The answer is a routing program with receipts.
No. Routing and caching reduce average latency. Distillation can be slower at the tail in early weeks but stabilizes once the eval loop is in place. Across our deployments, P95 latency dropped 20 to 35 percent.
An evaluation set of at least 200 labeled examples per request class, scored weekly. The proof is the score over time, not a one-shot benchmark.
No. The four levers work across providers. We have run the same program on Anthropic, OpenAI, and Bedrock-hosted models without switching.