In 2017, MIT Technology Review reported that MD Anderson had shelved IBM''s Watson for Oncology after the program''s costs ran past $62 million. The model passed medical exams beautifully. It struggled with actual patients.
That story is one of the most expensive AI integration case studies ever published, and it has almost nothing to do with the model. The model was trained for one job. The hospital deployed it as if it could do another.
We have run AI inside HavenWizards across 8 venture lines. The lessons are smaller in dollar value than MD Anderson''s, but the pattern is the same: the model rarely fails. The integration choice does.
Key Takeaway
Choose the lowest autonomy level that solves the problem. AI integration sits on a four-level spectrum from Assisted (L1) to Autonomous (L4). Most small businesses need L1 or L2. Most vendors sell L3 or L4. Buying high and integrating low is the most expensive mistake we see.
The Problem
Vendors compete on autonomy claims because autonomy sounds impressive in a pitch deck. Buyers default to "the most autonomous version we can afford" because it feels like the future. Both behaviors collide in the same place: a deployment that works in the demo, escalates in production, and either burns trust with customers or quietly gets switched off.
We have rolled back our own L3 deployments to L2 more than once. Each rollback cost us weeks. The framework below is what we now run by default before any AI gets near a customer.
The Framework
01 — The Integration Spectrum (Pick Your Level Before Picking Your Tool)
What we look for:
| Level | Name | Human role | AI role | Right for |
|---|---|---|---|---|
| L1 | Assisted | Decides + acts | Suggests | Autocomplete, smart defaults |
| L2 | Augmented | Handles exceptions | Handles routine | Email triage, ticket routing |
| L3 | Automated | Monitors + overrides | Acts within boundaries | Dynamic pricing, content moderation |
| L4 | Autonomous | Minimal oversight | Full autonomy | Algorithmic trading (rarely justified for SMBs) |
Why it matters: A misclassified problem deploys the wrong level — and the cost shows up only in production. The classification rule we use: if a wrong answer can hurt a customer, you are at L2 maximum. If a wrong answer reaches a customer, you are at L1.
02 — Pattern: The Copilot
What we look for:
- Human initiates every interaction
- Multiple options presented, never single answers
- Accept / modify / reject controls on every suggestion
- Fallback to blank slate always available
Why it matters: A copilot the human cannot ignore and still do their job is a dependency, not a copilot. We use copilots for code review, email drafts, and caption first-passes inside our content engine. The output saves time. The judgment never leaves the human.
03 — Pattern: The Triage Engine
What we look for:
- Confidence threshold defined per workflow
- Auto-resolve only at high confidence on low-stakes items
- Review queue for medium confidence
- Human-first below the floor
Why it matters: Triage works when AI sorts and humans decide. The AI never treats. We run triage on inbound partnership inquiries — the model classifies fit, but a human reads every message above the threshold. The model offloads the obvious noes; it does not generate yes signals.
04 — Pattern: The Quality Gate
What we look for:
- Humans do the work; AI catches mistakes
- False positive rate held under 10%
- Calibration reviewed monthly
Why it matters: Quality Gates fail when the false positive rate runs high enough that humans stop looking at flags. We run a Quality Gate on social-post output (the LL-SOC-001 caption truncation guardrail) — every Threads post is hard-truncated server-side at 500 characters because the AI will quietly produce 600-character drafts and the entire publish fails silently otherwise. AI prompt-following is not a guardrail; programmatic enforcement is.
Implementation Checklist
- Map each AI use case to a level (L1-L4) before evaluating tools
- Define a confidence threshold per workflow with documented stakes
- Build a fallback that works with AI completely disabled
- Set a monthly calibration review for any deployed AI
- Document the "why this output for this input" answer for every AI decision
What This Produces
- Deployments that fail at L2 instead of failing at L4 (cheaper recovery)
- Customer-facing AI the customer never has to think about
- A measurable trail when the model drifts (because someone is calibrating)
Common Mistakes
- Buying L3 to solve an L1 problem. Vendors lead with autonomy. Your job is to push the integration down the spectrum, not up.
- Deploying without a fallback. When the model is unavailable or below threshold, the system must still work — at reduced capacity, but functioning.
- Trusting prompt instructions for hard rules. Token limits, prohibited terms, character caps — enforce programmatically. The model will ignore prompts you assumed were binding.
Next Steps
If you are evaluating AI integration for a venture you operate, our free training on execution systems covers the level-spectrum decision in working examples. To see the integration patterns running in production across our portfolio, explore the venture portfolio.
Arena-forged across 8 venture lines. Every pattern tested in our own operations before it reaches a partner. Source on the IBM Watson / MD Anderson reference: MIT Technology Review, "MD Anderson Benches IBM Watson In Setback For Artificial Intelligence In Medicine" (2017).
