OpenAI says its free default model now matches frontier performance on health questions. Here’s what the data shows, how it was tested, and what’s missing from the announcement.
Roughly 230 million people ask ChatGPT a health question in an average week, according to OpenAI — more people than live in Brazil, asking about lab results, drug interactions, or whether a headache warrants a trip to urgent care. On June 18, 2026, OpenAI published a detailed breakdown of how its latest free-tier model, GPT-5.5 Instant, is handling that volume. The numbers are worth unpacking, the encouraging parts, and the parts the announcement spends less time on.
What Actually Changed
GPT-5.5 Instant isn’t a brand-new product. It’s the model that already replaced GPT-5.3 Instant as the default for every free ChatGPT user, a switch OpenAI made back in May 2026. What’s new is the health-specific tuning layered on top: better recognition of when a symptom needs urgent attention, more willingness to ask a follow-up question instead of guessing, and clearer flagging of uncertainty instead of confident-sounding guesswork.
The headline figure is a 71% drop in flagged factuality issues in health responses over a two-month window, drawn from OpenAI’s own monitoring of live production traffic rather than a curated test set. That’s a meaningful claim. It also comes with an obvious caveat: it’s OpenAI measuring OpenAI’s model, using OpenAI’s own definition of what counts as a flagged issue.
Flagged Factuality Issues in Health Responses
- Two months ago: baseline
- Today (GPT-5.5 Instant): down 71%
Source: OpenAI’s privacy-preserving production traffic monitors, covering billions of weekly health-related messages. Figures are self-reported.
5.3 Instant vs. 5.5 Instant, side by side
| Measure | GPT-5.3 Instant | GPT-5.5 Instant |
|---|---|---|
| Default model for free users | Was, until May 2026 | Yes |
| Performance vs. frontier “Thinking” models on HealthBench Professional | Below the frontier level | Comparable |
| Hallucinated claims on high-stakes prompts (medicine, law, finance) | Baseline | 52.5% fewer |
| Rated against physician-written answers (3,500 reviewed cases) | Not tested this way | Scored higher on accuracy, communication, completeness, and decision helpfulness |
| Flagged factuality issues in live health traffic | Baseline | Down 71% over two months |
The Physician Network Behind the Numbers
None of this happened in a vacuum. OpenAI has built a standing panel of more than 260 physicians across 60 countries, working in 49 languages and 26 specialties. Their job isn’t to write ChatGPT’s answers. It’s to grade them: deciding whether a sample response is accurate, whether it missed something a real doctor would catch, whether it explained uncertainty honestly or just sounded confident. Multiply that by more than 700,000 reviewed responses, and you get the rubrics that shape how the model gets trained and scored, measured in part against HealthBench, OpenAI’s own health evaluation suite.
The more interesting test compared the model directly against humans. OpenAI had physicians write answers to realistic health scenarios with no AI assistance, just unlimited time and a search engine. A separate panel then judged those answers against ChatGPT’s across 3,500 cases. GPT-5.5 Instant came out ahead on accuracy, communication, completeness, and what OpenAI calls “health decision helpfulness.” Doctors writing without time pressure or AI help still lost to the model on these specific measures.
That’s a genuinely interesting result, and it’s also OpenAI grading its own homework. As Search Engine Journal’s Matt Southern pointed out when the update landed, the claims rest on OpenAI’s internal benchmarks and physician network rather than independent, peer-reviewed testing.
What It Looks Like in Practice
In one example OpenAI published, a user asked why a doctor might order an MRI before a steroid injection for sciatica. A weaker model might have just answered the question. GPT-5.5 Instant walks through the actual reasoning: confirming what’s causing the pain, picking the correct injection site, ruling out red flags like infection or a tumor, and weighing whether an injection is even the right next step. It closes by suggesting a specific question to bring to the appointment. That’s the shape of the improvement OpenAI is going for — less “here’s an answer,” more “here’s how to think about your next conversation with a clinician.”
What the Announcement Doesn’t Dwell On
OpenAI’s post is, understandably, a highlight reel. A fuller picture includes findings from researchers studying this same category of tool over the past several months, independent of OpenAI.
A BMJ Open audit published in April 2026 tested five popular chatbots against 250 health questions and rated nearly half of the responses as problematic, either inaccurate or missing information a patient would need. One failure mode researchers flagged was “false balance”: a chatbot correctly notes that an alternative cancer treatment is unproven, then describes it in the same even tone used for chemotherapy. Dr. Michael Foote of Memorial Sloan Kettering Cancer Center, who wasn’t involved in the study, was blunt about the stakes — unproven remedies “hurt people directly” when patients lean on them instead of treatment that works.
A February 2026 Oxford study found something related but distinct. In a randomized trial of nearly 1,300 people working through realistic medical scenarios, participants often didn’t know what information the AI needed from them to give good advice, and the responses they got mixed accurate and inaccurate recommendations in ways that were hard to untangle. The researchers’ takeaway: standard benchmark scores don’t capture what actually goes wrong when a real, uncertain person tries to use one of these tools.
Then there’s ECRI, the nonprofit patient-safety organization that publishes an annual hazard list for hospitals. In January 2026, it named AI chatbot misuse the single biggest health technology hazard of the year, ranking it above cybersecurity failures and recalled medical devices. ECRI’s own testing turned up chatbots inventing body parts, recommending unnecessary procedures, and, in one case, giving instructions that risked burns from incorrect electrode placement.
One detail that’s easy to miss: ChatGPT is not HIPAA-compliant. OpenAI doesn’t sign business associate agreements for the consumer product, so anything typed into a free or Plus ChatGPT conversation isn’t covered by the same privacy protections that apply to your actual medical records.
How to Actually Use This Update
None of the caveats above mean the tool is useless. They mean it’s a tool with a specific, fairly narrow job description.
Reasonable uses:
- Translating a lab result or diagnosis into plain language
- Preparing questions for an upcoming appointment
- General wellness and habit-building questions
- Understanding what a medication is generally used for
Don’t rely on it for:
- Emergency symptoms — chest pain, breathing trouble, severe bleeding: call emergency services
- Replacing a second medical opinion
- Dosage decisions involving controlled or prescription medication
- Deciding to stop or change a treatment your doctor prescribed
Where This Fits in OpenAI’s Bigger Healthcare Push
The health update doesn’t exist on its own. OpenAI launched ChatGPT Health in January 2026, a feature that connects the assistant to health apps and medical records, though it’s still rolling out through a waitlist. There’s also ChatGPT for Clinicians, built for documentation and research workflows, and a separate enterprise offering called OpenAI for Healthcare. Leading the research effort is Karan Singhal, who joined OpenAI in mid-2024 after helping build Med-PaLM, Google’s medical-focused model family. The free-tier improvement covered here is the visible, consumer-facing piece of a much larger bet that healthcare is where AI’s gains translate most directly into people’s daily lives.
The Bottom Line
GPT-5.5 Instant is a real, measurable improvement over its predecessor. OpenAI’s own numbers and a 3,500-response physician review both point in the same direction. Whether “better than the last version” is the same thing as “good enough to trust with a health decision” is a separate question, and the ECRI and BMJ Open findings suggest the honest answer is: better, but not there yet. Treat it the way you’d treat a well-read friend with no medical license — useful for figuring out what to ask your actual doctor, not a substitute for asking them.
This article is for informational purposes and does not constitute medical advice. If you’re experiencing a medical emergency, call your local emergency services immediately.
Sources:
- OpenAI — Improving health intelligence in ChatGPT
- OpenAI — HealthBench
- TechCrunch — OpenAI releases GPT-5.5 Instant
- Search Engine Journal — OpenAI brings improved health responses to free ChatGPT
- U.S. News — Study finds AI chatbots can give misleading health advice
- University of Oxford — New study warns of risks in AI chatbots giving medical advice
- ECRI — Misuse of AI chatbots tops annual list of health technology hazards

