OpenAI's GPT-5.5 Instant outperforms physicians on 3,500 health questions, cutting factuality errors by 71%
OpenAI announced on June 18 that GPT-5.5 Instant — the model now powering health responses for all free ChatGPT users — outperformed physician-written answers on a set of 3,500 real-world health questions rated by an independent physician panel, and reduced factuality errors by 71% compared to earlier versions.
What's new
A distinct panel of physicians rated GPT-5.5 Instant's responses as higher-quality than human-written physician answers across measures of accuracy, communication, and completeness in 3,500 reviewed interactions. The model also logged a 71% drop in factuality issues versus its predecessor, GPT-5.3 Instant, which handled health queries from March through May 2026.
The update specifically targets three behaviors where earlier models fell short: recognizing when a user's description warrants urgent care, asking for relevant context before generating a recommendation, and calibrating expressed confidence to match actual certainty. GPT-5.5 Instant began rolling out to all free ChatGPT users on June 18.
More than 230 million people use ChatGPT each week for health and wellness questions, including interpreting lab results, preparing for appointments, and navigating insurance — making this one of OpenAI's highest-reach capability updates to date.
Context
OpenAI introduced ChatGPT Health in January 2026, letting users optionally connect medical records and wellness apps so the assistant could help interpret test results and prepare for appointments. The June 18 update does not change that integration layer — it upgrades the underlying model serving all health-adjacent queries, including from users who have not opted into dedicated health features.
GPT-5.5 Instant is the lightweight, low-latency member of OpenAI's GPT-5.5 family, distinct from GPT-5.5 and GPT-5.5 Pro. Its selection for health queries reflects a deliberate trade-off: prioritize response speed and broad access over the deeper reasoning of heavier models, given that the majority of health conversations are informational rather than diagnostic.
The 3,500-question evaluation was conducted by a physician panel that reviewed interactions blind — rating GPT-5.5 Instant's responses without knowing whether the source was AI or a human physician.
Why it matters
Physician-equivalent performance on a 3,500-question blind evaluation is a significant benchmark, even with the important caveat that it does not mean the model is appropriate for clinical diagnosis. It signals that for the everyday informational queries that make up the bulk of AI health use — drug interactions, symptom context, appointment preparation — the accuracy gap between the model and a practicing physician has narrowed to the point where a doctor panel cannot reliably identify the better answer.
The 71% factuality reduction carries heightened significance in a domain where errors can have real consequences. Factuality failures in health AI range from benign (outdated information) to serious (incorrect dosage guidance). A two-thirds reduction across a standardized test set, achieved in a single model generation cycle, suggests targeted investment in health-specific fine-tuning and evaluation rather than general capability scaling alone.
The release increases pressure on competing health AI products, including Google's Gemini health features and specialized tools from companies like Ada Health and Babylon. With GPT-5.5 Instant now serving health queries to its full free user base by default, OpenAI has moved physician-level health answering from an opt-in feature to the baseline experience for hundreds of millions of users.
Corroborating sources
- Openai
https://openai.com/index/improving-health-intelligence-in-chatgpt
“In health, progress means delivering responses that are accurate, understandable, and grounded in good judgment: recognizing when more context is needed, explaining uncertainty without overstating confidence, and helping people understand when to seek care.”