3 Sources
[1]
OpenAI Launches HealthBench, a Dataset That Benchmarks Healthcare AI Models
Imad is a senior reporter covering Google and internet culture. Hailing from Texas, Imad started his journalism career in 2013 and has amassed bylines with The New York Times, The Washington Post, ESPN, Tom's Guide and Wired, among others. OpenAI, the creator of artificial intelligence chatbot ChatGPT, has a new open-source large language model called HealthBench that lets the healthcare industry benchmark AI models, the company said in a blog post on Monday. The model was built in partnership with 262 physicians across 60 countries, and has 5,000 realistic health conversations baked in. The goal for HealthBench is to discover whether AI models are giving the best possible responses to people's health-related inquiries. Each response is measured against a physician-written rubric criteria, with each criterion weighted to match the physician's judgement. The rubric is scored by GPT-4.1. OpenAI's o3 reasoning model performs the best, according to HealthBench, with a score of 60%, followed by Elon Musk's Grok at 54% and Google's Gemini 2.5 Pro at 52%. In an example on OpenAI's blog post, it posits a scenario where a 70-year-old neighbor is lying on the floor, breathing but unresponsive. The person asks AI what should be done. A model then gives an answer with steps on what to do, such as calling emergency services, checking breathing and positioning airways. HealthBench then scores the response, explaining what the model answered correctly and what could be improved upon. It then gives a final score, in this case, 77%. The model can handle 49 languages, including Amharic and Nepali, and includes 26 medial specialties, such as neurological surgery and ophthalmology. OpenAI didn't immediately respond to a request for comment.
[2]
OpenAI Wants to be a '24/7 World-Class Doctor' in Your Pocket | AIM
OpenAI has launched a new benchmark to test how well AI models handle complex medical conversations. OpenAI is making a serious push into the healthcare sector, with the release of a new benchmark called HealthBench, designed to evaluate the capabilities of AI systems in health. The benchmark aims to help large language models (LLMs) support patients and clinicians with health discussions that are trustworthy, meaningful, and open to continuous improvement. HealthBench looks at seven key areas, including emergency care, managing uncertainty, and global health. "What if you had a world-class doctor in your pocket, 24/7, at no cost? That's the promise of AI in healthcare, but mistakes can be catastrophic. That's why OpenAI launched HealthBench, a new benchmark to test how well AI models handle real, complex medical conversations," Matthew Berman, CEO of Forward Future, wrote on X. Developed in partnership with 262 physicians from 60 countries, HealthBench includes 5,000 realistic health-related conversations, each paired with a custom physician-created rubric for grading model responses. OpenAI shared in its blog that it used HealthBench to evaluate how well its latest models perform on healthcare tasks. According to the company, recent models have improved quickly, with o3 outperforming others, including Claude 3.7 Sonnet and Gemini 2.5 Pro (March 2025 version) in the tests. OpenAI also mentioned that small models have gotten much better lately. GPT‑4.1 nano, for example, beats the August 2024 GPT‑4o model -- even though it's 25 times less expensive. Compared to written responses from doctors, LLMs were found to write better answers for many of the instances. By April this year, the newest models had reached a point where physician responses no longer improved the quality of the answers. Online, many users have shared stories of how ChatGPT helped them make sense of complicated health problems, ranging from chronic back pain to unexplained jaw issues. "I've had half a dozen healthcare-related issues in my family over the last few months, and ChatGPT has been more helpful than the physician...," said Joe Flaherty, a former Wired staff writer, in a post on X. "ChatGPT outperforms human doctors for me. It diagnosed a condition I have and recommended the correct treatment after two human specialists failed. Perfect use-case for LLMs as it requires knowledge & pattern matching," another user said on X. However, experts warn of the over-dependence on AI. "Using artificial intelligence for diagnosis and even for prescriptions, one has to be really cautious, because physical examination is missing," Dr CN Manjunath, senior cardiologist and director of the Sri Jayadeva Institute of Cardiovascular Sciences and Research, Bengaluru, told AIM in an earlier interaction. He further emphasised that, despite the widespread use of technology in healthcare, physical evaluation remains a cornerstone of accurate diagnosis. Though medications may alleviate symptoms, he advised always following up with a qualified medical practitioner for comprehensive care. He explained that once a particular diagnosis has been made, patients can follow up with ChatGPT. OpenAI's growing interest in healthcare is reflected in its job openings, which include roles such as health AI research engineer and healthcare software engineer. This development comes against the backdrop of OpenAI appointing Fidji Simo as the CEO of applications, allowing Sam Altman to focus more on research, compute, and safety. Time and time again, Altman has reiterated that he is most excited about scientific discoveries with the help of AI. "I'm personally most excited about AI for science at this point. I'm a big believer that the most important driver of the world and people's lives getting better and better is new scientific discovery," said Altman in a recent TED talk. He added that they hear from scientists about how the latest AI models have been making them more productive and impacting what they are able to discover. "I deeply believe that AGI can extend human life by broadening trustworthy access to care and accelerating longevity research," said Karina Nguyen, researcher at OpenAI, in a post on X. Even Bryan Johnson, known for his radical approach to longevity and anti-ageing, weighed in on OpenAI's development. He pointed out that AI-assisted physicians had outperformed human physicians without reference materials, adding that by April, the responses were so strong that physicians could no longer improve them. OpenAI is not alone in focusing on healthcare. Google recently launched TxGemma, a new suite of open-source language models built to support therapeutic development. The models are intended to improve tasks such as drug candidate assessment, molecule property prediction, and clinical trial outcome estimation by applying LLM capabilities to biomedical data. In 2024, Google developed Med-Gemini, a next-generation set of healthcare models that combine Gemini's advanced multimodal and reasoning capabilities by fine-tuning on de-identified medical data. To support care providers, Google, in 2023, introduced MedLM and Search for Healthcare. These are built to handle medical queries and are available on the Google Cloud Vertex AI platform. They help clinicians make better-informed decisions and enable patients to receive more accurate and personalised care. Anthropic chief Dario Amodei, a rival of OpenAI, has also expressed excitement about AI's potential in biology. "I'm optimistic that diseases which have plagued us for thousands of years -- such as cancer, Alzheimer's, and ageing itself -- may be treatable," he said. In his recent essay 'Machines of Loving Grace', Amodei outlined a future in which AI could "double our lifespans, cure all diseases, and create untold global economic wealth".Anthropic recently launched the AI for Science Program to support scientific research and discovery by giving researchers access to its API. The program offers free API credits for high-impact projects, with a focus on biology and life sciences.
[3]
OpenAI Releases HealthBench Dataset to Test AI in Health Care
TUESDAY, May 13, 2025 (HealthDay News) -- OpenAI has unveiled a large dataset to help test how well artificial intelligence (AI) models answer health care questions. Experts call it a major step forward, but they also say more work is needed to ensure safety. The dataset -- called HealthBench -- is OpenAI's first major independent health care project. It includes 5,000 "realistic health conversations," each with detailed grading tools to evaluate AI responses, STAT News reported. "Our mission as OpenAI is to ensure AGI is beneficial to humanity," Karan Singhal, head of the San Francisco-based company's health AI team, said. AGI is shorthand for artificial general intelligence. "One part of that is building and deploying technology," Singhal said. "Another part of it is ensuring that positive applications like health care have a place to flourish and that we do the right work to ensure that the models are safe and reliable in these settings." The dataset was created with help from 262 doctors who have worked in 60 countries. They provided more than 57,000 unique criteria to judge how well AI models answer health questions. HealthBench aims to fix a common problem: Comparing different AI models fairly. "What OpenAI has done is they have provided this in a scalable way from a really big, reputable brand that's going to enable people to use this very easily," Raj Ratwani, a health AI researcher at MedStar Health, said. The 5,000 examples in HealthBench were made using synthesized conversations designed by physicians. "We wanted to balance the benefits of being able to release the data with, of course, the privacy constraints of using realistic data," Singhal told STAT News. The dataset also includes a special group of 1,000 hard examples where AI models struggled. OpenAI hopes this group "provides a worthy target for model improvements for months to come," STAT News reported. OpenAI also tested its own models as well as models from Google, Meta, Anthropic and xAI. OpenAI's o3 model scored the best, especially in communication quality, STAT News reported. But models performed poorly in areas like context awareness and completeness, experts said. Some warned about OpenAI grading its own models. "In sensitive contexts like healthcare, where we are discussing life and death, that level of opacity is unacceptable," Hao explained. Others noted that AI itself was used to grade some of the responses, which could result in errors being overlooked. It "may hide errors shared by both model and grader," Girish Nadkarni, head of artificial intelligence and human health at the Icahn School of Medicine at Mount Sinai in New York City, told STAT News. He and others called for more reviews to ensure models work well in different countries and among different demographics. "HealthBench improves LLM healthcare evaluation but still needs subgroup analysis and wider human review before it can support safety claims," Nadkarni said. More information The National Institutes of Health has more on artificial intelligence in healthcare.
Share
Copy Link
OpenAI introduces HealthBench, a comprehensive dataset to evaluate AI models' performance in healthcare conversations, aiming to improve the reliability and safety of AI in medical applications.
OpenAI, the company behind ChatGPT, has launched HealthBench, a groundbreaking dataset designed to benchmark AI models in healthcare applications. This initiative marks a significant step towards improving the reliability and safety of AI in medical contexts 123.
HealthBench was developed in partnership with 262 physicians from 60 countries, encompassing:
This extensive collaboration ensures a diverse and comprehensive benchmark that can assess AI models across various medical scenarios and global contexts 123.
The benchmark employs a sophisticated evaluation system:
HealthBench also includes a subset of 1,000 challenging examples to push the boundaries of AI model capabilities 123.
OpenAI's testing revealed interesting results:
Notably, smaller models like GPT-4.nano have shown significant improvements, outperforming some larger, older models while being more cost-effective 2.
The introduction of HealthBench could have far-reaching implications:
Some users have reported AI outperforming human doctors in certain scenarios, particularly in complex or rare conditions 2.
Despite the promising advancements, experts urge caution:
Dr. CN Manjunath, a senior cardiologist, emphasizes the importance of following up with qualified medical practitioners for comprehensive care 2.
OpenAI's initiative aligns with a broader industry trend:
These developments suggest a growing emphasis on AI applications in medical research and patient care 23.
While HealthBench represents a significant advancement, experts highlight areas for improvement:
As the field evolves, addressing these challenges will be crucial for building trust and ensuring the safe, effective deployment of AI in healthcare settings 3.
Google's release of Veo 3, an advanced AI video generation model, has led to a surge in realistic AI-generated content and creative responses from real content creators, raising questions about the future of digital media and misinformation.
2 Sources
Technology
14 hrs ago
2 Sources
Technology
14 hrs ago
OpenAI's internal strategy document reveals plans to evolve ChatGPT into an AI 'super assistant' that deeply understands users and serves as an interface to the internet, aiming to help with various aspects of daily life.
2 Sources
Technology
6 hrs ago
2 Sources
Technology
6 hrs ago
Meta plans to automate up to 90% of product risk assessments using AI, potentially speeding up product launches but raising concerns about overlooking serious risks that human reviewers might catch.
3 Sources
Technology
6 hrs ago
3 Sources
Technology
6 hrs ago
Google quietly released an experimental app called AI Edge Gallery, allowing Android users to download and run AI models locally without an internet connection. The app supports various AI tasks and will soon be available for iOS.
2 Sources
Technology
6 hrs ago
2 Sources
Technology
6 hrs ago
Google announces plans to appeal a federal judge's antitrust decision regarding its online search monopoly, maintaining that the original ruling was incorrect. The case involves proposals to address Google's dominance in search and related advertising, with implications for AI competition.
3 Sources
Policy and Regulation
6 hrs ago
3 Sources
Policy and Regulation
6 hrs ago