Model Performance

Twitter

Facebook

Copy Link

Key Facts

Key PlatformsLMSYS Arena, Open LLM Leaderboard

Key BenchmarksMMLU, HumanEval, GSM8K, ARC

Top News

AI boosts breast cancer detection by 10% in major UK screening studies, flags hidden cancers

Major UK studies show AI integration into breast cancer screening increases detection rates by 10.4% and identifies up to 27.5% of interval cancers before they become visible. The technology also reduces radiologist workload by up to 44% while maintaining accuracy across diverse populations, marking a shift in how screening programs could operate.

Health6 sources

AI boosts breast cancer detection by 10% in major UK screening studies, flags hidden cancers

Health6 sources

Alibaba launches Qwen 3.5 small AI models for edge devices with offline capabilities

Alibaba unveiled its Qwen 3.5 series featuring compact AI models ranging from 800 million to 9 billion parameters, optimized for edge devices like smartphones and IoT systems. The models enable local computation with enhanced privacy and offline functionality, challenging the industry trend of massive cloud-based systems while delivering competitive performance on benchmarks like MMLU.

Technology3 sources

Alibaba launches Qwen 3.5 small AI models for edge devices with offline capabilities

Technology3 sources

Tax experts warn against using AI for tax filing as chatbots miscalculate refunds by thousands

As Tax Day approaches, tax experts are urging Americans not to rely on AI chatbots like ChatGPT and Grok for tax filing. Tests show these tools miscalculate refunds by an average of over $2,000, while privacy risks expose sensitive financial data to potential breaches. Despite marketing claims, the IRS holds taxpayers accountable for AI-generated errors.

Technology6 sources

Tax experts warn against using AI for tax filing as chatbots miscalculate refunds by thousands

Technology6 sources

California colleges spend millions on faulty AI systems that frustrate students

California community colleges are investing millions in AI chatbots to help students navigate admissions and financial aid, but the systems consistently fail to deliver accurate answers. East Los Angeles College's bot couldn't even name its own president correctly. With contracts totaling $3.8 million through 2029, students report abandoning these unreliable chatbots for Reddit and Google instead.

Technology2 sources

California colleges spend millions on faulty AI systems that frustrate students

Technology2 sources

Google launches Android Bench to rank AI models best suited for coding Android apps

Google unveiled Android Bench, a new benchmark and leaderboard that evaluates how well AI models handle real-world Android app development tasks. Gemini 3.1 Pro topped the rankings with a 72.4% score, followed by Claude Opus 4.6 and GPT-5.2 Codex. The benchmark tests models on tasks sourced from GitHub repositories to help developers pick the right AI tools.

Technology3 sources

Google launches Android Bench to rank AI models best suited for coding Android apps

Technology3 sources

OpenAI Releases GPT-5.4, New AI Model Built for Agents and Professional Work

OpenAI launched GPT-5.4 Thinking and Pro models on Thursday, designed specifically for AI agents and enterprise applications. The company claims the new AI model delivers 33% fewer false claims and matches human professionals 83% of the time across 44 occupations. The release comes amid growing competition with Anthropic's Claude and controversy over OpenAI's Pentagon deal.

Technology30 sources

OpenAI Releases GPT-5.4, New AI Model Built for Agents and Professional Work

Technology30 sources

AI chatbots miss most medical diagnoses as millions seek health advice from ChatGPT

More than 40 million people consult ChatGPT daily for health information, but new research reveals AI chatbots correctly identify medical conditions only 34.5% of the time. Studies show these tools undertriage 52% of emergency cases and provide correct follow-up steps just 44.2% of the time, raising urgent questions about patient safety as AI healthcare becomes mainstream.

Health7 sources

AI chatbots miss most medical diagnoses as millions seek health advice from ChatGPT

Health7 sources

Microsoft releases Phi-4 multimodal model that knows when thinking wastes time

Microsoft unveiled Phi-4-reasoning-vision-15B, a compact 15-billion-parameter model trained on just 200 billion tokens—five times less than competing systems. The open-source AI model intelligently activates chain-of-thought reasoning for complex math and science problems while staying silent on simple tasks like image captioning, delivering competitive performance with a fraction of the compute cost.

Technology2 sources

Microsoft releases Phi-4 multimodal model that knows when thinking wastes time

Technology2 sources

Stanford's Merlin AI delivers consistent diagnoses across hospitals using 3D CT scans

Stanford researchers developed Merlin AI, a vision-language model that analyzes 3D abdominal CT scans with 81% accuracy in predicting diagnoses. Trained on over 15,000 scans and nearly one million diagnostic codes, the NIH-funded tool outperformed specialized models across 750 tasks and demonstrated consistent performance across multiple hospital sites, offering a potential solution to the growing radiologist shortage.

Science2 sources

Stanford's Merlin AI delivers consistent diagnoses across hospitals using 3D CT scans

Science2 sources

OpenAI rolls out GPT-5.3 Instant with fewer refusals and less preachy tone across ChatGPT

OpenAI released GPT-5.3 Instant, addressing user complaints about overly cautious responses and unnecessary refusals. The update reduces hallucinations by 26.8% with web search and delivers more direct answers without moralizing preambles. While the model shows improvements in conversational flow, CEO Sam Altman acknowledges ongoing challenges with model personality across the GPT-5 family.

Technology9 sources

OpenAI rolls out GPT-5.3 Instant with fewer refusals and less preachy tone across ChatGPT

Technology9 sources

Google launches Gemini 3.1 Flash Lite with 2.5x speed boost and adjustable thinking levels

Google unveiled Gemini 3.1 Flash Lite, its fastest and most cost-efficient AI model designed for high-volume developer workloads. Priced at $0.25 per 1M input tokens, the model delivers 2.5x faster time to first token than its predecessor while introducing adjustable Thinking Levels that let developers balance speed with reasoning depth for real-time applications.

Technology11 sources

Google launches Gemini 3.1 Flash Lite with 2.5x speed boost and adjustable thinking levels

Technology11 sources

Humanity's Last Exam reveals the gap between AI and human intelligence despite rapid progress

Researchers created Humanity's Last Exam, a PhD-level AI benchmark with 2,500 questions designed to assess the limits of AI reasoning. Google's Gemini 3 Deep Think achieved the highest score at 48.4%, while human experts score around 90%. Despite progress, experts emphasize this doesn't signal the arrival of Artificial General Intelligence.

Science2 sources

Humanity's Last Exam reveals the gap between AI and human intelligence despite rapid progress

Science2 sources

Pantera, Franklin Templeton Back Sentient Arena to Test AI Agents on Enterprise Workflows

Open-source AI lab Sentient launched Arena, a production-grade platform to stress-test AI agents on complex enterprise tasks. Pantera Capital, Franklin Templeton's digital assets unit, and Founders Fund joined the first cohort, signaling institutional interest in evaluating AI agent reliability before deployment into real-world workflows.

Technology2 sources

Pantera, Franklin Templeton Back Sentient Arena to Test AI Agents on Enterprise Workflows

Technology2 sources

Harvard Study Shows AI Can Predict 71% of Fund Manager Trades, Raising Questions About Fees

A Harvard-led study reveals that AI can predict 71% of mutual fund trading decisions, suggesting much of active management follows detectable patterns. The research, analyzing data from 1990 to 2023, found that the unpredictable 29% of trades is where genuine alpha lives. The findings challenge the justification for high active-management fees and highlight how automation could reshape the $54 trillion asset management industry.

Science3 sources

Harvard Study Shows AI Can Predict 71% of Fund Manager Trades, Raising Questions About Fees

Science3 sources

ChatGPT Health fails to recognize half of medical emergencies in first independent safety test

OpenAI's ChatGPT Health missed over half of medical emergencies in a Nature Medicine study, directing patients to routine appointments instead of emergency rooms. With 40 million daily users seeking health guidance, the AI tool also showed alarming inconsistencies in suicide-crisis safeguards, triggering alerts for low-risk cases while failing to respond when users described specific self-harm plans.

Health9 sources

ChatGPT Health fails to recognize half of medical emergencies in first independent safety test

Health9 sources

Prev 1…101112…22 Next