5 Sources
5 Sources
[1]
OpenAI says GPT-5 stacks up to humans in a wide range of jobs | TechCrunch
OpenAI released a new benchmark on Thursday that tests how its AI models perform compared to human professionals across a wide range of industries and jobs. The test, GDPval, is an early attempt at understanding how close OpenAI's systems are to outperforming humans at economically valuable work -- a key part of the company's founding mission to develop artificial general intelligence or AGI. OpenAI says its found that its GPT-5 model and Anthropic's Claude Opus 4.1 "are already approaching the quality of work produced by industry experts." That's not to say that OpenAI's models are going to start replacing humans in their jobs immediately. Despite some CEOs' predictions that AI will take the jobs of humans in just a few years, OpenAI admits that GDPval today covers a very limited number of tasks people do in their real jobs. However, it is one of the latest ways the company is measuring AI's progress towards this milestone. GDPval is based on nine industries that contribute the most to America's gross domestic product, including domains such as healthcare, finance, manufacturing, and government. The benchmark tests an AI model's performance in 44 occupations among those industries, ranging from software engineers to nurses to journalists. For OpenAI's first version of the test, GDPval-v0, OpenAI asked experienced professionals to compare AI-generated reports with those produced by other professionals, and then choose the best one. For example, one prompt asked investment bankers to create a competitor landscape for the last mile delivery industry, and compare them to AI-generated reports. OpenAI then averages an AI model's "win rate" against the human reports across all 44 occupations. For GPT-5-high, a souped up version of GPT-5 with extra computational power, the company says the AI model was ranked as better than or on par with industry experts 40.6% of the time. OpenAI also tested Anthropic's Claude Opus 4.1 model, which was ranked as better than or on par with industry experts in 49% of tasks. OpenAI says that it believes Claude scored so high because of its tendency to make pleasing graphics, rather than sheer performance. It's worth noting that most working professionals do a lot more than submit research reports to their boss, which is all that GDPval-v0 tests for. OpenAI acknowledges this, and says it plans to create more robust tests in the future that can account for more industries and interactive workflows. Nonetheless, the company sees the progress on GDPval as notable. In an interview with TechCrunch, OpenAI's chief economist Dr. Aaron Chatterji said GDPval's results suggest that people in these jobs can now use AI models to spend time on more meaningful tasks. "[Because] the model is getting good at some of these things," Chatterji says, "people in those jobs can now use the model, increasingly as capabilities get better, to offload some of their work and do potentially higher value things." OpenAI's evaluations lead Tejal Patwardhan tells TechCrunch that she's encouraged by the rate of progress on GDPval. OpenAI's GPT-4o model scored just 13.7% (wins and ties versus humans), which was released roughly 15 months ago. Now GPT-5 scores nearly triple that, a trend Patwardhan expects to continue. Silicon Valley has a wide range of benchmarks it uses to measure the progress of AI models, and assess whether a given model is state-of-the-art. Among the most popular are AIME 2025 (a test of competitive math problems) and GPQA Diamond (a test of PhD level science questions). However, several AI models are nearing saturation on some of these benchmarks, and many AI researchers have cited the need for better tests that can measure AI's proficiency on real-world tasks. Benchmarks like GDPval could become increasingly important in that conversation, as OpenAI makes the case that its AI models are valuable for a wide range of industries. But OpenAI may need a more comprehensive version of the test to definitively say its AI models can outperform humans.
[2]
OpenAI is now testing ChatGPT against humans in 44 different occupations, from lawyers and software developers to registered nurses -- here's the full list of jobs affected
OpenAI, the company behind ChatGPT, has announced a new benchmark for testing its GPT-5 model, which involves pitting the AI directly against human experts in a variety of occupations. The benchmark is called GDPval and is responsible for assessing how close ChatGPT is getting to outperforming humans at "economically valuable, real-world tasks". That means moving beyond things like academic tests and coding competitions towards jobs that are carried out in the real world: nursing, financial management, engineering or journalism. This is all part of OpenAI's effort to establish artificial general intelligence (AGI) and the company notes that its GPT-5 model (and Anthropic's Claude Opus 4.1) "are already approaching the quality of work produced by industry experts." In a blog post explaining the new testing, OpenAI explained: "Unlike traditional benchmarks, GDPval tasks are not simple text prompts. "They come with reference files and context, and the expected deliverables span documents, slides, diagrams, spreadsheets, and multimedia. This realism makes GDPval a more realistic test of how models might support professionals." "The GDPval full set includes 1,320 specialized tasks (220 in the gold open-sourced set), each meticulously crafted and vetted by experienced professionals with over 14 years of experience on average from these fields. Every task is based on real work products, such as a legal brief, an engineering blueprint, a customer support conversation, or a nursing care plan." The tasks covered 44 different jobs across nine different industries. Here's the full list: It's the $64,000 question and the answer, probably, is yes. Or at least AI will take some measure of your job. OpenAI itself notes GDPval is an "early step that doesn't reflect the full nuance of many economic tasks." Additionally, while the test "spans 44 occupations and hundreds of knowledge work tasks, it is limited to one-shot evaluations, so it doesn't capture cases where a model would need to build context or improve through multiple drafts." There's still a long way to go, and a recent study claimed ChatGPT still routinely gets things wrong. But OpenAI is working hard on hitting AGI and says that future versions will extend to more interactive workflows and context-rich tasks to "better reflect the complexity of real-world knowledge work". The fact that AI will reshape our working landscape is pretty much a foregone conclusion at this point. But the way in which it's integrated into most societies is still very much in the hands of humans, business leaders and customers. There will always be work for humans to do, that's also a foregone conclusion, but the type of work is almost certain to look a lot different in the decades to come.
[3]
OpenAI tool shows AI catching up to human work
Why it matters: We're at an AI reckoning, where leaders are trying to justify investments without effective tools to measure returns. * A recent MIT study showing that most AI projects fail launched a debate about its techniques, but also exposed the challenges in measuring returns on these massive investments. Driving the news: On Thursday OpenAI introduced GDPval-v0, a new way to measure how well AI models perform what it calls "authentic work deliverables," like creating legal briefs, engineering blueprints and nursing care plans. * The "GDP" in GDPval stands for Gross Domestic Product, which OpenAI says researchers used as the key economic indicator for the evaluations. * The tasks the company tested came from occupations in the industries that contribute most to GDP. What they did: Researchers looked at around 1,300 work tasks across 44 occupations, in nine business sectors that each make up more than 5% of U.S. GDP. * Expert graders compared AI and human deliverables using detailed rubrics to decide which was better. * "We finally have a way to measure how our models perform in the real world -- not just on academic tests -- which is a key way for us to measure progress towards our goal of AGI," OpenAI researcher Tejal Patwardhan told Axios. Between the lines: OpenAI didn't just look at its own models. * Researchers also looked at how Anthropic's Claude, Google's Gemini, and xAI's Grok compared to human workers. What they found: Today's leading models are approaching parity with human professionals on many tasks, and the gains are accelerating. * In blind tests of 220 tasks, Claude Opus 4.1 edged out others, with its outputs rated as good as -- or better than -- human experts 47.6% of the time. * OpenAI's GPT-5 came in at a close second, excelling in domain-specific knowledge. * Research found that frontier models can complete the GDPval-v0 tasks roughly a hundred times faster and cheaper than experts. Yes, but: The speed and cost numbers are based on model inference time and API billing rates, and don't capture the cost of human insight required in a real world setting, per the research. What they're saying: Just because AI models can complete these tasks better, cheaper and faster doesn't mean it's going to edge all humans out of the workforce anytime soon, OpenAI chief economist Ronnie Chatterji told Axios. * "Your job is going to be different with a different set of tasks, maybe, than it was yesterday," Chatterji says. "It's gonna be hard to track the direct impact on the job market." * "The data shows that AI models are increasingly capable of doing a lot of the work that humans do right now," he added. "So that's where I think the economic value is coming from -- as a complement to workers." Stunning stat: Performance has more than doubled from GPT-4o (released spring 2024) to GPT-5 (released summer 2025).
[4]
AI Isn't Taking Your Job Yet -- But It Might Soon, OpenAI Data Suggests
The study showed the first wave of disruption will hit office-based jobs, from coders to lawyers and journalists. OpenAI unveiled GDPval on Thursday -- a benchmark that tries to assess qualitatively whether AI can do your actual job. These are not hypothetical exam questions, but real deliverables: legal briefs, engineering blueprints, nursing care plans, financial reports -- the kind of work, that is, that pays mortgages. The researchers deliberately focused on occupations where at least 60% of tasks are computer-based -- roles they describe as "predominantly digital." That scope covers professional services such as software developers, lawyers, accountants, and project managers; finance and insurance positions like analysts and customer service reps; and information-sector jobs ranging from journalists and editors to producers and AV technicians. Healthcare administration, white-collar manufacturing roles, and sales or real estate managers also feature prominently. Within that set, the work most exposed to AI overlaps with the kinds of digital, knowledge-intensive activities that large language models already handle well: * Software development, which represents the largest wage pool in the dataset, stands out as especially vulnerable. * Legal and accounting work, with its heavy reliance on documents and structured reasoning, is also high on the list, as are financial analysts and customer service representatives. * Content production roles -- editors, journalists, and other media workers -- face similar pressures given AI's growing fluency in language and multimedia generation. The absence of manual and physical labor jobs in the study highlights its boundaries: GDPval was not designed to measure exposure in fields like construction, maintenance, or agriculture. Instead, it underscores the point that the first wave of disruption is likely to strike white-collar, office-based jobs -- the very kinds of work once assumed to be most insulated from automation. The report builds on a two-year-old OpenAI/University of Pennsylvania study that claimed that up to 80% of U.S. workers could see at least 10% of their tasks affected by LLMs, and around 19% of workers could see at least 50% of their tasks affected. The most imperiled (or transformed) jobs are white-collar, knowledge-heavy ones -- especially in law, writing, analysis, and customer interaction. But the unsettling part isn't today's numbers. It's the trajectory. At this pace, the statistics suggest that AI could match human experts across the board by 2027. This is really close to AGI standards, and could mean that even tasks considered unsafe or too specialized for automation may soon become accessible to machines, threatening rapid workplace transformations. OpenAI tested 1,320 tasks across 44 occupations -- not random jobs, but roles in the nine sectors that drive most of America's GDP. Software developers, lawyers, nurses, financial analysts, journalists, engineers: the people who thought their degrees would protect them from automation. Each task came from professionals with an average of 14 years of experience -- not interns or recent grads, but seasoned experts who know their craft. The tasks weren't simple either, averaging seven hours of work with some stretched to multiple weeks of effort. According to OpenAI, the models completed these tasks up to 100 times faster and significantly cheaper than humans in some API-specific tasks -- which is to be expected and has been the case for decades. On more specialized tasks, the improvement was slower, but still noticeable. Even accounting for review time and the occasional do-over when the AI hallucinated something bizarre, the economics tilt hard toward automation. But cheer up: Just because a job is exposed doesn't mean it disappears. It may be augmented (for instance, lawyers and journalists using LLMs to write faster) rather than be replaced. And as far as AI has gone, hallucinations are still a pain for businesses. The research shows AI failing most often on instruction-following -- 35% of GPT-5's losses came from not fully grasping what was asked. Formatting errors plagued another 40% of failures. The models also struggled with collaboration, client interaction, and anything requiring genuine accountability, which OpenAI left out of the study. Nobody's suing an AI for malpractice yet. But for solo digital deliverables -- the reports, presentations, and analyses that fill most knowledge workers' days -- the gap is closing fast. OpenAI admits that GDPval today covers a very limited number of tasks people do in their real jobs. The benchmark can't measure interpersonal skills, physical presence, or the thousand micro-decisions that make someone valuable beyond their deliverables. Still, when investment banks start comparing AI-generated competitor analyses to those from human analysts, when hospitals evaluate AI nursing care plans against those from experienced nurses, and when law firms test AI briefs against associate work -- that's not speculation anymore. That's measurement.
[5]
OpenAI's GPT-5 matches human performance in jobs: What it means for work and AI
On September 25, 2025, OpenAI dropped a bombshell: its latest model, GPT-5, now "stacks up to humans in a wide range of jobs." The declaration ripples far beyond the world of AI benchmarks, it raises urgent questions about the future of work, the boundary between human and machine, and how societies will adapt when tools become peers. What does it really mean, though and more importantly, what comes next? Also read: Data commons MCP explained: Google's AI model context protocol for developers To make its case, OpenAI introduced GDPval, a new benchmark built to test AI vs. humans in economically meaningful roles. The benchmark draws on nine industries crucial to the U.S. GDP (healthcare, finance, manufacturing, government, etc.) and drills down to 44 occupations from nurses to software engineers to journalists. In the first version (GDPval-v0), human professionals compare reports they and the AI generate; they then judge which is better. GPT-5, in a "high compute" configuration (GPT-5-high), achieved a "win or tie" rate of 40.6% versus expert-level human output. That is a staggering leap: GPT-4o (OpenAI's earlier multimodal model) scored 13.7% in this same setup. OpenAI also tested Claude Opus 4.1 (from Anthropic), which scored 49% in the same evaluation, though OpenAI cautions part of that could be due to stylistic "presentation" (e.g. "pleasing graphics") rather than pure substance. OpenAI frames GDPval not as a final arbiter but as a stepping stone - a way to push the conversation beyond narrow academic benchmarks and into real-world tasks. What sets this announcement apart is not just the raw numbers, but the framing: this is a claim of task parity, not on chess puzzles or math exams, but on work that people actually do in their jobs. Still, OpenAI is careful to spell out the limitations. GDPval-v0 tests are limited in scope: they are static, noninteractive, and focus on output artifacts (reports, analyses) rather than on the full complexity of many jobs. Many real roles involve collaboration, stakeholder negotiation, on-the-fly adaptation, creativity, ethics, domain nuance, and interpersonal context - aspects hard to reduce to benchmark prompts. Thus, OpenAI acknowledges that it's not yet fielding GPT-5 to replace whole roles. Rather, the goal is augmentation: let humans offload lower-level cognitive work so they can spend more time on judgment, oversight, vision, and context. "People in those jobs can now use the model to offload some of their work and do potentially higher value things," OpenAI's chief economist, Aaron Chatterji, summarized. Still, the jump from "assistive AI" to "competent peer AI" is narrowing. The trajectory is disquieting for many. The question is: will workplaces and societies adapt fast enough? For many professionals, this moment will feel existential. If machines begin writing reports, diagnosing medical cases, or performing legal drafting at near-human quality, the sense that "my job is safe" becomes shaky. But the impact is uneven. Roles with more structured tasks (analysis, drafting, pattern recognition) are more exposed; roles grounded in human empathy, trust, high-stakes judgment, or the messy real world may resist analog substitution at least for some time. Still, even if your job stays intact, your tools may change. Expect increasing automation of daily workflows, with AI copilots becoming standard. Your value may shift toward meta-skills: oversight of AI, domain interpretation, accountability, and human relationship skills. Firms will rush to adopt any productivity multiplier. For industries with tight margins (consulting, finance, legal, media), the pressure to incorporate GPT-5-level automation will be immense. This may accelerate restructuring: flatter teams, fewer middle layers, more emphasis on hybrid human-AI squads. Some firms might experiment with replacing junior human roles first. Others might lean into differentiation - human judgment, ethics, brand - as their edge. But adoption won't be frictionless. Integration, reliability, auditing, legal liability, ethical guardrails, all these will become battlegrounds. Will clients accept AI-drafted legal memos? Will regulators allow AI medical assistants to operate with minimal oversight? Those answers will vary by jurisdiction. If AI is now approaching human-level competence in real work tasks, education systems must rethink what they teach. Memorization and literate report writing become less valuable than critical thinking, interpretive insight, collaboration, and ethical reasoning. Policymakers will also face hard questions: social safety nets, labor transitions, regulation of AI in high-stakes domains (medicine, law, defense), certification, liability, and IP. How do you audit or validate AI work when it competes with human professionals? Further, inequality may widen. Organizations with early access to strong AI systems and capital to deploy them could dramatically outpace smaller players and local firms, both within and across countries. OpenAI frames this as progress toward its long-term ambition: building Artificial General Intelligence (AGI). GDPval is one metric - but not the ultimate one. If GPT-5 is nearing human-level performance in many domain tasks, then the next frontier is robustness, generality, safety, interactive workflows, long-term planning, object permanence, world models, adaptability under uncertainty - in short, the qualities that humans bring to open-ended problems. Thus, GPT-5's performance is both a milestone and a challenge: can AI systems maintain trust, explainability, correctness, and alignment as they push closer to human-level agency? Also read: Macrohard by Musk's xAI: The AI-powered rival to Microsoft explained To understand the real emotional texture of this shift, consider a mid-career financial analyst. In 2027, she's asked to pilot a workflow where GPT-5 drafts her weekly competitor analyses; she reviews, edits, and presents. Some weeks, what the AI generates is better than she might have drafted; other weeks, it's off in subtle ways. Her role evolves: she's no longer just doing the grunt report work - she has to coach, correct, interpret, and contextualize. Her value shifts upward, but also precariously - any slip, or any superior AI, and she may become expendable. Or take a junior lawyer in a city firm. For routine contractual clauses and first-draft memos, his firm lets GPT-5 produce baseline versions. His "added value" becomes spotting edge cases, tailoring empathy in client communication, and managing relationships. He gains speed, but also competes with a tool that could someday absorb his tasks entirely. These are not speculative-future vignettes, they are iterations of the present. AI's gains in benchmarks today portend shifts in incentives, culture, and risk tomorrow. Benchmarks are measured settings. In the wild, AI still struggles with domain drift, nuance, ambiguity, hallucination, and adversarial prompts. For human-level tasks, users will demand explanations, provenance, and accountability. Can systems provide that in a way humans accept? As AI handles more important tasks, the cost of misalignment or error increases. Guardrails, oversight, fail-safes become essential. Who is liable for error? Should outputs carry legal disclaimers? Can AI-generated work be copyrighted or patented? These domains remain murky. Who gets access to the strongest models? If large companies deploy GPT-5 broadly, small firms or under-resourced geographies may lag behind. Rather than herald doom or promise messianic AI, a more balanced take is this: we're stepping into an era of hybrid intelligence, where humans and AI gradually integrate work. Machines will handle patterns, scale, speed; humans will bring empathy, meaning, oversight, values, and interpretation. GPT-5's claimed parity is not a final verdict - it's a loud signal. The coming years will test whether human societies adapt fast enough, and whether AI serves as enhancement, not erasure.
Share
Share
Copy Link
OpenAI introduces GDPval, a new benchmark testing AI models against human professionals in 44 occupations. GPT-5 shows significant improvement, matching or surpassing human performance in many tasks, potentially reshaping the future of work.
OpenAI has introduced a groundbreaking benchmark called GDPval, designed to assess how artificial intelligence models stack up against human professionals in real-world tasks
1
. This new evaluation method focuses on nine industries that contribute significantly to the U.S. Gross Domestic Product, testing AI performance across 44 different occupations2
.Source: Digit
The latest iteration of OpenAI's language model, GPT-5, has shown remarkable progress in these tests. In the GDPval-v0 benchmark, GPT-5-high (a high-compute version) was ranked as better than or on par with industry experts 40.6% of the time
1
. This represents a significant leap from its predecessor, GPT-4o, which scored only 13.7% just 15 months earlier3
.Source: Axios
The rapid advancement of AI capabilities raises important questions about the future of work. While OpenAI emphasizes that these models are not yet ready to replace humans entirely, they suggest that AI could significantly augment human capabilities in various professions
4
.The study indicates that the initial wave of AI disruption is likely to impact office-based, knowledge-intensive jobs the most. Software development, legal and accounting work, financial analysis, and content production roles are among the most vulnerable
4
. However, jobs requiring manual labor or physical presence were not included in this assessment.Source: Decrypt
OpenAI acknowledges that GDPval-v0 has limitations. It doesn't capture the full complexity of many jobs, including aspects like collaboration, client interaction, and accountability
5
. Future versions of the benchmark aim to incorporate more interactive workflows and context-rich tasks to better reflect real-world scenarios2
.Related Stories
The rapid progress of AI capabilities could lead to significant changes in education, policy-making, and economic structures. Educational systems may need to shift focus from memorization to critical thinking and ethical reasoning. Policymakers will face challenges in areas such as labor transitions, AI regulation, and social safety nets
5
.While the GDPval results are impressive, they don't signal an immediate replacement of human workers. Instead, they point towards a future where AI increasingly augments human capabilities, allowing professionals to focus on higher-value tasks. As AI continues to evolve, the challenge for society will be to adapt quickly and harness these technologies for the benefit of all.
Summarized by
Navi
09 Aug 2025•Technology
11 Feb 2025•Technology
15 Sept 2025•Technology