2 Sources
[1]
AI could soon tackle projects that take humans weeks
Today's artificial intelligence (AI) systems can't beat humans on long tasks, but they're improving at a rapid pace and could close the gap sooner than many anticipated, according to an analysis of leading models. METR, a non-profit organization in Berkeley, California, created nearly 170 real-world tasks in coding, cybersecurity, general reasoning and machine learning, then established a 'human baseline' by measuring how long it took expert programmers to complete them. The team then developed a metric for assessing the progress of AI models, which it calls 'task-completion time horizon'. This is the time programmers typically take to complete the tasks that AI models can complete at a certain success rate. In a preprint posted on arXiv this week, METR reports that GPT-2, an early large language model (LLM) published by OpenAI in 2019, failed on all tasks that took human experts more than one minute. Claude 3.7 Sonnet, released in February by the US-based start-up Anthropic, completed 50% of the tasks that would take people 59 minutes. Overall, the time horizon of the 13 leading AI models has doubled roughly every seven months since 2019, the paper finds. The exponential growth of AI time horizons accelerated in 2024, with the latest models doubling their horizon roughly every three months. The work has not been formally peer reviewed. At the 2019-2024 rate of progress, METR suggests that AI models will be able to handle tasks that take humans about a month at 50% reliability by 2029, possibly sooner. One month of dedicated human expertise, the paper notes, can be enough to start a new company or make scientific discoveries, for instance. But Joshua Gans, a management professor at the University of Toronto in Canada, who has written on the economics of AI, says that these sorts of predictions aren't that useful. "Extrapolations are tempting to do, but there is still so much we don't know about how AI will actually be used for these to be meaningful, he says. Human versus AI assessment The team chose the 50% success rate because it was the most robust to small changes in the data distribution. "If you pick very low or very high thresholds, removing or adding a single successful or single failed task, respectively, changes your estimate a lot," says co-author Lawrence Chan. Raising the reliability threshold from 50% to 80% reduced the average time horizon by a factor of five -- although the overall doubling time and trendline were similar. In the past five years, improvements in the general capabilities of LLMs have been driven largely by increases in scale -- the amount of training data, training time and number of model parameters. The paper attributes progress on the time horizon metric mainly to improvements in AI's logical reasoning, tool use, error correction and self-awareness in task execution. METR's time-horizon approach addresses some of the limitations in existing AI benchmarks, which map to real-world work only loosely and quickly 'saturate' as models improve. It provides a continuous, intuitive measure that better captures meaningful long-term progress, says co-author Ben West. Leading AI models achieve superhuman performance on many benchmarks, but they have had relatively little economic impact, says West. METR's latest research offers a partial answer to this puzzle: the best models sit at around a 40-minute time horizon, and there isn't much economically valuable work that a person can do in that time, says West. But Anton Troynikov, an AI researcher and entrepreneur in San Francisco, California, says that AI would have more economic impact if organizations were more willing to experiment and invest in leveraging the models effectively.
[2]
AI is learning to work like you and it's getting faster every day
Five years from now, AI might be completing software engineering tasks in a month that would take a human the same amount of time. That's the prediction of a new study that introduces a metric called the 50%-task-completion time horizon -- a measure of how long humans typically take to complete tasks that AI models can solve with a 50% success rate. And if current trends hold, AI is on track to automate increasingly complex work, from debugging code to conducting full-scale machine learning research. The study, conducted by the Model Evaluation & Threat Research (METR) group, suggests that AI's ability to handle long and complex tasks has been doubling every seven months since 2019. Today's frontier models, like Claude 3.7 Sonnet, already match human performance on 50-minute-long tasks. Extrapolating this growth, AI could reach a one-month time horizon -- the ability to autonomously complete tasks that would take a human a month -- between 2028 and 2031. This isn't just about raw computational power. AI's improved logical reasoning, tool use, and ability to adapt to mistakes are fueling the trend. Early AI systems would get stuck in loops or abandon problems too soon, but modern models are learning to persist and correct errors -- critical traits for automation at scale. The research team designed a benchmark based on 170 tasks across three datasets -- HCAST, RE-Bench, and a new suite of shorter software tasks called SWAA. They timed human professionals completing these tasks and compared their performance to AI models spanning from 2019 to 2025. The results showed a clear trajectory: Interestingly, AI's progress has been remarkably steady, even when tested against new challenges. The study found that the increase in time horizon remains consistent across different types of tasks, meaning AI isn't just getting better at specific benchmarks -- it's improving across the board. Are LLMs really ideological? While the study confirms rapid AI progress, it also raises concerns. The same ability that allows AI to write complex software could also enable it to perform high-risk activities autonomously. The paper warns that as AI systems become capable of extended autonomous operation, new safety measures will be needed to prevent misuse, such as self-replicating AI or autonomous development of hazardous materials. Additionally, AI's performance drops on "messier" real-world tasks -- those requiring creativity, strategic thinking, or human collaboration. While AI excels at structured problems with clear objectives, it still struggles in unpredictable environments. If AI's progress continues at its current rate, it could reshape industries by automating work traditionally done by skilled professionals. The implications stretch beyond software development -- fields like legal research, cybersecurity, and even scientific discovery could see AI playing a much larger role. But will the trend hold? The study's authors acknowledge that external factors -- such as compute limitations or breakthroughs in AI training -- could speed up or slow down progress. One thing is clear: AI isn't just getting smarter. It's learning how to work.
Share
Copy Link
A new study reveals AI models are rapidly improving their ability to handle complex tasks, potentially matching human performance on month-long projects by 2029. This progress raises both excitement and concerns about AI's future impact on various industries.
A groundbreaking study by the Model Evaluation & Threat Research (METR) group has revealed that artificial intelligence (AI) is making significant strides in handling complex, time-consuming tasks traditionally performed by human experts. The research introduces a new metric called the "task-completion time horizon," which measures the duration of tasks that AI models can complete with a 50% success rate compared to human experts 1.
The study found that the time horizon of leading AI models has been doubling approximately every seven months since 2019. This growth has accelerated in 2024, with the latest models doubling their horizon roughly every three months. At this rate, AI models could potentially handle tasks that take humans about a month to complete with 50% reliability by 2029 1.
METR created nearly 170 real-world tasks across various domains, including coding, cybersecurity, general reasoning, and machine learning. They established a human baseline by measuring the time taken by expert programmers to complete these tasks. The research team then assessed the progress of AI models against this baseline 1.
The paper attributes the progress in AI's time horizon metric to improvements in several key areas:
Modern AI models are learning to persist and correct errors, which are critical traits for automation at scale 2.
While the study confirms rapid AI progress, it also raises concerns about potential misuse. As AI systems become capable of extended autonomous operation, new safety measures will be needed to prevent risks such as self-replicating AI or autonomous development of hazardous materials 2.
The implications of this progress stretch beyond software development. Fields like legal research, cybersecurity, and scientific discovery could see AI playing a much larger role in the near future 2.
Despite the impressive progress, AI still faces challenges in certain areas:
Some experts, like Joshua Gans from the University of Toronto, caution against over-reliance on these predictions, noting that there is still much uncertainty about how AI will actually be used in practice 1.
Summarized by
Navi
Apple is reportedly in talks with OpenAI and Anthropic to potentially use their AI models to power an updated version of Siri, marking a significant shift in the company's AI strategy.
22 Sources
Technology
14 hrs ago
22 Sources
Technology
14 hrs ago
Microsoft unveils an AI-powered diagnostic system that demonstrates superior accuracy and cost-effectiveness compared to human physicians in diagnosing complex medical conditions.
6 Sources
Technology
22 hrs ago
6 Sources
Technology
22 hrs ago
Google announces a major expansion of AI tools in education, including Gemini for Education and NotebookLM for under-18 users, aiming to transform classroom experiences while addressing concerns about AI in learning environments.
7 Sources
Technology
14 hrs ago
7 Sources
Technology
14 hrs ago
NVIDIA's upcoming GB300 Blackwell Ultra AI servers, slated for release in the second half of 2025, are poised to become the most powerful AI servers globally. Major Taiwanese manufacturers are vying for production orders, with Foxconn securing the largest share.
2 Sources
Technology
6 hrs ago
2 Sources
Technology
6 hrs ago
Elon Musk's AI company, xAI, has raised $10 billion through a combination of debt and equity financing to expand its AI infrastructure and development efforts.
3 Sources
Business and Economy
6 hrs ago
3 Sources
Business and Economy
6 hrs ago