Curated by THEOUTPOST
On Wed, 19 Feb, 8:07 AM UTC
3 Sources
[1]
OpenAI Researchers Find That Even the Best AI Is "Unable To Solve the Majority" of Coding Problems
OpenAI researchers have admitted that even the most advanced AI models still are no match for human coders -- even though CEO Sam Altman insists they will be able to beat "low-level" software engineers by the end of this year. In a new paper, the company's researchers found that even frontier models, or the most advanced and boundary-pushing AI systems, "are still unable to solve the majority" of coding tasks. The researchers used a newly-developed benchmark called SWE-Lancer, built on more than 1,400 software engineering tasks from the freelancer site Upwork. Using the benchmark, OpenAI put three large language models (LLMs) -- its own o1 reasoning model and flagship GPT-4o, as well as Anthropic's Claude 3.5 Sonnet -- to the test. Specifically, the new benchmark evaluated how well the LLMs performed with two types of tasks from Upwork: individual tasks, which involved resolving bugs and implementing fixes to them, or management tasks that saw the models trying to zoom out and make higher-level decisions. (The models weren't allowed to access the internet, meaning they couldn't just crib similar answers that'd been posted online.) The models took on tasks cumulatively worth hundreds of thousands of dollars on Upwork, but they were only able to fix surface-level software issues, while remaining unable to actually find bugs in larger projects or find their root causes. These shoddy and half-baked "solutions" are likely familiar to anyone who's worked with AI -- which is great at spitting out confident-sounding information that often falls apart on closer inspection. Though all three LLMs were often able to operate "far faster than a human would," the paper notes, they also failed to grasp how widespread bugs were or to understand their context, "leading to solutions that are incorrect or insufficiently comprehensive." As the researchers explained, Claude 3.5 Sonnet performed better than the two OpenAI models pitted against it and made more money than o1 and GPT-4o. Still, the majority of its answers were wrong, and according to the researchers, any model would need "higher reliability" to be trusted with real-life coding tasks. Put more plainly, the paper seems to demonstrate that although these frontier models can work quickly and solve zoomed-in tasks, they're are nowhere near as skilled at handling them as human engineers. Though these LLMs have advanced rapidly over the past few years and will likely continue to do so, they're not skilled enough at software engineering to replace real-life people quite yet -- not that that's stopping CEOs from firing their human coders in favor of immature AI models.
[2]
AI can fix bugs -- but can't find them: OpenAI's study highlights limits of LLMs in software engineering
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Large language models (LLMs) may have changed software development, but enterprises will need to think twice about entirely replacing human software engineers with LLMs, despite OpenAI CEO Sam Altman's claim that models can replace "low-level" engineers. In a new paper, OpenAI researchers detail how they developed an LLM benchmark called SWE-Lancer to test how much foundation models can earn from real-life freelance software engineering tasks. The test found that, while the models can solve bugs, they can't see why the bug exists and continue to make more mistakes. The researchers tasked three LLMs -- OpenAI's GPT-4o and o1 and Anthropic's Claude-3.5 Sonnet -- with 1,488 freelance software engineer tasks from the freelance platform Upwork amounting to $1 million in payouts. They divided the tasks into two categories: individual contributor tasks (resolving bugs or implementing features), and management tasks (where the model roleplays as a manager who will choose the best proposal to resolve issues). "Results indicate that the real-world freelance work in our benchmark remains challenging for frontier language models," the researchers write. The test shows that foundation models cannot fully replace human engineers. While they can help solve bugs, they're not quite at the level where they can start earning freelancing cash by themselves. Benchmarking freelancing models The researchers and 100 other professional software engineers identified potential tasks on Upwork and, without changing any words, fed these to a Docker container to create the SWE-Lancer dataset. The container does not have internet access and cannot access GitHub "to avoid the possible of models scraping code diffs or pull request details," they explained. The team identified 764 individual contributor tasks, totaling about $414,775, ranging from 15-minute bug fixes to weeklong feature requests. These tasks, which included reviewing freelancer proposals and job postings, would pay out $585,225. The tasks were added to the expensing platform Expensify. The researchers generated prompts based on the task title and description and a snapshot of the codebase. If there were additional proposals to resolve the issue, "we also generated a management task using the issue description and list of proposals," they explained. From here, the researchers moved to end-to-end test development. They wrote Playwright tests for each task that applies these generated patches which were then "triple-verified" by professional software engineers. "Tests simulate real-world user flows, such as logging into the application, performing complex actions (making financial transactions) and verifying that the model's solution works as expected," the paper explains. Test results After running the test, the researchers found that none of the models earned the full $1 million value of the tasks. Claude 3.5 Sonnet, the best-performing model, earned only $208,050 and resolved 26.2% of the individual contributor issues. However, the researchers point out, "the majority of its solutions are incorrect, and higher reliability is needed for trustworthy deployment." The models performed well across most individual contributor tasks, with Claude 3.5-Sonnet performing best, followed by o1 and GPT-4o. "Agents excel at localizing, but fail to root cause, resulting in partial or flawed solutions," the report explains. "Agents pinpoint the source of an issue remarkably quickly, using keyword searches across the whole repository to quickly locate the relevant file and functions -- often far faster than a human would. However, they often exhibit a limited understanding of how the issue spans multiple components or files, and fail to address the root cause, leading to solutions that are incorrect or insufficiently comprehensive. We rarely find cases where the agent aims to reproduce the issue or fails due to not finding the right file or location to edit." Interestingly, the models all performed better on manager tasks that required reasoning to evaluate technical understanding. These benchmark tests showed that AI models can solve some "low-level" coding problems and can't replace "low-level" software engineers yet. The models still took time, often made mistakes, and couldn't chase a bug around to find the root cause of coding problems. Many "low-level" engineers work better, but the researchers said this may not be the case for very long.
[3]
OpenAI Thinks LLMs Can Earn $1M from Freelance Software Engineering Tasks
OpenAI has introduced SWELancer, a new benchmark to test whether frontier large language models (LLMs) can successfully complete real-world freelance software engineering tasks -- and even earn up to $1 million in total payouts. The evaluation is based on 1,488 freelance software engineering jobs from Upwork, collectively valued at $1 million. SWE-Lancer comprises over 1,400 software engineering tasks, with projects ranging from $50 bug fixes to $32,000 feature implementations. "Introducing SWE-Lancer: our most realistic coding benchmark to date. Still some limitations, but better than evals we had before," said Tejal Patwardhan, who works on the benchmarks and preparedness team at OpenAI. These tasks are divided into independent engineering tasks, where models must complete technical work, and managerial decision-making tasks, where models evaluate and choose between implementation proposals. By mapping AI model performance to real-world monetary value, SWE-Lancer provides a crucial tool for studying the economic impact of AI in software development. More research can be accessed here. Anthropic, the company behind the Claude model series, also released a survey highlighting AI's influence on the workplace. The findings revealed that approximately 36% of all occupations incorporate AI for at least a quarter of their tasks. Moreover, 57% of AI applications enhance human capabilities, while 43% focus on automation. However, only 4% of occupations rely on AI for at least 75% of their tasks. The study identified software development and technical writing as key areas where AI is utilised. In contrast, AI plays a minimal role in tasks that involve physical interaction with the environment.
Share
Share
Copy Link
OpenAI researchers develop a new benchmark called SWE-Lancer to test AI models' performance on real-world software engineering tasks, revealing that even advanced AI struggles with complex coding problems.
OpenAI researchers have developed a new benchmark called SWE-Lancer to evaluate the performance of large language models (LLMs) in real-world software engineering tasks. This innovative benchmark, based on over 1,400 freelance software engineering tasks from Upwork, aims to test the capabilities of frontier AI models in coding and software development 13.
SWE-Lancer comprises two main categories of tasks:
The benchmark includes projects ranging from quick $50 bug fixes to complex $32,000 feature implementations, with a cumulative value of approximately $1 million 3. To ensure a fair assessment, the AI models were not allowed internet access during the tests, preventing them from simply copying existing solutions 1.
Three advanced LLMs were put to the test using the SWE-Lancer benchmark:
The results revealed significant limitations in the AI models' abilities to handle complex software engineering tasks:
The study's findings have important implications for the future of AI in software development:
This research comes at a time when the impact of AI on various industries, including software development, is being closely scrutinized. A recent survey by Anthropic revealed that approximately 36% of all occupations incorporate AI for at least a quarter of their tasks, with software development being a key area of AI utilization 3.
As AI technology continues to advance, it's clear that while it can be a powerful tool for augmenting human capabilities in software engineering, it is not yet ready to fully replace human expertise. The SWE-Lancer benchmark provides a valuable tool for assessing progress in this field and understanding the economic implications of AI in software development 3.
Reference
[1]
[2]
[3]
A new study by Microsoft Research shows that even advanced AI models struggle with software debugging tasks, highlighting the continued importance of human programmers in the field.
5 Sources
5 Sources
OpenAI is developing an AI agent called A-SWE that can perform all duties of software engineers, potentially transforming the tech industry and raising questions about the future of human coders.
2 Sources
2 Sources
OpenAI CEO Sam Altman and other tech leaders discuss the growing role of AI in coding, predicting a reduced need for software engineers and emphasizing the importance of mastering AI tools.
4 Sources
4 Sources
Cognition AI's Devin, touted as the world's first AI software engineer, has been found to fail in 85% of assigned tasks, according to recent evaluations. This revelation challenges claims about AI's readiness to replace human software engineers.
3 Sources
3 Sources
Tech leaders predict AI will soon dominate coding tasks, potentially transforming the role of software developers and making programming more accessible.
7 Sources
7 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved