Curated by THEOUTPOST
On Fri, 24 Jan, 12:01 AM UTC
3 Sources
[1]
The "First AI Software Engineer" Is Bungling the Vast Majority of Tasks It's Asked to Do
Researchers have found that AI tech company Cognition's Devin, which it claims to be the "first AI software engineer," is astonishingly bad at its job. In a recent analysis, first spotted by The Register, a team of machine learning data scientists behind the independent AI research and development lab Answer.AI spent a month with the AI assistant, concluding that despite almost a year of hype, it "rarely worked." "Out of 20 tasks we attempted, we saw 14 failures, three inconclusive results, and just three successes," the researchers found -- a meager success rate of just 15 percent. Super, we've all had coworkers like that. But for tech that's supposed to represent the future, it's not inspiring confidence. "More concerning was our inability to predict which tasks would succeed," the team wrote. "Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability -- Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers." For instance, Devin was asked to deploy multiple applications to a deployment platform called Railway, but instead of realizing it was "not actually possible to do this," Devin "marched forward and tried to do this and hallucinated some things about how to interact with Railway." The results highlight that despite Cognition AI's boisterous marketing about Devin being able to "build and deploy apps end to end" when the tool was first introduced in March 2024, the tech is still struggling with some fundamental problems. It's a pertinent topic, with Meta CEO Mark Zuckerberg recently announcing that he intends to replace "midlevel engineers" with AI as soon as this year. OpenAI is also rumored to "announce a next-level breakthrough that unleashes PhD-level super-agents to do complex human tasks," according to a recent column by Axios cofounder Mike Allen and CEO Jim VandeHei. But whether the tech will actually live up to the hype and be ready to start replacing human workers in such a tight time frame -- or even at all -- remains an open question. Devin is an amalgamation of several AI models that operates through the messaging platform Slack and has access to an entire computing environment, including a web browser, code editor, and terminal. Devin was only made available to a select group of users when it was first announced, but saw a much wider release last month, starting at a steep $500 a month for "engineering teams." As the Answer.AI team points out, early demos of the AI assistant were impressive. In a March video, Cognition claimed Devin could be used to "make money taking on messy tasks" on the freelancing platform Upwork. It didn't take long for researchers to call foul, with a number of software developers analyzing Cognition's video and accusing the company of "lying" about its claims. "All of this stuff makes it look like Devin did a bunch of work," said software engineer Carl Brown from the YouTube channel Internet of Bugs in an April video. "It makes it look like Devin accomplished a lot of stuff." "So it is honestly, as far as I'm concerned, kind of impressive," he added. "But in the context of what an Upwork job should have been, and especially in the context of a bunch of people saying that Devin is 'taking jobs off of Upwork and doing them,' and especially in the context of the company saying that this video will let us watch Devin get paid for doing work, which is, again, just a lie." Both Answer.AI and Brown found that Devin also took far longer than any human coder when completing tasks. "Tasks that seemed straightforward often took days rather than hours," the Answer.AI researchers wrote, "with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions." In short, Congition's Devin highlights the often wide gap between AI companies' claims and reality, which has plagued the industry for years now. So whether an AI assistant will ever be able to competently replace a software engineer -- without causing any major headaches for its human coworkers, at least -- remains to be seen. More on replacing workers with AI: CEO Who Bragged About Replacing Workers With AI Now Distressed That AI Will Replace His Job Too
[2]
'First AI software engineer' is bad at its job
A service described as "the first AI software engineer" appears to be rather bad at its job, based on a recent evaluation. The auto-coder is called "Devin" and was introduced in March 2024. The bot's creator, an outfit called Cognition AI, has made claims such as "Devin can build and deploy apps end to end," and "can autonomously find and fix bugs in codebases." The tool reached general availability in December 2024, starting at $500 per month. "Devin is an autonomous AI software engineer that can write, run and test code, helping software engineers work on personal tasks or their team projects," Cognition's documentation declares. It "can review PRs, support code migrations, respond to on-call issues, build web applications, and even perform personal assistant tasks like ordering your lunch on DoorDash so you can stay locked in on your codebase." The service uses Slack as its main interface for commands, which are sent to its computing environment, a Docker container that hosts a terminal, browser, code editor, and planner. The AI agent supports API integration with external services. This allows it, for example, to send email messages on a user's behalf via SendGrid. Devin is a "compound AI system," meaning it relies on multiple underlying AI models, a set that has included OpenAI's GPT-4o and can be expected to evolve over time. In theory, you should be able to ask it to undertake tasks like migrating code to nbdev, a Jupyter Notebook development platform, and expect it to do so successfully. But that may be asking too much. Early assessments of Devin have found problems. Cognition AI posted a promo video that supposedly showed the AI coder autonomously completing projects on the freelancer-for-hire platform Upwork. Software developer Carl Brown analyzed that vid and debunked it on his Internet of Bugs YouTube channel. The software agent was also called out by another YouTube code pundit for allegedly including critical security issues. Now, three data scientists affiliated with Answer.AI, an AI research and development lab founded by Jeremy Howard and Eric Ries, have tested Devin and found it completed just three out of 20 tasks successfully. In an analysis conducted earlier this month by Hamel Husain, Isaac Flath, and Johno Whitaker, Devin started well, successfully pulling data from a Notion database into Google Sheets. The AI agent also managed to create a planet tracker for checking claims about the historical positions of Jupiter and Saturn. But as the three researchers continued their testing, they encountered problems. "Tasks that seemed straightforward often took days rather than hours, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions," the researchers explain in their report. "Even more concerning was Devin's tendency to press forward with tasks that weren't actually possible." As an example, they cited how Devin, when asked to deploy multiple applications to the infrastructure deployment platform Railway, failed to understand this wasn't supported and spent more than a day trying approaches that didn't work and hallucinating non-existent features. Of 20 tasks presented to Devin, the AI software engineer completed just three of them satisfactorily - the two cited above and a third challenge to research how to build a Discord bot in Python. Three other tasks produced inconclusive results, and 14 projects were outright failures. The researchers said that Devin provided a polished user experience that was impressive when it worked. "But that's the problem - it rarely worked," they wrote. "More concerning was our inability to predict which tasks would succeed. Even tasks similar to our early wins would fail in complex, time-consuming ways. The autonomous nature that seemed promising became a liability - Devin would spend days pursuing impossible solutions rather than recognizing fundamental blockers." Cognition AI did not respond to a request for comment. ®
[3]
World's 'first AI software engineer' fails 85% of its assigned tasks
In the midst of the 'AI Revolution', there's been plenty of speculation about AI taking away jobs, and no sector has been dealing with those fears more than the software engineering industry. However, programmers can rest assured that one of the latest tools touted as a fully autonomous AI software engineer reportedly has its limitations. Devin is an AI programming tool originally released by Cognition AI in March of 2024. The tool, hailed as the "first AI software engineer," ignited a range of concerns for programmers with fears regarding job security. Particularly given that some of the claims included the ability to "build and deploy apps end to end" and "autonomously find and fix bugs in codebases." Following its release, Cognition uploaded a video entitled "Devin's Upwork Side Hustle", which essentially claimed that the tool could make money through the completion of Upwork tasks. In April 2024, Veteran software developer Carl Brown of the YouTube channel Internet of Bugs quickly took to the platform to debunk some of the tool's claims, citing criticisms such as: "Devin didn't complete the advertised task. Instead, it generated errors in its own code and then fixed them" Shortly after, the original poster of the Upwork ad released a video supporting [Brown's] claim. Devon was rolled out to the general public in December 2024 with a price of $500 per month. Since then, the feedback has been somewhat similar. Three data scientists from Answer.AI, a reputable AI research and development lab, tested Devin and found that they only completed three out of 20 tasks successfully. Another analysis conducted by engineers Hamel Husain, Isaac Flath, and Johno Whitaker followed a similar pattern. Stating that "tasks that seemed straightforward often took days rather than hours" and that it had the concerning tendency to "press forward with tasks that weren't actually possible". The researchers did credit the tool, noting that it was impressive - when it worked. However, they concluded the statement, exclaiming "that's the problem - it rarely worked."
Share
Share
Copy Link
Cognition AI's Devin, touted as the world's first AI software engineer, has been found to fail in 85% of assigned tasks, according to recent evaluations. This revelation challenges claims about AI's readiness to replace human software engineers.
Cognition AI's Devin, marketed as the "first AI software engineer," has been found to significantly underperform expectations. A team of machine learning data scientists from Answer.AI conducted a month-long analysis of Devin, revealing a staggeringly low success rate of just 15% 1. Out of 20 assigned tasks, Devin completed only three successfully, with 14 failures and three inconclusive results.
The researchers highlighted several key issues with Devin's performance:
Unpredictability: The team found it difficult to predict which tasks Devin would successfully complete, with even similar tasks often resulting in failure 2.
Time inefficiency: Tasks that seemed straightforward often took days rather than hours to complete, with Devin getting stuck in technical dead-ends or producing overly complex, unusable solutions 1.
Inability to recognize limitations: Devin would persistently pursue impossible solutions rather than recognizing fundamental blockers, spending excessive time on unachievable tasks 2.
Cognition AI's marketing claims about Devin's capabilities have been called into question. The company initially boasted that Devin could "build and deploy apps end to end" and "autonomously find and fix bugs in codebases" 3. However, these claims have been challenged by multiple sources:
Software engineer Carl Brown analyzed Cognition's promotional video and accused the company of "lying" about its claims 1.
The Answer.AI team found that Devin often took far longer than any human coder to complete tasks 1.
Another YouTube code pundit pointed out critical security issues in Devin's output 2.
Devin's poor performance raises questions about the readiness of AI to replace human software engineers. This comes at a time when tech industry leaders like Mark Zuckerberg have announced intentions to replace "midlevel engineers" with AI 1. The gap between AI companies' claims and reality continues to be a significant issue in the industry.
Despite its shortcomings, researchers noted that Devin provided a polished user experience that was impressive when it worked. However, the infrequency of successful outcomes remains a major concern 2.
As the AI industry continues to evolve, the case of Devin serves as a reminder of the challenges that lie ahead in developing truly autonomous AI systems capable of replacing human software engineers. It also highlights the importance of critical evaluation and transparency in AI development and marketing claims.
Reference
[1]
[2]
OpenAI researchers develop a new benchmark called SWE-Lancer to test AI models' performance on real-world software engineering tasks, revealing that even advanced AI struggles with complex coding problems.
3 Sources
3 Sources
A new study by Microsoft Research shows that even advanced AI models struggle with software debugging tasks, highlighting the continued importance of human programmers in the field.
5 Sources
5 Sources
OpenAI is developing an AI agent called A-SWE that can perform all duties of software engineers, potentially transforming the tech industry and raising questions about the future of human coders.
2 Sources
2 Sources
Cognition AI has released Devin 2.0, an updated version of its AI-powered coding assistant, with new features and a significant price reduction. The tool now offers a pay-as-you-go plan starting at $20, down from its previous $500 monthly subscription.
4 Sources
4 Sources
Tech leaders predict AI will soon dominate coding tasks, potentially transforming the role of software developers and making programming more accessible.
7 Sources
7 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved