Curated by THEOUTPOST
On Fri, 11 Apr, 4:02 PM UTC
5 Sources
[1]
Researchers find AI is pretty bad at debugging -- but they're working on it
There are few areas where AI has seen more robust deployment than the field of software development. From "vibe" coding to GitHub Copilot to startups building quick-and-dirty applications with support from LLMs, AI is already deeply integrated. However, those claiming we're mere months away from AI agents replacing most programmers should adjust their expectations because models aren't good enough at the debugging part, and debugging occupies most of a developer's time. That's the suggestion of Microsoft Research, which built a new tool called debug-gym to test and improve how AI models can debug software. Debug-gym (available on GitHub and detailed in a blog post) is an environment that allows AI models to try and debug any existing code repository with access to debugging tools that aren't historically part of the process for these models. Microsoft found that without this approach, models are quite notably bad at debugging tasks. With the approach, they're better but still a far cry from what an experienced human developer can do. Here's how Microsoft's researchers describe debug-gym: Debug-gym expands an agent's action and observation space with feedback from tool usage, enabling setting breakpoints, navigating code, printing variable values, and creating test functions. Agents can interact with tools to investigate code or rewrite it, if confident. We believe interactive debugging with proper tools can empower coding agents to tackle real-world software engineering tasks and is central to LLM-based agent research. The fixes proposed by a coding agent with debugging capabilities, and then approved by a human programmer, will be grounded in the context of the relevant codebase, program execution and documentation, rather than relying solely on guesses based on previously seen training data. Pictured below are the results of the tests using debug-gym. This approach is much more successful than relying on the models as they're usually used, but when your best case is a 48.4 percent success rate, you're not ready for primetime. The limitations are likely because the models don't fully understand how to best use the tools, and because their current training data is not tailored to this use case. "We believe this is due to the scarcity of data representing sequential decision-making behavior (e.g., debugging traces) in the current LLM training corpus," the blog post says. "However, the significant performance improvement... validates that this is a promising research direction." This initial report is just the start of the efforts, the post claims. The next step is to "fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs." If the model is large, the best move to save inference costs may be to "build a smaller info-seeking model that can provide relevant information to the larger one." This isn't the first time we've seen outcomes that suggest some of the ambitious ideas about AI agents directly replacing developers are pretty far from reality. There have been numerous studies already showing that even though an AI tool can sometimes create an application that seems acceptable to the user for a narrow task, the models tend to produce code laden with bugs and security vulnerabilities, and they aren't generally capable of fixing those problems. This is an early step on the path to AI coding agents, but most researchers agree it remains likely that the best outcome is an agent that saves a human developer a substantial amount of time, not one that can do everything they can do.
[2]
AI models still struggle to debug software, Microsoft study shows | TechCrunch
AI models from OpenAI, Anthropic, and other top AI labs are increasingly being used to assist with programming tasks. Google CEO Sundar Pichai said in October that 25% of new code at the company is generated by AI, and Meta CEO Mark Zuckerberg has expressed ambitions to widely deploy AI coding models within the social media giant. Yet even some of the best models today struggle to resolve software bugs that wouldn't trip up experienced devs. A new study from Microsoft Research, Microsoft's R&D division, reveals that models, including Anthropic's Claude 3.7 Sonnet and OpenAI's o3-mini, fail to debug many issues in a software development benchmark called SWE-bench Lite. The results are a sobering reminder that, despite bold pronouncements from companies like OpenAI, AI is still no match for human experts in domains such as coding. The study's co-authors tested nine different models as the backbone for a "single prompt-based agent" that had access to a number of debugging tools, including a Python debugger. They tasked this agent with solving a curated set of 300 software debugging tasks from SWE-bench Lite. According to the co-authors, even when equipped with stronger and more recent models, their agent rarely completed more than half of the debugging tasks successfully. Claude 3.7 Sonnet had the highest average success rate (48.4%), followed by OpenAI's o1 (30.2%) and o3-mini (22.1%). Why the underwhelming performance? Some models struggled to use the debugging tools available to them and understand how different tools might help with different issues. The bigger problem, though, was data scarcity, according to the co-authors. They speculate that there's not enough data representing "sequential decision-making processes" -- that is, human debugging traces -- in current models' training data. "We strongly believe that training or fine-tuning [models] can make them better interactive debuggers," wrote the co-authors in their study. "However, this will require specialized data to fulfill such model training, for example, trajectory data that records agents interacting with a debugger to collect necessary information before suggesting a bug fix." The findings aren't exactly shocking. Many studies have shown that code-generating AI tends to introduce security vulnerabilities and errors, owing to weaknesses in areas like the ability to understand programming logic. One recent evaluation of Devin, a popular AI coding tool, found that it could only complete three out of 20 programming tests. But the Microsoft work is one of the more detailed looks yet at a persistent problem area for models. It likely won't dampen investor enthusiasm for AI-powered assistive coding tools, but with any luck, it'll make developers -- and their higher-ups -- think twice about letting AI run the coding show. For what it's worth, a growing number of tech leaders have disputed the notion that AI will automate away coding jobs. Microsoft co-founder Bill Gates has said he thinks programming as a profession is here to stay. So has Replit CEO Amjad Masad, Okta CEO Todd McKinnon, and IBM CEO Arvind Krishna.
[3]
AI Might Not Be Taking Your Programming Job Just Yet, Says Microsoft Research
If you're a programmer who is scared about AI taking your job, just like many other members of the general public, Microsoft might have some promising news for you. Microsoft Research, Microsoft's R&D division, tested a variety of the most popular large language models (LLMs) and found many came up surprisingly short when it came to a common programming task. The study tested nine different models, including Anthropic's Claude 3.7 Sonnet, OpenAI's o1, and OpenAI's o3-mini. The researchers assessed the ability of these AIs to perform "debugging," the process where programmers sift through existing code to find flaws that prevent it from working as intended (something that often takes up huge chunks of programmers' time). Microsoft hooked up the AIs to a third-party debugging assistant it created called Debug Gym and tested the AIs on a common software benchmark known as SWE-bench. However, the study had mixed results, and none of the tools achieved even a 50% success rate, even with the help of Debug Gym. Anthropic's Claude 3.7 Sonnet was the best performer, managing to successfully debug the faulty code in 48.4% of cases. OpenAI's o1 achieved success 30.2% of the time, while OpenAI's o3-mini did so 22.1% of the time. Microsoft's team reiterated that they believe that AI tools like the above can become effective code debuggers, and said it plans "to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs" in its future research. The findings may provide some slight relief for worried programmers, as more of the tech world's largest names pivot toward using AI for coding. In October 2024, Google announced it is now using AI to write "a quarter of all new code" during an earnings call. Meanwhile, AI startup Cognition Labs rolled out a new AI tool last year, dubbed Devin AI, that it claims can write code without human interference, complete engineering jobs on Upwork, and adjust its own AI models. Meta CEO Mark Zuckerberg is another famous face making big claims about the rise of AI programmers. He told podcaster Joe Rogan that his company "are going to have an AI that can effectively be a sort of mid-level engineer that you have at your company that can write code" at some point in 2025, adding he expected other companies to have similar capabilities.
[4]
Microsoft research shows AI coding tools fall short in key debugging tasks
In context: Some industry experts boldly claim that generative AI will soon replace human software developers. With tools like GitHub Copilot and AI-driven "vibe" coding startups, it may seem that AI has already significantly impacted software engineering. However, a new study suggests that AI still has a long way to go before replacing human programmers. The Microsoft Research study acknowledges that while today's AI coding tools can boost productivity by suggesting examples, they are limited in actively seeking new information or interacting with code execution when these solutions fail. However, human developers routinely perform these tasks when debugging, highlighting a significant gap in AI's capabilities. Microsoft introduced a new environment called debug-gym to explore and address these challenges. This platform allows AI models to debug real-world codebases using tools similar to those developers use, enabling the information-seeking behavior essential for effective debugging. Microsoft tested how well a simple AI agent, built with existing language models, could debug real-world code using debug-gym. While the results were promising, they were still limited. Despite having access to interactive debugging tools, the prompt-based agents rarely solved more than half of the tasks in benchmarks. That's far from the level of competence needed to replace human engineers. The research identifies two key issues at play. First, the training data for today's LLMs lacks sufficient examples of the decision-making behavior typical in real debugging sessions. Second, these models are not yet fully capable of utilizing debugging tools to their full potential. "We believe this is due to the scarcity of data representing sequential decision-making behavior (e.g., debugging traces) in the current LLM training corpus," the researchers said. Of course, artificial intelligence is advancing rapidly. Microsoft believes that language models can become much more capable debuggers with the right focused training approaches over time. One approach the researchers suggest is creating specialized training data focused on debugging processes and trajectories. For example, they propose developing an "info-seeking" model that gathers relevant debugging context and passes it on to a larger code generation model. The broader findings align with previous studies, showing that while artificial intelligence can occasionally generate seemingly functional applications for specific tasks, the resulting code often contains bugs and security vulnerabilities. Until artificial intelligence can handle this core function of software development, it will remain an assistant - not a replacement.
[5]
Microsoft study claims AI is still struggling to debug software
Microsoft's researchers are open-sourcing their tools to facilitate research Although generative AI is increasingly being integrated into programming workflows, new research from Microsoft reveals that large language models still aren't quite up to scratch when it comes to debugging. The research suggests that even advanced models still struggle with debugging tasks that are pretty simple for experienced developers, highlighting the continued importance of human programmers. AI does appear to have a solid use case, though, with Google now claiming that around 25% of new code is AI-generated. Meta has also noted the wide deployment of AI for coding. The report explores how 11 Microsoft researchers tested nine AI models on SWE-bench Lite - a popular debugging benchmark. Claude 3.7 Sonnet offered the highest success rate at a far-from-perfect 48.4%. OpenAI's o1 and o3-mini posted lower success rates of 30.2% and 22.1% respectively. "Even with debugging tools, our simple prompt-based agent rarely solves more than half of the SWE-bench Lite issues," the researchers wrote, blaming the suboptimal performance on a lack of data representing sequential decision-making behavior. All hope is not lost, though. "We believe that training or fine-tuning LLMs can enhance their interactive debugging abilities," they added. The researchers intend to fine-tune an info-seeking model specialized in gathering the necessary information to resolve bugs, but in the meantime, they promise to open-source debug-gym to make it easier for others to conduct similar research. Debug-gym is described as an "environment that allows code-repairing agents to access tools for active information-seeking behavior." However, for now, artificial intelligence might not be bringing as much value to developers' lives as AI companies suggest. "Most developers spend the majority of their time debugging code," the researchers wrote, indicating that even if they are benefitting from code generation, it might not be saving them that much time.
Share
Share
Copy Link
A new study by Microsoft Research shows that even advanced AI models struggle with software debugging tasks, highlighting the continued importance of human programmers in the field.
A recent study by Microsoft Research has shed light on the current limitations of artificial intelligence (AI) in software debugging, a crucial aspect of programming. Despite the increasing integration of AI into various coding tasks, the research reveals that even advanced AI models struggle with debugging problems that experienced human developers can easily solve 1.
To assess and improve AI's debugging capabilities, Microsoft researchers developed a new tool called debug-gym. This environment allows AI models to debug existing code repositories using tools that are typically not part of their process. Debug-gym expands an agent's action and observation space, enabling it to set breakpoints, navigate code, print variable values, and create test functions 1.
The study tested nine different AI models, including Anthropic's Claude 3.7 Sonnet and OpenAI's o1 and o3-mini, on a curated set of 300 software debugging tasks from SWE-bench Lite. The results were underwhelming:
These figures indicate that even the best-performing AI models are far from matching the capabilities of experienced human developers in debugging tasks.
The researchers identified two main factors contributing to AI's poor debugging performance:
While AI has made significant inroads in code generation, with companies like Google reporting that 25% of their new code is AI-generated, the debugging limitations highlight the continued importance of human programmers 3.
Several tech leaders, including Microsoft co-founder Bill Gates, Replit CEO Amjad Masad, and IBM CEO Arvind Krishna, have disputed the notion that AI will completely automate programming jobs in the near future 2.
Microsoft researchers believe that with the right focused training approaches, AI models can become more capable debuggers over time. They propose developing specialized training data focused on debugging processes and trajectories. Additionally, they plan to fine-tune an info-seeking model specialized in gathering necessary information to resolve bugs 5.
To facilitate further research in this area, Microsoft is open-sourcing the debug-gym environment, allowing other researchers to conduct similar studies and potentially improve AI's debugging capabilities 5.
As the field of AI in software development continues to evolve, it appears that the most likely outcome in the near term is not the replacement of human developers, but rather the development of AI agents that can significantly enhance developer productivity by handling certain tasks more efficiently.
Reference
OpenAI researchers develop a new benchmark called SWE-Lancer to test AI models' performance on real-world software engineering tasks, revealing that even advanced AI struggles with complex coding problems.
3 Sources
3 Sources
A software developer challenges GitHub's claims about the quality of code produced by its AI tool Copilot, raising questions about the study's methodology and statistical rigor.
2 Sources
2 Sources
Tech leaders predict AI will soon dominate coding tasks, potentially transforming the role of software developers and making programming more accessible.
7 Sources
7 Sources
AI is revolutionizing the programming landscape, offering both opportunities and challenges for entry-level coders. While it simplifies coding tasks, it also raises the bar for what constitutes an "entry-level" programmer.
2 Sources
2 Sources
Cognition AI's Devin, touted as the world's first AI software engineer, has been found to fail in 85% of assigned tasks, according to recent evaluations. This revelation challenges claims about AI's readiness to replace human software engineers.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved