Curated by THEOUTPOST
On Wed, 4 Dec, 12:02 AM UTC
2 Sources
[1]
GitHub Copilot code quality claims challenged
We're shocked - shocked - that Microsoft's study of its own tools might not be super-rigorous GitHub's claim that the quality of programming code written with its Copilot AI model is "significantly more functional, readable, reliable, maintainable, and concise," has been challenged by software developer Dan Cîmpianu. Cîmpianu, based in Romania, published a blog post in which he assails the statistical rigor of GitHub's Copilot code quality data. If you can't write good code without an AI, then you shouldn't use one in the first place GitHub last month cited research indicating that developers using Copilot: The first phase of the study relied on 243 developers with at least five years of Python experience who were randomly assigned to use GitHub Copilot (104) or not (98) - only 202 developer submissions ended up being valid. Each group created a web server to handle fictional restaurant reviews, supported by ten unit tests. Thereafter, each submission was reviewed by at least ten of the participants - a process that produced only 1,293 code reviews rather than the 2020 that 10x multiplication might lead one to expect. GitHub declined The Register's invitation to respond to Cîmpianu's critique. Cîmpianu takes issue with the choice of assignment, given that writing a basic Create, Read, Update, Delete (CRUD) app is the subject of endless online tutorials and therefore certain to have been included in training data used by code completion models. A more complex challenge would be better, he contends. He then goes on to question GitHub's inadequately explained graph that shows 60.8 percent of developers using Copilot passed all ten unit tests while only 39.2 percent of developers not using Copilot passed all the tests. That would be about 63 Copilot using developers out of 104 and about 38 non-Copilot developers out of 98 based on the firm's cited developer totals. But GitHub's post then reveals: "The 25 developers who authored code that passed all ten unit tests from the first phase of the study were randomly assigned to do a blind review of the anonymized submissions, both those written with and without GitHub Copilot." Cîmpianu observes that something doesn't add up here. One possible explanation is that GitHub misapplied the definite article "the" and simply meant 25 developers out of the total of 101 who passed all the tests were selected to do code reviews. More significantly, Cîmpianu takes issue with GitHub's claim that devs using Copilot produced significantly fewer code errors. As GitHub put it, "developers using GitHub Copilot wrote 18.2 lines of code per code error, but only 16.0 without. That equals 13.6 percent more lines of code with GitHub Copilot on average without a code error (p=0.002)." Cîmpianu argues that 13.6 percent is a misleading use of statistics because it only refers to two additional lines of code. While allowing that one might argue that adds up over time, he points out that the supposed error reduction is not actual error reduction. Rather it's coding style issues or linter warnings. As GitHub acknowledges in its definition of code errors: "This did not include functional errors that would prevent the code from operating as intended, but instead errors that represent poor coding practices." Cîmpianu is also unhappy with GitHub's claim that Copilot-assisted code was more readable, reliable, maintainable, and concise by 1 to 3 percent. He notes that the metrics for code style and code reviews can be highly subjective, and that details about how code was assessed have not been provided. Cîmpianu goes on to criticize GitHub's decision to use the same developers who submitted code samples for code evaluation, instead of an impartial group. "At the very least, I can appreciate they only made the developers who passed all unit tests do the reviewing," he wrote. "But remember, dear reader, that you're baited with a 3 percent increase in preference from some random 25 developers, whose only credentials (at least mentioned by the study) are holding a job for five years and passing ten unit tests." Cîmpianu points to a 2023 report from GitClear that found GitHub Copilot reduced code quality. Another paper by researchers affiliated with Bilkent University in Turkey, released in April 2023 and revised in October 2023, found that ChatGPT, GitHub Copilot, and Amazon Q Developer (formerly CodeWhisperer) all produce errors. And to the extent those errors produced "code smells" - poor coding practices that can give rise to vulnerabilities - "the average time to eliminate them was 9.1 minutes for GitHub Copilot, 5.6 minutes for Amazon CodeWhisperer, and 8.9 minutes for ChatGPT." That paper concludes, "All code generation tools are capable of generating valid code nine out of ten times with mostly similar types of issues. The practitioners should expect that for 10 percent of the time the generated code by the code generation tools would be invalid. Moreover, they should test their code thoroughly to catch all possible cases that may cause the generated code to be invalid." Nonetheless, a lot of developers are using AI coding tools like GitHub Copilot as an alternative to searching for answers on the web. Often, a partially correct code suggestion is enough to help inexperienced coders make progress. And those with substantial coding experience also see value in AI code suggestion models. As veteran open source developer Simon Willison observed in a recent interview [VIDEO]: "Somebody who doesn't know how to program can use Claude 3.5 artefacts to produce something useful. Somebody who does know how to program will do it better and faster and they'll ask better questions of it and they will produce a better result." For GitHub, maybe the message is that code quality, like security, isn't top of mind for many developers. Cîmpianu contends it shouldn't be that way. "[I]f you can't write good code without an AI, then you shouldn't use one in the first place," he concludes. Try telling that to the authors who don't write good prose, the recording artists who aren't good musicians, the video makers who never studied filmmaking, and the visual artists who can't draw very well. ®
[2]
Code written by OpenAI and praised by GitHub may not be as good as Github says
The test focused on a highly repetitive task - AI's ultimate role Software developer Dan Cîmpianu has criticized the quality of AI-generated code in a blog post targeted at GitHub's claims about its Copilot AI tool. More specifically, the Romanian developer slated the statistical accuracy and experimental design used by GitHub in a recent study, where it claimed that its Copilot-assisted code was "significantly more functional, readable, reliable, maintainable, and concise." However, the study focused on writing API endpoints for a web server, or Create, Read, Update and Delete actions (CRUDs), which Cîmpianu described as "one of the most boring, repetitive, uninspired, and cognitively unchallenged aspects of development." The study compared GitHub's OpenAI-backed AI-generated code with that of over 200 experienced developers, and found the AI code to perform better across multiple metrics. However, Cîmpianu has criticized GitHub for using percentages to denote differences without actually providing the baseline metrics for comparison, which could artificially make the percentage values look higher than they are. GitHub's study also defines errors as "inconsistent naming, unclear identifiers, excessive line length, excessive whitespace, missing documentation, repeated code, excessive branching or loop depth, insufficient separation of functionality, and variable complexity," meaning that bugs produced by its code were not included within the statistics of Another criticism of the study is that, despite being a "home to 1 billion developers," the study only uses a sample size of 243 developers. Cîmpianu concluded: "This does not seem to be even attempting to [be] aimed towards developers, but rather has the perfume of marketing, catered to the C-suites with buying power." Moreover, the developer also highlighted the skill required to write strong code, stating that AI should be seen as a supplement and an aid rather than a substitute for continued training.
Share
Share
Copy Link
A software developer challenges GitHub's claims about the quality of code produced by its AI tool Copilot, raising questions about the study's methodology and statistical rigor.
GitHub's recent claims about the superior quality of code produced by its AI-powered Copilot tool have been challenged by software developer Dan Cîmpianu. The Romanian developer has raised significant questions about the statistical rigor and methodology of GitHub's study, which asserted that Copilot-assisted code was "significantly more functional, readable, reliable, maintainable, and concise" 1.
The study, which involved 243 developers with at least five years of Python experience, tasked participants with creating a web server for fictional restaurant reviews. Cîmpianu argues that this choice of assignment – a basic Create, Read, Update, Delete (CRUD) app – is problematic as it's likely to be well-represented in the training data for code completion models 1.
Furthermore, the developer questions the statistical presentation of the results. For instance, GitHub's claim that developers using Copilot wrote 13% more lines of code without errors is criticized as potentially misleading, as it only represents two additional lines of code 1.
A key point of contention is GitHub's definition of 'code errors'. The study did not include functional errors that would prevent code from operating as intended, but instead focused on "poor coding practices" 1. This definition raises questions about the practical implications of the reported error reduction.
Cîmpianu also challenges GitHub's claims of 1-3% improvements in code readability, reliability, maintainability, and conciseness. He notes that these metrics can be highly subjective, and details about the assessment process were not provided 12.
Despite GitHub's vast user base of "1 billion developers," the study's sample size of 243 developers is criticized as potentially inadequate 2. Additionally, Cîmpianu questions the decision to use the same developers who submitted code samples for code evaluation, instead of an impartial group 1.
The critique points to conflicting evidence from other research. A 2023 report from GitClear found that GitHub Copilot actually reduced code quality 1. Another study by researchers at Bilkent University in Turkey revealed that AI coding tools, including GitHub Copilot, produce errors in about 10% of generated code 1.
While many developers find value in AI coding tools like GitHub Copilot, especially for tasks like searching for answers or assisting inexperienced coders, Cîmpianu argues that these tools should be seen as supplements rather than substitutes for continued training and skill development 2.
As veteran open source developer Simon Willison noted, "Somebody who doesn't know how to program can use Claude 3 artefacts to produce something useful. Somebody who does know how to program will do it better and faster and they'll ask better questions of it and they will produce a better result" 1.
This debate highlights the ongoing discussions about the role of AI in software development and the importance of rigorous, transparent evaluation of AI-assisted coding tools.
Reference
[1]
Generative AI is revolutionizing software development, offering significant productivity gains but also raising concerns about code quality and security. The impact varies based on developer experience and organizational readiness.
3 Sources
3 Sources
GitHub unveils a multi-model strategy for Copilot, integrating various AI models and expanding features, potentially reshaping the AI coding assistant landscape and challenging competitors like Cursor.
3 Sources
3 Sources
Despite the potential benefits, Indian developers are struggling to fully embrace AI coding tools due to affordability issues, company policies, and concerns about privacy and skill development.
2 Sources
2 Sources
As AI-powered coding tools gain prominence, the software development landscape is evolving rapidly. This story explores the impact of "vibe coding," the changing roles of developers, and the future of human-AI collaboration in the tech industry.
3 Sources
3 Sources
AI is revolutionizing the programming landscape, offering both opportunities and challenges for entry-level coders. While it simplifies coding tasks, it also raises the bar for what constitutes an "entry-level" programmer.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved