Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
It wasn't long ago that the startup Cognition was blowing minds with its product Devin, an AI-based software engineer powered by OpenAI's GPT-4 foundation large language model (LLM) on the backend that could autonomously write and edit code when given instructions in natural language text.
But Devin emerged in March 2024 -- five months ago -- an eternity in the fast-moving generative AI space.
Now, another "C" titled startup, Cosine, a which was founded through the esteemed Y Combinator startup accelerator in San Francisco, has announced its own new autonomous AI-powered engineer Genie, which it says handily outperforms Devin, scoring 30% on third-party benchmark test SWE-Bench compared to Devin's 13.8%, and even surpassing the 19% scored by Amazon's Q and Factory's Code Droid.
"This model is so much more than a benchmark score: it was trained from the start to think and behave like a human SWE [software engineer]," wrote Cosine's co-founder and CEO Alistair Pullen in a post on his account on the social network X.
What is Genie and what can it do?
Genie is an advanced AI software engineering model designed to autonomously tackle a wide range of coding tasks, from bug fixing to feature building, code refactoring, and validation through comprehensive testing ensures, as instructed by human engineers or managers.
It operates either fully autonomously or in collaboration with users and aims to provide the experience of working alongside a skilled colleague.
"We've been chasing the dream of building something that can genuinely automatically perform end to end programming tasks with no intervention and a high degree of reliability -- an artificial colleague," wrote Genie is the first step in doing exactly that," wrote Pullen in the Cosine blog post announcing Genie's performance and limited, invitation-only availability.
The AI can write software in a multitude of languages -- there are 15 listed in its technical report as being sources of data, including:
Cosine claims Genie can emulate the cognitive processes of human engineers.
"My thesis on this is simple: make it watch how a human engineer does their job, and mimic that process," Pullen explains in the blog post.
Powered by a long context OpenAI model
Unlike many AI models that rely on foundational models supplemented with a few tools, Genie was developed through a proprietary process that involves training and fine-tuning a long token output AI model from OpenAI .
"In terms of the model we're using, it's a (currently) non-general availability GPT-4o variant that OpenAI have allowed us to train as part of the experimental access program," Pullen wrote to VentureBeat via email. "The model has performed well and we've shared our learnings with the OpenAI finetuning team and engineering leadership as a result. This was a real turning point for us as it convinced them to invest resource and attention in our novel techniques."
While Cosine doesn't specify the particular model, OpenAI just recently announced the limited availability of a new GPT-4o Long Output Context model which can spit out up to 64,000 tokens of output instead of GPT-4o's initial 4,000 -- a 16-fold increase.
The training data was key
"For its most recent training run Genie was trained on billions of tokens of data, the mix of which was chosen to make the model as competent as possible on the languages our users care about the most at the current time," wrote Pullen in Cosine's technical report on the agent.
With its extensive context window and continuous loop of improvement, Genie iterates and refines its solutions until they meet the desired outcome.
Cosine says in its blog post that it spent nearly a year curating a dataset with a wide range of software development activities from real engineers.
"In practice, however, getting such and then effectively utilising that data is extremely difficult, because essentially it doesn't exist," Pullen elaborated in his blog post, adding. "Our data pipeline uses a combination of artefacts, static analysis, self-play, step-by-step verification, and fine-tuned AI models trained on a large amount of labelled data to forensically derive the detailed process that must have happened to have arrived at the final output. The impact of the data labelling can't be understated, getting hold of very high quality data from competent software engineers is difficult, but the results were worth it as it gave so much insight as to how developers implicitly think about approaching problems."
In an email to VentureBeat, Pullen clarified that: "we started with artefacts of SWEs doing their jobs like PRs, commits, issues from OSS repos (MIT licensed) and then ran that data through our pipeline to forensically derive the reasoning, to reconstruct how the humans came to the conclusions they did. This proprietary dataset is what we trained the v1 on, and then we used self-play and self-improvement to get us the rest of the way."
This dataset not only represents perfect information lineage and incremental knowledge discovery but also captures the step-by-step decision-making process of human engineers.
"By actually training our models with this dataset rather than simply prompting base models which is what everyone else is doing, we have seen that we're no longer just generating random code until some works, it's tackling problems like a human," Pullen asserted.
Implications and Future Developments
Genie's launch has far-reaching implications for software development teams, particularly those looking to enhance productivity and reduce the time spent on routine tasks. With its ability to autonomously handle complex programming challenges, Genie could potentially transform the way engineering resources are allocated, allowing teams to focus on more strategic initiatives.
"We're sprinting towards a future where engineering resources are no longer a constraint," said Pullen. "The value of an AI colleague that can jump into an unknown codebase, solve unseen problems, and do so orders of magnitude faster than a human is self-evident."
Cosine has ambitious plans for Genie's future development. The company intends to expand its model portfolio to include smaller models for simpler tasks and larger models capable of handling more complex challenges. Additionally, Cosine plans to extend its work into open-source communities by context-extending one of the leading open-source models and pre-training on a vast dataset.
Availability and Next Steps
While Genie is already being rolled out to select users, broader access is still being managed.
Interested parties can apply for early access to try Genie on their projects by filling out a webform on the Cosine website.
Cosine remains committed to continuous improvement, with plans to ship regular updates to Genie's capabilities based on customer feedback.
"SWE-Bench recently changed their submission requirements to include the full working process of AI models, which poses a challenge for us as it would require revealing proprietary methodologies," noted Pullen. "For now, we've decided to keep these internal processes confidential, but we've made Genie's final outputs publicly available for independent verification on GitHub."
More on Cosine
Cosine is a human reasoning lab focused on researching and codifying how humans perform tasks, with the aim of teaching AI to mimic, excel at, and expand on these tasks.
Founded in 2022 by Pullen, Sam Stenner, and Yang Li, the company's mission is to push the boundaries of AI by applying human reasoning to solve complex problems, starting with software engineering.
With a small but highly skilled team, Cosine has already made significant strides in the AI field, and Genie is just the beginning.
"We truly believe that we're able to codify human reasoning for any job and industry," Pullen stated in the announcement blog post. "Software engineering is just the most intuitive starting point, and we can't wait to show you everything else we're working on."