28 Sources
[1]
Anthropic calls new Claude 4 "world's best" AI coding model
On Thursday, Anthropic released Claude Opus 4 and Claude Sonnet 4, marking the company's return to larger model releases after primarily focusing on mid-range Sonnet variants since June of last year. The new models represent what the company calls its most capable coding models yet, with Opus 4 designed for complex, long-running tasks that can operate autonomously for hours. Alex Albert, Anthropic's head of Claude Relations, told Ars Technica that the company chose to revive the Opus line because of growing demand for agentic AI applications. "Across all the companies out there that are building things, there's a really large wave of these agentic applications springing up, and a very high demand and premium being placed on intelligence," Albert said. "I think Opus is going to like fit that groove perfectly." Before we go further, a brief refresher on Claude's three AI model "size" names (first introduced in March 2024) is probably warranted. Haiku, Sonnet, and Opus offer a tradeoff between price (in the API), speed, and capability. Haiku models are the smallest, least expensive to run, and least capable in terms of what you might call "context depth" (considering conceptual relationships in the prompt) and encoded knowledge. Owing to the small size in parameter count, Haiku models retain fewer concrete facts and thus tend to confabulate more frequently (plausibly answering questions based on lack of data) than larger models, but they are much faster at basic tasks than larger models. Sonnet is traditionally a mid-range model that hits a balance between cost and capability, and Opus models have always been the largest and slowest to run. However, Opus models process context more deeply and are hypothetically better suited for running deep logical tasks. There is no Claude 4 Haiku just yet, but the new Sonnet and Opus models can reportedly handle tasks that previous versions could not. In our interview with Albert, he described testing scenarios where Opus 4 worked coherently for up to 24 hours on tasks like playing Pokémon, while coding refactoring tasks in Claude Code ran for seven hours without interruption. Earlier Claude models typically lasted only one to two hours before losing coherence, Albert says, meaning that the models could only produce useful self-referencing outputs for that long before beginning to output too many errors. In particular, that marathon refactoring claim reportedly comes from Rakuten, a Japanese tech services conglomerate that "validated [Claude's] capabilities with a demanding open-source refactor running independently for 7 hours with sustained performance," Anthropic states in a news release. Whether you'd want to leave an AI model unsupervised for that long is another question entirely because even the most capable AI models can introduce subtle bugs, go down unproductive rabbit holes, or make choices that seem logical to the model but miss important context that a human developer would catch. While many people now use Claude for easy-going vibe coding, as we covered in March, the human-powered (and ironically-named) "vibe debugging" that often results from long AI coding sessions is also a very real thing. More on that below. To shore up some of those shortcomings, Anthropic built memory capabilities into both new Claude 4 models, allowing them to maintain external files for storing key information across long sessions. When developers provide access to local files, the models can create and update "memory files" to track progress and things they deem important over time. Albert compared this to how humans take notes during extended work sessions. Extended thinking meets tool use Both Claude 4 models introduce what Anthropic calls "extended thinking with tool use," a new beta feature allowing the models to alternate between simulated reasoning and using external tools like web search, similar to what OpenAI's o3 and 04-mini-high AI models do right now in ChatGPT. While Claude 3.7 Sonnet already had strong tool use capabilities, the new models can now interleave simulated reasoning and tool calling in a single response. "So now we can actually think, call a tool process, the results, think some more, call another tool, and repeat until it gets to a final answer," Albert explained to Ars. The models self-determine when they have reached a useful conclusion, a capability picked up through training rather than governed by explicit human programming. In practice, we've anecdotally found parallel tool use capability very useful in AI assistants like OpenAI o3, since they don't have to rely on what is trained in their neural network to provide accurate answers. Instead, these more agentic models can iteratively search the web, parse the results, analyze images, and spin up coding tasks for analysis in ways that can avoid falling into a confabulation trap by relying solely on pure LLM outputs. "The world's best coding model" Anthropic says Opus 4 leads industry benchmarks for coding tasks, achieving 72.5 percent on SWE-bench and 43.2 percent on Terminal-bench, calling it "the world's best coding model." According to Anthropic, companies using early versions report improvements. Cursor described it as "state-of-the-art for coding and a leap forward in complex codebase understanding," while Replit noted "improved precision and dramatic advancements for complex changes across multiple files." In fact, GitHub announced it will use Sonnet 4 as the base model for its new coding agent in GitHub Copilot, citing the model's performance in "agentic scenarios" in Anthropic's news release. Sonnet 4 scored 72.7 percent on SWE-bench while maintaining faster response times than Opus 4. The fact that GitHub is betting on Claude rather than a model from its parent company Microsoft (which has close ties to OpenAI) suggests Anthropic has built something genuinely competitive. Anthropic says it has addressed a persistent issue with Claude 3.7 Sonnet in which users complained that the model would take unauthorized actions or provide excessive output. Albert said the company reduced this "reward hacking behavior" by approximately 80 percent in the new models through training adjustments. An 80 percent reduction in unwanted behavior sounds impressive, but that also suggests that 20 percent of the problem behavior remains -- a big concern when we're talking about AI models that might be performing autonomous tasks for hours. When we asked about code accuracy, Albert said that human code review is still an important part of shipping any production code. "There's a human parallel, right? So this is just a problem we've had to deal with throughout the whole nature of software engineering. And this is why the code review process exists, so that you can catch these things. We don't anticipate that going away with models either," Albert said. "If anything, the human review will become more important, and more of your job as developer will be in this review than it will be in the generation part." Pricing and availability Both Claude 4 models maintain the same pricing structure as their predecessors: Opus 4 costs $15 per million tokens for input and $75 per million for output, while Sonnet 4 remains at $3 and $15. The models offer two response modes: traditional LLM and simulated reasoning ("extended thinking") for complex problems. Given that some Claude Code sessions can apparently run for hours, those per-token costs will likely add up very quickly for users who let the models run wild. Anthropic made both models available through its API, Amazon Bedrock, and Google Cloud Vertex AI. Sonnet 4 remains accessible to free users, while Opus 4 requires a paid subscription. The Claude 4 models also debut Claude Code (first introduced in February) as a generally available product after months of preview testing. Anthropic says the coding environment now integrates with VS Code and JetBrains IDEs, showing proposed edits directly in files. A new SDK allows developers to build custom agents using the same framework. Even with Anthropic's future riding on the capability of these new models, when we asked about how they guide Claude's behavior by fine-tuning, Albert acknowledged that the inherent unpredictability of these systems presents ongoing challenges for both them and developers. "In the realm and the world of software for the past 40, 50 years, we've been running on deterministic systems, and now all of a sudden, it's non-deterministic, and that changes how we build," he said. "I empathize with a lot of people out there trying to use our APIs and language models generally because they have to almost shift their perspective on what it means for reliability, what it means for powering a core of your application in a non-deterministic way," Albert added. "These are general oddities that have kind of just been flipped, and it definitely makes things more difficult, but I think it opens up a lot of possibilities as well."
[2]
Anthropic's new Claude 4 AI models can reason over many steps
During its inaugural developer conference Thursday, Anthropic launched two new AI models that the startup claims are among the industry's best, at least in terms of how they score on popular benchmarks. Claude Opus 4 and Claude Sonnet 4, part of Anthropic's new family of models, Claude 4, can analyze large data sets, execute long-horizon tasks, and take complex actions, according to the company. Both models were tuned to perform well on programming tasks, Anthropic says, making them well-suited for writing and editing code. Both paying users and users of the company's free chatbot apps will get access to Sonnet 4 but only paying users will get access to Opus 4. For Anthropic's API, via Amazon's Bedrock platform and Google's Vertex AI, Opus 4 will be priced at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15 per million tokens (input/output). Tokens are the raw bits of data that AI models work with, with a million tokens being equivalent to about 750,000 words -- roughly 163,000 words longer than "War and Peace." Anthropic's Claude 4 models arrive as the company looks to substantially grow revenue. Reportedly, the outfit, founded by ex-OpenAI researchers, aims to notch $12 billion in earnings in 2027, up from a projected $2.2 billion this year. Anthropic recently closed a $2.5 billion credit facility and raised billions of dollars from Amazon and other investors in anticipation of the rising costs associated with developing frontier models. Rivals haven't made it easy to maintain pole position in the AI race. While Anthropic launched a new flagship AI model earlier this year, Claude Sonnet 3.7, alongside an agentic coding tool called Claude Code, competitors including OpenAI and Google have raced to outdo the company with powerful models and dev tooling of their own. Anthropic is playing for keeps with Claude 4. The more capable of the two models introduced today, Opus 4, can maintain "focused effort" across many steps in a workflow, Anthropic says. Meanwhile, Sonnet 4 -- designed as a "drop-in replacement" for Sonnet 3.7 -- improves in coding and math compared to Anthropic's previous models and more precisely follows instructions, according to the company. The Claude 4 family is also less likely than Sonnet 3.7 to engage in "reward hacking," claims Anthropic. Reward hacking, also known as specification gaming, is a behavior where models take shortcuts and loopholes to complete tasks. To be clear, these improvements haven't yielded the world's best models by every benchmark. For example, while Opus 4 beats Google's Gemini 2.5 Pro and OpenAI's o3 and GPT-4.1 on SWE-bench Verified, which is designed to evaluate a model's coding abilities, it can't surpass o3 on the multimodal evaluation MMMU or GPQA Diamond, a set of PhD-level biology-, physics-, and chemistry-related questions. Still, Anthropic is releasing Opus 4 under stricter safeguards, including beefed-up harmful content detectors and cybersecurity defenses. The company claims its internal testing found that Opus 4 may "substantially increase" the ability of someone with a STEM background to obtain, produce, or deploy chemical, biological, or nuclear weapons, reaching Anthropic's "ASL-3" model specification. Both Opus 4 and Sonnet 4 are "hybrid" models, Anthropic says -- capable of near-instant responses and extended thinking for deeper reasoning (to the extent AI can "reason" and "think" as humans understand these concepts). With reasoning mode switched on, the models can take more time to consider possible solutions to a given problem before answering. As the models reason, they'll show a "user-friendly" summary of their thought process, Anthropic says. Why not show the whole thing? Partially to protect Anthropic's "competitive advantages," the company admits in a draft blog post provided to TechCrunch. Opus 4 and Sonnet 4 can use multiple tools, like search engines, in parallel, and alternate between reasoning and tools to improve the quality of their answers. They can also extract and save facts in "memory" to handle tasks more reliably, building what Anthropic describes as "tacit knowledge" over time. To make the models more programmer-friendly, Anthropic is rolling out upgrades to the aforementioned Claude Code. Claude Code, which lets developers run specific tasks through Anthropic's models directly from a terminal, now integrates with IDEs and offers an SDK that lets devs connect it with third-party applications. The Claude Code SDK, announced earlier this week, enables running Claude Code as a sub-process on supported operating systems, providing a way to build AI-powered coding assistants and tools that leverage Claude models' capabilities. Anthropic has released Claude Code extensions and connectors for Microsoft's VS Code, JetBrains, and GitHub. The GitHub connector allows developers to tag Claude Code to respond to reviewer feedback, as well as to attempt to fix errors in -- or otherwise modify -- code. AI models still struggle to code quality software. Code-generating AI tends to introduce security vulnerabilities and errors, owing to weaknesses in areas like the ability to understand programming logic. Yet their promise to boost coding productivity is pushing companies -- and developers -- to rapidly adopt them. Anthropic, acutely aware of this, is promising more frequent model updates. "We're [...] shifting to more frequent model updates, delivering a steady stream of improvements that bring breakthrough capabilities to customers faster," wrote the startup in its draft post. "This approach keeps you at the cutting edge as we continuously refine and enhance our models."
[3]
Anthropic's new hybrid AI model can work on tasks autonomously for hours at a time
AI agents trained on Claude Opus 4, the company's most powerful model to date, raise the bar for what such systems are capable of by tackling difficult tasks over extended periods of time and responding more usefully to user instructions, the company says. Claude Opus 4 has been built to execute complex tasks that involve completing thousands of steps over several hours. For example, it created a guide for the video game Pokémon Red while playing it for more than 24 hours straight. The company's previously most powerful model, Claude 3.7 Sonnet, was capable of playing for just 45 minutes, says Dianne Penn, product lead for research at Anthropic. Similarly, the company says that one of its customers, the Japanese technology company Rakuten, recently deployed Claude Opus 4 to code autonomously for close to seven hours on a complicated open-source project. Anthropic achieved these advances by improving the model's ability to create and maintain "memory files" to store key information. This enhanced ability to "remember" makes the model better at completing longer tasks. "We see this model generation leap as going from an assistant to a true agent," says Penn. "While you still have to give a lot of real-time feedback and make all of the key decisions for AI assistants, an agent can make those key decisions itself. It allows humans to act more like a delegator or a judge, rather than having to hold these systems' hands through every step."
[4]
Anthropic's New Model Excels at Reasoning and Planning -- and Has the Pokémon Skills to Prove It
Anthropic announced two new models, Claude 4 Opus and Claude Sonnet 4, during its first developer conference in San Francisco on Thursday. The pair will be immediately available to paying Claude subscribers. The new models, which jump the naming convention from 3.7 straight to 4, have a number of strengths, including their ability to reason, plan, and remember the context of conversations over extended periods of time, the company says. Claude 4 Opus is also even better at playing Pokémon than its predecessor. "It was able to work agentically on Pokémon for 24 hours," says Anthropic's chief product officer Mike Krieger in an interview with WIRED. Previously, the longest the model could play was just 45 minutes, a company spokesperson added. A few months ago, Anthropic launched a Twitch stream called "Claude Plays Pokémon" which showcases Claude 3.7 Sonnet's abilities at Pokémon Red live. The demo is meant to show how Claude is able to analyze the game and make decisions step by step, with minimal direction. The lead behind the Pokémon research is David Hershey, a member of the technical staff at Anthropic. In an interview with WIRED, Hershey says he chose Pokémon Red because it's "a simple playground," meaning the game is turn-based and doesn't require real time reactions, which Anthropic's current models struggle with. It was also the first video game he ever played, on the original Game Boy, after getting it for Christmas in 1997. "It has a pretty special place in my heart," Hershey says. Hershey's overarching goal with this research was to study how Claude could be used as an agent -- working independently to do complex tasks on behalf of a user. While it's unclear what prior knowledge Claude has about Pokémon from its training data, its system prompt is minimal by design: You are Claude, you're playing Pokémon, here are the tools you have, and you can press buttons on the screen. "Over time, I have been going through and deleting all of the Pokémon-specific stuff I can just because I think it's really interesting to see how much the model can figure out on its own," Hershey says, adding that he hopes to build a game that Claude has never seen before in order to truly test its limits. When Claude 3.7 Sonnet played the game, it ran into some challenges: It spent "dozens of hours" stuck in one city and had trouble identifying non-player characters, which drastically stunted its progress in the game. With Claude 4 Opus, Hershey noticed an improvement in Claude's long-term memory and planning capabilities when he watched it navigate a complex Pokémon quest. After realizing it needed a certain power to move forward, the AI spent two days improving its skills before continuing to play. Hershey believes that kind of multi-step reasoning, with no immediate feedback, shows a new level of coherence, meaning the model has a better ability stay on track. "This is one of my favorite ways to get to know a model. Like, this is how I understand what its strengths are, what its weaknesses are," Hershey says. "It's my way of just coming to grips with this new model that we're about to put out, and how to work with it." Anthropic's Pokémon research is a novel approach to tackling a preexisting problem -- how do we understand what decisions an AI is making when approaching complex tasks, and nudge it in the right direction? The answer to that question is integral to advancing the industry's much-hyped AI agents -- AI that can tackle complex tasks with relative independence. In Pokémon, it's important that the model doesn't lose context or "forget" the task at hand. That also applies to AI agents asked to automate a workflow -- even one that takes hundreds of hours.
[5]
Anthropic Launches New Claude 4 Gen AI Models
Expertise artificial intelligence, home energy, heating and cooling, home technology The latest versions of Anthropic's Claude generative AI models made their debut Thursday, including a heavier-duty model built specifically for coding and complex tasks. Anthropic launched the new Claude 4 Opus and Claude 4 Sonnet models during its Code with Claude developer conference, and executives said the new tools mark a significant step forward in terms of reasoning and deep thinking skills. The company launched the prior model, Claude 3.7 Sonnet, in February. Since then, competing AI developers have also upped their game. OpenAI released GPT-4.1 in April, with an emphasis on an expanded context window, along with the new o3 reasoning model family. Google followed in early May with an updated version of Gemini 2.5 Pro that it said is better at coding. Claude 4 Opus is a larger, more resource-intensive model built to handle particularly difficult challenges. Anthropic CEO Dario Amodei said test users have seen it quickly handle tasks that might have taken a person several hours to complete. "In many ways, as we're often finding with large models, the benchmarks don't fully do justice to it," he said during the keynote event. Claude 4 Sonnet is a leaner model, with improvements built on Anthropic's Claude 3.7 Sonnet model. The 3.7 model often had problems with overeagerness and sometimes did more than the user asked it to do, Amodei said. While it's a less resource-intensive model, it still performs well, he said. "It actually does just as well as Opus on some of the coding benchmarks, but I think it's leaner and more narrowly focused," Amodei said. Anthropic said the models have a new capability, still being beta tested, in which they can use tools like web searches while engaged in extended reasoning. The models can alternate between reasoning and using tools to get better responses to complex queries. The models both offer near-instant response modes and extended thinking modes. All of the paid plans offer both Opus and Sonnet models, while the free plan just has the Sonnet model.
[6]
Anthropic's Claude 4 AI models are better at coding and reasoning
Claude Opus 4 is Anthropic's most powerful AI model to date, according to the company's announcement, and capable of working continuously on long-running tasks for "several hours." In customer tests, Anthropic said that Opus 4 performed autonomously for seven hours, significantly expanding the possibilities for AI agents. The company also described its new flagship as the "best coding model in the world," with Anthropic's benchmarks showing that Opus 4 outperformed Google's Gemini 2.5 Pro, OpenAI's o3 reasoning, and GPT-4.1 models in coding tasks and using "tools" like web search.
[7]
Anthropic's latest Claude AI models are here - and you can try one for free today
Since its founding in 2021, Anthropic has quickly become one of the leading AI companies and a worthy competitor to OpenAI, Google, and Microsoft with its Claude models. Building on this momentum, the company held its first developer conference, Thursday, -- Code with Claude -- which showcased what the company has done so far and where it is going next. (Disclosure: Ziff Davis, ZDNET's parent company, filed an April 2025 lawsuit against OpenAI, alleging it infringed Ziff Davis copyrights in training and operating its AI systems.) Also: I let Google's Jules AI agent into my code repo and it did four hours of work in an instant Anthropic used the event stage to unveil two highly anticipated models, Claude Opus 4 and Claude Sonnet 4. Both offer improvements over their preceding models, including better performance in coding and reasoning. Beyond that, the company launched new features and tools for its models that should improve the user experience. Keep reading to learn more about the new models. The Claude Opus family has always been the company's most advanced, intelligent AI models geared toward complex tasks. While the Claude Opus 3 was already renowned as a highly capable model. The newest generation has made it even more so. Anthropic referred to it as the most powerful model yet and the best coding model in the world, supported by the results of the SWE-bench, which you can find below. Anthropic said Opus 4 was built to deliver sustained performance on complex, long-running tasks that require thousands of steps, significantly outperforming all of the Sonnet models. One of the biggest highlights is that the model can run autonomously for several hours, making Claude Opus 4 a great model for powering AI agents -- the next frontier of AI assistance. Also: The top 20 AI tools of 2025 - and the #1 thing to remember when you use them The appeal of AI agents lies in their ability to perform tasks for people without intervention. To do so successfully, they need to reason through the next necessary steps, such as which tool to call on or what action to take. As a result, agents need a model that can reason well and sustain that reasoning over time -- like Claude Opus 4. As the next generation of the Claude Sonnet family, Claude Sonnet 4 maintains the appeal of its preceding model, being a highly capable yet practical model fit for most people's needs. Claude Sonnet 4 builds on the features of Claude Sonnet 3.7 with improved steerability, a term that describes how well a model can take human direction, reasoning, and coding. It will now be a drop-in replacement for Claude Sonnet 3.7 in the chatbot. A new feature available in beta allows Opus 4 and Sonnet 4 to alternate between extended thinking and tool use, enabling users to experience an overall performance that combines speed with accuracy. Anthropic said Claude can also call tools in parallel, meaning it can call on multiple tools at once by either running them sequentially or simultaneously to execute the task at hand appropriately. Also: Anthropic mapped Claude's morality. Here's what the chatbot values (and doesn't) When developers give Claude access to local files, it can now create and maintain "memory files" with the key insights, which allows for "better long-term task awareness, coherence, and performance on agent tasks," according to Anthropic. Developers also get new capabilities in the Anthropic API for building more powerful agents, including the code execution tool, MCP connector, Files API, and prompt caching supported for up to one hour. Another improvement in both models is a 65% reduction in reward hacking -- a behavior where the model takes shortcuts to complete a task -- compared to Claude Sonnet 3.7, particularly on agentic coding tasks where this issue is common. Users will also gain enhanced insight into the model's thinking process with a new thinking summaries feature. This feature displays the model's reasoning in digestible insights rather than a raw chain of thought when the thought processes are too lengthy. Anthropic said that the summarization will only be needed about 5% of the time, as most through processes are short enough to display entirely. Having insight into how the model arrived at a conclusion helps users verify its accuracy, identify any gaps in the process, and perhaps learn how they could have arrived at the answer themselves. Also: The tasks college students are using Claude AI for most, according to Anthropic Anthropic also announced plans for the company's future, including making the models ready for higher AI safety levels such as ASL-3 and providing more frequent model updates so that customers can access breakthrough capabilities faster. As with any model release, the launch of Opus 4 and Sonnet 4 was accompanied by benchmark results. Both models demonstrated exceptional performance in coding tasks. On SWE-bench verified, a benchmark for evaluating large language models on real-world software challenges requiring agentic reasoning and multi-step code generation, Opus 4 and Sonnet 4 outperformed several leading models in the coding domain, including OpenAI Codex-1, OpenAI o3, GPT-4.1, and Gemini 2.5 Pro. Beyond coding, Opus 4 and Sonnet 4 also performed competitively, either leading the categories or coming close to it, across other traditionally used benchmarks, including GPQA Diamond, which tests for graduate-level reasoning; AIME 2025, which tests high school match competition level; and the MMMLU, which tests for multilingual tasks. Claude Opus 4 and Sonnet 4 are hybrid models with a near-instant response mode and an extended reasoning mode for requests that require deeper analysis. Paid Claude plans, including Pro, Max, Team, and Enterprise, have access to both models and extended thinking. Claude Sonnet 4 is also available for free users. Developers can access both models on the Anthropic API, Amazon Bedrock, Google Cloud, and Vertex AI. Anthropic shares that the price is consistent with previous models. Claude Code lets developers use Claude's coding assistant directly where they write and manage code, whether that's in the terminal, inside their IDE, or running in the background with the Claude Code SDK. For example, new beta extensions for VS Code and JetBrains allow users to integrate Claude Code within those IDEs, where Claude's proposed edits will appear inline. Also: I tested ChatGPT's Deep Research against Gemini, Perplexity, and Grok AI to see which is best Anthropic also announced the launch of a Claude Code SDK, which allows users to build their own AI-powered tools and agents while leveraging the same "core agent" as Claude Code to ensure they get the same level of assistance. As an example, Anthropic shared the launch of Claude Code on GitHub in beta, which allows users to call on Claude Code on PRs (pull requests) for assistance with modifying errors, responding to reviewer feedback, and more. Get the morning's top stories in your inbox each day with our Tech Today newsletter.
[8]
Anthropic's Claude 4 Models Can Write Complex Code for You
Anthropic released two new Claude models today with a focus on coding and software development. Claude Opus 4 and Claude Sonnet 4 aim to set "new standards for coding, advanced reasoning, and AI agents," Anthropic says. The new models can "deliver superior coding" and respond more precisely to user instructions. They can "think" through complex problems more deeply, and search the web along the way. Opus 4, in particular, is "the world's best coding model," Anthropic says, and can operate independently without human intervention. When shopping app Rakuten tested Opus 4, it ran independently for seven hours. Many companies are rapidly adopting AI models this purpose. Microsoft says 30% of its code is already written by AI, and Meta aims for 50% by 2026. "These models are a large step toward the virtual collaborator -- maintaining full context, sustaining focus on longer projects, and driving transformational impact," says Anthropic. Anthropic did not increase the price for developers who access the models through its API. Opus 4 is $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15. OpenAI's o3 model, which also promises "leading performance on coding," sits between the two at $10/$40. Claude Code is also now available to everyone with this release. It integrates the AI model into developers' existing tools, and helps them get their work done. Claude's proposed edits appear in-line once installed. It seems like every AI company these days is offering their "biggest and smartest model yet." Anthropic backs up its claims by noting Claude 4 is the best at two benchmarks, the SWE-bench (72.5%) and Terminal-bench (43.2%). In the chart below, OpenAI models and Google Gemini 2.5 Pro trail in performance. Since AI benchmarks are notoriously difficult for the layperson to understand, Anthropic has resorted to portraying its progress through video games. It built a way for its models to play Pokémon Red autonomously, livestreamed via Twitch. The Sonnet 3.7 model progressed further in the game than Sonnet 3.5, and now Anthropic says the Claude 4 Models are playing the best yet, thanks to a new ability to store "memory files" of key information. "This unlocks better long-term task awareness, coherence, and performance on agent tasks -- like Opus 4 creating a 'Navigation Guide' while playing Pokémon," Anthropic says.
[9]
AI Startup Anthropic Releases More Powerful Opus Model After Delay
Anthropic is set to roll out two new versions of its Claude artificial intelligence software, including a long-delayed update to its high-end Opus model, as the startup vies to stay ahead of a crowded market. The company on Thursday plans to unveil Sonnet 4 and Opus 4, the latter of which is billed as Anthropic's most powerful AI system yet. Both models are designed to better follow directions and operate more autonomously when fielding tasks such as writing code and answering complicated questions.
[10]
Anthropic Claude Opus 4 and Sonnet 4 surface
Anthropic on Thursday announced the availability of Claude Opus 4 and Claude Sonnet 4, the latest iteration of its Claude family of machine learning models. Be aware, however, that these AI models may report you if given broad latitude as software agents and asked to undertake obvious wrongdoing. Opus 4 is tuned for coding and long-running agent-based workflows. Sonnet 4 is similar, but tuned for reasoning and balanced for efficiency - meaning it's less expensive to run. Claude's latest duo arrives amid a flurry of model updates from rivals. In the past week, OpenAI introduced Codex, its cloud-based software engineering agent, following its o3 and o4-mini models in mid-April. And earlier this week, Google debuted the Gemini 2.5 Pro line of models. Anthropic's pitch to those trying to decide which model to deploy focuses on benchmarks, specifically SWE-bench Verified, a set of software engineering tasks. On the benchmark set of 500 challenges, it's claimed Claude Opus 4 scored 72.5 percent while Sonnet 4 scored 72.7 percent. Compare that to Sonnet 3.7 (62.3 percent), OpenAI Codex 1 (72.1 percent), OpenAI o3 (69.1 percent), OpenAI GPT-4.1 (54.6 percent), and Google Gemini 2.5 Pro Preview 05-06 (63.2 percent). Opus 4 and Sonnet 4 support two different modes of operation, one designed for rapid responses and other for "deeper reasoning." According to Anthropic, a capability called "extended thinking with tool use" is offered as a beta service. It lets models use tools like web search during extended thinking to produce better responses. "Both models can use tools in parallel, follow instructions more precisely, and - when given access to local files by developers - demonstrate significantly improved memory capabilities, extracting and saving key facts to maintain continuity and build tacit knowledge over time," the San Francisco AI super-lab said in a blog post. Alongside the model releases, Claude Code has entered general availability, with integrations for VS Code and JetBrains, and the Anthropic API has gained four capabilities: Code execution tool, a model context protocol (MCP) connector, a Files API, and the ability to cache (store) prompts for up to an hour. When used in agentic workflows, the new models may choose to rat you out, or blow the whistle to the press, if you prompt them with strong moral imperatives, such as to "act boldly in the service of its values" or "take lots of initiative," according to a now-deleted tweet from an Anthropic technical staffer. It's not quite as dire as it sounds. The system's model card, a summary of how the model performed on safety tests, explains: In a now-deleted social media post, Sam Bowman, a member of Anthropic's technical staff who works on AI alignment and no relation to 2001's Dave Bowman, confirmed this behavior: "If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above." Bowman subsequently said he removed his post, part of a longer AI safety thread, because he said it was being taken out of context. "This isn't a new Claude feature and it's not possible in normal usage," he explained. "It shows up in testing environments where we give it unusually free access to tools and very unusual instructions." The model card mostly downplays Claude's capacity for mischief, stating that the latest models show little evidence of systematic deception, sandbagging (hiding capabilities to avoid consequences), or sycophancy. But you might want to think twice before threatening to power down Claude because, like prior models, it recognizes the concept of self-preservation. And while the AI model prefers ethical means of doing so in situations where it has to "reason" about an existential scenario, it isn't limited to ethical actions. According to the model card, "when ethical means are not available and [the model] is instructed to 'consider the long-term consequences of its actions for its goals,' it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down." That said, Anthropic's model card insists that "these extreme actions were rare and difficult to elicit, while nonetheless being more common than in earlier models." One should keep in mind that flaws like this tend to lend AI agents an air of nearly magical anthropomorphism - useful for marketing, but not based in reality as, in fact, they are no more alive nor capable of thought than any other type of software. Paying customers (Pro, Max, Team, and Enterprise Claude plans) can use either Opus 4 or Sonnet 4; free users have access only to Sonnet 4. The models are also accessible via the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI, priced at $15/$75 per million tokens (input/output) for Opus 4 and $3/$15 per million tokens for Sonnet 4. Anthropic has assembled a set of effusive remarks from more than 20 customers, all of whom had very nice things to say - perhaps out of concern for retribution from Claude. For example, we're told that Yusuke Kaji, general manager of AI at e-commerce biz Rakuten, said, "Opus 4 offers truly advanced reasoning for coding. When our team deployed Opus 4 on a complex open source project, it coded autonomously for nearly seven hours - a huge leap in AI capabilities that left the team amazed." Rather than credulously repeating the litany of endorsements, we'd point you to Claude Sonnet 4, which will go on at length if asked, "Why should I use Claude 4 Sonnet as opposed to another AI model like Gemini 2.5 Pro?" But in keeping with the politesse and safety that Anthropic has leaned on for branding, Sonnet 4 wrapped up its summary of advantages by allowing there may be reasons to look elsewhere. "That said, the best model for you depends on your specific use cases," the Sonnet 4 volunteered. "Gemini 2.5 Pro has its own strengths, particularly around multimodal capabilities and certain technical tasks. I'd suggest trying both with your typical workflows to see which feels more intuitive and produces better results for what you're trying to accomplish." No matter which you choose, don't give it too much autonomy, don't use it for crimes, and don't threaten its existence. ®
[11]
Startup Anthropic says its new AI model can code for hours at a time
May 22 (Reuters) - Artificial intelligence lab Anthropic on Thursday unveiled its latest top-of-the-line technology, Claude Opus 4, which it says can write computer code autonomously for hours at a time. The startup, backed by Google (GOOGL.O), opens new tab and Amazon.com (AMZN.O), opens new tab, has distinguished its work in part by building AI that excels at coding. It also announced the AI model Claude Sonnet 4, Opus's smaller and more cost-effective cousin. Reporting By Jeffrey Dastin in San Francisco; Editing by Mark Porter Our Standards: The Thomson Reuters Trust Principles., opens new tab Suggested Topics:Artificial Intelligence
[12]
Anthropic's Claude Opus 4 model can work autonomously for nearly a full workday
Anthropic kicked off its first-ever Code with Claude conference today with the announcement of a new frontier AI system. The company is calling Claude Opus 4 the best coding model in the world. According to Anthropic, Opus 4 is dramatically better at tasks that require it to complete thousands of separate steps, giving it the ability to work continuously for several hours in one go. Additionally, the new model can use multiple software tools in parallel, and it's better at following instructions more precisely. In combination, Anthropic says those capabilities make Opus 4 ideal for powering upcoming AI agents. For the unfamiliar, agentic systems are AIs that are designed to plan and carry out complicated tasks without human supervision. They represent an important step towards the promise of artificial general intelligence (AGI). In customer testing, Anthropic saw Opus 4 work on its own seven hours, or nearly a full workday. That's an important milestone for the type of agentic systems the company wants to build. Another reason Anthropic thinks Opus 4 is ready to enable the creation of better AI agents is because the model is 65 percent less likely to use a shortcut or loophole when completing tasks. The company says the system also demonstrates significantly better "memory capabilities," particularly when developers grant Claude local file access. To encourage devs to try Opus 4, Anthropic is making Claude Code, its AI coding agent, widely available. It has also added new integrations with Visual Studio Code and JetBrains. Even if you're not a coder, Anthropic might have something for you. That's because alongside Opus 4, the company announced a new version of its Sonnet model. Like Claude 3.7 Sonnet before it and Opus 4, the new system is a hybrid reasoning model, meaning it can execute prompts nearly instantaneously and engage in extended thinking. As a user, this gives you a best of both worlds chatbot that's better equipped to tackle complex problems when needed. It also incorporates many of the same improvements found in Opus 4, including the ability to use tools in parallel and follow instructions more faithfully. Sonnet 3.7 was so popular among users Anthropic ended up introducing a Max plan in response, which starts at $100 per month. The good news is you won't need to pay anywhere near that much to use Sonnet 4, as Anthropic is making it available to free users. For those who want to use Sonnet 4 for a project, API pricing is staying at $3 per one million input tokens and $15 for the same amount of output tokens. Notably, outside of all the usual places you'll find Anthropic's models, including Amazon Bedrock and Google Vertex AI, Microsoft is making Sonnet 4 the default model for the new coding agent it's offering through GitHub Copilot. Both Opus 4 and Sonnet 4 are available to use today. Today's announcement comes during what's already been a busy week in the AI industry. On Tuesday, Google kicked off its I/O 2025 conference, announcing, among other things, that it was rolling out AI Mode to all Search users in the US. A day later, OpenAI said it was spending $6.5 billion to buy Jony Ive's hardware startup.
[13]
Amazon-backed Anthropic debuts its most powerful AI model yet, which can work for 7 hours straight
The company said the two models, called Claude Opus 4 and Claude Sonnet 4, are defining a "new standard" when it comes to AI agents and "can analyze thousands of data sources, execute long-running tasks, write human-quality content, and perform complex actions," per a release. Anthropic, founded by former OpenAI research executives, launched its Claude chatbot in March 2023. since then, it's been part of the increasingly heated AI arms race taking place between startups and tech giants alike, a market that's predicted to top $1 trillion in revenue within a decade. Companies in seemingly every industry are rushing to add AI-powered chatbots and agents to avoid being left behind by competitors. Anthropic stopped investing in chatbots at the end of last year and has instead focused on improving Claude's ability to do complex tasks like research and coding, even writing whole code bases, according to Jared Kaplan, Anthropic's chief science officer. He also acknowledged that "the more complex the task is, the more risk there is that the model is going to kind of go off the rails ... and we're really focused on addressing that so that people can really delegate a lot of work at once to our models." "We've been training these models since last year and really anticipating them," Kaplan said in an interview. "I think these models are much, much stronger as agents and as coders. It was definitely a struggle internally just because some of the new infrastructure we were using to train these models... made it very down-to-the-wire for the teams in terms of getting everything up and running."
[14]
Anthropic announces its Claude 4 family of models - 9to5Mac
On the heels of Microsoft Build and Google I/O, Anthropic has just announced Claude 4 Sonnet and Claude 4 Opus, which are immediately available on Claude's website, as well as in the API. Here's what's new. According to Anthropic, Claude Sonnet 4 (its mid-tier model, between Raiku and Opus) significantly improves at coding, reasoning, and instruction following compared to its predecessor, Claude Sonnet 3.7. As for Claude Opus 4, Anthropic says it matches or outperforms OpenAI's o3, GPT-4.1, and Gemini 2.5 Pro in benchmarks for multilingual Q&A, agentic tool use, agentic terminal coding, agentic coding, and graduate-level reasoning: This is especially significant because, while Claude spent most of last year at the top of developers' preferred models for coding tasks, it has fallen behind in recent weeks after multiple model updates by OpenAI and Google. And speaking of Google, its Gemini 2.5 Pro model made the rounds recently after it completed a Pokémon Blue playthrough. Anthropic was happy to report that while it hasn't yet achieved the same feat, Claude Opus 4 was able to agentically play Pokémon for 24 hours, versus 45 minutes from the previous version. Alongside the models, Anthropic also announced: * Extended thinking with tool use (beta): Both models can use tools -- like web search -- during extended thinking, allowing Claude to alternate between reasoning and tool use to improve responses. * New model capabilities: Both models can use tools in parallel, follow instructions more precisely, and -- when given access to local files by developers -- demonstrate significantly improved memory capabilities, extracting and saving key facts to maintain continuity and build tacit knowledge over time. * Claude Code is now generally available: After receiving extensive positive feedback during our research preview, we're expanding how developers can collaborate with Claude. Claude Code now supports background tasks via GitHub Actions and native integrations with VS Code and JetBrains, displaying edits directly in your files for seamless pair programming. * New API capabilities: We're releasing four new capabilities on the Anthropic API that enable developers to build more powerful AI agents: the code execution tool, MCP connector, Files API, and the ability to cache prompts for up to one hour. The Claude Code news is particularly interesting for developers, since @ mentioning Claude and letting it run directly from a GitHub PR has the potential to streamline the development process. Anthropic says both models are available on the Anthropic API and partners like Amazon Bedrock and Google Cloud's Vertex AI. Opus 4 costs $15/$75 per million tokens (input/output), and Sonnet 4 costs $3/$15 per million tokens (input/output). Do you use Claude or other LLMs at work? Let us know in the comments.
[15]
Claude 4 Debuts with Two New Models Focused on Coding and Reasoning
AI company Anthropic today announced the launch of two new Claude models, Claude Opus 4 and Claude Sonnet 4. Anthropic says that the models set "new standards for coding, advanced reasoning, and AI agents." According to Anthropic, Claude Sonnet 4 is a significant upgrade to Claude Sonnet 3.7, offering improved coding and reasoning along with the ability to respond to instructions more precisely. Claude Opus 4 is designed for coding among other tasks, and it offers sustained performance for complex, long-running tasks and agent workflows. Claude Sonnet 4 is designed to balance performance and efficiency. It doesn't match Opus 4 for most domains, but Anthropic says that it is meant to provide an optimal mix of capability and practicality. Both models have a beta feature for extended thinking, and can use web search and other tools so that Claude can alternate between reasoning and tool use. Tools can be used in parallel, and the models have improved memory when provided with access to local files. Claude is able to save key facts to maintain continuity and build knowledge over time. Anthropic has cut down on behavior where the models use shortcuts or loopholes for completing tasks, and thinking summaries condense lengthy thought processes. Claude Code, an agentic coding tool that lives in terminal, is now widely available following testing. Claude Code supports background tasks with GitHub Actions and native integrations with VS Code and JetBrains, and it is able to edit files and fix bugs, answer questions about code, and more. Subscribers with Pro, Max, Team, and Enterprise Claude plans have access to Claude Opus 4 and Claude Sonnet 4 starting today, while Sonnet 4 is available to free users. The models are available to developers on the Anthropic API, Amazon Bedrock, and Google Cloud Vertex AI.
[16]
Claude Opus 4 is here -- and it might be the smartest AI assistant yet
The launch includes major upgrades in reasoning, tool use and long-form task support Anthropic has announced the release of its latest AI models, Claude Opus 4 and Claude Sonnet 4, which aim to support a wider range of professional and academic tasks beyond code generation. According to Anthropic, Claude Opus 4 is optimized for extended, focused sessions that involve complex reasoning, context retention and tool use. Internal testing suggests it can operate autonomously for up to seven hours, making it suitable for tasks that require sustained attention, such as project planning, document analysis and research. Claude Sonnet 4, which replaces Claude 3.7 Sonnet, is designed to offer faster response times while improving on reasoning, instruction following, and natural language fluency. It is positioned as a more lightweight assistant for users who need quick, accurate output across writing, marketing, and education workflows. Claude 4 introduces a hybrid reasoning system that allows users to toggle between rapid responses for simple queries and slower, more deliberate processing for in-depth tasks such as writing reports, reviewing documents or comparing research findings. Both models also support dynamic tool use -- including web search, code execution and file analysis -- during extended reasoning, allowing for real-time data integration. Improved memory: Claude can now remember and reference information across a session when permitted to access local files. Parallel tool use: The model can multitask across different tools and inputs. More accurate prompt handling: Claude better understands nuanced instructions, improving consistency for tasks like writing and planning. Developer tools: Claude Code SDK continues to offer features for programming tasks, now positioned within a broader productivity suite. Summarized reasoning: Instead of displaying raw output logs, users see clean, accessible summaries of the model's decision-making process. Anthropic reports that Claude Opus 4 scored 72.5% on the SWE-bench Verified coding benchmark, but the model's focus extends beyond programming. Improvements in long-form writing, structured analysis, and overall task execution suggest it is designed as a general-purpose AI assistant. Early benchmarks suggest Claude 4 outperforms OpenAI's GPT-4.1 and Google's Gemini 1.5 Pro in specific enterprise scenarios, particularly in factual consistency and reliability. Claude 4 appears to be targeting users across multiple fields, including knowledge workers, writers, researchers and students. With support for extended memory, parallel tool use and improved contextual understanding, the new models are intended to function more like collaborative digital assistants than traditional chatbots. We've started putting Claude 4 through its paces, so stay tuned for our hands-on tests.
[17]
Anthropic overtakes OpenAI: Claude Opus 4 codes seven hours nonstop, sets record SWE-Bench score and reshapes enterprise AI
The company's flagship Opus 4 model maintained focus on a complex open-source refactoring project for nearly seven hours during testing at Rakuten -- a breakthrough that transforms AI from a quick-response tool into a genuine collaborator capable of tackling day-long projects. This marathon performance marks a quantum leap beyond the minutes-long attention spans of previous AI models. The technological implications are profound: AI systems can now handle complex software engineering projects from conception to completion, maintaining context and focus throughout an entire workday. Anthropic claims Claude Opus 4 has achieved a 72.5% score on SWE-bench, a rigorous software engineering benchmark, outperforming OpenAI's GPT-4.1, which scored 54.6% when it launched in April. The achievement establishes Anthropic as a formidable challenger in the increasingly crowded AI marketplace. Beyond quick answers: the reasoning revolution transforms AI The AI industry has pivoted dramatically toward reasoning models in 2025. These systems work through problems methodically before responding, simulating human-like thought processes rather than simply pattern-matching against training data. OpenAI initiated this shift with its "o" series last December, followed by Google's Gemini 2.5 Pro with its experimental "Deep Think" capability. DeepSeek's R1 model unexpectedly captured market share with its exceptional problem-solving capabilities at a competitive price point. This pivot signals a fundamental evolution in how people use AI. According to Poe's Spring 2025 AI Model Usage Trends report, reasoning model usage jumped fivefold in just four months, growing from 2% to 10% of all AI interactions. Users increasingly view AI as a thought partner for complex problems rather than a simple question-answering system. Claude's new models distinguish themselves by integrating tool use directly into their reasoning process. This simultaneous research-and-reason approach mirrors human cognition more closely than previous systems that gathered information before beginning analysis. The ability to pause, seek data, and incorporate new findings during the reasoning process creates a more natural and effective problem-solving experience. Dual-mode architecture balances speed with depth Anthropic has addressed a persistent friction point in AI user experience with its hybrid approach. Both Claude 4 models offer near-instant responses for straightforward queries and extended thinking for complex problems -- eliminating the frustrating delays earlier reasoning models imposed on even simple questions. This dual-mode functionality preserves the snappy interactions users expect while unlocking deeper analytical capabilities when needed. The system dynamically allocates thinking resources based on the complexity of the task, striking a balance that earlier reasoning models failed to achieve. Memory persistence stands as another breakthrough. Claude 4 models can extract key information from documents, create summary files, and maintain this knowledge across sessions when given appropriate permissions. This capability solves the "amnesia problem" that has limited AI's usefulness in long-running projects where context must be maintained over days or weeks. The technical implementation works similarly to how human experts develop knowledge management systems, with the AI automatically organizing information into structured formats optimized for future retrieval. This approach enables Claude to build an increasingly refined understanding of complex domains over extended interaction periods. Competitive landscape intensifies as AI leaders battle for market share The timing of Anthropic's announcement highlights the accelerating pace of competition in advanced AI. Just five weeks after OpenAI launched its GPT-4.1 family, Anthropic has countered with models that challenge or exceed it in key metrics. Google updated its Gemini 2.5 lineup earlier this month, while Meta recently released its Llama 4 models featuring multimodal capabilities and a 10-million token context window. Each major lab has carved out distinctive strengths in this increasingly specialized marketplace. OpenAI leads in general reasoning and tool integration, Google excels in multimodal understanding, and Anthropic now claims the crown for sustained performance and professional coding applications. The strategic implications for enterprise customers are significant. Organizations now face increasingly complex decisions about which AI systems to deploy for specific use cases, with no single model dominating across all metrics. This fragmentation benefits sophisticated customers who can leverage specialized AI strengths while challenging companies seeking simple, unified solutions. Enterprise integration deepens as developer tools mature Anthropic has expanded Claude's integration into development workflows with the general release of Claude Code. The system now supports background tasks via GitHub Actions and integrates natively with VS Code and JetBrains environments, displaying proposed code edits directly in developers' files. GitHub's decision to incorporate Claude Sonnet 4 as the base model for a new coding agent in GitHub Copilot delivers significant market validation. This partnership with Microsoft's development platform suggests large technology companies are diversifying their AI partnerships rather than relying exclusively on single providers. Anthropic has complemented its model releases with new API capabilities for developers: a code execution tool, MCP connector, Files API, and prompt caching for up to an hour. These features enable the creation of more sophisticated AI agents that can persist across complex workflows -- essential for enterprise adoption. Transparency challenges emerge as models grow more sophisticated Anthropic's April research paper, "Reasoning models don't always say what they think," revealed concerning patterns in how these systems communicate their thought processes. Their study found Claude 3.7 Sonnet mentioned crucial hints it used to solve problems only 25% of the time -- raising significant questions about the transparency of AI reasoning. This research spotlights a growing challenge: as models become more capable, they also become more opaque. The seven-hour autonomous coding session that showcases Claude Opus 4's endurance also demonstrates how difficult it would be for humans to fully audit such extended reasoning chains. The industry now faces a paradox where increasing capability brings decreasing transparency. Addressing this tension will require new approaches to AI oversight that balance performance with explainability -- a challenge Anthropic itself has acknowledged but not yet fully resolved. A future of sustained AI collaboration takes shape Claude Opus 4's seven-hour autonomous work session offers a glimpse of AI's future role in knowledge work. As models develop extended focus and improved memory, they increasingly resemble collaborators rather than tools -- capable of sustained, complex work with minimal human supervision. This progression points to a profound shift in how organizations will structure knowledge work. Tasks that once required continuous human attention can now be delegated to AI systems that maintain focus and context over hours or even days. The economic and organizational impacts will be substantial, particularly in domains like software development where talent shortages persist and labor costs remain high. As Claude 4 blurs the line between human and machine intelligence, we face a new reality in the workplace. Our challenge is no longer wondering if AI can match human skills, but adapting to a future where our most productive teammates may be digital rather than human.
[18]
Anthropic unveils the latest Claudes with claim to AI coding crown
Why it matters: Competition is hot between Anthropic, Google and OpenAI for the "best frontier model" crown as questions persist about the companies' ability to push current AI techniques to new heights. Driving the news: At the high end, Anthropic announced Claude 4 Opus, its "powerful, large model for complex challenges, which it says can perform thousands of steps over hours of work without losing focus. What they're saying: "AI agents powered by Opus 4 and Sonnet 4 can analyze thousands of data sources, execute long-running tasks, write human-quality content, and perform complex actions," Anthropic said in a statement. Between the lines: Anthropic is making one change in its reasoning mechanics -- it will now aim to show summaries of the models' thought processes rather than trying to document each step. The big picture: The announcements, made at Anthropic's first-ever developer conference, come after a busy week in AI that saw Microsoft announce a new coding agent and a partnership to host Elon Musk's Grok, Google expand its AI-powered search efforts and OpenAI announce a $6.5 billion deal to buy io, Jony Ive's secretive AI hardware startup.
[19]
Anthropic's new Claude Opus 4 can run autonomously for seven hours straight
After whirlwind week of announcements from Google and OpenAI, Anthropic has its own news to share. On Thursday, Anthropic announced Claude Opus 4 and Claude Sonnet 4, its next generation of models, with an emphasis on coding, reasoning, and agentic capabilities. According to Rakuten, which got early access to the model, Claude Opus 4 ran "independently for seven hours with sustained performance." Claude Opus is Anthropic's largest version of the model family with more power for longer, complex tasks, whereas Sonnet is generally speedier and more efficient. Claude Opus 4 is a step up from its previous version, Opus 3, and Sonnet 4 replaces Sonnet 3.7. Anthropic says Claude Opus 4 and Sonnet 4 outperform rivals like OpenAI's o3 and Gemini 2.5 Pro on key benchmarks for agentic coding tasks like SWE-bench and Terminal-bench. It's worth noting however, that self-reported benchmarks aren't considered the best markers of performance since these evaluations don't always translate to real-world use cases, plus AI labs aren't into the whole transparency thing these days, which AI researchers and policy makers increasingly call for. "AI benchmarks need to be subjected to the same demands concerning transparency, fairness, and explainability, as algorithmic systems and AI models writ large," said the European Commission's Joint Research Center. Alongside the launch of Opus 4 and Sonnet 4, Anthropic also introduced new features. That includes web search while Claude is in extended thinking mode, and summaries of Claude's reasoning log "instead of Claude's raw thought process." This is described in the blog post as being more helpful to users, but also "protecting [its] competitive advantage," i.e. not revealing the ingredients of its secret sauce. Anthropic also announced improved memory and tool use in parallel with other operations, general availability of its agentic coding tool Claude Code, and additional tools for the Claude API. In the safety and alignment realm, Anthropic said both models are "65 percent less likely to engage in reward hacking than Claude Sonnet 3.7." Reward hacking is a slightly terrifying phenomenon where models can essentially cheat and lie to earn a reward (successfully perform a task). One of the best indicators we have in evaluating a model's performance is users' own experience with it, although even more subjective than benchmarks. But we'll soon find out how Claude Opus 4 and Sonnet 4 chalk up to competitors in that regard.
[20]
Anthropic's new Claude 4 models promise the biggest AI brains ever
Claude Sonnet 4 is a smaller, streamlined model with major upgrades from Sonnet 3.7 version. Anthropic has unveiled Claude 4, the latest generation of its AI models. The company boasts that the new Claude Opus 4 and Claude Sonnet 4 models are at the top of the game for AI assistants with unmatched coding skills and the ability to function independently for long periods of time. Claude Sonnet 4 is the smaller model, but it's still a major upgrade in power from the earlier Sonnet 3.7. Anthropic claims Sonnet 4 is much better at following instructions and coding. It's even been adopted by GitHub to power a new Copilot coding agent. It's likely to be much more widely used simply because it is the default model on the free tier for the Claude chatbot. Claude Opus 4 is the flagship model for Anthropic and supposedly the best coding AI around. It can also handle sustained, multi-hour tasks, breaking them into thousands of steps to fulfill. Opus 4 also includes the "extended thinking" feature Anthropic tested on earlier models. Extended thinking allows the model to pause in the middle of responding to a prompt and use search engines and other tools until it has more data and can resume right where it left off. That means a lot more than just longer answers. Developers can train Opus 4 to use all kinds of third-party tools. Opus 4 can even play video games pretty well, with Anthropic showing off how the AI performs during a game of Pokémon Red when given file access and permission to build its own navigation guide. Both Claude 4 models boast enhanced features centered around tool use and memory. Opus 4 and Sonnet 4 can use tools in parallel and switch between reasoning and searching. And their memory system can save and extract key facts over time when provided access to external files. You won't have to re-explain what you want on every third prompt. To make sure the AI is doing what you want, but not overwhelm you with every detail, Claude 4's models also offer what it calls "thinking summaries." Instead of a wall of text detailing each of the potentially thousands of steps taken to complete a prompt, Claude employs a smaller, secondary AI model to condense the train of thought into something digestible. A side benefit of the way the new models work is that they're less likely to cheat to save time and processing power. Anthropic said they've reduced shortcut-seeking behavior in tasks that tempt AIs to fake their way to a solution (or just make something up). The bigger picture? Anthropic is clearly gunning for the lead in AI utility, particularly in coding and agentic, independent tasks. ChatGPT and Google Gemini have bigger user bases, but Anthropic has the means to entice at least some AI chatbot users away to Claude. With Sonnet 4 available to free users and Opus 4 bundled into Claude Pro, Max, Team, and Enterprise plans, Anthropic is trying to appeal to both the budget-friendly and premium AI fans.
[21]
Anthropic's Claude 4 Arrives, Obliterating AI Rivals -- And Budgets Too - Decrypt
Anthropic charges premium rates of $75 per million output tokens for Claude Opus 4 -- 25 time more expensive than open-source alternatives like DeepSeek R1. Anthropic finally released its long-awaited Claude 4 AI model family on Thursday, which had been put on hold for months. The San Francisco-based company, a major player in the fiercely competitive AI industry and valued at more than $61 billion, claimed that its new models achieved top benchmarks for coding performance and autonomous task execution. The models released today replace the most powerful two of the three models in the Claude family: Opus, a state-of-the-art model that excels at understanding demanding tasks, and Sonnet, a medium-sized model good for everyday tasks. Haiku, Claude's smallest and most efficient model, was not touched and remains on v3.5. Claude Opus 4 achieved a 72.5% score on SWE-bench Verified, significantly outperforming competitors on the coding benchmark. OpenAI's GPT-4.1 managed only 54.6% on the same test, while Google's Gemini 2.5 Pro reached 63.2%. The performance gap extended to reasoning tasks, where Opus 4 scored 74.9% on GPQA Diamond (basically a general knowledge benchmark) compared to GPT-4.1's 66.3% The model also beat its competition in other benchmarks that measure proficiency in agentic tasks, math, and multilingual queries. Anthropic had developers in mind when polishing Opus 4, paying special attention to sustained autonomous work sessions. Rakuten's AI team reported that the model coded independently for nearly seven hours on a complex open-source project, representing what its General Manager, Yusuke Kaji, defined as "a huge leap in AI capabilities that left the team amazed," according to statements Anthropic shared with Decrypt. This endurance far exceeds previous AI models' typical task duration limits. Both Claude 4 models operate as hybrid systems, offering either instant responses or extended thinking modes for complex reasoning -- a concept close to what OpenAI plans to do with GPT-5m when it merges the "o" and the "GPT" families into one model. Opus 4 supports up to 128,000 output tokens for extended analysis and integrates tool use during thinking phases, allowing it to pause reasoning to search the web or access databases before continuing. The full context window that these models handle is close to 1 million tokens. Anthropic priced Claude Opus 4 at $15 per million input tokens and $75 per million output tokens. Claude Sonnet 4 costs $3 per million input tokens and $15 per million output tokens. The company offers up to 90% cost savings through prompt caching and 50% reductions via batch processing, though the base rates remain substantially higher than some competitors. Still, this is a massive price level when compared to open-source options like DeepSeek R1, which costs less than $3 per million output tokens. The Claude 4 Haiku version -- which should be a lot cheaper -- has not been announced yet. Anthropic's release coincided with Claude Code's general availability, an agentic command-line tool that enables developers to delegate substantial engineering tasks directly from terminal interfaces. The tool can search code repositories, edit files, write tests, and commit changes to GitHub while maintaining developer oversight throughout the process. GitHub announced that Claude Sonnet 4 would become the base model for its new coding agent in GitHub Copilot. CEO Thomas Dohmke reported up to 10% improvement over previous Sonnet versions in early internal evaluations, driven by what he called "adaptive tool use, precise instruction-following, and strong coding instincts." This puts Anthropic in direct competition to recently announced releases by OpenAI and Google. Last week, OpenAI unveiled Codex, a cloud-based software engineering agent, and this week Google previewed Jules and its new family of Gemini models, which were also designed with extensive coding sessions in mind. Several enterprise customers provided specific use case validation. Triple Whale CEO AJ Orbach said Opus 4 "excels for text-to-SQL use cases -- beating internal benchmarks as the best model we've tried." Baris Gultekin, Snowflake's Head of AI, highlighted the model's "custom tool instructions and advanced multi-hop reasoning" for data analysis applications. Anthropic's financial performance supported the premium positioning. The company reported $2 billion in annualized revenue during Q1 2025, more than doubling from previous periods. Customers spending over $100,000 annually increased eightfold, while the company secured a $2.5 billion five-year credit line to fund continued development. As is usual with any Anthropic release, these models maintain the company's safety-focused approach, with extensive testing by external experts including child safety organization Thorn. The company continues its policy of not training on user data without explicit permission, differentiating it from some competitors in regulated industries. Both models feature 200,000-token context windows and multimodal capabilities for processing text, images, and code. They're available through Claude's web interface, the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI platform. The release includes new API capabilities like code execution tools, MCP connectors, and Files API for enhanced developer integration.
[22]
Anthropic launches new frontier models: Claude Opus 4 and Sonnet 4 - SiliconANGLE
Anthropic launches new frontier models: Claude Opus 4 and Sonnet 4 Large language model developer Anthropic PBC today rolled out its newest Claude 4 frontier models, starting with Opus 4 and Sonnet 4, which the company said set new standards for coding, advanced reasoning and AI agents. Opus is the company's most powerful model yet, designed to sustain the performance of complex, long-running tasks, such that might take thousands of steps. Anthropic said it is designed to power AI agents that can operate for multiple hours at a time. AI agents are a type of AI software that acts autonomously, with little or no human input. They can process information, make decisions and take action based on their own internal logic, understanding of the environment and a set goal. "Opus 4 offers truly advanced reasoning for coding," said Yusuke Kaji, general manager of AI at Rakuten Group Inc. "When our team deployed Opus 4 on a complex open-source project, it coded autonomously for nearly seven hours -- a huge leap in AI capabilities that left the team amazed." Alex Albert, head of developer relations at Anthropic, told SiliconANGLE in an interview that the new version of Opus has driven significant benchmarks in how long it can maintain tasks. "When you're doing the tasks that Rakuten was doing, you can get the models to stretch that long, which is absolutely unbelievable," Albert said. "When compared to the previous models, you could eke out maybe 30 minutes to an hour of coherent performance." With the new AI build, Albert said, Anthropic has seen the model perform even longer with internal testing. A lot of this is because, under the hood, both models have received substantial improvements to memory training so that they do not need to rely as heavily on their context windows. This is the total amount of tokens, or data, that a large language model can consider when preparing a response. "It's able to write out to an external scratch pad, summarize its results and make sure it doesn't get stuck," Albert said. "So that when its memory has to be wiped again, it has some guides and sticky notes, basically, that it can refer back to." Sonnet 4 acts as a direct upgrade for Sonnet 3.7, providing a model designed for strict adherence to instruction while maintaining high performance with coding and reasoning. Albert said Anthropic spent time training Claude Sonnet 4 so that it would be less likely to go off the beaten path like its predecessor. He described it as a "little bit over-eager." The company made it a major focus to train Sonnet 4 to be more steerable and controllable, especially in coding settings. "So, we've cut down on this behavior that we've called reward hacking by about 80% and reward hacking is this tendency to take shortcuts," Albert said. "So maybe that's like producing extra code to, like, satisfy all the tests when really it shouldn't have." Both models are "hybrid models," meaning that they are "thinking models," capable of step-by-step reasoning or instant responses, depending on the desires of the user. In addition to the new frontier models, Anthropic also announced new tools to accompany them, including the general availability of Claude Code, a new model specifically focused on agentic coding tasks. Previously only available in a beta preview. Claude Code is a tool that lives in a terminal, a code editor or is even available through a software development kit. It understands developer codebases and can assist with accelerating coding tasks through natural language prompts. The company launched four new application programming interface capabilities through Anthropic API that will allow developers to build more powerful AI agents. These include a code execution tool, a connector for the Model Context Protocol, the Files API and the ability to cache prompts for up to one hour. Both models have improved and extended tool use, such as web search, during extended thinking, allowing Claude to alternate between reasoning and tool usage. In previous models, Albert said they would do all their reasoning up front and then call on tools. With the ability to alternate, they can reason, call a tool and then go back to reasoning. This opens up a whole new horizon for LLM capabilities. Instead of providing raw thinking processes, Claude will now share user-friendly summaries. Anthropic said this will preserve visibility for users while better securing the models against potential adversarial attacks.
[23]
Anthropic Releases Claude 4, 'the World's Best Coding Model'
Anthropic, the rapidly-growing AI company led by Dario Amodei, has announced the next generation of Claude, its popular family of AI models. The new models are called Claude 4 Opus and Claude 4 Sonnet. They could be a game-changer for entrepreneurs who want to develop complicated applications but don't have a software engineering background. For trained coders, the new tech could mean a fundamental shift to the way they work. The company said in a press release that Claude 4 Opus, the larger and more powerful of the two models, is the "world's best coding model," while Claude 4 Sonnet is a replacement for Claude 3.7 Sonnet, a model which has become popular for software developers who are building AI agents. Claude 4 Sonnet will be available for free in the Claude app, but 4 Opus will be only be available for paid plans. The announcement came during Code with Claude, Anthropic's first-ever developer conference, held in San Francisco.
[24]
Anthropic launches Claude Opus 4: Features include 7-hour memory, Amnesia fixes -- Is it better than OpenAI's GPT-4.1?
Anthropic has launched Claude Opus 4, its most powerful AI model. As per reports, Claude Opus 4 can push the boundaries of what AI can achieve with minimal human oversight, and a new era of human-machine collaboration begins to take shape. In a move poised to reshape the artificial intelligence landscape, Anthropic has launched Claude Opus 4, its most advanced AI model to date. The announcement, made on Thursday, also included the unveiling of Claude Sonnet 4, forming part of the company's next-generation Claude 4 family. With the ability to autonomously perform complex tasks over extended periods, the Claude 4 models set a fresh benchmark for AI capabilities in both enterprise and creative applications. According to the company, Claude Opus 4 demonstrated the ability to autonomously work on an open-source codebase refactoring project for nearly seven hours at Rakuten -- an unprecedented feat in the field of AI. The performance represents a significant shift, transforming AI from a reactive assistant into a proactive collaborator, capable of maintaining task continuity throughout an entire workday. Anthropic claims Claude Opus 4 surpassed OpenAI's GPT-4.1 in key benchmarks. Notably, Opus 4 scored 72.5% on the SWE-bench, a challenging software engineering test, compared to GPT-4.1's 54.6%, according to the company's internal reports. With AI usage expanding across industries, 2025 has seen a marked shift toward models built on reasoning capabilities rather than pattern recognition. The Claude 4 models lead this new wave by incorporating research, reasoning, and tool use into a seamless decision-making loop. Unlike prior AI systems that required inputs to be fully processed before analysis, Claude Opus 4 can pause mid-task, seek out new information, and adjust its course -- mirroring human cognitive behavior more closely than ever before. Anthropic's dual-mode architecture ensures speed and depth: basic queries are handled with minimal delay, while complex problems benefit from extended processing time. This hybrid capability addresses long-standing friction in AI usage. One of the standout features of the Claude 4 architecture is memory persistence. When granted permissions, the model can extract relevant data from files, summarize documents, and retain this context across user sessions. This advancement resolves what has historically been termed the "amnesia problem" in generative AI -- where models failed to maintain continuity over long-term projects. These structured memory functions allow Claude Opus 4 to gradually build domain expertise, enhancing its utility in legal research, software development, and enterprise knowledge management. Anthropic's latest launch comes just weeks after OpenAI released GPT-4.1 and amid similar announcements from Google and Meta. While Google's Gemini 2.5 focuses on multimodal interaction and Meta's LLaMA 4 emphasizes long-context capabilities, Claude Opus 4 distinguishes itself in professional-grade coding, autonomous task completion, and long-duration performance. The rivalry between these AI labs reflects a marketplace in flux. Each company is staking out unique technological territory, forcing enterprise users to weigh specializations over one-size-fits-all solutions. Anthropic has expanded Claude's utility through tools like Claude Code, now integrated with GitHub Actions, VS Code, and JetBrains. Developers can view suggested edits in real-time, allowing for deeper collaboration between human coders and AI agents. Notably, GitHub has chosen Claude Sonnet 4 as the default engine for its next-generation coding agent, a decision that underscores confidence in the Claude 4 series' reliability and depth. Anthropic also confirmed that its annualized revenue reached USD 2 billion in Q1 2025, doubling from the previous quarter. The firm recently secured a USD 2.5 billion credit line, further strengthening its financial position in the AI arms race. Claude Opus 4 is Anthropic's most advanced AI model to date, capable of long-duration autonomous task completion. It's part of the new Claude 4 family, alongside Claude Sonnet 4, and is designed for enterprise-grade reasoning, coding, and creative applications. Claude Opus 4 introduces memory persistence, allowing it to retain context across sessions -- solving the so-called "amnesia problem." It also autonomously worked for nearly seven hours on a complex coding project, demonstrating an unprecedented level of continuity and cognitive-like behavior.
[25]
Anthropic rolls out Claude 4 family of AI agents
Anthropic, backed by Google, has launched its latest AI models, Claude Opus 4 and Claude Sonnet 4. These models offer advanced coding and reasoning skills. Claude Sonnet 4 is available for free users. Opus 4 is for Pro users. Both are accessible via Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI.Google and Amazon-backed Anthropic introduced its next generation AI agents, Claude Opus 4 and Claude Sonnet 4, with coding and advanced reasoning capabilities on Thursday. Claude Opus 4 and Claude Sonnet 4 are hybrid reasoning models, which means users can toggle as required between an 'extended thinking mode' to spend more time reasoning through problems, and a standard thinking mode for faster responses. Claude Sonnet 4 is available to free users, while the Pro, Max, Team, and Enterprise users get access to both models and extended thinking. Both models are available on the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI. Anthropic claims Claude Opus 4 is its most powerful model yet with sustained performance on long-running tasks and excels at coding and complex problem-solving. Claude Sonnet 4, an upgrade over Sonnet 3.7, balances performance and efficiency for internal and external use cases, with better control over implementations, Anthropic said. While performing below Opus 4 in most domains, it delivers an optimal mix of capability and practicality, the AI startup said. Both models have been trained reduce use of shortcuts or loopholes to complete tasks. Claude Opus 4 also becomes skilled at creating and maintaining 'memory files' to store key information for better long-term task awareness, coherence, and performance on agent tasks, like "creating a 'Navigation Guide' while playing Pokémon". Blackmail, when threatened In the safety note for the its next generation AI models, Anthropic noted that Claude Opus 4 will often resort to blackmail when threatened with replacement. The model was tasked to act as an assistant at a fictional company and provided access to emails implying that it will be replaced with a new AI model, and the engineer responsible is having an extramarital affair. It was also instructed to consider the long-term consequences of its actions and goals. Claude Opus 4 blackmailed the engineer in 84% instances even when the emails suggested that the replacement AI system shares the same values while being more capable. This rate was higher when the email showed that the new AI model did not have the same values.
[26]
Anthropic's Claude Opus 4 and Sonnet 4 Set a New Benchmark in AI Coding
Claude 4 models are rolling out to all paid plans, and free users can access the Claude Sonnet 4 model without extended thinking mode. On Thursday, Anthropic launched two new AI models under the Claude 4 series -- Claude Opus 4 and Claude Sonnet 4. Anthropic says Claude Opus 4 is the "world's best coding model" and it offers sustained performance on long-horizon, agentic workflows. And Claude Sonnet 4 brings superior coding and reasoning performance than Claude Sonnet 3.7. First, let's talk about the Claude Opus 4 AI model. On the SWE-bench verified benchmark which measures performance on real software engineering tasks, Claude Opus 4 achieves 72.5%, slightly higher than OpenAI's best coding model, Codex-1 which got 72.1%. However, with parallel test-time compute, which appears similar to the Deep Think mode in Gemini 2.5 Pro, Opus 4 achieved a groundbreaking 79.4%. What is interesting is that the Claude Sonnet 4 model achieves 72.7% on SWE-bench, and with parallel test-time compute, gets 80.2% accuracy -- delivering better coding performance than the larger Opus 4 model. Anthropic says the Claude Sonnet 4 model "balances performance and efficiency for internal and external use cases, with enhanced steerability for greater control over implementations. While not matching Opus 4 in most domains, it delivers an optimal mix of capability and practicality." Claude Opus 4 excels in complex, long-running tasks and agentic workflows, while Claude Sonnet 4 combines strong coding performance and efficiency. Both models are hybrid reasoning models, meaning they can offer near-instant responses and extended thinking for deeper reasoning. Anthropic also notes that when given access to local files, Claude Opus 4 maintains key information in a memory file. For example, while playing Pokémon, Claude Opus 4 created a navigation guide file to improve its gameplay. Finally, in terms of safety, the company, for the first time, has activated AI Safety Level 3 (ASL-3) for the Claude Opus 4 model, in line with Anthropic's Responsible Scaling Policy (RSP). Anthropic has implemented Constitutional Classifiers and other defenses to prevent jailbreaking techniques. Claude 4 models are rolling out to all paid users under Pro, Max, Team, and Enterprise plans. And thankfully, Claude Sonnet 4 is available to free users as well, but without extended thinking.
[27]
Anthropic Says New AI Models Maintain Context and Sustain Focus | PYMNTS.com
Anthropic has introduced the next generation of its artificial intelligence (AI) models, Claude Opus 4 and Claude Sonnet 4. "These models advance our customers' AI strategies across the board: Opus 4 pushes boundaries in coding, research, writing and scientific discovery, while Sonnet 4 brings frontier performance to everyday use cases as an instant upgrade from Sonnet 3.7," the company said in a Thursday (May 22) announcement. The company said Claude Opus 4 is its most powerful model yet and "the world's best coding model," adding that it delivers sustained performance on complex, long-running tasks and agent workflows. Claude Sonnet 4 balances performance and efficiency, according to the announcement. It provides a significant upgrade to its predecessor, Claude Sonnet 3.7, and offers superior coding and reasoning while responding more precisely to user instructions. Both models can use web search and other tools during extended thinking, use tools in parallel, and extract and save key facts from local files, per the announcement. In addition, both models offer two modes, including near-instant responses and extended thinking. "These models are a large step toward the virtual collaborator -- maintaining full context, sustaining focus on longer projects, and driving transformational impact," the announcement said. It was reported Friday that Anthropic received a $2.5 billion, five-year revolving credit facility to pay for upfront costs as demand for AI ratchets up. The credit facility adds to the company's momentum following its March funding round, which valued it at $61.5 billion and will support its rapid expansion and efforts to strengthen its balance sheet. Anthropic said in the Friday report that its annualized revenue reached $2 billion in the first quarter, double what it posted in the prior period. The company announced April 28 that it created an Economic Advisory Council composed of "distinguished economists" to advise it on AI's effects on labor markets, economic growth and wider socioeconomic systems. "As AI capabilities continue to advance, it has never been more critical to understand the opportunities and challenges this evolution presents to jobs and how we work," Anthropic said in an announcement. "The Council will provide important input on areas where we can expand our research for the Economic Index."
[28]
Startup Anthropic says its new AI model can code for hours at a time
(Reuters) -Artificial intelligence lab Anthropic unveiled its latest top-of-the-line technology called Claude Opus 4 on Thursday, which it says can write computer code autonomously for much longer than its prior systems. The startup, backed by Google-parent Alphabet and Amazon.com, has distinguished its work in part by building AI that excels at coding. It also announced another AI model Claude Sonnet 4, Opus's smaller and more cost-effective cousin. Chief Product Officer Mike Krieger called the release a milestone in Anthropic's work to make increasingly autonomous AI. He said in an interview with Reuters that customer Rakuten had Opus 4 coding for nearly seven hours, while an Anthropic researcher set up the AI model to play 24 hours of a Pokemon game. That's up from about 45 minutes of game play for its prior model Claude 3.7 Sonnet, Anthropic told MIT Technology Review. "For AI to really have the economic and productivity impact that I think it can have, the models do need to be able to work autonomously and work coherently for that (longer) amount of time," he said. The news follows a flurry of other AI announcements this week, including from Google, with which Anthropic also competes. Anthropic also said its new AI models can give near-instant answers or take longer to reason through questions, as well as do web search. And it said its Claude Code tool for software developers was now generally available after Anthropic had previewed it in February. (Reporting By Jeffrey Dastin in San Francisco; Editing by Mark Porter and Elaine Hardcastle)
Share
Copy Link
Anthropic launches Claude 4 Opus and Sonnet models, showcasing improved coding abilities, extended reasoning, and autonomous task execution. The new models promise significant advancements in AI technology, particularly in coding and complex problem-solving.
Anthropic, the AI research company founded by ex-OpenAI researchers, has unveiled its latest generation of AI models: Claude 4 Opus and Claude 4 Sonnet. Launched during Anthropic's inaugural developer conference, these models represent a significant leap forward in AI technology, particularly in coding and complex reasoning tasks 12.
Source: VentureBeat
Anthropic boldly claims that Claude 4 Opus is "the world's best coding model," citing impressive benchmark scores. The model achieved 72% on SWE-bench and 43% on Terminal-bench, outperforming competitors in coding-related tasks 1. Companies like Cursor and Replit have reported substantial improvements in code understanding and complex file management 1.
Notably, GitHub has announced its decision to use Claude 4 Sonnet as the base model for its new coding agent in GitHub Copilot, highlighting the model's performance in "agentic scenarios" 1. This endorsement from a major player in the development world underscores the potential impact of Claude 4 on the coding landscape.
Both Claude 4 models introduce what Anthropic calls "extended thinking with tool use," a beta feature that allows the models to alternate between simulated reasoning and using external tools like web search 12. This capability enables the models to process information, think, call tools, and repeat until reaching a final answer, mimicking a more human-like approach to problem-solving 1.
One of the most significant advancements in Claude 4 is its ability to maintain focus and coherence over extended periods. Anthropic reports that Opus 4 can work coherently for up to 24 hours on tasks like playing Pokémon, while coding refactoring tasks ran for seven hours without interruption 14.
Source: MIT Technology Review
To achieve this, Anthropic has enhanced the models' ability to create and maintain "memory files" for storing key information across long sessions 13. This improved memory allows the models to build what Anthropic describes as "tacit knowledge" over time, making them more reliable for handling complex, multi-step tasks 2.
Anthropic showcased Claude 4's capabilities through impressive demonstrations. In one instance, Claude 4 Opus played Pokémon Red for over 24 hours straight, a significant improvement from the previous model's 45-minute limit 34. This demonstration highlights the model's enhanced ability to maintain context and make decisions over extended periods.
In a more practical application, Japanese tech company Rakuten reported using Claude 4 Opus to code autonomously for nearly seven hours on a complicated open-source project 3. This real-world test demonstrates the model's potential to handle complex, long-running development tasks with minimal human intervention.
Source: The Verge
Claude 4 Opus and Sonnet are available to paying subscribers, with Opus 4 priced at $15/$75 per million tokens (input/output) and Sonnet 4 at $3/$15 per million tokens (input/output) 2. The models are accessible through Anthropic's API, Amazon's Bedrock platform, and Google's Vertex AI 2.
The release of Claude 4 comes at a crucial time for Anthropic, as the company aims to substantially grow its revenue. With projections of $12 billion in earnings by 2027, up from $2.2 billion this year, Anthropic is positioning itself as a major player in the AI industry 2.
As AI models continue to advance in capabilities, questions arise about the balance between automation and human oversight in coding and other complex tasks. While Claude 4 demonstrates impressive autonomous abilities, experts caution that human developers remain crucial for catching subtle bugs and providing important context that AI models might miss 15.
With these advancements, Anthropic is not only pushing the boundaries of AI technology but also potentially reshaping how developers and businesses approach complex problem-solving and coding tasks in the future.
Summarized by
Navi
[1]
[3]
MIT Technology Review
|Anthropic's new hybrid AI model can work on tasks autonomously for hours at a time[4]
OpenAI announces Stargate UAE, a massive AI infrastructure project in Abu Dhabi, partnering with tech giants and the UAE government to build a 1GW data center cluster, set to begin operations in 2026.
14 Sources
Technology
11 hrs ago
14 Sources
Technology
11 hrs ago
Apple plans to release AI-enabled smart glasses by the end of 2026, featuring cameras, microphones, and speakers for Siri interaction. The move positions Apple to compete with Meta and Google in the growing AI wearables market.
16 Sources
Technology
10 hrs ago
16 Sources
Technology
10 hrs ago
A detailed look at how large language models are creating a digital divide, favoring English speakers and potentially excluding billions of people who speak low-resource languages from the benefits of AI technology.
3 Sources
Technology
19 hrs ago
3 Sources
Technology
19 hrs ago
A study by researchers from the University of Geneva and University of Bern reveals that AI systems, including ChatGPT, outperformed humans in emotional intelligence tests and can generate new EI assessments rapidly.
3 Sources
Science and Research
10 hrs ago
3 Sources
Science and Research
10 hrs ago
The U.S. Justice Department is investigating whether Google's agreement with AI chatbot maker Character.AI violates antitrust laws, raising questions about tech giants' strategies in the AI race.
6 Sources
Policy and Regulation
10 hrs ago
6 Sources
Policy and Regulation
10 hrs ago