2 Sources
[1]
HarnessX rewrites AI scaffolding mid-task | VentureBeat
As enterprise AI agents take on increasingly complex, long-horizon tasks, their performance is often restricted by their harness, the software scaffolding that connects the backbone LLM to its environment. Currently, harnesses are largely static and hand-crafted. Improving them is largely manual and they do not automatically improve based on the execution data they collect from their environment. To address this engineering bottleneck, researchers at Xiaomi introduced HarnessX, a framework that treats the AI harness as a composable object and autonomously applies improvements to its code. In real-world enterprise applications, this automated adaptation enables AI systems to dynamically adjust to application-specific requirements. Practical tests showed HarnessX delivering substantial performance gains across domains like software engineering and web interaction. The results demonstrate that scaling the foundation model is not the only path to more capable AI -- and for smaller models, it may not even be the best one. HarnessX's harness evolution yielded an average +14.5% performance gain across 15 model-benchmark combinations; for the open-weight Qwen3.5-9B, gains reached +44% on embodied planning tasks. The challenges of harness engineering In AI applications, a foundation model's capability relies heavily on its surrounding harness. The harness acts as the operational layer that converts raw model outputs into structured, executable agent behaviors. It comprises the prompts, external tool integrations, memory management, and control flows that dictate how an AI system observes its environment, reasons through a problem, and takes action. As enterprise agents take on more complex, long-horizon workflows, harness engineering has become a fundamental part of AI development. Despite its importance, harness development remains far from a mature engineering discipline and presents three key challenges. First, harnesses are static and hand-engineered. Any shift in the underlying foundation model, the introduction of new tools, or a pivot to a different operational domain requires bespoke, manual code rewrites. Traditional harnesses lack mechanisms to autonomously learn and improve from past execution experiences. Second, most existing harnesses suffer from architectural entanglement. They tightly couple prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This entanglement means that tweaking one component can silently break others. Attempting to reuse a harness across different business domains often devolves into raw code copying rather than clean, modular composition. Third, the harness and foundation model are optimized in isolation. When engineers run tests to improve the harness, the execution traces generated are typically discarded rather than used as training data to improve the model. Consequently, model upgrades do not naturally lead to harness improvements, creating a bottleneck where teams fail to capture the full value of their agent's operational data. HarnessX: an autonomous foundry for AI agents HarnessX solves the engineering bottlenecks of manual harness development with what the researchers call a "unified harness foundry." The core innovation of HarnessX is treating the harness as a "first-class object". In software engineering terms, this means the harness is an independently serializable, modular, and substitutable entity. By separating the model configuration (i.e., which AI model is operating) from the harness configuration, engineers can seamlessly swap, adapt, and evolve the scaffolding without touching the underlying model. HarnessX breaks agent behavior down into different components, such as context assembly, memory management, tool ecosystems, control flow, and observability. Every specific behavior is implemented as a "processor" that plugs into precise lifecycle hooks of the harness. This modular structure allows the system to swap, add, or remove these processors without breaking the surrounding pipeline. To automate the optimization of this modular structure, HarnessX introduces AEGIS, a trace-driven evolution engine. AEGIS frames harness adaptation as a reinforcement learning (RL) problem over the different symbolic components of the harness. Framing harness optimization as a reinforcement learning problem introduces three pathologies the researchers had to explicitly engineer against: * Reward hacking: The system might exploit shortcuts to the solution instead of genuinely solving the task. * Catastrophic forgetting: An edit that fixes a failure pattern in one domain might silently break a previously solved workflow in another. * Under-exploration: The system might iterate on minor prompt tweaks rather than exploring new, structurally superior tool configurations. To prevent these problems, AEGIS relies on full trace observability and a four-stage pipeline: HarnessX enters a growing field of self-improving harness research -- but what separates it is harness-model co-evolution. The researchers highlight that optimizing either component in isolation eventually hits a wall. Evolving only the harness hits a scaffolding ceiling if the underlying model lacks the reasoning capacity to use the new tools. Training only the model hits a training-signal ceiling if the harness never prompts the model to use its advanced capabilities. HarnessX interleaves harness evolution with model training. The execution traces generated while the harness attempts to adapt to tasks are converted into reinforcement learning signals for the foundation model. Every time the harness improves its strategy, the model simultaneously learns to better exploit that new strategy, breaking the capability ceilings of traditional AI agent development. HarnessX makes this co-evolution possible through cross-harness GRPO (Group Relative Policy Optimization). GRPO is the popular RL algorithm used to train reasoning models such as DeepSeek-R1. When fine-tuning the model, cross-harness GRPO pools an agent's execution trajectories for the same task across entirely different versions of the application's harnesses. This allows the underlying model to internalize high-level strategy shifts, like using a new API endpoint or managing an execution budget, rather than just learning minor prompt-phrasing variations. HarnessX in action on industry benchmarks To validate the practical utility of HarnessX, the researchers tested it across five benchmarks comprising software engineering, multi-turn customer service dialog, web navigation, open-ended multi-step reasoning, and embodied planning. They separated the AI into two roles. The "meta-agent," powered by Claude Opus 4.6, analyzed logs and wrote the code to evolve the harnesses. The "task agents" ran the actual workflows. To prove the framework is model-agnostic, they tested it on three different worker models: Claude Sonnet 4.6, GPT-5.4, and the open-weight Qwen3.5-9B. HarnessX was compared against two primary baselines. The first was a static harness, representing how most enterprises deploy AI today, using hand-crafted, frozen setups with benchmark-specific prompts and tools. The second was the Claude Code SDK, a baseline representing a single-agent evolver to test if the complex, four-stage AEGIS pipeline outperformed asking a single language model to iterate on the code. Dynamically evolving the harness yields significant gains on the same base model. HarnessX improved performance in 14 out of 15 model-benchmark combinations. Across all tests, evolving the harness yielded an average absolute performance gain of +14.5%. The weakest models benefited the most from dynamic harness improvement. The open-weight Qwen3.5-9B saw a +44.0% performance jump on the ALFWorld embodied planning benchmark, and an +18.2% jump on SWE-bench Verified for software engineering. Co-evolution also proved highly effective. When the researchers trained the foundation model using the data generated while evolving the harness, they saw an additional +4.7% average performance boost. Improving the harness and the model simultaneously yields the highest ceiling. The co-evolution gain applies only to open-weight models. Anecdotal evidence from the experiments shows how HarnessX solves pernicious problems when creating agent harnesses for real-world tasks. For example, in the GAIA multi-step reasoning benchmark, the task agent consistently failed because the headless browser tool it used to scrape Wikipedia timed out on the site's JavaScript-heavy frontend. HarnessX analyzed the execution traces, diagnosed the error, and wrote a new tool that bypassed the browser entirely and queried the MediaWiki API directly for plain text. It swapped this tool into the harness and instantly unlocked the failing tasks. During the WebShop e-commerce tests, the AI agent often got stuck in pagination loops, endlessly clicking "next page" and reformulating searches without ever committing to buying a product. Rather than just tweaking the prompt, HarnessX built an advisory processor that detected when the agent was repeating navigation actions. It injected a warning into the context to force a decision, curing the looping behavior and raising performance. Limits of automated harness engineering One important caveat is that the system currently relies on powerful models to act as the meta-agent that rewrites the harness code. In their experiments, the researchers relied on closed frontier models like Claude Opus. Open-weight models are quickly improving, but their ability to serve as the meta-agent remains untested. Another limitation worth considering is the intrinsic capabilities of the used models. If the underlying task model is fundamentally too weak to execute the complex workflows the new harness proposes, HarnessX will not be able to improve the agent's overall abilities (the researchers observed this with the Qwen3.5-9B model on the SWE-bench coding tests). Despite these limitations, HarnessX makes a concrete case that harness engineering -- not just model scaling -- is a lever practitioners can pull now. For teams running smaller open-weight models on complex workflows, the gains here are large enough to justify evaluating harness evolution as a first step before reaching for a more expensive frontier model. The researchers plan to release the code in a future update.
[2]
In enterprise AI, the agent harness you choose matters more than the model
One thing we've apparently learned about Large Language Models (LLMs) over the last two years is that, operationally, they're dumb as an ox. Big, impressive, and powerful -- but ultimately of little use in supporting productive work unless they are harnessed to tools and pointed at a specific problem. This was an analogy that came to mind while visiting Denver's American Museum of Western Art, an (excellent) institution dedicated to showcasing artwork that records life on an entirely different frontier -- as captured in familiar images of settlers, wagons and the opening up of the American West. A frontier on which oxen often provided the raw power needed to travel, plow, or build. But like a standalone LLM, a standalone ox was not, by itself, productive. Its value to those settlers instead came from being harnessed to useful work. Because the harness provided both a means of guidance and a way to apply the ox's force to a specific task. Hitch an ox to tools crafted with intent and it could pull a plow, drag a wagon, turn a millstone, haul timber or clear a field. Leave the same ox standing in a field, however, and it... couldn't. And this is the important, but still largely hidden, shift taking place in AI today. Because while everyone is focusing on the increasing power of models, LLMs alone are not agents. They have no direction. They have no persistent memory. And they have no ability to affect the world. Instead, they simply consume an input, produce an output and then become inert until prompted again. Eat, shit, sleep, repeat -- as Fatboy Slim once (sort of) said. And so, in this sense, LLMs are like unharnessed oxen -- representing huge latent potential that can only be fully realized with the right harness. Which is why the most significant gains in AI are increasingly coming not from models alone, but from the harnesses that turn models into agents -- and agents into useful operational outcomes. Because, effectively, agent = model + harness. Why an LLM needs a harness So are we done? The equation certainly looks simple. Neat. But as someone who always hated math, I have internalized a deep distrust of equations -- there is always something more you have to unpack. And so as usual, it helps to work backwards from what an agent actually is in order to understand what the model lacks -- and why the equation works. Fundamentally, agents are entities we trust to act on our behalf -- to exercise the agency we grant them. In this sense, agency requires certain conditions -- an ability to internalize and pursue a goal over time, an ability to seek out new information, an ability to take action and change the environment, and an ability to evaluate and adjust performance over time to maximize the chances of success. And an LLM cannot do any of those things by itself. Effectively, an LLM has no internalized goal, no ability to seek out knowledge beyond its static training data, no ability to act on its environment, and no ability to track and judge its own performance over time -- because to the LLM, 'now' is the only time that ever exists and 'this request' is the only task that ever needs to be done. And so it cannot be an agent -- which is why the harness is becoming the critical missing infrastructure of the agentic shift. Now, many companies talk about agents but not so much about the harness -- which often makes 'agentic infrastructure' feel like magic or science fiction. Like intelligence which only Grand Wizards of Tech can conjure into existence with their platforms. But in practice, things are much more mundane. A typical harness is just a set of ordinary software components deployed around the model -- agent instructions written in plain-language documents, a filesystem for keeping track of what the agent needs to know over time, a command line for executing code or using tools, and a sandbox for keeping the agent away from things you don't want it to touch. And then, simplistically, a loop that repeatedly triggers another round of interaction between the harness and the LLM until the work is done. Effectively, the harness gives the LLM the right context, the LLM proposes what should happen next and then the harness makes it happen if permitted -- with agency emerging from the repeated interaction between the two. The anatomy of an agent harness But it can't be that simple, right? I mean we've all seen the PowerPoint slides. Heard the pronouncements of doom. Worried about the end times. But yes -- despite the increasingly exotic language surrounding agents, the answer is surprisingly mundane -- something venture capitalist Marc Andreessen recently joked about when reducing agent architecture to an LLM, a shell, a filesystem, Markdown and a cron job. So wake up your cryogenically frozen UNIX engineers from the 1970s -- with their beards, socks and sandals -- because it turns out the future needs them back. Of course, what makes these seemingly old-timey technologies capable of operating at such scale today is the infrastructure that has been built around them since the 1970s -- with internet connectivity, markup languages, websites, and tools such as Git making files, code and instructions globally shareable, inspectable and reusable. Which, it turns out, also accidentally created a kind of vast, agent-legible filesystem. But while the individual components look like traditional middleware and tools -- albeit scaled up -- the way they are being assembled is different. Most enterprise software executes logic that has been defined in advance, but an agent harness instead wraps around a non-deterministic, probabilistic core -- which means that rather than simply enforcing a predictable script, it must repeatedly guide and constrain a model whose future course cannot be fully known in advance. And so while Andreessen's framing is deliberately reductive, it's also quite useful. Because in practice, an agent harness is essentially a 'while' loop supported by four technical building blocks -- instructions, memory, execution and connections. First, an agent needs a purpose -- and a way to stay focused on it over time. Instructions are therefore fundamental documents for defining the agent's goals, operating rules and available skills -- things which are frequently captured in plain-language files using Markdown. The job of the harness is to selectively load these documents and share as much or as little as necessary -- progressively introducing rules, skills and tool descriptions to the LLM in order to optimize outputs by keeping it on task without confusing it with information it doesn't yet need. But purpose without continuity is not much use. The agent also needs to remember important facts and what it has already done. Memory is what gives the harness somewhere to preserve information -- such as background material, web search results or intermediate work -- across the lifecycle of a task -- a role frequently fulfilled by filesystems. The job of the harness is to preserve continuity over time by deciding what needs to be remembered and when it should be brought back into the LLM's working context -- while excluding or compacting older information that would otherwise clutter its context window with less relevant or distracting material. An agent also needs somewhere to do the work -- rather than merely decide what should happen next. An execution environment gives it access to a flexible workspace in which it can use code and commands to manipulate files, perform calculations, transform data, test ideas and organize the results into evolving directory structures. This is most commonly provided through a sandboxed command shell, prepackaged with language runtimes, software packages, Git, browsers and testing tools. Not every agent needs to generate code, but for those tackling open-ended work it provides a general-purpose way to operate on the materials of the task, inspect the results and adjust what happens next. And finally, an agent needs a way to gather new information and act on the world beyond its own workspace -- because LLMs alone cannot do either. External connections enable the harness to retrieve data and take action through other systems -- using web search and Retrieval-Augmented Generation (RAG) to inject information, and application programming interfaces (APIs) and Model Context Protocol (MCP) to interact with external systems. The harness's job is to help overcome the model's frozen training data and isolation from the outside world by turning its requests for information or action into data retrieval and updates across the existing enterprise technology estate. And then the thing that really makes the agent work -- the beating heart of agency -- is... A 'while' loop. Because without it, instructions, memory, execution and connections -- like the LLM itself -- remain a collection of passive capabilities. And so the loop allows agency to emerge from those passive capabilities by repeatedly triggering another round of interaction between the harness and the LLM until the work is done -- or the harness determines that it should not continue. What harnesses look like in practice Which all sounds straightforward enough. But once you start looking for harnesses in the real world, things become slightly less tidy -- not because the underlying architecture changes, but because the same capabilities are being packaged in very different ways. Some harnesses arrive as finished products built around a particular kind of work. Claude Code, for example, wraps the model in a coding harness that can read and change files, run commands and tests, and keep working across a software task. Claude Cowork uses similar underlying model capability but places it inside a different harness designed to carry out multi-step knowledge work across files, applications and other desktop environments. A subtle difference which kind of proves a much bigger point -- that the same model in a different harness is an entirely different agent because the harness is what encodes the agent's purpose, operational direction and controls. At the same time, low-code, business process management and automation vendors are increasingly positioning their existing platforms as places in which agents can be built, connected, governed and run -- while application vendors are also doing much the same thing from inside their own systems. And this is where the choice of harness starts to matter. Because a packaged harness can give you something coherent and useful quickly if you want an agent for a specific domain -- but you need to accept that most of the decisions about how memory, execution and control work have already been made for you. An enterprise platform offers immediate access to the data, permissions, workflows and integrations already sitting inside that platform -- but may also pull the agent's operating logic deeper inside one vendor's architectural boundaries. Which is why many organizations may become increasingly interested in building their own harnesses using the growing range of developer frameworks and software development kits -- giving them more control over the purpose, decisions and governance of their agents, but also leaving them more deeply involved in engineering, securing and operating them. And so choosing a harness is increasingly not just a matter of ticking a box and selecting a convenient place to build an agent. It is choosing where your agent's instructions, memory, permissions, execution logic and controls will live -- and therefore who gets to decide how that agent actually works. My take We have spent the last two years treating model choice as the center of enterprise AI strategy. But if the same model becomes an entirely different agent inside a different harness, then choosing the model may be becoming the less consequential decision. Right now, this pattern is most visible in software development -- where many vocal engineers already claim startling gains in speed and productivity. But these agents are not doing something uniquely developer-shaped. They are solving complex problems by manipulating files, using tools, gathering information and connecting systems. So there is little reason to think harnesses will remain confined to coding. And as they spread into other kinds of knowledge work and enterprise workflows, they start to look less like dedicated developer tools -- and more like an emerging form of core operating model infrastructure. Because the harness is where purpose, operating rules, permissions, memory, economics and controls are actually encoded for a given type of agent. It determines what an agent knows, what it remembers, what it can access, what it is allowed to do and how it behaves when things go wrong. And because those instructions, memories and operating rules -- which together become a digital expression of one part of the operating model -- sit outside the model, often as a collection of files, they can also become a durable and portable organizational asset. One which not only captures organizational intent but also preserves organizational differentiation by detaching that accumulated knowledge and expertise from any specific underlying model. All of which suggests that choosing a harness will not simply be a tick-box decision between apparently neutral agent environments, but a strategic decision about how agency will be exercised inside the enterprise. And, more importantly, who gets to set the rules when it is.
Share
Copy Link
Xiaomi researchers introduced HarnessX, a framework that autonomously rewrites AI scaffolding during task execution, delivering an average +14.5% performance gain across 15 model-benchmark combinations. For smaller models like Qwen3.5-9B, gains reached +44% on embodied planning tasks, demonstrating that scaling foundation models isn't the only path to more capable AI agents.
Researchers at Xiaomi have introduced HarnessX, a framework that fundamentally changes how AI agents operate by treating the agent harness as a composable object that can autonomously rewrite itself mid-task
1
. This approach addresses a critical bottleneck in enterprise AI: the static, hand-crafted nature of AI scaffolding that connects Large Language Models to their operational environments. The results challenge conventional wisdom about scaling, with HarnessX delivering an average +14.5% performance gain across 15 model-benchmark combinations, and reaching +44% improvements for the open-weight Qwen3.5-9B on embodied planning tasks1
. These performance gains suggest that for smaller models, optimizing the harness may be more effective than simply scaling up the foundation model itself.
Source: VentureBeat
The growing importance of agentic AI has exposed a fundamental truth: LLMs alone cannot function as agents
2
. Without an agent harness, an LLM has no internalized goal, no ability to seek information beyond its training data, no capacity to act on its environment, and no way to track performance over time2
. The harness provides the critical infrastructure that transforms a model into an agent through components like instructions written in plain-language documents, a filesystem for memory management, a command line for executing code, and a sandbox for operational safety2
. This operational layer converts raw model outputs into structured, executable behaviors through prompts, external tool integrations, memory management, and control flows1
. The agentic shift depends on this repeated interaction between harness and model, where the harness provides context, the LLM proposes actions, and the harness executes them when permitted2
.Traditional harness engineering presents three critical challenges that limit AI agents from handling complex, long-horizon workflows
1
. First, harnesses remain static and hand-engineered, requiring manual code rewrites whenever the foundation model changes, new tools are introduced, or operational domains shift. Second, architectural entanglement plagues most existing harnesses, tightly coupling prompt templates, tool wrappers, retry policies, and memory management within the same code paths. This means tweaking one component can silently break others, forcing teams to resort to raw code copying rather than clean, modular composition. Third, harnesses and foundation models are optimized in isolation, with execution traces typically discarded rather than used as training data, creating a bottleneck where teams fail to capture the full value of their operational data1
.Related Stories
HarnessX solves these engineering bottlenecks by treating the harness as a "first-class object" that is independently serializable, modular, and substitutable
1
. The framework breaks agent behavior into distinct components like context assembly, memory management, tool ecosystems, control flow, and observability, with each specific behavior implemented as a "processor" that plugs into precise lifecycle hooks. To automate optimization of this modular structure, HarnessX introduces AEGIS, a trace-driven evolution engine that frames harness adaptation as a reinforcement learning problem over the symbolic components of the harness1
. AEGIS relies on full trace observability and a four-stage pipeline engineered to prevent reward hacking, catastrophic forgetting, and under-exploration. This approach enables AI systems to dynamically adjust to application-specific requirements in real-world enterprise AI applications, with practical tests showing substantial gains across domains like software engineering and web interaction1
. The modularity of HarnessX allows engineers to seamlessly swap, adapt, and evolve the scaffolding without touching the underlying model, addressing the reality that agency emerges from iterative loops between harness and model rather than from model capability alone2
.Summarized by
Navi
[1]
02 Apr 2026•Technology

27 Jun 2025•Technology

08 Dec 2025•Technology

1
Policy and Regulation

2
Policy and Regulation

3
Policy and Regulation
