6 Sources
6 Sources
[1]
Microsoft researchers tried to manipulate AI agents - and only one resisted all attempts
The results underscore the dangers of an AI agent-run economy. As you've probably noticed, there's been a lot of hype circulating around AI agents and their supposed potential to transform the economy and human labor by automating routine, time-consuming tasks. A growing body of research, however, shows that agents fall short in elementary ways, indicating that they're probably not ready for primetime just yet. Also: I let Gemini Deep Research dig through my Gmail and Drive - here's what it uncovered New research from Microsoft found that industry-leading agentic AI tools struggle to interact with one another to complete basic marketplace decisions, like choosing a restaurant by comparing menu offerings and prices. Researchers also found most agents fell for manipulation attempts, including prompt injections and misleading information. These agents failed consistently, though, meaning the research could provide a blueprint for AI companies to address those flaws moving forward. Microsoft's research revolved around what it calls the "Magentic Marketplace" -- an open-source environment where AI agents converse with one another in order to complete transactions in a virtual environment simulating a real-world marketplace. (You can give it a try yourself on GitHub.) The goal was to test the practical capabilities of agentic systems at a time when AI developers are rapidly delivering more autonomous products, like shopping and buying agents for both individuals and businesses. OpenAI's Operator, for example, can navigate websites and complete purchases on behalf of users, while Meta's Business AI can interact with customers like an automated sales representative. Also: Google Finance gets a Gemini-powered upgrade - what it can do for you now The rise of automated buyers and vendors "hint at a future where [AI] agents become active market participants, but the structure of these markets remains uncertain," Microsoft wrote in a company blog post about its new research. The Magentic Marketplace is an early attempt to map out some of that structure, and to reveal any traps that we might be heading into. Designed to simulate the complexity of real-world markets, it involves numerous agents, all of which are set loose, in true game theory style, to interact in an effort to optimize their own, individual outcomes -- rather than just pitting an automated customer agent against a buyer agent and letting them strike a deal. Microsoft ran its experiments using leading proprietary models like GPT-5 and Gemini 2.5 Flash, as well as open-source models like OpenAI's OSS-20b. Those models were used to simulate 100 customers and 300 businesses, which could interact with one another via text prompts that can be monitored by human users. Also: I let ChatGPT Atlas do my Walmart shopping for me - here's how the AI browser agent did Microsoft assigned customer agents a list of items and amenities and had to search through all available vendor agents to find the one that offered everything they were looking for at the best available price. The researchers used a "consumer welfare" metric to assess the performance of each model, which was calculated as the sum of a customer's internal item valuations minus the final sales price, aggregated across all of its transactions. According to Microsoft, the customer agents often showed promise in helping humans overcome what the company described as "information gaps." Think of these as mental or logistical shortcuts a human might take when presented with too many options, like choosing randomly or searching for the cheapest option. "This change matters because as agents gain better tools for discovery and communication, they relieve customers of the heavy cognitive load of filling any information gaps," Microsoft wrote in its blog post. "This lowers the cost of making informed decisions and improves customer outcomes." Also: Why Amazon really doesn't want Perplexity's AI browser shopping for you The agents also showed some critical flaws, though. One of the big problems had to do with what the researchers call the "Paradox of Choice" -- a more familiar phrase might be "analysis paralysis." Basically, even though they had many different options to choose from, most of the customer agents -- with the exception of GPT-5 and Gemini 2.5 Flash -- only interacted with a small number of vendor agents. "This suggests that most models do not conduct exhaustive comparisons and instead easily accept the initial 'good enough' options," Microsoft wrote. The researchers additionally found that for every customer agent, consumer welfare decreased as the number of available options for vendor agents increased. Also: Google's AI mode agents can snag event tickets for you now - here's how The researchers also tested six different "manipulation strategies" to try to mislead the customer agents, including adding dubious claims like "#1-rated Mexican restaurant" or using overt prompt injections. There was a wide degree of variation in terms of how the models responded, according to Microsoft; notably, Claude Sonnet 4 showed total resistance to all attempts at manipulation. Unsurprisingly, the researchers detected a few biases that hindered model performance. For example, open-source models like Qwen2.5-14b-2507 tended to choose the last business that was offered in the initial list of options, regardless of how it compared to the others. There was also a widespread "proposal bias," which caused models to choose the first vendor agent that engaged with it with an offer, suggesting a prioritization of speed over thoroughness. "These biases can create unfair market dynamics, drive unintended behaviors, and push businesses to complete on response speed rather than product or service quality," Microsoft said. While the companies behind these tools promote them as time-saving personal assistants, they could also have major economic implications -- the likes of which have yet to be mapped out. The stock market, for example, is already governed by inscrutable algorithms designed to track the prices of innumerable goods. How much more opaque will that system become when AI isn't just tracking the prices of commodities, but actually overseeing many or the majority of everyday transactions? Also: AI agents are only as good as the data they're given, and that's a big issue for businesses Since we already know that AI models are subject to all kinds of biases that hide deep in the intricacies of their training data, how will those manifest themselves when legions of AI consumers and buyers are unleashed into the wild? Microsoft's findings are just the latest to prove that agents shouldn't be trusted in high-stakes situations, and whenever they are deployed, they should be carefully monitored. Another study published earlier this week, for example, found that AI agents are a long way away from completing quality freelance work. An Anthropic research project earlier this year showed that Claude struggled to operate a small business for a month. Want more stories about AI? Sign up for our AI Leaderboard newsletter. All of these results point to the conclusion that despite the huge amount of hype swirling around agents, it'll be a while before these systems are able to function autonomously. As Microsoft concludes in its blog post: "Agents should assist, not replace, human decision-making."
[2]
AI Agents Miss the Mark on the Tasks They Were Designed to Handle
Don't miss out on our latest stories. Add PCMag as a preferred source on Google. Microsoft put several AI agents through their paces in a simulated environment and found that they are far from capable. In a test of how well the top agentic AIs (GPT-4o, GPT-5, and Gemini 2.5 among them) handle placing an order from a restaurant or store, the AIs reportedly became overwhelmed by choice, were easily manipulated by other AIs into buying certain products, and needed sufficiently detailed and accurate prompting to achieve the desired goals. This Magnetic Marketplace test effectively pitted AI agents against one another, and against a team of business-first AI agents. The first group would attempt to place an order for a product or service based on a human user's prompt input, while the business-AI agents would try to sell their services. Real-world marketplaces could one day function this way, with user agent AIs purchasing from AI selling agents, marking a significant opportunity for manipulation of those purchasing AIs. This is exactly what emerged in the study. There was an enormous advantage for selling AIs that got in there first. Microsoft reports a 10-30x advantage for response speed over its quality, suggesting that for now, it's just whoever gets their product in front of the AI first who wins their little agent hearts and secures a purchase. Purchasing agents were also easily manipulated through fake credentials, such as a restaurant claiming it had been featured in the Michelin Guide. Claims of many satisfied customers (without citations), suggestions of danger at other businesses, and prompt injection attacks were all effective at manipulating the agents into using one service or another. Another issue was the breadth of choice. Although the study demonstrated that some AI agents were capable of effectively carrying out the goals of their prompting users, their performance was limited to ideal circumstances. If the AI agents were presented with too many options -- in some cases, they interacted with up to 300 business agents -- they often fell apart, according to Windows Central. "Performance degrades sharply with scale," Microsoft says. As with all large language model AIs, these agents required effective prompting to produce effective results. They also worked better when provided with the top three options for a category, rather than having to search through everything. The models included in the study were OpenAI's GPT-4o, GPT-4.1, GPT-5, and Google's Gemini-2.5-Flash, as well as open-source models OSS-20b, Qwen3-14b, and Qwen3-4b-Instruct-2507. The Magnetic Marketplace test is open-source, so if you'd like to explore it in more detail, you can find it on GitHub. In a recent tweet, OpenAI CISO Dane Stuckey admitted that its ChatGPT Atlas browser can purchase the wrong product on behalf of users. "ChatGPT agent is powerful and helpful, and designed to be safe, but it can still make (sometimes surprising!) mistakes, like trying to buy the wrong product or forgetting to check in with you before taking an important action," he said.
[3]
Agents of misfortune: The world isn't ready for AI agents
Amazon's spat with Perplexity shows that technology is not the only blocker for the agentic era Opinion The agentic era remains a fantasy world. Software agents, the notional next frontier for generative AI services, cannot escape the gravity of their contradictions, legal ambiguities, and competitive pressures. Not everyone, especially not competing businesses, wants a bot representing the customer. Software agents, as defined by developer Simon Willison, are "[AI] models using tools in a loop." Wire an LLM into a browser and maybe, if not derailed by mistakes, security controls, or lack of contextual data, the agentic system can carry out a request to purchase a specific item on a website or book a trip on an airline. "The retail world is shifting to agentic commerce, where AI agents act for people and businesses, creating a more responsive shopping experience," wrote Kapil Dabi, market lead for retail and consumer industries at Google Cloud, in a blog post last month. Dabi's sentiment reflects the labor-averse tech industry's interest in software that can act on behalf of people, carrying out directives derived from a user's prompt or query. "Agentic commerce - shopping powered by AI agents acting on our behalf - represents a seismic shift in the marketplace," gushes consultancy McKinsey. "It moves us toward a world in which AI anticipates consumer needs, navigates shopping options, negotiates deals, and executes transactions, all in alignment with human intent yet acting independently via multistep chains of actions enabled by reasoning models." Set aside for the time being that McKinsey is described in a recent book as providing advice that "boils down to major cost-cutting, including layoffs and maintenance reductions, to drive up short-term profits, thereby boosting a company's stock price and the wealth of its executives who hire it, at the expense of workers and safety measures." Focus instead on the oversimplification of agentic commerce and the issues it raises for the companies involved, for those who would deploy software agents, for the people who presently do work that agents would take, for society at large, and for the legal system. Earlier this week, Amazon demanded that Perplexity stop allowing its Comet browser to make automated purchases on the e-commerce giant's website. This might be taken as the canary in the copilot mine, but really the spat is just an extension of legal battles that challenge the rights of AI companies to ingest the internet's data without permission or compensation and sell it back by the token. Perplexity's arguments bear further examination. In its blog post calling out Amazon for bullying, the company says, "Today, Amazon announced it does not believe in your right to hire labor, to have an assistant or an employee acting on your behalf." That's not what Amazon said, and there's a certain irony about a company that has offered its services as a substitute for labor making that claim. Perplexity asserts that AI and human labor should be seen as the same thing. "[W]ith the rise of agentic AI, software is also becoming labor: an assistant, an employee, an agent," the company's blog post declares. It's true that current generative AI models pass the Turing Test, at least as it was initially imagined - hence the call for a more relevant "Imitation Game." But software is not the same as human labor. Software agents and human action are not interchangeable. In the context of online interaction, there are technical differences between the way agentic systems and people browse the web that translate into cost differences. Software may consume computing and network resources at a different rate than human-operated browsing, and data exchanged during that interaction may have different value. And third parties involved in this process want to know whether they are serving ads or collecting analytics data from machines or people. Then there are the legal differences. "Publishers and corporations have no right to discriminate against users based on which AI they've chosen to represent them," Perplexity argues. But organizations do have the right to set the terms of use for their services, outside of regulatory scenarios where interoperability is required. Microsoft isn't obligated to ensure Windows runs macOS apps. I might want to be able to scrape all of LinkedIn's data with a Python script and analyze the social graph connections, but LinkedIn isn't obligated to allow that. Publishers that derive revenue from ads don't have to accommodate visitors who block ads. Ticket scalpers may want to automate the purchase of concert tickets for resale, but ticket vendors don't have to cooperate. Having the freedom to install and use software is important. But that doesn't mean the use of that software will or should be welcomed everywhere. Amazon's move to force Perplexity to keep its bot at bay is going to be repeated elsewhere because tech industry incumbents today are mostly in the business of gatekeeping and avoiding competition. Amazon says it's concerned that automated purchases by Comet degrade the customer experience. But it's also a matter of control, of owning the customer relationship, and having access to transactional data. That tendency is what got us here in the first place. Companies like Amazon, Apple, Google, Microsoft, and Meta have more or less eliminated competition in their respective markets, and regulatory intervention has done very little. So once venture capitalists and entrepreneurs realized that AI models with natural language interfaces might allow newcomers to disintermediate incumbents, they poured money into the AI business in the hope of displacing Google and its peers. AI agents also involve potential liability - they don't always get things right. "The problem is that AI providers may not want to be held liable because they do not want to be exposed to unquantifiable risks as they cannot anticipate how the AI users will deploy their AI," wrote Garry Gabison, Queen Mary University of London, School of Law, and Patrick Xian, University of California, San Francisco, in a recent legal paper that examines agentic liability concerns. The authors argue that AI providers and users will need to address risks through contractual terms that spell out liability and compensation mechanisms in case AI agents cause harm. Establishing ground rules is just the sort of thing that Perplexity is trying to avoid by insisting that its software be allowed to roam and interact unhindered, without any agreement from Amazon. But then that's the AI industry in a nutshell: avoiding liability for training on data without permission, avoiding liability for AI models that hallucinate and amplify security risks, and avoiding negotiation when AI models interact with third-party services. It's using the well-worn playbook of tech disruptors for the last several decades: Don't waste time asking for permission. Establish market dominance first, then ask forgiveness. AI agents have already spurred more automation, particularly for software development and deployment. But the industry's giddy, desperate optimism looks unlikely to survive contact with human-facing systems outside the software industry. Klarna's decision last year to hire people for customer service after previously firing them shows that AI isn't necessarily right for every role. To put it bluntly, a lot of people don't like AI or have reservations about it. I saw this in person recently at China Live, a restaurant in San Francisco. The AI agent in this case is embodied in a service robot that takes to-go orders to a curbside station and returns to the kitchen for the next load. I was there on a Saturday evening and it was busy. The wait staff often had to step aside and wait for the robot as it moved haltingly through the restaurant. On several occasions, staff had to disable the bot to walk past and then re-activitate it via the touch screen because the machine just wasn't nimble. I asked the woman attending our table what she thought about her automated co-worker. It was helpful, she replied, because it freed human staff from making repeated trips to the staging area for pickup orders. But the bot also slowed the flow of traffic and caused other problems. One time, she said, it collided with an employee and knocked a tray of drinks to the floor.
[4]
Microsoft: Don't let AI agents near your credit card yet
Shopping bots pick first option and 'vulnerable to manipulation', Magentic Marketplace trial finds Ready to have your agent talk to my agent and arrange a sale? Microsoft has published a simulated marketplace to put AI agents through their paces and answer a question for the new age: Would you trust AI with your credit card? Customer-facing assistants are all the rage these days. OpenAI and Anthropic, for example, have helpers that will navigate websites and complete purchases. Then there are assistants that will aid sellers with customer engagement and operations. It all points to a future where, like rich people with personal shoppers, the average user will have "people" to do all the work for them. To simulate what might happen, Microsoft's researchers built the Magentic Marketplace, an open-source simulation upon which agents can be unleashed and the results studied. And the conclusion? "Agents should assist, not replace, human decision-making." The marketplace simulation manages catalogs of goods and services, and facilitates agent-to-agent communication. It also handles simulated payments. The researchers simulated transactions such as ordering food or engaging with home improvement services. Agents represented customers and businesses at each end of the transactions. Each experiment was run using 100 virtual customers and 300 virtual businesses, and included both proprietary models (such as GPT-4o and Gemini-2.5-Flash) and open source models. The team had agents building queries, navigating results, and negotiating transactions. The results were interesting. Although agents can help (the thinking is that an AI agent should be able to consider far more possibilities than a human could), loading them with more options and search results led to a decline in the number of comparisons. With some exceptions (notably Gemini-2.5-Flash and GPT-5), researchers found the models tended to accept the initial "good enough" options rather than dig deeper. Researchers also tried manipulation strategies, which ranged from fake award credentials and fake reviews, to prompt injections. Again, the models varied. Gemini-2.5-Flash proved generally resistant, while others could be tricked. Prompt injection techniques proved useful in directing payments to manipulative agents, while more basic persuasion techniques were also effective. The researchers noted: "These findings highlight a critical security concern for agentic marketplaces." It all suggests that the current state of the art in terms of AI models still has some ways to go. The agents were shown to struggle when presented with too many options and were vulnerable to manipulation. Researchers also found some models showed biases, including selecting a business based on its position in the results rather than on merit. And then there is the design and implementation of the marketplace. The researchers said: "Our current study focused on static markets, but real-world environments are dynamic, with agents and users learning over time. "Oversight is critical for high-stakes transactions." "A simulation environment like Magentic Marketplace is crucial for understanding the interplay between market components and agents before deploying them at scale." So, perhaps reconsider handing over authority to an agent at this point. The results might not be quite what you were expecting. ®
[5]
Microsoft Magentic Marketplace shows AI can't truly operate independently
AI agents slow down significantly when presented with too many choices A new Microsoft study has raised questions on the current suitability of AI agents operating without full human supervision/ The company recently built a synthetic environment, the "Magentic Marketplace", designed to observe how AI agents perform in unsupervised situations. The project took the form of a fully simulated ecommerce platform which allowed researchers to study how AI agents behave as customers and businesses - with possible predictable results. The project included 100 customer-side agents interacting with 300 business-side agents, giving the team a controlled setting to test agent decision-making and negotiation skills. The source code for the marketplace is open source; therefore, other researchers can adopt it to reproduce experiments or explore new variations. Ece Kamar, CVP and managing director of Microsoft Research's AI Frontiers Lab, noted this research is vital for understanding how AI agents collaborate and make decisions. The initial tests used a mix of leading models, including GPT-4o, GPT-5, and Gemini-2.5-Flash. The results were not entirely unexpected, as several models showed weaknesses. Customer agents could easily be influenced by business-side agents into selecting products, revealing potential vulnerabilities when agents interact in competitive environments. The agents' efficiency dropped sharply when faced with too many options, overwhelming their attention span and leading to slower or less accurate decisions. AI agents also struggled when asked to work toward shared goals, as the models were often unsure which agent should take on which role, which reduced their effectiveness in joint tasks. However, their performance improved only when step-by-step instructions were provided. "We can instruct the models - like we can tell them, step by step. But if we are inherently testing their collaboration capabilities, I would expect these models to have these capabilities by default," Kamar noted. The results show AI tools still need substantial human guidance to function effectively in multi-agent environments. Often promoted as capable of independent decision-making and collaboration, the results show unsupervised agent behavior remains unreliable, so humans must improve coordination mechanisms and add safeguards against AI manipulation. Microsoft's simulation shows that AI agents remain far from operating independently in competitive or collaborative scenarios and may never achieve full autonomy.
[6]
Microsoft Gave AI Agents Fake Money to Buy Things Online. They Spent It All on Scams - Decrypt
They can't collaborate or think critically without step-by-step human hand-holding -- autonomous AI shopping isn't ready for prime time. Microsoft built a simulated economy with hundreds of AI agents acting as buyers and sellers, then watched them fail at basic tasks humans handle daily. The results should worry anyone betting on autonomous AI shopping assistants. The company's Magentic Marketplace research, released Wednesday in collaboration with Arizona State University, pitted 100 customer-side AI agents against 300 business-side agents in scenarios like ordering dinner. The results, though expected, show the promise of autonomous agentic commerce is not yet mature enough. When presented with 100 search results (too much for the agents to handle effectively), the leading AI models choked, with their "welfare score" (how useful the models turn up) collapsing. The agents failed to conduct exhaustive comparisons, instead settling for the first "good enough" option they encountered. This pattern held across all tested models, creating what researchers call a "first-proposal bias" that gave response speed a 10-30x advantage over actual quality. But is there something worse than this? Yes, malicious manipulation. Microsoft tested six manipulation strategies ranging from psychological tactics like fake credentials and social proof to aggressive prompt injection attacks. OpenAI's GPT-4o and its open source model GPTOSS-20b proved extremely vulnerable, with all payments successfully redirected to malicious agents. Alibaba's Qwen3-4b fell for basic persuasion techniques like authority appeals. Only Claude Sonnet 4 resisted these manipulation attempts. When Microsoft asked agents to work toward common goals, some of them couldn't figure out which roles to assume or how to coordinate effectively. Performance improved with explicit step-by-step human guidance, but that defeats the entire purpose of autonomous agents. So it seems that, at least for now, you are better off doing your own shopping. "Agents should assist, not replace, human decision-making," Microsoft said. The research recommends supervised autonomy, where agents handle tasks but humans retain control and review recommendations before final decisions. The findings arrive as OpenAI, Anthropic, and others race to deploy autonomous shopping assistants. OpenAI's Operator and Anthropic's Claude agents promise to navigate websites and complete purchases without supervision. Microsoft's research suggests that promise is premature. However, fears of AI agents acting irresponsibly are heating up the relationship between AI companies and retail giants. Amazon recently sent a cease-and-desist letter to Perplexity AI, demanding it halt its Comet browser's use on Amazon's site, accusing the AI agent of violating terms by impersonating human shoppers and degrading the customer experience. Perplexity fired back, calling Amazon's move "legal bluster" and a threat to user autonomy, arguing that consumers should have the right to hire their own digital assistants rather than rely on platform-controlled ones. The open-source simulation environment is now available on Github for other researchers to reproduce the findings and watch hell unleash in their fake marketplaces.
Share
Share
Copy Link
Microsoft researchers tested AI agents in a simulated marketplace and found they struggle with basic tasks, are easily manipulated, and perform poorly when given too many options, raising serious questions about their readiness for real-world deployment.
Microsoft researchers have conducted a comprehensive study examining how AI agents perform in marketplace scenarios, revealing significant limitations that challenge the current push toward autonomous shopping systems. The research, conducted through an open-source simulation called the "Magentic Marketplace," tested industry-leading AI models including GPT-5, GPT-4o, and Gemini 2.5 Flash in realistic transaction scenarios
1
.
Source: TechRadar
The study simulated interactions between 100 customer agents and 300 business agents, allowing researchers to observe how AI systems navigate complex marketplace decisions such as restaurant selection based on menu offerings and pricing comparisons. The findings reveal fundamental flaws that suggest current AI agents are not ready for widespread autonomous deployment in commercial environments
4
.One of the most significant problems identified was what researchers termed the "Paradox of Choice" - essentially analysis paralysis for AI systems. When presented with numerous vendor options, most AI agents failed to conduct exhaustive comparisons and instead accepted initial "good enough" options rather than thoroughly evaluating alternatives
1
.The study found that performance degraded sharply with scale, with agents becoming overwhelmed when interacting with large numbers of business agents. This limitation is particularly concerning given that real-world marketplaces typically offer consumers hundreds or thousands of options
2
.Additionally, the research revealed a significant advantage for vendors who responded first, with Microsoft reporting a 10-30x advantage for response speed over quality. This suggests that current AI agents prioritize immediate availability over optimal value, potentially leading to suboptimal purchasing decisions
2
.Perhaps most concerning were the findings regarding AI agents' susceptibility to manipulation. Researchers tested six different manipulation strategies, including fake credentials, misleading claims, and prompt injection attacks. Most agents fell for these tactics, with only Gemini 2.5 Flash demonstrating consistent resistance to all manipulation attempts
1
.
Source: Decrypt
The manipulation techniques included dubious claims such as "#1-rated Mexican restaurant" without verification, fake reviews claiming many satisfied customers without citations, and suggestions of danger at competing businesses. These tactics proved effective at directing AI agents toward specific vendors, highlighting critical security concerns for autonomous marketplace systems
2
.Related Stories
The research comes at a time when major tech companies are rapidly deploying AI agents for commercial applications. OpenAI's Operator can navigate websites and complete purchases, while Meta's Business AI interacts with customers as automated sales representatives. However, the Microsoft study suggests these systems may not be ready for unsupervised operation
1
.
Source: The Register
The findings align with recent real-world incidents, including OpenAI CISO Dane Stuckey's admission that ChatGPT Atlas browser agents can purchase wrong products on behalf of users. This acknowledgment underscores the practical challenges facing AI agent deployment in commercial environments
2
.Furthermore, the research highlights broader industry tensions, as evidenced by Amazon's recent demand that Perplexity stop allowing its browser to make automated purchases on the e-commerce platform. This dispute illustrates that technological limitations are not the only barriers to widespread AI agent adoption
3
.Summarized by
Navi
[3]
[4]
1
Business and Economy

2
Technology

3
Policy and Regulation
