Curated by THEOUTPOST
On Fri, 21 Feb, 4:04 PM UTC
4 Sources
[1]
Microsoft Shows Off AI That Can Control an Entire Robot
Microsoft has released a new generative model, dubbed Magma, that can autonomously control an entire robot while processing information from its sensors -- a fascinating step toward a world in which AI like ChatGPT could interact with the physical world using a robotic arm, a humanoid android, or something else entirely. In its announcement, the tech giant claims its latest AI can process multimodal data, including text, images, and video, while also being able to "plan and act in the visual-spatial world." That means it could be used to "complete agentic tasks ranging from UI navigation to robot manipulation." "Magma is able to formulate plans and execute actions to achieve it," Microsoft wrote in its research paper documenting the new tool. "By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal and spatial intelligence to navigate complex tasks." Magma is part of a much broader transition from simple large language models and chatbots to "AI agents," which can carry out tasks on behalf of their human overlords. But the tech still has nagging technical limitations; case in point, OpenAI's recently-released AI agent, dubbed Operator, which was designed to navigate the internet to "perform tasks for you," still requires plenty of adult supervision to get anything done. And navigating the physical world, let alone manipulating objects, will likely be no easy task either. Nonetheless, according to Microsoft's tests, its Magma AI "creates new state-of-the-art results on UI navigation and robotic manipulation tasks, outperforming previous models that are tailored specifically to these tasks." Video samples released by the company, which you can see here, show the AI placing a plastic mushroom in a metal bowl and pushing a dishcloth across a countertop. Apart from manipulating a robotic arm, Microsoft also demonstrates how Magma could be used to assist a human agent through a live video feed, from helping out during a real-world game of chess to suggesting what to do to "relax for a few hours" in a living room. But the AI isn't quite perfect, as Microsoft's researchers admit in their research paper. For one, the tests they came up with were highly specific. "We note that the distribution of identities and activities in the instructional videos are not representative of the global human population and the diversity in society," the paper reads. The move toward agentic AI could also have plenty of unintended consequences, such as introducing cybersecurity vulnerabilities through bad actors exploiting jailbreaks or injecting malicious code. How such a scenario would play out with an AI that's controlling a robot in the physical world remains to be seen -- but we might prefer not to find out.
[2]
Microsoft's Magma AI Can Manipulate and Control Robots
Microsoft just introduced Magma, a new AI model designed to help robots see, understand and act more intelligently. Unlike traditional artificial intelligence models, Magma processes different types of data all at once - an effort Microsoft is calling a big leap toward "agentic AI," or systems that can plan and execute tasks on a user's behalf. The model, which uses a combination of vision and language processing, is trained on videos, images, robotics data and interface interactions so as to make it more versatile than previous models. On its Github page, the Microsoft Research team outlined how Magma can perform tasks, such as how it can manipulate robots and navigate user interfaces like clicking buttons. To develop the technology, the company partnered with researchers from the University of Maryland, the University of Wisconsin-Madison and the University of Washington. The launch comes as tech giants race to develop AI agents that can automate more aspects of daily life. Google has been advancing robotics-focused language models, while OpenAI's Operator tool is designed to handle mundane tasks like making reservations, ordering groceries and filling out forms via typing, clicking and scrolling within a specialized browser.
[3]
Microsoft's new AI agent can control software and robots
On Wednesday, Microsoft Research introduced Magma, an integrated AI foundation model that combines visual and language processing to control software interfaces and robotic systems. If the results hold up outside of Microsoft's internal testing, it could mark a meaningful step forward for an all-purpose multimodal AI that can operate interactively in both real and digital spaces. Microsoft claims that Magma is the first AI model that not only processes multimodal data (like text, images, and video) but can also natively act upon it -- whether that's navigating a user interface or manipulating physical objects. The project is a collaboration between researchers at Microsoft, KAIST, the University of Maryland, the University of Wisconsin-Madison, and the University of Washington. We've seen other large language model-based robotics projects like Google's PALM-E and RT-2 or Microsoft's ChatGPT for Robotics that utilize LLMs for an interface. However, unlike many prior multimodal AI systems that require separate models for perception and control, Magma integrates these abilities into a single foundation model. Microsoft is positioning Magma as a step toward agentic AI, meaning a system that can autonomously craft plans and perform multistep tasks on a human's behalf rather than just answering questions about what it sees. "Given a described goal," Microsoft writes in its research paper. "Magma is able to formulate plans and execute actions to achieve it. By effectively transferring knowledge from freely available visual and language data, Magma bridges verbal, spatial, and temporal intelligence to navigate complex tasks and settings." Microsoft is not alone in its pursuit of agentic AI. OpenAI has been experimenting with AI agents through projects like Operator that can perform UI tasks in a web browser, and Google has explored multiple agentic projects with Gemini 2.0. Spatial intelligence While Magma builds off of Transformer-based LLM technology that feeds training tokens into a neural network, it's different from traditional vision-language models (like GPT-4V, for example) by going beyond what they call "verbal intelligence" to also include "spatial intelligence" (planning and action execution). By training on a mix of images, videos, robotics data, and UI interactions, Microsoft claims that Magma is a true multimodal agent rather than just a perceptual model. The Magma model introduces two technical components: Set-of-Mark, which identifies objects that can be manipulated in an environment by assigning numeric labels to interactive elements, such as clickable buttons in a UI or graspable objects in a robotic workspace, and Trace-of-Mark, which learns movement patterns from video data. Microsoft says those features allow the model to complete tasks like navigating user interfaces or directing robotic arms to grasp objects. Microsoft Magma researcher Jianwei Yang wrote in a Hacker News comment that the name "Magma" stands for "M(ultimodal) Ag(entic) M(odel) at Microsoft (Rese)A(rch)," after some people noted that "Magma" already belongs to an existing matrix algebra library, which could create some confusion in technical discussions. Reported improvements over previous models In its Magma write-up, Microsoft claims Magma-8B performs competitively across benchmarks, showing strong results in UI navigation and robot manipulation tasks. For example, it scored 80.0 on the VQAv2 visual question-answering benchmark -- higher than GPT-4V's 77.2 but lower than LLaVA-Next's 81.8. Its POPE score of 87.4 leads all models in the comparison. In robot manipulation, Magma reportedly outperforms OpenVLA, an open source vision-language-action model, in multiple robot manipulation tasks. As always, we take AI benchmarks with a grain of salt since many have not been scientifically validated as being able to measure useful properties of AI models. External verification of Microsoft's benchmark results will become possible once other researchers can access the public code release. Like all AI models, Magma is not perfect. It still faces technical limitations in complex step-by-step decision-making that requires multiple steps over time, according to Microsoft's documentation. The company says it continues to work on improving these capabilities through ongoing research. Yang says Microsoft will release Magma's training and inference code on GitHub next week, allowing external researchers to build on the work. If Magma delivers on its promise, it could push Microsoft's AI assistants beyond limited text interactions, enabling them to operate software autonomously and execute real-world tasks through robotics. Magma is also a sign of how quickly the culture around AI can change. Just a few years ago, this kind of agentic talk scared many people who feared it might lead to AI taking over the world. While some people still fear that outcome, in 2025, AI agents are a common topic of mainstream AI research that regularly takes place without triggering calls to pause all of AI development.
[4]
Microsoft's Magma AI Model Can Automate Robotics Tasks
It features Set-of-Mark and Trace-of-Mark technical components Microsoft researchers announced a new foundation model on Wednesday that can perform agentic functions. Dubbed Magma, the artificial intelligence (AI) model is pre-trained on a large volume of datasets across text, images, videos, as well as spatial formats. The Redmond-based tech giant said that Magma is an extension of vision-language (VL) models and it can not only understand multimodal information but can also plan and act on them. The AI agent-enabled model can be used in a wide range of tasks including computer vision, user interface (UI) navigation, and robot manipulation. In a GitHub post, Microsoft researchers detailed the new Magma foundation model. Foundation models are distinctive large language models (LLMs), which are built from scratch and are not distilled from any other model. They often become the baseline for other models in the series. Magma is unique in the sense that the AI model is pre-trained on a wide range of datasets. The researchers stated that the base architecture behind Magma is the Llama 3 AI model. However, Magma is also equipped with the ability to plan and act in the visual-spatial world. This allows the model to not only generate outputs like a chatbot but also execute actions. It can be used as a computer vision chatbot that can offer information about the world it views when paired with camera sensors. Magma can also be used to control the UI of a device. But more interestingly, it can also control robots to complete complex tasks using agentic capabilities. The researchers said a major reason behind these capabilities is the diverse dataset along with two technical components -- Set-of-Mark and Trace-of-Mark. The former enables action grounding in images, videos and spatial data by having the model predict numeric marks for buttons or robot arms in image space. The latter feeds the model temporal video dynamics and makes it predict the next frames before it takes action. This allows the model to develop a strong spatial understanding. Microsoft researchers also shared the benchmark scores of the AI model based on internal testing. It has achieved competitive scores across all the agentic evaluation tests, outperforming models by OpenAI, Alibaba, and Google. The company has not released Magma in the public domain as of now.
Share
Share
Copy Link
Microsoft introduces Magma, a new AI foundation model capable of controlling robots and navigating software interfaces. This multimodal AI represents a significant step towards agentic AI, processing various data types and executing complex tasks.
Microsoft has introduced Magma, a groundbreaking AI foundation model that represents a significant leap towards agentic artificial intelligence. This innovative system can process multimodal data, including text, images, and video, while also planning and executing actions in both digital and physical environments 12.
Magma stands out from traditional AI models due to its ability to:
The model integrates visual and language processing, allowing it to bridge the gap between verbal and spatial intelligence 3. This integration enables Magma to perform a wide range of tasks, from manipulating robotic arms to navigating software interfaces 4.
Two key technical components contribute to Magma's advanced capabilities:
These features allow Magma to complete tasks such as grasping objects with robotic arms or clicking buttons in a user interface 4.
Microsoft claims that Magma-8B performs competitively across various benchmarks:
Magma's versatility opens up a wide range of potential applications:
The development of Magma involved collaboration between Microsoft and researchers from several universities, including the University of Maryland, the University of Wisconsin-Madison, and the University of Washington 23. This collaborative effort highlights the importance of cross-institutional research in advancing AI technologies.
While Magma represents a significant advancement in AI capabilities, it also raises important considerations:
Microsoft plans to release Magma's training and inference code on GitHub, allowing external researchers to build upon and verify the work 3. This open approach may accelerate further developments in agentic AI and robotics integration.
As the field of AI continues to evolve rapidly, Magma represents a significant milestone in the journey towards more capable and versatile artificial intelligence systems. Its potential to bridge the gap between digital and physical interactions could have far-reaching implications for various industries and applications.
Reference
[3]
[4]
Microsoft has launched Magnetic-One, an open-source multi-agent AI system designed to tackle complex, multi-step tasks across various domains. This innovative technology coordinates multiple specialized AI agents to enhance task completion efficiency and accuracy.
6 Sources
6 Sources
Microsoft introduces Magentic-One, an innovative multi-agent AI system designed to tackle a wide range of complex tasks autonomously. This open-source project aims to push the boundaries of AI capabilities in areas such as web browsing, coding, and task orchestration.
3 Sources
3 Sources
Microsoft announces the release of autonomous AI agents and Copilot Studio, enabling businesses to create custom AI assistants for task automation and productivity enhancement.
37 Sources
37 Sources
Microsoft launches 10 new autonomous AI agents integrated into Dynamics 365, aiming to streamline workflows and enhance operational efficiency across critical business functions. This move positions Microsoft as a leader in enterprise AI solutions.
34 Sources
34 Sources
Microsoft unveils MatterGen, an open-source AI model that revolutionizes inorganic material design, potentially accelerating advancements in energy storage, semiconductors, and carbon capture technologies.
3 Sources
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved