The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Tue, 29 Oct, 12:01 AM UTC
6 Sources
[1]
A faster, better way to train general-purpose robots
Caption: A figure shows how the new technique aligns data from varied domains, like simulation and real robots, and multiple modalities, including vision sensors and robotic arm position encoders, into a shared "language" that a generative AI model can process. In the classic cartoon "The Jetsons," Rosie the robotic maid seamlessly switches from vacuuming the house to cooking dinner to taking out the trash. But in real life, training a general-purpose robot remains a major challenge. Typically, engineers collect data that are specific to a certain robot and task, which they use to train the robot in a controlled environment. However, gathering these data is costly and time-consuming, and the robot will likely struggle to adapt to environments or tasks it hasn't seen before. To train better general-purpose robots, MIT researchers developed a versatile technique that combines a huge amount of heterogeneous data from many of sources into one system that can teach any robot a wide range of tasks. Their method involves aligning data from varied domains, like simulations and real robots, and multiple modalities, including vision sensors and robotic arm position encoders, into a shared "language" that a generative AI model can process. By combining such an enormous amount of data, this approach can be used to train a robot to perform a variety of tasks without the need to start training it from scratch each time. This method could be faster and less expensive than traditional techniques because it requires far fewer task-specific data. In addition, it outperformed training from scratch by more than 20 percent in simulation and real-world experiments. "In robotics, people often claim that we don't have enough training data. But in my view, another big problem is that the data come from so many different domains, modalities, and robot hardware. Our work shows how you'd be able to train a robot with all of them put together," says Lirui Wang, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on this technique. Wang's co-authors include fellow EECS graduate student Jialiang Zhao; Xinlei Chen, a research scientist at Meta; and senior author Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the Conference on Neural Information Processing Systems. Inspired by LLMs A robotic "policy" takes in sensor observations, like camera images or proprioceptive measurements that track the speed and position a robotic arm, and then tells a robot how and where to move. Policies are typically trained using imitation learning, meaning a human demonstrates actions or teleoperates a robot to generate data, which are fed into an AI model that learns the policy. Because this method uses a small amount of task-specific data, robots often fail when their environment or task changes. To develop a better approach, Wang and his collaborators drew inspiration from large language models like GPT-4. These models are pretrained using an enormous amount of diverse language data and then fine-tuned by feeding them a small amount of task-specific data. Pretraining on so much data helps the models adapt to perform well on a variety of tasks. "In the language domain, the data are all just sentences. In robotics, given all the heterogeneity in the data, if you want to pretrain in a similar manner, we need a different architecture," he says. Robotic data take many forms, from camera images to language instructions to depth maps. At the same time, each robot is mechanically unique, with a different number and orientation of arms, grippers, and sensors. Plus, the environments where data are collected vary widely. The MIT researchers developed a new architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from these varied modalities and domains. They put a machine-learning model known as a transformer into the middle of their architecture, which processes vision and proprioception inputs. A transformer is the same type of model that forms the backbone of large language models. The researchers align data from vision and proprioception into the same type of input, called a token, which the transformer can process. Each input is represented with the same fixed number of tokens. Then the transformer maps all inputs into one shared space, growing into a huge, pretrained model as it processes and learns from more data. The larger the transformer becomes, the better it will perform. A user only needs to feed HPT a small amount of data on their robot's design, setup, and the task they want it to perform. Then HPT transfers the knowledge the transformer grained during pretraining to learn the new task. Enabling dexterous motions One of the biggest challenges of developing HPT was building the massive dataset to pretrain the transformer, which included 52 datasets with more than 200,000 robot trajectories in four categories, including human demo videos and simulation. The researchers also needed to develop an efficient way to turn raw proprioception signals from an array of sensors into data the transformer could handle. "Proprioception is key to enable a lot of dexterous motions. Because the number of tokens is in our architecture always the same, we place the same importance on proprioception and vision," Wang explains. When they tested HPT, it improved robot performance by more than 20 percent on simulation and real-world tasks, compared with training from scratch each time. Even when the task was very different from the pretraining data, HPT still improved performance. "This paper provides a novel approach to training a single policy across multiple robot embodiments. This enables training across diverse datasets, enabling robot learning methods to significantly scale up the size of datasets that they can train on. It also allows the model to quickly adapt to new robot embodiments, which is important as new robot designs are continuously being produced," says David Held, associate professor at the Carnegie Mellon University Robotics Institute, who was not involved with this work. In the future, the researchers want to study how data diversity could boost the performance of HPT. They also want to enhance HPT so it can process unlabeled data like GPT-4 and other large language models. "Our dream is to have a universal robot brain that you could download and use for your robot without any training at all. While we are just in the early stages, we are going to keep pushing hard and hope scaling leads to a breakthrough in robotic policies, like it did with large language models," he says. This work was funded, in part, by the Amazon Greater Boston Tech Initiative and the Toyota Research Institute.
[2]
A faster, better way to train general-purpose robots
In the classic cartoon "The Jetsons," Rosie the robotic maid seamlessly switches from vacuuming the house to cooking dinner to taking out the trash. But in real life, training a general-purpose robot remains a major challenge. Typically, engineers collect data that are specific to a certain robot and task, which they use to train the robot in a controlled environment. However, gathering these data is costly and time-consuming, and the robot will likely struggle to adapt to environments or tasks it hasn't seen before. To train better general-purpose robots, MIT researchers developed a versatile technique that combines a huge amount of heterogeneous data from many of sources into one system that can teach any robot a wide range of tasks. Their method involves aligning data from varied domains, like simulations and real robots, and multiple modalities, including vision sensors and robotic arm position encoders, into a shared "language" that a generative AI model can process. By combining such an enormous amount of data, this approach can be used to train a robot to perform a variety of tasks without the need to start training it from scratch each time. This method could be faster and less expensive than traditional techniques because it requires far fewer task-specific data. In addition, it outperformed training from scratch by more than 20 percent in simulation and real-world experiments. "In robotics, people often claim that we don't have enough training data. But in my view, another big problem is that the data come from so many different domains, modalities, and robot hardware. Our work shows how you'd be able to train a robot with all of them put together," says Lirui Wang, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on this technique. Wang's co-authors include fellow EECS graduate student Jialiang Zhao; Xinlei Chen, a research scientist at Meta; and senior author Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the Conference on Neural Information Processing Systems. Inspired by LLMs A robotic "policy" takes in sensor observations, like camera images or proprioceptive measurements that track the speed and position a robotic arm, and then tells a robot how and where to move. Policies are typically trained using imitation learning, meaning a human demonstrates actions or teleoperates a robot to generate data, which are fed into an AI model that learns the policy. Because this method uses a small amount of task-specific data, robots often fail when their environment or task changes. To develop a better approach, Wang and his collaborators drew inspiration from large language models like GPT-4. These models are pretrained using an enormous amount of diverse language data and then fine-tuned by feeding them a small amount of task-specific data. Pretraining on so much data helps the models adapt to perform well on a variety of tasks. "In the language domain, the data are all just sentences. In robotics, given all the heterogeneity in the data, if you want to pretrain in a similar manner, we need a different architecture," he says. Robotic data take many forms, from camera images to language instructions to depth maps. At the same time, each robot is mechanically unique, with a different number and orientation of arms, grippers, and sensors. Plus, the environments where data are collected vary widely. The MIT researchers developed a new architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from these varied modalities and domains. They put a machine-learning model known as a transformer into the middle of their architecture, which processes vision and proprioception inputs. A transformer is the same type of model that forms the backbone of large language models. The researchers align data from vision and proprioception into the same type of input, called a token, which the transformer can process. Each input is represented with the same fixed number of tokens. Then the transformer maps all inputs into one shared space, growing into a huge, pretrained model as it processes and learns from more data. The larger the transformer becomes, the better it will perform. A user only needs to feed HPT a small amount of data on their robot's design, setup, and the task they want it to perform. Then HPT transfers the knowledge the transformer grained during pretraining to learn the new task. Enabling dexterous motions One of the biggest challenges of developing HPT was building the massive dataset to pretrain the transformer, which included 52 datasets with more than 200,000 robot trajectories in four categories, including human demo videos and simulation. The researchers also needed to develop an efficient way to turn raw proprioception signals from an array of sensors into data the transformer could handle. "Proprioception is key to enable a lot of dexterous motions. Because the number of tokens is in our architecture always the same, we place the same importance on proprioception and vision," Wang explains. When they tested HPT, it improved robot performance by more than 20 percent on simulation and real-world tasks, compared with training from scratch each time. Even when the task was very different from the pretraining data, HPT still improved performance. In the future, the researchers want to study how data diversity could boost the performance of HPT. They also want to enhance HPT so it can process unlabeled data like GPT-4 and other large language models. "Our dream is to have a universal robot brain that you could download and use for your robot without any training at all. While we are just in the early stages, we are going to keep pushing hard and hope scaling leads to a breakthrough in robotic policies, like it did with large language models," he says.
[3]
A faster, better way to train general-purpose robots: New technique pools diverse data
In the classic cartoon "The Jetsons," Rosie the robotic maid seamlessly switches from vacuuming the house to cooking dinner to taking out the trash. But in real life, training a general-purpose robot remains a major challenge. Typically, engineers collect data that are specific to a certain robot and task, which they use to train the robot in a controlled environment. However, gathering these data is costly and time-consuming, and the robot will likely struggle to adapt to environments or tasks it hasn't seen before. To train better general-purpose robots, MIT researchers developed a versatile technique that combines a huge amount of heterogeneous data from many sources into one system that can teach any robot a wide range of tasks. Their method involves aligning data from varied domains, like simulations and real robots, and multiple modalities, including vision sensors and robotic arm position encoders, into a shared "language" that a generative AI model can process. The work is published on the arXiv preprint server. By combining such an enormous amount of data, this approach can be used to train a robot to perform a variety of tasks without the need to start training it from scratch each time. This method could be faster and less expensive than traditional techniques because it requires far fewer task-specific data. In addition, it outperformed training from scratch by more than 20% in simulation and real-world experiments. "In robotics, people often claim that we don't have enough training data. But in my view, another big problem is that the data come from so many different domains, modalities, and robot hardware. Our work shows how you'd be able to train a robot with all of them put together," says Lirui Wang, an electrical engineering and computer science (EECS) graduate student and lead author of the paper on this technique. Wang's co-authors include fellow EECS graduate student Jialiang Zhao; Xinlei Chen, a research scientist at Meta; and senior author Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the Conference on Neural Information Processing Systems, held 10-15 December at the Vancouver Convention Center. Inspired by LLMs A robotic "policy" takes in sensor observations, like camera images or proprioceptive measurements that track the speed and position a robotic arm, and then tells a robot how and where to move. Policies are typically trained using imitation learning, meaning a human demonstrates actions or teleoperates a robot to generate data, which are fed into an AI model that learns the policy. Because this method uses a small amount of task-specific data, robots often fail when their environment or task changes. To develop a better approach, Wang and his collaborators drew inspiration from large language models like GPT-4. These models are pretrained using an enormous amount of diverse language data and then fine-tuned by feeding them a small amount of task-specific data. Pretraining on so much data helps the models adapt to perform well on a variety of tasks. "In the language domain, the data are all just sentences. In robotics, given all the heterogeneity in the data, if you want to pretrain in a similar manner, we need a different architecture," he says. Robotic data take many forms, from camera images to language instructions to depth maps. At the same time, each robot is mechanically unique, with a different number and orientation of arms, grippers, and sensors. Plus, the environments where data are collected vary widely. The MIT researchers developed a new architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from these varied modalities and domains. They put a machine-learning model known as a transformer into the middle of their architecture, which processes vision and proprioception inputs. A transformer is the same type of model that forms the backbone of large language models. The researchers align data from vision and proprioception into the same type of input, called a token, which the transformer can process. Each input is represented with the same fixed number of tokens. Then the transformer maps all inputs into one shared space, growing into a huge, pretrained model as it processes and learns from more data. The larger the transformer becomes, the better it will perform. A user only needs to feed HPT a small amount of data on their robot's design, setup, and the task they want it to perform. Then HPT transfers the knowledge the transformer gained during pretraining to learn the new task. Enabling dexterous motions One of the biggest challenges of developing HPT was building the massive dataset to pretrain the transformer, which included 52 datasets with more than 200,000 robot trajectories in four categories, including human demo videos and simulation. The researchers also needed to develop an efficient way to turn raw proprioception signals from an array of sensors into data the transformer could handle. "Proprioception is key to enable a lot of dexterous motions. Because the number of tokens is in our architecture always the same, we place the same importance on proprioception and vision," Wang explains. When they tested HPT, it improved robot performance by more than 20% on simulation and real-world tasks, compared with training from scratch each time. Even when the task was very different from the pretraining data, HPT still improved performance. "This paper provides a novel approach to training a single policy across multiple robot embodiments. This enables training across diverse datasets, enabling robot learning methods to significantly scale up the size of datasets that they can train on. It also allows the model to quickly adapt to new robot embodiments, which is important as new robot designs are continuously being produced," says David Held, associate professor at the Carnegie Mellon University Robotics Institute, who was not involved with this work. In the future, the researchers want to study how data diversity could boost the performance of HPT. They also want to enhance HPT so it can process unlabeled data like GPT-4 and other large language models. "Our dream is to have a universal robot brain that you could download and use for your robot without any training at all. While we are just in the early stages, we are going to keep pushing hard and hope scaling leads to a breakthrough in robotic policies, like it did with large language models," he says.
[4]
MIT researchers develop new approach for training general purpose robots
Serving tech enthusiasts for over 25 years. TechSpot means tech analysis and advice you can trust. What just happened? Researchers at the Massachusetts Institute of Technology (MIT) have developed a new approach to train general-purpose robots, drawing inspiration from the success of large language models like GPT-4. Called the Heterogeneous Pretrained Transformers (HPT), this approach allows robots to learn and adapt to a wide range of tasks - something that has been difficult to date. The research could lead to a future where robots are not just specialized tools but flexible assistants that can quickly learn new skills and adapt to changing circumstances, becoming truly general-purpose robotic assistants. Traditionally, robot training has been a time-consuming and costly process, requiring engineers to collect specific data for each robot and task in controlled environments. As a result, robots would struggle to adapt to new situations or unexpected obstacles. The MIT team's new technique combines large amounts of heterogeneous data from various sources into a single system capable of teaching robots a wide array of tasks. At the heart of the HPT architecture is a transformer, a type of neural network that processes inputs from various sensors, including vision and proprioception data, and creates a shared "language" that the AI model can understand and learn from. "In robotics, people often claim that we don't have enough training data. But in my view, another big problem is that the data come from so many different domains, modalities, and robot hardware," said Lirui Wang, the lead author of the study and an electrical engineering and computer science (EECS) graduate student at MIT. "Our work shows how you'd be able to train a robot with all of them put together." Wang's co-authors include fellow EECS graduate student Jialiang Zhao, Meta research scientist Xinlei Chen, and senior author Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the Conference on Neural Information Processing Systems. One of the key advantages of the HPT approach is its ability to leverage a massive dataset for pretraining. The researchers compiled a dataset consisting of 52 datasets with over 200,000 robot trajectories across four categories, including human demonstration videos and simulations. This pretraining allows the system to transfer knowledge effectively when learning new tasks, requiring only a small amount of task-specific data for fine-tuning. In both simulated and real-world tasks, the HPT method outperformed traditional training-from-scratch approaches by more than 20 percent. The HPT system still demonstrated improved performance even when faced with tasks significantly different from the pretraining data. "This paper provides a novel approach to training a single policy across multiple robot embodiments," said David Held, an associate professor at Carnegie Mellon University's Robotics Institute who was not involved in the study. "This enables training across diverse datasets, enabling robot learning methods to significantly scale up the size of datasets that they can train on. It also allows the model to quickly adapt to new robot embodiments, which is important as new robot designs are continuously being produced." The MIT researchers aim to enhance the HPT system by exploring how data diversity can boost its performance. They also plan to extend the system's capabilities to process unlabeled data, similar to how large language models like GPT-4 operate. Wang and his colleagues have set an ambitious goal for the future of this technology. "Our dream is to have a universal robot brain that you could download and use for your robot without any training at all," Wang explained. "While we are just in the early stages, we are going to keep pushing hard and hope scaling leads to a breakthrough in robotic policies, like it did with large language models." The Amazon Greater Boston Tech Initiative and the Toyota Research Institute partially funded this research.
[5]
MIT to Train New Skills to Robots Using Generative AI Technology
Researchers looked into GPT-4 architecture to develop the technique Massachusetts Institute of Technology (MIT) unveiled a new method to train robots last week that uses generative artificial intelligence (AI) models. The new technique relies on combining data across different domains and modalities and unifying them into a shared language which can then be processed by large language models (LLMs). MIT researchers claim that this method can give rise to general-purpose robots that can handle a wide range of tasks without needing to individually train each skill from scratch. In a newsroom post, MIT detailed the novel methodology to train robots. Currently, teaching a certain task to a robot is a difficult proposition as a large amount of simulation and real-world data is required. This is necessary because if the robot does not understand how to perform the task in a given environment, it will struggle to adapt to it. This means for every new task, new sets of data comprising every simulation and real-world scenario are needed. The robot then undergoes a training period where the actions are optimised and errors and glitches are removed. As a result, robots are generally trained on a specific task, and those multi-purpose robots seen in science fiction movies, have not been seen in reality. However, a new technique developed by researchers at MIT claims to bypass this challenge. In a paper published in the pre-print online journal arXIv (note: it is not peer-reviewed), the scientists highlighted that generative AI can assist with this problem. For this, data across different domains, such as simulations and real robots, and different modalities such as vision sensors and robotic arm position encoders, were unified into a shared language that can be processed by an AI model. A new architecture dubbed Heterogeneous Pretrained Transformers (HPT) was also developed to unify the data. Interestingly, the lead author of the study, Lirui Wang, an electrical engineering and computer science (EECS) graduate student, said that the inspiration for this technique was drawn from AI models such as OpenAI's GPT-4. The researchers added an LLM model called a transformer (similar to the GPT architecture) in the middle of their system and it processes both vision and proprioception (sense of self-movement, force, and position) inputs. The MIT researchers state that this new method could be faster and less expensive to train robots compared to the traditional methods. This is largely due to the lesser amount of task-specific data required to train the robot in various tasks. Further, the study found that this method outperformed training from scratch by more than 20 percent in both simulation and real-world experiments.
[6]
MIT Develops Innovative Generative AI Techniques for Training General-Purpose Robots
MIT has introduced a groundbreaking AI-based training method that enables robots to learn versatile skills The Massachusetts Institute of Technology (MIT) has unveiled a pioneering method for training robots that leverages generative artificial intelligence (AI) models. This innovative approach, detailed in a recent announcement, focuses on integrating data from diverse domains and modalities, creating a shared language that large language models (LLMs) can process. The researchers assert that this technique can facilitate the development of general-purpose robots capable of performing a wide array of tasks without the need for extensive individual training for each skill. The complexity of the current level of robotic intelligence lies in the need to train the robot with a lot of simulated and real-life datasets which could take a long time. The underlying issue is that during the training phases, the robot is made to learn how to perform a task in its environment and this may have scope for improvement when learning new tasks. The consequence of such was that whenever a new task was given, a new range of datasets to cover every potential simulation and real environmental extremes had to be obtained. Repeated approximations of continuous actions are usually associated with the robot's training period, corrected during this, incorrect actions deploying strategies. So far, the robots' operations have been mostly reduced to devices designed for one function and have therefore never approached the multi-functional abilities of fiction machines. Nonetheless, researchers from MIT offer a new technique that can help. In a report shared on the arXiv preprint server, the scientists described how generative AI facilitates robot training in a faster and more efficient manner. In their method, they combine information from multiple sources including, but not limited to simulated environments and real robots interacting, and multiple input types such as vision sensors and robotic arm position encoders. To assist in this unification, they also developed a new architecture which they termed Heterogeneous Pretrained Transformers (HPT). According to Lirui Wang, the primary author of the paper and a graduate student of electrical engineering and computer science EECS, the basic idea behind this method was to use AI models available in the public domain such as GPT-4 by . The introduction of an LLM model known as a transformer allowed the fellows in their setup to process two other inputs whatsoever: vision and proprioception, which is critical for a robot's way of self-movement and self-location. The new method proposed could have substantial implications. Results of the study suggest that the new method is quicker to deploy and cost-effective in training robots in comparison with traditional methods. The scientists noticed that in this technique, less data of a task-specific nature is needed, and as a result, such robots can be trained in more tasks efficiently. Additionally, the new technique met the expectations stated in earlier tests: it was proved to be more effective in simulations as well as in real conditions by 20% in regard to the traditional training approach. This is a major advancement towards constructing advanced robots meant to be deployed in a multitude of functions and different conditions.
Share
Share
Copy Link
MIT researchers have created a new method called Heterogeneous Pretrained Transformers (HPT) that uses generative AI to train robots for multiple tasks more efficiently, potentially revolutionizing the field of robotics.
Researchers at the Massachusetts Institute of Technology (MIT) have developed a groundbreaking technique for training general-purpose robots, potentially revolutionizing the field of robotics. The new method, called Heterogeneous Pretrained Transformers (HPT), draws inspiration from large language models like GPT-4 and aims to create more versatile and adaptable robotic systems 12.
Traditionally, training robots has been a time-consuming and expensive process. Engineers typically collect data specific to a particular robot and task, which is then used to train the robot in a controlled environment. This approach has several limitations:
MIT's new technique addresses these challenges by combining a vast amount of heterogeneous data from various sources into a single system capable of teaching robots a wide range of tasks 3. Key aspects of the HPT approach include:
The researchers, led by Lirui Wang, drew inspiration from the success of large language models like GPT-4 4. These models are pretrained on enormous amounts of diverse language data and then fine-tuned for specific tasks. The HPT architecture adapts this concept to robotics by:
The HPT approach offers several benefits over traditional robot training techniques:
While developing HPT, the researchers faced several challenges:
The team aims to further enhance HPT by:
The development of HPT could lead to more flexible and adaptable robots capable of quickly learning new skills and adjusting to changing circumstances. This breakthrough brings us closer to the vision of truly general-purpose robotic assistants, potentially transforming industries and everyday life 5.
As research continues, the MIT team dreams of creating a "universal robot brain" that could be downloaded and used for any robot without additional training, marking a significant step towards more intelligent and versatile robotic systems 4.
Reference
[1]
[2]
[3]
[5]
MIT researchers develop LucidSim, a novel system using generative AI and physics simulators to train robots in virtual environments, significantly improving their real-world performance in navigation and obstacle traversal.
2 Sources
2 Sources
Physical Intelligence, a San Francisco startup, has developed π0 (pi-zero), a generalist AI model for robotics that enables various robots to perform a wide range of household tasks with remarkable dexterity and adaptability.
2 Sources
2 Sources
The Genesis Project, an open-source simulation platform, is transforming robotics training by enabling ultra-fast, AI-powered virtual environments for robot learning and development.
6 Sources
6 Sources
Figure AI unveils Helix, an advanced Vision-Language-Action model that enables humanoid robots to perform complex tasks, understand natural language, and collaborate effectively, marking a significant leap in robotics technology.
9 Sources
9 Sources
NVIDIA introduces a three-computer solution to advance physical AI and robotics, combining training, simulation, and runtime systems to revolutionize industries from manufacturing to smart cities.
2 Sources
2 Sources