Curated by THEOUTPOST
On Sun, 23 Mar, 12:01 AM UTC
2 Sources
[1]
New AI tool generates high-quality images faster than state-of-the-art approaches
But the generative AI techniques increasingly being used to produce such images have drawbacks. One popular type of model, called a diffusion model, can create stunningly realistic images but is too slow and computationally intensive for many applications. On the other hand, the autoregressive models that power LLMs like ChatGPT are much faster, but they produce poorer-quality images that are often riddled with errors. Researchers from MIT and NVIDIA developed a new approach that brings together the best of both methods. Their hybrid image-generation tool uses an autoregressive model to quickly capture the big picture and then a small diffusion model to refine the details of the image. Their tool, known as HART (short for Hybrid Autoregressive Transformer) can generate images that match or exceed the quality of state-of-the-art diffusion models, but do so about nine times faster. The generation process consumes fewer computational resources than typical diffusion models, enabling HART to run locally on a commercial laptop or smartphone. A user only needs to enter one natural language prompt into the HART interface to generate an image. HART could have a wide range of applications, such as helping researchers train robots to complete complex real-world tasks and aiding designers in producing striking scenes for video games. "If you are painting a landscape, and you just paint the entire canvas once, it might not look very good. But if you paint the big picture and then refine the image with smaller brush strokes, your painting could look a lot better. That is the basic idea with HART," says Haotian Tang PhD '25, co-lead author of a new paper on HART. He is joined by co-lead author Yecheng Wu, an undergraduate student at Tsinghua University; senior author Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and a distinguished scientist of NVIDIA; as well as others at MIT, Tsinghua University, and NVIDIA. The research will be presented at the International Conference on Learning Representations. The best of both worlds Popular diffusion models, such as Stable Diffusion and DALL-E, are known to produce highly detailed images. These models generate images through an iterative process where they predict some amount of random noise on each pixel, subtract the noise, then repeat the process of predicting and "de-noising" multiple times until they generate a new image that is completely free of noise. Because the diffusion model de-noises all pixels in an image at each step, and there may be 30 or more steps, the process is slow and computationally expensive. But because the model has multiple chances to correct details it got wrong, the images are high-quality. Autoregressive models, commonly used for predicting text, can generate images by predicting patches of an image sequentially, a few pixels at a time. They can't go back and correct their mistakes, but the sequential prediction process is much faster than diffusion. These models use representations known as tokens to make predictions. An autoregressive model utilizes an autoencoder to compress raw image pixels into discrete tokens as well as reconstruct the image from predicted tokens. While this boosts the model's speed, the information loss that occurs during compression causes errors when the model generates a new image. With HART, the researchers developed a hybrid approach that uses an autoregressive model to predict compressed, discrete image tokens, then a small diffusion model to predict residual tokens. Residual tokens compensate for the model's information loss by capturing details left out by discrete tokens. "We can achieve a huge boost in terms of reconstruction quality. Our residual tokens learn high-frequency details, like edges of an object, or a person's hair, eyes, or mouth. These are places where discrete tokens can make mistakes," says Tang. Because the diffusion model only predicts the remaining details after the autoregressive model has done its job, it can accomplish the task in eight steps, instead of the usual 30 or more a standard diffusion model requires to generate an entire image. This minimal overhead of the additional diffusion model allows HART to retain the speed advantage of the autoregressive model while significantly enhancing its ability to generate intricate image details. "The diffusion model has an easier job to do, which leads to more efficiency," he adds. Outperforming larger models During the development of HART, the researchers encountered challenges in effectively integrating the diffusion model to enhance the autoregressive model. They found that incorporating the diffusion model in the early stages of the autoregressive process resulted in an accumulation of errors. Instead, their final design of applying the diffusion model to predict only residual tokens as the final step significantly improved generation quality. Their method, which uses a combination of an autoregressive transformer model with 700 million parameters and a lightweight diffusion model with 37 million parameters, can generate images of the same quality as those created by a diffusion model with 2 billion parameters, but it does so about nine times faster. It uses about 31 percent less computation than state-of-the-art models. Moreover, because HART uses an autoregressive model to do the bulk of the work -- the same type of model that powers LLMs -- it is more compatible for integration with the new class of unified vision-language generative models. In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture. "LLMs are a good interface for all sorts of models, like multimodal models and models that can reason. This is a way to push the intelligence to a new frontier. An efficient image-generation model would unlock a lot of possibilities," he says. In the future, the researchers want to go down this path and build vision-language models on top of the HART architecture. Since HART is scalable and generalizable to multiple modalities, they also want to apply it for video generation and audio prediction tasks. This research was funded, in part, by the MIT-IBM Watson AI Lab, the MIT and Amazon Science Hub, the MIT AI Hardware Program, and the National Science Foundation. The GPU infrastructure for training this model was donated by NVIDIA.
[2]
I tested the future of AI image generation. It's astoundingly fast.
Table of Contents Table of Contents A massive breakthrough Solving the cost-computing crisis Some rough edges One of the core problems with AI is the notoriously high power and computing demand, especially for tasks such as media generation. On mobile phones, when it comes to running natively, only a handful of pricey devices with powerful silicon can run the feature suite. Even when implemented at scale on cloud, it's a pricey affair. Nvidia may have quietly addressed that challenge in partnership with the folks over at the Massachusetts Institute of Technology and Tsinghua University. The team created a hybrid AI image generation tool called HART (hybrid autoregressive transformer) that essentially combines two of the most widely used AI image creation techniques. th result is a blazing fast tool with dramatically lower compute requirement. Recommended Videos Just to give you an idea of just how fast it is, I asked it to create an image of a parrot playing a bass guitar. It returned with the following picture in just about a second. I could barely even follow the progress bar. When I pushed the same prompt before Google's Imagen 3 model in Gemini, it took roughly 9-10 seconds on a 200 Mbps internet connection. A massive breakthrough When AI images first started making waves, the diffusion technique was behind it all, powering products such as OpenAI's Dall-E image generator, Google's Imagen, and Stable Diffusion. This method can produce images with an extremely high level of detail. However, it is a multi-step approach to creating AI images, and as a result, it is slow and computationally expensive. The second approach that has recently gained popularity is auto-regressive models, which essentially work in the same fashion as chatbots and generate images using a pixel prediction technique. It is faster, but also a more error-prone method of creating images using AI. On-device demo for HART: Efficient Visual Generation with Hybrid Autoregressive Transformer The team at MIT fused both methods into a single package called HART. It relies on an autoregression model to predict compressed image assets as a discrete token, while a small diffusion model handles the rest to compensate for the quality loss. The overall approach reduces the number of steps involved from over two dozen to eight steps. The experts behind HART claim that it can "generate images that match or exceed the quality of state-of-the-art diffusion models, but do so about nine times faster." HART combines an autoregressive model with a 700 million parameter range and a small diffusion model that can handle 37 million parameters. Solving the cost-computing crisis Interestingly, this hybrid tool was able to create images that matched the quality of top-shelf models with a 2 billion parameter capacity. Most importantly, HART was able to achieve that milestone at a nine times faster image generation rate, while requiring 31% less computation resources. As per the team, the low-compute approach allows HART to run locally on phones and laptops, which is a huge win. So far, the most popular mass-market products such as ChatGPT and Gemini require an internet connection for image generation as the computing happens in the cloud servers. In the test video, the team showcased it running natively on an MSI laptop with Intel's Core series processor and an Nvidia GeForce RTX graphics card. That's a combination you can find on a majority of gaming laptops out there, without spending a fortune, while at it. HART is capable of producing 1:1 aspect ratio images at a respectable 1024 x 1024 pixels resolution. The level of detail in these images is impressive, and so is the stylistic variation and scenery accuracy. During their tests, the team noted that the hybrid AI tool was anywhere between three to six times faster and offered over seven times higher throughput. The future potential is exciting, especially when integrating HART's image capabilities with language models. "In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture," says the team at MIT. They are already exploring that idea, and even plan to test the HART approach at audio and video generation. You can try it out on MIT's web dashboard. Some rough edges Before we dive into the quality debate, do keep in mind that HART is very much a research project that is still in its early stages. On the technical side, there are a few hassles highlighted by the team, such as overheads during the inference and training process. The challenges can be fixed or overlooked, because they are minor in the bigger scheme of things here. Moreover, considering the sheer benefits HART delivers in terms of computing efficiency, speed, and latency, they might just persist without leading to any major performance issues. In my brief time prompt-testing HART, I was astonished by the pace of image generation. I barely ran into a scenario where the free web tool took more than two seconds to create an image. Even with prompts that span three paragraphs (roughly over 200 words in length), HART was able to create images that adhere tightly to the description. Aside from descriptive accuracy, there was plenty of detail in the images. However, HART suffers from the typical failings of an AI image generator tool. It struggles with digits, basic depictions like eating food items, character consistency, and failing at perspective capture. Photorealism in human context is one area where I noticed glaring failures. On a few occasions, it simply got the concept of basic objects wrong, like confusing a ring with a necklace. But overall, those errors were far, few, and fundamentally expected. A healthy bunch of AI tools still can't get that right, despite being out there for a while now. Overall, I am particularly excited by the immense potential of HART. It would be interesting to see whether MIT and Nvidia create a product out of it, or simply adopt the hybrid AI image generation approach in an existing product. Either way, it's a glimpse into a very promising future.
Share
Share
Copy Link
Researchers from MIT and NVIDIA have developed HART, a hybrid AI tool that combines autoregressive and diffusion models to generate high-quality images nine times faster than current state-of-the-art approaches, while using fewer computational resources.
Researchers from MIT and NVIDIA have unveiled HART (Hybrid Autoregressive Transformer), a groundbreaking AI tool that promises to revolutionize image generation. This innovative approach combines the strengths of two popular AI techniques to create high-quality images faster and more efficiently than current state-of-the-art models 1.
HART ingeniously merges the speed of autoregressive models with the quality of diffusion models. The hybrid approach uses an autoregressive model to quickly capture the big picture, followed by a small diffusion model to refine the details 1. This combination allows HART to generate images that match or exceed the quality of state-of-the-art diffusion models, but approximately nine times faster.
The HART model, which combines a 700 million parameter autoregressive transformer with a 37 million parameter lightweight diffusion model, can produce images of comparable quality to those created by a 2 billion parameter diffusion model 1. This remarkable feat is achieved while using about 31% less computation than current leading models.
One of HART's most significant advantages is its ability to run locally on commercial laptops and smartphones, thanks to its reduced computational requirements 1. This on-device capability opens up new possibilities for AI image generation in various applications, from mobile apps to gaming.
In practical tests, HART has demonstrated impressive speed and quality. Users reported generation times of just about a second for complex prompts, significantly outpacing other popular models like Google's Imagen 3 2. The tool can produce 1024x1024 pixel images with remarkable detail and adherence to prompts.
HART's capabilities extend beyond simple image generation. Researchers envision integrating it with language models to create unified vision-language generative models. This could lead to applications such as interactive guides for complex tasks, like furniture assembly 1.
While HART represents a significant advancement, it still faces some challenges. The researchers noted minor overheads during inference and training processes. Additionally, like other AI image generators, HART occasionally struggles with certain elements such as digits, perspective, and photorealism in human contexts 2.
HART's development addresses one of the core challenges in AI: the high power and computing demands of media generation tasks. By significantly reducing the computational resources required while maintaining high-quality output, HART could pave the way for more widespread adoption of AI image generation technologies across various devices and platforms 2.
Reference
[2]
OpenAI has announced a significant upgrade to ChatGPT's image generation capabilities, integrating the GPT-4o model to create and modify images directly within the chatbot interface. This new feature aims to enhance user experience and cater to professional creative needs.
15 Sources
15 Sources
Reve AI, Inc. launches Reve Image 1.0, an advanced text-to-image AI model that excels in prompt adherence, aesthetics, and typography. The new model is outperforming established competitors and offering competitive pricing.
2 Sources
2 Sources
An in-depth analysis of various AI image generators, comparing their features, quality, and accessibility for users seeking to create AI-generated art.
2 Sources
2 Sources
Chinese startup DeepSeek unveils Janus-Pro, an advanced AI image generation model, claiming superior performance over industry leaders like DALL-E 3 and Stable Diffusion. This release follows their recent success with the R1 language model, signaling China's growing influence in the AI race.
11 Sources
11 Sources
Google introduces Gemini 2.0 Flash, a revolutionary AI model that combines native image generation and editing capabilities, potentially challenging traditional image editing software and other AI image generators.
9 Sources
9 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved