2 Sources
2 Sources
[1]
World's largest open-source multimodal dataset delivers 17x training efficiency, unlocking enterprise AI that connects documents, audio and video
AI models are only as good as the data they're trained on. That data generally needs to be labeled, curated and organized before models can learn from it in an effective way. One of the big missing links in the AI ecosystem has been the availability of a large high-quality open-source multimodal dataset. That changes today with the debut of the EMM-1 dataset which is comprised of 1 billion data pairs and 100M data groups across 5 modalities: text, image, video, audio and 3d point clouds .Multimodal datasets combine different types of data that AI systems can process together. This mirrors how humans perceive the world using multiple senses simultaneously. These datasets enable AI systems to make richer inferences by understanding relationships across data types, rather than processing each modality in isolation. EMM-1 is developed by data labeling platform vendor Encord. The company's platform enables teams to curate, label and manage training data at scale using both automated and human-in-the-loop workflows. Alongside the new model, Encord developed the EBind training methodology that prioritizes data quality over raw computational scale. The approach enabled a compact 1.8 billion parameter model to match the performance of models up to 17 times larger while slashing training time from days to hours on a single GPU rather than GPU clusters. "The big trick for us was to really focus on the data and to make the data very, very high quality," Encord Co-Founder and CEO Eric Landau told VentureBeat in an exclusive interview. "We were able to get to the same level of performance as models 20 times larger, not because we were super clever on the architecture, but because we trained it with really good data overall." The data quality advantage Encord's dataset is 100 times larger than the next comparable multimodal dataset, according to Landau. It operates at petabyte scale with terabytes of raw data and over 1 million human annotations. But scale alone doesn't explain the performance gains. The technical innovation centers on addressing what Landau calls an "under-appreciated" problem in AI training: data leakage between training and evaluation sets. "The leakage problem was one which we spent a lot of time on," Landau explained. "In a lot of data sets, there is a kind of leakage between different subsets of the data. Leakage actually boosts your results. It makes your evaluations look better. But it's one thing that we were quite diligent about." Data leakage occurs when information from test data inadvertently appears in training data, artificially inflating model performance metrics. Many benchmark datasets suffer from this contamination. Encord deployed hierarchical clustering techniques to ensure clean separation while maintaining representative distribution across data types. The company also used clustering to address bias and ensure diverse representation. How EBind boosts efficiency The data quality improvements work in tandem with an architectural approach designed for efficiency Encord's EBind extends the CLIP (Contrastive Language-Image Pre-training) approach (originally developed by OpenAI) from two modalities to five. CLIP learns to associate images and text in a shared representation space, enabling tasks like searching for images using text descriptions. Where CLIP learns to associate images and text in a shared latent space, EBind does the same across images, text, audio, 3D point clouds and video. The architectural choice prioritizes parameter efficiency. Rather than deploying separate specialized models for each modality pair, EBind uses a single base model with one encoder per modality. "Other methodologies, what they do is they use a bunch of different models, and they route to the best model for embedding these pairs, so they tend to explode in the number of parameters," Landau said. "We found we could use a single base model and just train one encoder per modality, so keeping it very simple and very parameter efficient, if we fed that overall architecture really, really good data." The resulting model rivals OmniBind, a much larger competitor in the multimodal space, but requires dramatically fewer computational resources for both training and inference. This makes EBind deployable in resource-constrained environments including edge devices for robotics and autonomous systems. The enterprise value of a multi-modal dataset Multimodal models enable enterprise use cases that span different data types. Most organizations store different data types in separate systems: documents in content management platforms, audio recordings in communication tools, training videos in learning management systems and structured data in databases. Multimodal models can search and retrieve across all of these simultaneously. "Enterprises have all different types of data. They don't just have documents. They have audio recordings, and they have training videos, and they have CSV files," Landau said. "Let's say you're a lawyer and you have a case file that has video evidence and also documents and recordings, and it's all scattered across a lot of silos of data. You can use EBind to pick all of the relevant data and bundle together to search and surface the right data much quicker than you would have before." The same principle applies across verticals. Healthcare providers can link patient imaging data to clinical notes and diagnostic audio. Financial services firms can connect transaction records to compliance call recordings and customer communications. Manufacturing operations can tie equipment sensor data to maintenance video logs and inspection reports. Beyond office environments, physical AI represents another frontier. Landau highlighted autonomous vehicles that benefit from both visual perception and audio cues like emergency sirens. In manufacturing and warehousing, robots that combine visual recognition with audio feedback and spatial awareness can operate more safely and effectively than vision-only systems. Enterprise use case: Extending computer vision with multimodal context Captur AI, an Encord customer, illustrates how companies are planning to use the dataset for specific business applications. The startup provides on-device image verification for mobile apps, validating photos in real-time for authenticity, compliance and quality before upload. The company works with shared mobility providers like Lime and delivery companies capturing billions of package photos. Captur AI processes over 100 million images on-device and specializes in distilling models to 6-10 megabytes so they can run on smartphones without cloud connectivity. But CEO Charlotte Bax sees multimodal capabilities as critical for expanding into higher-value use cases. "The market for us is massive. You submit photos for returns and retails. You submit photos to insurance companies for claims. You submit photos when you're listing something on eBay," Bax told VentureBeat in an exclusive interview. "Some of those use cases are very high risk or high value if something goes wrong, like insurance, the image only captures part of the context and audio can be an important signal." Bax cited digital vehicle inspections as a prime example. When customers photograph vehicle damage for insurance claims, they often describe what happened verbally while capturing images. Audio context can significantly improve claim accuracy and reduce fraud. "As you're doing that, oftentimes the customer is actually describing what's happened," Bad said. "A few of our potential prospects in InsurTech have asked us if we can actually do audio as well, because then that adds this additional bit of context for the user who's submitting the claim." The challenge lies in maintaining Captur AI's core advantage: running models efficiently on-device rather than requiring cloud processing. The company plans to use Encord's dataset to train compact multimodal models that preserve real-time, offline capabilities while adding audio and sequential image context. "The most important thing you can do is try and get as much context as possible," Bax said. "Can you get LLMs to be small enough to run on a device within the next three years, or can you run multimodal models on the device? Solving data quality before image upload is the interesting frontier." What this means for enterprises Encord's results challenge fundamental assumptions about AI development and suggest that the next competitive battleground may be data operations rather than infrastructure scale. Multimodal datasets unlock new capabilities. The ability to train models that understand relationships across data types opens use cases that single-modality systems cannot address. Data operations deserve equal investment with compute infrastructure. The 17x parameter efficiency gain from better data curation represents orders of magnitude in cost savings. Organizations pouring resources into GPU clusters while treating data quality as an afterthought may be optimizing the wrong variable. For enterprises building multimodal AI systems, Landau's assessment captures the strategic shift. "We were able to get to the same level of performance as models much larger, not because we were super clever on the architecture, but because we trained it with really good data overall," he said.
[2]
Encord creates a new method for training powerful multimodal AI models on a single GPU - SiliconANGLE
Encord creates a new method for training powerful multimodal AI models on a single GPU Artificial intelligence data annotation startup Encord, officially known as Cord Technologies Inc., wants to break down barriers to training multimodal AI models. To do that, it has just released what it says is the world's largest open-source multimodal dataset to help developers of all shapes and sizes build more sophisticated AI systems. Along with the dataset, Encord has created a new methodology for training multimodal AI models. It's called EBind, and the company claims it can be used to train advanced models capable of processing multiple kinds of data on a single graphics processing unit within a matter of hours, rather than weeks or days. The startup says the new dataset and methodology can help to democratize access to multimodal AI and increase the ability of smaller startups to compete with the likes of OpenAI, Google LLC, Meta Platforms Inc. and Anthropic PBC. Encord does know a thing or two about AI training, so it's qualified to make such a claim. The company is the creator of an automated data annotation platform that's used to label and annotate different types of data, including text files, images, videos and audio, so it can be used to train machine learning and computer vision models. Though automated data annotation systems are not new, traditional ones have relied heavily on human supervision. Encord doesn't do this, instead automating the entire process by using AI itself to supervise the AI that's doing the annotating, which helps companies to get large datasets ready for AI training much faster than was possible before. Encord co-founder and Chief Executive Eric Landau said the company wants to democratize access to multimodal AI because of its huge potential. Multimodal AI models are uniquely able to process multiple kinds of data, which is different from standard chatbots that can only be trained on text, or computer vision models, which learn exclusively from images. By ingesting multiple kinds of data, they can be used to solve more complex problems and generate more nuanced outputs. "Multimodal AI is the next major leap for our industry, with the power to teach robots, self-driving cars, drones and other systems to recognize and make inferences from their physical environments using the same combination of senses that humans use," Landau explained. The problem with multimodal AI is that, until now, it has been largely inaccessible to smaller teams. For one thing, there's a lack of multimodal data lying around in the public domain that can be used to train these models. And existing training methodologies require vast computation resources to run efficiently, which makes them prohibitively expensive for many smaller companies. Landau said Encord's new dataset and EBind methodology are meant to disrupt that status quo: "[They will] vastly reduce the time and compute power needed to develop, train and deploy multimodal AI systems - and will help to unleash the next wave of innovation in this space," he promised. The EBind methodology was designed to be used with Encord's voluminous and high-quality open multimodal dataset. It relies on a "single encoder per data modality," wherein the training process is driven more by data quality rather than raw compute power. So, the better the data is, the faster the models can be trained, even if only limited compute resources are available, Landau said. According to Encord's internal research, it was able to train a simple, 1.8 billion-parameter multimodal model that outperformed rivals models with up to 17 times more parameters, and it did this in just a few hours using a single GPU. The company has not yet published this research so its claims cannot be verified, but Charlotte Bax, CEO of the British vision AI startup Captur Ltd., has had early access to the dataset and methodology and was mightily impressed. "The dataset opens new possibilities for improving performance on image quality measures for our shared models across various verticals," Bax said. "We're always looking at ways to augment datasets for our on-device models to achieve better handling of edge cases, and Encord's new dataset offers a powerful pathway to accomplish that goal." Encord President Ulrik Stig Hansen said the success of the new methodology shows that data quality, rather than computing resources, will have the biggest impact on AI innovation in future. "The winning organizations... [will be those] that adopt new approaches to data curation and dataset construction, not just those that throw escalating levels of compute power at the problem," he predicted.
Share
Share
Copy Link
Encord introduces EMM-1, the largest open-source multimodal dataset, and EBind, a novel training methodology. This breakthrough enables efficient training of powerful multimodal AI models on a single GPU, potentially democratizing access to advanced AI technologies.
In a significant leap forward for the AI industry, data labeling platform vendor Encord has introduced EMM-1, the world's largest open-source multimodal dataset, alongside a novel training methodology called EBind. This development promises to democratize access to multimodal AI and revolutionize the way AI models are trained and deployed
1
2
.The EMM-1 dataset comprises an impressive 1 billion data pairs and 100M data groups across five modalities: text, image, video, audio, and 3D point clouds. This dataset is a staggering 100 times larger than the next comparable multimodal dataset, operating at petabyte scale with terabytes of raw data and over 1 million human annotations
1
.Source: VentureBeat
Encord's EBind methodology, which prioritizes data quality over raw computational power, has achieved remarkable results. A compact 1.8 billion parameter model trained using EBind matched the performance of models up to 17 times larger, while dramatically reducing training time from days to hours on a single GPU
1
2
.Encord's success is not just about scale, but also about addressing critical issues in AI training. The company focused on solving the problem of data leakage between training and evaluation sets, which can artificially inflate model performance metrics. By employing hierarchical clustering techniques, Encord ensured clean separation while maintaining representative distribution across data types
1
.EBind builds upon OpenAI's CLIP (Contrastive Language-Image Pre-training) approach, extending it from two modalities to five. This architectural choice prioritizes parameter efficiency by using a single base model with one encoder per modality, instead of deploying separate specialized models for each modality pair
1
.The introduction of EMM-1 and EBind has significant implications for enterprise AI applications. Multimodal models enable use cases that span different data types, allowing organizations to search and retrieve across various systems simultaneously, including content management platforms, communication tools, learning management systems, and databases
1
.Related Stories
Encord's innovations aim to break down barriers to training multimodal AI models, making them accessible to developers and companies of all sizes. By reducing the time and computational resources required for training, Encord is leveling the playing field, allowing smaller startups to compete with tech giants in the AI space
2
.Early access to the dataset and methodology has garnered positive reactions from industry professionals. Charlotte Bax, CEO of British vision AI startup Captur Ltd., praised the dataset's potential for improving image quality measures and handling edge cases in on-device models
2
.Encord's President, Ulrik Stig Hansen, predicts that future AI innovation will be driven more by data quality than by raw computing power. This shift in focus could reshape the competitive landscape in the AI industry, favoring organizations that excel in data curation and dataset construction
2
.Summarized by
Navi
1
Technology
2
Business and Economy
3
Business and Economy