5 Sources
5 Sources
[1]
Google's Gemini Embedding 2 arrives with native multimodal to cut costs and speed up your enterprise data stack
Yesterday amid a flurry of enterprise AI product updates, Google announced arguably its most significant one for enterprise customers: the public preview availability of Gemini Embedding 2, its new embeddings model -- a significant evolution in how machines represent and retrieve information across different media types. While previous embedding models were largely restricted to text, this new model natively integrates text, images, video, audio, and documents into a single numerical space -- reducing latency by as much as 70% for some customers and reducing total cost for enterprises who use AI models powered by their own data to complete business tasks. Who needs and uses an embedding model? For those who have encountered the term "embeddings" in AI discussions but find it abstract, a useful analogy is that of a universal library. In a traditional library, books are organized by metadata: author, title, or genre. In the "embedding space" of an AI, information is organized by ideas. Imagine a library where books aren't organized by the Dewey Decimal System, but by their "vibe" or "essence". In this library, a biography of Steve Jobs would physically fly across the room to sit next to a technical manual for a Macintosh. A poem about a sunset would drift toward a photography book of the Pacific Coast, with all thematically similar content organized in beautiful hovering "clouds" of books. This is basically what an embedding model does. An embedding model takes complex data -- like a sentence, a photo of a sunset, or a snippet of a podcast -- and converts it into a long list of numbers called a vector. These numbers represent coordinates in a high-dimensional map. If two items are "semantically" similar (e.g., a photo of a golden retriever and the text "man's best friend"), the model places their coordinates very close to each other in this map. Today, these models are the invisible engine behind: * Search Engines: Finding results based on what you mean, not just the specific words you typed. * Recommendation Systems: Netflix or Spotify suggesting content because its "coordinates" are near things you already like. * Enterprise AI: Large companies use them for Retrieval-Augmented Generation (RAG), where an AI assistant "looks up" a company's internal PDFs to answer an employee's question accurately. The concept of mapping words to vectors dates back to the 1950s with linguists like John Rupert Firth, but the modern "vector revolution" began in the early 2000s when Yoshua Bengio's team first used the term "word embeddings". The real breakthrough for the industry was Word2Vec, released by a team at Google led by Tomas Mikolov in 2013. Today, the market is led by a handful of major players: * OpenAI: Known for its widely-used text-embedding-3 series. * Google: With the new Gemini and previous Gecko models. * Anthropic and Cohere: Providing specialized models for enterprise search and developer workflows. By moving beyond text to a natively multimodal architecture, Google is attempting to create a singular, unified map for the sum of human digital expression -- text, images, video, audio, and documents -- all residing in the same mathematical neighborhood. Why Gemini Embedding 2 is such a big deal Most leading models are still "text-first." If you want to search a video library, the AI usually has to transcribe the video into text first, then embed that text. Google's Gemini Embedding 2 is natively multimodal. As Logan Kilpatrick of Google DeepMind posted on X, the model allows developers to "bring text, images, video, audio, and docs into the same embedding space". It understands audio as sound waves and video as motion directly, without needing to turn them into text first. This reduces "translation" errors and captures nuances that text alone might miss. For developers and enterprises, the "natively multimodal" nature of Gemini Embedding 2 represents a shift toward more efficient AI pipelines. By mapping all media into a single 3,072-dimensional space, developers no longer need separate systems for image search and text search; they can perform "cross-modal" retrieval -- using a text query to find a specific moment in a video or an image that matches a specific sound. And unlike its predecessors, Gemini Embedding 2 can process requests that mix modalities. A developer can send a request containing both an image of a vintage car and the text "What is the engine type?". The model doesn't process them separately; it treats them as a single, nuanced concept. This allows for a much deeper understanding of real-world data where the "meaning" is often found in the intersection of what we see and what we say. One of the model's more technical features is Matryoshka Representation Learning. Named after Russian nesting dolls, this technique allows the model to "nest" the most important information in the first few numbers of the vector. An enterprise can choose to use the full 3072 dimensions for maximum precision, or "truncate" them down to 768 or 1536 dimensions to save on database storage costs with minimal loss in accuracy. Benchmarking the performance gains of moving to multimodal Gemini Embedding 2 establishes a new performance ceiling for multimodal depth, specifically outperforming previous industry leaders across text, image, and video evaluation tasks. The model's most significant lead is found in video and audio retrieval, where its native architecture allows it to bypass the performance degradation typically associated with text-based transcription pipelines. Specifically, in video-to-text and text-to-video retrieval tasks, the model demonstrates a measurable performance gap over existing industry leaders, accurately mapping motion and temporal data into a unified semantic space. The technical results show a distinct advantage in the following standardized categories: * Multimodal Retrieval: Gemini Embedding 2 consistently outperforms leading text and vision models in complex retrieval tasks that require understanding the relationship between visual elements and textual queries. * Speech and Audio Depth: The model introduces a new standard for native audio embeddings, achieving higher accuracy in capturing phonetic and tonal intent compared to models that rely on intermediate text-transcription. * Contextual Scaling: In text-based benchmarks, the model maintains high precision while utilizing its expansive 8,192 token context window, ensuring that long-form documents are embedded with the same semantic density as shorter snippets. * Dimension Flexibility: Testing across the Matryoshka Representation Learning (MRL) layers reveals that even when truncated to 768 dimensions, the model retains a significant majority of its 3,072-dimension performance, outperforming fixed-dimension models of similar size. What it means for enterprise databases For the modern enterprise, information is often a fragmented mess. A single customer issue might involve a recorded support call (audio), a screenshot of an error (image), a PDF of a contract (document), and a series of emails (text). In previous years, searching across these formats required four different pipelines. With Gemini Embedding 2, an enterprise can create a Unified Knowledge Base. This enables a more advanced form of RAG, wherein a company's internal AI doesn't just look up facts, but understands the relationship between them regardless of format. Early partners are already reporting drastic efficiency gains: * Sparkonomy, a creator economy platform, reported that the model's native multimodality slashed their latency by up to 70%. By removing the need for intermediate LLM "inference" (the step where one model explains a video to another), they nearly doubled their semantic similarity scores for matching creators with brands. * Everlaw, a legal tech firm, is using the model to navigate the "high-stakes setting" of litigation discovery. In legal cases where millions of records must be parsed, Gemini's ability to index images and videos alongside text allows legal professionals to find "smoking gun" evidence that traditional text-search would miss. Understanding the limits In its announcement, Google was upfront about some of the current limitations of Gemini Embedding 2. The new model can accommodate vectorization of individual files that comprise of as many as 8,192 text tokens, 6 images (in as single batch), 128 seconds of video (2 minutes, 8 seconds long), 80 seconds of native audio (1.34 minutes), and a 6-page PDF. It is vital to clarify that these are input limits per request, not a cap on what the system can remember or store. Think of it like a scanner. If a scanner has a limit of "one page at a time," it doesn't mean you can only ever scan one page. it means you have to feed the pages in one by one. * Individual File Size: You cannot "embed" a 100-page PDF in a single call. You must "chunk" the document -- splitting it into segments of 6 pages or fewer -- and send each segment to the model individually. * Cumulative Knowledge: Once those chunks are converted into vectors, they can all live together in your database. You can have a database containing ten million 6-page PDFs, and the model will be able to search across all of them simultaneously. * Video and Audio: Similarly, if you have a 10-minute video, you would break it into 128-second segments to create a searchable "timeline" of embeddings. Licensing, pricing, and availability As of March 10, 2026, Gemini Embedding 2 is officially in Public Preview. For developers and enterprise leaders, this means the model is accessible for immediate testing and production integration, though it is still subject to the iterative refinements typical of "preview" software before it reaches General Availability (GA). The model is deployed across Google's two primary AI gateways, each catering to a different scale of operation: * Gemini API: Targeted at rapid prototyping and individual developers, this path offers a simplified pricing structure. * Vertex AI (Google Cloud): The enterprise-grade environment designed for massive scale, offering advanced security controls and integration with the broader Google Cloud ecosystem. It's also already integrated with the heavy hitters of AI infrastructure: LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, and ChromaDB. In the Gemini API, Google has introduced a tiered pricing model that distinguishes between "standard" data (text, images, and video) and "native" audio. * The Free Tier: Developers can experiment with the model at no cost, though this tier comes with rate limits (typically 60 requests per minute) and uses data to improve Google's products. * The Paid Tier: For production-level volume, the cost is calculated per million tokens. For text, image, and video inputs, the rate is $0.25 per 1 million tokens. * The "Audio Premium": Because the model natively ingests audio data without intermediate transcription -- a more computationally intensive task -- the rate for audio inputs is doubled to $0.50 per 1 million tokens. For large-scale deployments on Vertex AI, the pricing follows an enterprise-centric "Pay-as-you-go" (PayGo) model. This allows organizations to pay for exactly what they use across different processing modes: * Flex PayGo: Best for unpredictable, bursty workloads. * Provisioned Throughput: Designed for enterprises that require guaranteed capacity and consistent latency for high-traffic applications. * Batch Prediction: Ideal for re-indexing massive historical archives, where time-sensitivity is lower but volume is extremely high. By making the model available through these diverse channels and integrating it natively with libraries like LangChain, LlamaIndex, and Weaviate, Google has ensured that the "switching cost" for businesses isn't just a matter of price, but of operational ease. Whether a startup is building its first RAG-based assistant or a multinational is unifying decades of disparate media archives, the infrastructure is now live and globally accessible. In addition, the official Gemini API and Vertex AI Colab notebooks, which contain the Python code necessary to implement these features, are licensed under the Apache License, Version 2.0. The Apache 2.0 license is highly regarded in the tech community because it is "permissive." It allows developers to take Google's implementation code, modify it, and use it in their own commercial products without having to pay royalties or "open source" their own proprietary code in return. How enterprises should respond: migrate to Gemini 2 Embedding or not? For Chief Data Officers and technical leads, the decision to migrate to Gemini Embedding 2 hinges on the transition from a "text-plus" strategy to a "natively multimodal" one. If your organization currently relies on fragmented pipelines -- where images and videos are first transcribed or tagged by separate models before being indexed -- the upgrade is likely a strategic necessity. This model eliminates the "translation tax" of using intermediate LLMs to describe visual or auditory data, a move that partners like Sparkonomy found reduced latency by up to 70% while doubling semantic similarity scores. For businesses managing massive, diverse datasets, this isn't just a performance boost; it is a structural simplification that reduces the number of points where "meaning" can be lost or distorted. The effort to switch from a text-only foundation is lower than one might expect due to what early users describe as excellent "API continuity". Because the model integrates with industry-standard frameworks like LangChain, LlamaIndex, and Vector Search, it can often be "dropped into" existing workflows with minimal code changes. However, the real cost and energy investment lies in re-indexing. Moving to this model requires re-embedding your existing corpus to ensure all data points exist in the same 3,072-dimensional space. While this is a one-time computational hurdle, it is the prerequisite for unlocking cross-modal search -- where a simple text query can suddenly "see" into your video archives or "hear" specific customer sentiment in call recordings. The primary trade-off for data leaders to weigh is the balance between high-fidelity retrieval and long-term storage economics. Gemini Embedding 2 addresses this directly through Matryoshka Representation Learning (MRL), which allows you to truncate vectors from 3072 dimensions down to 768 without a linear drop in quality. This gives CDOs a tactical lever: you can choose maximum precision for high-stakes legal or medical discovery -- as seen in Everlaw's 20% lift in recall -- while utilizing smaller, more efficient vectors for lower-priority recommendation engines to keep cloud storage costs in check. Ultimately, the ROI is found in the "lift" of accuracy; in a landscape where an AI's value is defined by its context, the ability to natively index a 6-page PDF or 128 seconds of video directly into a knowledge base provides a depth of insight that text-only models simply cannot replicate.
[2]
Gemini Embedding 2: Our first natively multimodal embedding model
Today we're releasing Gemini Embedding 2, our first fully multimodal embedding model built on the Gemini architecture, in Public Preview via the Gemini API and Vertex AI. Expanding on our previous text-only foundation, Gemini Embedding 2 maps text, images, videos, audio and documents into a single, unified embedding space, and captures semantic intent across over 100 languages. This simplifies complex pipelines and enhances a wide variety of multimodal downstream tasks -- from Retrieval-Augmented Generation (RAG) and semantic search to sentiment analysis and data clustering. The model is based on Gemini and leverages its best-in-class multimodal understanding capabilities to create high-quality embeddings across: Beyond processing one modality at a time, this model natively understands interleaved input so you can pass multiple modalities of input (e.g., image + text) in a single request. This allows the model to capture the complex, nuanced relationships between different media types, unlocking more accurate understanding of complex, real-world data.
[3]
Gemini Embedding 2 Is Google's First AI Model to Map Text and Media
Google released its first fully multimodal embedding model on Tuesday. Dubbed Gemini Embedding 2, the artificial intelligence (AI) model maps text, images, audio, and videos into a single, unified embedding space. This means it uses an architecture to understand concepts whether they are written as words, spoken aloud, or shown in an image or a video. The Mountain View-based tech giant says this new system will simplify the way a large language model (LLM) understands information and will allow it to perform more complex actions. Google's First Multimodal Embedding Model Is Here In a blog post, the tech giant detailed the new AI model. It is the successor to the text-only embedding model that was released last year, and it captures semantic intent across more than 100 languages. Gemini Embedding 2 is currently available in public preview via the Gemini application programming interface (API) and Vertex AI. AI models typically have different digital file cabinets to store text, photos, videos, and audio files. Whenever a user requests information in a specific format, it begins looking into that specific cabinet. Usually, an LLM treats a "cat" in a text document and a "cat" in a video as two completely different things. And to make matters more complex, the method to obtain information differs with each format. Gemini Embedding 2 solves this problem by creating a new architecture that can only use a single cabinet for all kinds of information. This allows it to process a document that has both text and images at the same time, as humans do. Google says this new system simplifies "complex pipelines and enhances a wide variety of multimodal downstream tasks." Some of these include Retrieval-Augmented Generation (RAG) and semantic search, sentiment analysis, and data clustering. Coming to the AI model's capabilities, it has a text context window of up to 8,192 input tokens. It can also process up to six images per request in PNG and JPEG formats, and supports up to 120 seconds of video input in MP4 and MOV formats. Additionally, it can natively process and map audio data without needing text transcriptions. Further, it can also embed up to six-page-long PDFs. The Gemini Embedding 2 can also understand interleaved input, so users can send across multiple modalities (such as text and image) in the same request. Google says this capability allows the model to gain a more accurate understanding of complex, real-world data.
[4]
Google rolls out Gemini Embedding 2 for multimodal AI applications
Google has released Gemini Embedding 2, a multimodal embedding model built on the Gemini architecture. The model expands beyond earlier text-only embedding systems by mapping text, images, videos, audio, and documents into a single unified embedding space. It captures semantic meaning across more than 100 languages and supports AI tasks such as Retrieval-Augmented Generation (RAG), semantic search, sentiment analysis, and data clustering. Gemini Embedding 2 Gemini Embedding 2 uses the multimodal capabilities of the Gemini architecture to generate embeddings from different types of data. The model supports interleaved multimodal inputs, allowing developers to combine inputs such as text and images in a single request. This enables the system to capture relationships between different media types and process datasets that contain multiple formats. Key features Multimodal input support * Text: Supports up to 8,192 input tokens * Images: Processes up to six images per request, supporting PNG and JPEG formats * Videos: Supports video input of up to 120 seconds in MP4 and MOV formats * Audio: Directly processes audio without requiring transcription * Documents: Supports embedding PDF files up to six pages Interleaved multimodal inputs The model can process multiple media types within a single request, enabling contextual understanding between inputs such as image and text. Matryoshka Representation Learning (MRL) Gemini Embedding 2 incorporates Matryoshka Representation Learning, which allows embedding vectors to scale across different dimensions. The default dimension is 3,072, and developers can reduce the size to manage storage and performance requirements. Recommended output dimensions: * 3,072 * 1,536 * 768 Model capabilities According to Google, the model introduces multimodal embedding support across text, image, video, and speech tasks, while adding native audio processing capability. Supported use cases * Retrieval-Augmented Generation (RAG) * Semantic search * Sentiment analysis * Data clustering * Large-scale data management Availability Gemini Embedding 2 is available in Public Preview through the Gemini API and Vertex AI. Developers can access the model through integrations with frameworks and vector database tools including: * LangChain * LlamaIndex * Haystack * Weaviate * Qdrant * ChromaDB The model can also be used with vector search systems for multimodal data processing.
[5]
Google unveils Gemini Embedding 2, its first multimodal embedding model
The model aims to simplify complex AI pipelines for RAG and semantic search. Google has officially unveiled its first-ever multimodal embedding model, the Gemini Embedding 2. While AI started with being limited to text-only, with the help of Gemini Embedding 2, Google is planning to map text, images, videos, audio and documents into a single space. With the model, Google wants to simplify complex pipelines and enhance a wide variety of multimodal downstream tasks as it supports retrieval-augmented generation (RAG) and semantic search, from sentiment analysis to data clustering. Here's a detailed look at how this new embedding model works and how you can use it. Also Read: OpenAI to soon integrate Sora AI video tool into ChatGPT: Report First up, speaking of how the model works, Google explained that this new model is based on Gemini. As per them, it leverages its best-in-class multimodal understanding capabilities to create high-quality embeddings across various media. These media include text-based media that support a context of up to 8192 input tokens. In terms of images, the model is capable of processing up to 6 images per request, supporting both the popular PNG and JPEG formats. Videos are where things get interesting, as the model supports up to 120 seconds of video input in both MP4 and MOV formats. The model can natively input and embed audio data without needing intermediate text transcriptions, and even directly embed PDFs that are up to 6 pages. The best part is that this model has been built such that it can process more than one medium at a time. This model can pass multiple media, like image + text, in a single request. As per Google, this would allow the model to work between different media types, unlocking a better understanding of real-world data. Google also shared the performance difference over the various multimodal models available in the space. As per them, with Gemini Embedding 2, Google is not only improving from their legacy models, but they are also establishing a new performance standard when compared to the other models. They shared this table, detailing the performance improvements compared to the other models below: Using Google's new Gemini Embedding 2 multimodal embedding model is pretty simple, too. You can just head on over to either the Gemini API or the Vertex API platform and check it out from there. On their official blogpost, Google has released the code required to access the model.
Share
Share
Copy Link
Google unveiled Gemini Embedding 2, marking a shift in how AI systems process information. The natively multimodal embedding model maps text, images, videos, audio, and documents into a single unified embedding space, eliminating the need for separate processing systems. Available through Gemini API and Vertex AI, the model reduces latency by up to 70% for some customers while cutting costs for enterprises using AI powered by their own data.
Google has released Gemini Embedding 2, its first fully natively multimodal embedding model built on the Gemini architecture, now available in public preview through the Gemini API and Vertex AI
1
2
. This release represents a significant evolution in how machines represent and retrieve information across different media types, moving beyond the text-only limitations of previous embedding models1
. The new model maps text, images, videos, audio, and documents into a single unified embedding space, capturing semantic intent across more than 100 languages2
3
.
Source: Google
For enterprise customers, the impact is immediate and measurable. The model reduces latency by as much as 70% for some customers and cuts total costs for enterprises that use AI models powered by their own data to complete business tasks
1
. This efficiency gain stems from eliminating the need to convert different media types into text before processing—a step that previous systems required.Embeddings serve as the invisible engine behind modern AI applications, from search engines to recommendation systems
1
. An embedding model takes complex data—whether a sentence, a photo, or a podcast snippet—and converts it into a long list of numbers called a vector. These numbers represent coordinates in a high-dimensional map where semantically similar items sit close together1
.Large companies deploy these models for Retrieval-Augmented Generation (RAG), where an AI assistant searches through internal PDFs to answer employee questions accurately
1
4
. The market currently includes major players like OpenAI with its text-embedding-3 series, Google with Gemini and previous Gecko models, and Anthropic and Cohere providing specialized models for enterprise search1
.Most leading models remain "text-first," requiring video libraries to be transcribed into text before embedding
1
. Gemini Embedding 2 breaks this pattern by understanding audio as sound waves and video as motion directly, without converting them to text first1
. This approach reduces translation errors and captures nuances that text alone might miss.The model can process text, images, videos, and audio into the same 3,072-dimensional space, eliminating the need for separate systems for image search and text search
1
. Developers can now perform cross-modal retrieval—using a text query to find a specific moment in a video or an image that matches a particular sound. As Logan Kilpatrick of Google DeepMind noted, developers can "bring text, images, video, audio, and docs into the same embedding space"1
.
Source: VentureBeat
Gemini Embedding 2 supports interleaved multimodal inputs, allowing developers to combine multiple media types within a single request
2
3
. A developer can send a request containing both an image of a vintage car and the text "What is the engine type?" The model processes them as a single, nuanced concept rather than treating them separately1
.The model offers specific technical specifications: it supports up to 8,192 input tokens for text, processes up to six images per request in PNG and JPEG formats, handles up to 120 seconds of video input in MP4 and MOV formats, and can embed PDFs up to six pages long
3
4
. Notably, it can natively process and map audio data without requiring text transcriptions3
.One technical feature that sets Gemini Embedding 2 apart is Matryoshka Representation Learning (MRL)
1
4
. Named after Russian nesting dolls, this technique nests the most important information in the first few numbers of the vector. Enterprises can choose to use the full 3,072 dimensions or reduce the size to manage storage and performance requirements4
. Recommended output dimensions include 3,072, 1,536, and 7684
.Related Stories
The model enhances a wide variety of multimodal downstream tasks, from semantic search to sentiment analysis and data clustering
2
5
. By creating a singular, unified map for different types of digital content, Google aims to simplify complex pipelines that previously required multiple specialized systems3
5
.Developers can access the model through integrations with popular frameworks and vector databases including LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, and ChromaDB
4
. This broad integration support suggests Google is positioning the model for rapid adoption across existing AI development workflows.The shift toward native multimodal processing signals where AI infrastructure is heading. Enterprises currently managing separate systems for different media types should evaluate how consolidating into a unified embedding space might reduce complexity and costs. The 70% latency reduction reported by some customers indicates substantial performance gains are possible, particularly for organizations processing large volumes of mixed-media data
1
. As competitors like OpenAI and Anthropic respond, the embedding model landscape will likely see accelerated innovation in multimodal capabilities and efficiency improvements.Summarized by
Navi
[1]
1
Technology

2
Policy and Regulation

3
Technology
