Curated by THEOUTPOST
On Fri, 23 Aug, 12:05 AM UTC
2 Sources
[1]
Onehouse's vector embeddings support aims to cut the cost of AI training - SiliconANGLE
Onehouse's vector embeddings support aims to cut the cost of AI training Onehouse Inc., a company that sells a data lakehouse based on Apache Hudi as a managed service, today said it has launched a vector embedding generator to automate embedding pipelines as a part of its cloud service. Vector embeddings are mathematical representations of objects, such as words and images, in a continuous space in which each point is defined by a vector or an ordered list of numbers that represents features of an object, coordinates in a space or complex data type. Embeddings are typically used in machine learning and natural language processing to capture objects' semantic meaning or other relevant features in a way that a computer can process. Vector embedding pipelines continuously deliver data from streams, databases and files on cloud storage to foundation models used in generative AI. Onehouse can now accept embeddings that models return and store them in the data lakehouse. That can be a big money saver since vector databases typically require powerful hardware and fast storage tightly coupled with the computer. Vector databases have been the hottest area of the database management system market since the generative AI craze began last year. Forrester Research Inc. estimates that 75% of traditional databases, including relational and NoSQL models, will incorporate vector capabilities by 2026. Onehouse is essentially positioning its service as a clearinghouse for vector embeddings. Instead of storing data in a DBMS, customers can take advantage of the low cost of lakehouse storage, which is based on inexpensive, scalable object storage decoupled from computing resources. "Enterprises need to store a lot of data in their vector databases on local storage, so they need a much bigger vector database instance to get the speed and scalability they need," said Vinoth Chandar, chief executive of Onehouse and co-creator of Apache Hudi. "Many companies end up running multiple vector databases for different parts of their data so there is no single shared source of truth they can use to manage vector embedding data." Hudi has unique capabilities around update management, late-arriving data, concurrency control and other factors needed to scale to the data volumes AI applications need. The company said Onehouse can also support low-latency vector serving for real-time use cases. The data lakehouse serves vectors in batch, with hot vectors moved dynamically to the vector database for real-time serving. It has scale, cost and performance advantages for applications such as large language models. Chandar said the use of an intermediate lakehouse can also reduce the volume of application program interface calls to LLMs such as OpenAI LLC's GPT-4 that are needed to generate vector embeddings. "Hudi is one of the only lakehouse technologies to support advanced indexing and we call incremental queries, so it's able to drastically reduce the number of calls you need to OpenAI," or another vector embeddings generator, Chandar said. Incremental queries are a Hudi feature that allows users to efficiently query only the data that has changed since the last query or a specific point in time. "Hudi can give you a single image, so you can have a job running asynchronously in every arc, and it can make one API call for n updates to an upstream data object," he said. Low cost and flexibility are among the major features driving the growing popularity of data lakehouses. An MIT Technology Review survey of senior executives, chief architects and data scientists sponsored by Databricks Inc. found that almost three-quarters of organizations have adopted a lakehouse architecture. Of those, 99% said the lakehouse was helping to achieve their data and AI goals.
[2]
Onehouse Launches Vector Embeddings Generator for Managing Vectors at Scale on the Data Lakehouse
SUNNYVALE, Calif., August 22, 2024 (Newswire.com) - Onehouse, the Universal Data Lakehouse company, today announced that companies that want to reduce the time and costs required to build vector embeddings for generative AI applications now have an easier, more efficient, and more scalable solution. Onehouse is launching a vector embeddings generator to automate embeddings pipelines as a part of its managed ELT cloud service. These pipelines continuously deliver data from streams, databases, and files on cloud storage to foundation models from OpenAI, Voyage AI, and others. The models then return the embeddings to Onehouse, which stores them in highly optimized tables on the user's data lakehouse. As AI initiatives accelerate, there is a growing pain around managing data across numerous siloed vector databases to power RAG applications, leading to excessive costs and wasteful regeneration of vectors. The data lakehouse with open data formats on top of scalable, inexpensive cloud storage is becoming the natural platform of choice for centralizing and managing the vast amounts of data used by AI models. Users are now able to carefully choose what data and embeddings need to be moved to downstream vector databases. "Text search has evolved dramatically. The traditional tools have complications on their own, as in, ingress of data and egress when we would want to move out. Vector embeddings on data lakehouse not only avoids the ingress and egress complexities and cost but also can scale to massive volumes," said Kaushik Muniandi, engineering manager at NielsenIQ. "We found that vector embeddings on data lakehouse is the only solution that scales to support our application's data volumes while minimizing costs and delivering responses in seconds." With the addition of the vector embeddings generator as a part of Onehouse's powerful incremental ELT platform, Onehouse customers can streamline their vector embeddings pipelines to store embeddings directly on the lakehouse. This provides all of the lakehouse's unique capabilities around update management, late-arriving data, concurrency control and more while scaling to the data volumes needed to power large-scale AI applications. Now, organizations have a simpler way to augment the data that AI models operate on - including audio, text, and images - with vectors that capture the relationship between data to support AI use cases. The Onehouse product integrates with vector databases to enable high-scale, low-latency serving of vectors for real-time use cases. The data lakehouse stores all of an organization's vector embeddings and serves vectors in batch, while hot vectors are moved dynamically to the vector database for real-time serving. This architecture provides scale, cost, and performance advantages for building AI applications such as large language models (LLMs) and intelligent search. "Data processing and storage are foundational for AI projects," said Prashant Wason, Staff Software Engineer at Uber and Apache Hudiâ„¢ Project Management Committee member. "Hudi, and lakehouses more broadly, should be a key part of this journey as companies build AI applications on their large datasets. The scale, openness and extensible indexing that Hudi offers make this approach of bridging the lakehouse and operational vector databases a prime opportunity for value creation in the coming years." The shift to the data lakehouse is now the norm. A survey of C-suite executives, chief architects, and data scientists by MIT Technology Review and Databricks found that almost three-quarters of organizations have adopted a lakehouse architecture. Of those, 99 percent said the architecture was helping to achieve their data and AI goals. Onehouse CEO Vinoth Chandar led the creation of Apache Hudi while at Uber in 2016. Uber donated the project to the Apache Software Foundation in 2018. Onehouse works across all popular open source data lakehouse formats - Apache Hudi, Apache Iceberg, and Delta Lake - via integration with Apache XTableâ„¢ (Incubating) for interoperability. "AI is going to be only as good as the data fed to it, so managing data for AI is going to be a key aspect of data platforms going forward," said Chandar. "Hudi's powerful incremental processing capabilities also extend to the creation and management of vector embeddings across massive volumes of data. It provides both the open source community and Onehouse customers with significant competitive advantages, such as continuously updating vectors with changing data while reducing the costs of embedding generation and vector database loading." If you are interested in seeing vector embeddings for AI built and managed in the data lakehouse, join the upcoming webinar with NielsenIQ and Onehouse, "Vector Embeddings in the Lakehouse: Bridging AI and Data Lake Technologies." About Onehouse Onehouse, the pioneer in open data lakehouse technology, empowers enterprises to deploy and manage a world-class data lakehouse in minutes on Apache Hudi, Apache Iceberg, and Delta Lake. Delivered as a fully-managed cloud service in your private cloud environment, Onehouse offers high-performance ingestion pipelines for minute-level freshness and optimizes tables for maximum query performance. Thanks to its truly open data architecture, Onehouse eliminates data format, table format, compute and catalog lock-ins, guarantees interoperability with virtually any warehouse/data processing engine, and ensures exceptional ELT and query performance for all your workloads. Companies worldwide rely on Onehouse to power their analytics, reporting, data science, machine learning, and GenAI use cases from a single, unified source of data. Built on Apache Hudi and Apache XTable (Incubating), Onehouse features advanced capabilities such as indexing, ACID transactions, and time travel, ensuring consistent data across all downstream query engines and tools. The platform's unique incremental processing capabilities deliver unmatched ELT cost and performance by minimizing data movement and optimizing resource usage. With 24/7 reliability, immediate cost savings, and open access for all major tools and query engines, benefit from Onehouse's #nolockin philosophy to future-proof any stack. To learn more, visit https://www.onehouse.ai.
Share
Share
Copy Link
OneHouse, a data lakehouse company, has launched vector embeddings support to help organizations manage and reduce costs associated with AI model training. This new feature aims to streamline the process of creating and storing vector embeddings at scale.
OneHouse, a data lakehouse company, has announced the launch of vector embeddings support, a new feature designed to help organizations manage and reduce costs associated with AI model training. This development comes as businesses increasingly seek efficient ways to handle large-scale data for machine learning and artificial intelligence applications 1.
Vector embeddings are numerical representations of data that capture semantic meaning, allowing machines to process and understand complex information more effectively. These embeddings are crucial for various AI applications, including natural language processing, image recognition, and recommendation systems 2.
OneHouse's new feature aims to address the significant costs associated with generating and storing vector embeddings. By integrating this capability into their data lakehouse platform, OneHouse enables organizations to create and manage embeddings at scale without the need for separate vector databases or additional infrastructure 1.
The introduction of vector embeddings support by OneHouse is expected to have a significant impact on the AI and machine learning industry. By making it easier and more cost-effective for organizations to work with vector embeddings, OneHouse is potentially accelerating the adoption and development of AI technologies across various sectors 1.
As AI continues to evolve and become more integral to business operations, solutions like OneHouse's vector embeddings support are likely to play a crucial role in making advanced AI applications more accessible and manageable for a wider range of organizations. This development may lead to increased innovation and competitiveness in the AI space 2.
Dutch AI database startup Weaviate introduces Weaviate Embeddings, an open-source tool designed to streamline data vectorization for AI applications, offering developers more flexibility and control over their AI development process.
2 Sources
2 Sources
Vector databases are emerging as crucial tools in AI development, offering efficient storage and retrieval of high-dimensional data. Their impact spans various industries, from e-commerce to healthcare, revolutionizing how we handle complex information.
3 Sources
3 Sources
Zilliz, the company behind the open-source Milvus vector database, has announced significant updates to its Zilliz Cloud offering, aiming to reduce costs and complexity for enterprise AI deployments while improving performance.
2 Sources
2 Sources
Pinecone introduces innovative features to its vector database, including inference capabilities and cascading retrieval, aiming to improve AI application development and accuracy. The update combines dense and sparse vector retrieval with reranking technologies.
2 Sources
2 Sources
Vectorize AI Inc. debuts its platform for optimizing retrieval-augmented generation (RAG) data preparation, backed by $3.6 million in seed funding led by True Ventures. The startup aims to streamline the process of transforming unstructured data for AI applications.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved