3 Sources
[1]
Databricks says it solved the decades-old data pipeline problem that's been slowing AI agents
For decades, data professionals have struggled with the challenge of managing both operational and analytical databases in a unified approach that doesn't introduce latency and performance degradation. Agents made the problem structural. A system that reasons continuously and acts on live data cannot tolerate a pipeline between itself and the information it needs to act on. At the Data + AI Summit on Tuesday, Databricks announced two products aimed at collapsing that infrastructure. Lakehouse//RT delivers millisecond query latency directly on governed Delta and Iceberg tables, eliminating the dedicated real-time serving tier that enterprises have maintained alongside their lakehouses. LTAP, short for Lake Transactional/Analytical Processing, stores Postgres-native transactional data in Delta and Iceberg format from the point of write, removing the ETL pipelines that have connected operational and analytical systems for decades. Reynold Xin, co-founder of Databricks, described a simpler data stack as "the holy grail for agents" in a briefing with VentureBeat, arguing that as users vibe code more applications, the agents reasoning analytically on top of those apps need the underlying infrastructure out of the way to move fast. "The agents really prefer a much simpler stack, because they can move way faster," he said. LTAP bets on storage-layer unification where HTAP tried engine convergence Many vendors have tried various approaches over the decades to unify analytical and transactional data. Back in 2014, analyst firm Gartner coined the term HTAP, an acronym that stands for Hybrid Transactional/Analytical Processing as a way to describe vendors that attempted to unify the two types of databases. Vendors including MemSQL (now known as SingleStore) SAP HANA and Oracle's MySQL Heatwave are among many HTAP vendors in the market. LTAP is Databricks' answer to HTAP, using the Lakebase architecture to unify data at the storage layer rather than the engine level. Lakebase is Databricks' serverless cloud-based PostgreSQL database service that became generally available in February. "HTAP to us is kind of more of a failure of the industry rather than a success," Xin said. The LTAP approach goes to the storage layer instead of the query layer. Lakebase previously stored Postgres data in Postgres format on object storage, requiring conversion before the Lakehouse's analytical engines could use it efficiently. With LTAP, transactional data lands directly in Delta or Iceberg format, sharing the same copy that analytical workloads read. Postgres remains the transactional engine. Spark and the Lakehouse remain the analytical engine. "The whole point is, hey, you use the best tool for the job at the query engine level, we just make sure underlying storage is a single copy of the data," Xin said. The central engineering challenge is latency. Object storage carries response times in the seconds range, far too slow for OLTP workloads that require sub-millisecond performance. Lakebase handles this through a caching layer between Postgres compute instances and object storage. The key design decision is where the column conversion happens: idle CPU capacity in that caching layer performs the row-to-column conversion before data lands in object storage. "When you convert data from row to column, it compresses more than 10 times, typically, so now you substantially reduce the network cost of that basic caching layer between that caching layer and the object stores," Xin said. Lakehouse//RT delivers millisecond query latency on live lakehouse data without a separate serving tier Lakehouse//RT is Databricks' answer to the dedicated real-time serving tier -- the separate system enterprises have maintained alongside their lakehouses to handle low-latency queries, at the cost of data copies, split governance and pipeline complexity agents cannot work around. Key capabilities of Lakehouse//RT include: Reyden compute engine: Built specifically for high-concurrency, low-latency serving, Reyden queries Delta and Iceberg tables directly without moving data out of the lakehouse. Latency and throughput: Lakehouse//RT delivers sub-100ms latency at 12,000 queries per second, with response times as low as 10ms on smaller datasets and up to 16x better performance than existing dedicated serving stacks. Governance and data access: Every query runs within Unity Catalog's governance framework with no separate permissions layer, no data copies and no ingestion pipelines. Analysts see the agentic framing and open format approach as the real differentiators The problem both products address is well-documented among enterprise data teams, but analysts draw a distinction between the pain point and the specific claim Databricks is making. "Enterprises have had HTAP, streaming, cloud warehouses, and operational stores for years," Stephanie Walter, Practice Leader for AI Stack at HyperFRAME Research, told VentureBeat. "What is different is the agentic AI framing." Walter noted that agents need live operational data, historical context, governance, retrieval, and write-back in the same workflow. "That is a strong architecture argument, but Lakebase still has to prove it can meet the latency, reliability, and operational maturity CIOs expect," she said. Mike Leone, analyst at Moor Insights and Strategy, said the path to genuine differentiation is more specific than the unification concept itself. He also noted that open analytics on a data lake is table stakes now, with many vendors providing some sort of service. "The less common move is letting the transactional writes land in open formats too, so the operational database isn't sitting in a proprietary box while only the analytics half is open, "Leone told VentureBeat. He added that the open format approach, paired with Lakehouse//RT querying live data directly off the lake, is what gives the architecture a credible case for retiring a whole row of specialized systems. The technical claim that will face the most scrutiny is also the most central one. "The piece I'd still want their engineers to walk through is how both engines truly share one copy without a quiet conversion step doing the syncing in the middle," Leone said. What this means for enterprises For data engineers evaluating their stack for agentic workloads, the question is no longer which best-of-breed tool to run for each job -- it's whether running separate tools at all is still defensible. Enterprises that built separate operational databases, real-time serving tiers and analytical lakehouses could previously treat the gaps between them as a maintenance burden. Agents surface those gaps as an operational risk: a system reasoning across governance boundaries will find the inconsistencies faster than any human team. The market is moving away from specialized serving layers faster than most vendor roadmaps anticipated. According to VB Pulse Q1 2026, a three-wave longitudinal survey of 100-plus employee organizations, hybrid retrieval intent tripled from 10.3% to 33.3% across the quarter while standalone vector database adoption declined across every tracked vendor. The same consolidation logic is now hitting the real-time serving tier. The traditional approach -- best-of-breed tools for each workload type, pipelines between them -- was built for human-speed analytical consumption. Agent workloads don't tolerate that architecture. "The pain they're pointing at, all the copying and syncing between operational and analytical systems, is real and expensive, and anyone running this at scale feels it," Leone said.
[2]
Databricks declares the end of pipelines with a unified platform for operational and analytical data
Databricks declares the end of pipelines with a unified platform for operational and analytical data Databricks Inc. is using its Data + AI Summit today in San Francisco to unveil a new data architecture designed to eliminate one of enterprise computing's oldest bottlenecks: the separation between transactional databases and analytical systems. The company is also introducing a real-time analytics engine that it says removes the need for separate serving infrastructure while delivering millisecond response times. The new architecture, called Lake Transactional/Analytical Processing, unifies operational and analytical workloads on a single copy of data stored in a data lake. Databricks said the approach enables applications, analytics systems and artificial intelligence agents to access the same data without the change data capture pipelines, extract/transform/load processes and replicated databases that have traditionally connected operational and analytical environments. The company said conventional architectures are ill-suited to the emerging world of AI agents, which continuously read, analyze and act upon data in near-real time. "You've got more code being written than ever before, which means you've got lots more applications," said Shanku Niyogi, Databricks' vice president of product management. "Those applications are powered agents that need to both reason and act on data more quickly than humans can manage. So the data stack becomes the bottleneck." Enterprises have long maintained separate systems for transaction and analytical processing. Operational applications typically write data to transactional databases, while analytical systems consume copies of that data through ETL and change data capture pipelines, which monitor databases for modifications and propagate them to downstream destinations. Databricks argues that this architecture introduces latency, complexity and governance challenges that become more pronounced as AI-driven applications proliferate. Niyogi said many organizations are struggling to manage the growing number of pipelines required to synchronize operational and analytical systems. "We've been joking that CDC is 'continuous data corruption,'" he said. "Every time something changes, you've got a new pipeline." He cited a large banking customer that now maintains "hundreds of thousands of Postgres databases, each with CDC pipelines bringing data back to the lake." Built on Lakebase LTAP builds upon Databricks' Lakebase database platform, introduced last year. Lakebase separates database computing from storage. LTAP writes transactional data directly into open columnar formats such as Delta Lake and Apache Iceberg while maintaining PostgreSQL compatibility for applications. Niyogi said the architecture allows transactional applications to continue operating with native PostgreSQL performance while making data instantly available for analytics and machine learning workloads. "You're getting Postgres performance and Postgres semantics," he said, "but underneath the covers, when we write storage out to the lake, we're instantly writing that to columnar formats, which means any analytics engine now has access to all of your operational data. There are no pipelines and no latency." Columnar storage is a database architecture that stores data sequentially by column instead of by row to speed up analytical queries. Databricks said LTAP relies on open formats and it plans to open-source technology that enables PostgreSQL data to be stored in the Apache Parquet format while preserving compatibility. New analytics engine The company today is also introducing Lakehouse//RT, an analytics engine that it said brings real-time query performance directly to lakehouse environments. Traditionally, organizations seeking speedy access to analytical data have had to deploy specialized serving systems, caches or real-time databases alongside their data lakes. Lakehouse//RT is powered by a new execution engine called Reyden that Databricks claims can deliver response times as low as 10 milliseconds for smaller workloads and under 100 milliseconds for larger workloads while supporting tens of thousands of concurrent users and agents. The company said customers have reported up to 16 times better performance than existing real-time serving architectures. Niyogi described the product as a major evolution of the lakehouse concept. "With Lakehouse RT, we can actually serve data now directly out of the warehouse to tens of thousands of concurrent users with very low latency," he said. The company sees both announcements as foundational technologies for AI-driven enterprises, where agents will increasingly execute business processes and make operational decisions. "Agents need the best data," Niyogi said. "If they're getting stale or wrong data, they act poorly." Traditional architectures featuring separate transactional systems, analytical systems and serving layers "are just not a platform that you can put millions of agents on," he said. LTAP is available through as an upgrade for Lakebase customers, while Lakehouse//RT is entering beta test. Databricks said existing Lakehouse customers can adopt Lakehouse//RT as a drop-in replacement for current warehouse deployments and will receive access through their existing subscriptions, with promotional pricing planned during the first year.
[3]
Why Databricks calls CDC 'continuous data corruption' - and what it built instead
Shanku Niyogi, Vice President of Product Management at Databricks, has a new name for an old acronym. CDC, the streaming pipeline technique that has shuttled operational data into the analytics warehouse for years, is - in his words - "continuous data corruption." The original meaning is continuous data capture, and most data engineers will recognize where Niyogi is going. CDC pipes a copy of every change in a transactional database - the live system running orders, payments and stock - over to the analytics warehouse. That way, analysts can query yesterday's data without slowing down today's customers. It is a workaround for a 40-year-old split that exists because the two kinds of database were built for incompatible jobs. The workaround, by Niyogi's account during an interview at Databricks' Data and AI Summit in San Francisco, has not aged well. Niyogi says: CDC was slow, and it was buggy, and it was expensive. Pipelines break down. Schemas change. So we're calling it continuous data corruption - which I think for a lot of data engineers more accurately reflects the pain. Niyogi's description gives Databricks a natural opening to the launch of what the company calls an "agentic data foundation" - an attempt to merge transactional and analytical data into one architecture rather than continuing to bridge them with brittle pipelines. There are two pieces of news this week. One is Lake Transactional/Analytical Processing, or LTAP, the architectural concept under which Databricks is grouping the work. The other is Lakehouse//RT, a new real-time query layer on the existing lakehouse that Niyogi calls "the biggest innovation we've had since we started the lakehouse" in 2020. The argument is that the old way of doing things does not survive the next 18 months. Niyogi continues: This year, the amount of code being written in the world has gone up 50x. We think in the next 12 months, more code will be written than in the history of coding. More applications are being written by AI, and these applications are powered by AI. They need to get data, reason over data, act over data, all in near real time. Data teams cannot keep up building and maintaining CDC pipelines. That underlying concern is one most enterprise data teams will recognize - the volume of new applications, agentic or otherwise, is outpacing the team headcount available to plumb them. Niyogi mentions a large bank he spoke with earlier this week. He says: Hundreds of thousands of Postgres databases that they're having to build CDC pipelines for. The common pattern we're seeing is that the problems always existed. It's just the scale, and the need for that data in near real time, has now reached a stage where you just can't do this inefficient process anymore. What LTAP actually is LTAP is the architectural label, Lakebase is the product where it lives. Databricks launched Lakebase a year ago, built on the team and technology of Neon, a Postgres database company it acquired in 2025. The signature design choice is the separation of compute from storage. Most operational databases keep both close together on local disk for speed; Lakebase splits them, so the storage layer can run on ordinary cloud object storage - the cheap, infinitely scalable kind of storage that Amazon S3, Microsoft Azure Blob and Google Cloud Storage provide. That choice is not new for analytics. The whole modern data warehouse, including Databricks' own, has been built this way for the last decade. The new part is doing it for transactional workloads, which usually need the predictable, sub-millisecond latency of local disk to keep applications responsive. Niyogi explains: Cloud storage is cheap but it's not always reliable and fast. We've had years of experience doing this in the data warehousing world. So we brought that experience to this. The engineering trick is what happens when a Postgres database in Lakebase writes a row. Postgres writes the data in row format, the way operational databases need to. At the same time, Lakebase converts that data to columnar format - Delta Lake and Apache Iceberg, two open table formats that analytics engines can read directly - and stores it on the lake. He describes the effect: The data within minutes shows up in analytical format where you can instantly access it. It's completely open, which means it works with any choice of analytical tool. We're using Iceberg and Delta formats. Operational data and the analytical copy are not two physical datasets that need to be reconciled by a pipeline. They are the same data written in two formats, in one place, governed by one system. During a deeper discussion into Lakebase, and the developer experience, Niyogi singles out database branching - the ability to clone a database the way a developer clones a Git branch, run an experiment, and merge or discard the result - as particularly relevant to agent workloads, since agents need somewhere safe to test transactions that might touch production state. The product has, by Niyogi's count, more than 3,500 customers a year in, and Databricks now puts platform-wide activity at 12 million database launches per day. Named Lakebase customers include Block, Superhuman, Zillow, and Ensemble, the healthcare revenue cycle management firm whose Chief Technology Officer (CTO), Grant Veazey, says the platform is supporting more than two petabytes of clean, harmonized data underpinning AI-led revenue recovery in live hospital operations. Niyogi also cites Afresh, a supply chain analytics company that moved off Microsoft Azure SQL: data movement that used to take days, he says, now takes minutes. Alongside LTAP, Databricks is adding three new Lakebase capabilities: cross-cloud and cross-region disaster recovery, Git-style branching and snapshots in the database itself, and autonomous database operations - an agent that monitors database health, detects slowdowns, proposes indexes and assists with recovery. LTAP itself, the press materials note, is coming soon as part of Lakebase rather than shipping today. Lakehouse//RT and the one-second barrier The second announcement, Lakehouse//RT, sits on the analytics side of the same architecture. The problem Niyogi describes is one most application developers have hit. He says: Every data warehouse in trying to serve those application workloads has run into kind of a wall around the one-second barrier. If you write an application that's getting some data directly from a warehouse, your users wait. All those applications where you do a query, and now you have to wait. The workaround has been to copy the relevant data out of the warehouse into a faster system - often Redis as a cache, sometimes a purpose-built analytics serving database - and serve from there. Niyogi notes: That meant for developers, the warehouse was not even a thing they wrote against. Lakehouse//RT is Databricks' attempt to remove that compromise. The engine behind it, Databricks confirms in its press materials, is called Reyden - a new compute engine built for the concurrency and latency demands of agent workloads, with what Databricks describes as a "fully asynchronous execution model." Niyogi observes: With Lakehouse//RT we've done some really interesting engineering to break through that barrier. The numbers Databricks puts on it are eye-catching. According to the press release, they include: * Sub-100-millisecond latency at 12,000 queries per second on standard analytical benchmarks. * Response times as low as 10 milliseconds on smaller datasets. * Up to 16-times performance improvement over existing specialized real-time serving stacks. * Tens of thousands of concurrent users and agents. Niyogi adds: Now you can get essentially a 10-millisecond floor on being able to query directly against the lakehouse, with tens of thousands of concurrent users. That means the entire company can run an application that bangs on the warehouse directly and get performance very similar to a database that's cached that data. But again - no copy of data, single governance model, and a significant simplification for developers. There are two named customers in the launch. Cisco's Chris Kopek, Head of Data Platforms, reports threat-lookup queries against Lakehouse//RT running five times faster than the company's previous setup. Magnite's Kayvon Raphael, Senior Director of Engineering, says the platform is delivering sub-200-millisecond performance on Magnite's core dashboard queries at hundreds of queries per second across its real-time client-facing performance data. Both are using Lakehouse//RT in beta, which is its current availability status. The 16-times comparison is to "existing specialized real-time serving stacks" - but the specific systems Databricks is benchmarking against are not named. Cisco's five-times figure is a customer-reported workload result rather than a controlled comparison. Performance claims of this size carry more weight once the workloads and comparators are spelled out, and that detail will be the thing to look for in the formal documentation Databricks publishes around the launch. The mental model for developers Pulled together, the two products give developers a fairly clean rule of thumb. As is so often in these conversations, I ask Niyogi to explain the difference in layman's terms. Lakebase is where applications write. It is Postgres-compatible, so existing application code, drivers and object-relational mappers work without modification. Lakehouse//RT is where applications read, when they need analytical context delivered at speed. The conversion from operational data to analytical data happens automatically, in minutes, without a pipeline in the middle. Niyogi says: Many of those workloads can go directly against the lakehouse and get the same performance benefits. But when you are writing to the database - when an agent needs memory, or you are writing customer records - using Lakebase, all that data will be available to the analytics platform in a matter of minutes. His prediction is conditional on the architecture holding up: I think maybe developers will start to work against the lakehouse directly as well. They'll be getting their data from there and then writing data into Postgres. And they won't have to worry about how data gets from one world to the other. For a developer used to maintaining a stack of Postgres for writes, a warehouse for analytics, a Redis layer for caching and one or more Extract, Transform, Load (ETL) or CDC pipelines connecting them, that simplification is the actual product. The CIO message Niyogi's pitch to the Chief Information Officer (CIO) is, fundamentally, an argument that the old shape of the data stack is incompatible with the next wave of enterprise applications. He says: Every CIO today is looking to get value out of AI. Most CIOs are concerned about building production AI that is changing their business processes. They are doing that by building applications, and they are building hundreds of applications. Those applications are going to create data at great new volumes, and they're going to need to access and reason over that data in near real time. You can no longer afford to think of those two systems as two separate systems and then stitch them together with these inefficient processes that were built a couple of decades ago. You need to think of your data platform as a single data platform. My take S&P Global Market Intelligence puts the share of organizations with even one AI agent in production at 31% this year - some distance from the hundred-application picture Niyogi describes. The migration path from other platforms is also a real consideration But plenty of enterprise architects would likely agree that a data stack that supports a thousand agentic applications cannot be assembled out of CDC pipelines and reverse ETL. Writing Postgres data once and having it show up as analytical data on open object storage, within minutes, without a pipeline anywhere in the middle, is real engineering. Previous attempts at this died on the developer-experience step - engineers got asked to abandon Postgres, or to learn a new query language for the privilege. Building on Postgres rather than inventing a new dialect is what gives Lakebase a fighting chance, and the customer roster - Block, Zillow, Superhuman, Ensemble - suggests it is being taken seriously in production. Lakehouse//RT depends on numbers holding. Sub-second querying straight against the lakehouse, at concurrency, would change what enterprise developers do with their warehouse data. If Databricks' 16-times figure holds up under independent benchmarking - and Cisco's reported five-times on threat lookups suggests directionally that it might - the specialized serving layer that has lived alongside the warehouse faces a competitor that does not ask data to leave the lakehouse. Reyden, the engine doing this work, now has a name. What still does not have a name is which systems specifically Databricks is benchmarking against, but the numbers will carry more weight once comparators are spelled out. Databricks' argument rests on open formats - Delta Lake, Apache Iceberg, Postgres - being safe foundations for enterprise consolidation. Whether "open" carries the same weight at the protocol and governance layer as it does at the storage layer is a separate question, with stakes for OpenSharing, the protocol Databricks recently donated to the Linux Foundation. Coverage of that is coming next.
Share
Copy Link
Databricks announced two products at its Data + AI Summit that aim to solve a decades-old infrastructure challenge. The company's LTAP architecture and Lakehouse//RT engine eliminate the need for separate data pipelines between operational and analytical systems, delivering millisecond query latency directly on Delta Lake and Apache Iceberg tables. The move addresses a critical bottleneck as enterprises scale AI agents that need to reason and act on live data without delays.

At the Data + AI Summit on Tuesday, Databricks announced a fundamental shift in how enterprises handle operational and analytical data, introducing two products designed to eliminate infrastructure that has slowed AI agents
1
. The company unveiled Lake Transactional/Analytical Processing (LTAP) and Lakehouse//RT, technologies that promise to collapse the decades-old separation between transactional databases and analytical systems2
.Reynold Xin, co-founder of Databricks, described a simpler data stack as "the holy grail for agents," arguing that as users generate more applications, AI agents reasoning analytically need the underlying infrastructure out of the way to move fast
1
. The challenge is structural: a system that reasons continuously and acts on live data cannot tolerate a pipeline between itself and the information it needs to act on.LTAP stores PostgreSQL-native transactional data in Delta Lake and Apache Iceberg format from the point of write, eliminating ETL pipelines that have connected operational and analytical systems for decades
1
. The architecture builds upon Lakebase, Databricks' serverless cloud-based PostgreSQL database service that became generally available in February, built on technology from the Neon acquisition3
.Shanku Niyogi, Databricks' vice president of product management, has renamed Change Data Capture (CDC) as "continuous data corruption," reflecting widespread frustration with pipeline reliability. "CDC was slow, and it was buggy, and it was expensive. Pipelines break down. Schemas change," Niyogi said during an interview at the summit
3
. He cited a large banking customer maintaining hundreds of thousands of Postgres databases, each requiring CDC pipelines to bring data back to the lake2
.The LTAP approach unifies data at the storage layer rather than the engine level, distinguishing it from earlier HTAP (Hybrid Transactional/Analytical Processing) attempts. "HTAP to us is kind of more of a failure of the industry rather than a success," Xin noted
1
. Instead of converging engines, LTAP maintains PostgreSQL compatibility for transactional workloads while simultaneously writing data in columnar formats like Delta Lake and Apache Iceberg that analytical engines can read directly.Lakehouse//RT delivers sub-100ms latency at 12,000 queries per second, with response times as low as 10ms on smaller datasets and up to 16 times better performance than existing dedicated serving stacks
1
. The product is powered by a new execution engine called Reyden, built specifically for high-concurrency, low-latency serving that queries Delta Lake and Apache Iceberg tables directly without moving data out of the lakehouse.Niyogi described Lakehouse//RT as "the biggest innovation we've had since we started the lakehouse" in 2020, noting that it removes the need for separate serving infrastructure while delivering real-time data access
3
. Every query runs within Unity Catalog's governance framework with no separate permissions layer, no data copies and no ingestion pipelines1
.Related Stories
The urgency stems from explosive growth in code generation. "This year, the amount of code being written in the world has gone up 50x. We think in the next 12 months, more code will be written than in the history of coding," Niyogi said
3
. These applications, increasingly powered by AI agents, need to read, analyze and act upon data in near real-time, making traditional architectures with separate transactional systems, analytical systems and serving layers inadequate2
."Agents need the best data," Niyogi explained. "If they're getting stale or wrong data, they act poorly"
2
. The central engineering challenge is latency, as object storage carries response times in the seconds range, far too slow for OLTP workloads requiring sub-millisecond performance. Lakebase handles this through a caching layer between Postgres compute instances and object storage, with idle CPU capacity performing row-to-column conversion before data lands in object storage. When data converts from row to column, it compresses more than 10 times typically, substantially reducing network costs1
.Databricks plans to open-source technology that enables PostgreSQL data to be stored in Apache Parquet format while preserving compatibility, reinforcing its commitment to open formats
2
. As enterprises grapple with scaling AI agents, the ability to eliminate pipeline complexity while maintaining governance and performance will determine which organizations can deploy autonomous systems effectively.Summarized by
Navi
[1]
[2]
12 Jun 2025•Technology

02 Oct 2025•Technology

20 May 2025•Technology

1
Policy and Regulation

2
Business and Economy

3
Technology
