2 Sources
2 Sources
[1]
How MassMutual and Mass General Brigham turned AI pilot sprawl into production results
Enterprise AI programs rarely fail because of bad ideas. More often, they get stuck in ungoverned pilot mode and never reach production. At a recent VentureBeat event, technology leaders from MassMutual and Mass General Brigham explained how they avoided that trap -- and what the results look like when discipline replaces sprawl. At MassMutual, the results are concrete: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two. "We're always starting with why do we care about this problem?" Sears Merritt, MassMutual's head of enterprise technology and experience, said at the event. "If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?" Defining metrics, establishing strong feedback loops MassMutual, a 175-year-old company serving millions of policy owners and customers, has pushed AI into production across the business -- customer support, IT, customer acquisition, underwriting, servicing, claims, and other areas. Merritt said his team follows the scientific method, beginning with a hypothesis and testing whether it has an outcome that will tangibly drive the business forward. Some ideas are great, but they may be "intractable in the business" due to factors like lack of data or access, or regulatory constraint. "We won't go any further with an idea until we get crystal clear on how we're going to measure, and how we're going to define success." Ultimately, it's up to different departments and leaders to define what quality means: Choose a metric and define the minimum level of quality before a tool is placed into the hands of teams and partners. That starting point creates a quick feedback loop. "The things that we find slow us down is where there isn't shared clarity on what outcome we're trying to achieve," which can lead to confusion and constant re-adjusting, said Merritt. "We don't go to production until there is a business partner that says, 'Yes, that works.'" His team is strategic about evaluating emerging tools, and "extremely rigorous" when testing and measuring what "good" means. For instance, they perform trust scoring to lower hallucination rates, establish thresholds and evaluation criteria, and monitor for feature and output drift. Merritt also operates with a no-commitment policy -- meaning the company doesn't lock itself into using a particular model. It has what he calls an "incredibly heterogeneous" technology environment combining best of breed models alongside mainframes running on COBOL. That flexibility isn't accidental. His team built common service layers, microservices and APIs that sit between the AI layer and everything underneath -- so when a better model comes along, swapping it in doesn't mean starting over. Because, Merritt explained, "the best of breed today might be the worst of breed tomorrow, and we don't want to set ourselves up to fall behind." Weeding instead of letting a thousand flowers bloom Mass General Brigham (MGB), for its part, took more of a spray and pray approach -- at first. Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for the last 10 to 15 years, CTO Nallan "Sri" Sriraman said at the same VB event. But last year, he made a bold choice: His team shut down a sprawl of non-governed AI pilots. Initially, "we did follow the thousand flowers bloom [methodology], but we didn't have a thousand flowers, we had probably a few tens of flowers trying to bloom," he said. Like Merritt's team at MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific departments of workflows. They questioned what capabilities they wanted and needed and what investment those required. Sriraman's team also spoke with their primary platform providers -- Epic, Workday, ServiceNow, Microsoft -- about their roadmaps. This was a "pivotal moment," he noted, as they realized they were building in-house tools that vendors were already providing (or were planning to roll out). As Sriraman put it: "Why are we building it ourselves? We are already on the platform. It is going to be in the workflow. Leverage it." That said, the marketplace is still nascent, which can make for difficult decisions. "The analogy I will give is when you ask six blind men to touch an elephant and say, what does this elephant look like?" Sriraman said. "You're gonna get six different answers." There's nothing wrong with that, he noted; it's just that everybody is discovering and experimenting as the landscape keeps shifting. Instead of a wild West environment, Sriraman's team distributes Microsoft Copilot to users across the business, and uses a "small landing zone" where they can safely test more sophisticated products and control token use. They also began "consciously embedding AI champions" across business groups. "This is kind of a reverse of letting a thousand flowers bloom, carefully planting and nourishing," Sriraman said. Observability is another big consideration; he describes real-time dashboards that manage model drift and safety and allow IT teams to govern AI "a little more pragmatically." Health monitoring is critical with AI systems, he noted, and his team has established principles and policies around AI use, not to mention least access privileges. In clinical settings, the guardrails are absolute: AI systems never issue the final decision. "There's always going to be a doctor or a physician assistant in the loop to close the decision," Sriraman said. He cited radiology report generation as one area where AI is used heavily, but where a radiologist always signs off. Sriraman was clear: "Thou shall not do this: Don't show PHI [protected health information] in Perplexity. As simple as that, right?" And, importantly, there must be safety mechanisms in place. "We need a big red button, kill it," Sriraman emphasized. "We don't put anything in the operational setting without that." Ultimately, while agentic AI is a transformative technology, the enterprise approach to it doesn't have to be dramatically different. "There is nothing new about this," Sriraman said. "You can replace the word BPM [business process management] from the '90s and 2000s with AI. The same concepts apply."
[2]
Why most enterprise AI projects never reach production: "The model is rarely the main problem," says NTT DATA Consultant Alex Potapov
The consultant behind GenAI programs exceeding $30 million in potential revenue for industrial giants explains what separates the AI pilots that ship from the 42% that get abandoned According to Gartner's forecast, by 2028, more than half of enterprise AI models will be domain-specific, and by 2030, most organizations will shift from large engineering teams to smaller, AI-augmented units. The technology is clearly moving fast. The companies using it are not. S&P Global's 2025 Voice of the Enterprise survey, based on responses from over 1,000 enterprises in North America and Europe, found that 42% of companies abandoned most AI initiatives last year, more than double the 17% recorded the year before. On average, organizations scrapped 46% of their proof-of-concept projects before they ever reached production. Budgets get approved, pilots get built, demos get applauded, and then nothing ships. The reasons are not technical but are organizational: unready data, unclear ownership, and architectures that were never designed to survive past the presentation. Alex Potapov sees this pattern from the inside. At NTT DATA, he oversees the full implementation cycle of GenAI initiatives for global clients in energy and insurance, managing projects where a single engagement can represent tens of millions in revenue. Before that, he spent two years at HPE (Aruba), where he launched the Hardware-as-a-Service model and displaced the incumbent vendor at a global technology company, growing order volume from $100,000 to over $1.15 million. Earlier in his career, he held enterprise sales and key account management roles at Oracle and HPE, consistently exceeding sales targets. He holds an MBA from Vanderbilt University. In this conversation, Potapov explains what he sees going wrong from the inside: why enterprise data is seldom ready for the models companies buy, how unclear ownership quietly kills projects that the technology itself could have saved, and what the surviving initiatives did differently from the 42% that got abandoned. Alex, the assumption most people still make is that AI projects fail because the model didn't perform. You've participated in GenAI implementations for major industrial and insurance clients at NTT DATA. Where do these projects actually break down? The model is rarely the main problem. Most projects break down at the intersection of three things: data readiness, integration with enterprise systems, and unclear ownership across teams. Data is often the first bottleneck. Many organizations believe they have high-quality knowledge bases, but when we actually begin building a GenAI solution, we discover that information is fragmented across SharePoint, PDFs, internal tools, and sometimes outdated repositories. Without proper data structuring and governance, even the best models cannot produce reliable outputs. Integration is the second challenge. A GenAI solution becomes valuable only when it fits into existing workflows, whether that is CRM systems or internal support platforms. That work often takes longer than the AI component itself. But the most underestimated challenge is organizational ownership. GenAI initiatives usually sit between IT, data, legal, security, and the business unit. If it is not clearly defined who owns the product after the PoC phase, the project stalls. The technology gets all the attention. The real friction comes from enterprise complexity. So if the data is messy and the ownership is unclear, but the budget cycle still demands a working PoC in five to six weeks, what can you actually cut? And what are the early signs that a prototype was never designed for production? You can simplify the interface, limit the number of use cases, and work with a smaller dataset. A PoC does not need a polished front-end or full enterprise coverage. Its purpose is to validate whether the core value hypothesis holds. What you cannot compromise on is the data pipeline. Even in a short PoC, the data flow should resemble the production architecture as closely as possible. If the prototype relies on manually curated or static data that won't exist in production, the results become misleading. Security and compliance are equally non-negotiable, especially in regulated industries. If the PoC bypasses enterprise security controls, it creates something that cannot realistically move forward. And the PoC must be tied to a measurable business outcome. Without a clear KPI for reducing research time, accelerating internal workflows, stakeholders have no basis to justify further investment. As for warning signs, the most common one is heavy manual intervention. If the team is manually preparing prompts, curating datasets, or running processes outside the real workflow during the pilot, the solution is not operational. Another sign is no integration strategy -- a chatbot living in an isolated tab with no plan for connecting to enterprise systems. And the biggest red flag is when the PoC is driven by the innovation or IT team alone, but no business unit is prepared to adopt and operate it. That project will remain a demo. You've just described the scenario where nobody on the business side is ready to own the result. But there's also the opposite problem -- the business side is interested but can't articulate what they actually need. You deal with early-stage thinking in a different context, too -- as an expert judge at Vanderbilt's Sullivan Family Ideator and the SEC Student Pitch Competition, where you assess whether a raw concept has real commercial viability. When a large enterprise client walks in with "we want AI" but no concrete plan, how do you move them from that to a funded use case? It is actually very common for enterprise clients to approach consultants without a fully defined AI use case. The dynamic is not entirely different from what I see at pitch competitions -- someone has conviction that their idea matters, but the commercial logic is not yet structured. In both cases, the question is the same: is there a real problem here, and can you build a viable path from concept to value? In enterprise settings, part of our role is to show what we call the "art of the possible" -- practical examples from prior engagements where generative AI has already delivered value. From there, we run a structured discovery process: interviewing departments and stakeholders to understand current workflows and identify where AI could improve efficiency or decision-making. Once we collect those opportunities, we prioritize them by potential ROI and implementation complexity. This helps the organization focus on use cases that deliver real business impact while being realistic to implement. We then translate the most promising use case into measurable success metrics, time spent per task, number of manual steps, cost per transaction, and build a focused PoC where results can be demonstrated within weeks. Reducing internal research time by 30-40%, accelerating proposal generation, and improving knowledge retrieval for support teams. That is typically how you move stakeholders from vague interest to a funded business case. Once the PoC works, there's still the question of who runs the system after your consulting contract ends. In your experience, how decisive is this ownership question for whether a project survives past the pilot? Ownership is one of the most underestimated factors. Many pilots demonstrate promising results, but the real question is what happens after the pilot ends. In enterprise environments, GenAI solutions typically sit at the intersection of several teams: IT, data engineering, security, and the business unit that will ultimately use the solution. If ownership is not clearly defined from the beginning, the project often stalls after the PoC phase. In successful projects, three types of ownership need to be clearly established. First, product ownership is usually a business stakeholder who is responsible for adoption and ensuring the solution continues to deliver value. Second, data ownership is a team responsible for maintaining data quality and governance. And third, technical ownership -- typically IT or platform teams that maintain the infrastructure and integrations. When those responsibilities are clearly defined early on, the transition from PoC to production becomes much smoother. Without that alignment, even technically strong solutions often remain prototypes. Your career includes a Wi-Fi deployment at one of the region's largest hockey arenas, with over 12,000 seats, that was recognized as Project of the Year, and a communication modernization for 15,000 users at one of the world's largest gas companies. Both environments where system failure has immediate physical consequences. How do those reliability standards carry over into how you design GenAI solutions for manufacturing and insurance clients? The requirements change significantly. In a typical PoC, teams focus on model capability: prompt engineering, experimentation, and rapid prototyping. In enterprise environments, architecture, governance, and compliance become just as important as the model itself. Data privacy constraints often require that sensitive information never leave the organization's controlled environment. In insurance, this can involve regulatory frameworks like HIPAA. Solutions may rely on private deployments, secure APIs, or retrieval architectures that keep sensitive data within approved environments. Then there is reliability. In industrial and insurance settings, incorrect outputs or hallucinations introduce real operational risk. GenAI systems here are designed to augment human expertise, not replace it -- the AI generates insights, summaries, or recommendations, but a human expert remains in the loop to review and approve the output before it is used in any decision or client interaction. That allows organizations to benefit from productivity gains while maintaining the control required in regulated environments. And everything has to be auditable. Enterprise clients need to understand how the model generated its response and track system behavior for compliance and governance. Many companies think of GenAI as a way to automate a single task -- summarizing documents, answering internal queries. When does it make sense to stay narrow, and when should the architecture be designed for something broader from the start? In most enterprise environments, starting narrow makes sense. A focused use case lets you validate the technology quickly while minimizing risk and complexity. It also helps teams build internal expertise and confidence in how GenAI behaves in real workflows. Typical starting points are tasks involving large volumes of documents, repetitive knowledge retrieval, or internal research. But once the organization begins to see value, it becomes important to think about end-to-end workflows. Many of the most impactful GenAI opportunities are not isolated tasks but chains of activities -- document analysis, summarization, and decision support within the same process. The approach is to start narrow to prove value, but design the architecture so it can expand into larger workflows over time. Coming back to where we started -- that 42% abandonment rate. If a C-level executive asked you for the metrics that would actually tell them whether their AI investment is working, what would you point to? The metrics that matter are the ones that directly connect AI capabilities to operational improvements. Productivity is the most tangible: when a process that previously took hours can now be completed in a fraction of that time, the impact is immediate. Cost efficiency is the second -- reduced operational workload, or the ability to scale services without proportionally increasing headcount. Decision speed matters too: if AI accelerates the synthesis of information before decisions are made, that affects overall business performance.
Share
Share
Copy Link
A staggering 42% of companies abandoned most AI initiatives in 2025, more than double the previous year's rate. The culprit isn't bad technology—it's AI pilot sprawl, unclear ownership and governance, and data that's never production-ready. MassMutual and Mass General Brigham show what discipline looks like: 30% developer productivity gains, help desk times slashed from 11 minutes to one, and a shift from ungoverned experiments to measurable business results.

Enterprise AI is stuck in a troubling pattern. According to S&P Global's 2025 Voice of the Enterprise survey, 42% of companies abandoned most AI initiatives last year—more than double the 17% recorded the year before
2
. On average, organizations scrapped 46% of their proof-of-concept (PoC) projects before they ever reached production. Budgets get approved, pilots get built, demos get applauded, and then nothing ships. The problem isn't the technology itself. AI projects rarely fail because of bad ideas or underperforming models. Instead, they collapse under AI pilot sprawl, unclear ownership and governance, and organizational hurdles that were never addressed during the planning phase.Alex Potapov, an NTT DATA consultant who oversees GenAI implementations for global clients in energy and insurance, puts it bluntly: "The model is rarely the main problem"
2
. Most projects break down at the intersection of three things: data readiness, integration with enterprise systems, and unclear ownership across teams. Without proper data quality and governance, even the best models cannot produce reliable outputs. And when GenAI initiatives sit between IT, data, legal, security, and the business unit with no clear owner, the project stalls—no matter how impressive the demo was.MassMutual, a 175-year-old company serving millions of policy owners, offers a counterexample. The insurer has pushed AI into production across customer support, IT, customer acquisition, underwriting, servicing, and claims—achieving concrete results: 30% developer productivity gains, IT help desk resolution times reduced from 11 minutes to one, and customer service calls cut from 15 minutes to just one or two
1
.Sears Merritt, MassMutual's head of enterprise technology and experience, credits a disciplined approach rooted in metrics and feedback loops. "We're always starting with why do we care about this problem?" Merritt explained at a recent VentureBeat event. "If we solve the problem, how are we gonna know we solved it? And, how much value is associated with doing that?"
1
. His team follows the scientific method, beginning with a hypothesis and testing whether it will tangibly drive the business forward. Some ideas are great but may be "intractable in the business" due to lack of data, access, or regulatory constraints.Crucially, MassMutual won't advance an idea until there's crystal clarity on how success will be measured. Different departments and stakeholders define what quality means, choose a metric, and set minimum quality thresholds before a tool is placed into the hands of teams. "We don't go to production until there is a business partner that says, 'Yes, that works,'" Merritt said
1
. His team performs trust scoring to lower hallucination rates, establishes evaluation criteria, and monitors for feature and model drift. They also operate with a no-commitment policy on models, building common service layers, microservices, and APIs that sit between the AI layer and underlying systems—so when a better model emerges, swapping it in doesn't mean starting over.Mass General Brigham (MGB) took a different route—initially embracing a "spray and pray" approach before course-correcting. Around 15,000 researchers in the not-for-profit health system have been using AI, ML, and deep learning for 10 to 15 years, but last year, CTO Nallan "Sri" Sriraman made a bold choice: His team shut down a sprawl of non-governed AI pilots
1
. "We did follow the thousand flowers bloom [methodology], but we didn't have a thousand flowers, we had probably a few tens of flowers trying to bloom," he said.Like MassMutual, MGB pivoted to a more holistic view, examining why they were developing certain tools for specific workflows. They questioned what capabilities they wanted and needed, and what investment those required. Sriraman's team also spoke with their primary platform providers—Epic, Workday, ServiceNow, Microsoft—about their roadmaps. This was a "pivotal moment," as they realized they were building in-house tools that vendors were already providing or planning to roll out
1
. Instead of a wild West environment, Sriraman's team now distributes Microsoft Copilot to users across the business and uses a "small landing zone" where they can safely test more sophisticated products and control token use. They also began "consciously embedding AI champions" across business groups to drive adoption and accountability.Related Stories
Potapov, who manages GenAI programs exceeding $30 million in potential revenue for industrial giants, sees a clear pattern in what separates the AI projects that ship from those that get abandoned. Data is often the first bottleneck. Many organizations believe they have high-quality knowledge bases, but when building a GenAI solution, information turns out to be fragmented across SharePoint, PDFs, internal tools, and outdated repositories
2
. Integration is the second challenge—a GenAI solution becomes valuable only when it fits into existing workflows, whether CRM systems or internal support platforms. That work often takes longer than the AI component itself.The most underestimated challenge, however, is organizational ownership. Without clearly defined responsibility after the PoC phase, projects stall. Even in short pilots, certain elements cannot be compromised: the data pipeline should resemble the production architecture as closely as possible, security and compliance must be maintained, and the PoC must tie to a measurable business case with clear KPIs. Warning signs that a prototype was never designed for scalability include heavy manual intervention, no integration strategy, and workflows that exist in isolation from real enterprise systems
2
.As Gartner forecasts that by 2028, more than half of enterprise AI models will be domain-specific, and by 2030 most organizations will shift to smaller, AI-augmented units, the pressure to move AI from pilot to production will only intensify. The companies that achieve tangible business results will be those that address governance, establish strong feedback loops, and treat integration with enterprise systems as a first-class concern—not an afterthought.
Summarized by
Navi
[1]
17 Jul 2024

16 Jul 2024

02 Apr 2026•Technology
