Curated by THEOUTPOST
On Fri, 24 Jan, 12:05 AM UTC
2 Sources
[1]
Galileo launches 'Agentic Evaluations' to fix AI agent errors before they cost you
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Galileo, a San Francisco-based startup, is betting that the future of artificial intelligence depends on trust. Today, the company launched a new product, Agentic Evaluations, to address a growing challenge in the world of AI: making sure the increasingly complex systems known as AI agents actually work as intended. AI agents -- autonomous systems that perform multi-step tasks like generating reports or analyzing customer data -- are gaining traction across industries. But their rapid adoption raises a crucial question: How can companies verify these systems remain reliable after deployment? Galileo's CEO, Vikram Chatterji, believes his company has found the answer. "Over the last six to eight months, we started to see some of our customers trying to adopt agentic systems," said Chatterji in an interview. "Now LLMs can be used as a smart router to pick and choose the right API calls towards actually completing a task. Going from just generating text to actually completing a task was a very big chasm that was unlocked." AI agents show promise, but enterprises demand accountability Major enterprises like Cisco and Ema (founded by Coinbase's former Chief Product Officer) have already adopted Galileo's platform. These companies use AI agents to automate tasks from customer support to financial analysis, reporting significant productivity gains. "A sales representative who's trying to do outreach and outbounds would otherwise use maybe a week of their time to do that, versus with some of these AI-enabled agents, they're doing that within two days or less," Chatterji explained, highlighting the return on investment for enterprises. Galileo's new framework evaluates tool selection quality, detects errors in tool calls, and tracks overall session success. It also monitors essential metrics for large-scale AI deployment, including costs and latency. $68 million in funding fuels Galileo's push into enterprise AI The launch builds on Galileo's recent momentum. The company raised $45 million in Series B funding led by Scale Venture Partners last October, bringing its total funding to $68 million. Industry analysts project the market for AI operations tools could reach $4 billion by 2025. The stakes are high as AI deployment accelerates. Studies show even advanced models like GPT-4 can hallucinate about 23% of the time during basic question-and-answer tasks. Galileo's tools help enterprises identify these issues before they impact operations. "Before we launch this thing, we really, really need to know that this thing works," Chatterji said, describing customer concerns. "The bar is really high. So that's where we gave them this tool chain, such that they could just use our metrics as the basis for these tests." Addressing AI hallucinations and enterprise-scale challenges The company's focus on reliable, production-ready solutions positions it well in a market increasingly concerned with AI safety. For technical leaders deploying enterprise AI, Galileo's platform provides essential guardrails for ensuring AI agents perform as intended while controlling costs. As enterprises expand their use of AI agents, performance monitoring tools become crucial infrastructure. Galileo's latest offering aims to help businesses deploy AI responsibly and effectively at scale. "2025 will be the year of agents. It is going to be very prolific," Chatterji noted. "However, what we've also seen is a lot of companies that are just launching these agents without good testing is leading to negative implications... The need for proper testing and evaluations is more than ever before."
[2]
Galileo unleashes platform for evaluating AI agents - SiliconANGLE
Galileo Technologies Inc., which makes tools for observing and evaluation artificial intelligence models, today unveiled Agentic Evaluations, a platform aimed at evaluating the performance of AI agents powered by large language models. The company said it's addressing the additional complexity created by agents, which are software robots imbued with decision-making capabilities that enable them to plan, reason and execute tasks across multiple steps and adapt to changing environments and contexts with little or no human oversight. Because agent behavior is situational, developers can struggle to understand when and why failures occur. That hasn't dampened interest in the technology's workflow productivity potential. Gartner Inc. expects 33% of enterprise software applications to include agentic AI by 2028, up from less than 1% in 2024. Agents challenge existing development and testing techniques in new ways. One is that they can choose multiple action sequences in response to a user request, making them unpredictable. Complex agentic workflows are difficult to model and require more complex evaluation. Agents may also work with multiple LLMs, making performance and costs harder to pin down. The risk of errors grows with the size and complexity of the workflow. Galileo said its Agentic Evaluations provide a full lifecycle framework for system-level and step-by-step evaluation. It gives developers a view of an entire multi-step agent process, from input to completion, with tracing and simple visualizations that help developers quickly pinpoint inefficiencies and errors. The platform uses a set of proprietary LLM-as-a-Judge metrics - an evaluation technique that use LLMs to check and adjudicate tasks - specifically for developers building agents. Metrics include an assessment of whether the LLM planner selected the correct tool and arguments, an assessment of errors by individual tools, traces reflecting progress toward the ultimate goal and how the final action align with the agent's original instructions. Metrics are between 93% and 97% accurate, the company wrote in a blog post. Performance is measured using proprietary, research-based metrics at multiple levels. Developers can choose which LLMs are involved in planning and assess errors in individual tasks. Aggregate tracking for cost, latency and errors across sessions and spans helps with cost and latency measurement. Alerts and dashboards help in identifying systemic issues for continuous improvement such as failed tool calls or misalignment between the actions and instructions. The platform supports the popular open-source AI frameworks LangGraph and CrewAI. Agentic Evaluations is now available to all Galileo users. The company has raised $68 million, including a $45 million funding round last October.
Share
Share
Copy Link
Galileo introduces a new platform to evaluate and improve AI agent performance, addressing critical challenges in enterprise AI deployment and reliability.
San Francisco-based startup Galileo has launched a new product called 'Agentic Evaluations' to address the growing challenge of ensuring AI agent reliability and performance. As AI agents gain traction across industries, the need for robust evaluation tools has become paramount 1.
AI agents, autonomous systems capable of performing multi-step tasks, are being rapidly adopted by enterprises for various applications, from customer support to financial analysis. However, their complex nature poses significant challenges in terms of reliability and performance assessment. Gartner predicts that by 2028, 33% of enterprise software applications will include agentic AI, up from less than 1% in 2024 2.
Galileo's new platform provides a full lifecycle framework for system-level and step-by-step evaluation of AI agents. Key features include:
The platform utilizes proprietary LLM-as-a-Judge metrics, achieving 93% to 97% accuracy in evaluations 2.
Studies show that even advanced models like GPT-4 can hallucinate about 23% of the time during basic question-and-answer tasks. Galileo's tools help enterprises identify these issues before they impact operations, providing essential guardrails for responsible AI deployment 1.
Major enterprises like Cisco and Ema have already adopted Galileo's platform, reporting significant productivity gains. The company has secured $68 million in total funding, including a recent $45 million Series B round led by Scale Venture Partners 1.
The market for AI operations tools is projected to reach $4 billion by 2025. Galileo's CEO, Vikram Chatterji, believes that "2025 will be the year of agents," emphasizing the critical need for proper testing and evaluations in AI deployment 1.
Galileo's Agentic Evaluations platform supports popular open-source AI frameworks like LangGraph and CrewAI. It provides developers with a comprehensive view of multi-step agent processes, including tracing and visualizations to quickly identify inefficiencies and errors 2.
As enterprises continue to expand their use of AI agents, Galileo's latest offering aims to help businesses deploy AI responsibly and effectively at scale, addressing the growing concerns around AI safety and performance in the enterprise sector.
Reference
Agentic AI is gaining traction in enterprise software, promising autonomous decision-making capabilities. However, safety, reliability, and technical challenges temper the enthusiasm, limiting its current applications to non-critical business processes.
2 Sources
2 Sources
AI agents are gaining widespread adoption across industries, but their definition and implementation face challenges. Companies are rapidly deploying AI agents while grappling with issues of autonomy, integration, and enterprise readiness.
5 Sources
5 Sources
AI agents are emerging as autonomous systems capable of handling complex tasks across various industries, from customer service to software development. While promising increased efficiency, their deployment raises questions about job displacement, privacy, and trustworthiness.
8 Sources
8 Sources
AI agents are emerging as a powerful force in business automation, combining the capabilities of large language models with autonomous decision-making to revolutionize workflows across industries.
7 Sources
7 Sources
Google introduces the Agent Development Kit (ADK), Agent Engine, and other tools to simplify the creation and deployment of AI agents for enterprises, enhancing its position in the competitive AI agent platform market.
2 Sources
2 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved