The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2025 TheOutpost.AI All rights reserved
Curated by THEOUTPOST
On Thu, 1 May, 4:04 PM UTC
2 Sources
[1]
Salesforce research lays the foundations for more reliable enterprise AI agents
Business leaders face some big challenges if they want to use agents. These new benchmarks might help. The value of AI agents, systems that can carry out tasks for humans, is evident, with opportunities for productivity gains, especially for businesses. However, the performance of large language models (LLMs) can hinder the effective deployment of agents. Salesforce's AI Research seeks to address that issue. Also: 60% of AI agents work in IT departments - here's what they do every day On Thursday, Salesforce launched its inaugural Salesforce AI Research in Review report, highlighting the tech company's innovations, including new foundational developments and research papers from the past quarter. Salesforce hopes these pieces will help support the development of trustworthy and capable AI agents that can perform well in business environments. "At Salesforce, we call these 'boring breakthroughs' -- not because they're unremarkable, but because they're quietly capable, reliably scalable, and built to endure," said Silvio Savarese, Salesforce's chief scientist and head of AI research. "They're so seamless, some might take them for granted." Also: The 4 types of people interested in AI agents - and what businesses can learn from them Let's dive into some of the biggest breakthroughs and takeaways from the report. If you have ever used AI models for everyday, simple tasks, you may be surprised at the rudimentary nature of some of their mistakes. Even more puzzling is that the same model that got your basic questions wrong performed extremely well across benchmarks that tested its capabilities in highly complex topics, such as math, STEM, and coding. This paradox is what Salesforce refers to as "jagged intelligence". Salesforce notes that this "jaggedness", or the discrepancy between an LLM's raw intelligence and consistent real-world performance, is particularly challenging for enterprises requiring consistent operational performance, especially in unpredictable environments. However, addressing the problem means first quantifying it, which highlights another issue. "Today's AI is jagged, so we need to work on that -- but how can we work on something without measuring it first?" said Shelby Heinecke, senior AI research manager at Salesforce. Also: Why neglecting AI ethics is such risky business - and how to do AI right That is exactly the issue that Salesforce's new SIMPLE benchmark is addressing. Salesforce's SIMPLE public dataset features 225 reasoning questions that are straightforward for humans to answer but challenging for AI to benchmark or quantify due to the LLM's jaggedness. To give you an idea of just how basic the questions are, the dataset card in Hugging Face describes the problems as being "solvable by at least 10% of high schoolers given a pen, unlimited paper, and an hour of time." Despite not testing for super-complex tasks, the SIMPLE benchmark should help individuals understand how a model can reason in real-world environments and applications, especially when developing Enterprise General Intelligence (EGI). These competent AI systems handle business applications reliably. Also: 60% of AI agents work in IT departments - here's what they do every day Another benefit of the benchmark is that it should lead to higher trust from business leaders about implementing AI systems, such as AI agents, into their businesses, as they will have a much better idea about the consistency of the model's performance. Another benchmark developed by Salesforce is the ContextualJudgeBench, which takes a different approach, evaluating the AI-enabled judges rather than the models themselves. AI model benchmarks often use assessments by other AI models. ContextualJudgeBench focuses on the LLMs that evaluate other models with the idea that, if the evaluator is trustworthy, its evaluations will be. The benchmark tests over 2,000 response pairs. During the past quarter, Salesforce launched an agent benchmarking framework, CRMArena. The framework evaluates how AI agents perform CRM (customer relationship management) tasks, such as how AI summarizes sales emails and transcripts, makes commerce recommendations, and more. "These agents don't need to solve theorems, don't need to turn my prose into Shakespearean verses -- [they] need to really focus on those critical enterprise needs across different industry verticals," said Saverse. Also: How an 'internet of agents' could help AIs connect and work together CRMArena is meant to address the issue of organizations not knowing how well models perform at practical business tasks. Beyond comprehensive testing, the framework should help improve AI agents' development and performance. The full report includes further research to help improve AI model efficiency and reliability. Here's a super-simplified summary of some of those highlights:
[2]
Salesforce takes aim at 'jagged intelligence' in push for more reliable AI
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More Salesforce is tackling one of artificial intelligence's most persistent challenges for business applications: the gap between an AI system's raw intelligence and its ability to consistently perform in unpredictable enterprise environments -- what the company calls "jagged intelligence." In a comprehensive research announcement today, Salesforce AI Research revealed several new benchmarks, models, and frameworks designed to make future AI agents more intelligent, trusted, and versatile for enterprise use. The innovations aim to improve both the capabilities and consistency of AI systems, particularly when deployed as autonomous agents in complex business settings. "While LLMs may excel at standardized tests, plan intricate trips, and generate sophisticated poetry, their brilliance often stumbles when faced with the need for reliable and consistent task execution in dynamic, unpredictable enterprise environments," said Silvio Savarese, Salesforce's Chief Scientist and Head of AI Research, during a press conference preceding the announcement. The initiative represents Salesforce's push toward what Savarese calls "Enterprise General Intelligence" (EGI) -- AI designed specifically for business complexity rather than the more theoretical pursuit of Artificial General Intelligence (AGI). "We define EGI as purpose-built AI agents for business optimized not just for capability, but for consistency, too," Savarese explained. "While AGI may conjure images of superintelligent machines surpassing human intelligence, businesses aren't waiting for that distant, illusory future. They're applying these foundational concepts now to solve real-world challenges at scale." How Salesforce is measuring and fixing AI's inconsistency problem in enterprise settings A central focus of the research is quantifying and addressing AI's inconsistency in performance. Salesforce introduced the SIMPLE dataset, a public benchmark featuring 225 straightforward reasoning questions designed to measure how jagged an AI system's capabilities really are. "Today's AI is jagged, so we need to work on that. But how can we work on something without measuring it first? That's exactly what this SIMPLE benchmark is," explained Shelby Heinecke, Senior Manager of Research at Salesforce, during the press conference. For enterprise applications, this inconsistency isn't merely an academic concern. A single misstep from an AI agent could disrupt operations, erode customer trust, or inflict substantial financial damage. "For businesses, AI isn't a casual pastime; it's a mission-critical tool that requires unwavering predictability," Savarese noted in his commentary. Inside CRMArena: Salesforce's virtual testing ground for enterprise AI agents Perhaps the most significant innovation is CRMArena, a novel benchmarking framework designed to simulate realistic customer relationship management scenarios. It enables comprehensive testing of AI agents in professional contexts, addressing the gap between academic benchmarks and real-world business requirements. "Recognizing that current AI models often fall short in reflecting the intricate demands of enterprise environments, we've introduced CRMArena: a novel benchmarking framework meticulously designed to simulate realistic, professionally grounded CRM scenarios," Savarese said. The framework evaluates agent performance across three key personas: service agents, analysts, and managers. Early testing revealed that even with guided prompting, leading agents succeed less than 65% of the time at function-calling for these personas' use cases. "The CRM arena essentially is a tool that's been introduced internally for improving agents," Savarese explained. "It allows us to stress test these agents, understand when they're failing, and then use these lessons we learn from those failure cases to improve our agents." New embedding models that understand enterprise context better than ever before Among the technical innovations announced, Salesforce highlighted SFR-Embedding, a new model for deeper contextual understanding that leads the Massive Text Embedding Benchmark (MTEB) across 56 datasets. "SFR embedding is not just research. It's coming to Data Cloud very, very soon," Heinecke noted. A specialized version, SFR-Embedding-Code, was also introduced for developers, enabling high-quality code search and streamlining development. According to Salesforce, the 7B parameter version leads the Code Information Retrieval (CoIR) benchmark, while smaller models (400M, 2B) offer efficient, cost-effective alternatives. Why smaller, action-focused AI models may outperform larger language models for business tasks Salesforce also announced xLAM V2 (Large Action Model), a family of models specifically designed to predict actions rather than just generate text. These models start at just 1 billion parameters -- a fraction of the size of many leading language models. "What's special about our xLAM models is that if you look at our model sizes, we've got a 1B model, we all the way up to a 70B model. That 1B model, for example, is a fraction of the size of many of today's large language models," Heinecke explained. "This small model packs just so much power in taking the ability to take the next action." Unlike standard language models, these action models are specifically trained to predict and execute the next steps in a task sequence, making them particularly valuable for autonomous agents that need to interact with enterprise systems. "Large action models are LLMs under the hood, and the way we build them is we take an LLM and we fine-tune it on what we call action trajectories," Heinecke added. Enterprise AI safety: How Salesforce's trust layer establishes guardrails for business use To address enterprise concerns about AI safety and reliability, Salesforce introduced SFR-Guard, a family of models trained on both publicly available data and CRM-specialized internal data. These models strengthen the company's Trust Layer, which provides guardrails for AI agent behavior. "Agentforce's guardrails establish clear boundaries for agent behavior based on business needs, policies, and standards, ensuring agents act within predefined limits," the company stated in its announcement. The company also launched ContextualJudgeBench, a novel benchmark for evaluating LLM-based judge models in context -- testing over 2,000 challenging response pairs for accuracy, conciseness, faithfulness, and appropriate refusal to answer. Looking beyond text, Salesforce unveiled TACO, a multimodal action model family designed to tackle complex, multi-step problems through chains of thought-and-action (CoTA). This approach enables AI to interpret and respond to intricate queries involving multiple media types, with Salesforce claiming up to 20% improvement on the challenging MMVet benchmark. Co-innovation in action: How customer feedback shapes Salesforce's enterprise AI roadmap Itai Asseo, Senior Director of Incubation and Brand Strategy at AI Research, emphasized the importance of customer co-innovation in developing enterprise-ready AI solutions. "When we're talking to customers, one of the main pain points that we have is that when dealing with enterprise data, there's a very low tolerance to actually provide answers that are not accurate and that are not relevant," Asseo explained. "We've made a lot of progress, whether it's with reasoning engines, with RAG techniques and other methods around LLMs." Asseo cited examples of customer incubation yielding significant improvements in AI performance: "When we applied the Atlas reasoning engine, including some advanced techniques for retrieval augmented generation, coupled with our reasoning and agentic loop methodology and architecture, we were seeing accuracy that was twice as much as customers were able to do when working with kind of other major competitors of ours." The road to Enterprise General Intelligence: What's next for Salesforce AI Salesforce's research push comes at a critical moment in enterprise AI adoption, as businesses increasingly seek AI systems that combine advanced capabilities with dependable performance. While the entire tech industry pursues ever-larger models with impressive raw capabilities, Salesforce's focus on the consistency gap highlights a more nuanced approach to AI development -- one that prioritizes real-world business requirements over academic benchmarks. The technologies announced Thursday will begin rolling out in the coming months, with SFR-Embedding heading to Data Cloud first, while other innovations will power future versions of Agentforce. As Savarese noted in the press conference, "It's not about replacing humans. It's about being in charge." In the race to enterprise AI dominance, Salesforce is betting that consistency and reliability -- not just raw intelligence -- will ultimately define the winners of the business AI revolution.
Share
Share
Copy Link
Salesforce introduces new AI benchmarks and models to address the inconsistency in AI performance for enterprise applications, aiming to develop more reliable and capable AI agents for business environments.
Salesforce has launched a series of innovative AI benchmarks and models aimed at tackling the challenge of 'jagged intelligence' in artificial intelligence systems. This phenomenon refers to the discrepancy between an AI model's raw intelligence and its ability to perform consistently in real-world, unpredictable enterprise environments 12.
To address this issue, Salesforce has introduced several new benchmarks:
SIMPLE Benchmark: A public dataset featuring 225 straightforward reasoning questions that are easy for humans but challenging for AI. This benchmark aims to quantify the 'jaggedness' of AI models and improve their real-world performance 1.
ContextualJudgeBench: This benchmark evaluates AI-enabled judges rather than the models themselves, focusing on the reliability of AI systems that assess other models 1.
CRMArena: A framework designed to evaluate how AI agents perform in customer relationship management (CRM) tasks, such as summarizing sales emails and making commerce recommendations 12.
Salesforce is pushing towards what they call "Enterprise General Intelligence" (EGI), which focuses on developing AI specifically for business complexity. This approach aims to create purpose-built AI agents optimized for both capability and consistency in business environments 2.
Salesforce has also introduced new AI models and embeddings to enhance enterprise AI capabilities:
SFR-Embedding: A new model for deeper contextual understanding, leading the Massive Text Embedding Benchmark (MTEB) across 56 datasets 2.
SFR-Embedding-Code: A specialized version for developers, enabling high-quality code search and streamlining development 2.
xLAM V2 (Large Action Model): A family of models designed to predict actions rather than just generate text, starting at just 1 billion parameters 2.
These developments have significant implications for businesses looking to implement AI:
Improved Consistency: By addressing 'jagged intelligence', Salesforce aims to create AI systems that perform more reliably in unpredictable business environments 12.
Enhanced Trust: Better benchmarks and more consistent performance could lead to higher trust from business leaders in implementing AI systems 1.
Tailored Solutions: The focus on EGI and CRM-specific benchmarks suggests a move towards AI solutions tailored for specific business needs 12.
Efficient Models: Smaller, action-focused AI models like xLAM V2 may outperform larger language models for specific business tasks, offering more efficient solutions 2.
As AI continues to evolve, Salesforce's research lays the groundwork for more reliable, efficient, and business-focused AI agents. This could potentially revolutionize how enterprises leverage AI technology in their operations, leading to significant productivity gains and improved decision-making processes.
Salesforce launches Agentforce 2.0, an advanced AI platform for enterprises, featuring improved reasoning capabilities, integration with Slack, and a new library of pre-built skills. This update positions Salesforce as a leader in the emerging digital labor market.
10 Sources
10 Sources
Salesforce introduces Agentforce Testing Center, a pioneering platform for testing and managing AI agents, offering enterprises tools to evaluate, prototype, and monitor AI performance in secure environments.
4 Sources
4 Sources
Salesforce introduces AgentForce, a suite of AI-powered agents designed to enhance employee productivity and streamline business operations. The launch at Dreamforce 2024 marks a significant step in Salesforce's AI strategy.
6 Sources
6 Sources
AI agents are gaining widespread adoption across industries, but their definition and implementation face challenges. Companies are rapidly deploying AI agents while grappling with issues of autonomy, integration, and enterprise readiness.
5 Sources
5 Sources
Salesforce introduces AgentForce, a groundbreaking AI agent ecosystem, in collaboration with tech giants. This initiative aims to revolutionize enterprise computing and customer relationship management through autonomous AI agents.
5 Sources
5 Sources