Aimon Labs Inc., the creator of an autonomous "hallucination" detection model that improves the reliability of generative artificial intelligence applications, said today it has closed on a $2.3 million preseed funding round.
The money comes from Bessemer Venture Partners and Tidal Ventures, who co-led the round, along with a number of angel investors, including Thumbtack Inc. Chief Executive Marco Zapacosta and Sumo Logic Inc. co-founder Kumar Saurabh.
AIMon, as the startup likes to be known, is trying to tackle the extremely difficult-to-solve problem of AI hallucinations, and to do this it's relying on generative AI itself to monitor and safeguard other generative AI applications. It has created a large language model called HDM-1, or Hallucination Detection Model-1, and it says its performance far exceeds that of other LLMs in terms of its detection capabilities.
AI hallucinations are one of the biggest challenges faced by LLM developers. IBM Corp. defines them as a phenomenon in which LLMs "perceive patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate." They have drawn a lot of attention in the last couple of years, ever since ChatGPT exploded into the public consciousness, with numerous documented instances of AI models going astray.
The hallucinatory behavior of AI models can take many forms, such as leaking sensitive trading data, exhibiting bias, falling victim to prompt injection attacks, which is where malicious actors attempt to manipulate AI models into performing unintended actions. In some extreme cases, AI models can completely lose the plot, such as when a beta version of Microsoft Corp.'s chatbot Sydney professed its love for The Verge journalist Nathan Edwards, and later falsely confessed to murdering one of its developers.
What's alarming is how common AI hallucinations can be, with some studies estimating that they make up anywhere between 3% and 10% of all LLMs' responses to user prompts.
For businesses wanting to adopt AI, these hallucinations present a serious problem, because they simply cannot afford to unleash generative AI applications that are so unreliable. As AIMon co-founder and CEO Puneet Anand explains, if you have an AI model that's intended to diagnose medical issues automatically from computed tomography scans, you simply cannot afford to make mistakes.
"It is critical to ensure that the AI is actually doing what it is supposed to do," Anand said. "That is exactly what AIMon brings to the table. We are laser-focused on improving reliability for LLMs."
AIMon's HDM-1 is a commercially available AI monitoring tool that's designed to automate the improvement of LLMs, helping developers to discover when their models are hallucinating, troubleshoot the cause of those hallucinations, and find ways to fix them. It's an approach that's known in the industry as "LLM-as-a-judge."
With this model, a specialized LLM is deployed to evaluate the text and outputs of other LLMs, such as GPT-4. These evaluations check different aspects in the model's outputs, like tone, coherence and factual accuracy, among others. In the case of AIMon's HDM-1, it's tasked with judging whether the content created by the LLMs it monitors meets specific criteria. Arguably, the most important reason this technique got so popular is because it automates away what humans (usually engineers) need to do.
The startup, which participated in the Microsoft for Startups and Nvidia Inception programs, says HDM-1 has shown it can dramatically outperform OpenAI's GPT4-o mini, GPT4-Turbo and other notable models in terms of hallucination detection. With HDM-1, companies can gain more insights into their LLM-based applications, both in the preproduction process and after they have been deployed. It's useful not only for developers, but also governance, risk and compliance teams, the company said.
AIMon isn't alone in pursuing the idea of using an LLM-as-a-judge to solve AI's hallucinatory problems. Another startup, called Patronus AI Inc., debuted a similar model called Lynx during the summer, publishing benchmarks that showed it was 8.3% more accurate than GPT-4o in terms of detecting medical inaccuracies. It isn't clear how AIMon's HDM-1 stacks up against Lynx.
Nevertheless, Tidal Ventures' Nicholas Muy said AIMon's approach to increasing trust in AI is one of the most effective he has come across. "It's absolutely critical for builders of innovative generative AI applications," he said.