Curated by THEOUTPOST
On Mon, 4 Nov, 4:04 PM UTC
2 Sources
[1]
ThorV2 Puts an End to API Calling Struggles In LLMs
In collaboration with IIT Bombay and IIT Kharagpur, Floworks has released a research paper that dives deep into the sensational claims made by the YC-backed startup earlier this year, Floworks, the cloud-based enterprise automation startup, recently released a novel ThorV2 architecture allowing LLMs to perform functions with better accuracy and reliability. The YC-backed company collaborated with IIT Bombay and Kharagpur to build this architecture. In an interview with AIM earlier this year, Floworks claimed that its AI agent, Alisha, is 100% reliable for tasks involving API calls. Sudipta Biswas, the co-founder of Floworks, said, "Our model, which we are internally calling ThorV2, is the most accurate and the most reliable model out there in the world when it comes to using external tools right now" Further, he claimed that ThorV2 was 36% more accurate than OpenAI's GPT-4o, 4x cheaper, and almost 30% faster in terms of latency. These sensational claims were recently backed by an in-depth research paper that dives deep into ThorV2 architecture and how all of its novel features work to solve several crucial challenges in agentic workflows inside some of the market-leading LLMs today. The edge of domain modelling, used in ThorV2 architecture, involves providing minimal instruction upfront, allowing the agent to begin the task, and then providing the remaining information through error corrections post-task. This approach differs from providing knowledge of all possible scenarios regarding the function calling. Edge of domain modelling reduces the need for extensive instructions, which in turn reduces the number of tokens in the prompt. It can further lead to cost saving measures. The authors mentioned that "Function schemas can be lengthy, leading to large prompt sizes. This increases deployment costs, time consumption, and can result in decreased accuracy on reasoning tasks." Additional instructions added to the LLM in the error correction process are performed by a static agent implemented through an Agent Validator Architecture inside ThorV2. The Agent Validator architecture overcomes several limitations of agentic workflows, where the primary LLM agent performing a task receives feedback from other LLMs that act as critics. The authors argue that using an additional LLM not just increases the deployment costs, but decreases the rates of accuracy. The introduction of a static agent written fully in code, includes a component called Domain Expert Validator (DEV) which inspects the output generated by the LLM for errors. The DEV contains all the knowledge and information required to perform the function calls inside a specific platform. While building a validator requires a significant amount of effort, it helps reduce processing time and improve accuracy. This is because the knowledge contained inside the DEV contains information regarding the most common and repetitive errors that occur in the function calling process. One of ThorV2's other advantages is that it can generate multiple API calls in a single step. With ThorV2, a single query is sufficient for both tasks, even if the first task needs to retrieve information from the API for use in the second task. The approach involves using a placeholder to represent unknown values, and once the first task retrieves the API response, the value is then injected into the second task. "Generating multiple API calls at once requires sophisticated planning and reasoning capabilities, which is very challenging for ordinary LLMs. Our Agent-Validator architecture simplifies this process as well by correcting errors in the planning step", added the researchers. This approach is a significant improvement over the traditional, sequential handling of API calls in current LLMs, which often require a step-by-step execution process. The ThorV2 architecture was compared to OpenAI's GPT 4o, GPT 4 Turbo, and Claude 3 Opus for a set of operations on HubSpot's CRM. The authors developed a dataset called HubBench on which the model was evaluated. The models were tested for accuracy, reliability, speed, and cost. In a conversation with AIM, Sudipta mentioned that ThorV2 was connected to the Llama 3 70B model for comparison. ThorV2 came out on top in every single test, and a 100% score in the reliability test, which seeks a consistent output when the model is put out to perform the task ten times. In the single API call function, ThorV2 scored 90% accuracy, second to Claude 3 Opus' 78% score. The test also revealed that it only took $1.6 for a thousand queries, which is 3 times cheaper than OpenAI's models. Even with multiple API calls, ThorV2 performed better on every single metric. While reading the comparison benchmark scores, one wonders if these scores are relevant five months after the tests were conducted, with several new and capable models like Claude 3.5 Sonnet and GPT o1 having been launched. However, it is important to understand that ThorV2 is an architecture built to enhance the performance and capabilities of an existing LLM. The integration will, in fact, work better with new and more capable models. "We will soon come up with Thor v3, which will definitely compare with other models that have come up recently. But again, the framework is not a model-level innovation that we're doing," Sudipta said. "So even if the underlying model keeps on getting better, our framework will keep supporting even better than that." One of ThorV2's limitations is that it relies on knowledge from the DEV based on common, and well established error patterns and it may face difficulties approaching an unseen one. Moreover, the research currently tests ThorV2's architecture for just single, and two API call functions. The authors acknowledge the limitations, and plan to perform a comparison with three or more function calls in future research. In the conversation with AIM, Sudipta revealed that ThorV3 is currently in the works, and it will challenge some of the latest market leading models today. That said, one can also expect other limitations to be resolved in the future iteration. The authors envision ThorV2 to overcome the limitations of existing LLMs and solve problems that can truly create an impact. They mentioned that LLMs have revolutionised NLP and AI, demonstrating remarkable capabilities across a wide range of tasks. However, their economic impact has been somewhat limited, particularly in domains requiring precise interaction with external tools and APIs. Over the last few months, we've also seen a meteoric rise in AI Agents and their tremendous capabilities, and frameworks like ThorV2 can only propel their powers further in sectors that require a large amount of automation and knowledge transfer between different applications. "LLMs seem very cool, but to front-load them with a high amount of tokens, the cost will be prohibitively, very high. For large-scale operations where lots of automation is needed to be done, that price point will not suit enterprises, and small businesses," Sudipta said.
[2]
Stop Paying for GPT-4o -- This YC Startup Offers 4x the Savings
In collaboration with IIT Bombay and IIT Kharagpur, Floworks has released a research paper that dives deep into the sensational claims made by the YC-backed startup earlier this year. Floworks, the cloud-based enterprise automation startup, recently released a novel ThorV2 architecture allowing LLMs to perform functions with better accuracy and reliability. The YC-backed company collaborated with IIT Bombay and Kharagpur to build this architecture. In an interview with AIM earlier this year, Floworks claimed that its AI agent, Alisha, is 100% reliable for tasks involving API calls. Sudipta Biswas, the co-founder of Floworks, said, "Our model, which we are internally calling ThorV2, is the most accurate and the most reliable model out there in the world when it comes to using external tools right now." Further, he claimed that ThorV2 was 36% more accurate than OpenAI's GPT-4o, 4x cheaper, and almost 30% faster in terms of latency. These sensational claims were recently backed by an in-depth research paper that dives deep into ThorV2 architecture and how all of its novel features work to solve several crucial challenges in agentic workflows inside some of the market-leading LLMs today. The edge of domain modelling, used in ThorV2 architecture, involves providing minimal instruction upfront, allowing the agent to begin the task, and then providing the remaining information through error corrections post-task. This approach differs from providing knowledge of all possible scenarios regarding the function calling. Edge of domain modelling reduces the need for extensive instructions, which in turn reduces the number of tokens in the prompt. It can further lead to cost saving measures. The authors mentioned that "Function schemas can be lengthy, leading to large prompt sizes. This increases deployment costs, time consumption, and can result in decreased accuracy on reasoning tasks." Additional instructions added to the LLM in the error correction process are performed by a static agent implemented through an Agent Validator Architecture inside ThorV2. The Agent Validator architecture overcomes several limitations of agentic workflows, where the primary LLM agent performing a task receives feedback from other LLMs that act as critics. The authors argue that using an additional LLM not just increases the deployment costs, but decreases the rates of accuracy. The introduction of a static agent written fully in code, includes a component called Domain Expert Validator (DEV) which inspects the output generated by the LLM for errors. The DEV contains all the knowledge and information required to perform the function calls inside a specific platform. While building a validator requires a significant amount of effort, it helps reduce processing time and improve accuracy. This is because the knowledge contained inside the DEV contains information regarding the most common and repetitive errors that occur in the function calling process. One of ThorV2's other advantages is that it can generate multiple API calls in a single step. With ThorV2, a single query is sufficient for both tasks, even if the first task needs to retrieve information from the API for use in the second task. The approach involves using a placeholder to represent unknown values, and once the first task retrieves the API response, the value is then injected into the second task. "Generating multiple API calls at once requires sophisticated planning and reasoning capabilities, which is very challenging for ordinary LLMs. Our Agent-Validator architecture simplifies this process as well by correcting errors in the planning step", added the researchers. This approach is a significant improvement over the traditional, sequential handling of API calls in current LLMs, which often require a step-by-step execution process. The ThorV2 architecture was compared to OpenAI's GPT 4o, GPT 4 Turbo, and Claude 3 Opus for a set of operations on HubSpot's CRM. The authors developed a dataset called HubBench on which the model was evaluated. The models were tested for accuracy, reliability, speed, and cost. In a conversation with AIM, Sudipta mentioned that ThorV2 was connected to the Llama 3 70B model for comparison. ThorV2 came out on top in every single test, and a 100% score in the reliability test, which seeks a consistent output when the model is put out to perform the task ten times. In the single API call function, ThorV2 scored 90% accuracy, second to Claude 3 Opus' 78% score. The test also revealed that it only took $1.6 for a thousand queries, which is 3 times cheaper than OpenAI's models. Even with multiple API calls, ThorV2 performed better on every single metric. While reading the comparison benchmark scores, one wonders if these scores are relevant five months after the tests were conducted, with several new and capable models like Claude 3.5 Sonnet and GPT o1 having been launched. However, it is important to understand that ThorV2 is an architecture built to enhance the performance and capabilities of an existing LLM. The integration will, in fact, work better with new and more capable models. "We will soon come up with Thor v3, which will definitely compare with other models that have come up recently. But again, the framework is not a model-level innovation that we're doing," Sudipta said. "So even if the underlying model keeps on getting better, our framework will keep supporting even better than that." One of ThorV2's limitations is that it relies on knowledge from the DEV based on common, and well established error patterns and it may face difficulties approaching an unseen one. Moreover, the research currently tests ThorV2's architecture for just single, and two API call functions. The authors acknowledge the limitations, and plan to perform a comparison with three or more function calls in future research. In the conversation with AIM, Sudipta revealed that ThorV3 is currently in the works, and it will challenge some of the latest market leading models today. That said, one can also expect other limitations to be resolved in the future iteration. The authors envision ThorV2 to overcome the limitations of existing LLMs and solve problems that can truly create an impact. They mentioned that LLMs have revolutionised NLP and AI, demonstrating remarkable capabilities across a wide range of tasks. However, their economic impact has been somewhat limited, particularly in domains requiring precise interaction with external tools and APIs. Over the last few months, we've also seen a meteoric rise in AI Agents and their tremendous capabilities, and frameworks like ThorV2 can only propel their powers further in sectors that require a large amount of automation and knowledge transfer between different applications. "LLMs seem very cool, but to front-load them with a high amount of tokens, the cost will be prohibitively, very high. For large-scale operations where lots of automation is needed to be done, that price point will not suit enterprises, and small businesses," Sudipta said.
Share
Share
Copy Link
Floworks, in collaboration with IIT Bombay and IIT Kharagpur, introduces ThorV2, a novel architecture that enhances LLMs' API calling capabilities, offering improved accuracy, reliability, and cost-effectiveness compared to leading models.
Floworks, a YC-backed cloud-based enterprise automation startup, has introduced ThorV2, a novel architecture designed to revolutionize how Large Language Models (LLMs) handle API calls. Developed in collaboration with IIT Bombay and IIT Kharagpur, ThorV2 promises to address critical challenges in agentic workflows for market-leading LLMs [1][2].
ThorV2 incorporates several innovative features:
Edge of Domain Modeling: This approach provides minimal upfront instructions, allowing the agent to begin tasks and receive additional information through error corrections post-task. This method reduces token usage in prompts, potentially leading to cost savings [1][2].
Agent Validator Architecture: ThorV2 introduces a static agent, including a Domain Expert Validator (DEV), which inspects LLM outputs for errors. This approach overcomes limitations of traditional agentic workflows that rely on multiple LLMs for feedback [1][2].
Multiple API Calls in a Single Step: ThorV2 can generate multiple API calls simultaneously, using placeholders for unknown values and injecting them once retrieved. This capability significantly improves upon the sequential API call handling in current LLMs [1][2].
Floworks claims that ThorV2 outperforms leading models like OpenAI's GPT-4o, GPT-4 Turbo, and Claude 3 Opus in several key areas:
These claims were supported by benchmarks conducted on a dataset called HubBench, focusing on operations within HubSpot's CRM. ThorV2, connected to the Llama 3 70B model, demonstrated superior performance across accuracy, reliability, speed, and cost metrics [1][2].
The introduction of ThorV2 could have significant implications for the AI industry:
Cost Reduction: At $1 per thousand queries, ThorV2 is reportedly three times cheaper than OpenAI's models, potentially disrupting the pricing structure of AI services [1][2].
Improved Efficiency: The ability to handle multiple API calls in a single step could streamline complex AI-driven processes in various industries.
Enhanced Reliability: With claims of 100% reliability for API call tasks, ThorV2 could set a new standard for dependability in AI applications [1][2].
While ThorV2 shows promise, there are considerations for its future:
Adaptability: As an architecture rather than a standalone model, ThorV2 is designed to enhance existing LLMs, potentially improving as underlying models advance [1][2].
Ongoing Development: Floworks has announced plans for Thor v3, indicating continued innovation in this space [1][2].
Current Limitations: ThorV2 relies on established error patterns and has been tested primarily on single and two API call functions, leaving room for expansion in handling more complex scenarios [1][2].
As the AI landscape continues to evolve rapidly, innovations like ThorV2 underscore the importance of architectural improvements alongside model advancements in pushing the boundaries of AI capabilities.
Reference
[1]
[2]
Recent developments suggest open-source AI models are rapidly catching up to closed models, while traditional scaling approaches for large language models may be reaching their limits. This shift is prompting AI companies to explore new strategies for advancing artificial intelligence.
5 Sources
The AI industry is witnessing a shift in focus from larger language models to smaller, more efficient ones. This trend is driven by the need for cost-effective and practical AI solutions, challenging the notion that bigger models are always better.
2 Sources
OpenAI's release of a more affordable GPT-3.5 Turbo model sparks discussions on AI accessibility and potential misuse. Meanwhile, India's AI sector shows promise with homegrown language models and government initiatives.
2 Sources
An analysis of OpenAI's diverse AI projects and their impact on the tech industry, questioning whether the company's broad focus is a strength or a potential weakness.
2 Sources
OpenAI introduces O1 AI models for enterprise and education, competing with Anthropic. The models showcase advancements in AI capabilities and potential applications across various sectors.
3 Sources
The Outpost is a comprehensive collection of curated artificial intelligence software tools that cater to the needs of small business owners, bloggers, artists, musicians, entrepreneurs, marketers, writers, and researchers.
© 2024 TheOutpost.AI All rights reserved