2 Sources
[1]
Snowflake AI's SwiftKV Cuts Meta Llama Inference Costs by Up to 75%
Snowflake AI Research has introduced SwiftKV, an optimisation framework integrated into vLLM that significantly reduces inference costs for Meta Llama large language models (LLMs). The SwiftKV-optimised models, Snowflake-Llama-3.3-70B and Snowflake-Llama-3.1-405B, are available for serverless inference on Cortex AI. They offer cost reductions of up to 75% compared to the baseline Meta Llama models without SwiftKV. "SwiftKV's introduction comes at a critical moment for enterprises embracing LLM technologies. With the growth of use cases, organisations need solutions that deliver both immediate performance gains and long-term scalability," the company said. The framework reduces computational overhead during the key-value (KV) cache generation stage by reusing hidden states from earlier transformer layers. According to Snowflake AI Research, this optimisation cuts prefill compute by up to 50% while maintaining enterprise-grade accuracy. "Our approach combines model rewiring with lightweight fine-tuning and self-distillation to preserve performance," the team explained. Accuracy loss is limited to about one point across benchmarks. SwiftKV delivers performance improvements, including up to twice the throughput for models like Llama-3.3-70B in GPU environments such as NVIDIA H100s. It also reduces the time to the first token by up to 50%, benefiting latency-sensitive applications such as chatbots and AI copilots. "It is designed to integrate seamlessly with vLLM, enabling additional optimisation techniques such as attention optimisation and speculative decoding," the Snowflake team said. Beyond its integration with Cortex AI, SwiftKV is open-source, with model checkpoints available on Hugging Face and optimised inference on vLLM. The team has also released the ArcticTraining Framework, a post-training library for building SwiftKV models, enabling enterprises and researchers to deploy custom solutions. "By tackling computational bottlenecks, SwiftKV allows enterprises to maximise the potential of their LLM deployments," Snowflake AI Research said. Snowflake recently entered a multi-year deal with AI safety and research company Anthropic to use its Claude models. This partnership will make Anthropic's Claude models available to customers through Snowflake Cortex AI and help businesses worldwide get more value from their data. More businesses are turning to Snowflake's cloud data to organise their data using AI. Like Salesforce and Microsoft, Snowflake is developing AI agents with its Snowflake Intelligence platform. Snowflake chief Sridhar Ramaswamy believes it will simplify how enterprises derive value from data. "Imagine asking a data agent, 'Give me a summary of this Google Doc' or 'Tell me how many deals we had in North America last quarter', and instantly following up with the next steps using that same agent. That's exactly what Snowflake Intelligence will enable - a seamless way to access and act on your data in one place," he added.
[2]
Snowflake claims breakthrough can cut AI inferencing times by more than 50% - SiliconANGLE
Snowflake claims breakthrough can cut AI inferencing times by more than 50% Snowflake Inc. today said it's integrating technology into some of its hosted large language models that it says can significantly reduce the cost and time required for artificial intelligence inferencing, the use of trained models to make predictions or generate outputs based on new input data. The technique, called SwiftKV, is an optimization technique for large language models developed by Snowflake AI Research and released to open source that improves the efficiency of the inference process by essentially recycling information called hidden states from earlier layers of an LLM to avoid repeating calculations of key-value caches for later layers. Key-value caches are like memory shortcuts for a language model. They store important information about input text so the model doesn't have to recalculate it every time it generates or processes more text. That makes the model faster and more efficient. Snowflake said the technique can improve LLM inference throughput by 50% and has reduced inferencing costs for the open-source Llama 3.3 70B and Llama 3.1 405B models by up to 75% compared with running without SwiftKV. The company is initially integrating the technique with the Virtual Large Language Model -- a separate but similar technique encompassing end-to-end inferencing - and making it available in those two Llama models. The same optimizations will be added to other model families available within Snowflake Cortex AI, a feature in Snowflake's Data Cloud platform that enables businesses to build, deploy and scale AI and machine learning models directly within Snowflake. However, Snowflake didn't specify a timeframe for supporting other models. By avoiding redundant computations, SwiftKV reduces memory usage and computational overhead, enabling faster and more efficient decoding, particularly for autoregressive tasks in real-time AI applications. Those tasks involve generating one token -- a word or part of a word -- at a time, where each word is predicted based on the previously generated ones. The process is commonly used in applications such as chatbots, real-time translation and text generation, where speed is critical. The company said SwiftKV's performance gains are predicated on the assumption that most computational resources are consumed during the input or prompt stage. Many business tasks use long questions and generate short answers, which means most of the computing power goes into interpreting the prompt. Snowflake posted a distribution chart on its engineering blog (right) that shows that the typical Snowflake customer workload contains 10 times as many input as output tokens. "SwiftKV does not distinguish between inputs and outputs," said Yuxiong He, AI research team lead and distinguished software engineer at Snowflake. "When we enable SwiftKV, the model rewiring happens for both input processing as well as output generation. We achieve computation reduction on input processing only, otherwise known as prefilling computation." SwiftKV saves time by reusing completed work instead of repeating the same calculations, reducing extra steps by half with minimal loss of accuracy. It also uses a trick called "self-distillation" to ensure it remembers everything it needs, so answer quality doesn't change. In benchmarks, Snowflake said it saw accuracy decline by less than one percentage point. "There is a very small quality gap between the two," He said, "but if a customer is particularly concerned with this area, they can opt to use the base Llama models in Cortex AI instead." The technique enables performance optimizations on a range of use cases, Snowflake said. It improves throughput on unstructured text processing tasks such as summarization, translation and sentiment analysis. In latency-sensitive scenarios, such as chatbots or AI copilots, SwiftKV reduces the time-to-first token -- or the amount of time it takes for a model to generate and return the first piece of output -- by up to 50%.
Share
Copy Link
Snowflake AI Research introduces SwiftKV, an optimization framework that significantly reduces inference costs and improves performance for large language models, particularly Meta's Llama models.
Snowflake AI Research has introduced SwiftKV, a groundbreaking optimization framework that promises to revolutionize the efficiency and cost-effectiveness of large language model (LLM) inference. This innovation comes at a crucial time when enterprises are increasingly adopting LLM technologies and seeking solutions that offer both immediate performance gains and long-term scalability 1.
SwiftKV's core innovation lies in its ability to reduce computational overhead during the key-value (KV) cache generation stage. It achieves this by reusing hidden states from earlier transformer layers, effectively recycling information to avoid repeating calculations 2. This optimization technique can cut prefill compute by up to 50% while maintaining enterprise-grade accuracy 1.
The framework employs a combination of model rewiring, lightweight fine-tuning, and self-distillation to preserve performance. Snowflake AI Research reports that the accuracy loss is limited to about one percentage point across benchmarks, ensuring that answer quality remains largely unaffected 12.
SwiftKV delivers impressive performance enhancements:
SwiftKV is designed to integrate seamlessly with vLLM, a popular inference framework, enabling additional optimization techniques such as attention optimization and speculative decoding 1. Snowflake has made SwiftKV-optimized models, including Snowflake-Llama-3.3-70B and Snowflake-Llama-3.1-405B, available for serverless inference on Cortex AI 1.
The company plans to extend SwiftKV support to other model families within Snowflake Cortex AI, although specific timelines have not been announced 2.
In a move that promotes wider adoption and further development, Snowflake has made SwiftKV open-source. Model checkpoints are available on Hugging Face, and optimized inference is accessible through vLLM 1. Additionally, the company has released the ArcticTraining Framework, a post-training library for building SwiftKV models, enabling enterprises and researchers to deploy custom solutions 1.
SwiftKV's introduction is particularly significant for enterprises embracing LLM technologies. By addressing computational bottlenecks, it allows businesses to maximize the potential of their LLM deployments 1. This optimization is especially valuable for workloads typical in enterprise settings, where long questions often generate short answers, and most computational resources are consumed during the input or prompt stage 2.
As more businesses turn to cloud data solutions like Snowflake's to organize their data using AI, innovations like SwiftKV play a crucial role in making AI technologies more accessible and cost-effective. This aligns with Snowflake's broader strategy, which includes recent partnerships with AI companies like Anthropic and the development of AI agents through its Snowflake Intelligence platform 1.
Summarized by
Navi
[1]
Google's AI Mode for Search is expanding globally and introducing new agentic features, starting with restaurant reservations. The update brings personalized recommendations and collaboration tools, signaling a shift towards more interactive and intelligent search experiences.
17 Sources
Technology
14 hrs ago
17 Sources
Technology
14 hrs ago
Google releases the first comprehensive report on the energy usage of its Gemini AI model, providing unprecedented transparency in the tech industry and sparking discussions about AI's environmental impact.
7 Sources
Technology
14 hrs ago
7 Sources
Technology
14 hrs ago
Google joins the race to provide AI services to the US government, offering its Gemini AI tools to federal agencies for just 47 cents, undercutting competitors and raising concerns about potential vendor lock-in and future costs.
7 Sources
Technology
6 hrs ago
7 Sources
Technology
6 hrs ago
Microsoft is testing new AI-powered features for Windows 11's Copilot app, including semantic file search and an improved home experience, aimed at enhancing user productivity and file management.
4 Sources
Technology
14 hrs ago
4 Sources
Technology
14 hrs ago
AI-related companies have raised $118 billion in 2025, with funding concentrated in fewer companies. Major investors include SoftBank, Meta, and venture capital firms, reflecting the growing importance of AI across various sectors.
2 Sources
Business
22 hrs ago
2 Sources
Business
22 hrs ago