3 Sources
3 Sources
[1]
IBM launches Granite 4.0 to cut AI infra costs with hybrid Mamba-transformer models
Built for long-context tasks and edge deployments, Granite 4.0 combines Mamba's linear scaling with transformer precision, offering enterprises lower memory usage, faster inference, and ISO 42001-certified trust. IBM has launched Granite 4.0, a new family of open-source language models designed to slash infrastructure costs that have become a major barrier to enterprise AI adoption. Released under the Apache 2.0 licensing, Granite 4.0 represents IBM's bet on a fundamentally different architectural approach to enterprise AI deployment. The models are built on what the company described as a "hybrid" architecture -- combining emerging Mamba state space models with traditional transformer layers. Mamba, developed by researchers from Carnegie Mellon University and Princeton University, processes information sequentially rather than analyzing all tokens simultaneously like transformers.
[2]
IBM wows with Granite 4 LLM launch and hybrid Mamba/Transformer architecture
IBM today announced the release of Granite 4.0, the newest generation of its homemade family of open source large language models (LLMs) designed to balance high performance with lower memory and cost requirements. Despite being one of the oldest active tech companies in the U.S. (founded in 1911, 114 years ago!), "Big Blue" as its often nicknamed has already wowed many AI industry workers and followers with this new Granite 4.0 family of LLMs, as they offer high performance on third-party benchmarks; a permissive, business friendly license (Apache 2.0) that allows developers and enterprises to freely take, modify and deploy the models for their own commercial purposes; and, perhaps most importantly, have symbolically put the U.S. back into a competitive place with the growing raft of high-performing new generation open source Chinese LLMs, especially from Alibaba's prolific Qwen team -- alongside OpenAI with its gpt-oss model family released earlier this summer. Meta, the parent company of Facebook and Instagram, was once seen as the world and U.S. leader of open source LLMs with its Llama models, but after the disappointing release of the Llama 4 family in April and lack of its planned, most powerful Llama 4 Behemoth, it has since pursued a different strategy and is now partnering with outside labs like Midjourney on AI products, while it continues to build out an expensive, in-house AI "Superintelligence" team, as well. Little wonder AI engineer Alexander Doria (aka Pierre-Carl Langlais) observed, with a hilarious Lethal Weapon meme, that "ibm suiting up again after llama 4 fumbled," and "we finally have western qwen." Hybrid (Transformer/Mamba) theory At the heart of IBM's Granite 4.0 release is a new hybrid design that combines two very different architectures, or underlying organizational structures, for the LLMs in questionL transformers and Mamba. Transformers, introduced in 2017 by Vaswani and colleagues in the famous Google paper "Attention Is All You Need," power most large language models in use today. In this design, every token -- essentially a small chunk of text, like a word or part of a word -- can compare itself to every other token in the input. This "all-to-all" comparison is what gives transformers their strong ability to capture context and meaning across a passage. The trade-off is efficiency: because the model must calculate relationships between every possible pair of tokens in the context window, computation and memory demands grow rapidly as the input gets longer. This quadratic scaling makes transformers costly to run on very long documents or at high volume. Mamba, by contrast, is a newer architecture developed in late 2023 by researchers Albert Gu and Tri Dao at Carnegie Mellon University and Princeton University. Instead of comparing every token against all the others at once, it processes tokens one at a time, updating its internal state as it moves through the sequence. This design scales only linearly with input length, making it far more efficient at handling long documents or multiple requests at once. The trade-off is that transformers still tend to perform better in certain kinds of reasoning and "few-shot" learning, where it helps to hold many detailed token-to-token comparisons in memory. This enables much greater efficiency, especially for long documents or multi-session inference, though transformers retain advantages for some reasoning and few-shot learning tasks. But whether the model is built on transformers, Mamba, or a hybrid of the two, the way it generates new words works the same way. At each step, the model doesn't just pick from what's already in the context window. Instead, it uses its internal weights -- built from training on trillions of text samples -- to predict the most likely next token from its entire vocabulary. That's why, when prompted with "The capital of France is...," the model can output "Paris" even if "Paris" isn't in the input text. It has learned from countless training examples that "Paris" is a highly probable continuation in that context. In other words, the context window guides the prediction, but the embedding space -- the model's learned representation of all tokens it knows -- supplies the actual words it can generate. By combining Mamba-2 layers with transformer blocks, Granite 4.0 seeks to offer the best of both worlds: the efficiency of Mamba and the contextual precision of transformers. This is the first official Granite release to adopt the hybrid approach. IBM previewed it earlier in 2025 with the Granite-4.0-Tiny-Preview, but Granite 4.0 marks the company's first full family of models built on the Mamba-transformer combination. Granite 4.0 is being positioned as an enterprise-ready alternative to conventional transformer-based models, with particular emphasis on agentic AI tasks such as instruction following, function calling, and retrieval-augmented generation (RAG). The models are open sourced under the Apache 2.0 license, cryptographically signed for authenticity, and stand out as the first open language model family certified under ISO 42001, an international standard for AI governance and transparency. Reducing memory needs, expanding accessibility One of Granite 4.0's defining features is its ability to significantly reduce GPU memory consumption compared to traditional large language models. IBM reports that the hybrid Mamba-transformer design can cut RAM requirements by more than 70% (!!!) in production environments, especially for workloads involving long contexts and multiple concurrent sessions. Benchmarks released alongside the launch illustrate these improvements. Granite-4.0-H-Small, a 32B-parameter mixture-of-experts model with 9B active parameters, maintains strong throughput on a single NVIDIA H100 GPU, continuing to accelerate even under workloads that typically strain transformer-only systems. This efficiency translates directly into lower hardware costs for enterprises running intensive inference tasks. For smaller-scale or edge deployments, Granite 4.0 offers two lighter options: Granite-4.0-H-Tiny, a 7B-parameter hybrid with 1B active parameters, and Granite-4.0-H-Micro, a 3B dense hybrid. IBM is also releasing Granite-4.0-Micro, a 3B transformer-only model intended for platforms not yet optimized for Mamba-based architectures. Performance benchmarks Performance metrics suggest that the new models not only reduce costs but also compete with larger systems on enterprise-critical tasks. According to Stanford HELM's IFEval benchmark, which measures how well LLMs follow instructions from users, Granite-4.0-H-Small surpasses nearly all open weight models in instruction-following accuracy, ranking just behind Meta's much larger Llama 4 Maverick. The models also show strong results on the Berkeley Function Calling Leaderboard v3, where Granite-4.0-H-Small achieves a favorable trade-off between accuracy and hosted API pricing. On retrieval-augmented generation tasks, Granite 4.0 models post some of the highest mean accuracy scores among open competitors. Notably, IBM highlights that even Granite 4.0's smallest models outperform Granite 3.3 8B, despite being less than half its size, underscoring the gains achieved through both architectural changes and refined training methods. Trust, safety, and security Alongside technical efficiency, IBM is emphasizing governance and trust. Granite is the first open model family to achieve ISO/IEC 42001:2023 certification, demonstrating compliance with international standards for AI accountability, data privacy, and explainability. The company has also partnered with HackerOne to run a bug bounty program for Granite, offering up to $100,000 for vulnerabilities that could expose security flaws or adversarial risks. Additionally, every Granite 4.0 model checkpoint is cryptographically signed, enabling developers to verify provenance and integrity before deployment. IBM provides indemnification for customers using Granite on its watsonx.ai platform, covering third-party intellectual property claims against AI-generated content. Training and roadmap Granite 4.0 models were trained on a 22-trillion-token corpus sourced from enterprise-relevant datasets including DataComp-LM, Wikipedia, and curated subsets designed to support language, code, math, multilingual tasks, and cybersecurity. Post-training is split between instruction-tuned models, released today, and reasoning-focused "Thinking" variants, which are expected later this fall. IBM plans to expand the family by the end of 2025 with additional models, including Granite 4.0 Medium for heavier enterprise workloads and Granite 4.0 Nano for edge deployments. Broad availability across platforms Granite 4.0 models are available immediately on Hugging Face, IBM watsonx.ai, with distribution also through partners such as Dell Technologies, Docker Hub, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE, and Replicate. Support through Amazon SageMaker JumpStart and Microsoft Azure AI Foundry is expected soon. The hybrid architecture is supported in major inference frameworks, including vLLM 0.10.2 and Hugging Face Transformers. Compatibility has also been extended to llama.cpp and MLX, although optimization work is ongoing. The models are also usable in Unsloth for fine-tuning and in Continue for custom AI coding assistants. Enterprise focus Early access testing by enterprise partners, including EY and Lockheed Martin, has guided the launch. IBM highlights that the models are tailored for real-world enterprise needs, such as supporting multi-agent workflows, customer support automation, and large-scale retrieval systems. Granite 4.0 models are available in both Base and Instruct forms, with Instruct variants optimized for enterprise instruction-following tasks. The upcoming "Thinking" series will target advanced reasoning. Alternate hybrid Mamba / Transformer models Besides IBM, several major efforts are already charting different designs for mixing Transformers with Mamba architecture: The Qwen family from Alibaba remains a dense, decoder-only Transformer architecture, with no Mamba or SSM layers in its mainline models. However, experimental offshoots like Vamba-Qwen2-VL-7B show that hybrids derived from Qwen are possible, especially in vision-language settings. For now, though, Qwen itself is not part of the hybrid wave. What Granite 4.0 means for enterprises and what's next Granite 4.0 reflects IBM's strategy of combining open access with enterprise-grade safety, scalability, and efficiency. By focusing on lowering inference costs and reinforcing trust with governance standards, IBM positions the Granite family as a practical foundation for enterprises building AI applications at scale. For the U.S., the release carries symbolic weight: with Meta stepping back from leading the open-weight frontier after the uneven reception of Llama 4, and with Alibaba's Qwen family rapidly advancing in China, IBM's move positions American enterprise once again as a competitive force in globally available models. By making Granite 4.0 Apache-licensed, cryptographically signed, and ISO 42001-certified, IBM is signaling both openness and responsibility at a moment when trust, efficiency, and affordability are top of mind. This is especially enticing to U.S. and Western-based organizations who may be interested in open source models, but wary of those originating from China -- rightly or not -- over possible political ramifications and U.S. government contracts. For practitioners inside organizations, this positioning is not abstract. Lead AI engineers tasked with managing the full lifecycle of LLMs will see Granite 4.0's smaller memory footprint as a way to deploy faster and scale with leaner teams. Senior AI engineers in orchestration roles, who must balance budget limits with the need for efficiency, can take advantage of Granite's compatibility with mainstream platforms like SageMaker and Hugging Face to streamline pipelines without locking into proprietary ecosystems. Senior data engineers, responsible for integrating AI with complex data systems, will note the hybrid models' efficiency on long-context inputs, enabling retrieval-augmented generation on large datasets at lower cost. And for IT security directors charged with managing day-to-day defense, IBM's bug bounty program, cryptographic signing, and ISO accreditation provide clear governance signals that align with enterprise compliance needs. By targeting these distinct roles with a model family that is efficient, open, and hardened for enterprise use, IBM is not only courting adoption but also shaping a uniquely American answer to the open-source challenge posed by Qwen and other Chinese entrants. In doing so, Granite 4.0 places IBM at the center of a new phase in the global LLM race -- one defined not just by size and speed, but by trust, cost efficiency, and readiness for real-world deployment. With additional models scheduled for release before the end of the year and broader availability across major AI development platforms, Granite 4.0 is set to play a central role in IBM's vision of enterprise-ready, open-source AI.
[3]
IBM Granite 4.0: What you need to know about its hybrid AI models
IBM has just unveiled Granite 4.0, its latest generation of enterprise-focused AI language models, and it's turning heads in the AI community. Designed for efficiency without sacrificing performance, Granite 4.0 brings a host of innovations aimed at making high-performance AI more accessible for businesses of all sizes. At the core of Granite 4.0 is its novel hybrid Mamba-2/transformer architecture. This design merges the speed and long-context capabilities of state-space models (SSMs) with the deep contextual understanding of transformers. The result? A model family optimized for long conversations, multi-session tasks, and complex reasoning workloads, all while reducing the memory footprint significantly. Also read: Microsoft on AI in Biology: Understanding the risks of zero-day threats Hybrid architectures like the one in Granite 4.0 are becoming increasingly important as AI models grow larger and more demanding. By combining SSMs and transformers, IBM has created models that are over 70% more memory-efficient than comparable solutions. This means enterprises can run these models on more affordable GPUs without compromising on throughput or accuracy - a critical advantage for businesses looking to scale AI without ballooning infrastructure costs. Granite 4.0 also breaks new ground in transparency and governance. The models are open source under the Apache 2.0 license and are the first language models to achieve ISO 42001 certification, ensuring compliance with international AI governance standards. Deployment options are equally versatile, with availability across IBM watsonx.ai, Hugging Face, Docker Hub, and Replicate. Soon, Granite models will also support Amazon SageMaker JumpStart and Microsoft Azure AI Foundry. IBM has released multiple variants tailored for different use cases: Also read: Elon Musk backs 'Cancel Netflix' campaign on X: Here's all you need to know Granite 4.0 isn't just efficient, it's fast and capable. Early benchmarks show strong performance in instruction-following tasks and retrieval-augmented generation (RAG) scenarios, often outperforming models that are significantly larger in size. Enterprises requiring low-latency, high-throughput AI for real-time applications will find Granite 4.0 particularly compelling. The launch of Granite 4.0 signals a shift in enterprise AI, where efficiency, accessibility, and compliance are just as important as raw performance. By offering high-quality AI models that are memory-efficient, flexible, and certified for governance, IBM is enabling a wider range of businesses to integrate AI into their workflows without the need for massive infrastructure investments. Granite 4.0 may well be a game-changer for hybrid AI models, providing a blueprint for future language models that balance speed, context, and accessibility. For enterprises and AI enthusiasts eager to explore Granite 4.0, the models are now publicly available, marking a new chapter in scalable, high-performance AI.
Share
Share
Copy Link
IBM launches Granite 4.0, a new family of open-source language models combining Mamba and transformer architectures. These models aim to reduce AI infrastructure costs while maintaining high performance for enterprise applications.
IBM has made a significant leap in the world of artificial intelligence with the launch of Granite 4.0, a new family of open-source language models designed to revolutionize enterprise AI deployment
1
. This release marks a strategic move by IBM to address one of the most pressing challenges in AI adoption: the high infrastructure costs associated with running large language models (LLMs).At the heart of Granite 4.0 lies a novel hybrid architecture that combines two distinct approaches to AI model design: the Mamba state space model and the traditional transformer layers
2
. This innovative fusion aims to harness the strengths of both architectures:2
.By integrating these approaches, Granite 4.0 achieves a balance of efficiency and precision, particularly beneficial for long-context tasks and edge deployments
1
.The hybrid design of Granite 4.0 brings substantial improvements in efficiency without compromising on performance:
3
2
3
These improvements allow enterprises to run high-performance AI models on more affordable GPUs, significantly reducing the barrier to AI adoption
3
.Related Stories
IBM has positioned Granite 4.0 as an enterprise-ready solution with several key features:
2
3
3
The release of Granite 4.0 has significant implications for the AI industry:
2
.2
.3
.With Granite 4.0, IBM has not only introduced a powerful new tool for enterprises but also potentially charted a new course for the future of AI model development, where hybrid architectures may become increasingly prevalent.
Summarized by
Navi
21 Oct 2024β’Technology
19 Dec 2024β’Technology
27 Feb 2025β’Technology