17 Sources
17 Sources
[1]
Google announces Gemma 4 open AI models, switches to Apache 2.0 license
Google's Gemini AI models have improved by leaps and bounds over the past year, but you can only use Gemini on Google's terms. The company's Gemma open-weight models have provided more freedom, but Gemma 3, which launched over a year ago, is getting a bit long in the tooth. Starting today, developers can start working with Gemma 4, which comes in four sizes optimized for local usage. Google has also acknowledged developer frustrations with AI licensing, so it's dumping the custom Gemma license. Like past versions of its open-weight models, Google has designed Gemma 4 to be usable on local machines. That can mean plenty of things, of course. The two large Gemma variants, 26B Mixture of Experts and 31B Dense, are designed to run unquantized in bfloat16 format on a single 80GB Nvidia H100 GPU. Granted, that's a $20,000 AI accelerator, but it's still local hardware. If quantized to run at lower precision, these big models will fit on consumer GPUs. Google also claims it has focused on reducing latency to really take advantage of Gemma's local processing. The 26B Mixture of Experts model activates only 3.8 billion of its 26 billion parameters in inference mode, giving it much higher tokens-per-second than similarly sized models. Meanwhile, 31B Dense is more about quality than speed, but Google expects developers to fine-tune it for specific uses. The other two Gemma 4 models, Effective 2B (E2B) and Effective 4B (E4B), are aimed at mobile devices. These options were designed to maintain low memory usage during inference, running at an effective 2 billion or 4 billion parameters. Google says the Pixel team worked closely with Qualcomm and MediaTek to optimize these models for devices like smartphones, Raspberry Pi, and Jetson Nano. Not only do they use less memory and battery than Gemma 3, but Google also touts "near-zero latency" this time around. All the new Gemma 4 models will reportedly leave Gemma 3 in the dust -- Google claims these are the most capable models you can run on your local hardware. Google says Gemma 31B will debut at number three on the Arena list of top open AI models, behind GLM-5 and Kimi 2.5. However, even the biggest Gemma 4 variant is a fraction of the size of those models, making it theoretically much cheaper to run.
[2]
Google's Gemma 4 model goes fully open-source and unlocks powerful local AI - even on phones
From servers to smartphones, deployment just got much easier. Google announced today that its DeepMind AI research division is releasing Gemma 4, its latest generation of open large language models. The models are being released under the Apache 2.0 license, making them truly open source compared to the permissive but still controlled license of earlier Gemma generations. Gemma is an LLM like Gemini. But here, we're talking about the AI processing engine, not the chatbot interface. Both Gemma and Gemini were developed using the same research and technology. The difference is that Gemini is a subscription-based closed product, whereas Gemma is an open model that can be downloaded and run locally for free. The ability to run an AI model locally without a fee benefits a variety of applications. There are plenty of folks who want to run AI at home, without relying on the cloud, and for free. Also: How AI has suddenly become much more useful to open-source developers The ability to keep everything local is particularly important to enterprises that have data sovereignty or confidentiality requirements. For example, healthcare providers might have regulatory restrictions that prevent them from sharing patient data with a public cloud provider, yet they would still like to benefit from AI. By running the entire system locally, no data is sent to the cloud, but the AI capability is still available. There are many devices, ranging from smartphones to a whole bunch of IoT and edge devices, that may have only intermittent network connectivity (or none at all). Being able to run AI operations without additional costs and without the need to phone home provides considerable benefits in terms of flexibility, security, and cost control. Also: I used Gmail's AI tool to do hours of work for me in 10 minutes - with 3 prompts So, while you might run Gemini in your chat interface, you might install Gemma on a Raspberry Pi to monitor a process in a factory and make decisions in real-time without the latency of a round trip to the cloud and back. Earlier versions of Gemma were licensed under a Gemma Terms of Use statement, rather than a formal open-source license structure. Google permitted users to download Gemma, use it locally, and make modifications, but they restricted use to approved categories and limited redistribution. This approach allowed the model family to be called "open" but not "open source." There were many freedoms associated with using Gemma, but Google still held the leash. By contrast, the Apache 2.0 license grants nearly total freedom. Users and developers can use the software for any purpose, whether personal, commercial, or enterprise, and without any royalty requirements. If you do distribute the software, you're obligated to include a copy of the Apache 2.0 license and provide required attribution for the software. Users and developers are free to modify and redistribute the code, with the right to create derivative works and distribute both the original and modified versions. Also: Why AI is both a curse and a blessing to open-source software There are also some interesting patent-related protections and penalties. In terms of protections, Apache 2.0-licensed users are granted a license to any patents covering contributions, so that patent lawsuits can't target users merely for using the software. On the other hand, if you sue someone claiming the software infringes your patent, you automatically lose your license to use the software. Google is no longer using its own terms of use for Gemma 4. Instead, they're licensing Gemma 4 under the Apache 2.0 license, which means users and developers can use and distribute the model in any way they want without restrictions. Since the release of Gemma two years ago, in February 2024, the open model has experienced considerable adoption. According to Clement Farabet, VP of research, and Olivier Lacombe, group product manager at Google DeepMind, "Since the launch of our first generation, developers have downloaded Gemma over 400 million times, building a vibrant Gemmaverse of more than 100,000 variants." Also: 7 AI coding techniques I use to ship real, reliable products - fast But as ZDNET reported back then, "Google's latest AI offering is an 'open model' but not 'open-sourced.' That difference matters." That was then, and this is now. Now, Gemma 4 is being released as pure open-source software, which means we can expect adoption rates to pick up even over what we've seen in the past 26 months. Not only can we expect to see Gemma 4 adopted in more projects, but it's also now legitimately possible to bundle the AI with products, services, and devices that can benefit from a powerful on-board model. Gemma 4 is actually a four-model set. Two of the models are designed for higher-end servers with powerful GPUs, such as Nvidia H100. These models, known as 26B and 31B, have large parameter footprints. The 26B version focuses on reducing latency, activating a subset of its total parameter set for inference. The 31B model is designed to maximize raw power and quality, bringing all its capabilities to any problem it's asked to work on. The other two models are designed for the low end. Called E2B and E4B, these models are intended for mobile and IoT devices, although they'll also work well running on your home PC. These models have two and four-billion-parameter footprints, respectively, limiting device impact so that they can run efficiently on mobile and edge devices. Also: I built two apps with just my voice and a mouse - are IDEs already obsolete? According to Google's Farabet and Lacombe, "In close collaboration with our Google Pixel team and mobile hardware leaders like Qualcomm Technologies and MediaTek, these multimodal models run completely offline with near-zero latency across edge devices like phones, Raspberry Pi, and Jetson Nano." The company says all models support the following capabilities: There is no indication that Conversational Klingon is among the languages. However, given that Gemma 4 has been trained on a massive scrape of the public web, and that there is a dedicated community, a dictionary, and plenty of fan-generated content online, Klingon almost certainly appeared in the training data, which means the model should be able to perform some rudimentary translation at the least. In their blog post, Farabet and Lacombe said, "Gemma 4 outcompetes models 20x its size. For developers, this new level of intelligence-per-parameter means achieving frontier-level capabilities with significantly less hardware overhead." If you could deploy Gemma 4 on a local device today, what would be the first real task you would trust it to handle? Let us know in the comments below.
[3]
Google battles Chinese open weights models with Gemma 4
Now with a more permissive license, multi-modality, and support for more than 140 languages Google on Thursday unleashed a wave of new open-weights Gemma models optimized for agentic AI and coding, under a more permissive Apache 2.0 license aimed at winning over enterprises. The launch comes amidst an onslaught of open-weights Chinese large language models (LLMs) from Moonshot AI, Alibaba, and Z.AI, many of which now rival OpenAI's GPT-5 or Anthropic's Claude. With its latest release, Google is offering enterprise customers a domestic alternative, but one that won't just hoover up sensitive corporate data to train future models. Developed by Google's DeepMind team, the fourth generation of Gemma models brings several improvements, including "advanced reasoning" to improve performance in math and instruction-following, support for more than 140 languages, native function calling, and video and audio inputs. As with prior Gemma models, Google is making them available in multiple sizes to address applications ranging from single board computers and smartphones to laptops and enterprise datacenters. At the top of the stack is a 31 billion-parameter LLM that, Google says, has been tuned to maximize output quality. Given its size, the model isn't at risk of cannibalizing Google's larger proprietary models, but is small enough that enterprises won't need to run out and spend hundreds of thousands of dollars on GPU servers to run or fine tune it. According to Google, the model can run unquantized at 16-bit on a single 80 GB H100. Meanwhile at 4-bit precision, the model is small enough to fit on a 24 GB GPU like an Nvidia RTX 4090 or AMD RX 7900 XTX using frameworks such as Llama.cpp or Ollama. For applications requiring lower latency, aka faster responses, the Gemma 4 lineup also includes a 26 billion-parameter model that uses a mixture of experts (MoE) architecture. During inference, a subset of the model's 128 experts, totaling 3.8 billion active parameters, is used to process and generate each token. So long as you can fit the model into your VRAM, it can generate tokens far faster than a dense model of equivalent size. This higher speed does come at the expense of lower quality outputs, since only a fraction of the parameters are used to process the output. However, this may be worthwhile if running on devices with slower memory, like a notebook or consumer graphics card. Both of these models feature a 256,000-token context window, making them appropriate for local code assistants, a use case Google was keen to highlight in its launch announcement. Alongside these models are a pair of LLMs optimized for low-end edge hardware like smartphones and single board computers, like the Raspberry Pi. These models are available in two sizes, one with two billion effective parameters and another with four billion. The keyword here is "effective." The models actually have 5.1 and 8 billion parameters, respectively, but by using per-layer embeddings (PLE), Google is able to reduce the effective size of the model in terms of compute to between 2.3 billion and 4.5 billion parameters, making them more efficient to run on devices with limited compute or batteries. Despite their size, the two models still offer a context window of 128,000 tokens and are multimodal, which means that, in addition to text, they can accept visual and audio data (E2B/E4B only) as inputs. As with all vendor-supplied benchmarks, take these claims with a grain of salt, but compared to Gemma 3, Google boasts significant performance improvements in a variety of AI benchmarks: But Gemma 4's most significant change is perhaps the switch to a more permissive Apache 2.0 license, which gives enterprises much more flexibility as to how and where they can use or deploy the models. Previously, Google's Gemma license had prohibited use of the models in certain scenarios and reserved the right to terminate a user's access if they didn't play by the rules. The move to Apache 2.0 now means enterprises can deploy the models without fear of Google pulling the rug out from under them. Gemma 4 is available in Google's AI Studio and AI Edge Gallery services, as well as popular model repos like Hugging Face, Kaggle, and Ollama. At launch, Google claims day-one support for more than a dozen inference frameworks including vLLM, SGLang, Llama.cpp, and MLX, to name a handful. ®
[4]
Google's Gemma 4 AI can run on smartphones, no Internet required
Serving tech enthusiasts for over 25 years. TechSpot means tech analysis and advice you can trust. In a nutshell: Google has released the Gemma 4 open-weight AI model, designed to run locally on smartphones and other consumer devices. Built on Gemini 3, Gemma 4 comes in four versions optimized for different use cases, giving users and developers the flexibility to choose the model that best fits their needs. The two largest Gemma 4 models - 26B Mixture of Experts and 31B Dense - require an 80GB Nvidia H100 GPU to run unquantized in bfloat16 format. Google claims these models deliver "frontier intelligence on personal computers" for students, researchers, and developers, providing advanced reasoning capabilities for IDEs, coding assistants, and agentic workflows. The 26B model activates only 3.8 billion of its 26 billion parameters in inference mode, resulting in higher tokens-per-second performance compared with similar models, while significantly reducing latency. In contrast, the 31B model focuses on "maximizing raw quality" and allows developers to fine-tune it for specific use cases. The variants most relevant to end-users are Effective 2B and Effective 4B. These models can run entirely offline and use minimal memory during inference, with only 2 billion and 4 billion parameters, respectively. Google says that reducing the number of active parameters enables these models to run on mobile and IoT devices, including smartphones, Raspberry Pi, and Jetson Nano. Google claims that its Gemma 4 models are not only significantly faster than Gemma 3 but also the most capable AI models ever designed to run on local hardware. Independent testing appears to support this claim: the 31B model currently ranks #3 on the Arena AI leaderboard for open models, behind GLM-5 and Kimi 2.5, while the 26B sits at #6. Gemma 4 has been released under an Apache 2.0 license, allowing developers to integrate it into their apps and services without usage restrictions. By comparison, Gemma 3 is governed by a custom Google license with strict usage policies and numerous limitations, making it less attractive for developers. It is worth noting that, despite the Apache 2.0 license, Gemma 4 is "open-weight" rather than fully open-source. According to the Open Source Initiative, an AI model can only be considered open-source if the complete dataset used for training, along with scripts, infrastructure code, and detailed methodologies, is released. Google, however, is only releasing the model parameters, not the full, reproducible training pipeline, which prevents others from recreating the model from scratch. For most developers, this limitation is unlikely to matter, as the Apache 2.0 license still permits all forms of commercial use, modification, redistribution, and deployment, with only attribution required.
[5]
Google has launched Gemma 4
Built from the same research as Gemini 3, the new family spans a 2B edge model that runs on a Raspberry Pi to a 31B dense model currently ranked third on the Arena AI open-model leaderboard. The Apache 2.0 licence is a significant shift from previous Gemma releases. Google has released Gemma 4, the latest generation of its open-weight model family, in four sizes designed to cover everything from on-device inference on smartphones to workstation-class deployments. The models are built from the same research and technology that underpins Gemini 3, Google's proprietary frontier model, and are released under an Apache 2.0 licence, a more permissive terms than previous Gemma generations, and a change that Hugging Face co-founder Clément Delangue described as "a huge milestone." Demis Hassabis, CEO of Google DeepMind, called the new models "the best open models in the world for their respective sizes." The four variants are the Effective 2B (E2B) and Effective 4B (E4B) edge models, designed to run on-device on phones, Raspberry Pi, and Jetson Nano hardware developed in collaboration with the Pixel team, Qualcomm, and MediaTek; and the 26B Mixture-of-Experts (MoE) and 31B Dense models, aimed at offline use on developer hardware and consumer GPUs. The 31B Dense model currently ranks third among all open models on the Arena AI text leaderboard; the 26B MoE sits sixth. Google claims both larger models outcompete models up to 20 times their size on that benchmark. The 31B's unquantised weights fit on a single 80GB Nvidia H100 GPU; quantised versions run on consumer hardware. All four models are multimodal, natively processing video and images, and are trained across more than 140 languages. The E2B and E4B models additionally support native audio input for speech recognition. Context windows are 128K tokens for the edge models and 256K for the two larger variants. On capability, Google highlights multi-step reasoning improvements, native function-calling and structured JSON output for agentic workflows, and offline code generation. On performance, the Android Developers Blog notes the E2B model runs three times faster than the E4B, while the edge family overall is up to four times faster than previous Gemma versions and uses up to 60% less battery. The E2B and E4B models are also the foundation for Gemini Nano 4, Google's next-generation on-device model for Android, which will arrive on consumer devices later this year. Gemma has accumulated more than 400 million downloads and over 100,000 community-created variants since its first release, a figure Google points to as evidence of developer adoption at scale. Gemma 4 is available immediately on Hugging Face, Kaggle, and Ollama, with the 31B and 26B models accessible via Google AI Studio and the edge models via AI Edge Gallery. The Apache 2.0 licensing decision is the most consequential commercial signal in the launch: it removes restrictions that prevented some enterprise and commercial deployments under the previous Gemma terms, opening the ecosystem to a broader range of production use cases.
[6]
Google launches open-source model Gemma 4: How to try it
Google just released the latest version of its open AI model, Gemma 4, on Thursday. Crucially, Gemma 4 is a fully open-source model licensed under Apache 2.0, which is typically not the case with frontier models. Open models can be run locally on users' devices, and Google says Gemma 4 can be run on "billions of Android devices" and some laptop GPUs. "This open-source license provides a foundation for complete developer flexibility and digital sovereignty; granting you complete control over your data, infrastructure, and models," a Google blog post reads. "It allows you to build freely and deploy securely across any environment, whether on-premises or in the cloud." Most people have likely heard of Google's popular Gemini AI model, thanks to the ubiquitous AI chatbot that's been integrated into many of Google's products. Gemma is also a large language model (LLM) and was developed from the exact same technology and research that Google DeepMind used to build Gemini 3. Google is calling Gemma 4 its "most capable" open AI model yet. So, how is Gemma different from Gemini? Gemini is Google's proprietary subscription AI product, and the name of Google's family of multimodal AI models. Gemini has been integrated into virtually all of Google's core products, including Google Search, Gmail, Google Docs, and Google Cloud. Gemma 4, however, is an open AI model, meaning that the code and data that it's trained on are shared with its user base. Gemma AI models can be run off of a user's local hardware, even without an internet connection. Anyone can download Gemma 4 and run it off their device for free. These open AI models provide a more private and secure experience, as none of the chats, uploaded files, or answers are shared with a third party. Developers could use open AI models like Gemma 4 in order to integrate AI into their own applications without the need for any recurring subscription costs. Gemma 4 brings some advanced capabilities to Google's open AI model family. According to Google's announcement, Gemma 4 is now capable of advanced reasoning, which includes multi-step planning and deep logic. Google says it has made "significant improvements in math and instruction-following benchmarks that require it" with Gemma 4. Gemma 4 now also supports processes that are required for agentic workflows and localizes AI coding assistance. In addition, Gemma 4 can process audio and video for speech recognition and interpreting visuals such as charts. Gemma 4 is available in four sizes based on the number of weights used to power the model: two billion, four billion, 26 billion, and 31 billion. Hugging Face reports that these open-weight models are available in pre-trained and instruction-tuned variants, offering even more flexibility for developers. The AI model has been trained on more than 140 languages and has a context window up to 256,000 tokens, according to Google. (The smaller E2B and E4B variants have a context window of 128,000, however.) Now, open doesn't mean open source when it comes to AI models. Previous iterations of Gemma were open-weight (meaning the training datasets are publicly available) but were still bound by Google's terms, even if users were allowed to download the model onto their device. While users could modify the local LLM, they still had to operate under Google's rules on its use and redistribution. With Gemma 4, Google has now made the model open and open source. Google is distributing Gemma 4 under the popular open source software license Apache 2.0. Under this license, anyone can download and modify Gemma 4 and use it for any purpose, whether for personal or commercial use cases. Gemma 4 can be redistributed without any royalty requirements as well. Basically, the only requirement under the Apache 2.0 license is attribution, and the license must be distributed alongside the AI model.
[7]
Google releases Gemma 4 under Apache 2.0 -- and that license change may matter more than benchmarks
For the past two years, enterprises evaluating open-weight models have faced an awkward trade-off. Google's Gemma line consistently delivered strong performance, but its custom license -- with usage restrictions and terms Google could update at will -- pushed many teams toward Mistral or Alibaba's Qwen instead. Legal review added friction. Compliance teams flagged edge cases. And capable as Gemma 3 was, "open" with asterisks isn't the same as open. Gemma 4 eliminates that friction entirely. Google DeepMind's newest open model family ships under a standard Apache 2.0 license -- the same permissive terms used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem. No custom clauses, no "Harmful Use" carve-outs that required legal interpretation, no restrictions on redistribution or commercial deployment. For enterprise teams that had been waiting for Google to play on the same licensing terms as the rest of the field, the wait is over. The timing is notable. As some Chinese AI labs (most notably Alibaba's latest Qwen models, Qwen3.5 Omni and Qwen 3.6 Plus) have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction -- opening up its most capable Gemma release yet while explicitly stating the architecture draws from its commercial Gemini 3 research. Four models, two tiers: Edge to workstation in a single family Gemma 4 arrives as four distinct models organized into two deployment tiers. The "workstation" tier includes a 31B-parameter dense model and a 26B A4B Mixture-of-Experts model -- both supporting text and image input with 256K-token context windows. The "edge" tier consists of the E2B and E4B, compact models designed for phones, embedded devices, and laptops, supporting text, image, and audio with 128K-token context windows. The naming convention takes some unpacking. The "E" prefix denotes "effective parameters" -- the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique Google calls Per-Layer Embeddings (PLE). These tables are large on disk but cheap to compute, which is why the model runs like a 2B while technically weighing more. The "A" in 26B A4B stands for "active parameters" -- only 3.8 billion of the MoE model's 25.2 billion total parameters activate during inference, meaning it delivers roughly 26B-class intelligence with compute costs comparable to a 4B model. For IT leaders sizing GPU requirements, this translates directly to deployment flexibility. The MoE model can run on consumer-grade GPUs and should appear quickly in tools like Ollama and LM Studio. The 31B dense model requires more headroom -- think an NVIDIA H100 or RTX 6000 Pro for unquantized inference -- but Google is also shipping Quantization-Aware Training (QAT) checkpoints to maintain quality at lower precision. On Google Cloud, both workstation models can now run in a fully serverless configuration via Cloud Run with NVIDIA RTX Pro 6000 GPUs, spinning down to zero when idle. The MoE bet: 128 small experts to save on inference costs The architectural choices inside the 26B A4B model deserve particular attention from teams evaluating inference economics. Rather than following the pattern of recent large MoE models that use a handful of big experts, Google went with 128 small experts, activating eight per token plus one shared always-on expert. The result is a model that benchmarks competitively with dense models in the 27B-31B range while running at roughly the speed of a 4B model during inference. This is not just a benchmark curiosity -- it directly affects serving costs. A model that delivers 27B-class reasoning at 4B-class throughput means fewer GPUs, lower latency, and cheaper per-token inference in production. For organizations running coding assistants, document processing pipelines, or multi-turn agentic workflows, the MoE variant may be the most practical choice in the family. Both workstation models use a hybrid attention mechanism that interleaves local sliding window attention with full global attention, with the final layer always global. This design enables the 256K context window while keeping memory consumption manageable -- an important consideration for teams processing long documents, codebases, or multi-turn agent conversations. Native multimodality: Vision, audio, and function calling baked in from scratch Previous generations of open models typically treated multimodality as an add-on. Vision encoders were bolted onto text backbones. Audio required an external ASR pipeline like Whisper. Function calling relied on prompt engineering and hoping the model cooperated. Gemma 4 integrates all of these capabilities at the architecture level. All four models handle variable aspect-ratio image input with configurable visual token budgets -- a meaningful improvement over Gemma 3n's older vision encoder, which struggled with OCR and document understanding. The new encoder supports budgets from 70 to 1,120 tokens per image, letting developers trade off detail against compute depending on the task. Lower budgets work for classification and captioning; higher budgets handle OCR, document parsing, and fine-grained visual analysis. Multi-image and video input (processed as frame sequences) are supported natively, enabling visual reasoning across multiple documents or screenshots. The two edge models add native audio processing -- automatic speech recognition and speech-to-translated-text, all on-device. The audio encoder has been compressed to 305 million parameters, down from 681 million in Gemma 3n, while the frame duration dropped from 160ms to 40ms for more responsive transcription. For teams building voice-first applications that need to keep data local -- think healthcare, field service, or multilingual customer interaction -- running ASR, translation, reasoning, and function calling in a single model on a phone or edge device is a genuine architectural simplification. Function calling is also native across all four models, drawing on research from Google's FunctionGemma release late last year. Unlike previous approaches that relied on instruction-following to coax models into structured tool use, Gemma 4's function calling was trained into the model from the ground up -- optimized for multi-turn agentic flows with multiple tools. This shows up in agentic benchmarks, but more importantly, it reduces the prompt engineering overhead that enterprise teams typically invest when building tool-using agents. Benchmarks in context: Where Gemma 4 lands in a crowded field The benchmark numbers tell a clear story of generational improvement. The 31B dense model scores 89.2% on AIME 2026 (a rigorous mathematical reasoning test), 80.0% on LiveCodeBench v6, and hits a Codeforces ELO of 2,150 -- numbers that would have been frontier-class from proprietary models not long ago. On vision, MMMU Pro reaches 76.9% and MATH-Vision hits 85.6%. For comparison, Gemma 3 27B scored 20.8% on AIME and 29.1% on LiveCodeBench without thinking mode. The MoE model tracks closely: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond -- a graduate-level science reasoning benchmark. The performance gap between the MoE and dense variants is modest given the significant inference cost advantage of the MoE architecture. The edge models punch above their weight class. The E4B hits 42.5% on AIME 2026 and 52.0% on LiveCodeBench -- strong for a model that runs on a T4 GPU. The E2B, smaller still, manages 37.5% and 44.0% respectively. Both significantly outperform Gemma 3 27B (without thinking) on most benchmarks despite being a fraction of the size, thanks to the built-in reasoning capability. These numbers need to be read against an increasingly competitive open-weight landscape. Qwen 3.5, GLM-5, and Kimi K2.5 all compete aggressively in this parameter range, and the field moves fast. What distinguishes Gemma 4 is less any single benchmark and more the combination: strong reasoning, native multimodality across text, vision, and audio, function calling, 256K context, and a genuinely permissive license -- all in a single model family with deployment options from edge devices to cloud serverless. What enterprise teams should watch next Google is releasing both pre-trained base models and instruction-tuned variants, which matters for organizations planning to fine-tune for specific domains. The Gemma base models have historically been strong foundations for custom training, and the Apache 2.0 license now removes any ambiguity about whether fine-tuned derivatives can be deployed commercially. The serverless deployment option via Cloud Run with GPU support is worth watching for teams that need inference capacity that scales to zero. Paying only for actual compute during inference -- rather than maintaining always-on GPU instances -- could meaningfully change the economics of deploying open models in production, particularly for internal tools and lower-traffic applications. Google has hinted that this may not be the complete Gemma 4 family, with additional model sizes likely to follow. But the combination available today -- workstation-class reasoning models and edge-class multimodal models, all under Apache 2.0, all drawing from Gemini 3 research -- represents the most complete open model release Google has shipped. For enterprise teams that had been waiting for Google's open models to compete on licensing terms as well as performance, the evaluation can finally begin without a call to legal first.
[8]
Google Jumps Back Into the Open Source AI Race With Gemma 4 - Decrypt
U.S. open-source AI gets a needed boost, as Gemma 4 -- backed by DeepMind -- positions itself as the strongest American contender against DeepSeek, Qwen, and other Chinese leaders. Google's open AI ambitions got a lot more serious today. The company released Gemma 4, a family of four open-weight models built on the same research as Gemini 3, and licensed under Apache 2.0 -- a significant departure from the more restrictive terms on previous Gemma versions. Developers have downloaded past Gemma generations over 400 million times, spawning more than 100,000 community variants. This release is the most ambitious one yet. For the past year, the open-source AI leaderboard has been largely a Chinese affair. DeepSeek, Minimax, GLM and Qwen have dominated the top spots, leaving American alternatives scrambling for relevance. As Decrypt reported last year, Chinese open models went from barely 1.2% of global open-model usage in late 2024 to roughly 30% by the end of 2025, with Alibaba's Qwen even overtaking Meta's Llama as the most-used self-hosted model worldwide. Meta's Llama used to be the default choice for developers who wanted a capable, locally runnable model. That reputation has eroded -- Llama's Meta-controlled license raised questions about its true open-source status, and its performance slipped behind the Chinese competition. The Allen Institute's OLMo family tried to fill the gap but failed to gain meaningful traction. OpenAI released its gpt-oss models in August 2025, which gave the ecosystem a breath of fresh air, but they were never designed to be frontier competitors. And yesterday, a 30-person U.S. startup called Arcee AI released Trinity, a 400 billion parameter open model that made a compelling case that the American scene wasn't completely dead. Gemma 4 follows that momentum, this time with the full weight of Google DeepMind behind it, turning it into arguably the best American model in the open-source AI scene. The model is "built from the same world-class research and technology as Gemini 3," Google said in its announcement. Gemma 4 ships in four sizes: Effective 2B and 4B for phones and edge devices, a 26B Mixture of Experts model focused on speed, and a 31B Dense model optimized for raw quality. The 31B Dense currently ranks third among all open models on Arena AI's text leaderboard. The 26B MoE sits sixth. Google claims both outcompete models 20 times their size -- a claim that holds up, at least against the Arena AI numbers, where Chinese models still hold the top two spots. We tested Gemma 4. It's capable, with some caveats. The model applies reasoning even to tasks that don't require it, which can make responses feel over-engineered for simple prompts. Creative writing is decent -- serviceable, not inspired -- and likely improves with more specific guidance and prompt engineering. Where it delivered most clearly was code. Asked to generate a game, the output wasn't particularly flashy or elaborate, but it ran without errors on the first try. Not bad for a 41 billion parameter model. That zero-shot reliability is arguably more valuable than a prettier result that needs debugging. You can try the (basic, yet functional) game here. The four variants cover the full hardware spectrum. The E2B and E4B models are built for Android phones, Raspberry Pi, and edge devices, running completely offline with near-zero latency, native audio input, and a 128K context window. The 26B and 31B models target workstations and cloud deployments, extending context to 256K and adding native function-calling and structured JSON output for building autonomous agents. All four models process images and video natively. The larger models' full-precision weights fit on a single 80GB NVIDIA H100 GPU; quantized versions run on consumer hardware. The Apache 2.0 license is the other headline. Google's previous Gemma releases used a custom license that created legal ambiguity for commercial products. Apache 2.0 removes that friction entirely -- developers can modify, redistribute, and commercialize without worrying about Google changing the terms later. Hugging Face co-founder Clement Delangue praised it, saying that "Local AI is having its moment," and it is the future of the AI industry. Google DeepMind CEO Demis Hassabis went further, calling Gemma 4 "the best open models in the world for their respective sizes." That's a strong claim. Proprietary systems from Anthropic, OpenAI, and Google's own Gemini still lead on the hardest benchmarks. But for open-weight models you can run locally, modify freely, and deploy on your own infrastructure? The competition just got significantly thinner. You can try Gemma 4 now in Google AI Studio (31B and 26B) or Google AI Edge Gallery (E2B and E4B). Model weights are also available on Hugging Face, Kaggle, and Ollama.
[9]
Gemma 4: Byte for byte, the most capable open models
We are releasing Gemma 4 in four versatile sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE) and 31B Dense. The entire family moves beyond simple chat to handle complex logic and agentic workflows. Our larger models deliver state-of-the-art performance for their sizes, with the 31B model currently ranking as the #3 open model in the world on the industry-standard Arena AI text leaderboard, and the 26B model securing the #6 spot. There, Gemma 4 outcompetes models 20x its size. For developers, this new level of intelligence-per-parameter means achieving frontier-level capabilities with significantly less hardware overhead. At the edge, our E2B and E4B models redefine on-device utility, prioritizing multimodal capabilities, low-latency processing and seamless ecosystem integration over raw parameter count. To power the next generation of pioneering research and products, we've sized the Gemma 4 models specifically to run and fine-tune efficiently on hardware -- from billions of Android devices worldwide, to laptop GPUs, all the way up to developer workstations and accelerators. By using these highly optimized models, you can fine-tune Gemma 4 to achieve state-of-the-art performance on your specific tasks. We've already seen incredible success with this approach; for instance, INSAIT created a pioneering Bulgarian-first language model (BgGPT), and we worked with Yale University on Cell2Sentence-Scale to discover new pathways for cancer therapy, among many others. Here is what makes Gemma 4 our most capable open model family yet: We are releasing the Gemma 4 model weights in sizes tailored for specific hardware and use cases, ensuring you get frontier-class reasoning wherever you need it: Optimized to provide researchers and developers with state-of-the-art reasoning on accessible hardware, our unquantized bfloat16 weights fit efficiently on a single 80GB NVIDIA H100 GPU. For local setups, quantized versions run natively on consumer GPUs to power your IDEs, coding assistants and agentic workflows. Our 26B Mixture of Experts (MoE) focus on latency, activating only 3.8 billion of its total parameters during inference to deliver exceptionally fast tokens-per-second, while our 31B Dense is maximizing raw quality and provides a powerful foundation for fine-tuning. Engineered from the ground up for maximum compute and memory efficiency, these models activate an effective 2 billion and 4 billion parameter footprint during inference to preserve RAM and battery life. In close collaboration with our Google Pixel team and mobile hardware leaders like Qualcomm Technologies and MediaTek, these multimodal models run completely offline with near-zero latency across edge devices like phones, Raspberry Pi, NVIDIA and Jetson Orin Nano. Android developers can now prototype agentic flows in the AICore Developer Preview today for forward-compatibility with Gemini Nano 4. You gave us feedback, and we listened. Building the future of AI requires a collaborative approach, and we believe in empowering the developer ecosystem without restrictive barriers. That's why Gemma 4 is released under a commercially permissive Apache 2.0 license. This open-source license provides a foundation for complete developer flexibility and digital sovereignty; granting you complete control over your data, infrastructure, and models. It allows you to build freely and deploy securely across any environment, whether on-premises or in the cloud.
[10]
Google's new Gemma 4 models bring complex reasoning skills to low-power devices - SiliconANGLE
Google's new Gemma 4 models bring complex reasoning skills to low-power devices Google LLC is upping the stakes for open-weights artificial intelligence models with the release of Gemma 4, its most advanced "open" model family so far. Built on the same architectural foundation as Gemini 3, the models are designed to handle complex reasoning tasks and support autonomous AI agents running locally on low-power devices such as workstations and smartphones. With Gemma 4, Google DeepMind researchers Clement Farabet and Olivier Lacombe said, they've managed to squeeze out more "intelligence per parameter," allowing them to punch significantly above their weight class. For instance, the 31B Dense variant currently ranks third in open models on the industry-standard Arena AI Text leaderboard,. The Gemma 4 models come in four flavors: Effective 2B, Effective 4B, a 26B Mixture of Experts model and a 31B Dense model. The smaller "Effective" models are designed for edge use cases on lightweight hardware such as Android smartphones or Raspberry Pi computers, the researchers said. Meanwhile, the 26B MoE model has a clever trick in that it only activates 3.8 billion parameters on inference tasks, allowing it to perform at high speed without sacrificing the deep knowledge base of larger models. Farabet and Lacombe explained that each of the Gemma 4 models is better suited to running AI agents. Whereas earlier Gemma iterations forced developers to tweak their design so they could interact with other software tools, the Gemma 4 models have native support for function calling and structured JavaScript Object Notation outputs. This means developers can use them to power autonomous agents that interact with third-party tools and execute on multi-step plans. All four models have the ability to process images and videos, with the smaller E2B and E4B variants going further with support for native audio inputs, enabling real-time speech understanding directly on device. Google has also increased the context window of the models, up to 128K for the smallest models and 256K for the larger two. This means developers will be able to upload an entire codebase or massive sets of documents with a single prompt. Each of the models is being made available under a permissive Apache 2.0 license, which removes many of the commercial restrictions placed on other AI models, making them a great choice for developers building enterprise applications, Google said. They can be accessed directly through Google Cloud, and they're also available along with their open weights on Hugging Face, Kaggle and Ollama. The release underscores Google's ambitions to dominate the "local AI" industry. Because even the larger Gemma 4 models are small enough to run on a single graphics processing unit, that makes them suitable for edge use cases and applications where low latency and digital sovereignty are high priorities, said Holger Mueller, an analyst with Constellation Research. "Google is building its lead in AI, not only by pushing Gemini, but also open models with the Gemma 4 family," he said. "These are important for building an ecosystem of AI developers, and will help the company to tap into functional and vertical use cases on different device form factors. Google set a high bar with its previous Gemma 3 release, and so there's a lot of expectation with this release."
[11]
Google releases open-source AI model Gemma 4 for developers
Google released Gemma 4, a new open-source AI model, licensed under Apache 2.0, allowing developers to run it locally on numerous devices, including billions of Android devices and select laptop GPUs. This release marks a significant development in the availability of open AI models, offering developers complete control over data, infrastructure, and models. Google's approach aims to provide enhanced security and privacy compared to proprietary models. Gemma 4 builds on the technology behind Google's Gemini 3 model and is touted as the company's "most capable" open AI model to date. Unlike Gemini, which is a subscription-based product integrated into Google's applications, Gemma 4 offers users the ability to download and utilize the model for free. The new model boasts advanced features, including capabilities for complex reasoning, multi-step planning, and improved performance in math and instruction-following tasks. Gemma 4 also supports AI coding assistance and can process both audio and video content for tasks like speech recognition and visual interpretation. Gemma 4 is available in four different sizes: 2 billion, 4 billion, 26 billion, and 31 billion weights. It has been trained on over 140 languages and supports context windows of up to 256,000 tokens, with smaller variants limited to 128,000 tokens. Previously, iterations of the Gemma model were open-weight but restricted under Google's terms. Gemma 4's open-source designation allows for modifications and use without royalty requirements, aside from mandatory attribution. Google emphasized that this development encourages developer creativity and security. Gemma 4 is accessible via Google AI studio and can also be downloaded from platforms like Hugging Face, Kaggle, and Ollama, broadening its reach within the developer community.
[12]
Google's New Open-Source Model Will Let Users Build AI Agents
The open-source model is capable of multi-step planning and deep logic Google, on Thursday, introduced Gemma 4 artificial intelligence (AI) model. The first in the Gemma 4 family comes with several improvements over its predecessors. While Gemma 3 focused on text and visual reasoning capabilities, the Mountain View-based tech giant says the latest iteration brings agentic capabilities and advanced reasoning to the open-source model. Available in four different sizes, the latest large language model (LLM) will be available across Google's developer platforms and can be downloaded via third-party repositories to run locally. Google Releases Gemma 4 In a blog post, the tech giant announced and detailed the Gemma 4 AI model. The model is available in four different sizes and configurations, including Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE) and 31B Dense. The context window has also been increased to 256K tokens, up from 128K tokens in Gemma 3. Additionally, it has been trained natively on more than 140 languages. One big change from the previous generation is that Gemma 4 is now available under the permissible Apache 2.0 license, which allows usage for both academic and commercial purposes. The LLM can be directly used via Google AI Studio and Vertex AI, or can be downloaded from the company's Hugging Face, Kaggle, and Ollama listings. Three standout features in Gemma 4 are support for advanced reasoning, agentic workflows, and code generation. With advanced reasoning, it is now capable of multi-step planning and deep logic and is said to show improvements in mathematics and instruction following. The model is also capable of functional calling and structured JSON output, letting users power their AI agents with the model. Additionally, Google claims that the LLM supports high-quality offline code, although it is unclear where it stands compared to proprietary tools, such as Claude Code and Codex. However, the clear advantage here is the free usage and on-device privacy and security. Other notable feature includes native processing of videos and images with support for variable resolutions. Google claims the model supports visual tasks like OCR and chart understanding. Apart from this, the E2B and E4B models also support native audio input for speech recognition and understanding.
[13]
Google rolls out Gemma 4: How different is it from Gemini? Key difference of AI model explained
Google's new Gemma 4 AI models are set to revolutionize tech by running advanced capabilities directly on devices like laptops and smartphones. This 'open' AI, built on Gemini's research, promises faster, more private AI experiences, enabling offline features and multi-tasking without heavy computing power. Developers can freely use and adapt these models, potentially transforming everyday apps. In a move that may not look dramatic but could reshape how people use artificial intelligence, Google has rolled out Gemma 4, a new set of "open" AI models that promise advanced capabilities without needing heavy computing power. The models are designed to handle complex reasoning, coding and real-world tasks, while being light enough to run on devices like laptops and even smartphones. Gemma 4 is built using the same research that powers Gemini, but with a key difference, it is open. Developers can download it, tweak it and use it freely under an Apache 2.0 license. The models come in four sizes. Smaller versions are meant for mobile devices, while larger ones can take on more demanding workloads. The idea is simple: strong AI performance without the need for massive infrastructure. At first glance, this might sound like something only developers care about. But the shift could quietly change daily tech use. Instead of depending fully on cloud-based AI, apps can now run smarter features directly on devices. That means faster responses and better control over personal data. In some cases, users may not even need an internet connection. This could show up in subtle ways, smarter voice assistants, offline translation tools, or apps that summarise documents and images without sending anything online. Gemma 4 is built for multi-step thinking and can follow detailed instructions. It can write code, process images and videos, understand speech and work across more than 140 languages. It also supports long inputs, allowing it to analyse large documents or datasets in one go. For developers, one key feature is support for "agentic workflows," meaning the AI can take actions, interact with tools and complete tasks with minimal human input. One of the biggest claims is efficiency. Google says the larger models can compete with much bigger systems while using fewer resources. The smaller versions are designed to run directly on smartphones, including Android devices. If this works as promised, it could bring advanced AI features into everyday apps without draining battery or relying on constant internet access. There are still practical hurdles. Running powerful AI locally is not simple and may require technical know-how. For most people, the benefits will likely come through apps built by developers rather than direct use. There is also a larger concern around open AI systems. While openness can drive innovation, it can also raise questions about misuse when powerful tools are widely available. Gemma 4 may not grab attention like flashy AI chatbots, but it signals a subtle shift, from AI living in distant servers to sitting closer to users, right inside their devices. And that could change how people interact with technology in ways that are only beginning to show. (You can now subscribe to our Economic Times WhatsApp channel)
[14]
Google's Gemma 4 Model Can Now Be Deployed on NVIDIA's RTX GPUs, Delivering Optimized Performance for a 'Personalized' Agentic AI Environment
Google's newest open-source model, the Gemma 4, can now be deployed on NVIDIA's consumer-grade hardware, offering optimal performance for agentic AI workloads. [Press Release]: Open models are driving a new wave of on-device AI, extending innovation beyond the cloud to everyday devices. As these models advance, their value increasingly depends on access to local, real-time context that can turn meaningful insights into action. Designed for this shift, Google's latest additions to the Gemma 4 family introduce a class of small, fast and omni-capable models built for efficient local execution across a wide range of devices. Google and NVIDIA have collaborated to optimize Gemma 4 for NVIDIA GPUs, enabling efficient performance across a range of systems -- from data center deployments to NVIDIA RTX-powered PCs and workstations, the NVIDIA DGX Spark personal AI supercomputer and NVIDIA Jetson Orin Nano edge AI modules. The latest additions to the Gemma 4 family of open models -- spanning E2B, E4B, 26B, and 31B variants -- are designed for efficient deployment from edge devices to high-performance GPUs. This new generation of compact models supports a range of tasks, including: The E2B and E4B models are built for ultra-efficient, low-latency inference at the edge, running completely offline with near-zero latency across many devices, including Jetson Nano modules. The 26B and 31B models are designed for high-performance reasoning and developer-centric workflows, making them well-suited for agentic AI. Optimized to deliver state-of-the-art, accessible reasoning, these models run efficiently on NVIDIA RTX GPUs and DGX Spark -- powering development environments, coding assistants, and agent-driven workflows. As local agentic AI continues to gain momentum, applications like OpenClaw are enabling always-on AI assistants on RTX PCs, workstations, and DGX Spark. The latest Gemma 4 models are compatible with OpenClaw, allowing users to build capable local agents that draw context from personal files, applications, and workflows to automate tasks. Getting Started: Gemma 4 on RTX GPUs and DGX Spark NVIDIA has collaborated with Ollama and llama.cpp to provide the best local deployment experience for each of the Gemma 4 models. To use Gemma 4 locally, users can download Ollama to run Gemma 4 models or install llama.cpp and pair it with the Gemma 4 GGUF Hugging Face checkpoint. Additionally, Unsloth provides day-one support with optimized and quantized models for efficient local fine-tuning and deployment via Unsloth Studio. Start running and fine-tuning Gemma 4 in Unsloth Studio today. Running open models like the Gemma 4 family on NVIDIA GPUs achieves optimal performance because NVIDIA Tensor Cores accelerate AI inference workloads to deliver higher throughput and lower latency for local execution. Plus, the CUDA software stack ensures broad compatibility across leading frameworks and tools, enabling new models to run efficiently from day one. This combination allows open models like Gemma 4 to scale across a wide range of systems -- from Jetson Orin Nano at the edge to RTX PCs, workstations and DGX Spark -- without requiring extensive optimization.
[15]
Google unveils Gemma 4, expands lightweight open model lineup for developers - The Economic Times
Google has introduced a new generation of its open AI models under the Gemma family, with the launch of Gemma 4. The development was confirmed by Google Deepmind chief executive Demis Hassabis in a post on X on Thursday. Google described Gemma 4 as its "most capable open model" to date. The company has released it in four variants, each designed for different levels of performance and hardware requirements including Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE) and 31B Dense. Gemma 4 specifications Gemma 4 variant Effective 2B (E2B) is a compact model with around 2 billion parameters, Effective 4B (E4B) is a slightly larger version with improved capability, 26B Mixture of Experts (MoE), a model using a specialised architecture and 31B Dense is the largest and most powerful version in the lineup, the company blog post read. It added that the 31B dense model ranks among the top-performing open models on widely used industry benchmarks. Parameters refer to the number of adjustable values in a model; generally, more parameters allow for better performance but require more computing power. The Mixture of Experts (MoE) architecture is a technique where only a subset of the model's components are activated for each task. This improves efficiency by reducing the amount of computation needed compared to traditional "dense" models, where all parameters are used every time. Gemma 4 features The Gemma 4 model offers capabilities such as advanced reasoning, agentic workflows, coding, and support for over 140 languages. The models are also capable of solving complex mathematical problems and generating high-quality code, positioning them as potential local AI coding assistants. A key emphasis of Gemma 4 is efficiency. Smaller and optimised models allow developers to run advanced AI systems on more modest hardware, including personal workstations or edge devices, rather than requiring large data centers. This approach is intended to make "frontier-level" AI capabilities more accessible to a broader range of developers, the company mentioned.
[16]
Meet Gemma 4: Google's New AI Model Built for Both Heavy Systems and Everyday Devices
Google Introduces Gemma 4: A New AI Model Family Designed to Power Both Data Centres and Smartphones! Google has introduced Gemma 4 on April 3, 2026. The latest AI model is designed to run on both powerful data centres and everyday smartphones. , the goal is to make advanced AI tools more accessible to developers and users. With this move, Google is trying to bring high-level AI closer to daily life. The launch was led by Sundar Pichai and Demis Hassabis. Both officials have stressed the company's focus on making more responsible and widely available AI. Gemma models have already seen strong adoption, and this new version aims to expand their use across more devices.
[17]
Google launches Gemma 4 AI models: Features, capabilities and how to use
Gemma 4 comes in four different sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE) and 31B Dense. Google has announced Gemma 4, its newest family of open AI models. According to the tech giant, Gemma 4 models are its 'most intelligent open models to date' and provide an 'unprecedented level of intelligence-per-parameter.' The launch builds on the growing popularity of the Gemma ecosystem. Since the first Gemma models were introduced, developers have downloaded them more than 400 million times, creating over 1,00,000 variations, as per Google. Here's everything you need to know about Gemma 4 AI models. Gemma 4 comes in four different sizes: Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE) and 31B Dense. These models are designed to work across a wide range of devices, from smartphones and laptop GPUs to powerful AI servers. One of its key capabilities of Gemma 4 is better reasoning ability. The model can handle complex tasks that require step-by-step thinking. It also performs better in benchmarks related to math problems and instruction-following. Also read: OpenAI buys Sam Altman favourite tech show TBPN, internet calls it PR move It also supports agent-style workflows. With built-in features like function calling, structured JSON output, and system instructions, developers can build AI agents that interact with tools, APIs and different services more easily. Gemma 4 can also generate high-quality offline code. Another key feature is the long context window. Edge models support up to 128K tokens, while larger models can handle up to 256K tokens. Furthermore, Gemma 4 supports over 140 languages, helping developers build AI applications that can work for users globally. Also read: Google AI Pro plan now offers 5TB storage at no extra cost: How to get it Developers can start experimenting with Gemma 4 through several platforms. The 31B and 26B MoE models are available in Google AI Studio, while the E4B and E2B models can be accessed through the Google AI Edge Gallery.
Share
Share
Copy Link
Google has launched Gemma 4, its latest generation of open-weight AI models, marking a significant shift to the Apache 2.0 license from its previous restrictive terms. The release includes four model variants optimized for everything from smartphones to enterprise servers, with the 31B model ranking third on the Arena AI leaderboard. This licensing change removes commercial deployment barriers and positions Gemma 4 as a domestic alternative to Chinese open-weight models.
Google has released Gemma 4, its latest generation of open-weight AI models, under the Apache 2.0 license—a dramatic departure from the restrictive custom license that governed predecessor Gemma 3
1
. This licensing shift grants developers and enterprises near-total freedom to use, modify, and redistribute the models for any purpose without royalty requirements, addressing long-standing frustrations with AI licensing restrictions2
. The move enables enterprise and commercial use without fear of Google terminating access, making Gemma 4 a viable option for organizations with strict data privacy and sovereignty requirements3
.
Source: Ars Technica
Developed by Google DeepMind using the same research and technology that powers Gemini 3, Gemma 4 arrives as Chinese competitors like Moonshot AI, Alibaba, and Z.AI flood the market with open-weight models rivaling OpenAI's GPT-5
3
. Google positions Gemma 4 as a domestic alternative that won't harvest sensitive corporate data to train future models, a critical consideration for healthcare providers and enterprises bound by regulatory restrictions.Gemma 4 comprises four distinct variants designed to address use cases ranging from edge devices to high-performance servers. The 31B Dense model focuses on maximizing output quality and currently ranks third on the Arena AI open-model leaderboard, behind only GLM-5 and Kimi 2.5
1
. Despite its capabilities, the 31B model is a fraction of the size of competing models, making local AI deployment significantly more cost-effective. This model can run unquantized in bfloat16 format on a single 80GB Nvidia H100 GPU, and when quantized to 4-bit precision, it fits on consumer graphics cards like the Nvidia RTX 40903
.
Source: Wccftech
The 26B Mixture-of-Experts model prioritizes low latency over raw quality, activating only 3.8 billion of its 26 billion model parameters during inference to deliver higher tokens-per-second performance
1
. This architecture proves particularly valuable for applications requiring faster responses, such as coding assistants and agentic workflows, though the reduced active parameters do impact output quality compared to dense models3
. Both larger models feature a 256,000-token context window, making them appropriate for complex code generation tasks3
.The Effective 2B and Effective 4B models target mobile devices and edge devices like Raspberry Pi and Jetson Nano, developed through collaboration with the Pixel team, Qualcomm, and MediaTek
4
. These models use per-layer embeddings to reduce their effective size to 2.3 billion and 4.5 billion parameters respectively, despite having actual parameter counts of 5.1 billion and 8 billion3
. This innovation enables on-device AI that runs entirely offline, using minimal memory during inference and consuming up to 60% less battery than previous versions5
.
Source: Mashable
Google touts near-zero latency for these edge models, with the E2B running three times faster than the E4B
5
. Both support multi-modality, natively processing video, images, and audio inputs for speech recognition, with a 128,000-token context window3
. These models will also serve as the foundation for Gemini Nano 4, Google's next-generation on-device model for Android devices launching later this year5
.Related Stories
All Gemma 4 variants incorporate improved reasoning capabilities for mathematics and instruction-following, support for more than 140 languages, and native function calling for structured JSON output
3
. These enhancements position the models for agentic AI workflows where autonomous decision-making is required. Google claims significant performance improvements across AI benchmarks compared to Gemma 3, though the company advises taking vendor-supplied benchmarks with appropriate skepticism3
.Since the first Gemma release in February 2024, developers have downloaded the models over 400 million times, creating a vibrant ecosystem of more than 100,000 community variants
2
. The shift to a permissive license is expected to accelerate adoption rates further, particularly among enterprises that can now legitimately bundle the AI with products, services, and devices2
.Gemma 4 is immediately available through Hugging Face, Kaggle, and Ollama, with the larger models accessible via Google AI Studio and edge models through AI Edge Gallery
5
. Google claims day-one support for more than a dozen inference frameworks including vLLM, SGLang, Llama.cpp, and MLX3
. Hugging Face co-founder Clément Delangue described the Apache 2.0 licensing decision as "a huge milestone," while Google DeepMind CEO Demis Hassabis called the new models "the best open models in the world for their respective sizes"5
.While Gemma 4 carries the Apache 2.0 license, it remains "open-weight" rather than fully open-source, as Google has not released the complete training dataset, scripts, infrastructure code, or detailed methodologies required for full reproducibility
4
. For most developers, this distinction matters little, as the license still permits all forms of commercial use, modification, redistribution, and deployment with only attribution required4
.Summarized by
Navi
[2]
[3]
[5]
27 Jun 2025•Technology

22 May 2025•Technology

15 Aug 2025•Technology

1
Technology

2
Science and Research

3
Startups
