SAT, MAY 09, 2026
Independent · In‑Depth · Unsponsored
✎ Large Language Models

Gemma 4 Released: Google's Free Open AI Models That Run on Your Phone

Google DeepMind just dropped Gemma 4 — four Apache 2.0 open-weight models from 2B (phone-friendly) to 31B (top-ranked open model). 256K context, 140+ languages, and instant Ollama/Hugging Face support.

By AIToolsRecap April 2, 2026 8 min read 871 views
Home Articles Large Language Models Gemini Gemma 4 Released: Google's Free Open AI Models ...
Gemma 4 Released: Google's Free Open AI Models That Run on Your Phone

Gemma 4 Released: Google's Free Open AI Models That Run on Your Phone

Google DeepMind released Gemma 4 on April 2, 2026, and the open-weight AI model landscape will not look the same after today. Four models, free under Apache 2.0, built on the same research foundation as the proprietary Gemini 3, and designed to run on everything from a Raspberry Pi to a professional workstation — without a single API call to the cloud.

Google DeepMind CEO Demis Hassabis called them "the best open models in the world for their respective sizes." The benchmark data backs that up. The 31B Dense model currently ranks third among all open models globally on the Arena AI text leaderboard with an Elo score of 1452, outcompeting models with up to 20 times more parameters. The 26B Mixture of Experts model sits sixth with a score of 1441 — while activating only 3.8 billion parameters during inference. These are not incremental improvements over Gemma 3. They are a generational leap.

Four models, two deployment tiers

Gemma 4 ships in four distinct variants organized around two use cases. The workstation tier consists of the 31B Dense model and the 26B A4B Mixture of Experts model — both designed for developers, researchers, and enterprises running on laptops, gaming GPUs, and cloud infrastructure. The edge tier consists of the Effective 2B and Effective 4B models, built specifically for smartphones, IoT devices, Raspberry Pi boards, and Jetson Nano hardware.

The naming conventions are worth unpacking. The "E" prefix on the edge models stands for "effective parameters" — the E2B has 2.3 billion effective parameters but 5.1 billion total, because each decoder layer carries its own small embedding table through a technique Google calls Per-Layer Embeddings. These tables are large on disk but cheap to compute, which is why the model runs like a 2B while technically weighing more. The "A" in 26B A4B stands for "active parameters" — only 3.8 billion of the MoE model's 25.2 billion total parameters activate during inference, delivering 26B-level reasoning quality at 4B-class speed and memory footprint.

For hardware requirements: the E2B runs on smartphones and 4GB devices; the E4B on 8GB laptops; the 26B MoE on a 24GB GPU with Q4 quantization; the 31B Dense on a single 80GB NVIDIA H100 unquantized, or consumer GPUs with quantization via Ollama or llama.cpp.

What every Gemma 4 model can do

Every model in the Gemma 4 family — from the smallest edge variant to the 31B Dense — ships with a consistent set of capabilities that represent a significant upgrade over Gemma 3. All four models natively process text and images, including high-resolution images, video frames, optical character recognition, and chart understanding. The two edge models go further with native audio input support, enabling real-time speech understanding directly on device with no cloud dependency.

All models support context windows that dwarf most available alternatives: 128,000 tokens for the E2B and E4B edge variants, and 256,000 tokens for the 26B and 31B models. At 256,000 tokens, the larger models can process an entire codebase, a book-length document, or hours of meeting transcripts in a single prompt. Training covered more than 140 languages from the start — not as an afterthought, but as a core design requirement.

Native function calling and structured JSON outputs ship in all four variants. Developers no longer need to retrofit their applications to get Gemma 4 to interact with external tools and APIs — it works out of the box with agent frameworks. Earlier Gemma versions required extra engineering to achieve this. Gemma 4 removes that friction entirely.

The benchmark numbers

The generational leap from Gemma 3 to Gemma 4 is most visible in the benchmark scores. On AIME 2026, the rigorous mathematical reasoning competition benchmark, the 31B Dense model scores 89.2% — compared to Gemma 3 27B's 20.8%. On LiveCodeBench v6, the competitive coding benchmark, the 31B scores 80.0%. On Codeforces, the model's Elo rating jumped from 110 to 2,150. On GPQA Diamond, the graduate-level science reasoning benchmark, the 31B scores 85.7% — the second highest result recorded for any open-weight model under 40 billion parameters, just behind Qwen3.5 27B at 85.8%.

The 26B MoE model tracks the 31B closely on most benchmarks despite activating only a fraction of its parameters: 88.3% on AIME 2026, 77.1% on LiveCodeBench, and 82.3% on GPQA Diamond. The performance gap between the two larger models is modest. The inference cost advantage of the MoE architecture is not.

For vision tasks, the 31B model scores 76.9% on MMMU Pro and 85.6% on MATH-Vision — results that would have been frontier-class from proprietary models less than a year ago.

The Apache 2.0 license — why it matters more than benchmarks

Previous Gemma releases shipped under custom license terms that required legal interpretation before enterprise teams could deploy them commercially. That friction was real and slowed adoption in exactly the environments — regulated industries, government agencies, enterprises with strict IP policies — where on-device AI has the most value.

Gemma 4 eliminates that entirely. The Apache 2.0 license is the same permissive standard used by Qwen, Mistral, and most of the open-weight ecosystem. No custom clauses, no commercial deployment restrictions, no redistribution limits. For enterprise and sovereign AI deployments — organizations that need to run AI on their own infrastructure without any data leaving their environment — the license change may ultimately matter more than the benchmark scores.

The timing is pointed. As some Chinese AI labs including Alibaba have begun pulling back from fully open releases for their latest models, Google is moving in the opposite direction — opening up its most capable Gemma release yet, under truly permissive terms, while explicitly stating the architecture draws from its commercial Gemini 3 research.

How to run it right now

Gemma 4 models are available immediately via multiple paths. The fastest for most developers is Ollama — a single command pulls and runs any Gemma 4 variant locally. Hugging Face hosts all four models under the Google organization for direct download and fine-tuning. Kaggle Models also hosts the weights. For cloud deployment, both workstation models run serverlessly on Google Cloud Run with NVIDIA RTX Pro 6000 GPUs, scaling to zero when idle.

Day-one support is confirmed across the full inference ecosystem: Hugging Face Transformers, TRL, Transformers.js, Candle, vLLM, llama.cpp, MLX, LM Studio, Ollama, NVIDIA NIM and NeMo, Unsloth, SGLang, Keras, MaxText, and Docker. For Android developers specifically, the E4B and E2B models power Agent Mode in Android Studio and are production-ready for the ML Kit GenAI Prompt API. For experimenting in the browser, Google AI Studio provides immediate access to the 31B and 26B MoE models; Google AI Edge Gallery serves the two edge variants.

What Gemma 4 means for developers

Since Gemma 1 launched in February 2024, the model family has accumulated more than 400 million downloads and spawned a community ecosystem of over 100,000 fine-tuned variants. Gemma 3 produced specialized derivatives for medicine (MedGemma), marine biology (DolphinGemma for dolphin vocalization analysis), and accessibility (SignGemma for sign language translation). Gemma 4's stronger foundation and cleaner license will accelerate that pattern.

The developer buzz around Gemma 4 is focused on two specific capabilities: private, offline AI on everyday devices, and the economic case for replacing cloud API calls with local inference. A model that scores 89.2% on AIME 2026 and runs on a quantized consumer GPU with Ollama is not a research curiosity — it is a production-viable alternative to paying per-token for a proprietary model, with full data privacy and no latency from network round-trips.

For the open-weight AI ecosystem, Google has made a clear statement today: frontier-level intelligence is no longer the exclusive property of closed, cloud-hosted models. Gemma 4 puts it on your device, under a license that lets you build whatever you want with it.

Tags
Generative AIAI NewsGoogleCoding AIAI agentsfree AI toolsBest AI ToolsImage AI2026Nvidia