Local AI

The release of Gemma 4 has added energy to the discussion of local models and their importance. Models that you can download and run on hardware you own are becoming competitive with the “frontier models” hosted by large AI providers. These models have gotten good enough for production use, good enough for tasks that until recently required an API call to a frontier model. They are typically open weight (though not open source) and much smaller than the frontier models like Anthropic’s Claude.

The reasons for going local vary. For a financial services company, regulation may require that no sensitive data can leave the premises. For a developer in Europe, data sovereignty laws make cloud APIs awkward. For a developer in China, hardware constraints and geopolitics have made local, efficient models a practical necessity. For developers outside the US, the costs of using frontier models can be prohibitive. None of these reasons are new, but all of them are more urgent than they were a year ago, because the models are catching up.

Why local?

Reasons for running AI locally fall into a few categories: cost, privacy, performance, and control. Let me take them in order.

Cost is the easiest to quantify, though the numbers can be misleading. Developers using agentic tools for programming can spend $500 to $1,000 per month or more on API calls. NVIDIA CEO Jensen Huang has suggested that his engineers should spend an amount roughly equal to half their salary on AI tokens, given the productivity return. Whether or not you take that as prescriptive advice, it signals that token spending at scale is significant, which is exactly what makes the local alternative worth examining.

The hardware cost depends on where you’re starting. If you have a capable desktop already, dropping in an RTX 4070 ($500–$800 retail) gets you a 12GB-VRAM GPU adequate for most local models. Building a dedicated system from scratch (CPU, motherboard, 32GB of RAM, storage, case, power supply, and GPU) runs closer to $1,500. Teams spending $500 a month on API calls break even in a few months. After that, local costs approach zero; electricity for a consumer GPU setup runs $20 to $40 a month. High-volume batch work makes the economics even clearer. Processing thousands of documents through a cloud API gets expensive fast; locally, it costs nothing but time.

For individual developers and small teams, the management overhead is minimal. A tool like Ollama reduces running a local model to a background service; updating to a newer model is a single command, done on your own schedule. At enterprise scale the picture changes: Organizations that need production uptime guarantees, multiple developers sharing access, compliance logging, and dedicated engineering support face real overhead. A dedicated ML engineer runs $200,000 a year, and that’s noise compared to the cost of building or leasing AI infrastructure. For a solo developer or a two-person shop, that concern doesn’t apply.

Privacy arguments are often more compelling than cost. The concern isn’t primarily about bad actors at cloud providers; it’s about contracts, compliance, and control. GDPR and similar regulations create real constraints on where data can go. Healthcare and financial services companies have legal obligations that may effectively prohibit sending sensitive data to external APIs regardless of the provider’s security guarantees. Running a model locally means data stays on your hardware, under your control, with no possibility of inadvertent leakage to a third party. DockYard, writing about the business case for local AI, puts it simply: Local models “keep sensitive data on-device, reducing exposure to breaches and unauthorized access” and simplify compliance with regulations that require strict data residency.

The world beyond the US

The strongest momentum behind local AI adoption comes from developers and organizations outside the United States. The reasons vary by region, but they’re structural everywhere.

European regulators have been skeptical of US-based cloud services since before the first Schrems ruling invalidated the Safe Harbor framework in 2015. The concern that US intelligence services can access data held by US companies, regardless of where that data is stored, has never been fully resolved, and recent US policy directions have amplified European anxieties. More countries, including China and many other Asian nations, are also developing their own data sovereignty laws. Locally run models sidestep the problem.

China has become a leading provider of open AI models. DeepSeek’s appearance as a major open-weight model family wasn’t an accident; it reflects a systematic investment in AI that emphasizes efficiency and openness over raw scale. As I’ve written elsewhere, the Chinese approach to AI has been shaped in part by hardware constraints: When you can’t easily acquire NVIDIA’s fastest chips, you optimize your software instead. You use quantization. You build mixture-of-experts architectures that activate only a fraction of parameters per token. You design models that run well on the hardware you can actually get. The result is a generation of models that run efficiently on local hardware, and a developer community with expertise in building those models. While those techniques have been taken up by AI companies in the US, China clearly leads in efficient AI.

For application developers in India, Southeast Asia, Latin America, and Africa, cost is the most immediate barrier. Cloud API pricing denominated in dollars is expensive relative to local income levels in ways that matter for product economics, not just personal preference. Language is a deeper issue. Of the world’s 7,000-plus languages, only a few have enough textual data to train capable models, and both frontier and smaller open-weight models reflect that reality. A survey of African languages found pronounced performance gaps across models of all sizes. What open-weight models offer is the ability to fine-tune on local language data that the original training missed. A developer in Uganda building a health information tool, or a team in Malaysia building a customer service product, can take an open-weight base model and adapt it to the languages their users actually speak. That’s not possible with closed models.

The response has been a wave of regional model development. Sarvam in India has open-sourced models trained on data emphasizing all 22 official Indian languages, released under Apache 2.0. Sunbird AI in Uganda built Sunflower, a family of models covering 31 Ugandan languages, that was developed in partnership with Makerere University and trained on digitized radio broadcasts and community texts. Singapore’s AI research group built SEA-LION, tuned specifically for Southeast Asian languages and cultural contexts. Malaysia launched a domestically developed LLM, ILMU, in August 2025.

Chinese open source models help to fill this gap. According to Hugging Face’s data, Chinese models now account for a larger share of downloads on the platform than US models. Sunflower is built on Qwen; Malaysia’s NurAI, which targets 340 million speakers of Bahasa Melayu and related languages across the region, uses DeepSeek as its foundation. This isn’t ideology; it’s that Chinese open source models are efficient enough to run locally, permissively licensed, and increasingly well-suited to the multilingual fine-tuning these applications require.

OpenRouter’s model usage rankings, which track billions of API calls across many models, reflect the same reality. DeepSeek models and Qwen variants from Alibaba appear at the top of usage charts alongside offerings from OpenAI and Anthropic. (OpenRouter notes that raw token counts can be skewed by a few high-volume users; request counts give a more representative picture. Also note that rankings vary sharply day-to-day and week-to-week.) The frontier of capable AI is no longer exclusively American, and the application developers driving much of that usage are building for audiences that American tech companies have largely ignored.

Performance

When performance is an issue, the metric to watch depends on what you’re building. Time to first token matters most for interactive applications: how long before the model starts producing output. For a cloud API, that includes the network round trip (typically under 30 milliseconds to a major provider) plus server-side work: queuing, scheduling, and processing your prompt through the model before generation begins. For typical requests this can run to several hundred milliseconds in total, and longer when the server is under load. A local model starts processing immediately, with no queuing and no network hop, so time to first token is very low. For anything that feels like a conversation (a code assistant, a document tool, an interactive agent), that difference is perceptible.

Once generation starts, tokens per second is the metric to watch. Here, cloud providers have the advantage: Their infrastructure prioritizes inference, generating responses to prompts and API calls. A local model may feel faster to start and slower to finish than a well-provisioned cloud API.

For agentic workflows that chain together many model calls, both factors matter. Network round trips accumulate: At 30 milliseconds each, a hundred sequential calls adds three seconds of pure overhead before accounting for server-side processing, and the time-to-first-token overhead multiplies with every step. This is one reason local models have appeal for agentic applications, where the number of individual inference calls can be large.

High concurrency is a separate problem, and one where local deployment struggles. Consumer hardware handles one request at a time, or a few; a cloud provider scales horizontally. If your application serves many simultaneous users, local deployment requires either significant hardware investment or a different architecture.

Fine-tuning for specific applications

Applications where specialized domain knowledge matters are more common than people realize, and for all of them fine-tuning is a substantial advantage. A customer support model that knows your product deeply, a coding assistant tuned on your company’s codebase, a document processor fine-tuned on your industry’s vocabulary: These are things you can build and own with open models in ways you can’t with closed ones.

Developers are frequently prototyping an application on a frontier model, then moving to a smaller or local model that has been fine-tuned for production. An early description of this practice appears in “What We Learned from a Year of Building with LLMs”: “Prototype with the most highly capable models before trying to squeeze performance out of weaker models.” The practice is also recommended by both Anthropic and OpenAI, though they assume you will use their own smaller models, and they might get prickly around what they see as “distillation.”

Fine-tuning models is frequently associated with expensive AI experts, but it is gradually becoming more accessible. Techniques like QLoRA allow fine-tuning a 7B or 8B parameter model on a consumer GPU with 12GB of VRAM. Tools like Unsloth reduce VRAM requirements further while increasing throughput. The Hugging Face ecosystem (Transformers, Datasets, PEFT, TRL) provides additional tools for working with models. An individual developer or small team can adapt a base model to a specialized domain.

Cloud providers can’t easily offer this flexibility. You can fine-tune some closed models, but you’re working within the provider’s constraints at significant per-run cost, and the resulting model still lives on their hardware. Fine-tuning an open model produces something you own, that runs on your hardware, with no ongoing licensing fees and no dependency on a third party’s infrastructure decisions.

Security

The biggest advantage of a local model is that data stays local. There are no API endpoints to compromise, no cloud credentials to steal, no third-party infrastructure to go down during an outage. For regulated industries, this is often a decisive factor.

However, when you run a model on your own infrastructure, you take responsibility for the model’s security. Model creators make their own choices about safety and alignment before releasing a model. Base models (the foundation before instruction tuning and alignment) will comply with requests that a safety-tuned model would refuse; that’s a property of the model, not something you configure at runtime. When you choose a model to run locally you’re also choosing how much alignment work its creators did. Organizations need to evaluate this deliberately rather than assuming it’s handled.

The opacity of training data is a subtler concern. Because almost all open-weight models withhold their training datasets, you can’t audit the data on which the model was trained, making it hard to assess bias, verify that proprietary or regulated data wasn’t included, or detect benchmark contamination. For applications in regulated industries, this is a real gap.

Prompt injection is a threat that applies to any model. In a prompt injection attack, adversarial content in the model’s input overrides the system prompt and hijacks the model’s behavior. The malicious content can be in almost any form: text on a web page, invisible pixels in an image, and much more. The attack surface grows in agentic workflows, where models take actions based on content they retrieve from the web and other external sources. Frontier labs have made progress here: Anthropic has published research on RL-based injection hardening for agentic contexts, and OpenAI published the Instruction Hierarchy, a training methodology that teaches models to assign differential trust to instruction sources. Neither technique has a known open-weight equivalent. That said, both labs have stated publicly that the problem is unlikely to be fully solved. The root cause is architectural: LLMs process instructions and data in the same token stream, and that’s not a bug that can be patched out.

Supply chain security is yet another concern. Hugging Face hosts hundreds of thousands of models, and most have not been audited for safety. Some are actively hostile. Downloading a model from an unknown source and running it on your hardware is analogous to running an arbitrary executable. Sticking with well-known models such as Gemma from Google, GLM from Zhipu, and DeepSeek from DeepSeek AI reduces this risk substantially. The well-known models aren’t risk-free, but they’re in a different category from the long tail of unvetted uploads.

The current open model landscape

Before getting into specific models, it’s important to distinguish between “open source” and “open weight.” They are not the same, and most of what gets called open source AI is actually only open weight. The Open Source Initiative published a formal definition of open source AI in October 2024, requiring not just open model weights but training code, training data provenance, and evaluation code—enough for a skilled person to reproduce the system.

By that standard, almost none of the headline models qualify. Most models only release the weights: the trained numerical parameters that make up the model itself, without the data or code that produced them. Without training data, you can fine-tune a model, but you can’t audit the model for bias or benchmark contamination. Without training code, you can’t reproduce or systematically improve it. The term “openwashing” has started circulating for models that claim openness while releasing only weights, and it’s warranted. For most developers, the practical question is what the license actually permits. Apache 2.0 and MIT licenses, which several of the major open-weight models now carry, are permissive enough for most commercial use.

As of early April 2026, Gemma 4 from Google is the strongest open-weight model available. Like all the models here it releases weights only; training data and code are not disclosed. It comes in several sizes: compact 2B and 4B variants aimed at edge deployment, a 26B mixture-of-experts model that activates 4B parameters per token, and a 31B dense model suited for reasoning and fine-tuning. All variants handle images and video natively. For most developers looking for a locally runnable model right now, Gemma 4 is where to start.

The GLM series from Zhipu is underrated. The current release is GLM-5.1, with GLM-5 still widely used; both have large context windows and strong performance on reasoning tasks. The series has a particular focus on deep tool-assisted research workflows. This goes beyond what raw benchmark scores capture. For applications that involve sustained, complex work, such as legal document analysis, research synthesis, and multistage coding tasks, the GLM family is worth serious consideration.

DeepSeek’s V4 models are large, but they use a mixture-of-experts architecture to deliver high quality with a small active parameter count. DeepSeek’s R1 family ranges from 1.5B parameters to 671B. It has been specialized for reasoning and mathematical tasks. Training data and code have not been released for either V4 or R1. The community has launched an Open-R1 project that attempts a full reproduction of DeepSeek-R1’s training from scratch.

The Qwen series from Alibaba is capable across a range of tasks, multilingual, and licensed under Apache 2.0. Organizational changes have put its trajectory in question, though the open-weight releases of Qwen3.6-27B and other models in the Qwen 3.6 family are encouraging.

Kimi K2.6 from Moonshot AI is worth knowing about, although running it is beyond the capabilities of most consumer hardware. It’s a one-trillion-parameter mixture-of-experts model with 32B active parameters per token, trained specifically for coding and agentic tasks. Aggressive quantization can bring Kimi’s VRAM requirements down to 24GB, but that’s the practical floor.

Meta’s Muse Spark isn’t open but deserves a mention. Announced in early April 2026 and built by the newly formed Meta Superintelligence Labs under Alexandr Wang, Muse Spark is proprietary. Meta has a history of releasing open-weight models, so it’s possible something similar will follow for Muse Spark, but there’s no announcement, no timeline, and no guarantee. There has also been talk of smaller versions of Spark for edge devices.

If you want models that are genuinely open source by the OSI definition—training data, code, and weights all released—the options are more limited and less capable: Olmo from the Allen Institute for AI is the most serious effort; the full Dolma training dataset, training code, and hundreds of intermediate checkpoints have been released. It’s a valuable resource for researchers, but it isn’t competitive with Gemma 4 or DeepSeek on capability.

Regardless of which model you’re considering, how do you know whether it’s good enough for your application? Published benchmarks are often misleading; they measure what the benchmark designers thought to measure, not necessarily what you need. A more reliable approach is building a “golden dataset”: a few hundred real prompts drawn from your actual use case, with known-good answers, against which you can evaluate any candidate model. It’s worth doing before committing to any model for production use.

Choice and control

The gap between frontier and open models is narrowing and, more to the point, seems less and less relevant as open models improve. Is it worth getting locked in to a cloud provider, giving up control of your data provenance, and losing the ability to fine-tune a model for an application in exchange for a few points on a benchmark that doesn’t reflect the real world? An increasing number of AI developers and users are concluding that it doesn’t. The regulatory environment in Europe, and the hardware constraints in China, are producing a global developer community with expertise in making local AI work.

None of this means that cloud AI is going away. The frontier closed models will remain ahead on raw capability, and there are applications where that matters. But the days when a US-based cloud API was the only serious option for capable AI are over. Local AI is increasingly capable, and for a growing fraction of what developers want to build, especially outside the United States, it’s a viable choice.

If you want an introduction to using LLMs with open weights, join Christian Winkler on O’Reilly for the Open Weight Large Language Models Bootcamp on May 20 and 21. You’ll learn how to use models to retrieve information, combine the results of different models and refine the results with dense passage retrieval, discover how these models can excel on less powerful hardware by using new approaches to quantization, explore different frontends these models can be plugged into, and more in an interactive hands-on environment. O’Reilly members can register here.

Not a member? Sign up for a free 10-day trial before the course to attend.

Breaking

Why local?

The world beyond the US

Performance

Fine-tuning for specific applications

Security

The current open model landscape

Choice and control

By

Leave a Reply Cancel reply

You missed

The Chinese Government Just Got the World’s Largest Digital Rights Conference Canceled

Building Agentic AI Systems with Microsoft’s Agent Framework

Operationalizing AI for Scale and Sovereignty

Cyber-Insecurity in the AI Era

Local AI

Why local?

The world beyond the US

Performance

Fine-tuning for specific applications

Security

The current open model landscape

Choice and control

By

Related post

Cyber-Insecurity in the AI Era

Operationalizing AI for Scale and Sovereignty

Building Agentic AI Systems with Microsoft’s Agent Framework

Leave a Reply Cancel reply

You missed

The Chinese Government Just Got the World’s Largest Digital Rights Conference Canceled

Building Agentic AI Systems with Microsoft’s Agent Framework

Operationalizing AI for Scale and Sovereignty

Cyber-Insecurity in the AI Era