8 Sources
[1]
Running Ollama on a 15W CPU sounded ridiculous until I got it working with decent results
Richard is the PC Hardware Lead at XDA and has been covering the technology industry for almost two decades. He's been building PCs since young, and when not creating content, you can often find him inside a chassis somewhere. Large language models (LLMs) are incredibly useful. They're not perfect, but when prompted and used effectively, they can enhance your productivity and allow you to free up some valuable time for other tasks. Most of the grunt work happens in the cloud with ChatGPT, Claude, NotebookLM, and Copilot, to name but a few. Servers in datacenters are spun up to handle incoming requests, and you've probably read a news feature or two on just how much power these vast complexes require for handling AI. If you thought Bitcoin mining was wasting resources, you'll be stunned to see AI do just the same. But that's where running your own LLMs can make a world of difference. Being able to load heavily optimized models onto free and open software, requiring just a PC to run it and some electricity each time the system ramps up to handle your requests. It's never going to be as smooth and capable as cloud-based AI, but so long as you keep expectations in check and learn how best to prompt each model, you can achieve some incredible results with nothing but a discrete GPU and basic desktop setup. Throw in an Nvidia GeForce RTX 5090, and you've suddenly got access to some seriously powerful models. Using LXC-powered Ollama and Open WebUI It's easy, quick, and what I already know I never really bothered with using the CPU or, specifically, the integrated GPU found on the chip itself. That was until I decided I had had enough of my LLM box pulling 100 watts at idle and up to 300 watts or so when handling a request. I switched it out for a compact, low-power mini PC with a fairly mediocre processor, and the results weren't as awful as I expected. I decided to keep the mini PC running as my new LLM box, excited to see how the future further refines models and improves things with some fairly strong restrictions. Should you run LLMs on a budget-friendly mini PC? Not if you expect ChatGPT levels of responsiveness, but it can be a fun project. I fired up Proxmox on the mini PC, checked that all available CPU cores and RAM were locked and loaded. Then, a quick trip to the Proxmox community scripts page to take the command for installing Open WebUI with Ollama. Once that was installed and configured with a dedicated IP address through OPNsense -- replacing the previous Open WebUI running on a beefier PC -- I was good to go. Just like many other home lab projects, there are countless ways to go about it, but I felt like Proxmox and an LXC were the best way to make the most of the available hardware. I'm not after the best possible outcome (the CPU has a TDP of just 15W), at least not yet anyway. I know I'm going to have hardware constraints before anything relating to Ollama. Using llama.cpp may provide a performance upswing, but even then, there's the question as to whether it's worth it. This is something I'll look into later. For reference, this Minisforum U850 mini PC has the Intel Core i5-10210U CPU with four cores, and there's 16 GB of DDR4-2666 RAM. That's fairly underwhelming for a local LLM setup, especially the memory, since we're going to be CPU-bound and that RAM is super-slow compared to DDR5 and a discrete GPU. 5 self-hosted LLMs I use for specific tasks My customized, self-hosted AI workflow Posts 5 By Yash Patel Low-power CPUs are surprisingly capable But it has absolutely nothing on dedicated hardware It's fairly easy to configure Open WebUI, too. After creating the first account (also with admin privileges), I downloaded qwen3:4b-14_k_m and qwen2.5coder:7b-instruct-q4_k_s, which would be my two test beds to see how capable this system is at running smaller yet highly optimized LLMs. The results were surprising, as my esteemed colleague Ayush Pande discovered when running a similar test on a mini PC with an Intel N100 CPU. For Qwen3 on my compact system, the 4B model managed around 4 tok/s with a simple question, and when asked what XDA Developers is. Not brilliant, but more than sufficient for loading queries while doing something else. The Intel Core i5-10210U was never designed with local LLMs in mind. It's a mobile chip slapped onto a compact mini PC motherboard. Getting it to do much heavy lifting will result in slow waits, but the four physical cores and upgradable RAM do provide some wiggle room for heavier tasks, such as running local models. I found anything under 10B to be entirely possible without entering swap territory and waiting an absolute age for the CPU to handle everything. The downloaded test model qwen3:4b is great for general queries, and the slightly larger qwen2.5coder:7b is solid for assisting. I did find it humorous how Qwen3 believes XDA does not cover LLMs and PC hardware, though it's interesting how the model relied heavily on the community forum. That's the thing with these more compact models with smaller parameter totals. You need to prompt them the right way to get the most out of the technology. It's no good simply asking whether XDA covers PC hardware and LLMs, especially after querying what XDA is. The LLM will base its follow-up response on the forum, but that's where people can struggle with interacting with local and cloud-based models. Here's how I get the most out of my self-hosted LLM, especially when limited by VRAM Don't have an RTX 5090? No problem! Posts 8 By Rich Edmonds It's not a great daily driver Though around 4 tok/s is perfectly fine for my needs with a local LLM, it's not an ideal setup for running models daily. If you expect prompt responses and high accuracy, you'll need the cloud or compelling hardware to run it all locally, but then you run into the costs of electricity. Sometimes, depending on where you reside and what PC parts you have available, cloud AI may be more affordable. For those of us who don't mind waiting a minute or two for a response and use LLMs for specific needs, even a low-power, budget mini PC with a 15W CPU like this can get the job done. Have the cash for a mini PC designed for running AI? Grab something like the GMKtec EVO-X2 AI. GMKtec EVO-X2 AI Mini PC $2000 $2200 Save $200 Brand GMKtec CPU AMD Ryzen AI Max+ 395 Memory LPDDR5X-8000 Operating System Windows 11 / Ubuntu Graphics AMD Radeon 8060S $2000 at GMKtec See at Amazon Expand Collapse
[2]
My favorite AI coding tool isn't OpenCode: it's something way better (and cooler)
Ayush Pande is a PC hardware and gaming writer. When he's not working on a new article, you can find him with his head stuck inside a PC or tinkering with a server operating system. Besides computing, his interests include spending hours in long RPGs, yelling at his friends in co-op games, and practicing guitar. Even if you're as averse to vibe-coding as I am, AI tools have plenty of perks for programming-oriented tasks. You can use them to troubleshoot hard-to-diagnose error logs, have them control external tools, or even reformat code into the right languages. Plus, you've got a battalion of LLM-powered programming tools to choose from, ranging from simple code editors to complex IDEs and agent harnesses. Speaking of AI harnesses, I've grown fond of CLI tools that can hook up to LLMs and use their superior reasoning capabilities to run autonomous tasks. OpenCode is one of the more popular options, and while it definitely has some neat features, it's not my harness of choice. That designation goes to Pi, which is not only extremely lightweight, but it's also capable of extending its functionality by designing custom extensions using simple text prompts. I replaced Cursor and Antigravity with a completely local VS Code setup, and I missed less than I expected My self-hosted setup holds up pretty well for my coding tasks Posts 1 By Ayush Pande Pi's low resource consumption makes it the perfect companion for local models It doesn't hog a lot of context size for tools If you've read my articles on XDA, you're probably aware that I prefer locally-hosted LLMs over their cloud counterparts. After all, I don't have to waste extra money on paid licenses to use them. Nor do I have to worry about exposing important credentials to the clankers running on an external company's servers. That said, using coding agent harnesses like OpenCode can end up hogging quite a bit of the model's context length for the built-in tools before I even start prompting the agent. Considering that I run bulky MoE LLMs on outdated GPUs, I'd rather avoid clogging the context length as much as I can during long prompting sessions. In contrast, Pi ships without the additional tools, subagents, and services you'd typically find on Claude Code or OpenCode, so I don't have to worry about the agent harness taxing the limited context length (and by extension, my VRAM-constrained graphics cards) with functionality that I won't need. On my Qwen3.6-35B-A3B llama-server, I've had the same prompt session last for hours on Pi without the context length hitting the red zone, but I'd have to manually tweak the --n-cpu-moe parameter to boost the context length past the 100K mark before I could use Claude Code for the same duration. Pi can code its own extensions with simple prompts It's essentially a self-modifying utility that evolves with my coding workloads If Pi was simply a barebones tool with limited functionality, I wouldn't hold it in such high regard. But what really makes this agent harness stand out from its rivals is the fact that it can be customized with extensions - and not just outdated tools from random community GitHub repos, either. Pi can build practically any extension on its own once I give it the right prompt, and it's an absolute game-changer. Besides the whole DIY and customizable nature of Pi, its ability to spin up fully-functional tools capable of interfacing with external apps means I don't have to look into sketchy MCP servers. For reference, one of the first extensions I asked Pi to create was something that could link it with the underlying Docker runtime (rootless, of course) on the Linux host. To my surprise, the Pi + Qwen3.6 combo built a reliable extension that could control every aspect of my Docker workstation with simple queries. I've also asked it to create dedicated extensions for Podman, Nextcloud, Home Assistant, and Proxmox. Apart from one attempt where I didn't specify the Docker extension in the prompt (which caused Pi to have a meltdown while attempting to run the Podman extension to check Docker-based containers), I've had zero issues pairing my local LLM with my home lab services. Compared to all the hassle I had to go through when I tried to use third-party MCP servers to hook my PVE node to my LLMs a few weeks ago, Pi's ability to create properly functioning extensions within a few minutes was a breath of fresh air. I ran local LLMs on Intel's cheapest iGPU, and the results were surprisingly decent It ain't no match for a dedicated GPU, but you can run some light LLMs on the N100 Posts 1 By Ayush Pande That said, I wouldn't leave Pi unsupervised I've also set up guard rails to prevent misinterpreted prompts from damaging my system Although Pi is capable of evolving with my coding needs, it also lacks the safeguards built into its rivals, making it quite a chaotic gremlin if left to its own devices. Fortunately, there are a couple of ways to prevent this beast of an agent harness from wreaking havoc indiscriminately. The pi-permission-system package, for one, is a must-have for all Pi users, as it forces the tool to ask for permission before attempting tasks involving sudo privileges or destructive operations. Even the extensions themselves can be fully customized, so I've disabled any tools that are capable of deleting, updating, or modifying critical files/virtual guests on all the platforms I've paired with my Pi + llama-server setup. Likewise, I've deployed Pi on my dev virtual machines that are regularly backed up to my NAS. Since everything runs inside virtual environments that I can recover at the press of a button, Pi's only drawback matters little for my coding needs. Pi See at Github Expand Collapse
[3]
I finally found a local coding LLM that I actually want to use
Nick Lewis is an editor at How-To Geek. He has been using computers for 20 years --- tinkering with everything from the UI to the Windows registry to device firmware. Before How-To Geek, he used Python and C++ as a freelance programmer. In college, Nick made extensive use of Fortran while pursuing a physics degree. Nick's love of tinkering with computers extends beyond work. He has been running video game servers from home for more than 10 years using Windows, Ubuntu, or Raspberry Pi OS. He also uses Proxmox to self-host a variety of services, including a Jellyfin Media Server, an Airsonic music server, a handful of game servers, NextCloud, and two Windows virtual machines. He enjoys DIY projects, especially if they involve technology. He regularly repairs and repurposes old computers and hardware for whatever new project is at hand. He has designed crossovers for homemade speakers all the way from the basic design to the PCB. Nick enjoys the outdoors. When he isn't working on a computer or DIY project, he is most likely to be found camping, backpacking, or canoeing. If you run an AI locally, you get complete privacy, no API or subscription costs, offline access, and you never have to worry about running into your usage limit right when you're in the middle of something. For a long time, AI coding assistants were janky, unreliable, and painfully slow, but newer local models can hold their own against the cloud-based models as long as you're careful about how you use them. Qwen's latest coding model release has been especially impressive, and I now use the local model about half the time. Qwen makes local vibe coding viable Pair it with VSCodium for a fully open-source experience There are a ton of local AI models that can help you code locally, and they frequently leapfrog each other as models improve over time. For the most part, I haven't found these very impressive -- they're good as very fancy autocompletes, but not much else. However, recent models make them much more appealing. I've been using some of Qwen's coding-specific models, and found they are finally in a position where they're usable on moderate hardware and practically useful. It handles code completion, refactoring, and writing tests, and it does it pretty well. You can use it to plan or directly write, though I'd strongly recommend planning first. It isn't nearly as smart as the big cloud models, and it needs the help. I run my local coding models in VSCodium via the Cline extension. The entire thing appears as a small sidebar where you type your commands, approve code snippets, and manage your context window. I mostly use my local coding AI for simple things, while I leave more complex jobs or refactoring to Claude to save tokens. Because the entire setup relies on Ollama, I can also make my local AI accessible to any device on my home network. That means I'm not stuck seated in front of my desktop -- I can take my laptop Keep in mind that AI is constantly evolving. The local LLM space moves so fast that what is the gold standard today might be superseded by next month, but even the current options make the setup worth the effort. Cloud models are better, but local is inexpensive and private Privacy, cost, and availability add up Even if a cloud model is more "intelligent," you should still consider a local setup. The most pressing issue is privacy and security. When you run a model locally, your code never leaves your machine. If you're handling proprietary company data or sensitive client information, that is very important. Cost also matters. Claude, ChatGPT, Gemini, and all of the other major players charge monthly for access. Those plans start at about $20 per month, but the costs can grow explosively if you're not careful. A viable local agent means you can stop paying monthly subscriptions or worrying about per-token fees. Claude Price $20 Claude is an AI assistant made by Anthropic. It can assist with a wide range of tasks -- writing, coding, analysis, research, and more. Unlike a search engine, Claude reasons through problems conversationally, making it useful as a thinking partner rather than just an information retrieval tool. See at Claude Expand Collapse Once you own the GPU, your only ongoing cost is the electricity. It sounds like a dubious value proposition at first, but consider that Claude Max costs at least $100, which is the minimum subscription someone doing a lot of coding will need. After a year, that is an RTX 5080. After two years, that is an RTX 5090 (if you can find one at MSRP). It is also nice not to depend on anyone else's servers. On more than one occasion, I've gone to use Claude or Codex to write some code, only to find the servers are temporarily down. With your own local setup, your downtime is mostly under your control. Running a coding LLM locally has some tradeoffs VRAM, quantization, and context are the real constraints Running a local coding LLM isn't without its drawbacks, however. The big limit is hardware. If you're running a mid-range consumer GPU, like the 5070 Ti I use, you're going to run into bottlenecks. The primary constraint is VRAM, which dictates both the size of the model you can load and the length of the context window you can maintain. This is where quantization comes in. You'll see terms like Q4, Q5, or Q8. That is basically an indicator of how compressed the model is. While a Q8 (8-bit) model is more precise, a Q4 (4-bit) model allows you to run a larger model on hardware with less VRAM with only a slight decrease to output quality. With the right quantization, I can use some 27B parameter models on my 5070Ti, though larger models are out of reach. What Is an LLM? How AI Holds Conversations LLMs are an incredibly exciting technology, but how do they work? Posts By Katie Rees You should also expect a speed difference between local models and cloud-based models. Local models will struggle to fit large, complex jobs into the context window. Local coding LLMs are finally worth using The local LLM offerings have finally crossed the line from a fun novelty to something I can actually use daily. A big part of making these models useful is integration. You should try setting this up as a supplement to your cloud tools by attaching it to VSCodium or the IDE of your choice. It might not replace the most powerful models for every single task, but having a private, free, and always-available assistant on your own hardware is a great addition to any dev environment.
[4]
I replaced Cursor with a completely local VS Code setup, and I missed less than expected
Ayush Pande is a PC hardware and gaming writer. When he's not working on a new article, you can find him with his head stuck inside a PC or tinkering with a server operating system. Besides computing, his interests include spending hours in long RPGs, yelling at his friends in co-op games, and practicing guitar. Considering that AI tools can tackle mind-numbingly bogus tasks, there's no denying that they're a godsend for productivity. But with almost every cloud-based platform charging regular subscription fees for their clanker-powered services, we're starting to get to a point where you could end up paying hundreds of bucks to avoid hitting rate limits on coding platforms. In fact, the massively restricted token usage on the free versions of Cursor and Antigravity has made me stay away from their offerings, with the latter's timeouts in particular being a major turnoff for projects where I need to query the LLMs multiple times to get something meaningful out of them. Meanwhile, I've started experimenting with MoE models, and with the right extension on VS Code, they've entirely replaced their cloud counterparts for my dev tasks. I ran local LLMs on Intel's cheapest iGPU, and the results were surprisingly decent It ain't no match for a dedicated GPU, but you can run some light LLMs on the N100 Posts By Ayush Pande The llama-vscode extension fuels my coding escapades I've paired it with the MoE models I host on my home lab nodes Back when I first dove into LLMs, I stuck with 9B and 12B models for the most part. And while they're pretty decent for generating OCR text or creating tags for my documents, hyperlinks, and notes, they're far from ideal for coding tasks - and not just vibe-coding, either. The most common use case for LLMs in my home lab is querying them about failed projects, examining terminal logs, and conducting vulnerability scans on my code. The small-sized models that'd fit in consumer-tier GPUs lack the sheer computational prowess for these tasks, especially once you pit them against the reasoning powerhouses you can leverage with Cursor and Antigravity. However, Mixture-of-Experts models flip the whole situation on its head. After all, being able to host bulky 35B models on weak 12GB VRAM GPUs without taking massive performance hits or turning down the quantization rate makes them a force to be reckoned with. And having tested GPT-OSS-20B, Gemma-4-26B-A4B, and Qwen3.6-35B-A3B with my VS Code instance over the past couple of months, I can confirm that they're perfect for dev tasks, with Qwen3.6 holding its own against its cloud-based rivals. As for my coding toolkit, VS Code - the very application that Cursor and Antigravity are forked from - serves as the centerpiece of my setup. I'd initially used Continue during my Ollama days, but after getting a taste of MoE models, I've since shifted to llama-vscode, which pairs incredibly well with the llama-server instances running on my Proxmox server and gaming workstation. Since the llama-vscode extension accepts everything from code files to random documents, the possibility of my LLMs hallucinating is reduced even further. Pair it with the right LLM, and it can generate fully functional code snippets, while its auto-completion features are just as reliable. That said, I've had better luck with Qwen 2.5 Coder (the lower parameter variants) as the auto-completion model, as Qwen3.6 and Gemma 4 would take a couple of seconds to generate code. But for simple RAG-based chat or troubleshooting assistance, these LLMs tend to produce accurate results in under a minute. Your old GPU can still run big LLMs - you just need the right tweaks There's a lot you can do with these models Posts 19 By Ayush Pande It even uses custom agents and tools Including the utilities exposed via MCP servers Another neat aspect of llama-vscode is that it supports agentic workflows, and the default agent is versatile enough to adapt to most coding situations. However, the real fun begins when you start creating agents for dedicated tasks. There's even an agent designed to create other agents (and sub-agents), and it works well as long as I give it a detailed description of what I want in the chat section. Likewise, llama-vscode also lets me fine-tune the different aspects of an agent, and I can choose the exact number of tools at its disposal. Speaking of tools, llama-vscode works with MCP servers, meaning I can use my LLMs to control additional applications, instead of just relying on them for coding tasks. The best part? I don't have to pay any subscription fees for this setup Fun fact: Burst inference tasks don't consume a lot of energy Compared to cloud LLMs that can generate entire code files in a handful of seconds, the slightly longer time taken by my MoE models to answer queries isn't bad by any means. If anything, I'd take this slight performance drawback over running out of rate limits any day, especially since my local LLMs spare me from paying extra subscription fees every month. Deals Save on Workstation Gear: Deals for Home Labs & GPUs Explore discounts on workstations, GPUs, networking gear, and home-lab essentials to cut setup costs and boost self-hosted AI performance. Shop deals on processors, GPUs, memory, storage, and peripherals to build a subscription-free dev environment. Deals Explore Computers & Work Setup Deals If you're wondering about the power consumption of my LLM-hosting workstations, then no, my self-hosted models barely contribute to my energy bills. You see, there's a huge misconception about LLM usage in the tinkering community - while AI models can siphon an ungodly amount of power during the training phase, inference tasks are a different story altogether. When I run LLM-powered tasks, my GPUs spring to life for a few seconds, process the tasks, and go back to an idle state. If anything, running the servers 24/7 drains more watts than the inference tasks, but I already use one workstation for my Proxmox experiments, while the other is my main gaming/video-editing/coding machine. Then there's the privacy advantage of hooking local LLMs up to my home lab documents, early-access codebases, and confidential projects. Really, the few performance tradeoffs are worth the private and subscription-free nature of my local VS Code setup. Visual Studio Code See at Visual Studio Code Expand Collapse
[5]
My local LLM finally understands my house better than Google ever did
Anurag is an experienced journalist and author who's been covering tech for the past 5 years, with a focus on Windows, Android, and Apple. He's written for sites like Android Police, Neowin, Dexerto, and MakeTechEasier. Anurag's always pumped about tech and loves getting his hands on the latest gadgets. When he's not procrastinating, you'll probably find him catching the newest movies in theaters or scrolling through Twitter from his bed. Smart homes have always been built around rules and commands. You told Google Assistant or Alexa to turn on a light, set a timer, play music, or run a routine. If the device names were correct and the integration worked properly, the command succeeded. If not, the assistant either failed or asked you to repeat yourself. That limitation became more obvious as smart homes became more complicated. A modern home can have dozens of connected devices, including lights, motion sensors, thermostats, cameras, smart plugs, TVs, air quality sensors, and energy monitors. Traditional assistants could control these devices, but they usually could not reason about them. But I'm increasingly noticing that local LLMs are starting to change that. Systems built around Home Assistant and local LLMs like Qwen and Llama are beginning to treat the home as a contextual environment. The assistant is no longer responding only to fixed phrases. It can look at sensor data, room states, device activity, automations, and user intent together. I've been experimenting with a local LLM in my home, and while the setup is still far from polished, I'm starting to see how it understands my home better than Google ever did. The problem with traditional smart home setups There is just not enough context The biggest problem with traditional smart home setups is that they were never actually designed to understand the home. They were designed to execute commands. Systems built around Google Assistant work well when the interaction is simple and predictable. You tell the assistant to turn on a light, play music, set a timer, or activate a routine, and it responds by matching your words to a predefined action. A modern smart home can easily have dozens of connected devices spread across multiple rooms, including smart lights, motion sensors, thermostats, and more. Traditional assistants can technically control these devices, but they rarely understand how they relate to one another. Most of these assistants depend heavily on exact commands, fixed routines, and rigid naming structures. If you say the correct phrase, everything works, but the moment you phrase something differently or ask a more contextual question, the system starts struggling. For example, asking Google Assistant to "turn off the bedroom lights" is straightforward because the request maps clearly to a device and an action. But asking something like "make the living room comfortable for watching a movie" is much harder because the assistant now has to interpret intent rather than execute a direct instruction. It needs to understand which lights should dim, whether the TV is already on, what brightness level makes sense, and whether a pre-configured scene even exists. Things get even worse when you move beyond basic commands and start asking questions about the state of the home itself. I'm not saying a local LLM fixes this completely, but it still does a much better job. The modern local AI smart home stack Ollama plus Qwen is the way to go The current local AI smart home stack is surprisingly approachable compared to what it looked like even two or three years ago. At the center of most of these setups is Home Assistant, which essentially acts as the operating system for the house. It connects devices from completely different ecosystems and exposes them through one unified dashboard and automation layer. I've been spending some time experimenting with it myself, and I think Home Assistant is the part that most people underestimate. The LLM is not the actual foundation of the setup; it's Home Assistant. If you don't have the structure, entities, sensors, and automations in place, the model simply does not have enough context to work with. On top of Home Assistant sits Ollama, which is responsible for running local language models directly on consumer hardware. Tools like Ollama dramatically reduced the friction involved in local inference. Pulling models like Qwen or Meta's Llama is now closer to installing software than building a complicated AI environment from scratch. The hardware requirements have also become much more realistic. Smaller quantized models can comfortably run on hardware many people already own, including Mac minis, older gaming PCs, mini PCs, and reasonably modern desktops. I myself run the LLM on my MacBook. I tried running it on a NAS I own, but the performance wasn't very good even with a very small model. The local model then connects back into Home Assistant through integrations and conversation agents, and this is where tool calling becomes important. Without tool calling, the model is basically just a chatbot that can talk about your devices. With tool calling, it can actually interact with the home itself by turning on lights, activating scenes, reading sensor values, adjusting thermostats, and more. A local LLM becomes significantly more useful once it has access to motion sensors, mmWave presence sensors, door sensors, temperature sensors, and more. Once those data points exist, the assistant can answer questions in a way that's connected to the actual state of the house. One of the moments that really made this click for me was asking whether everything in the house was okay before going to bed and getting a response that included the balcony door still being open. The setup still is not perfect It's evolving As impressive as these local AI smart home setups can feel, it is important to set realistic expectations because the experience is still far from seamless. One thing I realized very quickly while experimenting with this setup is that the hardest part is not running the model itself. Tools like Ollama have made local inference surprisingly accessible. The difficult part is making the entire system behave reliably once you start adding real-world complexity. The more devices, sensors, rooms, and automations you expose to the model, the easier it becomes for things to break. Subscribe to the newsletter for practical local AI home guides Get the newsletter for in-depth coverage of local AI smart homes: step-by-step setup tips, troubleshooting strategies, Home Assistant integration examples, model and hardware comparisons (Ollama, Qwen, Llama), and practical experiments to improve reliability. Get Updates By subscribing, you agree to receive newsletter and marketing emails, and accept our Terms of Use and Privacy Policy. You can unsubscribe anytime. Voice interaction is another area where Google still has a major advantage. Google Assistant is a heavily optimized product with years of work behind its microphones and wake word systems. Local setups are improving, but they still require patience and experimentation. There are moments when the system feels genuinely futuristic, especially when it correctly interprets a vague request or summarizes the state of the house in a conversational way. But there are also moments where speech recognition gets a room name wrong, the wake word fails, or the response takes several seconds longer than expected. I also think many people underestimate how important the underlying automation still are. The LLM is not replacing the actual smart home logic. Most setups that work reliably still depend heavily on traditional Home Assistant automations for important tasks. Motion-based lighting, security routines, climate control, and scheduled actions are usually handled the old-fashioned way because they behave consistently every single time. The local LLM mostly sits on top of that system and makes the interaction more conversational. I started self-hosting LLMs and absolutely loved it Who needs OpenAI when your home lab can do the thinking for you? Posts 13 By Raghav Sethi
[6]
I built a private LLM on my home PC using a USB drive -- it only knows what I put on it
Aggy is a writer and editor who has worked for many high-traffic digital publications. He's a technology and gaming fanboy who has been a writer, editor, consultant, and computer animator. When you feed sensitive code or research into a cloud-based AI, you lose control over where that information travels. It might seem like a small trade-off for the convenience of a smart assistant, but you are effectively handing your data over to servers you don't own. It's a risk that most people accept without a second thought. However, you don't have to choose between advanced language models and the safety of your own private files; just switch to a local LLM. I'll never pay for AI again AI doesn't have to cost you a dime -- local models are fast, private, and finally worth switching to. Posts 7 By Yadullah Abidi Privacy risks of cloud processing Your data is never as safe as you think it is when you use the cloud When you send personal questions and code snippets to an external server, you are sending your private text, intellectual property, and sensitive information over the internet to computers owned by someone else. It's easy to forget that because it feels like your own home computer. Running your tasks through these centralized cloud systems creates major risks for your data privacy and control. Outside companies process your private information on machines you don't manage, which is a massive security vulnerability. Since your data can be passed along to outside partners within those networks, your information is only as secure as the weakest link in that chain. Some major cloud providers keep your prompts and results even when you turn off your history tracking. This data can stay stored for about 72 hours to handle system recovery, and it can stick around on external servers for up to three years if it gets flagged for human review or training. While you might try to protect your privacy by opting out of data training, providers usually force a trade-off by disabling key features or breaking app functions. This means real privacy in the cloud usually needs you to give up these features, leaving your data exposed unless you accept a broken experience. Then there is the training that AI has to go through. Your prompts and usage are perfect for training the AI. Since generative AI is baked into how the tech works, once your data enters the system, it is impossible to guarantee it will ever be completely deleted. Running local models with external storage You can carry your entire artificial intelligence setup in your pocket Building your own private, offline AI starts by downloading software that runs open-source language models directly on your computer's processor. I like GPT4All because it has few requirements and gives you a desktop chat window that works without an internet connection. By using compressed model files GGUF to save space and memory, you can run these models on regular computer processors and graphics cards. Since everything runs locally on your own machine, your questions and data never travel to a cloud server, keeping your data private. You can even block the software from using your network entirely, making sure it never tries to connect to the internet or leak data. GPT4All also lets you set up your model with rules and training data. You have to have a storage space ready. I like using a 1TB USB drive. Just go to your Chats, and then you'll see a LocalDocs button on the right side. From there you can add anything you'd like, just make sure you've already got a folder with documents in it. I sometimes use my own documents, but I also keep one in the flash drive I mentioned earlier so I can just go between PCs without having to redo it. I like to run it against the worst prompts I can think of. While it seems dumb at first, you're training it to replicate failure. If you only train it on things it will do well, you're not really fixing it. AIs work better when given constraints, and you can use the corrections you'll need to give as documents that build up those constraints. Make sure to update this over time; you're not really training it. This part takes a while, but you'll start to notice how much better it works as you add more documents. This is limited to how powerful your PC is, because it takes a lot more processing power to read through all of your documents. The number of documents doesn't matter as much as the size and contents. I like to separate them because it makes it easy to find the ones you want to modify or delete, but you can keep them all together if that is easiest. You can still use your USB for other things. I use mine for many other things, but that one folder is just for the AI. I recommend having a spare gigabyte, just in case you want to make a character or a really complicated AI. Secure and capable local performance You don't have to sacrifice speed to stay private One of the most persistent myths about AI is that you lose all performance if you refuse to connect to an external cloud server. We've gone far past the early days of AI needing expensive setups. A USB is fairly inexpensive, and you likely have one lying around. Subscribe to the newsletter for private AI how-tos Deepen your privacy-first AI practice: subscribe to the newsletter for step-by-step local LLM guides, troubleshooting tips, and curated model recommendations to help you run and maintain secure, self-hosted language models. Get Updates By subscribing, you agree to receive newsletter and marketing emails, and accept our Terms of Use and Privacy Policy. You can unsubscribe anytime. My workhorse PC is older, and it still types things out about as fast as I can read them. So it's not lengthy or time-consuming once it starts typing. I'd say the longest time is getting all the information before it starts. Even then, it's not a long wait. Just make sure the model you pick is a GGUF or an AWQ. The file sizes on these are shrunk up to 75%, while keeping 95% to 99% of the original model's accuracy and logic. Even with all it is doing, you don't need the internet at all. Train your own AI Moving your AI setup to a local drive on your home computer is a big change. It needs a bit of technical setup to keep file paths static, and you are responsible for maintaining your own hardware and backups. If you like using the massive scale of top-tier cloud providers for complex tasks, a local model might feel like a different tool altogether. However, for anyone who wants to analyze proprietary code or private documents without letting a corporation harvest their data, this is a better way to go. GPT4All OS Windows, macOS, Linux Developer Nomic AI Price model Free, Open-source A free, open-source local AI platform that runs large language models on your own PC without cloud dependency. See at GitHub See at Nomic AI Expand Collapse
[7]
I ran local LLMs on Intel's cheapest iGPU, and the results were surprisingly decent
Ayush Pande is a PC hardware and gaming writer. When he's not working on a new article, you can find him with his head stuck inside a PC or tinkering with a server operating system. Besides computing, his interests include spending hours in long RPGs, yelling at his friends in co-op games, and practicing guitar. Unlike cloud-based AI models, locally-hosted large language models are infamous for their sky-high system requirements, with the more powerful ones requiring plenty of tensor cores and ample VRAM. Although I'd argue that with MoE offloading, Mixture of Experts models can run even on ancient systems, you'll still need a discrete graphics card to run these bulky LLMs. But what if I ditched the dedicated GPU altogether and tried running LLMs on weak hardware - preferably a device that features an iGPU but doesn't cost an arm and a leg? Considering the Intel N100 is one of the cheapest x86 processors on the market, it seemed like the perfect option for this wacky experiment. And now that I've run a handful of models on my N100 board, I have to admit that it's a pretty decent option for light LLM tasks. Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them Ollama is great for getting you started... just don't stick around. Posts 12 By Adam Conway I went with an LXC-powered setup for my LLM experiments Passing the iGPU to the container didn't take too much effort Just like every other home lab project, I had a bunch of ways (and devices) to get my N100-powered LLM setup up and running. I initially wanted to opt for an ultralight Arch or DietPi setup, but I ended up pivoting to an LXC running on a Proxmox machine in the end. That's mostly because I didn't want to use snapshots to quickly restore my setup if the inference engine began throwing errors mid-compilation. For reference, the system in question is the LattePanda Mu, an affordable N100 compute module with 8GB of RAM. As for the inference engine, I really didn't want to opt for Ollama, even though it's the most beginner-friendly option for hosting local LLMs. Its heavy performance overhead already makes it a terrible option for such weak hardware, and it just isn't flexible enough to accommodate all the extra parameters I use when serving up my LLMs. So, good ol' llama.cpp was my primary choice, and I had to start by deploying an LXC specifically for this inference engine. Once I'd got the container up and running, it was time to pass the integrated graphics to the LXC. Fortunately, this process was as straightforward as entering /dev/dri/renderD128 in the Device Passthrough section of the LXC's Resources tab and entering 0666 as its Access Mode. After launching the LXC, I entered the following commands to install the necessary drivers alongside the vainfo utility, which confirmed that LXC was capable of harnessing the iGPU. apt update apt install -y intel-media-va-driver vainfo Compiling llama.cpp server required a couple of extra tweaks Having faced some issues when I tried to compile the Vulkan version of llama.cpp on my GTX 1080, I was prepared to reload to an older snapshot a couple of times to get everything working properly. Fortunately, I only had to reload twice, though the error was a bit of a pain to diagnose. Running the apt install git cmake curl glslc glslang-tools libvulkan1 vulkan-tools libvulkan-dev spirv-tools spirv-headers build-essential command pulled all the preliminary packages I needed for llama.cpp. Once they'd finished installing, I ran git clone https://github.com/ggml-org/llama.cpp to grab the inference engine's files and executed cd llama.cpp to switch to its directory. Then, I ran cmake -B build -DGGML_VULKAN=ON to configure the build environment, which surprisingly worked without any issues. However, the cmake -B build cmake --build build -- -j1 command would end up failing around the 18% mark every time I tried to compile llama.cpp. Not only that, the LXC would require me to sign in every time the process failed. After digging into some forums, I eventually realized the RAM (or the lack thereof) was the culprit. My system only had 8GB of memory, and I'd assigned 5GB to the LXC, which would end up starving it for RAM, and the 512MB of swap file didn't help, either. So, I upped the RAM to a whopping 7GB before tossing an additional 3GB swap allocation. And sure enough, the compilation process worked well without any errors, and I removed the swap file after llama.cpp was done installing to avoid throttling my LLM tasks with the slower inference speeds of my SSD. The N100 can handle decently-sized models It's definitely faster than a Raspberry Pi Considering my Raspberry Pi had some trouble running Gemma 3 (4B), I figured I could start my LLM-hosting workloads from there. So, I spun up a llama-server instance via the ./llama-server -m "/root/llama.cpp/models/gemma-3-4b-it-Q4_K_M.gguf" --host 0.0.0.0 --port 8082 command and began prompting it from its web UI. Unlike my Raspberry Pi, the LLM ran at decent speeds, which is far more than I was expecting. Upping the context window to 16K didn't max out its memory, either, which was a good sign. I ran this bulky LLM on an SBC cluster, and it's the most unhinged setup I've ever built My SBC cluster runs bigger models than a single Raspberry Pi, but the trade-offs are brutal Posts 1 By Ayush Pande Qwen3 (4B) also had similar results, and for a non-GPU setup without any dedicated VRAM and just 24 execution units, my LattePanda Mu seemed like a decent option for running tinier LLMs. However, I wanted to see how far I could push it, so I transferred the bulky DeepSeek R1 (specifically, DeepSeek R1-Distill-Qwen-7B) from my main PC to the N100-powered LXC, and ran ./llama-server -m "/root/llama.cpp/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf" --host 0.0.0.0 --port 8082. To my surprise, it spun up the llama-server instance, and just to see how far I could push it, I copied a long chain of logs from its LXC into the web UI and asked the LLM to read them. While the token inference speeds stayed around the 2.9 t/s margin, the DeepSeek R1-Distill-Qwen-7B was able to generate surprisingly correct results, though I'd end up choking the context window if I began extending the chats by tossing more logs into the prompts. It ain't perfect, but it's a decent secondary LLM server I've got a Gemma4-26B-A4B instance that runs on my GTX 1080 24/7, and I use it for the majority of my inference tasks, while Qwen3.6-35B-A3B serves as my coding companion on my RTX 3080 Ti system. So, I doubt I'd be using the N100 compute module for 7B models at a fraction of the speeds. But if I were to need a secondary LLM for certain inference tasks, or require an embedding model to work in tandem with my bulky clankers, I'll probably end up using my LattePanda Mu. After all, this Proxmox host houses essential LXCs, so tossing an LLM server on it wouldn't be that much of a problem, since I already plan to run it all the time. LattePanda Mu Storage 64GB eMMC, M.2 M-key slot CPU Intel N100 (upgradable to Intel i3-N305) Memory 8GB LPDDR5 (upgradable to 16GB) Operating System Windows 11, Linux Ports 4x USB Type-A, 1x HDMI 2.0, 1x 1GbE RJ45, 1x PCIe 3.0 x4 GPU Intel UHD Graphics $198 at DFRobot Expand Collapse
[8]
I replaced cloud LLMs with local models running off a Proxmox LXC, and the performance trade-off was worth it
Ayush Pande is a PC hardware and gaming writer. When he's not working on a new article, you can find him with his head stuck inside a PC or tinkering with a server operating system. Besides computing, his interests include spending hours in long RPGs, yelling at his friends in co-op games, and practicing guitar. Whether it's Perplexity's reliable and transparent nature or Claude Code's programming capabilities, there's no denying that cloud-based large language models can be a godsend for productivity. Most cloud LLMs ship with beginner-friendly UIs, and the fact that you don't have to put in extra work just to get them up and running makes them pretty convenient for the average user. But I've spent the last couple of months moving away from cloud LLMs for my everyday tasks, partly since I don't want external servers gaining access to my data, and also because I'd rather avoid the extra charges incurred by paid API usage. After migrating through a bunch of setups, I've honed in on a local LLM server running on my old Proxmox workstation, and it works surprisingly well for everything from simple prompting to OCR analysis, voice assistant inference backend, and automation pipelines. Your old GPU can still run big LLMs - you just need the right tweaks There's a lot you can do with these models Posts 18 By Ayush Pande Proxmox LXCs are incredible for hosting llama.cpp With some GPU passthrough wizardry, I can put my old graphics cards to good use Like most LLM-hosting enthusiasts, I started my journey by hosting local models on Ollama, and it served me well for the first couple of weeks. After all, pulling LLMs and deploying them is a piece of cake on Ollama, with a bunch of self-hosted apps supporting this inference engine natively. However, its extra performance overhead and lack of advanced tools became pretty apparent once I started looking into ways to maximize the efficiency on my local models. Once I started wanting to run bulky models (and I'll go over them in a bit), it became clear that Ollama won't work well for my needs, so I switched to llama.cpp instead. Rather, I began using the llama-server functionality to create an LLM server that remains operational 24/7 and hooks up to the rest of my FOSS arsenal thanks to its OpenAI-compatible API. I also went with a Proxmox LXC, as I can still share my old graphics card with Immich, Frigate, and other apps that need its computational prowess when my LLMs are inactive. Thanks to GPU passthrough, my llama-server LXC gets native-level performance, and I've upped its RAM resources all the way to 24GB (out of 32GB) to ensure it can fit MoE models (and I'll go over them in a bit). On my aged system, I simply ran the ls -l /dev/nvidia* command to get the device IDs (195, 235, and 237 for my GPU), pasted the following syntax into the LXC's config file, and installed the graphics card drivers inside the LXC to configure GPU passthrough, before compiling llama.cpp's Vulkan variant. lxc.cgroup2.devices.allow: c 195:* rwm lxc.cgroup2.devices.allow: c 235:* rwm lxc.cgroup2.devices.allow: c 237:* rwm lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file Certain local models have terrific reasoning capabilities And their token generation rates are a lot better than you'd expect During my Ollama days, I was starting to get frustrated by the accuracy (or rather, the lack thereof) of local models. Sure, 4B, 7B, and even 9B models could handle simple inference requests, but anything requiring detailed troubleshooting or complex reasoning would be too much for them to handle - and in some cases, they'd end up spouting complete nonsense. That's when I started looking into bulkier models - LLMs that could crunch 20B+ parameters. But considering that my broke self only has a Pascal card (specifically, a GTX 1080), I couldn't run conventional models without using the --ngl flag to offload entire layers from my GPU and causing the performance to plummet. However, Mixture of Experts models let me offload the less frequently accessed resources onto my CPU and RAM, with the attention weights and other demanding units still remaining on my GPU. As such, I can host models like GPT-OSS-20B and Gemma4-26B-A4B on my VRAM-starved card at respectable token rates, with the latter even managing 15+ t/s with a fairly large context window. As for their reasoning capabilities, I'd say they're solid competitors to cloud models. While I still prefer the Qwen3.6-35B-A3B for hardcore coding tasks, Gemma4 is pretty effective at rewriting code, providing autosuggestions, and aiding my troubleshooting needs. Likewise, it has yet to hallucinate or provide irrelevant information when I use it for RAG analysis in Paperless AI, Open Notebook, and Blinko. While we're on this subject... The llama-server web UI is pretty neat for my inference tasks While Open WebUI is better for a ChatGPT-like layout Besides its terrific performance, llama-server also deploys an interface for accessing LLMs via a web browser - and it's fairly useful for simple prompts and queries. It even supports MCP servers, and as long as I set the context window fairly high (and run the --webui-mcp-proxy flag), I have no issues controlling Obsidian, Home Assistant, TrueNAS, and a bunch of other apps via MCP tools on llama-server's web interface. However, I prefer Open WebUI for the majority of my tasks, and its ChatGPT-like interface makes it fairly accessible. But the real draw of Open WebUI is the sheer number of customization options and integrations that I can pair it (and by extension, my llama-server LLMs) with. There's the open terminal facility, which lets me execute Python code on the browser, and connecting it with SearXNG lets my Gemma4 instance access websites on the Internet instead of relying solely on its trained knowledge base. It even supports ComfyUI, and I often use Open WebUI to trigger the upscaling workflows I've configured on the app. I ditched Copilot on VS Code for this free extension, and it's miles ahead It's completely self-hosted, too! Posts 2 By Ayush Pande You shouldn't underestimate local LLMs I've been building my LLM pipelines for a couple of months, and it's really mind-boggling how much you can accomplish with them. Once you venture past the 20B mark, the reasoning capabilities of self-hosted models skyrocket to the point where they're good enough to replace their cloud counterparts for coding workloads. And with MoE models becoming more popular, it's possible to run competent clankers without dealing with slow token generation rates on an old GPU or throwing thousands of bucks on a new system. llama.cpp See at Official Website Expand Collapse
Share
Copy Link
Tech enthusiasts are replacing expensive cloud-based AI coding tools like Cursor with local LLM configurations powered by Ollama and Qwen models. These setups run on surprisingly modest hardware, including 15W CPUs and consumer GPUs, delivering privacy, cost savings, and offline access. From coding assistants to smart home control, local AI is proving capable enough to challenge cloud alternatives.
A growing number of developers and tech enthusiasts are moving away from subscription-based cloud AI services toward running large language models locally on their own hardware. Using tools like Ollama paired with optimized models such as Qwen and Llama, users are discovering that local LLM implementations can deliver surprisingly capable results even on low-power hardware
1
. This shift addresses mounting concerns about privacy, escalating subscription costs, and dependency on external servers.The appeal centers on three core advantages: complete data privacy since code never leaves the machine, elimination of recurring fees that can reach $100 per month or more for heavy users, and offline availability that isn't subject to server outages
3
. One developer noted that after just two years, the cumulative cost of a Claude Max subscription equals the price of an RTX 5090 GPU, making the initial hardware investment increasingly attractive for long-term use.
Source: XDA-Developers
What makes this transition particularly notable is the surprisingly modest hardware requirements. One experimenter successfully ran Ollama on a Minisforum U850 mini PC equipped with an Intel Core i5-10210U CPU—a 15W processor with just four cores and 16 GB of DDR4-2666 RAM
1
. Using heavily optimized models like qwen3:4b and qwen2.5coder:7b, the setup achieved around 4 tokens per second, sufficient for practical use when multitasking.The key to making local coding LLM work on consumer hardware lies in quantization and model selection. Mixture-of-Experts models have proven particularly effective, enabling users to host bulky 35B parameter models on GPUs with just 12GB VRAM without significant performance degradation
4
. Models like Qwen3.6-35B-A3B have demonstrated performance competitive with cloud alternatives for coding tasks including code completion, refactoring, and troubleshooting.
Source: XDA-Developers
Developers are building complete local AI coding environments using VS Code or VS Codium paired with extensions like llama-vscode and Cline
3
4
. These configurations provide capabilities previously available only through paid platforms like Cursor and Antigravity, which charge regular subscription fees and impose restrictive token limits on free tiers.The llama-vscode extension supports agentic workflows and integrates with MCP servers, allowing local LLM to control external applications beyond just coding tasks
4
. Users report that while cloud models generate code faster, the performance difference isn't substantial enough to justify ongoing subscription costs, especially when local setups eliminate rate limits entirely.One particularly innovative approach involves Pi, a lightweight CLI tool that can create custom extensions on demand through simple text prompts
2
. Unlike tools such as OpenCode that consume significant context length with pre-loaded tools, Pi ships minimal and allows users to build exactly the functionality they need, from Docker runtime control to Proxmox integration.Related Stories

Source: XDA-Developers
Beyond coding applications, local LLM implementations are transforming smart home control through integration with Home Assistant
5
. Unlike traditional voice assistants like Google Assistant or Alexa that rely on rigid commands and fixed routines, local AI can reason about contextual environments by analyzing sensor data, room states, and device relationships simultaneously.This contextual understanding enables more natural interactions. Instead of requiring exact device names and specific phrases, users can make requests like "make the living room comfortable for watching a movie," and the local AI smart home system interprets intent across multiple devices
5
. The setup typically combines Home Assistant as the foundational layer with Ollama running models like Qwen on consumer hardware including Mac minis, older gaming PCs, or modern desktops.The movement toward running large language models locally reflects broader concerns about data exposure and subscription fatigue. When using cloud-based AI coding tools, proprietary code and sensitive client information pass through external company servers—a significant security consideration for professional developers
3
.Cost calculations favor local implementations for frequent users. Cloud platforms typically start at $20 per month for basic access, with costs escalating rapidly for heavy usage. A local VS Code setup requires only the initial GPU investment and electricity costs during inference tasks
4
. For home lab enthusiasts, the energy efficiency proves notable—one user switched from a system drawing 300 watts under load to a low-power mini PC configuration1
.While local setups require managing VRAM constraints and context length limitations, the trade-offs appear acceptable to users prioritizing privacy and cost savings over marginal performance advantages. The rapid evolution of optimized models and tools like Open WebUI suggests the capability gap between local and cloud AI continues narrowing, making self-hosted solutions increasingly viable for practical applications.
Summarized by
Navi
[1]
[4]
[5]
17 Apr 2026•Technology

08 Jun 2026•Technology

02 May 2026•Technology

1
Startups

2
Policy and Regulation

3
Policy and Regulation
