3 Sources
[1]
I ran local LLMs on Intel's cheapest iGPU, and the results were surprisingly decent
Ayush Pande is a PC hardware and gaming writer. When he's not working on a new article, you can find him with his head stuck inside a PC or tinkering with a server operating system. Besides computing, his interests include spending hours in long RPGs, yelling at his friends in co-op games, and practicing guitar. Unlike cloud-based AI models, locally-hosted large language models are infamous for their sky-high system requirements, with the more powerful ones requiring plenty of tensor cores and ample VRAM. Although I'd argue that with MoE offloading, Mixture of Experts models can run even on ancient systems, you'll still need a discrete graphics card to run these bulky LLMs. But what if I ditched the dedicated GPU altogether and tried running LLMs on weak hardware - preferably a device that features an iGPU but doesn't cost an arm and a leg? Considering the Intel N100 is one of the cheapest x86 processors on the market, it seemed like the perfect option for this wacky experiment. And now that I've run a handful of models on my N100 board, I have to admit that it's a pretty decent option for light LLM tasks. Ollama is still the easiest way to start local LLMs, but it's the worst way to keep running them Ollama is great for getting you started... just don't stick around. Posts 12 By Adam Conway I went with an LXC-powered setup for my LLM experiments Passing the iGPU to the container didn't take too much effort Just like every other home lab project, I had a bunch of ways (and devices) to get my N100-powered LLM setup up and running. I initially wanted to opt for an ultralight Arch or DietPi setup, but I ended up pivoting to an LXC running on a Proxmox machine in the end. That's mostly because I didn't want to use snapshots to quickly restore my setup if the inference engine began throwing errors mid-compilation. For reference, the system in question is the LattePanda Mu, an affordable N100 compute module with 8GB of RAM. As for the inference engine, I really didn't want to opt for Ollama, even though it's the most beginner-friendly option for hosting local LLMs. Its heavy performance overhead already makes it a terrible option for such weak hardware, and it just isn't flexible enough to accommodate all the extra parameters I use when serving up my LLMs. So, good ol' llama.cpp was my primary choice, and I had to start by deploying an LXC specifically for this inference engine. Once I'd got the container up and running, it was time to pass the integrated graphics to the LXC. Fortunately, this process was as straightforward as entering /dev/dri/renderD128 in the Device Passthrough section of the LXC's Resources tab and entering 0666 as its Access Mode. After launching the LXC, I entered the following commands to install the necessary drivers alongside the vainfo utility, which confirmed that LXC was capable of harnessing the iGPU. apt update apt install -y intel-media-va-driver vainfo Compiling llama.cpp server required a couple of extra tweaks Having faced some issues when I tried to compile the Vulkan version of llama.cpp on my GTX 1080, I was prepared to reload to an older snapshot a couple of times to get everything working properly. Fortunately, I only had to reload twice, though the error was a bit of a pain to diagnose. Running the apt install git cmake curl glslc glslang-tools libvulkan1 vulkan-tools libvulkan-dev spirv-tools spirv-headers build-essential command pulled all the preliminary packages I needed for llama.cpp. Once they'd finished installing, I ran git clone https://github.com/ggml-org/llama.cpp to grab the inference engine's files and executed cd llama.cpp to switch to its directory. Then, I ran cmake -B build -DGGML_VULKAN=ON to configure the build environment, which surprisingly worked without any issues. However, the cmake -B build cmake --build build -- -j1 command would end up failing around the 18% mark every time I tried to compile llama.cpp. Not only that, the LXC would require me to sign in every time the process failed. After digging into some forums, I eventually realized the RAM (or the lack thereof) was the culprit. My system only had 8GB of memory, and I'd assigned 5GB to the LXC, which would end up starving it for RAM, and the 512MB of swap file didn't help, either. So, I upped the RAM to a whopping 7GB before tossing an additional 3GB swap allocation. And sure enough, the compilation process worked well without any errors, and I removed the swap file after llama.cpp was done installing to avoid throttling my LLM tasks with the slower inference speeds of my SSD. The N100 can handle decently-sized models It's definitely faster than a Raspberry Pi Considering my Raspberry Pi had some trouble running Gemma 3 (4B), I figured I could start my LLM-hosting workloads from there. So, I spun up a llama-server instance via the ./llama-server -m "/root/llama.cpp/models/gemma-3-4b-it-Q4_K_M.gguf" --host 0.0.0.0 --port 8082 command and began prompting it from its web UI. Unlike my Raspberry Pi, the LLM ran at decent speeds, which is far more than I was expecting. Upping the context window to 16K didn't max out its memory, either, which was a good sign. I ran this bulky LLM on an SBC cluster, and it's the most unhinged setup I've ever built My SBC cluster runs bigger models than a single Raspberry Pi, but the trade-offs are brutal Posts 1 By Ayush Pande Qwen3 (4B) also had similar results, and for a non-GPU setup without any dedicated VRAM and just 24 execution units, my LattePanda Mu seemed like a decent option for running tinier LLMs. However, I wanted to see how far I could push it, so I transferred the bulky DeepSeek R1 (specifically, DeepSeek R1-Distill-Qwen-7B) from my main PC to the N100-powered LXC, and ran ./llama-server -m "/root/llama.cpp/models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf" --host 0.0.0.0 --port 8082. To my surprise, it spun up the llama-server instance, and just to see how far I could push it, I copied a long chain of logs from its LXC into the web UI and asked the LLM to read them. While the token inference speeds stayed around the 2.9 t/s margin, the DeepSeek R1-Distill-Qwen-7B was able to generate surprisingly correct results, though I'd end up choking the context window if I began extending the chats by tossing more logs into the prompts. It ain't perfect, but it's a decent secondary LLM server I've got a Gemma4-26B-A4B instance that runs on my GTX 1080 24/7, and I use it for the majority of my inference tasks, while Qwen3.6-35B-A3B serves as my coding companion on my RTX 3080 Ti system. So, I doubt I'd be using the N100 compute module for 7B models at a fraction of the speeds. But if I were to need a secondary LLM for certain inference tasks, or require an embedding model to work in tandem with my bulky clankers, I'll probably end up using my LattePanda Mu. After all, this Proxmox host houses essential LXCs, so tossing an LLM server on it wouldn't be that much of a problem, since I already plan to run it all the time. LattePanda Mu Storage 64GB eMMC, M.2 M-key slot CPU Intel N100 (upgradable to Intel i3-N305) Memory 8GB LPDDR5 (upgradable to 16GB) Operating System Windows 11, Linux Ports 4x USB Type-A, 1x HDMI 2.0, 1x 1GbE RJ45, 1x PCIe 3.0 x4 GPU Intel UHD Graphics $198 at DFRobot Expand Collapse
[2]
I replaced cloud LLMs with local models running off a Proxmox LXC, and the performance trade-off was worth it
Ayush Pande is a PC hardware and gaming writer. When he's not working on a new article, you can find him with his head stuck inside a PC or tinkering with a server operating system. Besides computing, his interests include spending hours in long RPGs, yelling at his friends in co-op games, and practicing guitar. Whether it's Perplexity's reliable and transparent nature or Claude Code's programming capabilities, there's no denying that cloud-based large language models can be a godsend for productivity. Most cloud LLMs ship with beginner-friendly UIs, and the fact that you don't have to put in extra work just to get them up and running makes them pretty convenient for the average user. But I've spent the last couple of months moving away from cloud LLMs for my everyday tasks, partly since I don't want external servers gaining access to my data, and also because I'd rather avoid the extra charges incurred by paid API usage. After migrating through a bunch of setups, I've honed in on a local LLM server running on my old Proxmox workstation, and it works surprisingly well for everything from simple prompting to OCR analysis, voice assistant inference backend, and automation pipelines. Your old GPU can still run big LLMs - you just need the right tweaks There's a lot you can do with these models Posts 18 By Ayush Pande Proxmox LXCs are incredible for hosting llama.cpp With some GPU passthrough wizardry, I can put my old graphics cards to good use Like most LLM-hosting enthusiasts, I started my journey by hosting local models on Ollama, and it served me well for the first couple of weeks. After all, pulling LLMs and deploying them is a piece of cake on Ollama, with a bunch of self-hosted apps supporting this inference engine natively. However, its extra performance overhead and lack of advanced tools became pretty apparent once I started looking into ways to maximize the efficiency on my local models. Once I started wanting to run bulky models (and I'll go over them in a bit), it became clear that Ollama won't work well for my needs, so I switched to llama.cpp instead. Rather, I began using the llama-server functionality to create an LLM server that remains operational 24/7 and hooks up to the rest of my FOSS arsenal thanks to its OpenAI-compatible API. I also went with a Proxmox LXC, as I can still share my old graphics card with Immich, Frigate, and other apps that need its computational prowess when my LLMs are inactive. Thanks to GPU passthrough, my llama-server LXC gets native-level performance, and I've upped its RAM resources all the way to 24GB (out of 32GB) to ensure it can fit MoE models (and I'll go over them in a bit). On my aged system, I simply ran the ls -l /dev/nvidia* command to get the device IDs (195, 235, and 237 for my GPU), pasted the following syntax into the LXC's config file, and installed the graphics card drivers inside the LXC to configure GPU passthrough, before compiling llama.cpp's Vulkan variant. lxc.cgroup2.devices.allow: c 195:* rwm lxc.cgroup2.devices.allow: c 235:* rwm lxc.cgroup2.devices.allow: c 237:* rwm lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file Certain local models have terrific reasoning capabilities And their token generation rates are a lot better than you'd expect During my Ollama days, I was starting to get frustrated by the accuracy (or rather, the lack thereof) of local models. Sure, 4B, 7B, and even 9B models could handle simple inference requests, but anything requiring detailed troubleshooting or complex reasoning would be too much for them to handle - and in some cases, they'd end up spouting complete nonsense. That's when I started looking into bulkier models - LLMs that could crunch 20B+ parameters. But considering that my broke self only has a Pascal card (specifically, a GTX 1080), I couldn't run conventional models without using the --ngl flag to offload entire layers from my GPU and causing the performance to plummet. However, Mixture of Experts models let me offload the less frequently accessed resources onto my CPU and RAM, with the attention weights and other demanding units still remaining on my GPU. As such, I can host models like GPT-OSS-20B and Gemma4-26B-A4B on my VRAM-starved card at respectable token rates, with the latter even managing 15+ t/s with a fairly large context window. As for their reasoning capabilities, I'd say they're solid competitors to cloud models. While I still prefer the Qwen3.6-35B-A3B for hardcore coding tasks, Gemma4 is pretty effective at rewriting code, providing autosuggestions, and aiding my troubleshooting needs. Likewise, it has yet to hallucinate or provide irrelevant information when I use it for RAG analysis in Paperless AI, Open Notebook, and Blinko. While we're on this subject... The llama-server web UI is pretty neat for my inference tasks While Open WebUI is better for a ChatGPT-like layout Besides its terrific performance, llama-server also deploys an interface for accessing LLMs via a web browser - and it's fairly useful for simple prompts and queries. It even supports MCP servers, and as long as I set the context window fairly high (and run the --webui-mcp-proxy flag), I have no issues controlling Obsidian, Home Assistant, TrueNAS, and a bunch of other apps via MCP tools on llama-server's web interface. However, I prefer Open WebUI for the majority of my tasks, and its ChatGPT-like interface makes it fairly accessible. But the real draw of Open WebUI is the sheer number of customization options and integrations that I can pair it (and by extension, my llama-server LLMs) with. There's the open terminal facility, which lets me execute Python code on the browser, and connecting it with SearXNG lets my Gemma4 instance access websites on the Internet instead of relying solely on its trained knowledge base. It even supports ComfyUI, and I often use Open WebUI to trigger the upscaling workflows I've configured on the app. I ditched Copilot on VS Code for this free extension, and it's miles ahead It's completely self-hosted, too! Posts 2 By Ayush Pande You shouldn't underestimate local LLMs I've been building my LLM pipelines for a couple of months, and it's really mind-boggling how much you can accomplish with them. Once you venture past the 20B mark, the reasoning capabilities of self-hosted models skyrocket to the point where they're good enough to replace their cloud counterparts for coding workloads. And with MoE models becoming more popular, it's possible to run competent clankers without dealing with slow token generation rates on an old GPU or throwing thousands of bucks on a new system. llama.cpp See at Official Website Expand Collapse
[3]
I built a private LLM on my home PC using a USB drive -- it only knows what I put on it
Aggy is a writer and editor who has worked for many high-traffic digital publications. He's a technology and gaming fanboy who has been a writer, editor, consultant, and computer animator. When you feed sensitive code or research into a cloud-based AI, you lose control over where that information travels. It might seem like a small trade-off for the convenience of a smart assistant, but you are effectively handing your data over to servers you don't own. It's a risk that most people accept without a second thought. However, you don't have to choose between advanced language models and the safety of your own private files; just switch to a local LLM. I'll never pay for AI again AI doesn't have to cost you a dime -- local models are fast, private, and finally worth switching to. Posts 7 By Yadullah Abidi Privacy risks of cloud processing Your data is never as safe as you think it is when you use the cloud When you send personal questions and code snippets to an external server, you are sending your private text, intellectual property, and sensitive information over the internet to computers owned by someone else. It's easy to forget that because it feels like your own home computer. Running your tasks through these centralized cloud systems creates major risks for your data privacy and control. Outside companies process your private information on machines you don't manage, which is a massive security vulnerability. Since your data can be passed along to outside partners within those networks, your information is only as secure as the weakest link in that chain. Some major cloud providers keep your prompts and results even when you turn off your history tracking. This data can stay stored for about 72 hours to handle system recovery, and it can stick around on external servers for up to three years if it gets flagged for human review or training. While you might try to protect your privacy by opting out of data training, providers usually force a trade-off by disabling key features or breaking app functions. This means real privacy in the cloud usually needs you to give up these features, leaving your data exposed unless you accept a broken experience. Then there is the training that AI has to go through. Your prompts and usage are perfect for training the AI. Since generative AI is baked into how the tech works, once your data enters the system, it is impossible to guarantee it will ever be completely deleted. Running local models with external storage You can carry your entire artificial intelligence setup in your pocket Building your own private, offline AI starts by downloading software that runs open-source language models directly on your computer's processor. I like GPT4All because it has few requirements and gives you a desktop chat window that works without an internet connection. By using compressed model files GGUF to save space and memory, you can run these models on regular computer processors and graphics cards. Since everything runs locally on your own machine, your questions and data never travel to a cloud server, keeping your data private. You can even block the software from using your network entirely, making sure it never tries to connect to the internet or leak data. GPT4All also lets you set up your model with rules and training data. You have to have a storage space ready. I like using a 1TB USB drive. Just go to your Chats, and then you'll see a LocalDocs button on the right side. From there you can add anything you'd like, just make sure you've already got a folder with documents in it. I sometimes use my own documents, but I also keep one in the flash drive I mentioned earlier so I can just go between PCs without having to redo it. I like to run it against the worst prompts I can think of. While it seems dumb at first, you're training it to replicate failure. If you only train it on things it will do well, you're not really fixing it. AIs work better when given constraints, and you can use the corrections you'll need to give as documents that build up those constraints. Make sure to update this over time; you're not really training it. This part takes a while, but you'll start to notice how much better it works as you add more documents. This is limited to how powerful your PC is, because it takes a lot more processing power to read through all of your documents. The number of documents doesn't matter as much as the size and contents. I like to separate them because it makes it easy to find the ones you want to modify or delete, but you can keep them all together if that is easiest. You can still use your USB for other things. I use mine for many other things, but that one folder is just for the AI. I recommend having a spare gigabyte, just in case you want to make a character or a really complicated AI. Secure and capable local performance You don't have to sacrifice speed to stay private One of the most persistent myths about AI is that you lose all performance if you refuse to connect to an external cloud server. We've gone far past the early days of AI needing expensive setups. A USB is fairly inexpensive, and you likely have one lying around. Subscribe to the newsletter for private AI how-tos Deepen your privacy-first AI practice: subscribe to the newsletter for step-by-step local LLM guides, troubleshooting tips, and curated model recommendations to help you run and maintain secure, self-hosted language models. Get Updates By subscribing, you agree to receive newsletter and marketing emails, and accept our Terms of Use and Privacy Policy. You can unsubscribe anytime. My workhorse PC is older, and it still types things out about as fast as I can read them. So it's not lengthy or time-consuming once it starts typing. I'd say the longest time is getting all the information before it starts. Even then, it's not a long wait. Just make sure the model you pick is a GGUF or an AWQ. The file sizes on these are shrunk up to 75%, while keeping 95% to 99% of the original model's accuracy and logic. Even with all it is doing, you don't need the internet at all. Train your own AI Moving your AI setup to a local drive on your home computer is a big change. It needs a bit of technical setup to keep file paths static, and you are responsible for maintaining your own hardware and backups. If you like using the massive scale of top-tier cloud providers for complex tasks, a local model might feel like a different tool altogether. However, for anyone who wants to analyze proprietary code or private documents without letting a corporation harvest their data, this is a better way to go. GPT4All OS Windows, macOS, Linux Developer Nomic AI Price model Free, Open-source A free, open-source local AI platform that runs large language models on your own PC without cloud dependency. See at GitHub See at Nomic AI Expand Collapse
Share
Copy Link
A growing movement shows that large language models don't require expensive GPUs or cloud services. Experiments with Intel's N100 processor, Proxmox LXC containers, and USB-based setups reveal that local LLMs can deliver decent performance on budget hardware while maintaining complete data privacy. These developments challenge the assumption that powerful AI requires costly infrastructure or cloud subscriptions.

The barrier to entry for running large language models locally has dropped significantly, as recent experiments demonstrate that even Intel's cheapest processor can handle AI workloads. Using an Intel N100 processor with integrated graphics, one enthusiast successfully ran multiple LLMs on hardware costing a fraction of typical AI setups
1
. The LattePanda Mu compute module, featuring the N100 with just 8GB of RAM, proved capable of handling models like Gemma 3 (4B) at respectable speeds, outperforming even Raspberry Pi configurations.The setup relied on llama.cpp rather than Ollama, specifically to avoid performance overhead on such constrained hardware. Compiling llama.cpp with Vulkan support required careful memory management, with the process initially failing around the 18% mark due to RAM limitations. Allocating 7GB of the system's 8GB memory to the LXC container, plus an additional 3GB swap file during compilation, resolved the issue
1
. This demonstrates that running LLMs locally on budget hardware demands technical knowledge but remains achievable for those willing to optimize their configurations.For users with slightly more resources, Proxmox LXC containers combined with GPU passthrough offer a compelling alternative to cloud-based AI services. One user migrated entirely from cloud LLMs to a local setup running on aging hardware, specifically a GTX 1080 graphics card
2
. The configuration allows the same GPU to serve multiple applications, including Immich and Frigate, when LLM tasks aren't active.Mixture of Experts models proved particularly effective for this setup, allowing larger parameter counts without overwhelming limited VRAM. Models like GPT-OSS-20B and Gemma4-26B-A4B achieved token generation rates exceeding 15 tokens per second with substantial context windows
2
. The llama-server functionality provides an OpenAI-compatible API, enabling integration with various open-source applications while maintaining 24/7 availability. This approach demonstrates that older hardware, when properly configured, can deliver performance rivaling cloud services for many use cases.Concerns about data privacy have motivated users to build completely offline AI systems using tools like GPT4All and external USB storage. When users send code snippets or sensitive research to cloud providers, that information travels to servers beyond their control, potentially remaining stored for up to three years if flagged for review
3
. Major cloud providers retain prompts and results for approximately 72 hours even when history tracking is disabled, creating security vulnerabilities throughout the data chain.GPT4All enables users to run open-source models entirely on local processors using compressed GGUF files, eliminating internet connectivity requirements. One implementation uses a 1TB USB drive containing custom training documents and constraints, creating a portable AI system that works across multiple PCs
3
. The LocalDocs feature allows users to feed specific documents into the model, training it on proprietary information without exposing data to external servers. This approach addresses the fundamental trade-off between convenience and control that characterizes cloud-based AI.Related Stories
The computational demands of local LLMs have decreased as open-source models and optimization techniques improve. While cloud services offer seamless interfaces, the performance gap has narrowed considerably for common tasks like code rewriting, autosuggestions, and troubleshooting
2
. Users report that models with 20B+ parameters deliver reasoning capabilities competitive with commercial cloud offerings when running on properly configured local hardware.The experiments with integrated graphics on the Intel N100 required passing the iGPU to containers through device passthrough, a straightforward process involving entering /dev/dri/renderD128 in the LXC's Resources tab
1
. For those prioritizing data security and avoiding subscription costs, these performance trade-offs prove worthwhile. The ability to maintain complete control over sensitive information while achieving adequate inference speeds represents a significant shift in how individuals and small teams can deploy AI capabilities without relying on external infrastructure or incurring ongoing API charges.Summarized by
Navi
[1]
[2]
17 Apr 2026•Technology

02 May 2026•Technology

29 Jan 2025•Technology
1
Business and Economy

2
Technology

3
Policy and Regulation
