PrivateGPT features scripts to ingest data files, split them into chunks, create "embeddings" (numerical representations of the meaning of the text), and store those embeddings in a local Chroma vector store. When you ask a question, the app searches for relevant documents and sends just those to the LLM to generate an answer.
If you're familiar with Python and how to set up Python projects, you can clone the full PrivateGPT repository and run it locally. If you're less knowledgeable about Python, you may want to check out a simplified version of the project that author Iván MartÃnez set up for a conference workshop, which is considerably easier to set up.
That version's README file includes detailed instructions that don't assume Python sysadmin expertise. The repo comes with a folder full of Penpot documentation, but you can delete those and add your own.
PrivateGPT includes the features you'd likely most want in a "chat with your own documents" app in the terminal, but the documentation warns it's not meant for production. And once you run it, you may see why: Even the small model option ran very slowly on my home PC. But just remember, the early days of home internet were painfully slow, too. I expect these types of individual projects will speed up.
There are more ways to run LLMs locally than just these five, ranging from other desktop applications to writing scripts from scratch, all with varying degrees of setup complexity.
Jan is a relatively new open-source project that aims to "democratize AI access" with "open, local-first products." The app is simple to download and install, and the interface is a nice balance between customizability and ease of use. It's an enjoyable app to use.
Choosing models to use in Jan is pretty painless. Within the application's hub, shown below, there are descriptions of more than 30 models available for one-click download, including some with vision, which I didn't test. You can also import others in the GGUF format. Models listed in Jan's hub show up with "Not enough RAM" tags if your system is unlikely to be able to run them.
Jan's chat interface includes a right-side panel that lets you set system instructions for the LLM and tweak parameters. On my work Mac, a model I had downloaded was tagged as "slow on your device" when I started it, and I was advised to close some applications to try to free up RAM. Whether or not you're new to LLMs, it's easy to forget to free up as much RAM as possible when launching genAI applications, so that is a useful alert. (Chrome with a lot of tabs open can be a RAM hog; closing it solved the issue.)
Once I freed up the RAM, streamed responses within the app were pretty snappy.
Jan also lets you use OpenAI models from the cloud in addition to running LLMs locally. And, you can set up Jan to work with remote or local API servers.
Jan's project documentation was still a bit sparse when I tested the app in March 2024, although the good news is that much of the application is fairly intuitive to use -- but not all of it. One thing I missed in Jan was the ability to upload files and chat with a document. After searching on GitHub, I discovered you can indeed do this by turning on "Retrieval" in the model settings to upload files. However, I couldn't upload either a .csv or a .txt file. Neither were supported, although that wasn't obvious until I tried it. A PDF worked, though. It's also notable, although not Jan's fault, that the small models I was testing did not do a great job of retrieval-augmented generation.
A key advantage of Jan over LM Studio (see below) is that Jan is open source under the permissive AGPLv3 license, which allows for unrestricted commercial use as long as any derivative works are also open source. LM Studio is free for personal use, but the site says you should fill out the LM Studio @ Work request form to use it on the job. Jan is available for Windows, macOS, and Linux.
If all you want is a super easy way to chat with a local model from your current web workflow, the developer version of Opera is a possibility. It doesn't offer features like chat with your files. You also need to be logged into an Opera account to use it, even for local models, so I'm not confident it's as private as most other options reviewed here. However, it's a convenient way to test and use local LLMs in your workflow.
Local LLMs are available on the developer stream of Opera One, which you can download from its website.
To start, open the Aria Chat side panel -- that's the top button at the bottom left of your screen. That defaults to using OpenAI's models and Google Search.
To opt for a local model, you have to click Start, as if you're doing the default, and then there's an option near the top of the screen to "Choose local AI model."
Select that, then click "Go to settings" to browse or search for models, such as Llama 3 in 8B or 70B.
For those with very limited hardware, Opera suggests .
After your model downloads, it is a bit unclear how to go back to start a chat. Click the menu at the top left of your screen and you'll see a button for "New chat." Make sure to once again click "Choose local AI model," then select the model you downloaded; otherwise, you'll be chatting with the default OpenAI.
What's most attractive about chatting in Opera is using a local model that feels similar to the now familiar copilot-in-your-side-panel generative AI workflow. Opera is based in Norway and says it's GDPR compliant for all users. I'd still think twice about using this model for anything highly sensitive as long as the login to a cloud account is required.
Nvidia's Chat with RTX demo application is designed to answer questions about a directory of documents. As of its February launch, Chat with RTX can use either a Mistral or Llama 2 LLM running locally. You'll need a Windows PC with an Nvidia GeForce RTX 30 Series or higher GPU with at least 8GB of video RAM to run the application. You'll also want a robust internet connection. The download was a hefty 35GB zipped.
Chat with RTX presents a simple interface that's extremely easy to use. Clicking on the icon launches a Windows terminal that runs a script to launch an application in your default browser.
Select an LLM and the path to your files, wait for the app to create embeddings for your files -- you can follow that progress in the terminal window -- and then ask your question. The response includes links to documents used by the LLM to generate its answer, which is helpful if you want to make sure the information is accurate, since the model may answer based on other information it knows and not only your specific documents. The application currently supports .txt, .pdf, and .doc files as well as YouTube videos via a URL.
Note that Chat with RTX doesn't look for documents in subdirectories, so you'll need to put all your files in a single folder. If you want to add more documents to the folder, click the refresh button to the right of the data set to re-generate embeddings.
Mozilla's llamafile, unveiled in late November, allows developers to turn critical portions of large language models into executable files. It also comes with software that can download LLM files in the GGUF format, import them, and run them in a local in-browser chat interface.
To run llamafile, the project's README suggests downloading the current server version with
Then, download a model of your choice. I've read good things about Zephyr, so I found and downloaded a version from Hugging Face.
Enter your query at the bottom, and the screen will turn into a basic chatbot interface:
You can test out running a single executable with one of the sample files on the project's GitHub repository: , , or .
On the day that llamafile was released, Simon Willison, author of the LLM project profiled in this article, said in a blog post, "I think it's now the single best way to get started running large language models (think your own local copy of ChatGPT) on your own computer."
While llamafile was extremely easy to get up and running on my Mac, I ran into some issues on Windows. For now, like Ollama, llamafile may not be the top choice for plug-and-play Windows software.
A PrivateGPT spinoff, LocalGPT, includes more options for models and has detailed instructions as well as three how-to videos, including a 17-minute detailed code walk-through. Opinions may differ on whether this installation and setup is "easy," but it does look promising. As with PrivateGPT, though, documentation warns that running LocalGPT on a CPU alone will be slow.
Another desktop app I tried, LM Studio, has an easy-to-use interface for running chats, but you're more on your own with picking models. If you know what model you want to download and run, this could be a good choice. If you're just coming from using ChatGPT and you have limited knowledge of how best to balance precision with size, all the choices may be a bit overwhelming at first. Hugging Face Hub is the main source of model downloads inside LM Studio, and it has a lot of models.
Unlike the other LLM options, which all downloaded the models I chose on the first try, I had problems downloading one of the models within LM Studio. Another didn't run well, which was my fault for maxing out my Mac's hardware, but I didn't immediately see a suggested minimum non-GPU RAM for model choices. If you don't mind being patient about selecting and downloading models, though, LM Studio has a nice, clean interface once you're running the chat. As of this writing, the UI didn't have a built-in option for running the LLM over your own data.
LM Studio does have a built-in server that can be used "as a drop-in replacement for the OpenAI API," as the documentation notes, so code that was written to use an OpenAI model via the API will run instead on the local model you've selected.
Like h2oGPT, LM Studio throws a warning on Windows that it's an unverified app. LM Studio code is not available on GitHub and isn't from a long-established organization, though, so not everyone will be comfortable installing it.
In addition to using a pre-built model download interface through apps like h2oGPT, you can also download and run some models directly from Hugging Face, a platform and community for artificial intelligence that includes many LLMs. (Not all models there include download options.) Mark Needham, developer advocate at StarTree, has a nice explainer on how to do this, including a YouTube video. He also provides some related code in a GitHub repo, including sentiment analysis with a local LLM.
Hugging Face provides some documentation of its own about how to install and run available models locally.
Another popular option is to download and use LLMs locally in LangChain, a framework for creating end-to-end generative AI applications. That does require getting up to speed with writing code using the LangChain ecosystem. If you know LangChain basics, you may want to check out the documentation on Hugging Face Local Pipelines, Titan Takeoff (requires Docker as well as Python), and OpenLLM for running LangChain with local models. OpenLLM is another robust, standalone platform, designed for deploying LLM-based applications into production.