Let me start by confessing my dislike for AI note-takers like Otter AI and Fireflies. They are undeniably useful, but also a privacy nightmare. All it takes is one teammate bringing one of these note-takers into a meeting, and suddenly everyone else is forced to deal with email notifications they never consented to. On top of that, they flood in-call chats with ongoing transcriptions. Reports have even alleged that Otter AI secretly records private work conversations.
Instead of relying on such tools, I use OpenAI's Whisper locally. It's a general-purpose speech recognition model released in 2022 under an open-source MIT license.
Why do I use Whispers?
It's free, open-source, and can be self-hosted
Whisper was trained on 680,000 hours of multilingual audio and delivers state-of-the-art accuracy across many benchmarks. Unlike cloud services, its models can be downloaded and run locally, which allows you to transcribe audio completely offline. Your recordings never leave your machine, which is a major win for privacy.
This makes it well-suited for sensitive meetings, confidential interviews, or situations where compliance with privacy regulations is important. By contrast, tools like Otter require uploading your audio to their servers. Even if they promise encryption, you are still placing trust in a third party without knowing how or where your data is stored. Whisper removes that risk.
Language support is another strength. Whisper works with 99 different languages, far more than most cloud services. Otter.ai, for example, is focused mainly on English. In addition to transcription, Whisper can also translate speech from one language to another and automatically detect the language being spoken. For most of us, though, the main benefit is accurate transcription of meetings and interviews.
Whisper uses a Transformer-based neural network and comes in multiple sizes, from tiny with 39 million parameters to large with 1.5 billion. Smaller models are faster and lighter on memory, while the large model delivers the best accuracy but requires about 10GB of VRAM. There are also English-only versions, such as base.en, which perform slightly better on English audio than the multilingual ones. Even the smaller models are capable enough for clear recordings, so you can choose the balance of speed and accuracy that best matches your computer's hardware.
Setting up Whisper locally on a PC
It's a lot easier than you'd think
I didn't face a lot of difficulties running Whisper locally on my Windows PC and MacBook. It does require decent hardware to run, though. To get started with Whisper, you first need to install it along with its dependencies. You can do that by opening a terminal or command prompt and running:
pip install git+https://github.com/openai/whisper.git
Whisper also depends on ffmpeg, so make sure that's installed on your system. On Ubuntu or Debian, you can install it by running:
sudo apt update && sudo apt install ffmpeg
If you're on macOS and have Homebrew, just use brew install ffmpeg. Windows users can rely on Chocolatey with the command choco install ffmpeg. If you don't already have Chocolatey, you may need to install it first by following the instructions on its official website.
Since Whisper is a Python library, you'll also want to make sure both Python and pip are installed on your machine. Once that's set up, you're ready to transcribe audio files locally. From the command line, navigate to the folder where your audio file is located and run something like:
whisper --model base --language en --task transcribe your_audio_file.mp3
Just replace your_audio_file.mp3 with the actual path to your file, and Whisper will handle the transcription. If you'd rather use Whisper in a Python script, that works too. Here's a simple example:
import whisper
model = whisper.load_model("base")
result = model.transcribe("your_audio_file.mp3")
print(result["text"])
Do keep in mind that Whisper can be resource-intensive, particularly if you're using one of the larger models. They require more memory and processing power, so make sure your system can handle it. Even so, the flexibility of running it locally means your audio never leaves your device, giving you both control and privacy.
Transcribing audio with Whisper
Whisper supports a wide range of audio formats
Once you've installed Whisper, you can begin transcribing recordings directly from the command line. Whisper supports a wide range of audio formats, including WAV, MP3, M4A, and FLAC, as well as video formats like MP4 and MKV. Thanks to FFmpeg integration, you don't need to extract audio manually; Whisper can process these files as they are.
Running the tool is straightforward. You simply point Whisper to the file you want to transcribe, and it will automatically detect the spoken language and begin generating text. As it works, the terminal displays the transcription in segments with timestamps, giving you both structure and context. Once the process finishes, you'll have a complete transcript saved on your machine.
By default, Whisper uses its standard model, which balances speed and accuracy. However, you can adjust options such as model size, language settings, and output formats depending on your needs.
Self-hosted tools are the way to go
Data is often called the new oil, and I have no intention of handing it over to companies that already control so much of it. Avoiding that completely is difficult, especially if you still rely on Google services or spend time on social media. But there are areas where you can take back control. Instead of Google Drive, you can switch to a self-hosted option like Nextcloud. Instead of Notion, there are privacy-friendly alternatives worth exploring. And if you're willing to put in a little effort, you can even host your own AI models.