Whisper (Local)

Local speech-to-text with the Whisper CLI. No API key needed, runs entirely on-device.

Whisper (Local) is OpenAI's open-source speech-to-text model running entirely on your machine — no API key, no cloud, no per-minute billing. Released as open-source in September 2022, it quickly became the gold standard for local transcription. The model was trained on 680,000 hours of multilingual audio and supports 99 languages with automatic language detection. The local Whisper CLI is a Python-based tool that downloads model weights on first run and processes audio files through PyTorch inference. It comes in five model sizes — tiny (39M), base (74M), small (244M), medium (769M), and large-v3 (1.55B parameters) — letting you trade speed for accuracy based on your hardware. On a modern Mac with Apple Silicon, the medium model transcribes at roughly real-time speed; the large model is slower but approaches commercial-grade accuracy. For privacy-conscious users and organizations, local Whisper is transformative: your audio never leaves your machine. Medical transcription, legal depositions, confidential meetings — all can be transcribed without data leaving the network. This is the primary reason to choose local over the API, even though the API is faster and cheaper for high-volume use. The ecosystem has evolved significantly since release. whisper.cpp (by Georgi Gerganov) is a C/C++ port that runs 2-4x faster, especially on Apple Silicon with Metal acceleration. faster-whisper uses CTranslate2 for even better performance. These alternatives use the same model weights but with optimized inference engines. For OpenClaw users, the local Whisper skill means fully offline transcription capability. Process voice memos, meeting recordings, or podcast audio without any API costs. Pair it with cron for automated transcription pipelines that run entirely on your hardware. Best suited for: privacy-sensitive transcription, offline environments, developers wanting unlimited free transcription, anyone who prefers keeping audio data on-device.

Tags: transcription, speech-to-text, whisper, local, offline

Category: AI

Use Cases

Privacy-sensitive transcription: medical, legal, confidential meetings
Offline transcription in air-gapped or low-connectivity environments
Unlimited free transcription for high-volume audio processing
Voice memo pipeline: record → transcribe → summarize → notes
Podcast processing without API costs
Multilingual transcription across 99 languages
Automated transcription pipeline via cron for new audio files

Tips

Start with the 'medium' model for the best speed/accuracy tradeoff on Apple Silicon
Use whisper.cpp instead of the Python CLI for 2-4x faster inference on macOS
Use faster-whisper (CTranslate2) for the best performance on NVIDIA GPUs
Add `--language en` to skip auto-detection and improve speed when you know the language
Use `--output_format all` to get txt, srt, vtt, tsv, and json simultaneously
For long recordings, split into segments first: ffmpeg reduces memory pressure
Combine with OpenClaw cron for automated transcription of new audio files in a watched folder
Pre-process noisy audio: `ffmpeg -i input.mp3 -af 'highpass=f=200,lowpass=f=3000' clean.wav`

Known Issues & Gotchas

First run downloads model weights (large-v3 is ~3GB) — needs internet once, then fully offline
Python-based — requires Python 3.8+ and PyTorch, which can be heavy on disk
CPU-only transcription is 5-10x slower than real-time on the large model — GPU/MPS acceleration recommended
The 'tiny' and 'base' models hallucinate more on noisy audio — use 'medium' or 'large' for reliability
No streaming/real-time transcription — batch processing only (file in, text out)
Whisper can hallucinate repeated phrases on silent segments — especially with the large model
Apple Silicon Macs use MPS acceleration automatically with PyTorch — but whisper.cpp with Metal is faster
Memory usage scales with model size: large-v3 needs ~10GB RAM during inference

Alternatives

OpenAI Whisper API (cloud)
whisper.cpp
faster-whisper
Deepgram Nova-2
Sherpa ONNX TTS (reverse: STT)

Community Feedback

My self-hosted app uses local Whisper for transcription. The whole pipeline is self-hosted. It uses a locally-hosted Whisper or ASR model for the transcription, and all the smart features run locally too.
— Reddit r/LocalLLaMA

My experience with whisper.cpp — local no-dependency transcription. On Apple Silicon it's remarkably fast. The large-v3 model gives accuracy that rivals cloud APIs.
— Reddit r/LocalLLaMA

The open-source Whisper model downloaded and run locally from the GitHub repository is safe in the sense that your audio data is not sent anywhere.
— OpenAI Community

Users find OpenAI Whisper's ease of use exceptional, with simple integration and support for various platforms. The local version is free and unlimited.
— G2 Reviews

Configuration Examples

Install and first transcription

# Install via Homebrew (includes Python dependencies)
brew install openai-whisper

# Or via pip
pip install openai-whisper

# Transcribe with medium model (good balance)
whisper audio.mp3 --model medium --language en

# First run downloads the model (~1.5GB for medium)

Batch transcription with multiple outputs

# Generate all output formats
whisper meeting.wav --model large-v3 --output_format all --output_dir ./transcripts/

# Produces: meeting.txt, meeting.srt, meeting.vtt, meeting.tsv, meeting.json

whisper.cpp alternative (faster on Mac)

# Install whisper.cpp
brew install whisper-cpp

# Download model
whisper-cpp-download-ggml-model large-v3

# Transcribe (Metal accelerated on Apple Silicon)
whisper-cpp -m ~/.whisper/ggml-large-v3.bin -f audio.wav -otxt -osrt

Installation

brew install openai-whisper

Homepage: https://openai.com/research/whisper

Source: bundled