Whisper (Local)

Local speech-to-text with the Whisper CLI. No API key needed, runs entirely on-device.

Whisper (Local) is OpenAI's open-source speech-to-text model running entirely on your machine — no API key, no cloud, no per-minute billing. Released as open-source in September 2022, it quickly became the gold standard for local transcription. The model was trained on 680,000 hours of multilingual audio and supports 99 languages with automatic language detection. The local Whisper CLI is a Python-based tool that downloads model weights on first run and processes audio files through PyTorch inference. It comes in five model sizes — tiny (39M), base (74M), small (244M), medium (769M), and large-v3 (1.55B parameters) — letting you trade speed for accuracy based on your hardware. On a modern Mac with Apple Silicon, the medium model transcribes at roughly real-time speed; the large model is slower but approaches commercial-grade accuracy. For privacy-conscious users and organizations, local Whisper is transformative: your audio never leaves your machine. Medical transcription, legal depositions, confidential meetings — all can be transcribed without data leaving the network. This is the primary reason to choose local over the API, even though the API is faster and cheaper for high-volume use. The ecosystem has evolved significantly since release. whisper.cpp (by Georgi Gerganov) is a C/C++ port that runs 2-4x faster, especially on Apple Silicon with Metal acceleration. faster-whisper uses CTranslate2 for even better performance. These alternatives use the same model weights but with optimized inference engines. For OpenClaw users, the local Whisper skill means fully offline transcription capability. Process voice memos, meeting recordings, or podcast audio without any API costs. Pair it with cron for automated transcription pipelines that run entirely on your hardware. Best suited for: privacy-sensitive transcription, offline environments, developers wanting unlimited free transcription, anyone who prefers keeping audio data on-device.

Tags: transcription, speech-to-text, whisper, local, offline

Category: AI

Use Cases

  • Privacy-sensitive transcription: medical, legal, confidential meetings
  • Offline transcription in air-gapped or low-connectivity environments
  • Unlimited free transcription for high-volume audio processing
  • Voice memo pipeline: record → transcribe → summarize → notes
  • Podcast processing without API costs
  • Multilingual transcription across 99 languages
  • Automated transcription pipeline via cron for new audio files

Tips

  • Start with the 'medium' model for the best speed/accuracy tradeoff on Apple Silicon
  • Use whisper.cpp instead of the Python CLI for 2-4x faster inference on macOS
  • Use faster-whisper (CTranslate2) for the best performance on NVIDIA GPUs
  • Add `--language en` to skip auto-detection and improve speed when you know the language
  • Use `--output_format all` to get txt, srt, vtt, tsv, and json simultaneously
  • For long recordings, split into segments first: ffmpeg reduces memory pressure
  • Combine with OpenClaw cron for automated transcription of new audio files in a watched folder
  • Pre-process noisy audio: `ffmpeg -i input.mp3 -af 'highpass=f=200,lowpass=f=3000' clean.wav`

Known Issues & Gotchas

  • First run downloads model weights (large-v3 is ~3GB) — needs internet once, then fully offline
  • Python-based — requires Python 3.8+ and PyTorch, which can be heavy on disk
  • CPU-only transcription is 5-10x slower than real-time on the large model — GPU/MPS acceleration recommended
  • The 'tiny' and 'base' models hallucinate more on noisy audio — use 'medium' or 'large' for reliability
  • No streaming/real-time transcription — batch processing only (file in, text out)
  • Whisper can hallucinate repeated phrases on silent segments — especially with the large model
  • Apple Silicon Macs use MPS acceleration automatically with PyTorch — but whisper.cpp with Metal is faster
  • Memory usage scales with model size: large-v3 needs ~10GB RAM during inference

Alternatives

  • OpenAI Whisper API (cloud)
  • whisper.cpp
  • faster-whisper
  • Deepgram Nova-2
  • Sherpa ONNX TTS (reverse: STT)

Community Feedback

My self-hosted app uses local Whisper for transcription. The whole pipeline is self-hosted. It uses a locally-hosted Whisper or ASR model for the transcription, and all the smart features run locally too.

— Reddit r/LocalLLaMA

My experience with whisper.cpp — local no-dependency transcription. On Apple Silicon it's remarkably fast. The large-v3 model gives accuracy that rivals cloud APIs.

— Reddit r/LocalLLaMA

The open-source Whisper model downloaded and run locally from the GitHub repository is safe in the sense that your audio data is not sent anywhere.

— OpenAI Community

Users find OpenAI Whisper's ease of use exceptional, with simple integration and support for various platforms. The local version is free and unlimited.

— G2 Reviews

Configuration Examples

Install and first transcription

# Install via Homebrew (includes Python dependencies)
brew install openai-whisper

# Or via pip
pip install openai-whisper

# Transcribe with medium model (good balance)
whisper audio.mp3 --model medium --language en

# First run downloads the model (~1.5GB for medium)

Batch transcription with multiple outputs

# Generate all output formats
whisper meeting.wav --model large-v3 --output_format all --output_dir ./transcripts/

# Produces: meeting.txt, meeting.srt, meeting.vtt, meeting.tsv, meeting.json

whisper.cpp alternative (faster on Mac)

# Install whisper.cpp
brew install whisper-cpp

# Download model
whisper-cpp-download-ggml-model large-v3

# Transcribe (Metal accelerated on Apple Silicon)
whisper-cpp -m ~/.whisper/ggml-large-v3.bin -f audio.wav -otxt -osrt

Installation

brew install openai-whisper

Homepage: https://openai.com/research/whisper

Source: bundled