OpenAI Whisper API
Transcribe audio via OpenAI Audio Transcriptions API (cloud Whisper). Fast, accurate speech-to-text.
Tags: transcription, speech-to-text, whisper, openai, audio
Category: AI
Use Cases
- Meeting transcription: record → transcribe → extract action items
- Podcast processing: transcribe episodes for show notes and search
- Voice memo workflow: speak → transcribe → save to notes
- Video subtitle generation: transcribe audio track → SRT/VTT output
- Content repurposing: transcribe talks/lectures into blog posts
- Multilingual transcription: handle audio in 50+ languages
- Automated pipeline: watch folder → transcribe new audio → notify
Tips
- Split large files into segments: `ffmpeg -i input.mp3 -f segment -segment_time 600 -c copy out_%03d.mp3`
- Use `response_format: 'verbose_json'` to get word-level timestamps
- Set the `language` parameter explicitly when you know the language — improves accuracy significantly
- For meetings with multiple speakers, pair with speaker diarization (not built into Whisper API)
- Use SRT/VTT output format for subtitle generation
- Combine with OpenClaw's summarization for a full pipeline: transcribe → summarize → action items
- For privacy-sensitive audio, self-host Whisper locally instead of using the API
- Pre-process noisy audio with ffmpeg noise reduction before transcribing
Known Issues & Gotchas
- Maximum file size is 25MB — split larger files before uploading
- Audio is sent to OpenAI's servers — not suitable for highly confidential recordings
- No real-time/streaming transcription — batch only (send file, get result)
- Whisper can 'hallucinate' text on silent or very noisy segments — review critical transcriptions
- The API uses the large-v2 model — you can't select smaller/faster models via API
- Cost is per minute of audio, not per API call — a 60-minute file costs ~$0.36
- Language auto-detection works but explicit `language` parameter improves accuracy
- GPT-4o Transcribe is newer and potentially better but at a higher price point
Alternatives
- Deepgram
- AssemblyAI
- Whisper (self-hosted)
- Google Speech-to-Text
- GPT-4o Transcribe (OpenAI)
Community Feedback
Comparative Review of Speech-to-Text APIs (2025): GPT-4o Transcribe is the new contender, but Whisper remains the reliable workhorse at $0.006/min.
— Reddit r/speechtech
What we like most about OpenAI Whisper is its high accuracy and strong multilingual support. It performs well with different accents and noisy audio.
— G2 Reviews
Unbiased test results show Whisper scoring 98% accuracy with low hallucination rates. Real-world speed and cost make it competitive with enterprise alternatives.
— DIY AI
Deepgram is 36% more accurate, up to 5x faster, and has lower TCO than OpenAI Whisper for real-time streaming use cases.
— Deepgram Blog
Configuration Examples
Basic transcription
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@meeting.mp3 \
-F model=whisper-1 \
-F response_format=textTranscription with timestamps
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@podcast.mp3 \
-F model=whisper-1 \
-F response_format=verbose_json \
-F language=en | jq '.segments[] | {start, end, text}'Generate SRT subtitles
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@video_audio.mp3 \
-F model=whisper-1 \
-F response_format=srt > subtitles.srtSplit large files first
# Split into 10-minute segments
ffmpeg -i long_meeting.mp3 -f segment -segment_time 600 -c copy /tmp/seg_%03d.mp3
# Transcribe each segment
for f in /tmp/seg_*.mp3; do
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@"$f" -F model=whisper-1 -F response_format=text
doneInstallation
# Requires OPENAI_API_KEY env varHomepage: https://platform.openai.com/docs/guides/speech-to-text
Source: bundled