OpenAI Whisper API

Transcribe audio via OpenAI Audio Transcriptions API (cloud Whisper). Fast, accurate speech-to-text.

The OpenAI Whisper API skill enables OpenClaw to transcribe audio files using OpenAI's cloud-hosted Whisper model. At $0.006 per minute ($0.36/hour), it's one of the most affordable commercial speech-to-text APIs available, with accuracy benchmarks reaching 95-98% on clean English audio and strong multilingual support across 50+ languages. The skill is refreshingly simple: send an audio file via curl to OpenAI's `/v1/audio/transcriptions` endpoint, get back text. It supports multiple audio formats (mp3, mp4, m4a, wav, webm, mpeg, mpga), files up to 25MB, and optional parameters for language hints, timestamps, and output format (json, text, srt, vtt, verbose_json). Whisper's strength is its robustness: it handles background noise, accents, and multiple speakers reasonably well without any preprocessing. The 'large-v2' model (which the API uses) was trained on 680,000 hours of multilingual audio, making it one of the most broadly capable transcription models available. For most use cases — meeting transcription, podcast notes, voice memo processing — it just works. For OpenClaw users, Whisper API transforms voice into actionable text. Record a voice memo, transcribe it, then have your AI summarize, extract action items, or create structured notes. Combined with cron, you can build automated pipelines: monitor a folder for audio files → transcribe → summarize → save to Notion. The API-based approach means no local GPU is needed, but it does mean your audio is sent to OpenAI's servers. For privacy-sensitive use cases, consider self-hosting Whisper locally (which is free but requires compute). Best suited for: meeting transcription, podcast and video content processing, voice memo workflows, multilingual transcription needs, anyone wanting accurate speech-to-text without self-hosting infrastructure.

Tags: transcription, speech-to-text, whisper, openai, audio

Category: AI

Use Cases

Meeting transcription: record → transcribe → extract action items
Podcast processing: transcribe episodes for show notes and search
Voice memo workflow: speak → transcribe → save to notes
Video subtitle generation: transcribe audio track → SRT/VTT output
Content repurposing: transcribe talks/lectures into blog posts
Multilingual transcription: handle audio in 50+ languages
Automated pipeline: watch folder → transcribe new audio → notify

Tips

Split large files into segments: `ffmpeg -i input.mp3 -f segment -segment_time 600 -c copy out_%03d.mp3`
Use `response_format: 'verbose_json'` to get word-level timestamps
Set the `language` parameter explicitly when you know the language — improves accuracy significantly
For meetings with multiple speakers, pair with speaker diarization (not built into Whisper API)
Use SRT/VTT output format for subtitle generation
Combine with OpenClaw's summarization for a full pipeline: transcribe → summarize → action items
For privacy-sensitive audio, self-host Whisper locally instead of using the API
Pre-process noisy audio with ffmpeg noise reduction before transcribing

Known Issues & Gotchas

Maximum file size is 25MB — split larger files before uploading
Audio is sent to OpenAI's servers — not suitable for highly confidential recordings
No real-time/streaming transcription — batch only (send file, get result)
Whisper can 'hallucinate' text on silent or very noisy segments — review critical transcriptions
The API uses the large-v2 model — you can't select smaller/faster models via API
Cost is per minute of audio, not per API call — a 60-minute file costs ~$0.36
Language auto-detection works but explicit `language` parameter improves accuracy
GPT-4o Transcribe is newer and potentially better but at a higher price point

Alternatives

Deepgram
AssemblyAI
Whisper (self-hosted)
Google Speech-to-Text
GPT-4o Transcribe (OpenAI)

Community Feedback

Comparative Review of Speech-to-Text APIs (2025): GPT-4o Transcribe is the new contender, but Whisper remains the reliable workhorse at $0.006/min.
— Reddit r/speechtech

What we like most about OpenAI Whisper is its high accuracy and strong multilingual support. It performs well with different accents and noisy audio.
— G2 Reviews

Unbiased test results show Whisper scoring 98% accuracy with low hallucination rates. Real-world speed and cost make it competitive with enterprise alternatives.
— DIY AI

Deepgram is 36% more accurate, up to 5x faster, and has lower TCO than OpenAI Whisper for real-time streaming use cases.
— Deepgram Blog

Configuration Examples

Basic transcription

curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting.mp3 \
  -F model=whisper-1 \
  -F response_format=text

Transcription with timestamps

curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@podcast.mp3 \
  -F model=whisper-1 \
  -F response_format=verbose_json \
  -F language=en | jq '.segments[] | {start, end, text}'

Generate SRT subtitles

curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@video_audio.mp3 \
  -F model=whisper-1 \
  -F response_format=srt > subtitles.srt

Split large files first

# Split into 10-minute segments
ffmpeg -i long_meeting.mp3 -f segment -segment_time 600 -c copy /tmp/seg_%03d.mp3

# Transcribe each segment
for f in /tmp/seg_*.mp3; do
  curl -s https://api.openai.com/v1/audio/transcriptions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -F file=@"$f" -F model=whisper-1 -F response_format=text
done

Installation

# Requires OPENAI_API_KEY env var

Homepage: https://platform.openai.com/docs/guides/speech-to-text

Source: bundled