OpenAI Whisper API

Transcribe audio via OpenAI Audio Transcriptions API (cloud Whisper). Fast, accurate speech-to-text.

The OpenAI Whisper API skill enables OpenClaw to transcribe audio files using OpenAI's cloud-hosted Whisper model. At $0.006 per minute ($0.36/hour), it's one of the most affordable commercial speech-to-text APIs available, with accuracy benchmarks reaching 95-98% on clean English audio and strong multilingual support across 50+ languages. The skill is refreshingly simple: send an audio file via curl to OpenAI's `/v1/audio/transcriptions` endpoint, get back text. It supports multiple audio formats (mp3, mp4, m4a, wav, webm, mpeg, mpga), files up to 25MB, and optional parameters for language hints, timestamps, and output format (json, text, srt, vtt, verbose_json). Whisper's strength is its robustness: it handles background noise, accents, and multiple speakers reasonably well without any preprocessing. The 'large-v2' model (which the API uses) was trained on 680,000 hours of multilingual audio, making it one of the most broadly capable transcription models available. For most use cases — meeting transcription, podcast notes, voice memo processing — it just works. For OpenClaw users, Whisper API transforms voice into actionable text. Record a voice memo, transcribe it, then have your AI summarize, extract action items, or create structured notes. Combined with cron, you can build automated pipelines: monitor a folder for audio files → transcribe → summarize → save to Notion. The API-based approach means no local GPU is needed, but it does mean your audio is sent to OpenAI's servers. For privacy-sensitive use cases, consider self-hosting Whisper locally (which is free but requires compute). Best suited for: meeting transcription, podcast and video content processing, voice memo workflows, multilingual transcription needs, anyone wanting accurate speech-to-text without self-hosting infrastructure.

Tags: transcription, speech-to-text, whisper, openai, audio

Category: AI

Use Cases

  • Meeting transcription: record → transcribe → extract action items
  • Podcast processing: transcribe episodes for show notes and search
  • Voice memo workflow: speak → transcribe → save to notes
  • Video subtitle generation: transcribe audio track → SRT/VTT output
  • Content repurposing: transcribe talks/lectures into blog posts
  • Multilingual transcription: handle audio in 50+ languages
  • Automated pipeline: watch folder → transcribe new audio → notify

Tips

  • Split large files into segments: `ffmpeg -i input.mp3 -f segment -segment_time 600 -c copy out_%03d.mp3`
  • Use `response_format: 'verbose_json'` to get word-level timestamps
  • Set the `language` parameter explicitly when you know the language — improves accuracy significantly
  • For meetings with multiple speakers, pair with speaker diarization (not built into Whisper API)
  • Use SRT/VTT output format for subtitle generation
  • Combine with OpenClaw's summarization for a full pipeline: transcribe → summarize → action items
  • For privacy-sensitive audio, self-host Whisper locally instead of using the API
  • Pre-process noisy audio with ffmpeg noise reduction before transcribing

Known Issues & Gotchas

  • Maximum file size is 25MB — split larger files before uploading
  • Audio is sent to OpenAI's servers — not suitable for highly confidential recordings
  • No real-time/streaming transcription — batch only (send file, get result)
  • Whisper can 'hallucinate' text on silent or very noisy segments — review critical transcriptions
  • The API uses the large-v2 model — you can't select smaller/faster models via API
  • Cost is per minute of audio, not per API call — a 60-minute file costs ~$0.36
  • Language auto-detection works but explicit `language` parameter improves accuracy
  • GPT-4o Transcribe is newer and potentially better but at a higher price point

Alternatives

  • Deepgram
  • AssemblyAI
  • Whisper (self-hosted)
  • Google Speech-to-Text
  • GPT-4o Transcribe (OpenAI)

Community Feedback

Comparative Review of Speech-to-Text APIs (2025): GPT-4o Transcribe is the new contender, but Whisper remains the reliable workhorse at $0.006/min.

— Reddit r/speechtech

What we like most about OpenAI Whisper is its high accuracy and strong multilingual support. It performs well with different accents and noisy audio.

— G2 Reviews

Unbiased test results show Whisper scoring 98% accuracy with low hallucination rates. Real-world speed and cost make it competitive with enterprise alternatives.

— DIY AI

Deepgram is 36% more accurate, up to 5x faster, and has lower TCO than OpenAI Whisper for real-time streaming use cases.

— Deepgram Blog

Configuration Examples

Basic transcription

curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@meeting.mp3 \
  -F model=whisper-1 \
  -F response_format=text

Transcription with timestamps

curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@podcast.mp3 \
  -F model=whisper-1 \
  -F response_format=verbose_json \
  -F language=en | jq '.segments[] | {start, end, text}'

Generate SRT subtitles

curl -s https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@video_audio.mp3 \
  -F model=whisper-1 \
  -F response_format=srt > subtitles.srt

Split large files first

# Split into 10-minute segments
ffmpeg -i long_meeting.mp3 -f segment -segment_time 600 -c copy /tmp/seg_%03d.mp3

# Transcribe each segment
for f in /tmp/seg_*.mp3; do
  curl -s https://api.openai.com/v1/audio/transcriptions \
    -H "Authorization: Bearer $OPENAI_API_KEY" \
    -F file=@"$f" -F model=whisper-1 -F response_format=text
done

Installation

# Requires OPENAI_API_KEY env var

Homepage: https://platform.openai.com/docs/guides/speech-to-text

Source: bundled