Sherpa ONNX TTS (Local)

Local text-to-speech via sherpa-onnx. Fully offline, no cloud required. Supports multiple voice models.

Sherpa ONNX TTS is a fully offline, local text-to-speech engine built on the sherpa-onnx framework by k2-fsa (the Next-gen Kaldi team). It runs entirely on-device with no internet connection, no API keys, and no per-character billing — making it the zero-cost, privacy-first alternative to cloud TTS services like ElevenLabs or Google Cloud TTS. The sherpa-onnx project is a comprehensive speech processing toolkit that covers speech-to-text, text-to-speech, speaker diarization, voice activity detection, and more. The TTS component specifically wraps ONNX Runtime-based neural voice models (VITS, VITS2, MeloTTS, Kokoro, Matcha-TTS, and others) that run on CPU without GPU requirements. Models are compact enough for embedded systems, Raspberry Pi, and mobile devices. The OpenClaw skill wraps the sherpa-onnx TTS binary in a simple interface: point it at a runtime directory and a model directory, pass text in, get WAV audio out. Setup involves downloading the runtime binary for your platform and choosing a voice model from the extensive model zoo. There are hundreds of voices available across 30+ languages, from English and Chinese to Arabic and Swahili. Voice quality varies significantly by model. The newer Kokoro and MeloTTS models approach near-natural quality for English, while some older VITS models sound more robotic. The sweet spot is the Kokoro English model — it's fast, sounds good, and runs efficiently on Apple Silicon. For other languages, quality depends on available model training data. For OpenClaw users, Sherpa ONNX TTS is the fallback when ElevenLabs (SAG) isn't available or when privacy matters. No API key means it works on air-gapped systems, during internet outages, and without any recurring costs. The trade-off is voice quality — ElevenLabs sounds more natural, but Sherpa is free and private. The architecture supports 12 programming languages (C++, Python, Go, Swift, Rust, etc.) and runs on macOS, Linux, Windows, Android, iOS, HarmonyOS, Raspberry Pi, and RISC-V boards. It's one of the most broadly supported local TTS solutions available. Best suited for: privacy-sensitive environments, offline/air-gapped systems, cost-conscious users wanting unlimited TTS, embedded systems and IoT devices, OpenClaw users who want voice output without API costs.

Tags: tts, voice, offline, local, speech

Category: Voice

Use Cases

  • Offline voice output for OpenClaw on air-gapped or privacy-sensitive systems
  • Free unlimited TTS for notifications, reminders, and alerts
  • Embedded systems: Raspberry Pi, home automation voice announcements
  • Multilingual TTS without cloud API costs
  • Fallback TTS when internet is unavailable
  • IoT and smart home voice announcements
  • Accessibility: screen reader alternative for custom applications

Tips

  • Start with the Kokoro English model for the best quality-to-speed ratio on macOS/Linux
  • Use MeloTTS for multilingual needs — it handles Chinese, Japanese, Korean, and more
  • Convert output to mp3 for smaller files: `ffmpeg -i output.wav -codec:a libmp3lame output.mp3`
  • Pre-download models during setup — they're needed only once, then fully offline
  • On Apple Silicon, the runtime uses accelerated ONNX inference — performance is excellent
  • For OpenClaw, set env vars in openclaw.json so the skill auto-discovers the runtime
  • Pair with cron for scheduled spoken reminders without any API costs
  • Test multiple voice models — quality varies dramatically between them

Known Issues & Gotchas

  • Setup is more involved than cloud TTS — you need to download both the runtime binary AND a voice model separately
  • Voice quality is noticeably below ElevenLabs or Google Cloud TTS — good enough for notifications, not for production audio
  • The runtime directory and model directory must be configured via environment variables (SHERPA_ONNX_RUNTIME_DIR, SHERPA_ONNX_MODEL_DIR)
  • Model files can be large (100MB-1GB) — choose carefully for storage-constrained devices
  • Not all voice models support all languages — check model compatibility before downloading
  • Output is WAV only by default — you'll need ffmpeg to convert to mp3 or other formats
  • CPU inference speed varies: fast on Apple Silicon, slower on older x86 hardware or Raspberry Pi
  • No streaming playback built-in — generates the full WAV file before you can play it

Alternatives

  • SAG (ElevenLabs TTS)
  • Piper TTS
  • macOS say
  • Coqui TTS
  • OpenClaw built-in tts tool

Community Feedback

There are local neural TTS engines for android that work pretty well and have flawless intonations. Two projects which work amazingly: Piper and sherpa-onnx. Things are going really great for on-device TTS.

— Reddit r/androidapps

whisper.cpp vs sherpa-onnx vs something else for speech? I'm looking to run my own endpoint on my server for my apps. Sherpa-onnx supports both STT and TTS in one framework.

— Reddit r/LocalLLaMA

sherpa-onnx-tts provides a local, offline command-line wrapper around the sherpa-onnx TTS runtime to synthesize speech without cloud services. Produces WAV output from text input.

— LobeHub Skills Marketplace

Squawk uses sherpa-onnx for real-time local text-to-speech with AI. The engine handles synthesis without any cloud API dependency.

— OBS Forum

Configuration Examples

Download runtime and model

# Download runtime for macOS ARM64
mkdir -p ~/.openclaw/tools/sherpa-onnx-tts
cd ~/.openclaw/tools/sherpa-onnx-tts

# Download from GitHub releases
# https://github.com/k2-fsa/sherpa-onnx/releases

# Download a voice model (e.g., Kokoro English)
# https://k2-fsa.github.io/sherpa/onnx/tts/all-in-one.html

Configure environment

# Set environment variables
export SHERPA_ONNX_RUNTIME_DIR=~/.openclaw/tools/sherpa-onnx-tts/runtime
export SHERPA_ONNX_MODEL_DIR=~/.openclaw/tools/sherpa-onnx-tts/models/kokoro-en

# Or configure in openclaw.json env section

Generate speech

# Basic text-to-speech
bin/sherpa-onnx-tts -o output.wav "Hello from OpenClaw, running fully offline!"

# Convert to mp3
ffmpeg -i output.wav -codec:a libmp3lame -qscale:a 2 output.mp3

# Play on macOS
afplay output.wav

Installation

# Download runtime + model (see SKILL.md)

Source: bundled