Sherpa ONNX TTS (Local)

Local text-to-speech via sherpa-onnx. Fully offline, no cloud required. Supports multiple voice models.

Sherpa ONNX TTS is a fully offline, local text-to-speech engine built on the sherpa-onnx framework by k2-fsa (the Next-gen Kaldi team). It runs entirely on-device with no internet connection, no API keys, and no per-character billing — making it the zero-cost, privacy-first alternative to cloud TTS services like ElevenLabs or Google Cloud TTS. The sherpa-onnx project is a comprehensive speech processing toolkit that covers speech-to-text, text-to-speech, speaker diarization, voice activity detection, and more. The TTS component specifically wraps ONNX Runtime-based neural voice models (VITS, VITS2, MeloTTS, Kokoro, Matcha-TTS, and others) that run on CPU without GPU requirements. Models are compact enough for embedded systems, Raspberry Pi, and mobile devices. The OpenClaw skill wraps the sherpa-onnx TTS binary in a simple interface: point it at a runtime directory and a model directory, pass text in, get WAV audio out. Setup involves downloading the runtime binary for your platform and choosing a voice model from the extensive model zoo. There are hundreds of voices available across 30+ languages, from English and Chinese to Arabic and Swahili. Voice quality varies significantly by model. The newer Kokoro and MeloTTS models approach near-natural quality for English, while some older VITS models sound more robotic. The sweet spot is the Kokoro English model — it's fast, sounds good, and runs efficiently on Apple Silicon. For other languages, quality depends on available model training data. For OpenClaw users, Sherpa ONNX TTS is the fallback when ElevenLabs (SAG) isn't available or when privacy matters. No API key means it works on air-gapped systems, during internet outages, and without any recurring costs. The trade-off is voice quality — ElevenLabs sounds more natural, but Sherpa is free and private. The architecture supports 12 programming languages (C++, Python, Go, Swift, Rust, etc.) and runs on macOS, Linux, Windows, Android, iOS, HarmonyOS, Raspberry Pi, and RISC-V boards. It's one of the most broadly supported local TTS solutions available. Best suited for: privacy-sensitive environments, offline/air-gapped systems, cost-conscious users wanting unlimited TTS, embedded systems and IoT devices, OpenClaw users who want voice output without API costs.

Tags: tts, voice, offline, local, speech

Category: Voice

Use Cases

Offline voice output for OpenClaw on air-gapped or privacy-sensitive systems
Free unlimited TTS for notifications, reminders, and alerts
Embedded systems: Raspberry Pi, home automation voice announcements
Multilingual TTS without cloud API costs
Fallback TTS when internet is unavailable
IoT and smart home voice announcements
Accessibility: screen reader alternative for custom applications

Tips

Start with the Kokoro English model for the best quality-to-speed ratio on macOS/Linux
Use MeloTTS for multilingual needs — it handles Chinese, Japanese, Korean, and more
Convert output to mp3 for smaller files: `ffmpeg -i output.wav -codec:a libmp3lame output.mp3`
Pre-download models during setup — they're needed only once, then fully offline
On Apple Silicon, the runtime uses accelerated ONNX inference — performance is excellent
For OpenClaw, set env vars in openclaw.json so the skill auto-discovers the runtime
Pair with cron for scheduled spoken reminders without any API costs
Test multiple voice models — quality varies dramatically between them

Known Issues & Gotchas

Setup is more involved than cloud TTS — you need to download both the runtime binary AND a voice model separately
Voice quality is noticeably below ElevenLabs or Google Cloud TTS — good enough for notifications, not for production audio
The runtime directory and model directory must be configured via environment variables (SHERPA_ONNX_RUNTIME_DIR, SHERPA_ONNX_MODEL_DIR)
Model files can be large (100MB-1GB) — choose carefully for storage-constrained devices
Not all voice models support all languages — check model compatibility before downloading
Output is WAV only by default — you'll need ffmpeg to convert to mp3 or other formats
CPU inference speed varies: fast on Apple Silicon, slower on older x86 hardware or Raspberry Pi
No streaming playback built-in — generates the full WAV file before you can play it

Alternatives

SAG (ElevenLabs TTS)
Piper TTS
macOS say
Coqui TTS
OpenClaw built-in tts tool

Community Feedback

There are local neural TTS engines for android that work pretty well and have flawless intonations. Two projects which work amazingly: Piper and sherpa-onnx. Things are going really great for on-device TTS.
— Reddit r/androidapps

whisper.cpp vs sherpa-onnx vs something else for speech? I'm looking to run my own endpoint on my server for my apps. Sherpa-onnx supports both STT and TTS in one framework.
— Reddit r/LocalLLaMA

sherpa-onnx-tts provides a local, offline command-line wrapper around the sherpa-onnx TTS runtime to synthesize speech without cloud services. Produces WAV output from text input.
— LobeHub Skills Marketplace

Squawk uses sherpa-onnx for real-time local text-to-speech with AI. The engine handles synthesis without any cloud API dependency.
— OBS Forum

Configuration Examples

Download runtime and model

# Download runtime for macOS ARM64
mkdir -p ~/.openclaw/tools/sherpa-onnx-tts
cd ~/.openclaw/tools/sherpa-onnx-tts

# Download from GitHub releases
# https://github.com/k2-fsa/sherpa-onnx/releases

# Download a voice model (e.g., Kokoro English)
# https://k2-fsa.github.io/sherpa/onnx/tts/all-in-one.html

Configure environment

# Set environment variables
export SHERPA_ONNX_RUNTIME_DIR=~/.openclaw/tools/sherpa-onnx-tts/runtime
export SHERPA_ONNX_MODEL_DIR=~/.openclaw/tools/sherpa-onnx-tts/models/kokoro-en

# Or configure in openclaw.json env section

Generate speech

# Basic text-to-speech
bin/sherpa-onnx-tts -o output.wav "Hello from OpenClaw, running fully offline!"

# Convert to mp3
ffmpeg -i output.wav -codec:a libmp3lame -qscale:a 2 output.mp3

# Play on macOS
afplay output.wav

Installation

# Download runtime + model (see SKILL.md)

Source: bundled