speak-tts
npx skills add https://github.com/emzod/speak --skill speak-tts
Agent 安装分布
Skill 文档
speak – Talk to your Claude!
Give your agent the ability to speak to you real-time. Local text-to-speech, voice cloning, and audio generation on Apple Silicon. Give your agent the ability to speak to you real-time. Local TTS with voice cloning on Apple Silicon.
Prerequisites
| Requirement | Check | Install |
|---|---|---|
| Apple Silicon Mac | uname -m â arm64 |
Intel not supported |
| macOS 12.0+ | sw_vers |
– |
| sox | which sox |
brew install sox |
| ffmpeg | which ffmpeg |
brew install ffmpeg |
| poppler (PDF) | which pdftotext |
brew install poppler |
Input Sources
| Source | Example |
|---|---|
| Text file | speak article.txt |
| Markdown | speak doc.md |
| Direct string | speak "Hello" |
| Clipboard | pbpaste | speak |
| Stdin | cat file.txt | speak |
Web Articles
lynx -dump -nolist "https://example.com/article" | speak --output article.wav
Converting Formats
| Format | Convert Command |
|---|---|
pdftotext doc.pdf doc.txt |
|
| DOCX | textutil -convert txt doc.docx |
| HTML | pandoc -f html -t plain doc.html > doc.txt |
Output Modes
| Goal | Command |
|---|---|
| Save for later | speak text.txt --output file.wav |
| Listen now (streaming) | speak text.txt --stream |
| Listen now (complete) | speak text.txt --play |
| Both | speak text.txt --stream --output file.wav |
Default Behavior
speak article.txt # â ~/Audio/speak/article.wav (no playback)
speak "Hello" # â ~/Audio/speak/speak_<timestamp>.wav
Directory Auto-Creation
| Directory | Auto-Created? |
|---|---|
~/Audio/speak/ |
â Yes |
~/.chatter/voices/ |
â No |
| Custom directories | â No |
Always create custom directories first:
mkdir -p ~/.chatter/voices/
mkdir -p ~/Audio/custom/
Voice Cloning
Voice cloning generates speech that matches your vocal characteristics (pitch, tone, cadence) from a short recording.
Quality Expectations
- Output captures general voice characteristics but is not a perfect replica
- Quality depends heavily on sample quality
- 15-25 seconds is optimal (10s minimum, 30s maximum)
Recording Your Voice
Using QuickTime:
- Open QuickTime Player â File â New Audio Recording
- Record 20 seconds of clear speech
- File â Export As â Audio Only (.m4a)
- Convert to WAV (see below)
Using sox (command line):
# -d = use default microphone
# Recording starts immediately and stops after 25 seconds
sox -d -r 24000 -c 1 ~/.chatter/voices/my_voice.wav trim 0 25
Converting to Required Format
Voice samples MUST be: WAV, 24000 Hz, mono, 10-30 seconds.
# From MP3
ffmpeg -i voice.mp3 -ar 24000 -ac 1 voice.wav
# From M4A (QuickTime)
ffmpeg -i voice.m4a -ar 24000 -ac 1 voice.wav
# Trim to 25 seconds
ffmpeg -i long.wav -t 25 -ar 24000 -ac 1 trimmed.wav
# Check sample properties
ffprobe -i voice.wav 2>&1 | grep -E "Duration|Stream"
# Should show: Duration ~15-25s, 24000 Hz, mono
Using Your Voice
# Create directory
mkdir -p ~/.chatter/voices/
# Move sample
mv voice.wav ~/.chatter/voices/my_voice.wav
# Test
speak "Testing my voice" --voice ~/.chatter/voices/my_voice.wav --stream
# Use for content
speak notes.txt --voice ~/.chatter/voices/my_voice.wav --output presentation.wav
Path requirements:
- â Works:
~/.chatter/voices/my_voice.wav(tilde expanded by shell) - â Works:
/Users/name/.chatter/voices/my_voice.wav - â Fails:
my_voice.wav(relative path) - â Fails:
./voices/my_voice.wav(relative path)
Voice Sample Tips
| Good Sample | Bad Sample |
|---|---|
| Quiet room | Background noise |
| Natural pace | Rushed or monotone |
| Clear diction | Mumbling |
| Varied content | Repetitive phrases |
Default Voice
When --voice is omitted, a built-in default voice is used:
speak "Hello world" --stream # Uses default voice
Emotion Tags
Tags produce audible effects (actual sounds), not spoken words:
speak "[sigh] Monday again." --stream
# Output: (sigh sound) "Monday again."
| Tag | Effect |
|---|---|
[laugh] |
Laughter |
[chuckle] |
Light chuckle |
[sigh] |
Sighing |
[gasp] |
Gasping |
[groan] |
Groaning |
[clear throat] |
Throat clearing |
[cough] |
Coughing |
[crying] |
Crying |
[singing] |
Sung speech |
NOT supported: [pause], [whisper] (ignored)
For pauses: Use punctuation: "Wait... let me think."
Batch Processing
mkdir -p ~/Audio/book/
speak ch01.txt ch02.txt ch03.txt --output-dir ~/Audio/book/
# Creates: ch01.wav, ch02.wav, ch03.wav
# With auto-chunking (for long files)
speak chapters/*.txt --output-dir ~/Audio/book/ --auto-chunk
# Skip completed files
speak chapters/*.txt --output-dir ~/Audio/book/ --skip-existing
Auto-Chunk Behavior
When using --auto-chunk with batch processing:
- Each input file is chunked independently
- Chunks are generated and automatically concatenated per file
- Final output: one
.wavper input file (e.g.,ch01.wav) - Intermediate chunks deleted (unless
--keep-chunks)
You don’t need to manually concatenate chunks â only concatenate final chapter files.
Concatenating Audio
# Explicit order (recommended)
speak concat ch01.wav ch02.wav ch03.wav --output book.wav
# Glob pattern (REQUIRES zero-padded filenames)
speak concat audiobook/*.wav --output book.wav
Zero-Padding Rules
Critical for correct concatenation order:
| Files | Correct | Wrong |
|---|---|---|
| 1-9 | 01, 02, …, 09 |
1, 2, …, 9 |
| 10-99 | 01, 02, …, 99 |
1, 10, 2, … |
| 100+ | 001, 002, …, 999 |
1, 100, 2, … |
Why: Shell glob expansion sorts alphabetically. 1, 10, 2 vs 01, 02, 10.
PDF to Audiobook (Complete Workflow)
Step 1: Find Chapter Boundaries
# Preview table of contents
pdftotext -f 1 -l 5 textbook.pdf toc.txt
cat toc.txt # Note chapter page numbers
# Or search for "Chapter" markers
pdftotext textbook.pdf - | grep -n "Chapter"
Step 2: Extract Chapters (Zero-Padded!)
# For 100-page book with ~10 chapters
pdftotext -f 1 -l 12 -layout textbook.pdf ch01.txt
pdftotext -f 13 -l 25 -layout textbook.pdf ch02.txt
pdftotext -f 26 -l 38 -layout textbook.pdf ch03.txt
# ... continue for all chapters
Step 3: Estimate Time
speak --estimate ch*.txt
# Shows: total audio duration, generation time, storage needed
# Quick estimates:
# 1 page â 2 min audio â 1 min generation
# 100 pages â 200 min audio â 100 min generation â 500 MB
Step 4: Generate Audio
mkdir -p audiobook/
speak ch01.txt ch02.txt ch03.txt --output-dir audiobook/ --auto-chunk
# Creates: audiobook/ch01.wav, audiobook/ch02.wav, audiobook/ch03.wav
Step 5: Concatenate
speak concat audiobook/ch01.wav audiobook/ch02.wav audiobook/ch03.wav --output complete_audiobook.wav
# Or with glob (only if zero-padded):
speak concat audiobook/ch*.wav --output complete_audiobook.wav
PDF Troubleshooting
| Issue | Solution |
|---|---|
| Empty/garbled text | Scanned PDF â use OCR: brew install tesseract |
| Wrong encoding | Try: pdftotext -enc UTF-8 doc.pdf |
| Check word count | pdftotext doc.pdf - | wc -w (should be >100) |
Multi-Voice Content
mkdir -p podcast/scripts podcast/wav
echo "Welcome to the show." > podcast/scripts/01_host.txt
echo "Thanks for having me." > podcast/scripts/02_guest.txt
speak podcast/scripts/01_host.txt --voice ~/.chatter/voices/host.wav --output podcast/wav/01.wav
speak podcast/scripts/02_guest.txt --voice ~/.chatter/voices/guest.wav --output podcast/wav/02.wav
speak concat podcast/wav/01.wav podcast/wav/02.wav --output podcast.wav
Options Reference
| Option | Description | Default |
|---|---|---|
--stream |
Stream as it generates | false |
--play |
Play after complete | false |
--output <path> |
Output file | ~/Audio/speak/ |
--output-dir <dir> |
Batch output directory | – |
--voice <path> |
Voice sample (full path) | default |
--timeout <sec> |
Timeout per file | 300 |
--auto-chunk |
Split long documents | false |
--chunk-size <n> |
Chars per chunk | 6000 |
--resume <file> |
Resume from manifest | – |
--keep-chunks |
Keep intermediate files | false |
--skip-existing |
Skip if output exists | false |
--estimate |
Show duration estimate | false |
--dry-run |
Preview only | false |
--quiet |
Suppress output | false |
Commands
| Command | Description |
|---|---|
speak setup |
Set up environment |
speak health |
Check system status |
speak models |
List TTS models |
speak concat |
Concatenate audio |
speak daemon kill |
Stop TTS server |
speak config |
Show configuration |
Performance
| Metric | Value |
|---|---|
| Cold start | ~4-8s |
| Warm start | ~3-8s |
| Speed | 0.3-0.5x RTF (faster than real-time) |
| Storage | ~2.5 MB/min, ~150 MB/hour |
Resume Capability
For interrupted long generations:
# Single file with auto-chunk â use --resume
speak long.txt --auto-chunk --output book.wav
# If interrupted, manifest saved at ~/Audio/speak/manifest.json
speak --resume ~/Audio/speak/manifest.json
# Batch processing â use --skip-existing
speak ch*.txt --output-dir audiobook/ --auto-chunk
# If interrupted, re-run same command:
speak ch*.txt --output-dir audiobook/ --auto-chunk --skip-existing
Common Errors
| Error | Cause | Solution |
|---|---|---|
| “Voice file not found” | Relative path | Use full path: ~/.chatter/voices/x.wav |
| “Invalid WAV format” | Wrong specs | Convert: ffmpeg -i in.wav -ar 24000 -ac 1 out.wav |
| “Voice sample too short” | <10 seconds | Record 15-25 seconds |
| “Output directory doesn’t exist” | Not created | mkdir -p dirname/ |
| “sox not found” | Not installed | brew install sox |
| Scrambled concat order | Non-zero-padded | Use 01, 02, not 1, 2 |
| Timeout | >5 min generation | Use --auto-chunk or --timeout 600 |
| “Server not running” | Stale daemon | speak daemon kill && speak health |
Setup
speak "test" # Auto-setup on first run (downloads model ~500MB)
speak setup # Or manual setup
speak health # Verify everything works
Server Management
Server auto-starts and shuts down after 1 hour idle.
speak health # Check status
speak daemon kill # Stop manually