speech-to-text
1
总安装量
1
周安装量
#45116
全站排名
安装命令
npx skills add https://github.com/sarvamai/skills --skill speech-to-text
Agent 安装分布
amp
1
openclaw
1
opencode
1
continue
1
codex
1
Skill 文档
Speech-to-Text with Saarika
Saarika is Sarvam AI’s speech recognition model optimized for Indian languages with support for code-mixing (Hindi-English etc.) and multi-speaker scenarios.
Installation
pip install sarvamai
Quick Start
from sarvamai import SarvamAI
client = SarvamAI()
response = client.speech_to_text.transcribe(
file=open("audio.wav",
"rb"),
model="saarika:v2.5",
language_code="hi-IN"
)
print(response.transcript)
Supported Languages
| Code | Language | Code | Language |
|---|---|---|---|
hi-IN |
Hindi | ta-IN |
Tamil |
bn-IN |
Bengali | te-IN |
Telugu |
kn-IN |
Kannada | ml-IN |
Malayalam |
mr-IN |
Marathi | gu-IN |
Gujarati |
pa-IN |
Punjabi | or-IN |
Odia |
en-IN |
English (Indian) | auto |
Auto-detect |
API Options
REST API (â¤30 seconds)
For short audio clips:
response = client.speech_to_text.transcribe(
file=open("short_clip.wav",
"rb"),
model="saarika:v2.5",
language_code="auto", # Auto-detect language
with_timestamps=True, # Word-level timestamps
with_diarisation=True # Speaker identification
)
print(response.transcript)
print(response.language_code) # Detected language
print(response.words) # Timestamped words
print(response.speaker_segments) # Speaker turns
Batch API (â¤1 hour)
For long recordings:
response = client.speech_to_text.transcribe_batch(
file=open("long_recording.mp3",
"rb"),
model="saarika:v2.5",
language_code="hi-IN"
)
WebSocket Streaming (Real-time)
For live transcription. Audio must be sent as base64-encoded strings.
import asyncio
import base64
from sarvamai import AsyncSarvamAI
async def stream_audio():
client = AsyncSarvamAI()
async with client.speech_to_text_streaming.connect(
language_code="hi-IN",
model="saarika:v2.5",
high_vad_sensitivity=True
) as ws:
# Read and encode audio to base64
with open("audio.wav",
"rb") as f:
audio_base64 = base64.b64encode(f.read()).decode("utf-8")
# Send base64 encoded audio
await ws.transcribe(
audio=audio_base64,
encoding="audio/wav",
sample_rate=16000
)
# Receive transcription
response = await ws.recv()
print(response)
asyncio.run(stream_audio())
WebSocket supported formats: wav, pcm_s16le, pcm_l16, pcm_raw only. MP3/AAC/OGG not supported for streaming.
JavaScript
import { SarvamAI
} from "sarvamai";
import fs from "fs";
const client = new SarvamAI();
const response = await client.speechToText.transcribe({
file: fs.createReadStream("audio.wav"),
model: "saarika:v2.5",
languageCode: "hi-IN",
withTimestamps: true
});
console.log(response.transcript);
cURL
curl -X POST "https://api.sarvam.ai/speech-to-text" \
-H "api-subscription-key: $SARVAM_API_KEY" \
-F "file=@audio.wav" \
-F "model=saarika:v2.5" \
-F "language_code=hi-IN"
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
file |
File | Yes | Audio file (wav, mp3, flac, ogg, webm) |
model |
string | Yes | saarika:v2.5 or saarika:v2 |
language_code |
string | Yes | BCP-47 code or auto |
with_timestamps |
bool | No | Return word timestamps |
with_diarisation |
bool | No | Enable speaker identification |
Response
{
"request_id": "abc123",
"transcript": "नमसà¥à¤¤à¥, à¤à¤ª à¤à¥à¤¸à¥ हà¥à¤?",
"language_code": "hi-IN",
"words": [
{
"word": "नमसà¥à¤¤à¥",
"start": 0.0,
"end": 0.5
},
{
"word": "à¤à¤ª",
"start": 0.6,
"end": 0.8
}
],
"speaker_segments": [
{
"speaker": "SPEAKER_00",
"start": 0.0,
"end": 2.5
}
]
}
See references/streaming.md for detailed WebSocket documentation.