speech-to-text

📁 sarvamai/skills 📅 1 day ago
1
总安装量
1
周安装量
#45116
全站排名
安装命令
npx skills add https://github.com/sarvamai/skills --skill speech-to-text

Agent 安装分布

amp 1
openclaw 1
opencode 1
continue 1
codex 1

Skill 文档

Speech-to-Text with Saarika

Saarika is Sarvam AI’s speech recognition model optimized for Indian languages with support for code-mixing (Hindi-English etc.) and multi-speaker scenarios.

Installation

pip install sarvamai

Quick Start

from sarvamai import SarvamAI

client = SarvamAI()

response = client.speech_to_text.transcribe(
    file=open("audio.wav",
"rb"),
    model="saarika:v2.5",
    language_code="hi-IN"
)

print(response.transcript)

Supported Languages

Code Language Code Language
hi-IN Hindi ta-IN Tamil
bn-IN Bengali te-IN Telugu
kn-IN Kannada ml-IN Malayalam
mr-IN Marathi gu-IN Gujarati
pa-IN Punjabi or-IN Odia
en-IN English (Indian) auto Auto-detect

API Options

REST API (≤30 seconds)

For short audio clips:

response = client.speech_to_text.transcribe(
    file=open("short_clip.wav",
"rb"),
    model="saarika:v2.5",
    language_code="auto",           # Auto-detect language
    with_timestamps=True,           # Word-level timestamps
    with_diarisation=True           # Speaker identification
)

print(response.transcript)
print(response.language_code)       # Detected language
print(response.words)               # Timestamped words
print(response.speaker_segments)    # Speaker turns

Batch API (≤1 hour)

For long recordings:

response = client.speech_to_text.transcribe_batch(
    file=open("long_recording.mp3",
"rb"),
    model="saarika:v2.5",
    language_code="hi-IN"
)

WebSocket Streaming (Real-time)

For live transcription. Audio must be sent as base64-encoded strings.

import asyncio
import base64
from sarvamai import AsyncSarvamAI

async def stream_audio():
    client = AsyncSarvamAI()

    async with client.speech_to_text_streaming.connect(
        language_code="hi-IN",
        model="saarika:v2.5",
        high_vad_sensitivity=True
    ) as ws:
        # Read and encode audio to base64
        with open("audio.wav",
"rb") as f:
            audio_base64 = base64.b64encode(f.read()).decode("utf-8")

        # Send base64 encoded audio
        await ws.transcribe(
            audio=audio_base64,
            encoding="audio/wav",
            sample_rate=16000
        )

        # Receive transcription
        response = await ws.recv()
        print(response)

asyncio.run(stream_audio())

WebSocket supported formats: wav, pcm_s16le, pcm_l16, pcm_raw only. MP3/AAC/OGG not supported for streaming.

JavaScript

import { SarvamAI
} from "sarvamai";
import fs from "fs";

const client = new SarvamAI();

const response = await client.speechToText.transcribe({
  file: fs.createReadStream("audio.wav"),
  model: "saarika:v2.5",
  languageCode: "hi-IN",
  withTimestamps: true
});

console.log(response.transcript);

cURL

curl -X POST "https://api.sarvam.ai/speech-to-text" \
  -H "api-subscription-key: $SARVAM_API_KEY" \
  -F "file=@audio.wav" \
  -F "model=saarika:v2.5" \
  -F "language_code=hi-IN"

Parameters

Parameter Type Required Description
file File Yes Audio file (wav, mp3, flac, ogg, webm)
model string Yes saarika:v2.5 or saarika:v2
language_code string Yes BCP-47 code or auto
with_timestamps bool No Return word timestamps
with_diarisation bool No Enable speaker identification

Response

{
    "request_id": "abc123",
    "transcript": "नमस्ते, आप कैसे हैं?",
    "language_code": "hi-IN",
    "words": [
        {
            "word": "नमस्ते",
            "start": 0.0,
            "end": 0.5
        },
        {
            "word": "आप",
            "start": 0.6,
            "end": 0.8
        }
    ],
    "speaker_segments": [
        {
            "speaker": "SPEAKER_00",
            "start": 0.0,
            "end": 2.5
        }
    ]
}

See references/streaming.md for detailed WebSocket documentation.