495 lines
13 KiB
Plaintext
495 lines
13 KiB
Plaintext
---
|
|
title: "ElevenLabs"
|
|
description: "ElevenLabs API conversion guide - text-to-speech, speech-to-text, voice settings, and model management"
|
|
icon: "pause"
|
|
---
|
|
|
|
## Overview
|
|
|
|
ElevenLabs is a specialized audio provider for text-to-speech and speech-to-text operations. Bifrost performs conversions including:
|
|
- **Model ID mapping** - Uses provider model identifier directly
|
|
- **Voice configuration** - Maps voice settings (stability, similarity, boost, speed, style)
|
|
- **Response format conversion** - Speech format handling (MP3, Opus, PCM/WAV)
|
|
- **Timestamp support** - Character-level timing alignment for TTS
|
|
- **Transcription with alignment** - Word and character-level timing, diarization, and additional formats
|
|
- **Pronunciation dictionaries** - Support for custom pronunciation rules
|
|
- **Voice quality parameters** - Stability, similarity boost, and speaker boost controls
|
|
|
|
### Supported Operations
|
|
|
|
| Operation | Non-Streaming | Streaming | Endpoint |
|
|
|-----------|---------------|-----------|----------|
|
|
| Speech (TTS) | ✅ | ✅ | `/v1/text-to-speech/{voice_id}` |
|
|
| Transcriptions (STT) | ✅ | - | `/v1/speech-to-text` |
|
|
| List Models | ✅ | - | `/v1/models` |
|
|
| Chat Completions | ❌ | ❌ | - |
|
|
| Responses API | ❌ | ❌ | - |
|
|
| Text Completions | ❌ | ❌ | - |
|
|
| Embeddings | ❌ | ❌ | - |
|
|
| Image Generation | ❌ | ❌ | - |
|
|
|
|
<Note>
|
|
**Unsupported Operations** (❌): Chat Completions, Responses API, Text Completions, and Embeddings are not supported by ElevenLabs (audio-focused provider). These return `UnsupportedOperationError`.
|
|
|
|
**Note**: ElevenLabs also supports a "Speech with Timestamps" endpoint at `/v1/text-to-speech/{voice_id}/with-timestamps` (non-streaming only) for enhanced timestamp information.
|
|
</Note>
|
|
|
|
---
|
|
|
|
# 1. Speech (Text-to-Speech)
|
|
|
|
## Request Parameters
|
|
|
|
### Core Parameters
|
|
|
|
| Parameter | Mapping | Notes |
|
|
|-----------|---------|-------|
|
|
| `input.input` | `text` | The text to convert to speech (required) |
|
|
| `model` | `model_id` | Model identifier (e.g., `"eleven_multilingual_v2"`) |
|
|
| `response_format` | Query param `output_format` | Speech format (see [Response Format](#response-format)) |
|
|
|
|
### Voice Configuration
|
|
|
|
Voice settings are optional and controlled via `params`:
|
|
|
|
| Parameter | ElevenLabs Mapping | Default | Range |
|
|
|-----------|-------------------|---------|-------|
|
|
| `speed` | `voice_settings.speed` | 1.0 | 0.5-2.0 |
|
|
| `extra_params.stability` | `voice_settings.stability` | 0.5 | 0-1.0 |
|
|
| `extra_params.similarity_boost` | `voice_settings.similarity_boost` | 0.75 | 0-1.0 |
|
|
| `extra_params.use_speaker_boost` | `voice_settings.use_speaker_boost` | true | boolean |
|
|
| `extra_params.style` | `voice_settings.style` | 0 | 0-1.0 |
|
|
|
|
### Advanced Parameters
|
|
|
|
Use `extra_params` for ElevenLabs-specific TTS features:
|
|
|
|
<Tabs>
|
|
<Tab title="Gateway">
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/v1/audio/speech \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"model": "eleven_multilingual_v2",
|
|
"input": {"input": "Hello, how are you?"},
|
|
"voice": "21m00Tcm4TlvDq8ikWAM",
|
|
"response_format": "mp3",
|
|
"stability": 0.5,
|
|
"similarity_boost": 0.75,
|
|
"use_speaker_boost": true,
|
|
"style": 0,
|
|
"speed": 1.0,
|
|
"language_code": "en",
|
|
"seed": 42,
|
|
"previous_text": "Context text",
|
|
"next_text": "Future context",
|
|
"apply_text_normalization": "auto"
|
|
}'
|
|
```
|
|
|
|
</Tab>
|
|
<Tab title="Go SDK">
|
|
|
|
```go
|
|
resp, err := client.SpeechRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostSpeechRequest{
|
|
Provider: schemas.Elevenlabs,
|
|
Model: "eleven_multilingual_v2",
|
|
Input: &schemas.SpeechInput{
|
|
Input: "Hello, how are you?",
|
|
},
|
|
Params: &schemas.SpeechParameters{
|
|
VoiceConfig: &schemas.VoiceConfig{
|
|
Voice: schemas.Ptr("21m00Tcm4TlvDq8ikWAM"),
|
|
},
|
|
Speed: schemas.Ptr(1.0),
|
|
ResponseFormat: schemas.Ptr("mp3"),
|
|
ExtraParams: map[string]interface{}{
|
|
"stability": 0.5,
|
|
"similarity_boost": 0.75,
|
|
"use_speaker_boost": true,
|
|
"style": 0.0,
|
|
"language_code": "en",
|
|
"seed": 42,
|
|
"previous_text": "Context text",
|
|
"next_text": "Future context",
|
|
"apply_text_normalization": "auto",
|
|
},
|
|
},
|
|
})
|
|
```
|
|
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
#### Advanced TTS Parameters
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `language_code` | string | Language code (e.g., "en", "es") |
|
|
| `seed` | integer | Reproducible output (0-4294967295) |
|
|
| `previous_text` | string | Previous text context for consistency |
|
|
| `next_text` | string | Next text context for consistency |
|
|
| `previous_request_ids` | string[] | Previous request IDs for continuity |
|
|
| `next_request_ids` | string[] | Next request IDs for continuity |
|
|
| `apply_text_normalization` | string | Text normalization mode: `"auto"`, `"on"`, `"off"` |
|
|
| `apply_language_text_normalization` | boolean | Apply language-specific text normalization |
|
|
|
|
### Response Format
|
|
|
|
| Format | Output | Quality | Bitrate |
|
|
|--------|--------|---------|---------|
|
|
| `mp3` | MP3 | High | 128 kbps @ 44100 Hz |
|
|
| `opus` | Opus | High | 128 kbps @ 48000 Hz |
|
|
| `wav` / `pcm` | PCM WAV | Lossless | 16-bit @ 44100 Hz |
|
|
|
|
<Note>
|
|
Defaults to MP3 format if not specified. Format is passed via query parameter `output_format`.
|
|
</Note>
|
|
|
|
### Timestamps Support
|
|
|
|
To get character-level timing alignment, enable `with_timestamps`:
|
|
|
|
```json
|
|
{
|
|
"with_timestamps": true
|
|
}
|
|
```
|
|
|
|
When enabled, the endpoint `/v1/text-to-speech/{voice_id}/with-timestamps` is used and the response includes:
|
|
- `audio_base64` - Audio data as base64-encoded string
|
|
- `alignment.char_start_times_ms` - Character start times in milliseconds
|
|
- `alignment.char_end_times_ms` - Character end times in milliseconds
|
|
- `alignment.characters` - Array of characters
|
|
- `normalized_alignment` - Same as alignment but for normalized text
|
|
|
|
## Response Conversion
|
|
|
|
### Non-Timestamp Response
|
|
|
|
```json
|
|
{
|
|
"audio": "<binary audio data>"
|
|
}
|
|
```
|
|
|
|
### Timestamp Response
|
|
|
|
```json
|
|
{
|
|
"audio_base64": "<base64 encoded audio>",
|
|
"alignment": {
|
|
"char_start_times_ms": [0, 150, 280, ...],
|
|
"char_end_times_ms": [150, 280, 420, ...],
|
|
"characters": ["H", "e", "l", "l", "o", ...]
|
|
},
|
|
"normalized_alignment": {
|
|
"char_start_times_ms": [...],
|
|
"char_end_times_ms": [...],
|
|
"characters": [...]
|
|
}
|
|
}
|
|
```
|
|
|
|
## Streaming
|
|
|
|
Streaming speech returns audio in chunks as they are generated:
|
|
|
|
```json
|
|
{
|
|
"type": "audio.delta",
|
|
"audio": "<binary audio chunk>"
|
|
}
|
|
```
|
|
|
|
Final chunk:
|
|
```json
|
|
{
|
|
"type": "audio.done"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
# 2. Transcription (Speech-to-Text)
|
|
|
|
## Request Parameters
|
|
|
|
### Input Source
|
|
|
|
Choose one of the following (mutually exclusive):
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `input.file` | bytes | Audio file content (WAV, MP3, etc.) |
|
|
| `extra_params.cloud_storage_url` | string | URL to cloud-hosted audio file |
|
|
|
|
**Error**: Providing both or neither will result in error.
|
|
|
|
### Core Parameters
|
|
|
|
| Parameter | Mapping | Description |
|
|
|-----------|---------|-------------|
|
|
| `model` | `model_id` | Model identifier (required) |
|
|
| `params.language` | `language_code` | Language code (ISO 639-1, e.g., "en") |
|
|
|
|
### Advanced Parameters
|
|
|
|
Use `extra_params` for transcription-specific features:
|
|
|
|
<Tabs>
|
|
<Tab title="Gateway">
|
|
|
|
```bash
|
|
curl -X POST http://localhost:8080/v1/audio/transcriptions \
|
|
-F "file=@audio.wav" \
|
|
-F "model=eleven_latest" \
|
|
-F "language_code=en" \
|
|
-F "tag_audio_events=true" \
|
|
-F "num_speakers=2" \
|
|
-F "timestamps_granularity=word" \
|
|
-F "diarize=true" \
|
|
-F "diarization_threshold=0.5" \
|
|
-F "temperature=0.1" \
|
|
-F "seed=42" \
|
|
-F "use_multi_channel=true" \
|
|
-F "webhook=true" \
|
|
-F "webhook_id=webhook-123"
|
|
```
|
|
|
|
</Tab>
|
|
<Tab title="Go SDK">
|
|
|
|
```go
|
|
resp, err := client.TranscriptionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostTranscriptionRequest{
|
|
Provider: schemas.Elevenlabs,
|
|
Model: "eleven_latest",
|
|
Input: &schemas.TranscriptionInput{
|
|
File: audioBytes,
|
|
},
|
|
Params: &schemas.TranscriptionParameters{
|
|
Language: schemas.Ptr("en"),
|
|
ExtraParams: map[string]interface{}{
|
|
"tag_audio_events": true,
|
|
"num_speakers": 2,
|
|
"timestamps_granularity": "word",
|
|
"diarize": true,
|
|
"diarization_threshold": 0.5,
|
|
"temperature": 0.1,
|
|
"seed": 42,
|
|
"use_multi_channel": true,
|
|
"webhook": true,
|
|
"webhook_id": "webhook-123",
|
|
},
|
|
},
|
|
})
|
|
```
|
|
|
|
</Tab>
|
|
</Tabs>
|
|
|
|
#### Transcription Options
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| `tag_audio_events` | boolean | Tag audio events (background noise, music, etc.) |
|
|
| `num_speakers` | integer | Expected number of speakers (for diarization) |
|
|
| `timestamps_granularity` | string | Timestamp level: `"none"`, `"word"`, `"character"` |
|
|
| `diarize` | boolean | Identify different speakers |
|
|
| `diarization_threshold` | float | Speaker diarization sensitivity (0.0-1.0) |
|
|
| `file_format` | string | Input format: `"pcm_s16le_16"`, `"other"` |
|
|
| `temperature` | float | Transcription temperature (0.0-1.0) |
|
|
| `seed` | integer | Reproducible transcription |
|
|
| `use_multi_channel` | boolean | Process multi-channel audio separately |
|
|
| `webhook` | boolean | Enable webhook for async processing |
|
|
| `webhook_id` | string | Webhook endpoint ID |
|
|
| `webhook_metadata` | object/string | Additional webhook metadata |
|
|
| `cloud_storage_url` | string | URL to cloud-hosted audio (alternative to file) |
|
|
|
|
#### Additional Formats
|
|
|
|
Request multiple output formats simultaneously:
|
|
|
|
```json
|
|
{
|
|
"additional_formats": [
|
|
{
|
|
"format": "segmented_json",
|
|
"include_speakers": true,
|
|
"include_timestamps": true,
|
|
"segment_on_silence_longer_than_s": 1.0,
|
|
"max_segment_duration_s": 30.0
|
|
},
|
|
{
|
|
"format": "srt",
|
|
"max_segment_duration_s": 30.0
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Supported formats**: `segmented_json`, `docx`, `pdf`, `txt`, `html`, `srt`
|
|
|
|
## Response Conversion
|
|
|
|
### Basic Transcription
|
|
|
|
```json
|
|
{
|
|
"transcript": {
|
|
"language_code": "en",
|
|
"language_probability": 0.95,
|
|
"text": "Full transcribed text...",
|
|
"words": [
|
|
{
|
|
"text": "Hello",
|
|
"start": 0.0,
|
|
"end": 0.5,
|
|
"type": "word",
|
|
"speaker_id": "speaker_1",
|
|
"logprob": -0.05
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### With Diarization
|
|
|
|
When `diarize: true`, the response includes speaker identification:
|
|
|
|
```json
|
|
{
|
|
"transcript": {
|
|
"text": "Hello how are you?",
|
|
"words": [
|
|
{
|
|
"text": "Hello",
|
|
"speaker_id": "speaker_1"
|
|
},
|
|
{
|
|
"text": "how",
|
|
"speaker_id": "speaker_2"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### With Timestamps
|
|
|
|
Character-level timing when `timestamps_granularity: "character"`:
|
|
|
|
```json
|
|
{
|
|
"words": [
|
|
{
|
|
"text": "Hello",
|
|
"characters": [
|
|
{"text": "H", "start": 0.0, "end": 0.1},
|
|
{"text": "e", "start": 0.1, "end": 0.2}
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### With Additional Formats
|
|
|
|
```json
|
|
{
|
|
"transcript": { ... },
|
|
"additional_formats": [
|
|
{
|
|
"requested_format": "srt",
|
|
"file_extension": "srt",
|
|
"content_type": "text/plain",
|
|
"is_base64_encoded": false,
|
|
"content": "1\n00:00:00,000 --> 00:00:01,000\nHello\n\n2\n..."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Caveats
|
|
|
|
<Accordion title="Voice ID Required">
|
|
**Severity**: High
|
|
**Behavior**: Voice ID must be provided for TTS requests
|
|
**Impact**: Request fails without voice configuration
|
|
**Code**: `elevenlabs.go:198-208`
|
|
</Accordion>
|
|
|
|
<Accordion title="File or URL Required for Transcription">
|
|
**Severity**: High
|
|
**Behavior**: Either `file` or `cloud_storage_url` must be provided (not both)
|
|
**Impact**: Request fails with ambiguous input
|
|
**Code**: `elevenlabs.go:471-478`
|
|
</Accordion>
|
|
|
|
<Accordion title="Audio Format Conversion">
|
|
**Severity**: Low
|
|
**Behavior**: Response formats (MP3, Opus, WAV) mapped via format string
|
|
**Impact**: Format parameter passed as query string to endpoint
|
|
**Code**: `elevenlabs.go:712-715`, `utils.go:5-35`
|
|
</Accordion>
|
|
|
|
<Accordion title="Timestamps as Separate Endpoint">
|
|
**Severity**: Low
|
|
**Behavior**: Timestamp requests use `/with-timestamps` endpoint variant
|
|
**Impact**: Switches endpoint based on `with_timestamps` flag
|
|
**Code**: `elevenlabs.go:195-205`
|
|
</Accordion>
|
|
|
|
<Accordion title="Multipart Form Data for Transcription">
|
|
**Severity**: Low
|
|
**Behavior**: Transcription uses multipart/form-data, not JSON
|
|
**Impact**: File and parameters sent as form fields
|
|
**Code**: `elevenlabs.go:480-690`
|
|
</Accordion>
|
|
|
|
---
|
|
|
|
# 3. List Models
|
|
|
|
## Request Parameters
|
|
|
|
| Parameter | Type | Description |
|
|
|-----------|------|-------------|
|
|
| (none) | - | No parameters required |
|
|
|
|
Returns available models with their capabilities and language support.
|
|
|
|
## Response Conversion
|
|
|
|
```json
|
|
{
|
|
"models": [
|
|
{
|
|
"model_id": "eleven_multilingual_v2",
|
|
"name": "Eleven Multilingual v2",
|
|
"description": "Multilingual speech synthesis",
|
|
"serves_pro_voices": true,
|
|
"token_cost_factor": 1.0,
|
|
"can_do_text_to_speech": true,
|
|
"can_do_voice_conversion": true,
|
|
"can_use_style": true,
|
|
"can_use_speaker_boost": true,
|
|
"languages": [
|
|
{"language_id": "en", "name": "English"},
|
|
{"language_id": "es", "name": "Spanish"}
|
|
],
|
|
"requires_alpha_access": false,
|
|
"max_characters_request_free_user": 1000,
|
|
"max_characters_request_subscribed_user": 100000,
|
|
"maximum_text_length_per_request": 5000,
|
|
"model_rates": {
|
|
"character_cost_multiplier": 1.0
|
|
}
|
|
}
|
|
]
|
|
}
|
|
``` |