first commit

2026-04-26 21:52:23 +03:00
commit 880f412e2c
2662 changed files with 866266 additions and 0 deletions
--- a/docs/providers/supported-providers/anthropic.mdx
+++ b/docs/providers/supported-providers/anthropic.mdx
@@ -0,0 +1,513 @@
+---
+title: "Anthropic"
+description: "Anthropic API conversion guide - structural differences, message handling, thinking/reasoning, and tool conversion"
+icon: "asterisk"
+---
+
+## Overview
+
+Anthropic has significant structural differences from OpenAI's format. Bifrost performs extensive conversion including:
+- **System message extraction** - Removed from messages array, placed in separate `system` field
+- **Tool message grouping** - Consecutive tool messages merged into single user message
+- **Thinking block transformation** - `reasoning` parameters mapped to Anthropic's `thinking` structure
+- **Parameter renaming** - e.g., `max_completion_tokens` → `max_tokens`, `stop` → `stop_sequences`
+- **Content format conversion** - Images, files, and other content types adapted to Anthropic's schema
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/messages` |
+| Responses API | ✅ | ✅ | `/v1/messages` |
+| Text Completions | ✅ | ❌ | `/v1/complete` |
+| Embeddings | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Image Generation | ❌ | ❌ | - |
+| Files | ✅ | - | `/v1/files` |
+| Batch | ✅ | - | `/v1/messages/batches` |
+| List Models | ✅ | - | `/v1/models` |
+
+<Note>
+**Unsupported Operations** (❌): Embeddings, Speech, Transcriptions, and Image Generation are not supported by the upstream Anthropic API. These return `UnsupportedOperationError`.
+</Note>
+
+## Beta Headers
+
+Bifrost automatically manages Anthropic beta headers — detecting required headers from request features and injecting them. Headers are validated per provider to prevent unsupported headers from reaching the upstream API.
+
+| Beta Header | Anthropic | Azure | Vertex | Bedrock | Auto-Injected |
+|---|---|---|---|---|---|
+| `computer-use-2025-01-24` / `computer-use-2025-11-24` | ✅ | ✅ | ✅ | ✅ | ✅ (tool type detection) |
+| `structured-outputs-2025-11-13` | ✅ | ✅ | ❌ | ✅ | ✅ (strict/output_format) |
+| `advanced-tool-use-2025-11-20` | ✅ | ✅ | ❌ | ❌ | ✅ (defer_loading/input_examples/allowed_callers) |
+| `mcp-client-2025-11-20` | ✅ | ✅ | ❌ | ❌ | ✅ (mcp_servers detection) |
+| `prompt-caching-scope-2026-01-05` | ✅ | ✅ | ❌ | ❌ | ✅ (cache_control.scope) |
+| `compact-2026-01-12` | ✅ | ✅ | ✅ | ✅ | ✅ (compaction edit) |
+| `context-management-2025-06-27` | ✅ | ✅ | ✅ | ✅ | ✅ (clear edits) |
+| `files-api-2025-04-14` | ✅ | ✅ | ❌ | ❌ | ✅ (files endpoint) |
+| `interleaved-thinking-2025-05-14` | ✅ | ✅ | ✅ | ✅ | ✅ (thinking enabled/adaptive) |
+| `skills-2025-10-02` | ✅ | ✅ | ❌ | ❌ | Passthrough |
+| `context-1m-2025-08-07` | ✅ | ✅ | ✅ | ✅ | Passthrough |
+| `fast-mode-2026-02-01` | ✅ | ❌ | ❌ | ❌ | ✅ (speed=fast) |
+| `redact-thinking-2026-02-12` | ✅ | ✅ | ❌ | ❌ | Passthrough |
+
+<Note>
+**Passthrough headers** are not auto-injected but are validated and forwarded when set manually via the `anthropic-beta` request header. Unknown headers are forwarded to Anthropic only; for other providers (Vertex, Bedrock, Azure), unknown headers are silently dropped by default to prevent upstream errors.
+
+**Beta header overrides**: You can override the default support per provider via the Beta Headers tab in provider configuration, or by setting `beta_header_overrides` in the provider's `network_config`. See [Beta Header Overrides](/quickstart/gateway/provider-configuration#beta-header-overrides) for details.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `max_completion_tokens` | Renamed to `max_tokens` |
+| `temperature`, `top_p` | Direct pass-through |
+| `stop` | Renamed to `stop_sequences` |
+| `response_format` | Converted to `output_format` |
+| `tools` | Schema restructured (see [Tool Conversion](#tool-conversion)) |
+| `tool_choice` | Type mapped (see [Tool Conversion](#tool-conversion)) |
+| `reasoning` | Mapped to `thinking` (see [Reasoning / Thinking](#reasoning--thinking)) |
+| `user` | Wrapped in `metadata.user_id` |
+| `top_k` | Via `extra_params` (Anthropic-specific) |
+
+### Dropped Parameters
+
+The following parameters are silently ignored: `frequency_penalty`, `presence_penalty`, `logit_bias`, `logprobs`, `top_logprobs`, `seed`, `parallel_tool_calls`, `service_tier`
+
+### Extra Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway) for Anthropic-specific fields:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "anthropic/claude-3-5-sonnet",
+    "messages": [{"role": "user", "content": "Hello"}],
+    "top_k": 40
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+    Provider: schemas.Anthropic,
+    Model:    "claude-3-5-sonnet",
+    Input:    messages,
+    Params: &schemas.ChatParameters{
+        ExtraParams: map[string]interface{}{
+            "top_k": 40,
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+Anthropic also accepts a top-level `"cache_control": {"type": "ephemeral"}` object on `/anthropic/v1/messages` requests to enable automatic prompt caching, and Bifrost now forwards that directive through unchanged.
+
+### Cache Control
+
+Cache directives can be added to system messages, user messages, and tool definitions to enable prompt caching:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "anthropic/claude-3-5-sonnet",
+    "messages": [
+      {
+        "role": "user",
+        "content": [
+          {
+            "type": "text",
+            "text": "This is cached context",
+            "cache_control": {"type": "ephemeral"}
+          }
+        ]
+      }
+    ],
+    "system": [
+      {
+        "type": "text",
+        "text": "You are a helpful assistant",
+        "cache_control": {"type": "ephemeral"}
+      }
+    ]
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+    Provider: schemas.Anthropic,
+    Model:    "claude-3-5-sonnet",
+    Input: []schemas.ChatMessage{
+        {
+            Role: schemas.ChatMessageRoleUser,
+            Content: &schemas.ChatMessageContent{
+                ContentBlocks: []schemas.ChatContentBlock{
+                    {
+                        Text: schemas.Ptr("This is cached context"),
+                        CacheControl: &schemas.CacheControl{
+                            Type: schemas.Ptr("ephemeral"),
+                        },
+                    },
+                },
+            },
+        },
+    },
+    SystemMessages: []schemas.ChatMessage{
+        {
+            Role: schemas.ChatMessageRoleSystem,
+            Content: &schemas.ChatMessageContent{
+                ContentBlocks: []schemas.ChatContentBlock{
+                    {
+                        Text: schemas.Ptr("You are a helpful assistant"),
+                        CacheControl: &schemas.CacheControl{
+                            Type: schemas.Ptr("ephemeral"),
+                        },
+                    },
+                },
+            },
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Reasoning / Thinking
+
+**Documentation**: See [Bifrost Reasoning Reference](/providers/reasoning)
+
+### Parameter Mapping
+
+- `reasoning.effort` → `thinking.type` (always mapped to `"enabled"`)
+- `reasoning.max_tokens` → `thinking.budget_tokens` (token budget for thinking)
+
+### Critical Constraints
+
+- **Minimum budget**: 1024 tokens required; requests below this **fail with error**
+- **Dynamic budget**: `-1` is converted to `1024` automatically
+
+### Example
+
+```json
+// Request
+{"reasoning": {"effort": "high", "max_tokens": 2048}}
+
+// Anthropic conversion
+{"thinking": {"type": "enabled", "budget_tokens": 2048}}
+```
+
+## Message Conversion
+
+### Critical Caveats
+
+- **System message extraction**: System messages are **removed from messages array** and placed in separate `system` field. Multiple system messages become separate text blocks in the system array.
+- **Tool message grouping**: Consecutive tool messages are **merged into single user message** with `tool_result` content blocks.
+
+### Image Conversion
+
+- **URL images**: `{"type": "image_url", "image_url": {}}` → `{"type": "image", "source": {"type": "url", ...}}`
+- **Base64 images**: Data URL → `{"type": "image", "source": {"type": "base64", "media_type": "image/png", ...}}`
+
+### Cache Control Locations
+
+Cache directives supported on: system content blocks, user message content blocks, tool definitions (see [Cache Control](#cache-control) examples above)
+
+## Tool Conversion
+
+Tool definitions are restructured: `function.name` → `name`, `function.parameters` → `input_schema`, `function.strict` is dropped.
+
+Tool choice mapping: `"auto"` → `auto` | `"none"` → `none` | `"required"` → `any` | Specific tool → `{"type": "tool", "name": "X"}`
+
+## Response Conversion
+
+### Field Mapping
+
+- `stop_reason` → `finish_reason`: `end_turn`/`stop_sequence` → `stop`, `max_tokens` → `length`, `tool_use` → `tool_calls`
+- `input_tokens + cache_read_input_tokens + cache_creation_input_tokens` → `prompt_tokens` (all cache counts rolled into the total)
+- Cache token breakdown surfaced in `prompt_tokens_details`:
+  - `cache_read_input_tokens` → `prompt_tokens_details.cached_read_tokens`
+  - `cache_creation_input_tokens` → `prompt_tokens_details.cached_write_tokens`
+- `output_tokens` → `completion_tokens`
+- `thinking` blocks → `reasoning_details` with index, type, text, and signature fields
+- Tool call arguments converted from JSON object → JSON string
+
+## Streaming
+
+Event sequence: `message_start` → `content_block_start` → `content_block_delta` → `content_block_stop` → `message_delta` → `message_stop`
+
+Delta types: `text_delta` → content | `input_json_delta` → tool arguments | `thinking_delta` → reasoning text | `signature_delta` → reasoning signature
+
+---
+
+## Caveats
+
+<Accordion title="System Message Extraction">
+**Severity**: High
+**Behavior**: System messages removed from array, placed in separate `system` field
+**Impact**: Message array structure differs from input
+**Code**: `chat.go:145-167`
+</Accordion>
+
+<Accordion title="Tool Message Grouping">
+**Severity**: High
+**Behavior**: Consecutive tool messages merged into single user message
+**Impact**: Message count and structure changes
+**Code**: `chat.go:169-216`
+</Accordion>
+
+<Accordion title="Minimum Reasoning Budget">
+**Severity**: High
+**Behavior**: `reasoning.max_tokens` must be >= 1024
+**Impact**: Requests with lower values **fail with error**
+**Code**: `chat.go:113-115`
+</Accordion>
+
+<Accordion title="Dynamic Budget Conversion">
+**Severity**: Medium
+**Behavior**: `reasoning.max_tokens = -1` converted to `1024`
+**Impact**: Dynamic budgeting not supported
+**Code**: `chat.go:107-111`
+</Accordion>
+
+<Accordion title="Strict Tool Mode Dropped">
+**Severity**: Medium
+**Behavior**: `strict: true` in tool definitions silently dropped
+**Impact**: No schema validation enforcement
+**Code**: `chat.go:43-72`
+</Accordion>
+
+<Accordion title="Arguments Serialization">
+**Severity**: Low
+**Behavior**: Tool call `input` (object) serialized to `arguments` (JSON string)
+**Code**: `chat.go:341-350`
+</Accordion>
+
+---
+
+# 2. Responses API
+
+The Responses API uses the same underlying `/v1/messages` endpoint but converts between OpenAI's Responses format and Anthropic's Messages format.
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `max_output_tokens` | Renamed to `max_tokens` |
+| `temperature`, `top_p` | Direct pass-through |
+| `instructions` | Becomes system message |
+| `tools` | Schema restructured (see [Chat Completions](#1-chat-completions)) |
+| `tool_choice` | Type mapped (see [Chat Completions](#1-chat-completions)) |
+| `reasoning` | Mapped to `thinking` (see [Reasoning / Thinking](#reasoning--thinking)) |
+| `user` | Wrapped in `metadata.user_id` |
+| `text` | Converted to `output_format` |
+| `include` | Via `extra_params` (Anthropic-specific) |
+| `stop` | Via `extra_params`, renamed to `stop_sequences` |
+| `top_k` | Via `extra_params` (Anthropic-specific) |
+| `truncation` | Auto-set to `"auto"` for computer tools |
+
+### Extra Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway):
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "anthropic/claude-3-5-sonnet",
+    "input": "Hello, how are you?",
+    "top_k": 40
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ResponsesRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostResponsesRequest{
+    Provider: schemas.Anthropic,
+    Model:    "claude-3-5-sonnet",
+    Input:    messages,
+    Params: &schemas.ResponsesParameters{
+        ExtraParams: map[string]interface{}{
+            "top_k": 40,
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+### Cache Control
+
+Cache directives can be added to instructions (system) and input messages to enable prompt caching:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "anthropic/claude-3-5-sonnet",
+    "instructions": "You are a helpful assistant. This instruction is cached.",
+    "instructions_cache_control": {"type": "ephemeral"},
+    "input": [
+      {
+        "type": "text",
+        "text": "Answer this question",
+        "cache_control": {"type": "ephemeral"}
+      }
+    ]
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ResponsesRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostResponsesRequest{
+    Provider: schemas.Anthropic,
+    Model:    "claude-3-5-sonnet",
+    Input: []schemas.ChatMessage{
+        {
+            Role: schemas.ChatMessageRoleUser,
+            Content: &schemas.ChatMessageContent{
+                ContentBlocks: []schemas.ChatContentBlock{
+                    {
+                        Text: schemas.Ptr("Answer this question"),
+                        CacheControl: &schemas.CacheControl{
+                            Type: schemas.Ptr("ephemeral"),
+                        },
+                    },
+                },
+            },
+        },
+    },
+    Params: &schemas.ResponsesParameters{
+        Instructions: schemas.Ptr("You are a helpful assistant. This instruction is cached."),
+        InstructionsCacheControl: &schemas.CacheControl{
+            Type: schemas.Ptr("ephemeral"),
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Input & Instructions
+
+- **Input**: String wrapped as user message or array converted to messages
+- **Instructions**: Becomes system message (same extraction as [Chat Completions](#1-chat-completions))
+
+## Tool Support
+
+Supported types: `function`, `computer_use_preview`, `web_search`, `mcp`
+
+Tool conversions same as [Chat Completions](#1-chat-completions) with: MCP tools mapped to `mcp_servers` (server_label → name, server_url → url) and computer tools auto-set with `truncation: "auto"`
+
+Cache control supported on instructions and input blocks (see [Cache Control](#cache-control) examples)
+
+## Response Conversion
+
+- `stop_reason` → `status`: `end_turn`/`stop_sequence` → `completed`, `max_tokens` → `incomplete`
+- Top-level `input_tokens` and `output_tokens` are rollups that include cache-related usage; they map as `input_tokens` → `input_tokens` | `output_tokens` → `output_tokens`.
+- Cache-specific counts are exposed in details: `cache_read_input_tokens` → `input_tokens_details.cached_read_tokens` | `cache_creation_input_tokens` → `input_tokens_details.cached_write_tokens`
+- Output items: `text` → `message` | `tool_use` → `function_call` | `thinking` → `reasoning`
+
+## Streaming
+
+Event sequence: `message_start` → `content_block_start` → `content_block_delta` → `content_block_stop` → `message_delta` → `message_stop`
+
+Special handling: Computer tool arguments accumulated across chunks (emitted on `content_block_stop`), synthetic `content_part.added` events emitted for text/reasoning, MCP calls use `mcp_call_arguments_delta`, item IDs generated as `msg_{messageID}_item_{outputIndex}`
+
+---
+
+# 3. Text Completions (Legacy)
+
+<Warning>
+Legacy API using `/v1/complete` endpoint. Streaming not supported.
+</Warning>
+
+**Request**: `prompt` auto-wrapped with `\n\nHuman: {prompt}\n\nAssistant:` | `max_tokens` → `max_tokens_to_sample` | `temperature`, `top_p` direct pass-through | `top_k`, `stop` via `extra_params` (→ `stop_sequences`)
+
+**Response**: `completion` → `choices[0].text` | `stop_reason` → `finish_reason`
+
+---
+
+# 4. Batch API
+
+**Request formats**: `requests` array (CustomID + Params) or `input_file_id`
+
+**Pagination**: Cursor-based with `after_id`, `before_id`, `limit`
+
+**Endpoints**:
+- POST `/v1/messages/batches` - Create
+- GET `/v1/messages/batches` - List
+- GET `/v1/messages/batches/{batch_id}` - Retrieve
+- POST `/v1/messages/batches/{batch_id}/cancel` - Cancel
+
+**Response**: JSONL format with `{custom_id, result: {type, message}}`
+
+**Status mapping**: `in_progress` → `InProgress`, `canceling` → `Cancelling`, `ended` → `Ended`
+
+**Note**: RFC3339Nano timestamps converted to Unix, multi-key retry supported
+
+---
+
+# 5. Files API
+
+<Note>
+Requires beta header: `anthropic-beta: files-api-2025-04-14`
+</Note>
+
+**Upload**: Multipart/form-data with `file` (required) and `filename` (optional)
+
+**Field mapping**: `id` | `filename` | `size_bytes` → `bytes` | `created_at` (Unix) | `mime_type` → `content_type`
+
+**Endpoints**: POST `/v1/files`, GET `/v1/files` (cursor pagination), GET `/v1/files/{file_id}`, DELETE `/v1/files/{file_id}`, GET `/v1/files/{file_id}/content`
+
+**Note**: File purpose always `"batch"`, status always `"processed"`
+
+---
+
+# 6. List Models
+
+**Request**: GET `/v1/models?limit={defaultPageSize}` (no body)
+
+**Field mapping**: `id` (prefixed `anthropic/`) | `display_name` → `name` | `created_at` (Unix timestamp)
+
+**Pagination**: Token-based with `NextPageToken`, `FirstID`, `LastID`
+
+**Multi-key support**: Results aggregated from all keys, filtered by `allowed_models` if configured
--- a/docs/providers/supported-providers/azure.mdx
+++ b/docs/providers/supported-providers/azure.mdx
--- a/docs/providers/supported-providers/bedrock.mdx
+++ b/docs/providers/supported-providers/bedrock.mdx
--- a/docs/providers/supported-providers/cerebras.mdx
+++ b/docs/providers/supported-providers/cerebras.mdx
@@ -0,0 +1,122 @@
+---
+title: "Cerebras"
+description: "Cerebras API conversion guide - OpenAI-compatible format, full feature support, streaming, tool calling, and parameter handling"
+icon: "c"
+---
+
+## Overview
+
+Cerebras is a **fully OpenAI-compatible provider** leveraging the complete set of OpenAI API features. Bifrost delegates all functionality to the OpenAI provider implementation with standard parameter filtering. Key characteristics:
+- **Complete OpenAI compatibility** - All chat, text, and streaming features supported
+- **Full tool calling** - Function definitions and parallel tool execution
+- **Streaming support** - Server-Sent Events with token usage tracking
+- **Parameter preservation** - Passes through all standard OpenAI parameters
+- **Responses API** - Full support with format conversion
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| List Models | ✅ | - | `/v1/models` |
+| Embeddings | ❌ | ❌ | - |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Embeddings, Image Generation, Speech, Transcriptions, Files, and Batch are not supported by the upstream Cerebras API. These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+Cerebras supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Filtered Parameters
+
+Removed for Cerebras compatibility:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - OpenAI-specific
+
+### Reasoning Parameter
+
+Cerebras delegates to OpenAI via `ToOpenAIChatRequest`, so reasoning parameters are transformed: `reasoning.effort` values (e.g., `minimal` → `low`) are mapped per the OpenAI-compatible providers convention, and `reasoning.max_tokens` is cleared/omitted (removed during conversion).
+
+Cerebras supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+Bifrost converts Responses API format to Chat Completions internally, then converts response back:
+
+```
+BifrostResponsesRequest
+  → ToChatRequest()
+  → ChatCompletion
+  → ToBifrostResponsesResponse()
+```
+
+Same parameter support as Chat Completions with response format differences (output items instead of message content).
+
+---
+
+# 3. Text Completions
+
+Cerebras supports legacy text completion API:
+
+| Parameter | Mapping |
+|-----------|---------|
+| `prompt` | Sent as-is |
+| `max_tokens` | max_tokens |
+| `temperature` | temperature |
+| `top_p` | top_p |
+| `stop` | stop sequences |
+
+Response returns `choices[].text` with completion text.
+
+---
+
+# 4. Text Completions Streaming
+
+Streaming text completions use same SSE format as chat streaming.
+
+---
+
+# 5. List Models
+
+Lists available models from Cerebras with capabilities and context length information.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Embedding | Not offered by Cerebras API |
+| Image Generation | Not offered by Cerebras API |
+| Speech/TTS | Not offered by Cerebras API |
+| Transcription/STT | Not offered by Cerebras API |
+| Batch Operations | Not offered by Cerebras API |
+| File Management | Not offered by Cerebras API |
+
+---
+
+## Caveats
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters is silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/cohere.mdx
+++ b/docs/providers/supported-providers/cohere.mdx
@@ -0,0 +1,376 @@
+---
+title: "Cohere"
+description: "Cohere API conversion guide - parameter mapping, message handling, reasoning/thinking, and tool conversion"
+icon: "c"
+---
+
+## Overview
+
+Cohere has a different API structure from OpenAI's format. Bifrost performs conversions including:
+- **Parameter renaming** - e.g., `max_completion_tokens` → `max_tokens`, `top_p` → `p`, `stop` → `stop_sequences`
+- **Message content conversion** - String and content block formats handled
+- **Tool conversion** - Tool definitions and tool choice mapped to Cohere format
+- **Thinking/Reasoning transformation** - `reasoning` parameters mapped to Cohere's `thinking` structure
+- **Response format conversion** - JSON schema handling adapted to Cohere's format
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v2/chat` |
+| Responses API | ✅ | ✅ | `/v2/chat` |
+| Embeddings | ✅ | - | `/v2/embed` |
+| List Models | ✅ | - | `/v1/models` |
+| Text Completions | ❌ | ❌ | - |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Text Completions, Image Generation, Speech, Transcriptions, Files, and Batch are not supported by the upstream Cohere API. These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `max_completion_tokens` | Renamed to `max_tokens` |
+| `temperature`, `top_p` → `p` | Direct pass-through for temperature; `top_p` renamed to `p` |
+| `stop` | Renamed to `stop_sequences` |
+| `frequency_penalty`, `presence_penalty` | Direct pass-through |
+| `response_format` | Converted to structured format (see [Response Format](#response-format)) |
+| `tools` | Schema structure adapted (see [Tool Conversion](#tool-conversion)) |
+| `tool_choice` | Type mapped (see [Tool Conversion](#tool-conversion)) |
+| `reasoning` | Mapped to `thinking` (see [Reasoning / Thinking](#reasoning--thinking)) |
+| `user` | Via `extra_params` (not directly supported in Cohere v2 API) |
+| `top_k` | Via `extra_params` (Cohere-specific) |
+
+### Dropped Parameters
+
+The following parameters are silently ignored: `logit_bias`, `logprobs`, `top_logprobs`, `seed`, `parallel_tool_calls`, `service_tier`
+
+### Extra Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway) for Cohere-specific fields:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "cohere/command-r-plus",
+    "messages": [{"role": "user", "content": "Hello"}],
+    "top_k": 40,
+    "safety_mode": "STRICT",
+    "log_probs": true,
+    "strict_tool_choice": false
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+    Provider: schemas.Cohere,
+    Model:    "cohere/command-r-plus",
+    Input:    messages,
+    Params: &schemas.ChatParameters{
+        ExtraParams: map[string]interface{}{
+            "top_k": 40,
+            "safety_mode": "STRICT",
+            "log_probs": true,
+            "strict_tool_choice": false,
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Reasoning / Thinking
+
+**Documentation**: See [Bifrost Reasoning Reference](/providers/reasoning)
+
+### Parameter Mapping
+
+- `reasoning.effort` → `thinking.type` (mapped to `"enabled"` or `"disabled"`)
+- `reasoning.max_tokens` → `thinking.token_budget` (token budget for thinking)
+
+### Critical Constraints
+
+- **Minimum budget**: 1 token required; requests with 0 tokens will be converted to disabled
+- **Dynamic budget**: `-1` is converted to `1` automatically
+
+### Example
+
+```json
+// Request
+{"reasoning": {"effort": "high", "max_tokens": 2048}}
+
+// Cohere conversion
+{"thinking": {"type": "enabled", "token_budget": 2048}}
+```
+
+## Message Conversion
+
+### Content Handling
+
+- **String content**: Messages can have simple string content
+- **Content blocks**: Messages can have arrays of content blocks (text, images, thinking)
+- **Image conversion**: `image_url` blocks with URL are supported
+- **Tool calls**: Converted from message assistant tool calls to Cohere format
+- **Tool messages**: Tool call results are passed with `tool_call_id`
+
+## Tool Conversion
+
+Tool definitions are adapted to Cohere format with the following mappings:
+- Function `name` → `name` (unchanged)
+- Function `parameters` → `parameters` (flexible JSON format)
+- Strict mode (`strict: true`) is silently dropped (not supported)
+
+Tool choice mapping:
+- `"none"` → `"NONE"`
+- `"auto"` or `"required"` → `"REQUIRED"` or `"AUTO"`
+- Specific tool selection → `"REQUIRED"` (Cohere uses function-level selection)
+
+## Response Format
+
+Supported formats:
+- `text` - Plain text response
+- `json_object` - Structured JSON response
+- `json_schema` - JSON with schema validation (converted to `json_object`)
+
+Schema is passed through `response_format.json_schema` field.
+
+## Response Conversion
+
+### Field Mapping
+
+- `finish_reason`: `COMPLETE` / `STOP_SEQUENCE` → `stop`, `MAX_TOKENS` → `length`, `TOOL_CALL` → `tool_calls`
+- `input_tokens` → `prompt_tokens` | `output_tokens` → `completion_tokens`
+- `cached_tokens` → `prompt_tokens_details.cached_tokens` (if present)
+- Tool call arguments converted from string → string (no conversion needed, Cohere uses string format)
+
+## Streaming
+
+Event sequence: `message-start` → `content-start` → `content-delta` → `content-end` → `message-end`
+
+Delta types:
+- `content-delta` with text → message content
+- `content-delta` with thinking → reasoning text
+- `tool-call-start/delta/end` → tool call events
+- `tool-plan-delta` → tool planning output
+
+---
+
+## Caveats
+
+<Accordion title="Minimum Thinking Budget">
+**Severity**: Low
+**Behavior**: `reasoning.max_tokens` must be >= 1
+**Impact**: Very low impact, conversion happens automatically
+**Code**: `chat.go:104-130`
+</Accordion>
+
+<Accordion title="Top P Renamed">
+**Severity**: Low
+**Behavior**: `top_p` parameter renamed to `p`
+**Impact**: Parameter name changes internally
+**Code**: `chat.go:99`
+</Accordion>
+
+<Accordion title="Strict Tool Mode Dropped">
+**Severity**: Low
+**Behavior**: `strict: true` in tool definitions silently dropped
+**Impact**: No schema validation enforcement
+**Code**: `chat.go:168-185`
+</Accordion>
+
+<Accordion title="Tool Arguments Format">
+**Severity**: Low
+**Behavior**: Tool arguments are already strings, no JSON serialization needed
+**Impact**: Minimal - Cohere v2 API expects string format
+**Code**: `chat.go:70-78`
+</Accordion>
+
+---
+
+# 2. Responses API
+
+The Responses API uses the same underlying `/v2/chat` endpoint but converts between OpenAI's Responses format and Cohere's format.
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `max_output_tokens` | Renamed to `max_tokens` |
+| `temperature`, `top_p` → `p` | Direct pass-through for temperature; `top_p` renamed to `p` |
+| `instructions` | Becomes system message |
+| `text.format` | Converted to `response_format` |
+| `tools` | Schema restructured (see [Chat Completions](#1-chat-completions)) |
+| `tool_choice` | Type mapped (see [Chat Completions](#1-chat-completions)) |
+| `reasoning` | Mapped to `thinking` (see [Reasoning / Thinking](#reasoning--thinking)) |
+| `stop` | Via `extra_params`, renamed to `stop_sequences` |
+| `top_k` | Via `extra_params` (Cohere-specific) |
+| `frequency_penalty`, `presence_penalty` | Via `extra_params` |
+
+### Extra Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway):
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "cohere/command-r-plus",
+    "input": "Hello, how are you?",
+    "top_k": 40,
+    "stop": [".", "!"]
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ResponsesRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostResponsesRequest{
+    Provider: schemas.Cohere,
+    Model:    "cohere/command-r-plus",
+    Input:    messages,
+    Params: &schemas.ResponsesParameters{
+        ExtraParams: map[string]interface{}{
+            "top_k": 40,
+            "stop": []string{".", "!"},
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Input & Instructions
+
+- **Input**: String converted to user message or array converted to messages
+- **Instructions**: Becomes system message (prepended to messages)
+
+## Tool Support
+
+Supported types: `function`
+
+Tool conversions same as [Chat Completions](#1-chat-completions).
+
+## Response Conversion
+
+- `text` → `message` | `tool_use` → `function_call`
+- `input_tokens` / `output_tokens` preserved
+- Token details with cached tokens support
+
+## Streaming
+
+Event sequence: `message-start` → `content-start` → `content-delta` → `content-end` → `message-end`
+
+Special handling:
+- Tool call arguments accumulated across chunks
+- Synthetic `output_item.added` events emitted for text/reasoning
+- Stable item IDs generated as `msg_{messageID}_item_{outputIndex}`
+
+---
+
+# 3. Embeddings
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `input` (text or array) | Converted to `texts` array |
+| `dimensions` | Renamed to `output_dimension` |
+| `input_type` | Via `extra_params` (required, defaults to `"search_document"`) |
+| `embedding_types` | Via `extra_params` (array of embedding types) |
+| `truncate` | Via `extra_params` (how to handle long inputs) |
+| `max_tokens` | Via `extra_params` (max tokens to embed per input) |
+
+### Extra Parameters
+
+Use `extra_params` for Cohere-specific embedding options:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "cohere/embed-english-v3.0",
+    "input": ["text to embed"],
+    "input_type": "search_query",
+    "embedding_types": ["float"],
+    "truncate": "START"
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.EmbeddingRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostEmbeddingRequest{
+    Provider: schemas.Cohere,
+    Model:    "cohere/embed-english-v3.0",
+    Input: &schemas.EmbeddingInput{
+        Texts: []string{"text to embed"},
+    },
+    Params: &schemas.EmbeddingParameters{
+        Dimensions: schemas.Ptr(1024),
+        ExtraParams: map[string]interface{}{
+            "input_type": "search_query",
+            "embedding_types": []string{"float"},
+            "truncate": "START",
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+### Critical Notes
+
+- **Input Type Required**: Cohere v3+ models require `input_type` parameter (defaults to `"search_document"`)
+- **Embedding Types**: Specify which embedding types to return (e.g., `"float"`, `"int8"`)
+
+## Response Conversion
+
+- `embeddings.float` → `data[].embedding`
+- `meta.tokens` → usage information
+- Multiple embedding types handled
+
+---
+
+# 4. List Models
+
+**Request**: GET `/v1/models?page_size={defaultPageSize}`
+
+**Field mapping**: Model data converted to standard format
+
+**Pagination**: Cursor-based with `next_page_token`
+
+**Note**: `endpoint` and `default_only` filters available via `extra_params`
--- a/docs/providers/supported-providers/databricks.mdx
+++ b/docs/providers/supported-providers/databricks.mdx
@@ -0,0 +1,257 @@
+---
+title: "Databricks AI Gateway"
+description: "Route requests through Databricks AI Gateway using Unified (MLflow) or Native (Anthropic Messages) APIs as custom providers in Bifrost"
+icon: "database"
+---
+
+## Overview
+
+[Databricks AI Gateway](https://docs.databricks.com/en/ai-gateway/index.html) (Beta) is a governance layer on top of Databricks Model Serving that adds rate limiting, usage tracking, and inference logging to your LLM endpoints. Bifrost connects to AI Gateway endpoints as custom providers.
+
+### Unified vs Native APIs
+
+AI Gateway exposes two categories of APIs on every endpoint:
+
+- **Unified APIs** — Provider-agnostic, OpenAI-compatible interfaces powered by MLflow. You can swap the underlying model without changing client code. Path: `/mlflow/v1/chat/completions`.
+- **Native APIs** — Provider-specific interfaces that give full access to a provider's latest features. For Anthropic, the path is `/anthropic/v1/messages`.
+
+In Bifrost, each API category maps to a different custom provider base format:
+
+| API Category | Bifrost Base Format | Chat | Chat (stream) | Responses | Responses (stream) | Text | Coding Agents | List Models |
+|--------------|---------------------|------|---------------|-----------|---------------------|------|---------------|-------------|
+| **Unified** (MLflow) | `openai` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ (via Unity Catalog) |
+| **Native** (Anthropic Messages) | `anthropic` | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
+
+<Warning>
+The **Unified (MLflow) API** is a pure chat completions interface — it does **not** support the Responses API. Coding agents like Claude Code, Cursor, or Codex CLI that depend on the Responses API will **not** work through the Unified API.
+
+Use the **Native Anthropic Messages API** if you need Responses API support, coding agent compatibility, or text completions.
+</Warning>
+
+### Prerequisites
+
+Before configuring Bifrost, you need:
+
+1. A Databricks workspace with **Unity Catalog enabled** and AI Gateway access turned on by an account admin via **Account Console > Previews**
+2. An AI Gateway endpoint with at least one model destination — create one from the **AI Gateway** page in the Databricks sidebar
+3. The endpoint's AI Gateway URL — visible at the top of the endpoint overview page, in the format:
+   ```
+   https://<workspace-id>.ai-gateway.cloud.databricks.com
+   ```
+4. A Databricks Personal Access Token (PAT) — generate one from **Settings > Developer > Access tokens** in your Databricks workspace, or click **Generate Access Token** at the bottom of the endpoint page
+
+<img src="/media/databricks-endpoint-overview.png" alt="Databricks AI Gateway endpoint overview page showing the API format dropdown and Generate Access Token button" />
+
+---
+
+# 1. Unified API (MLflow Chat Completions)
+
+The Unified API exposes an OpenAI-compatible chat completions interface through AI Gateway's MLflow layer. Use this when you only need chat completions.
+
+### How it works
+
+AI Gateway exposes every endpoint at a `/mlflow/v1/chat/completions` path. Because this follows the OpenAI spec, Bifrost treats it as an OpenAI-compatible custom provider. The full endpoint URL looks like:
+
+```
+https://<workspace-id>.ai-gateway.cloud.databricks.com/mlflow/v1/chat/completions
+```
+
+You register only the base portion (`/mlflow`) as the custom provider's Base URL — Bifrost appends the standard `/v1/chat/completions` path automatically.
+
+## Step 1: Create the Custom Provider
+
+In Bifrost, go to **Models > Model Providers** in the sidebar. Click **Add New Provider** and select **Custom provider...** at the bottom of the dropdown.
+
+In the **Add Custom Provider** dialog, fill in:
+
+| Field | Value |
+|-------|-------|
+| **Name** | Your choice (e.g., `databricks-mlflow`) |
+| **Base Format** | Select `OpenAI` from the dropdown |
+| **Base URL** | `https://<workspace-id>.ai-gateway.cloud.databricks.com/mlflow` |
+| **Is Keyless?** | Toggle on |
+
+## Step 2: Configure List Models (Optional)
+
+The default `/v1/models` path does not work against the AI Gateway URL. To enable model listing, point it at the **Unity Catalog** API on your Databricks workspace instead.
+
+In the **Allowed Request Types** section of the dialog:
+
+1. Find the **List Models** toggle (make sure it's enabled)
+2. Click the **settings icon** (gear) next to List Models — this opens the **Custom Path or URL** popover
+3. Enter your workspace's Unity Catalog models endpoint:
+
+```
+https://<your-databricks-workspace-url>/api/2.1/unity-catalog/models
+```
+
+<Note>
+The Unity Catalog URL uses your **Databricks workspace URL** (e.g., `https://adb-1234567890.azuredatabricks.net`), which is a different host from the AI Gateway URL (`*.ai-gateway.cloud.databricks.com`).
+</Note>
+
+<img src="/media/databricks-add-provider-mlflow.png" alt="Add Custom Provider dialog configured for MLflow with the List Models custom path popover showing the Unity Catalog URL" />
+
+Click **Add** to save the custom provider.
+
+## Step 3: Add the Authorization Header
+
+After saving, your new provider appears in the **Configured Providers** list on the left. Select it, then click **Edit Provider Config** (the settings icon in the top-right corner) to open the provider configuration panel.
+
+1. Switch to the **Network** tab
+2. Scroll down to the **Extra Headers** table
+3. Add a new row:
+   - **Name** column: `Authorization`
+   - **Value** column: `Bearer <your-databricks-pat>`
+4. Click **Save Network Configuration**
+
+<img src="/media/databricks-network-mlflow.png" alt="Provider configuration Network tab showing the Authorization header in Extra Headers for the MLflow provider" />
+
+## Step 4: Send Requests
+
+Use your custom provider prefix with any model name registered as a destination on your AI Gateway endpoint:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "databricks-mlflow/<your-endpoint-model>",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+response, err := client.ChatCompletionRequest(
+    schemas.NewBifrostContext(ctx, schemas.NoDeadline),
+    &schemas.BifrostChatRequest{
+        Provider: "databricks-mlflow",
+        Model:    "<your-endpoint-model>",
+        Input:    messages,
+    },
+)
+```
+
+</Tab>
+</Tabs>
+
+---
+
+# 2. Native API (Anthropic Messages)
+
+The Native API exposes an Anthropic-compatible messages interface through AI Gateway. Use this when you need the Responses API, text completions, or coding agent support (Claude Code, Cursor, Codex CLI).
+
+### How it works
+
+AI Gateway exposes every endpoint at an `/anthropic/v1/messages` path that follows the Anthropic API spec. Bifrost treats this as an Anthropic-compatible custom provider. The full endpoint URL looks like:
+
+```
+https://<workspace-id>.ai-gateway.cloud.databricks.com/anthropic/v1/messages
+```
+
+You register only the base portion (`/anthropic`) as the custom provider's Base URL — Bifrost appends the standard Anthropic paths automatically.
+
+## Step 1: Create the Custom Provider
+
+In Bifrost, go to **Models > Model Providers** in the sidebar. Click **Add New Provider** and select **Custom provider...** at the bottom of the dropdown.
+
+In the **Add Custom Provider** dialog, fill in:
+
+| Field | Value |
+|-------|-------|
+| **Name** | Your choice (e.g., `databricks-anthropic`) |
+| **Base Format** | Select `Anthropic` from the dropdown |
+| **Base URL** | `https://<workspace-id>.ai-gateway.cloud.databricks.com/anthropic` |
+| **Is Keyless?** | Toggle on |
+
+## Step 2: Disable List Models
+
+AI Gateway's model listing endpoint uses an OpenAI-compatible format, which is incompatible with the Anthropic base format. You must disable it.
+
+In the **Allowed Request Types** section of the dialog, find the **List Models** toggle and turn it **off**.
+
+<img src="/media/databricks-add-provider-anthropic.png" alt="Add Custom Provider dialog configured for Anthropic Messages with List Models toggled off" />
+
+Click **Add** to save the custom provider.
+
+## Step 3: Add the Authorization Header
+
+After saving, select your new provider from the **Configured Providers** list and click **Edit Provider Config** to open the configuration panel.
+
+1. Switch to the **Network** tab
+2. Scroll down to the **Extra Headers** table
+3. Add a new row:
+   - **Name** column: `Authorization`
+   - **Value** column: `Bearer <your-databricks-pat>`
+4. Click **Save Network Configuration**
+
+<img src="/media/databricks-network-anthropic.png" alt="Provider configuration Network tab showing the Authorization header in Extra Headers for the Anthropic provider" />
+
+## Step 4: Send Requests
+
+Use your custom provider prefix with any model name registered as a destination on your AI Gateway endpoint:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "databricks-anthropic/<your-endpoint-model>",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+response, err := client.ChatCompletionRequest(
+    schemas.NewBifrostContext(ctx, schemas.NoDeadline),
+    &schemas.BifrostChatRequest{
+        Provider: "databricks-anthropic",
+        Model:    "<your-endpoint-model>",
+        Input:    messages,
+    },
+)
+```
+
+</Tab>
+</Tabs>
+
+### Coding Agent Compatibility
+
+The Native Anthropic Messages API works with Claude Code and other coding agents that depend on the Responses API. Point your coding agent at your Bifrost instance and use the `databricks-anthropic/<model>` prefix to route through your AI Gateway endpoint.
+
+---
+
+## Choosing the Right API
+
+| Consideration | Unified (MLflow) | Native (Anthropic Messages) |
+|---------------|-------------------|-----------------------------|
+| **Chat Completions** | ✅ | ✅ |
+| **Streaming** | ✅ | ✅ |
+| **Responses API** | ❌ | ✅ |
+| **Text Completions** | ❌ | ✅ |
+| **Coding Agents** (Claude Code, Cursor, Codex) | ❌ | ✅ |
+| **List Models** | ✅ (via Unity Catalog) | ❌ |
+| **Provider-agnostic** (swap models without code changes) | ✅ | ❌ |
+| **Bifrost Base Format** | `openai` | `anthropic` |
+
+<Note>
+You can create **two separate custom providers** — one per API category — pointing to the same AI Gateway endpoint. Use the Unified provider for chat completions with model listing, and the Native Anthropic provider for Responses API or coding agents.
+</Note>
+
+---
+
+## Reference Links
+
+- [Databricks AI Gateway Documentation](https://docs.databricks.com/en/ai-gateway/index.html)
+- [Create an AI Gateway Endpoint](https://docs.databricks.com/en/ai-gateway/create-endpoint.html)
+- [Databricks Personal Access Tokens](https://docs.databricks.com/en/dev-tools/auth/pat.html)
+- [Custom Providers in Bifrost](/providers/custom-providers)
--- a/docs/providers/supported-providers/elevenlabs.mdx
+++ b/docs/providers/supported-providers/elevenlabs.mdx
@@ -0,0 +1,495 @@
+---
+title: "ElevenLabs"
+description: "ElevenLabs API conversion guide - text-to-speech, speech-to-text, voice settings, and model management"
+icon: "pause"
+---
+
+## Overview
+
+ElevenLabs is a specialized audio provider for text-to-speech and speech-to-text operations. Bifrost performs conversions including:
+- **Model ID mapping** - Uses provider model identifier directly
+- **Voice configuration** - Maps voice settings (stability, similarity, boost, speed, style)
+- **Response format conversion** - Speech format handling (MP3, Opus, PCM/WAV)
+- **Timestamp support** - Character-level timing alignment for TTS
+- **Transcription with alignment** - Word and character-level timing, diarization, and additional formats
+- **Pronunciation dictionaries** - Support for custom pronunciation rules
+- **Voice quality parameters** - Stability, similarity boost, and speaker boost controls
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Speech (TTS) | ✅ | ✅ | `/v1/text-to-speech/{voice_id}` |
+| Transcriptions (STT) | ✅ | - | `/v1/speech-to-text` |
+| List Models | ✅ | - | `/v1/models` |
+| Chat Completions | ❌ | ❌ | - |
+| Responses API | ❌ | ❌ | - |
+| Text Completions | ❌ | ❌ | - |
+| Embeddings | ❌ | ❌ | - |
+| Image Generation | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Chat Completions, Responses API, Text Completions, and Embeddings are not supported by ElevenLabs (audio-focused provider). These return `UnsupportedOperationError`.
+
+**Note**: ElevenLabs also supports a "Speech with Timestamps" endpoint at `/v1/text-to-speech/{voice_id}/with-timestamps` (non-streaming only) for enhanced timestamp information.
+</Note>
+
+---
+
+# 1. Speech (Text-to-Speech)
+
+## Request Parameters
+
+### Core Parameters
+
+| Parameter | Mapping | Notes |
+|-----------|---------|-------|
+| `input.input` | `text` | The text to convert to speech (required) |
+| `model` | `model_id` | Model identifier (e.g., `"eleven_multilingual_v2"`) |
+| `response_format` | Query param `output_format` | Speech format (see [Response Format](#response-format)) |
+
+### Voice Configuration
+
+Voice settings are optional and controlled via `params`:
+
+| Parameter | ElevenLabs Mapping | Default | Range |
+|-----------|-------------------|---------|-------|
+| `speed` | `voice_settings.speed` | 1.0 | 0.5-2.0 |
+| `extra_params.stability` | `voice_settings.stability` | 0.5 | 0-1.0 |
+| `extra_params.similarity_boost` | `voice_settings.similarity_boost` | 0.75 | 0-1.0 |
+| `extra_params.use_speaker_boost` | `voice_settings.use_speaker_boost` | true | boolean |
+| `extra_params.style` | `voice_settings.style` | 0 | 0-1.0 |
+
+### Advanced Parameters
+
+Use `extra_params` for ElevenLabs-specific TTS features:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/audio/speech \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "eleven_multilingual_v2",
+    "input": {"input": "Hello, how are you?"},
+    "voice": "21m00Tcm4TlvDq8ikWAM",
+    "response_format": "mp3",
+    "stability": 0.5,
+    "similarity_boost": 0.75,
+    "use_speaker_boost": true,
+    "style": 0,
+    "speed": 1.0,
+    "language_code": "en",
+    "seed": 42,
+    "previous_text": "Context text",
+    "next_text": "Future context",
+    "apply_text_normalization": "auto"
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.SpeechRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostSpeechRequest{
+    Provider: schemas.Elevenlabs,
+    Model:    "eleven_multilingual_v2",
+    Input: &schemas.SpeechInput{
+        Input: "Hello, how are you?",
+    },
+    Params: &schemas.SpeechParameters{
+        VoiceConfig: &schemas.VoiceConfig{
+            Voice: schemas.Ptr("21m00Tcm4TlvDq8ikWAM"),
+        },
+        Speed: schemas.Ptr(1.0),
+        ResponseFormat: schemas.Ptr("mp3"),
+        ExtraParams: map[string]interface{}{
+            "stability": 0.5,
+            "similarity_boost": 0.75,
+            "use_speaker_boost": true,
+            "style": 0.0,
+            "language_code": "en",
+            "seed": 42,
+            "previous_text": "Context text",
+            "next_text": "Future context",
+            "apply_text_normalization": "auto",
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+#### Advanced TTS Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `language_code` | string | Language code (e.g., "en", "es") |
+| `seed` | integer | Reproducible output (0-4294967295) |
+| `previous_text` | string | Previous text context for consistency |
+| `next_text` | string | Next text context for consistency |
+| `previous_request_ids` | string[] | Previous request IDs for continuity |
+| `next_request_ids` | string[] | Next request IDs for continuity |
+| `apply_text_normalization` | string | Text normalization mode: `"auto"`, `"on"`, `"off"` |
+| `apply_language_text_normalization` | boolean | Apply language-specific text normalization |
+
+### Response Format
+
+| Format | Output | Quality | Bitrate |
+|--------|--------|---------|---------|
+| `mp3` | MP3 | High | 128 kbps @ 44100 Hz |
+| `opus` | Opus | High | 128 kbps @ 48000 Hz |
+| `wav` / `pcm` | PCM WAV | Lossless | 16-bit @ 44100 Hz |
+
+<Note>
+Defaults to MP3 format if not specified. Format is passed via query parameter `output_format`.
+</Note>
+
+### Timestamps Support
+
+To get character-level timing alignment, enable `with_timestamps`:
+
+```json
+{
+  "with_timestamps": true
+}
+```
+
+When enabled, the endpoint `/v1/text-to-speech/{voice_id}/with-timestamps` is used and the response includes:
+- `audio_base64` - Audio data as base64-encoded string
+- `alignment.char_start_times_ms` - Character start times in milliseconds
+- `alignment.char_end_times_ms` - Character end times in milliseconds
+- `alignment.characters` - Array of characters
+- `normalized_alignment` - Same as alignment but for normalized text
+
+## Response Conversion
+
+### Non-Timestamp Response
+
+```json
+{
+  "audio": "<binary audio data>"
+}
+```
+
+### Timestamp Response
+
+```json
+{
+  "audio_base64": "<base64 encoded audio>",
+  "alignment": {
+    "char_start_times_ms": [0, 150, 280, ...],
+    "char_end_times_ms": [150, 280, 420, ...],
+    "characters": ["H", "e", "l", "l", "o", ...]
+  },
+  "normalized_alignment": {
+    "char_start_times_ms": [...],
+    "char_end_times_ms": [...],
+    "characters": [...]
+  }
+}
+```
+
+## Streaming
+
+Streaming speech returns audio in chunks as they are generated:
+
+```json
+{
+  "type": "audio.delta",
+  "audio": "<binary audio chunk>"
+}
+```
+
+Final chunk:
+```json
+{
+  "type": "audio.done"
+}
+```
+
+---
+
+# 2. Transcription (Speech-to-Text)
+
+## Request Parameters
+
+### Input Source
+
+Choose one of the following (mutually exclusive):
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `input.file` | bytes | Audio file content (WAV, MP3, etc.) |
+| `extra_params.cloud_storage_url` | string | URL to cloud-hosted audio file |
+
+**Error**: Providing both or neither will result in error.
+
+### Core Parameters
+
+| Parameter | Mapping | Description |
+|-----------|---------|-------------|
+| `model` | `model_id` | Model identifier (required) |
+| `params.language` | `language_code` | Language code (ISO 639-1, e.g., "en") |
+
+### Advanced Parameters
+
+Use `extra_params` for transcription-specific features:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/audio/transcriptions \
+  -F "file=@audio.wav" \
+  -F "model=eleven_latest" \
+  -F "language_code=en" \
+  -F "tag_audio_events=true" \
+  -F "num_speakers=2" \
+  -F "timestamps_granularity=word" \
+  -F "diarize=true" \
+  -F "diarization_threshold=0.5" \
+  -F "temperature=0.1" \
+  -F "seed=42" \
+  -F "use_multi_channel=true" \
+  -F "webhook=true" \
+  -F "webhook_id=webhook-123"
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.TranscriptionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostTranscriptionRequest{
+    Provider: schemas.Elevenlabs,
+    Model:    "eleven_latest",
+    Input: &schemas.TranscriptionInput{
+        File: audioBytes,
+    },
+    Params: &schemas.TranscriptionParameters{
+        Language: schemas.Ptr("en"),
+        ExtraParams: map[string]interface{}{
+            "tag_audio_events": true,
+            "num_speakers": 2,
+            "timestamps_granularity": "word",
+            "diarize": true,
+            "diarization_threshold": 0.5,
+            "temperature": 0.1,
+            "seed": 42,
+            "use_multi_channel": true,
+            "webhook": true,
+            "webhook_id": "webhook-123",
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+#### Transcription Options
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `tag_audio_events` | boolean | Tag audio events (background noise, music, etc.) |
+| `num_speakers` | integer | Expected number of speakers (for diarization) |
+| `timestamps_granularity` | string | Timestamp level: `"none"`, `"word"`, `"character"` |
+| `diarize` | boolean | Identify different speakers |
+| `diarization_threshold` | float | Speaker diarization sensitivity (0.0-1.0) |
+| `file_format` | string | Input format: `"pcm_s16le_16"`, `"other"` |
+| `temperature` | float | Transcription temperature (0.0-1.0) |
+| `seed` | integer | Reproducible transcription |
+| `use_multi_channel` | boolean | Process multi-channel audio separately |
+| `webhook` | boolean | Enable webhook for async processing |
+| `webhook_id` | string | Webhook endpoint ID |
+| `webhook_metadata` | object/string | Additional webhook metadata |
+| `cloud_storage_url` | string | URL to cloud-hosted audio (alternative to file) |
+
+#### Additional Formats
+
+Request multiple output formats simultaneously:
+
+```json
+{
+  "additional_formats": [
+    {
+      "format": "segmented_json",
+      "include_speakers": true,
+      "include_timestamps": true,
+      "segment_on_silence_longer_than_s": 1.0,
+      "max_segment_duration_s": 30.0
+    },
+    {
+      "format": "srt",
+      "max_segment_duration_s": 30.0
+    }
+  ]
+}
+```
+
+**Supported formats**: `segmented_json`, `docx`, `pdf`, `txt`, `html`, `srt`
+
+## Response Conversion
+
+### Basic Transcription
+
+```json
+{
+  "transcript": {
+    "language_code": "en",
+    "language_probability": 0.95,
+    "text": "Full transcribed text...",
+    "words": [
+      {
+        "text": "Hello",
+        "start": 0.0,
+        "end": 0.5,
+        "type": "word",
+        "speaker_id": "speaker_1",
+        "logprob": -0.05
+      }
+    ]
+  }
+}
+```
+
+### With Diarization
+
+When `diarize: true`, the response includes speaker identification:
+
+```json
+{
+  "transcript": {
+    "text": "Hello how are you?",
+    "words": [
+      {
+        "text": "Hello",
+        "speaker_id": "speaker_1"
+      },
+      {
+        "text": "how",
+        "speaker_id": "speaker_2"
+      }
+    ]
+  }
+}
+```
+
+### With Timestamps
+
+Character-level timing when `timestamps_granularity: "character"`:
+
+```json
+{
+  "words": [
+    {
+      "text": "Hello",
+      "characters": [
+        {"text": "H", "start": 0.0, "end": 0.1},
+        {"text": "e", "start": 0.1, "end": 0.2}
+      ]
+    }
+  ]
+}
+```
+
+### With Additional Formats
+
+```json
+{
+  "transcript": { ... },
+  "additional_formats": [
+    {
+      "requested_format": "srt",
+      "file_extension": "srt",
+      "content_type": "text/plain",
+      "is_base64_encoded": false,
+      "content": "1\n00:00:00,000 --> 00:00:01,000\nHello\n\n2\n..."
+    }
+  ]
+}
+```
+
+---
+
+## Caveats
+
+<Accordion title="Voice ID Required">
+**Severity**: High
+**Behavior**: Voice ID must be provided for TTS requests
+**Impact**: Request fails without voice configuration
+**Code**: `elevenlabs.go:198-208`
+</Accordion>
+
+<Accordion title="File or URL Required for Transcription">
+**Severity**: High
+**Behavior**: Either `file` or `cloud_storage_url` must be provided (not both)
+**Impact**: Request fails with ambiguous input
+**Code**: `elevenlabs.go:471-478`
+</Accordion>
+
+<Accordion title="Audio Format Conversion">
+**Severity**: Low
+**Behavior**: Response formats (MP3, Opus, WAV) mapped via format string
+**Impact**: Format parameter passed as query string to endpoint
+**Code**: `elevenlabs.go:712-715`, `utils.go:5-35`
+</Accordion>
+
+<Accordion title="Timestamps as Separate Endpoint">
+**Severity**: Low
+**Behavior**: Timestamp requests use `/with-timestamps` endpoint variant
+**Impact**: Switches endpoint based on `with_timestamps` flag
+**Code**: `elevenlabs.go:195-205`
+</Accordion>
+
+<Accordion title="Multipart Form Data for Transcription">
+**Severity**: Low
+**Behavior**: Transcription uses multipart/form-data, not JSON
+**Impact**: File and parameters sent as form fields
+**Code**: `elevenlabs.go:480-690`
+</Accordion>
+
+---
+
+# 3. List Models
+
+## Request Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| (none) | - | No parameters required |
+
+Returns available models with their capabilities and language support.
+
+## Response Conversion
+
+```json
+{
+  "models": [
+    {
+      "model_id": "eleven_multilingual_v2",
+      "name": "Eleven Multilingual v2",
+      "description": "Multilingual speech synthesis",
+      "serves_pro_voices": true,
+      "token_cost_factor": 1.0,
+      "can_do_text_to_speech": true,
+      "can_do_voice_conversion": true,
+      "can_use_style": true,
+      "can_use_speaker_boost": true,
+      "languages": [
+        {"language_id": "en", "name": "English"},
+        {"language_id": "es", "name": "Spanish"}
+      ],
+      "requires_alpha_access": false,
+      "max_characters_request_free_user": 1000,
+      "max_characters_request_subscribed_user": 100000,
+      "maximum_text_length_per_request": 5000,
+      "model_rates": {
+        "character_cost_multiplier": 1.0
+      }
+    }
+  ]
+}
+```
--- a/docs/providers/supported-providers/fireworks.mdx
+++ b/docs/providers/supported-providers/fireworks.mdx
@@ -0,0 +1,179 @@
+---
+title: "Fireworks"
+description: "Fireworks API conversion guide covering native chat, responses, completions, embeddings, streaming, and Fireworks-specific parameter handling"
+icon: "sparkles"
+---
+
+## Overview
+
+Fireworks is an **OpenAI-compatible provider** in Bifrost with native support for:
+- **Chat Completions** via `/v1/chat/completions`
+- **Responses API** via `/v1/responses`
+- **Text Completions** via `/v1/completions`
+- **Embeddings** via `/v1/embeddings`
+- **Streaming** for chat, responses, and completions
+- **Tool calling** for chat and responses
+
+Unless noted below, Fireworks follows the standard OpenAI-compatible request and response behavior described in [OpenAI](./openai).
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/responses` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| Embeddings | ✅ | ❌ | `/v1/embeddings` |
+| List Models | ✅ | - | `/v1/models` |
+| Images | ❌ | ❌ | - |
+| Speech / Transcription | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+| Count Tokens | ❌ | ❌ | - |
+
+<Note>
+Fireworks Responses support is **native** in Bifrost. Requests are sent to Fireworks’ `/v1/responses` endpoint directly, so fields such as `previous_response_id`, `max_tool_calls`, and `store` are preserved.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+Fireworks chat completions use the standard OpenAI-compatible wire format.
+
+## Fireworks-specific handling
+
+- `prediction` is preserved and forwarded.
+- Bifrost maps `prompt_cache_key` to Fireworks `prompt_cache_isolation_key` for chat-completion cache isolation.
+- Assistant `reasoning_content` is preserved for Fireworks chat-completion models that support reasoning history.
+
+## Filtered Parameters
+
+For Fireworks chat completions, Bifrost removes or rewrites a small set of OpenAI-specific fields before sending the request upstream:
+
+- `prompt_cache_key` is mapped to Fireworks `prompt_cache_isolation_key`
+- `prompt_cache_retention` is removed
+- `verbosity` is removed
+- `store` is removed
+- `web_search_options` is removed
+
+## Example
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "fireworks/accounts/fireworks/models/deepseek-v3p2",
+    "messages": [
+      {"role": "user", "content": "Reply with exactly: fireworks ok"}
+    ]
+  }'
+```
+
+---
+
+# 2. Responses API
+
+Fireworks Responses use the native Fireworks endpoint:
+
+```text
+/v1/responses
+```
+
+This preserves Responses-only fields and semantics, including:
+- `previous_response_id`
+- `max_tool_calls`
+- `store`
+- native responses streaming
+
+## Example
+
+```bash
+curl -X POST http://localhost:8080/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "fireworks/accounts/fireworks/models/deepseek-v3p2",
+    "input": [
+      {"role": "user", "content": "Reply with exactly: responses ok"}
+    ],
+    "max_tool_calls": 2
+  }'
+```
+
+For continuation requests, Fireworks also supports `previous_response_id`.
+
+---
+
+# 3. Text Completions
+
+Fireworks text completions are sent to the native completions endpoint:
+
+```text
+/v1/completions
+```
+
+## Example
+
+```bash
+curl -X POST http://localhost:8080/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "fireworks/accounts/fireworks/models/deepseek-v3p2",
+    "prompt": "In fruits, A is for apple and B is for"
+  }'
+```
+
+For Fireworks text completions, Bifrost extracts `prompt_cache_key` from `extra_params` and maps it to Fireworks `prompt_cache_isolation_key`.
+
+---
+
+# 4. Embeddings
+
+Fireworks embeddings are sent to:
+
+```text
+/v1/embeddings
+```
+
+Embedding-capable models may be different from chat/completions models.
+
+## Example
+
+```bash
+curl -X POST http://localhost:8080/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "fireworks/nomic-ai/nomic-embed-text-v1.5",
+    "input": "embedding test"
+  }'
+```
+
+Fireworks documents additional embedding-specific fields such as `prompt_template`, `return_logits`, and `normalize`. This page describes the standard embeddings flow currently covered by Bifrost.
+
+---
+
+# 5. Unsupported Features
+
+The following operations are still unsupported by the Fireworks provider in Bifrost:
+
+| Feature | Status |
+|---------|--------|
+| Image generation / editing / variations | ❌ |
+| Speech / TTS | ❌ |
+| Transcription / STT | ❌ |
+| Files | ❌ |
+| Batch | ❌ |
+| Count tokens | ❌ |
+| Rerank | ❌ |
+
+---
+
+# 6. Caveats
+
+<Accordion title="Prompt Caching Semantics">
+For Fireworks chat completions, Bifrost maps `prompt_cache_key` to Fireworks `prompt_cache_isolation_key`, which is the Fireworks body field for cache isolation. Fireworks also accepts the header form `x-prompt-cache-isolation-key`. For text completions, Bifrost extracts `prompt_cache_key` from `extra_params` and maps it to the same Fireworks body field. If you need Fireworks session-affinity behavior, pass `user`, configure `x-session-affinity` in provider extra headers, or send it through the HTTP gateway via `x-bf-eh-x-session-affinity`. Live cache-hit behavior remains model and deployment dependent.
+</Accordion>
+
+<Accordion title="Reasoning History">
+Bifrost preserves assistant `reasoning_content` for Fireworks chat models that support reasoning history. Fireworks-specific reasoning controls such as `reasoning_history` are not given special typed handling in this provider page.
+</Accordion>
--- a/docs/providers/supported-providers/gemini.mdx
+++ b/docs/providers/supported-providers/gemini.mdx
@@ -0,0 +1,868 @@
+---
+title: "Google Gemini"
+description: "Google Gemini API conversion guide - request/response transformation, message conversion, tool handling, and streaming behavior"
+icon: "diamond"
+---
+
+## Overview
+
+Google Gemini's API has different structure from OpenAI. Bifrost performs extensive conversion including:
+- **Role remapping** - "assistant" → "model", system messages integrated into main flow
+- **Message grouping** - Consecutive tool responses merged into single user message
+- **Parameter renaming** - e.g., `max_completion_tokens` → `maxOutputTokens`, `stop` → `stopSequences`
+- **Function call handling** - Tool call ID preservation and thought signature support
+- **Content modality** - Support for text, images, video, code execution, and thought content
+- **Thinking/Reasoning** - Thinking configuration mapped to Bifrost reasoning structure
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1beta/models/{model}:generateContent` |
+| Responses API | ✅ | ✅ | `/v1beta/models/{model}:generateContent` |
+| Speech (TTS) | ✅ | ✅ | `/v1beta/models/{model}:generateContent` |
+| Transcriptions (STT) | ✅ | ✅ | `/v1beta/models/{model}:generateContent` |
+| Image Generation | ✅ | - | `/v1beta/models/{model}:generateContent` or `/v1beta/models/{model}:predict` (Imagen) |
+| Image Edit | ✅ | - | `/v1beta/models/{model}:generateContent` or `/v1beta/models/{model}:predict` (Imagen) |
+| Video Generation | ✅ | - | `/v1beta/models/{model}:predictLongRunning` |
+| Image Variation | ❌ | - | Not supported |
+| Embeddings | ✅ | - | `/v1beta/models/{model}:embedContent` |
+| Files | ✅ | - | `/upload/storage/v1beta/files` |
+| Batch | ✅ | - | `/v1beta/batchJobs` |
+| List Models | ✅ | - | `/v1beta/models` |
+
+---
+
+## Authentication
+
+Gemini supports API key authentication in addition to OAuth2 Bearer token authentication. The implementation conditionally uses the appropriate method based on the endpoint type.
+
+### API Key Authentication
+
+API key authentication is supported via two methods:
+
+1. **Header Method** (standard Gemini endpoints):
+   - Format: `x-goog-api-key: YOUR_API_KEY` header
+   - Used for: Standard Gemini endpoints (e.g., `/v1beta/models/{model}:generateContent`)
+
+2. **Query Parameter Method** (Imagen and custom endpoints):
+   - Format: `?key=YOUR_API_KEY` appended to request URLs
+   - Used for: Imagen models and custom endpoints
+   - Example: `https://generativelanguage.googleapis.com/v1beta/models/imagen-4.0-generate-001:predict?key=YOUR_API_KEY`
+
+Bifrost automatically selects the appropriate authentication method based on the endpoint type.
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `max_completion_tokens` | Renamed to `maxOutputTokens` |
+| `temperature`, `top_p` | Direct pass-through |
+| `stop` | Renamed to `stopSequences` |
+| `response_format` | Converted to `responseMimeType` and `responseJsonSchema` |
+| `tools` | Schema restructured (see [Tool Conversion](#tool-conversion)) |
+| `tool_choice` | Mapped to `functionCallingConfig` (see [Tool Conversion](#tool-conversion)) |
+| `reasoning` | Mapped to `thinkingConfig` (see [Reasoning / Thinking](#reasoning--thinking)) |
+| `top_k` | Via `extra_params` (Gemini-specific) |
+| `presence_penalty`, `frequency_penalty` | Via `extra_params` |
+| `seed` | Via `extra_params` |
+
+### Dropped Parameters
+
+The following parameters are silently ignored: `logit_bias`, `logprobs`, `top_logprobs`, `parallel_tool_calls`, `service_tier`
+
+### Extra Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway) for Gemini-specific fields:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemini/gemini-2.0-flash",
+    "messages": [{"role": "user", "content": "Hello"}],
+    "top_k": 40,
+    "stop_sequences": ["###"]
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+    Provider: schemas.Gemini,
+    Model:    "gemini-2.0-flash",
+    Input:    messages,
+    Params: &schemas.ChatParameters{
+        ExtraParams: map[string]interface{}{
+            "top_k": 40,
+            "stop_sequences": []string{"###"},
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Reasoning / Thinking
+
+**Documentation**: See [Bifrost Reasoning Reference](/providers/reasoning)
+
+### Parameter Mapping
+
+- `reasoning.effort` → `thinkingConfig.thinkingLevel` ("low" → `LOW`, "high" → `HIGH`)
+- `reasoning.max_tokens` → `thinkingConfig.thinkingBudget` (token budget for thinking)
+- `reasoning` parameter triggers `thinkingConfig.includeThoughts = true`
+
+### Supported Thinking Levels
+
+- `"low"` / `"minimal"` → `LOW`
+- `"medium"` / `"high"` → `HIGH`
+- `null` or unspecified → Based on `max_tokens`: -1 (dynamic), 0 (disabled), or specific budget
+
+### Example
+
+```json
+// Request
+{"reasoning": {"effort": "high", "max_tokens": 10000}}
+
+// Gemini conversion
+{"thinkingConfig": {"includeThoughts": true, "thinkingLevel": "HIGH", "thinkingBudget": 10000}}
+```
+
+## Message Conversion
+
+### Critical Caveats
+
+- **Role remapping**: "assistant" → "model", "system" → part of user/model content flow
+- **Consecutive tool responses**: Tool response messages merged into single user message with function response parts
+- **Content flattening**: Multi-part content in single message preserved as parts array
+
+### Image Conversion
+
+- **URL images**: `{type: "image_url", image_url: {url: "..."}}` → `{type: "image", source: {type: "url", url: "..."}}`
+- **Base64 images**: Data URL → `{type: "image", source: {type: "base64", media_type: "image/png", ...}}`
+- **Video content**: Preserved with metadata (fps, start/end offset)
+
+## Tool Conversion
+
+Tool definitions are restructured with these mappings:
+- `function.name` → `functionDeclarations.name` (preserved)
+- `function.parameters` → `functionDeclarations.parameters` (Schema format)
+- `function.description` → `functionDeclarations.description`
+- `function.strict` → Dropped (not supported by Gemini)
+
+### Tool Choice Mapping
+
+| OpenAI | Gemini |
+|--------|--------|
+| `"auto"` | `AUTO` (default) |
+| `"none"` | `NONE` |
+| `"required"` | `ANY` |
+| Specific tool | `ANY` with `allowedFunctionNames` |
+
+## Response Conversion
+
+### Field Mapping
+
+- `finishReason` → `finish_reason`:
+  - `STOP` → `stop`
+  - `MAX_TOKENS` → `length`
+  - `SAFETY`, `RECITATION`, `LANGUAGE`, `BLOCKLIST`, `PROHIBITED_CONTENT`, `SPII`, `IMAGE_SAFETY` → `content_filter`
+  - `MALFORMED_FUNCTION_CALL`, `UNEXPECTED_TOOL_CALL` → `tool_calls`
+
+- `candidates[0].content.parts[0].text` → `choices[0].message.content` (if single text block)
+- `candidates[0].content.parts[].functionCall` → `choices[0].message.tool_calls`
+- `promptTokenCount` → `usage.prompt_tokens`
+- `candidatesTokenCount` → `usage.completion_tokens`
+- `totalTokenCount` → `usage.total_tokens`
+- `cachedContentTokenCount` → `usage.prompt_tokens_details.cached_tokens`
+- `thoughtsTokenCount` → `usage.completion_tokens_details.reasoning_tokens`
+- Thought content (from `text` parts with `thought: true`) → `reasoning` field in stream deltas
+- Function call `args` (map) → JSON string `arguments`
+
+## Streaming
+
+Event structure:
+- Streaming responses contain deltas in `delta.content` (text), `delta.reasoning` (thoughts), `delta.toolCalls` (function calls)
+- Function responses appear as text content in the delta
+- `finish_reason` only set on final chunk
+- Usage metadata only included in final chunk
+
+---
+
+# 2. Responses API
+
+The Responses API uses the same underlying `/generateContent` endpoint but converts between OpenAI's Responses format and Gemini's Messages format.
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `max_output_tokens` | Renamed to `maxOutputTokens` |
+| `temperature`, `top_p` | Direct pass-through |
+| `instructions` | Converted to system instruction text |
+| `input` (string or array) | Converted to messages |
+| `tools` | Schema restructured (see [Chat Completions](#1-chat-completions)) |
+| `tool_choice` | Type mapped (see [Chat Completions](#1-chat-completions)) |
+| `reasoning` | Mapped to `thinkingConfig` (see [Reasoning / Thinking](#reasoning--thinking)) |
+| `text` | Maps to `responseMimeType` and `responseJsonSchema` |
+| `stop` | Via `extra_params`, renamed to `stopSequences` |
+| `top_k` | Via `extra_params` |
+
+### Extra Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway):
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemini/gemini-2.0-flash",
+    "input": "Hello, how are you?",
+    "instructions": "You are a helpful assistant.",
+    "top_k": 40
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ResponsesRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostResponsesRequest{
+    Provider: schemas.Gemini,
+    Model:    "gemini-2.0-flash",
+    Input:    messages,
+    Params: &schemas.ResponsesParameters{
+        Instructions: schemas.Ptr("You are a helpful assistant."),
+        ExtraParams: map[string]interface{}{
+            "top_k": 40,
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Input & Instructions
+
+- **Input**: String wrapped as user message or array converted to messages
+- **Instructions**: Becomes system instruction (single text block)
+
+## Tool Support
+
+Supported types: `function`, `computer_use_preview`, `web_search`, `mcp`
+
+Tool conversions same as [Chat Completions](#1-chat-completions) with:
+- Computer tools auto-configured (if specified in Bifrost request)
+- Function-based tools always enabled
+
+## Response Conversion
+
+- `finishReason` → `status`: `STOP`/`MAX_TOKENS`/other → `completed` | `SAFETY` → `incomplete`
+- Output items conversion:
+  - Text parts → `message` field
+  - Function calls → `function_call` field
+  - Thought content → `reasoning` field
+- Usage fields preserved with cache tokens mapped to `*_tokens_details.cached_tokens`
+
+## Streaming
+
+Event structure: Similar to Chat Completions streaming
+- `content_part.added` emitted for text and reasoning parts
+- Item IDs generated as `msg_{responseID}_item_{outputIndex}`
+
+---
+
+# 3. Speech (Text-to-Speech)
+
+Speech synthesis uses the underlying chat generation endpoint with audio response modality.
+
+## Request Parameters
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `input` | Text to synthesize → `contents[0].parts[0].text` |
+| `voice` | Voice name → `generationConfig.speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName` |
+| `response_format` | Only "wav" supported (default); auto-converted from PCM |
+
+### Voice Configuration
+
+**Single Voice**:
+```json
+{
+  "generationConfig": {
+    "responseModalities": ["AUDIO"],
+    "speechConfig": {
+      "voiceConfig": {
+        "prebuiltVoiceConfig": {
+          "voiceName": "Chant-Female"
+        }
+      }
+    }
+  }
+}
+```
+
+**Multi-Speaker**:
+```json
+{
+  "generationConfig": {
+    "responseModalities": ["AUDIO"],
+    "speechConfig": {
+      "multiSpeakerVoiceConfig": {
+        "speakerVoiceConfigs": [
+          {
+            "speaker": "Character A",
+            "voiceConfig": {
+              "prebuiltVoiceConfig": {
+                "voiceName": "Chant-Female"
+              }
+            }
+          }
+        ]
+      }
+    }
+  }
+}
+```
+
+## Response Conversion
+
+- Audio data extracted from `candidates[0].content.parts[].inlineData`
+- **Format conversion**: Gemini returns PCM audio (s16le, 24kHz, mono)
+- **Auto-conversion**: PCM → WAV when `response_format: "wav"` (default)
+- Raw audio returned if `response_format` is omitted or empty string
+
+### Supported Voices
+
+Common Gemini voices include:
+- `Chant-Female` - Female voice
+- `Chant-Male` - Male voice
+- Additional voices depend on model capabilities
+
+Check model documentation for complete list of supported voices.
+
+---
+
+# 4. Transcriptions (Speech-to-Text)
+
+Transcriptions are implemented as chat completions with audio content and text prompts.
+
+## Request Parameters
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `file` | Audio bytes → `contents[].parts[].inlineData` |
+| `prompt` | Instructions → `contents[0].parts[0].text` (defaults to "Generate a transcript of the speech.") |
+| `language` | Via `extra_params` (if supported by model) |
+
+### Audio Input Handling
+
+Audio is sent as inline data with auto-detected MIME type:
+```json
+{
+  "contents": [
+    {
+      "parts": [
+        {
+          "text": "<prompt text>"
+        },
+        {
+          "inlineData": {
+            "mimeType": "audio/wav",
+            "data": "<base64-encoded-audio>"
+          }
+        }
+      ]
+    }
+  ]
+}
+```
+
+### Extra Parameters
+
+Safety settings and caching can be configured:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/audio/transcriptions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemini/gemini-2.0-flash",
+    "file": "<binary-audio-data>",
+    "prompt": "Transcribe this audio in the original language."
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.TranscriptionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostTranscriptionRequest{
+    Provider: schemas.Gemini,
+    Model:    "gemini-2.0-flash",
+    Input: &schemas.TranscriptionInput{
+        File: audioBytes,
+    },
+    Params: &schemas.TranscriptionParameters{
+        Prompt: schemas.Ptr("Transcribe this audio."),
+        ExtraParams: map[string]interface{}{
+            "safety_settings": [...],
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Response Conversion
+
+- Transcribed text extracted from `candidates[0].content.parts[].text`
+- `task` set to `"transcribe"`
+- Usage metadata mapped:
+  - `promptTokenCount` → `input_tokens`
+  - `candidatesTokenCount` → `output_tokens`
+  - `totalTokenCount` → `total_tokens`
+
+---
+
+# 5. Embeddings
+
+<Note>
+Supports both single text and batch text embeddings via batch requests.
+</Note>
+
+**Request Parameters**:
+- `input` → `requests[0].content.parts[0].text` (single text joins arrays with space)
+- `dimensions` → `outputDimensionality`
+- Extra task type and title via `extra_params`
+
+**Response Mapping**:
+- `embeddings[].values` → Bifrost embedding array
+- `metadata.billableCharacterCount` → Usage prompt tokens (fallback)
+- Token counts extracted from usage metadata
+
+---
+
+# 6. Batch API
+
+**Request formats**: Inline requests array or file-based input
+
+**Pagination**: Token-based with `pageToken`
+
+**Endpoints**:
+- POST `/v1beta/batchJobs` - Create
+- GET `/v1beta/batchJobs?pageSize={limit}&pageToken={token}` - List
+- GET `/v1beta/batchJobs/{batch_id}` - Retrieve
+- POST `/v1beta/batchJobs/{batch_id}:cancel` - Cancel
+
+**Response Structure**:
+- Status mapping: `BATCH_STATE_PENDING`/`BATCH_STATE_RUNNING` → `in_progress`, `BATCH_STATE_SUCCEEDED` → `completed`, `BATCH_STATE_FAILED` → `failed`, `BATCH_STATE_CANCELLING` → `cancelling`, `BATCH_STATE_CANCELLED` → `cancelled`, `BATCH_STATE_EXPIRED` → `expired`
+- Inline responses: Array in `dest.inlinedResponses`
+- File-based responses: JSONL file in `dest.fileName`
+
+**Note**: RFC3339 timestamps converted to Unix timestamps
+
+---
+
+# 7. Files API
+
+<Note>
+Supports file upload for batch processing and multimodal requests.
+</Note>
+
+**Upload**: Multipart/form-data with `file` (binary) and `filename` (optional)
+
+**Field mapping**:
+- `name` → `id`
+- `displayName` → `filename`
+- `sizeBytes` → `size_bytes`
+- `mimeType` → `content_type`
+- `createTime` (RFC3339) → Converted to Unix timestamp
+
+**Endpoints**:
+- POST `/upload/storage/v1beta/files` - Upload
+- GET `/v1beta/files?limit={limit}&pageToken={token}` (cursor pagination)
+- GET `/v1beta/files/{file_id}` - Retrieve
+- DELETE `/v1beta/files/{file_id}` - Delete
+- GET `/v1beta/files/{file_id}/content` - Download
+
+---
+
+# 8. Image Generation
+
+Gemini supports two image generation formats depending on the model:
+
+1. **Standard Gemini Format**: Uses the `/v1beta/models/{model}:generateContent` endpoint
+2. **Imagen Format**: Uses the `/v1beta/models/{model}:predict` endpoint for Imagen models (detected automatically)
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `prompt` | Text description of the image to generate |
+| `n` | Number of images (mapped to `sampleCount` for Imagen, `candidateCount` for Gemini) |
+| `size` | Image size in WxH format (e.g., `"1024x1024"`). Converted to Imagen's `imageSize` + `aspectRatio` format |
+| `output_format` | Output format: `"png"`, `"jpeg"`, `"webp"`. Converted to MIME type for Imagen |
+| `seed` | Seed for reproducible generation (passed directly) |
+| `negative_prompt` | Negative prompt (passed directly) |
+
+### Extra Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway) for Gemini-specific fields:
+
+| Parameter | Type | Notes |
+|-----------|------|-------|
+| `personGeneration` | string | Person generation setting (Imagen only) |
+| `language` | string | Language code (Imagen only) |
+| `enhancePrompt` | bool | Prompt enhancement flag (Imagen only) |
+| `safetySettings` / `safety_settings` | string/array | Safety settings configuration |
+| `cachedContent` / `cached_content` | string | Cached content ID |
+| `labels` | object | Custom labels map |
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "gemini/imagen-4.0-generate-001",
+    "prompt": "A sunset over the mountains",
+    "size": "1024x1024",
+    "n": 2,
+    "output_format": "png"
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ImageGenerationRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostImageGenerationRequest{
+    Provider: schemas.Gemini,
+    Model:    "imagen-4.0-generate-001",
+    Input: &schemas.ImageGenerationInput{
+        Prompt: "A sunset over the mountains",
+    },
+    Params: &schemas.ImageGenerationParameters{
+        Size:         schemas.Ptr("1024x1024"),
+        N:            schemas.Ptr(2),
+        OutputFormat: schemas.Ptr("png"),
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Request Conversion
+
+### Standard Gemini Format
+
+- **Model mapping**: `bifrostReq.Model` → `req.Model`, with `bifrostReq.Input.Prompt` → `req.Contents[0].Parts[0].Text`
+- **Response modality**: Set by bifrost internally to `generationConfig.responseModalities = ["IMAGE"]` to indicate image generation
+- **Image count**: Specify number of images via `n` → `generationConfig.candidateCount`
+- **Extra parameters**: Include `safetySettings`, `cachedContent`, and `labels` mapped directly
+
+### Imagen Format
+
+- **Prompt**: `bifrostReq.Prompt` → `req.Instances[0].Prompt`
+- **Number of Images**: `n` → `req.Parameters.SampleCount`
+- **Size Conversion**: `size` (WxH format) converted to:
+  - `imageSize`: `"1k"` (if dimensions ≤ 1024), `"2k"` (if dimensions ≤ 2048). Sizes larger than `"2k"` are not supported by Imagen models.
+  - `aspectRatio`: `"1:1"`, `"3:4"`, `"4:3"`, `"9:16"`, or `"16:9"` (based on width/height ratio)
+- **Output Format**: `output_format` (`"png"`, `"jpeg"`) → `parameters.outputOptions.mimeType` (`"image/png"`, `"image/jpeg"`)
+- **Seed & Negative Prompt**: Passed directly to `seed` and `parameters.negativePrompt`
+- **Extra Parameters**: `personGeneration`, `language`, `enhancePrompt`, `safetySettings` mapped to parameters
+
+## Response Conversion
+
+### Standard Gemini Format
+
+- **Image Data**: Extracts `InlineData` from `candidates[0].content.parts[]` with MIME type `image/*`
+- **Output Format**: Converts MIME type (`image/png`, `image/jpeg`, `image/webp`) → file extension (`png`, `jpeg`, `webp`)
+- **Usage**: Extracts token usage from `usageMetadata`
+- **Multiple Images**: Each image part becomes an `ImageData` entry in the response array
+
+### Imagen Format
+
+- **Image Data**: Each `prediction` in `response.predictions[]` → `ImageData` with `b64_json` from `bytesBase64Encoded`
+- **Output Format**: Converts `prediction.mimeType` → file extension for `outputFormat` field (Imagen doesnt support webp)
+- **Index**: Each prediction gets an `index` (0, 1, 2, ...) in the response array
+
+## Size Conversion
+
+For Imagen format, size is converted between formats:
+
+**Supported Image Sizes**: `"1k"` (≤1024), `"2k"` (≤2048)
+
+**Supported Aspect Ratios**: `"1:1"`, `"3:4"`, `"4:3"`, `"9:16"`, `"16:9"`
+
+## Endpoint Selection
+
+The provider automatically selects the endpoint based on model name:
+- **Imagen models** (detected via `schemas.IsImagenModel()`): Uses `/v1beta/models/{model}:predict` endpoint
+- **Other models**: Uses `/v1beta/models/{model}:generateContent` endpoint with image response modality
+
+## Streaming
+
+Image generation streaming is not supported by Gemini.
+
+---
+
+# 9. Image Edit
+
+<Warning>
+Requests use **multipart/form-data**, not JSON.
+</Warning>
+
+Gemini supports image editing through two different APIs depending on the model:
+
+1. **Standard Gemini Format**: Uses the `/v1beta/models/{model}:generateContent` endpoint (for Gemini models)
+2. **Imagen Format**: Uses the `/v1beta/models/{model}:predict` endpoint (for Imagen models, detected automatically)
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier (Gemini or Imagen model) |
+| `prompt` | string | ✅ | Text description of the edit |
+| `image[]` | binary | ✅ | Image file(s) to edit (supports multiple images) |
+| `mask` | binary | ❌ | Mask image file |
+| `type` | string | ❌ | Edit type: `"inpainting"`, `"outpainting"`, `"inpaint_removal"`, `"bgswap"` (Imagen only) |
+| `n` | int | ❌ | Number of images to generate (1-10) |
+| `output_format` | string | ❌ | Output format: `"png"`, `"webp"`, `"jpeg"` |
+| `output_compression` | int | ❌ | Compression level (0-100%) |
+| `seed` | int | ❌ | Seed for reproducibility (via `ExtraParams["seed"]`) |
+| `negative_prompt` | string | ❌ | Negative prompt (via `ExtraParams["negativePrompt"]`) |
+| `guidanceScale` | int | ❌ | Guidance scale (via `ExtraParams["guidanceScale"]`, Imagen only) |
+| `baseSteps` | int | ❌ | Base steps (via `ExtraParams["baseSteps"]`, Imagen only) |
+| `maskMode` | string | ❌ | Mask mode (via `ExtraParams["maskMode"]`, Imagen only): `"MASK_MODE_USER_PROVIDED"`, `"MASK_MODE_BACKGROUND"`, `"MASK_MODE_FOREGROUND"`, `"MASK_MODE_SEMANTIC"` |
+| `dilation` | float | ❌ | Mask dilation (via `ExtraParams["dilation"]`, Imagen only): Range [0, 1] |
+| `maskClasses` | int[] | ❌ | Mask classes (via `ExtraParams["maskClasses"]`, Imagen only): For `MASK_MODE_SEMANTIC` |
+
+---
+
+**Request Conversion**
+
+### Standard Gemini Format (Non-Imagen Models)
+
+- **Model & Prompt**: `bifrostReq.Model` → `req.Model`, `bifrostReq.Input.Prompt` → `req.Contents[0].Parts[0].Text`
+- **Images**: Each image in `bifrostReq.Input.Images` is converted to a `Part` with:
+  - MIME type detection (`image/jpeg`, `image/webp`, `image/png`) with fallback to `image/png`
+  - Base64 encoding: `image.Image` → `Part.InlineData.Data` (base64 string)
+  - MIME type: `Part.InlineData.MIMEType`
+- **Response Modality**: `GenerationConfig.ResponseModalities` is set to `[ModalityImage]` to indicate image generation
+- **Extra Parameters**: Extracted from `ExtraParams`:
+  - `safetySettings` / `safety_settings` → `SafetySettings`
+  - `cachedContent` / `cached_content` → `CachedContent`
+  - `labels` → `Labels` (map[string]string)
+
+### Imagen Format (Imagen Models)
+
+- **Reference Images**: Each image in `bifrostReq.Input.Images` is converted to `ReferenceImage` with:
+  - `ReferenceType`: `"REFERENCE_TYPE_RAW"`
+  - `ReferenceID`: Sequential IDs starting from 1
+  - `ReferenceImage.BytesBase64Encoded`: Base64-encoded image data
+- **Mask Configuration**: If `Params.Mask` is provided or `maskMode` is specified:
+  - Default `maskMode`: `"MASK_MODE_USER_PROVIDED"` when mask data is present
+  - `maskMode` can be overridden via `ExtraParams["maskMode"]`
+  - `dilation` extracted from `ExtraParams["dilation"]` (validated to range [0, 1])
+  - `maskClasses` extracted from `ExtraParams["maskClasses"]` (for `MASK_MODE_SEMANTIC`)
+  - Mask image (if provided) is base64-encoded and added as `ReferenceType: "REFERENCE_TYPE_MASK"`
+- **Edit Mode Mapping**: `Params.Type` is mapped to `EditMode`:
+  - `"inpainting"` → `"EDIT_MODE_INPAINT_INSERTION"`
+  - `"outpainting"` → `"EDIT_MODE_OUTPAINT"`
+  - `"inpaint_removal"` → `"EDIT_MODE_INPAINT_REMOVAL"`
+  - `"bgswap"` → `"EDIT_MODE_BGSWAP"`
+  - If `Type` is not set, `editMode` can be specified directly via `ExtraParams["editMode"]`
+- **Parameters**:
+  - `n` → `Parameters.SampleCount`
+  - `output_format` → `Parameters.OutputOptions.MimeType` (converted: `"png"` → `"image/png"`, etc.)
+  - `output_compression` → `Parameters.OutputOptions.CompressionQuality`
+  - `seed` (via `ExtraParams["seed"]`) → `Parameters.Seed`
+  - `negativePrompt` (via `ExtraParams["negativePrompt"]`) → `Parameters.NegativePrompt`
+  - `guidanceScale` (via `ExtraParams["guidanceScale"]`) → `Parameters.GuidanceScale`
+  - `baseSteps` (via `ExtraParams["baseSteps"]`) → `Parameters.BaseSteps`
+  - Additional Imagen-specific parameters: `addWatermark`, `includeRaiReason`, `includeSafetyAttributes`, `personGeneration`, `safetySetting`, `language`, `storageUri`
+
+**Response Conversion**
+
+- **Standard Gemini Format**: Uses the same response conversion as image generation (see Image Generation section)
+- **Imagen Format**: Uses the same response conversion as Imagen image generation (see Image Generation section)
+
+**Endpoint Selection**
+
+The provider automatically selects the endpoint based on model name:
+- **Imagen models** (detected via `schemas.IsImagenModel()`): Uses `/v1beta/models/{model}:predict` endpoint
+- **Other models**: Uses `/v1beta/models/{model}:generateContent` endpoint with image response modality
+
+**Streaming**
+
+Image edit streaming is not supported by Gemini.
+
+**Image Variation**
+
+Image variation is not supported by Gemini.
+
+---
+
+# 10. List Models
+
+**Request**: GET `/v1beta/models?pageSize={limit}&pageToken={token}` (no body)
+
+**Field mapping**:
+- `name` (remove "models/" prefix) → `id` (add "gemini/" prefix)
+- `displayName` → `name`
+- `description` → `description`
+- `inputTokenLimit` → `max_input_tokens`
+- `outputTokenLimit` → `max_output_tokens`
+- Context length = `inputTokenLimit + outputTokenLimit`
+
+**Pagination**: Token-based with `nextPageToken`
+
+---
+
+# 11. Video Generation
+
+### Generate (`POST /v1/videos`)
+
+Requests use **JSON body (`application/json`)**.
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Veo model (e.g., `veo-3.1-generate-preview`) |
+| `prompt` | string | ✅ | Text description of the video |
+| `input_reference` | string | ❌ | Input image for image-to-video |
+| `seconds` | string | ❌ | Duration → `durationSeconds` |
+| `size` | string | ❌ | Resolution → aspect ratio (`1280x720` → `16:9`, `720x1280` → `9:16`) |
+| `negative_prompt` | string | ❌ | What to avoid in the video |
+| `seed` | int | ❌ | Seed for reproducibility |
+| `audio` | bool | ❌ | Enable audio generation → `generateAudio` |
+| `video_uri` | string | ❌ | GCS video URI for video extension |
+
+**Extra Params** (any unrecognized JSON field is forwarded as `extra_params`)
+
+| Key | Notes |
+|-----|-------|
+| `aspectRatio` | Override the aspect ratio directly (e.g., `"16:9"`, `"9:16"`). Takes precedence over `size` |
+| `resolution` | Native Gemini resolution string |
+| `sampleCount` | Number of samples to generate |
+| `personGeneration` | Person generation policy |
+| `numberOfVideos` | Number of videos to generate |
+| `storageURI` | GCS bucket for output storage |
+| `compressionQuality` | Output compression quality |
+| `enhancePrompt` | Auto-enhance the prompt |
+| `resizeMode` | How to handle size mismatches |
+| `reference_images` | Style/asset reference image objects |
+| `lastFrame` | Last frame image object for interpolation |
+
+**Response**: [`BifrostVideoGenerationResponse`](https://github.com/maximhq/bifrost/blob/main/core/schemas/videos.go) — `id`, `status`, `videos[]`
+
+If Gemini filters content for safety, `status` is `failed` and `content_filter` describes the reason.
+
+**Job Statuses**: `in_progress` → `completed` / `failed`
+
+### Retrieve / Download
+
+| Operation | Endpoint | Notes |
+|-----------|----------|-------|
+| Get status | `GET /v1/videos/{id}` | Polls the long-running operation |
+| Download | `GET /v1/videos/{id}/content` | Downloads from GCS URI or decodes base64 video |
+
+Video Delete, List, and Remix are not supported.
+
+---
+
+## Content Type Support
+
+Bifrost supports the following content modalities through Gemini:
+
+| Content Type | Support | Notes |
+|--------------|---------|-------|
+| Text | ✅ | Full support |
+| Images (URL/Base64) | ✅ | Converted to `{type: "image", source: {...}}` |
+| Video | ✅ | With fps, start/end offset metadata |
+| Audio | ⚠️ | Via file references only |
+| PDF | ✅ | Via file references |
+| Code Execution | ✅ | Auto-executed with results returned |
+| Thinking/Reasoning | ✅ | Thought parts marked with `thought: true` |
+| Function Calls | ✅ | With optional thought signatures |
+
+---
+
+## Caveats
+
+<Accordion title="Tool Response Grouping">
+**Severity**: High
+**Behavior**: Consecutive tool response messages merged into single user message
+**Impact**: Message count and structure changes
+**Code**: `chat.go:627-678`
+</Accordion>
+
+<Accordion title="Thinking Content Handling">
+**Severity**: Medium
+**Behavior**: Thought content appears as `text` parts with `thought: true` flag
+**Impact**: Requires checking `thought` flag to distinguish from regular text
+**Code**: `chat.go:242-244, 302-304`
+</Accordion>
+
+<Accordion title="Function Call Arguments Serialization">
+**Severity**: Low
+**Behavior**: Tool call `args` (object) converted to `arguments` (JSON string)
+**Impact**: Requires JSON parsing to access arguments
+**Code**: `chat.go:101-106`
+</Accordion>
+
+<Accordion title="Thought Signature Base64 Encoding">
+**Severity**: Low
+**Behavior**: `thoughtSignature` base64 URL-safe encoded, auto-converted during unmarshal
+**Impact**: Transparent to user; handled automatically
+**Code**: `types.go:1048-1063`
+</Accordion>
+
+<Accordion title="Streaming Finish Reason Timing">
+**Severity**: Medium
+**Behavior**: `finish_reason` only present in final stream chunk with usage metadata
+**Impact**: Cannot determine completion until end of stream
+**Code**: `chat.go:206-208, 325-328`
+</Accordion>
+
+<Accordion title="Cached Content Token Reporting">
+**Severity**: Low
+**Behavior**: Cached tokens reported in `prompt_tokens_details.cached_tokens`, cannot distinguish cache creation vs read
+**Impact**: Billing estimates may be approximate
+**Code**: `utils.go:270-274`
+</Accordion>
+
+<Accordion title="System Instruction Integration">
+**Severity**: Medium
+**Behavior**: System instructions become `systemInstruction` field (separate from messages), not included in message array
+**Impact**: Structure differs from OpenAI's system message approach
+**Code**: `responses.go:34-46`
+</Accordion>
+
--- a/docs/providers/supported-providers/groq.mdx
+++ b/docs/providers/supported-providers/groq.mdx
@@ -0,0 +1,146 @@
+---
+title: "Groq"
+description: "Groq API conversion guide - OpenAI-compatible format, parameter handling, text completion fallback, streaming, and tool support"
+icon: "bolt"
+---
+
+## Overview
+
+Groq is an **OpenAI-compatible provider** offering the same API interface with identical parameter handling. Bifrost delegates most functionality to the OpenAI provider implementation with minimal modifications. Key features:
+- **Full OpenAI compatibility** - Identical request/response format
+- **Streaming support** - Server-Sent Events with delta-based updates
+- **Tool calling** - Complete function definition and execution support
+- **Text completion fallback** - Via litellm compatibility mode when enabled
+- **Parameter filtering** - Removes unsupported OpenAI-specific fields
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| Text Completions | ⚠️ | ⚠️ | Via internal conversion |
+| List Models | ✅ | - | `/v1/models` |
+| Embeddings | ❌ | ❌ | - |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Text Completions (⚠️)**: Not supported natively by Groq. When enabled via `x-litellm-fallback` context, Bifrost internally converts text completion requests to chat completion requests, processes them through Chat Completions, and converts the response back to text completion format.
+
+**Unsupported Operations** (❌): Embeddings, Image Generation, Speech, Transcriptions, Files, and Batch are not supported by the upstream Groq API. These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+Groq supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Dropped Parameters
+
+These parameters are silently removed before sending to Groq:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - Not supported
+
+### Reasoning Parameter
+
+Groq supports reasoning via the standard `reasoning_effort` field:
+
+```json
+// Request with reasoning
+{
+  "model": "llama-3.3-70b-versatile",
+  "messages": [...],
+  "reasoning_effort": "high"
+}
+```
+
+Bifrost converts from the internal `Reasoning` structure to `reasoning_effort` string.
+
+## Message Conversion
+
+Groq uses OpenAI message format with the following content type limitations:
+
+**Content Types Supported:**
+- ✅ Text content (strings)
+- ❌ Images (neither URL nor base64)
+- ❌ Audio input
+- ❌ Files
+
+For all other message handling, tools, responses, and streaming formats, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+The Responses API is converted internally to Chat Completions:
+
+```go
+// Responses request → Chat request conversion
+request.ToChatRequest() → ChatCompletion → ToBifrostResponsesResponse()
+```
+
+Same parameter mapping and message conversion as Chat Completions. Response format differs slightly with `output` items instead of `message` content.
+
+---
+
+# 3. Text Completions (Litellm Fallback)
+
+<Warning>
+Text Completions are **not natively supported** by Groq. Support is only available when the `x-litellm-fallback` context flag is set.
+</Warning>
+
+When enabled, text completion requests are converted to chat completions:
+
+```go
+// Text completion → Chat completion conversion
+1. Wrap prompt in chat message
+2. Call ChatCompletion
+3. Extract text from response
+4. Format as TextCompletionResponse
+```
+
+**Limitations:**
+- Uses chat API (different from native text completion)
+- Single choice only (n=1)
+- Streaming not available
+
+---
+
+# 4. List Models
+
+Groq's model listing endpoint returns available models with their context lengths and capabilities.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Image URLs | Groq doesn't support image inputs |
+| Image Base64 | Groq doesn't support image inputs |
+| Multiple Images | Groq doesn't support image inputs |
+| Embedding | Not offered by Groq API |
+| Speech/TTS | Not offered by Groq API |
+| Transcription/STT | Not offered by Groq API |
+| Batch Operations | Not offered by Groq API |
+| File Management | Not offered by Groq API |
+
+---
+
+## Caveats
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters is silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/huggingface.mdx
+++ b/docs/providers/supported-providers/huggingface.mdx
@@ -0,0 +1,570 @@
+---
+title: "Hugging Face"
+description: "Detailed guide on Hugging Face provider implementation specifics, including model aliases and unique request handling."
+icon: "face-smiling-hands"
+---
+
+The Hugging Face provider in Bifrost (`core/providers/huggingface`) implements a complex integration that supports multiple inference providers (like `hf-inference`, `fal-ai`, `cerebras`, `sambanova`, etc.) through a unified interface.
+
+## Overview
+
+The Hugging Face provider implements custom logic for:
+- **Multiple inference backends**: Routes requests to 19+ different inference providers
+- **Dynamic model aliasing**: Transforms model IDs based on provider-specific mappings
+- **Heterogeneous request formats**: Supports JSON, raw binary, and base64-encoded payloads
+- **Provider-specific constraints**: Handles varying payload limits and format restrictions
+
+## Supported Inference Providers
+
+The Hugging Face provider supports routing to 20+ inference backends. Below is the current list of supported providers and their capabilities (as of December 2025):
+
+| Provider | Chat | Embedding | Speech (TTS) | Transcription (ASR) | Image Generation | Image Generation (stream) | Image Edit | Image Edit (stream) |
+|----------|------|-----------|--------------|---------------------|------------------|---------------------------|------------|---------------------|
+| `hf-inference` | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
+| `cerebras` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `cohere` | ✅ | ❌  | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `fal-ai` | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| `featherless-ai` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `fireworks` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `groq` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `hyperbolic` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `nebius` | ✅ | ✅ | ❌  | ❌ | ✅ | ❌ | ❌ | ❌ |
+| `novita` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `nscale` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `ovhcloud-ai-endpoints` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `public-ai` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `replicate` | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
+| `sambanova` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `scaleway` | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| `together` | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ |
+| `z-ai` | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+
+<Note>Provider capabilities may change over time. For the most up-to-date information, refer to the [Hugging Face Inference Providers documentation](https://huggingface.co/docs/inference-providers/en/index#partners). Also checkmarks (✅) indicate capabilities supported by the inference provider itself.</Note>
+
+<Info>All Chat-supported models automatically support Responses(`v1/responses`) as well via Bifrost's internal conversion logic.</Info>
+
+## Model Aliases & Identification
+
+Unlike standard providers where model IDs are direct strings (e.g., `gpt-4`), Hugging Face models in Bifrost are identified by a composite key to route requests to the correct inference backend.
+
+**Format**: `huggingface/[inference_provider]/[model_id]`
+
+- **inference_provider**: The backend service (e.g., `hf-inference`, `fal-ai`, `cerebras`).
+- **model_id**: The actual model identifier on Hugging Face Hub (e.g., `meta-llama/Meta-Llama-3-8B-Instruct`).
+
+**Example**: `huggingface/hf-inference/meta-llama/Meta-Llama-3-8B-Instruct`
+
+This parsing logic is handled in `utils.go` and `models.go`, allowing Bifrost to dynamically route requests based on the model string.
+
+## Request Handling Differences
+
+The Hugging Face provider handles various tasks (Chat, Speech, Transcription) which often require different request structures depending on the underlying inference provider.
+
+### Inference Provider Constraints
+
+Different inference providers have specific limitations and requirements:
+
+####  Payload Limit
+HuggingFace API enforces a **2 MB request body limit** across all request types (Chat, Embedding, Speech, Transcription). This constraint applies to:
+- JSON request payloads
+- Raw audio bytes in transcription requests
+- Any other request body data
+
+**Impact**: Large audio files, extensive chat histories, or bulk embedding requests may need to be split or compressed before sending.
+
+#### `fal-ai` Audio Format Restrictions
+The `fal-ai` provider has strict audio format requirements:
+- **Supported Format**: Only **MP3** (`audio/mpeg`) is accepted
+- **Rejected Formats**: WAV (`audio/wav`) and other formats are explicitly rejected
+- **Encoding**: Audio must be provided as a **base64-encoded Data URI** in the `audio_url` field
+
+**Validation Logic** (from `core/providers/huggingface/transcription.go`):
+```go
+mimeType := getMimeTypeForAudioType(utils.DetectAudioMimeType(request.Input.File))
+if mimeType == "audio/wav" {
+    return nil, fmt.Errorf("fal-ai provider does not support audio/wav format; please use a different format like mp3 or ogg")
+}
+encoded = fmt.Sprintf("data:%s;base64,%s", mimeType, encoded)
+```
+
+### Speech (Text-to-Speech)
+
+For Text-to-Speech (TTS) requests, the implementation differs from a standard pipeline request:
+
+- **No Pipeline Tag**: The `HuggingFaceSpeechRequest` struct does not include a `pipeline_tag` field in the JSON body, even though the model might be tagged as `text-to-speech` on the Hub.
+- **Structure**:
+  ```go
+  type HuggingFaceSpeechRequest struct {
+      Text       string                       `json:"text"`
+      Provider   string                       `json:"provider" validate:"required"`
+      Model      string                       `json:"model" validate:"required"`
+      Parameters *HuggingFaceSpeechParameters `json:"parameters,omitempty"`
+  }
+  ```
+- **Implementation**: See `core/providers/huggingface/speech.go`.
+
+### Transcription (Automatic Speech Recognition)
+
+The Transcription implementation (`core/providers/huggingface/transcription.go`) exhibits a "pattern-breaking" behavior where the request format changes significantly based on the inference provider.
+
+#### 1. `hf-inference` (Raw Bytes)
+When using the standard `hf-inference` provider, the API expects the **raw audio bytes** directly in the request body, not a JSON object.
+
+- **Content-Type**: Audio mime type (e.g., `audio/mpeg`).
+- **Body**: Raw binary data from `request.Input.File`.
+- **Payload Limit**: **Maximum 2 MB** for the raw audio bytes.
+- **Logic**:
+  ```go
+  // core/providers/huggingface/huggingface.go
+  if inferenceProvider == hfInference {
+      jsonData = request.Input.File // Raw bytes (max 2 MB)
+      isHFInferenceAudioRequest = true
+  }
+  ```
+- **URL Pattern**: `/hf-inference/models/{model_name}` (no `/pipeline/` suffix for ASR).
+
+#### 2. `fal-ai` (JSON with Base64 Data URI)
+When using `fal-ai` through HuggingFace provider, the API expects a **JSON body** containing the audio as a **base64-encoded Data URI**.
+
+
+- **Content-Type**: `application/json`.
+- **Body**: JSON object with `audio_url` field.
+- **Audio Format Restriction**: **Only MP3** (`audio/mpeg`) is supported. WAV files are rejected.
+- **Encoding**: Audio is base64-encoded and prefixed with a Data URI scheme.
+- **Logic**:
+  ```go
+  // core/providers/huggingface/transcription.go
+  encoded = base64.StdEncoding.EncodeToString(request.Input.File)
+  mimeType := getMimeTypeForAudioType(utils.DetectAudioMimeType(request.Input.File))
+  if mimeType == "audio/wav" {
+      return nil, fmt.Errorf("fal-ai provider does not support audio/wav format; please use a different format like mp3 or ogg")
+  }
+  encoded = fmt.Sprintf("data:%s;base64,%s", mimeType, encoded)
+  hfRequest = &HuggingFaceTranscriptionRequest{
+      AudioURL: encoded,
+  }
+  ```
+
+#### Dual Fields in `types.go`
+To support these divergent requirements, the `HuggingFaceTranscriptionRequest` struct in `types.go` contains fields for both scenarios, which are used mutually exclusively:
+
+```go
+type HuggingFaceTranscriptionRequest struct {
+    Inputs     []byte  `json:"inputs,omitempty"`    // For standard JSON providers (NOT hf-inference raw body)
+    AudioURL   string  `json:"audio_url,omitempty"` // For fal-ai (base64 Data URI, MP3 only)
+    Provider   *string `json:"provider,omitempty"`
+    Model      *string `json:"model,omitempty"`
+    Parameters *HuggingFaceTranscriptionRequestParameters `json:"parameters,omitempty"`
+}
+```
+
+**Key Points**:
+- `Inputs`: Used when JSON body is sent with raw bytes (most providers except `hf-inference` and `fal-ai`).
+- `AudioURL`: Used exclusively for `fal-ai`, must be a base64-encoded Data URI with MP3 format.
+- **Note**: For `hf-inference`, the entire request body is raw audio bytes—no JSON structure is used at all.
+
+## Image Generation
+
+The Hugging Face provider supports image generation through multiple inference providers, each with different request formats and capabilities.
+
+### Supported Inference Providers
+
+| Provider | Non-Streaming | Streaming | Notes |
+|----------|--------------|-----------|-------|
+| `hf-inference` | ✅ | ❌ | Simple prompt-only format, returns raw image bytes |
+| `fal-ai` | ✅ | ✅ | Full parameter support, supports streaming via Server-Sent Events |
+| `nebius` | ✅ | ❌ | Uses Nebius-specific format with width/height, LoRAs support |
+| `together` | ✅ | ❌ | OpenAI-compatible format |
+
+### Request Conversion
+
+The provider automatically routes to the appropriate inference provider based on the model string format: `huggingface/{provider}/{model_id}`.
+
+#### 1. `hf-inference`
+
+The simplest format, only requires a prompt:
+
+- **Request Structure**:
+  ```go
+  type HuggingFaceHFInferenceImageGenerationRequest struct {
+      Inputs string `json:"inputs"` // The prompt text
+  }
+  ```
+- **Response**: Raw image bytes (PNG/JPEG), automatically base64-encoded in Bifrost response
+- **Limitations**: No size, quality, or other parameter support
+
+#### 2. `fal-ai`
+
+The most feature-rich provider with extensive parameter support:
+
+- **Request Structure**:
+  ```go
+  type HuggingFaceFalAIImageGenerationRequest struct {
+      Prompt                string                `json:"prompt"`
+      NumImages             *int                  `json:"num_images,omitempty"`        // Maps from params.n
+      ResponseFormat        *string               `json:"response_format,omitempty"`   // "url" or "b64_json"
+      ImageSize             *HuggingFaceFalAISize `json:"image_size,omitempty"`        // {width, height} from size
+      NegativePrompt        *string               `json:"negative_prompt,omitempty"`
+      GuidanceScale         *float64              `json:"guidance_scale,omitempty"`    // From extra_params
+      NumInferenceSteps     *int                  `json:"num_inference_steps,omitempty"`
+      Seed                  *int                  `json:"seed,omitempty"`
+      OutputFormat          *string               `json:"output_format,omitempty"`    // "png", "jpeg", "webp" (jpg→jpeg)
+      SyncMode              *bool                 `json:"sync_mode,omitempty"`        // Auto-set if response_format="b64_json"
+      EnableSafetyChecker   *bool                 `json:"enable_safety_checker,omitempty"` // Auto-set if moderation="low"
+      Acceleration          *string               `json:"acceleration,omitempty"`      // From extra_params
+      EnablePromptExpansion *bool                 `json:"enable_prompt_expansion,omitempty"` // From extra_params
+  }
+  ```
+- **Parameter Mappings**:
+  - `n` → `num_images`
+  - `size` (e.g., `"1024x1024"`) → `image_size: {width: 1024, height: 1024}`
+  - `output_format: "jpg"` → `output_format: "jpeg"` (normalized)
+  - `response_format: "b64_json"` → `sync_mode: true`
+  - `moderation: "low"` → `enable_safety_checker: false`
+- **Response**: JSON with `images[]` array containing `url` and/or `b64_json` fields
+- **Extra Parameters**: Supports `guidance_scale`, `acceleration`, `enable_prompt_expansion`, `enable_safety_checker` via `extra_params`
+
+#### 3. `nebius`
+
+Uses Nebius-specific format with support for LoRAs:
+
+- **Request Structure**: Uses `NebiusImageGenerationRequest` (see Nebius provider docs)
+- **Parameter Mappings**:
+  - `size` (e.g., `"1024x1024"`) → `width` and `height` integers
+  - `output_format` → `response_extension` (normalized: "jpeg" → "jpg")
+  - `seed`, `negative_prompt` → Passed directly
+  - `extra_params.num_inference_steps` → `num_inference_steps`
+  - `extra_params.guidance_scale` → `guidance_scale`
+  - `extra_params.loras` → `loras[]` array (supports both map and array formats)
+- **Response**: Uses Nebius response format, converted to Bifrost format
+
+#### 4. `together`
+
+OpenAI-compatible format:
+
+- **Request Structure**:
+  ```go
+  type HuggingFaceTogetherImageGenerationRequest struct {
+      Prompt         string  `json:"prompt"`
+      Model          string  `json:"model"`
+      ResponseFormat *string `json:"response_format,omitempty"`
+      Size           *string `json:"size,omitempty"`  // Passed directly
+      N              *int    `json:"n,omitempty"`
+      Steps          *int    `json:"steps,omitempty"`  // From num_inference_steps
+  }
+  ```
+- **Parameter Mappings**:
+  - `response_format: "b64_json"` → `response_format: "base64"`
+  - `num_inference_steps` → `steps`
+- **Response**: OpenAI-compatible format with `data[]` array
+
+### Response Conversion
+
+Each provider's response is converted to Bifrost's unified `BifrostImageGenerationResponse` format:
+
+- **hf-inference**: Raw bytes → base64-encoded in `b64_json`
+- **fal-ai**: `images[]` array → `ImageData[]` with `url` and/or `b64_json`
+- **nebius**: Uses Nebius converter → Bifrost format
+- **together**: `data[]` array → `ImageData[]` with `b64_json` and/or `url`
+
+### Image Generation Streaming
+
+**Only `fal-ai` supports streaming** for HuggingFace image generation. Streaming uses Server-Sent Events (SSE) format.
+
+#### Streaming Request Format
+
+```go
+type HuggingFaceFalAIImageStreamRequest struct {
+    Prompt                string                `json:"prompt"`
+    ResponseFormat        *string               `json:"response_format,omitempty"`
+    NumImages             *int                  `json:"num_images,omitempty"`
+    ImageSize             *HuggingFaceFalAISize `json:"image_size,omitempty"`
+    // ... same parameters as non-streaming
+}
+```
+
+#### Streaming Response Format
+
+- **Event Type**: Server-Sent Events with `data:` prefix
+- **Chunk Format**: Each SSE event contains JSON with `images[]` array
+- **Stream Processing**:
+  - Each image in `images[]` becomes a separate stream chunk
+  - Chunks have `type: "partial"` until stream completion
+  - Final chunk has `type: "completed"` with the last image data
+  - Images can be delivered as `url` (public URL) or `b64_json` (base64-encoded)
+- **URL Pattern**: `/fal-ai/{model_id}/stream` (appended to base URL)
+
+#### Streaming Behavior
+
+- **Chunk Indexing**: Each chunk has an `Index` field (0, 1, 2, ...) and `ChunkIndex` for ordering
+- **Completion**: Final chunk includes all image data from the last SSE event
+- **Error Handling**: Errors in SSE format are parsed and sent as `BifrostError` chunks
+
+
+### Example Usage
+
+<Tabs>
+<Tab title="Gateway - fal-ai">
+
+```bash
+curl -X POST http://localhost:8080/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "huggingface/fal-ai/fal-ai/flux/dev",
+    "prompt": "A futuristic cityscape at sunset",
+    "size": "1024x1024",
+    "n": 2,
+    "output_format": "png",
+    "response_format": "url"
+  }'
+```
+
+</Tab>
+<Tab title="Gateway - Streaming (fal-ai only)">
+
+```bash
+curl -X POST http://localhost:8080/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "huggingface/fal-ai/fal-ai/flux/dev",
+    "prompt": "A futuristic cityscape at sunset",
+    "size": "1024x1024",
+    "stream": true
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ImageGenerationRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostImageGenerationRequest{
+    Provider: schemas.HuggingFace,
+    Model:    "huggingface/fal-ai/fal-ai/flux/dev",
+    Input: &schemas.ImageGenerationInput{
+        Prompt: "A futuristic cityscape at sunset",
+    },
+    Params: &schemas.ImageGenerationParameters{
+        Size:         schemas.Ptr("1024x1024"),
+        N:            schemas.Ptr(2),
+        OutputFormat: schemas.Ptr("png"),
+        ResponseFormat: schemas.Ptr("url"),
+        Seed:         schemas.Ptr(42),
+        NegativePrompt: schemas.Ptr("blurry, low quality"),
+        NumInferenceSteps: schemas.Ptr(50),
+        ExtraParams: map[string]interface{}{
+            "guidance_scale": 7.5,
+            "acceleration": "t4",
+            "enable_prompt_expansion": true,
+        },
+    },
+})
+```
+
+</Tab>
+<Tab title="Go SDK - Streaming">
+
+```go
+streamChan, err := client.ImageGenerationStreamRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostImageGenerationRequest{
+    Provider: schemas.HuggingFace,
+    Model:    "huggingface/fal-ai/fal-ai/flux/dev",
+    Input: &schemas.ImageGenerationInput{
+        Prompt: "A futuristic cityscape at sunset",
+    },
+    Params: &schemas.ImageGenerationParameters{
+        Size:    schemas.Ptr("1024x1024"),
+        N:       schemas.Ptr(2),
+    },
+})
+
+for stream := range streamChan {
+    if stream.BifrostImageGenerationStreamResponse != nil {
+        chunk := stream.BifrostImageGenerationStreamResponse
+        if chunk.URL != "" {
+            // Handle image URL
+        } else if chunk.B64JSON != "" {
+            // Handle base64 image data
+        }
+    }
+}
+```
+
+</Tab>
+</Tabs>
+
+### Provider-Specific Notes
+
+- **fal-ai**: 
+  - When `response_format="b64_json"`, `sync_mode` is automatically set to `true`
+  - When `moderation="low"`, `enable_safety_checker` is set to `false`
+  - `output_format: "jpg"` is normalized to `"jpeg"`
+- **nebius**: 
+  - `response_extension: "jpeg"` is normalized to `"jpg"` (Nebius inconsistency)
+  - LoRAs can be provided as `{"url": scale}` map or `[{"url": "...", "scale": ...}]` array
+- **hf-inference**: 
+  - Minimal format, only prompt supported
+  - Returns raw image bytes (automatically base64-encoded)
+- **together**: 
+  - OpenAI-compatible format
+  - `response_format: "b64_json"` is converted to `"base64"`
+
+## Image Edit
+
+<Warning>
+Requests use **multipart/form-data**, not JSON.
+</Warning>
+
+**Only `fal-ai` supports image editing** for HuggingFace. Image edit requests are routed to fal-ai inference provider.
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier (must be `huggingface/fal-ai/{model_id}`) |
+| `prompt` | string | ✅ | Text description of the edit |
+| `image[]` | binary | ✅ | Image file(s) to edit (supports multiple images for some models) |
+| `n` | int | ❌ | Number of images to generate (1-10) |
+| `size` | string | ❌ | Image size: `"WxH"` format (e.g., `"1024x1024"`) |
+| `output_format` | string | ❌ | Output format: `"png"`, `"webp"`, `"jpeg"` (note: `"jpg"` is normalized to `"jpeg"`) |
+| `seed` | int | ❌ | Seed for reproducibility (via `ExtraParams["seed"]`) |
+| `num_inference_steps` | int | ❌ | Number of inference steps (via `ExtraParams["num_inference_steps"]`) |
+| `guidance_scale` | float | ❌ | Guidance scale (via `ExtraParams["guidance_scale"]`) |
+| `acceleration` | string | ❌ | Acceleration mode (via `ExtraParams["acceleration"]`) |
+| `enable_safety_checker` | bool | ❌ | Enable safety checker (via `ExtraParams["enable_safety_checker"]`) |
+| `use_image_urls` | bool | ❌ | Override image field selection (via `ExtraParams["use_image_urls"]`) |
+
+---
+
+**Request Conversion**
+
+- **Model Validation**: Only `fal-ai` inference provider supports image edit. Other providers return `UnsupportedOperationError`.
+- **Image Conversion**: Each image in `bifrostReq.Input.Images` is converted to a base64 data URL:
+  - Format: `data:{mimeType};base64,{base64Data}`
+  - MIME type detection: `image/jpeg`, `image/webp`, `image/png` (via `http.DetectContentType`)
+- **Image Field Selection**: The provider uses different image fields based on model capabilities:
+  - **Multi-image models** (e.g., `fal-ai/flux-2/edit`, `fal-ai/flux-2-pro/edit`): Uses `image_urls` array field
+  - **Single-image models** (e.g., `fal-ai/flux-pro/kontext`, `fal-ai/flux/dev/image-to-image`): Uses `image_url` string field
+  - **Override**: `ExtraParams["use_image_urls"]` can override the automatic selection
+  - **Fallback**: For unknown models, uses `image_url` if single image, `image_urls` if multiple images
+- **Parameter Mapping**:
+  - `prompt` → `Prompt`
+  - `n` → `NumImages`
+  - `size` → `ImageSize` (converted from `"WxH"` string to `{Width, Height}` object)
+  - `output_format` → `OutputFormat` (`"jpg"` normalized to `"jpeg"`)
+  - `seed` (via `ExtraParams["seed"]`) → `Seed`
+  - `num_inference_steps` (via `ExtraParams["num_inference_steps"]`) → `NumInferenceSteps`
+  - `guidance_scale` (via `ExtraParams["guidance_scale"]`) → `GuidanceScale`
+  - `acceleration` (via `ExtraParams["acceleration"]`) → `Acceleration`
+  - `enable_safety_checker` (via `ExtraParams["enable_safety_checker"]`) → `EnableSafetyChecker`
+
+**Response Conversion**
+
+- **Non-streaming**: Uses the same response conversion as image generation (see Image Generation section)
+- **Streaming**: fal-ai streaming responses use Server-Sent Events (SSE) format:
+  - **Event Type**: Server-Sent Events with `data:` prefix
+  - **Chunk Format**: Each SSE event contains JSON with `images[]` array (or `data.images[]` in API envelope format)
+  - **Stream Processing**:
+    - Each image in `images[]` becomes a separate stream chunk
+    - Chunks have `type: "image_edit.partial_image"` until stream completion
+    - Final chunk has `type: "image_edit.completed"` with the last image data
+    - Images can be delivered as `url` (public URL) or `b64_json` (base64-encoded)
+  - **Response Structure**: Handles both API envelope format (`Data.Images`) and legacy flattened format (`Images`)
+  - **URL Pattern**: `/fal-ai/{model_id}/stream` (appended to base URL)
+
+**Endpoint**: `/fal-ai/{model_id}` (non-streaming), `/fal-ai/{model_id}/stream` (streaming)
+
+**Image Variation**
+
+Image variation is not supported by HuggingFace.
+
+## Raw JSON Body Handling
+
+While most providers strictly serialize a struct to JSON, the Hugging Face provider's `Transcription` method demonstrates a hybrid approach depending on the inference provider:
+
+### Embedding Requests
+
+For embedding requests, different providers expect different field names:
+
+- **Standard providers** (most): Use `input` field
+- **`hf-inference`**: Uses `inputs` field (plural)
+
+**Request Structure**:
+```go
+type HuggingFaceEmbeddingRequest struct {
+    Input    interface{} `json:"input,omitempty"`    // Used by all providers except hf-inference
+    Inputs   interface{} `json:"inputs,omitempty"`   // Used by hf-inference
+    Provider *string     `json:"provider,omitempty"` // Identifies the inference backend
+    Model    *string     `json:"model,omitempty"`    
+    // ... other fields
+}
+```
+
+The converter in `embedding.go` populates both fields to ensure compatibility across providers.
+
+### Differences in Inference Provider Constraints
+
+This multi-mode approach allows the provider to support diverse API contracts within a single implementation structure, accommodating:
+1. **Legacy endpoints** that expect raw binary data
+2. **Modern JSON APIs** with different schema expectations
+3. **Third-party providers** (like `fal-ai`) with custom requirements
+4. **Performance optimizations** (raw bytes avoid JSON overhead for `hf-inference`)
+
+This flexibility allows the provider to support diverse API contracts within a single implementation structure.
+
+## Model Discovery & Caching
+
+The provider implements sophisticated model discovery using the Hugging Face Hub API:
+
+### List Models Flow
+1. **Parallel Queries**: Fetches models from multiple inference providers concurrently
+2. **Filter by Pipeline Tag**: Uses `pipeline_tag` (e.g., `text-to-speech`, `feature-extraction`) to determine supported methods
+3. **Aggregate Results**: Combines responses from all providers into a unified list
+4. **Model ID Format**: Returns models as `huggingface/{provider}/{model_id}`
+
+### Provider Model Mapping Cache
+The provider maintains a cache (`modelProviderMappingCache`) to map Hugging Face model IDs to provider-specific model identifiers:
+
+```go
+// Example: "meta-llama/Meta-Llama-3-8B-Instruct" -> provider mappings
+{
+    "cerebras": {
+        "ProviderTask": "chat-completion",
+        "ProviderModelID": "llama3-8b-8192"
+    },
+    "groq": {
+        "ProviderTask": "chat-completion", 
+        "ProviderModelID": "llama3-8b-instant"
+    }
+}
+```
+
+**Cache Invalidation**: On HTTP 404 errors, the cache is cleared and the mapping is re-fetched, then the request is retried with the updated model ID.
+
+## Best Practices
+
+When working with the Hugging Face provider:
+
+1. **Check Payload Size**: Ensure request bodies are under 2 MB
+2. **Audio Format**: Use MP3 for `fal-ai`, avoid WAV files
+3. **Model Aliases**: Always specify provider in model string: `huggingface/{provider}/{model}`
+4. **Error Handling**: Implement retries for 404 errors (cache invalidation scenarios)
+5. **Provider Selection**: Use `auto` for automatic provider selection based on model capabilities
+6. **Pipeline Tags**: Verify model's `pipeline_tag` matches your use case (chat, embedding, TTS, ASR)
+
+## File Structure Reference
+
+```
+core/providers/huggingface/
+├── huggingface.go       # Main provider implementation, HTTP request handling
+├── types.go             # All provider-specific types (Request/Response DTOs)
+├── utils.go             # Helpers, constants, URL builders, model mapping
+├── chat.go              # Chat completion converters (Bifrost ↔ HF)
+├── embedding.go         # Embedding converters
+├── speech.go            # Text-to-speech converters
+├── transcription.go     # Speech-to-text converters
+├── models.go            # Model listing and capability detection
+├── images.go            # Image generation converters
+├── errors.go            # Error handling
+└── huggingface_test.go  # Comprehensive test suite
+```
+
+Each file follows strict separation of concerns as outlined in the [Adding a Provider](/contributing/adding-a-provider) guide.
--- a/docs/providers/supported-providers/mistral.mdx
+++ b/docs/providers/supported-providers/mistral.mdx
@@ -0,0 +1,270 @@
+---
+title: "Mistral"
+description: "Mistral API conversion guide - parameter mapping, message handling, tool support, transcription, and streaming behavior"
+icon: "m"
+---
+
+## Overview
+
+Mistral is an **OpenAI-compatible provider** with custom compatibility handling for specific features. Bifrost converts requests to Mistral's expected format while supporting their unique API endpoints. Key characteristics:
+- **OpenAI-compatible format** - Chat and streaming endpoints
+- **Transcription API** - Native audio transcription support
+- **Tool calling support** - Function definitions with string-based tool choice
+- **Streaming support** - Server-Sent Events for chat and transcription
+- **Parameter compatibility** - max_completion_tokens → max_tokens conversion
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| Transcriptions (STT) | ✅ | ✅ | `/v1/audio/transcriptions` |
+| Embeddings | ✅ | - | `/v1/embeddings` |
+| List Models | ✅ | - | `/v1/models` |
+| Image Generation | ❌ | ❌ | - |
+| Text Completions | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Text Completions, Speech (TTS), Files, and Batch are not supported by the upstream Mistral API. Image Generation is not currently supported by Bifrost's Mistral integration (Mistral API supports image generation, but Bifrost has not yet implemented this feature). These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+Mistral supports most OpenAI chat completion parameters with some conversions. For standard OpenAI parameter reference, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Parameter Mapping & Conversions
+
+| Parameter | OpenAI | Mistral | Notes |
+|-----------|--------|---------|-------|
+| `max_completion_tokens` | ✅ | `max_tokens` | **Conversion required** |
+| `temperature` | ✅ | ✅ | Direct pass-through |
+| `top_p` | ✅ | ✅ | Direct pass-through |
+| `stop` | ✅ | ✅ | Stop sequences |
+| `tools` | ✅ | ✅ | Function definitions |
+| `tool_choice` | String only | String only | **Limitations apply** |
+| `user` | ✅ | ✅ | Max 64 characters |
+| `frequency_penalty`, `presence_penalty` | ✅ | ✅ | Direct pass-through |
+
+### Critical Conversions
+
+**max_completion_tokens → max_tokens:**
+```json
+// Bifrost request
+{"max_completion_tokens": 4096}
+
+// Mistral API
+{"max_tokens": 4096}
+```
+
+**Tool Choice Simplification:**
+Mistral only supports simple string tool choice, not structured constraints:
+```json
+// OpenAI supports specific tool forcing
+{"tool_choice": {"type": "function", "function": {"name": "specific_tool"}}}
+
+// Mistral only supports
+{"tool_choice": "any"}  // or "none", "auto"
+```
+
+### Filtered Parameters
+
+Removed for Mistral compatibility:
+- `prompt_cache_key` - Not supported
+- `cache_control` - Stripped from content blocks
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - Not supported
+
+## Message Conversion
+
+Full OpenAI message support:
+- All roles: user, assistant, system, tool, developer
+- Content types: text, images, audio, files
+
+## Tool Conversion
+
+Tool definitions supported with constraints:
+
+| Aspect | Support | Notes |
+|--------|---------|-------|
+| Function definitions | ✅ | Full parameter schema support |
+| Tool choice "auto" | ✅ | Default mode |
+| Tool choice "any" | ✅ | Requires any tool |
+| Tool choice "none" | ✅ | No tools |
+| Specific tool forcing | ❌ | Not supported - simplified to "any" |
+| Parallel tools | ✅ | Multiple tools in one turn |
+
+**Limitation Caveat:**
+```go
+// Bifrost allows specifying a specific tool
+{
+  "tool_choice": {
+    "type": "function",
+    "function": {"name": "get_weather"}  // ❌ Not supported
+  }
+}
+
+// Mistral compatibility - converted to generic "any"
+{
+  "tool_choice": "any"
+}
+```
+
+## Response Conversion
+
+Standard OpenAI-compatible response:
+- `choices[].message.content` - Response text
+- `choices[].message.tool_calls` - Function calls
+- `usage` - Token counts (prompt_tokens, completion_tokens)
+- `finish_reason` - stop, tool_calls, length
+
+---
+
+# 2. Responses API
+
+Converted internally to Chat Completions with format transformation:
+
+```
+ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
+```
+
+Same parameter support and tool handling as Chat Completions.
+
+---
+
+# 3. Transcription
+
+Mistral provides native audio transcription with streaming support.
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Bifrost | Mistral | Notes |
+|-----------|---------|---------|-------|
+| `file` | Binary audio | Multipart form | Converted to multipart |
+| `model` | Model name | model | |
+| `language` | ISO-639-1 | language | Optional language hint |
+| `prompt` | Optional | prompt | Context for recognition |
+| `response_format` | Format type | response_format | json, text, etc. |
+| `temperature` | float | temperature | Sampling temperature |
+| `timestamp_granularities` | Array | Array field | Segment/word timestamps |
+
+### Multipart Form Structure
+
+Transcription requests are sent as multipart/form-data:
+
+```
+--boundary
+Content-Disposition: form-data; name="file"; filename="audio.mp3"
+[binary audio data]
+--boundary
+Content-Disposition: form-data; name="model"
+voxtral-mini-latest
+--boundary
+Content-Disposition: form-data; name="language"
+en
+--boundary--
+```
+
+## Transcription Response
+
+```json
+{
+  "text": "transcribed text",
+  "language": "en",
+  "duration": 3.5,
+  "segments": [{
+    "id": 0,
+    "start": 0.0,
+    "end": 1.5,
+    "text": "transcribed segment",
+    "temperature": 0.0,
+    "avg_logprob": -0.45,
+    "compression_ratio": 1.2,
+    "no_speech_prob": 0.001
+  }],
+  "words": [{
+    "word": "transcribed",
+    "start": 0.0,
+    "end": 0.8
+  }]
+}
+```
+
+## Transcription Streaming
+
+Mistral supports SSE streaming for transcription with custom event types:
+
+| Event Type | Content | Notes |
+|-----------|---------|-------|
+| `transcription.language` | Language code | Language detected |
+| `transcription.text.delta` | Text delta | Incremental text |
+| `transcription.segment` | Full segment | Complete segment data |
+| `transcription.done` | Final usage | Completion with tokens |
+
+---
+
+# 4. Embeddings
+
+Mistral supports text embeddings:
+
+| Parameter | Notes |
+|-----------|-------|
+| `input` | Text or array of texts |
+| `model` | Embedding model name |
+| `dimensions` | Custom output dimensions (optional) |
+| `encoding_format` | "float" or "base64" |
+
+Response returns embedding vectors with token usage.
+
+---
+
+# 5. List Models
+
+Lists available Mistral models with context length and capabilities.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Text Completions | Not offered by Mistral API |
+| Image Generation | Not yet implemented in Bifrost integration (Mistral API supports this) |
+| Speech/TTS | Not offered by Mistral API |
+| File Management | Not offered by Mistral API |
+| Batch Operations | Not offered by Mistral API |
+
+---
+
+## Caveats
+
+<Accordion title="Cache Control Stripped">
+**Severity**: Medium
+**Behavior**: Cache control directives removed from messages
+**Impact**: Prompt caching features unavailable
+**Code**: Stripped during JSON marshaling
+</Accordion>
+
+<Accordion title="Parameter Filtering">
+**Severity**: Low
+**Behavior**: OpenAI-specific parameters filtered
+**Impact**: prompt_cache_key, verbosity, store removed
+**Code**: filterOpenAISpecificParameters
+</Accordion>
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/nebius.mdx
+++ b/docs/providers/supported-providers/nebius.mdx
@@ -0,0 +1,230 @@
+---
+title: "Nebius"
+description: "Nebius API conversion guide - OpenAI-compatible format, parameter handling, streaming, embeddings, and special features"
+icon: "n"
+---
+
+## Overview
+
+Nebius is an **OpenAI-compatible provider** offering comprehensive API support. Bifrost delegates to the OpenAI implementation with standard parameter filtering. Key features:
+- **Full OpenAI compatibility** - Chat, text completion, embeddings, and responses
+- **Streaming support** - Server-Sent Events with delta-based updates
+- **AI Project ID** - Nebius-specific project identifier support
+- **Tool calling** - Complete function definition and execution
+- **Parameter filtering** - Removes unsupported OpenAI-specific fields
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| Embeddings | ✅ | - | `/v1/embeddings` |
+| Image Generation | ✅ | - | `/v1/images/generations` |
+| List Models | ✅ | - | `/v1/models` |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Speech, Transcriptions, Files, and Batch are not supported by the upstream Nebius API. These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+Nebius supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Nebius-Specific Parameters
+
+**ai_project_id (Optional):**
+
+Nebius allows specifying a project ID for resource organization:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "nebius/model-name",
+    "messages": [...],
+    "ai_project_id": "project-123"
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+request := &schemas.BifrostChatRequest{
+    Model: "model-name",
+    Input: messages,
+    Params: &schemas.ChatParameters{
+        ExtraParams: map[string]interface{}{
+            "ai_project_id": "project-123",
+        },
+    },
+}
+```
+
+</Tab>
+</Tabs>
+
+The `ai_project_id` is appended as a query parameter to the request URL.
+
+### Filtered Parameters
+
+Removed for Nebius compatibility:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - Not supported
+
+Nebius supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+Converted internally to Chat Completions:
+
+```
+ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
+```
+
+Same parameter support and message handling as Chat Completions. Supports ai_project_id via extra_params.
+
+---
+
+# 3. Text Completions
+
+Nebius supports legacy text completion format:
+
+| Parameter | Mapping |
+|-----------|---------|
+| `prompt` | Direct pass-through |
+| `max_tokens` | max_tokens |
+| `temperature`, `top_p` | Direct pass-through |
+| `stop` | Stop sequences |
+| `frequency_penalty`, `presence_penalty` | Penalty parameters |
+
+---
+
+# 4. Embeddings
+
+Nebius supports text embeddings:
+
+| Parameter | Notes |
+|-----------|-------|
+| `input` | Text or array of texts |
+| `model` | Embedding model name |
+| `encoding_format` | "float" or "base64" |
+| `dimensions` | Custom output dimensions (optional) |
+
+Response returns embedding vectors with usage information.
+
+---
+
+# 5. Image Generation
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier |
+| `prompt` | string | ✅ | Text description of the image to generate |
+| `size` | string | ❌ | Image size in WxH format (e.g., `"1024x1024"`). Converted to separate `width` and `height` integers |
+| `output_format` | string | ❌ | Output format: `"png"`, `"jpeg"`, `"webp"`. Note: `"jpeg"` is converted to `"jpg"` |
+| `response_format` | string | ❌ | Response format: `"url"` or `"b64_json"` |
+| `seed` | int | ❌ | Seed for reproducible generation  |
+| `negative_prompt` | string | ❌ | Negative prompt |
+| `num_inference_steps` | int | ❌ | Number of inference steps |
+| `extra_params` | object | ❌ | Nebius-specific parameters (see below) |
+
+**Extra Parameters (via `extra_params`)**
+
+| Parameter | Type | Notes |
+|-----------|------|-------|
+| `guidance_scale` | int | Guidance scale (0-100) |
+| `ai_project_id` | string | Nebius project ID (added as query parameter) |
+
+---
+
+**Request Conversion**
+
+- **Model & Prompt**: `bifrostReq.Model` → `req.Model` (pointer), `bifrostReq.Input.Prompt` → `req.Prompt` (pointer)
+- **Size Conversion**: `params.size` (WxH format like `"1024x1024"`) is split into:
+  - `width`: Integer extracted from first part (e.g., `1024`)
+  - `height`: Integer extracted from second part (e.g., `1024`)
+- **Output Format**: 
+  - `params.output_format` → `req.ResponseExtension`
+  - Special conversion: `"jpeg"` → `"jpg"` (Nebius uses `"jpg"` not `"jpeg"`)
+- **Response Format**: `params.response_format` → `req.ResponseFormat` (passed directly: `"url"` or `"b64_json"`)
+- **Seed & Negative Prompt**: `params.seed` → `req.Seed`, `params.negative_prompt` → `req.NegativePrompt` (passed directly)
+- **Num Inference Steps**: `params.num_inference_steps` → `req.NumInferenceSteps` (passed directly)
+- **Extra Parameters**:
+  - `guidance_scale` → `req.GuidanceScale` (int pointer)
+  - `ai_project_id` → Added as query parameter `?ai_project_id={value}` to the request URL
+
+**Response Conversion**
+
+- **Image Data**: Each item in `response.data[]` → `ImageData` with:
+  - `url`: From `data[].url`
+  - `b64_json`: From `data[].b64_json`
+  - `revised_prompt`: From `data[].revised_prompt`
+  - `index`: Sequential index (0, 1, 2, ...)
+- **ID**: `response.id` → `response.ID`
+- **Provider**: Set to `nebius` in `ExtraFields`
+
+**Endpoint**: `/v1/images/generations`
+
+**Streaming**: Image generation streaming is not supported by Nebius.
+
+---
+
+# 6. List Models
+
+Lists available Nebius models with capabilities and context lengths.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Speech/TTS | Not offered by Nebius API |
+| Transcription/STT | Not offered by Nebius API |
+| Batch Operations | Not offered by Nebius API |
+| File Management | Not offered by Nebius API |
+
+---
+
+## Caveats
+
+<Accordion title="Cache Control Stripped">
+**Severity**: Medium
+**Behavior**: Cache control directives are removed from messages
+**Impact**: Prompt caching features don't work
+**Code**: Stripped during JSON marshaling
+</Accordion>
+
+<Accordion title="Parameter Filtering">
+**Severity**: Low
+**Behavior**: OpenAI-specific fields filtered out
+**Impact**: prompt_cache_key, verbosity, store removed
+**Code**: filterOpenAISpecificParameters
+</Accordion>
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/ollama.mdx
+++ b/docs/providers/supported-providers/ollama.mdx
@@ -0,0 +1,239 @@
+---
+title: "Ollama"
+description: "Ollama API conversion guide - local inference, OpenAI-compatible format, streaming, tool calling, and embeddings"
+icon: "o"
+---
+
+## Overview
+
+Ollama is a **local-first, OpenAI-compatible inference engine** for running large language models on personal computers or servers. Bifrost delegates to the OpenAI implementation while supporting Ollama's unique configuration requirements. Key characteristics:
+- **Local-first deployment** - Run models locally or on private infrastructure
+- **OpenAI API compatibility** - Identical request/response format
+- **Full feature support** - Chat, text, embeddings, and streaming
+- **Tool calling** - Complete function definition and execution
+- **Self-hosted** - No external API dependency required
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| Embeddings | ✅ | - | `/v1/embeddings` |
+| List Models | ✅ | - | `/v1/models` |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Speech, Transcriptions, Files, and Batch are not supported by the upstream Ollama API. These return `UnsupportedOperationError`.
+
+Ollama is self-hosted. Ensure you have an Ollama instance running and configured with the correct BaseURL (e.g., `http://localhost:11434`).
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+Ollama supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Filtered Parameters
+
+Removed for Ollama compatibility:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - Not supported
+
+Ollama supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+Converted internally to Chat Completions:
+
+```
+ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
+```
+
+Same parameter support as Chat Completions.
+
+---
+
+# 3. Text Completions
+
+Ollama supports legacy text completion format:
+
+| Parameter | Mapping |
+|-----------|---------|
+| `prompt` | Direct pass-through |
+| `max_tokens` | max_tokens |
+| `temperature`, `top_p` | Direct pass-through |
+| `stop` | Stop sequences |
+
+---
+
+# 4. Embeddings
+
+Ollama supports text embeddings:
+
+| Parameter | Notes |
+|-----------|-------|
+| `input` | Text or array of texts |
+| `model` | Embedding model name |
+| `encoding_format` | "float" or "base64" |
+| `dimensions` | Custom output dimensions (optional) |
+
+Response returns embedding vectors with token usage.
+
+---
+
+# 5. List Models
+
+Lists models currently loaded in Ollama with capabilities and context information.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Speech/TTS | Not offered by Ollama API |
+| Transcription/STT | Not offered by Ollama API |
+| Batch Operations | Not offered by Ollama API |
+| File Management | Not offered by Ollama API |
+
+<Note>
+Ollama follows the OpenAI API specification for request format and error handling. Authentication is optional and depends on deployment (no authentication required for local access, optional Bearer token for protected instances).
+
+**Critical**: BaseURL must be explicitly configured pointing to your Ollama instance (e.g., `http://localhost:11434` for local, `https://ollama.example.com` for remote).
+</Note>
+
+---
+
+## Configuration
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+# Point to local Ollama instance
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "ollama/llama3.1:latest",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
+
+# Gateway needs to be configured with Ollama BaseURL
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+config := &schemas.ProviderConfig{
+    NetworkConfig: schemas.NetworkConfig{
+        BaseURL: "http://localhost:11434",  // Required!
+        DefaultRequestTimeoutInSeconds: 30,
+    },
+}
+provider, _ := ollama.NewOllamaProvider(config, logger)
+
+response, _ := provider.ChatCompletion(ctx, key, request)
+```
+
+</Tab>
+</Tabs>
+
+**Environment Setup:**
+
+1. Install Ollama from https://ollama.ai
+2. Pull a model:
+   ```bash
+   ollama pull llama3.1
+   ollama pull mistral
+   ollama pull neural-chat
+   ```
+3. Start Ollama server:
+   ```bash
+   ollama serve
+   ```
+4. Verify it's running:
+   ```bash
+   curl http://localhost:11434/api/tags
+   ```
+
+---
+
+## Performance Considerations
+
+**Streaming for Large Models:**
+For better user experience with large models, use streaming:
+
+```json
+{
+  "model": "llama3.1:latest",
+  "messages": [...],
+  "stream": true
+}
+```
+
+**Token Context:**
+Different models have different context windows:
+- Llama 3.1 70B: 128K tokens
+- Mistral 7B: 32K tokens
+- Neural Chat 7B: 8K tokens
+
+**GPU Acceleration:**
+Ollama automatically uses GPU if available. For CPU-only, ensure timeout is sufficient.
+
+---
+
+## Popular Models
+
+| Model | Size | Context | Speed |
+|-------|------|---------|-------|
+| llama3.1:latest | Varies | 128K | Fast |
+| mistral:latest | 7B | 32K | Very Fast |
+| neural-chat:latest | 7B | 8K | Very Fast |
+| orca-mini:latest | 3B | 3K | Very Fast |
+| openchat:latest | 7B | 8K | Very Fast |
+
+---
+
+## Caveats
+
+<Accordion title="BaseURL Configuration Required">
+**Severity**: High
+**Behavior**: BaseURL must be explicitly configured - no default
+**Impact**: Requests fail without proper configuration
+**Code**: NewOllamaProvider validates BaseURL is set
+</Accordion>
+
+<Accordion title="Cache Control Stripped">
+**Severity**: Low
+**Behavior**: Cache control directives are removed from messages
+**Impact**: Prompt caching features don't work
+**Code**: Stripped during JSON marshaling
+</Accordion>
+
+<Accordion title="Parameter Filtering">
+**Severity**: Low
+**Behavior**: OpenAI-specific parameters filtered out
+**Impact**: prompt_cache_key, verbosity, store removed
+**Code**: filterOpenAISpecificParameters
+</Accordion>
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/openai.mdx
+++ b/docs/providers/supported-providers/openai.mdx
@@ -0,0 +1,504 @@
+---
+title: "OpenAI"
+description: "OpenAI API conversion guide - what to know when using OpenAI through Bifrost"
+icon: "openai"
+---
+
+## Overview
+
+OpenAI is the **baseline schema** for Bifrost. When using OpenAI directly, parameters are passed through with minimal conversion - mostly validation and filtering of OpenAI-specific features.
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/responses` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| Embeddings | ✅ | - | `/v1/embeddings` |
+| Speech (TTS) | ✅ | ✅ | `/v1/audio/speech` |
+| Transcriptions (STT) | ✅ | ✅ | `/v1/audio/transcriptions` |
+| Image Generation | ✅ | ✅ | `/v1/images/generations` |
+| Image Edit | ✅ | ✅ | `/v1/images/edits` |
+| Image Variation | ✅ | - | `/v1/images/variations` |
+| Files | ✅ | - | `/v1/files` |
+| Batch | ✅ | - | `/v1/batches` |
+| Video Generation | ✅ | - | `/v1/videos` |
+| List Models | ✅ | - | `/v1/models` |
+
+---
+
+# 1. Chat Completions
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier |
+| `messages` | [array](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L15) | ✅ | [`ChatMessage`](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L370) array with roles ([docs](https://platform.openai.com/docs/api-reference/chat/create#chat-create-messages)) |
+| `temperature` | float | ❌ | Sampling temperature (0-2) |
+| `top_p` | float | ❌ | Nucleus sampling parameter |
+| `stop` | string/array | ❌ | Stop sequences |
+| `max_completion_tokens` | int | ❌ | Min 16, max output tokens |
+| `frequency_penalty` | float | ❌ | Frequency penalty (-2 to 2) |
+| `presence_penalty` | float | ❌ | Presence penalty (-2 to 2) |
+| `logit_bias` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L25) | ❌ | Token logit adjustments |
+| `logprobs` | bool | ❌ | Include log probabilities |
+| `top_logprobs` | int | ❌ | Number of log probabilities per token |
+| `seed` | int | ❌ | Reproducibility seed |
+| `response_format` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L23) | ❌ | Output format ([docs](https://platform.openai.com/docs/api-reference/chat/create#chat-create-response_format)) |
+| `tools` | [array](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L21) | ❌ | [`Tool`](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L600) objects ([docs](https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools)) |
+| `tool_choice` | string/[object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L22) | ❌ | `"auto"`, `"none"`, `"required"`, or specific tool |
+| `parallel_tool_calls` | bool | ❌ | Allow multiple simultaneous tool calls |
+| `stream_options` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L26) | ❌ | Streaming options ([docs](https://platform.openai.com/docs/api-reference/chat/create#chat-create-stream_options)) |
+| `reasoning` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L24) | ❌ | Reasoning parameters ([Bifrost docs](/providers/reasoning), [OpenAI docs](https://platform.openai.com/docs/api-reference/chat/create#chat-create-reasoning)) |
+| `user` | string | ❌ | **Truncated to 64 chars** |
+| `metadata` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L27) | ❌ | Custom metadata |
+| `store` | bool | ❌ | **Filtered for non-OpenAI routing** |
+| `service_tier` | string | ❌ | **Filtered for non-OpenAI routing** |
+| `prompt_cache_key` | string | ❌ | **Filtered for non-OpenAI routing** |
+| `prediction` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L28) | ❌ | Predicted output for acceleration |
+| `audio` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L29) | ❌ | Audio output config |
+| `modalities` | [array](https://github.com/maximhq/bifrost/blob/main/core/schemas/chatcompletions.go#L30) | ❌ | Response modalities (text, audio) |
+
+---
+
+- **Reasoning:** OpenAI supports `reasoning.effort` (`minimal`, `low`, `medium`, `high`) and `reasoning.max_tokens` - both passed through directly. When routing to other providers, `"minimal"` effort is converted to `"low"` for compatibility. See [Bifrost reasoning docs](/providers/reasoning).
+- **Messages:** All message roles are supported: `system`, `user`, `assistant`, `tool`, `developer` (treated as system). Content types: text, images via URL (`image_url`), audio input (`input_audio`). Tool messages include a `tool_call_id`.
+- **Tools:** Standard OpenAI tool format with strict mode support. Tool choice: `"auto"`, `"none"`, `"required"`, or specific tool by name.
+- **Responses:** Passed through in standard OpenAI format. Finish reasons: `stop`, `length`, `tool_calls`, `content_filter`. Usage includes token counts and optionally cached/reasoning token details.
+- **Streaming:** Server-Sent Events format with `delta.content`, `delta.tool_calls`, `finish_reason`, and `usage` (final chunk only, automatically included by Bifrost). `stream_options: { include_usage: true }` is set by default for all streaming calls.
+- **Cache Control:** `cache_control` fields are stripped from messages, their content blocks, and tools before sending.
+- **Token Enforcement:** `max_completion_tokens` is enforced to have a minimum of 16. Values below 16 are automatically set to 16.
+- **Special handling:** `user` field is truncated to 64 characters; `prompt_cache_key`, `store`, `service_tier` are filtered when routing to non-OpenAI providers
+
+---
+
+# 2. Responses API
+
+The Responses API is OpenAI's structured output API.
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier |
+| `input` | string/[array](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L40) | ✅ | Text or [`ContentBlock`](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L500) array ([docs](https://platform.openai.com/docs/api-reference/responses/create#responses-create-input)) |
+| `max_output_tokens` | int | ✅ | Maximum output length |
+| `background` | bool | ❌ | Run request in background mode |
+| `conversation` | string | ❌ | Conversation ID for continuing a conversation |
+| `include` | array | ❌ | Array of fields to include in response (e.g., `"web_search_call.action.sources"`) |
+| `instructions` | string | ❌ | System instructions |
+| `max_tool_calls` | int | ❌ | Maximum number of tool calls |
+| `metadata` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L94) | ❌ | Custom metadata |
+| `parallel_tool_calls` | bool | ❌ | Allow multiple simultaneous tool calls |
+| `previous_response_id` | string | ❌ | ID of previous response to continue from |
+| `prompt_cache_key` | string | ❌ | Prompt caching key |
+| `reasoning` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L238) | ❌ | [`ResponsesParametersReasoning`](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L238) configuration ([Bifrost docs](/providers/reasoning)) |
+| `safety_identifier` | string | ❌ | Safety identifier for content filtering |
+| `service_tier` | string | ❌ | Service tier for the request |
+| `stream_options` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L116) | ❌ | [`ResponsesStreamOptions`](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L116) configuration |
+| `store` | bool | ❌ | Store the response for later retrieval |
+| `temperature` | float | ❌ | Sampling temperature |
+| `text` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L120) | ❌ | [`ResponsesTextConfig`](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L120) for output formatting |
+| `top_logprobs` | int | ❌ | Number of log probabilities to return per token |
+| `top_p` | float | ❌ | Nucleus sampling parameter |
+| `tool_choice` | string/[object](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L969) | ❌ | [`ResponsesToolChoice`](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L969) strategy |
+| `tools` | [array](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L1050) | ❌ | [`ResponsesTool`](https://github.com/maximhq/bifrost/blob/main/core/schemas/responses.go#L1050) objects ([docs](https://platform.openai.com/docs/api-reference/responses/create#responses-create-tools)) |
+| `truncation` | string | ❌ | Truncation strategy (`auto` or `off`) |
+| `user` | string | ❌ | **Truncated to 64 chars** |
+
+---
+
+**Special Message Handling (gpt-oss vs other models):**
+
+OpenAI models handle reasoning differently depending on the model family:
+- **Non-gpt-oss models** (GPT-4o, o1, etc.): Send reasoning as **summaries**. Reasoning-only messages (with no summary and only content blocks) are filtered out since these models don't support reasoning content blocks in the request format.
+- **gpt-oss models**: Send reasoning as **content blocks**. Reasoning summaries in the request are converted to content blocks since gpt-oss expects reasoning as structured blocks, not summaries.
+
+This conversion ensures compatibility across different model architectures for the structured Responses API. See [Bifrost reasoning docs](/providers/reasoning) for detailed reasoning handling.
+
+**Token & Parameter Enforcement:**
+- `max_output_tokens` is enforced to have a minimum of 16. Values below 16 are automatically set to 16.
+- `reasoning.max_tokens` field is automatically removed from JSON output (OpenAI Responses API doesn't accept it).
+
+**Other conversions:**
+- Action types `zoom` and `region` are converted to `screenshot`
+- `cache_control` fields are stripped from messages and tools
+- Unsupported tool types are silently filtered (only these are supported: `function`, `file_search`, `computer_use_preview`, `web_search`, `mcp`, `code_interpreter`, `image_generation`, `local_shell`, `custom`, `web_search_preview`)
+
+**Response:** Includes `id`, `status` (`completed`, `incomplete`, `pending`, `error`), `output` array with message content, and token `usage`.
+
+**Streaming:** Server-Sent Events with types: `response.created`, `response.in_progress`, `response.output_item.added`, `response.content_part.added`, `response.output_text.delta`, `response.function_call_arguments.delta`, `response.completed`, `response.incomplete`. `stream_options: { include_usage: true }` is set by default for all streaming calls.
+
+---
+
+# 3. Text Completions (Legacy)
+
+<Warning>
+Text Completions is a legacy API. Use Chat Completions for new implementations.
+</Warning>
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier |
+| `prompt` | string/array | ✅ | Completion prompt(s) |
+| `max_tokens` | int | ❌ | Maximum output tokens |
+| `temperature` | float | ❌ | Sampling temperature |
+| `top_p` | float | ❌ | Nucleus sampling |
+| `stop` | string/array | ❌ | Stop sequences |
+| `user` | string | ❌ | **Truncated to 64 chars** |
+
+---
+
+- Array prompts generate multiple completions. Finish reasons: `stop` or `length`. Streaming uses SSE format. `stream_options: { include_usage: true }` is set by default for streaming calls.
+- `user` field is truncated to 64 characters or set to nil if it exceeds the limit.
+
+---
+
+# 4. Embeddings
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier |
+| `input` | string/[array](https://github.com/maximhq/bifrost/blob/main/core/schemas/embedding.go#L12) | ✅ | Text(s) to embed ([docs](https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-input)) |
+| `encoding_format` | string | ❌ | `float` or `base64` |
+| `dimensions` | int | ❌ | Output embedding dimensions |
+| `user` | string | ❌ | **NOT truncated** (unlike chat/text) |
+
+---
+
+- No streaming support. Returns [`embedding`](https://github.com/maximhq/bifrost/blob/main/core/schemas/embedding.go#L30) array with usage counts.
+
+---
+
+# 5. Speech (Text-to-Speech)
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | `tts-1` or `tts-1-hd` |
+| `input` | string | ✅ | Text to convert to speech |
+| `voice` | string | ✅ | alloy, echo, fable, onyx, nova, shimmer |
+| `response_format` | string | ❌ | mp3, opus, aac, flac, wav, pcm |
+| `speed` | float | ❌ | 0.25 to 4.0 (default 1.0) |
+
+---
+
+- Returns raw binary audio. Streaming supported in SSE format (base64 chunks), but not all models support streaming. `stream_options: { include_usage: true }` is set by default for streaming calls.
+
+---
+
+# 6. Transcriptions (Speech-to-Text)
+
+<Warning>
+Requests use **multipart/form-data**, not JSON.
+</Warning>
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `file` | binary | ✅ | Audio file (multipart form-data) |
+| `model` | string | ✅ | `whisper-1` |
+| `language` | string | ❌ | ISO-639-1 language code |
+| `prompt` | string | ❌ | Optional prompt for context |
+| `temperature` | float | ❌ | Sampling temperature |
+| `response_format` | string | ❌ | json, text, srt, vtt, verbose_json |
+
+---
+
+- **Supported audio formats:** mp3, mp4, mpeg, mpga, m4a, wav, webm
+- **Response:** Includes `text`, `task`, `language`, `duration`, and optionally word-level timing. Streaming supported in SSE format. `stream_options: { include_usage: true }` is set by default for streaming calls.
+
+---
+
+# 7. Image Generation
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier (e.g., `dall-e-3`) |
+| `prompt` | string | ✅ | Text description of the image to generate |
+| `n` | int | ❌ | Number of images to generate (1-10) |
+| `size` | string | ❌ | Image size: `"256x256"`, `"512x512"`, `"1024x1024"`, `"1792x1024"`, `"1024x1792"`, `"1536x1024"`, `"1024x1536"`, `"auto"` |
+| `quality` | string | ❌ | Image quality: `"auto"`, `"high"`, `"medium"`, `"low"`, `"hd"`, `"standard"` |
+| `style` | string | ❌ | Image style: `"natural"`, `"vivid"` |
+| `response_format` | string | ❌ | Response format: `"url"` or `"b64_json"` |
+| `background` | string | ❌ | Background: `"transparent"`, `"opaque"`, `"auto"` |
+| `output_format` | string | ❌ | Output format: `"png"`, `"webp"`, `"jpeg"` |
+| `output_compression` | int | ❌ | Compression level (0-100%) |
+| `partial_images` | int | ❌ | Number of partial images (0-3) |
+| `moderation` | string | ❌ | Moderation level: `"low"`, `"auto"` |
+| `user` | string | ❌ | User identifier |
+
+---
+
+**Request Conversion**
+
+OpenAI is the baseline schema for image generation. Parameters are passed through with minimal conversion:
+
+- **Model & Prompt**: `bifrostReq.Model` → `req.Model`, `bifrostReq.Prompt` → `req.Prompt`
+- **Parameters**: All fields from `bifrostReq` (`ImageGenerationParameters`) are embedded directly into the OpenAI request struct via struct embedding. No field mapping or transformation is performed.
+- **Streaming**: When streaming is requested, `stream: true` is set in the request body.
+
+**Response Conversion**
+
+- **Non-streaming**: OpenAI responses are unmarshaled directly into `BifrostImageGenerationResponse` since Bifrost's response schema is a superset of OpenAI's format. All fields are passed through as-is.
+- **Streaming**: OpenAI streaming responses use Server-Sent Events (SSE) format with event types:
+  - `image_generation.partial_image`: Intermediate image chunks with `b64_json` data
+  - `image_generation.completed`: Final chunk for each image with usage information
+  - `error`: Error events
+  
+  Each chunk includes:
+  - `type`: Event type
+  - `sequence_number`: Sequence number of the chunk
+  - `partial_image_index`: Image index (0-N) for partial images
+  - `b64_json`: Base64-encoded image data (pointer, may be nil)
+  - `usage`: Token usage (only in completed events)
+  - `created_at`, `size`, `quality`, `background`, `output_format`: Additional metadata
+  
+  Bifrost converts these to `BifrostImageGenerationStreamResponse` chunks with:
+  - Per-image `chunkIndex` tracking for proper ordering within each image
+  - `Index` field indicating which image (0-N) the chunk belongs to
+  - `PartialImageIndex` set only for partial images (not completed events)
+  - Usage information attached to completed chunks
+  - Latency tracking per chunk
+
+**Endpoint**: `/v1/images/generations`
+
+---
+
+# 8. Image Edit
+
+<Warning>
+Requests use **multipart/form-data**, not JSON.
+</Warning>
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier |
+| `prompt` | string | ✅ | Text description of the edit |
+| `image[]` | binary | ✅ | Image file(s) to edit (multipart form-data, supports multiple images) |
+| `mask` | binary | ❌ | Mask image file (multipart form-data) |
+| `n` | int | ❌ | Number of images to generate (1-10) |
+| `size` | string | ❌ | Image size: `"256x256"`, `"512x512"`, `"1024x1024"`, `"1536x1024"`, `"1024x1536"`, `"auto"` |
+| `quality` | string | ❌ | Image quality: `"auto"`, `"high"`, `"medium"`, `"low"`, `"standard"` |
+| `response_format` | string | ❌ | Response format: `"url"` or `"b64_json"` |
+| `background` | string | ❌ | Background: `"transparent"`, `"opaque"`, `"auto"` |
+| `input_fidelity` | string | ❌ | Input fidelity: `"low"`, `"high"` |
+| `partial_images` | int | ❌ | Number of partial images (0-3) |
+| `output_format` | string | ❌ | Output format: `"png"`, `"webp"`, `"jpeg"` |
+| `output_compression` | int | ❌ | Compression level (0-100%) |
+| `user` | string | ❌ | User identifier |
+| `stream` | bool | ❌ | Enable streaming response |
+
+---
+
+**Request Conversion**
+
+- **Model & Input**: `bifrostReq.Model` → `req.Model`, `bifrostReq.Input.Images` → `req.Input.Images`, `bifrostReq.Input.Prompt` → `req.Input.Prompt`
+- **Parameters**: All fields from `bifrostReq.Params` (`ImageEditParameters`) are embedded directly into the OpenAI request struct via struct embedding. No field mapping or transformation is performed.
+- **Multipart Form Data**: The request is serialized as `multipart/form-data`:
+  - **Model & Prompt**: Written as form fields (`model`, `prompt`)
+  - **Images**: Each image in `Input.Images` is written as a separate `image[]` field with proper MIME type detection (`image/jpeg`, `image/webp`, `image/png`) and Content-Type headers
+  - **Mask**: If present, written as a `mask` field with MIME type detection and appropriate filename (`mask.png`, `mask.jpg`, `mask.webp`)
+  - **Optional Parameters**: All optional parameters (`n`, `size`, `quality`, `response_format`, `background`, `input_fidelity`, `partial_images`, `output_format`, `output_compression`, `user`) are written as form fields
+  - **Integer Conversion**: Integer fields (`n`, `partial_images`, `output_compression`) are converted to strings using `strconv.Itoa`
+  - **Streaming**: When streaming is requested, `stream: "true"` is written as a form field
+
+**Response Conversion**
+
+- **Non-streaming**: OpenAI responses are unmarshaled directly into `BifrostImageGenerationResponse` since Bifrost's response schema is a superset of OpenAI's format. All fields are passed through as-is.
+- **Streaming**: OpenAI streaming responses use Server-Sent Events (SSE) format with event types:
+  - `image_edit.partial_image`: Intermediate image chunks with `b64_json` data
+  - `image_edit.completed`: Final chunk for each image with usage information
+  - `error`: Error events
+  
+  Each chunk includes:
+  - `type`: Event type (`image_edit.partial_image` or `image_edit.completed`)
+  - `sequence_number`: Sequence number of the chunk
+  - `partial_image_index`: Image index (0-N) for partial images
+  - `b64_json`: Base64-encoded image data (pointer, may be nil)
+  - `usage`: Token usage (only in completed events)
+  
+  Bifrost converts these to `BifrostImageGenerationStreamResponse` chunks with:
+  - Per-image `chunkIndex` tracking for proper ordering within each image
+  - `Index` field indicating which image (0-N) the chunk belongs to
+  - `PartialImageIndex` set only for partial images (not completed events)
+  - Usage information attached to completed chunks
+  - Latency tracking per chunk
+  - Robust handling of interleaved chunks using incomplete image tracking
+
+**Endpoint**: `/v1/images/edits`
+
+---
+
+# 9. Image Variation
+
+<Warning>
+Requests use **multipart/form-data**, not JSON.
+</Warning>
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Model identifier |
+| `image` | binary | ✅ | Image file to create variations from (multipart form-data) |
+| `n` | int | ❌ | Number of images to generate (1-10) |
+| `size` | string | ❌ | Image size: `"256x256"`, `"512x512"`, `"1024x1024"`, `"1792x1024"`, `"1024x1792"`, `"1536x1024"`, `"1024x1536"`, `"auto"` |
+| `response_format` | string | ❌ | Response format: `"url"` or `"b64_json"` |
+| `user` | string | ❌ | User identifier |
+
+---
+
+**Request Conversion**
+
+- **Model & Input**: `bifrostReq.Model` → `req.Model`, `bifrostReq.Input.Image.Image` → `req.Input.Image.Image`
+- **Parameters**: All fields from `bifrostReq.Params` (`ImageVariationParameters`) are embedded directly into the OpenAI request struct via struct embedding. No field mapping or transformation is performed.
+- **Multipart Form Data**: The request is serialized as `multipart/form-data`:
+  - **Model**: Written as form field (`model`)
+  - **Image**: The image is written as an `image` field with proper MIME type detection (`image/jpeg`, `image/webp`, `image/png`) and Content-Type headers. If MIME type cannot be detected, defaults to `image/png`
+  - **Optional Parameters**: All optional parameters (`n`, `size`, `response_format`, `user`) are written as form fields
+  - **Integer Conversion**: Integer field (`n`) is converted to string using `strconv.Itoa`
+- **Multiple Images**: Additional images beyond the first one (if present in `ExtraParams["images"]`) are stored in `ExtraParams` but only the first image is sent to OpenAI (OpenAI API only supports single image input)
+
+**Response Conversion**
+
+- **Non-streaming**: OpenAI responses are unmarshaled directly into `BifrostImageVariationResponse` (which is a type alias for `BifrostImageGenerationResponse`). All fields are passed through as-is.
+- **Streaming**: Not supported for image variation requests.
+
+**Endpoint**: `/v1/images/variations`
+
+---
+
+# 10. Files API
+
+## Upload
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `file` | binary | ✅ | File to upload (multipart form-data) |
+| `purpose` | string | ✅ | batch, fine-tune, or assistants |
+| `filename` | string | ❌ | Custom filename (defaults to file.jsonl) |
+
+Response: [`FileObject`](https://github.com/maximhq/bifrost/blob/main/core/schemas/files.go#L40) with `id`, `bytes`, `created_at`, `filename`, `purpose`, `status` ([docs](https://platform.openai.com/docs/api-reference/files/create))
+
+## List Files
+
+**Query Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `purpose` | string | ❌ | Filter by purpose |
+| `limit` | int | ❌ | Results per page |
+| `after` | string | ❌ | Pagination cursor |
+| `order` | string | ❌ | asc or desc |
+
+Cursor-based pagination with `has_more` flag.
+
+## Retrieve / Delete / Content
+
+Operations:
+- GET `/v1/files/{file_id}` - Retrieve file metadata
+- DELETE `/v1/files/{file_id}` - Delete file
+- GET `/v1/files/{file_id}/content` - Download file content
+
+---
+
+# 11. Batch API
+
+## Create Batch
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `input_file_id` | string | Conditional | File ID OR requests array (not both) |
+| `requests` | [array](https://github.com/maximhq/bifrost/blob/main/core/schemas/batch.go#L75) | Conditional | [`BatchRequestItem`](https://github.com/maximhq/bifrost/blob/main/core/schemas/batch.go#L31) objects (converted to JSONL) |
+| `endpoint` | string | ✅ | Target endpoint (e.g., /v1/chat/completions) |
+| `completion_window` | string | ❌ | 24h (default) |
+| `metadata` | [object](https://github.com/maximhq/bifrost/blob/main/core/schemas/batch.go#L89) | ❌ | Custom metadata |
+
+**Response:** [`BifrostBatchCreateResponse`](https://github.com/maximhq/bifrost/blob/main/core/schemas/batch.go#L91) with `id`, `endpoint`, `input_file_id`, `status`, `created_at`, `request_counts` ([docs](https://platform.openai.com/docs/api-reference/batch/create)). Statuses: [`BatchStatus`](https://github.com/maximhq/bifrost/blob/main/core/schemas/batch.go#L5) (validating, failed, in_progress, finalizing, completed, expired, cancelling, cancelled)
+
+## List Batches
+
+**Query Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `limit` | int | ❌ | Results per page |
+| `after` | string | ❌ | Pagination cursor |
+
+## Retrieve / Cancel Batch
+
+Operations:
+- GET `/v1/batches/{batch_id}` - Get batch [`BifrostBatchRetrieveResponse`](https://github.com/maximhq/bifrost/blob/main/core/schemas/batch.go#L167) ([docs](https://platform.openai.com/docs/api-reference/batch/retrieve))
+- POST `/v1/batches/{batch_id}/cancel` - Cancel batch ([docs](https://platform.openai.com/docs/api-reference/batch/cancel))
+
+## Get Results
+
+1. Batch must be `completed` (has `output_file_id`)
+2. Download output file via Files API
+3. Parse JSONL - each [`BatchResultItem`](https://github.com/maximhq/bifrost/blob/main/core/schemas/batch.go#L254): `{id, custom_id, response: {status_code, body}}`
+
+---
+
+# 12. List Models
+
+GET `/v1/models` - Lists available models with metadata. Model IDs in Bifrost responses are prefixed with `openai/` (e.g., `openai/gpt-4o`). Results are aggregated from all configured API keys. No request body or parameters required.
+
+---
+
+# 13. Video Generation
+
+## Generate (`POST /v1/videos`)
+
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | e.g., `sora-2` |
+| `prompt` | string | ✅ | Text description of the video |
+| `input_reference` | string | ❌ | Input image for image-to-video. **Must be a base64 data URL** (e.g., `data:image/png;base64,...`). Plain URLs are not accepted. |
+| `seconds` | string | ❌ | Duration in seconds |
+| `size` | string | ❌ | Resolution: `720x1280` (default), `1280x720`, `1024x1792`, `1792x1024` |
+
+**Response**: [`BifrostVideoGenerationResponse`](https://github.com/maximhq/bifrost/blob/main/core/schemas/videos.go) — `id`, `status`, `model`, `prompt`, `created_at`
+
+**Job Statuses**: `queued` → `in_progress` → `completed` / `failed`
+
+## Retrieve / Download / Delete / List / Remix
+
+| Operation | Endpoint | Notes |
+|-----------|----------|-------|
+| Get status | `GET /v1/videos/{id}` | Poll until `status: completed` |
+| Download | `GET /v1/videos/{id}/content` | Returns raw video bytes |
+| Delete | `DELETE /v1/videos/{id}` | Removes video job |
+| List jobs | `GET /v1/videos` | Query params: `after`, `limit`, `order` |
+| Remix | `POST /v1/videos/{id}/remix` | Body: `{"prompt": "..."}` |
+
+---
+
+## Common Error Codes
+
+HTTP Status → Error Type mapping:
+- `400` - `invalid_request_error`
+- `401` - `authentication_error`
+- `403` - `permission_error`
+- `404` - `not_found_error`
+- `429` - `rate_limit_error`
+- `500` - `api_error`
--- a/docs/providers/supported-providers/openrouter.mdx
+++ b/docs/providers/supported-providers/openrouter.mdx
@@ -0,0 +1,187 @@
+---
+title: "OpenRouter"
+description: "OpenRouter API conversion guide - routing to multiple providers, reasoning support, parameter handling, and streaming"
+icon: "split"
+---
+
+## Overview
+
+OpenRouter is an **OpenAI-compatible provider routing service** that accesses models from multiple providers (OpenAI, Anthropic, Google, Meta, etc.) through a unified interface. Bifrost delegates to the OpenAI implementation with special handling for reasoning models. Key features:
+- **Provider aggregation** - Access 100+ models from multiple vendors
+- **Reasoning support** - Extended thinking for supported models
+- **Parameter compatibility** - Intelligent reasoning effort conversion
+- **Streaming support** - Full SSE support with usage tracking
+- **Tool calling** - Complete function definition and execution
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/responses` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| List Models | ✅ | - | `/v1/models` |
+| Embeddings | ✅ | - | `/v1/embeddings` |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Image Generation, Speech, Transcriptions, Files, and Batch are not supported by the upstream OpenRouter API. These return a `BifrostError` with an error code of `"unsupported_operation"`.
+
+**Note**: OpenRouter's Responses API is currently in **beta**.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+OpenRouter supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Reasoning Parameter Handling
+
+OpenRouter supports extended thinking on compatible models:
+
+```json
+// Bifrost request
+{
+  "reasoning": {
+    "effort": "high",
+    "max_tokens": 10000
+  }
+}
+
+// OpenRouter conversion
+{
+  "reasoning_effort": "high"
+}
+```
+
+**Reasoning Models:** gpt-oss-120b and compatible models with special handling for reasoning content.
+
+### Cache Control Stripping
+
+Anthropic-specific cache control directives are automatically removed:
+
+```go
+// Bifrost supports cache control from Anthropic
+{
+  "messages": [{
+    "role": "user",
+    "content": [{
+      "type": "text",
+      "text": "...",
+      "cache_control": {"type": "ephemeral"}  // ← Stripped
+    }]
+  }]
+}
+
+// Sent to OpenRouter without cache directives
+```
+
+### Filtered Parameters
+
+Removed for OpenRouter compatibility:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - OpenAI-specific
+
+OpenRouter supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+OpenRouter's Responses API is handled as a distinct endpoint at `/v1/responses`. This API is currently in **beta** on OpenRouter.
+
+Same parameter support as Chat Completions, with requests forwarded directly to the Responses API endpoint without conversion to Chat Completions.
+
+**Special Message Handling (gpt-oss vs other models):**
+For details on how reasoning is handled differently between gpt-oss and other models, see [OpenAI Responses API documentation](/providers/supported-providers/openai) for the comprehensive explanation of reasoning conversion (summaries vs. content blocks).
+
+---
+
+# 3. Text Completions
+
+OpenRouter supports legacy text completion format:
+
+| Parameter | Mapping |
+|-----------|---------|
+| `prompt` | Direct pass-through |
+| `max_tokens` | max_tokens |
+| `temperature`, `top_p` | Direct pass-through |
+| `stop` | Stop sequences |
+
+---
+
+# 4. List Models
+
+Lists 100+ models available through OpenRouter, including:
+- OpenAI (GPT-4, GPT-4 Turbo, etc.)
+- Anthropic (Claude 3 family)
+- Google (Gemini)
+- Meta (Llama)
+- Mistral
+- And many more
+
+---
+
+# 5. Embeddings
+
+OpenRouter supports embeddings through their OpenAI-compatible API. This allows you to generate vector embeddings for text using models from various providers.
+
+| Parameter | Mapping |
+|-----------|---------|
+| `input` | Direct pass-through (string or array of strings) |
+| `model` | Model ID (e.g., `cohere/embed-multilingual-v3.0`, `amazon/amazon-embeddings-v2`) |
+| `dimensions` | Number of dimensions for the output embedding |
+| `encoding_format` | Output format (`float` or `base64`) |
+
+**Supported Models:** OpenRouter supports various embedding models including:
+- Cohere (embed-multilingual-v3.0, embed-english-v3.0, etc.)
+- Amazon (amazon-embeddings-v2)
+- And other providers
+
+The embedding request/response follows the standard OpenAI format.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Image Generation | Not offered by OpenRouter API |
+| Speech/TTS | Not offered by OpenRouter API |
+| Transcription/STT | Not offered by OpenRouter API |
+| Batch Operations | Not offered by OpenRouter API |
+| File Management | Not offered by OpenRouter API |
+
+---
+
+## Caveats
+
+<Accordion title="Cache Control Stripped">
+**Severity**: Medium
+**Behavior**: Anthropic cache control directives are removed
+**Impact**: Prompt caching features unavailable
+**Code**: Stripped during JSON marshaling
+</Accordion>
+
+<Accordion title="Parameter Filtering">
+**Severity**: Low
+**Behavior**: OpenAI-specific parameters filtered
+**Impact**: prompt_cache_key, verbosity, store removed
+**Code**: filterOpenAISpecificParameters
+</Accordion>
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/overview.mdx
+++ b/docs/providers/supported-providers/overview.mdx
@@ -0,0 +1,202 @@
+---
+title: "Overview"
+description: "Bifrost supports multiple AI providers with consistent OpenAI-compatible response formats, enabling seamless provider switching without code changes."
+icon: "circle-info"
+---
+
+## Overview
+
+Bifrost supports a wide range of AI providers, all accessible through a consistent OpenAI-compatible interface. This standardization allows you to switch between providers without modifying your application code, as all responses follow the same structure regardless of the underlying provider.
+ 
+Bifrost can also act as a provider-compatible gateway (for example, <u>[Anthropic](../../integrations/anthropic-sdk/overview)</u>, <u>[Google Gemini](../../integrations/genai-sdk/overview)</u>, <u>Cohere</u>, <u>[Bedrock](../../integrations/bedrock-sdk/overview)</u>, and others), exposing provider-specific endpoints so you can use existing provider SDKs or integrations with no code changes, see [What is an integration?](../../integrations/what-is-an-integration) for details.
+
+
+## Provider Support Matrix
+
+The following table summarizes which operations are supported by each provider via Bifrost’s unified interface.
+
+| Provider | Models | Text | Text (stream) | Chat | Chat (stream) | Responses | Responses (stream) | Images | Images (stream) | Image Edit | Image Edit (stream) | Image Variation | Embeddings | TTS | TTS (stream) | STT | STT (stream) | Files | Batch | Count tokens |
+|----------|--------|------|----------------|------|---------------|-----------|--------------------|--------|-----------------|------------|---------------------|-----------------|------------|-----|-------------|-----|--------------|-------|-------|--------------|
+| Anthropic (`anthropic/<model>`) | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
+| Azure (`azure/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ |
+| Bedrock (`bedrock/<model>`) | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ |
+| Cerebras (`cerebras/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Cohere (`cohere/<model>`) | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
+| Elevenlabs (`elevenlabs/<model>`) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ |
+| Fireworks (`fireworks/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Gemini (`gemini/<model>`) | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| Groq (`groq/<model>`) | ✅ | 🟡 | 🟡 | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Hugging Face (`huggingface/<model>`) | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ |
+| Mistral (`mistral/<model>`) | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
+| Nebius (`nebius/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Ollama (`ollama/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| OpenAI (`openai/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
+| OpenRouter (`openrouter/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Parasail (`parasail/<model>`) | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Perplexity (`perplexity/<model>`) | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Replicate (`replicate/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ |
+| SGL (`sgl/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+| Vertex AI (`vertex/<model>`) | ✅ | ❌ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
+| vLLM (`vllm/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ❌ | ❌ |
+| xAI (`xai/<model>`) | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
+
+
+- 🟡 Not supported by the downstream provider, but internally implemented by Bifrost as a fallback.
+- ❌ Not supported by the downstream provider, hence not supported by Bifrost.
+- ✅ Fully supported by the downstream provider, or internally implemented by Bifrost.
+
+
+<Note>
+Some operations are not supported by the downstream provider, and their internal implementation in Bifrost is optional. 🟡
+Like Text completions are not supported by Groq, but Bifrost can emulate them internally using the Chat Completions API. This feature is disabled by default, but it can be enabled by setting `compat.convert_text_to_chat` to `true` in the client configuration.
+We do not promote using such fallbacks, since text completions and chat completions are fundamentally different. However, this option is available to help users migrating from LiteLLM (which does support these fallbacks).
+</Note>
+
+
+Notes:
+- "Models" refers to the list models operation (`/v1/models`).
+- "Text" refers to the classic text completion interface (`/v1/completions`).
+- "Responses" refers to the OpenAI-style Responses API (`/v1/responses`). Depending on the provider, Bifrost either uses a native responses endpoint or maps to an equivalent chat API.
+- Reranking (`/v1/rerank`) is currently supported for Cohere, Bedrock, Vertex AI, and vLLM. See each provider page for model-specific requirements.
+- "Images" refers to the Image Generation API (`/v1/images/generations`).
+- "Image Edit" refers to the Image Edit API (`/v1/images/edits`).
+- "Image Variation" refers to the Image Variation API (`/v1/images/variations`).
+- TTS corresponds to `/v1/audio/speech` and STT to `/v1/audio/transcriptions`.
+- "Files" refers to the Files API operations (`/v1/files`) for uploading, listing, retrieving, and deleting files.
+- "Batch" refers to the Batch API operations (`/v1/batches`) for creating, listing, retrieving, canceling, and getting results of batch jobs.
+
+
+## Response Format
+
+All providers return responses in the OpenAI-compatible format. Bifrost handles the translation between different provider-specific formats automatically.
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+# Same response format regardless of provider
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-4o-mini",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+
+# Returns OpenAI-compatible format:
+{
+  "id": "chatcmpl-123",
+  "object": "chat.completion",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Hello! How can I help you?"
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 10,
+    "completion_tokens": 9,
+    "total_tokens": 19
+  }
+}
+```
+
+</Tab>
+
+<Tab title="Go SDK">
+
+```go
+// Same response structure regardless of provider
+type BifrostChatResponse struct {
+	ID                string                     `json:"id"`
+	Choices           []BifrostResponseChoice    `json:"choices"`
+	Created           int                        `json:"created"`
+	Model             string                     `json:"model"`
+	Object            string                     `json:"object"`
+	ServiceTier       string                     `json:"service_tier"`
+	SystemFingerprint string                     `json:"system_fingerprint"`
+	Usage             *BifrostLLMUsage           `json:"usage"`
+	ExtraFields       BifrostResponseExtraFields `json:"extra_fields"`
+}
+
+// Works with any provider
+response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+    Provider: schemas.OpenAI,    // or Anthropic, Bedrock, etc.
+    Model:    "gpt-4o-mini",     // or "claude-3-sonnet", etc.
+    Input:    messages,
+})
+// Response structure is always the same!
+```
+
+</Tab>
+</Tabs>
+
+
+## Custom Providers
+
+In addition to the built-in providers, Bifrost supports custom provider configurations. Custom providers allow you to create multiple instances of the same base provider with different configurations, request type restrictions, and access patterns. This is useful for environment-specific configurations, role-based access control, and feature testing.
+
+**Learn more:** [Custom Providers](../custom-providers)
+
+## Benefits
+
+The consistent interface across providers enables:
+
+- **Provider switching** without code modifications
+- **Fallback configurations** for improved reliability
+- **Load balancing** across multiple providers
+- **OpenAI-compatible patterns** for all providers
+
+## Provider Metadata
+
+Provider information is included in the `extra_fields` section of each response, providing transparency into which provider handled the request and any provider-specific metadata.
+
+### Raw Request/Response Access
+
+Bifrost can optionally return the raw request that was sent to the provider and the raw response received back. This is useful for debugging, auditing, and understanding how Bifrost transforms requests between different provider formats.
+
+**What's included:**
+- **`raw_request`** - The exact request body (JSON/data structure) that was sent to the provider's API endpoint
+- **`raw_response`** - The exact response body received from the provider (before Bifrost's normalization)
+- **Provider transformation details** - Shows exactly how Bifrost converted your input to provider-specific format
+
+**Example**: When you send a Chat Completions request with `max_completion_tokens` to Anthropic, Bifrost converts it to `max_tokens` in the raw request. Enabling raw request/response reveals this transformation.
+
+```json
+{
+  "extra_fields": {
+    "provider": "anthropic",
+    "raw_request": {
+      "model": "claude-3-5-sonnet",
+      "max_tokens": 4096,
+      "messages": [...]
+    },
+    "raw_response": {
+      "id": "msg_...",
+      "type": "message",
+      "content": [...],
+      "usage": {
+        "input_tokens": 123,
+        "output_tokens": 456
+      }
+    }
+  }
+}
+```
+
+**Use cases:**
+- **Debugging** - Verify how your request was transformed for the specific provider
+- **Auditing** - Track exactly what was sent to external APIs
+- **Cost analysis** - See actual token counts before Bifrost's normalization
+- **Integration testing** - Validate provider-specific transformations
+
+**Configuration options:**
+- **[Go SDK Provider Configuration](../../quickstart/go-sdk/provider-configuration)** - Configure `SendBackRawResponse` and other provider settings
+- **[Gateway Provider Configuration](../../quickstart/gateway/provider-configuration)** - Configure `send_back_raw_response` via API, UI, or config file
+
+<Note>
+Enabling raw request/response may increase response payload size and has minimal performance impact. Use it selectively in debugging/testing environments or when you need audit trails.
+</Note>
--- a/docs/providers/supported-providers/parasail.mdx
+++ b/docs/providers/supported-providers/parasail.mdx
@@ -0,0 +1,120 @@
+---
+title: "Parasail"
+description: "Parasail API conversion guide - OpenAI-compatible format, streaming support, tool calling, and parameter handling"
+icon: "p"
+---
+
+## Overview
+
+Parasail is an **OpenAI-compatible provider** offering high-performance inference. Bifrost delegates to the OpenAI implementation with standard parameter handling. Key characteristics:
+- **OpenAI API compatibility** - Identical request/response format
+- **Full streaming support** - Server-Sent Events with usage tracking
+- **Tool calling** - Complete function definition and execution
+- **Parameter filtering** - Removes unsupported OpenAI-specific fields
+- **Responses API** - Fallback to Chat Completions
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| List Models | ✅ | - | `/v1/models` |
+| Text Completions | ❌ | ❌ | - |
+| Embeddings | ❌ | ❌ | - |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Text Completions, Embeddings, Image Generation, Speech, Transcriptions, Files, and Batch are not supported by the upstream Parasail API. These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+Parasail supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Filtered Parameters
+
+Removed for Parasail compatibility:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - Not supported
+
+### Reasoning Parameter
+
+Reasoning via standard OpenAI format:
+
+```json
+{
+  "model": "parasail-llama-33-70b-fp8",
+  "messages": [...],
+  "reasoning_effort": "high"
+}
+```
+
+Parasail supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+Converted internally to Chat Completions:
+
+```
+ResponsesRequest → ChatRequest → ChatCompletion → ResponsesResponse
+```
+
+Same parameter support as Chat Completions.
+
+---
+
+# 3. List Models
+
+Lists available Parasail models with capabilities and context information.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Text Completions | Not offered by Parasail API |
+| Embedding | Not offered by Parasail API |
+| Image Generation | Not offered by Parasail API |
+| Speech/TTS | Not offered by Parasail API |
+| Transcription/STT | Not offered by Parasail API |
+| Batch Operations | Not offered by Parasail API |
+| File Management | Not offered by Parasail API |
+
+---
+
+## Caveats
+
+<Accordion title="Cache Control Stripped">
+**Severity**: Medium
+**Behavior**: Cache control directives are removed from messages
+**Impact**: Prompt caching features don't work
+**Code**: Stripped during JSON marshaling
+</Accordion>
+
+<Accordion title="Parameter Filtering">
+**Severity**: Low
+**Behavior**: OpenAI-specific parameters filtered out
+**Impact**: prompt_cache_key, verbosity, store removed
+**Code**: filterOpenAISpecificParameters
+</Accordion>
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/perplexity.mdx
+++ b/docs/providers/supported-providers/perplexity.mdx
@@ -0,0 +1,340 @@
+---
+title: "Perplexity"
+description: "Perplexity API conversion guide - OpenAI-compatible with web search integration, parameter mapping, and reasoning support"
+icon: "hexagon-nodes"
+---
+
+## Overview
+
+Perplexity is an OpenAI-compatible API with built-in web search capabilities and reasoning support. Bifrost performs conversions including:
+- **OpenAI-compatible base** - Uses OpenAI's chat format as foundation
+- **Web search parameters** - Search mode, domain filters, recency filters, and location-based search
+- **Reasoning effort mapping** - `reasoning.effort` mapped to Perplexity's `reasoning_effort` with special handling for "minimal"
+- **Search results inclusion** - Citations, search results, and videos included in response
+- **Special usage tracking** - Citation tokens, search queries, and reasoning tokens tracked separately
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/chat/completions` |
+| Responses API | ✅ | ✅ | `/chat/completions` |
+| Text Completions | ❌ | ❌ | - |
+| Embeddings | ❌ | ❌ | - |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+| List Models | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Text Completions, Embeddings, Image Generation, Speech, Transcriptions, Files, Batch, and List Models are not supported by the upstream Perplexity API. These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+Perplexity supports most OpenAI chat completion parameters. For standard parameter reference, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Perplexity-Specific Constraints
+
+- **No function calling**: `tools` and `tool_choice` are silently dropped
+- **Dropped parameters**: `stop`, `logit_bias`, `logprobs`, `top_logprobs`, `seed`, `parallel_tool_calls`, `service_tier`
+- **Reasoning**: Uses `reasoning_effort` instead of `reasoning` object (see [Reasoning & Effort](#reasoning--effort))
+
+### Perplexity-Specific Parameters
+
+Use `extra_params` (SDK) or pass directly in request body (Gateway) for Perplexity-specific search and configuration fields:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "sonar",
+    "messages": [{"role": "user", "content": "What is the latest news?"}],
+    "search_mode": "web",
+    "language_preference": "en",
+    "return_images": true,
+    "return_related_questions": true,
+    "disable_search": false,
+    "search_domain_filter": ["news.example.com"],
+    "search_recency_filter": "week"
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+    Provider: schemas.Perplexity,
+    Model:    "sonar",
+    Input:    messages,
+    Params: &schemas.ChatParameters{
+        ExtraParams: map[string]interface{}{
+            "search_mode": "web",
+            "language_preference": "en",
+            "return_images": true,
+            "return_related_questions": true,
+            "disable_search": false,
+            "search_domain_filter": []string{"news.example.com"},
+            "search_recency_filter": "week",
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+#### Search Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `search_mode` | string | Search mode: `"web"`, `"academic"`, `"news"`, etc. |
+| `language_preference` | string | Language preference (e.g., `"en"`, `"fr"`) |
+| `search_domain_filter` | string[] | Restrict search to specific domains |
+| `return_images` | boolean | Include images in search results |
+| `return_related_questions` | boolean | Return related questions |
+| `search_recency_filter` | string | Recency filter: `"hour"`, `"day"`, `"week"`, `"month"`, `"year"` |
+| `search_after_date_filter` | string | Search results after date (ISO format) |
+| `search_before_date_filter` | string | Search results before date (ISO format) |
+| `last_updated_after_filter` | string | Content last updated after date |
+| `last_updated_before_filter` | string | Content last updated before date |
+| `disable_search` | boolean | Disable web search entirely |
+| `enable_search_classifier` | boolean | Enable search classifier |
+| `top_k` | integer | Top-k results to use |
+
+#### Media Parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `web_search_options` | object[] | Array of web search option configurations with user location support |
+| `media_response.overrides.return_videos` | boolean | Return videos in results |
+| `media_response.overrides.return_images` | boolean | Return images in results |
+
+### Web Search Options
+
+Configure detailed search behavior including location:
+
+```json
+{
+  "web_search_options": [
+    {
+      "search_context_size": "high",
+      "user_location": {
+        "latitude": 40.7128,
+        "longitude": -74.0060,
+        "city": "New York",
+        "country": "US",
+        "region": "NY"
+      },
+      "image_search_relevance_enhanced": true
+    }
+  ]
+}
+```
+
+## Reasoning & Effort
+
+### Parameter Mapping
+
+- `reasoning.effort` → `reasoning_effort`
+- Supported efforts: `"low"`, `"medium"`, `"high"`
+- Special conversion: `"minimal"` → `"low"` (Perplexity normalizes to low/medium/high)
+- `reasoning.max_tokens` is silently dropped (Perplexity doesn't support token budget control)
+
+### Example
+
+```json
+// Request
+{"reasoning": {"effort": "high"}}
+
+// Perplexity conversion
+{"reasoning_effort": "high"}
+
+// Special case: "minimal" effort
+{"reasoning": {"effort": "minimal"}}
+→ {"reasoning_effort": "low"}
+```
+
+## Response Conversion
+
+### Search Results Inclusion
+
+Perplexity responses include additional fields for search integration:
+
+- `citations[]` - Source citations from search
+- `search_results[]` - Full search results with metadata
+- `videos[]` - Video results from search
+
+These fields are preserved in the Bifrost response for client use.
+
+### Usage Details
+
+Extended usage tracking specific to Perplexity:
+
+| Field | Source | Description |
+|-------|--------|-------------|
+| `completion_tokens_details.citation_tokens` | `usage.citation_tokens` | Tokens used for citations |
+| `completion_tokens_details.num_search_queries` | `usage.num_search_queries` | Number of web search queries performed |
+| `completion_tokens_details.reasoning_tokens` | `usage.reasoning_tokens` | Tokens consumed by reasoning process |
+| `usage.cost` | `usage.cost` | Cost of the request |
+
+### Example Response
+
+```json
+{
+  "id": "...",
+  "choices": [...],
+  "usage": {
+    "prompt_tokens": 100,
+    "completion_tokens": 150,
+    "total_tokens": 250,
+    "completion_tokens_details": {
+      "citation_tokens": 25,
+      "num_search_queries": 3,
+      "reasoning_tokens": 40
+    },
+    "cost": { "prompt_cost": 0.001, "completion_cost": 0.002 }
+  },
+  "citations": ["https://example.com/article1", "https://example.com/article2"],
+  "search_results": [
+    {
+      "title": "...",
+      "url": "...",
+      "snippet": "...",
+      "date": "2025-01-15"
+    }
+  ],
+  "videos": [
+    {
+      "title": "...",
+      "url": "...",
+      "duration": 300
+    }
+  ]
+}
+```
+
+## Streaming
+
+Perplexity uses OpenAI-compatible streaming format. Event sequence:
+- `chat.completion.chunk` events with delta updates
+- Standard OpenAI finish reason mapping
+
+<Note>
+Streaming with web search may return search results in final chunks.
+</Note>
+
+---
+
+## Caveats
+
+<Accordion title="No Tool Support">
+**Severity**: High
+**Behavior**: Tool-related parameters are silently dropped
+**Impact**: Function calling not available
+**Code**: `chat.go:8-36`
+</Accordion>
+
+<Accordion title="Reasoning Effort Mapping">
+**Severity**: Medium
+**Behavior**: `"minimal"` effort is mapped to `"low"` (Perplexity only supports low/medium/high)
+**Impact**: Requested minimal effort becomes low effort
+**Code**: `chat.go:30-36`, `responses.go:25-30`
+</Accordion>
+
+<Accordion title="Reasoning Max Tokens Dropped">
+**Severity**: Low
+**Behavior**: `reasoning.max_tokens` is silently dropped
+**Impact**: No control over reasoning token budget
+**Code**: `chat.go:29-36`
+</Accordion>
+
+<Accordion title="Stop Sequences Not Supported">
+**Severity**: Low
+**Behavior**: `stop` parameter is silently dropped
+**Impact**: Stop sequences not enforced
+**Code**: `chat.go:8-36`
+</Accordion>
+
+---
+
+# 2. Responses API
+
+The Responses API is adapted for Perplexity by converting to the Chat Completions format internally and returning results in Responses format.
+
+## Request Parameters
+
+### Parameter Mapping
+
+| Parameter | Transformation |
+|-----------|----------------|
+| `max_output_tokens` | Direct pass-through to `max_tokens` |
+| `temperature`, `top_p` | Direct pass-through |
+| `instructions` | Converted to system message (prepended) |
+| `reasoning.effort` | Mapped to `reasoning_effort` (see [Reasoning & Effort](#reasoning--effort)) |
+| `text.format` | Passed through as `response_format` |
+| `input` (string/array) | Converted to messages |
+
+### Extra Parameters
+
+Same Perplexity-specific search and configuration parameters as Chat Completions (see [Perplexity-Specific Parameters](#perplexity-specific-parameters)).
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/responses \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "sonar",
+    "instructions": "You are a helpful assistant with web search capabilities",
+    "input": "What is the latest news in technology?",
+    "search_mode": "news",
+    "return_images": true
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ResponsesRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostResponsesRequest{
+    Provider: schemas.Perplexity,
+    Model:    "sonar",
+    Input:    messages,
+    Params: &schemas.ResponsesParameters{
+        Instructions: schemas.Ptr("You are a helpful assistant with web search capabilities"),
+        ExtraParams: map[string]interface{}{
+            "search_mode": "news",
+            "return_images": true,
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Conversion Details
+
+- `instructions` becomes a system message prepended to input messages
+- `input` (string or array) converted to user message(s)
+- Response converted to Responses API format with same search results and extended usage details
+
+## Response Format
+
+Same as Chat Completions with search results, citations, and extended usage tracking preserved.
+
+## Streaming
+
+Responses streaming uses the same OpenAI-compatible streaming as Chat Completions, with results adapted to Responses format.
--- a/docs/providers/supported-providers/replicate.mdx
+++ b/docs/providers/supported-providers/replicate.mdx
@@ -0,0 +1,765 @@
+---
+title: "Replicate"
+description: "Replicate API conversion guide - prediction-based architecture, model-specific parameters, and async/sync modes"
+icon: "R"
+---
+
+## Overview
+
+Replicate is architecturally different from other providers in Bifrost. It uses a **prediction-based API** where every request creates a "prediction" that runs asynchronously. Each model on Replicate defines its own input schema, making it highly flexible but requiring model-specific parameter knowledge.
+
+### Key Architectural Differences
+
+1. **Prediction-Based System**: All operations create predictions via `/v1/predictions` or deployment endpoints
+2. **Model-Specific Inputs**: Each model has its own parameter schema (use `extra_params` for model-specific fields)
+3. **Async/Sync Modes**: Predictions can run synchronously (with `Prefer: wait` header) or asynchronously (with polling)
+4. **Flexible Output**: Output can be strings, arrays, URLs, or data URIs depending on the model
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/predictions` |
+| Responses API | ✅ | ✅ | `/v1/predictions` |
+| Text Completions | ✅ | ✅ | `/v1/predictions` |
+| Image Generation | ✅ | ✅ | `/v1/predictions` |
+| Image Edit | ✅ | ✅ | `/v1/predictions` |
+| Video Generation | ✅ | - | `/v1/predictions` |
+| Image Variation | ❌ | ❌ | - |
+| Files | ✅ | - | `/v1/files` |
+| List Models | ✅ | - | `/v1/deployments` |
+| Embeddings | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**List Models** returns account-specific deployments only, not all public models on Replicate.
+</Note>
+
+---
+
+# Model Identification
+
+Replicate models can be specified in three ways:
+
+## 1. Version ID
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "replicate/5c7d5dc6dd8bf75c1acaa8565735e7986bc5b66206b55cca93cb72c9bf15ccaa",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
+```
+
+## 2. Model Name
+
+Format: `owner/model-name`
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "replicate/meta/llama-2-7b-chat",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
+```
+
+## 3. Deployment
+
+Configure deployed models in the Replicate key configuration. Deployments map custom model identifiers to actual deployment paths.
+
+**Configuration Example:**
+
+```json
+{
+  "provider": "replicate",
+  "value": "your-api-key",
+  "aliases": {
+    "my-model": "owner/my-deployment-name"
+  }
+}
+```
+
+**Usage:**
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "replicate/my-model",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
+```
+
+---
+
+# Prediction Modes
+
+## Sync Mode
+
+Bifrost uses sync mode with the `Prefer: wait` header if it is present in the request headers. The request blocks until the prediction completes or times out (default 60 seconds).
+
+**How it works:**
+1. Creates prediction with `Prefer: wait=60` header
+2. Replicate holds connection open for up to 60 seconds
+3. If prediction completes within timeout, returns result immediately
+4. If timeout expires, falls back to polling mode
+
+## Async Mode (Polling)
+
+It is the default mode of Replicate predictions. Bifrost automatically polls the prediction URL every 2 seconds until completion.
+
+**Status Flow**: `starting` → `processing` → `succeeded`/`failed`/`canceled`
+
+---
+
+# 1. Chat Completions
+
+### Message Conversion
+
+**System Messages**: Extracted from messages array and concatenated into `system_prompt` field.
+
+**User/Assistant Messages**: Preserved as conversation context. Text content from content blocks is concatenated with newlines.
+
+**Image Content**: Non-base64 image URLs from message content blocks are extracted and passed as `image_input` array.
+
+```json
+// Input
+{
+  "messages": [
+    {"role": "system", "content": "You are helpful"},
+    {"role": "user", "content": "Hello"}
+  ]
+}
+
+// Converted to Replicate format
+{
+  "input": {
+    "system_prompt": "You are helpful",
+    "prompt": "Hello",
+    "messages": [...] // Original messages array also included
+  }
+}
+```
+
+### System Prompt Filtering
+
+**Important**: Not all Replicate models support the `system_prompt` field. For unsupported models, the system prompt is automatically prepended to the conversation prompt.
+
+**Models without system_prompt support:**
+- `meta/meta-llama-3-8b`
+- `meta/llama-2-70b`
+- `openai/gpt-oss-20b`
+- `openai/o1-mini`
+- `xai/grok-4`
+- All `deepseek-ai/deepseek*` models (e.g., `deepseek-r1`, `deepseek-v3`)
+
+### Model-Specific Parameters
+
+Use `extra_params` to pass model-specific parameters. These are **flattened into the input object**:
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "replicate/meta/llama-2-7b-chat",
+    "messages": [{"role": "user", "content": "Hello"}],
+    "temperature": 0.7,
+    "top_k": 50,
+    "repetition_penalty": 1.1,
+    "min_new_tokens": 10
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+    Provider: schemas.Replicate,
+    Model:    "meta/llama-2-7b-chat",
+    Input:    messages,
+    Params: &schemas.ChatParameters{
+        Temperature: schemas.Ptr(0.7),
+        ExtraParams: map[string]interface{}{
+            "top_k": 50,
+            "repetition_penalty": 1.1,
+            "min_new_tokens": 10,
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+<Warning>
+**Model Schema Discovery**: Each Replicate model has unique parameters. Check the model's documentation on replicate.com or use the OpenAPI schema from the model version to discover available parameters.
+</Warning>
+
+## Response Conversion
+
+### Field Mapping
+
+- **Output**: 
+  - String → `choices[0].message.content`
+  - Array of strings → joined and mapped to `choices[0].message.content`
+  - Object with `text` field → `text` value mapped to `choices[0].message.content`
+- **Status**: `succeeded` → `finish_reason: "stop"`, `failed` → `finish_reason: "error"`
+- **Metrics**: `input_token_count` → `prompt_tokens`, `output_token_count` → `completion_tokens`
+
+### Example Response
+
+```json
+{
+  "id": "abc123",
+  "model": "meta/llama-2-7b-chat",
+  "object": "chat.completion",
+  "created": 1234567890,
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Hello! How can I help you?"
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 10,
+    "completion_tokens": 8,
+    "total_tokens": 18
+  }
+}
+```
+
+## Streaming
+
+Replicate streaming uses Server-Sent Events (SSE) with the following event types:
+
+| Event Type | Description | Data Format |
+|------------|-------------|-------------|
+| `output` | Content chunk | Plain text string |
+| `done` | Completion | JSON: `{"reason": ""}` (empty = success) |
+| `error` | Error occurred | JSON: `{"detail": "error message"}` |
+
+**Streaming Flow:**
+1. Bifrost sets `stream: true` in prediction input
+2. Replicate returns `urls.stream` in initial response
+3. Bifrost connects to stream URL and processes SSE events
+4. `output` events → content deltas
+5. `done` event → final chunk with `finish_reason`
+
+**Done Event Reasons:**
+- Empty or no reason = success (`finish_reason: "stop"`)
+- `"canceled"` = prediction was canceled
+- `"error"` = prediction failed
+
+---
+
+# 2. Responses API
+
+The Responses API is converted internally to Chat Completions or native Replicate format depending on the model:
+
+```go
+// Responses request → Replicate prediction conversion
+ResponsesRequest → ReplicatePredictionRequest → ReplicatePredictionResponse → BifrostResponsesResponse
+```
+
+**Conversion Logic:**
+
+1. **For OpenAI models with `gpt-5-structured`**: Uses native Responses format with `input_item_list`, `tools`, and `json_schema` support
+2. **For all other models**: Converted to Chat Completions format using message conversion logic
+
+Same parameter mapping and system prompt handling as [Chat Completions](#1-chat-completions).
+
+## Response Format
+
+Responses follow standard Responses API format with status mapping:
+
+| Replicate Status | Responses Status |
+|------------------|------------------|
+| `succeeded` | `completed` |
+| `failed` | `failed` |
+| `canceled` | `cancelled` |
+| `processing` | `in_progress` |
+| `starting` | `queued` |
+
+---
+
+# 3. Text Completions (Legacy)
+
+### Conversion
+
+- **Prompt array**: Joined with newlines into single `prompt` field
+- **top_k**: Pass via `extra_params` (model-specific)
+
+### Example
+
+```bash
+curl -X POST http://localhost:8080/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "replicate/meta/llama-2-7b",
+    "prompt": "Once upon a time",
+    "max_tokens": 100,
+    "temperature": 0.8,
+    "top_k": 40
+  }'
+```
+
+## Response
+
+Same conversion as chat completions: output string/array → `choices[0].text`, with usage metrics from prediction metrics.
+
+---
+
+# 4. Image Generation
+
+### Parameter Mapping
+
+```json
+{
+  "prompt": "prompt",
+  "n": "number_of_images",
+  "aspect_ratio": "aspect_ratio",
+  "resolution": "resolution",
+  "output_format": "output_format",
+  "quality": "quality",
+  "background": "background",
+  "seed": "seed",
+  "negative_prompt": "negative_prompt",
+  "num_inference_steps": "num_inference_steps",
+  "input_images": "input_images"
+}
+```
+
+### Input Image Field Mapping
+
+**Important**: Different Replicate models expect input images in different fields. Bifrost automatically maps `input_images` to the correct field based on the model.
+
+**Field Mapping by Model:**
+
+| Field | Models |
+|-------|--------|
+| `image_prompt` | `black-forest-labs/flux-1.1-pro`<br/>`black-forest-labs/flux-1.1-pro-ultra`<br/>`black-forest-labs/flux-pro`<br/>`black-forest-labs/flux-1.1-pro-ultra-finetuned` |
+| `input_image` | `black-forest-labs/flux-kontext-pro`<br/>`black-forest-labs/flux-kontext-max`<br/>`black-forest-labs/flux-kontext-dev` |
+| `image` | `black-forest-labs/flux-dev`<br/>`black-forest-labs/flux-fill-pro`<br/>`black-forest-labs/flux-dev-lora`<br/>`black-forest-labs/flux-krea-dev` |
+| `input_images` | All other models (default) |
+
+<Note>
+For models that expect a single image field (`image_prompt`, `input_image`, `image`), only the first image from the `input_images` array is used.
+</Note>
+
+### Example
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "replicate/black-forest-labs/flux-schnell",
+    "prompt": "A serene mountain landscape at sunset",
+    "aspect_ratio": "16:9",
+    "output_format": "webp",
+    "num_inference_steps": 4,
+    "seed": 42
+  }'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ImageGenerationRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostImageGenerationRequest{
+    Provider: schemas.Replicate,
+    Model:    "black-forest-labs/flux-schnell",
+    Input: &schemas.ImageGenerationInput{
+        Prompt: "A serene mountain landscape at sunset",
+    },
+    Params: &schemas.ImageGenerationParameters{
+        AspectRatio: schemas.Ptr("16:9"),
+        OutputFormat: schemas.Ptr("webp"),
+        NumInferenceSteps: schemas.Ptr(4),
+        Seed: schemas.Ptr(42),
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+## Response Conversion
+
+Replicate output can be:
+- **Single URL**: String → `data[0].url`
+- **Multiple URLs**: Array → `data[i].url` for each image
+- **Data URIs**: Base64-encoded images in data URI format
+
+```json
+{
+  "id": "xyz789",
+  "created": 1234567890,
+  "model": "black-forest-labs/flux-schnell",
+  "data": [
+    {
+      "url": "https://replicate.delivery/pbxt/...",
+      "index": 0
+    }
+  ],
+  "usage": {
+    "input_tokens": 15,
+    "output_tokens": 0,
+    "total_tokens": 15
+  }
+}
+```
+
+## Streaming
+
+Image generation streaming provides progressive image updates as data URIs:
+
+**SSE Events:**
+- `output`: Data URI chunk (partial image)
+- `done`: Final completion with reason
+- `error`: Error details
+
+**Flow:**
+1. Each `output` event contains a complete data URI (e.g., `data:image/webp;base64,...`)
+2. Progressive refinement shows generation progress
+3. `done` event signals completion with final image
+4. Each chunk includes `Index`, `ChunkIndex`, and `B64JSON` fields
+
+---
+
+# 5. Image Edit
+
+Image edit runs as a prediction like image generation. You send one or more input images plus a prompt; the model returns edited image(s). The same **input image field mapping** as Image Generation applies (see [Field Mapping by Model](#field-mapping-by-model-1) below).
+
+**Endpoint**: `/v1/images/edits` (Bifrost) → Replicate `/v1/predictions` or deployment predictions.
+
+### Parameter Mapping
+
+| Bifrost / Request | Replicate input |
+|-------------------|-----------------|
+| `input.images` | Mapped to `image_prompt`, `input_image`, `image`, or `input_images` by model |
+| `input.prompt` | `prompt` |
+| `params.n` | `number_of_images` |
+| `params.output_format` | `output_format` |
+| `params.quality` | `quality` |
+| `params.background` | `background` |
+| `params.seed` | `seed` |
+| `params.negative_prompt` | `negative_prompt` |
+| `params.num_inference_steps` | `num_inference_steps` |
+| `params.extra_params` | Merged into prediction input |
+
+### Field Mapping by Model
+
+Input images are mapped to the same fields as in [Image Generation](#field-mapping-by-model):
+
+| Field | Models |
+|-------|--------|
+| `image_prompt` | `black-forest-labs/flux-1.1-pro`, `black-forest-labs/flux-1.1-pro-ultra`, `black-forest-labs/flux-pro`, `black-forest-labs/flux-1.1-pro-ultra-finetuned` |
+| `input_image` | `black-forest-labs/flux-kontext-pro`, `black-forest-labs/flux-kontext-max`, `black-forest-labs/flux-kontext-dev` |
+| `image` | `black-forest-labs/flux-dev`, `black-forest-labs/flux-fill-pro`, `black-forest-labs/flux-dev-lora`, `black-forest-labs/flux-krea-dev` |
+| `input_images` | All other models (default) |
+
+<Note>
+For single-image fields (`image_prompt`, `input_image`, `image`), only the first image from `input.images` is used.
+</Note>
+
+### Example
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl -X POST 'http://localhost:8080/v1/images/edits' \
+--form 'model="replicate/black-forest-labs/flux-fill-pro"' \
+--form 'image[]=@"image.png"' \
+--form 'prompt="Replace the sky with a starry night"' \
+--form 'mask=@"mask.png"'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+resp, err := client.ImageEditRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostImageEditRequest{
+    Provider: schemas.Replicate,
+    Model:    "black-forest-labs/flux-fill-pro",
+    Input: &schemas.ImageEditInput{
+        Prompt: "Replace the sky with a starry night",
+        Images: []schemas.ImageInput{
+            { Image: imageBytes },
+        },
+    },
+})
+```
+
+</Tab>
+</Tabs>
+
+### Response
+
+Same as Image Generation: single URL → `data[0].url`, array of URLs → `data[i].url`, or data URIs. Response shape is `BifrostImageGenerationResponse` with `data[].url` or `data[].b64_json`.
+
+### Streaming
+
+Image edit streaming is supported. Events use the same prediction log stream as image generation:
+
+- **Partial chunks**: `type: "image_edit.partial_image"` with `b64_json` (or data URI) until completion.
+- **Completed**: `type: "image_edit.completed"` with final image and usage.
+
+Use `Prefer: wait` for sync behavior or rely on polling (async) like other Replicate predictions.
+
+---
+
+# 6. Files API
+
+Replicate's Files API supports uploading, listing, and managing files for use in predictions.
+
+## Upload
+
+**Request**: Multipart form-data
+
+| Field | Type | Required | Notes |
+|-------|------|----------|-------|
+| `file` | binary | ✅ | File content |
+| `filename` | string | ❌ | Custom filename |
+| `content_type` | string | ❌ | MIME type (auto-detected from extension) |
+
+**Example:**
+
+```bash
+curl -X POST http://localhost:8080/v1/files \
+  -H "Authorization: Bearer $API_KEY" \
+  -F "file=@document.pdf" \
+  -F "filename=my-document.pdf"
+```
+
+**Response:**
+
+```json
+{
+  "id": "file_abc123",
+  "object": "file",
+  "bytes": 12345,
+  "created_at": 1234567890,
+  "filename": "my-document.pdf",
+  "purpose": "batch",
+  "status": "processed"
+}
+```
+
+## List Files
+
+**Query Parameters:**
+
+| Parameter | Type | Notes |
+|-----------|------|-------|
+| `limit` | int | Results per page |
+| `after` | string | Pagination cursor |
+
+**Example:**
+
+```bash
+curl -X GET "http://localhost:8080/v1/files?limit=20" \
+  -H "Authorization: Bearer $API_KEY"
+```
+
+**Pagination**: Uses cursor-based pagination with `next` URL in response. Bifrost serializes this into the `after` cursor.
+
+## Retrieve / Delete
+
+**Operations:**
+- GET `/v1/files/{file_id}` - Retrieve file metadata
+- DELETE `/v1/files/{file_id}` - Delete file
+
+## File Content Download
+
+<Warning>
+Replicate requires signed download URLs with `owner`, `expiry`, and `signature` parameters.
+</Warning>
+
+**Required Parameters in ExtraParams:**
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `owner` | string | File owner username |
+| `expiry` | int64 | Unix timestamp for expiration |
+| `signature` | string | Base64-encoded HMAC-SHA256 signature |
+
+**Signature Format**: HMAC-SHA256 of `"{owner} {file_id} {expiry}"` using Files API signing secret
+
+**Example:**
+
+```bash
+curl -X POST http://localhost:8080/v1/files/file_abc123/content \
+  -H "Content-Type: application/json" \
+  -d '{
+    "owner": "my-username",
+    "expiry": 1735689600,
+    "signature": "base64-encoded-signature"
+  }'
+```
+
+---
+
+# 7. List Models
+
+**Endpoint**: `/v1/models`
+
+<Warning>
+List Models returns **account-specific deployments only**, not all public models on Replicate.
+</Warning>
+
+Deployments are private or organization models with dedicated infrastructure. The response includes:
+
+```json
+{
+  "data": [
+    {
+      "id": "replicate/my-org/my-deployment",
+      "name": "my-deployment",
+      "owner": "my-org"
+    }
+  ],
+  "has_more": false
+}
+```
+
+**Usage:**
+1. List your deployments via this endpoint
+2. Use deployment name as model identifier: `replicate/my-org/my-deployment`
+3. Predictions route to deployment-specific endpoint: `/v1/deployments/my-org/my-deployment/predictions`
+
+---
+
+# Extra Parameters
+
+## Model-Specific Parameters
+
+The most important feature for Replicate integration is **extra_params**. Parameters not in Bifrost's standard schema are flattened directly into the prediction `input` object.
+
+### How It Works
+
+```json
+// Request with extra params
+{
+  "model": "replicate/stability-ai/sdxl",
+  "prompt": "A photo of an astronaut",
+  "temperature": 0.7,          // Standard param
+  "guidance_scale": 7.5,       // Model-specific (extra param)
+  "num_inference_steps": 50,   // Model-specific (extra param)
+  "scheduler": "DPMSolverMultistep"  // Model-specific (extra param)
+}
+
+// Converted to Replicate prediction input
+{
+  "version": "...",
+  "input": {
+    "prompt": "A photo of an astronaut",
+    "temperature": 0.7,
+    "guidance_scale": 7.5,       // Flattened from extra_params
+    "num_inference_steps": 50,   // Flattened from extra_params
+    "scheduler": "DPMSolverMultistep"  // Flattened from extra_params
+  }
+}
+```
+
+### Discovering Model Parameters
+
+Each Replicate model has unique parameters. To find available parameters:
+
+1. **Model Page**: Visit the model on [replicate.com](https://replicate.com)
+2. **OpenAPI Schema**: Available at `/v1/models/{owner}/{name}/versions/{version_id}` (includes `openapi_schema`)
+3. **Cog Definition**: Check the model's source code (if public)
+
+---
+
+## Caveats
+
+<Accordion title="System Prompt Field Support">
+**Severity**: Medium
+**Behavior**: Not all models support `system_prompt` field. For unsupported models, system prompt is prepended to conversation prompt.
+**Impact**: Prompt structure differs between models
+**Models Affected**: `meta/meta-llama-3-8b`, `meta/llama-2-70b`, `openai/gpt-oss-20b`, `openai/o1-mini`, `xai/grok-4`, and all `deepseek-ai/deepseek*` models
+**Code**: `chat.go:300-318`
+</Accordion>
+
+<Accordion title="Input Image Field Mapping">
+**Severity**: Medium
+**Behavior**: Different models expect input images in different fields (`image_prompt`, `input_image`, `image`, `input_images`)
+**Impact**: Bifrost automatically maps to correct field based on model
+**Models Affected**: Flux family models (see Input Image Field Mapping table)
+**Code**: `images.go:192-209`
+</Accordion>
+
+<Accordion title="Image Content in Chat">
+**Severity**: Low
+**Behavior**: Only non-base64 image URLs from message content blocks are extracted to `image_input`
+**Impact**: Base64-encoded images in messages are ignored
+**Code**: `chat.go:58-63`
+</Accordion>
+
+<Accordion title="Model-Specific Parameters">
+**Severity**: Medium
+**Behavior**: Each model has unique input schema; standard parameters may not work for all models
+**Impact**: Requires checking model documentation for available parameters
+**Mitigation**: Use `extra_params` for model-specific fields
+</Accordion>
+
+
+
+---
+
+## Video Generation
+
+### Generate (`POST /v1/videos`)
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Replicate model (owner/model or version ID) |
+| `prompt` | string | ✅ | Text description of the video |
+| `input_reference` | string | ❌ | Reference image (base64 data URL or URL) → mapped to `image` field; OpenAI-hosted models use `input_reference` |
+| `seconds` | string | ❌ | Duration → `duration` |
+| `seed` | int | ❌ | Seed for reproducibility |
+| `negative_prompt` | string | ❌ | What to avoid |
+
+**Extra Params**: Pass model-specific fields directly in the JSON body (unrecognized fields become `extra_params` and are flattened into the prediction input). `webhook` and `webhook_events_filter` are extracted automatically.
+
+
+**Response**: [`BifrostVideoGenerationResponse`](https://github.com/maximhq/bifrost/blob/main/core/schemas/videos.go) — `id`, `status`, `model`, `videos[]`
+
+**Job Statuses**: `queued` (starting) → `in_progress` (processing) → `completed` / `failed`
+
+### Retrieve / Download
+
+| Operation | Endpoint | Notes |
+|-----------|----------|-------|
+| Get status | `GET /v1/videos/{id}` | Maps to `/v1/predictions/{id}` |
+| Download | `GET /v1/videos/{id}/content` | Downloads from the prediction output URL |
+
+<Note>
+Video Delete, List, and Remix are not supported by Replicate.
+</Note>
+
+---
+
+## Reference Links
+
+- [Replicate API Documentation](https://replicate.com/docs/topics/predictions/create-a-prediction)
+- [Replicate Models](https://replicate.com/explore)
+- [Bifrost Replicate Provider Source](https://github.com/maximhq/bifrost/tree/main/core/providers/replicate)
--- a/docs/providers/supported-providers/runway.mdx
+++ b/docs/providers/supported-providers/runway.mdx
@@ -0,0 +1,116 @@
+---
+title: "Runway ML"
+description: "Runway ML API conversion guide - text-to-video, image-to-video, and video-to-video generation"
+icon: "film"
+---
+
+## Overview
+
+Runway ML provides video generation via an asynchronous task-based API. Bifrost maps the unified video schema to Runway's task API and polls until completion.
+
+### Supported Operations
+
+| Operation | Supported | Endpoint |
+|-----------|-----------|----------|
+| Video Generation | ✅ | `/v1/text_to_video`, `/v1/image_to_video`, `/v1/video_to_video` |
+| Video Retrieve | ✅ | `/v1/tasks/{task_id}` |
+| Video Download | ✅ | via Retrieve + URL download |
+| Video Delete | ✅ | `/v1/tasks/{task_id}` (cancel) |
+| Video List | ❌ | - |
+| Video Remix | ❌ | - |
+
+---
+
+# 1. Video Generation
+
+## Generate (`POST /v1/videos`)
+
+**Request Parameters**
+
+| Parameter | Type | Required | Notes |
+|-----------|------|----------|-------|
+| `model` | string | ✅ | Runway model |
+| `prompt` | string | ✅ | Text description of the video |
+| `input_reference` | string | ❌ | Input image for image-to-video |
+| `seconds` | string | ❌ | Duration in seconds (default: `"2"`) |
+| `size` | string | ❌ | Resolution as `WxH` (e.g., `1280x720`; default: `1280x720`) — converted to `W:H` ratio |
+| `seed` | int | ❌ | **Gen models only** |
+| `audio` | bool | ❌ | Enable audio generation. **Veo models only** |
+| `video_uri` | string | ❌ | Source video URL for video-to-video. **gen4_aleph only** |
+
+**Extra Params**
+
+| Key | Type | Notes |
+|-----|------|-------|
+| `references` | array | Video reference objects `[{"uri": "...", "tag": "..."}]` for video-to-video |
+| `content_moderation` | object | Content moderation config |
+| `reference_images` | array | Reference image objects for style/asset guidance |
+
+**Generation Modes** (auto-detected from inputs)
+
+- **Text-to-video**: `prompt` only
+- **Image-to-video**: `prompt` + `input_reference`
+- **Video-to-video**: `prompt` + `video_uri` — **gen4_aleph only**
+
+**Response**: [`BifrostVideoGenerationResponse`](https://github.com/maximhq/bifrost/blob/main/core/schemas/videos.go) with `id`, `status`, `videos[]`
+
+**Bifrost statuses** (normalized): `queued` → `in_progress` → `completed` / `failed`
+
+These values are the normalized view returned by Bifrost's API. Runway's native statuses are: `PENDING`, `THROTTLED`, `RUNNING`, `SUCCEEDED`, `FAILED`, `CANCELLED`.
+
+## Retrieve / Download / Delete
+
+| Operation | Endpoint | Notes |
+|-----------|----------|-------|
+| Get status | `GET /v1/videos/{id}` | Poll until `status: completed` |
+| Download content | `GET /v1/videos/{id}/content` | Returns raw video bytes (MP4) |
+| Cancel/Delete | `DELETE /v1/videos/{id}` | Cancels the running task |
+
+---
+
+## Configuration
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+curl --location 'http://localhost:8080/api/providers' \
+--header 'Content-Type: application/json' \
+--data '{
+    "provider": "runway",
+    "keys": [
+        {
+            "name": "runway-key-1",
+            "value": "env.RUNWAY_API_KEY",
+            "models": ["*"],
+            "weight": 1.0
+        }
+    ]
+}'
+```
+
+See **[Provider Configuration](../../quickstart/gateway/provider-configuration)** for full setup options.
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+case schemas.Runway:
+    return []schemas.Key{{
+        Value:  os.Getenv("RUNWAY_API_KEY"),
+        Models: []string{"*"},
+        Weight: 1.0,
+    }}, nil
+```
+
+See **[Provider Configuration](../../quickstart/go-sdk/provider-configuration)** for full setup options.
+
+</Tab>
+</Tabs>
+
+---
+
+## Reference Links
+
+- [Runway ML API Documentation](https://docs.dev.runwayml.com/)
+- [Runway ML Models](https://runwayml.com/research/)
--- a/docs/providers/supported-providers/sgl.mdx
+++ b/docs/providers/supported-providers/sgl.mdx
@@ -0,0 +1,146 @@
+---
+title: "SGLang"
+description: "SGL/SGLang API conversion guide - OpenAI-compatible format, parameter handling, streaming, tool support"
+icon: "s"
+---
+
+## Overview
+
+SGL (SGLang) is an **OpenAI-compatible local/remote inference engine** used for serving models with high throughput. Bifrost delegates all operations to the OpenAI provider implementation. Key features:
+- **OpenAI API compatibility** - Identical request/response format
+- **Full streaming support** - Server-Sent Events with usage tracking
+- **Tool calling** - Complete function definition and execution
+- **Text embeddings** - Support for embedding models
+- **Parameter filtering** - Removes unsupported fields for compatibility
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| Embeddings | ✅ | - | `/v1/embeddings` |
+| List Models | ✅ | - | `/v1/models` |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Speech, Transcriptions, Files, and Batch are not supported by the upstream SGL API. These return `UnsupportedOperationError`.
+
+SGL is typically self-hosted. Ensure BaseURL is configured correctly pointing to your SGL instance (e.g., `http://localhost:8000`).
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+SGL supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Filtered Parameters
+
+Removed for SGL compatibility:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - OpenAI-specific
+
+SGL supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+Fallback to Chat Completions with format conversion:
+
+```
+ResponsesRequest → ChatRequest → Response conversion
+```
+
+Same parameter support as Chat Completions.
+
+---
+
+# 3. Text Completions
+
+SGL supports legacy text completion format:
+
+| Parameter | Mapping |
+|-----------|---------|
+| `prompt` | Direct pass-through |
+| `max_tokens` | max_tokens |
+| `temperature`, `top_p` | Direct pass-through |
+| `frequency_penalty`, `presence_penalty` | Supported |
+
+---
+
+# 4. Embeddings
+
+SGL supports text embeddings for vector generation:
+
+| Parameter | Notes |
+|-----------|-------|
+| `input` | Text or array of texts |
+| `model` | Embedding model name |
+| `encoding_format` | "float" or "base64" |
+| `dimensions` | Model-specific dimension count |
+
+Response returns embedding vectors with usage information.
+
+---
+
+# 5. List Models
+
+Lists available models from SGL server with capabilities.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Speech/TTS | Not offered by SGL API |
+| Transcription/STT | Not offered by SGL API |
+| Batch Operations | Not offered by SGL API |
+| File Management | Not offered by SGL API |
+
+---
+
+<Note>
+SGL requires BaseURL configuration pointing to your SGL instance (e.g., `http://localhost:8000` for local, `https://sgl.example.com` for remote).
+</Note>
+
+## Caveats
+
+<Accordion title="BaseURL Configuration Required">
+**Severity**: High
+**Behavior**: BaseURL must be explicitly configured
+**Impact**: Requests fail without proper configuration
+**Code**: Validated in NewSGLProvider
+</Accordion>
+
+<Accordion title="Cache Control Stripped">
+**Severity**: Medium
+**Behavior**: Cache control directives are removed from messages
+**Impact**: Prompt caching features don't work
+**Code**: Stripped during JSON marshaling
+</Accordion>
+
+<Accordion title="Parameter Filtering">
+**Severity**: Low
+**Behavior**: OpenAI-specific fields filtered out
+**Impact**: prompt_cache_key, verbosity, store removed
+**Code**: filterOpenAISpecificParameters
+</Accordion>
+
+<Accordion title="User Field Size Limit">
+**Severity**: Low
+**Behavior**: User field > 64 characters silently dropped
+**Impact**: Longer user identifiers are lost
+**Code**: SanitizeUserField enforces 64-char max
+</Accordion>
--- a/docs/providers/supported-providers/vertex.mdx
+++ b/docs/providers/supported-providers/vertex.mdx
--- a/docs/providers/supported-providers/vllm.mdx
+++ b/docs/providers/supported-providers/vllm.mdx
@@ -0,0 +1,181 @@
+---
+title: "vLLM"
+description: "vLLM API guide - OpenAI-compatible self-hosted inference, chat, text, embeddings, rerank, and streaming"
+icon: "v"
+---
+
+## Overview
+
+vLLM is an **OpenAI-compatible provider** for self-hosted inference. Bifrost delegates to the shared OpenAI provider implementation. Key characteristics:
+- **OpenAI compatibility** - Chat, text completions, embeddings, rerank, and streaming
+- **Self-hosted** - Typically runs at `http://localhost:8000` or your own server
+- **Optional authentication** - API key often omitted for local instances
+- **Responses API** - Supported via chat completion fallback
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/chat/completions` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| Embeddings | ✅ | - | `/v1/embeddings` |
+| Rerank | ✅ | - | `/v1/rerank` (fallback: `/rerank`) |
+| List Models | ✅ | - | `/v1/models` |
+| Image Generation | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ✅ | ✅ | `/v1/audio/transcriptions` |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Image Generation, Speech, Files, and Batch are not supported and return `UnsupportedOperationError`.
+</Note>
+
+---
+
+## Authentication
+
+- **API key**: Optional. For local vLLM instances, the key is often left empty.
+- When set, the key is sent as `Authorization: Bearer <key>`.
+
+---
+
+## Configuration
+
+- **Base URL**: Default is `http://localhost:8000`. Override via provider `network_config.base_url`.
+- **Model names**: Depend on the models loaded in your vLLM instance (e.g. `meta-llama/Llama-3.2-1B-Instruct`, `BAAI/bge-m3` for embeddings).
+
+<Tabs>
+<Tab title="Gateway">
+
+```bash
+# Point to local or remote vLLM instance (default: http://localhost:8000)
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "vllm/meta-llama/Llama-3.2-1B-Instruct",
+    "messages": [{"role": "user", "content": "Hello"}]
+  }'
+
+# Gateway provider config: set base_url for remote vLLM
+# "network_config": { "base_url": "http://vllm-endpoint:8000" }
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+config := &schemas.ProviderConfig{
+    NetworkConfig: schemas.NetworkConfig{
+        BaseURL: "http://localhost:8000",  // optional; default is http://localhost:8000
+        DefaultRequestTimeoutInSeconds: 30,
+    },
+}
+provider, _ := vllm.NewVLLMProvider(config, logger)
+
+response, _ := provider.ChatCompletion(ctx, key, request)
+```
+
+</Tab>
+</Tabs>
+
+---
+
+## Getting started
+
+1. Run a vLLM server (Docker or pip). Example with Docker:
+   ```bash
+   docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.2-1B-Instruct
+   ```
+2. Verify the server:
+   ```bash
+   curl http://localhost:8000/v1/models
+   ```
+3. Use Bifrost with model prefix `vllm/<model_id>` (e.g. `vllm/meta-llama/Llama-3.2-1B-Instruct`).
+
+---
+
+# 1. Chat Completions
+
+vLLM supports standard OpenAI chat completion parameters. For full parameter reference, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions). Message types, tools, and streaming follow the same behavior.
+
+---
+
+# 2. Responses API
+
+Bifrost converts Responses API requests to Chat Completions and back:
+
+```
+BifrostResponsesRequest
+  → ToChatRequest()
+  → ChatCompletion
+  → ToBifrostResponsesResponse()
+```
+
+---
+
+# 3. Text Completions
+
+| Parameter | Mapping |
+|-----------|---------|
+| `prompt` | Sent as-is |
+| `max_tokens` | max_tokens |
+| `temperature` | temperature |
+| `top_p` | top_p |
+| `stop` | stop sequences |
+
+---
+
+# 4. Embeddings
+
+vLLM supports `/v1/embeddings`. Use model IDs exposed by your vLLM server (e.g. `BAAI/bge-m3`).
+
+---
+
+# 5. List Models
+
+Lists models from your vLLM instance via `/v1/models`. Available models depend on what is loaded on the server.
+
+---
+
+# 6. Rerank
+
+vLLM supports reranking for pooling/cross-encoder reranker models. Bifrost sends requests to `/v1/rerank` and automatically falls back to `/rerank` when required by your vLLM deployment.
+
+```bash
+curl -X POST http://localhost:8080/v1/rerank \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "vllm/BAAI/bge-reranker-v2-m3",
+    "query": "What is machine learning?",
+    "documents": [
+      {"text": "Machine learning is a subset of AI."},
+      {"text": "Python is a programming language."},
+      {"text": "Deep learning uses neural networks."}
+    ],
+    "params": {
+      "return_documents": true
+    }
+  }'
+```
+
+<Note>
+Your upstream vLLM server must be started with a rerank-capable model (pooling/cross-encoder task support).
+</Note>
+
+---
+
+## Caveats
+
+<Accordion title="Default base URL is localhost">
+**Severity**: Low  
+**Behavior**: Default base URL is `http://localhost:8000`.  
+**Impact**: For remote or custom ports, set `network_config.base_url` in the provider config.  
+</Accordion>
+
+<Accordion title="Error responses with HTTP 200">
+**Severity**: Low  
+**Behavior**: vLLM may return HTTP 200 with an error payload (e.g. `{"error": {"code": 404, "message": "..."}}`) instead of 4xx/5xx.  
+**Impact**: Bifrost normalizes these into standard error responses so clients see consistent error handling.  
+</Accordion>
--- a/docs/providers/supported-providers/xai.mdx
+++ b/docs/providers/supported-providers/xai.mdx
@@ -0,0 +1,164 @@
+---
+title: "xAI"
+description: "xAI API conversion guide - OpenAI-compatible format, Grok models, vision support, reasoning, and parameter handling"
+icon: "x"
+---
+
+## Overview
+
+xAI is an **OpenAI-compatible provider** powering the Grok family of models. Bifrost delegates to the OpenAI implementation with standard parameter filtering. Key features:
+- **Full OpenAI compatibility** - Chat, text completion, and responses
+- **Vision support** - Image URLs and base64 encoding for multimodal models
+- **Streaming support** - Server-Sent Events with delta-based updates
+- **Reasoning support** - Extended thinking for Grok reasoning models
+- **Tool calling** - Complete function definition and execution
+- **Parameter filtering** - Removes unsupported OpenAI-specific fields
+
+### Supported Operations
+
+| Operation | Non-Streaming | Streaming | Endpoint |
+|-----------|---------------|-----------|----------|
+| Chat Completions | ✅ | ✅ | `/v1/chat/completions` |
+| Responses API | ✅ | ✅ | `/v1/responses` |
+| Text Completions | ✅ | ✅ | `/v1/completions` |
+| Image Generation | ✅ | - | `/v1/images/generations` |
+| List Models | ✅ | - | `/v1/models` |
+| Embeddings | ❌ | ❌ | - |
+| Speech (TTS) | ❌ | ❌ | - |
+| Transcriptions (STT) | ❌ | ❌ | - |
+| Files | ❌ | ❌ | - |
+| Batch | ❌ | ❌ | - |
+
+<Note>
+**Unsupported Operations** (❌): Embeddings, Speech, Transcriptions, Files, and Batch are not supported by the upstream xAI API. These return `UnsupportedOperationError`.
+</Note>
+
+---
+
+# 1. Chat Completions
+
+## Request Parameters
+
+xAI supports all standard OpenAI chat completion parameters. For full parameter reference and behavior, see [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+### Filtered Parameters
+
+Removed for xAI compatibility:
+- `prompt_cache_key` - Not supported
+- `verbosity` - Anthropic-specific
+- `store` - Not supported
+- `service_tier` - Not supported
+
+### Reasoning Support
+
+xAI's `grok-3-mini` model supports extended reasoning via the standard `reasoning_effort` field:
+
+```json
+{
+  "model": "xai/grok-3-mini",
+  "messages": [...],
+  "reasoning_effort": "high"
+}
+```
+
+<Warning>
+**Model-Specific Feature**: The `reasoning_effort` parameter is only supported by `grok-3-mini`. Other Grok-3 and Grok-4 models will return an error if this parameter is specified.
+</Warning>
+
+Bifrost converts from the internal `Reasoning` structure to xAI's `reasoning_effort` string format.
+
+### Vision Support
+
+xAI vision models support both image URLs and base64-encoded images:
+
+```json
+{
+  "model": "xai/grok-2-vision-1212",
+  "messages": [{
+    "role": "user",
+    "content": [
+      {"type": "text", "text": "What is in this image?"},
+      {"type": "image_url", "image_url": {"url": "https://..."}}
+    ]
+  }]
+}
+```
+
+**Supported Image Formats:**
+- ✅ Image URLs
+- ✅ Base64-encoded images
+- ✅ Multiple images per message
+
+xAI supports all standard OpenAI message types, tools, responses, and streaming formats. For details on message handling, tool conversion, responses, and streaming, refer to [OpenAI Chat Completions](/providers/supported-providers/openai#1-chat-completions).
+
+---
+
+# 2. Responses API
+
+xAI's Responses API is forwarded directly to `/v1/responses`:
+
+```
+ResponsesRequest → /v1/responses → ResponsesResponse
+```
+
+Same parameter support and message handling as Chat Completions. Full streaming support available.
+
+---
+
+# 3. Text Completions
+
+xAI supports legacy text completion format:
+
+| Parameter | Mapping |
+|-----------|---------|
+| `prompt` | Direct pass-through |
+| `max_tokens` | max_tokens |
+| `temperature`, `top_p` | Direct pass-through |
+| `stop` | Stop sequences |
+| `frequency_penalty`, `presence_penalty` | Penalty parameters |
+
+Streaming support available via `stream: true`.
+
+---
+
+# 4. Image Generation
+
+xAI's image generation uses the OpenAI-compatible format.
+
+**Request Conversion**
+
+xAI uses the same conversion as OpenAI (see [OpenAI Image Generation](/providers/supported-providers/openai#7-image-generation)):
+
+- **Model & Prompt**: `bifrostReq.Model` → `req.Model`, `bifrostReq.Prompt` → `req.Prompt`
+- **Parameters**: All fields from `bifrostReq` (`ImageGenerationParameters`) are embedded directly into the request struct via struct embedding
+- **Endpoint**: `/v1/images/generations`
+
+<Note>
+**Note** : `quality`, `size` and `style` parameters are not supported by xAI's API at the moment.
+</Note>
+
+**Response Conversion**
+
+Responses are unmarshaled directly into `BifrostImageGenerationResponse`.
+
+**Streaming**: Image generation streaming is not supported by xAI.
+
+---
+
+# 5. List Models
+
+Lists available xAI models with their capabilities and context lengths.
+
+---
+
+## Unsupported Features
+
+| Feature | Reason |
+|---------|--------|
+| Embedding | Not offered by xAI API |
+| Speech/TTS | Not offered by xAI API |
+| Transcription/STT | Not offered by xAI API |
+| Batch Operations | Not offered by xAI API |
+| File Management | Not offered by xAI API |
+
+---