first commit

2026-04-26 21:52:23 +03:00
commit 880f412e2c
2662 changed files with 866266 additions and 0 deletions
--- a/docs/features/async-inference.mdx
+++ b/docs/features/async-inference.mdx
@@ -0,0 +1,182 @@
+---
+title: "Async Inference"
+description: "Submit inference requests asynchronously and poll for results later."
+icon: "clock"
+---
+
+## Overview
+
+Async inference uses a fire-and-forget pattern for gateway requests: submit a normal inference payload to an async endpoint, get a `job_id` immediately, and poll later for the final result.
+
+<Note>
+This is a gateway-only feature and is not available in the Go SDK and requires a Logs Store to be configured.
+</Note>
+
+## How It Works
+
+```mermaid
+sequenceDiagram
+    participant Client
+    participant Gateway as Bifrost Gateway
+    participant Worker as Async Worker
+    participant Provider
+
+    Client->>Gateway: POST /v1/async/chat/completions
+    Gateway-->>Client: 202 Accepted + {id, status: "pending"}
+    Gateway->>Worker: Queue async job
+    Worker->>Provider: Execute inference request
+    Provider-->>Worker: Response or error
+
+    Client->>Gateway: GET /v1/async/chat/completions/{job_id}
+    alt Job pending or processing
+        Gateway-->>Client: 202 Accepted + status
+    else Job completed or failed
+        Gateway-->>Client: 200 OK + result/error
+    end
+```
+
+## Supported Endpoints
+
+Streaming is not supported on async endpoints.
+
+| Request Type | Submit (POST) | Poll (GET) |
+|---|---|---|
+| Text completions | `/v1/async/completions` | `/v1/async/completions/{job_id}` |
+| Chat completions | `/v1/async/chat/completions` | `/v1/async/chat/completions/{job_id}` |
+| Responses API | `/v1/async/responses` | `/v1/async/responses/{job_id}` |
+| Embeddings | `/v1/async/embeddings` | `/v1/async/embeddings/{job_id}` |
+| Speech | `/v1/async/audio/speech` | `/v1/async/audio/speech/{job_id}` |
+| Transcriptions | `/v1/async/audio/transcriptions` | `/v1/async/audio/transcriptions/{job_id}` |
+| Image generations | `/v1/async/images/generations` | `/v1/async/images/generations/{job_id}` |
+| Image edits | `/v1/async/images/edits` | `/v1/async/images/edits/{job_id}` |
+| Image variations | `/v1/async/images/variations` | `/v1/async/images/variations/{job_id}` |
+| OCR | `/v1/async/ocr` | `/v1/async/ocr/{job_id}` |
+| Rerank | `/v1/async/rerank` | `/v1/async/rerank/{job_id}` |
+
+## Submitting a Request
+
+Use the same JSON body as the synchronous endpoint, but switch to the `/v1/async/` path.
+
+```bash
+curl -X POST http://localhost:8080/v1/async/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-bf-vk: sk-bf-your-virtual-key" \
+  -H "x-bf-async-job-result-ttl: 3600" \
+  -d '{
+    "model": "openai/gpt-4o-mini",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Summarize the latest release notes in 3 bullets"
+      }
+    ]
+  }'
+```
+
+**Response (`202 Accepted`)**
+
+```json
+{
+  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
+  "status": "pending",
+  "created_at": "2026-02-19T08:10:17.831Z"
+}
+```
+
+## Polling for Results
+
+Use `GET` on the matching endpoint with the returned `job_id`.
+
+```bash
+curl -X GET http://localhost:8080/v1/async/chat/completions/1e89b165-d4fe-49e8-beb2-3e157f2df02f \
+  -H "x-bf-vk: sk-bf-your-virtual-key"
+```
+
+**Response codes:**
+- `202 Accepted`: job is still `pending` or `processing`
+- `200 OK`: job is `completed` or `failed`
+
+**Pending example (`202`)**
+
+```json
+{
+  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
+  "status": "pending",
+  "created_at": "2026-02-19T08:10:17.831Z"
+}
+```
+
+**Completed example (`200`)**
+
+```json
+{
+  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
+  "status": "completed",
+  "created_at": "2026-02-19T08:10:17.831Z",
+  "completed_at": "2026-02-19T08:10:19.412Z",
+  "expires_at": "2026-02-19T09:10:19.412Z",
+  "status_code": 200,
+  "result": {
+    "id": "chatcmpl-123",
+    "object": "chat.completion"
+  }
+}
+```
+
+**Failed example (`200`)**
+
+```json
+{
+  "id": "1e89b165-d4fe-49e8-beb2-3e157f2df02f",
+  "status": "failed",
+  "created_at": "2026-02-19T08:10:17.831Z",
+  "completed_at": "2026-02-19T08:10:19.412Z",
+  "expires_at": "2026-02-19T09:10:19.412Z",
+  "status_code": 429,
+  "error": {
+    "error": {
+      "message": "rate limit exceeded",
+      "type": "rate_limit_error"
+    }
+  }
+}
+```
+
+## Job Lifecycle
+
+| Status | Meaning | Transition Trigger |
+|---|---|---|
+| `pending` | Job record is created and queued | Immediate status on submit |
+| `processing` | Background worker has picked up the job | Worker starts execution |
+| `completed` | Operation succeeded and result is stored | Provider call completes successfully |
+| `failed` | Operation failed and error is stored | Provider call returns a Bifrost error |
+
+## Result TTL and Expiration
+
+- Default TTL is **3600 seconds (1 hour)**.
+- TTL starts from **completion time**, not submission time.
+- Server default is configured in `client.async_job_result_ttl`.
+- Per-request override uses `x-bf-async-job-result-ttl`.
+- If the header is invalid or `<= 0`, Bifrost falls back to the default TTL.
+- Expired jobs return `404 Job not found or expired`.
+- Expired async jobs are cleaned up every minute.
+
+## Virtual Key Authorization
+
+- If a job is created with a virtual key, the job stores that virtual key identity.
+- Polling must use the same virtual key value.
+- Missing or mismatched virtual keys fail lookup and return `404 Job not found or expired`.
+- Jobs created without a virtual key are not virtual-key scoped, so they can be polled by any caller that passes your gateway auth/middleware checks.
+
+## Observability
+
+- Async executions are logged like synchronous requests.
+- The logging metadata includes `isAsyncRequest: true`, which appears as an **Async** badge in the Logs UI.
+- Background execution still uses Bifrost request APIs, so LLM plugin hooks (governance, logging, cost tracking, etc.) are executed for the actual inference run.
+
+## Limitations
+
+- Gateway-only feature (not available in Go SDK).
+- Streaming is not supported on async endpoints.
+- Requires Logs Store to register async routes.
+- Jobs stuck in `processing` are not auto-expired by TTL cleanup. Cleanup only deletes jobs with `expires_at` set (completed/failed).
--- a/docs/features/drop-in-replacement.mdx
+++ b/docs/features/drop-in-replacement.mdx
@@ -0,0 +1,78 @@
+---
+title: "Drop-in Replacement"
+description: "Replace your existing AI SDK connections with Bifrost by changing just the base URL. Keep your code, gain advanced features like fallbacks, load balancing, and governance."
+icon: "shuffle"
+---
+
+## Zero Code Changes
+
+The Bifrost Gateway acts as a drop-in replacement for popular AI SDKs. This means you can point your existing OpenAI, Anthropic, or Google GenAI client to Bifrost's HTTP gateway and instantly gain access to advanced features without rewriting your application.
+
+The magic happens with a single line change: update your `base_url` to point to Bifrost's gateway, and everything else stays exactly the same.
+
+## How It Works
+
+Bifrost provides **100% compatible endpoints** for popular AI SDKs by acting as a protocol adapter. Your existing SDK code continues to work unchanged, but now benefits from Bifrost's multi-provider support, automatic failovers, semantic caching, and governance features.
+
+<Tabs group="drop-in-replacement">
+
+<Tab title="OpenAI SDK">
+
+```python
+# Before: Direct to OpenAI
+client = openai.OpenAI(
+    api_key="your-openai-key"
+)
+
+# After: Through Bifrost
+client = openai.OpenAI(
+    base_url="http://localhost:8080/openai",  # Only change needed
+    api_key="dummy-key"  # Keys handled by Bifrost
+)
+```
+
+</Tab>
+
+<Tab title="Anthropic SDK">
+
+```python
+# Before: Direct to Anthropic
+client = anthropic.Anthropic(
+    api_key="your-anthropic-key"
+)
+
+# After: Through Bifrost
+client = anthropic.Anthropic(
+    base_url="http://localhost:8080/anthropic",  # Only change needed
+    api_key="dummy-key"  # Keys handled by Bifrost
+)
+```
+
+</Tab>
+
+</Tabs>
+
+## Instant Advanced Features
+
+Once your SDK points to Bifrost, you automatically get:
+
+- **Multi-provider support** with automatic failovers
+- **Load balancing** across multiple API keys
+- **Semantic caching** for faster responses
+- **Governance controls** for usage monitoring and budgets
+- **Request/response logging** and analytics
+- **Rate limiting** and circuit breakers
+
+and so much more! All without changing a **single line** of your application logic.
+
+## Complete Integration Support
+
+Bifrost provides drop-in compatibility for multiple popular AI SDKs and frameworks:
+
+- **[OpenAI SDK](../integrations/openai-sdk)**
+- **[Anthropic SDK](../integrations/anthropic-sdk)**
+- **[Google GenAI SDK](../integrations/genai-sdk)**
+- **[LiteLLM](../integrations/litellm-sdk)**
+- **[LangChain](../integrations/langchain-sdk)**
+
+**For detailed setup instructions and compatibility information:** [Complete Integration Guide](../integrations/what-is-an-integration)
--- a/docs/features/governance/budget-and-limits.mdx
+++ b/docs/features/governance/budget-and-limits.mdx
@@ -0,0 +1,599 @@
+---
+title: "Budget and Limits"
+description: "Enterprise-grade budget management and cost control with hierarchical budget allocation through virtual keys, teams, and customers."
+icon: "money-bills"
+---
+
+## Overview
+
+Budgeting and rate limiting are a core feature of Bifrost's governance system managed through [Virtual Keys](./virtual-keys).
+
+Bifrost's budget management system provides comprehensive cost control and financial governance for enterprise AI deployments. It operates through a **hierarchical budget structure** that enables granular cost management, usage tracking, and financial oversight across your entire organization.
+
+**Core Hierarchy:**
+```
+Customer (has independent budget)
+    ↓ (one-to-many)
+Team (has independent budget) 
+    ↓ (one-to-many)
+Virtual Key (has independent budget + rate limits)
+    ↓ (one-to-many)
+Provider Config (has independent budget + rate limits)
+
+OR
+
+Customer (has independent budget)
+    ↓ (direct attachment)
+Virtual Key (has independent budget + rate limits)
+    ↓ (one-to-many)
+Provider Config (has independent budget + rate limits)
+
+OR
+
+Virtual Key (standalone - has independent budget + rate limits)
+    ↓ (one-to-many)
+Provider Config (has independent budget + rate limits)
+```
+
+**Key Capabilities:**
+- **Virtual Keys** - Primary access control via `x-bf-vk` header (exclusive team OR customer attachment)
+- **Budget Management** - Independent budget limits at each hierarchy level with cumulative checking
+- **Rate Limiting** - Request and token-based throttling at both VK and provider config levels
+- **Provider-Level Governance** - Granular budgets and rate limits per AI provider within a virtual key
+- **Model/Provider Filtering** - Granular access control per virtual key  
+- **Usage Tracking** - Real-time monitoring and audit trails
+- **Audit Headers** - Optional team and customer identification
+
+---
+
+## Budget Management
+
+### Cost Calculation
+
+Bifrost automatically calculates costs based on:
+- **Provider Pricing** - Real-time model pricing data
+- **Token Usage** - Input + output tokens from API responses
+- **Request Type** - Different pricing for chat, text, embedding, speech, transcription
+- **Cache Status** - Reduced costs for cached responses
+- **Batch Operations** - Volume discounts for batch requests
+
+All cost calculation details are covered in [Architecture > Framework > Model Catalog](../../architecture/framework/model-catalog).
+
+### Budget Checking Flow
+
+When a request is made with a virtual key, Bifrost checks **all applicable budgets independently** in the hierarchy. Each budget must have sufficient remaining balance for the request to proceed.
+
+**Checking Sequence:**
+
+**For VK → Team → Customer:**
+```
+1. ✓ Provider Config Budget (if provider config has budget)
+2. ✓ VK Budget (if VK has budget)
+3. ✓ Team Budget (if VK's team has budget)  
+4. ✓ Customer Budget (if team's customer has budget)
+```
+
+**For VK → Customer (direct):**
+```
+1. ✓ Provider Config Budget (if provider config has budget)
+2. ✓ VK Budget (if VK has budget)
+3. ✓ Customer Budget (if VK's customer has budget)
+```
+
+**For Standalone VK:**
+```
+1. ✓ Provider Config Budget (if provider config has budget)
+2. ✓ VK Budget (if VK has budget)
+```
+
+**Important Notes:**
+- **All applicable budgets must pass** - any single budget failure blocks the request
+- **Budgets are independent** - each tracks its own usage and limits
+- **Costs are deducted from all applicable budgets** - same cost applied to each level
+- **Rate limits checked at provider config and VK levels** - teams and customers have no rate limits
+- **Provider selection** - providers that exceed their budget or rate limits are excluded from [routing](./routing)
+
+**Example:**
+```
+- Provider config budget: $4/$5 remaining ✓
+- VK budget: $9/$10 remaining ✓
+- Team budget: $15/$20 remaining ✓  
+- Customer budget: $45/$50 remaining ✓
+- Result: Allowed (no budget is exceeded)
+
+- After request: 
+    - Request cost: $2 
+    - Updated Provider=$6/$5, VK=$11/$10, Team=$17/$20, Customer=$47/$50
+    - Then the next request will be blocked (both provider and VK budgets exceeded).
+```
+
+## Rate Limiting
+
+Rate limits protect your system from abuse and manage traffic by setting thresholds on request frequency and token usage over a specific time window. Rate limits can be configured at **both the Virtual Key level and Provider Config level** for granular control.
+
+Bifrost supports two types of rate limits that work in parallel:
+- **Request Limits**: Control the maximum number of API calls that can be made within a set duration (e.g., 100 requests per minute).
+- **Token Limits**: Control the maximum number of tokens (prompt + completion) that can be processed within a set duration (e.g., 50,000 tokens per hour).
+
+### Rate Limit Hierarchy
+
+Rate limits are checked in hierarchical order:
+```
+1. ✓ Provider Config Rate Limits (if provider config has rate limits)
+2. ✓ Virtual Key Rate Limits (if VK has rate limits)
+```
+
+For a request to be allowed, it must pass both the request limit and token limit checks at **all applicable levels**. If a provider config exceeds its rate limits, that provider is excluded from routing, but other providers within the same virtual key remain available.
+
+### Provider-Level Rate Limiting
+
+Provider configs within a virtual key can have independent rate limits, enabling:
+- **Per-Provider Throttling**: Different rate limits for OpenAI vs Anthropic
+- **Provider Isolation**: Rate limit violations on one provider don't affect others
+- **Granular Control**: Fine-tune limits based on provider capabilities and costs
+
+## Reset Durations
+
+Budgets and rate limits support flexible reset durations:
+
+**Format Examples:**
+- `1m` - 1 minute
+- `5m` - 5 minutes  
+- `1h` - 1 hour
+- `1d` - 1 day
+- `1w` - 1 week
+- `1M` - 1 month
+- `1Y` - 1 year
+
+**Common Patterns:**
+- **Rate Limits**: `1m`, `1h`, `1d` for request throttling
+- **Budgets**: `1d`, `1w`, `1M`, `1Y` for cost control
+
+### Calendar-aligned budgets
+
+By default, a budget **rolls**: after `reset_duration` elapses since `last_reset`, usage resets. With **`calendar_aligned`: `true`**, the budget resets at the **start of each calendar period in UTC** instead (same instant for every customer of that configuration).
+
+**Supported `reset_duration` suffixes:** only day (`d`), week (`w`), month (`M`), and year (`Y`). Examples: `1d` → midnight UTC each day; `1w` → Monday 00:00 UTC each week; `1M` → first day of each month; `1Y` → January 1 each year. Sub-day durations (for example `1h`, `30m`) **cannot** use calendar alignment; the API rejects invalid combinations.
+
+Calendar alignment applies to budgets on **customers**, **teams**, **virtual keys**, and **per–provider-config** budgets. You can set it when creating a budget (`calendar_aligned` on create) or toggle it on update (`calendar_aligned` on the budget in `PUT` requests). Turning calendar alignment **on** for an existing budget resets **current usage to zero** and snaps **`last_reset`** to the current period start.
+
+---
+
+## Configuration Guide
+
+Configure provider-level budgets and rate limits using any of these methods:
+
+<Tabs>
+<Tab title="Web UI">
+
+The Bifrost Web UI provides an intuitive interface for configuring provider-level governance through the Virtual Keys management page.
+
+### Creating Virtual Keys with Provider Configs
+
+1. **Navigate to Virtual Keys**: Go to **Virtual Keys** page in the Bifrost dashboard
+2. **Create New Virtual Key**: Click "Create Virtual Key" button
+3. **Configure Providers**: In the "Provider Configurations" section:
+   - Add multiple providers with individual weights
+   - Set provider-specific budgets and rate limits
+   - Configure allowed models per provider
+
+### Provider Configuration Interface
+
+![Virtual Key Provider Configuration Interface](../../media/ui-virtual-key-provider-config.png)
+
+**Key Features:**
+- **Visual Provider Cards**: Each provider displays as an expandable card
+- **Budget Controls**: Set spending limits with reset periods per provider
+- **Rate Limit Controls**: Configure token and request limits independently
+- **Model Filtering**: Specify allowed models for each provider
+- **Weight Distribution**: Visual indicators for load balancing weights
+- **Real-time Validation**: Immediate feedback on configuration errors
+
+### Monitoring Provider Usage
+
+![Provider Usage Sheet](../../media/ui-virtual-key-provider-usage-sheet.png)
+
+The info sheet for the virtual key provides real-time monitoring of:
+- Budget consumption per provider
+- Rate limit utilization (tokens and requests)
+- Provider availability status
+- Usage trends and forecasting
+
+</Tab>
+<Tab title="API">
+
+Use the Bifrost HTTP API to programmatically manage provider-level governance configurations.
+
+### Create Virtual Key with Provider Configs
+
+```bash
+curl -X POST "https://your-bifrost-instance.com/api/governance/virtual-keys" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "marketing-team-vk",
+    "description": "Marketing team virtual key with provider-specific limits",
+    "provider_configs": [
+      {
+        "provider": "openai",
+        "weight": 0.7,
+        "allowed_models": ["gpt-4", "gpt-3.5-turbo"],
+        "budget": {
+          "max_limit": 500.00,
+          "reset_duration": "1M",
+          "calendar_aligned": true
+        },
+        "rate_limit": {
+          "token_max_limit": 1000000,
+          "token_reset_duration": "1h",
+          "request_max_limit": 1000,
+          "request_reset_duration": "1h"
+        }
+      },
+      {
+        "provider": "anthropic",
+        "weight": 0.3,
+        "allowed_models": ["claude-3-opus", "claude-3-sonnet"],
+        "budget": {
+          "max_limit": 200.00,
+          "reset_duration": "1M"
+        },
+        "rate_limit": {
+          "token_max_limit": 500000,
+          "token_reset_duration": "1h",
+          "request_max_limit": 500,
+          "request_reset_duration": "1h"
+        }
+      }
+    ],
+    "budget": {
+      "max_limit": 1000.00,
+      "reset_duration": "1M",
+      "calendar_aligned": true
+    },
+    "is_active": true
+  }'
+```
+
+Use `calendar_aligned` only with `d` / `w` / `M` / `Y` reset durations (see [Calendar-aligned budgets](#calendar-aligned-budgets)).
+
+### Update Provider Configuration
+
+```bash
+curl -X PUT "https://your-bifrost-instance.com/api/governance/virtual-keys/{vk_id}" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "provider_configs": [
+      {
+        "id": 1,
+        "provider": "openai",
+        "weight": 0.8,
+        "budget": {
+          "max_limit": 600.00,
+          "reset_duration": "1M"
+        },
+        "rate_limit": {
+          "token_max_limit": 1200000,
+          "token_reset_duration": "1h"
+        }
+      }
+    ]
+  }'
+```
+
+### API Response Structure
+
+```json
+{
+  "message": "Virtual key created successfully",
+  "virtual_key": {
+    "id": "vk_123",
+    "name": "marketing-team-vk",
+    "value": "vk_abc123def456",
+    "provider_configs": [
+      {
+        "id": 1,
+        "provider": "openai",
+        "weight": 0.7,
+        "allowed_models": ["gpt-4", "gpt-3.5-turbo"],
+        "budget": {
+          "id": "budget_789",
+          "max_limit": 500.00,
+          "current_usage": 0.00,
+          "reset_duration": "1M",
+          "calendar_aligned": true,
+          "last_reset": "2024-01-01T00:00:00Z"
+        },
+        "rate_limit": {
+          "id": "rate_limit_456",
+          "token_max_limit": 1000000,
+          "token_current_usage": 0,
+          "token_reset_duration": "1h",
+          "token_last_reset": "2024-01-01T00:00:00Z",
+          "request_max_limit": 1000,
+          "request_current_usage": 0,
+          "request_reset_duration": "1h",
+          "request_last_reset": "2024-01-01T00:00:00Z"
+        }
+      }
+    ]
+  }
+}
+```
+
+### Field Descriptions
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `provider` | string | AI provider name (e.g., "openai", "anthropic") |
+| `weight` | float | Load balancing weight (0.0-1.0) |
+| `allowed_models` | array | Specific models allowed for this provider |
+| `budget.max_limit` | float | Maximum spend in USD |
+| `budget.reset_duration` | string | Reset period (e.g., "1h", "1d", "1M") |
+| `budget.calendar_aligned` | boolean | When true, resets at calendar boundaries in UTC (requires `d`/`w`/`M`/`Y` durations) |
+| `rate_limit.token_max_limit` | integer | Maximum tokens per period |
+| `rate_limit.request_max_limit` | integer | Maximum requests per period |
+
+</Tab>
+<Tab title="config.json">
+
+Configure provider-level governance through Bifrost's configuration file for declarative management.
+
+### Basic Configuration Structure
+
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-dev-001",
+        "name": "development-team-vk",
+        "description": "Development team with multi-provider setup",
+        "is_active": true,
+        "rate_limit_id": "rl-vk-dev",
+        "provider_configs": [
+          {
+            "id": 1,
+            "provider": "openai",
+            "weight": 0.6,
+            "allowed_models": ["gpt-4", "gpt-3.5-turbo"],
+            "rate_limit_id": "rl-pc-openai"
+          },
+          {
+            "id": 2,
+            "provider": "anthropic",
+            "weight": 0.4,
+            "allowed_models": ["claude-3-opus", "claude-3-sonnet"],
+            "rate_limit_id": "rl-pc-anthropic"
+          }
+        ]
+      }
+    ],
+    "budgets": [
+      {
+        "id": "budget-vk-dev",
+        "virtual_key_id": "vk-dev-001",
+        "max_limit": 2000.00,
+        "reset_duration": "1M",
+        "calendar_aligned": true
+      },
+      {
+        "id": "budget-pc-openai",
+        "provider_config_id": 1,
+        "max_limit": 1000.00,
+        "reset_duration": "1M"
+      },
+      {
+        "id": "budget-pc-anthropic",
+        "provider_config_id": 2,
+        "max_limit": 500.00,
+        "reset_duration": "1M"
+      }
+    ],
+    "rate_limits": [
+      {
+        "id": "rl-vk-dev",
+        "token_max_limit": 5000000,
+        "token_reset_duration": "1h",
+        "request_max_limit": 3000,
+        "request_reset_duration": "1h"
+      },
+      {
+        "id": "rl-pc-openai",
+        "token_max_limit": 2000000,
+        "token_reset_duration": "1h",
+        "request_max_limit": 2000,
+        "request_reset_duration": "1h"
+      },
+      {
+        "id": "rl-pc-anthropic",
+        "token_max_limit": 1000000,
+        "token_reset_duration": "1h",
+        "request_max_limit": 1000,
+        "request_reset_duration": "1h"
+      }
+    ]
+  }
+}
+```
+
+Budgets and rate limits live as **separate top-level arrays** inside `governance`. Virtual keys and provider configs reference them by id (`rate_limit_id`) or are referenced back (`virtual_key_id` / `provider_config_id` on each `budgets[]` entry). Optional `calendar_aligned` on each `budget` matches the HTTP API and [calendar-aligned behavior](#calendar-aligned-budgets).
+
+### Advanced Configuration Examples
+
+#### Cost-Optimized Setup
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-cost-opt",
+        "name": "cost-optimized-vk",
+        "provider_configs": [
+          {"id": 10, "provider": "openai-gpt-3.5", "weight": 0.8, "rate_limit_id": "rl-cheap"},
+          {"id": 11, "provider": "openai-gpt-4",   "weight": 0.2, "rate_limit_id": "rl-premium"}
+        ]
+      }
+    ],
+    "budgets": [
+      {"id": "b-cheap",   "provider_config_id": 10, "max_limit": 50.00,  "reset_duration": "1d"},
+      {"id": "b-premium", "provider_config_id": 11, "max_limit": 200.00, "reset_duration": "1d"}
+    ],
+    "rate_limits": [
+      {"id": "rl-cheap",   "request_max_limit": 1000, "request_reset_duration": "1h"},
+      {"id": "rl-premium", "request_max_limit": 100,  "request_reset_duration": "1h"}
+    ]
+  }
+}
+```
+
+#### High-Volume Production Setup
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-prod-hv",
+        "name": "production-high-volume-vk",
+        "provider_configs": [
+          {"id": 20, "provider": "openai",       "weight": 0.5, "rate_limit_id": "rl-openai"},
+          {"id": 21, "provider": "anthropic",    "weight": 0.3, "rate_limit_id": "rl-anthropic"},
+          {"id": 22, "provider": "azure-openai", "weight": 0.2, "rate_limit_id": "rl-azure"}
+        ]
+      }
+    ],
+    "budgets": [
+      {"id": "b-openai",    "provider_config_id": 20, "max_limit": 5000.00, "reset_duration": "1M"},
+      {"id": "b-anthropic", "provider_config_id": 21, "max_limit": 3000.00, "reset_duration": "1M"},
+      {"id": "b-azure",     "provider_config_id": 22, "max_limit": 2000.00, "reset_duration": "1M"}
+    ],
+    "rate_limits": [
+      {"id": "rl-openai",    "token_max_limit": 10000000, "token_reset_duration": "1h", "request_max_limit": 10000, "request_reset_duration": "1h"},
+      {"id": "rl-anthropic", "token_max_limit": 6000000,  "token_reset_duration": "1h", "request_max_limit": 6000,  "request_reset_duration": "1h"},
+      {"id": "rl-azure",     "token_max_limit": 4000000,  "token_reset_duration": "1h", "request_max_limit": 4000,  "request_reset_duration": "1h"}
+    ]
+  }
+}
+```
+
+**Validation Rules:**
+- Budget limits must be positive numbers
+- Reset durations must be valid time formats
+- Rate limits must be positive integers
+- Provider names must match configured providers
+
+</Tab>
+</Tabs>
+
+## Provider-Level Governance Examples
+
+### Example 1: Mixed Provider Budgets
+
+A virtual key configured with multiple providers and different budget allocations:
+
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-mkt",
+        "name": "marketing-team-vk",
+        "provider_configs": [
+          {"id": 30, "provider": "openai",    "weight": 0.7},
+          {"id": 31, "provider": "anthropic", "weight": 0.3}
+        ]
+      }
+    ],
+    "budgets": [
+      {"id": "b-vk-mkt", "virtual_key_id": "vk-mkt",      "max_limit": 100, "reset_duration": "1M"},
+      {"id": "b-openai", "provider_config_id": 30,        "max_limit": 50,  "reset_duration": "1M"},
+      {"id": "b-anth",   "provider_config_id": 31,        "max_limit": 30,  "reset_duration": "1M"}
+    ]
+  }
+}
+```
+
+**Behavior:**
+- OpenAI requests limited to 50 dollars/month at provider level + 100 dollars/month at VK level
+- Anthropic requests limited to 30 dollars/month at provider level + 100 dollars/month at VK level
+- If any provider's budget is exhausted, all requests to that provider will be blocked
+
+### Example 2: Provider-Specific Rate Limits
+
+Different rate limits based on provider capabilities:
+
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-hv",
+        "name": "high-volume-vk",
+        "provider_configs": [
+          {"id": 40, "provider": "openai",    "rate_limit_id": "rl-openai"},
+          {"id": 41, "provider": "anthropic", "rate_limit_id": "rl-anthropic"}
+        ]
+      }
+    ],
+    "rate_limits": [
+      {"id": "rl-openai",    "request_max_limit": 1000, "request_reset_duration": "1h", "token_max_limit": 1000000, "token_reset_duration": "1h"},
+      {"id": "rl-anthropic", "request_max_limit": 500,  "request_reset_duration": "1h", "token_max_limit": 500000,  "token_reset_duration": "1h"}
+    ]
+  }
+}
+```
+
+**Behavior:**
+- OpenAI: 1000 requests/hour, 1M tokens/hour
+- Anthropic: 500 requests/hour, 500K tokens/hour
+- If any provider's rate limits are exceeded, all requests to that provider will be blocked
+
+### Example 3: Failover Strategy
+
+Provider configurations with budget-based failover:
+
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-cost",
+        "name": "cost-optimized-vk",
+        "provider_configs": [
+          {"id": 50, "provider": "openai-cheap",   "weight": 1.0},
+          {"id": 51, "provider": "openai-premium", "weight": 0.0, "rate_limit_id": "rl-premium"}
+        ]
+      }
+    ],
+    "budgets": [
+      {"id": "b-cheap",   "provider_config_id": 50, "max_limit": 10, "reset_duration": "1d"},
+      {"id": "b-premium", "provider_config_id": 51, "max_limit": 50, "reset_duration": "1d"}
+    ],
+    "rate_limits": [
+      {"id": "rl-premium", "request_max_limit": 100, "request_reset_duration": "1h", "token_max_limit": 50000, "token_reset_duration": "1h"}
+    ]
+  }
+}
+```
+
+**Behavior:**
+- Primary: Use cheap provider until $10 daily budget exhausted
+- Fallback: Automatically switch to premium provider when cheap option unavailable. To enable this, you should not send `provider` name in the request body, read [Routing](./routing#automatic-fallbacks) for more details.
+- Cost containment: Prevent unexpected overspend on premium resources and limit the number of requests to the premium provider
+
+
+## Key Benefits of Provider-Level Governance
+
+- **Granular Control**: Set specific spending limits and rate limits per AI provider
+- **Automatic Fallback**: Route to alternative providers when budgets or rate limits are exceeded
+- **Cost Control**: Track and control spending by provider for better financial oversight
+- **Performance Testing**: A/B testing across providers with controlled budgets
+- **Multi-Provider Strategies**: Primary/backup provider configurations
+- **Cost-Tiered Access**: Cheap providers for basic tasks, premium for complex workloads  
+
+---
+
+## Next Steps
+
+- **[Routing](./routing)** - Direct requests to specific AI models, providers, and keys using Virtual Keys.
+- **[MCP Tool Filtering](./mcp-tools)** - Manage MCP clients/tools for virtual keys.
+- **[Tracing](../observability/default)** - Audit trails and request tracking
--- a/docs/features/governance/mcp-tools.mdx
+++ b/docs/features/governance/mcp-tools.mdx
@@ -0,0 +1,160 @@
+---
+title: "MCP Tool Filtering"
+description: "Control which MCP tools are available for each Virtual Key."
+icon: "grid-2"
+---
+
+## Overview
+
+MCP Tool Filtering allows you to control which tools are available to AI models on a per-request basis using Virtual Keys (VKs). By configuring a VirtualKey, you can create a strict allow-list of MCP clients and tools, ensuring that only approved tools can be executed.
+
+Make sure you have at least one MCP client set up. Read more about it [here](../../mcp/overview).
+
+## How It Works
+
+The filtering logic is determined by the Virtual Key's configuration:
+
+1.  **No MCP Configuration on Virtual Key (Default)**
+    - If a Virtual Key has no specific MCP configurations, **no MCP tools are available** (deny-by-default).
+    - You must explicitly add MCP client configurations to allow tools.
+
+2.  **With MCP Configuration on Virtual Key**
+    - When you configure MCP clients on a Virtual Key, its settings take full precedence.
+    - Bifrost automatically generates an `x-bf-mcp-include-tools` header based on your VK configuration (unless `disable_auto_tool_inject` is enabled or the caller already sent the header). This acts as a strict allow-list for the request.
+    - If the caller already includes an `x-bf-mcp-include-tools` header, auto-injection is skipped — but the VK allow-list is enforced at inference time and still enforced again at MCP tool execution time.
+
+For each MCP client associated with a Virtual Key, you can specify the allowed tools:
+- **Select specific tools**: Only the chosen tools from that client will be available.
+- **Use `*` wildcard**: All available tools from that client will be permitted.
+- **Leave tool list empty**: All tools from that client will be **blocked**.
+- **Do not configure a client**: All tools from that client will be **blocked** (if other clients are configured).
+
+## Setting MCP Tool Restrictions
+
+<Tabs group="mcp-tool-restrictions">
+<Tab title="Web UI">
+
+You can configure which tools a Virtual Key has access to via the UI.
+
+1.  Go to **Virtual Keys** page.
+2.  Create/Edit virtual key
+![Virtual Key MCP Tool Restrictions](../../media/ui-virtual-key-mcp-filter.png)
+3.  In **MCP Client Configurations** section, add the MCP client you want to restrict the VK to
+4.  Select the specific tools to allow, or choose **Allow All Tools** to permit all current and future tools from that client (stored as `*`). Leaving the list empty blocks all tools for that client.
+5.  Click on the **Save** button
+
+</Tab>
+<Tab title="API">
+
+You can configure this via the REST API when creating (`POST`) or updating (`PUT`) a virtual key.
+
+**Create Virtual Key:**
+```bash
+curl -X POST http://localhost:8080/api/governance/virtual-keys \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "vk-for-billing-support",
+    "mcp_configs": [
+      {
+        "mcp_client_name": "billing-client",
+        "tools_to_execute": ["check-status"]
+      },
+      {
+        "mcp_client_name": "support-client",
+        "tools_to_execute": ["*"]
+      }
+    ]
+  }'
+```
+
+**Update Virtual Key:**
+```bash
+curl -X PUT http://localhost:8080/api/governance/virtual-keys/{vk_id} \
+  -H "Content-Type: application/json" \
+  -d '{
+    "mcp_configs": [
+      {
+        "mcp_client_name": "billing-client",
+        "tools_to_execute": ["check-status"]
+      },
+      {
+        "mcp_client_name": "support-client",
+        "tools_to_execute": ["*"]
+      }
+    ]
+  }'
+```
+
+**Behavior:**
+- The virtual key can only access the `check-status` tool from `billing-client`.
+- It can access all tools from `support-client`.
+- Any other MCP client is implicitly blocked for this key.
+
+</Tab>
+
+<Tab title="config.json">
+
+You can also define MCP tool restrictions directly in your `config.json` file. The `mcp_configs` array under a virtual key should reference the MCP client by name.
+
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-billing-support-only",
+        "name": "VK for Billing and Support",
+        "mcp_configs": [
+          {
+            "mcp_client_name": "billing-client",
+            "tools_to_execute": ["check-status"]
+          },
+          {
+            "mcp_client_name": "support-client",
+            "tools_to_execute": ["*"]
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+</Tab>
+</Tabs>
+
+## Example Scenario
+
+**Available MCP Clients & Tools:**
+- **`billing-client`**: with tools `[create-invoice, check-status]`
+- **`support-client`**: with tools `[create-ticket, get-faq]`
+
+<Tabs>
+<Tab title="VK with Full Access">
+**Configuration:**
+- `billing-client` -> Allowed Tools: `[*]` (wildcard)
+- `support-client` -> Allowed Tools: `[*]` (wildcard)
+
+**Result:**
+A request with this Virtual Key can access all four tools: `create-invoice`, `check-status`, `create-ticket`, and `get-faq`.
+
+</Tab>
+<Tab title="VK with Partial Access">
+**Configuration:**
+- `billing-client` -> Allowed Tools: `[check-status]`
+- `support-client` -> Not configured
+
+**Result:**
+A request with this Virtual Key can only access the `check-status` tool. All other tools are blocked.
+
+</Tab>
+<Tab title="VK with No Tools">
+**Configuration:**
+- `billing-client` -> Allowed Tools: `[]` (empty list)
+
+**Result:**
+A request with this Virtual Key cannot access any tools. All tools from all clients are blocked.
+</Tab>
+</Tabs>
+
+<Note>
+When a Virtual Key has MCP configurations, Bifrost enforces the allow-list at both inference time and MCP tool execution time. Auto-injection of the `x-bf-mcp-include-tools` header is skipped if the caller already provides it or if `disable_auto_tool_inject` is enabled — but the VK's restrictions are always applied regardless. You can still use the `x-bf-mcp-include-clients` header to filter MCP clients per request.
+</Note>
--- a/docs/features/governance/required-headers.mdx
+++ b/docs/features/governance/required-headers.mdx
@@ -0,0 +1,166 @@
+---
+title: "Required Headers"
+description: "Enforce mandatory headers on every request through governance."
+icon: "shield-check"
+---
+
+## Overview
+
+Required headers let you enforce that specific HTTP headers are present on every LLM and MCP request passing through Bifrost. If a request is missing any required header, the governance plugin rejects it with a **400 Bad Request** error before it reaches the provider.
+
+This is useful for:
+- **Tenant isolation** - Require `X-Tenant-ID` to identify the calling tenant
+- **Audit trails** - Require `X-Correlation-ID` for request tracing across services
+- **Custom routing metadata** - Require headers your infrastructure depends on
+
+<Note>
+Required headers validation requires **governance to be enabled**. The check runs in both `PreLLMHook` and `PreMCPHook`, so it applies to all inference and MCP tool execution requests.
+</Note>
+
+Header matching is **case-insensitive** — configuring `X-Tenant-ID` will match `x-tenant-id`, `X-TENANT-ID`, or any other casing.
+
+---
+
+## How it works
+
+```mermaid
+graph LR
+    A[Request] --> B{All required<br/>headers present?}
+    B -->|Yes| C[Continue to<br/>governance evaluation]
+    B -->|No| D[400 Bad Request<br/>missing_required_headers]
+```
+
+When a request arrives:
+1. The HTTP transport middleware stores all request headers in the Bifrost context (lowercased keys)
+2. The governance plugin's `PreLLMHook` / `PreMCPHook` checks for each required header
+3. If any are missing, the request is rejected immediately with a `400` status and a JSON error listing the missing headers
+
+**Example error response:**
+```json
+{
+  "error": {
+    "message": "missing required headers: x-tenant-id, x-correlation-id",
+    "type": "missing_required_headers"
+  }
+}
+```
+
+---
+
+## Configuration
+
+<Tabs group="config-method">
+<Tab title="Web UI">
+
+1. Navigate to **Config** > **Security Settings**
+2. Ensure **Governance** is enabled (the required headers section only appears when governance is active)
+3. Scroll to **Required Headers**
+
+![Required Headers Configuration](../../media/ui-required-headers-setting.png)
+
+4. Enter a comma-separated list of header names (e.g., `X-Tenant-ID, X-Correlation-ID`)
+5. Click **Save Changes**
+
+Changes take effect immediately — no restart required.
+
+</Tab>
+<Tab title="API">
+
+Include `required_headers` in the `client_config` when updating the configuration:
+
+```bash
+curl -X PUT http://localhost:8080/api/config \
+  -H "Content-Type: application/json" \
+  -d '{
+    "client_config": {
+      "required_headers": ["X-Tenant-ID", "X-Correlation-ID"]
+    }
+  }'
+```
+
+To clear required headers, pass an empty array:
+
+```bash
+curl -X PUT http://localhost:8080/api/config \
+  -H "Content-Type: application/json" \
+  -d '{
+    "client_config": {
+      "required_headers": []
+    }
+  }'
+```
+
+</Tab>
+<Tab title="config.json">
+
+Add `required_headers` to the `client` section:
+
+```json
+{
+  "client": {    
+    "required_headers": ["X-Tenant-ID", "X-Correlation-ID"]
+  }
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `required_headers` | `string[]` | No | List of header names that must be present on every request. Case-insensitive. |
+
+</Tab>
+</Tabs>
+
+---
+
+## Examples
+
+### Requiring a tenant header
+
+Configure a single required header to enforce tenant identification:
+
+```json
+{
+  "client": {
+    "required_headers": ["X-Tenant-ID"]
+  }
+}
+```
+
+**Valid request:**
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "X-Tenant-ID: tenant-123" \
+  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}'
+```
+
+**Rejected request** (missing header):
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}'
+# → 400: missing required headers: x-tenant-id
+```
+
+### Combining with virtual keys
+
+Required headers work alongside virtual key enforcement. When both are configured, the governance plugin checks required headers first, then validates the virtual key:
+
+```json
+{
+  "client": {
+    "enforce_auth_on_inference": true,
+    "required_headers": ["X-Tenant-ID"]
+  }
+}
+```
+
+A request must include **both** the virtual key header and `X-Tenant-ID` to pass governance.
+
+---
+
+## Next steps
+
+- **[Virtual Keys](./virtual-keys)** - Set up access control with virtual keys
+- **[Budget and Limits](./budget-and-limits)** - Configure budgets and rate limits
+- **[Routing](./routing)** - Route requests based on headers and other criteria
--- a/docs/features/governance/routing.mdx
+++ b/docs/features/governance/routing.mdx
@@ -0,0 +1,337 @@
+---
+title: "Routing"
+description: "Direct requests to specific AI models, providers, and keys using Virtual Keys."
+icon: "arrow-progress"
+---
+
+<Info>
+**Looking for comprehensive provider routing documentation?**
+
+For a detailed guide covering governance-based routing, adaptive load balancing, Model Catalog, and how they interact, see the [**Provider Routing Guide**](/providers/provider-routing).
+
+This page focuses specifically on configuring governance routing via Virtual Keys.
+</Info>
+
+## Overview
+
+Bifrost's governance-based routing capabilities offer granular control over how requests are directed to different AI models and providers through Virtual Key configuration. By configuring routing rules on a Virtual Key, you can enforce which providers and models are accessible, implement weighted load balancing strategies, create automatic fallbacks, and restrict access to specific provider API keys.
+
+This powerful feature enables key use cases like:
+
+- **Resilience & Failover**: Automatically fall back to a secondary provider if the primary one fails.
+- **Environment Separation**: Dedicate specific virtual keys to development, testing, and production environments with different provider and key access.
+- **Cost Management**: Route traffic to cheaper models or providers based on weights to optimize costs.
+- **Fine-grained Access Control**: Ensure that different teams or applications only use the models and API keys they are explicitly permitted to.
+
+## Provider/Model Restrictions
+
+Virtual Keys can be restricted to use only specific provider/models. When provider/model restrictions are configured, the VK can only access those designated provider/models, providing fine-grained control over which provider/models different users or applications can utilize.
+
+**How It Works:**
+- **No Provider Configs** (default): VK **blocks all providers** (deny-by-default). You must add provider configurations to allow traffic.
+- **With Provider Configs**: VK limited to only the specified provider/models. Configured providers participate in weighted load balancing only if their `weight` is set to a numeric value, while providers with `weight: null` remain configured but are opted out of weighted selection.
+
+**Model Validation:**
+When you configure provider restrictions on a Virtual Key, Bifrost validates that the requested model is allowed for the selected provider:
+- **`allowed_models: ["*"]`**: Allow all models supported by the provider (uses the Model Catalog for validation).
+- **Empty `allowed_models`**: **Deny all** models (deny-by-default).
+- **Explicit model list**: Only those specific models are permitted.
+- **Model Catalog Sync**: On startup and provider updates, Bifrost calls each provider's list models API. If this fails, you'll see a warning: `{"level":"warn","message":"failed to list models for provider <name>: failed to execute HTTP request to provider API"}`
+
+<Note>
+**Cross-provider routing does NOT happen automatically**. For example, requests for `gpt-4o` will NOT be routed to Anthropic unless you explicitly add `"gpt-4o"` to Anthropic's `allowed_models` in the Virtual Key configuration. Each provider only handles models it actually supports (determined by the Model Catalog).
+</Note>
+
+## Weighted Load Balancing
+
+When you configure multiple providers on a Virtual Key, Bifrost automatically implements weighted load balancing. Each provider can be assigned a weight, and requests are distributed proportionally. The `weight` field is optional — omitting it (or setting it to `null`) excludes the provider from weighted selection while still allowing it to be used for direct `provider/model` requests or as a fallback.
+
+**Example Configuration:**
+```
+Virtual Key: vk-prod-main
+├── OpenAI
+│   ├── Allowed Models: [gpt-4o, gpt-4o-mini]  ← Explicit whitelist
+│   └── Weight: 0.2 (20% of traffic)
+└── Azure
+    ├── Allowed Models: [gpt-4o]  ← Explicit whitelist
+    └── Weight: 0.8 (80% of traffic)
+```
+
+**Load Balancing Behavior:**
+- For `gpt-4o`: 80% Azure, 20% OpenAI (both providers have it in allowed_models)
+- For `gpt-4o-mini`: 100% OpenAI (only OpenAI has it in allowed_models)
+- For `claude-3-sonnet`: ❌ Rejected (neither provider has it in allowed_models)
+
+**Usage:**
+To trigger weighted load balancing, send requests with just the model name:
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: vk-prod-main" \
+  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello!"}]}'
+```
+
+To bypass load balancing and target a specific provider:
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: vk-prod-main" \
+  -d '{"model": "openai/gpt-4o", "messages": [{"role": "user", "content": "Hello!"}]}'
+```
+
+<Info>
+Weights are automatically normalized to a sum 1.0 based on the weights of all providers available on the VK for the given model.
+</Info>
+
+**Example with Wildcard `allowed_models` (allow all via Model Catalog):**
+```json
+{
+  "provider_configs": [
+    {
+      "provider": "openai",
+      "allowed_models": ["*"],  // Allow all — uses Model Catalog for validation
+      "weight": 0.5
+    },
+    {
+      "provider": "anthropic",
+      "allowed_models": ["*"],  // Allow all — uses Model Catalog for validation
+      "weight": 0.5
+    }
+  ]
+}
+```
+With this configuration:
+- Request for `gpt-4o` → Routed to OpenAI (Model Catalog shows OpenAI supports this)
+- Request for `claude-3-sonnet` → Routed to Anthropic (Model Catalog shows Anthropic supports this)
+- Request for `gpt-4o` will NOT route to Anthropic (Model Catalog shows Anthropic doesn't support OpenAI models)
+
+## Automatic Fallbacks
+
+When multiple providers are configured on a Virtual Key, Bifrost automatically creates fallback chains for resilience. This feature provides automatic failover without manual intervention.
+
+**How It Works:**
+- **Only activated when**: Your request has no existing `fallbacks` array in the request body
+- **Fallback creation**: Providers are sorted by weight (highest first) and added as fallbacks
+- **Respects existing fallbacks**: If you manually specify fallbacks, they are preserved
+
+**Example Request Flow:**
+1. Primary request goes to weighted-selected provider (e.g., Azure with 80% weight)
+2. If Azure fails, automatically retry with OpenAI
+3. Continue until success or all providers exhausted
+
+**Request with automatic fallbacks:**
+```bash
+# This request will get automatic fallbacks
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: vk-prod-main" \
+  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello!"}]}'
+```
+
+**Request with manual fallbacks (no automatic fallbacks added):**
+```bash
+# This request keeps your specified fallbacks
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: vk-prod-main" \
+  -d '{
+    "model": "gpt-4o", 
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "fallbacks": ["anthropic/claude-3-sonnet-20240229"]
+  }'
+```
+
+## Setting Provider/Model Routing
+
+<Tabs group="provider-model-restrictions">
+<Tab title="Web UI">
+
+1. Go to **Virtual Keys**
+2. Create/Edit virtual key
+
+![Virtual Key Provider/Model Restrictions](../../media/ui-virtual-key-routing.png)
+
+3. In **Provider Configurations** section, add the provider you want to restrict the VK to
+4. **Allowed Models**:
+   - **Specify models**: Enter specific models (e.g., `["gpt-4o", "gpt-4o-mini"]`) to explicitly whitelist only those models
+   - **`["*"]`**: Allow all models (uses the Model Catalog for validation).
+   - **Leave blank**: Deny all models (deny-by-default).
+5. Optionally add a weight for this provider (numeric value for weighted load balancing, or leave blank to exclude from weighted routing while keeping the provider available for direct requests and fallbacks)
+6. Click on the **Save** button
+</Tab>
+
+<Tab title="API">
+
+```bash
+curl -X PUT http://localhost:8080/api/governance/virtual-keys/{vk_id} \
+  -H "Content-Type: application/json" \
+  -d '{
+    "provider_configs": [
+      {
+        "provider": "openai",
+        "allowed_models": ["gpt-4o", "gpt-4o-mini"],
+        "weight": 0.2
+      },
+      {
+        "provider": "azure",
+        "allowed_models": ["gpt-4o"],
+        "weight": 0.8
+      }
+    ]
+  }'
+```
+
+</Tab>
+
+<Tab title="config.json">
+
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-prod-main",
+        "provider_configs": [
+          {
+            "provider": "openai",
+            "allowed_models": ["gpt-4o", "gpt-4o-mini"],
+            "weight": 0.2
+          },
+          {
+            "provider": "azure",
+            "allowed_models": ["gpt-4o"],
+            "weight": 0.8
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+</Tab>
+
+</Tabs>
+
+## API Key Restrictions
+
+Virtual Keys can be restricted to use only specific provider API keys. When key restrictions are configured, the VK can only access those designated keys, providing fine-grained control over which API keys different users or applications can utilize.
+
+**How It Works:**
+- **No Restrictions** (`key_ids: ["*"]`): VK can use any available provider keys based on load balancing
+- **With Restrictions**: VK limited to only the specified key IDs, regardless of other available keys
+- **All Blocked** (`key_ids: []` or field omitted): VK cannot use any provider keys (deny-by-default)
+
+**Example Scenario:**
+```
+Available Provider Keys:
+├── key-prod-001 → sk-prod-key... (Production OpenAI key)
+├── key-dev-002  → sk-dev-key...  (Development OpenAI key)  
+└── key-test-003 → sk-test-key... (Testing OpenAI key)
+
+Virtual Key Restrictions:
+├── vk-prod-main
+│   ├── Allowed Models: [gpt-4o]
+│   └── Restricted Keys: [key-prod-001] ← ONLY production key
+├── vk-dev-main  
+│   ├── Allowed Models: [gpt-4o-mini]
+│   └── Restricted Keys: [key-dev-002, key-test-003] ← Dev + test keys
+└── vk-unrestricted
+    ├── Allowed Models: ["*"]  ← All models via catalog
+    └── Restricted Keys: ["*"] ← Can use ANY available key
+```
+
+**Request Behavior:**
+```bash
+# Production VK - will ONLY use key-prod-001
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: vk-prod-main" \
+  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello!"}]}'
+
+# Development VK - will load balance between key-dev-002 and key-test-003
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: vk-dev-main" \
+  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello!"}]}'
+
+# VK with key_ids: ["*"] - can use any available OpenAI key
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: vk-unrestricted" \
+  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello!"}]}'
+```
+
+**Setting API Key Restrictions:**
+
+<Tabs group="api-key-restrictions">
+<Tab title="Web UI">
+
+1. Go to **Virtual Keys**
+2. Create/Edit virtual key
+
+![Virtual Key API Key Restrictions](../../media/ui-virtual-key-keys-filter.png)
+
+3. In **Allowed Keys** section, select the API key you want to restrict the VK to
+4. Click on the **Save** button
+
+</Tab>
+
+<Tab title="API">
+
+```bash
+curl -X PUT http://localhost:8080/api/governance/virtual-keys/{vk_id} \
+  -H "Content-Type: application/json" \
+  -d '{
+    "key_ids": ["key-prod-001"]
+  }'
+```
+
+</Tab>
+
+<Tab title="config.json">
+
+```json
+{
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-prod-main",
+        "provider_configs": [
+          {
+            "provider": "openai",
+            "key_ids": [
+              "key-prod-001"
+            ]
+          }
+        ]
+      }
+    ]
+  }
+}
+```
+
+</Tab>
+
+</Tabs>
+
+**Use Cases:**
+- **Environment Separation** - Production VKs use production keys, dev VKs use dev keys
+- **Cost Control** - Different teams use keys with different billing accounts
+- **Access Control** - Restrict sensitive keys to specific VKs only
+- **Compliance** - Ensure certain workloads only use compliant/audited keys
+
+<Note>The models restrictions applied on the keys of individual providers will always be applied and will work together with the provider/model or api key restrictions set on the virtual key.</Note>
+
+## Troubleshooting
+
+### Model Catalog Sync Failures
+
+If you see warnings like this in your Bifrost logs during startup or provider updates:
+```json
+{"level":"warn","time":"2026-01-13T14:18:53+05:30","message":"failed to list models for provider ollama: failed to execute HTTP request to provider API"}
+```
+
+**What this means:**
+- Bifrost attempted to call the provider's list models API to populate the Model Catalog
+- The request failed (network issue, provider unavailable, incorrect credentials, etc.)
+- If your Virtual Key has `allowed_models: []` (empty) for this provider, **all models will be denied**. Use `["*"]` to allow all models.
+
+**How to fix:**
+1. Check that the provider is correctly configured and accessible
+2. Verify network connectivity to the provider's API
+3. Ensure API credentials are valid
+4. Use `allowed_models: ["*"]` to allow all models, or specify an explicit list for critical providers
--- a/docs/features/governance/virtual-keys.mdx
+++ b/docs/features/governance/virtual-keys.mdx
@@ -0,0 +1,698 @@
+---
+title: "Virtual Keys"
+description: "Virtual keys are a way to manage access to your AI models."
+icon: "key"
+---
+
+## Overview
+
+Virtual Keys are the primary governance entity in Bifrost. Users and applications authenticate using the given headers to access virtual keys and get specific access permissions, budgets, and rate limits.
+
+**Allowed Headers:**
+- `x-bf-vk` - Virtual key header, eg. `sk-bf-*`
+- `Authorization` - Authorization header, eg. `Bearer sk-bf-*` (OpenAI style)
+- `x-api-key` - API key header, eg. `sk-bf-*` (Anthropic style)
+- `x-goog-api-key` - API key header, eg. `sk-bf-*` (Google Gemini style)
+
+<Note>Old virtual keys(without `sk-bf-*` prefix) are only supported by `x-bf-vk` header.</Note>
+
+<Info>You can also use `Authorization`, `x-api-key` and `x-goog-api-key` headers to pass direct keys to the provider. Read more about it in [Direct Key Bypass](../keys-management#direct-key-bypass).</Info>
+
+**Key Features:**
+- **Access Control** - Model and provider filtering
+- **Cost Management** - Independent budgets (checked along with team/customer budgets if attached)
+- **Rate Limiting** - Token and request-based throttling (VK-level only)
+- **Key Restrictions** - Limit VK to specific provider API keys (if configured, VK can only use those keys)
+- **Exclusive Attachment** - Belongs to either one team OR one customer OR neither (mutually exclusive)
+- **Active/Inactive Status** - Enable/disable access instantly
+
+## Configuration
+
+<Tabs group="config-method">
+<Tab title="Web UI">
+
+1. Go to **Virtual Keys**
+2. Click on **Add Virtual Key** button
+
+![Virtual Key Creation](../../media/ui-virtual-key.png)
+
+**Budget Settings:**
+- **Max Limit**: Dollar amount (e.g., `10.50`)
+- **Reset Duration**: `1m`, `1h`, `1d`, `1w`, `1M`, `1Y`
+- **Calendar aligned** (optional): When enabled, the budget resets at calendar boundaries in UTC (day/week/month/year) instead of on a rolling window. Only applies to day/week/month/year periods. See [Budget and Limits](./budget-and-limits#calendar-aligned-budgets).
+
+**Rate Limits:**
+- **Token Limit**: Max tokens per period
+- **Request Limit**: Max requests per period
+- **Reset Duration**: Reset frequency for each limit
+
+**Associations:**
+- **Team**: Assign to existing team (mutually exclusive with customer)
+- **Customer**: Assign to existing customer (mutually exclusive with team)
+
+3. Click **Create Virtual Key**
+
+</Tab>
+<Tab title="API">
+
+**Create Virtual Key (attached to team):**
+```bash
+curl -X POST http://localhost:8080/api/governance/virtual-keys \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Engineering Team API",
+    "description": "Main API key for engineering team",
+    "provider_configs": [
+      {
+        "provider": "openai",
+        "weight": 0.5,
+        "allowed_models": ["gpt-4o-mini"]
+      },
+      {
+        "provider": "anthropic",
+        "weight": 0.5,
+        "allowed_models": ["claude-3-sonnet-20240229"]
+      }
+    ],
+    "team_id": "team-eng-001",
+    "budget": {
+      "max_limit": 100.00,
+      "reset_duration": "1M"
+    },
+    "rate_limit": {
+      "token_max_limit": 10000,
+      "token_reset_duration": "1h",
+      "request_max_limit": 100,
+      "request_reset_duration": "1m"
+    },
+    "key_ids": ["8c52039e-38c6-48b2-8016-0bd884b7befb"],
+    "is_active": true
+  }'
+```
+
+**Create Virtual Key (directly attached to customer):**
+```bash
+curl -X POST http://localhost:8080/api/governance/virtual-keys \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Executive API Key",
+    "description": "Direct customer-level API access",
+    "provider_configs": [
+      {
+        "provider": "openai",
+        "weight": 0.5,
+        "allowed_models": ["gpt-4o"]
+      },
+      {
+        "provider": "anthropic",
+        "weight": 0.5,
+        "allowed_models": ["claude-3-opus-20240229"]
+      }
+    ],
+    "customer_id": "customer-acme-corp",
+    "budget": {
+      "max_limit": 500.00,
+      "reset_duration": "1M"
+    },
+    "is_active": true
+  }'
+```
+
+> **Note**: 
+> - `team_id` and `customer_id` are mutually exclusive - a VK can only belong to one team OR one customer, not both.
+> - `key_ids` restricts the VK to only use those specific provider API keys. Use `["*"]` to allow access to all available keys. An empty array `[]` or omitting the field entirely denies all keys.
+
+**Update Virtual Key:**
+```bash
+curl -X PUT http://localhost:8080/api/governance/virtual-keys/{vk_id} \
+  -H "Content-Type: application/json" \
+  -d '{
+    "description": "Updated description",
+    "budget": {
+      "max_limit": 150.00,
+      "reset_duration": "1M"
+    }
+  }'
+```
+
+**Get Virtual Keys:**
+```bash
+# List all virtual keys
+curl http://localhost:8080/api/governance/virtual-keys
+
+# Get specific virtual key
+curl http://localhost:8080/api/governance/virtual-keys/{vk_id}
+```
+
+**Delete Virtual Key:**
+```bash
+curl -X DELETE http://localhost:8080/api/governance/virtual-keys/{vk_id}
+```
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "client": {
+    "enforce_auth_on_inference": true
+  },
+  "governance": {
+    "virtual_keys": [
+      {
+        "id": "vk-001",
+        "name": "Engineering Team API",
+        "value": "sk-bf-*",
+        "description": "Main API key for engineering team",
+        "is_active": true,
+        "provider_configs": [
+          {
+            "provider": "openai",
+            "weight": 0.5,
+            "allowed_models": ["gpt-4o-mini"],
+            "key_ids": ["openai-primary"]
+          },
+          {
+            "provider": "anthropic",
+            "weight": 0.5,
+            "allowed_models": ["claude-3-sonnet-20240229"]
+          }
+        ],
+        "team_id": "team-eng-001",
+        "rate_limit_id": "rate-limit-eng-vk"
+      },
+      {
+        "id": "vk-002",
+        "name": "Executive API Key", 
+        "value": "vk-executive-direct",
+        "description": "Direct customer-level API access",
+        "is_active": true,
+        "provider_configs": [
+          {
+            "provider": "openai",
+            "weight": 0.5,
+            "allowed_models": ["gpt-4o"]
+          },
+          {
+            "provider": "anthropic",
+            "weight": 0.5,
+            "allowed_models": ["claude-3-opus-20240229"]
+          }
+        ],
+        "customer_id": "customer-acme-corp"
+      }
+    ],
+    "budgets": [
+      {
+        "id": "budget-eng-vk",
+        "virtual_key_id": "vk-001",
+        "max_limit": 100.00,
+        "reset_duration": "1M",
+        "current_usage": 0.0,
+        "last_reset": "2025-01-01T00:00:00Z"
+      },
+      {
+        "id": "budget-exec-vk",
+        "virtual_key_id": "vk-002",
+        "max_limit": 500.00,
+        "reset_duration": "1M",
+        "current_usage": 0.0,
+        "last_reset": "2025-01-01T00:00:00Z"
+      }
+    ],
+    "rate_limits": [
+      {
+        "id": "rate-limit-eng-vk", 
+        "token_max_limit": 10000,
+        "token_reset_duration": "1h",
+        "token_current_usage": 0,
+        "token_last_reset": "2025-01-01T00:00:00Z",
+        "request_max_limit": 100,
+        "request_reset_duration": "1m",
+        "request_current_usage": 0,
+        "request_last_reset": "2025-01-01T00:00:00Z"
+      }
+    ]
+  }
+}
+```
+
+</Tab>
+</Tabs>
+
+## User Groups
+
+### Teams
+
+Teams provide organizational grouping for virtual keys with department-level budget management. Teams can belong to one customer and have their own independent budget allocation.
+
+**Key Features:**
+- **Organizational Structure** - Group multiple virtual keys
+- **Independent Budgets** - Department-level cost control (separate from customer budgets)
+- **Customer Association** - Can belong to one customer (optional)
+- **No Rate Limits** - Teams cannot have rate limits (VK-level only)
+
+**Configuration**
+
+<Tabs group="config-method">
+<Tab title="Web UI">
+
+1. Go to **Users & Groups** → **Teams**
+
+2. Click on **Add Team** button
+
+![Team Creation](../../media/ui-create-teams.png)
+
+Fill the form and click on **Create Team** button
+
+3. **Assign Virtual Keys to Team**
+   - Go to **Virtual Keys** page
+   - Edit the virtual key and assign it to the team
+   - Click on **Save** button
+
+</Tab>
+<Tab title="API">
+
+**Create Team:**
+```bash
+curl -X POST http://localhost:8080/api/governance/teams \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Engineering Team",
+    "customer_id": "customer-acme-corp",
+    "budget": {
+      "max_limit": 500.00,
+      "reset_duration": "1M"
+    }
+  }'
+```
+
+**Update Team:**
+```bash
+curl -X PUT http://localhost:8080/api/governance/teams/{team_id} \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Updated Engineering Team",
+    "budget": {
+      "max_limit": 750.00,
+      "reset_duration": "1M"
+    }
+  }'
+```
+
+**Get Teams:**
+```bash
+# List all teams
+curl http://localhost:8080/api/governance/teams
+
+# Get specific team
+curl http://localhost:8080/api/governance/teams/{team_id}
+```
+
+**Delete Team:**
+```bash
+curl -X DELETE http://localhost:8080/api/governance/teams/{team_id}
+```
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "governance": {
+    "teams": [
+      {
+        "id": "team-eng-001",
+        "name": "Engineering Team",
+        "customer_id": "customer-acme-corp",
+        "budget_id": "budget-team-eng"
+      },
+      {
+        "id": "team-sales-001", 
+        "name": "Sales Team",
+        "customer_id": "customer-acme-corp",
+        "budget_id": "budget-team-sales"
+      }
+    ],
+    "budgets": [
+      {
+        "id": "budget-team-eng",
+        "max_limit": 500.00,
+        "reset_duration": "1M",
+        "current_usage": 0.0,
+        "last_reset": "2025-01-01T00:00:00Z"
+      },
+      {
+        "id": "budget-team-sales",
+        "max_limit": 250.00,
+        "reset_duration": "1M", 
+        "current_usage": 0.0,
+        "last_reset": "2025-01-01T00:00:00Z"
+      }
+    ]
+  }
+}
+```
+
+</Tab>
+</Tabs>
+
+### Customers
+
+Customers represent the highest level in the governance hierarchy, typically corresponding to organizations or major business units. They provide top-level budget control and organizational structure.
+
+**Key Features:**
+- **Top-Level Organization** - Highest hierarchy level
+- **Independent Budgets** - Organization-wide cost control (separate from team/VK budgets)
+- **Team Management** - Contains multiple teams and direct VKs
+- **No Rate Limits** - Customers cannot have rate limits (VK-level only)
+
+**Configuration**
+
+<Tabs group="config-method">
+<Tab title="Web UI">
+
+1. Go to **Users & Groups** → **Customers**
+
+2. Click on **Add Customer** button
+
+![Customer Creation](../../media/ui-create-customer.png)
+
+Fill the form and click on **Create Customer** button
+
+3. **Assign Teams to Customer**
+   - Go to **Teams** page
+   - Edit the team and assign it to the customer
+   - Click on **Save** button
+
+4. **Assign Virtual Keys to Customer**
+   - Go to **Virtual Keys** page
+   - Edit the virtual key and assign it to the customer
+   - Click on **Save** button
+
+</Tab>
+<Tab title="API">
+
+**Create Customer:**
+```bash
+curl -X POST http://localhost:8080/api/governance/customers \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Acme Corporation",
+    "budget": {
+      "max_limit": 2000.00,
+      "reset_duration": "1M"
+    }
+  }'
+```
+
+**Update Customer:**
+```bash
+curl -X PUT http://localhost:8080/api/governance/customers/{customer_id} \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "Acme Corp (Updated)",
+    "budget": {
+      "max_limit": 2500.00,
+      "reset_duration": "1M"
+    }
+  }'
+```
+
+**Get Customers:**
+```bash
+# List all customers
+curl http://localhost:8080/api/governance/customers
+
+# Get specific customer
+curl http://localhost:8080/api/governance/customers/{customer_id}
+```
+
+**Delete Customer:**
+```bash
+curl -X DELETE http://localhost:8080/api/governance/customers/{customer_id}
+```
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "governance": {
+    "customers": [
+      {
+        "id": "customer-acme-corp",
+        "name": "Acme Corporation",
+        "budget_id": "budget-customer-acme"
+      },
+      {
+        "id": "customer-beta-inc",
+        "name": "Beta Inc",
+        "budget_id": "budget-customer-beta"
+      }
+    ],
+    "budgets": [
+      {
+        "id": "budget-customer-acme",
+        "max_limit": 2000.00,
+        "reset_duration": "1M",
+        "current_usage": 0.0,
+        "last_reset": "2025-01-01T00:00:00Z"
+      },
+      {
+        "id": "budget-customer-beta",
+        "max_limit": 1500.00,
+        "reset_duration": "1M",
+        "current_usage": 0.0,
+        "last_reset": "2025-01-01T00:00:00Z"
+      }
+    ]
+  }
+}
+```
+
+</Tab>
+</Tabs>
+
+## Features
+
+- **[Budget and Limits](./budget-and-limits)** - Enterprise-grade budget management and cost control and rate limiting using virtual keys
+- **[Routing](./routing)** - Route requests to the appropriate providers/models and restrict api keys using virtual keys
+- **[MCP Tool Filtering](./mcp-tools)** - Manage MCP clients/tools for virtual keys
+
+
+## Usage
+
+### Making Virtual Keys Mandatory
+
+All governance-enabled requests must include the virtual key header:
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-bf-vk: sk-bf-*" \
+  -d '{
+    "model": "gpt-4o-mini",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+By default governance is optional, meaning that if the virtual key header is not present, the request will be allowed but without any governance checks/routing. But you can make it mandatory by enforcing the virtual key header.
+
+<Tabs group="enforce-governance-header">
+<Tab title="Web UI">
+
+1. Go to **Config** → **Security**
+
+2. Check the **Enforce Virtual Keys** checkbox
+
+![Enforce Virtual Keys](../../media/ui-enforce-virtual-keys.png)
+
+</Tab>
+<Tab title="API">
+```bash
+curl -X PUT http://localhost:8080/api/config \
+  -H "Content-Type: application/json" \
+  -d '{
+    "client_config": {
+      "enforce_auth_on_inference": true
+    }
+  }'
+```
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "client": {
+    "enforce_auth_on_inference": true
+  }
+}
+```
+
+</Tab>
+</Tabs>
+
+When the governance header is enforced, the request will be rejected if the `x-bf-vk` header is not present.
+
+### Authentication and Virtual Keys
+
+Virtual keys and HTTP authentication are **independent layers** that can work together:
+
+| Layer | Purpose | Headers |
+|-------|---------|---------|
+| **Authentication** | Validates user identity | `Authorization: Basic/Bearer <credentials>` |
+| **Virtual Keys** | Request routing and governance | `x-bf-vk`, `Authorization`[^1], `x-api-key`, `x-goog-api-key` |
+
+[^1]: Authorization can carry virtual keys only when auth is disabled (`disable_auth_on_inference: true`). When auth is enabled, Authorization is consumed by authentication and cannot be used for virtual keys.
+
+**When `disable_auth_on_inference: true` (auth disabled):**
+
+Virtual keys can be passed via any supported header without additional authentication:
+
+```bash
+# Using x-bf-vk header
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-vk: <VIRTUAL_KEY>" \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gpt-4o-mini", "messages": [...]}'
+
+# Using Authorization header (OpenAI style)
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Authorization: Bearer <VIRTUAL_KEY>" \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gpt-4o-mini", "messages": [...]}'
+```
+
+**When `disable_auth_on_inference: false` (auth enabled):**
+
+You must provide both authentication credentials AND the virtual key. Use `x-bf-vk` for the virtual key since the `Authorization` header is used for authentication:
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Authorization: Basic <base64-credentials>" \
+  -H "x-bf-vk: <VIRTUAL_KEY>" \
+  -H "Content-Type: application/json" \
+  -d '{"model": "gpt-4o-mini", "messages": [...]}'
+```
+
+**Configuring `disable_auth_on_inference`:**
+
+<Tabs group="config-method">
+<Tab title="Web UI">
+
+1. Go to **Config** → **Security**
+2. Toggle **Disable Auth on Inference** to enable/disable
+
+![Disable Auth on Inference](../../media/ui-disable-auth-on-inference.png)
+
+</Tab>
+<Tab title="API">
+
+```bash
+curl -X PUT http://localhost:8080/api/config \
+  -H "Content-Type: application/json" \
+  -d '{
+    "auth_config": {
+      "disable_auth_on_inference": true
+    }
+  }'
+```
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "auth_config": {
+    "is_enabled": true,
+    "disable_auth_on_inference": true
+  }
+}
+```
+
+</Tab>
+</Tabs>
+
+### Error Responses
+
+- Virtual Key Not Found (400)
+```json
+{
+  "error": {
+    "type": "virtual_key_required",
+    "message": "virtual key is missing in headers"
+  }
+}
+```
+
+- Virtual Key Blocked (403)
+```json
+{
+  "error": {
+    "type": "virtual_key_blocked", 
+    "message": "Virtual key is inactive"
+  }
+}
+```
+
+- Rate Limit Exceeded (429)
+```json
+{
+  "error": {
+    "type": "rate_limited",
+    "message": "Rate limits exceeded: [token limit exceeded (1500/1000, resets every 1h)]"
+  }
+}
+```
+
+- Token Limit Exceeded (429)
+```json
+{
+  "error": {
+    "type": "token_limited",
+    "message": "Rate limits exceeded: [token limit exceeded (1500/1000, resets every 1h)]"
+  }
+}
+```
+
+- Request Limit Exceeded (429)
+```json
+{
+  "error": {
+    "type": "request_limited", 
+    "message": "Rate limits exceeded: [request limit exceeded (101/100, resets every 1m)]"
+  }
+}
+```
+
+- Budget Exceeded (402)
+```json
+{
+  "error": {
+    "type": "budget_exceeded",
+    "message": "Budget exceeded: VK budget exceeded: 105.50 > 100.00 dollars"
+  }
+}
+```
+
+- Model Not Allowed (403)
+```json
+{
+  "error": {
+    "type": "model_blocked",
+    "message": "Model 'gpt-4o' is not allowed for this virtual key"
+  }
+}
+```
+
+- Provider Not Allowed (403)
+```json
+{
+  "error": {
+    "type": "provider_blocked",
+    "message": "Provider 'anthropic' is not allowed for this virtual key"
+  }
+}
+```
--- a/docs/features/keys-management.mdx
+++ b/docs/features/keys-management.mdx
@@ -0,0 +1,338 @@
+---
+title: "Load Balance"
+description: "Intelligent API key management with weighted load balancing, model-specific filtering, and automatic failover. Distribute traffic across multiple keys for optimal performance and reliability."
+icon: "scale-balanced"
+---
+
+## Smart Key Distribution
+
+Bifrost's key management system goes beyond simple API key storage. It provides intelligent load balancing, model-specific key filtering, and weighted distribution to optimize performance and manage costs across multiple API keys.
+
+When you configure multiple keys for a provider, Bifrost automatically distributes requests using sophisticated selection algorithms that consider key weights, model compatibility, and deployment mappings.
+
+## How Key Selection Works
+
+Bifrost follows a precise selection process for every request:
+
+1. **Context Override Check**: First checks if a key is explicitly provided in context (bypassing management)
+2. **Provider Key Lookup**: Retrieves all configured keys for the requested provider
+3. **Model Filtering**: Filters keys that support the requested model (respecting `models` allowlists and `blacklisted_models` denylists)
+4. **Deployment Validation**: For Azure/Bedrock, validates deployment mappings
+5. **Weighted Selection**: Uses weighted random selection among eligible keys
+
+This ensures optimal key usage while respecting your configuration constraints.
+
+## Implementation Examples
+
+<Tabs group="keys-management">
+
+<Tab title="Gateway">
+
+```bash
+# 1. Create or ensure the provider exists
+curl -X POST http://localhost:8080/api/providers \
+  -H "Content-Type: application/json" \
+  -d '{ "provider": "openai" }'
+
+# 2. Add keys individually via the dedicated keys API
+curl -X POST http://localhost:8080/api/providers/openai/keys \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "openai-key-1",
+    "value": "env.OPENAI_API_KEY_1",
+    "models": ["gpt-4o", "gpt-4o-mini"],
+    "weight": 0.7
+  }'
+
+curl -X POST http://localhost:8080/api/providers/openai/keys \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "openai-key-2",
+    "value": "env.OPENAI_API_KEY_2",
+    "models": ["*"],
+    "weight": 0.3
+  }'
+
+# Regular request (uses weighted key selection)
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-4o-mini",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+
+# Request with direct API key (bypasses key management)
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer sk-your-direct-api-key" \
+  -d '{
+    "model": "openai/gpt-4o-mini", 
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+</Tab>
+
+<Tab title="Go SDK">
+
+```go
+package main
+
+import (
+    "context"
+    "github.com/maximhq/bifrost/core/schemas"
+)
+
+func (a *MyAccount) GetKeysForProvider(ctx *context.Context, provider schemas.ModelProvider) ([]schemas.Key, error) {
+    switch provider {
+    case schemas.OpenAI:
+        return []schemas.Key{
+            {
+                ID:     "primary-key",
+                Value:  "env.OPENAI_API_KEY_1",
+                Models: ["gpt-4o", "gpt-4o-mini"], // Model whitelist
+                Weight: 0.7,                       // 70% of traffic
+            },
+            {
+                ID:     "secondary-key", 
+                Value:  "env.OPENAI_API_KEY_2",
+                Models: []string{"*"}, // ["*"] = supports all models (empty slice denies all in v1.5.0+)
+                Weight: 0.3, // 30% of traffic
+            },
+        }, nil
+    case schemas.Anthropic:
+        return []schemas.Key{
+            {
+                Value:  "env.ANTHROPIC_API_KEY",
+                Models: ["claude-3-5-sonnet-20241022"],
+                Weight: 1.0,
+            },
+        }, nil
+    }
+    return nil, fmt.Errorf("provider %s not supported", provider)
+}
+
+// Using with explicit context key (bypasses key management)
+func makeRequestWithDirectKey() {
+    ctx := context.Background()
+    
+    // Direct key bypasses all key management
+    directKey := schemas.Key{
+        Value:  "sk-direct-api-key",
+        Weight: 1.0,
+    }
+    ctx = context.WithValue(ctx, schemas.BifrostContextKeyDirectKey, directKey)
+    
+    response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), &schemas.BifrostChatRequest{
+        Provider: schemas.OpenAI,
+        Model:    "gpt-4o-mini",
+        Input:    messages,
+    })
+}
+```
+
+</Tab>
+
+</Tabs>
+
+## Weighted Load Balancing
+
+Bifrost uses weighted random selection to distribute requests across multiple keys. This allows you to:
+
+**Control Traffic Distribution:**
+- Assign higher weights to premium keys with better rate limits
+- Balance between production and backup keys
+- Gradually migrate traffic during key rotation
+
+**Weight Calculation Example:**
+```
+Key 1: Weight 0.7 (70% probability)
+Key 2: Weight 0.3 (30% probability)
+Total Weight: 1.0
+
+Random selection ensures statistical distribution over time
+```
+
+**Algorithm Details:**
+1. Calculate total weight of all eligible keys
+2. Generate random number between 0 and total weight
+3. Select key based on cumulative weight ranges
+4. If selected key fails, automatic fallback to next available key
+
+## Model Whitelisting and Filtering
+
+Keys can be restricted to specific models for access control and cost management:
+
+**Model Filtering Logic:**
+- **Empty `models` array (`[]`)**: Denies ALL models (deny-by-default, v1.5.0+) — use `["*"]` to allow all
+- **Populated `models` array**: Key only supports listed models
+- **`blacklisted_models`**: Optional per-key denylist. If non-empty and the requested model appears in it, the key is excluded—even if that model is also in `models` (denylist wins over the allowlist)
+- **Model mismatch**: Key is excluded from selection for that request
+
+**Use Cases:**
+- **Premium Models**: Dedicated keys for expensive models (GPT-4, Claude-3)
+- **Team Separation**: Different keys for different teams or projects
+- **Cost Control**: Restrict access to specific model tiers
+- **Compliance**: Separate keys for different security requirements
+- **Denylist**: Block specific models on a key
+
+**Example Model Restrictions:**
+
+Each key is created individually via `POST /api/providers/{provider}/keys`:
+
+```json
+// Premium-only key
+{
+  "name": "openai-pre-key-1",
+  "value": "premium-key",
+  "models": ["gpt-4o", "o1-preview"],
+  "weight": 1.0
+}
+
+// Standard-only key
+{
+  "name": "openai-std-key-1",
+  "value": "standard-key",
+  "models": ["gpt-4o-mini", "gpt-3.5-turbo"],
+  "weight": 1.0
+}
+
+// Shared key with denylist
+{
+  "name": "openai-shared-key",
+  "value": "env.OPENAI_API_KEY",
+  "models": ["gpt-4o", "gpt-4o-mini"],
+  "blacklisted_models": ["gpt-5"],
+  "weight": 1.0
+}
+```
+
+## Deployment Mapping (Azure & Bedrock)
+
+For cloud providers with deployment-based routing, Bifrost validates deployment availability:
+
+**Azure:**
+- Keys must have deployment mappings for specific models
+- Deployment name maps to actual Azure deployment identifier
+- Missing deployment excludes key from selection
+
+**AWS Bedrock:**
+- Supports model profiles and direct model access
+- Deployment mappings enable inference profile routing
+- ARN configuration determines URL formation
+
+**Deployment Validation Process:**
+1. Check if provider uses deployments (Azure/Bedrock)
+2. Verify deployment exists for requested model
+3. Exclude keys without proper deployment mapping
+4. Continue with standard weighted selection
+
+## Custom Key Usage (By Name or ID)
+
+Bifrost supports referencing a stored provider key by name or by ID instead of sending the raw secret. This can be useful when you want callers to reference logical key names or stable IDs and let the gateway resolve the actual secret from configured provider keys.
+
+**When both are provided, ID takes priority over name.**
+
+### By ID
+
+- Header: send `x-bf-api-key-id: <key-id>` on the request. The gateway will look up the key with that ID.
+- Context (Go SDK):
+
+```go
+ctx := context.Background()
+ctx = context.WithValue(ctx, schemas.BifrostContextKeyAPIKeyID, "key-uuid-1234")
+```
+
+### By Name
+
+- Header: send `x-bf-api-key: <key-name>` on the request. The gateway will look up the named key and use its secret for the upstream provider call.
+- Context (Go SDK):
+
+```go
+ctx := context.Background()
+ctx = context.WithValue(ctx, schemas.BifrostContextKeyAPIKeyName, "openai-key-1")
+```
+
+Note: Both mechanisms reference a stored key (not the raw secret). The gateway resolves the key against configured provider keys and applies model allowlists, denylists, and deployment mapping. When an explicit key ID or name is supplied, weighted selection is bypassed and the referenced key is used directly.
+
+```bash
+# Example: request referencing a stored key name that doesn't exist
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-bf-api-key: non_existant_key" \
+  -d '{
+    "model": "anthropic/claude-haiku-4-5",
+    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
+  }'
+```
+
+Response (example):
+
+```json
+{"is_bifrost_error":false,"error":{"error":"no key found with name \"non_existant_key\" for provider: anthropic","message":"no key found with name \"non_existant_key\" for provider: anthropic"},"extra_fields":{"provider":"anthropic","model_requested":"claude-haiku-4-5","request_type":"chat_completion"}}
+```
+
+# Example: request referencing a stored key name that exists but no configured keys support the requested model
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-bf-api-key: key_with_model_disabled" \
+  -d '{
+    "model": "anthropic/claude-sonnet-4-5",
+    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
+  }'
+```
+
+Response (example):
+
+```json
+{"is_bifrost_error":false,"error":{"error":"no keys found that support model: claude-sonnet-4-5","message":"no keys found that support model: claude-sonnet-4-5"},"extra_fields":{"provider":"anthropic","model_requested":"claude-sonnet-4-5","request_type":"chat_completion"}}
+```
+
+Note: This is not a weighted selection, by providing a specific key name you are explicitly telling Bifrost which stored key to use, so weighted distribution is bypassed. The example above demonstrates the error returned when a referenced key name cannot be resolved.
+
+## Direct Key Bypass
+
+For scenarios requiring explicit key control, Bifrost supports bypassing the entire key management system:
+
+**Go SDK Context Override:**
+Pass a key directly in the request context using `schemas.BifrostContextKeyDirectKey`. This completely bypasses provider key lookup and selection.
+
+**Gateway Header-based Keys:**
+Send API keys in `Authorization` (Bearer), `x-api-key` or `x-goog-api-key` headers. Requires `allow_direct_keys` setting to be enabled.
+
+**Enable Direct Keys:**
+
+<Tabs group="direct-keys-config">
+
+<Tab title="Web UI">
+
+![Web UI](../media/ui-config-direct-keys.png)
+
+1. Navigate to **Configuration** page
+2. Toggle **"Allow Direct Keys"** to enabled
+3. Save configuration
+
+</Tab>
+
+<Tab title="config.json">
+```json
+{
+  "client": {
+    "allow_direct_keys": true
+  }
+}
+```
+
+</Tab>
+
+</Tabs>
+
+<Note>If a Bifrost virtual key (`sk-bf-*`) is attached in the auth header, direct key bypass will be skipped.</Note>
+
+**When to Use Direct Keys:**
+- Per-user API key scenarios
+- External key management systems
+- Testing with specific keys
+- Debugging key-related issues
--- a/docs/features/litellm-compat.mdx
+++ b/docs/features/litellm-compat.mdx
@@ -0,0 +1,214 @@
+---
+title: "LiteLLM Compatibility"
+description: "Request and response transformations for LiteLLM proxy/SDK compatibility."
+icon: "train"
+---
+
+## Compatibility Transformations
+
+The LiteLLM compatibility plugin provides two transformations:
+
+1. **Text-to-Chat Conversion** - Automatically converts text completion requests to chat completion format for models that only support chat APIs
+2. **Chat-to-Responses Conversion** - Automatically converts chat completion requests to responses format for models that only support responses APIs
+3. **Drop Unsupported Params** - Automatically drops unsupported parameters if the model doesn't support them
+
+When either transformation is applied, responses include `extra_fields.converted_request_type: <transformed_request_type>`. If request parameters are dropped, the keys are added in `extra_fields.dropped_compat_plugin_params`.
+
+---
+
+## 1. Text-to-Chat Conversion
+
+Many modern AI models (like GPT-3.5-turbo, GPT-4, Claude, etc.) only support the chat completion API and don't have native text completion endpoints. LiteLLM compatibility mode automatically handles this by:
+
+1. Checking if the model supports text completion natively (using the model catalog)
+2. If not supported, converting your text prompt to chat message format
+3. Calling the chat completion endpoint internally
+4. Transforming the response back to text completion format
+5. Returning content in `choices[0].text` instead of `choices[0].message.content`
+
+<Note>
+**Smart Conversion**: The conversion only happens when the model doesn't support text completions natively. If a model has native text completion support (like OpenAI's davinci models), Bifrost uses the text completion endpoint directly without any conversion.
+</Note>
+
+This allows you to use a unified text completion interface across all providers, even those that only support chat completions.
+
+## How It Works
+
+When LiteLLM compatibility is enabled and you make a text completion request, Bifrost first checks if the model supports text completion:
+
+```mermaid
+flowchart LR
+A[Text Completion Request] --> B{Model Supports Text Completion?}
+B -->|Yes| C[Call Text Completion API]
+B -->|No| D[Convert to Chat Message]
+D --> E[Call Chat Completion API]
+E --> F[Transform Response]
+C --> G[Text Completion Response]
+F --> G
+```
+
+**Request Transformation:**
+- Your text prompt becomes a user message: `{"role": "user", "content": "your prompt"}`
+- Parameters like `max_tokens`, `temperature`, `top_p` are mapped to chat equivalents
+- Fallbacks are preserved
+
+**Response Transformation:**
+- `choices[0].message.content` → `choices[0].text`
+- `object: "chat.completion"` → `object: "text_completion"`
+- Usage statistics and metadata are preserved
+
+## 2. Chat-to-Responses Conversion
+
+Some AI models (like OpenAI o1-pro) only support the responses API and don't support native chat completion endpoints. LiteLLM compatibility mode automatically handles this by:
+
+1. Checking if the model supports chat completion natively (using the model catalog)
+2. If not supported, converting your chat message to responses API format
+3. Calling the responses endpoint internally
+4. Transforming the response back to chat completion format
+
+<Note>
+**Smart Conversion**: The conversion only happens when the model doesn't support chat completions natively. If a model has native chat completion support (like OpenAI's gpt-4 models), Bifrost uses the chat completion endpoint directly without any conversion.
+</Note>
+
+This allows you to use a unified chat completion interface across all providers, even those that only support responses API.
+
+## How It Works
+
+When LiteLLM compatibility is enabled and you make a chat completion request, Bifrost first checks if the model supports chat completion:
+
+```mermaid
+flowchart LR
+A[Chat Completion Request] --> B{Model Supports Chat Completion?}
+B -->|Yes| C[Call Chat Completion API]
+B -->|No| D[Convert to Responses Message]
+D --> E[Call Responses API]
+E --> F[Transform Response]
+C --> G[Chat Completion Response]
+F --> G
+```
+
+## Enabling LiteLLM Compatibility
+
+<Tabs group="litellm-compat">
+
+<Tab title="Gateway UI">
+
+1. Open the Bifrost dashboard
+2. Navigate to **Settings** → **Client Configuration**
+3. Expand **LiteLLM Compat** and enable the features you need:
+   - **Convert Text to Chat** — converts text completion requests to chat for models that only support chat
+   - **Convert Chat to Responses** — converts chat completion requests to responses for models that only support responses
+   - **Drop Unsupported Params** — drops unsupported parameters based on model catalog allowlist
+4. Save your configuration
+
+</Tab>
+
+<Tab title="Configuration File">
+
+```json
+{
+  "client_config": {
+    "compat": {
+      "convert_text_to_chat": true,
+      "convert_chat_to_responses": true,
+      "should_drop_params": true
+    }
+  }
+}
+```
+
+</Tab>
+
+</Tabs>
+
+## Supported Providers
+
+Text completion to chat completion conversion works with any provider that supports chat completions but lacks native text completion support:
+
+| Provider | Native Text Completion | With Fallback |
+|----------|----------------------|------------------|
+| OpenAI (GPT-4, GPT-3.5-turbo) | No | Yes |
+| Anthropic (Claude) | No | Yes |
+| Groq | No | Yes |
+| Gemini | No | Yes |
+| Mistral | No | Yes |
+| Bedrock | Varies by model | Yes |
+
+Chat completion to responses conversion works with any provider that supports responses but lacks native chat completion support:
+
+| Provider | Native Chat Completion | With Fallback |
+|----------|----------------------|------------------|
+| OpenAI (o1-pro) | No | Yes |
+
+## Behavior Details
+
+**Model Capability Detection:**
+- Bifrost uses the model catalog to check if a model supports text completion
+- If the model has a "completion" mode in its pricing data, it supports text completion
+- Conversion only happens when the model lacks native text completion support
+
+## Transformations Reference
+
+### Transformation 1: Text-to-Chat Conversion
+
+**Applies to:** Text completion requests on chat-only models
+
+| Phase | Original | Transformed |
+|-------|----------|-------------|
+| Request | Text prompt (string) | Chat message with `role: "user"` |
+| Request | Array prompts | Concatenated into text content blocks |
+| Request | `text_completion` request type | `chat_completion` request type |
+| Request | `max_tokens`, `temperature`, `top_p` | Mapped to chat equivalents |
+| Response | `choices[0].message.content` | `choices[0].text` |
+| Response | `object: "chat.completion"` | `object: "text_completion"` |
+
+### Transformation 2: Chat-to-Responses Conversion
+
+**Applies to:** Chat completion requests on responses-only models
+
+| Phase | Original | Transformed |
+|-------|----------|-------------|
+| Request | Chat message with `role: "user"` | Responses input with `role: "user"` |
+| Request | `chat_completion` request type | `responses` request type |
+
+### Metadata Set on Transformed Responses
+
+When either transformation is applied:
+
+- `extra_fields.request_type`: Reflects the original request type
+- `extra_fields.original_model_requested`: The originally requested model
+- `extra_fields.resolved_model_used`: The actual provider API identifier used (equals original_model_requested when no alias mapping exists)
+
+### Error Handling
+
+When errors occur on transformed requests:
+- Original request type and model are preserved in error metadata
+- `extra_fields.converted_request_type`: Set to type of request that was converted to (i.e., `chat_completion` or `responses`)
+- `extra_fields.provider`: The provider that handled the request
+- `extra_fields.original_model_requested`: The originally requested model
+- `extra_fields.dropped_compat_plugin_params`: If any unsupported parameters were dropped, the keys are added here
+
+## What's Preserved
+
+- Model selection and fallback chain
+- Temperature, top_p, max_tokens, and other generation parameters
+- Stop sequences and frequency/presence penalties
+- Usage statistics and token counts
+
+## When to Use This
+
+**Good Use Cases:**
+- Migrating from LiteLLM to Bifrost without code changes
+- Maintaining backward compatibility with text completion interfaces or chat completion interfaces
+- Using a unified API across providers with different capabilities
+
+**Consider Alternatives When:**
+- You need chat-specific features (system messages, conversation history)
+- You want explicit control over message formatting
+- Performance is critical (direct chat requests avoid conversion overhead)
+
+## Related Features
+
+- [Fallbacks](/features/fallbacks) - Automatic provider failover
+- [Drop-in Replacement](/features/drop-in-replacement) - Use existing SDKs with Bifrost
+- [LiteLLM Integration](/integrations/litellm-sdk) - Using LiteLLM SDK with Bifrost
--- a/docs/features/observability/default.mdx
+++ b/docs/features/observability/default.mdx
@@ -0,0 +1,600 @@
+---
+title: "Built-in Observability"
+description: "Monitor and analyze every AI request and response in real-time. Track performance, debug issues, and gain insights into your AI application's behavior with comprehensive request tracing."
+icon: "cube"
+---
+
+## Overview
+
+Bifrost includes **built-in observability**, a powerful feature that automatically captures and stores detailed information about every AI request and response that flows through your system. This provides structured, searchable data with real-time monitoring capabilities, making it easy to debug issues, analyze performance patterns, and understand your AI application's behavior at scale.
+
+All LLM interactions are captured with comprehensive metadata including inputs, outputs, tokens, costs, and latency. The logging plugin operates **asynchronously** with zero impact on request latency.
+
+![Live Log Stream Interface](../../media/ui-live-log-stream.gif)
+
+---
+
+## What's Captured
+
+Bifrost traces comprehensive information for every request, without any changes to your application code.
+
+![Complete Request Tracing Overview](../../media/ui-request-tracing-overview.png)
+
+### **Request Data**
+- **Input Messages**: Complete conversation history and user prompts
+- **Model Parameters**: Temperature, max tokens, tools, and all other parameters
+- **Provider Context**: Which provider and model handled the request
+- **Prompt Tracking**: When the [Prompts plugin](/features/prompt-repository/prompts-plugin) is active, the log captures the selected prompt name, version number, and ID for full traceability
+
+### **Response Data**
+- **Output Messages**: AI responses, tool calls, and function results
+- **Performance Metrics**: Latency and token usage
+- **Status Information**: Success or error details
+
+### **Retry & Key Selection** <sup>v1.5.0-prerelease4+</sup>
+
+When Bifrost retries a request (rate-limit or network error) the following fields are recorded:
+
+| Field | Meaning |
+|-------|---------|
+| `selected_key_id` / `selected_key_name` | The API key that **successfully** served the request. `null` when all attempts failed — use `attempt_trail` to see which keys were tried. |
+| `number_of_retries` | Total number of attempts minus one. **Does not indicate which key was used on each attempt.** |
+| `attempt_trail` | Ordered array of every attempt, with key used and failure reason. `fail_reason` is `null` on the final attempt. |
+
+**Example `attempt_trail`** — two rate-limit rotations then success on a third key:
+
+```json
+"attempt_trail": [
+  { "attempt": 0, "key_id": "key-a", "key_name": "Key A", "fail_reason": "rate_limit_error" },
+  { "attempt": 1, "key_id": "key-b", "key_name": "Key B", "fail_reason": "rate_limit_error" },
+  { "attempt": 2, "key_id": "key-c", "key_name": "Key C", "fail_reason": null }
+]
+```
+
+Network-error retries reuse the same key; only rate-limit errors rotate to a different key:
+
+```json
+"attempt_trail": [
+  { "attempt": 0, "key_id": "key-a", "key_name": "Key A", "fail_reason": "network_error" },
+  { "attempt": 1, "key_id": "key-a", "key_name": "Key A", "fail_reason": "rate_limit_error" },
+  { "attempt": 2, "key_id": "key-b", "key_name": "Key B", "fail_reason": null }
+]
+```
+
+`attempt_trail` is `null` / absent when the request succeeded on the first try without retries.
+
+### **Custom Metadata**
+- **Logging Headers**: Capture configured request headers (e.g., `X-Tenant-ID`) into log metadata
+- **Ad-hoc Headers**: Any `x-bf-lh-*` prefixed header is automatically captured into metadata
+- See [Logging Headers](#logging-headers) below for full details
+
+### **Multimodal & Tool Support**
+- **Audio Processing**: Speech synthesis and transcription inputs/outputs
+- **Vision Analysis**: Image URLs and vision model responses
+- **Tool Execution**: Function calling arguments and results
+
+![Multimodal Request Tracing](../../media/ui-multimodal-tracing.png)
+
+---
+
+## How It Works
+
+The logging plugin intercepts all requests flowing through Bifrost using the plugin architecture, ensuring your LLM requests maintain optimal performance:
+
+1. **PreLLMHook**: Captures request metadata (provider, model, input messages, parameters).
+2. **Async Processing**: Logs are written in background goroutines with `sync.Pool` optimization.
+3. **PostLLMHook**: Updates log entry with response data (output, tokens, cost, latency, errors).
+4. **Real-time Updates**: WebSocket broadcasts keep the UI synchronized.
+
+All logging operations are non-blocking, ensuring your LLM requests maintain optimal performance.
+
+---
+
+## Configuration
+
+Configure request tracing to control what gets logged and where it's stored.
+
+<Tabs group="tracing-config">
+
+<Tab title="Using Web UI">
+
+![Tracing Configuration Interface](../../media/ui-tracing-config.png)
+
+1. Navigate to **http://localhost:8080**
+2. Go to **"Settings"**
+3. Toggle **"Enable Logs"** 
+
+</Tab>
+
+<Tab title="Using API">
+
+**Enable/Disable Tracing:**
+```bash
+curl --location 'http://localhost:8080/api/config' \
+--header 'Content-Type: application/json' \
+--method PUT \
+--data '{
+    "client_config": {
+        "enable_logging": true,
+        "disable_content_logging": false,
+        "drop_excess_requests": false,
+        "initial_pool_size": 300,
+         "enforce_auth_on_inference": false,
+        "allow_direct_keys": false,
+        "prometheus_labels": [],
+        "allowed_origins": []
+    }
+}'
+```
+
+**Check Current Configuration:**
+```bash
+curl --location 'http://localhost:8080/api/config'
+```
+
+**Response includes tracing status:**
+```json
+{
+    "client_config": {
+        "enable_logging": true,
+        "disable_content_logging": false,
+        "drop_excess_requests": false
+    },
+    "is_db_connected": true,
+    "is_cache_connected": true, 
+    "is_logs_connected": true
+}
+```
+
+</Tab>
+
+<Tab title="Using config.json">
+
+In your `config.json` file, you can enable logging and configure the log store:
+```json
+{
+    "client": {
+        "enable_logging": true,
+        "disable_content_logging": false,
+        "drop_excess_requests": false,
+        "initial_pool_size": 300,
+        "allow_direct_keys": false
+    },
+    "logs_store": {
+        "enabled": true,
+        "type": "sqlite",
+        "config": {
+            "path": "./logs.db"
+        }
+    }
+}
+```
+- **`enable_logging`**: Master toggle for request tracing.
+- **`disable_content_logging`**: Disable logging of request/response content, but still log usage metadata (latency, cost, token count, etc.).
+- **`logs_store`**: Check [Log Store Options](#log-store-options) for more details.
+
+</Tab>
+
+<Tab title="Using Go SDK">
+
+When using Bifrost as a Go SDK, initialize the logging plugin manually:
+
+```go
+package main
+
+import (
+    "context"
+    bifrost "github.com/maximhq/bifrost/core"
+    "github.com/maximhq/bifrost/core/schemas"
+    "github.com/maximhq/bifrost/framework/logstore"
+    "github.com/maximhq/bifrost/framework/pricing"
+    "github.com/maximhq/bifrost/plugins/logging"
+)
+
+func main() {
+    ctx := context.Background()
+    logger := schemas.NewLogger()
+    
+    // Initialize log store (SQLite)
+    store, err := logstore.NewLogStore(ctx, &logstore.Config{
+        Enabled: true,
+        Type:    logstore.LogStoreTypeSQLite,
+        Config: &logstore.SQLiteConfig{
+            Path: "./logs.db",
+        },
+    }, logger)
+    if err != nil {
+        panic(err)
+    }
+    
+    // Initialize pricing manager (required for cost calculation)
+    pricingManager := pricing.NewPricingManager(logger)
+    
+    // Initialize logging plugin
+    loggingPlugin, err := logging.Init(ctx, logger, store, pricingManager)
+    if err != nil {
+        panic(err)
+    }
+    
+    // Initialize Bifrost with logging plugin
+    client, err := bifrost.Init(ctx, schemas.BifrostConfig{
+        Account: &yourAccount,
+        LLMPlugins: []schemas.LLMPlugin{loggingPlugin},
+    })
+    if err != nil {
+        panic(err)
+    }
+    defer client.Shutdown()
+    
+    // All requests are now logged automatically
+}
+```
+
+</Tab>
+
+</Tabs>
+
+---
+
+## Accessing & Filtering Logs
+
+Retrieve and analyze logs with powerful filtering capabilities via the UI, API, and WebSockets.
+
+![Advanced Log Filtering Interface](/media/ui-log-filtering.gif)
+
+### Web UI
+
+When running the Gateway, access the built-in dashboard at `http://localhost:8080`. The UI provides:
+- Real-time log streaming
+- Advanced filtering and search
+- Detailed request/response inspection
+- Token and cost analytics
+
+### API Endpoints
+
+Query logs programmatically using the `GET` request.
+
+```bash
+curl 'http://localhost:8080/api/logs?' \
+'providers=openai,anthropic&' \
+'models=gpt-4o-mini&' \
+'status=success,error&' \
+'start_time=2024-01-15T00:00:00Z&' \
+'end_time=2024-01-15T23:59:59Z&' \
+'min_latency=1000&' \
+'max_latency=5000&' \
+'min_tokens=10&' \
+'max_tokens=1000&' \
+'min_cost=0.001&' \
+'max_cost=10&' \
+'content_search=python&' \
+'limit=100&' \
+'offset=0'
+```
+**Available Filters:**
+
+| Filter | Description | Example |
+|--------|-------------|---------|
+| `providers` | Filter by AI providers | `openai,anthropic` |
+| `models` | Filter by specific models | `gpt-4o-mini,claude-3-sonnet` |
+| `status` | Request status | `success,error,processing` |
+| `objects` | Request types | `chat.completion,embedding` |
+| `start_time` / `end_time` | Time range (RFC3339) | `2024-01-15T10:00:00Z` |
+| `min_latency` / `max_latency` | Response time (ms) | `1000` to `5000` |
+| `min_tokens` / `max_tokens` | Token usage range | `10` to `1000` |
+| `min_cost` / `max_cost` | Cost range (USD) | `0.001` to `10` |
+| `content_search` | Search in messages | `"error handling"` |
+| `limit` / `offset` | Pagination | `100`, `200` |
+
+**Response Format**
+
+```json
+{
+    "logs": [...],
+    "pagination": {
+        "limit": 100,
+        "offset": 0,
+        "sort_by": "timestamp",
+        "order": "desc"
+    },
+    "stats": {
+        "total_requests": 1234,
+        "success_rate": 0.85,
+        "average_latency": 100,
+        "total_tokens": 10000,
+        "total_cost": 100
+    }
+}
+```
+
+Perfect for analytics, debugging specific issues, or building custom monitoring dashboards.
+
+### WebSocket
+
+Subscribe to real-time log updates for live monitoring:
+
+```javascript
+const ws = new WebSocket('ws://localhost:8080/ws')
+
+ws.onmessage = (event) => {
+  const logUpdate = JSON.parse(event.data)
+  console.log('New log entry:', logUpdate)
+}
+```
+
+---
+
+## Log Store Options
+
+Choose the right storage backend for your scale and requirements.
+
+The logging plugin is **automatically enabled** in Gateway mode with SQLite storage by default. You can configure it to use PostgreSQL by setting the `logs_store` configuration in your `config.json` file.
+
+### **Current Support**
+
+<Tabs group="log-store-types">
+<Tab title="SQLite (Default)">
+
+- **Best for**: Development, small-medium deployments
+- **Performance**: Excellent for read-heavy workloads
+- **Setup**: Zero configuration, single file storage
+- **Limits**: Single-writer, local filesystem only
+
+```json
+{
+    "logs_store": {
+        "enabled": true,
+        "type": "sqlite",
+        "config": {
+            "path": "./logs.db"
+        }
+    }
+}
+```
+
+</Tab>
+<Tab title="PostgreSQL">
+
+- **Best for**: High-volume production deployments
+- **Performance**: Excellent concurrent writes and complex queries
+- **Features**: Advanced indexing, partitioning, replication
+- **Requirement**: PostgreSQL database must be UTF8 encoded (see [PostgreSQL UTF8 Requirement](../../quickstart/gateway/setting-up#postgresql-utf8-requirement))
+
+```json
+{
+    "logs_store": {
+        "enabled": true,
+        "type": "postgres",
+        "config": {
+            "host": "localhost",
+            "port": "5432",
+            "user": "bifrost",
+            "password": "postgres",
+            "db_name": "bifrost",
+            "ssl_mode": "disable"
+        }
+    }
+}
+```
+
+</Tab>
+</Tabs>
+
+### **Planned Support**
+
+- **MySQL**: For traditional MySQL environments.
+- **ClickHouse**: For large-scale analytics and time-series workloads.
+
+---
+
+## Supported Request Types
+
+The logging plugin captures all Bifrost request types:
+
+- Text Completion (streaming and non-streaming)
+- Chat Completion (streaming and non-streaming)
+- Responses (streaming and non-streaming)
+- Embeddings
+- Speech Generation (streaming and non-streaming)
+- Transcription (streaming and non-streaming)
+- Video Generation
+
+---
+
+## Logging Headers
+
+Capture specific HTTP request headers into the **metadata** field of every LLM and MCP log entry. This enables request tracing, tenant identification, and custom debugging without modifying your application code.
+
+### How It Works
+
+There are two ways headers get captured into log metadata:
+
+**1. Configured Logging Headers** — Define a list of header names in the configuration. The logging plugin looks up each configured header (case-insensitive) and stores its value in the metadata.
+
+**2. `x-bf-lh-*` Prefix (Automatic)** — Any request header with the `x-bf-lh-` prefix is automatically captured into metadata with no configuration needed. The prefix is stripped and the remainder becomes the metadata key.
+
+| Request Header | Metadata Key | Metadata Value |
+|----------------|-------------|----------------|
+| `x-bf-lh-tenant-id: acme` | `tenant-id` | `acme` |
+| `x-bf-lh-env: production` | `env` | `production` |
+| `x-bf-lh-region: us-east-1` | `region` | `us-east-1` |
+
+Both methods can be used together — configured headers and `x-bf-lh-*` headers are merged into the same metadata map.
+
+### Configuring Logging Headers
+
+<Tabs group="logging-headers-config">
+<Tab title="Web UI">
+
+1. Navigate to **Config** > **Logging**
+2. Ensure **Enable Logs** is toggled on
+3. Scroll to **Logging Headers**
+
+![Logging Headers Configuration](../../media/ui-logging-headers-setting.png)
+
+4. Enter a comma-separated list of header names (e.g., `X-Tenant-ID, X-Correlation-ID`)
+5. Click **Save Changes**
+
+Changes take effect immediately — no restart required.
+
+</Tab>
+<Tab title="API">
+
+Include `logging_headers` in the `client_config` when updating the configuration:
+
+```bash
+curl -X PUT http://localhost:8080/api/config \
+  -H "Content-Type: application/json" \
+  -d '{
+    "client_config": {
+      "logging_headers": ["X-Tenant-ID", "X-Correlation-ID"]
+    }
+  }'
+```
+
+</Tab>
+<Tab title="config.json">
+
+Add `logging_headers` to the `client` section:
+
+```json
+{
+  "client": {
+    "enable_logging": true,
+    "logging_headers": ["X-Tenant-ID", "X-Correlation-ID"]
+  }
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `logging_headers` | `string[]` | No | List of header names to capture in log metadata. Case-insensitive. No restart required. |
+
+</Tab>
+</Tabs>
+
+### Usage Examples
+
+**Configured headers:**
+
+```bash
+# Config has: logging_headers: ["X-Tenant-ID", "X-Correlation-ID"]
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "X-Tenant-ID: tenant-123" \
+  -H "X-Correlation-ID: req-abc-456" \
+  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}'
+```
+
+Log metadata: `{"x-tenant-id": "tenant-123", "x-correlation-id": "req-abc-456"}`
+
+**Ad-hoc `x-bf-lh-*` headers (no config needed):**
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-bf-lh-env: production" \
+  -H "x-bf-lh-version: v2.1.0" \
+  -d '{"model": "gpt-4o", "messages": [{"role": "user", "content": "Hello"}]}'
+```
+
+Log metadata: `{"env": "production", "version": "v2.1.0"}`
+
+### Viewing Metadata in the UI
+
+Metadata is displayed in the log detail view for both LLM and MCP logs as individual key-value entries alongside other request details.
+
+![Log Entry with Metadata](../../media/ui-log-metadata-display.png)
+
+### Combining with Required Headers
+
+[Required headers](../governance/required-headers) and logging headers serve different purposes and can be used together:
+
+| Feature | Purpose | Effect on Request |
+|---------|---------|-------------------|
+| **Required Headers** | Enforce header presence | Rejects request if missing (400) |
+| **Logging Headers** | Capture header values | No effect on request — only logs metadata |
+
+A common pattern is to require a header **and** log it:
+
+```json
+{
+  "client": {
+    "required_headers": ["X-Tenant-ID"],
+    "logging_headers": ["X-Tenant-ID"]
+  }
+}
+```
+
+---
+
+## When to Use
+
+### Built-in Observability
+
+Use the built-in logging plugin for:
+
+- **Local Development**: Quick setup with SQLite, no external dependencies
+- **Self-hosted Deployments**: Full control over your data with PostgreSQL
+- **Simple Use Cases**: Basic monitoring and debugging needs
+- **Privacy-sensitive Workloads**: Keep all logs on your infrastructure
+
+### vs. Maxim Plugin
+
+Switch to the [Maxim plugin](./maxim) for:
+
+- Advanced evaluation and testing workflows
+- Prompt engineering and experimentation
+- Multi-team governance and collaboration
+- Production monitoring with alerts and SLAs
+- Dataset management and annotation pipelines
+
+### vs. OTel Plugin
+
+Switch to the [OTel plugin](./otel) for:
+
+- Integration with existing observability infrastructure
+- Correlation with application traces and metrics
+- Custom collector configurations
+- Compliance and enterprise requirements
+
+---
+
+## Performance
+
+The logging plugin is designed for **zero-impact observability**:
+
+- **Async Operations**: All database writes happen in background goroutines
+- **Sync.Pool**: Reuses memory allocations for LogMessage and UpdateLogData structs
+- **Batch Processing**: Efficiently handles high request volumes
+- **Automatic Cleanup**: Removes stale processing logs every 30 seconds
+
+In benchmarks, the logging plugin adds **< 0.1ms overhead** to request processing time.
+
+---
+
+## Connectors
+
+<CardGroup cols={2}>
+  <Card title="Maxim AI" icon="infinity" href="/features/observability/maxim">
+    Comprehensive LLM observability and evaluation.
+  </Card>
+  <Card title="OpenTelemetry" icon="bolt" href="/features/observability/otel">
+    OTLP integration for distributed tracing.
+  </Card>
+  <Card title="Prometheus" icon="chart-line" href="/features/observability/prometheus">
+    Native Prometheus metrics.
+  </Card>
+  <Card title="Datadog" icon="dog" href="/enterprise/datadog-connector">
+    Native APM, LLM Observability, and metrics.
+  </Card>
+</CardGroup>
+
+---
+
+## Next Steps
+
+- **[Gateway Setup](../../quickstart/gateway/setting-up)** - Get Bifrost running with tracing enabled
+- **[Provider Configuration](../../quickstart/gateway/provider-configuration)** - Configure multiple providers for better insights
+- **[Telemetry](../telemetry)** - Prometheus metrics and dashboards
+- **[Governance](../governance)** - Virtual keys and usage limits
--- a/docs/features/observability/maxim.mdx
+++ b/docs/features/observability/maxim.mdx
@@ -0,0 +1,225 @@
+---
+title: "Maxim AI"
+description: "Integrate Maxim SDK for comprehensive LLM observability, tracing, and evaluation."
+icon: "infinity"
+---
+
+## Overview
+
+Bifrost provides comprehensive LLM observability through the **Maxim plugin**, enabling seamless tracking, evaluation, and analysis of AI interactions. The plugin automatically forwards all LLM requests and responses to Maxim's platform for detailed monitoring and performance insights.
+
+![Maxim Logs](https://github.com/maximhq/bifrost/blob/main/docs/media/maxim-logs.png?raw=true)
+
+---
+
+## Setup
+
+The Maxim plugin enables seamless observability and evaluation of LLM interactions by forwarding inputs/outputs to Maxim's platform:
+
+<Tabs group="setup-method">
+<Tab title="Go SDK">
+
+```go
+package main
+
+import (
+    "context"
+    bifrost "github.com/maximhq/bifrost/core"
+    "github.com/maximhq/bifrost/core/schemas"
+    maxim "github.com/maximhq/bifrost/plugins/maxim"
+)
+
+func main() {
+    // Initialize Maxim plugin
+    maximPlugin, err := maxim.Init(maxim.Config{
+        ApiKey:    "your_maxim_api_key",
+        LogRepoId: "your_default_repo_id", // Optional: fallback repository
+    })
+    if err != nil {
+        panic(err)
+    }
+
+    // Initialize Bifrost with the plugin
+    client, err := bifrost.Init(context.Background(), schemas.BifrostConfig{
+        Account: &yourAccount,
+        LLMPlugins: []schemas.LLMPlugin{maximPlugin},
+    })
+    if err != nil {
+        panic(err)
+    }
+    defer client.Shutdown()
+
+    // All requests will now be traced to Maxim
+}
+```
+
+</Tab>
+<Tab title="config.json">
+
+For HTTP transport, configure via environment variables:
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "maxim",
+      "config": {        
+        "api_key": "your_maxim_api_key",
+        "log_repo_id": "your_default_repo_id"
+      }
+    }
+  ]
+}
+```
+
+</Tab>
+</Tabs>
+
+## Configuration
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `ApiKey` | `string` | ✅ Yes | Your Maxim API key for authentication |
+| `LogRepoId` | `string` | ❌ No | Default log repository ID (can be overridden per request) |
+
+## Repository Selection
+
+The plugin uses repository selection with the following priority:
+
+1. **Header/Context Repository** - Highest priority
+2. **Default Repository** (from plugin config) - Fallback
+3. **Skip Logging** - If neither is available
+
+<Tabs group="repository-selection"  >
+<Tab title="Go SDK">
+
+```go
+ctx := context.Background()
+
+// Use specific repository for this request
+ctx = context.WithValue(ctx, maxim.LogRepoIDKey, "project-specific-repo")
+```
+
+</Tab>
+<Tab title="Gateway"> 
+
+```bash
+# Use default repository (from config)
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -d '{"model": "gpt-4", "messages": [...]}'
+
+# Override with specific repository
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-maxim-log-repo-id: project-specific-repo" \
+  -d '{"model": "gpt-4", "messages": [...]}'
+```
+
+</Tab>
+</Tabs>
+
+
+## Custom Trace Management
+
+### Trace Propagation
+
+The plugin supports custom session, trace, and generation IDs for advanced tracing scenarios:
+
+<Tabs group="trace-propagation">
+<Tab title="Go SDK">
+```go
+ctx := context.Background()
+
+// Prefer typed keys from the Maxim plugin
+ctx = context.WithValue(ctx, maxim.TraceIDKey, "custom-trace-123")
+ctx = context.WithValue(ctx, maxim.GenerationIDKey, "custom-gen-456")
+ctx = context.WithValue(ctx, maxim.SessionIDKey, "user-session-789")
+
+// Optionally set human-friendly names
+ctx = context.WithValue(ctx, maxim.TraceNameKey, "checkout-flow")
+ctx = context.WithValue(ctx, maxim.GenerationNameKey, "rerank-step")
+```
+</Tab>
+<Tab title="Gateway">
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-maxim-trace-id: custom-trace-123" \
+  -H "x-bf-maxim-generation-id: custom-gen-456" \
+  -H "x-bf-maxim-session-id: user-session-789" \
+  -H "x-bf-maxim-trace-name: checkout-flow" \
+  -H "x-bf-maxim-generation-name: rerank-step" \
+  -d '{"model": "gpt-4", "messages": [...]}'
+```
+</Tab>
+</Tabs>
+
+### Custom Tags
+
+You can add custom tags to traces for enhanced filtering and analytics:
+
+<Tabs group="custom-tags">
+<Tab title="Go SDK">
+
+```go
+ctx := context.Background()
+
+// Pass arbitrary tag key-values via context map
+tags := map[string]string{
+    "environment":  "production",
+    "user-id":      "user-123",
+    "feature-flag": "new-ui",
+}
+ctx = context.WithValue(ctx, maxim.TagsKey, tags)
+```
+
+</Tab>
+<Tab title="Gateway">
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "x-bf-maxim-environment: production" \
+  -H "x-bf-maxim-user-id: user-123" \
+  -H "x-bf-maxim-feature-flag: new-ui" \
+  -d '{"model": "gpt-4", "messages": [...]}'
+```
+
+Reserved keys are `session-id`, `trace-id`, `trace-name`, `generation-id`, `generation-name`, `log-repo-id`. All other `x-bf-maxim-*` headers are treated as tags.
+
+</Tab>
+</Tabs>
+
+## Supported Request Types
+
+The plugin supports the following Bifrost request types:
+
+- Text Completion
+- Chat Completion
+
+## Monitoring & Analytics
+
+Once configured, monitor your AI apps in the [Maxim Dashboard](https://getmaxim.ai/). Maxim is an end-to-end evaluation & observability platform built to help teams ship AI agents faster while maintaining high quality.
+
+* **Experiment / Prompt Engineering**
+  Playground++ for prompt design: versioning, comparison (A/B), visual chaining, low-code tooling.
+
+* **Simulation & Evaluation**
+  Test agents over thousands of scenarios, both automated (statistical, programmatic) and human-in-the-loop for edge cases. Custom and off-the-shelf evaluators.
+
+* **Observability / Monitoring**
+  Real-time traces, logging, debugging of multi-agent workflows, live issue tracking, alerts when quality or performance degrade.
+
+* **Data Engine & Dataset Management**
+  Support for multi-modal datasets, import & continuous curation, feedback/annotation pipelines, data splitting for experiments.
+
+* **Governance, Security & Compliance**
+  Features like SOC 2 Type II compliance, enterprise security controls, permissions, auditability.
+
+* **Alerts & SLAs**: Threshold-based notifications to keep quality and latency in guardrails
+
+## Next Steps
+
+Now that you have observability set up with the Maxim plugin, explore these related topics:
+
+- **[Tracing](./default)** - Deep-dive into request/response logging and correlation
+- **[Telemetry](../telemetry)** - Prometheus metrics, dashboards, and alerting
+- **[Governance](../governance/virtual-keys)** - Virtual keys, per-team controls, and usage limits
--- a/docs/features/observability/otel.mdx
+++ b/docs/features/observability/otel.mdx
@@ -0,0 +1,978 @@
+---
+title: "OpenTelemetry (OTel)"
+description: "Integrate with OpenTelemetry collectors for enterprise observability and distributed tracing"
+icon: "bolt"
+---
+
+## Overview
+
+<Frame>
+  <img src="/media/grafana-otel-traces.png" alt="Okta Applications page" />
+</Frame>
+
+The **OTel plugin** enables seamless integration with OpenTelemetry Protocol (OTLP) collectors, allowing you to send LLM traces to your existing observability infrastructure. Connect Bifrost to platforms like Grafana Cloud, Datadog, New Relic, Honeycomb, or self-hosted collectors.
+
+All traces follow OpenTelemetry semantic conventions, making it easy to correlate LLM operations with your broader application telemetry.
+
+---
+
+## Supported Trace Formats
+
+The plugin supports multiple trace formats to match your observability platform:
+
+| Format | Description | Use Case | Status |
+|--------|-------------|----------|----------|
+| `genai_extension` | OpenTelemetry GenAI semantic conventions | **Recommended** - Standard OTel format with rich LLM metadata | ✅ Released |
+| `vercel` | Vercel AI SDK format | For Vercel AI SDK compatibility | 🔄 Coming soon |
+| `open_inference` | Arize OpenInference format | For Arize Phoenix and OpenInference tools | 🔄 Coming soon | 
+
+---
+
+## Configuration
+
+### Required Fields
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `service_name` | `string` | ❌ No | Service name to be used for tracing, defaults to `bifrost` |
+| `collector_url` | `string` | ✅ Yes | OTLP collector endpoint URL |
+| `trace_type` | `string` | ✅ Yes | One of: `genai_extension`, `vercel`, `open_inference` |
+| `protocol` | `string` | ✅ Yes | Transport protocol: `http` or `grpc` |
+| `headers` | `object` | ❌ No | Custom headers for authentication (supports `env.VAR_NAME`) |
+| `tls_ca_cert` | `string` | ❌ No | File path to client CA certificate for TLS. Optional. Works with both gRPC and HTTP protocol |
+
+### Environment Variable Substitution
+
+Headers support environment variable substitution using the `env.` prefix:
+
+```json
+{
+  "headers": {
+    "Authorization": "env.OTEL_API_KEY",
+    "X-Custom-Header": "env.CUSTOM_VALUE"
+  }
+}
+```
+
+### Resource Attributes
+
+The plugin supports the standard `OTEL_RESOURCE_ATTRIBUTES` environment variable. Any attributes defined in this variable will be automatically attached to every span emitted by the plugin.
+
+```bash
+export OTEL_RESOURCE_ATTRIBUTES="deployment.environment=production,service.version=1.2.3,team.name=platform"
+```
+
+These attributes appear as resource-level metadata on all traces:
+
+```json
+{
+  "resource": {
+    "attributes": {
+      "service.name": "bifrost",
+      "deployment.environment": "production",
+      "service.version": "1.2.3",
+      "team.name": "platform"
+    }
+  }
+}
+```
+
+This is useful for:
+- **Environment identification** - Distinguish between production, staging, and development traces
+- **Service versioning** - Track which version of your service generated the trace
+- **Team attribution** - Tag traces with team ownership for filtering and alerting
+- **Custom metadata** - Add any key-value pairs relevant to your observability needs
+
+---
+
+## Setup
+
+<Tabs group="setup-method">
+<Tab title="UI">
+![Otel UI setup](../../media/otel-ui-setup.png)
+</Tab>
+<Tab title="Go SDK">
+
+```go
+package main
+
+import (
+    "context"
+    bifrost "github.com/maximhq/bifrost/core"
+    "github.com/maximhq/bifrost/core/schemas"
+    "github.com/maximhq/bifrost/framework/pricing"
+    otel "github.com/maximhq/bifrost/plugins/otel"
+)
+
+func main() {
+    ctx := context.Background()
+    logger := schemas.NewLogger()
+    
+    // Initialize pricing manager (required for cost calculation)
+    pricingManager := pricing.NewPricingManager(logger)
+    
+    // Initialize OTel plugin
+    otelPlugin, err := otel.Init(ctx, &otel.Config{
+        ServiceName:  "bifrost",
+        CollectorURL: "http://localhost:4318",
+        TraceType:    otel.TraceTypeGenAIExtension,
+        Protocol:     otel.ProtocolHTTP,
+        Headers: map[string]string{
+            "Authorization": "env.OTEL_API_KEY",
+        },
+    }, logger, pricingManager)
+    if err != nil {
+        panic(err)
+    }
+    
+    // Initialize Bifrost with the plugin
+    client, err := bifrost.Init(ctx, schemas.BifrostConfig{
+        Account: &yourAccount,
+        LLMPlugins: []schemas.LLMPlugin{otelPlugin},
+    })
+    if err != nil {
+        panic(err)
+    }
+    defer client.Shutdown()
+    
+    // All requests are now traced to OTel collector
+}
+```
+
+</Tab>
+<Tab title="config.json">
+
+For Gateway mode, configure via `config.json`:
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "http://localhost:4318",
+        "trace_type": "genai_extension",
+        "protocol": "http",
+        "headers": {
+          "Authorization": "env.OTEL_API_KEY"
+        }
+      }
+    }
+  ]
+}
+```
+
+If you need to connect to an OTEL collector that requires TLS, configure `tls_ca_cert`:
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "localhost:4317",
+        "trace_type": "genai_extension",
+        "protocol": "grpc",
+        "tls_ca_cert": "/path/to/your/ca.cert",
+        "headers": {
+          "Authorization": "env.OTEL_API_KEY"
+        }
+      }
+    }
+  ]
+}
+```
+
+</Tab>
+</Tabs>
+
+---
+
+## Quick Start with Docker
+
+Get started quickly with a complete observability stack using the included Docker Compose configuration:
+
+```yml
+services:
+  otel-collector:
+    image: otel/opentelemetry-collector-contrib:latest
+    container_name: otel-collector
+    command: ["--config=/etc/otelcol/config.yaml"]
+    configs:
+      - source: otel-collector-config
+        target: /etc/otelcol/config.yaml
+    ports:
+      - "4317:4317"   # OTLP gRPC
+      - "4318:4318"   # OTLP HTTP
+      - "8888:8888"   # Collector /metrics
+      - "9464:9464"   # Prometheus scrape endpoint
+      - "13133:13133" # Health check
+      - "1777:1777"   # pprof
+      - "55679:55679" # zpages
+    restart: unless-stopped
+    depends_on:
+      - tempo
+
+  tempo:
+    image: grafana/tempo:latest
+    container_name: tempo
+    command: [ "-config.file=/etc/tempo.yaml" ]
+    configs:
+      - source: tempo-config
+        target: /etc/tempo.yaml
+    ports:
+      - "3200:3200"   # tempo HTTP API
+    expose:
+      - "4317"        # OTLP gRPC (internal)
+    volumes:
+      - tempo-data:/var/tempo
+    restart: unless-stopped
+
+  prometheus:
+    image: prom/prometheus:latest
+    container_name: prometheus
+    depends_on:
+      - otel-collector
+    command:
+      - "--config.file=/etc/prometheus/prometheus.yml"
+      - "--storage.tsdb.path=/prometheus"
+      - "--web.console.libraries=/usr/share/prometheus/console_libraries"
+      - "--web.console.templates=/usr/share/prometheus/consoles"
+      - "--web.enable-remote-write-receiver"
+    ports:
+      - "9090:9090"
+    volumes:
+      - prometheus-data:/prometheus
+    configs:
+      - source: prometheus-config
+        target: /etc/prometheus/prometheus.yml
+    restart: unless-stopped
+
+  grafana:
+    image: grafana/grafana:latest
+    container_name: grafana
+    depends_on:
+      - prometheus
+      - tempo
+    environment:
+      GF_SECURITY_ADMIN_USER: admin
+      GF_SECURITY_ADMIN_PASSWORD: admin
+      GF_AUTH_ANONYMOUS_ENABLED: "true"
+      GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
+      GF_PLUGINS_ALLOW_LOADING_UNSIGNED_PLUGINS: "grafana-pyroscope-app,grafana-exploretraces-app,grafana-metricsdrilldown-app"
+      GF_PLUGINS_ENABLE_ALPHA: "true"
+      GF_INSTALL_PLUGINS: ""
+    ports:
+      - "4000:3000"
+    volumes:
+      - grafana-data:/var/lib/grafana
+    configs:
+      - source: grafana-datasources
+        target: /etc/grafana/provisioning/datasources/datasources.yml
+    restart: unless-stopped
+
+configs:
+  otel-collector-config:
+    content: |
+      receivers:
+        otlp:
+          protocols:
+            grpc:
+              endpoint: 0.0.0.0:4317
+            http:
+              endpoint: 0.0.0.0:4318
+
+      processors:
+        batch:
+
+      exporters:
+        prometheus:
+          endpoint: 0.0.0.0:9464
+          namespace: otel
+          const_labels:
+            source: otelcol
+            
+        otlp/tempo:
+          endpoint: tempo:4317
+          tls:
+            insecure: true
+            
+        debug:
+          verbosity: detailed
+
+      extensions:
+        health_check:
+          endpoint: 0.0.0.0:13133
+        pprof:
+          endpoint: 0.0.0.0:1777
+        zpages:
+          endpoint: 0.0.0.0:55679
+
+      service:
+        extensions: [health_check, pprof, zpages]
+        telemetry:
+          logs:
+            level: debug
+          metrics:
+            level: detailed
+        pipelines:
+          traces:
+            receivers: [otlp]
+            processors: [batch]
+            exporters: [debug, otlp/tempo]
+          metrics:
+            receivers: [otlp]
+            processors: [batch]
+            exporters: [debug, prometheus]
+          logs:
+            receivers: [otlp]
+            processors: [batch]
+            exporters: [debug]
+
+  tempo-config:
+    content: |
+      server:
+        http_listen_port: 3200
+        log_level: info
+
+      distributor:
+        receivers:
+          otlp:
+            protocols:
+              grpc:
+                endpoint: 0.0.0.0:4317
+
+      ingester:
+        max_block_duration: 5m
+        trace_idle_period: 10s
+
+      compactor:
+        compaction:
+          block_retention: 1h
+
+      storage:
+        trace:
+          backend: local
+          wal:
+            path: /var/tempo/wal
+          local:
+            path: /var/tempo/blocks
+
+      metrics_generator:
+        registry:
+          external_labels:
+            source: tempo
+        storage:
+          path: /var/tempo/generator/wal
+          remote_write:
+            - url: http://prometheus:9090/api/v1/write
+
+  prometheus-config:
+    content: |
+      global:
+        scrape_interval: 15s
+      scrape_configs:
+        - job_name: "otelcol-internal"
+          static_configs:
+            - targets: ["otel-collector:8888"]
+        - job_name: "otelcol-exporter"
+          static_configs:
+            - targets: ["otel-collector:9464"]
+        - job_name: "tempo"
+          static_configs:
+            - targets: ["tempo:3200"]
+
+  grafana-datasources:
+    content: |
+      apiVersion: 1
+      datasources:
+        - name: Prometheus
+          uid: prometheus
+          type: prometheus
+          access: proxy
+          orgId: 1
+          url: http://prometheus:9090
+          isDefault: true
+          editable: true
+        - name: Tempo
+          uid: tempo
+          type: tempo
+          access: proxy
+          orgId: 1
+          url: http://tempo:3200
+          editable: true
+          jsonData:
+            tracesToMetrics:
+              datasourceUid: prometheus
+            nodeGraph:
+              enabled: true
+
+volumes:
+  prometheus-data:
+  grafana-data:
+  tempo-data:
+```
+
+This launches:
+- **OTel Collector** - Receives traces on ports 4317 (gRPC) and 4318 (HTTP)
+- **Tempo** - Distributed tracing backend
+- **Prometheus** - Metrics collection
+- **Grafana** - Visualization dashboard
+
+Access Grafana at `http://localhost:3000` (default credentials: admin/admin)
+
+<Frame>
+  <img src="/media/grafana-otel-traces.png" alt="Okta Applications page" />
+</Frame>
+
+---
+
+## Popular Platform Integrations
+
+<Tabs group="platforms">
+<Tab title="Grafana Cloud">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "https://otlp-gateway-prod-us-central-0.grafana.net/otlp",
+        "trace_type": "genai_extension",
+        "protocol": "http",
+        "headers": {
+          "Authorization": "env.GRAFANA_CLOUD_API_KEY"
+        }
+      }
+    }
+  ]
+}
+```
+
+Set environment variable:
+```bash
+export GRAFANA_CLOUD_API_KEY="Basic <your-base64-encoded-token>"
+```
+
+</Tab>
+<Tab title="Datadog">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "https://trace.agent.datadoghq.com",
+        "trace_type": "genai_extension",
+        "protocol": "http",
+        "headers": {
+          "DD-API-KEY": "env.DATADOG_API_KEY"
+        }
+      }
+    }
+  ]
+}
+```
+
+Set environment variable:
+```bash
+export DATADOG_API_KEY="your-datadog-api-key"
+```
+
+</Tab>
+<Tab title="New Relic">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "https://otlp.nr-data.net:4318",
+        "trace_type": "genai_extension",
+        "protocol": "http",
+        "headers": {
+          "api-key": "env.NEW_RELIC_LICENSE_KEY"
+        }
+      }
+    }
+  ]
+}
+```
+
+Set environment variable:
+```bash
+export NEW_RELIC_LICENSE_KEY="your-license-key"
+```
+
+</Tab>
+<Tab title="Honeycomb">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "https://api.honeycomb.io",
+        "trace_type": "genai_extension",
+        "protocol": "http",
+        "headers": {
+          "x-honeycomb-team": "env.HONEYCOMB_API_KEY",
+          "x-honeycomb-dataset": "bifrost-traces"
+        }
+      }
+    }
+  ]
+}
+```
+
+Set environment variable:
+```bash
+export HONEYCOMB_API_KEY="your-api-key"
+```
+
+</Tab>
+<Tab title="Langfuse">
+
+[Langfuse](https://langfuse.com) is an open-source LLM observability platform that accepts OpenTelemetry traces via its OTLP endpoint.
+
+<Tabs>
+<Tab title="UI">
+
+Configure the OTel plugin with the following settings:
+
+| Field | Value |
+|-------|-------|
+| **Collector URL** | `https://cloud.langfuse.com/api/public/otel` (EU) or `https://us.cloud.langfuse.com/api/public/otel` (US) |
+| **Trace Type** | `genai_extension` |
+| **Protocol** | `http` (required - Langfuse does not support gRPC) |
+| **Headers** | `Authorization`: `env.LANGFUSE_AUTH` |
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "https://cloud.langfuse.com/api/public/otel",
+        "trace_type": "genai_extension",
+        "protocol": "http",
+        "headers": {
+          "Authorization": "env.LANGFUSE_AUTH"
+        }
+      }
+    }
+  ]
+}
+```
+
+For US region, use `https://us.cloud.langfuse.com/api/public/otel` instead.
+
+</Tab>
+</Tabs>
+
+Set up the environment variable with your Langfuse API keys:
+
+```bash
+# Generate base64 auth string from your Langfuse API keys
+export LANGFUSE_AUTH="Basic $(echo -n 'pk-lf-xxx:sk-lf-xxx' | base64)"
+```
+
+Replace `pk-lf-xxx` and `sk-lf-xxx` with your Langfuse public and secret keys from your project settings.
+
+<Note>
+Langfuse only supports HTTP protocol. Do not use gRPC.
+</Note>
+
+See the [Langfuse OpenTelemetry documentation](https://langfuse.com/integrations/native/opentelemetry) for more details.
+
+</Tab>
+<Tab title="Self-Hosted">
+
+Use the included Docker Compose stack or point to your own collector:
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "http://your-collector:4318",
+        "trace_type": "genai_extension",
+        "protocol": "http"
+      }
+    }
+  ]
+}
+```
+
+</Tab>
+</Tabs>
+
+---
+
+## Captured Data
+
+Each trace includes comprehensive LLM operation metadata following OpenTelemetry semantic conventions:
+
+### Span Attributes
+
+- **Span Name**: Based on request type (`gen_ai.chat`, `gen_ai.text`, `gen_ai.embedding`, etc.)
+- **Service Info**: `service.name=bifrost`, `service.version`
+- **Provider & Model**: `gen_ai.provider.name`, `gen_ai.request.model`
+
+### Request Parameters
+
+- Temperature, max_tokens, top_p, stop sequences
+- Presence/frequency penalties
+- Tool configurations and parallel tool calls
+- Custom parameters via `ExtraParams`
+
+### Input/Output Data
+
+- Complete chat history with role-based messages
+- Prompt text for completions
+- Response content with role attribution
+- Tool calls and results
+
+### Performance Metrics
+
+- Token usage (prompt, completion, total)
+- Cost calculations in dollars
+- Latency and timing (start/end timestamps)
+- Error details with status codes
+
+### Example Span
+
+```json
+{
+  "name": "gen_ai.chat",
+  "attributes": {
+    "gen_ai.provider.name": "openai",
+    "gen_ai.request.model": "gpt-4",
+    "gen_ai.request.temperature": 0.7,
+    "gen_ai.request.max_tokens": 1000,
+    "gen_ai.usage.prompt_tokens": 45,
+    "gen_ai.usage.completion_tokens": 128,
+    "gen_ai.usage.total_tokens": 173,
+    "gen_ai.usage.cost": 0.0052
+  }
+}
+```
+
+<Frame>
+  <img src="/media/grafana-otel-span-details.png" alt="Okta Applications page" />
+</Frame>
+
+---
+
+## Supported Request Types
+
+The OTel plugin captures all Bifrost request types:
+
+- **Chat Completion** (streaming and non-streaming) → `gen_ai.chat`
+- **Text Completion** (streaming and non-streaming) → `gen_ai.text`
+- **Embeddings** → `gen_ai.embedding`
+- **Speech Generation** (streaming and non-streaming) → `gen_ai.speech`
+- **Transcription** (streaming and non-streaming) → `gen_ai.transcription`
+- **Responses API** → `gen_ai.responses`
+
+---
+
+## Protocol Support
+
+### HTTP (OTLP/HTTP)
+
+Uses HTTP/1.1 or HTTP/2 with JSON or Protobuf encoding:
+
+```json
+{
+  "collector_url": "http://localhost:4318",
+  "protocol": "http"
+}
+```
+
+Default port: **4318**
+
+### gRPC (OTLP/gRPC)
+
+Uses gRPC with Protobuf encoding for lower latency:
+
+```json
+{
+  "collector_url": "localhost:4317",
+  "protocol": "grpc"
+}
+```
+
+Default port: **4317**
+
+---
+
+## Metrics Push (Cluster Mode)
+
+<Note>
+**Multi-node deployments**: If you are running multiple Bifrost nodes, use push-based metrics for accurate aggregation. Pull-based `/metrics` scraping may miss nodes behind a load balancer.
+</Note>
+
+The OTel plugin supports **push-based metrics export** via OTLP, which is essential for multi-node cluster deployments. Instead of relying on Prometheus scraping each node's `/metrics` endpoint (which can miss nodes behind a load balancer), all nodes actively push metrics to a central OTEL Collector.
+
+### Configuration
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `metrics_enabled` | `boolean` | ❌ No | Enable push-based metrics export (default: `false`) |
+| `metrics_endpoint` | `string` | ✅ Yes (if enabled) | OTLP metrics endpoint URL |
+| `metrics_push_interval` | `integer` | ❌ No | Push interval in seconds (default: `15`, range: 1-300) |
+
+### Example Configuration
+
+<Tabs group="metrics-config">
+<Tab title="HTTP Protocol">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "http://otel-collector:4318/v1/traces",
+        "trace_type": "genai_extension",
+        "protocol": "http",
+        "metrics_enabled": true,
+        "metrics_endpoint": "http://otel-collector:4318/v1/metrics",
+        "metrics_push_interval": 15
+      }
+    }
+  ]
+}
+```
+
+</Tab>
+<Tab title="gRPC Protocol">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "otel",
+      "config": {
+        "service_name": "bifrost",
+        "collector_url": "otel-collector:4317",
+        "trace_type": "genai_extension",
+        "protocol": "grpc",
+        "metrics_enabled": true,
+        "metrics_endpoint": "otel-collector:4317",
+        "metrics_push_interval": 15
+      }
+    }
+  ]
+}
+```
+
+</Tab>
+</Tabs>
+
+### Pushed Metrics
+
+These are the same **Prometheus-style metrics** from the telemetry plugin, pushed via OTLP protocol to a central collector:
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `bifrost_upstream_requests_total` | Counter | Total requests to upstream providers |
+| `bifrost_success_requests_total` | Counter | Successful upstream requests |
+| `bifrost_error_requests_total` | Counter | Error requests with status code labels |
+| `bifrost_input_tokens_total` | Counter | Total input tokens |
+| `bifrost_output_tokens_total` | Counter | Total output tokens |
+| `bifrost_cache_hits_total` | Counter | Cache hits |
+| `bifrost_cost_total` | Counter | Total cost in USD |
+| `bifrost_upstream_latency_seconds` | Histogram | Upstream request latency |
+| `bifrost_stream_first_token_latency_seconds` | Histogram | Time to first token |
+| `bifrost_stream_inter_token_latency_seconds` | Histogram | Inter-token latency |
+| `http_requests_total` | Counter | Total HTTP requests |
+| `http_request_duration_seconds` | Histogram | HTTP request duration |
+
+### OTEL Collector Configuration
+
+Configure your OTEL Collector to receive OTLP metrics and export to your preferred backend (Datadog, Prometheus, etc.):
+
+```yaml
+receivers:
+  otlp:
+    protocols:
+      grpc:
+        endpoint: 0.0.0.0:4317
+      http:
+        endpoint: 0.0.0.0:4318
+
+processors:
+  batch:
+    timeout: 10s
+    send_batch_size: 1000
+
+exporters:
+  # For Datadog
+  datadog:
+    api:
+      key: ${DD_API_KEY}
+  
+  # Or for Prometheus remote write
+  prometheusremotewrite:
+    endpoint: "http://prometheus:9090/api/v1/write"
+
+service:
+  pipelines:
+    metrics:
+      receivers: [otlp]
+      processors: [batch]
+      exporters: [datadog]  # or prometheusremotewrite
+```
+
+### Why Push vs Pull?
+
+| Aspect | Pull (`/metrics` scrape) | Push (OTEL metrics) |
+|--------|--------------------------|---------------------|
+| Load balancer | May miss nodes | All nodes push |
+| Service discovery | Required | Not required |
+| Scraper configuration | Per-node endpoints | Single collector |
+| Cluster aggregation | Query-side `sum()` | Collector handles it |
+
+For **single-node deployments**, pull-based `/metrics` scraping works well. For **multi-node clusters**, push-based metrics ensures all nodes are captured.
+
+---
+
+## Advanced Features
+
+### Automatic Span Management
+
+- Spans are tracked with a **20-minute TTL** using an efficient sync.Map implementation
+- Automatic cleanup prevents memory leaks for long-running processes
+- Handles streaming requests with accumulator for chunked responses
+
+### Async Emission
+
+All span emissions happen asynchronously in background goroutines:
+
+```go
+// Zero impact on request latency
+go func() {
+    p.client.Emit(ctx, spans)
+}()
+```
+
+### Streaming Support
+
+The plugin accumulates streaming chunks and emits a single complete span when the stream finishes, providing accurate token counts and costs.
+
+### Environment Variable Security
+
+Sensitive credentials never appear in config files:
+
+```json
+{
+  "headers": {
+    "Authorization": "env.OTEL_API_KEY"
+  }
+}
+```
+
+The plugin reads `OTEL_API_KEY` from the environment at runtime.
+
+---
+
+## When to Use
+
+### OTel Plugin
+
+Choose the OTel plugin when you:
+
+- Have existing OpenTelemetry infrastructure
+- Need to correlate LLM traces with application traces
+- Require compliance with enterprise observability standards
+- Want vendor flexibility (switch backends without code changes)
+- Need multi-service distributed tracing
+
+### vs. Built-in Observability
+
+Use [Built-in Observability](./default) for:
+
+- Local development and testing
+- Simple self-hosted deployments
+- No external dependencies
+- Direct database access to logs
+
+### vs. Maxim Plugin
+
+Use the [Maxim Plugin](./maxim) for:
+
+- Advanced LLM evaluation and testing
+- Prompt engineering and experimentation
+- Team collaboration and governance
+- Production monitoring with alerts
+- Dataset management and curation
+
+---
+
+## Troubleshooting
+
+### Connection Issues
+
+Verify collector is reachable:
+
+```bash
+# Test HTTP endpoint
+curl -v http://localhost:4318/v1/traces
+
+# Test gRPC endpoint (requires grpcurl)
+grpcurl -plaintext localhost:4317 list
+```
+
+### Missing Traces
+
+Check Bifrost logs for emission errors:
+
+```bash
+# Enable debug logging
+bifrost-http --log-level debug
+```
+
+### Authentication Failures
+
+Verify environment variables are set:
+
+```bash
+echo $OTEL_API_KEY
+```
+
+---
+
+## Next Steps
+
+- **[Built-in Observability](./default)** - Local logging for development
+- **[Maxim Plugin](./maxim)** - Advanced LLM evaluation and monitoring
+- **[Telemetry](../telemetry)** - Prometheus metrics and dashboards
--- a/docs/features/observability/prometheus.mdx
+++ b/docs/features/observability/prometheus.mdx
@@ -0,0 +1,306 @@
+---
+title: "Prometheus"
+description: "Monitor Bifrost metrics with Prometheus scraping or Push Gateway for multi-node deployments"
+icon: "chart-line"
+---
+
+## Overview
+
+Bifrost exposes Prometheus metrics via two methods:
+
+1. **Pull-based (Scraping)**: Traditional `/metrics` endpoint that Prometheus can scrape
+2. **Push-based (Push Gateway)**: Push metrics to a Prometheus Push Gateway for cluster deployments
+
+<Note>
+  **For multi-node deployments**: Use the Push Gateway method to ensure accurate metric aggregation. Traditional scraping may miss nodes behind load balancers.
+</Note>
+
+---
+
+## Pull-based Scraping
+
+Bifrost automatically exposes a `/metrics` endpoint when the telemetry plugin is enabled (enabled by default). No additional configuration is needed.
+
+<Info>
+  When Bifrost's authentication is enabled (`auth_config.is_enabled = true`), the `/metrics` endpoint requires Basic auth credentials. You must include the same `admin_username` and `admin_password` from your `auth_config` in the Prometheus scrape configuration. Without this, Prometheus will receive `401 Unauthorized` responses and scraping will silently fail.
+</Info>
+
+### Prometheus Configuration
+
+Add Bifrost to your Prometheus `prometheus.yml`:
+
+```yaml
+scrape_configs:
+  - job_name: 'bifrost'
+    static_configs:
+      - targets: ['bifrost-host:8080']
+    scrape_interval: 15s
+```
+
+If Bifrost authentication is enabled, add `basic_auth` to your scrape config:
+
+```yaml
+scrape_configs:
+  - job_name: 'bifrost'
+    static_configs:
+      - targets: ['bifrost-host:8080']
+    scrape_interval: 15s
+    basic_auth:
+      username: '<admin_username>'
+      password: '<admin_password>'
+```
+
+### Endpoint
+
+```
+GET /metrics
+```
+
+Returns metrics in Prometheus exposition format.
+
+---
+
+## Push-based (Push Gateway)
+
+For multi-node cluster deployments, the Prometheus plugin pushes metrics to a [Prometheus Push Gateway](https://github.com/prometheus/pushgateway). This ensures all nodes' metrics are captured regardless of load balancer routing.
+
+### Configuration
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `push_gateway_url` | `string` | ✅ Yes | - | Push Gateway URL (e.g., `http://pushgateway:9091`) |
+| `job_name` | `string` | ❌ No | `bifrost` | Job label for pushed metrics |
+| `instance_id` | `string` | ❌ No | hostname | Instance identifier for metric grouping |
+| `push_interval` | `integer` | ❌ No | `15` | Push interval in seconds (1-300) |
+| `basic_auth` | `object` | ❌ No | - | Basic auth credentials |
+
+### Basic Auth Configuration
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `username` | `string` | ✅ Yes | Basic auth username |
+| `password` | `string` | ✅ Yes | Basic auth password |
+
+---
+
+## Setup
+
+<Tabs group="setup-method">
+<Tab title="UI">
+
+1. Navigate to **Observability** → **Prometheus** in the Bifrost UI
+2. The `/metrics` endpoint is shown at the top for scraping configuration
+3. To enable Push Gateway:
+   - Enter the **Push Gateway URL**
+   - Configure **Job Name** and **Push Interval** as needed
+   - Optionally set a custom **Instance ID**
+   - Enable **Basic Authentication** if required
+   - Toggle **Enable Push Gateway** on
+   - Click **Save Prometheus Configuration**
+
+</Tab>
+<Tab title="Config File">
+
+```json
+{
+  "plugins": [
+    {
+      "name": "telemetry",
+      "enabled": true,
+      "config": {
+        "push_gateway": {
+          "enabled": true,
+          "push_gateway_url": "http://pushgateway:9091",
+          "job_name": "bifrost",
+          "push_interval": 15
+        }
+      }
+    }
+  ]
+}
+```
+
+### With Basic Auth
+
+```json
+{
+  "plugins": [
+    {
+      "name": "telemetry",
+      "enabled": true,
+      "config": {
+        "push_gateway": {
+          "enabled": true,
+          "push_gateway_url": "http://pushgateway:9091",
+          "job_name": "bifrost",
+          "push_interval": 15,
+          "instance_id": "bifrost-node-1",
+          "basic_auth": {
+            "username": "admin",
+            "password": "secret"
+          }
+        }
+      }
+    }
+  ]
+}
+```
+
+</Tab>
+</Tabs>
+
+---
+
+## Available Metrics
+
+The following metrics are available from both the `/metrics` endpoint and Push Gateway:
+
+### HTTP Metrics
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `http_requests_total` | Counter | Total HTTP requests by path, method, status |
+| `http_request_duration_seconds` | Histogram | HTTP request latency |
+| `http_request_size_bytes` | Histogram | Request body size |
+| `http_response_size_bytes` | Histogram | Response body size |
+
+### Bifrost LLM Metrics
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `bifrost_upstream_requests_total` | Counter | Total requests to LLM providers |
+| `bifrost_upstream_latency_seconds` | Histogram | Provider request latency |
+| `bifrost_success_requests_total` | Counter | Successful provider requests |
+| `bifrost_error_requests_total` | Counter | Failed provider requests |
+| `bifrost_input_tokens_total` | Counter | Total input tokens processed |
+| `bifrost_output_tokens_total` | Counter | Total output tokens generated |
+| `bifrost_cost_total` | Counter | Total cost in USD |
+| `bifrost_cache_hits_total` | Counter | Cache hits by type |
+| `bifrost_stream_first_token_latency_seconds` | Histogram | Time to first token (streaming) |
+| `bifrost_stream_inter_token_latency_seconds` | Histogram | Inter-token latency (streaming) |
+| `bifrost_key_rotation_events_total` | Counter | Per-attempt retry/rotation events with key identifiers (see below) <sup>v1.5.0-prerelease4+</sup> |
+
+### Default Labels
+
+All Bifrost metrics include these labels:
+
+- `provider` - LLM provider name
+- `model` - Model identifier
+- `method` - Request type (chat, completion, embedding, etc.)
+- `virtual_key_id` / `virtual_key_name` - Virtual key identifiers
+- `selected_key_id` / `selected_key_name` - API key that successfully served the request (`""` when all attempts failed)
+- `number_of_retries` - Total attempts minus one (across all keys)
+- `fallback_index` - Fallback position
+- `team_id` / `team_name` - Team identifiers (if governance enabled)
+- `customer_id` / `customer_name` - Customer identifiers (if governance enabled)
+
+<Note>
+  **v1.5.0-prerelease4+**: `selected_key_id` / `selected_key_name` are only populated when the request succeeds. On final errors both are empty — use `bifrost_key_rotation_events_total` or the `attempt_trail` log field to see which keys were tried.
+</Note>
+
+### Key Rotation Events <sup>v1.5.0-prerelease4+</sup>
+
+`bifrost_key_rotation_events_total` is incremented once per **failed attempt** (not per request), giving you time-series visibility into retry pressure:
+
+| Label | Values | Description |
+|-------|--------|-------------|
+| `provider` | e.g. `openai` | LLM provider |
+| `requested_model` | e.g. `gpt-4o` | Model as requested (before any alias resolution) |
+| `key_id` | UUID | The provider API key that failed on this attempt |
+| `key_name` | string | Human-readable name of the provider API key |
+| `fail_reason` | error type string | Provider error type (e.g. `rate_limit_error`, `network_error`) |
+
+**Example queries:**
+
+```promql
+# Rate-limit events per provider over time
+sum by (provider, fail_reason) (
+  rate(bifrost_key_rotation_events_total[5m])
+)
+
+# Which specific keys are hitting rate limits most often
+topk(5, sum by (provider, key_name, fail_reason) (
+  rate(bifrost_key_rotation_events_total{fail_reason="rate_limit_error"}[1h])
+))
+```
+
+---
+
+## Push Gateway Setup
+
+If you don't have a Push Gateway running, deploy one:
+
+### Docker
+
+```bash
+docker run -d -p 9091:9091 prom/pushgateway
+```
+
+### Kubernetes (Helm)
+
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm install pushgateway prometheus-community/prometheus-pushgateway
+```
+
+### Configure Prometheus to Scrape Push Gateway
+
+Add to your `prometheus.yml`:
+
+```yaml
+scrape_configs:
+  - job_name: 'pushgateway'
+    honor_labels: true
+    static_configs:
+      - targets: ['pushgateway:9091']
+```
+
+<Note>
+  The `honor_labels: true` setting is important - it preserves the `job` and `instance` labels pushed by Bifrost instead of overwriting them with the Push Gateway's labels.
+</Note>
+
+---
+
+## Pull vs Push: When to Use Each
+
+| Scenario | Recommended Method |
+|----------|-------------------|
+| Single Bifrost instance | Pull (scraping) |
+| Multiple instances, direct access | Pull (scraping) |
+| Multiple instances behind load balancer | **Push (Push Gateway)** |
+| Kubernetes with service mesh | Pull or Push |
+| Serverless / ephemeral instances | **Push (Push Gateway)** |
+
+### Why Push for Clusters?
+
+When multiple Bifrost instances run behind a load balancer:
+
+1. **Scraping randomness**: Each scrape may hit different nodes, missing metrics from others
+2. **Instance tracking**: Push Gateway properly tracks per-instance metrics via `instance` label
+3. **Aggregation**: Downstream tools (Grafana, Datadog) can aggregate across all instances
+
+---
+
+## Troubleshooting
+
+### Push Gateway Connection Failed
+
+```
+failed to push metrics to push gateway: connection refused
+```
+
+- Verify the Push Gateway URL is correct and reachable from Bifrost
+- Check firewall rules between Bifrost and Push Gateway
+- Ensure Push Gateway is running: `curl http://pushgateway:9091/metrics`
+
+### Metrics Not Appearing
+
+- Verify the telemetry plugin is enabled (required for metrics collection)
+- Check Bifrost logs for push errors
+- Verify Prometheus is scraping the Push Gateway with `honor_labels: true`
+
+### Authentication Failed
+
+- Double-check username and password
+- Ensure basic auth is configured on the Push Gateway side
+- Check for special characters that may need escaping
--- a/docs/features/plugins/circuit-breaker.mdx
+++ b/docs/features/plugins/circuit-breaker.mdx
--- a/docs/features/plugins/jsonparser.mdx
+++ b/docs/features/plugins/jsonparser.mdx
@@ -0,0 +1,306 @@
+---
+title: JSON Parser
+description: A simple Bifrost plugin that handles partial JSON chunks in streaming responses by making them valid JSON objects.
+icon: "code-branch"
+---
+
+## Overview
+
+When using AI providers that stream JSON responses, the individual chunks often contain incomplete JSON that cannot be parsed directly. This plugin automatically detects and fixes partial JSON chunks by adding the necessary closing braces, brackets, and quotes to make them valid JSON.
+
+## Features
+
+- **Automatic JSON Completion**: Detects partial JSON and adds missing closing characters
+- **Streaming Only**: Processes only streaming responses (non-streaming responses are ignored)
+- **Flexible Usage Modes**: Supports two usage types for different deployment scenarios
+- **Safe Fallback**: Returns original content if JSON cannot be fixed
+- **Memory Leak Prevention**: Automatic cleanup of stale accumulated content with configurable intervals
+- **Zero Dependencies**: Only depends on Go's standard library
+
+## Usage
+
+### Usage Types
+
+The plugin supports two usage types:
+
+1. **AllRequests**: Processes all streaming responses automatically
+2. **PerRequest**: Processes only when explicitly enabled via request context
+
+
+```go
+package main
+
+import (
+    "time"
+    "github.com/maximhq/bifrost/core"
+    "github.com/maximhq/bifrost/core/schemas"
+    "github.com/maximhq/bifrost/plugins/jsonparser"
+)
+
+func main() {
+    // Create the JSON parser plugin for all requests
+    jsonPlugin := jsonparser.NewJsonParserPlugin(jsonparser.PluginConfig{
+        Usage:           jsonparser.AllRequests,
+        CleanupInterval: 2 * time.Minute,  // Cleanup every 2 minutes
+        MaxAge:          10 * time.Minute,  // Remove entries older than 10 minutes
+    })
+    
+    // Initialize Bifrost with the plugin
+    client, err := bifrost.Init(context.Background(), schemas.BifrostConfig{
+        Account: &MyAccount{},
+        LLMPlugins: []schemas.LLMPlugin{
+            jsonPlugin,
+        },
+    })
+    
+    if err != nil {
+        panic(err)
+    }
+    
+    // Use the client normally - JSON parsing happens automatically
+    // in the PostLLMHook for all streaming responses
+}
+```
+
+### PerRequest Mode
+
+```go
+package main
+
+import (
+    "context"
+    "time"
+    "github.com/maximhq/bifrost/core"
+    "github.com/maximhq/bifrost/core/schemas"
+    "github.com/maximhq/bifrost/plugins/jsonparser"
+)
+
+func main() {
+    // Create the JSON parser plugin for per-request control
+    jsonPlugin := jsonparser.NewJsonParserPlugin(jsonparser.PluginConfig{
+        Usage:           jsonparser.PerRequest,
+        CleanupInterval: 2 * time.Minute,  // Cleanup every 2 minutes
+        MaxAge:          10 * time.Minute,  // Remove entries older than 10 minutes
+    })
+    
+    // Initialize Bifrost with the plugin
+    client, err := bifrost.Init(context.Background(), schemas.BifrostConfig{
+        Account: &MyAccount{},
+        LLMPlugins: []schemas.LLMPlugin{
+            jsonPlugin,
+        },
+    })
+    
+    if err != nil {
+        panic(err)
+    }
+
+    ctx := context.WithValue(context.Background(), jsonparser.EnableStreamingJSONParser, true)
+    
+    // Enable JSON parsing for specific requests
+    stream, bifrostErr := client.ChatCompletionStreamRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), request)
+    if bifrostErr != nil {
+            // handle error
+    }
+    for chunk := range stream {
+        _ = chunk // handle each streaming chunk
+    }
+}
+```
+
+### Configuration
+
+```go
+// Custom cleanup configuration
+plugin := jsonparser.NewJsonParserPlugin(jsonparser.PluginConfig{
+    Usage:           jsonparser.AllRequests,
+    CleanupInterval: 2 * time.Minute,  // Cleanup every 2 minutes
+    MaxAge:          10 * time.Minute,  // Remove entries older than 10 minutes
+})
+```
+
+#### Default Values
+
+- **CleanupInterval**: 5 minutes (how often to run cleanup)
+- **MaxAge**: 30 minutes (how old entries can be before cleanup)
+- **Usage**: Must be specified (AllRequests or PerRequest)
+
+### Context Key for PerRequest Mode
+
+When using `PerRequest` mode, the plugin checks for the context key `jsonparser.EnableStreamingJSONParser` with a boolean value:
+
+- `true`: Enable JSON parsing for this request
+- `false`: Disable JSON parsing for this request
+- Key not present: Disable JSON parsing for this request
+
+**Example:**
+
+```go
+import (
+    "context"
+
+    "github.com/maximhq/bifrost/plugins/jsonparser"
+)
+
+// Enable JSON parsing for this request
+ctx := context.WithValue(context.Background(), jsonparser.EnableStreamingJSONParser, true)
+
+// Disable JSON parsing for this request
+ctx := context.WithValue(context.Background(), jsonparser.EnableStreamingJSONParser, false)
+
+// No context key - JSON parsing disabled (default behavior)
+ctx := context.Background()
+```
+
+## How It Works
+
+The plugin implements an optimized `parsePartialJSON` function with the following steps:
+
+1. **Usage Check**: Determines if processing should occur based on usage type and context
+2. **Validates Input**: First tries to parse the string as valid JSON
+3. **Character Analysis**: If invalid, processes the string character-by-character to track:
+   - String boundaries (inside/outside quotes)
+   - Escape sequences
+   - Opening/closing braces and brackets
+4. **Auto-Completion**: Adds missing closing characters in the correct order
+5. **Validation**: Verifies the completed JSON is valid
+6. **Fallback**: Returns original content if completion fails
+
+### Memory Management
+
+The plugin automatically manages memory by:
+
+1. **Accumulating Content**: Stores partial JSON chunks with timestamps for each request
+2. **Periodic Cleanup**: Runs a background goroutine that removes stale entries based on `MaxAge`
+3. **Request Completion**: Automatically clears accumulated content when requests complete successfully
+4. **Configurable Intervals**: Allows customization of cleanup frequency and retention periods
+
+### Real-Life Streaming Example
+
+Here's a practical example showing how the JSON parser plugin fixes broken JSON chunks in streaming responses:
+
+```go
+package main
+
+import (
+    "context"
+    "encoding/json"
+    "fmt"
+    "time"
+    "github.com/maximhq/bifrost/core"
+    "github.com/maximhq/bifrost/core/schemas"
+    "github.com/maximhq/bifrost/plugins/jsonparser"
+)
+
+func main() {
+    // Create JSON parser plugin
+    jsonPlugin := jsonparser.NewJsonParserPlugin(jsonparser.PluginConfig{
+        Usage:           jsonparser.AllRequests,
+        CleanupInterval: 2 * time.Minute,
+        MaxAge:          10 * time.Minute,
+    })
+    
+    // Initialize Bifrost with the plugin
+    client, err := bifrost.Init(context.Background(), schemas.BifrostConfig{
+        Account: &MyAccount{},
+        LLMPlugins: []schemas.LLMPlugin{jsonPlugin},
+    })
+    if err != nil {
+        panic(err)
+    }
+    defer client.Shutdown()
+
+    // Request structured JSON response  
+    request := &schemas.BifrostChatRequest{
+        Provider: schemas.OpenAI,
+        Model:    "gpt-4o-mini",
+        Input: []schemas.ChatMessage{
+            {
+                Role: schemas.ChatMessageRoleUser,
+                Content: schemas.ChatMessageContent{
+                    ContentStr: bifrost.Ptr("Return user profile as JSON: {\"name\": \"John Doe\", \"email\": \"john@example.com\"}"),
+                },
+            },
+        },
+    }
+
+    // Stream the response
+    stream, bifrostErr := client.ChatCompletionStreamRequest(schemas.NewBifrostContext(context.Background(), schemas.NoDeadline), request)
+    if bifrostErr != nil {
+        panic(bifrostErr)
+    }
+
+    fmt.Println("Streaming JSON response:")
+    for chunk := range stream {
+        if chunk.BifrostChatResponse != nil && len(chunk.BifrostChatResponse.Choices) > 0 {
+            choice := chunk.BifrostChatResponse.Choices[0]
+            if choice.ChatStreamResponseChoice != nil && choice.ChatStreamResponseChoice.Delta != nil {
+                content := *choice.ChatStreamResponseChoice.Delta.Content
+                fmt.Printf("Chunk: %s\n", content)
+
+                // With JSON parser, you can parse each chunk immediately
+                var jsonData map[string]interface{}
+                if err := json.Unmarshal([]byte(content), &jsonData); err == nil {
+                    fmt.Printf("✅ Valid JSON parsed successfully\n")
+                } else {
+                    fmt.Printf("❌ Invalid JSON: %v\n", err)
+                }
+            }
+        }
+    }
+}
+```
+
+**Without JSON Parser** (raw streaming chunks):
+```
+Chunk 1: `{`                    ❌ Invalid JSON
+Chunk 2: `{"name"`              ❌ Invalid JSON  
+Chunk 3: `{"name": "John"`      ❌ Invalid JSON
+Chunk 4: `{"name": "John Doe"`  ❌ Invalid JSON
+```
+
+**With JSON Parser** (processed chunks):
+```
+Chunk 1: `{}`                               ✅ Valid JSON
+Chunk 2: `{"name": ""}`                     ✅ Valid JSON
+Chunk 3: `{"name": "John"}`                 ✅ Valid JSON  
+Chunk 4: `{"name": "John Doe"}`             ✅ Valid JSON
+```
+
+### Use Cases
+
+- **Function Calling**: Stream tool call arguments as valid JSON throughout the response
+- **Structured Data**: Stream complex JSON objects (user profiles, product catalogs) progressively
+- **Real-time Parsing**: Enable client-side JSON parsing at each streaming step without waiting for completion
+- **API Integration**: Forward streaming JSON to downstream services that expect valid JSON
+- **Live Updates**: Update UI components with valid JSON data as it streams in
+
+### Example Transformations
+
+| Input | Output |
+|-------|--------|
+| `{"name": "John"` | `{"name": "John"}` |
+| `["apple", "banana"` | `["apple", "banana"]` |
+| `{"user": {"name": "John"` | `{"user": {"name": "John"}}` |
+| `{"message": "Hello\nWorld"` | `{"message": "Hello\nWorld"}` |
+| `""` (empty string) | `{}` |
+| `"   "` (whitespace only) | `{}` |
+
+## Testing
+
+Run the test suite:
+
+```bash
+cd plugins/jsonparser
+go test -v
+```
+
+The tests cover:
+- Plugin interface compliance
+- Both usage types (AllRequests and PerRequest)
+- Context-based enabling/disabling
+- Streaming responses only (non-streaming responses are ignored)
+- Various JSON completion scenarios
+- Edge cases and error conditions
+- Memory cleanup functionality with real and simulated requests
+- Configuration options and default values
--- a/docs/features/plugins/mocker.mdx
+++ b/docs/features/plugins/mocker.mdx
@@ -0,0 +1,566 @@
+---
+title: "Mocker"
+description: "Mock AI provider responses for testing, development, and simulation purposes."
+icon: "mask"
+---
+
+## Quick Start
+
+### Minimal Configuration
+
+The simplest way to use the Mocker plugin is with no configuration - it will create a default catch-all rule:
+
+```go
+package main
+
+import (
+    "context"
+    bifrost "github.com/maximhq/bifrost/core"
+    "github.com/maximhq/bifrost/core/schemas"
+    mocker "github.com/maximhq/bifrost/plugins/mocker"
+)
+
+func main() {
+    // Create plugin with minimal config
+    plugin, err := mocker.NewMockerPlugin(mocker.MockerConfig{
+        Enabled: true, // Default rule will be created automatically
+    })
+    if err != nil {
+        panic(err)
+    }
+
+    // Initialize Bifrost with the plugin
+    client, initErr := bifrost.Init(context.Background(), schemas.BifrostConfig{
+        Account: &yourAccount,
+        LLMPlugins: []schemas.LLMPlugin{plugin},
+    })
+    if err != nil {
+        panic(err)
+    }
+    defer client.Shutdown()
+
+    // All chat and responses requests will now return: "This is a mock response from the Mocker plugin"
+    
+    // Chat completion request
+    chatResponse, _ := client.ChatCompletionRequest(schemas.NewBifrostContext(context.Background(), schemas.NoDeadline), &schemas.BifrostChatRequest{
+        Provider: schemas.OpenAI,
+        Model:    "gpt-4",
+        Input: []schemas.ChatMessage{
+            {
+                Role: schemas.ChatMessageRoleUser,
+                Content: schemas.ChatMessageContent{
+                    ContentStr: bifrost.Ptr("Hello!"),
+                },
+            },
+        },
+    })
+    
+    // Responses request
+    responsesResponse, _ := client.ResponsesRequest(schemas.NewBifrostContext(context.Background(), schemas.NoDeadline), &schemas.BifrostResponsesRequest{
+        Provider: schemas.OpenAI,
+        Model:    "gpt-4o",
+        Input: []schemas.ResponsesMessage{
+            {
+                Role: bifrost.Ptr(schemas.ResponsesInputMessageRoleUser),
+                Content: &schemas.ResponsesMessageContent{
+                    ContentStr: bifrost.Ptr("Hello!"),
+                },
+            },
+        },
+    })
+}
+```
+
+### Custom Response
+
+```go
+plugin, err := mocker.NewMockerPlugin(mocker.MockerConfig{
+    Enabled: true,
+    Rules: []mocker.MockRule{
+        {
+            Name:        "openai-mock",
+            Enabled:     true,
+            Probability: 1.0, // Always trigger
+            Conditions: mocker.Conditions{
+                Providers: []string{"openai"},
+            },
+            Responses: []mocker.Response{
+                {
+                    Type: mocker.ResponseTypeSuccess,
+                    Content: &mocker.SuccessResponse{
+                        Message: "Hello! This is a custom mock response for OpenAI.",
+                        Usage: &mocker.Usage{
+                            PromptTokens:     15,
+                            CompletionTokens: 25,
+                            TotalTokens:      40,
+                        },
+                    },
+                },
+            },
+        },
+    },
+})
+```
+
+### Responses Request Example
+
+The mocker plugin automatically handles both chat completion and responses requests with the same configuration:
+
+```go
+// This rule will work for both ChatCompletionRequest and ResponsesRequest
+{
+    Name:        "universal-mock",
+    Enabled:     true,
+    Probability: 1.0,
+    Conditions: mocker.Conditions{
+        MessageRegex: stringPtr("(?i).*hello.*"),
+    },
+    Responses: []mocker.Response{
+        {
+            Type: mocker.ResponseTypeSuccess,
+            Content: &mocker.SuccessResponse{
+                Message: "Hello! I'm a mock response that works for both request types.",
+            },
+        },
+    },
+}
+```
+
+## Installation
+
+Add the plugin to your project:
+
+   ```bash
+   go get github.com/maximhq/bifrost/plugins/mocker
+   ```
+
+Import in your code:
+
+   ```go
+   import mocker "github.com/maximhq/bifrost/plugins/mocker"
+   ```
+
+## Basic Usage
+
+### Creating the Plugin
+
+```go
+config := mocker.MockerConfig{
+    Enabled: true,
+    DefaultBehavior: mocker.DefaultBehaviorPassthrough, // "passthrough", "success", "error"
+    Rules: []mocker.MockRule{
+        // Your rules here
+    },
+}
+
+plugin, err := mocker.NewMockerPlugin(config)
+if err != nil {
+    log.Fatal(err)
+}
+```
+
+### Adding to Bifrost
+
+```go
+client, initErr := bifrost.Init(context.Background(), schemas.BifrostConfig{
+    Account: &yourAccount,
+    LLMPlugins: []schemas.LLMPlugin{plugin},
+    Logger: bifrost.NewDefaultLogger(schemas.LogLevelInfo),
+})
+```
+
+### Disabling the Plugin
+
+```go
+config := mocker.MockerConfig{
+    Enabled: false, // All requests pass through to real providers
+}
+```
+
+## Supported Request Types
+
+The Mocker plugin supports the following Bifrost request types:
+
+- **Chat Completion Requests** (`ChatCompletionRequest`) - Standard chat-based interactions
+- **Responses Requests** (`ResponsesRequest`) - OpenAI-compatible responses API format
+- **Skip Context Key** - Use `"skip-mocker"` context key to bypass mocking per request
+
+### Skip Mocker for Specific Requests
+
+You can skip the mocker plugin for specific requests by adding a context key:
+
+```go
+import "github.com/maximhq/bifrost/core/schemas"
+
+// Create context that skips mocker
+ctx := context.WithValue(context.Background(), 
+    schemas.BifrostContextKey("skip-mocker"), true)
+
+// This request will bypass the mocker and go to the real provider
+response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), request)
+```
+
+## Key Features
+
+### Template Variables
+
+Create dynamic responses using templates:
+
+```go
+Response{
+    Type: mocker.ResponseTypeSuccess,
+    Content: &mocker.SuccessResponse{
+        MessageTemplate: stringPtr("Hello from {{provider}} using model {{model}}!"),
+    },
+}
+```
+
+**Available Variables:**
+- `{{provider}}` - Provider name (e.g., "openai", "anthropic")
+- `{{model}}` - Model name (e.g., "gpt-4", "claude-3")
+- `{{faker.*}}` - Fake data generation (see Configuration Reference)
+
+### Weighted Response Selection
+
+Configure multiple responses with different probabilities:
+
+```go
+Responses: []mocker.Response{
+    {
+        Type:   mocker.ResponseTypeSuccess,
+        Weight: 0.8, // 80% chance
+        Content: &mocker.SuccessResponse{
+            Message: "Success response",
+        },
+    },
+    {
+        Type:   mocker.ResponseTypeError,
+        Weight: 0.2, // 20% chance
+        Error: &mocker.ErrorResponse{
+            Message: "Rate limit exceeded",
+            Type:    stringPtr("rate_limit"),
+            Code:    stringPtr("429"),
+        },
+    },
+}
+```
+
+### Latency Simulation
+
+Add realistic delays to responses:
+
+```go
+// Fixed latency
+Latency: &mocker.Latency{
+    Type: mocker.LatencyTypeFixed,
+    Min:  250 * time.Millisecond,
+}
+
+// Variable latency
+Latency: &mocker.Latency{
+    Type: mocker.LatencyTypeUniform,
+    Min:  100 * time.Millisecond,
+    Max:  500 * time.Millisecond,
+}
+```
+
+### Advanced Matching
+
+#### Regex Message Matching
+```go
+Conditions: mocker.Conditions{
+    MessageRegex: stringPtr(`(?i).*support.*|.*help.*`),
+}
+```
+
+#### Request Size Filtering
+```go
+Conditions: mocker.Conditions{
+    RequestSize: &mocker.SizeRange{
+        Min: 100,  // bytes
+        Max: 1000, // bytes
+    },
+}
+```
+
+### Faker Data Generation
+
+Create realistic test data using faker variables:
+
+```go
+{
+    Name: "user-profile-example",
+    Responses: []mocker.Response{
+        {
+            Type: mocker.ResponseTypeSuccess,
+            Content: &mocker.SuccessResponse{
+                MessageTemplate: stringPtr(`User Profile:
+- Name: {{faker.name}}
+- Email: {{faker.email}}
+- Company: {{faker.company}}
+- Address: {{faker.address}}, {{faker.city}}
+- Phone: {{faker.phone}}
+- User ID: {{faker.uuid}}
+- Join Date: {{faker.date}}
+- Premium Account: {{faker.boolean}}`),
+            },
+        },
+    },
+}
+```
+
+### Statistics and Monitoring
+
+Get runtime statistics for monitoring:
+
+```go
+stats := plugin.GetStatistics()
+fmt.Printf("Plugin enabled: %v\n", stats.Enabled)
+fmt.Printf("Total requests: %d\n", stats.TotalRequests)
+fmt.Printf("Mocked requests: %d\n", stats.MockedRequests)
+
+// Rule-specific stats
+for ruleName, ruleStats := range stats.Rules {
+    fmt.Printf("Rule %s: %d triggers\n", ruleName, ruleStats.Triggers)
+}
+```
+
+## Configuration Reference
+
+### MockerConfig
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `Enabled` | `bool` | `false` | Enable/disable the entire plugin |
+| `DefaultBehavior` | `string` | `"passthrough"` | Action when no rules match: `"passthrough"`, `"success"`, `"error"` |
+| `GlobalLatency` | `*Latency` | `nil` | Global latency applied to all rules |
+| `Rules` | `[]MockRule` | `[]` | List of mock rules evaluated in priority order |
+
+### MockRule
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `Name` | `string` | - | Unique rule name for identification |
+| `Enabled` | `bool` | `true` | Enable/disable this specific rule |
+| `Priority` | `int` | `0` | Higher numbers = higher priority |
+| `Probability` | `float64` | `1.0` | Activation probability (0.0=never, 1.0=always) |
+| `Conditions` | `Conditions` | `{}` | Matching conditions (empty = match all) |
+| `Responses` | `[]Response` | - | Possible responses (weighted random selection) |
+| `Latency` | `*Latency` | `nil` | Rule-specific latency override |
+
+### Conditions
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `Providers` | `[]string` | Match specific providers: `["openai", "anthropic"]` |
+| `Models` | `[]string` | Match specific models: `["gpt-4", "claude-3"]` |
+| `MessageRegex` | `*string` | Regex pattern to match message content |
+| `RequestSize` | `*SizeRange` | Request size constraints in bytes |
+
+### Response
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `Type` | `string` | Response type: `"success"` or `"error"` |
+| `Weight` | `float64` | Weight for random selection (default: 1.0) |
+| `Content` | `*SuccessResponse` | Required if `Type="success"` |
+| `Error` | `*ErrorResponse` | Required if `Type="error"` |
+| `AllowFallbacks` | `*bool` | Control fallback behavior (`nil`=allow, `false`=block) |
+
+### SuccessResponse
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `Message` | `string` | Static response message |
+| `MessageTemplate` | `*string` | Template with variables: `{{provider}}`, `{{model}}`, `{{faker.*}}` |
+| `Model` | `*string` | Override model name in response |
+| `Usage` | `*Usage` | Token usage information |
+| `FinishReason` | `*string` | Completion reason (default: `"stop"`) |
+| `CustomFields` | `map[string]interface{}` | Additional metadata fields |
+
+### ErrorResponse
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `Message` | `string` | Error message to return |
+| `Type` | `*string` | Error type (e.g., `"rate_limit"`, `"auth_error"`) |
+| `Code` | `*string` | Error code (e.g., `"429"`, `"401"`) |
+| `StatusCode` | `*int` | HTTP status code |
+
+### Latency
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `Type` | `string` | Latency type: `"fixed"` or `"uniform"` |
+| `Min` | `time.Duration` | Minimum/exact latency (use `time.Millisecond`) |
+| `Max` | `time.Duration` | Maximum latency (required for `"uniform"`) |
+
+**Important**: Use Go's `time.Duration` constants:
+- ✅ Correct: `100 * time.Millisecond`
+- ❌ Wrong: `100` (nanoseconds, barely noticeable)
+
+### Faker Variables
+
+#### Personal Information
+- `{{faker.name}}` - Full name
+- `{{faker.first_name}}` - First name only
+- `{{faker.last_name}}` - Last name only
+- `{{faker.email}}` - Email address
+- `{{faker.phone}}` - Phone number
+
+#### Location
+- `{{faker.address}}` - Street address
+- `{{faker.city}}` - City name
+- `{{faker.state}}` - State/province
+- `{{faker.zip_code}}` - Postal code
+
+#### Business
+- `{{faker.company}}` - Company name
+- `{{faker.job_title}}` - Job title
+
+#### Text and Data
+- `{{faker.lorem_ipsum}}` - Lorem ipsum text
+- `{{faker.lorem_ipsum:10}}` - Lorem ipsum with 10 words
+- `{{faker.uuid}}` - UUID v4
+- `{{faker.hex_color}}` - Hex color code
+
+#### Numbers and Dates
+- `{{faker.integer}}` - Random integer (1-100)
+- `{{faker.integer:10,50}}` - Random integer between 10-50
+- `{{faker.float}}` - Random float (0-100, 2 decimals)
+- `{{faker.float:1,10}}` - Random float between 1-10
+- `{{faker.boolean}}` - Random boolean
+- `{{faker.date}}` - Date (YYYY-MM-DD format)
+- `{{faker.datetime}}` - Datetime (YYYY-MM-DD HH:MM:SS format)
+
+## Best Practices
+
+### Rule Organization
+
+```go
+// Use priority to control rule evaluation order
+rules := []mocker.MockRule{
+    {Name: "specific-error", Priority: 100, Conditions: /* specific */},
+    {Name: "general-success", Priority: 50, Conditions: /* general */},
+    {Name: "catch-all", Priority: 0, Conditions: /* empty */},
+}
+```
+
+### Development vs Production
+
+```go
+// Development: High mock rate
+config := mocker.MockerConfig{
+    Enabled: true,
+    Rules: []mocker.MockRule{
+        {Probability: 1.0}, // Always mock
+    },
+}
+
+// Production: Occasional testing
+config := mocker.MockerConfig{
+    Enabled: true,
+    Rules: []mocker.MockRule{
+        {Probability: 0.1}, // 10% mock rate
+    },
+}
+```
+
+### Performance Considerations
+
+- Place specific conditions before general ones (higher priority)
+- Use simple string matching over complex regex when possible
+- Keep response templates reasonably sized
+- Consider disabling debug logging in production
+
+### Testing Your Configuration
+
+```go
+func validateMockerConfig(config mocker.MockerConfig) error {
+    _, err := mocker.NewMockerPlugin(config)
+    return err
+}
+
+// Test before deployment
+if err := validateMockerConfig(yourConfig); err != nil {
+    log.Fatalf("Invalid mocker configuration: %v", err)
+}
+```
+
+## Common Issues
+
+### Plugin Not Triggering
+
+1. Check if plugin is enabled: `Enabled: true`
+2. Verify rule is enabled: `rule.Enabled: true`
+3. Check probability: `Probability: 1.0` for testing
+4. Verify conditions match your request
+
+### Latency Not Working
+
+Use `time.Duration` constants, not raw integers:
+
+```go
+// ❌ Wrong: 100 nanoseconds (barely noticeable)
+Min: 100
+
+// ✅ Correct: 100 milliseconds
+Min: 100 * time.Millisecond
+```
+
+### Regex Not Matching
+
+Test your regex pattern and ensure proper escaping:
+
+```go
+// Case-insensitive matching
+MessageRegex: stringPtr(`(?i).*help.*`)
+
+// Escape special characters
+MessageRegex: stringPtr(`\$\d+\.\d+`) // Match $12.34
+```
+
+### Controlling Fallbacks
+
+```go
+Response{
+    Type: mocker.ResponseTypeError,
+    AllowFallbacks: boolPtr(false), // Block fallbacks
+    Error: &mocker.ErrorResponse{
+        Message: "Authentication failed",
+    },
+}
+```
+
+### Skip Mocker Not Working
+
+Ensure you're using the correct context key format:
+
+```go
+// ✅ Correct
+ctx := context.WithValue(context.Background(), 
+    schemas.BifrostContextKey("skip-mocker"), true)
+
+// ❌ Wrong
+ctx := context.WithValue(context.Background(), "skip-mocker", true)
+```
+
+### Responses Request Issues
+
+If responses requests aren't being mocked:
+
+1. Verify the plugin supports `ResponsesRequest` (version 1.2.13+)
+2. Check that your regex patterns match the message content
+3. Ensure the request type is `schemas.ResponsesRequest`
+
+### Debug Mode
+
+Enable debug logging to troubleshoot:
+
+```go
+client, initErr := bifrost.Init(context.Background(), schemas.BifrostConfig{
+    Account: &account,
+    LLMPlugins: []schemas.LLMPlugin{plugin},
+    Logger:  bifrost.NewDefaultLogger(schemas.LogLevelDebug),
+})
+```
--- a/docs/features/prompt-repository/playground.mdx
+++ b/docs/features/prompt-repository/playground.mdx
@@ -0,0 +1,242 @@
+---
+title: "Playground"
+description: "Create, test, and version prompts in an interactive playground."
+icon: "square-terminal"
+---
+
+## Overview
+
+The **Playground** in Bifrost is an interactive workspace for building, testing, and managing prompts. It allows you to experiment with messages, switch models, adjust parameters, and iterate until the output looks right. Once you're satisfied, you can **publish a version** and use it directly in your codebase. Over time, the prompt repository becomes a centralized **CMS for all your prompts**, making it easier to manage versions, collaborate with teammates, and maintain production-ready prompts.
+
+![Prompt Repository Overview](../../media/prompt-repo-overview.png)
+
+## How it Works
+
+The playground is built around four core concepts: **Prompts, Sessions, and Versions**.
+
+### Folders
+
+Folders help organize prompts into logical groups. Teams often structure them by product area, feature, or use case.
+
+- Each folder has a **name** and optional **description**
+- Prompts can live inside folders or at the root level
+- Deleting a folder removes **all prompts, sessions, and versions inside it**
+
+### Prompts
+
+A **Prompt** is the main unit in the repository.
+
+Think of it as a container that holds the full lifecycle of a prompt, from early experiments to production-ready versions.
+
+Each prompt can have:
+
+- Multiple **sessions** for experimentation
+- Multiple **versions** for stable releases
+
+### Sessions (Working Copies)
+
+Sessions are **editable working copies** where you experiment with a prompt.
+
+You can freely:
+
+- Modify messages
+- Switch providers or models
+- Adjust parameters
+- Run the prompt repeatedly
+
+Sessions don't affect committed versions, so you can iterate safely.
+
+If your session has unsaved changes, a **red asterisk appears next to the prompt name** in the top bar.  
+You can save your progress using:
+
+- **Save Session** button
+- `Cmd + S` / `Ctrl + S`
+
+Saved sessions can be **renamed and restored** from the dropdown next to the Save button.
+
+### Versions (Immutable Snapshots)
+
+When you're happy with a prompt, you can **commit it as a version**.
+
+Versions are **immutable snapshots**; once created, they cannot be edited. When the config differs from the last saved version, the **Unpublished Changes** badge appears, and it can be committed to create a new version.
+
+Each version stores:
+
+- The selected **message history** (system, user, assistant)
+- **Provider and model configuration**
+- **Model parameters** (temperature, max tokens, etc.)
+- A **commit message** describing the change
+
+Versions are automatically numbered:
+
+```
+v1 → v2 → v3 → ...
+```
+
+You can also **restore a previous version** from the dropdown next to the **Commit Version** button.
+
+---
+
+## Workspace Layout
+
+The playground uses a simple **three-panel layout**:
+
+| Panel | Purpose |
+|------|---------|
+| **Sidebar (left)** | Browse prompts, manage folders, and organize items |
+| **Playground (center)** | Build and test your prompt messages |
+| **Settings (right)** | Configure provider, model, API key, variables, parameters, and deployments |
+
+The settings panel is organized into collapsible sections:
+
+- **Configuration** — Provider, model, API key, variables, and model parameters
+- **Deployments** — Prompt deployment strategies and traffic routing (enterprise)
+
+![Workspace Layout](../../media/prompt-repo-layout.png)
+
+---
+
+## Getting Started
+
+<Steps>
+
+<Step title="Create a folder (optional)">
+
+Click the **"+"** button in the sidebar and select **New Folder**.
+
+Folders help organize prompts by team, feature, or use case.
+
+![Create Folder](../../media/prompt-repo-create-folder.png)
+
+</Step>
+
+<Step title="Create a prompt">
+
+Click **"+"** again and choose **New Prompt**.  
+Give it a name and optionally assign it to a folder.
+
+![Create Prompt](../../media/prompt-repo-create-prompt.png)
+
+</Step>
+
+<Step title="Build your prompt">
+
+Add messages to your prompt in the Playground:
+
+- **System messages** for instructions
+- **User messages** for input
+- **Assistant messages** for examples or few-shot responses
+
+Configure the provider, model, and parameters from the settings panel on the right.
+
+![Playground](../../media/prompt-repo-playground.png)
+
+</Step>
+
+<Step title="Run the prompt">
+
+Click **Run** or press `Cmd + S` / `Ctrl + S`.
+
+Optionally, if you do not want to execute the prompt and only want to add a message to history, use the **+ Add** button.
+
+</Step>
+
+<Step title="Save and publish a version">
+
+Once you're satisfied with the results:
+
+1. **Save Session** to preserve your work
+2. **Commit Version** to create an immutable snapshot
+
+![Commit Version](../../media/prompt-repo-commit.png)
+
+</Step>
+
+</Steps>
+
+## Key Capabilities
+
+### Version Control
+
+Each committed version creates a permanent record of your prompt.
+
+This allows teams to track changes and safely iterate without breaking production prompts.
+
+Key characteristics:
+
+- **Sequential versioning** — v1, v2, v3, ...
+- **Commit messages** explaining what changed
+- **Immutable history**
+
+### Multi-Provider Testing
+
+You can switch between providers and models directly in the Playground.
+
+Supported providers may include:
+
+- OpenAI
+- Anthropic
+- AWS Bedrock
+- Others configured in your Bifrost instance
+
+You can also choose which API key to use:
+
+- **Auto**: Uses the first available key.
+- **Specific key**: Select a particular key.
+- **Virtual key**: Uses governance-managed keys.
+
+This makes it easy to compare how different models respond to the same prompt.
+
+### Message Types
+
+The Playground supports several message roles:
+
+- **System**: Defines behavior or instructions.
+- **User**: Input to the model.
+- **Assistant**: The model's response to the user's input.
+- **Tool Calls**: Function calls made by the model.
+- **Tool Results**: Mock or real responses from called tools.
+
+These allow you to simulate complex conversations and agent workflows.
+
+
+### Attachments
+
+For models that support multimodal input, you can attach files directly to user messages.
+
+Supported attachments may include:
+
+- Images
+- PDFs
+- Other supported file types
+
+Attachments are only enabled when the selected model supports them.
+
+### Drag-and-Drop Organization
+
+Prompts can be reorganized easily using drag and drop in the sidebar.
+
+You can move prompts:
+
+- Between folders
+- Back to the root level
+
+## Session Management
+
+Sessions store the state of your prompt experiments.
+
+Each prompt maintains its **own session history**, allowing you to explore different approaches without losing previous work.
+
+With sessions you can:
+
+- Save specific conversation states
+- Rename sessions for clarity
+- Switch between past experiments
+
+![Sessions](../../media/prompt-repo-sessions.png)
+
+---
+
+## Using prompts in production
+
+To attach committed versions to **Chat Completions** or **Responses** requests through the gateway (HTTP headers, merging, and caching behavior), see the [Prompts plugin](/features/prompt-repository/prompts-plugin).
--- a/docs/features/prompt-repository/prompts-plugin.mdx
+++ b/docs/features/prompt-repository/prompts-plugin.mdx
@@ -0,0 +1,148 @@
+---
+title: "Prompts plugin"
+description: "Use committed prompt templates from the Prompt Repository on inference requests via HTTP headers or custom resolvers."
+icon: "puzzle-piece"
+---
+
+## Overview
+
+The **Prompts** plugin connects the [Prompt Repository](/features/prompt-repository/playground) to inference. It loads committed prompt versions from the config store and **prepends** their messages to **Chat Completions** and **Responses** requests. It also **merges model parameters** from the stored version with the incoming request (request values take precedence).
+
+**What it does:**
+
+- Resolves which prompt and version to apply per request (default: HTTP headers).
+- Injects the version’s message history **before** the client’s messages.
+- Applies the version’s `model` parameters as defaults, then overrides with whatever the client sent for the same parameters.
+
+---
+
+## Prerequisites
+
+- **Config store** with Prompt Repository tables (typically **PostgreSQL**). File-backed config alone does not store prompts.
+- Prompts authored and **committed as versions** in the UI or via the `/api/prompt-repo/...` HTTP API (see `docs/openapi/openapi.yaml` in the repository).
+- A **prompt ID** (UUID) for each prompt you reference at runtime. You can read it from the repository API or the playground.
+
+---
+
+## How it works
+
+```mermaid
+flowchart TB
+    Client([Client]) --> Gateway[Bifrost HTTP]
+    Gateway --> PreHook["HTTP transport pre-hook:<br/>copy x-bf-prompt-id / x-bf-prompt-version to context"]
+    PreHook --> PreLLM["PreLLM hook:<br/>resolve version, merge params,<br/>prepend template messages"]
+    PreLLM --> Provider[Provider]
+```
+
+1. **Transport (HTTP):** Incoming headers `x-bf-prompt-id` and `x-bf-prompt-version` are copied onto the Bifrost context (header name matching is case-insensitive).
+2. **Resolve:** The plugin looks up the prompt and the requested version. If **`x-bf-prompt-version` is omitted**, the prompt’s **latest committed version** is used.
+3. **Parameters:** Version `model` parameters are merged into the request; any field already set on the request wins.
+4. **Messages:** Messages from the committed version are **prepended** to `messages` (chat) or `input` (responses). Your request body adds the user turn(s) after the template.
+
+If the prompt ID is missing, the plugin does nothing and the request passes through unchanged.
+
+---
+
+## HTTP headers (gateway)
+
+| Header | Required | Description |
+|--------|----------|-------------|
+| `x-bf-prompt-id` | Yes, to enable injection | UUID of the prompt in the repository. |
+| `x-bf-prompt-version` | No | **Integer version number** (e.g. `3` for v3). If omitted, the **latest** committed version for that prompt is used. |
+
+Invalid or unknown IDs / versions are logged as warnings; the request is **not** failed by the plugin (it proceeds without template injection).
+
+---
+
+## Example: Chat Completions
+
+Use the same JSON body as a normal chat request. Only the headers select the template.
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-bf-prompt-id: YOUR-PROMPT-UUID" \
+  -H "x-bf-vk: sk-bf-your-virtual-key" \
+  -d '{
+    "model": "openai/gpt-5.4",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Tell me about Bifrost Gateway?"
+      }
+    ]
+  }'
+```
+
+![Commit Version with Stream enabled in the playground](../../media/prompt-plugin-version-commit.png)
+
+When you commit a version from the playground, the model parameters (temperature, max tokens, etc.) are saved with it. These parameters are merged into the outgoing request, with client-supplied values taking precedence.
+
+![LLM log for the same request showing Type: Chat Stream](../../media/prompt-plugin-llm-log.png)
+
+In **Logs**, that run shows the full conversation: the committed **system** template, your **user** message from the request body, and the assistant reply. The log also displays the **Selected Prompt** name and version number for easy traceability.
+
+The provider receives the merged model parameters from both the prompt version and the client request, with the messages from the committed version prepended before the client’s messages.
+
+---
+
+## Example: Responses API
+
+```bash
+curl -X POST http://localhost:8080/v1/responses \
+  -H "Content-Type: application/json" \
+  -H "x-bf-prompt-id: YOUR-PROMPT-UUID" \
+  -H "x-bf-prompt-version: 4" \
+  -H "x-bf-vk: sk-bf-your-virtual-key" \
+  -d '{
+    "model": "openai/gpt-5-nano-2025-08-07",
+    "input": "What is Pale Blue Dot?"
+  }'
+```
+
+---
+
+## Streaming
+
+Streaming is controlled entirely by the client request. If you want streaming, set `"stream": true` in the request body. The plugin merges model parameters from the committed version (request values take precedence), but does **not** override the transport-level streaming mode.
+
+---
+
+## Cache and updates
+
+The plugin keeps an in-memory cache of prompts and versions (loaded with a small number of store queries at startup). When you create, update, or delete prompts or versions through the **gateway APIs**, the server **reloads** that cache so new commits are visible without a full process restart.
+
+---
+
+## Go SDK and custom resolution
+
+For embedded Bifrost (Go SDK), register the plugin with `prompts.Init` and a **config store** that implements the prompt tables API. The default resolver reads the same logical keys from `BifrostContext`:
+
+- `prompts.PromptIDKey` (`x-bf-prompt-id`)
+- `prompts.PromptVersionKey` (`x-bf-prompt-version`)
+
+Set them on the context you pass to `ChatCompletion` / `Responses` if you are not going through the HTTP transport hooks.
+
+For advanced routing (for example, choosing a prompt from governance metadata), implement `prompts.PromptResolver` and use **`prompts.InitWithResolver`**. The interface is:
+
+```go
+type PromptResolver interface {
+    Resolve(ctx *schemas.BifrostContext, req *schemas.BifrostRequest) (promptID string, versionNumber int, err error)
+}
+```
+
+Return an empty `promptID` to skip injection for a request. Return `versionNumber == 0` to use the prompt's **latest** committed version; any positive integer selects that specific version.
+
+After injection, the plugin sets the following context keys (read by the logging plugin to populate log fields):
+
+- `schemas.BifrostContextKeySelectedPromptID` — UUID of the applied prompt
+- `schemas.BifrostContextKeySelectedPromptName` — Display name of the prompt
+- `schemas.BifrostContextKeySelectedPromptVersion` — Version number as a string (e.g. `"3"`)
+
+---
+
+## Related
+
+- [Playground](/features/prompt-repository/playground) — create folders, prompts, sessions, and committed versions.
+- [Writing Go plugins](/plugins/writing-go-plugin) — plugin interfaces and lifecycle.
+- Built-in plugin name in code: `prompts` (`github.com/maximhq/bifrost/plugins/prompts`).
--- a/docs/features/retries-and-fallbacks.mdx
+++ b/docs/features/retries-and-fallbacks.mdx
@@ -0,0 +1,393 @@
+---
+title: "Retries & Fallbacks"
+description: "Automatic retry with exponential backoff and provider failover. Retries handle transient errors within a provider; fallbacks switch to a different provider when all retries are exhausted."
+icon: "list-check"
+---
+
+## Overview
+
+Bifrost provides two complementary layers of resilience:
+
+- **Retries** — When a provider returns a transient error (network issue, rate limit, 5xx), Bifrost automatically retries the same request against the same provider with exponential backoff. On rate-limit errors, it can also rotate to a different API key from your pool.
+- **Fallbacks** — When the primary provider fails after exhausting all retries, Bifrost moves on to the next provider in your fallback chain. Each fallback provider gets its own full retry budget.
+
+Together, they let you build LLM-powered applications that stay up through rate limits, transient outages, and even full provider failures — with no changes required in your application code.
+
+---
+
+## Retries
+
+### How retries work
+
+When a request fails with a retryable error, Bifrost:
+
+1. Waits using **exponential backoff with jitter** before the next attempt
+2. Retries the request against the same provider
+3. On **rate-limit errors** (`429`): rotates to a different API key from the pool (if multiple keys are configured) so fresh capacity is used
+4. On **network/server errors** (`5xx`, DNS, connection refused): reuses the same key — these are transient server issues, not per-key capacity problems
+5. Continues until the request succeeds or `max_retries` is exhausted
+
+### Backoff formula
+
+```
+backoff = min(retry_backoff_initial × 2^attempt, retry_backoff_max) × jitter(0.8–1.2)
+```
+
+With the defaults of `retry_backoff_initial = 500ms` and `retry_backoff_max = 5000ms`:
+
+| Attempt | Base backoff | With jitter (approx.) |
+|---------|-------------|----------------------|
+| 1st retry | 500 ms | 400–600 ms |
+| 2nd retry | 1000 ms | 800 ms–1.2 s |
+| 3rd retry | 2000 ms | 1.6–2.4 s |
+| 4th retry | 4000 ms | 3.2–4.8 s |
+| 5th+ retry | 5000 ms (capped) | 4–5 s |
+
+### What triggers a retry
+
+| Condition | Retried? | Key rotation? |
+|-----------|----------|---------------|
+| Network error (DNS, connection refused) | Yes | No — same key reused |
+| `5xx` server errors (500, 502, 503, 504) | Yes | No — same key reused |
+| Rate limit (`429` or rate-limit message pattern) | Yes | Yes — next key from pool |
+| Request validation error | No | — |
+| Plugin-enforced block | No | — |
+| Cancelled request | No | — |
+
+### Configuring retries
+
+Retries are configured per-provider in `network_config`. The defaults are `max_retries: 0` (no retries), `retry_backoff_initial: 500` ms, and `retry_backoff_max: 5000` ms.
+
+<Tabs group="config-method">
+<Tab title="Web UI">
+
+<Frame>
+  <img src="/media/ui-retries-network-config.png" alt="Retries configuration in the Bifrost Web UI showing Max Retries, Retry Backoff Initial, and Retry Backoff Max fields under Network Config" />
+</Frame>
+
+Navigate to **Providers**, select a provider, and open the **Network Config** section.
+
+Set:
+- **Max Retries** — number of additional attempts after the first failure (e.g. `3`)
+- **Retry Backoff Initial** — starting backoff in milliseconds (e.g. `500`)
+- **Retry Backoff Max** — maximum backoff cap in milliseconds (e.g. `5000`)
+
+</Tab>
+<Tab title="API">
+
+```bash
+curl --location 'http://localhost:8080/api/providers' \
+--header 'Content-Type: application/json' \
+--data '{
+    "provider": "openai",
+    "keys": [
+        {
+            "name": "openai-key-1",
+            "value": "env.OPENAI_API_KEY",
+            "models": ["*"],
+            "weight": 1.0
+        }
+    ],
+    "network_config": {
+        "max_retries": 3,
+        "retry_backoff_initial": 500,
+        "retry_backoff_max": 5000
+    }
+}'
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
+    switch provider {
+    case schemas.OpenAI:
+        return &schemas.ProviderConfig{
+            NetworkConfig: schemas.NetworkConfig{
+                MaxRetries:          3,
+                RetryBackoffInitial: 500 * time.Millisecond,
+                RetryBackoffMax:     5 * time.Second,
+            },
+            ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize,
+        }, nil
+    }
+    return nil, fmt.Errorf("provider %s not supported", provider)
+}
+```
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "providers": {
+    "openai": {
+      "keys": [
+        { "name": "openai-key-1", "value": "env.OPENAI_KEY_1", "models": ["*"], "weight": 1.0 },
+        { "name": "openai-key-2", "value": "env.OPENAI_KEY_2", "models": ["*"], "weight": 1.0 },
+        { "name": "openai-key-3", "value": "env.OPENAI_KEY_3", "models": ["*"], "weight": 1.0 }
+      ],
+      "network_config": {
+        "max_retries": 3,
+        "retry_backoff_initial": 500,
+        "retry_backoff_max": 5000
+      }
+    }
+  }
+}
+```
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `max_retries` | integer | `0` | Number of additional attempts after the first failure |
+| `retry_backoff_initial` | integer (ms) | `500` | Starting backoff duration in milliseconds |
+| `retry_backoff_max` | integer (ms) | `5000` | Maximum backoff cap in milliseconds |
+
+</Tab>
+</Tabs>
+
+### Key rotation on rate limits
+
+<Note>
+Key rotation on retries requires **v1.5.0-prerelease4 or later**.
+</Note>
+
+When you configure multiple API keys for a provider, Bifrost automatically rotates to a fresh key when a rate-limit error is encountered — so retries are not wasted repeating a request with a key that has already hit its limit.
+
+```json
+{
+  "providers": {
+    "openai": {
+      "keys": [
+        { "name": "openai-key-1", "value": "env.OPENAI_KEY_1", "models": ["*"], "weight": 1.0 },
+        { "name": "openai-key-2", "value": "env.OPENAI_KEY_2", "models": ["*"], "weight": 1.0 },
+        { "name": "openai-key-3", "value": "env.OPENAI_KEY_3", "models": ["*"], "weight": 1.0 }
+      ],
+      "network_config": {
+        "max_retries": 5
+      }
+    }
+  }
+}
+```
+
+With 3 keys and `max_retries: 5`, Bifrost cycles through all three keys twice before giving up. Once all keys in the pool have been tried, it resets and starts a fresh weighted round.
+
+<Note>
+Key rotation on rate limits only applies when `max_retries > 0` and more than one key is configured for the provider. With a single key, all retries reuse that key.
+</Note>
+
+---
+
+## Fallbacks
+
+Fallbacks provide automatic failover to a different provider when the primary fails after exhausting all its retries. Each fallback is tried in order until one succeeds.
+
+### How fallbacks work
+
+1. **Primary attempt**: Tries your configured provider with its full retry budget
+2. **Fallback decision**: If the primary fails (and the error is retryable at the provider level), Bifrost moves to the first fallback
+3. **Sequential fallbacks**: Each fallback provider also gets its own full retry budget
+4. **First success wins**: Returns the response from the first provider that succeeds
+5. **All fail**: Returns the original error from the primary provider. Exception: if a plugin on a fallback provider sets `AllowFallbacks = false` on the error (e.g. a security or compliance plugin that should halt the chain regardless of remaining fallbacks), Bifrost stops immediately and returns that fallback's error rather than continuing to the next provider or returning the primary error.
+
+Each fallback is treated as a completely fresh request — all configured plugins (semantic caching, governance, logging) run again for the fallback provider.
+
+### Implementation
+
+<Tabs group="fallbacks">
+<Tab title="Gateway">
+
+Pass a `fallbacks` array in the request body. Each entry specifies a `provider/model` string:
+
+```bash
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "openai/gpt-4o-mini",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Explain quantum computing in simple terms"
+      }
+    ],
+    "fallbacks": [
+      "anthropic/claude-3-5-sonnet-20241022",
+      "bedrock/anthropic.claude-3-sonnet-20240229-v1:0"
+    ],
+    "max_tokens": 1000,
+    "temperature": 0.7
+  }'
+```
+
+The response `extra_fields.provider` tells you which provider actually served the request:
+
+```json
+{
+  "id": "chatcmpl-123",
+  "object": "chat.completion",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "Quantum computing is like having a super-powered calculator..."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 12,
+    "completion_tokens": 150,
+    "total_tokens": 162
+  },
+  "extra_fields": {
+    "provider": "anthropic",
+    "latency": 1.2
+  }
+}
+```
+
+</Tab>
+<Tab title="Go SDK">
+
+```go
+package main
+
+import (
+    "context"
+    "fmt"
+    "github.com/maximhq/bifrost"
+    "github.com/maximhq/bifrost/core/schemas"
+)
+
+func chatWithFallbacks(client *bifrost.Bifrost) {
+    ctx := context.Background()
+
+    response, err := client.ChatCompletionRequest(
+        schemas.NewBifrostContext(ctx, schemas.NoDeadline),
+        &schemas.BifrostChatRequest{
+            Provider: schemas.OpenAI,
+            Model:    "gpt-4o-mini",
+            Input: []schemas.ChatMessage{
+                {
+                    Role: schemas.ChatMessageRoleUser,
+                    Content: &schemas.ChatMessageContent{
+                        ContentStr: bifrost.Ptr("Explain quantum computing in simple terms"),
+                    },
+                },
+            },
+            // Fallback chain: OpenAI → Anthropic → Bedrock
+            Fallbacks: []schemas.Fallback{
+                {Provider: schemas.Anthropic, Model: "claude-3-5-sonnet-20241022"},
+                {Provider: schemas.Bedrock, Model: "anthropic.claude-3-sonnet-20240229-v1:0"},
+            },
+            Params: &schemas.ChatParameters{
+                MaxCompletionTokens: bifrost.Ptr(1000),
+                Temperature:         bifrost.Ptr(0.7),
+            },
+        },
+    )
+
+    if err != nil {
+        fmt.Printf("All providers failed: %v\n", err)
+        return
+    }
+
+    fmt.Printf("Response from %s: %s\n",
+        response.ExtraFields.Provider,
+        *response.Choices[0].BifrostNonStreamResponseChoice.Message.Content.ContentStr)
+}
+```
+
+</Tab>
+</Tabs>
+
+---
+
+## How retries and fallbacks work together
+
+The two mechanisms form a nested resilience loop. Retries run inside each provider attempt; fallbacks run across providers once retries are exhausted.
+
+```mermaid
+sequenceDiagram
+    participant App
+    participant Bifrost
+    participant Primary as Primary Provider
+    participant FB1 as Fallback 1
+    participant FB2 as Fallback 2
+
+    App->>Bifrost: Request (primary + fallbacks)
+
+    rect rgb(220, 235, 250)
+        note over Bifrost,Primary: Primary provider attempt (with retries)
+        Bifrost->>Primary: Attempt 1
+        Primary-->>Bifrost: 429 Rate Limit
+        note over Bifrost: Backoff + rotate key
+        Bifrost->>Primary: Attempt 2 (different key)
+        Primary-->>Bifrost: 503 Unavailable
+        note over Bifrost: Backoff
+        Bifrost->>Primary: Attempt 3
+        Primary-->>Bifrost: 503 Unavailable
+        note over Bifrost: max_retries exhausted
+    end
+
+    rect rgb(235, 250, 220)
+        note over Bifrost,FB1: Fallback 1 attempt (with its own retries)
+        Bifrost->>FB1: Attempt 1
+        FB1-->>Bifrost: 500 Server Error
+        note over Bifrost: Backoff
+        Bifrost->>FB1: Attempt 2
+        FB1-->>Bifrost: ✓ Success
+    end
+
+    Bifrost-->>App: Response (from Fallback 1)
+```
+
+**Key point:** each provider in the chain — primary and every fallback — gets its own full `max_retries` budget. A primary configured with `max_retries: 3` and two fallbacks each also configured with `max_retries: 3` means up to 12 total attempts before giving up.
+
+<Info>
+The retry budget is set per-provider in `network_config`. If your fallback providers have different retry configurations, each will use their own settings.
+</Info>
+
+---
+
+## Real-world scenarios
+
+**Scenario 1: Rate limiting with key rotation**
+
+OpenAI key 1 hits its rate limit. Bifrost rotates to key 2 on the next retry — no fallback needed, the request succeeds within the same provider.
+
+**Scenario 2: Provider outage**
+
+OpenAI is experiencing downtime (returning `503`). Bifrost retries with the same key (transient server issue), exhausts `max_retries`, then fails over to Anthropic. Anthropic succeeds on the first attempt.
+
+**Scenario 3: Cascading failure**
+
+Both primary and first fallback are down. Bifrost works through each provider's retry budget sequentially until the second fallback succeeds.
+
+**Scenario 4: Cost-sensitive fallback**
+
+Primary: a premium model for quality. Fallback: a cost-effective alternative. Governance rules can trigger a budget-exceeded error on the primary, which cascades into the fallback chain.
+
+---
+
+## Plugin execution
+
+When a fallback is triggered, the fallback request is treated as completely new:
+
+- Semantic cache checks run again (the fallback provider may have a cached response)
+- Governance rules apply to the new provider
+- Logging captures the fallback attempt separately
+- All configured plugins execute fresh for each provider in the chain
+
+**Plugin fallback control:** Plugins can prevent fallbacks from being triggered for specific error types. For example, a security plugin might disable fallbacks for compliance reasons. When a plugin sets `AllowFallbacks = false` on the error, the fallback chain is skipped entirely and the original error is returned immediately.
+
+---
+
+## Next steps
+
+- **[Keys Management](./keys-management)** — Configure multiple API keys per provider to enable key rotation on retries
+- **[Governance](./governance/virtual-keys)** — Use virtual keys and routing rules to control which providers are used
+- **[Observability](./observability/default)** — Track retry counts and fallback usage in your logs
--- a/docs/features/semantic-caching.mdx
+++ b/docs/features/semantic-caching.mdx
@@ -0,0 +1,675 @@
+---
+title: "Semantic Caching"
+description: "Intelligent response caching based on semantic similarity. Reduce costs and latency by serving cached responses for semantically similar requests."
+icon: "database"
+---
+
+## Overview
+
+Semantic caching uses vector similarity search to intelligently cache AI responses, serving cached results for semantically similar requests even when the exact wording differs. This dramatically reduces API costs and latency for repeated or similar queries.
+
+**Key Benefits:**
+- **Cost Reduction**: Avoid expensive LLM API calls for similar requests
+- **Improved Performance**: Sub-millisecond cache retrieval vs multi-second API calls  
+- **Intelligent Matching**: Semantic similarity beyond exact text matching
+- **Streaming Support**: Full streaming response caching with proper chunk ordering
+
+---
+
+## Core Features
+
+- **Dual-Layer Caching**: Exact hash matching + semantic similarity search (customizable threshold)
+- **Vector-Powered Intelligence**: Uses embeddings to find semantically similar requests
+- **Dynamic Configuration**: Per-request TTL and threshold overrides via headers/context
+- **Model/Provider Isolation**: Separate caching per model and provider combination
+
+---
+
+## Vector Store Setup
+
+Semantic caching requires a configured vector store. Bifrost supports the following vector databases:
+
+<CardGroup cols={2}>
+  <Card title="Weaviate" icon="database" href="/integrations/vector-databases/weaviate">
+    Production-ready vector database with gRPC support.
+  </Card>
+  <Card title="Redis / Valkey" icon="database" href="/integrations/vector-databases/redis">
+    High-performance in-memory vector store using RediSearch-compatible APIs.
+  </Card>
+  <Card title="Qdrant" icon="database" href="/integrations/vector-databases/qdrant">
+    Rust-based vector search engine with advanced filtering.
+  </Card>
+  <Card title="Pinecone" icon="database" href="/integrations/vector-databases/pinecone">
+    Managed vector database service with serverless options.
+  </Card>
+</CardGroup>
+
+<Info>
+For detailed setup instructions and configuration options for each vector store, see the [Vector Store documentation](/architecture/framework/vector-store).
+</Info>
+
+**Quick Example (Weaviate):**
+
+<Tabs group="vector-store-setup">
+
+<Tab title="Go SDK">
+
+```go
+import (
+    "context"
+    "github.com/maximhq/bifrost/framework/vectorstore"
+)
+
+// Configure vector store (example: Weaviate)
+vectorConfig := &vectorstore.Config{
+    Enabled: true,
+    Type:    vectorstore.VectorStoreTypeWeaviate,
+    Config: vectorstore.WeaviateConfig{
+        Scheme: "http",
+        Host:   "localhost:8080",
+    },
+}
+
+// Create vector store
+store, err := vectorstore.NewVectorStore(context.Background(), vectorConfig, logger)
+if err != nil {
+    log.Fatal("Failed to create vector store:", err)
+}
+```
+
+</Tab>
+
+<Tab title="config.json">
+
+```json
+{
+  "vector_store": {
+    "enabled": true,
+    "type": "weaviate",
+    "config": {
+      "host": "localhost:8080",
+      "scheme": "http"
+    }
+  }
+}
+```
+
+</Tab>
+
+</Tabs>
+
+---
+
+## Semantic Cache Configuration
+
+> **UI Note**: The current Web UI flow configures provider-backed semantic caching. If you want direct-only mode (`dimension: 1` with no `provider`), configure it through `config.json`.
+
+<Tabs group="cache-config">
+
+<Tab title="Go SDK">
+
+```go
+import (
+    "github.com/maximhq/bifrost/plugins/semanticcache"
+    "github.com/maximhq/bifrost/core/schemas"
+)
+
+// Configure semantic cache plugin
+cacheConfig := &semanticcache.Config{
+    // Embedding model configuration (Required)
+    Provider:       schemas.OpenAI,
+    Keys:          []schemas.Key{{Value: "sk-..."}},
+    EmbeddingModel: "text-embedding-3-small",
+    Dimension:     1536,
+    
+    // Cache behavior
+    TTL:       5 * time.Minute,  // Time to live for cached responses (default: 5 minutes)
+    Threshold: 0.8,              // Similarity threshold for cache lookup (default: 0.8)
+    CleanUpOnShutdown: true,     // Clean up cache on shutdown (default: false)
+    
+    // Conversation behavior
+    ConversationHistoryThreshold: 5,    // Skip caching if conversation has > N messages (default: 3)
+    ExcludeSystemPrompt: bifrost.Ptr(false), // Exclude system messages from cache key (default: false)
+    
+    // Advanced options
+    CacheByModel:    bifrost.Ptr(true),  // Include model in cache key (default: true)
+    CacheByProvider: bifrost.Ptr(true),  // Include provider in cache key (default: true)
+}
+
+// Create plugin
+plugin, err := semanticcache.Init(context.Background(), cacheConfig, logger, store)
+if err != nil {
+    log.Fatal("Failed to create semantic cache plugin:", err)
+}
+
+// Add to Bifrost config
+bifrostConfig := schemas.BifrostConfig{
+    LLMPlugins: []schemas.LLMPlugin{plugin},
+    // ... other config
+}
+```
+
+</Tab>
+
+<Tab title="Web UI">
+
+![Semantic Cache Plugin Configuration](../media/ui-semantic-cache-config.png)
+
+**Note**: Make sure you have a vector store setup (using `config.json`) before configuring the semantic cache plugin.
+
+1. **Navigate to Settings**
+   - Open Bifrost UI at `http://localhost:8080`
+   - Go to Settings.
+
+2. **Configure Semantic Cache Plugin**
+
+- Toggle the plugin switch to enable it, and fill in the required fields.
+
+**Required Fields:**
+- **Provider**: The provider to use for caching.
+- **Embedding Model**: The embedding model to use for caching.
+- **Dimension**: The embedding dimension for the configured embedding model.
+
+**Note**: Changes will need a restart of the Bifrost server to take effect, because the plugin is loaded on startup only.
+
+</Tab>
+
+<Tab title="config.json">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "semantic_cache",
+      "config": {        
+        "provider": "openai",
+        "embedding_model": "text-embedding-3-small",
+        "dimension": 1536,
+        
+        "cleanup_on_shutdown": true,
+        "ttl": "5m",
+        "threshold": 0.8,
+        
+        "conversation_history_threshold": 3,
+        "exclude_system_prompt": false,
+        
+        "cache_by_model": true,
+        "cache_by_provider": true
+      }
+    }
+  ]
+}
+```
+
+> **Note**: In `config.json` setups, provider keys are taken from the provider config on initialization, so you do not need to duplicate `keys` inside the plugin config. Any updates to the provider keys will not be reflected until next restart.
+
+**TTL Format Options:**
+- Duration strings: `"30s"`, `"5m"`, `"1h"`, `"24h"`
+- Numeric seconds: `300` (5 minutes), `3600` (1 hour)
+
+</Tab>
+
+</Tabs>
+
+---
+
+## Direct Hash Mode (Embedding-Free)
+
+Direct hash mode provides exact-match caching without requiring an embedding provider. Each request is hashed deterministically based on its normalized input, parameters, and stream flag. Identical requests produce cache hits; different wording is a cache miss.
+
+Exact-match direct entries are stored and retrieved using a deterministic cache ID. This keeps repeated direct cache lookups fast and consistent across retries, streaming responses, and restarts.
+
+**When to use direct hash mode:**
+- You only need exact-match deduplication (no fuzzy/semantic matching)
+- You cannot or do not want to call an external embedding API
+- You want the lowest possible latency with zero embedding overhead
+- Cost-sensitive environments where embedding API calls add up
+
+### Setup
+
+To enable direct-only mode globally, set `dimension: 1` and omit the `provider` and `keys` fields from the plugin config. The plugin will automatically fall back to direct search only.
+
+> **Important**: If you specify `dimension: 1` and also provide a `provider`, Bifrost treats the config as provider-backed semantic mode, not direct-only mode. To use direct-only mode, omit the `provider` field entirely.
+
+<Warning>
+A vector store is still required as the storage backend, even in direct hash mode. See [Recommended Vector Store](#recommended-vector-store) below for the best choice.
+</Warning>
+
+<Tabs group="direct-hash-setup">
+
+<Tab title="Go SDK">
+
+```go
+import (
+    "github.com/maximhq/bifrost/plugins/semanticcache"
+)
+
+cacheConfig := &semanticcache.Config{
+    // No Provider, Keys, or EmbeddingModel -- direct hash mode only
+    Dimension: 1, // Placeholder; entries are stored as metadata-only (no embedding vectors). Change dimension before switching to dual-layer mode to avoid mixed-dimension issues.
+
+    TTL:               5 * time.Minute,
+    CleanUpOnShutdown: true,
+    CacheByModel:      bifrost.Ptr(true),
+    CacheByProvider:   bifrost.Ptr(true),
+}
+
+plugin, err := semanticcache.Init(ctx, cacheConfig, logger, store)
+```
+
+</Tab>
+
+<Tab title="Helm">
+
+```yaml
+bifrost:
+  plugins:
+    semanticCache:
+      enabled: true
+      config:
+        dimension: 1
+        ttl: "5m"
+        cleanup_on_shutdown: true
+        cache_by_model: true
+        cache_by_provider: true
+```
+
+</Tab>
+
+<Tab title="config.json">
+
+```json
+{
+  "plugins": [
+    {
+      "enabled": true,
+      "name": "semantic_cache",
+      "config": {
+        "dimension": 1,
+        "ttl": "5m",
+        "cleanup_on_shutdown": true,
+        "cache_by_model": true,
+        "cache_by_provider": true
+      }
+    }
+  ]
+}
+```
+
+</Tab>
+
+</Tabs>
+
+When initialized this way, all requests automatically use direct hash matching regardless of the `x-bf-cache-type` header. No embeddings are generated, and no embedding provider credentials are needed.
+
+### Recommended Vector Store
+
+**Redis/Valkey-compatible stores** are recommended for direct hash mode. They do not require vectors for metadata-only entries, and all cache fields are indexed as TAG fields for fast exact-match lookups.
+
+<Warning>
+Qdrant and Pinecone are not compatible with direct hash mode when no embedding provider is configured. These stores require a vector for every entry; the plugin's zero-vector placeholder codepath requires an initialised embedding client, so storage will fail if no provider is set. Weaviate requires a vector per entry as well and is therefore also not recommended for direct-only mode.
+</Warning>
+
+<Tabs group="direct-hash-redis">
+
+<Tab title="Helm">
+
+```yaml
+vectorStore:
+  enabled: true
+  type: redis
+  redis:
+    external:
+      enabled: true
+      host: "redis-or-valkey.example.com"
+      port: 6379
+      password: "your-redis-password"
+```
+
+</Tab>
+
+<Tab title="config.json">
+
+```json
+{
+  "vector_store": {
+    "enabled": true,
+    "type": "redis",
+    "config": {
+      "addr": "localhost:6379"
+    }
+  }
+}
+```
+
+<Info>
+For Valkey deployments, keep `vector_store.type` as `"redis"` and point `config.addr` to your Valkey endpoint.
+</Info>
+
+</Tab>
+
+</Tabs>
+
+### Per-Request Cache Type Override
+
+When the plugin is initialized **without** an embedding provider (direct-only mode), all requests use direct hash matching automatically. The `x-bf-cache-type` header has no effect.
+
+When the plugin is initialized **with** an embedding provider (dual-layer mode), you can force direct-only matching on specific requests using the `x-bf-cache-type: direct` header. See [Cache Type Control](#cache-type-control) for details.
+
+---
+
+## Cache Triggering
+
+<Warning>
+**Cache Key is mandatory**: Semantic caching only activates when a cache key is provided. Without a cache key, requests bypass caching entirely.
+</Warning>
+
+<Tabs group="cache-triggering">
+
+<Tab title="Go SDK">
+Must set cache key in request context:
+
+```go
+// This request WILL be cached
+ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
+response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(ctx, schemas.NoDeadline), request)
+
+// This request will NOT be cached (no context value)
+response, err := client.ChatCompletionRequest(schemas.NewBifrostContext(context.Background(), schemas.NoDeadline), request)
+```
+
+</Tab>
+
+<Tab title="HTTP API">
+Must set cache key in request header `x-bf-cache-key`:
+
+```bash
+# This request WILL be cached
+curl -H "x-bf-cache-key: session-123" ...
+
+# This request will NOT be cached (no header)
+curl ...
+```
+
+</Tab>
+
+</Tabs>
+
+## Per-Request Overrides
+
+Override default TTL and similarity threshold per request:
+
+<Tabs group="per-request-overrides">
+
+<Tab title="Go SDK">
+
+You can set TTL and threshold in the request context using the semantic cache context keys:
+
+```go
+// Go SDK: Custom TTL and threshold
+ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
+ctx = context.WithValue(ctx, semanticcache.CacheTTLKey, 30*time.Second)
+ctx = context.WithValue(ctx, semanticcache.CacheThresholdKey, 0.9)
+```
+
+</Tab>
+
+<Tab title="HTTP API">
+
+You can set TTL and threshold in the request headers `x-bf-cache-ttl` and `x-bf-cache-threshold`:
+
+```bash
+# HTTP: Custom TTL and threshold
+curl -H "x-bf-cache-key: session-123" \
+     -H "x-bf-cache-ttl: 30s" \
+     -H "x-bf-cache-threshold: 0.9" ...
+```
+
+</Tab>
+
+</Tabs>
+
+---
+
+## Advanced Cache Control
+
+### Cache Type Control
+
+Control which caching mechanism to use per request:
+
+<Tabs group="cache-type-control">
+
+<Tab title="Go SDK">
+
+```go
+// Use only direct hash matching (fastest)
+ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
+ctx = context.WithValue(ctx, semanticcache.CacheTypeKey, semanticcache.CacheTypeDirect)
+
+// Use only semantic similarity search
+ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")  
+ctx = context.WithValue(ctx, semanticcache.CacheTypeKey, semanticcache.CacheTypeSemantic)
+
+// Default behavior: Direct + semantic fallback (if not specified)
+ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
+```
+
+</Tab>
+
+<Tab title="HTTP API">
+
+```bash
+# Direct hash matching only
+curl -H "x-bf-cache-key: session-123" \
+     -H "x-bf-cache-type: direct" ...
+
+# Semantic similarity search only  
+curl -H "x-bf-cache-key: session-123" \
+     -H "x-bf-cache-type: semantic" ...
+
+# Default: Both (if header not specified)
+curl -H "x-bf-cache-key: session-123" ...
+```
+
+</Tab>
+
+</Tabs>
+
+### No-Store Control
+
+Disable response caching while still allowing cache reads:
+
+<Tabs group="no-store-control">
+
+<Tab title="Go SDK">
+
+```go
+// Read from cache but don't store the response
+ctx = context.WithValue(ctx, semanticcache.CacheKey, "session-123")
+ctx = context.WithValue(ctx, semanticcache.CacheNoStoreKey, true)
+```
+
+</Tab>
+
+<Tab title="HTTP API">
+
+```bash
+# Read from cache but don't store response
+curl -H "x-bf-cache-key: session-123" \
+     -H "x-bf-cache-no-store: true" ...
+```
+
+</Tab>
+
+</Tabs>
+
+---
+
+## Conversation Configuration
+
+### History Threshold Logic
+
+The `ConversationHistoryThreshold` setting skips caching for conversations with many messages to prevent false positives:
+
+**Why this matters:**
+- **Semantic False Positives**: Long conversation histories have high probability of semantic matches with unrelated conversations due to topic overlap
+- **Direct Cache Inefficiency**: Long conversations rarely have exact hash matches, making direct caching less effective
+- **Performance**: Reduces vector store load by filtering out low-value caching scenarios
+
+```json
+{
+  "conversation_history_threshold": 3  // Skip caching if > 3 messages in conversation
+}
+```
+
+**Recommended Values:**
+- **1-2**: Very conservative (may miss valuable caching opportunities)
+- **3-5**: Balanced approach (default: 3)
+- **10+**: Cache longer conversations (higher false positive risk)  
+
+### System Prompt Handling
+
+Control whether system messages are included in cache key generation:
+
+```json  
+{
+  "exclude_system_prompt": false  // Include system messages in cache key (default)
+}
+```
+
+**When to exclude (`true`):**
+- System prompts change frequently but content is similar
+- Multiple system prompt variations for same use case
+- Focus caching on user content similarity
+
+**When to include (`false`):**
+- System prompts significantly change response behavior  
+- Each system prompt requires distinct cached responses
+- Strict response consistency requirements
+
+---
+
+## Cache Management
+
+### Cache Metadata Location
+
+When responses are served from semantic cache, 3 key variables are automatically added to the response:
+
+**Location**: `response.ExtraFields.CacheDebug` (as a JSON object)
+
+**Fields**:
+- `CacheHit` (boolean): `true` if the response was served from the cache, `false` when lookup fails.
+- `HitType` (string): `"semantic"` for similarity match, `"direct"` for hash match
+- `CacheID` (string): Unique cache entry ID for management operations (present only for cache hits)
+
+
+**Semantic Cache Only**:
+- `ProviderUsed` (string): Provider used for the calculating semantic match embedding. (present for both cache hits and misses)
+- `ModelUsed` (string): Model used for the calculating semantic match embedding. (present for both cache hits and misses)
+- `InputTokens` (number): Number of tokens extracted from the request for the semantic match embedding calculation. (present for both cache hits and misses)
+- `Threshold` (number): Similarity threshold used for the match. (present only for cache hits)
+- `Similarity` (number): Similarity score for the match. (present only for cache hits)
+
+Example HTTP Response:
+
+```json
+{
+  "extra_fields": {
+    "cache_debug": {
+      "cache_hit": true,
+      "hit_type": "direct",
+      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
+    }
+  }
+}
+
+{
+  "extra_fields": {
+    "cache_debug": {
+      "cache_hit": true,
+      "hit_type": "semantic",
+      "cache_id": "550e8500-e29b-41d4-a725-446655440001",
+      "threshold": 0.8,
+      "similarity": 0.95,
+      "provider_used": "openai",
+      "model_used": "gpt-4o-mini",
+      "input_tokens": 100
+    }
+  }
+}
+
+{
+  "extra_fields": {
+    "cache_debug": {
+      "cache_hit": false,
+      "provider_used": "openai",
+      "model_used": "gpt-4o-mini",
+      "input_tokens": 20
+    }
+  }
+}
+```
+
+
+These variables allow you to detect cached responses and get the cache entry ID needed for clearing specific entries.
+
+### Clear Specific Cache Entry
+
+Use the request ID from cached responses to clear specific entries:
+
+<Tabs group="cache-clear">
+
+<Tab title="Go SDK">
+
+```go
+// Clear specific entry by request ID
+err := plugin.ClearCacheForRequestID("550e8400-e29b-41d4-a716-446655440000")
+
+// Clear all entries for a cache key  
+err := plugin.ClearCacheForKey("support-session-456")
+```
+
+</Tab>
+
+<Tab title="HTTP API">
+
+```bash
+# Clear specific cached entry by request ID
+curl -X DELETE http://localhost:8080/api/cache/clear/550e8400-e29b-41d4-a716-446655440000
+
+# Clear all entries for a cache key
+curl -X DELETE http://localhost:8080/api/cache/clear-by-key/support-session-456
+```
+
+</Tab>
+
+</Tabs>
+
+### Cache Lifecycle & Cleanup
+
+The semantic cache automatically handles cleanup to prevent storage bloat:
+
+**Automatic Cleanup:**
+- **TTL Expiration**: Entries are automatically removed when TTL expires
+- **Shutdown Cleanup**: All cache entries are cleared from the vector store namespace and the namespace itself when Bifrost client shuts down
+- **Namespace Isolation**: Each Bifrost instance uses isolated vector store namespaces to prevent conflicts
+
+**Manual Cleanup Options:**
+- Clear specific entries by request ID (see examples above)
+- Clear all entries for a cache key
+- Restart Bifrost to clear all cache data
+
+<Warning>
+The semantic cache namespace and all its cache entries are deleted when Bifrost client shuts down **only if `cleanup_on_shutdown` is set to `true`**. By default (`cleanup_on_shutdown: false`), cache data persists between restarts. DO NOT use the plugin's namespace for external purposes.
+</Warning>
+
+<Warning>
+**Dimension Changes**: If you update the `dimension` config, the existing namespace will contain data with mixed dimensions, causing retrieval issues. To avoid this, either use a different `vector_store_namespace` or set `cleanup_on_shutdown: true` before restarting.
+</Warning>
+
+---
+
+<Info>
+**Vector Store Requirement**: Semantic caching requires a configured vector store. Bifrost supports Weaviate, Redis/Valkey-compatible endpoints, Qdrant, and Pinecone. See the [Vector Store documentation](/architecture/framework/vector-store) for setup details.
+</Info>
--- a/docs/features/sso-with-google-github.mdx
+++ b/docs/features/sso-with-google-github.mdx
@@ -0,0 +1,6 @@
+---
+title: "SSO with Google & GitHub"
+description: "Secure single sign-on authentication with Google and GitHub OAuth providers."
+tag: "Coming soon"
+icon: "sign-in-alt"
+---
--- a/docs/features/telemetry.mdx
+++ b/docs/features/telemetry.mdx
@@ -0,0 +1,322 @@
+---
+title: "Telemetry"
+description: "Comprehensive Prometheus-based monitoring for Bifrost Gateway with custom metrics and labels."
+icon: "gauge"
+---
+
+## Overview
+
+Bifrost provides built-in telemetry and monitoring capabilities through Prometheus metrics collection. The telemetry system tracks both HTTP-level performance metrics and upstream provider interactions, giving you complete visibility into your AI gateway's performance and usage patterns.
+
+**Key Features:**
+- **Prometheus Integration** - Native metrics collection at `/metrics` endpoint
+- **Comprehensive Tracking** - Success/error rates, token usage, costs, and cache performance
+- **Custom Labels** - Configurable dimensions for detailed analysis
+- **Dynamic Headers** - Runtime label injection via `x-bf-prom-*` headers
+- **Cost Monitoring** - Real-time tracking of AI provider costs in USD
+- **Cache Analytics** - Direct and semantic cache hit tracking
+- **Async Collection** - Zero-latency impact on request processing
+- **Multi-Level Tracking** - HTTP transport + upstream provider metrics
+
+The telemetry plugin operates asynchronously to ensure metrics collection doesn't impact request latency or connection performance.
+
+---
+
+## Default Metrics
+
+### HTTP Transport Metrics
+
+These metrics track all incoming HTTP requests to Bifrost:
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `http_requests_total` | Counter | Total number of HTTP requests |
+| `http_request_duration_seconds` | Histogram | Duration of HTTP requests |
+| `http_request_size_bytes` | Histogram | Size of incoming HTTP requests |
+| `http_response_size_bytes` | Histogram | Size of outgoing HTTP responses |
+
+Labels:
+- `path`: HTTP endpoint path
+- `method`: HTTP verb (e.g., `GET`, `POST`, `PUT`, `DELETE`)
+- `status`: HTTP status code
+- custom labels: Custom labels configured in the Bifrost configuration
+
+### Upstream Provider Metrics
+
+These metrics track requests forwarded to AI providers:
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|---------|
+| `bifrost_upstream_requests_total` | Counter | Total requests forwarded to upstream providers | Base Labels, custom labels |
+| `bifrost_success_requests_total` | Counter | Total successful requests to upstream providers | Base Labels, custom labels |
+| `bifrost_error_requests_total` | Counter | Total failed requests to upstream providers | Base Labels, `status_code`, custom labels |
+| `bifrost_upstream_latency_seconds` | Histogram | Latency of upstream provider requests | Base Labels, `is_success`, custom labels |
+| `bifrost_input_tokens_total` | Counter | Total input tokens sent to upstream providers | Base Labels, custom labels |
+| `bifrost_output_tokens_total` | Counter | Total output tokens received from upstream providers | Base Labels, custom labels |
+| `bifrost_cache_hits_total` | Counter | Total cache hits by type (direct/semantic) | Base Labels, `cache_type`, custom labels |
+| `bifrost_cost_total` | Counter | Total cost in USD for upstream provider requests | Base Labels, custom labels |
+
+Base Labels:
+- `provider`: AI provider name (e.g., `openai`, `anthropic`, `azure`)
+- `model`: Model name (e.g., `gpt-4o-mini`, `claude-3-sonnet`)
+- `method`: Request type (`chat`, `text`, `embedding`, `speech`, `transcription`)
+- `virtual_key_id`: Virtual key ID
+- `virtual_key_name`: Virtual key name
+- `routing_engines_used`: Comma-separated routing engines used ("routing-rule", "governance", "loadbalancing")
+- `routing_rule_id`: Routing rule ID that matched the request
+- `routing_rule_name`: Routing rule name that matched the request
+- `selected_key_id`: ID of the key that successfully served the request (`null` on final errors)
+- `selected_key_name`: Name of the key that successfully served the request (`null` on final errors)
+- `number_of_retries`: Number of retries
+- `fallback_index`: Fallback index (0 for first attempt, 1 for second attempt, etc.)
+- custom labels: Custom labels configured in the Bifrost configuration
+
+### Streaming Metrics
+
+These metrics capture latency characteristics specific to streaming responses:
+
+| Metric | Type | Description | Labels |
+|--------|------|-------------|---------|
+| `bifrost_stream_first_token_latency_seconds` | Histogram | Time from request start to first streamed token | Base Labels |
+| `bifrost_stream_inter_token_latency_seconds` | Histogram | Latency between subsequent streamed tokens | Base Labels |
+
+---
+
+## Monitoring Examples
+
+### Success Rate Monitoring
+Track the success rate of requests to different providers:
+
+```promql
+# Success rate by provider
+rate(bifrost_success_requests_total[5m]) / 
+rate(bifrost_upstream_requests_total[5m]) * 100
+```
+
+### Token Usage Analysis
+Monitor token consumption across different models:
+
+```promql
+# Input tokens per minute by model
+increase(bifrost_input_tokens_total[1m])
+
+# Output tokens per minute by model  
+increase(bifrost_output_tokens_total[1m])
+
+# Token efficiency (output/input ratio)
+rate(bifrost_output_tokens_total[5m]) / 
+rate(bifrost_input_tokens_total[5m])
+```
+
+### Cost Tracking
+Monitor spending across providers and models:
+
+```promql
+# Cost per second by provider
+sum by (provider) (rate(bifrost_cost_total[1m]))
+
+# Daily cost estimate
+sum by (provider) (increase(bifrost_cost_total[1d]))
+
+# Cost per request by provider and model
+sum by (provider, model) (rate(bifrost_cost_total[5m])) / 
+sum by (provider, model) (rate(bifrost_upstream_requests_total[5m]))
+```
+
+### Cache Performance
+Track cache effectiveness:
+
+```promql
+# Cache hit rate by type
+rate(bifrost_cache_hits_total[5m]) / 
+rate(bifrost_upstream_requests_total[5m]) * 100
+
+# Direct vs semantic cache hits
+sum by (cache_type) (rate(bifrost_cache_hits_total[5m]))
+```
+
+### Error Rate Analysis
+Monitor error patterns:
+
+```promql
+# Error rate by provider
+rate(bifrost_error_requests_total[5m]) / 
+rate(bifrost_upstream_requests_total[5m]) * 100
+
+# Errors by model
+sum by (model) (rate(bifrost_error_requests_total[5m]))
+```
+
+---
+
+## Configuration
+
+Configure custom Prometheus labels to add dimensions for filtering and analysis:
+
+<Tabs group="config-method">
+<Tab title="Web UI">
+
+![Prometheus Labels](../media/ui-prometheus-labels.png)
+
+1. **Navigate to Configuration**
+   - Open Bifrost UI at `http://localhost:8080`
+   - Go to **Config** tab
+
+2. **Prometheus Labels**
+   ```
+   Custom Labels: team, environment, organization, project
+   ```
+
+</Tab>
+<Tab title="API">
+
+```bash
+# Update prometheus labels via API
+curl -X PATCH http://localhost:8080/config \
+  -H "Content-Type: application/json" \
+  -d '{
+    "client": {
+      "prometheus_labels": ["team", "environment", "organization", "project"]
+    }
+  }'
+```
+
+</Tab>
+<Tab title="config.json">
+
+```json
+{
+  "client": {
+    "prometheus_labels": ["team", "environment", "organization", "project"],
+    "drop_excess_requests": false,
+    "initial_pool_size": 300
+  }
+}
+```
+
+</Tab>
+</Tabs>
+
+### Dynamic Label Injection
+
+Add custom label values at runtime using `x-bf-prom-*` headers:
+
+```bash
+# Add custom labels to specific requests
+curl -X POST http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "x-bf-prom-team: engineering" \
+  -H "x-bf-prom-environment: production" \
+  -H "x-bf-prom-organization: my-org" \
+  -H "x-bf-prom-project: my-project" \
+  -d '{
+    "model": "gpt-4o-mini",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+**Header Format:**
+- Prefix: `x-bf-prom-`
+- Label name: Any string after the prefix
+- Value: String value for the label
+
+---
+
+## Infrastructure Setup
+
+### Development & Testing
+
+For local development and testing, use the provided Docker Compose setup:
+
+```bash
+# Navigate to telemetry plugin directory
+cd plugins/telemetry
+
+# Start Prometheus and Grafana
+docker-compose up -d
+
+# Access endpoints
+# Prometheus: http://localhost:9090
+# Grafana: http://localhost:3000 (admin/admin)
+# Bifrost metrics: http://localhost:8080/metrics
+```
+
+<Warning>
+**Development Only**: The provided Docker Compose setup is for testing purposes only. Do not use in production without proper security, scaling, and persistence configuration.
+</Warning>
+
+You can use the Prometheus scraping endpoint to create your own Grafana dashboards. Given below are few examples created using the Docker Compose setup.
+
+![Grafana Dashboard](../media/ui-grafana-dashboard.png)
+
+### Production Deployment
+
+For production environments:
+
+1. **Deploy Prometheus** with proper persistence, retention, and security
+2. **Configure scraping** to target your Bifrost instances at `/metrics`
+3. **Set up Grafana** with authentication and dashboards
+4. **Configure alerts** based on your SLA requirements
+
+**Prometheus Scrape Configuration:**
+```yaml
+scrape_configs:
+  - job_name: "bifrost-gateway"
+    static_configs:
+      - targets: ["bifrost-instance-1:8080", "bifrost-instance-2:8080"]
+    scrape_interval: 30s
+    metrics_path: /metrics
+    # If Bifrost auth is enabled, add:
+    # basic_auth:
+    #   username: '<admin_username>'
+    #   password: '<admin_password>'
+```
+
+<Info>
+  If you have Bifrost authentication enabled (`auth_config`), you must include `basic_auth` in the scrape config with your `admin_username` and `admin_password`. See the [Prometheus docs](/features/observability/prometheus#pull-based-scraping) for details.
+</Info>
+
+### Production Alerting Examples
+
+Configure alerts for critical scenarios using the new metrics:
+
+**High Error Rate Alert:**
+```yaml
+- alert: BifrostHighErrorRate
+  expr: sum by (provider) (rate(bifrost_error_requests_total[5m])) / sum by (provider) (rate(bifrost_upstream_requests_total[5m])) > 0.05
+  for: 2m
+  labels:
+    severity: warning
+  annotations:
+    summary: "High error rate detected for provider {{ $labels.provider }} ({{ $value | humanizePercentage }})"
+```
+
+**High Cost Alert:**
+```yaml
+- alert: BifrostHighCosts
+  expr: sum by (provider) (increase(bifrost_cost_total[1d])) > 100  # $100/day threshold
+  for: 10m
+  labels:
+    severity: warning
+  annotations:
+    summary: "Daily cost for provider {{ $labels.provider }} exceeds $100 ({{ $value | printf \"%.2f\" }})"
+```
+
+**Cache Performance Alert:**
+```yaml
+- alert: BifrostLowCacheHitRate
+  expr: sum by (provider) (rate(bifrost_cache_hits_total[15m])) / sum by (provider) (rate(bifrost_upstream_requests_total[15m])) < 0.1
+  for: 5m
+  labels:
+    severity: info
+  annotations:
+    summary: "Cache hit rate for provider {{ $labels.provider }} below 10% ({{ $value | humanizePercentage }})"
+```
+
+---
+
+## Next Steps
+
+- **[Prometheus Documentation](https://prometheus.io/docs/)** - Official Prometheus guides
+- **[Grafana Setup](https://grafana.com/docs/)** - Dashboard creation and management
+- **[Tracing](./observability/default)** - Request/response logging for detailed analysis