--- title: "Provider Routing" description: "Understand how Bifrost routes requests across AI providers using governance rules and adaptive load balancing." icon: "route" --- ## Overview Bifrost offers two powerful methods for routing requests across AI providers, each serving different use cases: 1. **Governance-based Routing**: Explicit, user-defined routing rules configured via Virtual Keys 2. **Adaptive Load Balancing**: Automatic, performance-based routing powered by real-time metrics (Enterprise feature) When both methods are available, **governance takes precedence** because users have explicitly defined their routing preferences through provider configurations on Virtual Keys. **When to use which method:** - Use **Governance** when you need explicit control, compliance requirements, or specific cost optimization strategies - Use **Adaptive Load Balancing** for automatic performance optimization and minimal configuration overhead --- ## The Model Catalog The Model Catalog is Bifrost's central registry that tracks which models are available from which providers. It powers both governance-based routing and adaptive load balancing by maintaining an up-to-date mapping of models to providers. **Architecture Documentation**: For detailed technical documentation on the Model Catalog implementation, including API reference, thread safety, and advanced usage patterns, see [Model Catalog Architecture](/architecture/framework/model-catalog). ### Data Sources The Model Catalog combines two data sources to maintain a comprehensive and up-to-date model registry: 1. **Pricing Data** (Primary source) - Downloaded from a remote URL (configurable, defaults to `https://getbifrost.ai/datasheet`) - Contains model names, pricing tiers, and provider mappings - Synced to database on startup and refreshed periodically (default: every 24 hours) - Used for cost calculation and initial model-to-provider mapping - **Stored as**: In-memory map `pricingData[model|provider|mode]` for O(1) lookups 2. **Provider List Models API** (Secondary source) - Calls each provider's `/v1/models` endpoint during startup - Enriches the catalog with provider-specific models and aliases - Re-fetched when providers are added/updated via API or dashboard - Adds models that may not be in pricing data yet (e.g., newly released models) - **Stored as**: In-memory map `modelPool[provider][]models` **Why two sources?** Pricing data provides comprehensive model coverage with cost information, while the List Models API ensures you can use newly released models immediately without waiting for pricing data updates. ### How Model Availability is Determined Bifrost uses a sophisticated multi-step process to determine if a model is available for a provider: **Purpose**: Find all models available for a specific provider **Lookup Process**: 1. Check `modelPool[provider]` for direct matches 2. Return all models in that provider's slice **Example**: ```go models := GetModelsForProvider("openai") // Returns: ["gpt-4o", "gpt-4o-mini", "gpt-4-turbo", "gpt-3.5-turbo", ...] ``` **Used by**: - Routing Methods to validate `allowed_models` - Dashboard model selector dropdowns - API responses for `/v1/models?provider=openai` **Purpose**: Find all providers that support a specific model **Lookup Process**: 1. **Direct lookup**: Check each provider's model list in `modelPool` 2. **Cross-provider resolution**: Apply special handling for proxy providers **Special Cross-Provider Rules**: If model is not found directly, check if `provider/model` exists in OpenRouter ```go // Request: claude-3-5-sonnet // Checks: openrouter models for "anthropic/claude-3-5-sonnet" // Result: Adds "openrouter" to providers list ``` If model is not found directly, check if `provider/model` exists in Vertex ```go // Request: claude-3-5-sonnet // Checks: vertex models for "anthropic/claude-3-5-sonnet" // Result: Adds "vertex" to providers list ``` For GPT models, check if `openai/model` exists in Groq ```go // Request: gpt-3.5-turbo // Checks: groq models for "openai/gpt-3.5-turbo" // Result: Adds "groq" to providers list ``` For Claude models, check Bedrock with flexible matching ```go // Request: claude-3-5-sonnet // Checks: bedrock models containing "claude-3-5-sonnet" // Matches: "anthropic.claude-3-5-sonnet-20240620-v1:0" // Result: Adds "bedrock" to providers list ``` **Example**: ```go providers := GetProvidersForModel("claude-3-5-sonnet") // Returns: ["anthropic", "vertex", "bedrock", "openrouter"] // Even though the request was just "claude-3-5-sonnet"! ``` **Used by**: - Load balancing to find candidate providers - Fallback generation - Model validation in requests **Purpose**: Get pricing data for cost calculation and model validation **Lookup Key**: `model|provider|mode` (e.g., `gpt-4o|openai|chat`) **Fallback Chain**: 1. **Primary lookup**: `model|provider|requestType` 2. **Gemini → Vertex**: If Gemini not found, try Vertex with same model 3. **Vertex format stripping**: For `provider/model`, strip prefix and retry 4. **Bedrock prefix handling**: For Claude models, try with `anthropic.` prefix 5. **Responses → Chat**: If Responses mode not found, try Chat mode **Example Flow**: ```go // Request: claude-3-5-sonnet on Gemini (Responses API) // 1. Try: claude-3-5-sonnet|gemini|responses → Not found // 2. Try: claude-3-5-sonnet|vertex|responses → Not found // 3. Try: claude-3-5-sonnet|vertex|chat → ✅ Found! // Pricing returned from vertex/chat mode ``` **Used by**: - Cost calculation for billing - Model validation during routing - Budget enforcement ### Syncing Behavior When Bifrost starts, it performs a complete model catalog initialization: **Step-by-step process** (from `server.go:Bootstrap()`): ```go // 1. Download from URL pricingData := loadPricingFromURL(ctx) // 2. Store in database (if configStore available) configStore.CreateModelPrices(ctx, pricingData) // 3. Load into memory cache mc.pricingData = map[string]TableModelPricing{...} ``` ```go // Build modelPool from pricing data mc.populateModelPoolFromPricingData() // Result: modelPool[provider] = [models from pricing] ``` ```go // Call ListAllModels for all configured providers modelData, err := client.ListAllModels(ctx, nil) // Add results to model pool mc.AddModelDataToPool(modelData) // Result: modelPool enriched with provider-specific models ``` If list models API fails for a provider: ```json {"level":"warn","message":"failed to list models for provider ollama: connection refused"} ``` - Logged as warning, **does not stop startup** - Provider remains usable with models from pricing data - Can be manually refreshed later via API **Result**: Bifrost is ready with a comprehensive model catalog combining both sources. While Bifrost is running, the catalog stays up-to-date through background workers: **Pricing Data Sync**: - Background worker runs every **1 hour** (ticker interval) - Checks if **24 hours** have elapsed since last sync (configurable) - If yes, downloads fresh pricing data and updates database + memory cache - Timer resets after successful sync **List Models API Sync**: Triggered by these events: 1. **Provider Added**: When a new provider is configured ```bash POST /api/v1/providers # Automatically calls ListModels for the new provider ``` 2. **Provider Updated**: When provider config changes (keys, endpoints, etc.) ```bash PUT /api/v1/providers/{provider} # Refetches models to detect changes ``` 3. **Manual Refresh**: Via API endpoint ```bash POST /api/v1/providers/{provider}/models/refetch # Explicitly refetches models for a provider ``` 4. **Manual Delete + Refetch**: Clear and reload models ```bash DELETE /api/v1/providers/{provider}/models POST /api/v1/providers/{provider}/models/refetch # Useful when models are out of sync ``` **Failure Handling**: - Pricing URL fails but database has data → Use cached database records - Pricing URL fails and no database data → Error logged, existing memory cache retained - List models API fails → Log warning, retain existing model pool entries Bifrost's multi-layered approach ensures high availability: **Layer 1: Pricing Data Persistence** ``` URL fails → Database → Memory cache → Continue operation ``` **Layer 2: Model Pool Redundancy** ``` ListModels fails → Pricing data models → Continue with reduced catalog ``` **Layer 3: Runtime Validation** ``` Model not in catalog → Special cross-provider rules → May still work ``` **Example Scenario**: ``` Situation: - Pricing URL is down - OpenAI ListModels API is down - User requests gpt-4o on OpenAI Bifrost's Response: 1. ✅ Pricing data available from database (last sync 12h ago) 2. ✅ Model pool has gpt-4o from previous ListModels call 3. ✅ Request proceeds normally 4. 📊 Cost calculated from cached pricing data ``` This design ensures **requests never fail due to sync issues** as long as one data source is available. ### Allowed Models Behavior with Examples The `allowed_models` field in provider configs controls which models can be used with that provider. Understanding its behavior is crucial for governance routing. **Configuration**: ```json { "provider_configs": [ { "provider": "openai", "allowed_models": ["*"], "weight": 1.0 } ] } ``` **Behavior**: - Bifrost calls `GetModelsForProvider("openai")` - Returns all models in `modelPool["openai"]` - Request validated against catalog **Examples**: ```bash # ✅ Allowed (in catalog) curl -H "x-bf-vk: vk-123" -d '{"model": "gpt-4o"}' # ✅ Allowed (in catalog) curl -H "x-bf-vk: vk-123" -d '{"model": "gpt-3.5-turbo"}' # ❌ Rejected (not in OpenAI catalog) curl -H "x-bf-vk: vk-123" -d '{"model": "claude-3-5-sonnet"}' ``` **Use Cases**: - Default behavior for most deployments - Automatically stays up-to-date with provider's model offerings - No manual model list maintenance required Using `"allowed_models": []` (empty array) means **deny all models** — no requests will be served. Use `["*"]` to allow all models via the catalog. **Configuration**: ```json { "provider_configs": [ { "provider": "openai", "allowed_models": ["gpt-4o", "gpt-4o-mini"], // Only these two "weight": 1.0 }, { "provider": "anthropic", "allowed_models": ["claude-3-5-sonnet-20241022"], // Specific version "weight": 1.0 } ] } ``` **Behavior**: - Bifrost validates request model against explicit list - Catalog is **ignored** for this provider - Supports both direct matches and provider-prefixed entries - Case-sensitive matching **Examples**: ```bash # ✅ Allowed (in explicit list) curl -H "x-bf-vk: vk-123" -d '{"model": "gpt-4o"}' # ❌ Rejected (not in explicit list) curl -H "x-bf-vk: vk-123" -d '{"model": "gpt-4-turbo"}' # Even though gpt-4-turbo is in the OpenAI catalog! # ✅ Allowed (exact match for Anthropic) curl -H "x-bf-vk: vk-123" -d '{"model": "claude-3-5-sonnet-20241022"}' # ❌ Rejected (version mismatch) curl -H "x-bf-vk: vk-123" -d '{"model": "claude-3-5-sonnet-20240620"}' ``` **Provider-Prefixed Entries**: You can also use provider-prefixed model names in `allowed_models`. Bifrost will strip the prefix and match against the requested model: ```json { "provider_configs": [ { "provider": "openrouter", "allowed_models": ["openai/gpt-4o", "anthropic/claude-3-5-sonnet"], "weight": 1.0 } ] } ``` **How it works**: ```bash # Request without prefix curl -H "x-bf-vk: vk-123" -d '{"model": "gpt-4o"}' # 1. Checks: "openai/gpt-4o" in allowed_models # 2. Strips prefix: "openai/gpt-4o" → "gpt-4o" # 3. Compares: "gpt-4o" == "gpt-4o" ✅ # 4. Result: Allowed and routed to OpenRouter ``` This is particularly useful for proxy providers (OpenRouter, Vertex) where you want to explicitly control which upstream models are accessible. **Use Cases**: - Compliance requirements (only approved models) - Cost control (restrict to cheaper models) - Version pinning (prevent automatic updates) - Testing specific model versions - **Explicit cross-provider routing** (e.g., only allow OpenAI models via OpenRouter) **Key Concept**: Aliases are **key-level** mappings that allow user-friendly model names to map to provider-specific identifiers. **How Aliases Work**: - Defined at the **Key level**, not Virtual Key level - Structure: `aliases: {"user-facing-name": "provider-specific-id"}` - **Alias key** (left side): User-facing model name used in requests - **Provider ID** (right side): Provider-specific identifier sent to the API **Azure OpenAI Example**: Provider configuration with alias mapping: ```json { "providers": { "azure": { "keys": [ { "name": "azure-prod-key", "value": "your-api-key", "aliases": { "gpt-4o": "my-prod-gpt4o-deployment", "gpt-4o-mini": "my-mini-deployment" }, "azure_key_config": { "endpoint": "https://your-resource.openai.azure.com" } } ] } } } ``` **What Happens**: 1. **Allowed models derived from aliases**: `["gpt-4o", "gpt-4o-mini"]` 2. **User requests with alias**: `{"model": "gpt-4o"}` 3. **Bifrost validates**: `gpt-4o` is in derived allowed models ✅ 4. **Bifrost resolves alias**: `gpt-4o` → `my-prod-gpt4o-deployment` 5. **Sent to Azure**: Uses `my-prod-gpt4o-deployment` as the deployment name 6. **Pricing lookup**: If pricing for resolved ID not found, falls back to alias `gpt-4o` **Bedrock Example with Inference Profiles**: ```json { "providers": { "bedrock": { "keys": [ { "name": "bedrock-key", "aliases": { "claude-sonnet": "us.anthropic.claude-3-5-sonnet-20241022-v2:0", "claude-opus": "us.anthropic.claude-3-opus-20240229-v1:0" }, "bedrock_key_config": { "access_key": "your-access-key", "secret_key": "your-secret-key", "region": "us-east-1" } } ] } } } ``` **What Happens**: 1. **Allowed models**: `["claude-sonnet", "claude-opus"]` (from alias keys) 2. **User requests**: `{"model": "claude-sonnet"}` 3. **Bifrost validates**: `claude-sonnet` in allowed models ✅ 4. **Resolves alias**: `claude-sonnet` → `us.anthropic.claude-3-5-sonnet-20241022-v2:0` 5. **Sent to Bedrock**: Full ARN used in API call **Priority of Model Restrictions**: When determining allowed models for a key: ``` 1. If key.models is NOT empty → Use key.models 2. Else if aliases exist → Use alias keys 3. Else → All models allowed (use Model Catalog) ``` **Example with Both**: ```json { "keys": [ { "models": ["gpt-4o", "gpt-3.5-turbo"], // Explicit restriction "aliases": { "gpt-4o": "my-deployment", "gpt-4-turbo": "another-deployment" // NOT accessible! }, "azure_key_config": { "endpoint": "https://your-resource.openai.azure.com" } } ] } ``` Result: Only `["gpt-4o", "gpt-3.5-turbo"]` allowed (models field takes priority) **Vertex Example** (similar pattern): ```json { "keys": [ { "aliases": { "claude-3-5-sonnet": "anthropic/claude-3-5-sonnet@20241022", "gemini-pro": "google/gemini-1.5-pro" }, "vertex_key_config": { "project_id": "my-project", "region": "us-central1" } } ] } ``` **Use Cases for Aliases**: - **Azure**: Map generic model names to specific deployment names in your Azure resource - **Bedrock**: Use short aliases for long inference profile ARNs - **Vertex**: Map to specific model versions or regional endpoints - **Multi-environment**: Different aliases per key (dev/staging/prod) **Key Insight**: ``` User Request: {"model": "gpt-4o"} ↓ Validation: Check if "gpt-4o" in allowed models (derived from aliases) ↓ Mapping: aliases["gpt-4o"] → "my-prod-gpt4o-deployment" ↓ API Call: Uses "my-prod-gpt4o-deployment" as deployment ID ↓ Pricing: Falls back to "gpt-4o" if resolved ID not in pricing data ``` This allows user-friendly model names in requests while supporting provider-specific identifier patterns at the key level. **Configuration**: ```json { "provider_configs": [ { "provider": "openai", "allowed_models": ["gpt-4o"], "weight": 0.5 }, { "provider": "azure", "allowed_models": ["gpt-4o"], "weight": 0.5 } ] } ``` **Request**: ```bash curl -H "x-bf-vk: vk-123" \ -d '{"model": "gpt-4o"}' ``` **Routing Behavior**: 1. **Model validation**: Both providers have `gpt-4o` in allowed_models ✅ 2. **Weighted selection**: 50% chance each 3. **Provider selected**: Let's say Azure 4. **Model transformation**: `gpt-4o` → `azure/gpt-4o` 5. **Fallbacks**: `["openai/gpt-4o"]` (remaining providers) **Special Cross-Provider Scenarios**: ```json { "provider_configs": [ { "provider": "openrouter", "allowed_models": ["*"] } ] } ``` Request `claude-3-5-sonnet`: - Bifrost checks: `GetModelsForProvider("openrouter")` - Finds: `anthropic/claude-3-5-sonnet` in OpenRouter catalog - ✅ Allowed, routes to OpenRouter **Use Case**: Route 99% of OpenAI traffic through OpenRouter for cost savings, keep 1% direct for fallback ```json { "provider_configs": [ { "provider": "openai", "allowed_models": ["gpt-4o"], "weight": 0.01 // 1% direct to OpenAI }, { "provider": "openrouter", "allowed_models": ["openai/gpt-4o"], // Provider-prefixed "weight": 0.99 // 99% via OpenRouter } ] } ``` Request `gpt-4o`: - **OpenAI check**: `"gpt-4o"` in `["gpt-4o"]` → ✅ Allowed - **OpenRouter check**: Strips prefix from `"openai/gpt-4o"` → matches `"gpt-4o"` → ✅ Allowed - **Weighted selection**: 99% chance → OpenRouter selected - **Final model**: `openrouter/gpt-4o` - **Fallbacks**: `["openai/gpt-4o"]` (1% provider as fallback) **Why this works**: Bifrost now supports provider-prefixed entries in `allowed_models`, so `"openai/gpt-4o"` matches requests for `"gpt-4o"`. ```json { "provider_configs": [ { "provider": "vertex", "allowed_models": ["claude-3-5-sonnet", "gemini-1.5-pro"] } ] } ``` Request `claude-3-5-sonnet`: - Model catalog lookup: `GetProvidersForModel("claude-3-5-sonnet")` - Finds: `["anthropic", "vertex", "bedrock"]` - Validation: `claude-3-5-sonnet` in allowed_models ✅ - Sends to Vertex as: `anthropic/claude-3-5-sonnet` ```json { "provider_configs": [ { "provider": "groq", "allowed_models": ["gpt-3.5-turbo"] } ] } ``` Request `gpt-3.5-turbo`: - Special handling: Checks Groq catalog for `openai/gpt-3.5-turbo` - ✅ Found, validation passes - Sends to Groq as: `openai/gpt-3.5-turbo` ### How It's Used in Routing When a Virtual Key has `provider_configs`, governance uses the model catalog for validation: **Wildcard allowed_models Example**: ```json { "provider_configs": [ { "provider": "openai", "allowed_models": ["*"], "weight": 0.5 } ] } ``` **Request Flow**: ```bash curl -H "x-bf-vk: vk-123" -d '{"model": "gpt-4o"}' # 1. Governance checks: Is "gpt-4o" in GetModelsForProvider("openai")? # 2. Catalog lookup: modelPool["openai"] contains "gpt-4o" ✅ # 3. Validation passes, provider selected # 4. Model becomes: "openai/gpt-4o" ``` **Rejection Example**: ```bash curl -H "x-bf-vk: vk-123" -d '{"model": "claude-3-5-sonnet"}' # 1. Governance checks: Is "claude-3-5-sonnet" in GetModelsForProvider("openai")? # 2. Catalog lookup: modelPool["openai"] does NOT contain "claude-3-5-sonnet" ❌ # 3. Validation fails, request rejected # 4. Error: "model not allowed for any configured provider" ``` When load balancing selects providers, it queries the catalog to find candidates: **Request Flow**: ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -d '{"model": "gpt-4o", "messages": [...]}' # 1. Load balancer: GetProvidersForModel("gpt-4o") # 2. Catalog returns: ["openai", "azure", "groq"] # 3. Filter by configured providers: ["openai", "azure"] (groq not configured) # 4. Performance scoring: openai=0.95, azure=0.87 # 5. Select: openai (highest score) # 6. Model becomes: "openai/gpt-4o" # 7. Fallbacks: ["azure/gpt-4o"] ``` **Cross-Provider Discovery**: ```bash curl -d '{"model": "claude-3-5-sonnet"}' # 1. Load balancer: GetProvidersForModel("claude-3-5-sonnet") # 2. Catalog checks: # - Direct: ["anthropic"] ✅ # - OpenRouter: Has "anthropic/claude-3-5-sonnet" ✅ # - Vertex: Has "anthropic/claude-3-5-sonnet" ✅ # - Bedrock: Has "anthropic.claude-3-5-sonnet-..." ✅ # 3. Catalog returns: ["anthropic", "openrouter", "vertex", "bedrock"] # 4. Performance scoring across all four # 5. Best performer selected ``` This is how Bifrost achieves **intelligent cross-provider routing** without manual configuration. **Model Catalog is essential for cross-provider routing**. Without it, Bifrost wouldn't know that `gpt-4o` is available from OpenAI, Azure, and Groq, or that `claude-3-5-sonnet` can be routed through Anthropic, Vertex, Bedrock, and OpenRouter. This knowledge powers both governance validation and load balancing provider discovery. --- ## Governance-based Routing Governance-based routing allows you to explicitly define which providers and models should handle requests for a specific Virtual Key. This method provides precise control over routing decisions. ### How It Works When a Virtual Key has `provider_configs` defined: 1. **Request arrives** with a Virtual Key (e.g., `x-bf-vk: vk-prod-main`) 2. **Model validation**: Bifrost checks if the requested model is allowed for any configured provider 3. **Provider filtering**: Providers are filtered based on: - Model availability in `allowed_models` - Budget limits (current usage vs max limit) - Rate limits (tokens/requests per time window) 4. **Weighted selection**: A provider is selected using weighted random distribution 5. **Provider prefix added**: Model string becomes `provider/model` (e.g., `openai/gpt-4o`) 6. **Fallbacks created**: Remaining providers sorted by weight (descending) are added as fallbacks ### Configuration Example ```json { "provider_configs": [ { "provider": "openai", "allowed_models": ["gpt-4o", "gpt-4o-mini"], "weight": 0.3, "budget": { "max_limit": 100.0, "current_usage": 45.0 } }, { "provider": "azure", "allowed_models": ["gpt-4o"], "weight": 0.7, "rate_limit": { "token_max_limit": 100000, "token_reset_duration": "1m" } } ] } ``` ### Request Flow ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -H "x-bf-vk: vk-prod-main" \ -d '{"model": "gpt-4o", "messages": [...]}' ``` - OpenAI: ✅ Has `gpt-4o` in allowed_models, budget OK, weight 0.3 - Azure: ✅ Has `gpt-4o` in allowed_models, rate limit OK, weight 0.7 - 70% chance → Azure - 30% chance → OpenAI ```json { "model": "azure/gpt-4o", "messages": [...], "fallbacks": ["openai/gpt-4o"] } ``` ### Key Features | Feature | Description | |---------|-------------| | **Explicit Control** | Define exactly which providers and models are accessible | | **Budget Enforcement** | Automatically exclude providers exceeding budget limits | | **Rate Limit Protection** | Skip providers that have hit rate limits | | **Weighted Distribution** | Control traffic distribution with custom weights | | **Automatic Fallbacks** | Failed providers automatically retry with next highest weight | ### Best Practices Assign higher weights to cheaper providers for cost-sensitive workloads: ```json { "provider_configs": [ {"provider": "groq", "allowed_models": ["*"], "key_ids": ["*"], "weight": 0.7}, {"provider": "openai", "allowed_models": ["*"], "key_ids": ["*"], "weight": 0.3} ] } ``` Create different Virtual Keys for dev/staging/prod with different provider access: ```json { "virtual_keys": [ { "id": "vk-dev", "provider_configs": [{"provider": "ollama", "allowed_models": ["*"], "key_ids": ["*"]}] }, { "id": "vk-prod", "provider_configs": [ {"provider": "openai", "allowed_models": ["*"], "key_ids": ["*"]}, {"provider": "azure", "allowed_models": ["*"], "key_ids": ["*"]} ] } ] } ``` Restrict specific Virtual Keys to compliant providers: ```json { "provider_configs": [ {"provider": "azure", "allowed_models": ["gpt-4o"]}, {"provider": "bedrock", "allowed_models": ["claude-3-sonnet-20240229"]} ] } ``` **`allowed_models: ["*"]`**: Allows all models supported by the provider, validated via the Model Catalog (populated from pricing data and the provider's list models API). See the [Model Catalog section](#the-model-catalog) above for how syncing works. For configuration instructions, see [Governance Routing](/features/governance/routing). **`allowed_models: []` (empty array)**: Denies **all** models — no requests will be served for this provider config. This is deny-by-default behavior introduced in v1.5.0. **Empty `provider_configs`**: When `provider_configs` is empty (no providers configured), **all providers are blocked** (deny-by-default). You must explicitly add provider configurations to allow traffic through a Virtual Key. --- ## Adaptive Load Balancing **Enterprise Feature**: Adaptive Load Balancing is available in Bifrost Enterprise. [Contact us](https://www.getmaxim.ai/bifrost/enterprise) to enable it. Adaptive Load Balancing automatically optimizes routing based on real-time performance metrics. It operates at **two levels** to provide both macro-level provider selection and micro-level key optimization. ### Two-Level Architecture Separating provider selection (direction) from key selection (route) enables: - **Provider-level optimization**: Choose the best provider for a model based on aggregate performance - **Key-level optimization**: Within that provider, choose the best API key based on individual key performance - **Resilience**: Even when provider is specified (by governance or user), key-level load balancing still optimizes which API key to use ```mermaid flowchart TB Request["Request: gpt-4o"] subgraph Level1["Level 1: Direction (Provider Selection)"] Cat["Model Catalog Lookup"] Providers["Candidate Providers:
openai, azure, groq"] Filter["Filter by allowed_models
and key availability"] Score["Score by performance:
error rate, latency, utilization"] Select["Select: openai"] end subgraph Level2["Level 2: Route (Key Selection)"] Keys["Available OpenAI Keys:
key-1, key-2, key-3"] KeyScore["Score each key:
error rate, latency, TPM hits"] KeySelect["Select: key-2
(best performing)"] end Request --> Cat --> Providers --> Filter --> Score --> Select Select --> Keys --> KeyScore --> KeySelect --> Response["Execute with
openai/gpt-4o + key-2"] ``` ### Level 1: Direction (Provider Selection) **When it runs**: Only when the model string has **no** provider prefix (e.g., `gpt-4o`) **How it works**: 1. **Model catalog lookup**: Find all configured providers that support the requested model 2. **Provider filtering**: Filter based on: - Allowed models from keys configuration - Keys availability for the provider 3. **Performance scoring**: Calculate scores for each provider based on: - Error rates (50% weight) - Latency (20% weight, using MV-TACOS algorithm) - Utilization (5% weight) - Momentum bias (recovery acceleration) 4. **Smart selection**: Choose provider using weighted random with jitter and exploration 5. **Fallbacks created**: Remaining providers sorted by performance score (descending) are added as fallbacks ### Level 2: Route (Key Selection) **When it runs**: **Always**, even when provider is already specified (by governance, user, or Level 1) **How it works**: 1. **Get available keys**: Fetch all keys for the selected provider 2. **Filter by configuration**: Apply model restrictions from key configuration 3. **Performance scoring**: Calculate score for each key based on: - Error rates (recent failures) - Latency (response time) - TPM hits (rate limit violations) - Current state (Healthy, Degraded, Failed, Recovering) 4. **Weighted random selection**: Choose key with exploration (25% chance to probe recovering keys) 5. **Circuit breaker**: Skip keys with zero weight (TPM hits, repeated failures) ### Scoring Algorithm The load balancer computes a performance score for each provider-model combination: $$ Score = (P_{error} \times 0.5) + (P_{latency} \times 0.2) + (P_{util} \times 0.05) - M_{momentum} $$ Lower penalties = Higher weights = More traffic. The system self-heals by quickly penalizing failing routes but enabling fast recovery once issues are resolved. ### Request Flow ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -d '{"model": "gpt-4o", "messages": [...]}' ``` Providers supporting `gpt-4o`: [openai, azure, groq] - OpenAI: Score 0.92 (low latency, 99% success rate) - Azure: Score 0.85 (medium latency, 98% success rate) - Groq: Score 0.65 (high latency recently) OpenAI selected (highest score within jitter band) ```json { "model": "openai/gpt-4o", "messages": [...], "fallbacks": ["azure/gpt-4o", "groq/gpt-4o"] } ``` ### Key Features | Feature | Description | |---------|-------------| | **Automatic Optimization** | No manual weight tuning required | | **Real-time Adaptation** | Weights recomputed every 5 seconds based on live metrics | | **Circuit Breakers** | Failing routes automatically removed from rotation | | **Fast Recovery** | 90% penalty reduction in 30 seconds after issues resolve | | **Health States** | Routes transition between Healthy, Degraded, Failed, and Recovering | | **Smart Exploration** | 25% chance to probe potentially recovered routes | ### Dashboard Visibility Monitor load balancing performance in real-time: Adaptive Load Balancing Dashboard The dashboard shows: - Weight distribution across provider-model-key routes - Performance metrics (error rates, latency, success rates) - State transitions (Healthy → Degraded → Failed → Recovering) - Actual vs expected traffic distribution --- ## How Governance and Load Balancing Interact When both methods are available in your Bifrost deployment, they work together in a complementary way across two levels. **Key Insight**: Load balancing has **two levels**: - **Level 1 (Direction/Provider)**: Skipped when provider is already specified - **Level 2 (Route/Key)**: **Always runs**, even when provider is specified This means key-level optimization works regardless of how the provider was chosen! ### Execution Flow ```mermaid flowchart TD Start["Request: gpt-4o"] subgraph Governance["Governance Plugin (HTTPTransportIntercept)"] HasVK{"Has VK with
provider_configs?"} GovRoute["Provider Selection:
Weighted random"] AddPrefix["Add prefix:
azure/gpt-4o"] end subgraph LB1["Load Balancer Level 1 (Middleware)"] PrefixCheck{"Has provider
prefix?"} LBProvider["Provider Selection:
Performance-based"] AddLBPrefix["Add prefix:
openai/gpt-4o"] end subgraph LB2["Load Balancer Level 2 (Key Selector)"] GetKeys["Get available keys
for selected provider"] ScoreKeys["Score keys by
performance metrics"] SelectKey["Select best key"] end Start --> HasVK HasVK -->|Yes| GovRoute --> AddPrefix HasVK -->|No| PrefixCheck AddPrefix --> PrefixCheck PrefixCheck -->|Yes, skip Level 1| GetKeys PrefixCheck -->|No| LBProvider --> AddLBPrefix --> GetKeys GetKeys --> ScoreKeys --> SelectKey --> Execute["Execute request
with selected provider + key"] ``` ### Execution Order 1. **HTTPTransportIntercept** (Governance Plugin - Provider Level) - Runs first in the request pipeline - Checks if Virtual Key has `provider_configs` - If yes: adds provider prefix (e.g., `azure/gpt-4o`) - **Result**: Provider is selected by governance rules 2. **Middleware** (Load Balancing Plugin - Provider Level / Direction) - Runs after HTTPTransportIntercept - Checks if model string contains "/" - If yes: **skips provider selection** (already determined by governance or user) - If no: performs performance-based provider selection - **Result**: Provider prefix added if not already present 3. **KeySelector** (Load Balancing - Key Level / Route) - **Always runs** during request execution in Bifrost core - Gets all keys for the selected provider - Filters keys based on model restrictions - Scores each key by performance metrics - Selects best key using weighted random + exploration - **Result**: Optimal key selected within the provider **Important**: Even when governance specifies `azure/gpt-4o`, load balancing **still optimizes which Azure key to use** based on performance metrics. This is the power of the two-level architecture! ### Example Scenarios **Setup:** - Virtual Key has `provider_configs` defined - No adaptive load balancing enabled **Request:** ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -H "x-bf-vk: vk-prod-main" \ -d '{"model": "gpt-4o", "messages": [...]}' ``` **Behavior:** 1. **Governance** applies weighted provider routing → selects Azure (70% weight) 2. Model becomes `azure/gpt-4o` 3. **Standard key selection** (non-adaptive) chooses an Azure key based on static weights 4. Request forwarded to Azure with selected key **Setup:** - **No Virtual Key** (do not send `x-bf-vk`) → this is the **Load Balancing–only** setup - **Virtual Key with empty / missing `provider_configs`** → **blocks all providers** (deny-by-default) and therefore is **NOT** an LB-only setup - Adaptive load balancing enabled **Request:** ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -d '{"model": "gpt-4o", "messages": [...]}' ``` **Behavior:** 1. **Load Balancing Level 1** applies performance-based provider routing → selects OpenAI (best performing) 2. Model becomes `openai/gpt-4o` 3. **Load Balancing Level 2** selects best OpenAI key based on performance metrics (error rate, latency, TPM status) 4. Request forwarded to OpenAI with optimal key **Setup:** - Virtual Key has `provider_configs` defined - Adaptive load balancing enabled - Azure has 3 keys: `azure-key-1`, `azure-key-2`, `azure-key-3` **Request:** ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -H "x-bf-vk: vk-prod-main" \ -d '{"model": "gpt-4o", "messages": [...]}' ``` **Behavior:** 1. **Governance** applies first (respects explicit user config) → selects Azure provider 2. Model becomes `azure/gpt-4o` 3. **Load Balancing Level 1** sees "/" and **skips provider selection** (already decided) 4. **Load Balancing Level 2** still runs! Selects best Azure key based on performance: - `azure-key-1`: 99% success rate, 150ms avg latency → score 0.95 - `azure-key-2`: 85% success rate, 200ms avg latency → score 0.60 (degraded) - `azure-key-3`: Hit TPM limit → score 0.0 (circuit broken) - **Selects `azure-key-1`** (highest score) 5. Request forwarded to Azure with `azure-key-1` **Why?** Governance controls provider selection (explicit user intent), but load balancing still optimizes key selection (automatic performance optimization). **Setup:** - Both governance and load balancing enabled - OpenAI has 2 keys available **Request:** ```bash curl -X POST http://localhost:8080/v1/chat/completions \ -d '{"model": "openai/gpt-4o", "messages": [...]}' ``` **Behavior:** 1. **Governance** sees "/" and skips 2. **Load Balancing Level 1** sees "/" and **skips provider selection** 3. **Load Balancing Level 2** still runs! Selects best OpenAI key based on current metrics 4. Request forwarded to OpenAI with optimal key **Why?** User explicitly specified the provider, but key-level optimization still provides value by selecting the best-performing OpenAI key. ### Provider vs Key Selection Rules | Scenario | Provider Selection | Key Selection | |----------|-------------------|---------------| | VK with provider_configs | **Governance** (weighted random) | **Standard** or **Adaptive** (if enabled) | | VK without provider_configs + LB | **Blocked** (empty = no providers allowed) | N/A | | No VK + LB | **Load Balancing Level 1** (performance) | **Load Balancing Level 2** (performance) | | Model with provider prefix + LB | **Skip** (already specified) | **Load Balancing Level 2** (performance) ✅ | | No Load Balancing enabled | **Governance** or **User** or **Model Catalog** | **Standard** (static weights) | **Critical Insight**: - **Provider selection** respects the hierarchy: Governance → Load Balancing Level 1 → User specification - **Key selection** runs independently and benefits from load balancing **even when provider is predetermined** This separation is what makes the two-level architecture so powerful! --- ## Routing Rules (Dynamic Expression-Based Routing) **Position in routing pipeline**: Routing Rules execute **before governance provider selection** and can override it. They are evaluated before adaptive load balancing, enabling dynamic provider/model overrides based on runtime conditions like headers, parameters, capacity metrics, and organizational hierarchy. ### Overview Routing Rules provide sophisticated, expression-based control over request routing using CEL expressions. Unlike governance routing (static weights), routing rules evaluate conditions dynamically at request time. ### When Routing Rules Execute ```mermaid flowchart TD Start["Request: model + provider"] subgraph Rules["1. Routing Rules Layer (Evaluated First)"] RuleMatch{"CEL Expression
Matches?"} RuleDecision["Override:
New provider/model/fallbacks"] NoMatch["No match:
Continue to Governance"] end subgraph Gov["2. Governance Layer (if no routing rule matched)"] VKValidation["Virtual Key Validation"] GovRouting["Provider Governance Routing
(weighted random)"] end subgraph LB["3. Load Balancing Layer"] LB1["Level 1: Provider Selection"] LB2["Level 2: Key Selection"] end Start --> RuleMatch RuleMatch -->|Yes| RuleDecision --> LB1 RuleMatch -->|No| NoMatch --> VKValidation --> GovRouting --> LB1 LB1 --> LB2 --> Execute["Execute with
selected provider + key"] ``` ### How It Works 1. **Routing rules evaluate first** in scope precedence order (VirtualKey → Team → Customer → Global) 2. **If a routing rule matches**: provider/model/fallbacks are overridden, governance provider_configs are skipped 3. **If no routing rule matches**: governance provider selection runs (weighted random) 4. **Load balancing Level 1**: skipped if provider already determined (has "/" prefix) 5. **Load balancing Level 2** (key selection): always runs to select the best key within the determined provider ### Available CEL Variables Routing rules access request context through CEL variables: ```cel // Request context model // Requested model provider // Current provider // Headers and parameters (case-insensitive) headers["x-tier"] // Request header params["region"] // Query parameter // Organization context virtual_key_id // VirtualKey ID team_name // Team name customer_id // Customer ID // Capacity metrics (0-100 percentage) budget_used // Budget usage % tokens_used // Token rate limit usage % request // Request rate limit usage % ``` ### Examples #### Route based on user tier ```cel headers["x-tier"] == "premium" // → openai/gpt-4o ``` #### Route to fallback when budget high ```cel budget_used > 85 // → groq/llama-2 (cheaper) ``` #### Route by team ```cel team_name == "ml-research" // → anthropic/claude-3-opus ``` #### Complex multi-condition routing ```cel headers["x-environment"] == "production" && tokens_used < 75 && team_name == "ai-platform" // → openai/gpt-4o ``` ### Scope Hierarchy Rules are evaluated in organizational precedence order (first-match-wins): ``` 1. VirtualKey scope (highest priority) 2. Team scope 3. Customer scope 4. Global scope (lowest priority) ``` Within each scope, rules are sorted by **priority** (ascending: 0 before 10). ### Key Features | Feature | Description | |---------|-------------| | **CEL Expressions** | Powerful, composable condition language with multiple operators | | **Scope Hierarchy** | Rules at VirtualKey/Team/Customer/Global levels with proper precedence | | **Dynamic Override** | Override provider and/or model based on runtime conditions | | **Fallback Chains** | Define multiple fallback providers for automatic failover | | **Priority Ordering** | Lower priority evaluated first within same scope | | **Capacity Awareness** | Access real-time budget and rate limit usage percentages | ### Integration with Governance Routing Rules execute **before** governance provider selection and can override it: **If a routing rule matches**: ``` Routing Rules evaluate ↓ Rule matches: budget_used > 85 ↓ Override: groq/llama-2 (cheaper provider) ↓ Governance provider_configs SKIPPED ↓ Load Balancing selects best key ``` **If no routing rule matches**: ``` Routing Rules evaluate ↓ No matching rule ↓ Governance decides: azure/gpt-4o (70% weight) ↓ Load Balancing selects best key ``` **Key Insight**: Routing rules have higher precedence than governance provider_configs. If a routing rule matches, governance provider_configs are bypassed entirely. ### Integration with Load Balancing Routing Rules work **before** load balancing: ``` Routing Rules decide: openai/gpt-4o ↓ Load Balancing Level 1: Skipped (provider already determined) ↓ Load Balancing Level 2: Selects best OpenAI key based on performance ``` Even when routing rules determine the provider, load balancing Level 2 still optimizes which API key to use within that provider. ### Use Cases - **Tier-based routing**: Premium users → fast providers - **Capacity failover**: High budget usage → cheaper providers - **Team preferences**: Different teams → different providers - **A/B testing**: Route subset of traffic to test models - **Regional routing**: EU users → EU providers (data residency) - **Complex logic**: Combine multiple conditions for sophisticated routing ### Dashboard & API Routing rules can be configured through: - **Dashboard**: Visual rule builder with CEL expression editor - **API**: `POST /api/governance/routing-rules` and related endpoints - **Scope**: Create rules at global, customer, team, or virtual key levels - **Priority**: Order rules within scope with numeric priority For complete documentation, see [Routing Rules Documentation](/providers/routing-rules). --- ## Choosing the Right Approach 1. **Use Governance When:** ✅ **Compliance requirements**: Need to ensure data stays in specific regions or providers ✅ **Cost optimization**: Want explicit control over traffic distribution to cheaper providers ✅ **Budget enforcement**: Need hard limits on spending per provider ✅ **Environment separation**: Different teams/apps need different provider access ✅ **Rate limit management**: Need to respect provider-specific rate limits 2. **Use Routing Rules When:** ✅ **Dynamic routing**: Route based on runtime request context (headers, parameters) ✅ **Capacity-aware routing**: Switch to fallback when budget/rate limits high ✅ **Organization-based routing**: Different rules for teams/customers ✅ **A/B testing**: Route subset of traffic to test new models ✅ **Complex conditions**: Multiple criteria (e.g., tier + capacity + team) 3. **Use Load Balancing When:** ✅ **Performance optimization**: Want automatic routing to best-performing providers ✅ **Minimal configuration**: Prefer hands-off operation with intelligent defaults ✅ **Dynamic workloads**: Traffic patterns change frequently ✅ **Automatic failover**: Need instant adaptation to provider issues ✅ **Multi-provider redundancy**: Want seamless provider switching based on availability 4. **Use All Three Together:** ✅ **Complete solution**: Governance provides base routing, routing rules add dynamic override, load balancing optimizes keys ✅ **Maximum flexibility**: Different Virtual Keys use different strategies (governance vs routing rules vs load balancing) ✅ **Enterprise deployments**: Complex organizations with multiple requirements per layer --- ## Additional Resources Configuration instructions for setting up governance routing via Virtual Keys (Web UI, API, config.json) Dynamic, expression-based routing using CEL expressions for runtime conditions Technical implementation details: scoring algorithms, weight calculations, and performance characteristics Learn how to create and configure Virtual Keys Understand how automatic fallbacks work across providers