--- title: "Performance Tuning" description: "Optimize Bifrost for high throughput with concurrency, buffer sizing, and memory pool configuration" icon: "gauge-high" --- ## Overview Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior: | Parameter | Scope | Default | Description | |-----------|-------|---------|-------------| | **Concurrency** | Per Provider | 1000 | Number of worker goroutines processing requests simultaneously | | **Buffer Size** | Per Provider | 5000 | Maximum requests that can be queued before blocking/dropping | | **Initial Pool Size** | Global | 5000 | Pre-allocated objects in sync pools to reduce GC pressure | These defaults are suitable for most production deployments handling up to ~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance. --- ## Understanding the Parameters ### Concurrency (Per Provider) **What it does:** Controls two aspects of provider performance: 1. **Worker Goroutines:** The number of goroutines that process requests for each provider. Each worker pulls requests from the provider's queue and executes them against the provider's API. 2. **Provider Pool Pre-warming:** Pre-allocates provider-specific response objects (e.g., `AnthropicMessageResponse`, `OpenAIResponse`) in sync pools to reduce allocations during request handling. **Impact:** - **Higher concurrency** = More parallel requests to the provider, higher throughput, more pre-allocated response objects - **Lower concurrency** = Fewer parallel requests, lower resource usage, respects provider rate limits **Default:** `1000` workers per provider ```json { "providers": { "openai": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 100, "buffer_size": 500 } } } } ``` ```go func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) { return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 100, // 100 concurrent workers BufferSize: 500, // 500 request queue capacity }, }, nil } ``` ### Buffer Size (Per Provider) **What it does:** Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers. **Impact:** - **Larger buffer** = More requests can be queued during traffic spikes, handles burst traffic better - **Smaller buffer** = Lower memory footprint, faster backpressure signals to clients **Default:** `5000` requests per provider queue **Queue Full Behavior:** Controlled by `drop_excess_requests`: - `false` (default): New requests block until queue space is available - `true`: New requests are immediately dropped with an error when queue is full **Constraint:** Buffer size must be greater than or equal to concurrency. If `concurrency > buffer_size`, provider setup will fail. ### Initial Pool Size (Global) **What it does:** Controls the number of pre-allocated objects in Bifrost's internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead. **Pooled Objects:** - Channel messages (request wrappers) - Response channels - Error channels - Stream channels - Plugin pipelines - Request objects **Impact:** - **Higher initial pool** = Less GC pressure during high traffic, more consistent latency, higher initial memory usage - **Lower initial pool** = Lower initial memory footprint, may cause more allocations under load **Default:** `5000` objects per pool ```json { "config": { "initial_pool_size": 10000, "drop_excess_requests": false } } ``` ```go bifrostConfig := schemas.BifrostConfig{ Account: myAccount, InitialPoolSize: 10000, // Pre-warm pools with 10,000 objects DropExcessRequests: false, } client, err := bifrost.Init(ctx, bifrostConfig) ``` --- ## Sizing Guidelines ### Concurrency & Buffer Size (Per Provider) Configure these settings **per provider** based on the expected RPS for that specific provider: | Provider RPS | Concurrency | Buffer Size | |--------------|-------------|-------------| | 100 | 100 | 150 | | 500 | 500 | 750 | | 1000 | 1000 | 1500 | | 2500 | 2500 | 3750 | | 5000 | 5000 | 7500 | | 10000 | 10000 | 15000 | **Example:** If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with `concurrency: 2000, buffer_size: 3000` and Anthropic with `concurrency: 500, buffer_size: 750`. **Formula:** ``` concurrency = expected_rps buffer_size = 1.5 × expected_rps ``` This ratio ensures: - Enough queue capacity to absorb traffic bursts - Workers are never starved for work - Backpressure is applied before memory exhaustion ### Initial Pool Size (Global) Configure this setting based on **total RPS across all providers combined**: | Total RPS (All Providers) | Initial Pool Size | Memory Estimate | |---------------------------|-------------------|-----------------| | 100 | 150 | ~50 MB | | 500 | 750 | ~100 MB | | 1000 | 1500 | ~200 MB | | 2500 | 3750 | ~400 MB | | 5000 | 7500 | ~800 MB | | 10000 | 15000 | ~1.5 GB | Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment. **Formula:** ``` initial_pool_size = 1.5 × total_expected_rps ``` Additionally, ensure: ``` initial_pool_size >= max(buffer_size across all providers) ``` This ensures pools are pre-warmed to handle peak queue depths without runtime allocations. --- ## Multi-Node Deployments When running multiple Bifrost instances behind a load balancer, **divide the per-node settings by the number of nodes** based on your total expected RPS. ### Formula ``` Per-Node Concurrency = Total Concurrency / Number of Nodes Per-Node Buffer Size = Total Buffer Size / Number of Nodes Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes ``` ### Example: 10,000 RPS Across 4 Nodes **Total capacity (aggregate across all 4 nodes):** - Total RPS: 10,000 RPS - Per-node RPS: ~2,500 RPS per node **Single node settings for 10,000 RPS (if running on one node):** - Concurrency: 10000 - Buffer Size: 15000 - Initial Pool Size: 15000 **Per-node settings (4 nodes, 10,000 RPS total):** | Parameter | Total (Aggregate) | Per Node (4 nodes) | |-----------|-------------------|-------------------| | Concurrency | 10000 | 2500 | | Buffer Size | 15000 | 3750 | | Initial Pool Size | 15000 | 3750 | ```json { "config": { "initial_pool_size": 3750, "drop_excess_requests": false }, "providers": { "openai": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 2500, "buffer_size": 3750 } }, "anthropic": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 2500, "buffer_size": 3750 } } } } ``` ```go const numNodes = 4 func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) { // Total capacity divided by number of nodes // Total: 10,000 RPS across 4 nodes = 2,500 RPS per node return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 10000 / numNodes, // 2500 per node BufferSize: 15000 / numNodes, // 3750 per node }, }, nil } // In main initialization bifrostConfig := schemas.BifrostConfig{ Account: myAccount, InitialPoolSize: 15000 / numNodes, // 3750 per node } ``` **Kubernetes Horizontal Pod Autoscaling:** When using HPA, configure settings for your minimum replica count. As pods scale up, each node handles a smaller portion of traffic. Consider using environment variables or ConfigMaps to dynamically adjust settings based on replica count. --- ## Provider-Specific Tuning Different providers have different rate limits and latency characteristics. Tune each provider independently: ### Provider Rate Limit Considerations | Provider | Typical Rate Limits | Recommended Concurrency | Notes | |----------|---------------------|------------------------|-------| | OpenAI | 500-10000 RPM (varies by tier) | 100-500 | Higher tiers support more concurrency | | Anthropic | 1000-4000 RPM (varies by tier) | 50-200 | More conservative rate limits | | Bedrock | Per-model limits | 100-300 | Check AWS quotas for your account | | Azure OpenAI | Deployment-specific | 100-500 | Configure per-deployment | | Vertex AI | Per-model quotas | 100-300 | Check GCP quotas | | Groq | Very high throughput | 500-1000 | Designed for high concurrency | | Ollama | Local resource bound | 10-50 | Limited by local GPU/CPU | ### Example: Mixed Provider Configuration ```json { "providers": { "openai": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 200, "buffer_size": 1000 } }, "anthropic": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 100, "buffer_size": 500 } }, "groq": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 500, "buffer_size": 2500 } }, "ollama": { "keys": [...], "concurrency_and_buffer_size": { "concurrency": 20, "buffer_size": 100 } } } } ``` ```go func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) { switch provider { case schemas.OpenAI: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 200, BufferSize: 1000, }, }, nil case schemas.Anthropic: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 100, BufferSize: 500, }, }, nil case schemas.Groq: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 500, BufferSize: 2500, }, }, nil case schemas.Ollama: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{ Concurrency: 20, BufferSize: 100, }, }, nil default: return &schemas.ProviderConfig{ NetworkConfig: schemas.DefaultNetworkConfig, ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize, }, nil } } ``` --- ## Queue Overflow Handling When the provider queue reaches capacity, Bifrost's behavior is controlled by `drop_excess_requests`: ### Blocking Mode (Default) ```json { "config": { "drop_excess_requests": false } } ``` - New requests **wait** until queue space is available - Ensures no requests are lost - May increase latency during high load - Suitable for critical workloads where every request matters ### Drop Mode ```json { "config": { "drop_excess_requests": true } } ``` - New requests are **immediately rejected** when queue is full - Returns error: `"request dropped: queue is full"` - Maintains consistent latency for accepted requests - Suitable for real-time applications where stale requests are useless **Best Practice:** Use `drop_excess_requests: true` with buffer sizes at 1.5x concurrency for production workloads. This prevents memory exhaustion while still handling reasonable traffic bursts. --- ## Monitoring and Diagnostics ### Key Metrics to Monitor | Metric | Healthy Range | Action if Exceeded | |--------|---------------|-------------------| | Queue depth | < 50% of buffer_size | Increase buffer or concurrency | | Request latency (p99) | < 2x average | Check provider rate limits | | Dropped requests | 0 | Increase buffer_size | | Memory usage | Stable | Reduce pool/buffer sizes | | Goroutine count | Stable | Check for goroutine leaks | ### Health Check Endpoint The Gateway exposes health and metrics endpoints: ```bash # Health check curl http://localhost:8080/health # Prometheus metrics curl http://localhost:8080/metrics ``` --- ## Best Practices Summary Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources. Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns. Don't set concurrency higher than provider rate limits allow. You'll just get rate-limited. Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests. ### Quick Reference ``` // Formula concurrency = expected_rps buffer_size = 1.5 × expected_rps initial_pool_size = 1.5 × total_rps (across all providers) // Example: 500 RPS per provider, 2 providers (1000 total RPS) concurrency: 500, buffer_size: 750, initial_pool_size: 1500 // Example: 2000 RPS per provider, 3 providers (6000 total RPS) concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000 // Multi-node formula per_node_value = total_value / number_of_nodes ``` --- ## Related Documentation - **[Provider Configuration](../quickstart/gateway/provider-configuration)** - Complete provider setup guide - **[Custom Providers](./custom-providers)** - Creating custom provider integrations - **[Deployment](../deployment-guides/)** - Production deployment guides