---
title: "Performance Tuning"
description: "Optimize Bifrost for high throughput with concurrency, buffer sizing, and memory pool configuration"
icon: "gauge-high"
---
## Overview
Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:
| Parameter | Scope | Default | Description |
|-----------|-------|---------|-------------|
| **Concurrency** | Per Provider | 1000 | Number of worker goroutines processing requests simultaneously |
| **Buffer Size** | Per Provider | 5000 | Maximum requests that can be queued before blocking/dropping |
| **Initial Pool Size** | Global | 5000 | Pre-allocated objects in sync pools to reduce GC pressure |
These defaults are suitable for most production deployments handling up to ~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance.
---
## Understanding the Parameters
### Concurrency (Per Provider)
**What it does:** Controls two aspects of provider performance:
1. **Worker Goroutines:** The number of goroutines that process requests for each provider. Each worker pulls requests from the provider's queue and executes them against the provider's API.
2. **Provider Pool Pre-warming:** Pre-allocates provider-specific response objects (e.g., `AnthropicMessageResponse`, `OpenAIResponse`) in sync pools to reduce allocations during request handling.
**Impact:**
- **Higher concurrency** = More parallel requests to the provider, higher throughput, more pre-allocated response objects
- **Lower concurrency** = Fewer parallel requests, lower resource usage, respects provider rate limits
**Default:** `1000` workers per provider
```json
{
"providers": {
"openai": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 100,
"buffer_size": 500
}
}
}
}
```
```go
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 100, // 100 concurrent workers
BufferSize: 500, // 500 request queue capacity
},
}, nil
}
```
### Buffer Size (Per Provider)
**What it does:** Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers.
**Impact:**
- **Larger buffer** = More requests can be queued during traffic spikes, handles burst traffic better
- **Smaller buffer** = Lower memory footprint, faster backpressure signals to clients
**Default:** `5000` requests per provider queue
**Queue Full Behavior:** Controlled by `drop_excess_requests`:
- `false` (default): New requests block until queue space is available
- `true`: New requests are immediately dropped with an error when queue is full
**Constraint:** Buffer size must be greater than or equal to concurrency. If `concurrency > buffer_size`, provider setup will fail.
### Initial Pool Size (Global)
**What it does:** Controls the number of pre-allocated objects in Bifrost's internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead.
**Pooled Objects:**
- Channel messages (request wrappers)
- Response channels
- Error channels
- Stream channels
- Plugin pipelines
- Request objects
**Impact:**
- **Higher initial pool** = Less GC pressure during high traffic, more consistent latency, higher initial memory usage
- **Lower initial pool** = Lower initial memory footprint, may cause more allocations under load
**Default:** `5000` objects per pool
```json
{
"config": {
"initial_pool_size": 10000,
"drop_excess_requests": false
}
}
```
```go
bifrostConfig := schemas.BifrostConfig{
Account: myAccount,
InitialPoolSize: 10000, // Pre-warm pools with 10,000 objects
DropExcessRequests: false,
}
client, err := bifrost.Init(ctx, bifrostConfig)
```
---
## Sizing Guidelines
### Concurrency & Buffer Size (Per Provider)
Configure these settings **per provider** based on the expected RPS for that specific provider:
| Provider RPS | Concurrency | Buffer Size |
|--------------|-------------|-------------|
| 100 | 100 | 150 |
| 500 | 500 | 750 |
| 1000 | 1000 | 1500 |
| 2500 | 2500 | 3750 |
| 5000 | 5000 | 7500 |
| 10000 | 10000 | 15000 |
**Example:** If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with `concurrency: 2000, buffer_size: 3000` and Anthropic with `concurrency: 500, buffer_size: 750`.
**Formula:**
```
concurrency = expected_rps
buffer_size = 1.5 × expected_rps
```
This ratio ensures:
- Enough queue capacity to absorb traffic bursts
- Workers are never starved for work
- Backpressure is applied before memory exhaustion
### Initial Pool Size (Global)
Configure this setting based on **total RPS across all providers combined**:
| Total RPS (All Providers) | Initial Pool Size | Memory Estimate |
|---------------------------|-------------------|-----------------|
| 100 | 150 | ~50 MB |
| 500 | 750 | ~100 MB |
| 1000 | 1500 | ~200 MB |
| 2500 | 3750 | ~400 MB |
| 5000 | 7500 | ~800 MB |
| 10000 | 15000 | ~1.5 GB |
Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment.
**Formula:**
```
initial_pool_size = 1.5 × total_expected_rps
```
Additionally, ensure:
```
initial_pool_size >= max(buffer_size across all providers)
```
This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.
---
## Multi-Node Deployments
When running multiple Bifrost instances behind a load balancer, **divide the per-node settings by the number of nodes** based on your total expected RPS.
### Formula
```
Per-Node Concurrency = Total Concurrency / Number of Nodes
Per-Node Buffer Size = Total Buffer Size / Number of Nodes
Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes
```
### Example: 10,000 RPS Across 4 Nodes
**Total capacity (aggregate across all 4 nodes):**
- Total RPS: 10,000 RPS
- Per-node RPS: ~2,500 RPS per node
**Single node settings for 10,000 RPS (if running on one node):**
- Concurrency: 10000
- Buffer Size: 15000
- Initial Pool Size: 15000
**Per-node settings (4 nodes, 10,000 RPS total):**
| Parameter | Total (Aggregate) | Per Node (4 nodes) |
|-----------|-------------------|-------------------|
| Concurrency | 10000 | 2500 |
| Buffer Size | 15000 | 3750 |
| Initial Pool Size | 15000 | 3750 |
```json
{
"config": {
"initial_pool_size": 3750,
"drop_excess_requests": false
},
"providers": {
"openai": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 2500,
"buffer_size": 3750
}
},
"anthropic": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 2500,
"buffer_size": 3750
}
}
}
}
```
```go
const numNodes = 4
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
// Total capacity divided by number of nodes
// Total: 10,000 RPS across 4 nodes = 2,500 RPS per node
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 10000 / numNodes, // 2500 per node
BufferSize: 15000 / numNodes, // 3750 per node
},
}, nil
}
// In main initialization
bifrostConfig := schemas.BifrostConfig{
Account: myAccount,
InitialPoolSize: 15000 / numNodes, // 3750 per node
}
```
**Kubernetes Horizontal Pod Autoscaling:** When using HPA, configure settings for your minimum replica count. As pods scale up, each node handles a smaller portion of traffic. Consider using environment variables or ConfigMaps to dynamically adjust settings based on replica count.
---
## Provider-Specific Tuning
Different providers have different rate limits and latency characteristics. Tune each provider independently:
### Provider Rate Limit Considerations
| Provider | Typical Rate Limits | Recommended Concurrency | Notes |
|----------|---------------------|------------------------|-------|
| OpenAI | 500-10000 RPM (varies by tier) | 100-500 | Higher tiers support more concurrency |
| Anthropic | 1000-4000 RPM (varies by tier) | 50-200 | More conservative rate limits |
| Bedrock | Per-model limits | 100-300 | Check AWS quotas for your account |
| Azure OpenAI | Deployment-specific | 100-500 | Configure per-deployment |
| Vertex AI | Per-model quotas | 100-300 | Check GCP quotas |
| Groq | Very high throughput | 500-1000 | Designed for high concurrency |
| Ollama | Local resource bound | 10-50 | Limited by local GPU/CPU |
### Example: Mixed Provider Configuration
```json
{
"providers": {
"openai": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 200,
"buffer_size": 1000
}
},
"anthropic": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 100,
"buffer_size": 500
}
},
"groq": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 500,
"buffer_size": 2500
}
},
"ollama": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 20,
"buffer_size": 100
}
}
}
}
```
```go
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
switch provider {
case schemas.OpenAI:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 200,
BufferSize: 1000,
},
}, nil
case schemas.Anthropic:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 100,
BufferSize: 500,
},
}, nil
case schemas.Groq:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 500,
BufferSize: 2500,
},
}, nil
case schemas.Ollama:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 20,
BufferSize: 100,
},
}, nil
default:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize,
}, nil
}
}
```
---
## Queue Overflow Handling
When the provider queue reaches capacity, Bifrost's behavior is controlled by `drop_excess_requests`:
### Blocking Mode (Default)
```json
{
"config": {
"drop_excess_requests": false
}
}
```
- New requests **wait** until queue space is available
- Ensures no requests are lost
- May increase latency during high load
- Suitable for critical workloads where every request matters
### Drop Mode
```json
{
"config": {
"drop_excess_requests": true
}
}
```
- New requests are **immediately rejected** when queue is full
- Returns error: `"request dropped: queue is full"`
- Maintains consistent latency for accepted requests
- Suitable for real-time applications where stale requests are useless
**Best Practice:** Use `drop_excess_requests: true` with buffer sizes at 1.5x concurrency for production workloads. This prevents memory exhaustion while still handling reasonable traffic bursts.
---
## Monitoring and Diagnostics
### Key Metrics to Monitor
| Metric | Healthy Range | Action if Exceeded |
|--------|---------------|-------------------|
| Queue depth | < 50% of buffer_size | Increase buffer or concurrency |
| Request latency (p99) | < 2x average | Check provider rate limits |
| Dropped requests | 0 | Increase buffer_size |
| Memory usage | Stable | Reduce pool/buffer sizes |
| Goroutine count | Stable | Check for goroutine leaks |
### Health Check Endpoint
The Gateway exposes health and metrics endpoints:
```bash
# Health check
curl http://localhost:8080/health
# Prometheus metrics
curl http://localhost:8080/metrics
```
---
## Best Practices Summary
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.
Don't set concurrency higher than provider rate limits allow. You'll just get rate-limited.
Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.
### Quick Reference
```
// Formula
concurrency = expected_rps
buffer_size = 1.5 × expected_rps
initial_pool_size = 1.5 × total_rps (across all providers)
// Example: 500 RPS per provider, 2 providers (1000 total RPS)
concurrency: 500, buffer_size: 750, initial_pool_size: 1500
// Example: 2000 RPS per provider, 3 providers (6000 total RPS)
concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000
// Multi-node formula
per_node_value = total_value / number_of_nodes
```
---
## Related Documentation
- **[Provider Configuration](../quickstart/gateway/provider-configuration)** - Complete provider setup guide
- **[Custom Providers](./custom-providers)** - Creating custom provider integrations
- **[Deployment](../deployment-guides/)** - Production deployment guides