Files
bifrost/docs/providers/performance.mdx
Beyhan Oğur 880f412e2c first commit
2026-04-26 21:52:23 +03:00

508 lines
15 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Performance Tuning"
description: "Optimize Bifrost for high throughput with concurrency, buffer sizing, and memory pool configuration"
icon: "gauge-high"
---
## Overview
Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:
| Parameter | Scope | Default | Description |
|-----------|-------|---------|-------------|
| **Concurrency** | Per Provider | 1000 | Number of worker goroutines processing requests simultaneously |
| **Buffer Size** | Per Provider | 5000 | Maximum requests that can be queued before blocking/dropping |
| **Initial Pool Size** | Global | 5000 | Pre-allocated objects in sync pools to reduce GC pressure |
<Info>
These defaults are suitable for most production deployments handling up to ~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance.
</Info>
---
## Understanding the Parameters
### Concurrency (Per Provider)
**What it does:** Controls two aspects of provider performance:
1. **Worker Goroutines:** The number of goroutines that process requests for each provider. Each worker pulls requests from the provider's queue and executes them against the provider's API.
2. **Provider Pool Pre-warming:** Pre-allocates provider-specific response objects (e.g., `AnthropicMessageResponse`, `OpenAIResponse`) in sync pools to reduce allocations during request handling.
**Impact:**
- **Higher concurrency** = More parallel requests to the provider, higher throughput, more pre-allocated response objects
- **Lower concurrency** = Fewer parallel requests, lower resource usage, respects provider rate limits
**Default:** `1000` workers per provider
<Tabs>
<Tab title="Gateway (config.json)">
```json
{
"providers": {
"openai": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 100,
"buffer_size": 500
}
}
}
}
```
</Tab>
<Tab title="Go SDK">
```go
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 100, // 100 concurrent workers
BufferSize: 500, // 500 request queue capacity
},
}, nil
}
```
</Tab>
</Tabs>
### Buffer Size (Per Provider)
**What it does:** Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers.
**Impact:**
- **Larger buffer** = More requests can be queued during traffic spikes, handles burst traffic better
- **Smaller buffer** = Lower memory footprint, faster backpressure signals to clients
**Default:** `5000` requests per provider queue
**Queue Full Behavior:** Controlled by `drop_excess_requests`:
- `false` (default): New requests block until queue space is available
- `true`: New requests are immediately dropped with an error when queue is full
<Warning>
**Constraint:** Buffer size must be greater than or equal to concurrency. If `concurrency > buffer_size`, provider setup will fail.
</Warning>
### Initial Pool Size (Global)
**What it does:** Controls the number of pre-allocated objects in Bifrost's internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead.
**Pooled Objects:**
- Channel messages (request wrappers)
- Response channels
- Error channels
- Stream channels
- Plugin pipelines
- Request objects
**Impact:**
- **Higher initial pool** = Less GC pressure during high traffic, more consistent latency, higher initial memory usage
- **Lower initial pool** = Lower initial memory footprint, may cause more allocations under load
**Default:** `5000` objects per pool
<Tabs>
<Tab title="Gateway (config.json)">
```json
{
"config": {
"initial_pool_size": 10000,
"drop_excess_requests": false
}
}
```
</Tab>
<Tab title="Go SDK">
```go
bifrostConfig := schemas.BifrostConfig{
Account: myAccount,
InitialPoolSize: 10000, // Pre-warm pools with 10,000 objects
DropExcessRequests: false,
}
client, err := bifrost.Init(ctx, bifrostConfig)
```
</Tab>
</Tabs>
---
## Sizing Guidelines
### Concurrency & Buffer Size (Per Provider)
Configure these settings **per provider** based on the expected RPS for that specific provider:
| Provider RPS | Concurrency | Buffer Size |
|--------------|-------------|-------------|
| 100 | 100 | 150 |
| 500 | 500 | 750 |
| 1000 | 1000 | 1500 |
| 2500 | 2500 | 3750 |
| 5000 | 5000 | 7500 |
| 10000 | 10000 | 15000 |
<Info>
**Example:** If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with `concurrency: 2000, buffer_size: 3000` and Anthropic with `concurrency: 500, buffer_size: 750`.
</Info>
**Formula:**
```
concurrency = expected_rps
buffer_size = 1.5 × expected_rps
```
This ratio ensures:
- Enough queue capacity to absorb traffic bursts
- Workers are never starved for work
- Backpressure is applied before memory exhaustion
### Initial Pool Size (Global)
Configure this setting based on **total RPS across all providers combined**:
| Total RPS (All Providers) | Initial Pool Size | Memory Estimate |
|---------------------------|-------------------|-----------------|
| 100 | 150 | ~50 MB |
| 500 | 750 | ~100 MB |
| 1000 | 1500 | ~200 MB |
| 2500 | 3750 | ~400 MB |
| 5000 | 7500 | ~800 MB |
| 10000 | 15000 | ~1.5 GB |
<Note>
Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment.
</Note>
**Formula:**
```
initial_pool_size = 1.5 × total_expected_rps
```
Additionally, ensure:
```
initial_pool_size >= max(buffer_size across all providers)
```
This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.
---
## Multi-Node Deployments
When running multiple Bifrost instances behind a load balancer, **divide the per-node settings by the number of nodes** based on your total expected RPS.
### Formula
```
Per-Node Concurrency = Total Concurrency / Number of Nodes
Per-Node Buffer Size = Total Buffer Size / Number of Nodes
Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes
```
### Example: 10,000 RPS Across 4 Nodes
**Total capacity (aggregate across all 4 nodes):**
- Total RPS: 10,000 RPS
- Per-node RPS: ~2,500 RPS per node
**Single node settings for 10,000 RPS (if running on one node):**
- Concurrency: 10000
- Buffer Size: 15000
- Initial Pool Size: 15000
**Per-node settings (4 nodes, 10,000 RPS total):**
| Parameter | Total (Aggregate) | Per Node (4 nodes) |
|-----------|-------------------|-------------------|
| Concurrency | 10000 | 2500 |
| Buffer Size | 15000 | 3750 |
| Initial Pool Size | 15000 | 3750 |
<Tabs>
<Tab title="Gateway (config.json)">
```json
{
"config": {
"initial_pool_size": 3750,
"drop_excess_requests": false
},
"providers": {
"openai": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 2500,
"buffer_size": 3750
}
},
"anthropic": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 2500,
"buffer_size": 3750
}
}
}
}
```
</Tab>
<Tab title="Go SDK">
```go
const numNodes = 4
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
// Total capacity divided by number of nodes
// Total: 10,000 RPS across 4 nodes = 2,500 RPS per node
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 10000 / numNodes, // 2500 per node
BufferSize: 15000 / numNodes, // 3750 per node
},
}, nil
}
// In main initialization
bifrostConfig := schemas.BifrostConfig{
Account: myAccount,
InitialPoolSize: 15000 / numNodes, // 3750 per node
}
```
</Tab>
</Tabs>
<Tip>
**Kubernetes Horizontal Pod Autoscaling:** When using HPA, configure settings for your minimum replica count. As pods scale up, each node handles a smaller portion of traffic. Consider using environment variables or ConfigMaps to dynamically adjust settings based on replica count.
</Tip>
---
## Provider-Specific Tuning
Different providers have different rate limits and latency characteristics. Tune each provider independently:
### Provider Rate Limit Considerations
| Provider | Typical Rate Limits | Recommended Concurrency | Notes |
|----------|---------------------|------------------------|-------|
| OpenAI | 500-10000 RPM (varies by tier) | 100-500 | Higher tiers support more concurrency |
| Anthropic | 1000-4000 RPM (varies by tier) | 50-200 | More conservative rate limits |
| Bedrock | Per-model limits | 100-300 | Check AWS quotas for your account |
| Azure OpenAI | Deployment-specific | 100-500 | Configure per-deployment |
| Vertex AI | Per-model quotas | 100-300 | Check GCP quotas |
| Groq | Very high throughput | 500-1000 | Designed for high concurrency |
| Ollama | Local resource bound | 10-50 | Limited by local GPU/CPU |
### Example: Mixed Provider Configuration
<Tabs>
<Tab title="Gateway (config.json)">
```json
{
"providers": {
"openai": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 200,
"buffer_size": 1000
}
},
"anthropic": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 100,
"buffer_size": 500
}
},
"groq": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 500,
"buffer_size": 2500
}
},
"ollama": {
"keys": [...],
"concurrency_and_buffer_size": {
"concurrency": 20,
"buffer_size": 100
}
}
}
}
```
</Tab>
<Tab title="Go SDK">
```go
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
switch provider {
case schemas.OpenAI:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 200,
BufferSize: 1000,
},
}, nil
case schemas.Anthropic:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 100,
BufferSize: 500,
},
}, nil
case schemas.Groq:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 500,
BufferSize: 2500,
},
}, nil
case schemas.Ollama:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
Concurrency: 20,
BufferSize: 100,
},
}, nil
default:
return &schemas.ProviderConfig{
NetworkConfig: schemas.DefaultNetworkConfig,
ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize,
}, nil
}
}
```
</Tab>
</Tabs>
---
## Queue Overflow Handling
When the provider queue reaches capacity, Bifrost's behavior is controlled by `drop_excess_requests`:
### Blocking Mode (Default)
```json
{
"config": {
"drop_excess_requests": false
}
}
```
- New requests **wait** until queue space is available
- Ensures no requests are lost
- May increase latency during high load
- Suitable for critical workloads where every request matters
### Drop Mode
```json
{
"config": {
"drop_excess_requests": true
}
}
```
- New requests are **immediately rejected** when queue is full
- Returns error: `"request dropped: queue is full"`
- Maintains consistent latency for accepted requests
- Suitable for real-time applications where stale requests are useless
<Tip>
**Best Practice:** Use `drop_excess_requests: true` with buffer sizes at 1.5x concurrency for production workloads. This prevents memory exhaustion while still handling reasonable traffic bursts.
</Tip>
---
## Monitoring and Diagnostics
### Key Metrics to Monitor
| Metric | Healthy Range | Action if Exceeded |
|--------|---------------|-------------------|
| Queue depth | < 50% of buffer_size | Increase buffer or concurrency |
| Request latency (p99) | < 2x average | Check provider rate limits |
| Dropped requests | 0 | Increase buffer_size |
| Memory usage | Stable | Reduce pool/buffer sizes |
| Goroutine count | Stable | Check for goroutine leaks |
### Health Check Endpoint
The Gateway exposes health and metrics endpoints:
```bash
# Health check
curl http://localhost:8080/health
# Prometheus metrics
curl http://localhost:8080/metrics
```
---
## Best Practices Summary
<CardGroup cols={2}>
<Card title="Start Conservative" icon="shield">
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
</Card>
<Card title="Monitor Continuously" icon="chart-line">
Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.
</Card>
<Card title="Match Provider Limits" icon="scale-balanced">
Don't set concurrency higher than provider rate limits allow. You'll just get rate-limited.
</Card>
<Card title="Plan for Bursts" icon="bolt">
Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.
</Card>
</CardGroup>
### Quick Reference
```
// Formula
concurrency = expected_rps
buffer_size = 1.5 × expected_rps
initial_pool_size = 1.5 × total_rps (across all providers)
// Example: 500 RPS per provider, 2 providers (1000 total RPS)
concurrency: 500, buffer_size: 750, initial_pool_size: 1500
// Example: 2000 RPS per provider, 3 providers (6000 total RPS)
concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000
// Multi-node formula
per_node_value = total_value / number_of_nodes
```
---
## Related Documentation
- **[Provider Configuration](../quickstart/gateway/provider-configuration)** - Complete provider setup guide
- **[Custom Providers](./custom-providers)** - Creating custom provider integrations
- **[Deployment](../deployment-guides/)** - Production deployment guides