508 lines
15 KiB
Plaintext
508 lines
15 KiB
Plaintext
---
|
||
title: "Performance Tuning"
|
||
description: "Optimize Bifrost for high throughput with concurrency, buffer sizing, and memory pool configuration"
|
||
icon: "gauge-high"
|
||
---
|
||
|
||
## Overview
|
||
|
||
Bifrost provides three key performance configuration parameters that control throughput, memory usage, and request handling behavior:
|
||
|
||
| Parameter | Scope | Default | Description |
|
||
|-----------|-------|---------|-------------|
|
||
| **Concurrency** | Per Provider | 1000 | Number of worker goroutines processing requests simultaneously |
|
||
| **Buffer Size** | Per Provider | 5000 | Maximum requests that can be queued before blocking/dropping |
|
||
| **Initial Pool Size** | Global | 5000 | Pre-allocated objects in sync pools to reduce GC pressure |
|
||
|
||
<Info>
|
||
These defaults are suitable for most production deployments handling up to ~5000 RPS. For higher throughput or constrained environments, tuning these parameters can significantly improve performance.
|
||
</Info>
|
||
|
||
---
|
||
|
||
## Understanding the Parameters
|
||
|
||
### Concurrency (Per Provider)
|
||
|
||
**What it does:** Controls two aspects of provider performance:
|
||
1. **Worker Goroutines:** The number of goroutines that process requests for each provider. Each worker pulls requests from the provider's queue and executes them against the provider's API.
|
||
2. **Provider Pool Pre-warming:** Pre-allocates provider-specific response objects (e.g., `AnthropicMessageResponse`, `OpenAIResponse`) in sync pools to reduce allocations during request handling.
|
||
|
||
**Impact:**
|
||
- **Higher concurrency** = More parallel requests to the provider, higher throughput, more pre-allocated response objects
|
||
- **Lower concurrency** = Fewer parallel requests, lower resource usage, respects provider rate limits
|
||
|
||
**Default:** `1000` workers per provider
|
||
|
||
<Tabs>
|
||
<Tab title="Gateway (config.json)">
|
||
|
||
```json
|
||
{
|
||
"providers": {
|
||
"openai": {
|
||
"keys": [...],
|
||
"concurrency_and_buffer_size": {
|
||
"concurrency": 100,
|
||
"buffer_size": 500
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
</Tab>
|
||
<Tab title="Go SDK">
|
||
|
||
```go
|
||
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
|
||
return &schemas.ProviderConfig{
|
||
NetworkConfig: schemas.DefaultNetworkConfig,
|
||
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
|
||
Concurrency: 100, // 100 concurrent workers
|
||
BufferSize: 500, // 500 request queue capacity
|
||
},
|
||
}, nil
|
||
}
|
||
```
|
||
|
||
</Tab>
|
||
</Tabs>
|
||
|
||
### Buffer Size (Per Provider)
|
||
|
||
**What it does:** Sets the capacity of the buffered channel (queue) for each provider. Incoming requests are queued here before being picked up by workers.
|
||
|
||
**Impact:**
|
||
- **Larger buffer** = More requests can be queued during traffic spikes, handles burst traffic better
|
||
- **Smaller buffer** = Lower memory footprint, faster backpressure signals to clients
|
||
|
||
**Default:** `5000` requests per provider queue
|
||
|
||
**Queue Full Behavior:** Controlled by `drop_excess_requests`:
|
||
- `false` (default): New requests block until queue space is available
|
||
- `true`: New requests are immediately dropped with an error when queue is full
|
||
|
||
<Warning>
|
||
**Constraint:** Buffer size must be greater than or equal to concurrency. If `concurrency > buffer_size`, provider setup will fail.
|
||
</Warning>
|
||
|
||
### Initial Pool Size (Global)
|
||
|
||
**What it does:** Controls the number of pre-allocated objects in Bifrost's internal sync pools at startup. These pools recycle objects to reduce garbage collection overhead.
|
||
|
||
**Pooled Objects:**
|
||
- Channel messages (request wrappers)
|
||
- Response channels
|
||
- Error channels
|
||
- Stream channels
|
||
- Plugin pipelines
|
||
- Request objects
|
||
|
||
**Impact:**
|
||
- **Higher initial pool** = Less GC pressure during high traffic, more consistent latency, higher initial memory usage
|
||
- **Lower initial pool** = Lower initial memory footprint, may cause more allocations under load
|
||
|
||
**Default:** `5000` objects per pool
|
||
|
||
<Tabs>
|
||
<Tab title="Gateway (config.json)">
|
||
|
||
```json
|
||
{
|
||
"config": {
|
||
"initial_pool_size": 10000,
|
||
"drop_excess_requests": false
|
||
}
|
||
}
|
||
```
|
||
|
||
</Tab>
|
||
<Tab title="Go SDK">
|
||
|
||
```go
|
||
bifrostConfig := schemas.BifrostConfig{
|
||
Account: myAccount,
|
||
InitialPoolSize: 10000, // Pre-warm pools with 10,000 objects
|
||
DropExcessRequests: false,
|
||
}
|
||
|
||
client, err := bifrost.Init(ctx, bifrostConfig)
|
||
```
|
||
|
||
</Tab>
|
||
</Tabs>
|
||
|
||
---
|
||
|
||
## Sizing Guidelines
|
||
|
||
### Concurrency & Buffer Size (Per Provider)
|
||
|
||
Configure these settings **per provider** based on the expected RPS for that specific provider:
|
||
|
||
| Provider RPS | Concurrency | Buffer Size |
|
||
|--------------|-------------|-------------|
|
||
| 100 | 100 | 150 |
|
||
| 500 | 500 | 750 |
|
||
| 1000 | 1000 | 1500 |
|
||
| 2500 | 2500 | 3750 |
|
||
| 5000 | 5000 | 7500 |
|
||
| 10000 | 10000 | 15000 |
|
||
|
||
<Info>
|
||
**Example:** If you expect 2000 RPS to OpenAI and 500 RPS to Anthropic, configure OpenAI with `concurrency: 2000, buffer_size: 3000` and Anthropic with `concurrency: 500, buffer_size: 750`.
|
||
</Info>
|
||
|
||
**Formula:**
|
||
```
|
||
concurrency = expected_rps
|
||
buffer_size = 1.5 × expected_rps
|
||
```
|
||
|
||
This ratio ensures:
|
||
- Enough queue capacity to absorb traffic bursts
|
||
- Workers are never starved for work
|
||
- Backpressure is applied before memory exhaustion
|
||
|
||
### Initial Pool Size (Global)
|
||
|
||
Configure this setting based on **total RPS across all providers combined**:
|
||
|
||
| Total RPS (All Providers) | Initial Pool Size | Memory Estimate |
|
||
|---------------------------|-------------------|-----------------|
|
||
| 100 | 150 | ~50 MB |
|
||
| 500 | 750 | ~100 MB |
|
||
| 1000 | 1500 | ~200 MB |
|
||
| 2500 | 3750 | ~400 MB |
|
||
| 5000 | 7500 | ~800 MB |
|
||
| 10000 | 15000 | ~1.5 GB |
|
||
|
||
<Note>
|
||
Memory estimates are approximate and vary based on request/response sizes, number of providers, and plugins. Monitor actual memory usage in your environment.
|
||
</Note>
|
||
|
||
**Formula:**
|
||
```
|
||
initial_pool_size = 1.5 × total_expected_rps
|
||
```
|
||
|
||
Additionally, ensure:
|
||
```
|
||
initial_pool_size >= max(buffer_size across all providers)
|
||
```
|
||
|
||
This ensures pools are pre-warmed to handle peak queue depths without runtime allocations.
|
||
|
||
---
|
||
|
||
## Multi-Node Deployments
|
||
|
||
When running multiple Bifrost instances behind a load balancer, **divide the per-node settings by the number of nodes** based on your total expected RPS.
|
||
|
||
### Formula
|
||
|
||
```
|
||
Per-Node Concurrency = Total Concurrency / Number of Nodes
|
||
Per-Node Buffer Size = Total Buffer Size / Number of Nodes
|
||
Per-Node Initial Pool Size = Total Initial Pool Size / Number of Nodes
|
||
```
|
||
|
||
### Example: 10,000 RPS Across 4 Nodes
|
||
|
||
**Total capacity (aggregate across all 4 nodes):**
|
||
- Total RPS: 10,000 RPS
|
||
- Per-node RPS: ~2,500 RPS per node
|
||
|
||
**Single node settings for 10,000 RPS (if running on one node):**
|
||
- Concurrency: 10000
|
||
- Buffer Size: 15000
|
||
- Initial Pool Size: 15000
|
||
|
||
**Per-node settings (4 nodes, 10,000 RPS total):**
|
||
|
||
| Parameter | Total (Aggregate) | Per Node (4 nodes) |
|
||
|-----------|-------------------|-------------------|
|
||
| Concurrency | 10000 | 2500 |
|
||
| Buffer Size | 15000 | 3750 |
|
||
| Initial Pool Size | 15000 | 3750 |
|
||
|
||
<Tabs>
|
||
<Tab title="Gateway (config.json)">
|
||
|
||
```json
|
||
{
|
||
"config": {
|
||
"initial_pool_size": 3750,
|
||
"drop_excess_requests": false
|
||
},
|
||
"providers": {
|
||
"openai": {
|
||
"keys": [...],
|
||
"concurrency_and_buffer_size": {
|
||
"concurrency": 2500,
|
||
"buffer_size": 3750
|
||
}
|
||
},
|
||
"anthropic": {
|
||
"keys": [...],
|
||
"concurrency_and_buffer_size": {
|
||
"concurrency": 2500,
|
||
"buffer_size": 3750
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
</Tab>
|
||
<Tab title="Go SDK">
|
||
|
||
```go
|
||
const numNodes = 4
|
||
|
||
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
|
||
// Total capacity divided by number of nodes
|
||
// Total: 10,000 RPS across 4 nodes = 2,500 RPS per node
|
||
return &schemas.ProviderConfig{
|
||
NetworkConfig: schemas.DefaultNetworkConfig,
|
||
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
|
||
Concurrency: 10000 / numNodes, // 2500 per node
|
||
BufferSize: 15000 / numNodes, // 3750 per node
|
||
},
|
||
}, nil
|
||
}
|
||
|
||
// In main initialization
|
||
bifrostConfig := schemas.BifrostConfig{
|
||
Account: myAccount,
|
||
InitialPoolSize: 15000 / numNodes, // 3750 per node
|
||
}
|
||
```
|
||
|
||
</Tab>
|
||
</Tabs>
|
||
|
||
<Tip>
|
||
**Kubernetes Horizontal Pod Autoscaling:** When using HPA, configure settings for your minimum replica count. As pods scale up, each node handles a smaller portion of traffic. Consider using environment variables or ConfigMaps to dynamically adjust settings based on replica count.
|
||
</Tip>
|
||
|
||
---
|
||
|
||
## Provider-Specific Tuning
|
||
|
||
Different providers have different rate limits and latency characteristics. Tune each provider independently:
|
||
|
||
### Provider Rate Limit Considerations
|
||
|
||
| Provider | Typical Rate Limits | Recommended Concurrency | Notes |
|
||
|----------|---------------------|------------------------|-------|
|
||
| OpenAI | 500-10000 RPM (varies by tier) | 100-500 | Higher tiers support more concurrency |
|
||
| Anthropic | 1000-4000 RPM (varies by tier) | 50-200 | More conservative rate limits |
|
||
| Bedrock | Per-model limits | 100-300 | Check AWS quotas for your account |
|
||
| Azure OpenAI | Deployment-specific | 100-500 | Configure per-deployment |
|
||
| Vertex AI | Per-model quotas | 100-300 | Check GCP quotas |
|
||
| Groq | Very high throughput | 500-1000 | Designed for high concurrency |
|
||
| Ollama | Local resource bound | 10-50 | Limited by local GPU/CPU |
|
||
|
||
### Example: Mixed Provider Configuration
|
||
|
||
<Tabs>
|
||
<Tab title="Gateway (config.json)">
|
||
|
||
```json
|
||
{
|
||
"providers": {
|
||
"openai": {
|
||
"keys": [...],
|
||
"concurrency_and_buffer_size": {
|
||
"concurrency": 200,
|
||
"buffer_size": 1000
|
||
}
|
||
},
|
||
"anthropic": {
|
||
"keys": [...],
|
||
"concurrency_and_buffer_size": {
|
||
"concurrency": 100,
|
||
"buffer_size": 500
|
||
}
|
||
},
|
||
"groq": {
|
||
"keys": [...],
|
||
"concurrency_and_buffer_size": {
|
||
"concurrency": 500,
|
||
"buffer_size": 2500
|
||
}
|
||
},
|
||
"ollama": {
|
||
"keys": [...],
|
||
"concurrency_and_buffer_size": {
|
||
"concurrency": 20,
|
||
"buffer_size": 100
|
||
}
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
</Tab>
|
||
<Tab title="Go SDK">
|
||
|
||
```go
|
||
func (a *MyAccount) GetConfigForProvider(provider schemas.ModelProvider) (*schemas.ProviderConfig, error) {
|
||
switch provider {
|
||
case schemas.OpenAI:
|
||
return &schemas.ProviderConfig{
|
||
NetworkConfig: schemas.DefaultNetworkConfig,
|
||
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
|
||
Concurrency: 200,
|
||
BufferSize: 1000,
|
||
},
|
||
}, nil
|
||
case schemas.Anthropic:
|
||
return &schemas.ProviderConfig{
|
||
NetworkConfig: schemas.DefaultNetworkConfig,
|
||
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
|
||
Concurrency: 100,
|
||
BufferSize: 500,
|
||
},
|
||
}, nil
|
||
case schemas.Groq:
|
||
return &schemas.ProviderConfig{
|
||
NetworkConfig: schemas.DefaultNetworkConfig,
|
||
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
|
||
Concurrency: 500,
|
||
BufferSize: 2500,
|
||
},
|
||
}, nil
|
||
case schemas.Ollama:
|
||
return &schemas.ProviderConfig{
|
||
NetworkConfig: schemas.DefaultNetworkConfig,
|
||
ConcurrencyAndBufferSize: schemas.ConcurrencyAndBufferSize{
|
||
Concurrency: 20,
|
||
BufferSize: 100,
|
||
},
|
||
}, nil
|
||
default:
|
||
return &schemas.ProviderConfig{
|
||
NetworkConfig: schemas.DefaultNetworkConfig,
|
||
ConcurrencyAndBufferSize: schemas.DefaultConcurrencyAndBufferSize,
|
||
}, nil
|
||
}
|
||
}
|
||
```
|
||
|
||
</Tab>
|
||
</Tabs>
|
||
|
||
---
|
||
|
||
## Queue Overflow Handling
|
||
|
||
When the provider queue reaches capacity, Bifrost's behavior is controlled by `drop_excess_requests`:
|
||
|
||
### Blocking Mode (Default)
|
||
|
||
```json
|
||
{
|
||
"config": {
|
||
"drop_excess_requests": false
|
||
}
|
||
}
|
||
```
|
||
|
||
- New requests **wait** until queue space is available
|
||
- Ensures no requests are lost
|
||
- May increase latency during high load
|
||
- Suitable for critical workloads where every request matters
|
||
|
||
### Drop Mode
|
||
|
||
```json
|
||
{
|
||
"config": {
|
||
"drop_excess_requests": true
|
||
}
|
||
}
|
||
```
|
||
|
||
- New requests are **immediately rejected** when queue is full
|
||
- Returns error: `"request dropped: queue is full"`
|
||
- Maintains consistent latency for accepted requests
|
||
- Suitable for real-time applications where stale requests are useless
|
||
|
||
<Tip>
|
||
**Best Practice:** Use `drop_excess_requests: true` with buffer sizes at 1.5x concurrency for production workloads. This prevents memory exhaustion while still handling reasonable traffic bursts.
|
||
</Tip>
|
||
|
||
---
|
||
|
||
## Monitoring and Diagnostics
|
||
|
||
### Key Metrics to Monitor
|
||
|
||
| Metric | Healthy Range | Action if Exceeded |
|
||
|--------|---------------|-------------------|
|
||
| Queue depth | < 50% of buffer_size | Increase buffer or concurrency |
|
||
| Request latency (p99) | < 2x average | Check provider rate limits |
|
||
| Dropped requests | 0 | Increase buffer_size |
|
||
| Memory usage | Stable | Reduce pool/buffer sizes |
|
||
| Goroutine count | Stable | Check for goroutine leaks |
|
||
|
||
### Health Check Endpoint
|
||
|
||
The Gateway exposes health and metrics endpoints:
|
||
|
||
```bash
|
||
# Health check
|
||
curl http://localhost:8080/health
|
||
|
||
# Prometheus metrics
|
||
curl http://localhost:8080/metrics
|
||
```
|
||
|
||
---
|
||
|
||
## Best Practices Summary
|
||
|
||
<CardGroup cols={2}>
|
||
<Card title="Start Conservative" icon="shield">
|
||
Begin with lower values and scale up based on observed performance. Over-provisioning wastes resources.
|
||
</Card>
|
||
<Card title="Monitor Continuously" icon="chart-line">
|
||
Track queue depths, latencies, and error rates. Adjust settings based on real traffic patterns.
|
||
</Card>
|
||
<Card title="Match Provider Limits" icon="scale-balanced">
|
||
Don't set concurrency higher than provider rate limits allow. You'll just get rate-limited.
|
||
</Card>
|
||
<Card title="Plan for Bursts" icon="bolt">
|
||
Set buffer_size to 1.5x concurrency to handle traffic spikes without dropping requests.
|
||
</Card>
|
||
</CardGroup>
|
||
|
||
### Quick Reference
|
||
|
||
```
|
||
// Formula
|
||
concurrency = expected_rps
|
||
buffer_size = 1.5 × expected_rps
|
||
initial_pool_size = 1.5 × total_rps (across all providers)
|
||
|
||
// Example: 500 RPS per provider, 2 providers (1000 total RPS)
|
||
concurrency: 500, buffer_size: 750, initial_pool_size: 1500
|
||
|
||
// Example: 2000 RPS per provider, 3 providers (6000 total RPS)
|
||
concurrency: 2000, buffer_size: 3000, initial_pool_size: 9000
|
||
|
||
// Multi-node formula
|
||
per_node_value = total_value / number_of_nodes
|
||
```
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- **[Provider Configuration](../quickstart/gateway/provider-configuration)** - Complete provider setup guide
|
||
- **[Custom Providers](./custom-providers)** - Creating custom provider integrations
|
||
- **[Deployment](../deployment-guides/)** - Production deployment guides
|