--- title: "Clustering" description: "Enterprise-grade high-availability clustering with automatic service discovery, intelligent traffic distribution, and gossip-based state synchronization for production deployments." icon: "circle-nodes" --- ## Overview **Bifrost Clustering** delivers production-ready high availability through a peer-to-peer network architecture with automatic service discovery. The clustering system uses gossip protocols to maintain consistent state across nodes while providing seamless scaling, automatic failover, and zero-downtime deployments. ### Why Clustering Matters Modern AI gateway deployments require robust infrastructure to handle production workloads: | Challenge | Impact | Clustering Solution | |-----------|--------|-------------------| | **Single Point of Failure** | Complete service outage if gateway fails | Distributed architecture with automatic failover | | **Traffic Spikes** | Performance degradation under high load | Dynamic load distribution across multiple nodes | | **Provider Rate Limits** | Request throttling and service interruption | Distributed rate limit tracking across cluster | | **Regional Latency** | Poor user experience in distant regions | Geographic distribution with local processing | | **Maintenance Windows** | Service downtime during updates | Rolling updates with zero-downtime deployment | | **Capacity Planning** | Over/under-provisioning resources | Elastic scaling based on real-time demand | ### Core Features | Feature | Description | |---------|-------------| | **Automatic Service Discovery** | 6 discovery methods for any infrastructure (K8s, Consul, etcd, DNS, UDP, mDNS) | | **Peer-to-Peer Architecture** | No single point of failure with equal node participation | | **Gossip-Based State Sync** | Real-time synchronization of traffic patterns and limits | | **Automatic Failover** | Seamless traffic redistribution when nodes fail | | **Zero-Downtime Updates** | Rolling deployments without service interruption | --- ## Architecture ### Peer-to-Peer Network Design Bifrost clustering uses a **peer-to-peer (P2P) network** where all nodes are equal participants. Each node: - Discovers peers automatically using configured discovery method - Synchronizes state via gossip protocol - Shares traffic patterns and rate limits - Handles failover automatically ### Gossip Protocol The gossip protocol ensures all nodes maintain consistent views of: - **Traffic Patterns**: Request volume, latency metrics, error rates - **Rate Limit States**: Current usage counters for each provider/model - **Node Health**: CPU, memory, network status of all peers - **Configuration Changes**: Provider updates, routing rules, policies **Convergence**: All nodes converge to the same state within seconds with eventual consistency guarantees. ### Minimum Node Requirements **Recommended: 3+ nodes minimum** for optimal fault tolerance. | Cluster Size | Fault Tolerance | Use Case | |--------------|-----------------|----------| | **3 nodes** | 1 node failure | Small production deployments | | **5 nodes** | 2 node failures | Medium production deployments | | **7+ nodes** | 3+ node failures | Large enterprise deployments | --- ## Configuration Basics ### Core Configuration Structure The new clustering configuration uses a `cluster_config` object with integrated service discovery: ```json { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "kubernetes", "service_name": "bifrost-cluster" }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } } } ``` Discovery-specific fields (e.g. `k8s_label_selector`, `consul_address`, `etcd_endpoints`) slot into the `discovery` object alongside `type` — see each method's section below. At startup, cluster mode requires either a non-empty `peers` list or `discovery.enabled: true`. ### Common Discovery Configuration Fields All discovery methods support these common fields: | Field | Type | Required | Description | |-------|------|----------|-------------| | `enabled` | boolean | No | Enable/disable discovery (must be `true` to use discovery at runtime) | | `type` | string | Yes | Discovery type: `kubernetes`, `consul`, `etcd`, `dns`, `udp`, `mdns` | | `service_name` | string | Conditional | Required for `consul`, `etcd`, `udp`, and typically `mdns`; optional for `kubernetes` and `dns` | | `bind_port` | integer | No | Port for cluster communication (default: 10101) | | `dial_timeout` | duration | No | Discovery timeout (default: 10s) | | `allowed_address_space` | array | No | CIDR ranges to filter discovered nodes (e.g., `["10.0.0.0/8"]`) | ### Gossip Configuration | Field | Description | Default | |-------|-------------|---------| | `port` | Gossip protocol port | 10101 | | `timeout_seconds` | Health check timeout | 10 | | `success_threshold` | Successful checks to mark healthy | 3 | | `failure_threshold` | Failed checks to mark unhealthy | 3 | --- ## Service Discovery Methods Bifrost supports 6 service discovery methods to fit any infrastructure. Choose based on your deployment environment: Native K8s pod discovery via label selectors HashiCorp Consul service mesh integration etcd-based distributed discovery Traditional DNS SRV record discovery Local network broadcast discovery Multicast DNS for local development --- ## Kubernetes Discovery **Best for:** Kubernetes deployments with StatefulSets or Deployments Kubernetes discovery uses the K8s API to automatically discover pods based on label selectors. This is the most common method for cloud-native deployments. ### How It Works 1. Each Bifrost pod queries the Kubernetes API for pods matching the label selector 2. Discovers pod IPs automatically as pods scale up/down 3. Works seamlessly with StatefulSets, Deployments, and DaemonSets 4. No external dependencies required ### Configuration ```json { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "kubernetes", "service_name": "bifrost-cluster", "k8s_namespace": "default", "k8s_label_selector": "app=bifrost" }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } } } ``` ### Configuration Parameters | Parameter | Required | Description | Example | |-----------|----------|-------------|---------| | `k8s_namespace` | No | Kubernetes namespace to search | `"default"`, `"production"` | | `k8s_label_selector` | Yes | Label selector for pod discovery | `"app=bifrost"`, `"app=bifrost,env=prod"` | ### Kubernetes Deployment Example ```yaml apiVersion: apps/v1 kind: StatefulSet metadata: name: bifrost namespace: default spec: serviceName: bifrost-cluster replicas: 3 selector: matchLabels: app: bifrost template: metadata: labels: app: bifrost spec: serviceAccountName: bifrost containers: - name: bifrost image: /bifrost:latest ports: - containerPort: 8080 name: http - containerPort: 10101 name: gossip volumeMounts: - name: config mountPath: /etc/bifrost volumes: - name: config configMap: name: bifrost-config --- apiVersion: v1 kind: ServiceAccount metadata: name: bifrost namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: bifrost-pod-reader namespace: default rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: bifrost-pod-reader namespace: default subjects: - kind: ServiceAccount name: bifrost namespace: default roleRef: kind: Role name: bifrost-pod-reader apiGroup: rbac.authorization.k8s.io ``` ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: bifrost namespace: default spec: replicas: 3 selector: matchLabels: app: bifrost template: metadata: labels: app: bifrost spec: serviceAccountName: bifrost containers: - name: bifrost image: /bifrost:latest ports: - containerPort: 8080 name: http - containerPort: 10101 name: gossip volumeMounts: - name: config mountPath: /etc/bifrost volumes: - name: config configMap: name: bifrost-config --- apiVersion: v1 kind: Service metadata: name: bifrost-cluster namespace: default spec: clusterIP: None selector: app: bifrost ports: - port: 10101 name: gossip ``` ### Troubleshooting **Symptoms**: Cluster shows only 1 member, pods running in isolation **Solutions**: - Verify ServiceAccount has RBAC permissions to list pods - Check label selector matches pod labels exactly - Ensure namespace is correct (defaults to "default") - Verify gossip port (10101) is not blocked by NetworkPolicies - Check logs for "error listing pods" messages **Symptoms**: "error getting kubernetes config" or "forbidden" errors **Solutions**: - Create ServiceAccount for Bifrost pods - Create Role with `get`, `list`, `watch` permissions on pods - Create RoleBinding linking ServiceAccount to Role - Verify RBAC is enabled in cluster **Symptoms**: Nodes discovered but marked as "suspect" or "dead" **Solutions**: - Verify gossip port (10101) is accessible between pods - Check for NetworkPolicies blocking pod-to-pod communication - Increase `timeout_seconds` in gossip config if network is slow - Verify pods are in Running state with `kubectl get pods` --- ## Consul Discovery **Best for:** Consul service mesh environments, multi-datacenter deployments Consul discovery integrates with HashiCorp Consul for service registration and discovery. Ideal for environments already using Consul for service mesh or service discovery. ### How It Works 1. Each Bifrost node registers itself with Consul on startup 2. Nodes query Consul to discover other Bifrost instances 3. Consul performs health checks on each node 4. Unhealthy nodes are automatically deregistered 5. Supports multi-datacenter deployments ### Configuration ```json { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "consul", "service_name": "bifrost-cluster", "consul_address": "consul.service.consul:8500" }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } } } ``` ### Configuration Parameters | Parameter | Required | Description | Example | |-----------|----------|-------------|---------| | `consul_address` | No | Consul agent address | `"localhost:8500"`, `"consul.service.consul:8500"` (default: `localhost:8500`) | Consul discovery automatically registers each node with health checks. The health check monitors the gossip port (TCP). ### Docker Compose with Consul ```yaml version: '3.8' services: consul: image: hashicorp/consul:latest command: agent -dev -client=0.0.0.0 ports: - "8500:8500" networks: - bifrost-net bifrost-1: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config-node1.json:/etc/bifrost/config.json ports: - "8080:8080" depends_on: - consul networks: - bifrost-net bifrost-2: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config-node2.json:/etc/bifrost/config.json ports: - "8081:8080" depends_on: - consul networks: - bifrost-net bifrost-3: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config-node3.json:/etc/bifrost/config.json ports: - "8082:8080" depends_on: - consul networks: - bifrost-net networks: bifrost-net: driver: bridge ``` ### Troubleshooting **Symptoms**: "failed to register service with Consul" errors **Solutions**: - Verify Consul agent is accessible at configured address - Check Consul agent logs for registration errors - Ensure Consul ACL token has write permissions if ACLs enabled - Verify network connectivity between Bifrost and Consul - Check firewall rules allow connections to port 8500 **Symptoms**: Consul UI shows services but nodes don't join cluster **Solutions**: - Verify `service_name` matches across all nodes - Check Consul service health checks are passing - Ensure gossip port is accessible between nodes - Verify nodes are registered in correct datacenter - Check for DNS resolution issues if using service DNS names **Symptoms**: Services show as critical in Consul UI **Solutions**: - Verify gossip port (10101) is accessible - Check Consul agent can reach node's gossip port - Increase health check timeout in Consul if needed - Review Bifrost logs for startup errors - Ensure nodes have correct IP addresses registered --- ## etcd Discovery **Best for:** etcd-based distributed systems, existing etcd infrastructure etcd discovery uses etcd's distributed key-value store for service registration and discovery. Perfect for environments already using etcd or requiring strong consistency. ### How It Works 1. Each Bifrost node registers itself in etcd with a lease 2. Nodes maintain lease through keepalive messages 3. Nodes query etcd prefix to discover other instances 4. Failed nodes' leases expire and are automatically removed 5. Provides strongly consistent service registry ### Configuration ```json { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "etcd", "service_name": "bifrost-cluster", "etcd_endpoints": [ "http://etcd-1:2379", "http://etcd-2:2379", "http://etcd-3:2379" ], "dial_timeout": "10s" }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } } } ``` ### Configuration Parameters | Parameter | Required | Description | Example | |-----------|----------|-------------|---------| | `etcd_endpoints` | Yes | Array of etcd endpoint URLs | `["http://localhost:2379"]`, `["https://etcd1:2379", "https://etcd2:2379"]` | | `dial_timeout` | No | Connection timeout | `"10s"` (default), `"30s"` | Each node registers under `/services/{service_name}/{node_id}` with a 30-second TTL lease. ### Docker Compose with etcd ```yaml version: '3.8' services: etcd: image: quay.io/coreos/etcd:latest command: - etcd - --advertise-client-urls=http://etcd:2379 - --listen-client-urls=http://0.0.0.0:2379 - --listen-peer-urls=http://0.0.0.0:2380 - --initial-cluster=etcd=http://etcd:2380 - --initial-advertise-peer-urls=http://etcd:2380 ports: - "2379:2379" - "2380:2380" networks: - bifrost-net bifrost-1: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8080:8080" depends_on: - etcd networks: - bifrost-net bifrost-2: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8081:8080" depends_on: - etcd networks: - bifrost-net bifrost-3: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8082:8080" depends_on: - etcd networks: - bifrost-net networks: bifrost-net: driver: bridge ``` ### Troubleshooting **Symptoms**: "etcd client error" on startup **Solutions**: - Verify etcd endpoints are accessible - Check URL format (http:// or https://) - Ensure etcd cluster is healthy and running - Verify network connectivity to etcd endpoints - Check firewall rules allow connections to port 2379 - Increase `dial_timeout` if network is slow **Symptoms**: "failed to register with etcd" errors **Solutions**: - Verify etcd cluster is accepting writes - Check etcd cluster has available space - Ensure authentication credentials if etcd has auth enabled - Review etcd logs for permission or quota errors - Verify node can resolve etcd hostnames **Symptoms**: Nodes repeatedly registering/deregistering **Solutions**: - Check network stability between nodes and etcd - Verify etcd cluster is not overloaded - Monitor etcd metrics for high latency - Increase lease TTL if network has high latency - Check for etcd leader election issues --- ## DNS Discovery **Best for:** Traditional infrastructure, static node addresses, cloud DNS services DNS discovery uses standard DNS resolution to discover cluster nodes. Works with any DNS server and is ideal for static deployments or cloud environments with DNS integration. ### How It Works 1. Configure DNS A records or SRV records for cluster nodes 2. Bifrost queries DNS to resolve configured names 3. All returned IP addresses are treated as potential cluster members 4. Supports multiple DNS names for different node groups 5. Works with internal DNS, cloud DNS, or public DNS ### Configuration ```json { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "dns", "service_name": "bifrost-cluster", "dns_names": [ "bifrost-cluster.local", "bifrost-nodes.internal.company.com" ], "bind_port": 10101 }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } } } ``` ### Configuration Parameters | Parameter | Required | Description | Example | |-----------|----------|-------------|---------| | `dns_names` | Yes | Array of DNS names to resolve | `["bifrost.local"]`, `["node1.local", "node2.local", "node3.local"]` | | `bind_port` | No | Port appended to discovered IPs | `10101` (default) | DNS discovery is passive - it doesn't register nodes. You must manage DNS records externally (via DNS server, cloud DNS, or Kubernetes DNS). ### Setup Examples ```bash # Create A records for each node aws route53 change-resource-record-sets \ --hosted-zone-id Z1234567890ABC \ --change-batch '{ "Changes": [{ "Action": "CREATE", "ResourceRecordSet": { "Name": "bifrost-cluster.internal.company.com", "Type": "A", "TTL": 60, "ResourceRecords": [ {"Value": "10.0.1.10"}, {"Value": "10.0.1.11"}, {"Value": "10.0.1.12"} ] } }] }' ``` ```yaml apiVersion: v1 kind: Service metadata: name: bifrost-cluster namespace: default spec: clusterIP: None # Headless service selector: app: bifrost ports: - port: 10101 name: gossip --- # DNS will resolve bifrost-cluster.default.svc.cluster.local # to all pod IPs matching the selector ``` ```bash # /etc/dnsmasq.conf address=/bifrost-cluster.local/192.168.1.10 address=/bifrost-cluster.local/192.168.1.11 address=/bifrost-cluster.local/192.168.1.12 # Or use /etc/hosts on each node echo "192.168.1.10 node1.bifrost.local" >> /etc/hosts echo "192.168.1.11 node2.bifrost.local" >> /etc/hosts echo "192.168.1.12 node3.bifrost.local" >> /etc/hosts ``` ### Troubleshooting **Symptoms**: "dns lookup error" in logs, no nodes discovered **Solutions**: - Verify DNS names are resolvable: `nslookup bifrost-cluster.local` - Check DNS server is accessible from Bifrost nodes - Verify `/etc/resolv.conf` has correct nameserver - Test DNS resolution from inside container if using Docker - Check for DNS caching issues (try flushing DNS cache) **Symptoms**: DNS resolves but cluster has 0 members **Solutions**: - Verify DNS returns multiple A records (not CNAME) - Check that returned IPs are correct and reachable - Ensure `bind_port` matches actual gossip port on nodes - Verify nodes are listening on returned IP addresses - Use `dig` or `nslookup` to verify DNS response format **Symptoms**: IPs discovered but gossip connection fails **Solutions**: - Verify gossip port (10101) is open on all nodes - Check firewall rules between nodes - Ensure nodes are listening on correct network interface - Verify IP addresses match node's actual network addresses - Test connectivity: `telnet 10101` --- ## UDP Broadcast Discovery **Best for:** Local network deployments, on-premise infrastructure, development clusters UDP broadcast discovery automatically finds nodes on the same local network using broadcast packets. No external dependencies required. ### How It Works 1. Nodes broadcast UDP discovery beacons on configured port 2. Other nodes on the same network respond with acknowledgments 3. Nodes discover each other's IP addresses automatically 4. Limited to nodes on the same broadcast domain (subnet) 5. Requires `allowed_address_space` for security ### Configuration ```json { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "udp", "service_name": "bifrost-cluster", "udp_broadcast_port": 9999, "allowed_address_space": [ "192.168.1.0/24", "10.0.0.0/8" ], "dial_timeout": "10s" }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } } } ``` ### Configuration Parameters | Parameter | Required | Description | Example | |-----------|----------|-------------|---------| | `udp_broadcast_port` | Yes | Port for broadcast discovery | `9999`, `8888` | | `allowed_address_space` | Yes | CIDR ranges to limit discovery scope | `["192.168.1.0/24"]`, `["10.0.0.0/8", "172.16.0.0/12"]` | | `dial_timeout` | No | Time to wait for responses | `"10s"` (default) | UDP broadcast discovery requires `allowed_address_space` to be configured. This prevents scanning arbitrary networks and limits discovery to trusted subnets. ### Docker Compose Example ```yaml version: '3.8' services: bifrost-1: image: /bifrost:latest network_mode: bridge environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8080:8080" - "9999:9999/udp" - "10101:10101" bifrost-2: image: /bifrost:latest network_mode: bridge environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8081:8080" - "9999:9999/udp" - "10101:10101" bifrost-3: image: /bifrost:latest network_mode: bridge environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8082:8080" - "9999:9999/udp" - "10101:10101" ``` Use `network_mode: bridge` (default) or `host` for UDP broadcast. Custom networks may not support broadcast. ### Troubleshooting **Symptoms**: Discovery runs but finds 0 nodes **Solutions**: - Verify `allowed_address_space` includes node IP addresses - Check UDP broadcast port is open (firewall/security groups) - Ensure nodes are on same subnet/broadcast domain - Verify broadcast is enabled on network interface - Test with `tcpdump -i any -n udp port 9999` - Check Docker network mode supports broadcast (use bridge or host) **Symptoms**: "not in allowed address space" warnings **Solutions**: - Verify CIDR notation is correct (e.g., `192.168.1.0/24`) - Ensure `allowed_address_space` covers all node IPs - Check node IP addresses: `ip addr` or `ifconfig` - Remember to use network address, not host address - Test CIDR match online or with ipcalc **Symptoms**: "permission denied" or "address already in use" **Solutions**: - Check if another process is using the UDP broadcast port - Verify port number is > 1024 (non-privileged) or run as root - Use `netstat -tulpn | grep 9999` to check port usage - Change `udp_broadcast_port` to different value - Ensure firewall isn't blocking UDP on that port --- ## mDNS Discovery **Best for:** Local development, testing, zero-configuration setups mDNS (Multicast DNS) provides zero-configuration service discovery on local networks. Perfect for development and testing without requiring any infrastructure setup. ### How It Works 1. Nodes advertise themselves via mDNS (Bonjour/Avahi) 2. Other nodes browse for mDNS services 3. Automatic discovery within the same local network 4. No DNS server or configuration required 5. Limited to local network segment ### Configuration ```json { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "mdns", "service_name": "bifrost", "mdns_service": "_bifrost._tcp", "dial_timeout": "10s" }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } } } ``` ### Configuration Parameters | Parameter | Required | Description | Example | |-----------|----------|-------------|---------| | `mdns_service` | No | mDNS service type | `"_bifrost._tcp"` (default), `"_myapp._tcp"` | | `dial_timeout` | No | Time to wait for mDNS responses | `"10s"` (default) | mDNS is designed for development and testing. For production, use Kubernetes, Consul, or etcd discovery. ### Local Development Example ```bash # Start first node docker run -p 8080:8080 -p 10101:10101 \ -v $(pwd)/config-mdns.json:/etc/bifrost/config.json \ /bifrost:latest # Start second node (discovers first automatically) docker run -p 8081:8080 -p 10102:10101 \ -v $(pwd)/config-mdns.json:/etc/bifrost/config.json \ /bifrost:latest # Start third node (discovers both automatically) docker run -p 8082:8080 -p 10103:10101 \ -v $(pwd)/config-mdns.json:/etc/bifrost/config.json \ /bifrost:latest ``` ### Troubleshooting **Symptoms**: Nodes don't discover each other via mDNS **Solutions**: - Verify mDNS is enabled on network (check firewall) - Ensure multicast is enabled on network interface - Check nodes are on same local network segment - Verify mDNS port 5353 is not blocked - Test mDNS resolution: `avahi-browse -a` (Linux) or `dns-sd -B` (macOS) - Increase `dial_timeout` if discovery is slow **Symptoms**: "skipping invalid host address" warnings **Solutions**: - This is normal - mDNS returns network/broadcast addresses - mDNS automatically filters invalid addresses (127.x.x.x, *.0, *.255) - Check that nodes have valid non-loopback IP addresses - Ensure nodes are not using 127.0.0.1 for binding - Verify network interface has proper IP configuration **Symptoms**: Nodes discover then disconnect repeatedly **Solutions**: - mDNS has eventual consistency, allow time for propagation - Check gossip port accessibility between nodes - Verify network doesn't drop multicast packets - Consider using a more robust discovery method for production - Check for network congestion or packet loss --- ## Deployment Patterns ### Docker Compose Deployment Complete example using Kubernetes-style discovery with a shared config store: ```yaml version: '3.8' services: postgres: image: postgres:14 environment: POSTGRES_DB: bifrost POSTGRES_USER: bifrost POSTGRES_PASSWORD: bifrost_password volumes: - postgres_data:/var/lib/postgresql/data networks: - bifrost-net consul: image: hashicorp/consul:latest command: agent -dev -client=0.0.0.0 ports: - "8500:8500" networks: - bifrost-net bifrost-1: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8080:8080" depends_on: - postgres - consul networks: - bifrost-net bifrost-2: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8081:8080" depends_on: - postgres - consul networks: - bifrost-net bifrost-3: image: /bifrost:latest environment: - BIFROST_CONFIG=/etc/bifrost/config.json volumes: - ./config.json:/etc/bifrost/config.json ports: - "8082:8080" depends_on: - postgres - consul networks: - bifrost-net nginx: image: nginx:alpine ports: - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf:ro depends_on: - bifrost-1 - bifrost-2 - bifrost-3 networks: - bifrost-net volumes: postgres_data: networks: bifrost-net: driver: bridge ``` **nginx.conf** for load balancing: ```nginx events { worker_connections 1024; } http { upstream bifrost_cluster { least_conn; server bifrost-1:8080 max_fails=3 fail_timeout=30s; server bifrost-2:8080 max_fails=3 fail_timeout=30s; server bifrost-3:8080 max_fails=3 fail_timeout=30s; } server { listen 80; location / { proxy_pass http://bifrost_cluster; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # Timeouts proxy_connect_timeout 60s; proxy_send_timeout 60s; proxy_read_timeout 60s; } location /health { access_log off; return 200 "healthy\n"; add_header Content-Type text/plain; } } } ``` ### Kubernetes Production Deployment Production-ready Kubernetes deployment with StatefulSet: If you use PostgreSQL for `config_store`, ensure the target database is UTF8 encoded. See [PostgreSQL UTF8 Requirement](../quickstart/gateway/setting-up#postgresql-utf8-requirement). ```yaml apiVersion: v1 kind: ConfigMap metadata: name: bifrost-config namespace: bifrost data: config.json: | { "cluster_config": { "enabled": true, "discovery": { "enabled": true, "type": "kubernetes", "service_name": "bifrost-cluster", "k8s_namespace": "bifrost", "k8s_label_selector": "app=bifrost,component=gateway" }, "gossip": { "port": 10101, "config": { "timeout_seconds": 10, "success_threshold": 3, "failure_threshold": 3 } } }, "config_store": { "enabled": true, "type": "postgres", "config": { "host": "postgres.bifrost.svc.cluster.local", "port": "5432", "user": "bifrost", "password": "changeme", "db_name": "bifrost", "ssl_mode": "require" } } } --- apiVersion: v1 kind: ServiceAccount metadata: name: bifrost namespace: bifrost --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: bifrost-pod-reader namespace: bifrost rules: - apiGroups: [""] resources: ["pods"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: bifrost-pod-reader namespace: bifrost subjects: - kind: ServiceAccount name: bifrost namespace: bifrost roleRef: kind: Role name: bifrost-pod-reader apiGroup: rbac.authorization.k8s.io --- apiVersion: apps/v1 kind: StatefulSet metadata: name: bifrost namespace: bifrost spec: serviceName: bifrost-cluster replicas: 3 selector: matchLabels: app: bifrost component: gateway template: metadata: labels: app: bifrost component: gateway spec: serviceAccountName: bifrost containers: - name: bifrost image: /bifrost:latest ports: - containerPort: 8080 name: http protocol: TCP - containerPort: 10101 name: gossip protocol: TCP env: - name: BIFROST_CONFIG value: /etc/bifrost/config.json volumeMounts: - name: config mountPath: /etc/bifrost resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "2000m" memory: "2Gi" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5 volumes: - name: config configMap: name: bifrost-config --- apiVersion: v1 kind: Service metadata: name: bifrost-cluster namespace: bifrost spec: clusterIP: None selector: app: bifrost component: gateway ports: - port: 10101 name: gossip protocol: TCP --- apiVersion: v1 kind: Service metadata: name: bifrost namespace: bifrost spec: type: LoadBalancer selector: app: bifrost component: gateway ports: - port: 80 targetPort: 8080 protocol: TCP name: http --- apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: bifrost-pdb namespace: bifrost spec: minAvailable: 2 selector: matchLabels: app: bifrost component: gateway ``` ### Bare Metal / VM Deployment For bare metal or VM deployments using systemd: **Step 1: Install Bifrost on each node** ```bash # Download Bifrost Enterprise binary curl -O https://releases.getmaxim.ai/bifrost-enterprise/latest/bifrost-enterprise-linux-amd64 chmod +x bifrost-enterprise-linux-amd64 sudo mv bifrost-enterprise-linux-amd64 /usr/local/bin/bifrost-enterprise ``` **Step 2: Create configuration file** ```bash sudo mkdir -p /etc/bifrost sudo cat > /etc/bifrost/config.json < /etc/systemd/system/bifrost.service < **Symptoms**: Each node thinks it's the only member **Common Causes & Solutions**: - **Discovery not configured**: Verify `discovery.enabled: true` and `discovery.type` is set - **Service name mismatch**: Ensure all nodes have identical `service_name` - **Gossip port blocked**: Check firewall allows TCP port 10101 between nodes - **Discovery method issues**: See method-specific troubleshooting above - **Network isolation**: Verify nodes can reach each other on gossip port **Symptoms**: Nodes divided into separate clusters **Common Causes & Solutions**: - **Network partition**: Check network connectivity between all nodes - **Different discovery configs**: Ensure all nodes use same discovery settings - **Firewall blocking gossip**: Verify bidirectional connectivity on port 10101 - **Discovery scoped incorrectly**: Check label selectors, DNS names, or address spaces - **Restart all nodes**: Sometimes requires simultaneous restart to reform cluster **Symptoms**: Memory grows over time, especially in large clusters **Common Causes & Solutions**: - **Large gossip messages**: Check size of gossiped data - **Too many nodes**: Optimize for clusters with 3-7 nodes typically - **Message deduplication cache**: This is normal, cache TTL is 2 minutes - **Increase node resources**: Ensure adequate memory allocation **Symptoms**: Nodes repeatedly join and leave cluster **Common Causes & Solutions**: - **Network instability**: Check for packet loss or high latency - **Resource constraints**: Ensure nodes have adequate CPU/memory - **Timeout too aggressive**: Increase `timeout_seconds` in gossip config - **Health check failures**: Review liveness probe configuration - **Discovery intervals**: Check discovery isn't running too frequently **Symptoms**: Broadcast queue errors, messages not propagating **Common Causes & Solutions**: - **Queue not initialized**: Check logs for initialization errors - **No active members**: Verify cluster has multiple healthy members - **Gossip port unreachable**: Test connectivity between all nodes - **Message too large**: Check size of broadcast messages **Key log messages to look for**: ``` ✅ Successful cluster formation: - "successfully joined X peers on startup" - "cluster health: HEALTHY" - "discovered X nodes" ⚠️ Warning signs: - "no new nodes discovered" - "failed to join cluster" - "cluster health: NOT HEALTHY" - "node marked as suspect" ❌ Errors: - "discovery failed" - "failed to broadcast" - "timeout waiting for response" ``` ### Health Check Endpoints Monitor cluster health via HTTP endpoints: ```bash # Check if node is healthy curl http://localhost:8080/health # Get cluster status (if exposed) curl http://localhost:8080/cluster/status # Expected response shows all cluster members { "local_node": "bifrost-remote-10101-...", "members": 3, "healthy_members": 3, "cluster_health": "HEALTHY" } ``` --- This clustering implementation ensures Bifrost can handle enterprise-scale deployments with high availability, automatic service discovery, and intelligent traffic distribution across any infrastructure.