402 lines
11 KiB
Plaintext
402 lines
11 KiB
Plaintext
---
|
||
title: "Troubleshooting"
|
||
description: "Diagnose and fix common issues with Bifrost Helm deployments — pods, database, ingress, secrets, PVCs, and performance"
|
||
icon: "wrench"
|
||
---
|
||
|
||
This page covers the most common problems encountered when deploying Bifrost with Helm, along with diagnostic commands and fixes.
|
||
|
||
---
|
||
|
||
## Pod Not Starting
|
||
|
||
### Quick diagnostics
|
||
|
||
```bash
|
||
# Show pod status
|
||
kubectl get pods -l app.kubernetes.io/name=bifrost
|
||
|
||
# Show pod events (most useful first step)
|
||
kubectl describe pod -l app.kubernetes.io/name=bifrost
|
||
|
||
# Show pod logs (use --previous if the pod has already crashed)
|
||
kubectl logs -l app.kubernetes.io/name=bifrost
|
||
kubectl logs -l app.kubernetes.io/name=bifrost --previous
|
||
```
|
||
|
||
### Image pull errors (`ErrImagePull` / `ImagePullBackOff`)
|
||
|
||
```bash
|
||
# Check which image is being pulled
|
||
kubectl describe pod -l app.kubernetes.io/name=bifrost | grep "Image:"
|
||
|
||
# Verify imagePullSecrets are attached
|
||
kubectl get pod -l app.kubernetes.io/name=bifrost -o jsonpath='{.items[0].spec.imagePullSecrets}'
|
||
|
||
# Test secret manually
|
||
kubectl get secret <pull-secret-name> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .
|
||
```
|
||
|
||
Common causes:
|
||
- `image.tag` not set — the chart requires it; the pod will not start without it
|
||
- Pull secret missing or expired (ECR tokens expire after 12 hours)
|
||
- Incorrect `image.repository` for enterprise registry
|
||
|
||
```bash
|
||
# Fix: set the correct tag
|
||
helm upgrade bifrost bifrost/bifrost --reuse-values --set image.tag=v1.4.11
|
||
```
|
||
|
||
### PVC not binding (`Pending`)
|
||
|
||
```bash
|
||
# Check PVC status
|
||
kubectl get pvc -l app.kubernetes.io/instance=bifrost
|
||
|
||
# Show binding events
|
||
kubectl describe pvc -l app.kubernetes.io/instance=bifrost
|
||
```
|
||
|
||
Common causes:
|
||
- No Persistent Volume provisioner in the cluster
|
||
- `storageClass` set to a class that doesn't exist
|
||
- `ReadWriteOnce` access mode with multiple replicas (SQLite PVCs are single-node)
|
||
|
||
```bash
|
||
# List available storage classes
|
||
kubectl get storageclass
|
||
|
||
# Fix: pin to a valid storage class
|
||
helm upgrade bifrost bifrost/bifrost \
|
||
--reuse-values \
|
||
--set storage.persistence.storageClass=standard
|
||
```
|
||
|
||
### ConfigMap / Secret errors
|
||
|
||
```bash
|
||
# View the generated ConfigMap (contains rendered config.json)
|
||
kubectl get configmap bifrost-config -o yaml
|
||
|
||
# View secrets the pod depends on
|
||
kubectl get secret -l app.kubernetes.io/instance=bifrost
|
||
|
||
# Decode a specific secret value
|
||
kubectl get secret bifrost-encryption -o jsonpath='{.data.key}' | base64 -d
|
||
```
|
||
|
||
### CrashLoopBackOff
|
||
|
||
```bash
|
||
# Get last log lines before the crash
|
||
kubectl logs -l app.kubernetes.io/name=bifrost --previous --tail=50
|
||
|
||
# Common causes shown in logs:
|
||
# "encryption key is not initialized" → no key provided; optional, but data will be stored in plaintext
|
||
# "failed to connect to database" → see Database section below
|
||
# "image.tag is required" → set image.tag in values
|
||
```
|
||
|
||
---
|
||
|
||
## Database Connection Issues
|
||
|
||
### Embedded PostgreSQL
|
||
|
||
```bash
|
||
# Check if the PostgreSQL pod is running
|
||
kubectl get pods -l app.kubernetes.io/name=bifrost-postgresql
|
||
|
||
# Connect directly to inspect the database
|
||
kubectl exec -it deployment/bifrost-postgresql -- psql -U bifrost -d bifrost
|
||
|
||
# Test connectivity from the Bifrost pod
|
||
kubectl exec -it deployment/bifrost -- nc -zv bifrost-postgresql 5432
|
||
|
||
# Check PostgreSQL logs
|
||
kubectl logs deployment/bifrost-postgresql --tail=50
|
||
```
|
||
|
||
### External PostgreSQL
|
||
|
||
```bash
|
||
# Test connectivity from within the cluster
|
||
kubectl run pg-test --image=postgres:16-alpine --rm -it --restart=Never -- \
|
||
psql "host=your-db-host dbname=bifrost user=bifrost sslmode=require"
|
||
|
||
# Verify the secret value is correct
|
||
kubectl get secret postgres-credentials -o jsonpath='{.data.password}' | base64 -d
|
||
|
||
# Check that the external host/port is reachable
|
||
kubectl exec -it deployment/bifrost -- nc -zv your-db-host 5432
|
||
```
|
||
|
||
Common causes:
|
||
- `sslMode: disable` when the database requires SSL — set `sslMode: require`
|
||
- Password in secret doesn't match the database user
|
||
- Network policy blocking pod → database traffic
|
||
- Database not UTF8 encoded (see [PostgreSQL UTF8 Requirement](/quickstart/gateway/setting-up#postgresql-utf8-requirement))
|
||
|
||
```bash
|
||
# Fix: update the secret and restart
|
||
kubectl create secret generic postgres-credentials \
|
||
--from-literal=password='correct-password' \
|
||
--dry-run=client -o yaml | kubectl apply -f -
|
||
|
||
kubectl rollout restart deployment/bifrost
|
||
```
|
||
|
||
---
|
||
|
||
## Ingress Not Working
|
||
|
||
```bash
|
||
# Check ingress resource status
|
||
kubectl describe ingress bifrost
|
||
|
||
# Check if the ingress controller is running
|
||
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
|
||
|
||
# View ingress controller logs for routing errors
|
||
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50
|
||
|
||
# Verify DNS resolves to the correct load balancer IP
|
||
nslookup bifrost.yourdomain.com
|
||
kubectl get ingress bifrost -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
|
||
|
||
# Test without TLS first
|
||
curl -v http://bifrost.yourdomain.com/health
|
||
```
|
||
|
||
Common causes:
|
||
- `ingress.className` not set or set to a class not installed in the cluster
|
||
- TLS certificate not issued yet (cert-manager can take up to 60 seconds)
|
||
- Service port mismatch — Bifrost listens on `8080` by default
|
||
|
||
```bash
|
||
# Check cert-manager certificate status
|
||
kubectl get certificate -l app.kubernetes.io/instance=bifrost
|
||
kubectl describe certificate bifrost-tls
|
||
```
|
||
|
||
---
|
||
|
||
## Secret and Credential Issues
|
||
|
||
### Provider API key not resolving
|
||
|
||
If Bifrost logs show `env.OPENAI_API_KEY: not set` or similar:
|
||
|
||
```bash
|
||
# Check the env var is present in the running pod
|
||
kubectl exec -it deployment/bifrost -- env | grep OPENAI
|
||
|
||
# Verify the providerSecrets secret exists with the right key
|
||
kubectl get secret provider-api-keys -o yaml
|
||
|
||
# Check the providerSecrets configuration rendered correctly
|
||
kubectl get configmap bifrost-config -o yaml | grep -A5 providers
|
||
```
|
||
|
||
### Encryption key issues
|
||
|
||
```bash
|
||
# Verify the secret exists and contains the right key name
|
||
kubectl get secret bifrost-encryption -o yaml
|
||
|
||
# Check the exact key name matches encryptionKeySecret.key in values
|
||
# Default key name is "encryption-key" — if you used "key", set:
|
||
# bifrost.encryptionKeySecret.key: "key"
|
||
```
|
||
|
||
---
|
||
|
||
## High Memory Usage
|
||
|
||
```bash
|
||
# Check current resource usage
|
||
kubectl top pods -l app.kubernetes.io/name=bifrost
|
||
|
||
# Check if OOM kills are happening
|
||
kubectl describe pod -l app.kubernetes.io/name=bifrost | grep -A3 "OOMKilled\|Limits"
|
||
|
||
# View resource requests/limits on running pods
|
||
kubectl get pod -l app.kubernetes.io/name=bifrost \
|
||
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'
|
||
```
|
||
|
||
**Increase resource limits:**
|
||
|
||
```bash
|
||
helm upgrade bifrost bifrost/bifrost \
|
||
--reuse-values \
|
||
--set resources.limits.memory=4Gi \
|
||
--set resources.requests.memory=1Gi
|
||
```
|
||
|
||
**Tune Go runtime** (see [Docker Tuning](/deployment-guides/docker-tuning)):
|
||
|
||
```yaml
|
||
env:
|
||
- name: GOGC
|
||
value: "200" # run GC less often
|
||
- name: GOMEMLIMIT
|
||
value: "3500MiB" # hard memory ceiling slightly below the container limit
|
||
```
|
||
|
||
---
|
||
|
||
## High CPU Usage / Latency
|
||
|
||
```bash
|
||
# Check CPU usage
|
||
kubectl top pods -l app.kubernetes.io/name=bifrost
|
||
|
||
# Check if HPA is scaling correctly
|
||
kubectl get hpa bifrost
|
||
kubectl describe hpa bifrost
|
||
```
|
||
|
||
Common causes:
|
||
- `initialPoolSize` too small — goroutines queuing up; increase to `500`–`1000`
|
||
- `dropExcessRequests: false` with a small pool — queue depth growing unboundedly
|
||
|
||
```bash
|
||
helm upgrade bifrost bifrost/bifrost \
|
||
--reuse-values \
|
||
--set bifrost.client.initialPoolSize=1000 \
|
||
--set bifrost.client.dropExcessRequests=true
|
||
```
|
||
|
||
---
|
||
|
||
## Autoscaling Issues
|
||
|
||
### HPA not scaling
|
||
|
||
```bash
|
||
# Check HPA status and current metrics
|
||
kubectl describe hpa bifrost
|
||
|
||
# Verify metrics server is installed
|
||
kubectl top nodes
|
||
kubectl top pods
|
||
|
||
# Common fix: metrics server not installed
|
||
# Install with:
|
||
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
|
||
```
|
||
|
||
### Pods scaling down too aggressively (drops active SSE streams)
|
||
|
||
The default `scaleDown.stabilizationWindowSeconds: 300` and `preStop` sleep of 15 seconds should prevent this. If streams are still being cut:
|
||
|
||
```yaml
|
||
terminationGracePeriodSeconds: 120 # increase if streams run longer than 105s
|
||
|
||
autoscaling:
|
||
behavior:
|
||
scaleDown:
|
||
stabilizationWindowSeconds: 600 # wait 10 min before scaling down
|
||
policies:
|
||
- type: Pods
|
||
value: 1
|
||
periodSeconds: 300 # remove at most 1 pod per 5 min
|
||
|
||
lifecycle:
|
||
preStop:
|
||
exec:
|
||
command: ["sh", "-c", "sleep 30"] # give load balancer more time to drain
|
||
```
|
||
|
||
```bash
|
||
helm upgrade bifrost bifrost/bifrost --reuse-values -f graceful-shutdown-values.yaml
|
||
```
|
||
|
||
---
|
||
|
||
## SQLite / PVC Issues
|
||
|
||
### StatefulSet migration (upgrading from chart < v2.0.0)
|
||
|
||
Older chart versions used a Deployment + manual PVC. v2.0.0 moved SQLite to a StatefulSet. If upgrading:
|
||
|
||
```bash
|
||
# 1. Scale down the old deployment
|
||
kubectl scale deployment bifrost --replicas=0
|
||
|
||
# 2. Note the existing PVC name
|
||
kubectl get pvc
|
||
|
||
# 3. Upgrade, pointing at the existing claim
|
||
helm upgrade bifrost bifrost/bifrost \
|
||
--reuse-values \
|
||
--set storage.persistence.existingClaim=<your-old-pvc-name> \
|
||
--set image.tag=v1.4.11
|
||
```
|
||
|
||
### Data lost after upgrade
|
||
|
||
```bash
|
||
# Check if PVCs still exist (they persist after helm uninstall)
|
||
kubectl get pvc -l app.kubernetes.io/instance=bifrost
|
||
|
||
# Re-attach by setting existingClaim
|
||
helm upgrade bifrost bifrost/bifrost \
|
||
--reuse-values \
|
||
--set storage.persistence.existingClaim=<pvc-name>
|
||
```
|
||
|
||
---
|
||
|
||
## Cluster Mode Issues
|
||
|
||
### Peers not discovering each other
|
||
|
||
```bash
|
||
# Check gossip port is reachable between pods
|
||
kubectl exec -it bifrost-0 -- nc -zv bifrost-1.bifrost-headless 7946
|
||
|
||
# View gossip-related log lines
|
||
kubectl logs -l app.kubernetes.io/name=bifrost --tail=100 | grep -i gossip
|
||
|
||
# Check the headless service exists
|
||
kubectl get svc bifrost-headless
|
||
```
|
||
|
||
For Kubernetes-based discovery, verify the service account has pod list permissions:
|
||
|
||
```bash
|
||
kubectl auth can-i list pods --as=system:serviceaccount:default:bifrost
|
||
```
|
||
|
||
---
|
||
|
||
## Useful Diagnostic Commands
|
||
|
||
```bash
|
||
# Full state dump for a support ticket
|
||
kubectl get all -l app.kubernetes.io/instance=bifrost
|
||
kubectl describe pod -l app.kubernetes.io/name=bifrost > pod-describe.txt
|
||
kubectl logs -l app.kubernetes.io/name=bifrost --tail=200 > pod-logs.txt
|
||
|
||
# View the full rendered config.json
|
||
kubectl get configmap bifrost-config -o jsonpath='{.data.config\.json}' | jq .
|
||
|
||
# Check current Helm values (shows all overrides)
|
||
helm get values bifrost
|
||
|
||
# Check Helm release status
|
||
helm status bifrost
|
||
|
||
# View Helm release history
|
||
helm history bifrost
|
||
```
|
||
|
||
---
|
||
|
||
## Still Stuck?
|
||
|
||
- [GitHub Issues](https://github.com/maximhq/bifrost/issues) — search existing issues or open a new one
|
||
- [Enterprise Support](mailto:support@getmaxim.ai) — for enterprise customers with SLA
|