Files
bifrost/docs/deployment-guides/helm/troubleshooting.mdx
Beyhan Oğur 880f412e2c first commit
2026-04-26 21:52:23 +03:00

402 lines
11 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: "Troubleshooting"
description: "Diagnose and fix common issues with Bifrost Helm deployments — pods, database, ingress, secrets, PVCs, and performance"
icon: "wrench"
---
This page covers the most common problems encountered when deploying Bifrost with Helm, along with diagnostic commands and fixes.
---
## Pod Not Starting
### Quick diagnostics
```bash
# Show pod status
kubectl get pods -l app.kubernetes.io/name=bifrost
# Show pod events (most useful first step)
kubectl describe pod -l app.kubernetes.io/name=bifrost
# Show pod logs (use --previous if the pod has already crashed)
kubectl logs -l app.kubernetes.io/name=bifrost
kubectl logs -l app.kubernetes.io/name=bifrost --previous
```
### Image pull errors (`ErrImagePull` / `ImagePullBackOff`)
```bash
# Check which image is being pulled
kubectl describe pod -l app.kubernetes.io/name=bifrost | grep "Image:"
# Verify imagePullSecrets are attached
kubectl get pod -l app.kubernetes.io/name=bifrost -o jsonpath='{.items[0].spec.imagePullSecrets}'
# Test secret manually
kubectl get secret <pull-secret-name> -o jsonpath='{.data.\.dockerconfigjson}' | base64 -d | jq .
```
Common causes:
- `image.tag` not set — the chart requires it; the pod will not start without it
- Pull secret missing or expired (ECR tokens expire after 12 hours)
- Incorrect `image.repository` for enterprise registry
```bash
# Fix: set the correct tag
helm upgrade bifrost bifrost/bifrost --reuse-values --set image.tag=v1.4.11
```
### PVC not binding (`Pending`)
```bash
# Check PVC status
kubectl get pvc -l app.kubernetes.io/instance=bifrost
# Show binding events
kubectl describe pvc -l app.kubernetes.io/instance=bifrost
```
Common causes:
- No Persistent Volume provisioner in the cluster
- `storageClass` set to a class that doesn't exist
- `ReadWriteOnce` access mode with multiple replicas (SQLite PVCs are single-node)
```bash
# List available storage classes
kubectl get storageclass
# Fix: pin to a valid storage class
helm upgrade bifrost bifrost/bifrost \
--reuse-values \
--set storage.persistence.storageClass=standard
```
### ConfigMap / Secret errors
```bash
# View the generated ConfigMap (contains rendered config.json)
kubectl get configmap bifrost-config -o yaml
# View secrets the pod depends on
kubectl get secret -l app.kubernetes.io/instance=bifrost
# Decode a specific secret value
kubectl get secret bifrost-encryption -o jsonpath='{.data.key}' | base64 -d
```
### CrashLoopBackOff
```bash
# Get last log lines before the crash
kubectl logs -l app.kubernetes.io/name=bifrost --previous --tail=50
# Common causes shown in logs:
# "encryption key is not initialized" → no key provided; optional, but data will be stored in plaintext
# "failed to connect to database" → see Database section below
# "image.tag is required" → set image.tag in values
```
---
## Database Connection Issues
### Embedded PostgreSQL
```bash
# Check if the PostgreSQL pod is running
kubectl get pods -l app.kubernetes.io/name=bifrost-postgresql
# Connect directly to inspect the database
kubectl exec -it deployment/bifrost-postgresql -- psql -U bifrost -d bifrost
# Test connectivity from the Bifrost pod
kubectl exec -it deployment/bifrost -- nc -zv bifrost-postgresql 5432
# Check PostgreSQL logs
kubectl logs deployment/bifrost-postgresql --tail=50
```
### External PostgreSQL
```bash
# Test connectivity from within the cluster
kubectl run pg-test --image=postgres:16-alpine --rm -it --restart=Never -- \
psql "host=your-db-host dbname=bifrost user=bifrost sslmode=require"
# Verify the secret value is correct
kubectl get secret postgres-credentials -o jsonpath='{.data.password}' | base64 -d
# Check that the external host/port is reachable
kubectl exec -it deployment/bifrost -- nc -zv your-db-host 5432
```
Common causes:
- `sslMode: disable` when the database requires SSL — set `sslMode: require`
- Password in secret doesn't match the database user
- Network policy blocking pod → database traffic
- Database not UTF8 encoded (see [PostgreSQL UTF8 Requirement](/quickstart/gateway/setting-up#postgresql-utf8-requirement))
```bash
# Fix: update the secret and restart
kubectl create secret generic postgres-credentials \
--from-literal=password='correct-password' \
--dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/bifrost
```
---
## Ingress Not Working
```bash
# Check ingress resource status
kubectl describe ingress bifrost
# Check if the ingress controller is running
kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx
# View ingress controller logs for routing errors
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=50
# Verify DNS resolves to the correct load balancer IP
nslookup bifrost.yourdomain.com
kubectl get ingress bifrost -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
# Test without TLS first
curl -v http://bifrost.yourdomain.com/health
```
Common causes:
- `ingress.className` not set or set to a class not installed in the cluster
- TLS certificate not issued yet (cert-manager can take up to 60 seconds)
- Service port mismatch — Bifrost listens on `8080` by default
```bash
# Check cert-manager certificate status
kubectl get certificate -l app.kubernetes.io/instance=bifrost
kubectl describe certificate bifrost-tls
```
---
## Secret and Credential Issues
### Provider API key not resolving
If Bifrost logs show `env.OPENAI_API_KEY: not set` or similar:
```bash
# Check the env var is present in the running pod
kubectl exec -it deployment/bifrost -- env | grep OPENAI
# Verify the providerSecrets secret exists with the right key
kubectl get secret provider-api-keys -o yaml
# Check the providerSecrets configuration rendered correctly
kubectl get configmap bifrost-config -o yaml | grep -A5 providers
```
### Encryption key issues
```bash
# Verify the secret exists and contains the right key name
kubectl get secret bifrost-encryption -o yaml
# Check the exact key name matches encryptionKeySecret.key in values
# Default key name is "encryption-key" — if you used "key", set:
# bifrost.encryptionKeySecret.key: "key"
```
---
## High Memory Usage
```bash
# Check current resource usage
kubectl top pods -l app.kubernetes.io/name=bifrost
# Check if OOM kills are happening
kubectl describe pod -l app.kubernetes.io/name=bifrost | grep -A3 "OOMKilled\|Limits"
# View resource requests/limits on running pods
kubectl get pod -l app.kubernetes.io/name=bifrost \
-o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.containers[0].resources}{"\n"}{end}'
```
**Increase resource limits:**
```bash
helm upgrade bifrost bifrost/bifrost \
--reuse-values \
--set resources.limits.memory=4Gi \
--set resources.requests.memory=1Gi
```
**Tune Go runtime** (see [Docker Tuning](/deployment-guides/docker-tuning)):
```yaml
env:
- name: GOGC
value: "200" # run GC less often
- name: GOMEMLIMIT
value: "3500MiB" # hard memory ceiling slightly below the container limit
```
---
## High CPU Usage / Latency
```bash
# Check CPU usage
kubectl top pods -l app.kubernetes.io/name=bifrost
# Check if HPA is scaling correctly
kubectl get hpa bifrost
kubectl describe hpa bifrost
```
Common causes:
- `initialPoolSize` too small — goroutines queuing up; increase to `500``1000`
- `dropExcessRequests: false` with a small pool — queue depth growing unboundedly
```bash
helm upgrade bifrost bifrost/bifrost \
--reuse-values \
--set bifrost.client.initialPoolSize=1000 \
--set bifrost.client.dropExcessRequests=true
```
---
## Autoscaling Issues
### HPA not scaling
```bash
# Check HPA status and current metrics
kubectl describe hpa bifrost
# Verify metrics server is installed
kubectl top nodes
kubectl top pods
# Common fix: metrics server not installed
# Install with:
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
```
### Pods scaling down too aggressively (drops active SSE streams)
The default `scaleDown.stabilizationWindowSeconds: 300` and `preStop` sleep of 15 seconds should prevent this. If streams are still being cut:
```yaml
terminationGracePeriodSeconds: 120 # increase if streams run longer than 105s
autoscaling:
behavior:
scaleDown:
stabilizationWindowSeconds: 600 # wait 10 min before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 300 # remove at most 1 pod per 5 min
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 30"] # give load balancer more time to drain
```
```bash
helm upgrade bifrost bifrost/bifrost --reuse-values -f graceful-shutdown-values.yaml
```
---
## SQLite / PVC Issues
### StatefulSet migration (upgrading from chart < v2.0.0)
Older chart versions used a Deployment + manual PVC. v2.0.0 moved SQLite to a StatefulSet. If upgrading:
```bash
# 1. Scale down the old deployment
kubectl scale deployment bifrost --replicas=0
# 2. Note the existing PVC name
kubectl get pvc
# 3. Upgrade, pointing at the existing claim
helm upgrade bifrost bifrost/bifrost \
--reuse-values \
--set storage.persistence.existingClaim=<your-old-pvc-name> \
--set image.tag=v1.4.11
```
### Data lost after upgrade
```bash
# Check if PVCs still exist (they persist after helm uninstall)
kubectl get pvc -l app.kubernetes.io/instance=bifrost
# Re-attach by setting existingClaim
helm upgrade bifrost bifrost/bifrost \
--reuse-values \
--set storage.persistence.existingClaim=<pvc-name>
```
---
## Cluster Mode Issues
### Peers not discovering each other
```bash
# Check gossip port is reachable between pods
kubectl exec -it bifrost-0 -- nc -zv bifrost-1.bifrost-headless 7946
# View gossip-related log lines
kubectl logs -l app.kubernetes.io/name=bifrost --tail=100 | grep -i gossip
# Check the headless service exists
kubectl get svc bifrost-headless
```
For Kubernetes-based discovery, verify the service account has pod list permissions:
```bash
kubectl auth can-i list pods --as=system:serviceaccount:default:bifrost
```
---
## Useful Diagnostic Commands
```bash
# Full state dump for a support ticket
kubectl get all -l app.kubernetes.io/instance=bifrost
kubectl describe pod -l app.kubernetes.io/name=bifrost > pod-describe.txt
kubectl logs -l app.kubernetes.io/name=bifrost --tail=200 > pod-logs.txt
# View the full rendered config.json
kubectl get configmap bifrost-config -o jsonpath='{.data.config\.json}' | jq .
# Check current Helm values (shows all overrides)
helm get values bifrost
# Check Helm release status
helm status bifrost
# View Helm release history
helm history bifrost
```
---
## Still Stuck?
- [GitHub Issues](https://github.com/maximhq/bifrost/issues) — search existing issues or open a new one
- [Enterprise Support](mailto:support@getmaxim.ai) — for enterprise customers with SLA