The Error That Nearly Broke Our Deployment
Three hours into our Kamal deployment, we were stuck in a loop:
ERROR Failed to boot web on {ip_address}
INFO First web container is unhealthy on {ip_address}, not booting any other roles
The container would start, but Kamal’s health check kept failing. After 30 seconds, Kamal would kill the container
and retry, creating an endless loop.
We spent hours debugging deployment scripts, PostgreSQL configurations, and Rails settings. The fix turned out to
be much simpler: DNS configuration.
The Root Cause: Broken DNS Resolution
What Was Happening
When Kamal tried to verify container health, it performed this sequence:
- Container starts → my_app-web-abc123 boots
- Traefik (Kamal proxy) tries to check /up endpoint
- DNS lookup → Resolve my_app-web-abc123 to an IP address
- Health check fails → DNS resolution times out or fails
- Container killed → Kamal marks it as unhealthy
The DNS Failure
The Traefik container’s /etc/resolv.conf showed:
nameserver 127.0.0.53
search members.linode.com
options edns0 trust-ad ndots:0
Problem: 127.0.0.53 is the host’s systemd-resolved DNS server. It’s not accessible from inside
Docker containers!
When Traefik tried to resolve my_app-web-abc123:
- It queried 127.0.0.53 (systemd-resolved)
- The query failed with “connection refused”
- Health check failed
- Container was killed
The Solution: Proper Docker DNS Configuration
What We Fixed
We configured Docker’s DNS settings in /etc/docker/daemon.json:
{
"dns": ["127.0.0.11", "8.8.8.8", "1.1.1.1"]
}
Why This Works
1. 127.0.0.11 (Docker’s Internal DNS) – First Priority
- Resolves container hostnames automatically
- Handles inter-container communication
- Always available inside Docker networks
2. 8.8.8.8 (Google DNS) – Second Priority
- Resolves external domains (APIs, gems, etc.)
- Fast and reliable
- Global infrastructure
3. 1.1.1.1 (Cloudflare DNS) – Third Priority
- Privacy-focused external DNS
- Backup if 8.8.8.8 fails
- No query logging
How Docker Uses This
Docker’s DNS resolution order:
- Try 127.0.0.11 (internal) → container names
- If that fails → 8.8.8.8 (external) → domains
- If that fails → 1.1.1.1 (external) → domains
The IPv4/IPv6 Issue
While debugging, we discovered another subtle problem:
The IPv6 Trap
The server setup script used:
SERVER_IP=$(curl -s ifconfig.me || echo "ip_address_goes_here")
Problem: ifconfig.me returned an IPv6 address:
2600:3c03::...
This IPv6 address was used in PostgreSQL’s pg_hba.conf:
host my_app_production my_app_user 2600:3c03.../32 md5
PostgreSQL had issues with this IPv6 address, causing authentication failures.
The Fix
Force IPv4 detection:
SERVER_IP=$(curl -s -4 ifconfig.me || echo "ip_address_goes_here")
The -4 flag ensures we always get an IPv4 address, which PostgreSQL handles reliably.
The PostgreSQL Network Isolation Issue
The Problem
Kamal uses a separate Docker network (172.18.0.0/16) for containers, while PostgreSQL is on the host’s Docker
bridge network (172.17.0.0/16).
The firewall only allowed 172.17.0.0/16:
5432/tcp ALLOW 172.17.0.0/16
The Fix
Add the Kamal network to both firewall and PostgreSQL config:
Firewall (ufw):
sudo ufw allow from 172.18.0.0/16 to any port 5432
PostgreSQL (pg_hba.conf):
host my_app_production my_app_user 172.18.0.0/16 md5
Complete Fix in our setup script
IPv4 Fix
SERVER_IP=$(curl -s -4 ifconfig.me || echo "ip_address_goes_here")
Kamal Network Firewall Rule
sudo ufw allow from 172.18.0.0/16 to any port 5432
PostgreSQL Kamal Network Rule
host $DB_NAME $DB_USER 172.18.0.0/16 md5
Docker DNS Configuration
{
"dns": ["127.0.0.11", "8.8.8.8", "1.1.1.1"]
}
Key Takeaways
- DNS is Critical for Container Orchestration
- Always configure Docker’s DNS properly
- Include both internal and external DNS servers
- Test DNS resolution from containers
- Network Isolation Matters
- Docker networks are isolated by default
- PostgreSQL must allow connections from all Docker networks
- Firewall rules must match
- IPv4 vs IPv6 Can Break Things
- PostgreSQL works better with IPv4
- Force IPv4 when detecting server IPs
- Test both IPv4 and IPv6 connectivity
- Health Checks are Essential
- The /up endpoint is critical for Kamal
- DNS must work for health checks to succeed
- Timeout settings matter (30s default)
Troubleshooting DNS Issues
If you encounter “First web container is unhealthy”:
- Check Container Logs
docker logs my_app-web-abc123 - Check Traefik/Kamal Proxy Logs
docker logs kamal-proxy | grep -i healthcheck - Test DNS Resolution
# From inside Traefik container docker exec kamal-proxy getent hosts my_app-web-abc123 docker exec kamal-proxy getent hosts google.com - Verify DNS Configuration
# Check daemon.json cat /etc/docker/daemon.json # Check container's resolv.conf docker exec kamal-proxy cat /etc/resolv.conf - Check PostgreSQL Connectivity
# From kamal network docker run --rm --network kamal postgres:16 psql \ -h 172.17.0.1 -U my_app_user -d my_app_production -c "SELECT 1"
Results
After implementing all fixes:
- ✅ DNS resolution works (internal and external)
- ✅ Health checks pass (Traefik can reach containers)
- ✅ PostgreSQL connections work (from both Docker networks)
- ✅ Deployments succeed (consistent, reliable)
- ✅ IPv4 detection works (no IPv6 issues)
Final Thoughts
The “First web container is unhealthy” error can be a DNS configuration issue, not a deployment or application
problem.
By understanding how Docker networks work, how DNS resolution functions, and how PostgreSQL authentication works, we can prevent this issue from ever occurring again.
Key files to review:
/etc/docker/daemon.json– Docker DNS configuration/etc/postgresql/16/main/pg_hba.conf– PostgreSQL authentication/etc/ufw/rules.conf– Firewall rules
The fix is now automated in our setup script, ensuring new servers have proper DNS and network configuration from
day one.