Automate TLS for dynamic domains with Traefik + Hetzner DNS.
Back when I began building apidex.dev it was obvious from the get go that custom domains would matter. Users should be able to publish API docs at docs.acme.com, not only at a generic subdomain, no one wants that.
This sounds simple until you get to HTTPS. Every domain needs a valid TLS certificate. Additionally, the users can add domains at any time, meaning the solution needs to be flexible and not over-engineered or too-hardcoded.
Static Traefik labels do not handle this scenario too well. Nor are manual certbot scripts any better at it. You either restart containers very (too) often or build a pile of unnecessary code.
I wanted the solution to be simple, without restarts, and without manual steps or background jobs to watch stuff to be boring.
The problem
Traefik works well with Docker labels when routes are known ahead of time. You put routers on containers as labels. Traefik reads them and wires everything up on its own.
However, labels are static. Traefik reads them when the container starts. If a user adds a domain later, Traefik does not know about it. So that's a no-go.
You can work around this by rebuilding docker-compose.yml and running docker-compose up -d.
Clearly not something you'd wanna do each time someone clicks "Add domain". It leads to reloads and is annoying to automate. Apart from that, it's hacky as hell.
Once you have more than a few custom domains, you need dynamic routing.
The solution: Traefik's HTTP provider
Traefik can read the config from more than one place. Most Docker setups use the Docker provider. However, there is also an HTTP provider, and it is more useful than it looks.
With the HTTP provider, Traefik polls an endpoint and merges the response into its live configuration.
# docker-compose.yml (traefik service)
- "--providers.http.endpoint=https://api.example.com/_dynamic-config/<unguessable-token>"
- "--providers.http.pollInterval=30s"
In my setup, the real endpoint uses a path that's not as easily discoverable. Here, Traefik calls the endpoint every 30 seconds. It compares the response with the config it already has, then applies the changes.
Routers can appear and disappear while the container keeps running, and this works with no restarts, and with no downtime.
What the endpoint returns
On the backend side, one controller builds the Traefik configuration from database.
It finds projects with a custom domain and turns each domain into routers.
# app/controllers/traefik_config_controller.rb
def build_traefik_config(domains)
routers = {}
domains.each do |domain|
safe_name = domain.gsub(/[^a-z0-9]/i, "-")
routers["custom-#{safe_name}"] = {
rule: "Host(`#{domain}`)",
service: "frontend",
entryPoints: ["websecure"],
tls: { certResolver: "letsencrypt" }
}
routers["custom-#{safe_name}-http"] = {
rule: "Host(`#{domain}`)",
service: "frontend",
entryPoints: ["web"],
middlewares: ["https-redirect"]
}
end
{ http: { routers: routers, services: { ... }, middlewares: { ... } } }
end
Each domain gets two routers. The first router listens on port 443. It enables TLS through the letsencrypt resolver. This is the router that serves the traffic.
Second router listens on port 80, it's only there to redirect to HTTPS with a 301.
The service: "frontend" part points to a service from the static Docker labels. That is where the Next.js app lives. The HTTP provider only adds the routing rules.
Certificate issuance: Let's Encrypt HTTP-01
Traefik handles ACME, so the config is small.
- "--certificatesresolvers.letsencrypt.acme.httpchallenge=true"
- "--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"
- "--certificatesresolvers.letsencrypt.acme.email=admin@example.com"
- "--certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json"
This uses the HTTP-01 challenge.
When Traefik asks for a cert, Let's Encrypt checks that Traefik can serve a token under /.well-known/acme-challenge/ over plain old HTTP.
That is why port 80 must stay open. The site redirects to HTTPS, but the challenge still starts over HTTP.
Once Traefik sees a new domain added in the next poll, it starts the ACME flow.
The cert is stored in acme.json. Traefik renews it before it expires.
User flow
For the user, the flow is simple.
- User adds
docs.acme.comin their Apidex dashboard. - They point
docs.acme.comto apidex.dev or to the server IP they're given. - Traefik sees a new domain pop up on its next HTTP provider poll.
- ACME challenge runs and a certificate is issued.
- Domain goes live on HTTPS. 🚀
No one has to approve or provision anything manually!
What's on my Hetzner box specifically
The infrastructure is rather simple, there's one Hetzner VPS running Docker Compose, and Traefik, the Rails API, and the Next.js frontend all run on that machine. The firewall's configured to allow inbound traffic on ports 80 and 443.
Doesn't have a load balancer, no DNS API integration, and HTTP-01 works fine at this scale, so I did not add more parts.
Wildcard subdomains like *.apidex.dev are handled separately with a HostRegexp rule in Docker labels.
The HTTP provider only handles custom domains users bring in.
Gotchas
A few things worth mentioning ,
acme.jsonmust have strict permissions. If it is not set to600, Traefik refuses to start.- DNS propagation can bite your ass. If Traefik asks for a cert before the domain points to your server, challenge fails.
- HTTP provider merges with the existing config. It does not replace it. This is useful, because static routers keep working.
- Let's Encrypt has rate limits. Main one is 50 certificates per registered domain each week.
Let's Encrypt retries failed challenges, but the user may not see clear feedback. It is worth checking DNS before accepting the domain.
Conclusion
All in all, the final setup is even simpler than I thought it would be despite the various obstacles I faced.
Traefik's HTTP provider gives you dynamic routing without restarts. ACME handles the certificates. Only custom part is a small controller that returns JSON.
It has no certbot scripts, no cron jobs, no DNS API keys. If you are building a multi-tenant app with custom domains, it seems like a pretty fine solution as it stays out of your way as more domains get added.