Kairos on Hetzner Cloud: Seven Failures, Three Cloud-Configs, One PR

A build journal of what actually happened when I validated three Kairos cloud-configs for k3s on Hetzner Cloud: seven failure modes, one kubelet TLS nightmare, and a managed cert rabbit hole I went down so you don't have to.

I’m a Hetzner Cloud customer. I’ve been running personal workloads there for a while now, and at some point I decided I wanted to do it properly.

Not just “it runs.” Properly.

My goal is simple: the smallest possible Kubernetes footprint on a virtual server, using the best CNCF tooling I know, at minimal cost. One node. Production-grade. Fully integrated with what the cloud provider actually offers. Not bolted on top of it.

I had already done this exercise on Proxmox. Hetzner is different by design. HCCM and the Hetzner CSI driver are the point. They give you real cloud primitives: Load Balancers, Volumes, node identity, topology labels. And Kairos gives you immutable Linux as the base. That combination is exactly what I want to build on.

This is also the first step toward my first OpenTofu project. I’ve been a Terraform user for years. This is where I cross that line. And I want the infrastructure foundation to be solid before I write a single .tf file.

So I spent a few days validating three Kairos cloud-configs end-to-end on real Hetzner infrastructure. I hit seven failure modes. I documented all of them. I baked the fixes in. And I shipped the result as PR #379 to kairos-io/hadron.

This is the honest story of what happened.


Three configs, three levels of capability

FileWhat it bootstraps
hcm-only.yamlHCCM + Flannel + hcloud-csi. The minimum that gives you Hetzner Load Balancers, Volumes, and proper node identity.
hcm-traefik.yamlEverything above + k3s-bundled Traefik v3 reconfigured for Gateway API. Gateway + HTTPRoute out of the box, fronted by a real Hetzner LB.
hcm-cilium.yamlk3s with Cilium 1.18 as CNI: eBPF dataplane, kubeProxyReplacement: true, native routing over the Hetzner private network. HCCM still runs for node identity and Hetzner LBs. Cilium owns the packets.

First boot to working cluster: ~1 minute. No kubectl apply. No manual Helm runs. That’s the standard I was holding myself to.


Before the failures: HCCM and Cilium don’t fight

Understanding this upfront would have saved me some confusion later.

HCCM and Cilium own completely different domains. They meet at exactly one place: the InternalIP field on the Node object.

HCCM initializes each Node and writes InternalIP=<private network IP>, say 10.0.0.3. Cilium reads that and uses it for autoDirectNodeRoutes: direct host routes between nodes through the Hetzner private network, no VXLAN overhead. Full MTU. No extra cost.

Two settings keep them from stepping on each other:

  • networking.enabled: false in HCCM. Otherwise HCCM runs its own route-controller that collides with Cilium’s routing.
  • l2announcements: false + externalIPs: false in Cilium. On Hetzner, real LBs come from HCCM. You don’t want Cilium trying to do that too.

HCCM owns what the cluster knows about Hetzner. Cilium owns how packets move inside it.


Seven failures, seven fixes

1. CCM crashloops with ClusterCIDRMisconfigured

First failure, first minute. The HCCM Helm chart default pod CIDR is 10.244.0.0/16, the vanilla Kubernetes default. k3s uses 10.42.0.0/16. Mismatch → route-controller crashloop on startup.

Fix baked in: networking.clusterCIDR: 10.42.0.0/16 in CCM HelmChart values for hcm-only and hcm-traefik. In hcm-cilium, set networking.enabled: false. Cilium owns routes, so this doesn’t apply.


2. The one that hurt: kubelet TLS cert missing the private IP

This was the most frustrating failure of the whole session. And it’s subtle enough that it took me a while to understand what was happening.

Symptom: kubectl logs, kubectl exec, kubectl top all fail with an x509 error mentioning a private IP. The cluster looks healthy. It isn’t.

What’s happening: kubelet generates its TLS certificate at boot, before HCCM has had time to set InternalIP to the private network address. Once HCCM writes InternalIP=10.0.0.x, the API server tries to reach kubelet at that private IP. But kubelet’s cert SAN only covers the public IP it had at first boot.

The private IP is not in the cert. Inter-node communication through the private network is silently broken. The cluster appears fine. Until you actually try to use it.

The fix: write the private IP to the k3s config before k3s starts. A stages.boot step detects the private IP from the running interfaces and writes node-ip: 10.0.0.x to /etc/rancher/k3s/config.yaml before k3s initializes. Kubelet includes the private IP in its cert SAN from the very first registration.

Fix baked in: all three configs handle this in the boot stage automatically.


3. Service type=LoadBalancer stuck <pending> forever

Classic HCCM failure. The Service sits at <pending> indefinitely. Nothing appears in the Hetzner console. The only clue is a buried event: neither location nor network-zone set.

HCCM needs to know where to provision the LB. Without HCLOUD_LOAD_BALANCERS_LOCATION, it refuses to provision anything. Silently.

Important nuance: the location must match your servers’ location, because USE_PRIVATE_IP: true (fix #4) requires the LB and servers to share a Hetzner network zone.

Network zoneLocations
eu-centralfsn1, nbg1, hel1
us-eastash
us-westhil
ap-southeastsin

Fix baked in: HCLOUD_LOAD_BALANCERS_LOCATION: "<HETZNER_REGION>" in all three variants. Replace the placeholder with your actual location code.


4. LB has EXTERNAL-IP but targets are Unhealthy

This one is easy to misdiagnose. It looks like everything is working. Until it suddenly isn’t.

By default, HCCM registers nodes as LB targets by their public IP. The LB health-checks <node-public-IP>:<NodePort>. A tight Cloud Firewall on the public NIC silently drops the probes. Targets flip to Unhealthy. Traffic stops. It looks random.

HCLOUD_LOAD_BALANCERS_USE_PRIVATE_IP: "true" fixes it. HCCM registers targets by their private IP. Probes ride the private network. They pass reliably. Your public NIC firewall stays tight without breaking anything.

Fix baked in: all three variants.


5. Two default StorageClasses

Quick one. hcloud-csi registers hcloud-volumes as a default StorageClass. k3s ships local-path as a default too. Two defaults → non-deterministic PVC behavior.

Fix baked in: --disable=local-storage in k3s args. hcloud-volumes is the sole default.


6. Cross-namespace HTTPRoute silently returns 404 (Traefik variant)

The k3s-bundled Traefik creates a Gateway restricted to allowedRoutes.namespaces.from: Same. Only routes in kube-system are admitted. An HTTPRoute in default is accepted by the Gateway API, but returns 404 because the listener blocks it.

Fix: HTTPRoute goes in kube-system, with a ReferenceGrant in the backend namespace for cross-namespace backend refs. Documented in the PR README as a known footgun.


7. Cilium Gateway auto-created Service has no LB location

When Cilium processes a Gateway, it auto-creates a cilium-gateway-<name> Service of type: LoadBalancer. The initial hcm-cilium config didn’t have the location env vars set. Only hcm-traefik did. Hit fix #3 again, in the wrong variant.

Fix baked in: location env vars now present in all three configs, with full parity.


The managed cert rabbit hole (don’t go down it)

After the PR was opened, I tried Hetzner’s “Managed Certificate” feature. Request a Let’s Encrypt cert via annotations, Hetzner handles the rest. Sounds great.

Here’s what actually happened.

Annotation names

The real annotation names from HCCM v1.31 source:

load-balancer.hetzner.cloud/protocol: "https"
load-balancer.hetzner.cloud/certificate-type: "managed"
load-balancer.hetzner.cloud/http-managed-certificate-domains: "<domain>"

Note the http- prefix on the domains annotation.

DNS-01 only: Hetzner DNS required

With the correct annotations, HCCM created a Certificate via the Hetzner API. Within seconds:

{
  "code": "dns_zone_not_found",
  "message": "DNS zone not found"
}

Hetzner Managed Certificates validate domain ownership by creating a _acme-challenge.<domain> TXT record on Hetzner DNS. If your domain isn’t hosted there, it fails. Period. HTTP-01 is not involved. The LB’s port 80 is irrelevant.

This is in the Hetzner docs. But most informal sources describe the feature as “automatic Let’s Encrypt for your LB” without mentioning the DNS prerequisite. There is many issues talking about the lack of feature regarding HCCM and certificates within the HCCM Github project issues.

Cleanup ordering

Deleting the Certificate via API while it was still referenced by the LB: 422 certificate still in use. Strip the cert annotations first, then delete.

The three real options

  1. Delegate the subdomain to Hetzner DNS via NS records.
  2. Move the full zone to Hetzner DNS.
  3. Use cert-manager in-cluster. Any DNS provider, HTTP-01 or DNS-01. Cert lands in a Kubernetes Secret, referenced from your Gateway via tls.certificateRefs. This is the right answer for most people and for me using OVH-DNS method with the cert-manager-ovh plugin.

I didn’t include managed-cert support in the PR. Too DNS-specific to be a safe default.


Wrapping up

Three configs. Six gotchas documented. One PR open.

If you run Kairos on Hetzner Cloud, the patterns in this article and in PR #379 cover the production-grade defaults you’d otherwise rediscover the hard way: HCCM’s --cloud-provider=external requirement, the routes-only ClusterRole, the LB location annotation, the Traefik cross-namespace 404, and the managed-cert dead end that pushes you straight back to cert-manager.

The configs are merge-ready. The footguns are mapped. Pick the variant that matches your traffic story:

  • Cilium Gateway API for new builds that want one CNI handling routing end to end.
  • Cilium Ingress for kube-proxy-replacement clusters still talking to Ingress-only tooling.
  • Traefik when you want the k3s-batteries-included path with the fewest moving parts.

Next on this topic: a walk-through of cert-manager plus the OVH DNS-01 webhook to replace Hetzner’s managed certificates entirely. Same Gateway, any DNS provider, no zone migration required.


PR #379 is open. Feedback welcome. Open an issue on kairos-io/hadron or find me on LinkedIn.