0/1 nodes are available: 1 node(s) had volume node affinity conflict

We have a Tutor K8s installation (latest version) in AWS EKS. We run lms, cms, mfe and caddy related pods only, as the databases are external. Now we are running failover tests on the nodes. Everything works fine, except in this case:
when we bring down the node where caddy runs, and it is redeployed in a node in a different availability zone, we get the following error: 0/1 nodes are available: 1 node(s) had volume node affinity conflict.

This issue seems to be related to caddy’s persistent volume claim, that cannot be unbound and bound in a different AZ.

Did anybody go into this problem before?
What is the purpose of this caddy PV? I guess it’s only for storing the SSL certificates, so I wouldn’t mind losing this data if the node goes down. Can we get rid of this PV?

We’ve found this article that can help.

1 Like

Hi @regis! Your opinion here will be very appreciated.

Do you see any issue in using an emptyDir type for the data volume of caddy, instead of a PVC? I have tried in my test environment in AWS and an emptyDir solved the problem: when I kill the node and the pods are recreated in a new node in another zone, now the caddy pod restarts well. Of course, the certificates are lost and it takes a couple of minutes to recreate them all, but at least it gives more resiliency. What do you think?

Hi @andres,
Is your issue somewhat related to this PR from @fghaas? fix: Enable rolling updates for the Caddy deployment in multi-node Kubernetes by fghaas · Pull Request #660 · overhangio/tutor · GitHub If yes please comment there.

Hi @regis!
Well… in a sense they are related, that the Caddy’s PVC is causing troubles in many aspects.
I’ve been talking to the AWS guys and they suggested that the only fully resilient and fully scalable way to have PVs in K8s is using some kind of NFS, like EFS. I have made some tests but it’s quite complicated to setup, and for our case with Caddy, it’s an overkill. In my setup, I have removed all other stateful pods out of the cluster.
In my opinion, changing the PV type to EmptyDir will fix the problem of whole node recreation in another AZ. This should be a very rare event, so it should not cause big troubles with certificates. I don’t know how this will work during the rolling update situation.

Or get rid of the PV at all by disabling https and creating a LB outside the cluster to terminate the ssl tunnels.

In my case, I use a Swarm Cluster and I’m not using Caddy, I’m using an External Load Balancer with HTTPS and inside my VPC the communication is over HTTP.

In k8s a volume can be shared only by pods running in the same node (VM). So if want a redundancy between nodes/zones/regions you need something like an NFS, as you said.

I don’t use Caddy, so I don’t know how it works, but if the volumes data and config are just to certificate and config files, you can create a configMap or Secret and mount them like a volume, with this your container will be able to mount it in any node/zone/region.

1 Like

A config map won’t work, as Caddy generates the certificates dynamically. But disabling Caddy certificates will probably be the best options.
I’m curious… do you run Tutor over Swarm?

Yeah, the infra that I have available is a Swarm Cluster running on AWS, it is in the plans to migrate to k8s.

1 Like