0/1 nodes are available: 1 node(s) had volume node affinity conflict

andres · March 17, 2022, 3:14pm

We have a Tutor K8s installation (latest version) in AWS EKS. We run lms, cms, mfe and caddy related pods only, as the databases are external. Now we are running failover tests on the nodes. Everything works fine, except in this case:
when we bring down the node where caddy runs, and it is redeployed in a node in a different availability zone, we get the following error: 0/1 nodes are available: 1 node(s) had volume node affinity conflict.

This issue seems to be related to caddy’s persistent volume claim, that cannot be unbound and bound in a different AZ.

Did anybody go into this problem before?
What is the purpose of this caddy PV? I guess it’s only for storing the SSL certificates, so I wouldn’t mind losing this data if the node goes down. Can we get rid of this PV?

We’ve found this article that can help.

andres · March 24, 2022, 1:51pm

Hi @regis! Your opinion here will be very appreciated.

Do you see any issue in using an emptyDir type for the data volume of caddy, instead of a PVC? I have tried in my test environment in AWS and an emptyDir solved the problem: when I kill the node and the pods are recreated in a new node in another zone, now the caddy pod restarts well. Of course, the certificates are lost and it takes a couple of minutes to recreate them all, but at least it gives more resiliency. What do you think?

regis · May 12, 2022, 3:00pm

Hi @andres,
Is your issue somewhat related to this PR from @fghaas? fix: Enable rolling updates for the Caddy deployment in multi-node Kubernetes by fghaas · Pull Request #660 · overhangio/tutor · GitHub If yes please comment there.

andres · May 12, 2022, 3:39pm

Hi @regis!
Well… in a sense they are related, that the Caddy’s PVC is causing troubles in many aspects.
I’ve been talking to the AWS guys and they suggested that the only fully resilient and fully scalable way to have PVs in K8s is using some kind of NFS, like EFS. I have made some tests but it’s quite complicated to setup, and for our case with Caddy, it’s an overkill. In my setup, I have removed all other stateful pods out of the cluster.
In my opinion, changing the PV type to EmptyDir will fix the problem of whole node recreation in another AZ. This should be a very rare event, so it should not cause big troubles with certificates. I don’t know how this will work during the rolling update situation.

andres · May 12, 2022, 3:53pm

Or get rid of the PV at all by disabling https and creating a LB outside the cluster to terminate the ssl tunnels.

erickhgm · May 12, 2022, 6:44pm

In my case, I use a Swarm Cluster and I’m not using Caddy, I’m using an External Load Balancer with HTTPS and inside my VPC the communication is over HTTP.

In k8s a volume can be shared only by pods running in the same node (VM). So if want a redundancy between nodes/zones/regions you need something like an NFS, as you said.

I don’t use Caddy, so I don’t know how it works, but if the volumes data and config are just to certificate and config files, you can create a configMap or Secret and mount them like a volume, with this your container will be able to mount it in any node/zone/region.

andres · May 12, 2022, 7:33pm

A config map won’t work, as Caddy generates the certificates dynamically. But disabling Caddy certificates will probably be the best options.
I’m curious… do you run Tutor over Swarm?

erickhgm · May 12, 2022, 8:43pm

Yeah, the infra that I have available is a Swarm Cluster running on AWS, it is in the plans to migrate to k8s.

system · August 10, 2022, 8:44pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.