"tutor k8s" in production mode: is this a DNS catch-22?

Hi,

@Namrata and I have been doing some testing with Tutor on Kubernetes this week, and we’ve been running into a situation that looks a bit like a catch-22. Our Kubernetes cluster is managed by OpenStack Magnum.

The documentation says that in order to run tutor k8s in production mode, you must first create DNS records — presumably a single A (or AAAA) record pointing to the external-ip associated with the Caddy service, plus a few CNAMEs.

Now, the external-ip of the caddy service is only being allocated during tutor k8s start (or quickstart), which means that you’d have to

  1. Run tutor k8s [start|quickstart]
  2. Wait until the caddy service is up, and retrieve its external-ip with kubectl (at this point other pods may be hitting the Error or CrashLoopBackoff state because they rely on name resolution to connect to other Tutor-managed services)
  3. Create the DNS record,
  4. Recover any failed pods.

This process would need to be repeated every time one runs tutor k8s stop followed by tutor k8s start, because generally Kubernetes doesn’t guarantee the persistence of external IPs. (If you throw a service away and recreate it, you can’t expect that its external IP is the same as before.)

We’re guessing that for someone running EKS on AWS, they could work around this issue by automatically creating DNS records via Route 53, but in our case with Magnum on OpenStack (and unfortunately lacking Designate) we don’t have that option.

So, is there a more elegant way to resolve this chicken-and-egg issue than steps 1-4 above?

And perhaps related, under what circumstances does it really make sense to run tutor k8s quickstart and answer ‘n’ on the “Are you configuring a production platform” question? It would appear that the non-production settings don’t play nicely with the tutor-minio plugin, which the documentation says is required for running tutor k8s. That’s why we skipped straight to the “production” configuration where we ran into the DNS catch-22.

Thanks in advance for your thoughts!

Cheers,
Florian

Hey Florian,
Your analysis is 100% correct. I’m very modest about my Kubernetes skills, so I’ll be happy to consider any alternative option that you suggest. What would be the typical k8s way of doing things? We need a solution that is provider-agnostic. Would it be sufficient not to stop the caddy load balancer on tutor k8s stop?

Hi Régis,

thanks for the reply! If you’re looking for suggestions on how this can be made more user-friendly in a provider-agnostic way, I’d have two:

  1. Perhaps remove the “Are you configuring a production platform?” question from tutor k8s quickstart altogether, and always apply the production settings.
  2. During tutor k8s {init|start|quickstart}, deploy the caddy service first, and wait until its external IP is ready (perhaps using an approach as outlined here). Then, list the external IP and all host names configured for Tutor services, and prompt the user to create DNS A and CNAME records for them (or maybe prompt only on quickstart and init, but proceed on start). Then, resolve the host names, and if the responses match the external IP address, continue with deploying the rest of the services.

As an aside, it’d be interesting to see how this behaves in Kubernetes 1.21 and later, where dual-stack (IPv4/IPv6) support is enabled by default. Unfortunately we currently don’t have a test environment where we can deploy Kubernetes 1.21 with full dual-stack support (there are “interesting” issues in our Magnum setup with that configuration).

How do those ideas sound to you?

So, a few more findings here (perhaps this is helpful to others). What we ended up doing was this:

  1. Run tutor k8s quickstart. This causes some jobs to fail for the reasons outlined in the first post in this thread.
  2. Check kubectl -n openedx get svc caddy to retrieve the external IP of the caddy service.
  3. Create the DNS records for that external IP, so that the host names set for LMS_HOST, CMS_HOST and (if using the mfe and minio plugins) MFE_HOST and MINIO_HOST resolve to that address.
  4. Wait until the Let’s Encrypt certs have been generated by the caddy pod (prior to this, the services would be unable to connect to each other via HTTPS)
  5. Run tutor k8s quickstart again. This is a no-op for the services themselves, but the jobs get recreated and are now able to complete.

So, for successful clean spinup from scratch in a Kubernetes environment, tutor quickstart would have to wait until the DNS records resolve, and then wait some more until the Let’s Encrypt certs have been generated, and only then proceed with setting up the rest of the services. Does that sound reasonable?

I created this GitHub issue to summarize my thoughts: Kubernetes LoadBalancer race conditions and other issues · Issue #532 · overhangio/tutor · GitHub @fghaas let’s move the conversation there.