Lilac Upgrade Login Issues

Hi all,

We have today upgraded from Koa to Lilac via tutor v12 on K8s. Unfortunately however as part of the upgrade we lost our PV for Elasticsearch (which we’re not sure is relevant but worth mentioning!).

Since upgrading however, we now have an issue where pre-existing users created before the migration (with the exception of the first admin user) cannot login and are presented with an ‘unexpected error occured’ message on signin. This problem does not affect newly registered users though.

Looking at the LMS logs, the authentication is actually successful;

2021-06-10 18:05:58,424 INFO 6 [tracking] [user None] [ip XXX.XXX.XXX.XXX] logger.py:41 - {"name": "/api/user/v1/account/login_session/", "context": {"user_id": null, "path": "/api/user/v1/account/login_session/", "course_id": "", "org_id": ""}, "username": "", "session": "3e79b483f05620129bb1db97fd9d2d9f", "ip": "XXX.XXX.XXX.XXX", "agent": "Mozilla/5.0 ( │
2021-06-10 18:05:58,574 INFO 6 [audit] [user 4] [ip XXX.XXX.XXX.XXX] models.py:2590 - Login success - user.id: 4                                                                                                                                                                                                                                                      
[uwsgi-http key: SITE_URL client_addr: 10.0.2.88 client_port: 60556] hr_read(): Connection reset by peer [plugins/http/http.c line 917]                                                                                                                                                                                                       [pid: 6|app: 0|req: 482/747] 10.0.2.88 () {66 vars in 1917 bytes} [Thu Jun 10 18:05:58 2021] POST /api/user/v1/account/login_session/ => generated 45 bytes in 426 msecs (HTTP/1.0 200) 14 headers in 4193 bytes (1 switches on core 0)

From the in-browser web console, the login_session response comes back with a 502, looking at the nginx pod logs, we see the following:

2021/06/10 18:10:45 [error] 22#22: *2512 upstream sent too big header while reading response header from upstream, client: 10.0.2.218, server: SITE_URL, request: "POST /api/user/v1/account/login_session/ HTTP/1.1", upstream: "http://172.20.199.231:8000/api/user/v1/account/login_session/", host: "SITE_URL", r
10.0.2.218 - - [10/Jun/2021:18:10:45 +0000] SITE_URL "POST /api/user/v1/account/login_session/ HTTP/1.1" 502 552 "SITE_URL/login?next=%2F" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36" "XXX.XXX.XXX.XXX"

Any help would be greatly appreciated! :slight_smile:

we ran into this with MFEs on edx dot org and it was due to cookie sizes being too large. We just bumped the header size limit on nginx and gunicorn:

1 Like

@daviesnathan thanks for the report! Can you try adding the following line to the nginx configuration for the LMS, as suggested by @fredsmith?

 large_client_header_buffers 8 16k;

If this change works for you we’ll make a 12.0.1 release very quickly.

Tutor does not use gunicorn, and we already increased the uwsgi buffer size in the past, so the other change should not be necessary. (I believe)

1 Like

nm :slight_smile: - @daviesnathan worked it out.

@regis can you give a bit more info on where we set this?

Is this going into the lms config file? These are usually value : option. This looks like value: option : option.

@regis, unfortunately adding your recommendation to _tutor.conf in the nginx configmap didn’t quite do the trick.

However, I did some googling and tried setting some proxy buffer directives on the lms server config.

We now have

  location @proxy_to_lms_app {
    proxy_redirect off;
    proxy_set_header Host $http_host;
    proxy_pass http://lms-backend;
    proxy_buffer_size 8k;
    proxy_buffers 8 8k;
    proxy_busy_buffers_size 16k;
  }

Which has fixed our problem and we’re now able to login again. However, admittedly we’re now wondering if there is something more sinister going on under the hood that we can’t see in the logs which this is simply now ‘bodging’ :sweat_smile:

Can you please try adding the large_client_header_buffers 8 16k; directive in the $(tutor config printroot)/env/apps/nginx/lms.conf file? This line should be added inside the server directive. Then restart the nginx server with tutor local restart nginx (do not run tutor config save or tutor local quickstart, or your manual changes will be overwritten).

I’d rather go with a server-wide fix rather than fix that needs to be added to every location directive.

I understand your anxiety. Open edX is a large piece of software and there are many places where it can fail. Unfortunately, we don’t have a testing plan in place for every upgrade – although we most certainly should. So I cannot guarantee that Tutor v12 is bug-free. What I can guarantee, is that we will react swiftly to every bug report and that we will strive to fix them as quickly as possible.

Thanks for the reply. As requested I’ve just tried adding the large_client_header_buffers line to the lms server directory, but unfortunately it doesn’t work :frowning:

Ok, let’s try your solution, but slightly different: can you attempt to simply add proxy_buffers 8 8k; to the lms.conf file in the server directive – and not in the location?

So I attempted to add proxy_buffers 8 8k; to the lms server directive. However, this also didn’t work. But some tinkering later I found that simply adding

proxy_buffer_size 8k; to the lms server directive did the trick :smiley:

That’s great! Do you want to open a PR with this change, or shall we do it?

@daviesnathan According to the docs, the nginx proxy_buffer_size should already be 8k for some users “depending on a platform”. Which makes me wonder on what OS/architecture you are running Tutor. Can you give me more details about your server?

I created the PR here: fix: "upstream sent too big header" nginx errors by regisb · Pull Request #451 · overhangio/tutor · GitHub

Thanks for raising that PR for us! :slight_smile:

Absolutely. In terms of the target infrastructure Tutor is deploying against, we’re running in AWS EKS which is Kubernetes v1.17. The nodes within the cluster are running Amazon Linux 2 with an AMI version of 1.17.9-20200904

Tutor itself is being run on our CI server which is also running AL2, and specifically is being run from a build container based on alpine3.12.

Hope this answers everything you need to know. :slight_smile: