Here are my findings:
- The
InvalidRequest
error stems from the fact that the url querystring is improperly url-encoded. The argument that is causing an issue is response-content-disposition
. response-content-disposition=attachment; filename="course.g7wg323t.tar.gz"
triggers the error. Replacing this argument by response-content-disposition=attachment%3B+filename%3D%22course.g7wg323t.tar.gz%22
fixes the error.
- The
response-content-disposition
argument is set in the import_export.export_status_handler
function: edx-platform/import_export.py at 91a4ab7c1b28904cba341d423f664d13d71f9887 · openedx/edx-platform · GitHub I would expect that the parameters
argument is properly url-encoded by boto3.
- To reproduce the incorrect url, run:
./manage.py cms shell
from django.conf import settings
from django.core.files.storage import get_storage_class
Storage = get_storage_class(settings.COURSE_IMPORT_EXPORT_STORAGE)
s = Storage()
s.url(name="pouac.xml", parameters={'ResponseContentDisposition': 'attachment; filename="pouac.xml"'}
)
We can see that the generated url does not properly encode its querystring.
- This seems to be a bug in boto3. We are using
boto3==1.4.8
. The latest version is 1.20.41. Let’s try to upgrade… The piece of code above still returns a url that contains a querystring which is not url-encoded.
- Interestingly, the bug does not occur when we call the boto3 API directly:
>> s.bucket.meta.client.generate_presigned_url('get_object', Params={'ResponseContentDisposition': 'attachment; filename="pouac.xml"', 'Bucket': s.bucket.name, "Key": "pouac.xml"})
'http://files.local.overhang.io/openedx/pouac.xml?response-content-disposition=attachment%3B%20filename%3D%22pouac.xml%22&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=openedx%2F20220124%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20220124T161250Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=e6840e6ebc2202f0aeff6180fc688a2e20511917168fb0f2f99010e2b425d846'
This means that the bug stems from django-storages.
- Indeed, when looking at the source code of the
url
function in the django-storages package, we see that it’s the _strip_signing_parameters
that un-escapes the querystring:
>>> s._strip_signing_parameters(s.bucket.meta.client.generate_presigned_url('get_object', Params={'ResponseContentDisposition': 'attachment; filename="pouac.xml"', 'Bucket': s.bucket.name, "Key": "pouac.xml"}, ExpiresIn=s.querystring_expire))
'http://files.local.overhang.io/openedx/pouac.xml?response-content-disposition=attachment; filename="pouac.xml"'
- We are using
django-storages==1.8
. The latest release is 1.12.3. Let’s try to upgrade… Nope, the issue still occurs.
- We are only going through
_strip_signing_parameters
because we are using AWS_QUERYSTRING_AUTH = False
. That’s because Open edX assets are supposed to be public (see this tutor-minio commit for instance).
- Interestingly, the
ImportExportS3Storage
class which is the default storage class for course import/export storage defines querystring_auth=True
. (source).
- Let’s try to switch to the default
ImportExportS3Storage
class by setting COURSE_IMPORT_EXPORT_STORAGE = "cms.djangoapps.contentstore.storage.ImportExportS3Storage"
in the tutor-minio plugin…
- For some reason, it looks like the course export generation task is completely bypassing the
COURSE_IMPORT_EXPORT_STORAGE
setting and is making use of the DEFAULT_STORAGE
(!!!)
- Let’s try instead to set the
USER_TASKS_ARTIFACT_STORAGE = "cms.djangoapps.contentstore.storage.ImportExportS3Storage"
setting, as it is actually done in cms/envs/production.py… bingo! exporting the course fails with a different error: {'raw_error_msg': 'S3ResponseError: 403 Forbidden\n'}
. This means that the course export does make use of the USER_TASKS_ARTIFACT_STORAGE
setting. But we probably can’t use it because, well, it uses the outdated S3BotoStorage
class.
(But WHY for god’s sake do we keep using the outdated S3BotoStorage class in edx-platform?!?)
- I attempted to make boto use sigv4 with
os.environ["S3_USE_SIGV4"] = "True"
. This changed the error from 403 to 400, so I guess it must have had some kind of effect. Here is the traceback:
cms-worker_1 | [2022-01-24 17:10:55,601: ERROR/ForkPoolWorker-1] cms.djangoapps.contentstore.tasks.export_olx[2eb9e318-2d7a-4cce-96c9-5cf2e4d42ed9]: Error exporting course course-v1:org+test101+a
lpha
cms-worker_1 | Traceback (most recent call last):
cms-worker_1 | File "/openedx/edx-platform/cms/djangoapps/contentstore/tasks.py", line 322, in export_olx
cms-worker_1 | artifact.file.save(name=os.path.basename(tarball.name), content=File(tarball))
cms-worker_1 | File "/openedx/venv/lib/python3.8/site-packages/django/db/models/fields/files.py", line 89, in save
cms-worker_1 | self.name = self.storage.save(name, content, max_length=self.field.max_length)
cms-worker_1 | File "/openedx/venv/lib/python3.8/site-packages/django/core/files/storage.py", line 54, in save
cms-worker_1 | name = self._save(name, content)
cms-worker_1 | File "/openedx/venv/lib/python3.8/site-packages/storages/backends/s3boto.py", line 425, in _save
cms-worker_1 | key = self.bucket.get_key(encoded_name)
cms-worker_1 | File "/openedx/venv/lib/python3.8/site-packages/boto/s3/bucket.py", line 193, in get_key
cms-worker_1 | key, resp = self._get_key_internal(key_name, headers, query_args_l)
cms-worker_1 | File "/openedx/venv/lib/python3.8/site-packages/boto/s3/bucket.py", line 230, in _get_key_internal
cms-worker_1 | raise self.connection.provider.storage_response_error(
cms-worker_1 | boto.exception.S3ResponseError: S3ResponseError: 400 Bad Request
- Let’s attempt to correctly configure S3BotoStorage, even though I find this utterly depressing.
- Let’s try a different approach. We would like to have a course import/export storage class with querystring_auth=True (like
ImportExportS3Storage
) but inherits from S3Boto3 Storage (unlike ImportExportS3Storage
). So let’s modify the ImportExportS3Storage
class.
- Damn, my dev environment is broken… I need to fix it (see PR).
- I modified the
ImportExportS3Storage
class to inherit from S3Boto3Storage
instead of S3BotoStorage
. URL generation works, and I am able to download the file.
Thus, there are two possible solutions:
a. Make it so that the tutor-minio plugin implements and uses its own ImportExportS3Storage class.
b. Push an upstream fix to migrate the ImportExportS3Storage to inherit from S3Boto3Storage instead of S3BotoStorage. As usual, this is probably the best but also the most difficult solution…