Utfmb4 instead of utf8mb3? (#442)

I understand the problem mentioned in #442.

Default UTF8 encoding / collation in MySQL doesn’t allow storing 4 byte multi-byte unicode characters, like emojis (and I suppose also like many asian characters like kanji).

Encoding and collation are defined in github edx/configuration in files:

[playbooks/roles/mysql/templates/default_character_sets_and_collations.cnf.j2](https://github.com/edx/configuration/blob/56b32423efc3b29309f4d6e475dd296e13c55cad/playbooks/roles/mysql/templates/default_character_sets_and_collations.cnf.j2)

and

[playbooks/roles/mysql/defaults/main.yml](https://github.com/edx/configuration/blob/61dedaa048f8d9430edbaef6005fad802f75003f/playbooks/roles/mysql/defaults/main.yml)

I’m nowadays using for software I build for my customers:

DEFAULT_MYSQL_CHARACTER_SET: utf8mb4
DEFAULT_MYSQL_COLLATION: utf8mb4_unicode_520_ci

Which permits storing of those characters.

Would be nice to be able to configure this in Tutor, though to change this for existings databases / tables, an export and a new import would be needed.

More information e.g. How Do I Use the utf8mb4 Character Set to Store Emojis in an RDS for MySQL DB Instance?_Relational Database Service_FAQs_Database Parameter Modification_HUAWEI CLOUD

Thanks for the feedback @insad.

  1. Can you confirm that switching the mysql character set and collation fixes the issues described in #442?
  2. What’s the standard procedure for migrating from one character set/collation to another?
  3. Are there any side effects with using the utf8mb4 character set?

I would not be too keen on migrating the mysql database of all tutor users, unless the benefits are substantial. Also, it seems to me that it’s quite possible for any user to lauch their own MySQL server, configure it as they see fit and simply bypass the mysql server that ships with tutor (RUN_MYSQL=false). Finally, I think that this should be achievable quite easily with a plugin. Would you be interested in creating such a plugin?

  1. Absolutely sure, even (bad software as) Wordpress is using this by default now. Also mysql 8 is now using by default a charset of utf8mb4 with a collation of utf8mb4_0900_ca_ci (this collation is not supported on mysql 5.7.32 used by tutor) (see also What is the utf8mb4_0900_ai_ci collation? - Monolune).

  2. Manually could export the database ‘openedx’ from docker, transform it from utf8 to utf8mb4, and import it into a new database on my server (didn’t try to import the newly create database into the live docker mysql). See below for a step-by-step receipt.

  3. I don’t have a good understanding still of the structure of tutor, less of the inner workings of docker… so I don’t feel confident to create a plugin. Looking at openedx / tutor for launching our e-learning platform in South America after I can travel there again after this covid hype, but still need studying django and some devops stuff…

Step-by-step receipt

Basically:

  • dump separately the schema and the data of an existing database, transforming them in the process
  • drop the existing database
  • recreate the database with the new charset and collation
  • import the schema back into this database
  • import the data back into this database

Here the process to get a copy of a database in the utf8mb4 charset on your ubuntu server (which you can dump and import into the mysql running in docker):

  1. Lookup the mysql root password of your tutor installation in config.yml:

    MYSQL_ROOT_PASSWORD: XXXXXXXX

  2. Open a bash shell (running mysql docker instance):

    docker exec -it tutor_local_mysql_1 bash

  3. Launch the mysql console:

    mysql -uroot -pXXXXXXXX

  4. Show the databases:

    mysql> show databases;
    +--------------------+
    | Database           |
    +--------------------+
    | information_schema |
    | discovery          |
    | ecommerce          |
    | mysql              |
    | notes              |
    | openedx            |
    | performance_schema |
    | sys                |
    | xqueue             |
    +--------------------+
    9 rows in set (0.10 sec)
  1. Show for instance the “openedx” database create statement:
    mysql> show create database openedx;
    +----------+------------------------------------------------------------------+
    | Database | Create Database                                                  |
    +----------+------------------------------------------------------------------+
    | openedx  | CREATE DATABASE `openedx` /*!40100 DEFAULT CHARACTER SET utf8 */ |
    +----------+------------------------------------------------------------------+
    1 row in set (0.00 sec)
  1. Show the tables in “openedx”:
    mysql> show tables from openedx;
    +------------------------------------------------------------------+
    | Tables_in_openedx                                                |
    +------------------------------------------------------------------+
    | announcements_announcement                                       |
    | api_admin_apiaccessconfig                                        |
    | api_admin_apiaccessrequest                                       |
    | assessment_assessment                                            |
    | ............................                                     |
    | xblock_django_xblockstudioconfiguration                          |
    | xblock_django_xblockstudioconfigurationflag                      |
    +------------------------------------------------------------------+
    427 rows in set (0.00 sec)
  1. Exit mysql console and docker bash terminal (exit / exit).

  2. Dump / transform the schema and the data, e.g. from the ‘openedx’ database (repeat for ‘discovery’ ‘ecommerce’, …):

    docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q -d -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx | sed 's/utf8/utf8mb4/gi' > openedx_schema.sql

    docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q --insert-ignore -t -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx > openedx_data.sql
  1. Open the mysql console on your server:

    sudo mysql

  2. Create a new ‘openedx’ database with charset utf8mb4 and the desired collation. To see the available collations you can enter:

    mysql> show collation;

    The default collation on mysql 5.7.32 is ‘utf8mb4_general_ci’, to use this you can create the new database with:

    mysql> create database openedx default charset utf8mb4 collate utf8mb4_general_ci;
    To use another collation:
    mysql> create database openedx default charset utf8mb4 collate utf8mb4_unicode_520_ci;
  1. Import the schema and the data into this newly created database:
    mysql> use openedx;
    mysql> source openedx_schema.sql
    mysql> source openedx_data.sql
  1. Dump your newly created database and import it into docker (to be continued…), see also Backup and restore a mysql database from a running Docker mysql container · GitHub

There sure will be other methods for transforming databases from utf8 to utf8mb4, but I found my method foolproof (used it long time ago to transform some really big databases from latin1 to utf8). E.g. here is another (untested) approach: How to support full Unicode in MySQL databases · Mathias Bynens

Some other observations:

  1. Maybe this is not needed for openedx installations that don’t cater to asian languages / users.
  2. The possibility to show emojis probably is not very important (as long as you aren’t providing courses for school kids).
  3. But the possibility to show e.g. math symbols directly in unicode instead of resorting to software as LaTEX can be very interesting for math-heavy (science / engineering) courses.

You got me right there :slight_smile: Tutor very much caters to everyone, including Asian users.

I can confirm the issue. With the “notes” plugin enabled, when I create a note with the text “:poop:” it triggers the following error:

notes_1          | Traceback (most recent call last):
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/core/handlers/exception.py", line 34, in inner
notes_1          |     response = get_response(request)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/core/handlers/base.py", line 115, in _get_response
notes_1          |     response = self.process_exception_by_middleware(e, request)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/core/handlers/base.py", line 113, in _get_response
notes_1          |     response = wrapped_callback(request, *callback_args, **callback_kwargs)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
notes_1          |     return view_func(*args, **kwargs)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/views/generic/base.py", line 71, in view
nginx_1          | 172.18.0.4 - - [14/Jun/2021:09:36:45 +0000] "PUT /api/v1/annotations/1/ HTTP/1.1" 500 27 "https://demo.openedx.overhang.io/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" "80.67.176.14"
notes_1          |     return self.dispatch(request, *args, **kwargs)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 509, in dispatch
notes_1          |     response = self.handle_exception(exc)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 469, in handle_exception
notes_1          |     self.raise_uncaught_exception(exc)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
notes_1          |     raise exc
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 506, in dispatch
notes_1          |     response = handler(request, *args, **kwargs)
notes_1          |   File "/openedx/edx-notes-api/notesapi/v1/views.py", line 585, in put
notes_1          |     note.save()
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 743, in save
notes_1          |     self.save_base(using=using, force_insert=force_insert,
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 780, in save_base
notes_1          |     updated = self._save_table(
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 853, in _save_table
notes_1          |     updated = self._do_update(base_qs, using, pk_val, values, update_fields,
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 903, in _do_update
notes_1          |     return filtered._update(values) > 0
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/query.py", line 760, in _update
notes_1          |     return query.get_compiler(self.db).execute_sql(CURSOR)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/sql/compiler.py", line 1471, in execute_sql
notes_1          |     cursor = super().execute_sql(result_type)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/sql/compiler.py", line 1142, in execute_sql
notes_1          |     cursor.execute(sql, params)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 67, in execute
notes_1          |     return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
notes_1          |     return executor(sql, params, many, context)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 84, in _execute
notes_1          |     return self.cursor.execute(sql, params)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/utils.py", line 89, in __exit__
notes_1          |     raise dj_exc_value.with_traceback(traceback) from exc_value
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 84, in _execute
notes_1          |     return self.cursor.execute(sql, params)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/mysql/base.py", line 71, in execute
notes_1          |     return self.cursor.execute(query, args)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 206, in execute
notes_1          |     res = self._query(query)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 319, in _query
notes_1          |     db.query(q)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/MySQLdb/connections.py", line 259, in query
notes_1          |     _mysql.connection.query(self, query)
notes_1          | django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xF0\\x9F\\x92\\xA9' for column 'text' at row 1")

I agree that changing the collation should fix the issue. But I’m still anxious about changing the database collation. There could be far-reaching impacts that we do not foresee. In the native installation, MySQL runs on utf8_general_ci: configuration/main.yml at 1909bafde44584ec7bd04f7cb6ee6ac0bfa6b521 · edx/configuration · GitHub

I think we should try to work with edX on this specific issue. I would like to go to edX with a convincing use case. I think that correctly setting the user name is sufficiently important. @iamCristYe gave a convincing example in the Github issue. However, the "张𠇶 " does not print correctly in any of my computer applications – and not on my phone either. The first character (“zhang”) is fine, but the second prints as an empty square. What character is it exactly? (maybe give a link to a Chinese dictionary) Is that a frequent character/name?

There could be problems with the number of characters stored in the table indices. So yes, I think would be better to go to edX with this. See also MySQL :: MySQL 5.7 Reference Manual :: 14.23 InnoDB Limits

@insad @regis Thanks for your discussion! Sorry I didn’t see this days ago. What I want to add is that most used CJK (Chinese, Japanese and Korean) chars are included in utf8mb3, but utf8mb4 does include some other CJK chars (rarely used). And most used Math symbols are included in utf8mb3 as well, I think. So one major use for adding utf8mb4 supports would be supporting emojis. But imaging we’re hosting a course about ancient Chinese Literature or Unicode itself, such chars may be useful

As far as I understand, most course content is stored in Mongodb, which already supports characters from utf8mb4. Am I correct? If the biggest benefit is only to allow very rare characters or emojis in user names, then I’m afraid it will not warrant such a dangerous MySQL migration.

Yes most course contents in mongo are fine with emojis, and migrating current users to utf8mb4 could be dangerous, so initially in the github issue I proposed another choice: simply change the MySQL config to utf8mb4, in this way:

  1. New users will be able to use utf8mb4 and store all chars
  2. Existing users will experience exactly the same behavior as what they use to have because utf8mb3 is a subset of utf8mb4
  3. Can be better prepared for future MySQL 8 upgrade?

Pushing the utf8mb4 upgrade to all new and existing Tutor users is a little too experimental for my taste – but then I have little experience in these matters. What I suggest is that other platform maintainers try out the utf8mb4 collation in MySQL with the help of this little plugin I created: Tutor plugin for adding utf8mb4 support to MySQL · GitHub
If this plugin works then we will be able to add utf8mb4 support to Open edX.

What do other @maintainers think?

EDIT: I tested this plugin and I am still unable to create notes with the “:poop:” character in them. I keep getting that error from the notes container:

notes_1                       | django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xF0\\x9F\\x92\\xA9' for column 'text' at row 1")

It looks like django-mysql needs extra configuration to get utf8mb4 to work: Checks — Django-MySQL 3.11.1 documentation

@insad: you said earlier that you were “absolutely sure” that switching the mysql collation would fix the issue. But did you actually test it? If yes, how?

You have to transform your database also… only setting charset and collation to utf8mb4 doesn’t work with a database created with charset utf8 !!

I would suggest you rm .local/share/tutor and run quickstart again, as mysql already there will not change its collation if you don’t export and reimport. BTW nice new avatar

@iamCristYe Exactly, I was describing how you could export / transform / reimport an existing database in utf8 encoding…

Yes, this is exactly what I did already.

Do you mean that the collation needs to be explicitely defined for newly created databases? All these steps make me wonder: did you actually get Open edX to work with an utf8mb4 database, and did it fix the notes issue?

after adding "charset": "utf8mb4", to tutor/auth.json at v12.0.1 · overhangio/tutor · GitHub I’m able to set :man_technologist: as my full name. However I don’t see a place to set this for notes


OK should be this line: tutor-notes/tutor.py at master · overhangio/tutor-notes · GitHub


Now I can add “:woman_factory_worker:🧑‍🏭:man_factory_worker::woman_technologist:🧑‍💻:man_technologist::woman_office_worker:🧑‍💼:man_office_worker::woman_mechanic:🧑‍🔧:man_mechanic::woman_scientist:🧑‍🔬:man_scientist::woman_artist:🧑‍🎨:man_artist::woman_firefighter:🧑‍🚒:man_firefighter::woman_pilot:🧑‍✈:man_pilot::woman_astronaut:🧑‍🚀:man_astronaut:” to notes


BTW, when looking into the dumped mysql, I noticed the commands at tutor/local.rst at v12.0.1 · overhangio/tutor · GitHub will dump mysql internal tables like information_schema as well, which may not be desirable. What do you think? I’d be happy to make a PR.

I can confirm that I am able to add emojis to student notes after adding DATABASES["default"]["OPTIONS"]["charset"] = "utf8mb4" to the notes settings.

I think it be would be worth it to upgrade to utf8mb4. But I’m worried about the following item (from this blog post which you linked to earlier):

When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three.

I seem to remember that such an error already occurred in Open edX – though I can’t remember where or when. Given the size of edx-platform, it’s very likely that some TEXT or TINYTEXT columns contain values that would exceed the maximum size if we were to upgrade to utf8mb4.

Therefore, further investigation is required… I do not have time to lead that investigation. If someone else is interested, I’ll be happy to learn about the result of their enquiries. If not, I guess we will have to wait until edX (or the yet-to-be-named-non-profit-that-will-be-created-after-edX) upgrades to mysql 8, where utf8mb4 is the default.

Indeed better wait for mysql 8… for long varchar and indexes you need to set row_format to dynamic:

create table my_table (
......
) row_format=dynamic;

Ref. MySQL :: MySQL 5.7 Reference Manual :: 14.11 InnoDB Row Formats

Considering the enormous number of tables in the database, it’s not very wise trying to do that by ourselves…

Btw, should we be happy about edX and 2U to Join Together in Industry-Redefining Combination ? Will all effort put in the future in the yet-to-named-non-profit go make profit for U2 shareholders, and put off people to spend their time to this?