Utfmb4 instead of utf8mb3? (#442)

I understand the problem mentioned in #442.

Default UTF8 encoding / collation in MySQL doesn’t allow storing 4 byte multi-byte unicode characters, like emojis (and I suppose also like many asian characters like kanji).

Encoding and collation are defined in github edx/configuration in files:

[playbooks/roles/mysql/templates/default_character_sets_and_collations.cnf.j2](https://github.com/edx/configuration/blob/56b32423efc3b29309f4d6e475dd296e13c55cad/playbooks/roles/mysql/templates/default_character_sets_and_collations.cnf.j2)

and

[playbooks/roles/mysql/defaults/main.yml](https://github.com/edx/configuration/blob/61dedaa048f8d9430edbaef6005fad802f75003f/playbooks/roles/mysql/defaults/main.yml)

I’m nowadays using for software I build for my customers:

DEFAULT_MYSQL_CHARACTER_SET: utf8mb4
DEFAULT_MYSQL_COLLATION: utf8mb4_unicode_520_ci

Which permits storing of those characters.

Would be nice to be able to configure this in Tutor, though to change this for existings databases / tables, an export and a new import would be needed.

More information e.g. How Do I Use the utf8mb4 Character Set to Store Emojis in an RDS for MySQL DB Instance?_Relational Database Service_FAQs_Database Parameter Modification_HUAWEI CLOUD

Thanks for the feedback @insad.

  1. Can you confirm that switching the mysql character set and collation fixes the issues described in #442?
  2. What’s the standard procedure for migrating from one character set/collation to another?
  3. Are there any side effects with using the utf8mb4 character set?

I would not be too keen on migrating the mysql database of all tutor users, unless the benefits are substantial. Also, it seems to me that it’s quite possible for any user to lauch their own MySQL server, configure it as they see fit and simply bypass the mysql server that ships with tutor (RUN_MYSQL=false). Finally, I think that this should be achievable quite easily with a plugin. Would you be interested in creating such a plugin?

  1. Absolutely sure, even (bad software as) Wordpress is using this by default now. Also mysql 8 is now using by default a charset of utf8mb4 with a collation of utf8mb4_0900_ca_ci (this collation is not supported on mysql 5.7.32 used by tutor) (see also What is the utf8mb4_0900_ai_ci collation? - Monolune).

  2. Manually could export the database ‘openedx’ from docker, transform it from utf8 to utf8mb4, and import it into a new database on my server (didn’t try to import the newly create database into the live docker mysql). See below for a step-by-step receipt.

  3. I don’t have a good understanding still of the structure of tutor, less of the inner workings of docker… so I don’t feel confident to create a plugin. Looking at openedx / tutor for launching our e-learning platform in South America after I can travel there again after this covid hype, but still need studying django and some devops stuff…

Step-by-step receipt

Basically:

  • dump separately the schema and the data of an existing database, transforming them in the process
  • drop the existing database
  • recreate the database with the new charset and collation
  • import the schema back into this database
  • import the data back into this database

Here the process to get a copy of a database in the utf8mb4 charset on your ubuntu server (which you can dump and import into the mysql running in docker):

  1. Lookup the mysql root password of your tutor installation in config.yml:

    MYSQL_ROOT_PASSWORD: XXXXXXXX

  2. Open a bash shell (running mysql docker instance):

    docker exec -it tutor_local_mysql_1 bash

  3. Launch the mysql console:

    mysql -uroot -pXXXXXXXX

  4. Show the databases:

    mysql> show databases;
    +--------------------+
    | Database           |
    +--------------------+
    | information_schema |
    | discovery          |
    | ecommerce          |
    | mysql              |
    | notes              |
    | openedx            |
    | performance_schema |
    | sys                |
    | xqueue             |
    +--------------------+
    9 rows in set (0.10 sec)
  1. Show for instance the “openedx” database create statement:
    mysql> show create database openedx;
    +----------+------------------------------------------------------------------+
    | Database | Create Database                                                  |
    +----------+------------------------------------------------------------------+
    | openedx  | CREATE DATABASE `openedx` /*!40100 DEFAULT CHARACTER SET utf8 */ |
    +----------+------------------------------------------------------------------+
    1 row in set (0.00 sec)
  1. Show the tables in “openedx”:
    mysql> show tables from openedx;
    +------------------------------------------------------------------+
    | Tables_in_openedx                                                |
    +------------------------------------------------------------------+
    | announcements_announcement                                       |
    | api_admin_apiaccessconfig                                        |
    | api_admin_apiaccessrequest                                       |
    | assessment_assessment                                            |
    | ............................                                     |
    | xblock_django_xblockstudioconfiguration                          |
    | xblock_django_xblockstudioconfigurationflag                      |
    +------------------------------------------------------------------+
    427 rows in set (0.00 sec)
  1. Exit mysql console and docker bash terminal (exit / exit).

  2. Dump / transform the schema and the data, e.g. from the ‘openedx’ database (repeat for ‘discovery’ ‘ecommerce’, …):

    docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q -d -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx | sed 's/utf8/utf8mb4/gi' > openedx_schema.sql

    docker exec tutor_local_mysql_1 /usr/bin/mysqldump -Q --insert-ignore -t -uroot -pXXXXXXXX --default-character-set=utf8 --skip-set-charset openedx > openedx_data.sql
  1. Open the mysql console on your server:

    sudo mysql

  2. Create a new ‘openedx’ database with charset utf8mb4 and the desired collation. To see the available collations you can enter:

    mysql> show collation;

    The default collation on mysql 5.7.32 is ‘utf8mb4_general_ci’, to use this you can create the new database with:

    mysql> create database openedx default charset utf8mb4 collate utf8mb4_general_ci;
    To use another collation:
    mysql> create database openedx default charset utf8mb4 collate utf8mb4_unicode_520_ci;
  1. Import the schema and the data into this newly created database:
    mysql> use openedx;
    mysql> source openedx_schema.sql
    mysql> source openedx_data.sql
  1. Dump your newly created database and import it into docker (to be continued…), see also Backup and restore a mysql database from a running Docker mysql container · GitHub

There sure will be other methods for transforming databases from utf8 to utf8mb4, but I found my method foolproof (used it long time ago to transform some really big databases from latin1 to utf8). E.g. here is another (untested) approach: How to support full Unicode in MySQL databases · Mathias Bynens

Some other observations:

  1. Maybe this is not needed for openedx installations that don’t cater to asian languages / users.
  2. The possibility to show emojis probably is not very important (as long as you aren’t providing courses for school kids).
  3. But the possibility to show e.g. math symbols directly in unicode instead of resorting to software as LaTEX can be very interesting for math-heavy (science / engineering) courses.

You got me right there :slight_smile: Tutor very much caters to everyone, including Asian users.

I can confirm the issue. With the “notes” plugin enabled, when I create a note with the text “:poop:” it triggers the following error:

notes_1          | Traceback (most recent call last):
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/core/handlers/exception.py", line 34, in inner
notes_1          |     response = get_response(request)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/core/handlers/base.py", line 115, in _get_response
notes_1          |     response = self.process_exception_by_middleware(e, request)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/core/handlers/base.py", line 113, in _get_response
notes_1          |     response = wrapped_callback(request, *callback_args, **callback_kwargs)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
notes_1          |     return view_func(*args, **kwargs)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/views/generic/base.py", line 71, in view
nginx_1          | 172.18.0.4 - - [14/Jun/2021:09:36:45 +0000] "PUT /api/v1/annotations/1/ HTTP/1.1" 500 27 "https://demo.openedx.overhang.io/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0" "80.67.176.14"
notes_1          |     return self.dispatch(request, *args, **kwargs)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 509, in dispatch
notes_1          |     response = self.handle_exception(exc)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 469, in handle_exception
notes_1          |     self.raise_uncaught_exception(exc)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
notes_1          |     raise exc
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/rest_framework/views.py", line 506, in dispatch
notes_1          |     response = handler(request, *args, **kwargs)
notes_1          |   File "/openedx/edx-notes-api/notesapi/v1/views.py", line 585, in put
notes_1          |     note.save()
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 743, in save
notes_1          |     self.save_base(using=using, force_insert=force_insert,
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 780, in save_base
notes_1          |     updated = self._save_table(
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 853, in _save_table
notes_1          |     updated = self._do_update(base_qs, using, pk_val, values, update_fields,
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/base.py", line 903, in _do_update
notes_1          |     return filtered._update(values) > 0
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/query.py", line 760, in _update
notes_1          |     return query.get_compiler(self.db).execute_sql(CURSOR)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/sql/compiler.py", line 1471, in execute_sql
notes_1          |     cursor = super().execute_sql(result_type)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/models/sql/compiler.py", line 1142, in execute_sql
notes_1          |     cursor.execute(sql, params)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 67, in execute
notes_1          |     return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 76, in _execute_with_wrappers
notes_1          |     return executor(sql, params, many, context)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 84, in _execute
notes_1          |     return self.cursor.execute(sql, params)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/utils.py", line 89, in __exit__
notes_1          |     raise dj_exc_value.with_traceback(traceback) from exc_value
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/utils.py", line 84, in _execute
notes_1          |     return self.cursor.execute(sql, params)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/django/db/backends/mysql/base.py", line 71, in execute
notes_1          |     return self.cursor.execute(query, args)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 206, in execute
notes_1          |     res = self._query(query)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/MySQLdb/cursors.py", line 319, in _query
notes_1          |     db.query(q)
notes_1          |   File "/usr/local/lib/python3.8/dist-packages/MySQLdb/connections.py", line 259, in query
notes_1          |     _mysql.connection.query(self, query)
notes_1          | django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xF0\\x9F\\x92\\xA9' for column 'text' at row 1")

I agree that changing the collation should fix the issue. But I’m still anxious about changing the database collation. There could be far-reaching impacts that we do not foresee. In the native installation, MySQL runs on utf8_general_ci: configuration/main.yml at 1909bafde44584ec7bd04f7cb6ee6ac0bfa6b521 · edx/configuration · GitHub

I think we should try to work with edX on this specific issue. I would like to go to edX with a convincing use case. I think that correctly setting the user name is sufficiently important. @iamCristYe gave a convincing example in the Github issue. However, the "张𠇶 " does not print correctly in any of my computer applications – and not on my phone either. The first character (“zhang”) is fine, but the second prints as an empty square. What character is it exactly? (maybe give a link to a Chinese dictionary) Is that a frequent character/name?

There could be problems with the number of characters stored in the table indices. So yes, I think would be better to go to edX with this. See also MySQL :: MySQL 5.7 Reference Manual :: 14.23 InnoDB Limits

@insad @regis Thanks for your discussion! Sorry I didn’t see this days ago. What I want to add is that most used CJK (Chinese, Japanese and Korean) chars are included in utf8mb3, but utf8mb4 does include some other CJK chars (rarely used). And most used Math symbols are included in utf8mb3 as well, I think. So one major use for adding utf8mb4 supports would be supporting emojis. But imaging we’re hosting a course about ancient Chinese Literature or Unicode itself, such chars may be useful

As far as I understand, most course content is stored in Mongodb, which already supports characters from utf8mb4. Am I correct? If the biggest benefit is only to allow very rare characters or emojis in user names, then I’m afraid it will not warrant such a dangerous MySQL migration.

Yes most course contents in mongo are fine with emojis, and migrating current users to utf8mb4 could be dangerous, so initially in the github issue I proposed another choice: simply change the MySQL config to utf8mb4, in this way:

  1. New users will be able to use utf8mb4 and store all chars
  2. Existing users will experience exactly the same behavior as what they use to have because utf8mb3 is a subset of utf8mb4
  3. Can be better prepared for future MySQL 8 upgradation?