Utfmb4 instead of utf8mb3? (#442)

@insad @regis Thanks for your discussion! Sorry I didn’t see this days ago. What I want to add is that most used CJK (Chinese, Japanese and Korean) chars are included in utf8mb3, but utf8mb4 does include some other CJK chars (rarely used). And most used Math symbols are included in utf8mb3 as well, I think. So one major use for adding utf8mb4 supports would be supporting emojis. But imaging we’re hosting a course about ancient Chinese Literature or Unicode itself, such chars may be useful

As far as I understand, most course content is stored in Mongodb, which already supports characters from utf8mb4. Am I correct? If the biggest benefit is only to allow very rare characters or emojis in user names, then I’m afraid it will not warrant such a dangerous MySQL migration.

Yes most course contents in mongo are fine with emojis, and migrating current users to utf8mb4 could be dangerous, so initially in the github issue I proposed another choice: simply change the MySQL config to utf8mb4, in this way:

  1. New users will be able to use utf8mb4 and store all chars
  2. Existing users will experience exactly the same behavior as what they use to have because utf8mb3 is a subset of utf8mb4
  3. Can be better prepared for future MySQL 8 upgrade?

Pushing the utf8mb4 upgrade to all new and existing Tutor users is a little too experimental for my taste – but then I have little experience in these matters. What I suggest is that other platform maintainers try out the utf8mb4 collation in MySQL with the help of this little plugin I created: Tutor plugin for adding utf8mb4 support to MySQL · GitHub
If this plugin works then we will be able to add utf8mb4 support to Open edX.

What do other @maintainers think?

EDIT: I tested this plugin and I am still unable to create notes with the “:poop:” character in them. I keep getting that error from the notes container:

notes_1                       | django.db.utils.OperationalError: (1366, "Incorrect string value: '\\xF0\\x9F\\x92\\xA9' for column 'text' at row 1")

It looks like django-mysql needs extra configuration to get utf8mb4 to work: Checks — Django-MySQL 3.11.1 documentation

@insad: you said earlier that you were “absolutely sure” that switching the mysql collation would fix the issue. But did you actually test it? If yes, how?

You have to transform your database also… only setting charset and collation to utf8mb4 doesn’t work with a database created with charset utf8 !!

I would suggest you rm .local/share/tutor and run quickstart again, as mysql already there will not change its collation if you don’t export and reimport. BTW nice new avatar

@iamCristYe Exactly, I was describing how you could export / transform / reimport an existing database in utf8 encoding…

Yes, this is exactly what I did already.

Do you mean that the collation needs to be explicitely defined for newly created databases? All these steps make me wonder: did you actually get Open edX to work with an utf8mb4 database, and did it fix the notes issue?

after adding "charset": "utf8mb4", to tutor/auth.json at v12.0.1 · overhangio/tutor · GitHub I’m able to set :man_technologist: as my full name. However I don’t see a place to set this for notes


OK should be this line: tutor-notes/tutor.py at master · overhangio/tutor-notes · GitHub


Now I can add “:woman_factory_worker:🧑‍🏭:man_factory_worker::woman_technologist:🧑‍💻:man_technologist::woman_office_worker:🧑‍💼:man_office_worker::woman_mechanic:🧑‍🔧:man_mechanic::woman_scientist:🧑‍🔬:man_scientist::woman_artist:🧑‍🎨:man_artist::woman_firefighter:🧑‍🚒:man_firefighter::woman_pilot:🧑‍✈:man_pilot::woman_astronaut:🧑‍🚀:man_astronaut:” to notes


BTW, when looking into the dumped mysql, I noticed the commands at tutor/local.rst at v12.0.1 · overhangio/tutor · GitHub will dump mysql internal tables like information_schema as well, which may not be desirable. What do you think? I’d be happy to make a PR.

I can confirm that I am able to add emojis to student notes after adding DATABASES["default"]["OPTIONS"]["charset"] = "utf8mb4" to the notes settings.

I think it be would be worth it to upgrade to utf8mb4. But I’m worried about the following item (from this blog post which you linked to earlier):

When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three.

I seem to remember that such an error already occurred in Open edX – though I can’t remember where or when. Given the size of edx-platform, it’s very likely that some TEXT or TINYTEXT columns contain values that would exceed the maximum size if we were to upgrade to utf8mb4.

Therefore, further investigation is required… I do not have time to lead that investigation. If someone else is interested, I’ll be happy to learn about the result of their enquiries. If not, I guess we will have to wait until edX (or the yet-to-be-named-non-profit-that-will-be-created-after-edX) upgrades to mysql 8, where utf8mb4 is the default.

Indeed better wait for mysql 8… for long varchar and indexes you need to set row_format to dynamic:

create table my_table (
......
) row_format=dynamic;

Ref. MySQL :: MySQL 5.7 Reference Manual :: 14.11 InnoDB Row Formats

Considering the enormous number of tables in the database, it’s not very wise trying to do that by ourselves…

Btw, should we be happy about edX and 2U to Join Together in Industry-Redefining Combination ? Will all effort put in the future in the yet-to-named-non-profit go make profit for U2 shareholders, and put off people to spend their time to this?

Yes, I think we should be very happy. Let’s move the conversation here: edX sold to 2U, a SaaS platform - #2 by regis

Some interesting links I found:

https://openedx.atlassian.net/browse/OSPR-5477

https://openedx.atlassian.net/browse/OSPR-5465

https://openedx.atlassian.net/browse/CRI-99

https://openedx.atlassian.net/browse/OSPR-5447

And also

2 Likes

@regis what do you think?

I have a naive question: why is that not desirable?

You have a lot of flags you can use with mysqldump, why not put simply a link to the corresponding doc page MySQL :: MySQL 5.7 Reference Manual :: 4.5.4 mysqldump — A Database Backup Program ?

Maybe because they do not actually contains data of openedx? Also not sure when importing the dumped mysql to another db, would dropping mysql database tables be OK, like

DROP TABLE IF EXISTS `db`;
DROP TABLE IF EXISTS `user`;

Migrating data to a separate MySQL database is a completely different topic than backing up – which is the topic that is covered in the docs linked above. Let’s stay on topic here; if you’d like to talk about data migration, please create a separate topic.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.