MySQL and Unicode

I used MySQL for a great many projects over the years with the assumption that a charset of utf8 and a collation of utf8_unicode_ci was going to support all of UTF-8 and that was all I need to do. I was sorely mistaken but there was no point in writing until now, because MySQL 5.5 has finally helped rectify the issue. Up until MySQL 5.5 (released December of 2010) the UTF-8 support was severely hobbled. With MySQL 5.5 the server can now support the full range of characters that UTF-8 allows but it’s not the default behavior. There are still plenty of pitfalls for the naïve developer starting out with MySQL.

The root cause and how the problems surface

If you work on internationalization issues and use MySQL you’ve probably run into this. If not then it’s only a matter of time until you run into it. This is the dirty little secret of MySQL before 5.5 … it only supports characters within the BMP (more on what that means in a moment). With Unicode 6.0 over 50% of all characters are outside of the BMP so that’s a big deal. To me it was a big deal before, but with Unicode 6.0 it’s bigger and now people are starting to take notice. Supporting half of a character set is almost worse than not supporting it since it’s difficult to figure out what’s going to happen when you send data to the server.

Most of the bugs caused by this poor UTF-8 support don’t surface in English, and that’s why many developers remain blissfully unaware of the problem. Having said that, this is not something that’s only found when user’s write in Deseret or Egyptian Hieroglyphs (PDF), it also comes into play with Japanese names and some new Unicode characters like Emoji.

The MySQL defaults also have an effect on Unicode processing that affects the majority of European languages you’re likely the try and target. This unsafe default  has a different root cause but I want to cover it here so the two issues are not confused with each other. Here are how some of the most common issue surface:

  • Attempting to update a record where a varchar column changed only by a single accent will return success but not save the change
    • This makes spelling corrections in record seem to “not save”.
  • A row can be inserted successfully but when it is re-read the varchar column is empty or missing characters.
    • This is the case with the Ruby driver at least, probably others.
  • Unique constraints on varchar columns will fail despite different text if the differences are all outside of the BMP.
  • You can’t store the new Emoji characters, or they seem to disappear.

The supplementary character issue

The crux of the UTF-8 issue is rooted in the fact that UTF-8 is a variable-width encoding. The MySQL 5.1 implementation (like many others) was built on the assumption that a UTF-8 encoded character would use between 1 and 3 bytes. The beauty of UTF-8 is that it is compatible with ASCII (and by proxy the beginning of ISO-8859-1, which is also ASCII compatible) for the most common characters in English, using only 1 byte, but that the variable-width encoding scheme lets it also support the rest of Unicode. Here’s the rub: “the rest of Unicode” is an expanding set and when it passed a certain boundary UTF-8 needed 4 bytes to handle it. Let’s have an example:


1 byte  (0000-007F): "A"  ➔ 0x41
2 bytes (0080-07FF): "Ж" ➔ 0xD0 0x96
3 bytes (0800-FFFF): "龍" ➔ 0xE9 0xBE 0x8D
4 bytes (> FFFF)   : "𐐒" ➔ 0xF0 0x90 0x90 0x92

As you can see above, that last character takes 4 bytes in UTF-8 while MySQL only expects 1-3 bytes per character. Without getting into too many Unicode details it’s important for our discussion to know that the Unicode characters are logically divided into “planes“. The first plane is the Basic Multilingual Plane (BMP) and what you probably think of as “Unicode”. This is all of the characters between U+0000 and U+FFFF and covers most of the characters for the major languages of the world. When Unicode starting using the additional planes it started addressing characters beyond U+FFFF and UTF-8 expanded to handle that.

The MySQL 5.1 CHARSETutf8” and the utf8_* collations were only able to handle 1-3 bytes so they had to do something. That something was mentioned depends on your client driver. In Ruby the insert succeeds but the data is discarded. In Java an exception is thrown, which at least alerts you to the problem. With the addition of Emoji in Unicode 6.0 this problem has a very high profile failure case when the text is Emoji-only. For Japanese mobile phone users this not an uncommon use case, and you can expect it to get more common in the US very soon.

The problem with accents

The problem I mentioned above with accent changes not taking effect is actually a configuration problem rather than a MySQL bug. This has to do with the common misunderstandings around MySQL collation and specifically the utf_general_ci/utf_uncode_ci collation sequences. This is probably best illustrated with an example:


mysql> select "bar" = "bär" COLLATE utf8_unicode_ci\G
*************************** 1. row ***************************
"bar" = "bär" COLLATE utf8_unicode_ci: 1
mysql> select "bar" = "bär" COLLATE utf8_general_ci\G
*************************** 1. row ***************************
"bar" = "bär" COLLATE utf8_general_ci: 1
mysql> select "bar" = "bär" COLLATE utf8_bin\G
*************************** 1. row ***************************
"bar" = "bär" COLLATE utf8_bin: 0

If your default collation is utf_general_ci or utf_uncode_ci then your database thinks “bar” and “bär” are the same word. If you have unique constraints on a column this is clearly problematic but this can also cause problems without that. This is the root cause of the “accent changes won’t save” problem. When updating a record it appears MySQL (or at least InnoDB) checks for equality before updating a record. Since and accent-only change is considered equal by the collation MySQL skips the write (which saves I/O overhead) and returns success since it thinks it optimized a write rather than failing.

English speaking developers very often assume that MySQL is mature and will “just work” in every language. The default character set for MySQL is latin1 (rather than utf8) and many developers have learned to dutifully change that to utf8 and utf8_general_ci when they install MySQL. This change fixes a large swath of issues but leaves behind these much more subtle bugs.

The MySQL 5.1 guidelines and work arounds

If you need to stay on MySQL 5.1 there are a few things worth keeping in mind.

  • Using utf8_bin for collation will solve the accent issues, which is probably what you thought the other utf8_* collations where doing. The downside to this is that utf8_bin and ORDER BY will not be in language-specific order. You’ll have to make the choice for yourself depending on your needs but my advice would be that if you don’t know you should go with utf8_bin.
  • If you want to fully support non-BMP characters your only recourse is to convert your char/varchar column in question to a binary/varbinary and make sure you handle the character encoding correctly on the way in and out. If you use UTF-8 be sure to allow 4 bytes per character (as discussed above). If you use UTF-16, make sure you allow 4-bytes to handle surrogate pairs. The downside to this is the subtle differences in how binary and varbinary columns are stored and you should fully understand that and the performance implications before making any changes.

The MySQL 5.5 fix

MySQL 5.5 was just released with new character sets for utf8mb4, utf16 and utf32. While utf16 and utf32 support is certainly welcomed I want to focus on utf8mb4 since my assumption is that you want ASCII compatibility. The newly added utf8mb4 character set is a superset of the utf8 character set that can store up to 4 bytes per character. With 4 bytes per character utf8mb4 should be able to store all 16 planes, including planes 4-13 which are currently empty and given the known writing systems in the world are not expected to be filled.

The addition of utf8mb4 makes me breath a little easier. You still have to know your collations to not reproduce the accent problem but I trust you can remember that small piece of Unicode/MySQL knowledge. If not, maybe you’ll google again in the future an find this handy command:

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;

About these ads

10 thoughts on “MySQL and Unicode

  1. Kenneth says:

    Interesting. Do people really still use Egyptian Hieroglyphs? Awesome. :D

    Does the accent problem leave sites open to attack? I can imagine registering on a site with an email address which is a duplicate of an existing address, but with an accent over one character. The database might then update/read from the wrong record. It’s an edge case, sure, because it would depend on deep knowledge of the schema, but sounds possible if validation isn’t correctly applied.

    The biggest headache with UTF-8 storage for me was sorting Japanese strings. How do you sort those, when your collation is set to utf8_bin?

  2. Ram Viswanadha says:

    “The beauty of UTF-8 is that it is compatible with ASCII and ISO-8859-1 for the most common characters”
    Uhh .. a nit pick UTF-8 is not compatible with ISO-8859-1. ISO-8859-1 is from 0x00-0xFF which UTF-8 is only compatible with ASCII (0x00-0x7F). All code points in ISO-8859-1 from 0x80-0xFF need variable width encoding.

  3. mzsanford says:

    Hi Kenneth,

    On Security:

    Interesting point on the security vulnerability. I can assume that most Ruby on Rails projects are running in some variant of this configuration so it requires less schema knowledge than you would think. The accented update case I used above does the lookup by ID so you cant force an incorrect lookup, but assuming you (a) allowed people to register with a foo@éxample.com, and (b) then did a lookup by email that ordered them such that the older came first and (c) did not error in the event of multiple results this could surface as a security issue. See:


    select * from foo where text = 'foo@example.com' order by id desc;
    +----+------------------+
    | id | text |
    +----+------------------+
    | 10 | foo@éxample.com |
    | 9 | foo@example.com |
    +----+------------------+

    For point (b) I imagine an older account updating an email address would have the effect. For point (c) that is the default in ActiveRecord’s find(:first) method so again, knowing someone runs Rails let’s me know they could be open to this. On point (a), input validation saves the day. Yet again.

    On Sorting:

    You can alter the collation at select time using the COLLATE keyword. This allows you to store anything but sort for display in a context sensitive way. My Japanese collation knowledge is minimal but I pulled the utf8_general_ci, utf8_unicode_ci and utf8_bin results for some Hiragana, Katakana and Kanji samples and pasted them side-by-side for a comparison:


    Select command: select * from foo order by text collate utf8_XXX;

    utf8_general_ci utf8_unicode_ci utf8_bin
    +----+------+ +----+------+ +----+------+
    | id | text | | id | text | | id | text |
    +----+------+ +----+------+ +----+------+
    | 1 | の | | 2 | コ | | 1 | の |
    | 4 | ば | | 3 | コ | | 4 | ば |
    | 2 | コ | | 1 | の | | 2 | コ |
    | 3 | コ | | 4 | ば | | 3 | コ |
    | 5 | 国 | | 5 | 国 | | 5 | 国 |
    | 6 | 部 | | 6 | 部 | | 6 | 部 |
    +----+------+ +----+------+ +----+------+

  4. mzsanford says:

    Hi Ram:

    You are correct. I meant to say that it was compatible with the most common characters form ISO-8859-1 (which also strove for ASCII overlap). I’ll clarify that in the post … thanks for picking my nit :)

  5. Quora says:

    Which open source databases boast complete Unicode support?…

    While Unicode support is probably not the only factor to use in deciding on a database, when it comes to Unicode support in a major open source RDBMS (MySQL or Postgres) it seems that Postgres has better support if configured correctly (See the post ab…

  6. […] to 4 bytes for each character, and MySQL 5,1 uses by default 3 bytes for each character (this has changed in MySQL 5.5), then the limit for this value is 255 characters, or 255*3 bytes and (255*3) is less than […]

  7. […] versions of MySQL will throw a fit if you try to insert four-byte UTF-8 characters. If your find yourself importing a ticketing system into JIRA and run into an error that looks like […]

  8. […] versions of MySQL will throw a fit if you try to insert four-byte UTF-8 characters. If your find yourself importing a ticketing system into JIRA and run into an error that looks like […]

  9. Emanuel says:

    Hey Matt,
    I really appreciate you writing some regarding this.
    Ive had a hard time trying to get a web application to work cross-browser….. …. ..
    It’s database driven and based on php and javascript.

    Since i’m a swede we have these lovely characters å ä ö and more that keeps on mekong my life difficult. Ive rebuilt the whole db (utf8 and utf8_unicode_ci charset for better sorting) and rewritten all code to include headers with utf-8, and saved files as the same.

    But still I have problems….

    My headache now is FF and Safari that both throws me an error when trying to retrieve data from the db. ( Incorrect string value: ‘\xC4RNA’ for column ‘USTYP’ at row 1 )

    Its for the swedish letter Ä.

    tried to use the ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin; to see if that would help, but no.

    Do I need to whipe out all the stored data and reinsert it for this to work?

    Regards and thanks for writing this up.

    /Emanuel.

  10. mzsanford says:

    The alter table looks correct and should alter the existing data (if possible) though it can cause some truncation issues. Can you connect from the command line mysql client? If so I would be curious if `select * from table_name where id = X` (the where clause should find just the bad row) exhibits the same issue. I’ve seen both Ruby and Java have issues where the client connection is using latin1, which the command line can bypass just to give you a good baseline. If it won’t work from the command line mysql then the issue needs to be fixed on the Mysql side. If it can, the it’s between your app and Mysql.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.