Java, UTF-8, and Mysql

Found a little problem with I18N in java and mysql the other day. We’re working to allow YouService to handle international characters. Our first test was to input Kanji characters into the client and then test whether we got Kanji characters back out when other clients sync’d.

In order to do this, the client sends a request to our http API on the server. The server then stores these changes as transactions in a database. Then when other clients, that have permission to the data sync with the server, they pull the transactions out and change their own local PIM records.

This seemed pretty straightforward, what we put in we wanted to get out. If someone enters Kanji characters we wanted to hand Kanji characters back out to everyone. The problem was that after entering Kanji, you got back gibberish. So, time to fire up the debugger and figure out where the problem occurred.

  • We determined that the client was sending properly URL encoded UTF-8 data to the server.
  • The server was then able to unencode the data properly.
  • The server then entered the data into Mysql tables with character support for UTF-8. The data was properly visible with the db.
  • It turned out that the problem occured upon pulling the data back out of the database. By looking at the data in binary form we were getting back ~130% of the bytes that we put in. Something wasn’t handling the character sets properly.

    I went through and made sure that every setting in mysql was set for utf-8 because it appeared that even though the data we put in was utf-8, the character set of the table was utf8, and we were explicitly asking for utf-8 back, the jdbc driver was trying to handle it as latin and then convert it to utf-8. This double conversion is what was causing the extra bytes and conversion to gibberish. After two days of playing with the mysql parameters it appeared that it was a bug in the jdbc code as it wasn’t listening to the directive to use the utf-8 character set.

    I finally had to resort to a workaround. Since I knew that what I wanted out was exactly what I put in and this data wasn’t going to be searched or selected on, I dropped the String down to a byte[] and inserted it that way. On the way out I called new String(getBytes(1), "UTF-8"). This worked very well, as now when I insert Kanji characters in the client I get Kanji characters back out.