Fun with character encoding and when to use ISO8859 instead of UTF-8

What character encoding are you using? Most folks nowadays settle on UTF-8 for web centric type applications, but things can get squirrelly if you use this encoding and start working with non-unicode systems. Recently, we had a situation where we took the string representation that started out with an array of 17 0xff values. In a unicode aware system, using UTF-8, this will translate into a character sequence of 17 0xfffd values.

What happened? How did an array of 8 bit values get magically translated into an array of 16 bit values?

It requires a bit of digging, but the short version is that if your source system is using 8 bit characters (something like iso 8859-n) and you translate to unicode, you will fail on certain byte values because they are invalid in UTF-8 and other character encodings. The only thing that can then happen is to change the character to the "invalid character" which is 0xfffd. For reference, in UTF-8 the values 192, 193, and everything over 253 are invalid and will be translated into 0xfffd.

So what's the solution? If you MUST do this because you are interacting with an embedded device or something else that is not unicode aware, the simplest would be to use an ISO 8859 charset to support characters larger than 0x80 (128). This can be quite a challenge if you actually need to get the correct glyph because there are a number of charsets in this space. Note, in Java at least, all characters are 16bit values, so there is often some magical transformation necessary to switch between bytes and chars.

Examples for folks who'd like to see the problem in code:

char[] ba = {0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff,
                0xff};

byte[] ba2 = new byte[ba.length];
        for(int i = 0; i < ba.length;i++) {
            ba2[i] = (byte)ba[i];
        }


        try {
            String ascii = new String(ba2, "US-ASCII");
            String iso8859 = new String(ba2, "ISO-8859-1");
            String utf8 = new String(ba2, "UTF-8");

            char asciichar = ascii.charAt(0);
//this will be 65533 (0xfffd)
            char isochar = iso8859.charAt(0);
//this will be 255 (0xff)
            char utfchar = utf8.charAt(0);
//this will also be (0xfffd)

Note, there are still challenges as a char is 16 bits and you'll need to be careful when traversing systems to let everyone know your output is in an ISO-8859 character set. For example, if you generate an xml file and set the encoding to UTF-8, the file will contain �����������������, but if you use ISO-8859 it will contain ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ.

A more entertaining aspect of this is if you subsequently try to insert this string value into a database that is unicode aware and assume your characters will fit into 17 bytes it will overflow the width because the ����������������� string of 17 chars is actually 34 bytes, but the ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ string can be represented as only 17 bytes.

In short, the space of ascii between 128 and 255 is treacherous territory and using UTF-8 only helps if everybody uses UTF-8, otherwise transliterating between the code pages can be quite an adventure.

Comments

Popular posts from this blog

Please use ANSI-92 SQL Join Syntax

the myth of asynchronous JDBC

The difference between Scalability, Performance, Efficiency, and Concurrency explained