Working In Multilingual Sources
September 8, 2015
Here at Digital Shadows we collect and store data from all across the web in different languages and formats. This post will be about some of the challenges you are likely to face trying to handle data in different languages and how to deal with them. Most of our code is in Java so the examples here will all be written in Java.
Not every document or web page declares its encoding (we’ll ignore those for now that lie about their encoding). So you’re either left with following the default, for example Microsoft Word documents default to UCS-2, or taking a guess. Let’s assume we have found a small file on the internet and we want to read its content. We read it into a byte array but we have no idea how the file was encoded. Let’s assume the content is UTF-8 and try and decode it.
The output we get is:
This is probably not what the content of the document is supposed to look like. Given that we guessed the encoding we can assume we got it wrong. Perhaps we should save our string and look at it later when we have a little more information. Let’s save our string (or write it to the console in this case)
Hold on what’s this?
This isn’t even the same byte array we passed in. This default behaviour does not work for us. Losing the original data is even worse than not being able to interpret it. What we really want is to report some kind of error when we meet content we can’t handle rather than quietly corrupting it. Thankfully Java can take of that for us.
Now when trying to decode our bytes we see:
Lesson – use the right functions for decoding data and don’t trust that it will be in the encoding you expect.
So we have some bytes that we have successfully avoided mangling it into a string in an invalid format. Now what? We could try again with another likely encoding but this doesn’t always make sense. We could keep the byte array and store that but now our system needs to handle storing two different types of data, raw bytes and strings. Wouldn’t it be better if we could store it as a string and still not destroy anything in it?
There are two obvious ways to handle this and that’s either encode the bytes in Base64 or use an encoding that won’t mangle our input. Here ISO_8859_1 comes to the rescue.
Our output from this is:
This is not the string that the bytes are supposed to represent but has the advantage that we haven’t lost anything. If we ever find the right encoding to use we can decode that back to the source byte array and re-encode it correctly.
Lesson – if you aren’t sure what something is you don’t need to throw it away. You can store it and come back when you do.
Storing Your Data
Here at Digital Shadows we use several different technologies for storing and processing our data, depending where it came from and what we want to do with it. Care must be taken to ensure that these systems are storing this data safely and correctly. Take MySQL for example, we are going to be consuming data from websites all over the world so we want to store it in UTF8.
A quick google search leads us to a well-meaning stackoverflow where you are told how to set the relevant configuration options to utf8. Problem solved?
However a closer read of the MySQL documentation will tell you:
To correctly store 4 byte characters in MySQL you must use utf8mb4 as your encoding.
Lesson – read the documentation carefully when choosing how to configure your databases and test the edge cases to make sure it works how you think it does.
Sadly the RFC’s for domains and URI’s don’t specify the encoding. However the World Wide Web Consortium recommends that UTF8 be used.
This has generally been taken to mean that all percentage encoded URL’s should be in UTF8 and that everyone will build their websites to decode their characters in UTF8. If you’re building your own website this is great advice to follow. If you want to know if you have seen a URL before in a decoded form you can’t rely on ‘recommends’.
Trying to decode the following will, in most of the URI decoders, result in an error.
However with a bit more context:
We can see that the URI is from a Chinese company. So for the final big reveal the URI encoding actually represents the same byte array we were using earlier. This extra information about the location of the website is what we needed to help us identify the right encoding. We try decoding the byte array with the GBK encoding, an encoding for simplified Chinese characters.
The output we get is:
Lesson – when building your own web services you should follow the specification’s recommendations. When consuming others data you can’t assume they have.
In order to support collecting and analysing data in different languages you have to work to the data not to a standard or recommendation. Not every website or document will be valid, in UTF8 or even complete. This data is still important, we need to correctly handle it, detect when we can’t, fall back to safe practices and ensure that when we are done, we store it correctly.
At Digital Shadows we’re always on the lookout for the very best technical talent. If solving the hardest challenges and working in a fast-paced environment appeals to you, head on over to our careers page to find out more.