2008-01-04

Unicode Strings and byte buffers

Prior to Unicode there was ASCII or ISO 8859-1 (except for Microsoft that used their own encoding to lock-in users) and string manipulation was not hard.

Now, Unicode is the future since everyone wants an easy solution to integrate all the characters of all the languages of the world to be supported by every application.

It seems like lots of programming languages have problems handling Unicode strings, mainly because they put Strings and byte buffers on the same bucket.
Maybe, some of the programming languages that have problems with Unicode handling is because programmers are using the API incorrectly, but some of them have a real design flaw that make working correctly with Unicode strings impossible.

My test to know if a programming language has correct Unicode support is just uppercasing the "á" string and verify the "Á" string is returned.

Languages that have a correct API include:
  • C with glib (using UTF-8) : g_utf8_strup("á",-1) -> Á
  • Java I do not know any way to do a better API. Java totally differentiates a string from a byte buffer. "á".toUpperCase() -> Á
  • C# has correct Unicode support: "á".ToUpper() -> Á
  • Python3 does Unicode handling just like Java did from 1.0 released at 1995. Python is finally catching up with Java! Take a look at What’s New in Python 3.0 so you can know what was fixed. print("á".upper()) -> Á
Languages with hard to use (but correct) Unicode support:Languages that lack Unicode support:

No comments: