edpeur public mind dump: Unicode Strings and byte buffers

Prior to Unicode there was ASCII or ISO 8859-1 (except for Microsoft that used their own encoding to lock-in users) and string manipulation was not hard.

Now, Unicode is the future since everyone wants an easy solution to integrate all the characters of all the languages of the world to be supported by every application.

It seems like lots of programming languages have problems handling Unicode strings, mainly because they put Strings and byte buffers on the same bucket.
Maybe, some of the programming languages that have problems with Unicode handling is because programmers are using the API incorrectly, but some of them have a real design flaw that make working correctly with Unicode strings impossible.

My test to know if a programming language has correct Unicode support is just uppercasing the "á" string and verify the "Á" string is returned.

Languages that have a correct API include:

C with glib (using UTF-8) : g_utf8_strup("á",-1) -> Á
Java I do not know any way to do a better API. Java totally differentiates a string from a byte buffer. "á".toUpperCase() -> Á
C# has correct Unicode support: "á".ToUpper() -> Á
Python3 does Unicode handling just like Java did from 1.0 released at 1995. Python is finally catching up with Java! Take a look at What’s New in Python 3.0 so you can know what was fixed. print("á".upper()) -> Á

Languages with hard to use (but correct) Unicode support:

Python2 Why Python2 Unicode Sucks. Python2 Unicode support just calls for problems like urlparse considered harmful as there is no difference between a byte buffer and a string. The slides Unicode In Python2, Completely Demystified will help you understand Unicode in Python2. Move on to Python3 and you will be safe. Python2 has correct Unicode support but it is hard to use: print u'á'.upper() -> Á
Perl Unicode-processing issues in Perl and how to cope with it and Perl Unicode FAQ. Perl has hard to use Unicode support: perl -e 'use utf8; print uc("á\n");' -> Á

Languages that lack Unicode support:

PHP Unicode not completely fixed until PHP 6 Unicode support in PHP 6 2005 meeting : strtoupper("á") -> Á
Ruby Unicode handling in Ruby Ruby 1.9 Strings JRuby Unicode. Ruby does not have correct Unicode support #2350: print "á".upcase -> á

edpeur public mind dump

2008-01-04

Unicode Strings and byte buffers

No comments:

About Me

Links

Labels

del.icio.us/eperez

Uploads from eperez

User Eduardo - Stack Overflow