Wednesday, August 01, 2007

Catching up with Java 5

Java 5 (a.k.a Tiger) has been around from a while. But there are still many developer's (including myself) who do not know about and use all it's features.

So, in an effort to educate myself and help others, I have decided to spend some time everyday reading Java 1.5 Tiger A Developer's Notebook, and share my findings with others on this blog.

Something I found out today (I know this should have happened long back, but such is the profession of programming :-) ), is that since Java 1.5 there is support for Unicode 4 which supports a supplemantary character set, that goes beyond 16 bits. An interesting implication is that a the char data type may no longer be able to hold all characters, because those in the supplementary range can now take upto 21 bits.

This means that a string that contains certain characters may have to encode them as 2 char data types. Such a pair of characters that represents one codepoint is known as a surrogate pair. Now a string with n codepoints may no longer be n characters long, because some code points will be encoded using one character, while some will use a surrogate pair.

A few questions have come to my mind about parsing such strings. How do I determine which codepoint appears in the middle of the String?

I came across this article that explains support for unicode 4 in Java. I will read it and share any interesting findings on this blog.

Meanwhile for a more general explanation of unicode, I strongly recommend this excellent article by Joel Spolsky: The absolute minimum every software developer absolutely, positively must know about unicode and character sets (no excuses!)

  • Discuss this post in the learning forum.
  • Check out my learning journal. I am learning JSF at the moment. Do you want to join an experiment in forming an adhoc virtual study group?
Note: This text was originally posted on my earlier blog at
Here are the comments from the original post

AUTHOR: Manjari
DATE: 08/17/2007 05:33:37 AM
Thanks for the link to Joel's article on Unicode. I discovered I was blissfully ignorant in that context.
DATE: 08/18/2007 06:31:12 PM
You are very welcome Manjari, and thanks for the comment :-)


No comments: