Thursday, August 30, 2007

Changes in Java to support supplementary Unicode characters

Support for supplementary characters might need changes in the Java language as well as the API. A few questions come to mind.

  • How do we support supplementary characters at the primitive level (char is only 16 bits)?
  • How do we support supplementary characters in low level API's (such as the static methods of the Character class) ?
  • How do we support supplementary characters in high level API's that deal with character sequences?
  • How do we support supplementary characters in Java literals?
  • How do we support supplementary characters in Java source files?

The expert commitee that worked on JSR-204 dealt with all these questions and many more (I'm sure) . After deliberating as well as experimenting with how the changes would affect code, they came up with the following solution.

The primitive char was left unchanged. It is still 16 bits and no other type has been added to the Java language to support the supplementary range of unicode characters.

 Low level API's, such as static methods of the Character class, accepted the char primitive type before support for supplementary characters was provided in Java. However, since Java 5.0, methods such as isLetter(...) of the Character class provide an overloaded method that accepts an int representing the code point, along with the earlier method that accepted a char.

 
JavaCharacterAPI.JPG 

 

High level API's will continue to work "as is" for most developers. They represent character sequences as UTF-16 sequences. Some methods in String and StringBuffer now have parrallel methods to work with code points. Some such methods are codePointAt(...) , codePointBefore(...), and codePointCount(). For example the codePointCount() method returns the number of code points in a String, which may not be the same as the number of characters in the String, if some characters are from the supplementary range and are represented as surrogate pairs.

 

JavaStringMethodsForUnicode.JPG 

 

Identifiers in Java can contain any letter or digit. Many supplementary characters are letters or digits. To allow supplementary characters to be used in identifiers, the Java compiler and other tools were modified to use different API methods (isJavaIdentifierPart(int), isJavaIdentifierStart(int)).

Since we need to support supplementary characters all the way, they also need to be supported in Java source files. I will discuss how to include unicode characters in Java source files and get them to compile using the Java compilers -encode option, in the next blog post.

While I was reading about encoding, I came accross this interesting blog post that describes a situation when an I18N enables Java program ceased to work after the build machine was moved from a Windows box to a Red Hat box. The reason of course was encoding related issues.

 



Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net

Friday, August 17, 2007

It's been a while since I posted

It's been a week since I posted last. I am really sorry, this is the second time in succession that I have missed my target of posting at least thrice a week. By way of an excuse, all I have is a lame "it's been a bit crazy at work". I am messing around with a lot of client side technologies, like AJAX and the plethora of libraries that accompany it, and all this without really understanding Javascript well enough.

One of the libraries I am checking out is DWR (Direct Web Remoting) . It allows Javascript code to invoke Java objects. All this is done by creating proxy objects in Javascript that make AJAX calls to the DWR Servlet, which in turn invokes the Java objects. I personally think, it's a very nice concept, and it also supports reverse AJAX.

Would you like to know more about DWR? Please comment and let me know. I will then post a series on DWR after completing the current one on Unicode characters.

 

On a total tangent, here's a little something from Doc Searls on writing to inform readers


 

I don’t think of my what I do here as production of “information” that others “consume”. Nor do I think of it as “one-to-many” or “many-to-many”. I thnk of it as writing that will hopefully inform readers.

 

Informing is not the same as “delivering information”. Inform is derived from the verb to form. When you inform me, you form me. You enlarge that which makes me most human: what I know. I am, to some degree, authored by you.

  What we call “authority” is the right we give others to author us, to enlarge us.
  The human need to increase what we know, and to help each other do the same, is what the Net at its best is all about. Yeah, it’s about other things. But it needs to be respected as an accessory to our humanity. And terms like “social media”, forgive me, don’t do that. (At least not for me.

 



Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net
Here are the comments from the original post

-----
COMMENT:
AUTHOR: kishore hariharan
URL:
DATE: 08/29/2007 05:33:38 AM
hi prof,
this concerns the simple yet powerful concept of DWR. My current assignment involves working and exploring the same. I and two of my colleauges actually implemented server side push concept using DWR, as the client demanded real time updates for his Stock Trading Application.
The POC took a while but we managed to finish the implementation with support from JMS.
There are still a few grey areas on using this:
- like the number of live connections DWR can handle at an instance.
-How does DWR actually manage to keep the connections alive is still a mystery. The best we figured was that it sent some heartbeats at frequent intervals. Still not very clear on this.
-Can it really manage to bypass poweful firewalls.
These are a few of the questions we are still researching on. The implementation at the moment works great but still awaits the real test of volume testing.

-----
COMMENT:
AUTHOR: Parag
DATE: 08/30/2007 10:57:23 AM
Kishore,

These are interesting topics you are researching. I am sure your findings will be of interest to many people.

Do post them to our forum.

--
Regards
Parag

Friday, August 10, 2007

Supplemantary character support in Java

In the last post I wrote that supplementary characters in the Unicode standard are in the range above U+FFFF, which means they need more than 16 bits to represent them. Since the char primitive type in Java is a 16 bit character, we will have to use 2 char's for them.

I just finished reading some stuff on supplementary character support in Java, and well, there are parts I understood right away and parts that are going to need further reading. I will try to share what I am learning on this blog. However, let us first clarify some terminology.

Character: Is an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.

Character Set: Is a collection of characters.

Unicode is a coded character set that assigns a unique number to every character defined in the Unicode standard.

A Code Point is the unique number assigned to every Unicode character. Valid Unicode code points are in the range of U+0000 to U+10FFFF. This range is capable of holding more than a million characters, out of which 96,382 have been assigned by the Unicode 4.0 standard.

Supplementary characters are those characters that have been assigned code points beyond U+FFFF. So essentially they lie in the range of U+10000 - U+10FFFF.

When these characters are to be stored in a computer system, they have to be stored as a sequence of bits (this is known as UTF-32 encoding). The simplest way store them is to store each character as a 4 byte sequence capable to addressing the entire unicode range. However this will waste a lot of space, because most of the time we deal with characters in the ASCII range of 00 - FF. Some other mechanism is needed to make better use of the computer's memory and storage. Other encodings that exist are UTF-8 and UTF-16, which as their names suggest, use 8-bit and 16-bit sequences.

A natural question that must have occurred to you is, how do we store characters that go beyond 8 bits or 16 bits in UTF-8 and UTF-16. This is made possible by using multiple blocks. Each block will also have to indicate whether it represents a single character or is part of a series of blocks that represent one character. UTF-8 and UTF-16 help us store characters using less space than UTF-32. The most widely used encoding standard is UTF-8.

In the next post I will discuss how Java supports the supplementary range in it's API's and in the Virtual Machine.



Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net

Wednesday, August 01, 2007

Catching up with Java 5

Java 5 (a.k.a Tiger) has been around from a while. But there are still many developer's (including myself) who do not know about and use all it's features.

So, in an effort to educate myself and help others, I have decided to spend some time everyday reading Java 1.5 Tiger A Developer's Notebook, and share my findings with others on this blog.

Something I found out today (I know this should have happened long back, but such is the profession of programming :-) ), is that since Java 1.5 there is support for Unicode 4 which supports a supplemantary character set, that goes beyond 16 bits. An interesting implication is that a the char data type may no longer be able to hold all characters, because those in the supplementary range can now take upto 21 bits.

This means that a string that contains certain characters may have to encode them as 2 char data types. Such a pair of characters that represents one codepoint is known as a surrogate pair. Now a string with n codepoints may no longer be n characters long, because some code points will be encoded using one character, while some will use a surrogate pair.

A few questions have come to my mind about parsing such strings. How do I determine which codepoint appears in the middle of the String?

I came across this article that explains support for unicode 4 in Java. I will read it and share any interesting findings on this blog.

Meanwhile for a more general explanation of unicode, I strongly recommend this excellent article by Joel Spolsky: The absolute minimum every software developer absolutely, positively must know about unicode and character sets (no excuses!)

  • Discuss this post in the learning forum.
  • Check out my learning journal. I am learning JSF at the moment. Do you want to join an experiment in forming an adhoc virtual study group?
Note: This text was originally posted on my earlier blog at http://www.adaptivelearningonline.net
Here are the comments from the original post

-----
COMMENT:
AUTHOR: Manjari
URL: http://simplymanjari.blogspot.com/
DATE: 08/17/2007 05:33:37 AM
Thanks for the link to Joel's article on Unicode. I discovered I was blissfully ignorant in that context.
-----
COMMENT:
AUTHOR: Parag
DATE: 08/18/2007 06:31:12 PM
You are very welcome Manjari, and thanks for the comment :-)

--
Regards
Parag