The end of last year I made a move from an IT shop focused on supporting the U.S. side of our business to a department that provides support to our operations outside the U.S. This was the first time I’ve worked in an international context and found myself running into long time assumptions that were no longer true on a regular basis. My first project was implementing a third party, web based HR system for medium sized offices. I found myself constantly missing important issues because I had such a narrow approach to the problem space. Sure I’ve built applications and databases that supported Unicode, but I’ve never actually implemented anything with them but the same types of systems I’d built in the past with ASCII. But a large portion of the worlds population is in Asia, and ASCII is certainly not going to cut it there. Fortunately a new edition of Ken Lunde’s classic CJKV Information Processing has become available and it has really opened my eyes.
CJKV Information Processing has a long history that actually goes back into the 1980s. It began as a simple text document JAPAN.INF, available via FTP on a number of servers. This document was excerpted and refined and published as Lunde’s first book in 1993, Understanding Japanese Information Processing. Shortly after JAPAN.INF became CJK.INF and the foundation for the first edition of CJKV Information Processing was born. The first edition was published in 1999, and it is safe to say that a number of important things have changed over the last 10 years. Lunde states four major developments that prompted this second edition in the preface. They are the emergence of Unicode, OpenType and the Portable Document Format (PDF) as preferred tools and lastly the maturity of the web in general to use Unicode and deal with a wider range of languages and their character sets.
Lunde sets out not to create an exhaustive reference on the languages themselves, but rather an exhaustive guide to the considerations that come into play when processing CJKV information. As Lunde states, “..this book focuses heavily on how CJKV text is handled on computer systems in a very platform-independent way…” Taking into account the complexity of the topic, the breadth of the work and the degree to which it is independent of any specific technology, outside a heavy bias for Unicode, is extremely impressive. A glance over the table of contents show just how true this is. Chapter 9, Information Processing Techniques has sections touching on C/C++, Java, Perl, Python, Ruby, Tcl and others. These are brief, with most examples in Java but that they are all directly addressed shows a great awareness of the options out there. The sections that deal with operating system issues have the same breadth. Chapter 10, OSes, Text Editors, and Word Processors doesn’t just hit the top Mac and Windows items. It looks at FreeBSD, Linux, Mac OS X, MS Vista, MS-DOS, Plan 9, OpenSolaris, Unix and more. There are also sections for what Lunde calls hybrid environments such as Boot Camp, CrossOver Mac, Gnome, KDE, VMware Fusion, Wine and the X Window System. Interestingly the Word Processor system covers AbiWord and KWord but not OpenOffice.org The point stands that anyone looking to support CJKV, this book will probably cover your platform and give you at the very least a starting point with your chosen tool set.
That said, an extremely specific implementation is not what Lunde is out to offer up. This is the very opposite of a ‘cook book’ approach. This also makes the book extremely useful to anyone dealing with internationalization, globalization or localization issues regardless of character set or language. Lunde teaches the underlying principles of how writing systems and scripts work. He then moves to how computer systems deal with these various writing systems and scripts. The focus is always on CJKV but the principles will hold true in any setting. This continues to be the case as Lunde talks about character sets, encoding, code conversion and a host of other issues that surround handling characters. Typography is included, as well as input and output methods. In each case Lunde covers the basics as well as pointing out areas of concern and where exceptions may cause issues. The author is nothing if not thorough in this regard. His knowledge of the problem space is at times down right staggering. Lunde also touches on dictionaries as well as publishing in print and on the web.
The first three chapters set the table for the rest of the book with an overview of the issues that will be addressed, information on the history and usage of the writing systems and scripts covered and the character set standards that exist. This was a fascinating glimpse, once again into CJKV languages and how other languages are dealt with as well. I think there is even a lot here that would be extremely informative to a person who wants to learn more about CJKV, even if they are not a developer that will be working with one of the languages. That’s only the first quarter of the book, so I don’t know that it would be worth it from just that perspective, but it is definitely a nice benefit of Lunde’s approach.
The style is very readable, but I wouldn’t just hand this to someone who didn’t have some familiarity with text processing issues on computer systems. While there is no requirement to know or understand one of the CJKV languages, understanding how computer systems process data and information is important. I did not know anything about CJKV languages prior to reading the book and have learned quite a bit. What I learned was not limited to the CJKV arena. The experience I had was very similar to when I studied ancient Greek in school. Learning Greek I learned much more about English grammar than I had ever picked up prior. Reading CJKV Information Processing I learned quite a bit more about the issues involved in things like character encoding and typography for every language, not just these four. But in dealing with CJKV specifically I’ve found that Lunde’s work is indispensable. It is not just my go to reference, it’s essentially my only reference. If any other works do come my way, this is the standard against which they will be judged.
There are thirteen indexes including a nice glossary. Nine of them are character sets, which were printed out in the longer first edition. In this second edition, there is a note on each, with a url pointing to a PDF with the information. It seemed odd, but each URL gets it’s own page. This means there are nine pages with nothing but the title of the index and a url. Fortunately they are all in the same directory, which can be reached directly from the books page at the O’Reilly site. It seems it would have made sense to just list them all on a single page, but maybe it was necessary for some reason. It’s a minute flaw in what is a great book.
Title: CJKV Information Processing 2nd ed.
Author: Ken Lunde
Publisher: O’Reilly Media, Inc.
Tagline: Chinese, Japanese, Korean & Vietnamese computing.