Proposed Entry Format of Sort Key for Index Entry

ARAI Bunkichi
Koubunsha, Ltd. (www.kobu.com)
Yokohama, Japan

Draft; Nov 6, 1997 - was 'Convention for adding Yomi to Index Entries'
Update; Nov 9, 1997 - I understand yomi is a special case of sorting keys.

Comments and suggestions are welcome.
Corrections to mistakes in English are also welcome.
This file can be freely distributed if you don't modify it.

If you download this file, also download .gif files in the same directory. They are used for displaying special characters.


Summary

This is a proposal for adding sorting keys to Index entries of online documents such as HTML files. Sorting keys should be useful but not required when building English Index. However they are critical in correctly arranging Japanese words with kanji.

There are several on-going projects for adding Table of Contents and Index to HTML-based documents (such as HTML-based help from Microsoft HTML Help, Netscape NetHelp 2, Sun JavaHelp, etc.). They define TOC and Index file formats. Some of them also define TOC and Index entry format. To assist these efforts, I would like to present a simple and general way of adding sorting keys to each index entry.

Briefly, the format of an index entry will be:

	word-spelling ~ sort-key
where tilde (~) and sort-key are optional. When sort-key is present, it must be used to determine its order; otherwise word-spelling is used as always.

I know there are lots of nice authoring technologies and tools created in the United States that support double-byte character set (DBCS) so that Japanese authors can also use them. But unfortunately, these technologies and tools are not fully useful to Japanese authors if they can't create appropriate Index. They need extra tools or manual labor to do the work or they have to abandon the use of such tools and technologies.

By supporting this or similar sorting key entry format, authoring technologies and tools created outside of Japan will become very useful and sufficient to even Japanese authors. I believe this also applies to other far east countries where Chinese-origin characters are used. Then writers and programmers in these countries can be good customers to authoring tools developers world-wide.

Background - Yomi is a sorting key

I explain why Japanese words need extra sorting keys for building Index.

In English, most words consists of only Latin alphabets and they are ordered by their spelling. This is not true in Japanese language. I don't know much about other languages but I believe that this is also a case in Koera and China. I imagine it is also a case for European languages which require accented alphabets.

Japanese words containing kanji are ordered by their yomi not by its spelling. Yomi is pronunciation of kanji words and traditionally used for arranging words in order.

Japanese characters are divided into two groups; kana (phonetic characters such as ) and kanji (ideographic symbols imported from ancient China such as ). Off course we sometimes mix Latin and Greek alphabets too.

Kana

Kana is a lot like Latin alphabet. Kana itself has no meanings and it just represents sound of speech. They have a single unique pronunciation. The spelling of a word consisting only of kana characters directly represent its speech sound. Kana characters have predetermined order. Dictionaries are sorted by this order. It's possible to spell every Japanese sentence only with kana; although only small children do that.

Kanji

A kanji represents meaning, not sound. It usually has two or more pronunciations determined by its context.

Here is an example. A kanji for mountain is . River is . Ancient Chinese people pronouned like 'san' ( in kana pronunciation) and pronounced like 'ga' ( in kana). A Japanese kana word for mountain is (pronounced like 'yama'). River is ('kawa'). We read as 'yama' and as 'kawa' when it stands alone in a sentence.

However if and are concatenated to form a single word to mean a landscape with mountains and rivers, they are pronounced as 'sanga' ( in kana).

Yomi

We represent pronunciation of kanji by kana characters. We call pronunciation of a kanji word in some context as 'yomi.' We have a tradition to add small kana pronunciation by side of kanji to help children read it. This is called 'yomi-gana' (where 'gana' is the same as 'kana'). Yomigana is also used to sort words in dictionaries and items in a list.

Yomi is a kana word associated with a kanji word for pronunciation and sorting purposes. Japanese authors need some mechanism for adding yomi to every word containing at least one kanji when they create Index. However we don't have to add yomi to words consisting only of kana and Latin alphabet.

Proposal - A convention for adding sort keys to index entries

There are several specifications for adding Index feature to HTML-based documents. Here I choose NetHelp specification as an example to demonstrate the way of adding sorting keys. I believe that the convention described below can be applied to any other authoring specifications.

Adding sorting keys

NetHelp, a HTML-based online help technology from Netscape, defines that index entries can be specified within an anchor tag in an HTML file:

	<A NAME=... INDEXSTRING="index-entry^index-entry^...">
Multiple entries are delimited by a circumflex (^).

An example:

	<A NAME=... INDEXSTRING="mountain^river">
My suggestion is to provide an optional place for adding sorting key to index entry. Sorting key is optional. English and kana-only words don't need sorting keys. Only words containing some kanji (or maybe accented characters) need sorting keys.
	index-entry = word-spelling
	index-entry-with-sort-key = word-spelling ~ sort-key
When you add sorting key to index entry, add tilde (~) and the sorting key of the entry. One or more spaces can be inserted before and/or after the delimiting tilde for readability. If an index entry contains any of entry delimiter (^) or sort key delimiter (~) or backslash (\), then add a backslash in front of it.

Although not related to yomi, I would like to point out that some mechanisms are necessary to indicate subordinate (or indented) entries. WinHelp uses comma (,) and colon (:) for this purpose by default.
An example:
	<A NAME=... INDEXSTRING=" ~  ^  ~  ^  ~ ">
Sorting key feature is critical to Japanese authors to add yomi to entries. It should be useful to everyone because it gives finer control in arranging index entries. For example, the C language compier command '#include' can be placed in I section not in Symbols section if you prefer.
	<A NAME=... INDEXSTRING="#include~include">
The point of this convension is that. If the authoring tool sees a tilde in the middle of an index entry, the words following the tilde should be treated as sorting key for the words in front of the tilde.

Sorting using the sorting keys

The authoring tool must be careful when sorting index entries with sorting keys. If the entry has no sorting key, the entry itself is used to sort the entry. However, if there is a sorting key, it must be used instead of the entry. In other words, if sorting key is omitted, the entry itself is sorting key, which is used to sort entries.

Note that, after sorting of index entries is complete, sorting keys are not necessary any more. You don't have to keep them. For example, Microsoft adopts W3C Site Map format for holding sorted index entries in HTML Help. The Site Map file doesn't have to keep sorting keys in it.

Conclusion

In this memo, I emphasized the importance of sorting keys in building non-English Index. I described a situation in Japanese language where yomi is a sorting key for kanji word. And I presented a convention or format for adding sorting keys to Index entries.

I hope the reader understands that this simple addition is very useful to authors who works in the countries where word spelling just don't mean word order. I am glad that architects of authoring systems can spend some time in designing such a feature. I am also glad that programmers of authoring tools can spend some time in implementing it for us.

Thank you for reading this memo.


I am ready to answer questions if you need more information about this subject. I welcome input from non-English writers about 'yomi' of their language and sorting requirements for Index.

About the author ARAI Bunkichi is a writer and programmer and works for Koubunsha. Koubunsha specializes in creating online documentation in Japanese language.