UCA Default Table Criteria for New Characters
This page explains the criteria which the UTC uses
in deciding how to create initial orderings for
the large collections of new characters added to the
Default Unicode Collation Element Table (DUCET)
for each new minor or major version of the Unicode Standard. See
Unicode
Collation Algorithm for information
about the algorithm itself and technical details regarding
the format and use of the DUCET.
Criteria for Ordering New Scripts
1. When a new script is added to the standard, the establishment
of its primary ordering should, as much as possible, be based
on information provided with the Summary Proposal Form and
other supporting documents for the proposed encoding.
2. Failing that, or given ambiguity in the proposal documentation,
primary ordering should be based on whatever lexicographical
evidence can be gathered for the language which is either the
best documented and/or in most widespread use for that script.
3. If a script is in multilingual use and has character extensions
provided for specific languages, then following the choice of
primary order for the first language (by criterion 2), weights
for character extensions should be interpolated so as to get
the ordering for other languages (if known) as correct as
possible without requiring tailoring.
4. If characters with accents are included, then the accents should
be given secondary weights unless overriding concerns based on
established practice for primary letter weighting dictate otherwise.
5. If characters with distinctions comparable to case are included,
then the case (or presentation form) differences should be given
tertiary weights unless overriding concerns based on
established practice for ordering dictate otherwise.
6. Weighting for digits, symbols, and punctuation in a new script
should, as much as possible, follow the established patterns in
the DUCET for other scripts, so as not to introduce idiosyncratic
treatments of such characters on a script-by-script basis.
7. In some instances—particularly for historic scripts—there
may be no established native lexicographical order, or none
documented well enough to be usable. In such cases, a primary
order based simply on code point order in the charts or, alternatively,
based on a well-known academic catalog order for the characters,
may be an acceptable alternative for placing the characters in
the DUCET.
8. The impact on the overall size and complexity of the DUCET
also needs to be considered when adding collation weights for a new script.
Particularly complex approaches to the specific weighting for
a new script should be avoided if they would have a significant
impact on the table's use for all other scripts and languages,
even if that approach might produce a marginally better default
ordering for the new script.
Criteria for Addition of Small Numbers of Characters to Existing
Collections
9. As much as possible, when adding additional characters to
scripts (or other collections) already in the DUCET—as,
for example, adding small numbers of additional Latin, Cyrillic,
or Arabic characters—weights for such characters should be
interpolated in the table following the predominant principles
of ordering already established in the table for that script.
This is to minimize the chances that such characters will simply
get lost in the table by being ordered in some haphazard, ad
hoc manner for the script. (Thus if a z-like character with
some overlay diacritic is added to the table, it should be
weighted as much as possible like other z-like characters with
diacritics.)
10. In most instances, characters added after the fact for a script,
in support of some small, minority language use or specialized
orthography, will be added in full knowledge that a tailoring
of the DUCET will be necessary in order to support ordering for
that language or specialized orthography. However, in certain,
limited cases, it may be appropriate to attempt to place such
an additional character in a primary order other than would
be chosen by criterion 9, if it is known that that character
is used only for that language or specialized orthography.
Such exceptions should, however, be just that—exceptions
rather than usual cases.
11. When additional characters have formal decomposition mappings
in the standard, their collation weight should simply be
derived automatically from the decomposition, unless there
is a clear, overriding reason to do otherwise. This is because
overriding the decomposition in all cases marginally complicates
the process of regenerating the DUCET, may often introduce
unanticipated edge cases or interactions with other weights,
and seldom is sufficient to produce a "perfect" ordering.
12. Additional sets of punctuation or other symbols that fall
into clear classes that have been grouped together in the DUCET
should be grouped, as much as possible, with like characters
already present in the DUCET. Thus if a new quotation mark of
some sort is added, it should be grouped with the existing batch
of quotation marks in the table. This eases maintenance and will
make sense for some kinds of ordering, even though for most
lexicographical sorting, punctuation and such symbols are basically
ignored.
13. Other symbols should simply default to getting weights based
on the code point order, along with the existing collection of
otherwise unclassified symbols.