Unicode® 5.2.0
Released: 2009 October 1 (Announcement)
Version 5.2.0 has been superseded by the
latest version
of the Unicode Standard.
|
Version 5.2.0 of the Unicode Standard consists of the core
specification (The Unicode Standard,
Version 5.2), together with
the delta and archival code charts for this version, the 5.2.0 Unicode Standard Annexes,
and the 5.2.0 Unicode Character Database (UCD).
The core specification gives the general principles,
requirements for conformance, and guidelines for implementers. The
code charts show representative glyphs for all the Unicode
characters. The Unicode Standard Annexes supply detailed normative
information about particular aspects of the standard. The Unicode
Character Database supplies normative and informative data for
implementers to allow them to implement the Unicode Standard.
|
Version 5.2.0 of the Unicode Standard
should be referenced as:
The Unicode Consortium. The Unicode Standard, Version 5.2.0,
defined by: The Unicode Standard, Version 5.2 (Mountain View, CA: The
Unicode Consortium, 2009. ISBN 978-1-936213-00-9). (http://www.unicode.org/versions/Unicode5.2.0/)
A complete specification of the contributory files for Unicode
5.2.0 is found on the page
Components for 5.2.0.
That page also provides the recommended reference format for Unicode Standard Annexes.
A. Online Edition
B. Overview
C. Stability Policy Update
D. Character Additions
E. Conformance Changes
F. Unicode Character Database
Changes
G. Unicode Standard Annex Changes
The text of The Unicode Standard, Version 5.2, as well as
the delta and archival code charts,
is available via the navigation links on this page.
The charts and the Unicode Standard Annexes may be printed, while
the other files may be viewed but not printed. The
Unicode 5.2 Web Bookmarks page has links to all sections of the
online text. A
zipped version of the
core specification (10 MB) is also
available for download.
This page summarizes important changes to the standard from
Unicode 5.1.0. The core specification and the Unicode Standard Annexes
are not delta documents; they incorporate all of the textual changes
for their updates for Version 5.2.0.
The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes.
Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam,
and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language.
The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East.
Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols.
This latest version of the Unicode Standard has exactly the same character assignments as ISO/IEC 10646:2003 plus Amendments 1 through 6.
Unicode Version 5.2:
- Updates stability policies to add property value stability guarantees for
identifier-related properties, a guarantee of property, property alias and
property value alias stability, and a policy on alias uniqueness.
- Incorporates into Chapter 3, Conformance the formal definitions of
normalization formerly presented in Unicode Standard Annex #15, "Unicode
Normalization Forms." Sections that were modified include sections 3.6 and
3.11.
- Revises Section 3.5, Properties to better explain the status of
Normative, Informative, Provisional, and Contributory properties.
- Clarifies the definition of Deprecated and its relationship to ”strongly
discouraged,” and updates the set of Deprecated characters in view of this
clearer definition.
- Updates best practices for the use of replacement characters.
- Improves the description of compatibility characters in Chapter 2,
General Structure.
- Adds standardized named sequences for Tamil.
- Contains significant changes to properties and behavioral specifications.
Errata incorporated into Unicode 5.2.0 are listed by date in
a separate table. For corrigenda and errata after the release of Unicode 5.2.0, see the list of current
Updates and Errata.
The Unicode Character Encoding Stability Policy has been updated.
This update strengthens normalization stability, adds stability
policy for case pairs, and extends constraints on property values.
For the current statement of these policies, see
Unicode Character Encoding Stability Policy.
6,648 new character assignments were made to the Unicode Standard, Version 5.2.0
(over and above what was in Unicode 5.1.0). The character repertoire corresponds to
ISO/IEC 10646:2003 plus Amendments 1 through 6.
The exact list of characters added for Version 5.2.0 is documented in the file DerivedAge.txt in the Unicode Character Database. Among the characters added, there are a few notable cases which may impact existing implementations.
These cases are highlighted here, so that implementers can check for any problematical assumptions in their code.
- There are three new characters in the newly-encoded Kaithi
script that will require changes in implementations which make
hard-coded assumptions about composition during normalization.
Most new characters added to the standard with decompositions
cannot be generated by the operations toNFC() or toNFKC(), but
these three can. Implementers should check their code carefully to
ensure that it handles these three characters correctly.
- U+1109A KAITHI LETTER DDDHA
- U+1109C KAITHI LETTER RHA
- U+110AB KAITHI LETTER VA
- One of the compatibility CJK ideographs added in this version
has a decomposition mapping to a unified CJK ideograph in Extension B. The
effect of this is that for the first time a character in the BMP
normalizes to a character not in the BMP:
toNFC(U+FA6C) = U+242EE
Implementers should check their implementations of
normalization to ensure they are not assuming that no BMP
character can normalize to a non-BMP character.
- Any hard-coded range assumptions about Unified CJK Ideographs
in implementations may need fixing, because the end range for
those has changed from U+9FC3 to U+9FCB in this version. There is
also an entirely new block of CJK Unified Ideographs: CJK Unified
Ideographs Extension C (U+2A700..U+2B73F), with characters encoded
in the range U+2A700 to U+2B734.
- There is now an assigned Hangul jamo character at U+11A7. This
may interfere with some implementations' boundary testing for
Hangul decomposition.
- There are a number of new Hangul jamo characters added for support
of Old Korean. Some of these are encoded in new blocks. An
implementation may run into trouble if it assumes that the
repertoire of conjoining jamos is fixed, or that all conjoining
jamos occur only in the Hangul Jamo block, U+1100..U+11FF.
- New uppercase parenthesized symbols have been added. Unlike
the circled letter symbols, there are no uppercase/lowercase
relationships for these new characters.
Character Assignment Overview
The new character additions were to both the BMP and the SMP
(Plane 1). The following table shows the allocation of code points in Unicode
5.2.0. For more information on the specific characters, see the file
DerivedAge.txt in the
Unicode Character Database.
For more details of character counts, see
Appendix D, Changes from Previous Versions in Unicode 5.2.
Graphic |
107,154 |
Format |
142 |
Control |
65 |
Private Use |
137,468 |
Surrogate |
2,048 |
Noncharacter |
66 |
Reserved |
867,169 |
There are several changes to conformance requirements in Unicode 5.2 that
impact implementations. The most important of these are noted specifically
here.
- The formal definitions of normalization formerly presented in
Unicode Standard Annex #15, "Unicode Normalization Forms," have been moved
to Chapter 3, Conformance.
- A key conformance clause on the modification of character sequences,
C7, has been tightened to eliminate security risks resulting from deletion
of noncharacters from uninterpreted text strings. In Unicode 5.2, the
conformance requirements now disallow their removal, except where
strings are explicitly being modified.
- The status of Normative, Informative, Provisional, and Contributory
properties is clarified in Section 3.5 Properties.
- The types of code points are clarified in Chapters 2, 3, and
4, with coordinated updates in Unicode Standard Annex #44,
"Unicode Character Database."
- The PropertyAliases.txt file in the Unicode Character Database is now
designated as the normative listing of Unicode character properties and
their names.
- The BidiTest.txt file in the Unicode Character Database is a new
feature in Unicode 5.2. This file contains test cases for assessing
conformance to the Unicode Bidirectional Algorithm.
- There are additional changes in Unicode conformance requirements due
to changes in the UCD data files and the Unicode Standard Annexes listed
below.
The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in
UAX #44, Unicode Character Database. The most significant changes include:
- There are new case-related properties in
DerivedCoreProperties.txt and DerivedNormalizationProps.txt. The new case-related derived
properties are NFKC_Casefold, Case_Ignorable, Cased,
Changes_When_Lowercased, Changes_When_Uppercased,
Changes_When_Titlecased, Changes_When_Casemapped,
Changes_When_Casefolded, and Changes_When_NFKC_Casefolded.
- Contributory is considered to be a distinct status for
a Unicode character property. Contributory properties are neither
normative nor informative. The status of all character properties is listed in the property table in
UAX #44, Unicode Character Database.
- Two new joining groups, FARSI YEH and NYA, were added. These new joining groups may require an update to implementations of Arabic shaping rules.
- There is a new data file in the Unicode Character Database, CJKRadicals.txt, which maps the radical numbers used in the Unicode
Radical-Stroke Index to the actual Unicode code points for the corresponding radicals. Unlike other files, the
first field is not a code point number.
- The Unihan.txt file in Unihan.zip is split into 8 separate
files within the zip file, organized by category. See
UAX #38, Unicode Han Database (Unihan) for details.
In Version 5.2, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section
of each UAX, linked directly from the following list of UAXes.
-
UAX #9: Unicode Bidirectional Algorithm
- Added Section 4.4
Bidi Conformance Testing
-
Added BN to Rule X6
(removing certain characters)
- Clarified examples in Rule
N1
(affecting characters next to EN or AN characters)
- Added to HL6 the clause: Those with a resolved directionality of L and whose bidi
class is R or AL.
-
UAX #11: East Asian Width
- Updated the description of the property value
for unassigned code points
-
UAX #14: Unicode Line Breaking Algorithm
- Added class CP, reintroduced rule LB30,
adjusted other rules for class CP.
- In section 5.1, clarified that the lists of
characters for each property contain
representative characters, and are not
necessarily complete.
- Unassigned code points in CJK regions
default to class ID.
-
UAX #15: Unicode Normalization Forms
- Moved formal specification of NFC and NFKC
into Chapter 3.
- Added general introduction to the document
itself.
-
UAX #24: Unicode Script Property
- Updated short alias for Inherited from Qaai to
Zinh.
- Rewrote Section 3. Added a new subsection 3.4,
to clarify the distinction between script
designators and script property value aliases,
their respective matching rules, and the use of
underscores. Added a new subsection 3.5 to clarify
ambiguity in the term script name.
-
UAX #29: Unicode Text Segmentation
-
UAX #31: Unicode Identifier and Pattern Syntax
- Updated the table,
Candidate Characters for Inclusion in Identifiers.
Updated the placement of various scripts in the two tables,
Candidate Characters for Exclusion from Identifiers and
Recommended Scripts [for Identifiers], and marked some of the recommended scripts as limited use. Added pointer to CLDR for information about scripts in limited use.
- Added the following to
Candidate Characters for Inclusion in Identifiers:
U+0F0B ( ་ ) TIBETAN MARK INTERSYLLABIC TSHEG and
U+30FB ( ・ ) KATAKANA MIDDLE DOT.
- Added the following to
Candidate Characters for Exclusion from
Identifiers or
Recommended Scripts: Default Ignorable Code
Points, Tatweel (-like) characters, and scripts
Old Turkic, Old South Arabian, Imperial Aramaic,
Inscriptional Parthian, Inscriptional Pahlavi,
Avestan, Egyptian Hieroglyphs, Samaritan, Kaithi,
Lisu, Meetei Mayek, Tai Tham, Tai Viet, Javanese,
Bamum.
-
UAX #34: Unicode Named Character Sequences
-
UAX #38: Unicode Han Database (Unihan)
- Reclassified kDefinition, kHanyuPinlu, and
kXHC1983 fields as Readings.
- Documented revised structure of
Unihan.zip.
- Updated regular expressions of tags.
-
UAX #41: Common References for Unicode Standard Annexes
-
UAX #42: Unicode Character Database in XML
-
Added attributes for new properties and values.
-
Changed types of certain elements.
-
Updated the patterns for Unihan properties.
-
UAX #44: Unicode Character Database
- Completely reorganized and rewritten, to include all the content from the
obsoleted UCD.html.
- Extensive new content added to account for all property
changes and additions for Unicode 5.2.
- Further clarifications were added regarding character properties,
including the definition and contents of the Deprecated property,
the nature of Contributory properties, the description of
numeric properties, format issues for the Unihan database,
constraints on property changes between releases, the description
and exact values of defaults for property values, and many others.
|