Wednesday, December 21, 2022

Unicode in 2022

Hello Everyone!

As we go into the New Year, the Unicode team thought we’d share some highlights from this past year. From source-code spoofing to preserving indigenous languages, the Unicode team has had another full year, including expanding the number of characters that appear on billions of devices around the world.

Nearly 150,000 characters!

On the character side, we reached a total of just shy of 150,000 characters (149,186 to be exact). Of the 4,489 characters added in the 15.0 release, the biggest set was 4,192 ideographs for use in Chinese, Japanese, and Korean. There are also two new scripts, Nag Mundari and Kawi. Nag Mundari is a script used to write the Mundari language of India, a language with 1.1 million speakers. Kawi is an important historic script of insular Southeast Asia, found in inscriptions and on artifacts in several languages dating from the 8th to the 16th centuries — and is undergoing a revival today amongst enthusiasts.

And we can’t forget the 20 new emoji characters — we’re looking forward to seeing which are the most popular: shaking face? Goose? Maracas? Pink heart? If you’re involved in implementing emoji, you’ll also want to look at latest changes in UTS #51 Unicode Emoji.

See the Unicode15.0.0 page for more details. We’re also changing how we do releases — for more, see 2023 Release Planning.

The Launch of ICU4X

ICU is used in every major device and operating system; it’s how you see a date or number on your phone, for example. This new project, ICU4X, was created to solve the needs of clients who wish to provide client-side internationalization for their products in resource-constrained environments and across many programming languages. After 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution. For details, see Announcing ICU4X 1.0.

When does i ≠ і?

Can you tell the difference between i and і? Yeah, most people can’t. The first set of changes to help counter source-code spoofing were included in the 15.0 versions of the UAX #9 Unicode Bidirectional Algorithm, UAX #31 Unicode Identifier and Pattern Syntax, and UTS #39 Unicode Security Mechanisms.

For 2023, there is a new draft UTS #55 Unicode Source Code Handling, providing guidance for programming language designers and tooling developers, and specifying mechanisms to avoid usability and security issues arising from improper handling of Unicode. More changes are on their way for UAX #9, UAX #31, and UTS #39 as well.

Åge Møller, Πέτρος Νικόλαος Καρατζής, ராஜேந்திர சோழன்

We’re making great progress on internationalized formatting of people’s names. What does that mean? Software needs to be able to format people's names, such as John Smith or 宮崎駿. The formatting can be surprisingly complicated: for example, people may have a different number of names, depending on their culture — they might have only one name (“Zendaya”), only two (“Albert Einstein”), or three or more. So the software needs to handle missing or extra name fields gracefully.

There are many more complexities — for more details, see Formatting people’s names.

You have 2 unread messages.

Or, you have 3 items in your cart. Whenever a computer needs to construct a sentence using “placeholders” such as 3, it is formatting a message. The current industry standard is ICU’s message formatting; a project started about 3 years ago, with the goal of improving on that to build a more robust and extensible mechanism. There is now a Tech Preview in ICU — we’d urge developers to try it out!

See message-format-wg for details on the syntax and message2/package-summary.html for the API (note that the ICU’s convention for tech previews is to mark as Deprecated), and the test code in MessageFormat2Test.java for examples of usage.

(There are of course other fixes, upgrades and new features in ICU: see ICU 72 and ICU 71 for more details.)

Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ

In CLDR, we now have 95 languages at the Modern level (suitable for full UI internationalization), 6 at the Moderate level (suitable for “document content” internationalization), and 29 at the Basic level (suitable for locale selection). We added a tech preview of formatting for person names, plus additions for Unicode 15.0 (emoji names and search keywords), names for new scripts, new CJK collation, and so on. For more information, see CLDR v42.

Revitalization and Preservation of Indigenous Languages

The Nattilik language community was unable to use their language reliably for even simple, everyday digital text exchanges such as email or text messaging. The Typotheque Syllabics Project, an initiative based out of Toronto and The Hague, Netherlands, undertook research with language keepers across various Syllabics-using Indigenous communities in Canada. By collaborating with Nattilik language keepers and elders in the community, key issues the Nattilik community of Western Nunavut faced were identified, and it was discovered that there were 12 missing syllabic characters from the Unicode Standard. The Consortium worked with the Typotheque Syllabics Project to add 16 characters to the script to support Nattilik and other languages in Unicode version 14.0, and improved the glyphs in Unicode version 15.0. See this blog post from June.

The Past and Future of Flag Emoji

Despite being the largest emoji category with a strong association tied to identity, flags are by far the least used. Flag emoji have always been subject to special criteria due to their open-ended nature, infrequent use, and burden on implementations. The addition of other flags and thousands of valid sequences into the Unicode Standard has not resulted in wider adoption. They don’t stand still, are constantly evolving, and due to the open-ended nature of flags, the addition of one creates exclusivity at the expense of others. Curious to learn more? Read more about the Past and Future of Flag Emoji.

Available Now! New YouTube Playlist and Technical Quick Start Guide

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. Unicode technical leadership and other experts shared background on our core projects with participants from more than 30 countries. If you missed the webinar, no worries! The recorded sessions are available on this YouTube playlist. And if you are new to Unicode and internationalization or simply want a refresh, you can also check out our Technical Quick Start Guide. This handy guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.” There are also useful links to more detailed information and how you can get involved. Read more here.

Support Unicode 💞💕💌💯✨🌟🤠🛟🎁

Finally, if you are already a contributor to — or member of Unicode (or your company or organization is!), thank you, Danke, Děkuju, धन्यवाद, merci, 谢谢你, grazie, நன்றி, and gracias! What we have accomplished is only possible because of supporters like you.

And if you want to support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode is a US-based non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Monday, November 14, 2022

The Unicode® Standard – 2023 Release Planning

By Peter Constable, Chair of the Unicode Technical Committee

[image]

At the Q4 Unicode Technical Committee (UTC) meeting held from November 1-3, our member representatives unanimously agreed to a release plan for 2023 and tentative plan for 2024. Along with some tooling updates, our plans aim to ensure that we are more agile to meet the evolving internationalization landscape and better able to meet the needs of Unicode members and other consumers of the Standard.

More information can be found in the Release Management Group’s Recommendations for 2023-2024.

BACKGROUND

For several years now, the UTC has worked on an annual cycle for new versions of The Unicode Standard and related specifications. New versions used to be released in March of each year, but in 2021, due to COVID-19, the release was delayed until September.

MOVING FORWARD

Going forward, our plan is to continue with a new release each year in September. That annual, predictable cycle works well for Unicode's other major projects—CLDR and ICU—and helps implementers in their planning.

In 2023, we will keep up that cadence with a September release, but we also need to take some time to evaluate and update our processes for developing each new version of the Standard.

Therefore, the 2023 release will be a “dot” release: Unicode 15.1. It will include important updates to Unicode Standard Annexes and to the Unicode Character Database, and have a limited set of new characters — but new scripts and most other character additions will be held until the 2024 release. A major new area is the planned release of a Unicode Technical Standard for avoiding source-code spoofing, along with associated changes in other specifications.

Regarding emoji, if there are any new emoji in the 15.1 release, they would leverage existing code points, as was done for the 13.1 release, rather than the addition of entirely new characters.

2024 AND BEYOND

For 2024, we anticipate returning to our regular cadence, with a major release in September 2024. Unicode 16.0 will include additional new scripts, emoji and other characters, as well as other updates.

Learn more about how you can support the Unicode Consortium and our mission, including information on our Adopt-a-Character program, here!

Tuesday, November 8, 2022

Available Now! New YouTube Playlist and Technical Quick Start Guide

By Elango Cheran

On September 28th, Unicode held a webinar on the “Overview of Internationalization and Unicode Projects” for Unicode enthusiasts. More than 180 people across 30 countries joined us for this online event.

The Consortium is pleased to now make available the videos from this event. If you are new to Unicode and internationalization or want an overview of the most recent projects, check out our new YouTube playlist and Technical Quick Start Guide.

Our Technical Leadership and other experts provide a handy overview on such topics as:

Introduction to Internationalization - Addison Phillips, Internationalization Engineer
Unicode Consortium: Past, Present, and Future - Mark Davis, Cofounder and President
Scripts and Character Encoding - Deborah Anderson, Chair of the Script Ad Hoc Committee
Unicode CLDR (Common Locale Data Repository) - Mark Davis and Annemarie Apple, Chair and Vice Chair of the CLDR Committee
Unicode ICU (International Components for Unicode) - Markus Scherer, Chair of ICU Committee
Unicode ICU4X 2022 - Shane Carr, Chair of ICU4X Subcommittee

Also included in the playlist is the audience Q&A with Elango Charan, the webinar’s emcee, and Mark Davis, Cofounder and President of Unicode.

The Unicode Technical Quick Start Guide is also now available. The guide explains what Unicode is, including answering the question, “What is Internationalization and Why it Matters.” There is also an overview of the technical committees and useful links to more detailed information and how you can get involved.

Learn more about how you can support the Unicode Consortium and our mission, including information on our Adopt-a-Character program, here!

Friday, October 28, 2022

MessageFormat 2 Technical Preview Available

The MessageFormat Working Group is pleased to announce that it has released a Technical Preview implementation of the current state of the MessageFormat 2 specification in ICU4J in the recent ICU 72 release. The Working Group has been working on a specification for a successor to ICU MessageFormat, which has been an industry staple for internationalized software for more than two decades.

Owing to the prevalence of MessageFormat not just as an API for software, but also given its syntax serving as a de facto serialization format for the localized messages sent to the API, the Working Group has put careful consideration into interchange and interoperability. Some goals of the new specification include promoting best practices for internationalization, including compatibility with localization industry supported XLIFF. Another goal includes a definition of the data model of the API input separate from the syntax to allow for multiple compliant syntaxes to be compatible. Similarly, the specification includes a registry of interfaces for dependent formatting functions, in order to cleanly separate implementation from specification, allowing users to specify custom formatting functions and plug in their own implementations.

MessageFormat 2 builds on top of the experience from using and maintaining ICU MessageFormat and a number of other localization systems and workflows. It improves the placeholder syntax, improves escaping rules inside the translatable content, replaces nested selectors with top-level multiple selectors, and allows the use of custom formatters.

For example:


match {$photoCount :number} {$userGender :equals}
when 1 masculine {{$userName} added a new photo to his album.}
when 1 feminine {{$userName} added a new photo to her album.}
when 1 * {{$userName} added a new photo to their album.}
when * masculine {{$userName} added {$photoCount} photos to his album.}
when * feminine {{$userName} added {$photoCount} photos to her album.}
when * * {{$userName} added {$photoCount} photos to their album.}

More examples and the formal definition of the grammar can be found in the specification draft.

We invite you all to try the Tech Preview available now in ICU4J and provide us any and all feedback. Similarly, please note that the MessageFormat 2 is still a work in progress, and therefore we rely on your questions, suggestions, and issues to critically inform how we iterate on the specification. Real world experience will allow us to address potential shortcomings in the ways that MessageFormat 2 will get used in practice.

For information about using the Tech Preview, refer to the API docs, ICU 72 download page, and the ICU4J User Guide.

To leave feedback about MessageFormat 2 (specification, syntax, etc.) or the Tech Preview implementation, please visit the working group’s repository at github.com/unicode-org/message-format-wg, where you can open a new Discussion topic or file a new Issue.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Friday, October 21, 2022

ICU 72 Released

Unicode® ICU 72 has just been released. ICU is the premier library for software internationalization, used by a wide array of companies and organizations to support the world's languages, implementing both the latest version of the Unicode Standard and of the Unicode locale data (CLDR). ICU 72 updates to Unicode 15, and to CLDR 42 locale data with various additions and corrections.

ICU 72 and CLDR 42 are major releases, including a new version of Unicode and major locale data improvements.

ICU 72 adds two technology preview implementations based on draft Unicode specifications:

Formatting of people’s names in multiple languages (CLDR background on why this feature is being added and what it does)
An enhanced version of message formatting

This release also updates to the time zone data version 2022e (2022-oct). Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/72.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, October 20, 2022

Unicode CLDR v42 available

Unicode CLDR version 42 is now available and has been integrated into version 72 of ICU. In CLDR 42, the focus was on:

Locale coverage. The following locales now have higher coverage levels:
1. Modern: Igbo (ig), Yoruba, (yo)
2. Moderate: Chuvash (cv), Xhosa (xh)
3. Basic: Bhojpuri (bho), Haryanvi (bgc), Rajasthani (raj), Tigrinya (ti)
Formatting Person Names. Added data and structure for formatting people’s names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.0 Support. Added short names, keywords, and sort-order for the new Unicode 15.0 emoji.
Coverage, Phase 2. Added additional language names and other items to the Modern coverage level for more consistency (and utility) across platforms.
Unicode 15.0 additions. Made the regular additions and changes for the new release of Unicode, including names for new scripts, collation data for Han characters, etc.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.). For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

There are many other changes: to find out more, see the draft CLDR v42 release page, which has information on accessing the data, reviewing charts of the changes, and — importantly — Migration issues.

In version 42, the following levels were reached:

Level	Languages	Locales*	Notes
Modern	95	369	Suitable for full UI internationalization
Afrikaans‎, ‎… Čeština‎, ‎… Dansk‎, ‎… Eesti‎, ‎… Filipino‎, ‎… Gaeilge‎, ‎… Hrvatski‎, ‎Indonesia‎, ‎… Jawa‎, ‎Kiswahili‎, ‎Latviešu‎, ‎… Magyar‎, ‎…Nederlands‎, ‎… O‘zbek‎, Polski‎, ‎… Română‎, ‎Slovenčina‎, ‎… Tiếng Việt‎, ‎… Ελληνικά‎, Беларуская‎, ‎… ‎ᏣᎳᎩ‎, ‎ Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, … አማርኛ‎, ‎नेपाली‎, … ‎অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, ‎… 日本語‎, ‎…
Moderate	6	11	Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Binisaya, … ‎Èdè Yorùbá, ‎Føroyskt, ‎Igbo, ‎IsiZulu, ‎Kanhgág, ‎Nheẽgatu, ‎Runasimi, ‎Sardu, ‎Shqip, ‎سنڌي, …
Basic	29	43	Suitable for locale selection, such as choice of language in mobile phone settings.
Asturianu, ‎Basa Sunda, ‎Interlingua, ‎Kabuverdianu, ‎Lea Fakatonga, ‎Rumantsch, ‎Te reo Māori, ‎Wolof, ‎Босански (Ћирилица), ‎Татар, ‎Тоҷикӣ, ‎Ўзбекча (Кирил), ‎کٲشُر, ‎कॉशुर (देवनागरी), ‎…, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, ‎粤语 (简体)‎

* Locales are variants for different countries or scripts.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, October 6, 2022

ICU 72 Release Candidate Available

We are pleased to announce the release candidate for Unicode® ICU 72. It updates to Unicode 15, and to CLDR 42 locale data with various additions and corrections.

ICU 72 adds technology preview implementations for person name formatting, as well as for a new version of message formatting based on a proposed draft Unicode specification.

ICU 72 and CLDR 42 are major releases, including a new version of Unicode and major locale data improvements.

ICU 72 updates to the time zone data version 2022b (2022-Aug) which is effectively the same as 2022c. Note that pre-1970 data for a number of time zones has been removed, as has been the case in the upstream tzdata release since 2021b.

For details, please see https://icu.unicode.org/download/72.

Please test this release candidate on your platforms and report bugs and regressions by Tuesday, 2022-Oct-18, via the icu-support mailing list, and/or please find/submit error reports.

Please do not use this release candidate in production.

The preliminary API reference documents are published on unicode-org.github.io/icu-docs/ – follow the “Dev” links there.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Thursday, September 29, 2022

Unicode CLDR v42 Beta — Spec Review

The Unicode CLDR v42 Beta is now available for specification review and integration testing. The release is planned for Oct 19, but any feedback on the specification needs to be submitted well in advance of that date. The specification is available at Draft LDML Modifications. The biggest change is the new Person Names Formatting section.

The beta has already been integrated into the development version of ICU. We would especially appreciate feedback from ICU users and non-ICU consumers of CLDR data, and on Migration issues.

Feedback can be filed at CLDR Tickets.

CLDR provides key building blocks for software to support the world's languages (dates, times, numbers, sort-order, etc.) For example, all major browsers and all modern mobile phones use CLDR for language support. (See Who uses CLDR?)

Via the online Survey Tool, contributors supply data for their languages — data that is widely used to support much of the world’s software. This data is also a factor in determining which languages are supported on mobile phones and computer operating systems.

In CLDR 42, the focus is on:

Locale coverage. The following locales now have higher coverage levels:

Modern: Igbo (ig), yo (Yoruba)
Moderate: Chuvash (cv), Xhosa (xh)
Basic: Haryanvi (bgc), Bhojpuri (bho), Rajasthani (raj), Tigrinya (ti)

Formatting Person Names. Added data and structure for formatting people's names. For more information on why this feature is being added and what it does, see Background.
Emoji 15.0 Support. Added short names, keywords, and sort-order for the new Unicode 15.0 emoji.
Coverage, Phase 2. Added additional language names and other items to the Modern coverage level, for more consistency (and utility) across platforms.
Unicode 15.0 additions. Made the regular additions and changes for a new release of Unicode, including names for new scripts, collation data for Han characters, etc.

There are many other changes: to find out more, see the draft CLDR v42 release page, which has information on accessing the date, reviewing charts of the changes, and — importantly — Migration issues.

In version 42, the following levels were reached:

Level	Languages	Locales*	Notes
Modern	95	369	Suitable for full UI internationalization
Afrikaans‎, ‎… Čeština‎, ‎… Dansk‎, ‎… Eesti‎, ‎… Filipino‎, ‎… Gaeilge‎, ‎… Hrvatski‎, ‎Indonesia‎, ‎… Jawa‎, ‎Kiswahili‎, ‎Latviešu‎, ‎… Magyar‎, ‎…Nederlands‎, ‎… O‘zbek‎, ‎Polski‎, ‎… Română‎, ‎Slovenčina‎, ‎… Tiếng Việt‎, ‎… Ελληνικά‎, ‎Беларуская‎, ‎… ‎ᏣᎳᎩ‎, ‎ Ქართული‎, ‎Հայերեն‎, ‎עברית‎, ‎اردو‎, … አማርኛ‎, ‎नेपाली‎, … ‎অসমীয়া‎, ‎বাংলা‎, ‎ਪੰਜਾਬੀ‎, ‎ગુજરાતી‎, ‎ଓଡ଼ିଆ‎, ‎தமிழ்‎, ‎తెలుగు‎, ‎ಕನ್ನಡ‎, ‎മലയാളം‎, ‎සිංහල‎, ‎ไทย‎, ‎ລາວ‎, ‎မြန်မာ‎, ‎ខ្មែរ‎, ‎한국어‎, ‎… 日本語‎, ‎…
Moderate	6	11	Suitable for full “document content” internationalization, such as formats in a spreadsheet.
Binisaya, … ‎Èdè Yorùbá, ‎Føroyskt, ‎Igbo, ‎IsiZulu, ‎Kanhgág, ‎Nheẽgatu, ‎Runasimi, ‎Sardu, ‎Shqip, ‎سنڌي, …
Basic	29	43	Suitable for locale selection, such as choice of language in mobile phone settings.
Asturianu, ‎Basa Sunda, ‎Interlingua, ‎Kabuverdianu, ‎Lea Fakatonga, ‎Rumantsch, ‎Te reo Māori, ‎Wolof, ‎Босански (Ћирилица), ‎Татар, ‎Тоҷикӣ, ‎Ўзбекча (Кирил), ‎کٲشُر, ‎कॉशुर (देवनागरी), ‎…, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ, ‎粤语 (简体)‎

* Locales are variants for different countries or scripts.

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Announcing ICU4X 1.0

I. Introduction

Hello! Ndeewo! Molweni! Салам! Across the world, people are coming online with smartphones, smart watches, and other small, low-resource devices. The technology industry needs an internationalization solution for these environments that scales to dozens of programming languages and thousands of human languages.

Enter ICU4X. As the name suggests, ICU4X is an offshoot of the industry-standard i18n library published by the Unicode Consortium, ICU (International Components for Unicode), which is embedded in every major device and operating system.

This week, after 2½ years of work by Google, Mozilla, Amazon, and community partners, the Unicode Consortium has published ICU4X 1.0, its first stable release. Built from the ground up to be lightweight, portable, and secure, ICU4X learns from decades of experience to bring localized date formatting, number formatting, collation, text segmentation, and more to devices that, until now, did not have a suitable solution.

Lightweight: ICU4X is Unicode's first library to support static data slicing and dynamic data loading. With ICU4X, clients can inspect their compiled code to easily build small, optimized locale data packs and then load those data packs on the fly, enabling applications to scale to more languages than ever before. Even when platform i18n is available, ICU4X is suitable as a polyfill to add additional features or languages. It does this while using very little RAM and CPU, helping extend devices' battery life.

Portable: ICU4X supports multiple programming languages out of the box. ICU4X can be used in the Rust programming language natively, with official wrappers in C++ via the foreign function interface (FFI) and JavaScript via WebAssembly. More programming languages can be added by writing plugins, without needing to touch core i18n logic. ICU4X also allows data files to be updated independently of code, making it easier to roll out Unicode updates.

Secure: Rust's type system and ownership model guarantee memory-safety and thread-safety, preventing large classes of bugs and vulnerabilities.

How does ICU4X achieve these goals, and why did the team choose to write ICU4X over any number of alternatives?

II. Why ICU4X?

You may still be wondering, what led the Unicode Consortium to choose a new Rust-based library as the solution to these problems?

II.A. Why a new library?

The Unicode Consortium also publishes ICU4C and ICU4J, i18n libraries written for C/C++ and Java. Why write a new library from scratch? Wouldn’t that increase the ongoing maintenance burden? Why not focus our efforts on improving ICU4C and/or ICU4J instead?

ICU4X solves a different problem for different types of clients. ICU4X does not seek to replace ICU4C or ICU4J; rather, it seeks to replace the large number of non-Unicode, often-unmaintained, often-incomplete i18n libraries that have been written to bring i18n to new programming languages and resource-constrained environments. ICU4X is a product that has long been missing from Unicode's portfolio.

Early on, the team evaluated whether ICU4X's goals could have been achieved by refactoring ICU4C or ICU4J. We found that:

ICU4C has already gone through a period of optimization for tree shaking and data size. Despite these efforts, we continue to have stakeholders saying that ICU4C is too large for their resource-constrained environment. Getting further improvements in ICU4C would amount to rewrites of much of ICU4C's code base, which would need to be done in a way that preserves backwards compatibility. This would be a large engineering effort with an uncertain final result. Furthermore, writing a new library allows us to additionally optimize for modern UTF-8-native environments.
Except for JavaScript via j2cl, Java is not a suitable source language for portability to low-resource environments like wearables. Further, ICU4J has many interdependent parts that would require a large amount of effort to bring to a state where it could be a viable j2cl source.
Some of our stakeholders (Firefox and Fuchsia) are drawn to Rust's memory safety. Like most complex C++ projects, ICU4C has had its share of CVEs, mostly relating to memory safety. Although C++ diagnostic tools are improving, Rust has very strong guarantees that are impossible in other software stacks.

For all these reasons, we decided that a Rust-based library was the best long-term choice.

II.B. Why use ICU4X when there is i18n in the platform?

Many of the same people who work on ICU4X also work to make i18n available in the platform (browser, mobile OS, etc.) through APIs such as the ECMAScript Intl object, android.icu, and other smartphone native libraries. ICU4X complements the platform-based solutions as the ideal polyfill:

Some platform i18n features take 5 or more years to gain wide enough availability to be used in client-side applications. ICU4X can bridge the gap.
ICU4X can enable clients to add more locales than those available in the platform.
Some clients prefer identical behavior of their app across multiple devices. ICU4X can give them this level of consistency.
Eventually, we hope that ICU4X will back platform implementations in ECMAScript and elsewhere, providing a maximal amount of consistency when ICU4X is also used as a polyfill.

II.C Why pluggable data?

One of the most visible departures that ICU4X makes from ICU4C and ICU4J is an explicit data provider argument on most constructor functions. The ICU4X data provider supports the following use cases:

Data files that are readable by both older and newer versions of the code; for more detail on how this works, see ICU4X Data Versioning Design
Data files that can be swapped in and out at runtime, making it easy to upgrade Unicode, CLDR, or time zone database versions. Swapping in new data can be done at runtime without needing to restart the application or clear internal caches.
Multiple data sources. For example, some data may be baked into the app, some may come from the operating system, and some may come from an HTTP service.
Customizable data caches. We recognize that there is no "one size fits all" approach to caching, so we allow the client to configure their data pipeline with the appropriate type of cache.
Fully configurable data fallbacks and overlays. Individual fields of ICU4X data can be selectively overridden at runtime.

III. How We Made ICU4X Lightweight

There are three factors that combine to make code lightweight: small binary size, low memory usage, and deliberate performance optimizations. For all three, we have metrics that are continuously measured on GitHub Actions continuous integration (CI).

III.A. Small Binary Size

Internationalization involves a large number of components with many interdependencies. To combat this problem, ICU4X optimizes for "tree shaking" (dead code elimination) by:

Minimizing the number of dependencies of each individual component.
Using static types in ways that scope functions to the pieces of data they need.
Splitting functions and classes that pull in more data than they need into multiple, smaller pieces.

Developers can statically link ICU4X and run a tree-shaking tool like LLVM link-time optimization (LTO) to produce a very small amount of compiled code, and then they can run our static analysis tool to build an optimally small data file for it.

In addition to static analysis, ICU4X supports dynamic data loading out of the box. This is the ultimate solution for supporting hundreds of languages, because new locale data can be downloaded on the fly only when they are needed, similar to message bundles for UI strings.

III.B. Low Memory Usage

At its core, internationalization transforms inputs to human-readable outputs, using locale-specific data. ICU4X introduces novel strategies for runtime loading of data involving zero memory allocations:

Supports Postcard-format resource files for dynamically loaded, zero-copy deserialized data across all architectures.
Supports compile-time linking of required data without deserialization overhead via DataBake.
Data schema is designed so that individual components can use the immutable locale data directly with minimal post-processing, greatly reducing the need for internal caches.
Explicit "data provider" argument to each function that requires data, making it very clear when data is required.

ICU4X team member Manish Goregaokar wrote a blog post series detailing how the zero-copy deserialization works under the covers.

III.C. Deliberate Performance Optimizations

Reducing CPU usage improves latency and battery life, important to most clients. ICU4X achieves low CPU usage by:

Writing in Rust, a high-performance language.
Utilizing zero-copy deserialization.
Measuring every change against performance benchmarks.

The ICU4X team uses a benchmark-driven approach to achieve highly competitive performance numbers: newly added components should have benchmarks, and future changes to those components should avoid regressing on those benchmarks.

Although we always seek to improve performance, we do so deliberately. There are often space/time tradeoffs, and the team takes a balanced approach. For example, if improving performance requires increasing or duplicating the data requirements, we tend to favor smaller data, like we've done in the normalizer and collator components. In the segmenter components, we offer two modes: a machine learning LSTM segmenter with lower data size but heavier CPU usage, and a dictionary-based segmenter with larger data size but faster. (There is ongoing work to make the LSTM segmenter require fewer CPU resources.)

IV. How We Made ICU4X Portable

The software ecosystem continually evolves with new programming languages. The "X" in ICU4X is a nod to the second main design goal: portability to many different environments.

ICU4X is Unicode's first internationalization library to have official wrappers in more than one target language. We do this with a tool we designed called Diplomat, which generates idiomatic bindings in many programming languages that encourage i18n best practices. Thanks to Diplomat, these bindings are easy to maintain, and new programming languages can be added without needing i18n expertise.

Under the covers, ICU4X is written in no_std Rust (no system dependencies) wrapped in a stable ABI that Diplomat bindings invoke across foreign function interface (FFI) or WebAssembly (WASM). We have some basic tutorials for using ICU4X from C++ and JavaScript/TypeScript.

V. What’s next?

ICU4X represents an exciting new step in bringing internationalized software to more devices, use cases, and programming languages. A Unicode working group is hard at work on expanding ICU4X’s feature set over time so that it becomes more useful and performant; we are eager to learn about new use cases and have more people contribute to the project.

Have questions? You can contact us on the ICU4X discussion forum!

Want to try it out? See our tutorials, especially our Intro tutorial!

Interested in getting involved? See our Contribution Guide.

Want to stay posted on future ICU4X updates? Sign up for our low-traffic announcements list, [email protected]!

Over 144,000 characters are available for adoption to help the Unicode Consortium’s work on digitally disadvantaged languages

Wednesday, December 21, 2022

Nearly 150,000 characters!

The Launch of ICU4X

When does i ≠ і?

Åge Møller, Πέτρος Νικόλαος Καρατζής, ராஜேந்திர சோழன்

You have 2 unread messages.

Māori, ‎Wolof, тоҷикӣ, ‎‎کٲشُر, ‎ትግርኛ, कॉशुर‎, ‎মৈতৈলোন্, ‎ᱥᱟᱱᱛᱟᱲᱤ

Revitalization and Preservation of Indigenous Languages

The Past and Future of Flag Emoji

Available Now! New YouTube Playlist and Technical Quick Start Guide

Support Unicode 💞💕💌💯✨🌟🤠🛟🎁

Monday, November 14, 2022

BACKGROUND

MOVING FORWARD

2024 AND BEYOND

Tuesday, November 8, 2022

Friday, October 28, 2022

Friday, October 21, 2022

Thursday, October 20, 2022

Thursday, October 6, 2022

Thursday, September 29, 2022

I. Introduction

II. Why ICU4X?

II.A. Why a new library?

II.B. Why use ICU4X when there is i18n in the platform?

II.C Why pluggable data?

III. How We Made ICU4X Lightweight

III.A. Small Binary Size

III.B. Low Memory Usage

III.C. Deliberate Performance Optimizations

IV. How We Made ICU4X Portable

V. What’s next?

Links of Interest

Blog Archive

Labels

Followers

Subscribe to this blog