A
Proposed Update UTS #18, Unicode Regular Expressions is now available for
review and feedback.
Regular expressions are a key tool in software development. Back in
2000, few regular expression engines supported Unicode, even at a basic level.
UTS #18 set out to raise the bar, describing how regular expression engines
could be adapted to deal with Unicode correctly and completely. Since that time,
major programming languages and libraries have adopted level 1 features
(supporting all Unicode literals, basic character properties, subtraction,
intersection, ...), and some also adopted some level 2 features (full character
properties, grapheme clusters, ...).
A major enhancement to UTS #18 in 2020 focused on the addition of
Character Classes with strings. The initial impetus for this was to handle emoji
effectively in browsers, as most emoji consist of more than one code point.
Supporting strings directly in character classes frees up programs from having
to download large amounts of data or handle complicated syntax. Using a property
like RGI_Emoji allows a regular expression to match both individual codes such
as "😁" and multi-codepoint strings such as "🇫🇷". This extension to strings is
also important for internationalization. For example, the alphabets used by many
languages contain multi-code-point strings, so this extension allows them to be
handled easily.
Additional enhancements are in progress this year, based on working
with members of the ECMAScript committee, including more clarifications, better
guidance on implementation, and addressing some tricky issues dealing with
complementing (inverting) Character Classes. The end goal of all of these
enhancements in 2020 and 2021 is to significantly raise the level of Unicode
support in programming languages and libraries.
For more information, see
https://www.unicode.org/review/pri427/.
Over 140,000 characters are available for adoption
to help the Unicode Consortium’s work on digitally disadvantaged languages