Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new UTS46 logic from Unicode 16 #197

Open
kjd opened this issue Sep 16, 2024 · 0 comments
Open

Implement new UTS46 logic from Unicode 16 #197

kjd opened this issue Sep 16, 2024 · 0 comments

Comments

@kjd
Copy link
Owner

kjd commented Sep 16, 2024

Unicode 16 introduces changes to UTS46 that require refactoring our UTS46 implementation. Specifically:

  • Reissued for Unicode 16.0.0.
  • The handling of UseSTD3ASCIIRules has been simplified. Conditional data involving disallowed_STD3_* Status values has been replaced with simple checking for a subset of ASCII characters in the Validity Criteria. This simplifies the data format and data lookup, makes standard UseSTD3ASCIIRules=true handling consistent with custom UseSTD3ASCIIRules, and avoids unnecessarily disallowing certain labels that contain disallowed_STD3_mapped characters but which do not contain non-LDH ASCII characters when the mappings are applied.
    Behavior for UseSTD3ASCIIRules=false is unchanged.
    Examples for UseSTD3ASCIIRules=true behavior changes:
    • Example for a label which continues to fail the Validity Criteria despite the change in Processing: In Unicode 15.1, input label "⑷" was unchanged in Processing and failed the Validity Criteria. (U+2477 disallowed_STD3_mapped was resolved to disallowed, and its mapping was not applied.) In Unicode 16.0, "⑷" is Mapped to "(4)", which still fails the Validity Criteria, except if a custom set of valid ASCII characters is used that includes the parentheses.
    • Example for a label which newly passes the Validity Criteria due to the change in Processing: In Unicode 15.1, input label "\uFF1D\u0338" (fullwidth equals + combining solidus overlay) was unchanged in Processing and failed the Validity Criteria. (U+FF1D disallowed_STD3_mapped was resolved to disallowed, and its mapping was not applied.) In Unicode 16.0, "\uFF1D\u0338" is Mapped to "\u003D\u0338" and Normalized to "\u2260" (not equal to), which is valid.
    • In Section 4, Processing, if the label starts with “xn--”, and the conversion from Punycode yields either an empty label or an all-ASCII label, then an error is now recorded, consistent with IDNA2008.
  • Changed Section 6 Mapping Table Derivation, Table 3. Base Valid Set, replacing \p{Block=Ideographic_Description_Characters} and \u31EF with the equivalent [\p{IDS_Unary_Operator}\p{IDS_Binary_Operator}\p{IDS_Trinary_Operator}].
  • Changed Section 6 Mapping Table Derivation, Step 3: Specify the base exclusion set, to define a small, fixed base exclusion set. Previously, the base exclusion set had been derived from differences between IDNA2003 data and UTS46 principles.
    Changes in the Unicode 15.1 version led to unexpected edge cases in processing. At the same time, transitional processing was deprecated.
    The UTC concluded that it was no longer necessary to disallow characters on the basis of differences from IDNA2003, and decided to simplify the definition of the base exclusion set.
    As a result, a number of characters that were disallowed before are now ignored, mapped, or (in the one case of U+1806 MONGOLIAN TODO SOFT HYPHEN) valid. In xn-- Punycode labels, characters with Status ignored and mapped are still not valid. The recent edge cases and processing complications are no longer present.
    For details see the proposal in document L2/24-064 item 6.2.
  • Removed the content of Section 7, IDNA Comparison, which is no longer applicable.
  • Noted in Section 8.3, Migration new syntax in the test file: "" means an empty string. There are also other test data corrections and improvements.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant