Skip to content

CLDR-19466 Implement currency test data generator and modular validation suites#5808

Open
younies wants to merge 4 commits into
unicode-org:mainfrom
younies:decimalformatter-ai-currency-final
Open

CLDR-19466 Implement currency test data generator and modular validation suites#5808
younies wants to merge 4 commits into
unicode-org:mainfrom
younies:decimalformatter-ai-currency-final

Conversation

@younies

@younies younies commented Jun 7, 2026

Copy link
Copy Markdown
Member

CLDR-19466

Description

🚀 Introduces GenerateCurrencyFormatTestData and synchronized tests in TestCurrencyFormat to establish rigorous cross-implementation benchmarks for currency formatting.

📐 Dimensions Covered

  • 🌐 Locales: Core subset (en_US, de, ja, ar_EG, etc.) + Complete modern CLDR catalog.
  • 💵 Currencies: Core representative set (USD, EUR, JPY, CHF, RUB, EGP, CNY, INR) + Dynamic extraction of all active circulating world currencies.
  • 🔢 Numbers: High-signal core baseline values + Complete boundary sets (powers of 10, fractional steps, negatives).
  • 📏 Format Lengths: Declarative metadata mapped to SHORT (symbol), LONG (full display name), and NARROW (narrow symbol).
  • 🏷️ Notations: Standard Decimal & Compact Short (Notation.compactShort()).

🔮 Next Steps (Upcoming PRs)

  1. ⚙️ Engine Refactoring: Evolve from relying on ICU4J to building an independent engine powered directly by raw CLDR XML data and LDML TR35 specs.
  2. 🏦 Format Expansion: Add complete suites for accounting formatting and no-currency formatting.

Diff Base #5709

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@younies younies requested a review from sffc June 7, 2026 11:15
@younies younies force-pushed the decimalformatter-ai-currency-final branch from 06c8511 to 7890fa7 Compare June 7, 2026 11:19
@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateCurrencyFormatTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestCurrencyFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/cldr that referenced this pull request Jun 7, 2026
@younies younies force-pushed the decimalformatter-ai-currency-final branch from 7890fa7 to 2b29782 Compare June 7, 2026 11:19
@jira-pull-request-webhook

Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies changed the title Draft CLDR-19466 Implement currency test data generator and modular validation suites CLDR-19466 Implement currency test data generator and modular validation suites Jun 7, 2026
younies added a commit to younies/cldr that referenced this pull request Jun 7, 2026
@younies younies force-pushed the decimalformatter-ai-currency-final branch from 2b29782 to 16b5a34 Compare June 7, 2026 11:20
@jira-pull-request-webhook

Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/cldr that referenced this pull request Jun 7, 2026
@younies younies force-pushed the decimalformatter-ai-currency-final branch from 16b5a34 to 9222c97 Compare June 7, 2026 11:38
@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • common/testData/currency/README.md is no longer changed in the branch
  • common/testData/decimal/decimals_all_numbers.tsv is no longer changed in the branch
  • common/testData/decimal/decimals_modern_locales.tsv is no longer changed in the branch
  • common/testData/decimal/decimals.tsv is no longer changed in the branch
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateCurrencyFormatTestData.java is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateDecimalFormatTestData.java is no longer changed in the branch
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestCurrencyFormat.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestDecimalFormat.java is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/cldr that referenced this pull request Jun 7, 2026
@younies younies force-pushed the decimalformatter-ai-currency-final branch from 9222c97 to d7ed7eb Compare June 7, 2026 11:55
@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateCurrencyFormatTestData.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/cldr that referenced this pull request Jun 9, 2026
@younies younies force-pushed the decimalformatter-ai-currency-final branch from d7ed7eb to 61de19b Compare June 9, 2026 16:02
@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • common/testData/currency/currencies_2161_lines.tsv is different
  • common/testData/currency/currencies_all_modern_currencies_long_13051_lines.tsv is different
  • common/testData/currency/currencies_all_modern_currencies_narrow_13051_lines.tsv is different
  • common/testData/currency/currencies_all_modern_currencies_short_13051_lines.tsv is different
  • common/testData/currency/currencies_all_modern_locales_long_7681_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_locales_long_7761_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_locales_narrow_7681_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_locales_narrow_7761_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_locales_short_7681_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_locales_short_7761_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_numbers_long_20161_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_numbers_narrow_20161_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_numbers_short_20161_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_extended_numbers_long_20161_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_extended_numbers_narrow_20161_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_extended_numbers_short_20161_lines.tsv is now changed in the branch
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateCurrencyFormatTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestCurrencyFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@sffc sffc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: it seems we need to split the files a bit more. There are a few ways to do this, but I suggest splitting based on the CurrencyDisplay. You end up with:

  • currencies_short.tsv
  • currencies_short_modern_locales.tsv
  • currencies_short_modern_currencies.tsv
  • currencies_short_extended_numbers.tsv
  • currencies_narrow.tsv
  • currencies_narrow_modern_locales.tsv
  • currencies_narrow_modern_currencies.tsv
  • currencies_narrow_extended_numbers.tsv
  • currencies_iso.tsv
  • currencies_iso_modern_locales.tsv
  • currencies_iso_modern_currencies.tsv
  • currencies_iso_extended_numbers.tsv
  • currencies_name.tsv
  • currencies_name_modern_locales.tsv
  • currencies_name_modern_currencies.tsv
  • currencies_name_extended_numbers.tsv

Hopefully each of those is under 10k lines.

Comment on lines +63 to +64
// South Asian currency representing lakh/crore grouping style
"INR");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Are there any locales that use lakh/crore for formatting INR that don't use lakh/crore for formatting other currencies?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Point taken—since grouping is tied to the locale data rather than the currency, INR doesn't exercise any unique edge cases here.

For example, our core Bengali (bn) suite formatting standard USD already successfully exercises lakh/crore grouping and accounting parentheses:
bn USD standard decimal symbol 1234565.0 ১২,৩৪,৫৬৫.০০ US$
bn USD accounting decimal symbol -1230.05 (১,২৩০.০৫ US$)

Therefore, explicitly keeping INR in the core set is redundant, so I have pruned it.

Comment on lines +61 to +62
// High-volume East Asian currency
"CNY",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being "high-volume" isn't enough to meet the bar for the core test set. We are testing currencies that exercise edge cases. Given the size of these files, we should be looking to prune the core set.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've pruned CNY from CORE_CURRENCIES to help keep the core test set minimal and focused on edge cases.

Comment on lines +52 to +54
// Notable for custom financial formatting and separators (matches de_CH
// core locale)
"CHF",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an example of a locale that uses different separators for CHF than it does for other currencies?

Do any of the other currencies on this list also exercise the different separators?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, the custom separator is tied to the locale de_CH and not to the currency CHF, I am going to add de_CH to the core locales then and remove CHF from the core currencies

}
}

public enum FormatLength {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This is NOT the same as FormatLength in decimal formatting. What this enum is doing is choosing the currency symbol style. ECMA calls this "currency display" and that's probably what we should call it here, too. There are 4 standard choices in CLDR:

https://unicode.org/reports/tr35/tr35-numbers.html#Number_Pattern_Character_Definitions

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this enum to CurrencyDisplay and updated its constants to SYMBOL ("symbol"), NARROW_SYMBOL ("narrowSymbol"), CODE ("code"), and NAME ("name") to fully align with ECMA and CLDR TR35 specifications.

Comment on lines +111 to +113
public enum ValueRepresentation {
DECIMAL("decimal"),
COMPACT_SHORT("compact-short");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: CLDR calls this "format length", for better or worse, so we should do the same here. We already discussed this in your first PR.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree and disagree at the same time :)

While CLDR calls compact notation sizes 'format length', CLDR also explicitly separates between the length of the value and the length of the currency symbol.

Therefore, to keep them clearly distinguished here, I've named this dimension ValueFormatLength (and TSV column value_format_length).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can call it NumberFormatLength; I think that's still compatible with the spec

allModern.removeAll(getCoreCurrencies());
return allModern;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: Please add CurrencyFormatType with choices Standard and Accounting.

https://unicode.org/reports/tr35/tr35-numbers.html#Currency_Formats

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the CurrencyFormatType enum with choices STANDARD ("standard") and ACCOUNTING ("accounting"). Mapped ACCOUNTING to ICU's SignDisplay.ACCOUNTING in the formatting logic and expanded the TSV schema to include this dimension as a new column.

@sffc

sffc commented Jun 10, 2026

Copy link
Copy Markdown
Member

Here's another thought on how to split the files, only if you like it. You could add more columns, one for each currency display style. For example, instead of three lines split across 3 files

ar	USD	decimal	short	0.0	‏0.00 US$
ar	USD	decimal	long	0.0	0.00 دولار أمريكي
ar	USD	decimal	narrow	0.0	‏0.00 US$

You could have just 1 line

ar	USD	decimal	short	0.0	‏0.00 US$	0.00 دولار أمريكي	‏0.00 US$

(and add a fourth entry for the ISO code)

@younies younies force-pushed the decimalformatter-ai-currency-final branch from 61de19b to 0c62b01 Compare June 10, 2026 10:07
younies added a commit to younies/cldr that referenced this pull request Jun 10, 2026
younies added a commit to younies/cldr that referenced this pull request Jun 10, 2026
…nerator)

- Prune CHF, CNY, INR from core currencies

- Add CurrencyFormatType enum (standard, accounting)

- Rename ValueRepresentation to ValueFormatLength

- Rename FormatLength to CurrencyDisplay

- Update TSV generation schema and extended file splitting

- Update TestCurrencyFormat unit tests

TAG=agy

CONV=accc5499-4131-4578-a83d-0f700da17d6e
@jira-pull-request-webhook

Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • common/testData/currency/currencies_2161_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_3601_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_currencies_code_26641_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_currencies_long_13051_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_currencies_name_26641_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_currencies_narrow_13051_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_currencies_narrowSymbol_26641_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_currencies_short_13051_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_currencies_symbol_26641_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_locales_code_9601_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_locales_long_7681_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_locales_name_9601_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_locales_narrow_7681_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_locales_narrowSymbol_9601_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_all_modern_locales_short_7681_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_all_modern_locales_symbol_9601_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_extended_numbers_code_25201_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_extended_numbers_long_20161_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_extended_numbers_name_25201_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_extended_numbers_narrow_20161_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_extended_numbers_narrowSymbol_25201_lines.tsv is now changed in the branch
  • common/testData/currency/currencies_extended_numbers_short_20161_lines.tsv is no longer changed in the branch
  • common/testData/currency/currencies_extended_numbers_symbol_25201_lines.tsv is now changed in the branch
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateCurrencyFormatTestData.java is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestCurrencyFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies force-pushed the decimalformatter-ai-currency-final branch from 0c62b01 to 31d29d1 Compare June 10, 2026 10:08
@jira-pull-request-webhook

Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies requested a review from sffc June 10, 2026 10:09
Comment on lines +121 to +122
DECIMAL("decimal"),
COMPACT_SHORT("compact-short");

@sffc sffc Jun 10, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: This enum should have the same names and values as the one for Decimal

}
return lnf.format(number).toString();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: The data files are still too big, so you need to either reduce the number of cases or split them up further. If you split them up further, please split them on a dimension that has a fixed number of values; for example, format length ("" or "short") and currency format type ("standard" or "accounting").

try (TempPrintWriter pw =
TempPrintWriter.openUTF8Writer(CLDRPaths.TEST_DATA + OUTPUT_SUBDIR, filename)) {
pw.println(
"locale\tcurrency\tcurrency_format\tvalue_format_length\tcurrency_display\tinput\texpected");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Make the column names match the enum names


private static void writeTsv(List<TestCase> testCases, String filenamePrefix)
throws IOException {
String filename = filenamePrefix + "_" + (testCases.size() + 1) + "_lines.tsv";

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Before we land this, take off the number of lines from the filename, because the number of lines can change when CLDR adds or removes modern locales or currencies, but we want the filename to stay the same

displayStyles,
coreNumbers,
combo -> true);
writeTsv(extCurrCases, "currencies_all_modern_currencies" + suffix);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Put the currency display first, and the "modern_currencies" after, and say "modern", not "all_modern"

@younies younies requested a review from sffc June 10, 2026 18:07

@macchiati macchiati left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is much cleaner to do this by locale, not by variations of number. We do this for some other test files, and it is easier to maintain — and use.

In addition, an important test case is a currency for the likely region for the language, which varies by language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants