CLDR-19466 Implement currency test data generator and modular validation suites#5808
CLDR-19466 Implement currency test data generator and modular validation suites#5808younies wants to merge 4 commits into
Conversation
06c8511 to
7890fa7
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
…alidation suites See unicode-org#5808
7890fa7 to
2b29782
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
2b29782 to
16b5a34
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
16b5a34 to
9222c97
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
9222c97 to
d7ed7eb
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
d7ed7eb to
61de19b
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
sffc
left a comment
There was a problem hiding this comment.
Suggestion: it seems we need to split the files a bit more. There are a few ways to do this, but I suggest splitting based on the CurrencyDisplay. You end up with:
currencies_short.tsvcurrencies_short_modern_locales.tsvcurrencies_short_modern_currencies.tsvcurrencies_short_extended_numbers.tsvcurrencies_narrow.tsvcurrencies_narrow_modern_locales.tsvcurrencies_narrow_modern_currencies.tsvcurrencies_narrow_extended_numbers.tsvcurrencies_iso.tsvcurrencies_iso_modern_locales.tsvcurrencies_iso_modern_currencies.tsvcurrencies_iso_extended_numbers.tsvcurrencies_name.tsvcurrencies_name_modern_locales.tsvcurrencies_name_modern_currencies.tsvcurrencies_name_extended_numbers.tsv
Hopefully each of those is under 10k lines.
| // South Asian currency representing lakh/crore grouping style | ||
| "INR"); |
There was a problem hiding this comment.
Question: Are there any locales that use lakh/crore for formatting INR that don't use lakh/crore for formatting other currencies?
There was a problem hiding this comment.
No. Point taken—since grouping is tied to the locale data rather than the currency, INR doesn't exercise any unique edge cases here.
For example, our core Bengali (bn) suite formatting standard USD already successfully exercises lakh/crore grouping and accounting parentheses:
bn USD standard decimal symbol 1234565.0 ১২,৩৪,৫৬৫.০০ US$
bn USD accounting decimal symbol -1230.05 (১,২৩০.০৫ US$)
Therefore, explicitly keeping INR in the core set is redundant, so I have pruned it.
| // High-volume East Asian currency | ||
| "CNY", |
There was a problem hiding this comment.
Being "high-volume" isn't enough to meet the bar for the core test set. We are testing currencies that exercise edge cases. Given the size of these files, we should be looking to prune the core set.
There was a problem hiding this comment.
Agreed. I've pruned CNY from CORE_CURRENCIES to help keep the core test set minimal and focused on edge cases.
| // Notable for custom financial formatting and separators (matches de_CH | ||
| // core locale) | ||
| "CHF", |
There was a problem hiding this comment.
Is there an example of a locale that uses different separators for CHF than it does for other currencies?
Do any of the other currencies on this list also exercise the different separators?
There was a problem hiding this comment.
you are right, the custom separator is tied to the locale de_CH and not to the currency CHF, I am going to add de_CH to the core locales then and remove CHF from the core currencies
| } | ||
| } | ||
|
|
||
| public enum FormatLength { |
There was a problem hiding this comment.
Issue: This is NOT the same as FormatLength in decimal formatting. What this enum is doing is choosing the currency symbol style. ECMA calls this "currency display" and that's probably what we should call it here, too. There are 4 standard choices in CLDR:
https://unicode.org/reports/tr35/tr35-numbers.html#Number_Pattern_Character_Definitions
There was a problem hiding this comment.
Renamed this enum to CurrencyDisplay and updated its constants to SYMBOL ("symbol"), NARROW_SYMBOL ("narrowSymbol"), CODE ("code"), and NAME ("name") to fully align with ECMA and CLDR TR35 specifications.
| public enum ValueRepresentation { | ||
| DECIMAL("decimal"), | ||
| COMPACT_SHORT("compact-short"); |
There was a problem hiding this comment.
Issue: CLDR calls this "format length", for better or worse, so we should do the same here. We already discussed this in your first PR.
There was a problem hiding this comment.
I agree and disagree at the same time :)
While CLDR calls compact notation sizes 'format length', CLDR also explicitly separates between the length of the value and the length of the currency symbol.
Therefore, to keep them clearly distinguished here, I've named this dimension ValueFormatLength (and TSV column value_format_length).
There was a problem hiding this comment.
You can call it NumberFormatLength; I think that's still compatible with the spec
| allModern.removeAll(getCoreCurrencies()); | ||
| return allModern; | ||
| } | ||
|
|
There was a problem hiding this comment.
Issue: Please add CurrencyFormatType with choices Standard and Accounting.
https://unicode.org/reports/tr35/tr35-numbers.html#Currency_Formats
There was a problem hiding this comment.
Added the CurrencyFormatType enum with choices STANDARD ("standard") and ACCOUNTING ("accounting"). Mapped ACCOUNTING to ICU's SignDisplay.ACCOUNTING in the formatting logic and expanded the TSV schema to include this dimension as a new column.
|
Here's another thought on how to split the files, only if you like it. You could add more columns, one for each currency display style. For example, instead of three lines split across 3 files You could have just 1 line (and add a fourth entry for the ISO code) |
61de19b to
0c62b01
Compare
…nerator) - Prune CHF, CNY, INR from core currencies - Add CurrencyFormatType enum (standard, accounting) - Rename ValueRepresentation to ValueFormatLength - Rename FormatLength to CurrencyDisplay - Update TSV generation schema and extended file splitting - Update TestCurrencyFormat unit tests TAG=agy CONV=accc5499-4131-4578-a83d-0f700da17d6e
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
0c62b01 to
31d29d1
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
| DECIMAL("decimal"), | ||
| COMPACT_SHORT("compact-short"); |
There was a problem hiding this comment.
Issue: This enum should have the same names and values as the one for Decimal
| } | ||
| return lnf.format(number).toString(); | ||
| } | ||
|
|
There was a problem hiding this comment.
Issue: The data files are still too big, so you need to either reduce the number of cases or split them up further. If you split them up further, please split them on a dimension that has a fixed number of values; for example, format length ("" or "short") and currency format type ("standard" or "accounting").
| try (TempPrintWriter pw = | ||
| TempPrintWriter.openUTF8Writer(CLDRPaths.TEST_DATA + OUTPUT_SUBDIR, filename)) { | ||
| pw.println( | ||
| "locale\tcurrency\tcurrency_format\tvalue_format_length\tcurrency_display\tinput\texpected"); |
There was a problem hiding this comment.
Nit: Make the column names match the enum names
|
|
||
| private static void writeTsv(List<TestCase> testCases, String filenamePrefix) | ||
| throws IOException { | ||
| String filename = filenamePrefix + "_" + (testCases.size() + 1) + "_lines.tsv"; |
There was a problem hiding this comment.
Nit: Before we land this, take off the number of lines from the filename, because the number of lines can change when CLDR adds or removes modern locales or currencies, but we want the filename to stay the same
| displayStyles, | ||
| coreNumbers, | ||
| combo -> true); | ||
| writeTsv(extCurrCases, "currencies_all_modern_currencies" + suffix); |
There was a problem hiding this comment.
Nit: Put the currency display first, and the "modern_currencies" after, and say "modern", not "all_modern"
macchiati
left a comment
There was a problem hiding this comment.
I think it is much cleaner to do this by locale, not by variations of number. We do this for some other test files, and it is easier to maintain — and use.
In addition, an important test case is a currency for the likely region for the language, which varies by language.
CLDR-19466
Description
🚀 Introduces
GenerateCurrencyFormatTestDataand synchronized tests inTestCurrencyFormatto establish rigorous cross-implementation benchmarks for currency formatting.📐 Dimensions Covered
en_US,de,ja,ar_EG, etc.) + Complete modern CLDR catalog.USD,EUR,JPY,CHF,RUB,EGP,CNY,INR) + Dynamic extraction of all active circulating world currencies.SHORT(symbol),LONG(full display name), andNARROW(narrow symbol).Notation.compactShort()).🔮 Next Steps (Upcoming PRs)
Diff Base #5709
ALLOW_MANY_COMMITS=true