[FLINK-38740][table] Introduce Welford's online algorithm for variance related functions to avoid catastrophic cancellation in naive algorithm #27325

dylanhz · 2025-12-05T09:18:49Z

What is the purpose of the change

Fix catastrophic cancellation in naive algorithm of variance related functions.

Brief change log

Introduce Welford's online algorithm for variance related functions.
Add a new internal built-in agg function WELFORD_M2.
Add new rewrite rule to reuse the intermediate result welford_m2 for all variance related functions.

Verifying this change

New test class: MathAggFunctionITCase.
Update some existing tests using variance related functions.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
The S3 file system connector: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not documented)

dylanhz · 2025-12-05T09:20:01Z

The first commit will be removed once it is merged.
#27319

flinkbot · 2025-12-05T09:24:48Z

CI report:

232f955 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

lincoln-lil · 2025-12-22T13:00:35Z

@flinkbot run azure

dylanhz · 2025-12-23T01:59:36Z

@flinkbot run azure

…e related functions to avoid catastrophic cancellation in naive algorithm

dylanhz · 2025-12-24T12:32:17Z

The @flinkbot run azure command is not working, we need some help to fix it.

lincoln-lil

@dylanhz Thank you for fixing this! This pr is of high code quality with solid test coverage, and it effectively resolves the long-standing numerical stability issues in these functions.

I checked:
The core implementation in WelfordM2AggFunction (accumulate, merge, retract, and numerical safeguards in getValue).
The related changes in AggregateReduceFunctionsRule, with corresponding plan tests.

Overall, LGTM! Only minor comments.

...e/src/main/java/org/apache/flink/table/runtime/functions/aggregate/WelfordM2AggFunction.java

lincoln-lil · 2025-12-25T07:35:52Z

...e/src/main/java/org/apache/flink/table/runtime/functions/aggregate/WelfordM2AggFunction.java

+    @Override
+    public Double getValue(WelfordM2Accumulator acc) {
+        // Theoretically, acc.m2 should always be non-negative.
+        // But in practice it may be negative if records are out of order, which is different from


nit: IIUC, perhaps we should clarify here that out-of-order itself doesn't affect the result mathematically, the issue mainly stems from accumulated precision loss caused by floating-point arithmetic in combination with retractions?

Thanks for your input. Precision is irrelevant here actually. The negative value is simply a mathematical result of the algorithm when processing unmatched retractions.

Regarding the comment, I've updated it to be more explicit and focused on the current logic. What do you think of this version?

// Theoretically, acc.m2 should always be non-negative. // But in practice it may become negative due to unmatched retractions. // (e.g., [+I, 1], [+I, 2] followed by [-D, 3], which results in acc.m2 = -4) // Therefore, return null in such cases to indicate an invalid result.

dylanhz force-pushed the FLINK-38740 branch from c5c78d5 to c2180bc Compare December 16, 2025 01:52

dylanhz force-pushed the FLINK-38740 branch from c2180bc to 76d55b1 Compare December 23, 2025 05:49

[FLINK-38740][table] Introduce Welford's online algorithm for varianc…

af479f8

…e related functions to avoid catastrophic cancellation in naive algorithm

dylanhz force-pushed the FLINK-38740 branch from 76d55b1 to af479f8 Compare December 24, 2025 08:08

lincoln-lil approved these changes Dec 25, 2025

View reviewed changes

Address comments

232f955

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-38740][table] Introduce Welford's online algorithm for variance related functions to avoid catastrophic cancellation in naive algorithm #27325

[FLINK-38740][table] Introduce Welford's online algorithm for variance related functions to avoid catastrophic cancellation in naive algorithm #27325

Uh oh!

dylanhz commented Dec 5, 2025

Uh oh!

dylanhz commented Dec 5, 2025 •

edited

Loading

Uh oh!

flinkbot commented Dec 5, 2025 •

edited

Loading

Uh oh!

lincoln-lil commented Dec 22, 2025

Uh oh!

dylanhz commented Dec 23, 2025 •

edited

Loading

Uh oh!

dylanhz commented Dec 24, 2025

Uh oh!

lincoln-lil left a comment

Uh oh!

Uh oh!

lincoln-lil Dec 25, 2025

Uh oh!

dylanhz Dec 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[FLINK-38740][table] Introduce Welford's online algorithm for variance related functions to avoid catastrophic cancellation in naive algorithm #27325

Are you sure you want to change the base?

[FLINK-38740][table] Introduce Welford's online algorithm for variance related functions to avoid catastrophic cancellation in naive algorithm #27325

Uh oh!

Conversation

dylanhz commented Dec 5, 2025

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

dylanhz commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

flinkbot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

lincoln-lil commented Dec 22, 2025

Uh oh!

dylanhz commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dylanhz commented Dec 24, 2025

Uh oh!

lincoln-lil left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lincoln-lil Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

dylanhz Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dylanhz commented Dec 5, 2025 •

edited

Loading

flinkbot commented Dec 5, 2025 •

edited

Loading

dylanhz commented Dec 23, 2025 •

edited

Loading