Skip to content

[SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF#54992

Draft
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56123/refactor/grouped-agg-arrow-udf
Draft

[SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF#54992
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-56123/refactor/grouped-agg-arrow-udf

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Mar 24, 2026

What changes were proposed in this pull request?

Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF to use ArrowStreamSerializer as a pure I/O layer, moving all processing logic into read_udfs() in worker.py.

Why are the changes needed?

Part of SPARK-55388. This makes the eval types self-contained in read_udfs(), consistent with the previously refactored SQL_SCALAR_ARROW_UDF and SQL_MAP_ARROW_ITER_UDF.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

ASV micro-benchmarks show 5-25% improvement for SQL_GROUPED_AGG_ARROW_UDF and 0-14% for SQL_GROUPED_AGG_ARROW_ITER_UDF:

Scenario UDF Before After Change
few_groups_sm sum 0.0179s 0.0169s -6%
few_groups_sm mean_multi 0.0069s 0.0058s -16%
many_groups_sm sum 0.1996s 0.1584s -21%
many_groups_sm mean_multi 0.1953s 0.1460s -25%
many_groups_lg sum 0.0979s 0.0870s -11%
wide_cols mean_multi 0.0724s 0.0540s -25%

Was this patch authored or co-authored using generative AI tooling?

No.

@Yicong-Huang Yicong-Huang changed the title [SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF to use ArrowStreamSerializer [SPARK-56123][PYTHON] Refactor SQL_GROUPED_AGG_ARROW_UDF and SQL_GROUPED_AGG_ARROW_ITER_UDF Mar 24, 2026
@Yicong-Huang Yicong-Huang marked this pull request as draft March 24, 2026 23:21
@Yicong-Huang Yicong-Huang force-pushed the SPARK-56123/refactor/grouped-agg-arrow-udf branch 4 times, most recently from e8a35a7 to a15c5d1 Compare March 25, 2026 17:21
@Yicong-Huang Yicong-Huang force-pushed the SPARK-56123/refactor/grouped-agg-arrow-udf branch from a15c5d1 to 827d633 Compare March 25, 2026 17:24
@Yicong-Huang
Copy link
Contributor Author

This PR depends on #54967 (enforce_schema). Will rebase after that one merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant