Skip to content

Conversation

@Pajaraja
Copy link
Contributor

What changes were proposed in this pull request?

Added UnionLoop and UnionLoopRef to NormalizePlan and CteIdNormalizer.

Why are the changes needed?

To add testing for recursive CTEs for single pass analyzer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests in NormalizePlanSuite.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Sonnet 4.5 via Cursor AI - Original draft of the tests.

@github-actions github-actions bot added the SQL label Dec 29, 2025
@diffray-bot
Copy link

Changes Summary

This PR adds support for normalizing UnionLoop and UnionLoopRef nodes in the plan comparison logic. It introduces new normalization methods in CteIdNormalizer and corresponding test cases to enable testing of recursive CTEs in the single-pass analyzer.

Type: feature

Components Affected: catalyst/plans/NormalizePlan, catalyst/plans/CteIdNormalizer, test suite for NormalizePlan

Files Changed
File Summary Change Impact
...he/spark/sql/catalyst/plans/NormalizePlan.scala Added handling for UnionLoop and UnionLoopRef nodes in the plan normalization transformation and added corresponding normalization methods to CteIdNormalizer class. ✏️ 🟡
...ark/sql/catalyst/plans/NormalizePlanSuite.scala Added three new test cases for UnionLoopRef ID normalization, UnionLoop ID and outputAttrIds normalization, and complete recursive CTE normalization. ✏️ 🟡
Architecture Impact
  • New Patterns: Parallel normalization patterns for recursive CTE nodes (similar to existing CTE handling)
  • Dependencies: added imports: CTERelationDef, UnionLoop, UnionLoopRef in test file
  • Coupling: Introduces tight coupling between UnionLoop/UnionLoopRef and CteIdNormalizer. The normalization logic assumes specific traversal order (UnionLoop before UnionLoopRef) similar to existing CTE handling.

Risk Areas: ID mapping correctness: The normalization relies on traversal order where UnionLoop is encountered before UnionLoopRef to build the mapping correctly, Duplicate ID handling in normalizeDef: The logic checks if a CTERelationDef ID is already mapped (new check at line 256), which could affect existing behavior for duplicate CTEs, UnionLoop ID mapping: The normalizeUnionLoop method only remaps if ID already exists, but doesn't insert new mappings. This is asymmetric compared to normalizeUnionLoopRef behavior

Suggestions
  • Add documentation explaining the assumption about traversal order in plan transformation
  • Consider adding a comment explaining why UnionLoop doesn't insert new mappings unlike UnionLoopRef
  • Verify the logic in normalizeDef change (line 256-257) doesn't introduce unintended regressions for existing CTE handling

Full review in progress... | Powered by diffray

@diffray-bot
Copy link

Changes Summary

This PR adds support for normalizing recursive CTEs (rCTEs) in Spark's query plan normalization logic by introducing handlers for UnionLoop and UnionLoopRef nodes. This enables the single-pass analyzer to correctly compare semantically identical recursive CTE queries with different internal IDs.

Type: feature

Components Affected: Catalyst Query Plan Normalization, Recursive CTE Support, Query Plan Comparison Infrastructure

Files Changed
File Summary Change Impact
...he/spark/sql/catalyst/plans/NormalizePlan.scala Added UnionLoop and UnionLoopRef normalization handlers, plus a fix to prevent duplicate CTE ID mappings ✏️ 🟡
...ark/sql/catalyst/plans/NormalizePlanSuite.scala Added three comprehensive test cases for UnionLoopRef, UnionLoop, and recursive CTE normalization ✏️ 🟢
Architecture Impact
  • New Patterns: Visitor pattern extension for rCTE nodes, ID mapping pattern for maintaining referential integrity
  • Dependencies: added: UnionLoop and UnionLoopRef logical plan classes
  • Coupling: NormalizePlan normalization logic now explicitly depends on recursive CTE node types (UnionLoop, UnionLoopRef), increasing coupling to the CTE implementation details

Risk Areas: Bug fix in normalizeDef() changes behavior - need to verify it doesn't break existing non-recursive CTE normalization, ID remapping logic complexity - UnionLoopRef mapping uses counter differently than CTERelationRef, potential for confusion, Interaction between UnionLoop ID normalization and UnionLoopRef counter-based normalization - asymmetric pattern could be error-prone

Suggestions
  • Consider adding comments explaining why UnionLoopRef uses counter-based assignment vs. mapping lookup (unlike UnionLoop)
  • Verify the bug fix in normalizeDef() doesn't have unintended side effects on regular CTEs
  • Consider adding integration tests that verify recursive CTEs normalize correctly within larger query plans

Full review in progress... | Powered by diffray

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants