Skip to content

fix: remap parent id after hashing#4245

Merged
badGarnet merged 3 commits intomainfrom
fix/remap-parent-id-when-hashing-id
Feb 19, 2026
Merged

fix: remap parent id after hashing#4245
badGarnet merged 3 commits intomainfrom
fix/remap-parent-id-when-hashing-id

Conversation

@badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Feb 19, 2026

This PR addresses an issue where hashing element id loses the reference for parent id.
This happens when calling partition_html where the partition process already assigned parent ids for elements based on html structure before apply_metadata is called, i.e., before element id hashing happens.
This fix ensures that the parent references stay unchanged after hashing.

@badGarnet
Copy link
Collaborator Author

@claude review

@claude
Copy link

claude bot commented Feb 19, 2026

Claude finished @badGarnet's task —— View job


PR Review: Fix Parent ID Remapping After Hashing

Review Tasks:

  • Read and understand the changed files
  • Analyze the core fix implementation in metadata.py
  • Review test coverage and test cases
  • Check changelog and version updates
  • Provide comprehensive feedback and recommendations

Analysis

Problem: When partition_html runs, it assigns parent IDs based on HTML structure before apply_metadata is called. Later, when element IDs get hashed (converted from UUIDs to SHA1), the parent ID references become stale since they still point to the old UUID values.

Solution: The fix adds a parent ID remapping step in _assign_hash_ids() at unstructured/partition/common/metadata.py:264-266:

for element in elements:
    if element.metadata.parent_id is not None:
        element.metadata.parent_id = id_mapping[element.metadata.parent_id]

This ensures parent IDs are updated to reference the new hashed IDs.

Code Review Feedback

✅ Strengths

  1. Clean, minimal fix: The solution is surgical and only adds what's necessary to solve the problem.

  2. Proper placement: The parent ID remapping happens after ID hashing but within the same function, ensuring consistency.

  3. Excellent test coverage: The new test test_assign_hash_ids_remaps_parent_id_to_new_hash_id() at test_unstructured/partition/common/test_metadata.py:487-503 directly validates the fix:

    • Creates a title with a child element that has parent_id pointing to the title's UUID
    • Calls _assign_hash_ids()
    • Verifies that after hashing, child.metadata.parent_id == title.id (the new hashed ID)
  4. Changelog properly updated: Version bumped to 0.20.6 with clear description.

🤔 Considerations & Questions

  1. KeyError risk: What happens if element.metadata.parent_id exists but isn't in id_mapping? This could occur if:

    • Parent elements are filtered out before hashing
    • Parent IDs are manually set to non-element values
    • Data corruption scenarios

    Consider adding a safety check:

    if element.metadata.parent_id is not None and element.metadata.parent_id in id_mapping:
        element.metadata.parent_id = id_mapping[element.metadata.parent_id]
  2. Integration test gap: While the unit test is excellent, there's no integration test showing the full partition_htmlapply_metadata → hashing flow. Consider adding a test that:

    • Calls partition_html with unique_element_ids=False
    • Verifies parent-child relationships are preserved after the full pipeline
  3. Documentation: The docstring for _assign_hash_ids() at line 246 could mention that it also remaps parent IDs, not just element IDs.

🔍 Edge Cases to Consider

  • What if there are circular references in parent IDs? (Though this shouldn't happen with proper hierarchy)
  • What about deeply nested hierarchies? (Should work fine with the current approach)
  • What if some elements don't get hashed but their children do? (Covered by the KeyError consideration above)

Overall Assessment

This is a well-implemented fix for a legitimate bug. The solution is clean, focused, and properly tested. The main concern is the potential for KeyError exceptions in edge cases, but this may be acceptable if the assumption is that parent IDs should always reference valid elements in the same collection.

Recommendation:Approve with minor suggestion to add the KeyError safety check mentioned above, though the current implementation may be acceptable depending on system guarantees.


@badGarnet badGarnet marked this pull request as ready for review February 19, 2026 14:48
@badGarnet badGarnet added this pull request to the merge queue Feb 19, 2026
Merged via the queue into main with commit e2d8b7a Feb 19, 2026
51 checks passed
@badGarnet badGarnet deleted the fix/remap-parent-id-when-hashing-id branch February 19, 2026 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants