Preserve newlines in Table and TableChunk elements during PDF partitioning#4214
Merged
badGarnet merged 3 commits intoUnstructured-IO:mainfrom Jan 30, 2026
Merged
Conversation
3fc5a33 to
318330b
Compare
Collaborator
|
the ingest test is failing because multiple white space used to be replaced with just one but now they remain multiple ones -> results in text changed
|
…rtitioning The RE_MULTISPACE_INCLUDING_NEWLINES regex was being applied to all Text elements, including Table and TableChunk. This incorrectly removed newline characters that carry structural meaning in tables (row separation). Fixes Unstructured-IO#3983
Co-Authored-By: Claude Opus 4.5 <[email protected]>
9fb28db to
4be5dc7
Compare
badGarnet
approved these changes
Jan 30, 2026
Merged
via the queue into
Unstructured-IO:main
with commit Jan 30, 2026
b1e4b00
72 of 73 checks passed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3983
Summary
This PR fixes an issue where newline characters were being incorrectly stripped from
TableandTableChunkelements during PDF partitioning. TheRE_MULTISPACE_INCLUDING_NEWLINESregex was being applied indiscriminately to allTextelements, including tables, which removed newlines that carry structural meaning (such as row separation).Changes
unstructured/partition/pdf.py: Added conditional logic to skip whitespace normalization forTableandTableChunkelements, preserving newlines that convey tabular structureCHANGELOG.md: Added entry documenting the fixunstructured/__version__.py: Version bump to 0.18.33Problem
When processing PDFs (especially image-based PDFs with tables), the code applied this regex substitution to all
Textelements:This stripped meaningful line breaks from table content, degrading the structural representation of tabular data.
Solution
Added a check to exclude
TableandTableChunkelements from the whitespace normalization: