Skip to content

fix: remove duplicate characters caused by fake bold rendering in PDFs#4215

Merged
badGarnet merged 25 commits intoUnstructured-IO:mainfrom
bittoby:fix/remove-pdf-bold-text-duplication
Feb 13, 2026
Merged

fix: remove duplicate characters caused by fake bold rendering in PDFs#4215
badGarnet merged 25 commits intoUnstructured-IO:mainfrom
bittoby:fix/remove-pdf-bold-text-duplication

Conversation

@bittoby
Copy link
Contributor

@bittoby bittoby commented Jan 28, 2026

Closes #3864

Summary

  • Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD")
  • Some PDF generators simulate bold by rendering each character twice at slightly offset positions
  • Added character-level deduplication based on position proximity to detect and remove these duplicates

Problem

When extracting text from certain PDFs, bold text appears duplicated:

# Before fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60>60" instead of ">60"

Solution

Added character-level deduplication that:

  • Compares consecutive characters' text content and position
  • Removes duplicates where same character appears within 3 pixels (configurable)
  • Preserves spaces and other non-character elements (LTAnno objects)
# After fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60" ✓

Configuration

# Default: 3.0 pixels (enabled)
export PDF_CHAR_DUPLICATE_THRESHOLD=3.0

# Disable deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=0

# More aggressive deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=5.0

@bittoby
Copy link
Contributor Author

bittoby commented Jan 28, 2026

@badGarnet Could you please review this PR? Thanks!

@badGarnet
Copy link
Collaborator

@badGarnet Could you please review this PR? Thanks!

Thanks for contributing! I would suggest finding an example pdf that has this kind of issue and add a test using it. The code reads fine to me but it would be good to test on an actual file.

@bittoby
Copy link
Contributor Author

bittoby commented Jan 30, 2026

@badGarnet
I added example pdf(example-docs/pdf/fake-bold-sample.pdf) and test script(diagnose_fake_bold.py) for diagnose fake bolds.
please review and test again. Thank you

Comment on lines +281 to +284
assert len(text_with_dedup) <= len(text_no_dedup), (
f"Deduplicated text ({len(text_with_dedup)} chars) should not be longer "
f"than non-deduplicated text ({len(text_no_dedup)} chars)"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a better assert would be:

  • checking the exact expected text length
  • check there is duplicated characters in the text_no_dedup (like bboolldd) and normal text in text_with_dedupe (like bold)

@@ -0,0 +1,69 @@
"""Diagnostic script to verify fake-bold PDF deduplication is working."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a test against the new file is good enough; we don't need to add a script to root dir for this case

@bittoby
Copy link
Contributor Author

bittoby commented Feb 2, 2026

@badGarnet Thanks for your feedback. I've updated. Could you please review again and confirm that it’s configured correctly according to your req? thanks again!

@bittoby bittoby force-pushed the fix/remove-pdf-bold-text-duplication branch from 29d32e5 to 355e925 Compare February 3, 2026 17:48
@bittoby
Copy link
Contributor Author

bittoby commented Feb 4, 2026

Hi, @badGarnet . I updated all. Hope you merge this when you have a sec

@badGarnet badGarnet enabled auto-merge February 5, 2026 16:50
auto-merge was automatically disabled February 5, 2026 17:39

Head branch was pushed to by a user without write access

@bittoby
Copy link
Contributor Author

bittoby commented Feb 5, 2026

@badGarnet Thanks for approval. Can you merge the PR!

@bittoby
Copy link
Contributor Author

bittoby commented Feb 5, 2026

Sorry for tagging you again, @badGarnet. I faced linting test error, so I updated the code and pushed a new commit. Could you please review it again and merge? Thanks.

Copy link
Collaborator

@badGarnet badGarnet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the changelog and move your entry to the appropriate section; please also bump the version number

@bittoby
Copy link
Contributor Author

bittoby commented Feb 6, 2026

I updated changelog and bumped version number

@bittoby bittoby requested a review from badGarnet February 6, 2026 00:40
CHANGELOG.md Outdated
- **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489)

### Fixes
- **Fix duplicate characters in PDF bold text extraction**: Some PDFs render bold text by drawing each character twice at slightly offset positions, causing text like "BOLD" to be extracted as "BBOOLLDD". Added character-level deduplication based on position proximity. Configurable via `PDF_CHAR_DUPLICATE_THRESHOLD` environment variable (default: 3.0 pixels, set to 0 to disable)(fixes #3864).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete?

CHANGELOG.md Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please bump here as well

# Fake-bold duplicates typically have >70% overlap
# Legitimate consecutive letters have <30% overlap (or none)
# Use 50% as threshold to be conservative
return overlap_ratio > 0.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this also a config variable so it can be changed via an env variable

…CHAR_OVERLAP_RATIO_THRESHOLD environment variable
@bittoby
Copy link
Contributor Author

bittoby commented Feb 10, 2026

@badGarnet Updated!

@bittoby
Copy link
Contributor Author

bittoby commented Feb 11, 2026

@badGarnet Sorry for tagging you again. could you please check again? thanks

@badGarnet badGarnet enabled auto-merge February 11, 2026 16:08
auto-merge was automatically disabled February 11, 2026 16:43

Head branch was pushed to by a user without write access

@badGarnet
Copy link
Collaborator

please update changelog and bump version 😬 especially after merging main into this branch

@bittoby
Copy link
Contributor Author

bittoby commented Feb 13, 2026

I updated changelog and version. I would appreciate to merge now. Thanks for your support! @badGarnet

@badGarnet badGarnet enabled auto-merge February 13, 2026 00:52
auto-merge was automatically disabled February 13, 2026 01:35

Head branch was pushed to by a user without write access

@bittoby
Copy link
Contributor Author

bittoby commented Feb 13, 2026

Sorry @badGarnet. While CI testing, It faced some mini error like ID mismatch, dismiss some variables. I have fixed. Please check again. 🙏

@bittoby
Copy link
Contributor Author

bittoby commented Feb 13, 2026

@badGarnet please merge this pr.

@badGarnet badGarnet enabled auto-merge February 13, 2026 13:38
auto-merge was automatically disabled February 13, 2026 13:47

Head branch was pushed to by a user without write access

@bittoby
Copy link
Contributor Author

bittoby commented Feb 13, 2026

I have updated changelog and version again. could you rerun, please? @badGarnet

@badGarnet badGarnet enabled auto-merge February 13, 2026 14:12
@badGarnet badGarnet added this pull request to the merge queue Feb 13, 2026
Merged via the queue into Unstructured-IO:main with commit 8096b5a Feb 13, 2026
50 checks passed
@bittoby
Copy link
Contributor Author

bittoby commented Feb 13, 2026

Thank you @badGarnet 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug/Bold characters get repeated while extracting

2 participants