fix: remove duplicate characters caused by fake bold rendering in PDFs by bittoby · Pull Request #4215 · Unstructured-IO/unstructured

bittoby · 2026-01-28T12:23:22Z

Summary

Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD")
Some PDF generators simulate bold by rendering each character twice at slightly offset positions
Added character-level deduplication based on position proximity to detect and remove these duplicates

Problem

When extracting text from certain PDFs, bold text appears duplicated:

# Before fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60>60" instead of ">60"

Solution

Added character-level deduplication that:

Compares consecutive characters' text content and position
Removes duplicates where same character appears within 3 pixels (configurable)
Preserves spaces and other non-character elements (LTAnno objects)

# After fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60" ✓

Configuration

# Default: 3.0 pixels (enabled)
export PDF_CHAR_DUPLICATE_THRESHOLD=3.0

# Disable deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=0

# More aggressive deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=5.0

bittoby · 2026-01-28T12:33:07Z

@badGarnet Could you please review this PR? Thanks!

badGarnet · 2026-01-30T02:53:50Z

@badGarnet Could you please review this PR? Thanks!

Thanks for contributing! I would suggest finding an example pdf that has this kind of issue and add a test using it. The code reads fine to me but it would be good to test on an actual file.

bittoby · 2026-01-30T18:00:24Z

@badGarnet
I added example pdf(example-docs/pdf/fake-bold-sample.pdf) and test script(diagnose_fake_bold.py) for diagnose fake bolds.
please review and test again. Thank you

badGarnet · 2026-02-02T16:23:34Z

test_unstructured/partition/pdf_image/test_pdfminer_utils.py

+        assert len(text_with_dedup) <= len(text_no_dedup), (
+            f"Deduplicated text ({len(text_with_dedup)} chars) should not be longer "
+            f"than non-deduplicated text ({len(text_no_dedup)} chars)"
+        )


a better assert would be:

checking the exact expected text length

check there is duplicated characters in the text_no_dedup (like bboolldd) and normal text in text_with_dedupe (like bold)

badGarnet · 2026-02-02T16:24:15Z

diagnose_fake_bold.py

@@ -0,0 +1,69 @@
+"""Diagnostic script to verify fake-bold PDF deduplication is working."""


a test against the new file is good enough; we don't need to add a script to root dir for this case

…ix/remove-pdf-bold-text-duplication

bittoby · 2026-02-02T17:20:05Z

@badGarnet Thanks for your feedback. I've updated. Could you please review again and confirm that it’s configured correctly according to your req? thanks again!

…ix/remove-pdf-bold-text-duplication

bittoby · 2026-02-04T03:49:21Z

Hi, @badGarnet . I updated all. Hope you merge this when you have a sec

bittoby · 2026-02-05T17:40:58Z

@badGarnet Thanks for approval. Can you merge the PR!

bittoby · 2026-02-05T20:19:29Z

Sorry for tagging you again, @badGarnet. I faced linting test error, so I updated the code and pushed a new commit. Could you please review it again and merge? Thanks.

badGarnet

please update the changelog and move your entry to the appropriate section; please also bump the version number

…c ID generation

bittoby · 2026-02-06T00:29:31Z

I updated changelog and bumped version number

badGarnet · 2026-02-06T15:35:29Z

CHANGELOG.md

 - **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489)

 ### Fixes
+- **Fix duplicate characters in PDF bold text extraction**: Some PDFs render bold text by drawing each character twice at slightly offset positions, causing text like "BOLD" to be extracted as "BBOOLLDD". Added character-level deduplication based on position proximity. Configurable via `PDF_CHAR_DUPLICATE_THRESHOLD` environment variable (default: 3.0 pixels, set to 0 to disable)(fixes #3864).


badGarnet · 2026-02-06T15:35:44Z

CHANGELOG.md

please bump here as well

badGarnet · 2026-02-10T15:06:26Z

unstructured/partition/pdf_image/pdfminer_utils.py

+    # Fake-bold duplicates typically have >70% overlap
+    # Legitimate consecutive letters have <30% overlap (or none)
+    # Use 50% as threshold to be conservative
+    return overlap_ratio > 0.5


let's make this also a config variable so it can be changed via an env variable

…CHAR_OVERLAP_RATIO_THRESHOLD environment variable

bittoby · 2026-02-10T15:19:19Z

@badGarnet Updated!

bittoby · 2026-02-11T12:07:37Z

@badGarnet Sorry for tagging you again. could you please check again? thanks

badGarnet · 2026-02-13T00:16:42Z

please update changelog and bump version 😬 especially after merging main into this branch

bittoby · 2026-02-13T00:39:19Z

I updated changelog and version. I would appreciate to merge now. Thanks for your support! @badGarnet

bittoby · 2026-02-13T01:37:30Z

Sorry @badGarnet. While CI testing, It faced some mini error like ID mismatch, dismiss some variables. I have fixed. Please check again. 🙏

bittoby · 2026-02-13T12:03:47Z

@badGarnet please merge this pr.

bittoby · 2026-02-13T13:50:30Z

I have updated changelog and version again. could you rerun, please? @badGarnet

bittoby · 2026-02-13T15:10:11Z

Thank you @badGarnet 👍

fix: remove duplicate characters caused by fake bold rendering in PDFs

f8af84b

bittoby added 2 commits January 30, 2026 17:25

fix: solve merge conflict

8d80a34

fix: apply character deduplication to fast strategy for fake-bold PDFs

92c02d6

bittoby added 2 commits January 30, 2026 19:10

fix: define imports at the top

8377398

test: simplify fake-bold integration test assertions

d817d42

badGarnet reviewed Feb 2, 2026

View reviewed changes

bittoby added 2 commits February 2, 2026 18:15

Merge branch 'main' of https://github.com/bittoby/unstructured into f…

e0803a3

…ix/remove-pdf-bold-text-duplication

fix: improve fake-bold deduplication tests with specific assertions

3d11da7

bittoby added 2 commits February 3, 2026 18:47

Merge branch 'main' of https://github.com/bittoby/unstructured into f…

90a82c2

…ix/remove-pdf-bold-text-duplication

fix: remove unused pytest import to pass ruff linter

355e925

bittoby force-pushed the fix/remove-pdf-bold-text-duplication branch from 29d32e5 to 355e925 Compare February 3, 2026 17:48

badGarnet approved these changes Feb 5, 2026

View reviewed changes

badGarnet enabled auto-merge February 5, 2026 16:50

fix: black formatting violations in PDF test files for CI/CD compliance

14d1231

auto-merge was automatically disabled February 5, 2026 17:39
Head branch was pushed to by a user without write access

badGarnet requested changes Feb 5, 2026

View reviewed changes

fix: Update code formatting and element ID to match new deterministri…

80e2774

…c ID generation

bittoby requested a review from badGarnet February 6, 2026 00:40

badGarnet reviewed Feb 6, 2026

View reviewed changes

CHANGELOG.md Outdated

Copy link

Collaborator

badGarnet Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please bump here as well

bittoby reacted with thumbs up emoji

fix: Update CHANGELOG

68fc61c

fix: solve merge conflict

1894ed1

badGarnet reviewed Feb 10, 2026

View reviewed changes

fix: make pdf character overlap ratio threshold configurable via PDF_…

5f7c2e6

…CHAR_OVERLAP_RATIO_THRESHOLD environment variable

fix: resolve merge conflict

9100347

badGarnet enabled auto-merge February 11, 2026 16:08

fix: resolve Lint style error

66beb1e

auto-merge was automatically disabled February 11, 2026 16:43
Head branch was pushed to by a user without write access

bittoby added 2 commits February 12, 2026 11:38

fix:resolve conflict

d5bc389

fix: Update test lint style

a0a7c7b

fix: Update version and changelog

90584aa

badGarnet enabled auto-merge February 13, 2026 00:52

fix: solve test error

a84a5f7

auto-merge was automatically disabled February 13, 2026 01:35
Head branch was pushed to by a user without write access

bittoby added 2 commits February 13, 2026 02:41

fix: solve merge conflict

55957ba

fix: update changlog

99880cf

badGarnet enabled auto-merge February 13, 2026 13:38

fix: update log and version

f97dca5

auto-merge was automatically disabled February 13, 2026 13:47
Head branch was pushed to by a user without write access

badGarnet enabled auto-merge February 13, 2026 14:12

badGarnet added this pull request to the merge queue Feb 13, 2026

Merged via the queue into Unstructured-IO:main with commit 8096b5a Feb 13, 2026
50 checks passed

		@@ -0,0 +1,69 @@
		"""Diagnostic script to verify fake-bold PDF deduplication is working."""

Conversation

bittoby commented Jan 28, 2026

Summary

Problem

Solution

Configuration

Uh oh!

bittoby commented Jan 28, 2026

Uh oh!

badGarnet commented Jan 30, 2026

Uh oh!

bittoby commented Jan 30, 2026

Uh oh!

badGarnet Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

badGarnet Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 2, 2026

Uh oh!

bittoby commented Feb 4, 2026

Uh oh!

bittoby commented Feb 5, 2026

Uh oh!

bittoby commented Feb 5, 2026

Uh oh!

badGarnet left a comment

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 6, 2026

Uh oh!

badGarnet Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

badGarnet Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

badGarnet Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 10, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

badGarnet commented Feb 13, 2026

Uh oh!

bittoby commented Feb 13, 2026

Uh oh!

bittoby commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittoby commented Feb 13, 2026

Uh oh!

bittoby commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bittoby commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bittoby commented Feb 13, 2026 •

edited

Loading

bittoby commented Feb 13, 2026 •

edited

Loading