fix: remove duplicate characters caused by fake bold rendering in PDFs#4215
Conversation
|
@badGarnet Could you please review this PR? Thanks! |
Thanks for contributing! I would suggest finding an example pdf that has this kind of issue and add a test using it. The code reads fine to me but it would be good to test on an actual file. |
|
@badGarnet |
| assert len(text_with_dedup) <= len(text_no_dedup), ( | ||
| f"Deduplicated text ({len(text_with_dedup)} chars) should not be longer " | ||
| f"than non-deduplicated text ({len(text_no_dedup)} chars)" | ||
| ) |
There was a problem hiding this comment.
a better assert would be:
- checking the exact expected text length
- check there is duplicated characters in the
text_no_dedup(likebboolldd) and normal text intext_with_dedupe(likebold)
diagnose_fake_bold.py
Outdated
| @@ -0,0 +1,69 @@ | |||
| """Diagnostic script to verify fake-bold PDF deduplication is working.""" | |||
There was a problem hiding this comment.
a test against the new file is good enough; we don't need to add a script to root dir for this case
…ix/remove-pdf-bold-text-duplication
|
@badGarnet Thanks for your feedback. I've updated. Could you please review again and confirm that it’s configured correctly according to your req? thanks again! |
…ix/remove-pdf-bold-text-duplication
29d32e5 to
355e925
Compare
|
Hi, @badGarnet . I updated all. Hope you merge this when you have a sec |
Head branch was pushed to by a user without write access
|
@badGarnet Thanks for approval. Can you merge the PR! |
|
Sorry for tagging you again, @badGarnet. I faced linting test error, so I updated the code and pushed a new commit. Could you please review it again and merge? Thanks. |
badGarnet
left a comment
There was a problem hiding this comment.
please update the changelog and move your entry to the appropriate section; please also bump the version number
|
I updated changelog and bumped version number |
CHANGELOG.md
Outdated
| - **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489) | ||
|
|
||
| ### Fixes | ||
| - **Fix duplicate characters in PDF bold text extraction**: Some PDFs render bold text by drawing each character twice at slightly offset positions, causing text like "BOLD" to be extracted as "BBOOLLDD". Added character-level deduplication based on position proximity. Configurable via `PDF_CHAR_DUPLICATE_THRESHOLD` environment variable (default: 3.0 pixels, set to 0 to disable)(fixes #3864). |
CHANGELOG.md
Outdated
There was a problem hiding this comment.
please bump here as well
| # Fake-bold duplicates typically have >70% overlap | ||
| # Legitimate consecutive letters have <30% overlap (or none) | ||
| # Use 50% as threshold to be conservative | ||
| return overlap_ratio > 0.5 |
There was a problem hiding this comment.
let's make this also a config variable so it can be changed via an env variable
…CHAR_OVERLAP_RATIO_THRESHOLD environment variable
|
@badGarnet Updated! |
|
@badGarnet Sorry for tagging you again. could you please check again? thanks |
Head branch was pushed to by a user without write access
|
please update changelog and bump version 😬 especially after merging main into this branch |
|
I updated changelog and version. I would appreciate to merge now. Thanks for your support! @badGarnet |
Head branch was pushed to by a user without write access
|
Sorry @badGarnet. While CI testing, It faced some mini error like ID mismatch, dismiss some variables. I have fixed. Please check again. 🙏 |
|
@badGarnet please merge this pr. |
Head branch was pushed to by a user without write access
|
I have updated changelog and version again. could you rerun, please? @badGarnet |
|
Thank you @badGarnet 👍 |
Closes #3864
Summary
Problem
When extracting text from certain PDFs, bold text appears duplicated:
Solution
Added character-level deduplication that:
Configuration