-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PAGE XML renderer / export #4214
Conversation
Can you send a result file of this code for some example image? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CI builds fail because some include statements are missing.
Thanks for suggestion and fixes @stweil Here are examples for eurotext and hebrew with 'page' and 'page-poly': |
PAGE XML with box coordinates:
PAGE XML with polygon:
|
Added terminating linefeed, removed trailing whitespaces and fixed typos (found by typos-cli). |
Maybe we can merge this new feature in a new 5.4.0 pre-release. |
Co-authored-by: Stefan Weil <[email protected]>
Unused variables Co-authored-by: Stefan Weil <[email protected]>
Remove unused variables Co-authored-by: Stefan Weil <[email protected]>
Signed-off-by: Stefan Weil <[email protected]>
Add PAGE XML export and documentation. To generate PAGE XML output just add 'page' to the tesseract command. The output is outputname + '.page.xml' to avoid conflicts with ALTO export. The output can be customized with the flags: tessedit_create_page_polygon and tessedit_create_page_wordlevel. Co-authored-by: Stefan Weil <[email protected]>
Should we squash long PRs? Upd.: |
Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)") Signed-off-by: Stefan Weil <[email protected]>
Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)") Signed-off-by: Stefan Weil <[email protected]>
Fixes: 577e8a8 ("Add PAGE XML renderer / export (tesseract-ocr#4214)") Signed-off-by: Stefan Weil <[email protected]>
Use also enum names instead of numeric values where possible. Fixes: 577e8a8 ("Add PAGE XML renderer / export (tesseract-ocr#4214)") Signed-off-by: Stefan Weil <[email protected]>
Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)") Signed-off-by: Stefan Weil <[email protected]>
Use also enum names instead of numeric values where possible. Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)") Signed-off-by: Stefan Weil <[email protected]>
Hi everyone,
I've created a PAGE-XML renderer/export.
It's not just a simple PAGE-XML export, it can also produce a textline polygon instead of a simple bounding box, and it can output up to word level.
The output can be customised with three bool parameters
After installing tesseract you can use the preconfigured settings: 'page' .
page -> Output page.xml file with polygon and line-level
As word-level is more of a niche requirement, you need to enable it via -c:
-c page_xml_polygon=1 // True polygon or False bounding boxes
-c page_xml_level=1 // 0 line or 1 word level
The output is valid for PAGE XML version 2019-07-15.
If the textlines contains only ltr or rtl characters, the output is correct, but for mixed lines (BiDi) I am not quite sure.
Can anyone help me here?