Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAGE XML renderer / export #4214

Merged
merged 26 commits into from
Apr 19, 2024
Merged

PAGE XML renderer / export #4214

merged 26 commits into from
Apr 19, 2024

Conversation

JKamlah
Copy link
Contributor

@JKamlah JKamlah commented Mar 20, 2024

Hi everyone,

I've created a PAGE-XML renderer/export.

It's not just a simple PAGE-XML export, it can also produce a textline polygon instead of a simple bounding box, and it can output up to word level.
The output can be customised with three bool parameters

   tessedit_create_page_xml -> "Write .page.xml PAGE file".
   page_xml_polygon -> "Create PAGE file with polygons instead of box values (default:True)".
   page_xml_level -> "Create PAGE with 0 - line or 1 - word level (default:0)".

After installing tesseract you can use the preconfigured settings: 'page' .
page -> Output page.xml file with polygon and line-level

As word-level is more of a niche requirement, you need to enable it via -c:
-c page_xml_polygon=1 // True polygon or False bounding boxes
-c page_xml_level=1 // 0 line or 1 word level

The output is valid for PAGE XML version 2019-07-15.

If the textlines contains only ltr or rtl characters, the output is correct, but for mixed lines (BiDi) I am not quite sure.
Can anyone help me here?

@egorpugin
Copy link
Contributor

Can you send a result file of this code for some example image?

Copy link
Member

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI builds fail because some include statements are missing.

src/api/pagerenderer.cpp Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@JKamlah
Copy link
Contributor Author

JKamlah commented Mar 20, 2024

Thanks for suggestion and fixes @stweil

Here are examples for eurotext and hebrew with 'page' and 'page-poly':
https://digi.bib.uni-mannheim.de/~jkamlah/PAGERenderer/

@stweil
Copy link
Member

stweil commented Mar 20, 2024

Can you send a result file of this code for some example image?

PAGE XML with box coordinates:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
        <Metadata>
                <Creator>Tesseract - 5.3.4-27-ge2f3</Creator>
                <Created>2024-03-20T14:16:51</Created>
                <LastChange>2024-03-20T14:16:51</LastChange>
        </Metadata>
        <Page imageFilename="test/testing/HelloGoogle.tif" imageWidth="400" imageHeight="200" type="content">
                <ReadingOrder>
                        <OrderedGroup id="ro6814328771458624830" caption="Regions reading order">
                                <RegionRefIndexed index="0" regionRef="r0"/>
                        </OrderedGroup>
                </ReadingOrder>
                <TextRegion id="r0" custom="readingOrder {index:0;} readingDirection {left-to-right;} orientation {0;}">
                        <Coords points="10,20 331,20 331,71 10,71"/>
                        <TextLine id="r0l0" readingDirection="left-to-right" custom="readingOrder {index:0;}">
                                <Coords points="10,20 331,20 331,71 10,71"/>
                                <Baseline points="10,59 331,63"/>
                                <TextEquiv index="1" conf="0.9585">
                                        <Unicode>Hello Google</Unicode>
                                </TextEquiv>
                        </TextLine>
                        <TextEquiv index="1" conf="0.9585">
                                <Unicode>Hello Google</Unicode>
                        </TextEquiv>
                </TextRegion>
                </Page>
</PcGts>

PAGE XML with polygon:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
        <Metadata>
                <Creator>Tesseract - 5.3.4-27-ge2f3</Creator>
                <Created>2024-03-20T14:19:36</Created>
                <LastChange>2024-03-20T14:19:36</LastChange>
        </Metadata>
        <Page imageFilename="test/testing/HelloGoogle.tif" imageWidth="400" imageHeight="200" type="content">
                <ReadingOrder>
                        <OrderedGroup id="ro6814328771458624830" caption="Regions reading order">
                                <RegionRefIndexed index="0" regionRef="r0"/>
                        </OrderedGroup>
                </ReadingOrder>
                <TextRegion id="r0" custom="readingOrder {index:0;} readingDirection {left-to-right;} orientation {0;}">
                        <Coords points="10,20 331,20 331,71 10,71"/>
                        <TextLine id="r0l0" readingDirection="left-to-right" custom="readingOrder {index:0;}">
                                <Coords points="10,21 42,21 46,30 74,30 79,20 102,20 105,30 132,30 158,22 193,22 197,32 276,32 276,22 298,22 331,23 331,63 298,63 298,71 259,71 256,62 197,62 158,61 74,60 46,60 10,58"/>
                                <Baseline points="10,59 158,60 331,63"/>
                                <TextEquiv index="1" conf="0.9585">
                                        <Unicode>Hello Google</Unicode>
                                </TextEquiv>
                        </TextLine>
                        <TextEquiv index="1" conf="0.9585">
                                <Unicode>Hello Google</Unicode>
                        </TextEquiv>
                </TextRegion>
                </Page>
</PcGts>

src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
@JKamlah
Copy link
Contributor Author

JKamlah commented Mar 20, 2024

Added terminating linefeed, removed trailing whitespaces and fixed typos (found by typos-cli).
For the unused functions and variables I need some time.

CMakeLists.txt Outdated Show resolved Hide resolved
src/ccmain/tesseractclass.cpp Outdated Show resolved Hide resolved
src/ccmain/tesseractclass.h Outdated Show resolved Hide resolved
src/ccmain/tesseractclass.h Outdated Show resolved Hide resolved
@stweil stweil requested review from zdenop, egorpugin, amitdo and a team and removed request for a team March 22, 2024 15:04
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
src/api/pagerenderer.cpp Outdated Show resolved Hide resolved
@stweil stweil marked this pull request as ready for review April 10, 2024 11:16
@stweil
Copy link
Member

stweil commented Apr 10, 2024

Maybe we can merge this new feature in a new 5.4.0 pre-release.

@stweil stweil merged commit a789b2d into tesseract-ocr:main Apr 19, 2024
5 of 6 checks passed
stweil added a commit that referenced this pull request Apr 19, 2024
Add PAGE XML export and documentation.
To generate PAGE XML output just add 'page' to the tesseract command.

The output is outputname + '.page.xml' to avoid conflicts with ALTO export.

The output can be customized with the flags:
tessedit_create_page_polygon and tessedit_create_page_wordlevel.

Co-authored-by: Stefan Weil <[email protected]>
@egorpugin
Copy link
Contributor

egorpugin commented Apr 19, 2024

Should we squash long PRs?

Upd.:
I see, seems squashed.

stweil added a commit that referenced this pull request Apr 24, 2024
Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)")
Signed-off-by: Stefan Weil <[email protected]>
stweil added a commit that referenced this pull request Apr 24, 2024
Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)")
Signed-off-by: Stefan Weil <[email protected]>
stweil added a commit to stweil/tesseract that referenced this pull request May 3, 2024
Fixes: 577e8a8 ("Add PAGE XML renderer / export (tesseract-ocr#4214)")
Signed-off-by: Stefan Weil <[email protected]>
stweil added a commit to stweil/tesseract that referenced this pull request May 3, 2024
Use also enum names instead of numeric values where possible.

Fixes: 577e8a8 ("Add PAGE XML renderer / export (tesseract-ocr#4214)")
Signed-off-by: Stefan Weil <[email protected]>
stweil added a commit that referenced this pull request May 3, 2024
Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)")
Signed-off-by: Stefan Weil <[email protected]>
stweil added a commit that referenced this pull request May 3, 2024
Use also enum names instead of numeric values where possible.

Fixes: 577e8a8 ("Add PAGE XML renderer / export (#4214)")
Signed-off-by: Stefan Weil <[email protected]>
@amitdo amitdo added the output issues related output formats label Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
output issues related output formats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants