Script for pipeline with structure recognition using table-transformer available here #116
Replies: 3 comments 2 replies
-
|
Thanks. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for amazing repo, this makes complicated Document AI process much simpler. When i am trying above code its giving me below error `Downloading (…)lve/main/config.json: 100%
|
Beta Was this translation helpful? Give feedback.
-
|
Script is out of date and will not work with latest release. Will be integrated into the built-in analyzer in next PR. |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
-
With the latest PR, it is now possible to perform table recognition and table structure recognition with MS table-transformer, as suggested in repo https://github.com/microsoft/table-transformer. I use, as so often, the integration of the model via the Transformer library.
I am very pleased with the results, especially regarding row and column recognition: it seems that the model generalizes better when adapted to a different domain than medical research tables compared to the current model trained on Pubtabnet.
With regard to spanning cells, column_header, etc. detection, I am more cautious. There might also be a problem when generating the HTML which can be due to poor spanning cell prediction but I haven't figured out yet.
There is of course the possibility to filter spanning cells and generate a table structure only with simple cells cells. But I am not quite certain about that.
However, to integrate the model into a robust processing pipeline, several configurations need to be set (padding, nms thresholds, ...). Attached is a proposal of a pipeline with a given config, but I am interested if there are better options. You can copy/paste this script and get started. Feedback and suggestions on this are always welcome.
table-transformer will be available in the next release - I'll try to create one by mid-February. If this approach gets more robust I might consider replacing the built-in pipeline with this one, but not at this stage.
Beta Was this translation helpful? Give feedback.
All reactions