Description
Describe the bug
Some tables have multi-headers, whereas the main header can have multiple columns with specific names, like the following in the PDF file.
Llama-Parse is doing a pretty awesome job with parsing such tables, but there is still inconsistency in markdown (or other provided results) where some table cells are missed or do NOT provide enough information about the original table structure.
Job IDs
accurate
: 47b23fba-6750-4569-8466-e9bf1b3f3057
premium
: 311480bb-bd2b-48c2-b012-5ba44f9b6c1e
continuous
: c1dd6e62-1e62-47e2-a430-6c879e1056b8
vendor openai-gpt4o
: 1a404e2c-ffde-44a7-a821-aab0d0d79804
vendor gemini-1.5-pro.md
: 21036a1b-290f-42fe-99a0-46edc0222390
vendor anthropic-sonnet-3.5
: d3a82652-b73a-4c16-af70-78ef0011021c
Client:
- Python Library
- Frontend (cloud.llamaindex.ai)
Additional comments
accurate
: do NOT provide extractable table data
premium
: quite precise cell extraction, but hard to interpret table structure
continuous
: do NOT provide extractable table data
vendor openai-gpt4o
: missing some cells data, but adds spaces which allows interpretation table structure (ALMOST SOLVE THE ISSUE)
vendor gemini-1.5-pro.md
: missing some cells data, but combines multi-heads into a few heads with repetition the names (ALMOST SOLVE THE ISSUE). the problem here is the false parsing of the first column and issues with the last one
vendor anthropic-sonnet-3.5
: the most precise cell extraction, but missed table structure (ALMOST SOLVE THE ISSUE)
P.s. I also used various prompts, but they do NOT provide consistent parsing results and usually make worse.
Files
PDF file, which contains a single table with complex header
Catalog-Part-1-p18.pdf
Resulting markdown
accurate
Catalog_LP_accurate.md
premium
Catalog_LP_premium.md
continuous
Catalog_LP_continous.md
vendor openai-gpt4o
Catalog_LP_vendor_openai-gpt4o.md
vendor gemini-1.5-pro.md
Catalog_LP_vendor_gemini-1.5-pro.md
vendor anthropic-sonnet-3.5
: Catalog_LP_vendor_anthropic-sonnet-3.5.md
Expected markdown tables left only tables for short.
v0 with combining top and bottom headers (like gemini-1.5-pro
) Catalog_LP_expected_v0.md
v1 with column leveling with spaces (like a combination of openai-gpt4o
and anthropic-sonnet-3.5
) Catalog_LP_expected_v1.md