Skip to content

Problem with parsing PDFs with multi-level columns #512

Open
@VolDonets

Description

@VolDonets

Describe the bug
Some tables have multi-headers, whereas the main header can have multiple columns with specific names, like the following in the PDF file.
Llama-Parse is doing a pretty awesome job with parsing such tables, but there is still inconsistency in markdown (or other provided results) where some table cells are missed or do NOT provide enough information about the original table structure.

Job IDs
accurate: 47b23fba-6750-4569-8466-e9bf1b3f3057
premium: 311480bb-bd2b-48c2-b012-5ba44f9b6c1e
continuous: c1dd6e62-1e62-47e2-a430-6c879e1056b8
vendor openai-gpt4o: 1a404e2c-ffde-44a7-a821-aab0d0d79804
vendor gemini-1.5-pro.md: 21036a1b-290f-42fe-99a0-46edc0222390
vendor anthropic-sonnet-3.5: d3a82652-b73a-4c16-af70-78ef0011021c

Client:

  • Python Library
  • Frontend (cloud.llamaindex.ai)

Additional comments
accurate: do NOT provide extractable table data
premium: quite precise cell extraction, but hard to interpret table structure
continuous: do NOT provide extractable table data
vendor openai-gpt4o: missing some cells data, but adds spaces which allows interpretation table structure (ALMOST SOLVE THE ISSUE)
vendor gemini-1.5-pro.md: missing some cells data, but combines multi-heads into a few heads with repetition the names (ALMOST SOLVE THE ISSUE). the problem here is the false parsing of the first column and issues with the last one
vendor anthropic-sonnet-3.5: the most precise cell extraction, but missed table structure (ALMOST SOLVE THE ISSUE)
P.s. I also used various prompts, but they do NOT provide consistent parsing results and usually make worse.

Files
PDF file, which contains a single table with complex header
Catalog-Part-1-p18.pdf

Resulting markdown
accurate Catalog_LP_accurate.md
premium Catalog_LP_premium.md
continuous Catalog_LP_continous.md
vendor openai-gpt4o Catalog_LP_vendor_openai-gpt4o.md
vendor gemini-1.5-pro.md Catalog_LP_vendor_gemini-1.5-pro.md
vendor anthropic-sonnet-3.5: Catalog_LP_vendor_anthropic-sonnet-3.5.md

Expected markdown tables left only tables for short.
v0 with combining top and bottom headers (like gemini-1.5-pro) Catalog_LP_expected_v0.md
v1 with column leveling with spaces (like a combination of openai-gpt4o and anthropic-sonnet-3.5) Catalog_LP_expected_v1.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions