document-analysis

Here are 231 public repositories matching this topic...

opendatalab / MinerU

Transforms complex documents like PDFs into LLM-ready markdown/JSON for your Agentic workflows.

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Dec 16, 2025
Python

bytedance / Dolphin

Star

The official repo for “Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting”, ACL, 2025.

python pdf parser ocr pdf-converter document-analysis pdf-parser layout-analysis vlm-ocr

Updated Dec 17, 2025
Python

ucbepic / docetl

Star

A system for agentic LLM-powered data processing and ETL

python workflow data etl semantic-data elt data-pipelines agents document-analysis document-processing unstructured-data unstructured-data-analysis llm

Updated Dec 19, 2025
Python

UglyToad / PdfPig

Star

Read and extract text and other content from PDFs in C# (port of PDFBox)

pdf csharp pdfbox netstandard pdf-files pdf-document pdf-generation hocr document-analysis pdf-extractor alto-xml page-xml layout-analysis pdf-document-processor

Updated Dec 7, 2025
C#

NanoNets / docext

Star

An on-premises, OCR-free unstructured data extraction, markdown conversion and benchmarking toolkit. (https://idp-leaderboard.org/)

Updated Aug 25, 2025
Python

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Updated Apr 9, 2025
C++

tstanislawek / awesome-document-understanding

Star

A curated list of resources for Document Understanding (DU) topic

Updated Jun 2, 2023

DocumindHQ / documind

Star

Open-source platform for extracting structured data from documents using AI.

open-source pdf parser ocr ai pdf-converter developer-tools extract-data document-analysis pdf-extractor document-extraction llms pdf-extractor-llm

Updated May 15, 2025
JavaScript

AI-powered document analysis platform built with Next.js, LangChain, PostgreSQL + pgvector. Upload, organize, and chat with documents. Includes predictive missing-document detection, role-based workflows, and page-level insight extraction.

typescript ocr nextjs postgresql full-stack openai document-analysis rag vector-search ai-chatbot document-ai langchain pgvector drizzle-orm rag-chatbot llm-app

Updated Dec 20, 2025
JavaScript

Yuliang-Liu / Curve-Text-Detector

Star

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

deep-learning object-detection document-analysis scene-text

Updated Jul 20, 2020
Jupyter Notebook

ispras / dedoc

Star

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

html pdf ocr table-of-contents excel html-parser docx documents doc scanned-documents txt document-analysis odt pdf-parser table-recognition docx-parser document-content-extraction logical-structure-extraction

Updated Dec 16, 2025
Python

wenwenyu / PICK-pytorch

Star

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

document-analysis graph-convolutional-network graph-learning graph-neural-networks document-understanding key-information-extraction

Updated Jul 25, 2024
Python

CybercentreCanada / assemblyline

Star

AssemblyLine 4: File triage and malware analysis

framework incident-response malware python3 cybersecurity cert infosec malware-analyzer malware-analysis malware-research automation-framework cyber-security file-analysis document-analysis security-automation security-tools malware-detection assemblyline security-automation-framework

Updated Dec 18, 2025
Python

jpWang / LiLT

Star

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

nlp information-extraction document-analysis document-understanding multilingual-models document-ai multimodal-pre-trained-model

Updated Oct 31, 2022
Python

pandora-analysis / pandora

Star

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

infosec document-analysis malware-detection document-analyzing

Updated Dec 15, 2025
Python

lazyFrogLOL / llmdocparser

Star

A package for parsing PDFs and analyzing their content using LLMs.

nlp ocr chunking document-analysis pdf-parser pdfparser rag llm text-chunking

Updated Aug 6, 2024
Python

masyagin1998 / robin

Star

RObust document image BINarization

python opencv ocr computer-vision deep-learning keras neural-networks document-analysis u-net document-binarization

Updated Aug 2, 2024
Python

ppaanngggg / yolo-doclaynet

Star

YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis

yolo document-analysis layout-analysis ultralytics yolov8 doclaynet

Updated Aug 3, 2025
Python

anisha2102 / docvqa

Star

Document Visual Question Answering

computer-vision deep-learning document-analysis visual-question-answering

Updated Jul 30, 2020
Python

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

Star

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.

natural-language-processing artificial-intelligence question-answering chat-application document-analysis streamlit gpt-3 large-language-models generative-ai langchain openai-chatgpt retrieval-augmented-generation

Updated Jul 4, 2024
Python

Improve this page

Add a description, image, and links to the document-analysis topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the document-analysis topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

document-analysis

Here are 231 public repositories matching this topic...

opendatalab / MinerU

bytedance / Dolphin

ucbepic / docetl

UglyToad / PdfPig

NanoNets / docext

AlibabaResearch / AdvancedLiterateMachinery

tstanislawek / awesome-document-understanding

DocumindHQ / documind

Deodat-Lawson / PDR_AI_v2

Yuliang-Liu / Curve-Text-Detector

ispras / dedoc

wenwenyu / PICK-pytorch

CybercentreCanada / assemblyline

jpWang / LiLT

pandora-analysis / pandora

lazyFrogLOL / llmdocparser

masyagin1998 / robin

ppaanngggg / yolo-doclaynet

anisha2102 / docvqa

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

Improve this page

Add this topic to your repo