Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Dec 11, 2024 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Module for automatic summarization of text documents and HTML pages.
Golang PDF library for creating and processing PDF files (pure go)
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Heuristic based boilerplate removal tool
This repository has moved! https://github.com/unidoc/unipdf
A self-hosted search engine for documents.
Text Extraction, Rendering and Converting of PDF Documents
A simple library and set of tools for parsing, modifying, and composing SRT files.
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
A very simple news crawler with a funny name
🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based
Benchmarking PDF libraries
Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)
hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
Entity Disambiguation as text extraction (ACL 2022)
AWS Lambda functions to extract text from various binary formats.
CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)
The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."