Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
-
Updated
Nov 30, 2024 - Java
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
🐋 Web Archiving Integration Layer: One-Click User Instigated Preservation
Dockerized Web Curator Tool with Heritrix 3 and pywb
Parse a Heritrix crawl.log into an XML sitemap
Single Docker container running Heritrix 3, picking up jobs from a directory.
Add a description, image, and links to the heritrix topic page so that developers can more easily learn about it.
To associate your repository with the heritrix topic, visit your repo's landing page and select "manage topics."