Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
-
Updated
Jul 14, 2021 - Python
Parsing Huge Web Archive files from Common Crawl data index to fetch any required domain's data concurrently with Python and Scrapy.
A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
Add a description, image, and links to the webarchive-data-scraping topic page so that developers can more easily learn about it.
To associate your repository with the webarchive-data-scraping topic, visit your repo's landing page and select "manage topics."