PGN Scraper is a command-line application written in Go, designed to scrape Portable Game Notation (PGN) files and related formats from the internet. It supports crawling websites, downloading files, and organizing the output efficiently, making it ideal for chess enthusiasts or developers working with large amounts of chess data.
- Multi-threaded Crawling: Scrape multiple URLs concurrently with customizable threading.
- Custom Headers and Proxies: Support for random user-agents, custom headers, and proxies.
- Advanced File Handling: Detects duplicate files using checksums, supports name regeneration, and organizes downloaded files by domain.
- Recursive Crawling: Optionally follow internal links up to a user-defined depth.
- Flexible Format Support: Download PGN, ChessBase, SCID, ZIP, and text files.
- Random Delays: Simulates human-like browsing behavior with random delays.
- Ensure you have Go installed on your system (version 1.18+).
- Clone the repository:
git clone https://github.com/Pythoript/PGN-Scraper.git
- Navigate to the project directory:
cd PGN-Scraper
- Install Dependancies:
go mod tidy
- Build the application:
go build -o pgn-scraper
Run the scraper using the command-line interface. The following flags are supported:
Flag | Description | Default |
---|---|---|
--url |
Single URL to scrape. | |
--url-file |
File containing a list of URLs (one per line). | |
--depth |
Maximum depth for recursive crawling. | 1 |
--threads |
Number of concurrent threads for crawling. | 1 |
--dir |
Directory to save downloaded files. | ./ |
--proxy |
Proxy server URL (e.g., http://proxy.example.com ). |
|
--port |
Proxy server port (used with --proxy ). |
|
--user-agent |
Custom user-agent string. | |
--ua-file |
File containing user-agent strings (one per line). | |
--rename |
Rename downloaded files to random strings. | false |
--log |
Enable logging to a file. | false |
--log-file |
Specify the log file name. | pgn-scraper.log |
--verbose |
Enable verbose logging. | false |
--no-header |
Disable custom headers in requests. | false |
--delay |
Base delay (ms) between requests. | 0 |
--vary |
Random variation (ms) added to base delay. | 0 |
--keep-duplicates |
Keep duplicate files (disables checksums). | false |
--cookie |
Send cookie with each request | |
--cookie-file |
File containing cookies for authentication | |
--log-failed |
File to save faild (non 404 error) download URLs | |
--save-urls |
File to save successfully downloaded URLs | |
--overwrite |
Overwrite existing files. | false |
--chessbase |
Support downloading ChessBase files | false |
--scid |
Support downloading SCID files | false |
--text |
Support downloading text files | false |
--zip |
Support downloading zip and 7z files | false |
--fide |
Support downloading PGN files on ratings.fide.com. | false |
--all |
Download all supported formats.upport downloading PGN files | false |
--crawl |
Enable crawling for internal links. | false |
-
Basic Scraping:
./pgn-scraper --url https://example.com/pgns
-
Scrape Multiple URLs from a File:
./pgn-scraper --url-file urls.txt --threads 5 --dir chess-data
-
Enable Proxies and Custom User-Agent:
./pgn-scraper --url https://example.com/pgns --proxy http://proxyserver.com --user-agent "CustomAgent/1.0"
-
Recursive Crawling:
./pgn-scraper --url https://example.com --crawl --depth 3
- Fork the repository.
- Create a new branch for your feature:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add new feature"
- Push to your branch:
git push origin feature-name
- Submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.