Crawl URLs, find matches with regex and save results
- Can start URL crawling from Google Custom Search results using a file with keywords or by using a file with a list of URLs
- Can define regex to skip URLs hosts, paths or queries formats
- Can define regex to match content
- Can define DOM elements for which the regex will be applied
- Outputs to CSV file or SQLite database with caching and deduplication of matches (format in key/value where key is the match, and value is the URL matched)
- Can use cache to prevent revisiting URLs that had already been crawled in a different run
git clone https://github.com/nicupavel/regexrover && cd regexrover
go get github.com/wailsapp/wails/v2
go mod tidy
cp .env.default .env
go run ./cmd/cli
cd cmd/gui/frontend; npm i; cd - ; cd cmd/gui ; wails build
- Google Custom Search - will search for keywords from a file and crawl the URLs from the search results
- List of URLs - will crawl a list of URLs from a file
- Create a Programable Search Engine
- Copy the Search Engine Code / ID to
GOOGLE_SEARCH_IDin.envfile - Get an API key from Google Search JSON API
- Put this key in
GOOGLE_SEARCH_API_KEYin.envfile - Create a file with keywords (can be multiple words per line) and put the file name in
KEYWORDS_FILEin.envfile - go run
Note: All search results from google will be saved to a file named with <keywords>_search_links.txt. This file can be used
later in the mode below.
- Create a file with a list of URLs (1 URL per line) and put the file name in
CRAWL_URLS_FILEin.envfile - go run
Note: Both modes will output a CSV file or a SQLite database named found_matches_<run_date_time>.csv|sqlite
- (optional)
GOOGLE_SEARCH_IDID of the Google Programable Engine - (optional)
GOOGLE_SEARCH_API_KEYAPI Key for the Google Custom Search v1 - (optional)
KEYWORDS_FILEthe file that has your keywords to search on each line will do a Google search. Can be multiple words separated by space per line MATCH_OUTPUT_CHUNKSOptimize file writing and deduplicate matches. After how many matches the results are saved to CVS file. Default: 5CRAWL_CACHE_DIRDirectory to store the cache for the crawler. Crawler won't visit cached pages on another run. If empty it won't keep a cacheCRAWL_DEPTHHow many levels deep the crawler should go from the page URL obtained from Google or the file with links. Use 0 for infinite recursion. Default: 1CRAWL_THREADSHow many threads the crawler should use. Default: 20CRAWL_IGNORE_DOMAINSA list separated by , of domains to ignore in crawlingCRAWL_ALLOWED_URLS_REGEXURLs matching this regex will be crawled, can be used to select certain tld or ignore paths with query stringsCRAWL_USER_AGENTBrowser User Agent to useCRAWL_TAGRegex matching will be performed in all DOM elements with this tag. For examplebodywill have the regex applied to all body content.CRAWL_MATCH_REGEXThe regex to match content. Text matching this regex will be saved in the output file along with the URLCRAWL_URLS_FILEThe file with the list of URLs to start crawling. If this is defined Google Search mode will be ignoredOUTPUT_DRIVERUsecsvorsqliteoutput. Default:csvCRAWL_LOG0 means disable logging and only show stats, 1 means showing URLs visited and matches. Default: 0
The CRAWL_TAG can be specified using the goquery selectors.