libgh - GitHub scraping tool
libgh [--days|-d DAYS] [--force|-f] [--from] [--json|-j] [--prune|-p] [--topics] [--xml|-x] [--debug] [--help|-?] [--version] [--] account_or_repository [...]
The alias lgh is also available to shorten the command name.
The libgh command-line utility scraps data from a list of GitHub accounts (either personal or organizational) or repositories (in account/repository form).
By default this data is returned as pretty-printed text, or JSON data if the --json|-j option is used, or XML data if the --xml|-x option is used.
As data is retrieved in unauthenticated mode, some of it may be missing. For instance, organizational accounts will not mention the origin of forked repositories and may come with partial repositories topics. The --from and --topics options will enable additional repository scraping in order to provide these information.
The GitHub Web site is applying rate limiting rules. To comply with its policies no more than 60 requests per hour, with at least a 1 second interval between requests, will be performed. The tool will also maintain a caching directory of requests results, which it will reuse for 7 days or the --days|-d option parameter. A value of 0 will instruct the tool not to use caching, while the --force|-f option will force reloading the resources requested. The --debug option will show if a resource comes from the cache or the Web, as well as the number of Web requests made to GitHub per day, hour and minute.
The cache can be trimmed to the 7 days or --days|-d parameter value with the --prune|-p option. The pages are stored as XZ compressed files in order to reduce disk usage.
Options | Use |
---|---|
--days|-d DAYS | Set number of caching days (0=don't use cache) |
--force|-f | Force fetching URL instead of using cache |
--from | Load repositories when forked_from is blank |
--json|-j | Switch to JSON output instead of plain text |
--prune|-p | Prune cache items olday than DAYS and cache index |
--topics | Load repositories when there are missing topics |
--xml|-x | Switch to XML output instead of plain text |
--debug | Enable debug mode |
--help|-? | Print usage and a short help message and exit |
--version | Print version and exit |
-- | Options processing terminator |
The LOCALAPPDATA and TMP environment variables under Windows, and HOME, TMPDIR and TMP environment variables under other operating systems can influence the caching directory used.
The libgh utility will attempt to maintain a caching directory for the web requests it makes.
This directory will be located in one of the following places:
Unix:
${HOME}/.cache/libgh
${TMPDIR}/.cache/libgh
${TMP}/.cache/libgh
Windows:
%LOCALAPPDATA%\cache\libgh
%TMP%\cache\libgh
An index.txt file will make the correspondence between URL and files.
The libgh utility exits 0 on success, and >0 if an error occurs.
To extract data from a personal GitHub account named HubTou in all possible output formats, do:
$ lgh --debug HubTou > libgh.txt
$ lgh --debug --json HubTou > libgh.json
$ lgh --debug --xml HubTou > libgh.xml
Results for this example are available there:
The libgh utility is not a standard UNIX command.
This implementation tries to follow the PEP 8 style guide for Python code.
To be tested under Windows.
This implementation was made for the PNU project.
It's intended as the scraping engine for my topgh tool.
It is available under the 3-clause BSD license.
Some information are not available in unauthenticated mode and the rate limits per hour are quite low, but it should be fine anyway for most usages.