This is a Python CLI tool for comprehensive GitHub data backup. The architecture follows a single-module design with clear separation of concerns across functional areas.
bin/github-backup: CLI entry point that orchestrates the backup workflowgithub_backup/github_backup.py: Single module containing all core functionality (~1400+ lines)github_backup/__init__.py: Version tracking only
- Parse & Authenticate →
parse_args()→get_auth()→get_authenticated_user() - Discover →
retrieve_repositories()→filter_repositories() - Backup →
backup_repositories()+backup_account()
- Uses
retrieve_data_gen()for paginated API calls with automatic rate limiting - Template-based URL construction:
"https://{host}/repos/{owner}/{name}/issues" - Built-in retry logic for 502 errors and incomplete reads
- Supports both classic tokens (
-t) and fine-grained tokens (-f)
# Supports multiple auth methods in get_auth():
# - Fine-grained tokens (github_pat_...)
# - Classic tokens with x-oauth-basic
# - Basic username/password
# - OSX Keychain integration
# - GitHub App authentication (--as-app)- Time-based:
--incrementaluses APIsinceparameter with last backup timestamp - File-based:
--incremental-by-filescompares filesystem modification times - State stored in
{output_dir}/last_updatefile
- Uses
logging_subprocess()wrapper for all git operations - Supports both regular clones and bare/mirror clones (
--bare→git clone --mirror) - SSH vs HTTPS preference via
--prefer-sshflag - LFS support with
git lfs fetch --all --prune
{output_dir}/
├── repositories/{repo_name}/repository/ # Git clones
├── starred/{owner}/{repo_name}/ # Starred repos
├── gists/{gist_id}/ # User gists
├── account/{starred,followers,following}.json
└── {repo}/issues/{number}.json # Per-repo data
# No unit tests exist - this is acknowledged in README
pip install flake8
flake8 --ignore=E501,E203,W503 # Same as CIdocker run --rm -v /path/to/backup:/data --name github-backup \
ghcr.io/josegonzalez/python-github-backup -o /data $OPTIONS $USER- Automated via GitHub Actions (
automatic-release.yml,tagged-release.yml) - Version bumping in
github_backup/__init__.py - Docker image publishing to ghcr.io
- Automatic throttling based on
x-ratelimit-remainingheader - Custom throttling via
--throttle-limitand--throttle-pause - Exponential backoff for 403 rate limit responses
- Graceful degradation for missing data (404s logged but don't block)
- Blocking errors (403 auth failures) exit entirely
- Incomplete reads get 3 retry attempts with 5-second delays
- Atomic writes via
.tempfiles thenos.rename() - UTF-8 encoding with
codecs.open()for JSON files - JSON formatting:
ensure_ascii=False, sort_keys=True, indent=4
--alldoesn't include everything: Missing private repos, forks, starred repos, LFS, gists--bareis actually--mirror: Usesgit clone --mirror, notgit clone --bare- Starred gists: Stored in same directory as user gists, not separately
- Incremental risks: Failed runs can cause missing data in subsequent incremental backups
- Authentication scope: Fine-grained tokens need specific repository and user permissions
When adding new backup types, follow the pattern:
- Add CLI argument in
parse_args() - Create
backup_*()function following existing patterns - Call from
backup_repositories()orbackup_account() - Use
retrieve_data()for API calls andmkdir_p()for directories - Follow atomic file writing pattern with
.tempfiles