Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily.
Project description
A Python package & CLI tool that interfaces with the Wayback Machine API
Introduction
Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.
Wayback Machine has 3 client side APIs.
- SavePageNow or Save API
- CDX Server API
- Availability API
These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.
Installation
Using pip, from PyPI (recommended):
pip install waybackpy
Using conda, from conda-forge (recommended):
See also waybackpy feedstock, maintainers are @rafaelrdealmeida, @labriunesp and @akamhy.
conda install -c conda-forge waybackpy
Install directly from this git repository (NOT recommended):
pip install git+https://github.com/akamhy/waybackpy.git
Docker Image
Docker Hub: hub.docker.com/r/secsi/waybackpy
Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).
RAUDI is a tool by SecSI, an Italian cybersecurity startup.
Usage
As a Python package
Save API aka SavePageNow
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)
CDX API aka CDXServerAPI
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://google.com"
>>> user_agent = "my new app's user agent"
>>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent)
oldest
>>> cdx_api.oldest()
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest = cdx_api.oldest()
>>> oldest
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> oldest.original
'http://google.com:80/'
>>> oldest.urlkey
'com,google)/'
>>> oldest.timestamp
'19981111184551'
>>> oldest.datetime_timestamp
datetime.datetime(1998, 11, 11, 18, 45, 51)
>>> oldest.statuscode
'200'
>>> oldest.mimetype
'text/html'
newest
>>> newest = cdx_api.newest()
>>> newest
com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563
>>> newest.archive_url
'https://web.archive.org/web/20220217234427/http://@google.com/'
>>> newest.timestamp
'20220217234427'
near
>>> near = cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10)
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.timestamp
'20101010101435'
>>> near.timestamp
'20101010101435'
>>> near = cdx_api.near(wayback_machine_timestamp=2008080808)
>>> near.archive_url
'https://web.archive.org/web/20080808051143/http://google.com/'
>>> near = cdx_api.near(unix_timestamp=1286705410)
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>>
snapshots
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/
Availability API
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, WaybackMachineAvailabilityAPI
, are also implemented in the CDX server API interface class, WaybackMachineCDXServerAPI
. Also note
that the newest()
method of WaybackMachineAvailabilityAPI
can be more recent than WaybackMachineCDXServerAPI
's same method.
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
newest
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
near
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.
As a CLI tool
Demo video on asciinema.org, you can copy the text from video:
CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file waybackpy-3.0.6.tar.gz
.
File metadata
- Download URL: waybackpy-3.0.6.tar.gz
- Upload date:
- Size: 29.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 497a371756aba7644eb7ada0ebd4edb15cb8c53bc134cc973bf023a12caff83f |
|
MD5 | a724cf6e2c5b20fde24173301d63aaab |
|
BLAKE2b-256 | 34ab90085feb81e7fad7d00c736f98e74ec315159ebef2180a77c85a06b2f0aa |
File details
Details for the file waybackpy-3.0.6-py3-none-any.whl
.
File metadata
- Download URL: waybackpy-3.0.6-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c568b0db9056fbe42a1a7e56b4f1d1919bd3f76bd62da58d9ee2e577297be284 |
|
MD5 | 932d2f92943b36703493ee4b95a0b666 |
|
BLAKE2b-256 | 1055573692440ce73f08200f4b6d3e193fa6426d1cb460912a472b8c3db137e2 |