Skip to main content

Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily.

Project description


A Python package & CLI tool that interfaces with the Wayback Machine API

Unit Tests codecov pypi Downloads Codacy Badge GitHub lastest commit PyPI - Python Version Code style: black


Introduction

Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.

Wayback Machine has 3 client side APIs.

  • SavePageNow or Save API
  • CDX Server API
  • Availability API

These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.

Installation

Using pip, from PyPI (recommended):

pip install waybackpy

Using conda, from conda-forge (recommended):

See also waybackpy feedstock, maintainers are @rafaelrdealmeida, @labriunesp and @akamhy.

conda install -c conda-forge waybackpy

Install directly from this git repository (NOT recommended):

pip install git+https://github.com/akamhy/waybackpy.git

Docker Image

Docker Hub: hub.docker.com/r/secsi/waybackpy

Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).

RAUDI is a tool by SecSI, an Italian cybersecurity startup.

Usage

As a Python package

Save API aka SavePageNow

>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)

CDX API aka CDXServerAPI

>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://google.com"
>>> user_agent = "my new app's user agent"
>>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent)
oldest
>>> cdx_api.oldest()
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest = cdx_api.oldest()
>>> oldest
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> oldest.original
'http://google.com:80/'
>>> oldest.urlkey
'com,google)/'
>>> oldest.timestamp
'19981111184551'
>>> oldest.datetime_timestamp
datetime.datetime(1998, 11, 11, 18, 45, 51)
>>> oldest.statuscode
'200'
>>> oldest.mimetype
'text/html'
newest
>>> newest = cdx_api.newest()
>>> newest
com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563
>>> newest.archive_url
'https://web.archive.org/web/20220217234427/http://@google.com/'
>>> newest.timestamp
'20220217234427'
near
>>> near = cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10)
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.timestamp
'20101010101435'
>>> near.timestamp
'20101010101435'
>>> near = cdx_api.near(wayback_machine_timestamp=2008080808)
>>> near.archive_url
'https://web.archive.org/web/20080808051143/http://google.com/'
>>> near = cdx_api.near(unix_timestamp=1286705410)
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> 
snapshots
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
...     print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/

Availability API

It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, WaybackMachineAvailabilityAPI, are also implemented in the CDX server API interface class, WaybackMachineCDXServerAPI. Also note that the newest() method of WaybackMachineAvailabilityAPI can be more recent than WaybackMachineCDXServerAPI's same method.

>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
newest
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
near
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/

Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.

As a CLI tool

Demo video on asciinema.org, you can copy the text from video:

asciicast

CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

waybackpy-3.0.6.tar.gz (29.9 kB view details)

Uploaded Source

Built Distribution

waybackpy-3.0.6-py3-none-any.whl (34.9 kB view details)

Uploaded Python 3

File details

Details for the file waybackpy-3.0.6.tar.gz.

File metadata

  • Download URL: waybackpy-3.0.6.tar.gz
  • Upload date:
  • Size: 29.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for waybackpy-3.0.6.tar.gz
Algorithm Hash digest
SHA256 497a371756aba7644eb7ada0ebd4edb15cb8c53bc134cc973bf023a12caff83f
MD5 a724cf6e2c5b20fde24173301d63aaab
BLAKE2b-256 34ab90085feb81e7fad7d00c736f98e74ec315159ebef2180a77c85a06b2f0aa

See more details on using hashes here.

File details

Details for the file waybackpy-3.0.6-py3-none-any.whl.

File metadata

  • Download URL: waybackpy-3.0.6-py3-none-any.whl
  • Upload date:
  • Size: 34.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.27.1 requests-toolbelt/0.9.1 urllib3/1.26.8 tqdm/4.63.0 importlib-metadata/4.11.3 keyring/23.5.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.10.2

File hashes

Hashes for waybackpy-3.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 c568b0db9056fbe42a1a7e56b4f1d1919bd3f76bd62da58d9ee2e577297be284
MD5 932d2f92943b36703493ee4b95a0b666
BLAKE2b-256 1055573692440ce73f08200f4b6d3e193fa6426d1cb460912a472b8c3db137e2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page