twarc v1 collected a set of utilities for working with tweet json in the utils directory of the git repository. This was a handy way to develop and share snippets of code. But some utilities had different dependencies which weren't managed in a uniform way. Some of the utilities had slightly different interfaces. They needed to be downloaded from GitHub manually and weren't easily accessible at the command line if you remembered where you put them.
With twarc2 these utilities are now installable as plugins, which are made available as subcommands using the same twarc2 command line. Plugins are published separately from twarc on PyPI and are installed with pip. Here is a list of some known plugins (if you write one please let us know so we can add it to this list):
- twarc-ids: a simple example of printing the ids for tweets to use as a reference for creating plugins
- twarc-csv: export tweets to CSV, which is probably the first thing a researcher will want to do
- twarc-videos: extract videos from tweets
- twarc-network: visualize tweets and users as a network graph
- twarc-timeline-archive: routinely download tweet timelines for a list of users
- twarc-hashtags: create a report of hashtags that are used in collected tweet data
- Write your own, and let us know so we can add it here!
The twarc-ids plugin provides an example of how to write plugins. This reference plugin simply reads collected tweet JSON data and writes out the tweet identifiers. First you install the plugin:
pip install twarc-ids
and then you use it:
twarc2 ids tweets.json > ids.txt
Internally twarc's command line is implemented using the click library. The
click-plugins module is what manages twarc2 plugins. Basically you import
click
and implement your plugin as you would any other click utility, for
example:
import json
import click
@click.command()
@click.argument('infile', type=click.File('r'), default='-')
@click.argument('outfile', type=click.File('w'), default='-')
def ids(infile, outfile):
"""
Extract tweet ids from tweet JSON.
"""
for line in infile:
tweet = json.loads(line)
click.echo(t['data']['id'], file=outfile)
Note that the plugin takes input file infile and writes to an output file outfile which default to stdin and stdout respectively. This allows plugin utilities to be used as part of pipelines. You can add options using the standard facilities that click provides if your plugin needs them.
If your plugin needs to talk to the Twitter API then just add the
@click.pass_obj
decorator which will ensure that the first parameter in
your function will be a Twarc2 client that is configured to use the
client's keys.
@click.command()
@click.argument('infile', type=click.File('r'), default='-')
@click.argument('outfile', type=click.File('w'), default='-')
@click.pass_obj
def ids(twarc_client, infile, outfile):
# do something with the twarc client here
Finally you just need to create a setup.py
file for your project that
looks something like this:
import setuptools
setuptools.setup(
name='twarc-ids',
version='0.0.1',
url='https://github.com/docnow/twarc-ids',
author='Ed Summers',
author_email='[email protected]',
py_modules=['twarc_ids'],
description='A twarc plugin to read Twitter data and output the tweet ids',
install_requires=['twarc'],
setup_requires=['pytest-runner'],
tests_require=['pytest'],
entry_points='''
[twarc.plugins]
ids=twarc_ids:ids
'''
)
The key part here is the entry_points
section which is what allows twarc2 to
discover twarc.plugins dynamically at runtime, and also defines how the
subcommand maps to the plugin's function.
It's good practice to include a test or two for your plugin to ensure it works over time. Check out the example here for how to test command line utilities easily with click.
To publish your plugin on PyPi:
pip install twine
python setup.py sdist
twine upload dist/*
# enter pypi login details