Please see client2 docs for the full list of available functions. Here are some minimal working snippets of code that use twarc2 as a library.
The client implements the API as closely as possible - so if the API docs expect a parameter in a certain way, so does the twarc2 library.
import datetime
from twarc.client2 import Twarc2
from twarc.expansions import ensure_flattened
# Your bearer token here
t = Twarc2(bearer_token="A...z")
# Start and end times must be in UTC
start_time = datetime.datetime(2021, 3, 21, 0, 0, 0, 0, datetime.timezone.utc)
end_time = datetime.datetime(2021, 3, 22, 0, 0, 0, 0, datetime.timezone.utc)
# search_results is a generator, max_results is max tweets per page, 100 max for full archive search with all expansions.
search_results = t.search_all(query="dogs lang:en -is:retweet", start_time=start_time, end_time=end_time, max_results=100)
# Get all results page by page:
for page in search_results:
# Do something with the whole page of results:
# print(page)
# or alternatively, "flatten" results returning 1 tweet at a time, with expansions inline:
for tweet in ensure_flattened(page):
# Do something with the tweet
print(tweet)
# Stop iteration prematurely, to only get 1 page of results.
break
Twarc will try to retrieve all available results and handle retries and rate limits for you. This can potentially retrieve more tweets than your monthly limit will allow. The command line interface has a --limit
option, but the library returns generator functions and it is upto you to stop iterating when you have retrieved enough results.
For example, to only get 2 "pages" of followers max per user:
from twarc.client2 import Twarc2
# Your bearer token here
t = Twarc2(bearer_token="A...z")
user_ids = [12, 2244994945, 4503599627370241] # @jack, @twitterdev, @overflow64
# Iterate over our target users
for user_id in user_ids:
# Iterate over pages of followers
for i, follower_page in enumerate(t.followers(user_id)):
# Do something with the follower_page here
print(f"Fetched a page of {len(follower_page['data'])} followers for {user_id}")
if i == 1: # Only retrieve the first two pages (enumerate starts from 0)
break
twarc-csv
is an extra plugin you can install:
pip install twarc-csv
This can also be used as a library, for example:
from twarc_csv import CSVConverter
with open("results.jsonl", "r") as infile:
with open("results.csv", "w") as outfile:
converter = CSVConverter(infile=infile, outfile=outfile)
converter.process()
Assuming results.jsonl
already exists and contains 1 API response per line or 1 tweet per line. The CSVConverter
. The other parameters and their defaults apart from infile
and outfile
are:
json_encode_all=False,
json_encode_lists=True,
json_encode_text=False,
inline_referenced_tweets=True,
allow_duplicates=False,
input_tweet_columns=True,
input_users_columns=False,
extra_input_columns="",
output_columns="",
batch_size=100
And correspond to the command line options: https://github.com/DocNow/twarc-csv#extra-command-line-options
The full list of valid output_columns
are: https://github.com/DocNow/twarc-csv/blob/main/twarc_csv.py#L14-L106 when using input_tweet_columns=True
and https://github.com/DocNow/twarc-csv/blob/main/twarc_csv.py#L111-L137 when using input_users_columns=True
.
Here is a complete working example that searches for all recent tweets in the last few hours, writes a results.jsonl
with the original responses, and then converts this to CSV:
import json
from datetime import datetime, timezone, timedelta
from twarc.client2 import Twarc2
from twarc_csv import CSVConverter
# Your bearer token here
t = Twarc2(bearer_token="A...z")
# Start and end times must be in UTC
start_time = datetime.now(timezone.utc) + timedelta(hours=-3)
# end_time cannot be immediately now, has to be at least 30 seconds ago.
end_time = datetime.now(timezone.utc) + timedelta(minutes=-1)
query = "dogs lang:en -is:retweet has:media"
print(f"Searching for \"{query}\" tweets from {start_time} to {end_time}...")
# search_results is a generator, max_results is max tweets per page, not total, 100 is max when using all expansions.
search_results = t.search_recent(query=query, start_time=start_time, end_time=end_time, max_results=100)
# Get all results page by page:
for page in search_results:
# Do something with the page of results:
with open("dogs_results.jsonl", "w+") as f:
f.write(json.dumps(page) + "\n")
print("Wrote a page of results...")
print("Converting to CSV...")
# This assumes `results.jsonl` is finished writing.
with open("dogs_results.jsonl", "r") as infile:
with open("dogs_output.csv", "w") as outfile:
converter = CSVConverter(infile, outfile, json_encode_all=False, json_encode_lists=True, json_encode_text=False, inline_referenced_tweets=True, allow_duplicates=False, batch_size=1000)
converter.process()
print("Finished.")