In this folder, you can find the annotation framework and information about the data used in the resource paper: "TACO -- Twitter Arguments from COnversations".
The contents of this folder comprise all data that can be shared with the public according to Twitter's developer policy. This includes reduced versions of tweets that only contain their tweet_id. Additionally, we offer the dataset_statistics.ipynb file, which we utilized to generate our ground truth data and gain preliminary insights. Since we cannot release all data, such as the text of tweets, the dataset_statistics.ipynb file is only provided for comparison purposes. Accessing it would necessitate the use of the following files:
- data/unify.py: Used to build our ground truth data.
- data/import: Conversations and expert decisions were brought in from external sources (*).
- data/backup_tweets.csv: Containing the clear text of all tweets (*).
- data/url_dict.json: The resolved tiny URLs in order to trace the original URLs (*).
(*): The sensitive user data contained in these files should not be made public. Please contact for accessing the original data. For rehydrating tweets or for obtaining own conversations we recommend to use the Twitter API v2. |
---|
To construct Twitter conversations, we have included the conversations.csv file, which contains all conversations. This file comprises the following columns:
- tweet_id: The unique identifier of a tweet in Twitter's database.
- conversation_id: The tweet_id of the very first (top-level) tweet in the conversation.
- parent_id: The tweet_id of the parent tweet that the current tweet is replying to (*).
- topic: The conversation's topic that was assigned for sampling purposes.
(*): There are tweets with a tweet_id that is the same as their parent_id but with a different conversation_id. These tweets are dead-end tweets that have their parent deleted but are not top-level tweets. |
---|
To create the ground truth labels, we used the individual decisions of our six experts, which are stored in worker_decisions.csv, and obtained the majority vote by hard coding the results in majority_votes.csv. The worker_decisions.csv file includes the following columns:
- tweet_id: The unique identifier of a tweet in Twitter's database.
- information: A binary value indicating the presence (1) or absence (0) of information in the tweet.
- inference: A binary value indicating the presence (1) or absence (0) of inference in the tweet.
- confidence: A value indicating the annotator's task confidence, ranging from easy (1) to hard (3), not used in the paper (*).
- worker: The identifier of the annotator (A and E both belong to the first author)).
- topic: The conversation's topic that was assigned for sampling purposes.
- phase: The phase in which the tweet was annotated.
(*): This confidence value is distinct from the confidence reported in majority_votes.csv and was solely utilized as a self-monitoring variable (reminder) for the experts during the annotation process. |
---|
Our six experts provided individual decisions from two annotation phases. During the initial annotation stage, experts A, B, C, and D annotated 600 conversation-starting tweets (300 random selected for each #Abortion and #Brexit) to evaluate and refine the framework. This first annotation step comprised two phases:
- training_1 - 2: Two successive training phases, each involving 100 tweets for either #Abortion and #Brexit, were conducted for the annotators. These were followed by a debriefing session.
- extension_1 - 4: The four extensions steps, each comprising 100 tweets, were conducted after the annotators had completed their deliberation.
In the second annotation step, three additional annotators, namely A (E), F, and G, annotated the tweets of 200 conversations. To this end, 100 conversation-starting tweets from the first step (with perfect agreement among A-D) were randomly selected for training the new annotators on the tweets and their subsequent conversations. This was followed by another 100 conversations (started by 25 randomly selected conversation-starting tweet for either #GOT, #SquidGame, #TwitterTake and #LOTRROP). In total, the second annotation step comprised the following phases:
- training_3 - 4: Two training phases involving 100 conversation-starting tweets from the first annotation step (conducted with 100% agreement among annotators A, B, C, and D) along with their entire conversations for #Abortion and #Brexit.
- extension_5 - 8: The following four annotation steps each included 25 new conversations for either #GOT, #SquidGame, #TwitterTakeover, or #LOTRROP.
The individual annotation phases (including the inter-annotator-agreement) are detailed in dataset_statistics.ipynb.
Once the annotation phases were complete, the ground truth labels were assigned using a hard majority vote (more than 50% of all experts had to agree on one class). The resulting ground truth data is saved in majority_votes.csv, which contains the following columns:
- tweet_id: A unique identifier for each tweet in Twitter's database.
- topic: The topic of the conversation that was assigned for sampling purposes.
- class: The class assigned to each tweet based on the majority vote of the annotators, as specified in annotation_framework.pdf.
- confidence: The ratio of annotators who endorsed the ultimate class decision. For instance, if annotators A, B, and C all voted for one class, while annotator D voted for a different class, the confidence value would be 3/4.
TACO -- Twitter Arguments from Conversations by Marc Feger is licensed under CC BY-NC-SA 4.0
Please contact [email protected] or [email protected].
We thank Aylin Feger, Tillmann Junk, Andreas Burbach, Talha Caliskan, and Aaron Schneider for their contributions to the annotation process in this paper.