VTuber 1B is an academic purpose NLP dataset, collecting over a billion live chats, superchats, and moderation events (bans/deletions) from virtual YouTubers' live streams.
Download the dataset from Kaggle Datasets and join #vtuber-1b
channel on holodata Discord for discussions.
We also offer ❤️🩹 Sensai, a live chat dataset specifically made for building ML models for spam detection / toxic chat classification.
- Source: YouTube live chat events collected by our Honeybee cluster. Holodex is a stream index provider for Honeybee which covers Hololive, Nijisanji, 774inc, etc.
- Temporal Coverage:
- Chats: from 2021-01-15
- Super chats: from 2021-03-16
- Super stickers: from 2022-01-20 (N/A yet)
- Membership joining events: from 2021-10-18 (N/A yet)
- Membership milestones: from 2021-10-20 (N/A yet)
- Membership gifts: N/A
- Placeholders: from 2022-01-21 (N/A yet)
- Update Frequency:
- At least once every 6 months
- Toxic Chat Classification
- Spam Detection
- Demographic Visualization
- Superchat Analysis
- Training neural language models
See Kaggle public notebooks (VTuber 1B / VTuber 1B Elements) for ideas, as well as /notebooks
folder in the repo.
We employed Honeybee cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.
Kaggle Datasets (2 MB)
VTuber 1B Elements is most suitable for statistical visualizations and exploratory data analysis.
filename | summary |
---|---|
channels.csv |
Channel index |
chat_stats.csv |
Chat statistics |
superchat_stats.csv |
Super Chat statistics |
Kaggle Datasets (47 GB)
VTuber 1B is most suitable for frequency analysis. This edition includes only the essential columns in order to reduce dataset size and make it faster from Kaggle Kernels to load data in.
filename | summary |
---|---|
chats_%Y-%m.parquet |
Live chat events (> 1,000,000,000) |
superchats_%Y-%m.parquet |
Super chat events (> 4,000,000) |
deletion_events.parquet |
Deletion events |
ban_events.parquet |
Ban events |
VTuber 1B Complete is only available to those approved by the admins. If you are interested in conducting research using this edition, please reach us at [email protected]
(for organizations only).
filename | summary |
---|---|
chats_%Y-%m.parquet |
Live chat messages (> 1,000,000,000) |
superchats_%Y-%m.parquet |
Super chat messages (> 4,000,000) |
deletion_events.parquet |
Deletion events |
ban_events.parquet |
Ban events |
Ban and deletion are equivalent to
markChatItemsByAuthorAsDeletedAction
andmarkChatItemAsDeletedAction
respectively.
column | type | description |
---|---|---|
channelId | string | channel id |
name | string | channel name |
englishName | nullable string | channel name (English) |
affiliation | string | channel affiliation |
group | nullable string | group |
subscriptionCount | number | subscription count |
videoCount | number | uploads count |
photo | string | channel icon |
Inactive channels have INACTIVE
in group
column.
import pandas as pd
dtype_dict = {
'channelId': 'category',
'name': 'category',
'englishName': 'category',
'affiliation': 'category',
'subscriptionCount': 'int32',
'videoCount': 'int16',
'photo': 'category'
}
chats = pd.read_csv('../input/vtuber-livechat-elements/channels.csv', dtype=dtype_dict)
column | type | description |
---|---|---|
channelId | string | channel id |
period | string | interested period (%Y-%M) |
chats | number | number of chats |
memberChats | number | number of chats with membership status attached |
uniqueChatters | number | number of unique chatters |
uniqueMembers | number | number of unique members appeared on live chat |
bannedChatters | number | number of unique chatters marked as banned by mods |
deletedChats | number | number of chats deleted by mods |
import pandas as pd
chat_stats = pd.read_csv('../input/vtuber-livechat-elements/chat_stats.csv'))
sc_stats = pd.read_csv('../input/vtuber-livechat-elements/superchat_stats.csv'))
stats = pd.merge(chat_stats, sc_stats, on=['period', 'channelId'], how='left')
column | type | description |
---|---|---|
channelId | string | channel id |
period | string | interested period (%Y-%M) |
superChats | number | number of super chats |
uniqueSuperChatters | number | number of unique super chatters |
totalSC | number | total amount of super chats (JPY) |
averageSC | number | average amount of super chat (JPY) |
totalMessageLength | number | total message length |
averageMessageLength | number | average mesage length |
mostFrequentCurrency | string | most frequent currency |
mostFrequentColor | string | most frequent color |
column | type | description | in standard version |
---|---|---|---|
timestamp | string | ISO 8601 UTC timestamp | limited accuracy |
id | string | chat id | N/A |
authorName | string | author name | N/A |
authorChannelId | string | author channel id | anonymized |
body | string | chat message | N/A |
bodyLength | number | chat message length | standard version only |
membership | string | membership status | N/A |
isMember | nullable boolean | is member (null if unknown) | standard version only |
isModerator | boolean | is channel moderator | N/A |
isVerified | boolean | is verified account | N/A |
videoId | string | source video id | |
channelId | string | source channel id |
value | duration |
---|---|
unknown | Indistinguishable |
non-member | 0 |
new | < 1 month |
1 month | >= 1 month, < 2 months |
2 months | >= 2 months, < 6 months |
6 months | >= 6 months, < 12 months |
1 year | >= 12 months, < 24 months |
2 years | >= 24 months |
import pandas as pd
chats = pd.read_parquet('../input/vtuber-livechat/chats_2022-02.parquet')
column | type | description | in standard version |
---|---|---|---|
timestamp | string | ISO 8601 UTC timestamp | limited accuracy |
id | string | chat id | N/A |
authorName | string | author name | N/A |
authorChannelId | string | author channel id | anonymized |
body | nullable string | chat message | N/A |
amount | number | purchased amount | |
currency | string | three-letter currency symbol | |
color | string | color | N/A |
significance | number | significance | |
videoId | string | source video id | N/A |
channelId | string | source channel id |
color | significance | purchase amount (¥) | purchase amount ($) | max. message length |
---|---|---|---|---|
blue | 1 | ¥ 100 - 199 | $ 1.00 - 1.99 | 0 |
lightblue | 2 | ¥ 200 - 499 | $ 2.00 - 4.99 | 50 |
green | 3 | ¥ 500 - 999 | $ 5.00 - 9.99 | 150 |
yellow | 4 | ¥ 1000 - 1999 | $ 10.00 - 19.99 | 200 |
orange | 5 | ¥ 2000 - 4999 | $ 20.00 - 49.99 | 225 |
magenta | 6 | ¥ 5000 - 9999 | $ 50.00 - 99.99 | 250 |
red | 7 | ¥ 10000 - 50000 | $ 100.00 - 500.00 | 270 - 350 |
import pandas as pd
from glob import iglob
sc = pd.concat([
pd.read_parquet(f)
for f in iglob('../input/vtuber-livechat/superchats_*.parquet')
], ignore_index=False)
sc.sort_index(inplace=True)
column | type | description | in standard version |
---|---|---|---|
timestamp | string | UTC timestamp | |
id | string | chat id | |
retracted | boolean | is deleted by author oneself | |
videoId | string | source video id | |
channelId | string | source channel id |
Insert deleted_by_mod
column to chats
DataFrame:
chats = pd.read_parquet('../input/vtuber-livechat/chats_2022-02.parquet')
delet = pd.read_parquet('../input/vtuber-livechat/deletion_events.parquet', columns=['id', 'retracted'])
delet = delet[delet['retracted'] == 0]
delet['deleted_by_mod'] = True
chats = pd.merge(chats, delet[['id', 'deleted_by_mod']], how='left')
chats['deleted_by_mod'].fillna(False, inplace=True)
Here Ban means either to place user in time out or to permanently hide the user's comments on the channel's current and future live streams. This mixup is due to the fact that these actions are indistinguishable from others with the extracted data from markChatItemsByAuthorAsDeletedAction
event.
column | type | description | in standard version |
---|---|---|---|
timestamp | string | UTC timestamp | |
authorChannelId | string | channel id | anonymized |
videoId | string | source video id | |
channelId | string | source channel id |
Insert banned
column to chats
DataFrame:
chats = pd.read_parquet('../input/vtuber-livechat/chats_2022-02.parquet')
ban = pd.read_parquet('../input/vtuber-livechat/ban_events.parquet', columns=['authorChannelId', 'videoId'])
ban['banned'] = True
chats = pd.merge(chats, ban, on=['authorChannelId', 'videoId'], how='left')
chats['banned'].fillna(False, inplace=True)
id
and authorChannelId
are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
All custom emojis are replaced with a Unicode replacement character � (U+FFFD
).
Bans and deletions from multiple moderators for the same person or chat will be logged separately. For simplicity, you can safely ignore all but the first line recorded in time order.
@misc{vtuber-livechat-dataset,
author={Yasuaki Uechi},
title={VTuber 1B: Large-scale Live Chat and Moderation Events Dataset},
year={2022},
month={2},
version={37},
url={https://holodata.org/vtuber-1b}
}
- Code: MIT License
- Dataset: ODC Public Domain Dedication and Licence (PDDL)