VTuber 1B: Live Chat and Moderation Events

VTuber 1B is an academic purpose NLP dataset, collecting over a billion live chats, superchats, and moderation events (bans/deletions) from virtual YouTubers' live streams.

Download the dataset from Kaggle Datasets and join #vtuber-1b channel on holodata Discord for discussions.

We also offer ❤️‍🩹 Sensai, a live chat dataset specifically made for building ML models for spam detection / toxic chat classification.

Provenance

Source: YouTube live chat events collected by our Honeybee cluster. Holodex is a stream index provider for Honeybee which covers Hololive, Nijisanji, 774inc, etc.
Temporal Coverage:
- Chats: from 2021-01-15
- Super chats: from 2021-03-16
- Super stickers: from 2022-01-20 (N/A yet)
- Membership joining events: from 2021-10-18 (N/A yet)
- Membership milestones: from 2021-10-20 (N/A yet)
- Membership gifts: N/A
- Placeholders: from 2022-01-21 (N/A yet)
Update Frequency:
- At least once every 6 months

Research Ideas

Toxic Chat Classification
Spam Detection
Demographic Visualization
Superchat Analysis
Training neural language models

See Kaggle public notebooks (VTuber 1B / VTuber 1B Elements) for ideas, as well as /notebooks folder in the repo.

We employed Honeybee cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.

Editions

VTuber 1B Elements

Kaggle Datasets (2 MB)

VTuber 1B Elements is most suitable for statistical visualizations and exploratory data analysis.

filename	summary
`channels.csv`	Channel index
`chat_stats.csv`	Chat statistics
`superchat_stats.csv`	Super Chat statistics

VTuber 1B

Kaggle Datasets (47 GB)

VTuber 1B is most suitable for frequency analysis. This edition includes only the essential columns in order to reduce dataset size and make it faster from Kaggle Kernels to load data in.

filename	summary
`chats_%Y-%m.parquet`	Live chat events (> 1,000,000,000)
`superchats_%Y-%m.parquet`	Super chat events (> 4,000,000)
`deletion_events.parquet`	Deletion events
`ban_events.parquet`	Ban events

VTuber 1B Complete

VTuber 1B Complete is only available to those approved by the admins. If you are interested in conducting research using this edition, please reach us at [email protected] (for organizations only).

filename	summary
`chats_%Y-%m.parquet`	Live chat messages (> 1,000,000,000)
`superchats_%Y-%m.parquet`	Super chat messages (> 4,000,000)
`deletion_events.parquet`	Deletion events
`ban_events.parquet`	Ban events

Dataset Breakdown

Ban and deletion are equivalent to markChatItemsByAuthorAsDeletedAction and markChatItemAsDeletedAction respectively.

Channels (`channels.csv`)

column	type	description
channelId	string	channel id
name	string	channel name
englishName	nullable string	channel name (English)
affiliation	string	channel affiliation
group	nullable string	group
subscriptionCount	number	subscription count
videoCount	number	uploads count
photo	string	channel icon

Inactive channels have INACTIVE in group column.

Pandas usage

import pandas as pd

dtype_dict = {
    'channelId': 'category',
    'name': 'category',
    'englishName': 'category',
    'affiliation': 'category',
    'subscriptionCount': 'int32',
    'videoCount': 'int16',
    'photo': 'category'
}
chats = pd.read_csv('../input/vtuber-livechat-elements/channels.csv', dtype=dtype_dict)

Chat Statistics (`chat_stats.csv`)

column	type	description
channelId	string	channel id
period	string	interested period (%Y-%M)
chats	number	number of chats
memberChats	number	number of chats with membership status attached
uniqueChatters	number	number of unique chatters
uniqueMembers	number	number of unique members appeared on live chat
bannedChatters	number	number of unique chatters marked as banned by mods
deletedChats	number	number of chats deleted by mods

Pandas usage

import pandas as pd

chat_stats = pd.read_csv('../input/vtuber-livechat-elements/chat_stats.csv'))
sc_stats = pd.read_csv('../input/vtuber-livechat-elements/superchat_stats.csv'))
stats = pd.merge(chat_stats, sc_stats, on=['period', 'channelId'], how='left')

Super Chat Statistics (`superchat_stats.csv`)

column	type	description
channelId	string	channel id
period	string	interested period (%Y-%M)
superChats	number	number of super chats
uniqueSuperChatters	number	number of unique super chatters
totalSC	number	total amount of super chats (JPY)
averageSC	number	average amount of super chat (JPY)
totalMessageLength	number	total message length
averageMessageLength	number	average mesage length
mostFrequentCurrency	string	most frequent currency
mostFrequentColor	string	most frequent color

Chats (`chats_%Y-%m.parquet`)

column	type	description	in standard version
timestamp	string	ISO 8601 UTC timestamp	limited accuracy
id	string	chat id	N/A
authorName	string	author name	N/A
authorChannelId	string	author channel id	anonymized
body	string	chat message	N/A
bodyLength	number	chat message length	standard version only
membership	string	membership status	N/A
isMember	nullable boolean	is member (null if unknown)	standard version only
isModerator	boolean	is channel moderator	N/A
isVerified	boolean	is verified account	N/A
videoId	string	source video id
channelId	string	source channel id

Membership status

value	duration
unknown	Indistinguishable
non-member	0
new	< 1 month
1 month	>= 1 month, < 2 months
2 months	>= 2 months, < 6 months
6 months	>= 6 months, < 12 months
1 year	>= 12 months, < 24 months
2 years	>= 24 months

Pandas usage

import pandas as pd

chats = pd.read_parquet('../input/vtuber-livechat/chats_2022-02.parquet')

Superchats (`chats_:year:-:month:.parquet`)

column	type	description	in standard version
timestamp	string	ISO 8601 UTC timestamp	limited accuracy
id	string	chat id	N/A
authorName	string	author name	N/A
authorChannelId	string	author channel id	anonymized
body	nullable string	chat message	N/A
amount	number	purchased amount
currency	string	three-letter currency symbol
color	string	color	N/A
significance	number	significance
videoId	string	source video id	N/A
channelId	string	source channel id

Color and Significance

color	significance	purchase amount (¥)	purchase amount ($)	max. message length
blue	1	¥ 100 - 199	$ 1.00 - 1.99	0
lightblue	2	¥ 200 - 499	$ 2.00 - 4.99	50
green	3	¥ 500 - 999	$ 5.00 - 9.99	150
yellow	4	¥ 1000 - 1999	$ 10.00 - 19.99	200
orange	5	¥ 2000 - 4999	$ 20.00 - 49.99	225
magenta	6	¥ 5000 - 9999	$ 50.00 - 99.99	250
red	7	¥ 10000 - 50000	$ 100.00 - 500.00	270 - 350

Pandas usage

import pandas as pd
from glob import iglob

sc = pd.concat([
    pd.read_parquet(f)
    for f in iglob('../input/vtuber-livechat/superchats_*.parquet')
], ignore_index=False)
sc.sort_index(inplace=True)

Deletion Events (`deletion_events.parquet`)

column	type	description
timestamp	string	UTC timestamp
id	string	chat id
retracted	boolean	is deleted by author oneself
videoId	string	source video id
channelId	string	source channel id

Pandas usage

Insert deleted_by_mod column to chats DataFrame:

chats = pd.read_parquet('../input/vtuber-livechat/chats_2022-02.parquet')
delet = pd.read_parquet('../input/vtuber-livechat/deletion_events.parquet', columns=['id', 'retracted'])

delet = delet[delet['retracted'] == 0]

delet['deleted_by_mod'] = True
chats = pd.merge(chats, delet[['id', 'deleted_by_mod']], how='left')
chats['deleted_by_mod'].fillna(False, inplace=True)

Ban Events (`ban_events.parquet`)

Here Ban means either to place user in time out or to permanently hide the user's comments on the channel's current and future live streams. This mixup is due to the fact that these actions are indistinguishable from others with the extracted data from markChatItemsByAuthorAsDeletedAction event.

column	type	description	in standard version
timestamp	string	UTC timestamp
authorChannelId	string	channel id	anonymized
videoId	string	source video id
channelId	string	source channel id

Pandas usage

Insert banned column to chats DataFrame:

chats = pd.read_parquet('../input/vtuber-livechat/chats_2022-02.parquet')
ban = pd.read_parquet('../input/vtuber-livechat/ban_events.parquet', columns=['authorChannelId', 'videoId'])

ban['banned'] = True
chats = pd.merge(chats, ban, on=['authorChannelId', 'videoId'], how='left')
chats['banned'].fillna(False, inplace=True)

Consideration

Anonymization

id and authorChannelId are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.

Handling Custom Emojis

All custom emojis are replaced with a Unicode replacement character � (U+FFFD).

Redundant Ban and Deletion Events

Bans and deletions from multiple moderators for the same person or chat will be logged separately. For simplicity, you can safely ignore all but the first line recorded in time order.

Citation

@misc{vtuber-livechat-dataset,
 author={Yasuaki Uechi},
 title={VTuber 1B: Large-scale Live Chat and Moderation Events Dataset},
 year={2022},
 month={2},
 version={37},
 url={https://holodata.org/vtuber-1b}
}

License

Code: MIT License
Dataset: ODC Public Domain Dedication and Licence (PDDL)

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github		.github
notebooks		notebooks
vtlc		vtlc
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.prettierrc		.prettierrc
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VTuber 1B: Live Chat and Moderation Events

Provenance

Research Ideas

Editions

VTuber 1B Elements

VTuber 1B

VTuber 1B Complete

Dataset Breakdown

Channels (`channels.csv`)

Pandas usage

Chat Statistics (`chat_stats.csv`)

Pandas usage

Super Chat Statistics (`superchat_stats.csv`)

Chats (`chats_%Y-%m.parquet`)

Membership status

Pandas usage

Superchats (`chats_:year:-:month:.parquet`)

Color and Significance

Pandas usage

Deletion Events (`deletion_events.parquet`)

Pandas usage

Ban Events (`ban_events.parquet`)

Pandas usage

Consideration

Anonymization

Handling Custom Emojis

Redundant Ban and Deletion Events

Citation

License

About

Sponsor this project

Languages

License

sigvt/vtuber-livechat-dataset

Folders and files

Latest commit

History

Repository files navigation

VTuber 1B: Live Chat and Moderation Events

Provenance

Research Ideas

Editions

VTuber 1B Elements

VTuber 1B

VTuber 1B Complete

Dataset Breakdown

Channels (channels.csv)

Pandas usage

Chat Statistics (chat_stats.csv)

Pandas usage

Super Chat Statistics (superchat_stats.csv)

Chats (chats_%Y-%m.parquet)

Membership status

Pandas usage

Superchats (chats_:year:-:month:.parquet)

Color and Significance

Pandas usage

Deletion Events (deletion_events.parquet)

Pandas usage

Ban Events (ban_events.parquet)

Pandas usage

Consideration

Anonymization

Handling Custom Emojis

Redundant Ban and Deletion Events

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Sponsor this project

Languages

Channels (`channels.csv`)

Chat Statistics (`chat_stats.csv`)

Super Chat Statistics (`superchat_stats.csv`)

Chats (`chats_%Y-%m.parquet`)

Superchats (`chats_:year:-:month:.parquet`)

Deletion Events (`deletion_events.parquet`)

Ban Events (`ban_events.parquet`)