STREAM-ZH: Simplified Topic Retrieval, Exploration, and Analysis Module for Chinese language

We extend STREAM and present STREAM-ZH, the first topic modeling package to fully support the Chinese language across a broad range of topic models, evaluation metrics, and preprocessing workflows.

🚀 Installation

You can install STREAM-ZH directly from PyPI:

pip install stream_topic

Please note that additional packages required for processing Chinese datasets may need to be installed

pip install jieba
pip install hanlp
pip install thulac
pip install snownlp
pip install pkuseg
pip install opencc

📦 Available Models

STREAM-ZH inherits various neural and non-neural topic models provided by STREAM. Currently, the following models are implemented:

Name	Implementation
LDA	Latent Dirichlet Allocation
NMF	Non-negative Matrix Factorization
WordCluTM	Tired of topic models?
CEDC	Topics in the Haystack
DCTE	Human in the Loop
KMeansTM	Simple Kmeans followed by c-tfidf
SomTM	Self organizing map followed by c-tfidf
CBC	Coherence based document clustering
TNTM	Transformer-Representation Neural Topic Model
ETM	Topic modeling in embedding spaces
CTM	Combined Topic Model
CTMNeg	Contextualized Topic Models with Negative Sampling
ProdLDA	Autoencoding Variational Inference For Topic Models
NeuralLDA	Autoencoding Variational Inference For Topic Models
NSTM	Neural Topic Model via Optimal Transport

📊 Available Metrics

STREAM-ZH inherits all the evaluation metrics of STREAM, including intruder, diversity and coherence metrics.

Name	Description
ISIM	Average cosine similarity of top words of a topic to an intruder word.
INT	For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words.
ISH	Calculates the shift in the centroid of a topic when an intruder word is replaced.
Expressivity	Cosine Distance of topics to meaningless (stopword) embedding centroid
Embedding Topic Diversity	Topic diversity in the embedding space
Embedding Coherence	Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic.
NPMI	Classical NPMi coherence computed on the source corpus.

🗂️ Available Datasets

STREAM-ZH provides the following preprocessed Chinese datasets for benchmark testing:

Name	# Docs	# Words	# Avg Length	Description
THUCNews	804,656	395,432	230.5	Preprocessed THUCNews dataset
THUCNews_small	13,994	40,865	198.1	A subset of THUCNews with 1,000 documents per category
FUDANCNews	9,526	22,985	422.5	Originally for text classification, merged from its training and test sets
TOUTIAO	337,902	57,616	10.2	Preprocessed a news headline dataset
TOUTIAO_small	19,399	12,777	8.1	A subset of TOUTIAO with 1,400 documents per category
CMtMedQA_ten	48,413	22,404	166.1	Preprocessed a Chinese multi-round medical conversation corpus, by selecting ten medical themes
CMtMedQA_small	9,909	12,885	164.6	A subset of CMtMedQA_ten with 1,000 documents per category

🔧 Usage

To use one of the available models for Chinese topic modeling, follow the simple steps below:

Import the necessary modules:

from stream_topic.models import KmeansTM
from stream_topic.utils import TMDataset

🛠️ Preprocessing

Get the dataset and preprocess for your model:

dataset = TMDataset(language="chinese", stopwords_path = 'stream_topic/utils/common_stopwords.txt')
dataset.fetch_dataset("THUCNews_small", dataset_path = "stream_ZH_topic_data/preprocessed_datasets/THUCNews", source = 'local')
dataset.preprocess(model_type="KmeansTM")

The specified model_type is optional and further arguments can be specified. Default steps are predefined for all included models.

🚀 Model fitting

Choose the model you want to use and train it:

model = KmeansTM(embedding_model_name="TencentBAC/Conan-embedding-v1", stopwords_path = 'stream_topic/utils/common_stopwords.txt')# 
model.fit(dataset, n_topics=14, language = "chinese")

To get the topics, simply run:

Get the topics:
```
topics = model.get_topics()
```

✅ Evaluation

Specify the embedding model of Chinese

from stream_topic.metrics.metrics_config import MetricsConfig
MetricsConfig.set_PARAPHRASE_embedder("TencentBAC/Conan-embedding-v1")
MetricsConfig.set_SENTENCE_embedder("TencentBAC/Conan-embedding-v1")

To evaluate your model simply use one of the metrics.

from stream_topic.metrics import ISIM, INT, ISH, Expressivity, NPMI

metric = ISIM()
metric.score(topics)

Scores for each topic are available via:

metric.score_per_topic(topics)

metric =NPMI(dataset, language = "chinese", stopwords = 'stream_topic/utils/common_stopwords.txt')
metric.score(topics)

Scores for each topic are available via:

metric.score_per_topic(topics)

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
stream_ZH_topic_data/preprocessed_datasets		stream_ZH_topic_data/preprocessed_datasets
stream_topic		stream_topic
.gitattributes		.gitattributes
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STREAM-ZH: Simplified Topic Retrieval, Exploration, and Analysis Module for Chinese language

Table of Contents

🚀 Installation

📦 Available Models

📊 Available Metrics

🗂️ Available Datasets

🔧 Usage

🛠️ Preprocessing

🚀 Model fitting

✅ Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

williamlhy/STREAM-ZH

Folders and files

Latest commit

History

Repository files navigation

STREAM-ZH: Simplified Topic Retrieval, Exploration, and Analysis Module for Chinese language

Table of Contents

🚀 Installation

📦 Available Models

📊 Available Metrics

🗂️ Available Datasets

🔧 Usage

🛠️ Preprocessing

🚀 Model fitting

✅ Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages