We extend STREAM and present STREAM-ZH, the first topic modeling package to fully support the Chinese language across a broad range of topic models, evaluation metrics, and preprocessing workflows.
You can install STREAM-ZH directly from PyPI:
pip install stream_topicPlease note that additional packages required for processing Chinese datasets may need to be installed
pip install jieba
pip install hanlp
pip install thulac
pip install snownlp
pip install pkuseg
pip install openccSTREAM-ZH inherits various neural and non-neural topic models provided by STREAM. Currently, the following models are implemented:
| Name | Implementation |
|---|---|
| LDA | Latent Dirichlet Allocation |
| NMF | Non-negative Matrix Factorization |
| WordCluTM | Tired of topic models? |
| CEDC | Topics in the Haystack |
| DCTE | Human in the Loop |
| KMeansTM | Simple Kmeans followed by c-tfidf |
| SomTM | Self organizing map followed by c-tfidf |
| CBC | Coherence based document clustering |
| TNTM | Transformer-Representation Neural Topic Model |
| ETM | Topic modeling in embedding spaces |
| CTM | Combined Topic Model |
| CTMNeg | Contextualized Topic Models with Negative Sampling |
| ProdLDA | Autoencoding Variational Inference For Topic Models |
| NeuralLDA | Autoencoding Variational Inference For Topic Models |
| NSTM | Neural Topic Model via Optimal Transport |
STREAM-ZH inherits all the evaluation metrics of STREAM, including intruder, diversity and coherence metrics.
| Name | Description |
|---|---|
| ISIM | Average cosine similarity of top words of a topic to an intruder word. |
| INT | For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words. |
| ISH | Calculates the shift in the centroid of a topic when an intruder word is replaced. |
| Expressivity | Cosine Distance of topics to meaningless (stopword) embedding centroid |
| Embedding Topic Diversity | Topic diversity in the embedding space |
| Embedding Coherence | Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic. |
| NPMI | Classical NPMi coherence computed on the source corpus. |
STREAM-ZH provides the following preprocessed Chinese datasets for benchmark testing:
| Name | # Docs | # Words | # Avg Length | Description |
|---|---|---|---|---|
| THUCNews | 804,656 | 395,432 | 230.5 | Preprocessed THUCNews dataset |
| THUCNews_small | 13,994 | 40,865 | 198.1 | A subset of THUCNews with 1,000 documents per category |
| FUDANCNews | 9,526 | 22,985 | 422.5 | Originally for text classification, merged from its training and test sets |
| TOUTIAO | 337,902 | 57,616 | 10.2 | Preprocessed a news headline dataset |
| TOUTIAO_small | 19,399 | 12,777 | 8.1 | A subset of TOUTIAO with 1,400 documents per category |
| CMtMedQA_ten | 48,413 | 22,404 | 166.1 | Preprocessed a Chinese multi-round medical conversation corpus, by selecting ten medical themes |
| CMtMedQA_small | 9,909 | 12,885 | 164.6 | A subset of CMtMedQA_ten with 1,000 documents per category |
To use one of the available models for Chinese topic modeling, follow the simple steps below:
-
Import the necessary modules:
from stream_topic.models import KmeansTM from stream_topic.utils import TMDataset
- Get the dataset and preprocess for your model:
dataset = TMDataset(language="chinese", stopwords_path = 'stream_topic/utils/common_stopwords.txt') dataset.fetch_dataset("THUCNews_small", dataset_path = "stream_ZH_topic_data/preprocessed_datasets/THUCNews", source = 'local') dataset.preprocess(model_type="KmeansTM")
The specified model_type is optional and further arguments can be specified. Default steps are predefined for all included models.
-
Choose the model you want to use and train it:
model = KmeansTM(embedding_model_name="TencentBAC/Conan-embedding-v1", stopwords_path = 'stream_topic/utils/common_stopwords.txt')# model.fit(dataset, n_topics=14, language = "chinese")
To get the topics, simply run:
- Get the topics:
topics = model.get_topics()
Specify the embedding model of Chinese
from stream_topic.metrics.metrics_config import MetricsConfig
MetricsConfig.set_PARAPHRASE_embedder("TencentBAC/Conan-embedding-v1")
MetricsConfig.set_SENTENCE_embedder("TencentBAC/Conan-embedding-v1")To evaluate your model simply use one of the metrics.
from stream_topic.metrics import ISIM, INT, ISH, Expressivity, NPMI
metric = ISIM()
metric.score(topics)Scores for each topic are available via:
metric.score_per_topic(topics)metric =NPMI(dataset, language = "chinese", stopwords = 'stream_topic/utils/common_stopwords.txt')
metric.score(topics)Scores for each topic are available via:
metric.score_per_topic(topics)