Skip to content

williamlhy/STREAM-ZH

Repository files navigation

STREAM-ZH: Simplified Topic Retrieval, Exploration, and Analysis Module for Chinese language

We extend STREAM and present STREAM-ZH, the first topic modeling package to fully support the Chinese language across a broad range of topic models, evaluation metrics, and preprocessing workflows.

Table of Contents

🚀 Installation

You can install STREAM-ZH directly from PyPI:

pip install stream_topic

Please note that additional packages required for processing Chinese datasets may need to be installed

pip install jieba
pip install hanlp
pip install thulac
pip install snownlp
pip install pkuseg
pip install opencc

📦 Available Models

STREAM-ZH inherits various neural and non-neural topic models provided by STREAM. Currently, the following models are implemented:

Name Implementation
LDA Latent Dirichlet Allocation
NMF Non-negative Matrix Factorization
WordCluTM Tired of topic models?
CEDC Topics in the Haystack
DCTE Human in the Loop
KMeansTM Simple Kmeans followed by c-tfidf
SomTM Self organizing map followed by c-tfidf
CBC Coherence based document clustering
TNTM Transformer-Representation Neural Topic Model
ETM Topic modeling in embedding spaces
CTM Combined Topic Model
CTMNeg Contextualized Topic Models with Negative Sampling
ProdLDA Autoencoding Variational Inference For Topic Models
NeuralLDA Autoencoding Variational Inference For Topic Models
NSTM Neural Topic Model via Optimal Transport

📊 Available Metrics

STREAM-ZH inherits all the evaluation metrics of STREAM, including intruder, diversity and coherence metrics.

Name Description
ISIM Average cosine similarity of top words of a topic to an intruder word.
INT For a given topic and a given intruder word, Intruder Accuracy is the fraction of top words to which the intruder has the least similar embedding among all top words.
ISH Calculates the shift in the centroid of a topic when an intruder word is replaced.
Expressivity Cosine Distance of topics to meaningless (stopword) embedding centroid
Embedding Topic Diversity Topic diversity in the embedding space
Embedding Coherence Cosine similarity between the centroid of the embeddings of the stopwords and the centroid of the topic.
NPMI Classical NPMi coherence computed on the source corpus.

🗂️ Available Datasets

STREAM-ZH provides the following preprocessed Chinese datasets for benchmark testing:

Name # Docs # Words # Avg Length Description
THUCNews 804,656 395,432 230.5 Preprocessed THUCNews dataset
THUCNews_small 13,994 40,865 198.1 A subset of THUCNews with 1,000 documents per category
FUDANCNews 9,526 22,985 422.5 Originally for text classification, merged from its training and test sets
TOUTIAO 337,902 57,616 10.2 Preprocessed a news headline dataset
TOUTIAO_small 19,399 12,777 8.1 A subset of TOUTIAO with 1,400 documents per category
CMtMedQA_ten 48,413 22,404 166.1 Preprocessed a Chinese multi-round medical conversation corpus, by selecting ten medical themes
CMtMedQA_small 9,909 12,885 164.6 A subset of CMtMedQA_ten with 1,000 documents per category

🔧 Usage

To use one of the available models for Chinese topic modeling, follow the simple steps below:

  1. Import the necessary modules:

    from stream_topic.models import KmeansTM
    from stream_topic.utils import TMDataset

🛠️ Preprocessing

  1. Get the dataset and preprocess for your model:
    dataset = TMDataset(language="chinese", stopwords_path = 'stream_topic/utils/common_stopwords.txt')
    dataset.fetch_dataset("THUCNews_small", dataset_path = "stream_ZH_topic_data/preprocessed_datasets/THUCNews", source = 'local')
    dataset.preprocess(model_type="KmeansTM")

The specified model_type is optional and further arguments can be specified. Default steps are predefined for all included models.

🚀 Model fitting

  1. Choose the model you want to use and train it:

    model = KmeansTM(embedding_model_name="TencentBAC/Conan-embedding-v1", stopwords_path = 'stream_topic/utils/common_stopwords.txt')# 
    model.fit(dataset, n_topics=14, language = "chinese")

To get the topics, simply run:

  1. Get the topics:
    topics = model.get_topics()

✅ Evaluation

Specify the embedding model of Chinese

from stream_topic.metrics.metrics_config import MetricsConfig
MetricsConfig.set_PARAPHRASE_embedder("TencentBAC/Conan-embedding-v1")
MetricsConfig.set_SENTENCE_embedder("TencentBAC/Conan-embedding-v1")

To evaluate your model simply use one of the metrics.

from stream_topic.metrics import ISIM, INT, ISH, Expressivity, NPMI

metric = ISIM()
metric.score(topics)

Scores for each topic are available via:

metric.score_per_topic(topics)
metric =NPMI(dataset, language = "chinese", stopwords = 'stream_topic/utils/common_stopwords.txt')
metric.score(topics)

Scores for each topic are available via:

metric.score_per_topic(topics)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages