Skip to content

Using word embeddings, TFIDF and text-hashing to cluster and visualise text documents

Notifications You must be signed in to change notification settings

ttavni/2D_Text_Clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Clustering

Made by Timothy Avni (tavni96) & Peter Simkin (Psimkin)

We present a way to cluster text documents by stacking features from TFIDF, pretrained word embeddings and text hashing.

We then reduce these dimensions using UMAP and HDBSCAN to produce a 2-D D3.js visualisation.

from TextProcessor.features import text_features
from TextProcessor.reduction import Mapper
from TextProcessor.labeller import automatic_labelling
from TextProcessor.viz import Visualiser

corpus = ### List of Documents
tf = text_features(corpus)
data = Mapper(tf.values, corpus)
mapping = automatic_labelling(data)
Visualiser(mapping,folder='test')
	

Viz