Skip to content

gojira69/Static-Word-Embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Static Word Embeddings

This repository presents an evaluation and comparison of three popular static word embedding techniques—Singular Value Decomposition (SVD), Continuous Bag of Words (CBOW), and Skip-Gram—on the WordSim353 dataset using Spearman's Rank Correlation. The goal is to assess how well each model captures semantic similarity between words.


Assumptions

1. Window Size

  • window_size refers to the number of context words around a target word, i.e., ± window_size.

2. Corpus Preprocessing

The text corpus is preprocessed before training the embedding models:

  • All text is converted to lowercase
  • Punctuation is removed
  • Stopwords are filtered out

3. Embeddings and Similarity Scores

  • Word embeddings are saved in model-specific directories.
  • Cosine similarity scores computed using the WordSim353-Crowd dataset are saved as: word_similarity.csv

4. Loading Pre-Trained Embeddings

  • Pre-trained embeddings can be loaded directly from the appropriate cell.
  • Training from scratch is possible but may be computationally expensive.
  • Loading embeddings will automatically run evaluation on the WordSim353 dataset.

1. Introduction

The effectiveness of SVD, CBOW, and Skip-Gram was checked by analyzing each technique was effective in capturing semantic similarity between word pairs. Evaluation was done using the Spearman's Rank Correlation coefficient between model-computed similarities and human-annotated scores in the WordSim353-Crowd dataset.


2. Singular Value Decomposition (SVD)

Model Configuration

  • Window Size: 4
  • Embedding Dimension: 200

Evaluation Results

  • Spearman Correlation: 0.1985
  • p-value: 0.00033
  • Words Evaluated: 323 / 353

Hyperparameter Sweep

Window Size Embedding Dim Spearman Correlation p-value
3 100 0.1300 0.0193
3 200 0.1975 0.00036
3 300 0.2226 5.45e-05
3 400 0.2278 3.57e-05
4 100 0.1144 0.0399
4 200 0.1985 0.00033
4 300 0.2421 1.08e-05
4 400 0.2482 6.37e-06

Observation: Higher embedding dimensions and larger context windows (up to 4) generally yield better correlation scores, suggesting improved semantic representation.


3. Continuous Bag of Words (CBOW)

Model Configuration

  • Window Size: 4
  • Embedding Dimension: 200
  • Learning Rate: 0.1
  • Epochs: 75
  • Batch Size: 1024
  • Negative Samples: 20

Evaluation Results

  • Spearman Correlation: 0.2212
  • p-value: 0.0000607
  • Words Evaluated: 323 / 353

Observation: CBOW outperforms SVD and captures moderate levels of semantic similarity.


4. Skip-Gram

Model Configuration

  • Window Size: 4
  • Embedding Dimension: 200
  • Negative Samples: 20
  • Learning Rate: 0.1
  • Epochs: 100
  • Batch Size: 1000

Evaluation Results

  • Spearman Correlation: 0.2558
  • p-value: 0.0000032
  • Words Evaluated: 323 / 353

Observation: Skip-Gram outperforms both CBOW and SVD, showing the strongest alignment with human judgments of word similarity.


5. Comparative Summary

Model Spearman Correlation p-value
SVD 0.1985 0.00033
CBOW 0.2212 0.0000607
Skip-Gram 0.2558 0.0000032

Conclusion

  • Skip-Gram provides the most effective word embeddings in terms of human-perceived similarity.
  • CBOW offers moderate performance with faster training time compared to Skip-Gram.
  • SVD, while simpler, performs less effectively than neural methods but still captures meaningful relationships.

About

Implementations of Popular Static Word Embedding Techniques

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published