Static Word Embeddings

This repository presents an evaluation and comparison of three popular static word embedding techniques—Singular Value Decomposition (SVD), Continuous Bag of Words (CBOW), and Skip-Gram—on the WordSim353 dataset using Spearman's Rank Correlation. The goal is to assess how well each model captures semantic similarity between words.

Assumptions

1. Window Size

window_size refers to the number of context words around a target word, i.e., ± window_size.

2. Corpus Preprocessing

The text corpus is preprocessed before training the embedding models:

All text is converted to lowercase
Punctuation is removed
Stopwords are filtered out

3. Embeddings and Similarity Scores

Word embeddings are saved in model-specific directories.
Cosine similarity scores computed using the WordSim353-Crowd dataset are saved as: word_similarity.csv

4. Loading Pre-Trained Embeddings

Pre-trained embeddings can be loaded directly from the appropriate cell.
Training from scratch is possible but may be computationally expensive.
Loading embeddings will automatically run evaluation on the WordSim353 dataset.

1. Introduction

The effectiveness of SVD, CBOW, and Skip-Gram was checked by analyzing each technique was effective in capturing semantic similarity between word pairs. Evaluation was done using the Spearman's Rank Correlation coefficient between model-computed similarities and human-annotated scores in the WordSim353-Crowd dataset.

2. Singular Value Decomposition (SVD)

Model Configuration

Window Size: 4
Embedding Dimension: 200

Evaluation Results

Spearman Correlation: 0.1985
p-value: 0.00033
Words Evaluated: 323 / 353

Hyperparameter Sweep

Window Size	Embedding Dim	Spearman Correlation	p-value
3	100	0.1300	0.0193
3	200	0.1975	0.00036
3	300	0.2226	5.45e-05
3	400	0.2278	3.57e-05
4	100	0.1144	0.0399
4	200	0.1985	0.00033
4	300	0.2421	1.08e-05
4	400	0.2482	6.37e-06

Observation: Higher embedding dimensions and larger context windows (up to 4) generally yield better correlation scores, suggesting improved semantic representation.

3. Continuous Bag of Words (CBOW)

Model Configuration

Window Size: 4
Embedding Dimension: 200
Learning Rate: 0.1
Epochs: 75
Batch Size: 1024
Negative Samples: 20

Evaluation Results

Spearman Correlation: 0.2212
p-value: 0.0000607
Words Evaluated: 323 / 353

Observation: CBOW outperforms SVD and captures moderate levels of semantic similarity.

4. Skip-Gram

Model Configuration

Window Size: 4
Embedding Dimension: 200
Negative Samples: 20
Learning Rate: 0.1
Epochs: 100
Batch Size: 1000

Evaluation Results

Spearman Correlation: 0.2558
p-value: 0.0000032
Words Evaluated: 323 / 353

Observation: Skip-Gram outperforms both CBOW and SVD, showing the strongest alignment with human judgments of word similarity.

5. Comparative Summary

Model	Spearman Correlation	p-value
SVD	0.1985	0.00033
CBOW	0.2212	0.0000607
Skip-Gram	0.2558	0.0000032

Conclusion

Skip-Gram provides the most effective word embeddings in terms of human-perceived similarity.
CBOW offers moderate performance with faster training time compared to Skip-Gram.
SVD, while simpler, performs less effectively than neural methods but still captures meaningful relationships.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
CBOW		CBOW
SVD		SVD
Skip Gram		Skip Gram
README.md		README.md
wordsim353crowd.csv		wordsim353crowd.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Static Word Embeddings

Assumptions

1. Window Size

2. Corpus Preprocessing

3. Embeddings and Similarity Scores

4. Loading Pre-Trained Embeddings

1. Introduction

2. Singular Value Decomposition (SVD)

Model Configuration

Evaluation Results

Hyperparameter Sweep

3. Continuous Bag of Words (CBOW)

Model Configuration

Evaluation Results

4. Skip-Gram

Model Configuration

Evaluation Results

5. Comparative Summary

Conclusion

About

Uh oh!

Releases

Packages

Languages

gojira69/Static-Word-Embeddings

Folders and files

Latest commit

History

Repository files navigation

Static Word Embeddings

Assumptions

1. Window Size

2. Corpus Preprocessing

3. Embeddings and Similarity Scores

4. Loading Pre-Trained Embeddings

1. Introduction

2. Singular Value Decomposition (SVD)

Model Configuration

Evaluation Results

Hyperparameter Sweep

3. Continuous Bag of Words (CBOW)

Model Configuration

Evaluation Results

4. Skip-Gram

Model Configuration

Evaluation Results

5. Comparative Summary

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages