This is code from: https://github.com/facebookresearch/fastText/tree/main/alignment
This directory provides code for learning alignments between word embeddings in different languages.
The code is in Python 3 and requires NumPy.
The script example.sh shows how to use this code to learn and evaluate a bilingual alignment of word embeddings.
In this adapted version, we align the fastText embeddings that were trained on the Common Crawl.
For our language pair (English->Vietnamese) run the script like this:
./example.sh en vi FacebookAI/roberta-base phucdev/vi-bpe-culturax-4g-sampleThe word embeddings used in [1] can be found on the fastText project page and the supervised bilingual lexicons on the MUSE project page.
The script align.py aligns word embeddings from two languages using a bilingual lexicon as supervision.
The details of this approach can be found in [1].
Wikipedia fastText embeddings aligned with our method can be found here.
If you use the supervised alignment method, please cite:
[1] A. Joulin, P. Bojanowski, T. Mikolov, H. Jegou, E. Grave, Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion
@InProceedings{joulin2018loss,
title={Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion},
author={Joulin, Armand and Bojanowski, Piotr and Mikolov, Tomas and J\'egou, Herv\'e and Grave, Edouard},
year={2018},
booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing},
}