-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Pretrained Ancient Greek Fasttext #1215
Comments
Hi @Zoomerhimmer I am open minded about changing to to the fasttext vectors.
I want to learn more about this. What makes them better? Is it they are using "character grams" in addition to word tokens (word2vec)? |
Here's our embeddings code: https://github.com/cltk/cltk/blob/c15e0b27bab2526710408d30d5ca3879964ca17c/src/cltk/embeddings/embeddings.py Would you like to work on this @Zoomerhimmer ? Important parts of the code:
Probably a few other small things to change. Do you know where these grc ft models are saved online? |
I actually can’t find any details about the fasttext model itself, except from the name. However, I believe it outperforms the current NLPL model. Here’s an example (I got ft to load with your helpful hints):
Another verb:
Here are some nouns:
To conclude, I guess character grams would be the difference for quality, because Ancient Greek was one of the most highly inflected languages around. 300 dimensions is also bigger than 100, so that probably contributes, though I don’t know how much. Maybe I should reach out to the researcher and ask about his corpus and training parameters. |
They have four locations. Here is the Zenodo site's download link (https://zenodo.org/record/7630945/files/grc_fasttext_skipgram_nn2_xn10_dim300.vec?download=1). This is the repo (https://zenodo.org/record/7630945) they have other formats and various sizes too. I was a bit confused over the _build_fasttext_url function. Do all FastText vectors have to be stored on dl.fbaipublicfiles.com? And how do we deal with licensing/attribution? I'm totally unfamiliar with that stuff. |
Could these fasttext embeddings be added to the models list? It's under the CCA 4.0. The 300D vec is far superior to the NLPL word2vec implementation. I loaded it in gensim but didn't know how to plug into the cltk pipeline.
The text was updated successfully, but these errors were encountered: