Adding Pretrained Ancient Greek Fasttext #1215

Zoomerhimmer · 2023-03-17T21:00:47Z

Could these fasttext embeddings be added to the models list? It's under the CCA 4.0. The 300D vec is far superior to the NLPL word2vec implementation. I loaded it in gensim but didn't know how to plug into the cltk pipeline.

kylepjohnson · 2023-03-18T15:28:20Z

Hi @Zoomerhimmer I am open minded about changing to to the fasttext vectors.

The 300D vec is far superior to the NLPL word2vec implementation

I want to learn more about this. What makes them better? Is it they are using "character grams" in addition to word tokens (word2vec)?

kylepjohnson · 2023-03-18T15:40:38Z

Here's our embeddings code: https://github.com/cltk/cltk/blob/c15e0b27bab2526710408d30d5ca3879964ca17c/src/cltk/embeddings/embeddings.py

Would you like to work on this @Zoomerhimmer ?

Important parts of the code:

Import gensim:

cltk/src/cltk/embeddings/embeddings.py

Line 27 in c15e0b2

from gensim import models # type: ignore
Maps of lang-to-w2v url:

cltk/src/cltk/embeddings/embeddings.py

Line 37 in c15e0b2

MAP_NLPL_LANG_TO_URL = dict(

(You would need to make a new lang-to-fasttext url.)
At the check for MAP_NLPL_LANG_TO_URL here, you would add your new fasttext url:

cltk/src/cltk/embeddings/embeddings.py

Line 253 in c15e0b2

if self.iso_code not in MAP_NLPL_LANG_TO_URL:

Probably a few other small things to change.

Do you know where these grc ft models are saved online?

Zoomerhimmer · 2023-03-18T20:24:35Z

I actually can’t find any details about the fasttext model itself, except from the name. However, I believe it outperforms the current NLPL model. Here’s an example (I got ft to load with your helpful hints):

ft_model.model.most_similar('λέγω')
[('φημὶ', 0.6184707283973694), ('λέγωμαι', 0.590637743473053), ('δείκνυμί', 0.587498128414154), ('δείκνυμι', 0.5802488327026367), ('τέμνομαι', 0.577020525932312), ('ἀποδείκνυμαι', 0.5757586359977722), ('εἴπω', 0.5722758769989014), ('φημι', 0.5673102736473083), ('ἀπολέγω', 0.5672284960746765), ('λέγομαι', 0.5671581029891968)]
These words are all near to each other in meaning. The only word I can’t make sense of is τέμνομαι.
wv_model.model.most_similar('λέγω')
[('ἀμὴν', 0.7812811732292175), ('εἴρηκα', 0.7477512359619141), ('καλῶ', 0.7466088533401489), ('ἐπιδείξω', 0.7412055134773254), ('Ἀμὴν', 0.7397247552871704), ('ποιήσω', 0.7381742596626282), ('ἐρῶ', 0.723713755607605), ('λαβέ', 0.7228215932846069), ('φράσω', 0.719882071018219), ('ἀναγνώσεται', 0.7092446684837341)]
This list has a wider semantic scope, but apparently more confidence that these words are associated. I get why ἀμὴν would be near verbs of speaking when a speaker wants to add veridical force to his statement and ποιήσω makes sense if we think of promises. However, λαβέ seems off the rails for me. I could be speaking out of ignorance though--and maybe comparing direct discourse verbs aren’t the best.

Another verb:
The NLPL model didn’t have ἰσχύω in it’s vocab so I did 3rd person.

ft_model.model.most_similar('ἰσχύει')
[('ἐξισχύει', 0.7105363011360168), ('ἰσχύεις', 0.7062532901763916), ('κατισχύει', 0.6672629117965698), ('ἐνισχύει', 0.6650412082672119), ('ἰσχύες', 0.6644095182418823), ('ὑπερισχύει', 0.6635435223579407), ('ἰσχύσει', 0.6463064551353455), ('δυνατεῖ', 0.626556396484375), ('δύναται', 0.6209343075752258), ('ἰσχύῃ', 0.6127247214317322)]
All these makes sense. One would expect prepositionally prefixed verbs to tend towards similarity.
wv_model.model.most_similar('ἰσχύει')
[('τοὔλαττον', 0.9440385699272156), ('ʽδῆλον', 0.9435624480247498), ('προαιρεῖται', 0.9398163557052612), ('ἀδιαφόρων', 0.9391382932662964), ('δικαιοπραγεῖν', 0.9363390207290649), ('ἀναγκάζεται', 0.9361328482627869), ('ἀνεξέταστον', 0.934328556060791), ('ἀναγκαιότερον', 0.9341817498207092), ('εὐπρεπὲς', 0.9334261417388916), ('χρηματίζεσθαι', 0.9333732724189758)]
This list doesn’t even include δύναμαι yet it seems highly confident of these other relatives that are worlds apart from each other.

Here are some nouns:

ft_model.model.most_similar('ἀνήρ')
[('ἀνὴρ', 0.8346245884895325), ('ὡνήρ', 0.6498374342918396), ('ἀγύναιος', 0.6230184435844421), ('ἀνδήριος', 0.619379997253418), ('ἀνδραίμων', 0.612857460975647), ('ἀνδρὼν', 0.6128512024879456), ('ἀνδρικώτατος', 0.6093217730522156), ('ἁνήρ', 0.6082747578620911), ('ἀστὸς', 0.6050947904586792), ('αὐτὴς', 0.6047882437705994)]
(I would think normalizing graves to acute accents wouldn’t do any harm and would remove duplicates like the first item.) These terms seem reasonable. Most are just variants of ‘male’, but we have ‘unmarried’, ‘native’, and ‘her husband’.
wv_model.model.most_similar('ἀνήρ')
[('κακὸς', 0.8629888296127319), ('ἀγαθός', 0.8620637059211731), ('ἐλεύθερος', 0.8437824249267578), ('κακός', 0.839881956577301), ('ἐχθρὸς', 0.8395668268203735), ('θρασὺς', 0.83908611536026), ('Ἕλλην', 0.8368448615074158), ('ἀνὴρ', 0.8336415886878967), ('ξένος', 0.8300831913948059), ('οἷος', 0.8276161551475525)]
These terms focus a lot more on how to describe a man: ‘evil’, ‘good’, ‘free’, ‘hostile’, ‘brave’, ‘Greek’, ‘male’, ‘stranger’, ‘such a one as’. Again a much wider range, but still sensible if we think of how these would occur in context.

ft_model.model.most_similar('ἄνθρωπος')
[('ἄνθρωπός', 0.7523252367973328), ('θεάνθρωπος', 0.7364073991775513), ('ἅνθρωπος', 0.7242015600204468), ('ὥνθρωπος', 0.7177456617355347), ('ἀνθρωπός', 0.7167887091636658), ('γελαστικὸς', 0.6792545914649963), ('ἄνθρωπέ', 0.6771118640899658), ('γελαστικός', 0.653688907623291), ('λογικὸς', 0.6507590413093567), ('λογικός', 0.6483426094055176)]
It is interesting that the God-man (Christ) would come up near the top (is this an artifact of character grams?). Otherwise we’ve got the ‘cheerful/laughing man’ and the ‘reasonable man’ tailing a bunch of alternate forms for ‘man’.
wv_model.model.most_similar('ἄνθρωπος')
[('πλούσιος', 0.808495283126831), ('ἁμαρτάνει', 0.7945271134376526), ('ἰατρὸς', 0.7912680506706238), ('δοῦλος', 0.7888801693916321), ('ἰσχυρὸς', 0.788048267364502), ('ἅνθρωπος', 0.7873396277427673), ('ἰατρός', 0.7864810824394226), ('ἀκριβὴς', 0.7815724015235901), ('τοιοῦτος', 0.7811455130577087), ('μικρὸς', 0.7802046537399292)]
Again this model focuses more on word pairs and less on synonyms, which may be better depending on what you want to do. It’s funny how humans would be associated with sinning (‘ ἁμαρτάνει’); makes sense though.

To conclude, I guess character grams would be the difference for quality, because Ancient Greek was one of the most highly inflected languages around. 300 dimensions is also bigger than 100, so that probably contributes, though I don’t know how much. Maybe I should reach out to the researcher and ask about his corpus and training parameters.

Zoomerhimmer · 2023-03-18T20:30:53Z

Do you know where these grc ft models are saved online?

They have four locations. Here is the Zenodo site's download link (https://zenodo.org/record/7630945/files/grc_fasttext_skipgram_nn2_xn10_dim300.vec?download=1). This is the repo (https://zenodo.org/record/7630945) they have other formats and various sizes too.

I was a bit confused over the _build_fasttext_url function. Do all FastText vectors have to be stored on dl.fbaipublicfiles.com? And how do we deal with licensing/attribution? I'm totally unfamiliar with that stuff.

Zoomerhimmer added the feature-request label Mar 17, 2023

kylepjohnson self-assigned this Mar 18, 2023

clemsciences self-assigned this Mar 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Pretrained Ancient Greek Fasttext #1215

Adding Pretrained Ancient Greek Fasttext #1215

Zoomerhimmer commented Mar 17, 2023

kylepjohnson commented Mar 18, 2023

kylepjohnson commented Mar 18, 2023

Zoomerhimmer commented Mar 18, 2023

Zoomerhimmer commented Mar 18, 2023 •

edited

Loading

Adding Pretrained Ancient Greek Fasttext #1215

Adding Pretrained Ancient Greek Fasttext #1215

Comments

Zoomerhimmer commented Mar 17, 2023

kylepjohnson commented Mar 18, 2023

kylepjohnson commented Mar 18, 2023

Zoomerhimmer commented Mar 18, 2023

Zoomerhimmer commented Mar 18, 2023 • edited Loading

Zoomerhimmer commented Mar 18, 2023 •

edited

Loading