add Spacy process for Grc #1243

pharos-alexandria · 2023-12-31T16:49:28Z

I've added a Spacy process for Ancient Greek using odyCy (small pipeline) which performs better than the Stanza process for grc.

Caveat: If installing the model, it throws an error at the moment, as odyCy asks for "spacy>=3.5.0, <3.6.0". I do not know how to handle that.

kylepjohnson · 2023-12-31T18:12:08Z

@pharos-alexandria I am impressed as ever by your contributions. @clemsciences and I have been debating what to do for odycy, since it relies on an older version of spacy — and this version is incompatible with the version used by the Latin model of @diyclassics .

We have two options, please tell me your opinion: (1) we write to the authors of Odycy and ask them to retrain on a recent spacy; (2) we try some kind of on-the-fly patching of spacy depending on which pipeline is chosen.

(2) would solve the problem short-term, but it is a very poor engineering solution for a Python library 😇

pharos-alexandria · 2024-01-01T09:21:45Z

@kylepjohnson, I would also go for option 1. Shoud I contact them or do you like to do that?

(Another topic would be to allow users to choose which model to use in the SpaCy process.)

Happy new year to you and @clemsciences!

kylepjohnson · 2024-01-01T17:26:11Z

@pharos-alexandria Καλή Χρονιά!

Shoud I contact them or do you like to do that?

If you kindly would, please do so and cc myself and Clément, so we can help answer any cltk questions.

pharos-alexandria · 2024-01-01T17:55:02Z

...it will take some days; I'm off to a conference until the weekend.

clemsciences · 2024-01-01T22:48:34Z

Hello @pharos-alexandria, first of all happy new year!

...it will take some days; I'm off to a conference until the weekend.

I'll write the email tomorrow and I'll cc you and Kyle.

(Another topic would be to allow users to choose which model to use in the SpaCy process.)
This is a very important topic. Before loading a process, the coder can load a spaCy model independently from CLTK and set it to the SpacyWrapper:

SpacyWrapper("lat", spacy_model)

This code associates the spaCy model with the Latin language, because when a SpacyProcess is called, the used algorithm is loaded from the SpacyWrapper. One of the remaining problem is that it downloads the default model for Latin even if it's not used in this case.

After having thought, it surely does not work. An else clause is missing here and the code should become:

if not self.nlp:
    self.nlp = self._load_model()
else:
    self.nlps[language] = self.nlp

clemsciences · 2024-01-24T21:30:14Z

I'll write the email tomorrow and I'll cc you and Kyle.

No news from them. Shoud I try again or maybe someone else can send the email? Maybe it will have more weight.

x-tabdeveloping · 2024-04-23T14:50:44Z

For the record, we have updated OdyCy to work with the latest version of SpaCy a couple of weeks ago, it should just work fine :D

kylepjohnson · 2024-04-24T01:12:24Z

Thank you @x-tabdeveloping . @clemsciences and have this at the top of our list.

x-tabdeveloping · 2024-04-24T11:43:34Z

Wonderful! Let us know if you need anything or experience any issues.

spaCy = 3.7.2 to spaCy = 3.7.4

Readded Latin language for spaCy models.

clemsciences · 2024-05-09T22:23:46Z

I think this is ready for tests!

kylepjohnson · 2024-05-09T22:28:07Z

src/cltk/dependency/spacy_wrapper.py

    "lat": "la_core_web_lg",
 }

 MAP_LANG_TO_SPACY_MODEL_URL: dict[str, str] = {
+    "grc": "https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl",


I'm surprised that they call their one model "sm" but there is no medium or large.

They actually have this model which looks to have better results: https://huggingface.co/chcaa/grc_odycy_joint_trf, https://huggingface.co/chcaa/grc_odycy_joint_trf/resolve/main/grc_odycy_joint_trf-any-py3-none-any.whl. Do we prefer a smaller model with good results or a heavier results with the best results?

We always prefer the heavier model. Go with the bigger of the two and I will write to them, see what they say.

kylepjohnson · 2024-05-09T22:29:25Z

src/cltk/dependency/spacy_wrapper.py

+                    "https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl",
+                ]
+            )
+        else:


🤷‍♂️ shrugs in Python 🤷‍♂️

I don't know what shrugs mean. And I can see here duplicate code with Latin. I'm going to fix it.

I am just being silly. I think this is the best of all the possible solutions.

kylepjohnson · 2024-05-09T22:36:30Z

The code looks fine @clemsciences . A few requests, before we push to PyPI:

Would you please do one more sanity check: "Run all" (locally) the notebook https://github.com/cltk/cltk/blob/master/notebooks/Demo%20of%20Pipeline%20for%20all%20languages.ipynb -- if it completes without error or many warnings, then I am mostly confident. You can let me know the results here.
In a new notebook, which you do not need to save to the repo, run a large portion of this Greek text: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0071:speech=18

Watch for any error or warning messages that pop up, then report them here. I ask this because in the previous Spacy model we took on (for Latin), there were about ~6 nonsensical inferences made by the model, which raises a warning in our morphosyntax system. (For example, it might say a adverb is plural.)

clemsciences · 2024-05-09T22:44:22Z

The code looks fine @clemsciences . A few requests, before we push to PyPI:
1. Would you please do one more sanity check: "Run all" (locally) the notebook https://github.com/cltk/cltk/blob/master/notebooks/Demo%20of%20Pipeline%20for%20all%20languages.ipynb -- if it completes without error or many warnings, then I am mostly confident. You can let me know the results here.

2. In a new notebook, which you do not need to save to the repo, run a large portion of this Greek text: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0071:speech=18
Watch for any error or warning messages that pop up, then report them here. I ask this because in the previous Spacy model we took on (for Latin), there were about ~6 nonsensical inferences made by the model, which raises a warning in our morphosyntax system. (For example, it might say a adverb is plural.)

I see what you mean, I'm going to work on it later.

Removed duplicate code

clemsciences · 2024-05-09T22:46:04Z

And another thing: I'll wait for @pharos-alexandria to test it, since she is the author of this pull request.

kylepjohnson · 2024-05-10T06:02:57Z

@clemsciences I ran the Demonstration notebook and all the important stuff in the code works. Here are the following
changes that need to be made, though:

Update spacy in pyproject.toml to 3.7.4
Then run make freeze dependencies (note: I am on Python v. 3.11.8)
make installDev
make notebook and run all cells of https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb (Before updating to 3.7.4, this gave me a warning about possible incompatibility of the Greek model.)
Make this a "minor" update, which could break some people's workflow; so 1.3.0
If you want to push to PyPI yourself @clemsciences , run make publish after the above.

pharos-alexandria · 2024-05-10T08:26:56Z

I'm off for a week, but I'll test asap. The larger model will surely be better, but of course also needs more resources...

By the way: GreCy also made an update to their models, and the non-trf-models (i.e. lg) now also have NER which is working quite good. You could do a quick test at our new "Classical Language Dictionary": https://cld.bbaw.de/analyzer/text#grc_perseus_lg[ent_type_].

kylepjohnson · 2024-05-10T18:29:28Z

@pharos-alexandria We value your expert opinion greatly, so please do share when you are ready. Meanwhile, we will promote Odycy to master, since the results look promising.

@clemsciences The changes I asked for, here they are in this branch. You may merge this one into the original PR, or do the steps yourself. https://github.com/cltk/cltk/tree/pharos-alexandria-grcspacy-plus-kj

clemsciences · 2024-05-12T22:36:58Z

I think we can continue with this first version. Then we'll have to think of how to handle the model variants for a given language.

kylepjohnson · 2024-05-13T01:53:18Z

how to handle the model variants for a given language.

I agree that the API needs to provide an easier way to call, and then chain, Processes.

add Spacy process for Grc

0348621

clemsciences and others added 4 commits May 9, 2024 23:54

Merge branch 'master' into grcspacy

5c305dd

Update pyproject.toml

59b3857

spaCy = 3.7.2 to spaCy = 3.7.4

Update pyproject.toml

1a87e87

Update spacy_wrapper.py

534f306

Readded Latin language for spaCy models.

kylepjohnson approved these changes May 9, 2024

View reviewed changes

Update spacy_wrapper.py

420d9b5

Removed duplicate code

clemsciences added greek feature-request labels May 9, 2024

clemsciences assigned pharos-alexandria and clemsciences May 9, 2024

kyle's updates to PR cltk#1243

27c36a7

kylepjohnson added 2 commits May 11, 2024 10:45

update print messages

fd49d2d

bump vers to 1.3.0

ff6f5d2

clemsciences merged commit 7014616 into cltk:master May 12, 2024
2 checks passed

pharos-alexandria deleted the grcspacy branch May 21, 2024 06:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Spacy process for Grc #1243

add Spacy process for Grc #1243

pharos-alexandria commented Dec 31, 2023

kylepjohnson commented Dec 31, 2023 •

edited

Loading

pharos-alexandria commented Jan 1, 2024

kylepjohnson commented Jan 1, 2024

pharos-alexandria commented Jan 1, 2024

clemsciences commented Jan 1, 2024 •

edited

Loading

clemsciences commented Jan 24, 2024 •

edited

Loading

x-tabdeveloping commented Apr 23, 2024

kylepjohnson commented Apr 24, 2024

x-tabdeveloping commented Apr 24, 2024

clemsciences commented May 9, 2024

kylepjohnson May 9, 2024

clemsciences May 9, 2024

kylepjohnson May 9, 2024

kylepjohnson May 9, 2024

clemsciences May 9, 2024

kylepjohnson May 9, 2024

kylepjohnson commented May 9, 2024

clemsciences commented May 9, 2024

clemsciences commented May 9, 2024

kylepjohnson commented May 10, 2024

pharos-alexandria commented May 10, 2024 •

edited

Loading

kylepjohnson commented May 10, 2024 •

edited

Loading

clemsciences commented May 12, 2024

kylepjohnson commented May 13, 2024 via email

add Spacy process for Grc #1243

add Spacy process for Grc #1243

Conversation

pharos-alexandria commented Dec 31, 2023

kylepjohnson commented Dec 31, 2023 • edited Loading

pharos-alexandria commented Jan 1, 2024

kylepjohnson commented Jan 1, 2024

pharos-alexandria commented Jan 1, 2024

clemsciences commented Jan 1, 2024 • edited Loading

clemsciences commented Jan 24, 2024 • edited Loading

x-tabdeveloping commented Apr 23, 2024

kylepjohnson commented Apr 24, 2024

x-tabdeveloping commented Apr 24, 2024

clemsciences commented May 9, 2024

kylepjohnson May 9, 2024

Choose a reason for hiding this comment

clemsciences May 9, 2024

Choose a reason for hiding this comment

kylepjohnson May 9, 2024

Choose a reason for hiding this comment

kylepjohnson May 9, 2024

Choose a reason for hiding this comment

clemsciences May 9, 2024

Choose a reason for hiding this comment

kylepjohnson May 9, 2024

Choose a reason for hiding this comment

kylepjohnson commented May 9, 2024

clemsciences commented May 9, 2024

clemsciences commented May 9, 2024

kylepjohnson commented May 10, 2024

pharos-alexandria commented May 10, 2024 • edited Loading

kylepjohnson commented May 10, 2024 • edited Loading

clemsciences commented May 12, 2024

kylepjohnson commented May 13, 2024 via email

kylepjohnson commented Dec 31, 2023 •

edited

Loading

clemsciences commented Jan 1, 2024 •

edited

Loading

clemsciences commented Jan 24, 2024 •

edited

Loading

pharos-alexandria commented May 10, 2024 •

edited

Loading

kylepjohnson commented May 10, 2024 •

edited

Loading