Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Spacy process for Grc #1243

Merged
merged 9 commits into from
May 12, 2024
Merged

Conversation

pharos-alexandria
Copy link
Contributor

I've added a Spacy process for Ancient Greek using odyCy (small pipeline) which performs better than the Stanza process for grc.

Caveat: If installing the model, it throws an error at the moment, as odyCy asks for "spacy>=3.5.0, <3.6.0". I do not know how to handle that.

@kylepjohnson
Copy link
Member

kylepjohnson commented Dec 31, 2023

@pharos-alexandria I am impressed as ever by your contributions. @clemsciences and I have been debating what to do for odycy, since it relies on an older version of spacy — and this version is incompatible with the version used by the Latin model of @diyclassics .

We have two options, please tell me your opinion: (1) we write to the authors of Odycy and ask them to retrain on a recent spacy; (2) we try some kind of on-the-fly patching of spacy depending on which pipeline is chosen.

(2) would solve the problem short-term, but it is a very poor engineering solution for a Python library 😇

@pharos-alexandria
Copy link
Contributor Author

@kylepjohnson, I would also go for option 1. Shoud I contact them or do you like to do that?

(Another topic would be to allow users to choose which model to use in the SpaCy process.)

Happy new year to you and @clemsciences!

@kylepjohnson
Copy link
Member

@pharos-alexandria Καλή Χρονιά!

Shoud I contact them or do you like to do that?

If you kindly would, please do so and cc myself and Clément, so we can help answer any cltk questions.

@pharos-alexandria
Copy link
Contributor Author

...it will take some days; I'm off to a conference until the weekend.

@clemsciences
Copy link
Member

clemsciences commented Jan 1, 2024

Hello @pharos-alexandria, first of all happy new year!

...it will take some days; I'm off to a conference until the weekend.

I'll write the email tomorrow and I'll cc you and Kyle.


(Another topic would be to allow users to choose which model to use in the SpaCy process.)
This is a very important topic. Before loading a process, the coder can load a spaCy model independently from CLTK and set it to the SpacyWrapper:

SpacyWrapper("lat", spacy_model)

This code associates the spaCy model with the Latin language, because when a SpacyProcess is called, the used algorithm is loaded from the SpacyWrapper. One of the remaining problem is that it downloads the default model for Latin even if it's not used in this case.

After having thought, it surely does not work. An else clause is missing here and the code should become:

if not self.nlp:
    self.nlp = self._load_model()
else:
    self.nlps[language] = self.nlp

@clemsciences
Copy link
Member

clemsciences commented Jan 24, 2024

I'll write the email tomorrow and I'll cc you and Kyle.

No news from them. Shoud I try again or maybe someone else can send the email? Maybe it will have more weight.

@x-tabdeveloping
Copy link

For the record, we have updated OdyCy to work with the latest version of SpaCy a couple of weeks ago, it should just work fine :D

@kylepjohnson
Copy link
Member

Thank you @x-tabdeveloping . @clemsciences and have this at the top of our list.

@x-tabdeveloping
Copy link

Wonderful! Let us know if you need anything or experience any issues.

clemsciences and others added 4 commits May 9, 2024 23:54
@clemsciences
Copy link
Member

I think this is ready for tests!

"lat": "la_core_web_lg",
}

MAP_LANG_TO_SPACY_MODEL_URL: dict[str, str] = {
"grc": "https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised that they call their one model "sm" but there is no medium or large.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They actually have this model which looks to have better results: https://huggingface.co/chcaa/grc_odycy_joint_trf, https://huggingface.co/chcaa/grc_odycy_joint_trf/resolve/main/grc_odycy_joint_trf-any-py3-none-any.whl. Do we prefer a smaller model with good results or a heavier results with the best results?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always prefer the heavier model. Go with the bigger of the two and I will write to them, see what they say.

"https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl",
]
)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ shrugs in Python 🤷‍♂️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know what shrugs mean. And I can see here duplicate code with Latin. I'm going to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just being silly. I think this is the best of all the possible solutions.

@kylepjohnson
Copy link
Member

The code looks fine @clemsciences . A few requests, before we push to PyPI:

  1. Would you please do one more sanity check: "Run all" (locally) the notebook https://github.com/cltk/cltk/blob/master/notebooks/Demo%20of%20Pipeline%20for%20all%20languages.ipynb -- if it completes without error or many warnings, then I am mostly confident. You can let me know the results here.

  2. In a new notebook, which you do not need to save to the repo, run a large portion of this Greek text: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0071:speech=18

Watch for any error or warning messages that pop up, then report them here. I ask this because in the previous Spacy model we took on (for Latin), there were about ~6 nonsensical inferences made by the model, which raises a warning in our morphosyntax system. (For example, it might say a adverb is plural.)

@clemsciences
Copy link
Member

The code looks fine @clemsciences . A few requests, before we push to PyPI:

1. Would you please do one more sanity check: "Run all" (locally) the notebook https://github.com/cltk/cltk/blob/master/notebooks/Demo%20of%20Pipeline%20for%20all%20languages.ipynb -- if it completes without error or many warnings, then I am mostly confident. You can let me know the results here.

2. In a new notebook, which you do not need to save to the repo, run a large portion of this Greek text: https://www.perseus.tufts.edu/hopper/text?doc=Perseus:text:1999.01.0071:speech=18

Watch for any error or warning messages that pop up, then report them here. I ask this because in the previous Spacy model we took on (for Latin), there were about ~6 nonsensical inferences made by the model, which raises a warning in our morphosyntax system. (For example, it might say a adverb is plural.)

I see what you mean, I'm going to work on it later.

Removed duplicate code
@clemsciences
Copy link
Member

And another thing: I'll wait for @pharos-alexandria to test it, since she is the author of this pull request.

@kylepjohnson
Copy link
Member

@clemsciences I ran the Demonstration notebook and all the important stuff in the code works. Here are the following
changes that need to be made, though:

  1. Update spacy in pyproject.toml to 3.7.4
  2. Then run make freeze dependencies (note: I am on Python v. 3.11.8)
  3. make installDev
  4. make notebook and run all cells of https://github.com/cltk/cltk/blob/master/notebooks/CLTK%20Demonstration.ipynb (Before updating to 3.7.4, this gave me a warning about possible incompatibility of the Greek model.)
  5. Make this a "minor" update, which could break some people's workflow; so 1.3.0
  6. If you want to push to PyPI yourself @clemsciences , run make publish after the above.

@pharos-alexandria
Copy link
Contributor Author

pharos-alexandria commented May 10, 2024

I'm off for a week, but I'll test asap. The larger model will surely be better, but of course also needs more resources...

By the way: GreCy also made an update to their models, and the non-trf-models (i.e. lg) now also have NER which is working quite good. You could do a quick test at our new "Classical Language Dictionary": https://cld.bbaw.de/analyzer/text#grc_perseus_lg[ent_type_].

@kylepjohnson
Copy link
Member

kylepjohnson commented May 10, 2024

@pharos-alexandria We value your expert opinion greatly, so please do share when you are ready. Meanwhile, we will promote Odycy to master, since the results look promising.

@clemsciences The changes I asked for, here they are in this branch. You may merge this one into the original PR, or do the steps yourself. https://github.com/cltk/cltk/tree/pharos-alexandria-grcspacy-plus-kj

@clemsciences
Copy link
Member

I think we can continue with this first version. Then we'll have to think of how to handle the model variants for a given language.

@clemsciences clemsciences merged commit 7014616 into cltk:master May 12, 2024
2 checks passed
@kylepjohnson
Copy link
Member

kylepjohnson commented May 13, 2024 via email

@pharos-alexandria pharos-alexandria deleted the grcspacy branch May 21, 2024 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants