-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Spacy process for Grc #1243
Conversation
@pharos-alexandria I am impressed as ever by your contributions. @clemsciences and I have been debating what to do for odycy, since it relies on an older version of spacy — and this version is incompatible with the version used by the Latin model of @diyclassics . We have two options, please tell me your opinion: (1) we write to the authors of Odycy and ask them to retrain on a recent spacy; (2) we try some kind of on-the-fly patching of spacy depending on which pipeline is chosen. (2) would solve the problem short-term, but it is a very poor engineering solution for a Python library 😇 |
@kylepjohnson, I would also go for option 1. Shoud I contact them or do you like to do that? (Another topic would be to allow users to choose which model to use in the SpaCy process.) Happy new year to you and @clemsciences! |
@pharos-alexandria Καλή Χρονιά!
If you kindly would, please do so and cc myself and Clément, so we can help answer any cltk questions. |
...it will take some days; I'm off to a conference until the weekend. |
Hello @pharos-alexandria, first of all happy new year!
I'll write the email tomorrow and I'll cc you and Kyle.
SpacyWrapper("lat", spacy_model) This code associates the spaCy model with the Latin language, because when a After having thought, it surely does not work. An else clause is missing here and the code should become: if not self.nlp:
self.nlp = self._load_model()
else:
self.nlps[language] = self.nlp |
No news from them. Shoud I try again or maybe someone else can send the email? Maybe it will have more weight. |
For the record, we have updated OdyCy to work with the latest version of SpaCy a couple of weeks ago, it should just work fine :D |
Thank you @x-tabdeveloping . @clemsciences and have this at the top of our list. |
Wonderful! Let us know if you need anything or experience any issues. |
spaCy = 3.7.2 to spaCy = 3.7.4
Readded Latin language for spaCy models.
I think this is ready for tests! |
"lat": "la_core_web_lg", | ||
} | ||
|
||
MAP_LANG_TO_SPACY_MODEL_URL: dict[str, str] = { | ||
"grc": "https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that they call their one model "sm" but there is no medium or large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They actually have this model which looks to have better results: https://huggingface.co/chcaa/grc_odycy_joint_trf, https://huggingface.co/chcaa/grc_odycy_joint_trf/resolve/main/grc_odycy_joint_trf-any-py3-none-any.whl. Do we prefer a smaller model with good results or a heavier results with the best results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always prefer the heavier model. Go with the bigger of the two and I will write to them, see what they say.
"https://huggingface.co/chcaa/grc_odycy_joint_sm/resolve/main/grc_odycy_joint_sm-any-py3-none-any.whl", | ||
] | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷♂️ shrugs in Python 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what shrugs mean. And I can see here duplicate code with Latin. I'm going to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just being silly. I think this is the best of all the possible solutions.
The code looks fine @clemsciences . A few requests, before we push to PyPI:
Watch for any error or warning messages that pop up, then report them here. I ask this because in the previous Spacy model we took on (for Latin), there were about ~6 nonsensical inferences made by the model, which raises a warning in our morphosyntax system. (For example, it might say a adverb is plural.) |
I see what you mean, I'm going to work on it later. |
Removed duplicate code
And another thing: I'll wait for @pharos-alexandria to test it, since she is the author of this pull request. |
@clemsciences I ran the Demonstration notebook and all the important stuff in the code works. Here are the following
|
I'm off for a week, but I'll test asap. The larger model will surely be better, but of course also needs more resources... By the way: GreCy also made an update to their models, and the non-trf-models (i.e. lg) now also have NER which is working quite good. You could do a quick test at our new "Classical Language Dictionary": https://cld.bbaw.de/analyzer/text#grc_perseus_lg[ent_type_]. |
@pharos-alexandria We value your expert opinion greatly, so please do share when you are ready. Meanwhile, we will promote Odycy to master, since the results look promising. @clemsciences The changes I asked for, here they are in this branch. You may merge this one into the original PR, or do the steps yourself. https://github.com/cltk/cltk/tree/pharos-alexandria-grcspacy-plus-kj |
I think we can continue with this first version. Then we'll have to think of how to handle the model variants for a given language. |
how to handle the model variants for a given language.
I agree that the API needs to provide an easier way to call, and then chain, Processes.
|
I've added a Spacy process for Ancient Greek using odyCy (small pipeline) which performs better than the Stanza process for grc.
Caveat: If installing the model, it throws an error at the moment, as odyCy asks for "spacy>=3.5.0, <3.6.0". I do not know how to handle that.