Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about ambiguous POS and cases and about extracting lines of poetry. #1218

Open
AnnapolisKen opened this issue Apr 21, 2023 · 5 comments
Assignees

Comments

@AnnapolisKen
Copy link

I'm trying to figure out if CLTK would be the best library for a project I'm working on.

I'd like to quickly scan many corpora of Latin poetry for golden lines, at least to assemble a greedy list of possible lines that I can easily whittle down by eyeballing them.

So after about an hour of combing through these docs, I have a few questions. These are at a basic level because I haven't worked with CLTK and I am a very basic programmer. Feel free to tell me to just dive in and work with this until I understand it. And I'm not entirely sure if this is the forum for questions like this.

Ideally a golden line is a Latin hexameter or pentameter that consists of two nouns in different cases, each accompanied by an adjective/determinative/participle/propername/number agreeing with it, with a verb separating the modifiers from their nouns. Here are the two common types:
Lucan 1.95 Fraterno primi maduerunt sanguine muri. (chiastic abVAB form)
Lucan 1.105 Assyrias Latio maculavit sanguine Carrhas (concentric abVBA form)

  1. Does the DOC object have a built-in method for extracting a line of poetry? Or would I just declare one by extracting the contents between newlines in the raw string?
  2. It would seem that getting the POS from all the words in a line would do 90% of the work, and getting the cltk.morphology.universal_dependencies_features.Case from the declined POS would do the rest. Is there any leeway for coding ambiguity into the raw corpora? I mean aren't there texts where a form could be understood as being either of two different POS? Or either of two different cases? To pull in all possibilities I'd like to include ambiguous lines as well.

Ambiguity in whether a form is a noun, proper noun, or adjective comes up in the phrases "ductorem Varum", "Deum Christum", "Augusti Caesaris" and "Ammon Jupiter" in the following golden lines. However we classify these parts of speech, they fulfill the form:

Manilius 1.899 cum fera ductorem rapuit Germania Varum
Prudentius Psychom. 74 atque innupta Deum concepit femina Christum
Corippus In laudem 4.138 Augusti priscum renovasti Caesaris aevum;
Rafael Landívar Rusticatio Mexicana 1.123 Et Lybicas Ammon contemnat Jupiter undas.

@todd-cook
Copy link
Collaborator

todd-cook commented Apr 21, 2023

Does the DOC object have a built-in method for extracting a line of poetry?

--No, but it might be a good new feature; but it's problematic since doc ~ sentences and not verses.

Or would I just declare one by extracting the contents between newlines in the raw string?

--I would create a doc object for each sentence in verse, otherwise POS tags may not resolve as desired.

It would seem that getting the POS from all the words in a line would do 90% of the work, and getting the cltk.morphology.universal_dependencies_features.Case from the declined POS would do the rest. Is there any leeway for coding ambiguity into the raw corpora? I mean aren't there texts where a form could be understood as being either of two different POS? Or either of two different cases? To pull in all possibilities I'd like to include ambiguous lines as well.

--No there's no built-in method to accommodate POS tag ambiguity in a corpus. With poetry one has to expect the rules of language to be broken.

You probably want to have a look at the HexameterScanner class and related ones; knowing which vowels must be accented can help narrow the cases; POS taggers are mostly trained on prose and unaccented texts.

@AnnapolisKen
Copy link
Author

Thank you so much! That explains a lot.
The workflow I had envisioned was to start first with the Hexameter scanner to see if the line was a hexameter or pentameter. Spot checking the corpora I saw that some prose intros were nestled among the poems.

@clemsciences
Copy link
Member

Designing a specific Doc for poetry would be valuable for poetry study, but I don't think there is a single solution. Maybe cltk can propose a solution to users and show to advanced users how to customize it.
I remember I tried to transform the HexameterScanner into a Process but it was harder than expected.

@AnnapolisKen
Copy link
Author

Well, in the corpus poetry lines seem to be organized via line breaks, so can/could the DOC include the line break as a POS or other token? Then the workflow could go as follows:

  1. If the lines lack macrons, first a hexameter scanner could work on the corpus from line-break to line-break and create a new corpus with macrons.
  2. The new corpus can be processed into a DOC to get POS, more accurately thanks to macrons.
  3. My new script could examine the DOC, checking the POS and cases of all the words between one line break and the next to evaluate as a likely candidate for a golden line.
    Again, thank you for your patience with me and my project on a bizarre bywater of Latin metrics.

@AnnapolisKen
Copy link
Author

I've been tasked to learn more about Bard AI, so here's what Bard just suggested.

import cltk

Create a new grammar

grammar = cltk.parse.LabeledTreebankGrammar()

Add a new rule to the grammar

grammar.add_rule('line_break', '\n')

Parse the sentence

sentence = "Imminet exitio vir coniugis, illa mariti; Lurida terribiles miscent aconita novercae"
tokens = cltk.parse.pos_tag(sentence, grammar)

Print the POS of the line break

print(tokens[tokens.index('\n')])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants