Questions about ambiguous POS and cases and about extracting lines of poetry. #1218

AnnapolisKen · 2023-04-21T01:58:20Z

I'm trying to figure out if CLTK would be the best library for a project I'm working on.

I'd like to quickly scan many corpora of Latin poetry for golden lines, at least to assemble a greedy list of possible lines that I can easily whittle down by eyeballing them.

So after about an hour of combing through these docs, I have a few questions. These are at a basic level because I haven't worked with CLTK and I am a very basic programmer. Feel free to tell me to just dive in and work with this until I understand it. And I'm not entirely sure if this is the forum for questions like this.

Ideally a golden line is a Latin hexameter or pentameter that consists of two nouns in different cases, each accompanied by an adjective/determinative/participle/propername/number agreeing with it, with a verb separating the modifiers from their nouns. Here are the two common types:
Lucan 1.95 Fraterno primi maduerunt sanguine muri. (chiastic abVAB form)
Lucan 1.105 Assyrias Latio maculavit sanguine Carrhas (concentric abVBA form)

Does the DOC object have a built-in method for extracting a line of poetry? Or would I just declare one by extracting the contents between newlines in the raw string?
It would seem that getting the POS from all the words in a line would do 90% of the work, and getting the cltk.morphology.universal_dependencies_features.Case from the declined POS would do the rest. Is there any leeway for coding ambiguity into the raw corpora? I mean aren't there texts where a form could be understood as being either of two different POS? Or either of two different cases? To pull in all possibilities I'd like to include ambiguous lines as well.

Ambiguity in whether a form is a noun, proper noun, or adjective comes up in the phrases "ductorem Varum", "Deum Christum", "Augusti Caesaris" and "Ammon Jupiter" in the following golden lines. However we classify these parts of speech, they fulfill the form:

Manilius 1.899 cum fera ductorem rapuit Germania Varum
Prudentius Psychom. 74 atque innupta Deum concepit femina Christum
Corippus In laudem 4.138 Augusti priscum renovasti Caesaris aevum;
Rafael Landívar Rusticatio Mexicana 1.123 Et Lybicas Ammon contemnat Jupiter undas.

todd-cook · 2023-04-21T05:53:45Z

Does the DOC object have a built-in method for extracting a line of poetry?

--No, but it might be a good new feature; but it's problematic since doc ~ sentences and not verses.

Or would I just declare one by extracting the contents between newlines in the raw string?

--I would create a doc object for each sentence in verse, otherwise POS tags may not resolve as desired.

It would seem that getting the POS from all the words in a line would do 90% of the work, and getting the cltk.morphology.universal_dependencies_features.Case from the declined POS would do the rest. Is there any leeway for coding ambiguity into the raw corpora? I mean aren't there texts where a form could be understood as being either of two different POS? Or either of two different cases? To pull in all possibilities I'd like to include ambiguous lines as well.

--No there's no built-in method to accommodate POS tag ambiguity in a corpus. With poetry one has to expect the rules of language to be broken.

You probably want to have a look at the HexameterScanner class and related ones; knowing which vowels must be accented can help narrow the cases; POS taggers are mostly trained on prose and unaccented texts.

AnnapolisKen · 2023-04-21T17:07:51Z

Thank you so much! That explains a lot.
The workflow I had envisioned was to start first with the Hexameter scanner to see if the line was a hexameter or pentameter. Spot checking the corpora I saw that some prose intros were nestled among the poems.

clemsciences · 2023-04-22T20:52:59Z

Designing a specific Doc for poetry would be valuable for poetry study, but I don't think there is a single solution. Maybe cltk can propose a solution to users and show to advanced users how to customize it.
I remember I tried to transform the HexameterScanner into a Process but it was harder than expected.

AnnapolisKen · 2023-04-22T21:10:49Z

Well, in the corpus poetry lines seem to be organized via line breaks, so can/could the DOC include the line break as a POS or other token? Then the workflow could go as follows:

If the lines lack macrons, first a hexameter scanner could work on the corpus from line-break to line-break and create a new corpus with macrons.
The new corpus can be processed into a DOC to get POS, more accurately thanks to macrons.
My new script could examine the DOC, checking the POS and cases of all the words between one line break and the next to evaluate as a likely candidate for a golden line.
Again, thank you for your patience with me and my project on a bizarre bywater of Latin metrics.

AnnapolisKen · 2023-05-22T15:24:35Z

I've been tasked to learn more about Bard AI, so here's what Bard just suggested.

import cltk

Create a new grammar

grammar = cltk.parse.LabeledTreebankGrammar()

Add a new rule to the grammar

grammar.add_rule('line_break', '\n')

Parse the sentence

sentence = "Imminet exitio vir coniugis, illa mariti; Lurida terribiles miscent aconita novercae"
tokens = cltk.parse.pos_tag(sentence, grammar)

Print the POS of the line break

print(tokens[tokens.index('\n')])

clemsciences assigned todd-cook and clemsciences Apr 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about ambiguous POS and cases and about extracting lines of poetry. #1218

Questions about ambiguous POS and cases and about extracting lines of poetry. #1218

AnnapolisKen commented Apr 21, 2023

todd-cook commented Apr 21, 2023 •

edited

Loading

AnnapolisKen commented Apr 21, 2023

clemsciences commented Apr 22, 2023

AnnapolisKen commented Apr 22, 2023

AnnapolisKen commented May 22, 2023

Questions about ambiguous POS and cases and about extracting lines of poetry. #1218

Questions about ambiguous POS and cases and about extracting lines of poetry. #1218

Comments

AnnapolisKen commented Apr 21, 2023

todd-cook commented Apr 21, 2023 • edited Loading

AnnapolisKen commented Apr 21, 2023

clemsciences commented Apr 22, 2023

AnnapolisKen commented Apr 22, 2023

AnnapolisKen commented May 22, 2023

Create a new grammar

Add a new rule to the grammar

Parse the sentence

Print the POS of the line break

todd-cook commented Apr 21, 2023 •

edited

Loading