-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Description of method and statistics from the creation of the supersense-expanded Elexis corpus #145
Comments
Tagging ELEXIS-WSD 1.1 with supersensesThis document concerns itself with the process (and associated challenges of) adding supersenses to the Danish part of the Parallel sense-annotated corpus ELEXIS-WSD 1.1. As a part of this process supersenses were also added to the DanNet dataset (a total of 71055 synsets affected). Many of the Danish tokens were annotated with sense IDs from the DanNet dataset, meaning that semantic information derived from DanNet could be used in part to further annotate these words; specifically in this case: the synsets/senses in DanNet are annotated with EuroWordNet ontological types. From ontological type to supersenseBolette et al had already produced a partial mapping from EuroWordNet ontological types to a (slightly expanded) set of supersenses. This mapping was only partial since it didn't map ontological types to every possible part-of-speech. Supersenses are discrete and partitioned according to part-of-speech. Nevertheless, many supersenses could be assigned based on this direct mapping. Using hypernyms to improve supersense taggingOne issue we rant into was the orthogonal nature of the two types categorisations. Ontological types are precise and numerous where multiple semantic "tags" make up a composite type. Supersenses are broad categories with no clear demarcation so words sometimes have no "obvious" home. One case where this issue manifested had to with the set of synsets tagged with the supersense Tagging remaining synsetsA full list of the remaining untagged synsets was produced programmatically and subsequently partitioned according to a combination of ontological type and part-of-speech tag (94 cases). This list was then put in a spreadsheet and each of the 94 combinations manually annotated with supersenses to allow for supersense-tagging any synset with that combination. After adding this data to the DanNet dataset, every synset was now finally tagged with a supersense. Producing an updated ELEXIS-WSD 1.1 CoNLL-U fileOnce the entirety of DanNet had been annotated with supersenses, this data could be used to tag every token with a DanNet sense ID in the ELEXIS dataset (3939 unique senses in total spread over 11085 instances in the dataset). Unfortunately, the remaining tokens that should be tagged with supersenses still amounted to a total of 3141 unique IDs. These IDs are not linked to DanNet in any way, so they will either have to be manually annotated with supersenses (a gargantuan task) or some other way of (semi-)automatically assigning supersenses will have to be devised. |
The key figures and the key algorithms used to create the new dataset.
Also, challenges, e.g. underdescribed IDs.
The text was updated successfully, but these errors were encountered: