Feature Request: Token level output #386

atakanokan · 2019-10-03T19:50:53Z

Feature description

doccano currently only outputs character-level annotation. However, some workflows used for NLP require input as lists of words and list of token labels:

Sample sentence: 
['Two', ',', 'Samsung', 'based', ',', 'electronic', 'cash', 'registers', 'were', 'reconstructed', 'in', 'order', 'to', 'expand', 'their', 'functions', 'and', 'adapt', 'them', 'for', 'networking', '.']

Sample sentence labels: 
['O', 'O', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

It is referenced in an earlier issue (#7) that this was done like this due to the fact that some languages are not space-separated. I think it would be good to have the option and users that annotate space-separated languages can use that for seamless input to their workflow.

Example taken from: https://github.com/microsoft/nlp/blob/master/examples/named_entity_recognition/ner_wikigold_bert.ipynb

The text was updated successfully, but these errors were encountered:

prokotg · 2019-10-30T11:27:17Z

Hi there! I wrote something similar for myself and I would love to contribute with PR :) however, I am not sure how to handle mislabeled tokens. Namely, what if a token was marked only partially? For my own purposes, I print out the mislabeled token, which is a warning to a script's user, and drop the token annotation but in a production this is not a way to go.

The other things is, what should be "non-entity" token? 'O' ? Then we should prevent user from adding such label which might be misleading for some people. Or maybe we should create the form where user itself can provide the token? Or leave it blank?

I think this feature request is great but we should agree how exactly tackle this 😄 I would love to read your suggestions

Hironsan · 2019-11-26T03:04:42Z

I have a plan to create a new Python package named doccano-transformer.
It will transform annotated documents into other formats such as #362, #454 and so on. So, the token level output should be included in doccano-transformer.

NielsRogge · 2020-02-29T12:49:00Z

Any updates on this? I'd like to import a dataset simply as .txt (in which each line is a sentence):

George Washington went to Washington.
Sam Houston stayed home.

... and export it (after annotating) as follows, also in a .txt:

George B-PER
Washington I-PER
went O
to O
Washington B-LOC

Sam B-PER
Houston I-PER
stayed O
home O

In other words, export in the well-known IOB annotation format. So for this, Doccano should automatically know that if an annotated entity comprises more than 1 token should be annotated with B (beginning) and I (inside) labels. Also, there are more sophisticated annotation schemes besides IOB, such as BIOES. Here, S (single) is used to represent a chunk containing a single token. The BIOES annotation scheme would result in the following:

George B-PER
Washington E-PER
went O
to O
Washington S-LOC

Sam B-PER
Houston E-PER
stayed O
home O

It would be awesome if I could export annotated datasets in the IOB or BIOES (or other) formats. Many state-of-the-art libraries for NER require token-level annotation in order to train models (Flair from Zalando, Transformers from HuggingFace,...).

Hironsan · 2020-05-12T21:58:00Z

We released doccano-transformer. It supports data transformation. Currently, supported tasks are named entity recognition and supported formats are CoNLL2003 and spaCy.

We have a plan to extend tasks and formats.
Please look forward to it.

icoxfog417 added the feature request feature request for doccano label Oct 4, 2019

Hironsan closed this as completed May 12, 2020

nk-alex mentioned this issue Aug 1, 2022

Token level output doccano/doccano-transformer#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Token level output #386

Feature Request: Token level output #386

atakanokan commented Oct 3, 2019

prokotg commented Oct 30, 2019

Hironsan commented Nov 26, 2019

NielsRogge commented Feb 29, 2020 •

edited

Loading

Hironsan commented May 12, 2020

Feature Request: Token level output #386

Feature Request: Token level output #386

Comments

atakanokan commented Oct 3, 2019

Feature description

prokotg commented Oct 30, 2019

Hironsan commented Nov 26, 2019

NielsRogge commented Feb 29, 2020 • edited Loading

Hironsan commented May 12, 2020

NielsRogge commented Feb 29, 2020 •

edited

Loading