Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Token level output #386

Closed
atakanokan opened this issue Oct 3, 2019 · 4 comments
Closed

Feature Request: Token level output #386

atakanokan opened this issue Oct 3, 2019 · 4 comments
Labels
feature request feature request for doccano

Comments

@atakanokan
Copy link

Feature description

doccano currently only outputs character-level annotation. However, some workflows used for NLP require input as lists of words and list of token labels:

Sample sentence: 
['Two', ',', 'Samsung', 'based', ',', 'electronic', 'cash', 'registers', 'were', 'reconstructed', 'in', 'order', 'to', 'expand', 'their', 'functions', 'and', 'adapt', 'them', 'for', 'networking', '.']

Sample sentence labels: 
['O', 'O', 'I-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

It is referenced in an earlier issue (#7) that this was done like this due to the fact that some languages are not space-separated. I think it would be good to have the option and users that annotate space-separated languages can use that for seamless input to their workflow.

Example taken from: https://github.com/microsoft/nlp/blob/master/examples/named_entity_recognition/ner_wikigold_bert.ipynb

@icoxfog417 icoxfog417 added the feature request feature request for doccano label Oct 4, 2019
@prokotg
Copy link

prokotg commented Oct 30, 2019

Hi there! I wrote something similar for myself and I would love to contribute with PR :) however, I am not sure how to handle mislabeled tokens. Namely, what if a token was marked only partially? For my own purposes, I print out the mislabeled token, which is a warning to a script's user, and drop the token annotation but in a production this is not a way to go.

The other things is, what should be "non-entity" token? 'O' ? Then we should prevent user from adding such label which might be misleading for some people. Or maybe we should create the form where user itself can provide the token? Or leave it blank?

I think this feature request is great but we should agree how exactly tackle this 😄 I would love to read your suggestions

@Hironsan
Copy link
Member

I have a plan to create a new Python package named doccano-transformer.
It will transform annotated documents into other formats such as #362, #454 and so on. So, the token level output should be included in doccano-transformer.

@NielsRogge
Copy link

NielsRogge commented Feb 29, 2020

Any updates on this? I'd like to import a dataset simply as .txt (in which each line is a sentence):

George Washington went to Washington.
Sam Houston stayed home.

... and export it (after annotating) as follows, also in a .txt:

George B-PER
Washington I-PER
went O
to O
Washington B-LOC

Sam B-PER
Houston I-PER
stayed O
home O

In other words, export in the well-known IOB annotation format. So for this, Doccano should automatically know that if an annotated entity comprises more than 1 token should be annotated with B (beginning) and I (inside) labels. Also, there are more sophisticated annotation schemes besides IOB, such as BIOES. Here, S (single) is used to represent a chunk containing a single token. The BIOES annotation scheme would result in the following:

George B-PER
Washington E-PER
went O
to O
Washington S-LOC

Sam B-PER
Houston E-PER
stayed O
home O

It would be awesome if I could export annotated datasets in the IOB or BIOES (or other) formats. Many state-of-the-art libraries for NER require token-level annotation in order to train models (Flair from Zalando, Transformers from HuggingFace,...).

@Hironsan
Copy link
Member

We released doccano-transformer. It supports data transformation. Currently, supported tasks are named entity recognition and supported formats are CoNLL2003 and spaCy.

We have a plan to extend tasks and formats.
Please look forward to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request feature request for doccano
Projects
None yet
Development

No branches or pull requests

5 participants