-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Token level output #386
Comments
Hi there! I wrote something similar for myself and I would love to contribute with PR :) however, I am not sure how to handle mislabeled tokens. Namely, what if a token was marked only partially? For my own purposes, I print out the mislabeled token, which is a warning to a script's user, and drop the token annotation but in a production this is not a way to go. The other things is, what should be "non-entity" token? I think this feature request is great but we should agree how exactly tackle this 😄 I would love to read your suggestions |
Any updates on this? I'd like to import a dataset simply as .txt (in which each line is a sentence):
... and export it (after annotating) as follows, also in a .txt:
In other words, export in the well-known IOB annotation format. So for this, Doccano should automatically know that if an annotated entity comprises more than 1 token should be annotated with B (beginning) and I (inside) labels. Also, there are more sophisticated annotation schemes besides IOB, such as BIOES. Here, S (single) is used to represent a chunk containing a single token. The BIOES annotation scheme would result in the following:
It would be awesome if I could export annotated datasets in the IOB or BIOES (or other) formats. Many state-of-the-art libraries for NER require token-level annotation in order to train models (Flair from Zalando, Transformers from HuggingFace,...). |
We released doccano-transformer. It supports data transformation. Currently, supported tasks are named entity recognition and supported formats are CoNLL2003 and spaCy. We have a plan to extend tasks and formats. |
Feature description
doccano currently only outputs character-level annotation. However, some workflows used for NLP require input as lists of words and list of token labels:
It is referenced in an earlier issue (#7) that this was done like this due to the fact that some languages are not space-separated. I think it would be good to have the option and users that annotate space-separated languages can use that for seamless input to their workflow.
Example taken from: https://github.com/microsoft/nlp/blob/master/examples/named_entity_recognition/ner_wikigold_bert.ipynb
The text was updated successfully, but these errors were encountered: