Warning:
This wiki has been archived and is now read-only.
Text Analysis serializations
From ITS
Version: 16 January 2014
Overview
The ITS 2.0 specifiation defines a normative way to represent Text Analysis information in XML and HTML locally. Text Analysis information can also be represented in other formats, e.g. JSON. This page provides a description of such alternative serializations. Please edit this page or provide comments on the ITS IG mailing list.
Comparison to NERD API output
The output of the NERD API is described in a JSON format. Here is an example API call output.
[ { (1) idEntity: 120, (2) label: "BBC", (3) startChar: 138, endChar: 141, (4) extractorType: "Company", (5) nerdType: "http://nerd.eurecom.fr/ontology#Organization", (6) uri: "http://dbpedia.org/resource/BBC", (7) confidence: 0.0582796, (8) relevance: 0.5, (9) extractor: "dbspotlight" }, ...]
There are the following correspondences between the NERD API and Text Analysis information pieces:
-
idEntity
: no correspondance -
label
: content of the annotated element in XML or HTML -
startChar
,endChar
: not represented as part of Text Analysis information piece, but is generated in a NIF workflow, see conversion to NIF -
extractorType
: no correspondance -
nerdType
: entity type / concept class, e.g. in HTMLits-ta-class-ref="http://nerd.eurecom.fr/ontology#Organization"
-
uri
: Entity / concept identifier, e.g. in HTMLits-ta-ident-ref="http://dbpedia.org/resource/BBC"
-
confidence
: Text analysis confidence, e.g. in HTMLits-ta-confidence="0.0582796"
-
relevance:
no correspondance -
extractor
: its-annotators-ref (in HTML) or annotatorsRef (in XML) attribute, e.g.its-annotators-ref="text-analysis|dbspotlight"
.