Attributes
Token attributes are specified using internal IDs in many places including:
Matcher
patterns,Doc.to_array
andDoc.from_array
Doc.has_annotation
MultiHashEmbed
Tok2Vec architectureattrs
All methods automatically convert between the string version of an ID ("DEP"
)
and the internal integer symbols (DEP
). The internal IDs can be imported from
spacy.attrs
or retrieved from the StringStore
. A map
from string attribute names to internal attribute IDs is stored in
spacy.attrs.IDS
.
The corresponding Token
object attributes can be
accessed using the same names in lowercase, e.g. token.orth
or token.length
.
For attributes that represent string values, the internal integer ID is accessed
as Token.attr
, e.g. token.dep
, while the string value can be retrieved by
appending _
as in token.dep_
.
Attribute | Description |
---|---|
DEP | The token’s dependency label. str |
ENT_ID | The token’s entity ID (ent_id ). str |
ENT_IOB | The IOB part of the token’s entity tag. Uses custom integer values rather than the string store: unset is 0 , I is 1 , O is 2 , and B is 3 . str |
ENT_KB_ID | The token’s entity knowledge base ID. str |
ENT_TYPE | The token’s entity label. str |
IS_ALPHA | Token text consists of alphabetic characters. bool |
IS_ASCII | Token text consists of ASCII characters. bool |
IS_DIGIT | Token text consists of digits. bool |
IS_LOWER | Token text is in lowercase. bool |
IS_PUNCT | Token is punctuation. bool |
IS_SPACE | Token is whitespace. bool |
IS_STOP | Token is a stop word. bool |
IS_TITLE | Token text is in titlecase. bool |
IS_UPPER | Token text is in uppercase. bool |
LEMMA | The token’s lemma. str |
LENGTH | The length of the token text. int |
LIKE_EMAIL | Token text resembles an email address. bool |
LIKE_NUM | Token text resembles a number. bool |
LIKE_URL | Token text resembles a URL. bool |
LOWER | The lowercase form of the token text. str |
MORPH | The token’s morphological analysis. MorphAnalysis |
NORM | The normalized form of the token text. str |
ORTH | The exact verbatim text of a token. str |
POS | The token’s universal part of speech (UPOS). str |
SENT_START | Token is start of sentence. bool |
SHAPE | The token’s shape. str |
SPACY | Token has a trailing space. bool |
TAG | The token’s fine-grained part of speech. str |