Lexicon
Lexicon is a schema definition language used to describe atproto records, HTTP endpoints (XRPC), and event stream messages. It builds on top of the atproto Data Model.
The schema language is similar to JSON Schema and OpenAPI, but includes some atproto-specific features and semantics.
This specification describes version 1 of the Lexicon definition language.
Overview of Types
Lexicon Type | Data Model Type | Category |
---|---|---|
null | Null | concrete |
boolean | Boolean | concrete |
integer | Integer | concrete |
string | String | concrete |
bytes | Bytes | concrete |
cid-link | Link | concrete |
blob | Blob | concrete |
array | Array | container |
object | Object | container |
params | container | |
token | meta | |
ref | meta | |
union | meta | |
unknown | meta | |
record | primary | |
query | primary | |
procedure | primary | |
subscription | primary |
Lexicon Files
Lexicons are JSON files associated with a single NSID. A file contains one or more definitions, each with a distinct short name. A definition with the name main
optionally describes the "primary" definition for the entire file. A Lexicon with zero definitions is invalid.
A Lexicon JSON file is an object with the following fields:
lexicon
(integer, required): indicates Lexicon language version. In this version, a fixed value of1
id
(string, required): the NSID of the Lexiconrevision
(integer, optional): indicates the version of this Lexicon, if changes have occurreddescription
(string, optional): short overview of the Lexicon, usually one or two sentencesdefs
(map of strings-to-objects, required): set of definitions, each with a distinct name (key)
Schema definitions under defs
all have a type
field to distinguish their type. A file can have at most one definition with one of the "primary" types. Primary types should always have the name main
. It is possible for main
to describe a non-primary type.
References to specific definitions within a Lexicon use fragment syntax, like com.example.defs#someView
. If a main
definition exists, it can be referenced without a fragment, just using the NSID. For references in the $type
fields in data objects themselves (eg, records or contents of a union), this is a "must" (use of a #main
suffix is invalid). For example, com.example.record
not com.example.record#main
.
The semantics of the revision
field have not been worked out yet, but are intended to help third parties identity the most recent among multiple versions or copies of a Lexicon.
Related Lexicons are often grouped together in the NSID hierarchy. As a convention, any definitions used by multiple Lexicons are defined in a dedicated *.defs
Lexicon (eg, com.atproto.server.defs
) within the group. A *.defs
Lexicon should generally not include a definition named main
, though it is not strictly invalid to do so.
Primary Type Definitions
The primary types are:
query
: describes an XRPC Query (HTTP GET)procedure
: describes an XRPC Procedure (HTTP POST)subscription
: Event Stream (WebSocket)record
: describes an object that can be stored in a repository record
Each primary definition schema object includes these fields:
type
(string, required): the type value (eg,record
for records)description
(string, optional): short, usually only a sentence or two
Record
Type-specific fields:
key
(string, required): specifies the Record Key typerecord
(object, required): a schema definition with typeobject
, which specifies this type of record
Query and Procedure (HTTP API)
Type-specific fields:
parameters
(object, optional): a schema definition with typeparams
, describing the HTTP query parameters for this endpointoutput
(object, optional): describes the HTTP response bodydescription
(string, optional): short descriptionencoding
(string, required): MIME type for body contents. Useapplication/json
for JSON responses.schema
(object, optional): schema definition, either anobject
, aref
, or aunion
of refs. Used to describe JSON encoded responses, though schema is optional even for JSON responses.
input
(object, optional, only forprocedure
): describes HTTP request body schema, with the same format as theoutput
fielderrors
(array of objects, optional): set of string error codes which might be returnedname
(string, required): short name for the error type, with no whitespacedescription
(string, optional): short description, one or two sentences
Subscription (Event Stream)
Type-specific fields:
parameters
(object, optional): same as Query and Proceduremessage
(object, optional): specifies what messages can bedescription
(string, optional): short descriptionschema
(object, required): schema definition, which must be aunion
of refs
errors
(array of objects, optional): same as Query and Procedure
Subscription schemas (referenced by the schema
field under message
) must be a union
of refs, not an object
type.
Field Type Definitions
As with the primary definitions, every schema object includes these fields:
type
(string, required): fixed value for each typedescription
(string, optional): short, usually only a sentence or two
null
No additional fields.
boolean
Type-specific fields:
default
(boolean, optional): a default value for this fieldconst
(boolean, optional): a fixed (constant) value for this field
When included as an HTTP query parameter, should be rendered as true
or false
(no quotes).
integer
A signed integer number.
Type-specific fields:
minimum
(integer, optional): minimum acceptable valuemaximum
(integer, optional): maximum acceptable valueenum
(array of integers, optional): a closed set of allowed valuesdefault
(integer, optional): a default value for this fieldconst
(integer, optional): a fixed (constant) value for this field
string
Type-specific fields:
format
(string, optional): string format restrictionmaxLength
(integer, optional): maximum length of value, in UTF-8 bytesminLength
(integer, optional): minimum length of value, in UTF-8 bytesmaxGraphemes
(integer, optional): maximum length of value, counted as Unicode Grapheme ClustersminGraphemes
(integer, optional): minimum length of value, counted as Unicode Grapheme ClustersknownValues
(array of strings, optional): a set of suggested or common values for this field. Values are not limited to this set (aka, not a closed enum).enum
(array of strings, optional): a closed set of allowed valuesdefault
(string, optional): a default value for this fieldconst
(string, optional): a fixed (constant) value for this field
Strings are Unicode. For non-Unicode encodings, use bytes
instead. The basic minLength
/maxLength
validation constraints are counted as UTF-8 bytes. Note that Javascript stores strings with UTF-16 by default, and it is necessary to re-encode to count accurately. The minGraphemes
/maxGraphemes
validation constraints work with Grapheme Clusters, which have a complex technical and linguistic definition, but loosely correspond to "distinct visual characters" like Latin letters, CJK characters, punctuation, digits, or emoji (which might comprise multiple Unicode codepoints and many UTF-8 bytes).
format
constrains the string format and provides additional semantic context. Refer to the Data Model specification for the available format types and their definitions.
const
and default
are mutually exclusive.
bytes
Type-specific fields:
minLength
(integer, optional): minimum size of value, as raw bytes with no encodingmaxLength
(integer, optional): maximum size of value, as raw bytes with no encoding
cid-link
No type-specific fields.
See Data Model spec for CID restrictions.
array
Type-specific fields:
items
(object, required): describes the schema elements of this arrayminLength
(integer, optional): minimum count of elements in arraymaxLength
(integer, optional): maximum count of elements in array
In theory arrays have homogeneous types (meaning every element as the same type). However, with union types this restriction is meaningless, so implementations can not assume that all the elements have the same type.
object
A generic object schema which can be nested inside other definitions by reference.
Type-specific fields:
properties
(map of strings-to-objects, required): defines the properties (fields) by name, each with their own schemarequired
(array of strings, optional): indicates which properties are requirednullable
(array of strings, optional): indicates which properties can havenull
as a value
As described in the data model specification, there is a semantic difference in data between omitting a field; including the field with the value null
; and including the field with a "false-y" value (false
, 0
, empty array, etc).
blob
Type-specific fields:
accept
(array of strings, optional): list of acceptable MIME types. Each may end in*
as a glob pattern (eg,image/*
). Use*/*
to indicate that any MIME type is accepted.maxSize
(integer, optional): maximum size in bytes
params
This is a limited-scope type which is only ever used for the parameters
field on query
, procedure
, and subscription
primary types. These map to HTTP query parameters.
Type-specific fields:
required
(array of strings, optional): same semantics as field onobject
properties
: similar to properties underobject
, but can only include the typesboolean
,integer
,string
, andunknown
; or anarray
of one of these types
Note that unlike object
, there is no nullable
field on params
.
token
Tokens are empty data values which exist only to be referenced by name. They are used to define a set of values with specific meanings. The description
field should clarify the meaning of the token. Tokens encode as string data, with the string being the fully-qualified reference to the token itself (NSID followed by an optional fragment).
Tokens are similar to the concept of a "symbol" in some programming languages, distinct from strings, variables, built-in keywords, or other identifiers.
For example, tokens could be defined to represent the state of an entity (in a state machine), or to enumerate a list of categories.
No type-specific fields.
ref
Type-specific fields:
ref
(string, required): reference to another schema definition
Refs are a mechanism for re-using a schema definition in multiple places. The ref
string can be a global reference to a Lexicon type definition (an NSID, optionally with a #
-delimited name indicating a definition other than main
), or can indicate a local definition within the same Lexicon file (a #
followed by a name).
union
Type-specific fields:
refs
(array of strings, required): references to schema definitionsclosed
(boolean, optional): indicates if a union is "open" or "closed". defaults tofalse
(open union)
Unions represent that multiple possible types could be present at this location in the schema. The references follow the same syntax as ref
, allowing references to both global or local schema definitions. Actual data will validate against a single specific type: the union does not combine fields from multiple schemas, or define a new hybrid data type. The different types are referred to as variants.
By default unions are "open", meaning that future revisions of the schema could add more types to the list of refs (though can not remove types). This means that implementations should be permissive when validating, in case they do not have the most recent version of the Lexicon. The closed
flag (boolean) can indicate that the set of types is fixed and can not be extended in the future.
A union
schema definition with no refs
is allowed and similar to unknown
, as long as the closed
flag is false (the default). The main difference is that the data would be required to have the $type
field. An empty refs list with closed
set to true is an invalid schema.
The schema definitions pointed to by a union
are objects or types with a clear mapping to an object, like a record
. All the variants must be represented by a CBOR map (or JSON Object) and must include a $type
field indicating the variant type. Because the data must be an object, unions can not reference token
(which would correspnod to string data).
unknown
Indicates than any data object could appear at this location, with no specific validation. The top-level data must be an object (not a string, boolean, etc). As with all other data types, the value null
is not allowed unless the field is specifically marked as nullable
.
The data object may contain a $type
field indicating the schema of the data, but this is not currently required. The top-level data object must not have the structure of a compound data type, like blob ($type: blob
) or CID link ($link
).
The (nested) contents of the data object must still be valid under the atproto data model. For example, it should not contain floats. Nested compound types like blobs and CID links should be validated and transformed as expected.
Lexicon designers are strongly recommended to not use unknown
fields in record
objects for now.
No type-specific fields.
String Formats
Strings can optionally be constrained to one of the following format
types:
at-identifier
: either a Handle or a DID, details described belowat-uri
: AT-URIcid
: CID in string format, details specified in Data Modeldatetime
: timestamp, details specified belowdid
: generic DID Identifierhandle
: Handle Identifiernsid
: Namespaced Identifiertid
: Timestamp Identifier (TID)record-key
: Record Key, matching the general syntax ("any")uri
: generic URI, details specified belowlanguage
: language code, details specified below
For the various identifier formats, when doing Lexicon schema validation the most expansive identifier syntax format should be permitted. Problems with identifiers which do pass basic syntax validation should be reported as application errors, not lexicon data validation errors. For example, data with any kind of DID in a did
format string field should pass Lexicon validation, with unsupported DID methods being raised separately as an application error.
at-identifier
A string type which is either a DID (type: did) or a handle (handle). Mostly used in XRPC query parameters. It is unambiguous whether an at-identifier is a handle or a DID because a DID always starts with did:, and the colon character (:) is not an allowed in handles.
datetime
Full-precision date and time, with timezone information.
This format is intended for use with computer-generated timestamps in the modern computing era (eg, after the UNIX epoch). If you need to represent historical or ancient events, ambiguity, or far-future times, a different format is probably more appropriate. Datetimes before the Current Era (year zero) as specifically disallowed.
Datetime format standards are notoriously flexible and overlapping. Datetime strings in atproto should meet the intersecting requirements of the RFC 3339, ISO 8601, and WHATWG HTML datetime standards.
The character separating "date" and "time" parts must be an upper-case T
.
Timezone specification is required. It is strongly preferred to use the UTC timezone, and to represent the timezone with a simple capital Z
suffix (lower-case is not allowed). While hour/minute suffix syntax (like +01:00
or -10:30
) is supported, "negative zero" (-00:00
) is specifically disallowed (by ISO 8601).
Whole seconds precision is required, and arbitrary fractional precision digits are allowed. Best practice is to use at least millisecond precision, and to pad with zeros to the generated precision (eg, trailing :12.340Z
instead of :12.34Z
). Not all datetime formatting libraries support trailing zero formatting. Both millisecond and microsecond precision have reasonable cross-language support; nanosecond precision does not.
Implementations should be aware when round-tripping records containing datetimes of two ambiguities: loss-of-precision, and ambiguity with trailing fractional second zeros. If de-serializing Lexicon records in to native types, and then re-serializing, the string representation may not be the same, which could result in broken hash references, sanity check failures, or repository update churn. A safer thing to do is to deserialize the datetime as a simple string, which ensures round-trip re-serialization.
Implementations "should" validate that the semantics of the datetime are valid. For example, a month or day 00
is invalid.
Valid examples:
# preferred
1985-04-12T23:20:50.123Z
1985-04-12T23:20:50.123456Z
1985-04-12T23:20:50.120Z
1985-04-12T23:20:50.120000Z
# supported
1985-04-12T23:20:50.12345678912345Z
1985-04-12T23:20:50Z
1985-04-12T23:20:50.0Z
1985-04-12T23:20:50.123+00:00
1985-04-12T23:20:50.123-07:00
Invalid examples:
1985-04-12
1985-04-12T23:20Z
1985-04-12T23:20:5Z
1985-04-12T23:20:50.123
+001985-04-12T23:20:50.123Z
23:20:50.123Z
-1985-04-12T23:20:50.123Z
1985-4-12T23:20:50.123Z
01985-04-12T23:20:50.123Z
1985-04-12T23:20:50.123+00
1985-04-12T23:20:50.123+0000
# ISO-8601 strict capitalization
1985-04-12t23:20:50.123Z
1985-04-12T23:20:50.123z
# RFC-3339, but not ISO-8601
1985-04-12T23:20:50.123-00:00
1985-04-12 23:20:50.123Z
# timezone is required
1985-04-12T23:20:50.123
# syntax looks ok, but datetime is not valid
1985-04-12T23:99:50.123Z
1985-00-12T23:20:50.123Z
uri
Flexible to any URI schema, following the generic RFC-3986 on URIs. This includes, but isn’t limited to: did
, https
, wss
, ipfs
(for CIDs), dns
, and of course at
.
Maximum length in Lexicons is 8 KBytes.
language
An IETF Language Tag string, compliant with BCP 47, defined in RFC 5646 ("Tags for Identifying Languages"). This is the same standard used to identify languages in HTTP, HTML, and other web standards. The Lexicon string must validate as a "well-formed" language tag, as defined in the RFC. Clients should ignore language strings which are "well-formed" but not "valid" according to the RFC.
As specified in the RFC, ISO 639 two-character and three-character language codes can be used on their own, lower-cased, such as ja
(Japanese) or ban
(Balinese). Regional sub-tags can be added, like pt-BR
(Brazilian Portuguese). Additional subtags can also be added, such as hy-Latn-IT-arevela
.
Language codes generally need to be parsed, normalized, and matched semantically, not simply string-compared. For example, a search engine might simplify language tags to ISO 639 codes for indexing and filtering, while a client application (user agent) would retain the full language code for presentation (text rendering) locally.
When to use $type
Data objects sometimes include a $type
field which indicates their Lexicon type. The general principle is that this field needs to be included any time there could be ambiguity about the content type when validating data.
The specific rules are:
record
objects must always include$type
. While the type is often known from context (eg, the collection part of the path for records stored in a repository), record objects can also be passed around outside of repositories and need to be self-describingunion
variants must always include$type
, except at the top level ofsubscription
messages
Note that blob
objects always include $type
, which allows generic processing.
As a reminder, main
types must be referenced in $type
fields as just the NSID, not including a #main
suffix.
Lexicon Evolution
Lexicons are allowed to change over time, within some bounds to ensure both forwards and backwards compatibility. The basic principle is that all old data must still be valid under the updated Lexicon, and new data must be valid under the old Lexicon.
- Any new fields must be optional
- Non-optional fields can not be removed. A best practice is to retain all fields in the Lexicon and mark them as deprecated if they are no longer used.
- Types can not change
- Fields can not be renamed
If larger breaking changes are necessary, a new Lexicon name must be used.
It can be ambiguous when a Lexicon has been published and becomes "set in stone". At a minimum, public adoption and implementation by a third party, even without explicit permission, indicates that the Lexicon has been released and should not break compatibility. A best practice is to clearly indicate in the Lexicon type name any experimental or development status. Eg, com.corp.experimental.newRecord
.
Authority and Control
The authority for a Lexicon is determined by the NSID, and rooted in DNS control of the domain authority. That authority has ultimate control over the Lexicon definition, and responsibility for maintenance and distribution of Lexicon schema definitions.
In a crisis, such as unintentional loss of DNS control to a bad actor, the protocol ecosystem could decide to disregard this chain of authority. This should only be done in exceptional circumstances, and not as a mechanism to subvert an active authority. The primary mechanism for resolving protocol disputes is to fork Lexicons in to a new namespace.
Protocol implementations should generally consider data which fails to validate against the Lexicon to be entirely invalid, and should not try to repair or do partial processing on the individual piece of data.
Unexpected fields in data which otherwise conforms to the Lexicon should be ignored. When doing schema validation, they should be treated at worst as warnings. This is necessary to allow evolution of the schema by the controlling authority, and to be robust in the case of out-of-date Lexicons.
Third parties can technically insert any additional fields they want in to data. This is not the recommended way to extend applications, but it is not specifically disallowed. One danger with this is that the Lexicon may be updated to include fields with the same field names but different types, which would make existing data invalid.
Usage and Implementation Guidelines
It should be possible to translate Lexicon schemas to JSON Schema or OpenAPI and use tools and libraries from those ecosystems to work with atproto data in JSON format.
Implementations which serialize and deserialize data from JSON or CBOR in to structures derived from specific Lexicons should be aware of the risk of "clobbering" unexpected fields. For example, if a Lexicon is updated to add a new (optional) field, old implementations would not be aware of that field, and might accidentally strip the data when de-serializing and then re-serializing. Depending on the context, one way to avoid this problem is to retain any "extra" fields, or to pass-through the original data object instead of re-serializing it.
Possible Future Changes
The validation rules for unexpected additional fields may change. For example, a mechanism for Lexicons to indicate that the schema is "closed" and unexpected fields are not allowed, or a convention around field name prefixes (x-
) to indicate unofficial extension.