Universal container format based on progressive specialization

_[This is a work-in-progress draft design which has been heavily edited since it was first published]_

This is an attempt at designing a highly flexible, yet compact, multipurpose container format that can function both as a content/entity identifier, a file header, as a part of a protocol message, or even to contain both metadata and data by itself. 

Basically there's a very simple underlying concept here: that successive type enumerations can be used to progressively "namespace" into more and more specialized contexts describing more fine-grained information. Note these type enumerations don't have to be limited to built-in fields (like `entity domain` or `schema version`) -- they can be dynamically inferred from fields whose semantics are progressively refined by the schema itself (somewhat like a state machine).

_(This is mostly an illustrative example of how such format could be designed, but I did put a lot of thought into it so I think it's a worthwhile read)_

It starts with a message encoding identifier (1 character), which can be any one of `raw-binary`, `base64`, `base32`  etc:
```
<message encoding [1 char]>
```

Now that we're in binary, a version number for the container format (varint):
```
<container version [varint]>
```

Now a varint for a entity domain identifier (e.g. `file`, `ipfs`, `ipns`, `https`, `bitcoin`, `ethereum` etc.)
```
<entity domain [varint]>
```

And now a varint version number of the schema for the domain (each domain independently maintains its own schema versioning):
```
<domain-specific schema version [varint]>
```
Now the base payload (AKA required fields), where its schema is specialized for the particular domain and version number, (note that total length is included to allow for a client to segment it even if it is unfamiliar with the particular combination):
```
<base payload length [varint]>
<base payload [arbitrary binary layout - can be variable length]>
```

And now field data (AKA optional fields), in a simplified protocol buffer like encoding (roughly described below):
```
<field data [unspecified total length])>
``` 

That's all really. It's not bound to contain a hash of any sort, or to be associated with a particular category within a set of predefined codec types. 

Example: say we want to encode `[raw-binary, container version 2, IPFS, schema version 1]` so the first required field would be `resource type`, say it's `UnixFS File`, which in turn would refine the schema further to expect `<dag hash type [varint]>` and `<dag hash [binary string]>` as following fields.

The base document would look something like:
```
<encoding: "b" [1 character]>
<container version: 2 [1 byte]>
<entity domain: IPFS [1 byte]>
<domain-specific schema version: 1 [1 byte]>
<base payload length: 34 [1 byte]>
<resource type: UnixFS file [1 byte]>
<dag hash type: sha-256 [1 byte]>
<dag hash [32 bytes]>
```
(Total length: 1 char + 38 bytes)

## Optional fields:

Each optional field is structured as:
```
<data type and field identifier [varint]>
<field payload>
```
Where the first bit of `data type and field identifier` represents the type and the rest the field identifier (specific for the particular schema), which can grow indefinitely since its a varint (fitting into a single byte would allow for 6 bits which can support up to 64 different field IDs).

Data type can be:
```
0: varint 
1: length prepended binary string (where length is a varint)
```
_(I'm not sure if there's a need for anything else, since booleans can be contained in bitfields and floats can be stored in binary strings)_

So let's say for the example we wanted to add a `file size`, `chunking algorithm` and `max chunk size` optional fields to the base CID:
```
<data type: 0, field id: file size (#0) [1 byte]>
<field payload [6 bytes]>
<data type: 0, field id: chunking algorithm (#1) [1 byte]>
<field payload [1 byte]>
<data type: 0, field id: max chunk size (#2) [1 byte]>
<field payload [3 bytes]>
```
Totals (`file size`: 7 bytes, `chunking algorithm`: 2 bytes, `chunk size`: 4 bytes). Of course if the information cannot be represented here (say, chunking is variable): it may simply not be included at all.

Now let's say the user wants to also add a signature for the hash, and that is not supported in the base schema, so they would need to use their own application specific field identifier in a reserved range (for this example say 4096+ is reserved [4096 is roughly midway within the range available for 2 byte identifiers]). 

```
<data type: 1, field id: hmac-sha-256 hash signature (#4096) [2 bytes]>
<field payload [1 for length + 32 bytes for data]>
```

Even if the client doesn't understand this field, it can safely ignore and skip it since all the length information is available through the encoding itself.

Note that it's possible to standardize identifiers within the range 4096+ as application reserved globally for all domains. This would mean that application-specific fields could be added to a document even if its schema is not understood by the client.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Universal container format based on progressive specialization #23

Optional fields:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Universal container format based on progressive specialization #23

Description

Optional fields:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions