Protocol Tech Debt (2024) #2128

bnewbold · 2024-02-03T00:30:39Z

bnewbold
Feb 3, 2024
Maintainer

This is a living list of known issues with atproto, as it is running today in the live network, which we would like to change or fix.

It isn’t a list of missing protocol features or product features; there are a bunch of other things we plan to finish or add to atproto. These are only lower-level technical issues with stuff which has already been specified and implemented, and which could impact future interoperability.

Note that almost none of these impact most apps, bots, or other integrations from being built today. Most of these have to do with low-level infrastructure (like Relays), or dealing with old content from the early days of the network.

Urgent

Stuff that we want to fix in time for federation, but might slip.

Commit prev is Nullable and Optional: see discussion below about nullable+optional in general. In particular we should make the prev field on commit objects one or the other; probably optional. In theory this would require a change of the repo version number, but that is disruptive; maybe we can get away with just saying it is optional in version 3? UPDATE: we will say it is nullable-only in version 3, see Resolution of repo v3 commit `prev` being both nullable and optional #2181
Colon in Record Keys: according to the spec, record keys can not have colons (:), but this has not been enforced by the PDS, and there are some popular feed generator records in the wild with this character. There is some debate over whether we should change the spec or migrate the records; the important thing is to get implementation and specification aligned. UPDATE: we decided we'll allow colons in record keys, see Decision to allow colon character (`:`) in Record Keys #2224
Remove Legacy Blobs: there used to be a different format for blob encoding. It would be great to say that there is only one acceptable blob encoding going forward, and either invalidate or migrate all the old records containing them (which date back to early 2023). If we only update the records, then “strong” references (by CID) will throw warnings (eventually), but that is way easier than recursive migrations, and probably good enough.
Enforce Datetime Syntax Strictly: currently the reference service implementations are lenient about datetime parsing, and will try to make the most of strings which are valid ISO strings but not valid RFC 3339, or are missing timezone context. There are many such records. We should at least prevent creation of more such records, and possible bulk-update old invalid records. UPDATE: datetimes are enforced strictly on record creation for posts, though not other datetime fields
Repo Event Stream Message Types: will probably be updated to have consolidated “identity” and “account” message types (replacing handle, migration, tombstone). There should also be a recognizable event for new identities. This work will likely happen when account migration is finalized and shipped.

Strictness and Completeness

These are changes which should not impact “well behaved” clients, but could invalidate some old records.

Enforce Record Key As Part Of Lexicon Validation: Lexicons declare a record-key syntax for records of that type, but we are not consistent about enforcing that. We should reject records with “wrong” record key syntax, and ensure that the specs are clear on this. For example, trying to create a app.bsky.actor.profile with record key as a TID should be a validation error (can override by skipping validation at record creation time).
Add tid and record-key as String Formats to Lexicon Language: many other identifiers in atproto have their own Lexicon string formats, which help with automatic validation. TID and Record Key are both specified identifiers and appear in a lot of APIs. They should get formats, and exiting Lexicons should be updated to require them (this is a “bending” schema change). UPDATE: these formats have been added to the Lexicon specs, though are not yet enforced in the Lexicons themselves.
Nullable and Optional: Lexicon currently allows specifying fields which are both nullable and optional, and both JSON and CBOR can represent this distinction (aka, explicit null vs entirely omitting the field). Some programming language serialization libraries can support this as well, but others struggle. In particular, there isn’t an obvious/idiomatic way to distinguish in golang using the standard Marshal/Unmarshal system. Proposal is to update Lexicon spec to say this combination is disallowed; and update any instances in current Lexicons (might be breaking changes).
Formalize and Document CAR Diff Contents: repo event stream messages contain binary data (a CAR file containing "blocks") representing new MST nodes and records from a specific commit. Which record and MST node blocks are contained, and any flexibility around this, has changed a bit over time and hasn't been formalized yet.
Formalize Lexicon Unions: it is probably the case that only objects (containing a $type) or token strings (which are are a reference) can be used in unions. Eg, it probably isn't possible or allowed to put an integer in a union. This should be clarified.
Fromalize Lexicon unknown: it is probably the case that only objects can be used for unknown format data, but this isn't specified and should be cleared up. Furthermore, any recursive restrictions on the object data should be clarified. For example, floats are mentioned as not being allowed, but further invalid blob-like objects are probably not allowed either.

Lexicon Cleanups

There are some cleanups we’d like to do which will bend or breaking Lexicon stability rules. We don’t think these will actually be very disruptive, and think it is reasonable to declare “one last” round of breaking changes before we commit to stability.

Move Ozone Lexicons to New Namespace: there are a bunch of Lexicons under com.atproto.admin.* which are pretty specific to the Ozone backend. We’d like to move these to an Ozone-specific NSID namespace (eg, tools.ozone.*), which would be independent of both com.atproto and app.bsky.
Remove Deprecated Fields and Endpoints: there are a few Lexicon fields and endpoints which are explicitly marked “DEPRECATED”. It would be good to just remove those while we have the chance, at least if they are optional fields. See also: remove deprecated com.atproto.sync endpoints (getHead and getCheckout) #2505, remove deprecated sync fields #2506
Lexicon string length hardening. There are a few string fields with no max lengths. We started tightening these, but it was disruptive with existing records. We hope to have a process for rolling out this kind of change gradually. See also: lexicons: more string limits #1994
Small cleanup tasks: Lexicon nits (Jan 2024 edition) #2111
update many endpoints to return 404 for "Not Found", instead of 400

“Someday”

Things that might change in the future, but aren’t planned work. Might get to these in the coming year or so, or going through a formal standardization process would be a good time to revisit these.

Don’t require /xrpc/ prefix: requiring HTTP URL path prefixes is generally frowned on by standards folks. We also try not to use the “XRPC” terminology as much these days. We can probably find a way to make the prefix flexible, allowing folks to use an alternative if they want.
Event Stream Sequence Numbers: for example, sharding, or having more formal mechanisms to reset or migrate sequences.
Event Stream Messages are Two CBOR Objects: this makes decoding efficient (single-pass, known schema) for some implementations (like golang), but makes other implementations quite difficult (like rust). This might just be a wart.
Handles and DIDs in AT URIs: it is confusing to allow either handles or DIDs in AT-URIs, particularly when they are in records (as opposed to in API query parameters, etc). This might just be a wart. Discussion: Where are `at://` URIs with handles supported? #1778
Clarify role of floating point numbers in data model: the IPLD data model (and DAG-CBOR codec) allows floats, but atproto specification recommends against. Note that CBOR explicitly distinguishes floats from integers, while JSON is more casual about the distinction. How should floats in CBOR and JSON be handled in atproto? Should records containing a float field be considered invalid? Even within an unknown nested field/object?
Optionally retain record history in repo: make it possible to store multiple versions of a record, with different CIDs, in the same repo. One way to do this would be extending the repo "path" syntax for old versions. This would be optional, as an extra flag on update operations.

"Unfortunate"

These are things we know are controversial or have poor developer experience, but are not likely to change at this point. But maybe they will, if developer friction continues to be bad despite iterations on docs and SDK design.

Datetimes as Strings: some folks would prefer we used UNIX timestamp integers (seconds? milliseconds? nanoseconds?), instead of the current datetime string format in Lexicons, or in particular in wire parts of the protocol. Other folks think the human-accessibility and web-iness is good.
Datetime Semantics: we have discussed trust in timestamps in distrubuted systems, and how createdAt and indexedAt can be combined in to a sortAt for use in display and ordering posts in feeds. This continues to cause confusion and consternation, and some folks would think that global reliable timestamps are important enough to warrant an extension to the protocol.
UTF-8 Richtext Facet Indexing: this comes up periodically as a confusing or bad decision. We are pretty confident we made the right call on this.

DavidBuchanan314 · 2024-02-03T01:30:21Z

DavidBuchanan314
Feb 3, 2024

Regarding "recursive migration" of records, I believe it's relatively feasible using the following high-level approach:

Stop accepting new records that don't meet the new validity criteria (whatever is chosen)
Build a reverse-lookup table index for all record CID dependencies (but only for records from before the new validity rules) (NB: the cryptographic properties of CIDs guarantee that there are no dependency cycles)
Build a table that remembers whether each record is "visited" or not.
Initialize an empty "edit queue".
For each record not yet marked as "visited":
1. Following CID references in both directions, enumerate all connected nodes and store them in a set. Mark them as "visited" as you add them. Most of the time this will be a small set, but it could be large for "hellthreads" and trendy quote-posts chains (but probably still fitting in memory).
2. Initialize an empty lookup table mapping "old CID" to "new CID".
3. Iterate through this set of records in Topological order. For each record:
  1. Normalize it according to whatever rules are chosen, making use of the "old to new CID" lookup table for any CIDs it references (the topological ordering guarantees that any changes you need to know about will already be in the table).
  2. If the record actually changed during normalization, note the change in the "edit queue", and note the changed CID in the "old to new CID" lookup table.
Process/apply your edit queue. This will inevitably need to be split up for each PDS (to simplify things you could maintain a separate queue for each PDS from the start)

I believe this algorithm is overall O(n), and it guarantees that each record is processed at most once. Step 5 is most simply done serially, but it can be done arbitrarily slowly. Assuming you have the disk space for it, you can fully complete all these steps and double-check your work before embarking on step 6, which actually commits the changes. While step 6 is in progress, people might notice subtle breakage when browsing old threads, but other than that it should be fairly unobtrusive. Theoretically you could figure out how to make all the changes atomically but that's probably more trouble than it's worth. There's also plenty of scope for concurrency/parallelism.

Edit: I just thought of one issue with this approach. If any new replies (or other references) are created to posts that need edits, during the execution of the algorithm, then those references would become stale. You could minimise this by waiting a while after step 1.

0 replies

DavidBuchanan314 · 2024-02-18T00:40:57Z

DavidBuchanan314
Feb 18, 2024

Something else you might want to add: [non-]support for floating point values in records.

The spec says they're not allowed (last time I checked) but the current implementation allows them (also last time I checked), and there are floats in records in-the-wild (at least, there are in my repo)

1 reply

bnewbold Feb 18, 2024
Maintainer Author

great point, added

Zackaryia · 2024-03-20T23:22:32Z

Zackaryia
Mar 20, 2024

In order to create trusted timestamps I recommend requiring all timestamps to be signed by some sort of agreed upon time server / block chain similar to how https://opentimestamps.org/ works. The only key difference is that open timestamps uses Bitcoin to sign timestamps but I reccommend XRP because it has fast blocks (~4 seconds) and is really cheap per transaction. The signature would just include the merkle tree path between the object's hash and the XRP record hash, as well as the ID of the XRP record and ledger id. I guesstimate that this would only require ~300 bytes of extra data which is more than affordable for signed timestamps.

0 replies

Zackaryia · 2024-03-20T23:24:43Z

Zackaryia
Mar 20, 2024

Also why are merkle search trees used instead of B+ trees, they are less efficent in speed, and space, and their only advantage is their reproduciblity but as far as I can tell, a reproducible repo is not needed.

2 replies

DavidBuchanan314 Mar 21, 2024

The benefits of the MST relative to alternatives are subtle. For one thing, the MST gives you compact exclusion proofs (i.e. proof that a particular record does not currently exist). Speed is mostly a non-issue because the MST itself is not the primary record lookup mechanism - AppViews etc. will have their own db indexes. Most importantly of all, the MST gives compact authenticated diffs, which are critical for the repo sync mechanism(s) (including but not limited to the "firehose")

Zackaryia Mar 21, 2024

I believe that you can get proof that a record does not exist and much smaller authenticated diffs with B+ trees.
The main issue with MST is that randomness is not evenly distributed and will lead to non uniform layouts and therefore non optimal diffs but B+ trees can be re-balanced to make optimal diffs.

I could create a small demo to prove this if it would be useful, but when I played around with real repos I found the layout to be much worse than even a semi-well balanced B+ tree.

The most optimal authenticated diff would be just a merkle tree hash path between an object and the root node which is very large with MSTs because they arent designed to generate the most compact merkle trees.

I have been playing around with the idea of untrusted PDSes where a client downloads the full proof chain between a DID and a Post to prove that a DID posted a message however Atproto is not designed with this in mind so the full chain would be pretty large, but I think this is a cool goal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protocol Tech Debt (2024) #2128

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Protocol Tech Debt (2024) #2128

bnewbold Feb 3, 2024 Maintainer

Urgent

Strictness and Completeness

Lexicon Cleanups

“Someday”

"Unfortunate"

Replies: 4 comments · 3 replies

DavidBuchanan314 Feb 3, 2024

DavidBuchanan314 Feb 18, 2024

bnewbold Feb 18, 2024 Maintainer Author

Zackaryia Mar 20, 2024

Zackaryia Mar 20, 2024

DavidBuchanan314 Mar 21, 2024

Zackaryia Mar 21, 2024

bnewbold
Feb 3, 2024
Maintainer

Replies: 4 comments 3 replies

DavidBuchanan314
Feb 3, 2024

DavidBuchanan314
Feb 18, 2024

bnewbold Feb 18, 2024
Maintainer Author

Zackaryia
Mar 20, 2024

Zackaryia
Mar 20, 2024