Description
As discussed in multiple LA meetings, I'd like to enable DAG support in the SDK.
Although, there might be an ambiguity on what it actually entails. I showed multiple examples where DAGs were used with a single UAST (tree-like), multiple roots (token stream, native AST, UAST) and more graph-like structures (all mentioned UAST roots, schema nodes, root-less annotations pointing to UAST nodes, etc).
For now, I only propose to enable the simplest use case: a single UAST root, where a node can be referenced multiple times (without loops).
The goal of this issue is to start a discussion. Later it will be split into separate issues covering specific topics of the conversion. It will also serve as a first step toward full DAG support.
Use cases:
- Some drivers emit the same node twice in different places in the AST:
- Efficiently represent the same positional node used multiple times.
- Binary format already compresses them out.
- But it will increase the performance of the encoder.
- Efficiently represent abiguous nodes:
- Define Ambiguity node #337
- Child nodes will be cloned right now. Might lead to duplication of large subtrees.
- Partially allow drivers to resolve symbols to a single virtual node
- Go only?
Required changes:
SDK will need to be updated to handle these kind of nodes properly. The change requires to track visited nodes and should be easy to implement.
Thanks to opaque binary format, there are no changes to the protocol.
The binary format, as it is right now, supports DAGs out of the box. However, it tries to automatically compress leaf subtrees by referencing the same node ID multiple times. This will no longer work with DAGs, since decoder will assume that its the same node, regardless if a node was generated by the compressor or not. Thus, we will need to add a new field to protobuf definition of the node. This field can be defined as bool clone
and will control whatever this node was made for compression reasons (true
), or it's exactly the same node from semantic point of view (false
).
YAML encoder will probably need to be changed to print implicit node IDs. We could use YAML built-in node ID references, but turned out that there is no YAML library in Go that supports this feature. Thus, the best way is to encode an additional @ref
pseudo-property, and deduplicate nodes based on this property when decoding. Same can be done for JSON.
Babelfish Web can probably ignore this change for now. The proxy can simply take the in-memory DAG and serialize it to JSON without using our encoder. This operation is safe and will only lead to learger JSON output for the Babelfish Web. Later, it can be updated to use our JSON encode and support @ref
for a more compact encoding.
Libuast's UastLoad
method will need to be updated to track visited nodes. It will then automatically work for all clients.
CC @bblfsh/maintainers