OT, but I dislike that this page uses italic font for two entirely different purposes, emphasis and as a hover target for additional information. How am I as the reader supposed to know which is which?
Oops, that’s a total mistake on our part! I’ve (re)added a reference (it was lost along the way somehow) and included it back in our latter discussion. Thanks for pointing out that omission!
After some thought … this seems vulnerable to a similar problem as many hash-tables, wherein if an adversary can choose keys to add to the data structure, they can come up with keys that mess it up enough to ruin the performance characteristics.
In this case, they can generate keys that have a low rank, or I guess, sufficient keys of the same rank. That will bias the expected distribution of ranks among the keys and make the structure top-heavy (or middle-heavy or whatever.) This would be quite bad for a zip-tree since it creates longer and longer linked lists; not as bad if the “G” data structure is more efficient, but possibly still a problem.
Prolly trees aren’t vulnerable to this, AFAIK, because the tree structure is determined not by individual keys but by a rolling hash function over concatenated series of keys in a node.
Unfortunately, prolly-trees are also vulnerable to adversarial keys, in pretty much the same way as any probabilistic tree structure is vulnerable. If you are aware of the rolling hash procedure, you can come up with keys that will trigger a boundary (or not), just the same as you can with g-trees or other structures that rely on local patterns. In here, the dolt folks explain how they try to adjust things to mitigate this (and some other properties) to a degree: https://www.dolthub.com/blog/2024-03-03-prolly-trees/, but this introduces new problems (like, your boundary selection is no longer independent, as detailed here: https://interjectedfuture.com/lab-notes/lab-note-033-cascading-boundary-changes-in-prolly-trees/). Ideally, you want to completely avoid having external information to help select your boundaries (having said that, dolt’s modification to adjust for node size is quite nice).
In g-trees, we opt to abstract away this part with the internal set structure. So you could, for instance, parameterize the internal set structure to be another g-tree with a different pseudorandom function to determine ranks (use the leading zero count of the binary representation on the main tree, and the trailing zero count on the internal set tree for example). Note also that, while it is totally true that someone with control over the keys can come up with pretty horrific patterns, we can also fight that to a degree with a good hash function. While obviously not an actual fix, it can at least make attacks like this more annoying for our adversary.
Interesting. From the doc comment (ranks are stored implicitly, nodes are stored as arrays), it looks like ClojureDart is using Merkle Search Trees, not G-trees.
Come for the data structures, stay for all the thoughtful reading enhancements.
Indeed! This is a lovely way to do online publishing.
According to the README on the repo it’s written using this template for a program called Macromania.
Even if I don’t use those tools, I’ll certainly be aiming to make my future online writing as helpfully interactive as this.
OT, but I dislike that this page uses italic font for two entirely different purposes, emphasis and as a hover target for additional information. How am I as the reader supposed to know which is which?
Slightly surprised there’s no mention of the seemingly-similar Prolly Tree. (At least, not in the first half of the paper…)
Oops, that’s a total mistake on our part! I’ve (re)added a reference (it was lost along the way somehow) and included it back in our latter discussion. Thanks for pointing out that omission!
That’s another very cute B-tree-like data structure. Thanks for the pointer!
After some thought … this seems vulnerable to a similar problem as many hash-tables, wherein if an adversary can choose keys to add to the data structure, they can come up with keys that mess it up enough to ruin the performance characteristics.
In this case, they can generate keys that have a low rank, or I guess, sufficient keys of the same rank. That will bias the expected distribution of ranks among the keys and make the structure top-heavy (or middle-heavy or whatever.) This would be quite bad for a zip-tree since it creates longer and longer linked lists; not as bad if the “G” data structure is more efficient, but possibly still a problem.
Prolly trees aren’t vulnerable to this, AFAIK, because the tree structure is determined not by individual keys but by a rolling hash function over concatenated series of keys in a node.
Unfortunately, prolly-trees are also vulnerable to adversarial keys, in pretty much the same way as any probabilistic tree structure is vulnerable. If you are aware of the rolling hash procedure, you can come up with keys that will trigger a boundary (or not), just the same as you can with g-trees or other structures that rely on local patterns. In here, the dolt folks explain how they try to adjust things to mitigate this (and some other properties) to a degree: https://www.dolthub.com/blog/2024-03-03-prolly-trees/, but this introduces new problems (like, your boundary selection is no longer independent, as detailed here: https://interjectedfuture.com/lab-notes/lab-note-033-cascading-boundary-changes-in-prolly-trees/). Ideally, you want to completely avoid having external information to help select your boundaries (having said that, dolt’s modification to adjust for node size is quite nice).
In g-trees, we opt to abstract away this part with the internal set structure. So you could, for instance, parameterize the internal set structure to be another g-tree with a different pseudorandom function to determine ranks (use the leading zero count of the binary representation on the main tree, and the trailing zero count on the internal set tree for example). Note also that, while it is totally true that someone with control over the keys can come up with pretty horrific patterns, we can also fight that to a degree with a good hash function. While obviously not an actual fix, it can at least make attacks like this more annoying for our adversary.
Sorted maps in ClojureDart are also geometric search trees and I find them easier to implement than B-trees (no balancing). https://github.com/Tensegritics/ClojureDart/blob/e05433c2999af00b19c1ea5c6093f639d1d7d4c5/clj/src/cljd/core.cljd#L8067
Interesting. From the doc comment (ranks are stored implicitly, nodes are stored as arrays), it looks like ClojureDart is using Merkle Search Trees, not G-trees.
Looks indeed similar modulo the merkle part. Thanks