-
Notifications
You must be signed in to change notification settings - Fork 243
Add ZeroAsciiDenseSparse2dTrie for more efficient storage of data keys with many attributes #7264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Manishearth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good design!
|
|
||
| impl ZeroAsciiDenseSparse2dTrieOwned { | ||
| /// Builds one of these from a two-dimensional BTreeMap and a delimiter. | ||
| pub fn try_from_btree_map_str( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please document the soft invariant that the inner btree values should be all in the same locality for maximum performance
though currently this is a "this code will panic" invariant so perhaps we should just say that first (and onc ethat is fixed, talk about the soft performance invariant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added docs about the locality invariant. I called it "in a compact range".
| /// ``` | ||
| #[cfg(feature = "alloc")] | ||
| #[derive(Debug, PartialEq, Eq)] | ||
| pub struct ZeroAsciiDenseSparse2dTrieOwned { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observation: this isn't generic over a Store. Making such a change would be tricky, though, you'd need to be generic over two kinds of stores. I think this is fine for now, we can experiment later.
thought: name's not great, I don't have better ideas, and I don't care strongly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added docs explaining why it isn't generic over a Store.
| } | ||
|
|
||
| /// Borrows the structure for encoding into [`zerovec`]. | ||
| pub fn as_encodeable(&self) -> impl EncodeAsVarULE<ZeroAsciiDenseSparse2dTrieVarULE> + '_ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: this is a good trick for avoiding more unsafe impls.
utils/zerotrie/src/dense.rs
Outdated
| /// A data structure designed for 2-dimensional ASCII keys with a fixed | ||
| /// delimiter that exhibit a mix of density and sparsity (owned version). | ||
| /// | ||
| /// It stores prefixes and suffixes separated by a delimiter, which must not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: the "delimeter" is really a mostly-internal detail, perhaps we should externally document this as a trie for 2D keys, and briefly mention that when constructing the trie one must include a delimeter that is not found in the 2D keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I improved this.
| #[cfg(feature = "alloc")] | ||
| #[derive(Debug, PartialEq, Eq)] | ||
| pub struct ZeroAsciiDenseSparse2dTrieOwned { | ||
| /// The main trie, including all data that isn't in the dense matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: please explicitly document the "row offset index" as an internal impl comment
issue: please include a worked out example showing the data model here as an internal comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, in the docs on the dense field
utils/zerotrie/src/builder/dense.rs
Outdated
| } | ||
| // >= because DenseType::MAX is the sentinel | ||
| if max - min >= usize::from(DenseType::MAX) { | ||
| todo!("need to handle this case: pick the best dense row offset"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably make this somewhat actionable to the user by saying what's wrong with the data
"currently do not support handling a large variance of values in the same prefix" or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, and added a comment explaining how I'd implement it when we need it
utils/zerotrie/src/builder/dense.rs
Outdated
| .iter() | ||
| .map(|suffix| { | ||
| let value = sub_trie.get(suffix).unwrap_or(sentinel); | ||
| let Ok(offset) = DenseType::try_from(value - min) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor nit: I'd preemptively set let row_value_offset = min at the beginning of this branch and then subtract that instead of min here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
#6920