Skip to content

Conversation

@sffc
Copy link
Member

@sffc sffc commented Dec 2, 2025

@sffc sffc requested a review from a team as a code owner December 2, 2025 16:52
@sffc sffc marked this pull request as draft December 2, 2025 16:52
@sffc sffc changed the title Initial work on ZeroTrie dense data optimization Add ZeroAsciiDenseSparse2dTrie for more efficient storage of data keys with many attributes Dec 10, 2025
@sffc sffc marked this pull request as ready for review December 10, 2025 03:33
Copy link
Member

@Manishearth Manishearth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good design!


impl ZeroAsciiDenseSparse2dTrieOwned {
/// Builds one of these from a two-dimensional BTreeMap and a delimiter.
pub fn try_from_btree_map_str(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please document the soft invariant that the inner btree values should be all in the same locality for maximum performance

though currently this is a "this code will panic" invariant so perhaps we should just say that first (and onc ethat is fixed, talk about the soft performance invariant)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added docs about the locality invariant. I called it "in a compact range".

/// ```
#[cfg(feature = "alloc")]
#[derive(Debug, PartialEq, Eq)]
pub struct ZeroAsciiDenseSparse2dTrieOwned {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: this isn't generic over a Store. Making such a change would be tricky, though, you'd need to be generic over two kinds of stores. I think this is fine for now, we can experiment later.

thought: name's not great, I don't have better ideas, and I don't care strongly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs explaining why it isn't generic over a Store.

}

/// Borrows the structure for encoding into [`zerovec`].
pub fn as_encodeable(&self) -> impl EncodeAsVarULE<ZeroAsciiDenseSparse2dTrieVarULE> + '_ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: this is a good trick for avoiding more unsafe impls.

/// A data structure designed for 2-dimensional ASCII keys with a fixed
/// delimiter that exhibit a mix of density and sparsity (owned version).
///
/// It stores prefixes and suffixes separated by a delimiter, which must not
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: the "delimeter" is really a mostly-internal detail, perhaps we should externally document this as a trie for 2D keys, and briefly mention that when constructing the trie one must include a delimeter that is not found in the 2D keys.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I improved this.

#[cfg(feature = "alloc")]
#[derive(Debug, PartialEq, Eq)]
pub struct ZeroAsciiDenseSparse2dTrieOwned {
/// The main trie, including all data that isn't in the dense matrix.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue: please explicitly document the "row offset index" as an internal impl comment

issue: please include a worked out example showing the data model here as an internal comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, in the docs on the dense field

}
// >= because DenseType::MAX is the sentinel
if max - min >= usize::from(DenseType::MAX) {
todo!("need to handle this case: pick the best dense row offset");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably make this somewhat actionable to the user by saying what's wrong with the data

"currently do not support handling a large variance of values in the same prefix" or something

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, and added a comment explaining how I'd implement it when we need it

.iter()
.map(|suffix| {
let value = sub_trie.get(suffix).unwrap_or(sentinel);
let Ok(offset) = DenseType::try_from(value - min) else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very minor nit: I'd preemptively set let row_value_offset = min at the beginning of this branch and then subtract that instead of min here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sffc sffc requested a review from Manishearth December 11, 2025 01:00
@sffc sffc merged commit b57d797 into unicode-org:main Dec 11, 2025
32 checks passed
@sffc sffc deleted the zerodense branch December 11, 2025 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants