Skip to content
Merged
Prev Previous commit
Next Next commit
review feedback
  • Loading branch information
sffc committed Dec 11, 2025
commit 25a8c32ca4dcb2e50bae25e5aebdf54d5ddc7e93
15 changes: 12 additions & 3 deletions utils/zerotrie/src/builder/dense.rs
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ impl<'a> DenseSparse2dAsciiWithFixedDelimiterBuilder<'a> {
}
// >= because DenseType::MAX is the sentinel
if max - min >= usize::from(DenseType::MAX) {
todo!("need to handle this case: pick the best dense row offset");
// How to implement this when we need it:
// 1. Select the row offset that gets the greatest number of values into the dense matrix.
// 2. Put all out-of-range values into the primary trie, as we do with sparse rows.
todo!("values in row are not in a sufficiently compact range");
}
let sentinel = min + usize::from(DenseType::MAX);
// Partition the entries based on whether they can be encoded as dense
Expand All @@ -61,20 +64,21 @@ impl<'a> DenseSparse2dAsciiWithFixedDelimiterBuilder<'a> {
.collect::<ZeroTrieSimpleAscii<Vec<u8>>>();
if sub_trie.byte_len() > self.suffixes.len() * core::mem::size_of::<DenseType>() {
// Create a dense prefix
let row_value_offset = min;
let offsets = self
.suffixes
.iter()
.map(|suffix| {
let value = sub_trie.get(suffix).unwrap_or(sentinel);
let Ok(offset) = DenseType::try_from(value - min) else {
let Ok(offset) = DenseType::try_from(value - row_value_offset) else {
unreachable!("this should have been handled earlier");
};
offset
})
.collect::<Vec<DenseType>>();
self.dense.push(Row {
prefix,
row_value_offset: min,
row_value_offset,
offsets,
});
for (suffix, value) in sparse_only.iter() {
Expand Down Expand Up @@ -144,6 +148,11 @@ impl<'a> DenseSparse2dAsciiWithFixedDelimiterBuilder<'a> {

impl ZeroAsciiDenseSparse2dTrieOwned {
/// Builds one of these from a two-dimensional BTreeMap and a delimiter.
///
/// Keep in mind the recommendations for optimal data size described in
/// the [class docs].
///
/// [class docs]: ZeroAsciiDenseSparse2dTrieOwned
pub fn try_from_btree_map_str(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please document the soft invariant that the inner btree values should be all in the same locality for maximum performance

though currently this is a "this code will panic" invariant so perhaps we should just say that first (and onc ethat is fixed, talk about the soft performance invariant)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added docs about the locality invariant. I called it "in a compact range".

entries: &BTreeMap<&str, BTreeMap<&str, usize>>,
delimiter: u8,
Expand Down
64 changes: 53 additions & 11 deletions utils/zerotrie/src/dense.rs
Original file line number Diff line number Diff line change
Expand Up @@ -22,14 +22,30 @@ use zerovec::ZeroVec;

pub(crate) type DenseType = u16;

/// A data structure designed for 2-dimensional ASCII keys with a fixed
/// delimiter that exhibit a mix of density and sparsity (owned version).
/// A data structure designed for 2-dimensional ASCII keys that exhibit a mix
/// of density and sparsity.
///
/// It stores prefixes and suffixes separated by a delimiter, which must not
/// occur in any of the prefix or suffix strings.
/// This structure is best for data with the following properties:
///
/// This type can deliver data size savings if the suffixes are commonly
/// repeated. If they are mostly unique, a simple trie is likely better.
/// 1. Suffixes are commonly repeated. For example, the two-dimentional keys
/// "aa/ZZ" and "bb/ZZ" have the same suffix (second-level key).
/// 1. Some prefixes are sparse (contain just a few suffixes) while others are
/// dense (contain many/most of the suffixes).
/// 1. Values are mostly in a compact range within a prefix. For example, for
/// the keys "qq/AA", "qq/BB", and "qq/CC", the values 100, 101, and 103
/// are in a compact range, but the values 1, 500, and 7e8 are not.
///
/// Instead of having a pluggable store like [`ZeroTrie`], there are three
/// separate versions of this type:
///
/// 1. [`ZeroAsciiDenseSparse2dTrieOwned`], which owns its data
/// 1. [`ZeroAsciiDenseSparse2dTrieBorrowed`], which borrows its data
/// 1. [`ZeroAsciiDenseSparse2dTrieVarULE`], for zero-copy storage as bytes
///
/// Internally, the structure stores prefixes and suffixes separated by a
/// delimiter, which must not occur in any of the prefix or suffix strings.
///
/// ✨ *Enabled with the `alloc` and `dense` Cargo features.*
///
/// # Examples
///
Expand Down Expand Up @@ -64,23 +80,49 @@ pub struct ZeroAsciiDenseSparse2dTrieOwned {
pub(crate) primary: ZeroTrieSimpleAscii<Vec<u8>>,
/// A trie mapping suffixes to column IDs in the dense matrix.
pub(crate) suffixes: ZeroTrieSimpleAscii<Vec<u8>>,
/// The dense matrix.
/// The dense matrix. To look up a value from the dense matrix:
///
/// 1. Find the row index and row value offset from [`Self::primary`]
/// 2. Find the column index from [`Self::suffixes`]
/// 3. Find the number of columns from [`Self::suffix_count`]
/// 4. Select the entry from the matrix based on the row and column index.
/// 5. Add the row value offset to recover the original value.
///
/// The row value offset exists in order to allow storage of rows with
/// values that are tight in a certain range, but when that range exceeds
/// the capacity of [`DenseType`], such as 100000, 100001, and 100002.
///
/// For example: suppose you are looking up "ddd/ZZ".
///
/// 1. Check the primary matrix. Row index is 4, row value offset 100.
/// 2. Check the suffixes matrix. Column index is 19.
/// 3. Check the number of columns. There are 30 columns.
/// 4. Select the value at index (30*4)+19 = 139. It is 15.
/// 5. Add 100 to recover the original value. Return 115.
pub(crate) dense: ZeroVec<'static, DenseType>,
/// The number of columns in the dense matrix.
pub(crate) suffix_count: u16,
/// The delimiter byte.
pub(crate) delimiter: u8,
}

/// A data structure designed for 2-dimensional ASCII keys with a fixed
/// delimiter that exhibit a mix of density and sparsity (ULE version).
/// A data structure designed for 2-dimensional ASCII keys that exhibit a mix
/// of density and sparsity.
///
/// For more details, see [`ZeroAsciiDenseSparse2dTrieOwned`].
///
/// ✨ *Enabled with the `dense` Cargo feature.*
pub type ZeroAsciiDenseSparse2dTrieVarULE = VarTupleULE<
(u16, u8),
Tuple3VarULE<ZeroSlice<u8>, ZeroSlice<u8>, ZeroSlice<DenseType>, Index32>,
>;

/// A data structure designed for 2-dimensional ASCII keys with a fixed
/// delimiter that exhibit a mix of density and sparsity (borrowed version).
/// A data structure designed for 2-dimensional ASCII keys that exhibit a mix
/// of density and sparsity.
///
/// For more details, see [`ZeroAsciiDenseSparse2dTrieOwned`].
///
/// ✨ *Enabled with the `dense` Cargo feature.*
#[derive(Debug, PartialEq, Eq)]
pub struct ZeroAsciiDenseSparse2dTrieBorrowed<'a> {
primary: &'a ZeroTrieSimpleAscii<[u8]>,
Expand Down
Loading