-
Notifications
You must be signed in to change notification settings - Fork 533
Description
Description of the Problem
When executing a search query that hits a massive number of splits (e.g., in a 50PB log scale deployment with a broad time range and weak filters), Quickwit exhibits extremely heavy continuous reads from S3. This pattern can saturate the network ingress bandwidth on the search nodes.
Upon analyzing the S3 access pattern (s3_compatible_storage.rs, bundle_storage.rs, and SplitCache), it's clear that Quickwit is already highly optimized (utilizing HTTP Range Requests efficiently). The problem stems from the concurrency logic in the leaf.rs module.
Currently, in single_doc_mapping_leaf_search, all matched uncached splits are scheduled and dispatched concurrently via run_local_search_tasks.
While Quickwit has a great CanSplitDoBetter pruning mechanism, it has limited effectiveness in this extreme scenario. Because hundreds or thousands of splits are spawned and enter their warmup phase (triggering S3 I/O) almost simultaneously, the pruning state (worst_hit_found) doesn't get populated fast enough to prevent unnecessary splits from aggressively downloading their fast_fields and dictionary terms.
As a result, an enormous amount of unnecessary S3 Range Request bandwidth is wasted on splits that would have been pruned anyway if they were processed slightly later.
Proposed Solution: Progressive Two-Phase Search
To maximize the power of the existing CanSplitDoBetter mechanism without impacting normal query latency (where split_count is reasonable), we propose modifying single_doc_mapping_leaf_search to implement a "Two-Phase" or "Progressive" execution path when the split count is large.
Logic implementation:
- New Configuration: Introduce
progressive_search_batch_size: usizeinSearcherConfig(defaulting to 20 or similar). If set to 0, it falls back to the current behavior. - Phase 1 (Priority Batch): In
leaf.rs, we take the firstbatch_sizesplits (which are already sorted byoptimize_split_order, meaning they are highly relevant) and execute them. Wetokio::join!them and wait for completion. - Data Population: Upon Phase 1 completion, the
CanSplitDoBetterstructure (which holds theincremental_merge_collector) is filled with concreteworst_hitscores from the most relevant splits. - Phase 2 (Remaining Splits): We then dispatch the remaining splits. Because
CanSplitDoBetteris now populated with a strong baseline, thesimplify_search_requestfunction can aggressively convert unnecessary splits into count-only queries or completely skip them (ifcountis not needed), preventing them from making any heavy S3 I/O requests.
Impact & Safety:
- Paginating (
countaccuracy) remains unaffected.simplify_search_requestcorrectly respectsCountAlland safely converts splits tocount-only(bypassing expensive field data downloads) instead of dropping them. - This effectively bounds the peak S3 GET request burst length to exactly what is needed for the Top-K.
- It degrades seamlessly back to the original concurrency behavior if the query matches fewer than
progressive_search_batch_sizesplits.
Environment details:
- OS: Linux / Kubernetes
- Quickwit Version: Latest
main