Skip to content

Optimize S3 Bandwidth Usage via Two-Phase Progressive Leaf Search for Heavy Queries #6210

@snac21

Description

@snac21

Description of the Problem
When executing a search query that hits a massive number of splits (e.g., in a 50PB log scale deployment with a broad time range and weak filters), Quickwit exhibits extremely heavy continuous reads from S3. This pattern can saturate the network ingress bandwidth on the search nodes.

Upon analyzing the S3 access pattern (s3_compatible_storage.rs, bundle_storage.rs, and SplitCache), it's clear that Quickwit is already highly optimized (utilizing HTTP Range Requests efficiently). The problem stems from the concurrency logic in the leaf.rs module.

Currently, in single_doc_mapping_leaf_search, all matched uncached splits are scheduled and dispatched concurrently via run_local_search_tasks.
While Quickwit has a great CanSplitDoBetter pruning mechanism, it has limited effectiveness in this extreme scenario. Because hundreds or thousands of splits are spawned and enter their warmup phase (triggering S3 I/O) almost simultaneously, the pruning state (worst_hit_found) doesn't get populated fast enough to prevent unnecessary splits from aggressively downloading their fast_fields and dictionary terms.

As a result, an enormous amount of unnecessary S3 Range Request bandwidth is wasted on splits that would have been pruned anyway if they were processed slightly later.

Proposed Solution: Progressive Two-Phase Search
To maximize the power of the existing CanSplitDoBetter mechanism without impacting normal query latency (where split_count is reasonable), we propose modifying single_doc_mapping_leaf_search to implement a "Two-Phase" or "Progressive" execution path when the split count is large.

Logic implementation:

  1. New Configuration: Introduce progressive_search_batch_size: usize in SearcherConfig (defaulting to 20 or similar). If set to 0, it falls back to the current behavior.
  2. Phase 1 (Priority Batch): In leaf.rs, we take the first batch_size splits (which are already sorted by optimize_split_order, meaning they are highly relevant) and execute them. We tokio::join! them and wait for completion.
  3. Data Population: Upon Phase 1 completion, the CanSplitDoBetter structure (which holds the incremental_merge_collector) is filled with concrete worst_hit scores from the most relevant splits.
  4. Phase 2 (Remaining Splits): We then dispatch the remaining splits. Because CanSplitDoBetter is now populated with a strong baseline, the simplify_search_request function can aggressively convert unnecessary splits into count-only queries or completely skip them (if count is not needed), preventing them from making any heavy S3 I/O requests.

Impact & Safety:

  • Paginating (count accuracy) remains unaffected. simplify_search_request correctly respects CountAll and safely converts splits to count-only (bypassing expensive field data downloads) instead of dropping them.
  • This effectively bounds the peak S3 GET request burst length to exactly what is needed for the Top-K.
  • It degrades seamlessly back to the original concurrency behavior if the query matches fewer than progressive_search_batch_size splits.

Environment details:

  • OS: Linux / Kubernetes
  • Quickwit Version: Latest main

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions