A math puzzle and a better algorithm for top-k

21

A math puzzle and a better algorithm for top-k compsci math programming quickwit.io
via fmassot 7 months ago | caches
Archive.org Archive.today Ghostarchive
| 6 comments

6

1. 2
  
  pims 7 months ago | link
  
  When it comes to algorithm complexity, how relevant is the worst case? I understand that it gives a lower bound on performance, but without a sense of how probable the worst case is, does it really matter in practice?
  1. 7
    
    fulmicoton 7 months ago | link
    
    levant is the worst case? I understand that it gives a lower bound on performance, but without a sense of how p
    
    It is very relevant if the worst case is what happens on your specific workload! For instance, here the worst is when the data is sorted (or almost sorted).
    
    At Quickwit, we are building a log search engine. Most of the time, users run a query and want their logs sorted by timestamp. The data in the index is ordered in the order of ingestion, so our theoretical worst case is actually the most common case on our workload.
    
    You will find a similar situation with sort algorithms. The first implementation of quicksort in libc was the naive one, where the pivot picked is the first element. It resulted in a complexity of O(n^2) if the data is almost sorted. In the real world, a lot of the stuff we sort is already almost sorted, so it was incredibly slow.
    
    They fixed it with a better pivot choice. Also a lot of language standard library now prefer TimSort, which is specially optimized for this very common edge case.
    1. 1
      
      pims 7 months ago | link
      
      I hadn’t considered the results being already sorted was a common case for chronological ordering. Makes perfect sense. Thanks for the explanation.
  2. 5
    
    dzwdz 7 months ago | link
    
    It might matter if you’re writing something that’ll be exposed to the internet, because then, if there’s a way for the user to trigger the worst case, it could be used to DoS you.
    
    IIRC a few years ago there was an issue with predictable hashtable keys in some language(s?), which could be used to DoS webservers.
  3. 2
    
    Corbin 7 months ago | link
    
    You have rediscovered the concept of amortized complexity theory. The idea is to compute the complexity of a sequence of operations that might occur “in practice” and derive new worst-case bounds which are (theoretically) more favorable. Both amortized and worst-case analyses are relevant.
    
    For example, tangent to @dzwdz’s comment, an (open-addressed) hash table usually has some sort of quadratic-time worst-case behavior, typically by forcing many rekeying operations under the hood. However, under typical use-cases, one of the n factors can become log n by amortization, which is tolerable. This matters because sometimes quadratic-time behaviors are found in the wild! Examples of this happened in Rust and in Python, Perl, PHP, and more. Users will write code “in practice” which exhibits worst-case behaviors.
  4. 2
    
    xoranth 7 months ago | link
    
    Besides DoS, it matters for work done in batches, if the new batch is dependent on the aggregate result of the previous batch.