Full-text search of the OpenBSD manual pages

48

Full-text search of the OpenBSD manual pages openbsd man.ifconfig.se
authored by zelest 8 months ago | caches
Archive.org Archive.today Ghostarchive
| 21 comments

21

1. 36
  
  jrwren 8 months ago | link
  
  It is too fast.
  
  Seriously, this reminds me of 1995 and the first time I clicked links on a webpage on a university network instead of dial up.
  
  It is so fast, its a testament to how terrible the web has gotten.
  1. 14
    
    FRIGN 8 months ago | link
    
    100% this, man. We have so much power at our fingertips and waste it on bloat.
  2. 4
    
    kevinc edited 8 months ago | link
    
    That event of switching from dialup to 100BaseT when I went to college was, by the numbers and multiplicatively, the single biggest personal tech step improvement I’ve lived through. The web was like lightning for years. I miss it dearly and love sites like this one that just don’t do any extra bullshit.
    1. 7
      
      lorddimwit 8 months ago | link
      
      I was talking about how I wish we could get Google Fiber where we lived.
      
      Wife: “Is it going to feel as fast as when I connected to the ethernet in my dorm at UT Austin in 1998?”
      
      Me: “I mean, technically, it’s…”
      
      Wife: “Is it going to feel as fast?”
      
      Me: “Probably…”
      
      Wife: “I have standards.”
      1. 2
        
        jrwren 8 months ago | link
        
        The answer is NO.
        
        It is not going to feel as fast.
2. 5
  
  franta 8 months ago | link
  
  Looks nice and clean. Is it free software? Can I use it for indexing and searching my own documentation and host it myself?
  1. 9
    
    x64k 8 months ago | link
    
    I’d love to see the source just for a good example of proper use of PostgreSQL for full-text indexing! I haven’t seen an application that does full-text indexing/search so well, and so quickly, in a very long time. I have a feeling that, even if it’s got some of the usual small hacks that personal projects sometimes have, this is the kind of source code that aspiring developers would find extremely rewarding to read and understand.
    1. jrwren 8 months ago | link
      
      [Comment removed by author]
  2. 6
    
    zelest 8 months ago | link
    
    Thank you! Unfortunately not, it’s more of a 48 hour hack that is too ugly and way too tailored for the OpenBSD manual pages. I hope to be able to release it in a somewhat near future though, when most of the ugly stuff is sorted.
    1. 29
      
      simonw 8 months ago | link
      
      Release the ugly software!
      
      I don’t think it’s ever too early to release the code - put a big warning notice on the README that it’s all a big hack and ship it anyway.
      
      The value for me would be seeing what hacky tricks you used to get it to work. I couldn’t care less if it’s not polished yet.
      1. 5
        
        winter 8 months ago | link
        
        Agreed, and some of the fun lies in letting other people find cool ways to remove the hacky stuff (if you want to accept contributions).
  3. 4
    
    epilys 8 months ago | link
    
    I had done a hack as well for manpages with sqlite https://lobste.rs/s/t7tuej/epilys_buke_toy_manpage_full_text_search
    
    Feel free to play with it, since it’s free software.
3. 4
  zelest 8 months ago | link
  Source is now available through Git:
  
  git clone https://man.ifconfig.se/man.git
4. 3
  
  sanxiyn 8 months ago | link
  
  I am curious about which search technology is used.
  1. 2
    
    williewillus 8 months ago | link
    
    My guess is a custom indexing solution tailored towards the usecase, that keeps everything in-memory.
    
    When your corpus is static, relatively small for modern computers, and changes slowly, you can do lots of things to squeeze perf out.
    1. 28
      
      zelest 8 months ago | link
      
      I’m using PostgreSQL with proper indexing. :)
    2. 4
      
      oxTung 8 months ago | link
      
      Computers are ridiculously fast. JASSjr which is a project trying to demonstrate how easy it is to write a minimal search engine (just a hash table, and a merge) can search a 173,252 document corpus with an average length of 470 words in 10ms using a naïve Python implementation on an i5-4570 (a CPU from 2013). This doesn’t include the index read but if you are clever about your index format you can make that fast too. For a long time my hobby search project was a CGI binary written in C that would read and decode the index file on every request. It had about 80,000 documents and still ran in 10ms without anything more than a hash table and a repeating 2-way merge for the k-merge algorithm
5. 2
  
  voutilad 8 months ago | link
  
  Something might be goofy with tokenization on the indexer side and/or stemming on the query side…
  
  Search for “atomic” and I’d expect some hits on the atomic kernel functions.
  
  For instance, compare to apropos:
  
  https://man.ifconfig.se/?q=atomic&a=false&s=false
  
  vs.
  
  https://man.openbsd.org/?query=atomic&apropos=1&sec=0&arch=default&manpath=OpenBSD-current
  1. 3
    
    zelest 8 months ago | link
    
    Yeah, the indexer turns words into “lexemes”, which is a normalized form of the word. This make a search for the word “code” match documents containing “code”, “coding” and “coder”. In your case, “atoms” and “atomic” will become “atom” and various X libraries seem to use the word “atoms” more frequently. Pagination and more than 10 results per query is on the todo though.
    
    You can also choose to only search the kernel functions:
    
    https://man.ifconfig.se/?q=atomic&a=false&s=9
    1. 4
      
      voutilad 8 months ago | link
      
      Is there a way to tune the relevancy model used? If it’s processing “atomic” like this, it’s tokenizing atomic_add_int into potentially [atomic, add, int] and then probably processing those tokens into lexemes. In my past experience with things like Apache Lucene, usually there’s a way to influence the scoring so an exact token match (atomic in this case) would boost the score higher than a match on a shared lexeme/lemma/stem/etc.
      
      I unfortunately have no experience using PostgreSQL for this type of stuff…but there’s probably a way.
6. 1
  
  caleb 8 months ago | link
  
  Does it count as full-text search if it’s based on indexed words and you can’t search for an arbitrary string?
  1. 2
    
    oxTung 8 months ago | link
    
    Full-text refers to searching over the documents rather than summaries or title of the documents as classic retrieval engines used to do. We now live in a world where full-text search is usually the standard