Static low-bandwidth search at scale

3

phaer 2 years ago | link

Does anyone know how it compares to https://stork-search.net/ ?

Seems quite similar at first glance, except for MIT vs GPL, both seem to use rust and wasm and focus on static sites.

10
oxTung 2 years ago | link
They’re both pretty good search products and you’d be fine choosing either in most cases. Stork has obviously been around for longer and so has some extra polish in that regard. The main advantage (and raison d’etre) with Pagefind is that it uses considerably less bandwith as it only loads the portions of the index that it needs to complete a search whereas Stork loads the entire index up front

Stork advantages:
- Can be used on content other than html
- Stemming can be set for languages other than English and on a per file basis
- Result ranking boosts exact matches and down-weights prefix matches and then stop words
- Apache licensed
Pagefind advantages:
- Easier to setup for the common case of static site generator (just point at your output dir and go)
- Tweaking is done without a seperate config file
- Uses considerably less bandwith
- MIT licensed
Some areas in which I think both could improve:
- Neither of them use BM25 or TFIDF for ranking. BM25 is industry standard for first stage ranking and TFIDF is the okayish ranking that most hobbyists will come across. Either would make stop words obsolete also
- Neither do language detection for deciding on the stemmer (fairly easy to do with trigram statistics)
- Neither of them do query expansion
- They’re both fast largely on account of being in Rust but there is room for better performance by reducing allocation during indexing, using a different index structure for search (easier in Storks case than Pagefind due to how the chunking constrains choices), and by the algorithm for evaluating and merging results lists during the search
- There’s still further room available for shrinking the index size in both of them
But for the target use case of blogs and small to medium static websites either would likely be fine
1. 2
  
  phaer 2 years ago | link
  
  Wow, thanks for this exhaustive comparison!
2. 2
  
  bglw 2 years ago | link
  
  Wow, yes, fantastic write-up. I should definitely add a roadmap to the Pagefind documentation, as there are quite a few relevant things in our short-term plans.
  
  One of the imminent features to release is multilingual support, which much of the piping is in for already. My intention is to take a shortcut on the trigram statistics angle, and make use of the HTML metadata. If output correctly, a static site should have a language attribute on the HTML (or otherwise detectable through the URL structure). Using this, Pagefind can swap out the stemmer for that page as a whole. The plan is then that the index chunks would be output separately for each language, and in the browser you choose which language you’re searching up front. In our experience, it isn’t common to want to search all languages of a multilingual website at once. This should be out in a few weeks, and would give you multilingual search still without a configuration file.
  
  I haven’t documented Pagefind’s result ranking fully, but it currently should be boosting exact matches and down-weighting prefix matches, which is then combined with a rudimentary term frequency. Medium term I plan to add TFIDF — I have a rough plan for how to get the data I need into the browser without a large network request. Unsure on BM25.
  
  Query expansion is hard in a bandwidth sense, as most of the information isn’t loaded in. I do want to experiment with loading in a subset of the “dictionary” that was found on the site (likely the high ranking TFIDF words) and providing some spell checking functionality for those words specifically, if I can do it in a reasonable bandwidth footprint.
  
  Speed is something to revisit, bandwidth has been the full priority thus far. I would be keen to hear any thoughts you have on shrinking the index size, though — I’m sure you’ve looked into it already but I have exhausted my current avenues of index shrinkage :)
  1. 4
    
    oxTung 2 years ago | link
    
    Thanks :)
    
    That sounds like a good plan for multilingual. I agree that it isn’t common to search all languages at once. Hopefully the solution you describe can also be integrated into the websites language selector so that it can remain completely transparent to the user
    
    That’s good to hear. Sorry that I missed the down-weighing of prefix matches when I was reading the code. If you are implementing TFIDF I highly recommend BM25 as it gives better results with mostly only a formula change. But there seems to be no way to get better than BM25 without extra ranking factors or machine learning http://www.cs.otago.ac.nz/homepages/andrew/papers/2014-2.pdf
    
    I’m assuming with the extra network weight of TFIDF you’re referring to having to have all document IDs in the metadata so that you can compute the ranking without requesting the full document bodies? For the term frequency part you should be able to just use an extra byte in every position in the postings list which shouldn’t be much overhead on a per chunk basis. There’s no point using more than a byte as knowing that there’s more than 255 instances of “the” in a document is really minuscule diminishing returns
    
    For the dictionary front you could investigate Levenshtein distance. It would allow you to spell check using only the chunks you’ll have already fetched. Typically the first and last letters of a word will be typed correctly and in the middle will either be a transposition, addition, or removal and likely only one such. I haven’t investigated the state of the algorithms to do that though https://en.wikipedia.org/wiki/Levenshtein_distance
    
    Query expansion proper is very hard and normally is done by mining query logs. General purpose thesauri typically give bad results. And domain specific ones are expensive to create. I’m not sure what the solution there is or if it’s worth covering at all. If you did implement it I would imagine a thesaurus at the start of every chunk covering the words included which should be minimal network overhead
    
    Are you doing run length encoding for the postings list yet? I didn’t check sorry. Doing that with group varint, vbyte or simple8 compression will save you the most. You might also want to look into Trie structures which would allow you to compress your terms list considerably and still perform prefix search. As a note I wouldn’t recommend B-Tree structures greater than a depth of 2 (which is how you’ve already implemented Pagefind’s index anyway)
    
    For speed. Two easy things. 1. Sort the postings lists on length before merging and merge from shortest to longest. This allows you to skip as much as possible when increasing the comparison pointers. 2. Have the document parser return an iterator for then indexer to use so that you’re not allocating and deallocating all the structures required to temporarily hold the document. Not searching to completion would also speed it up, but I’m not sure that it’s a feature for a small site
    
    To improve result accuracy you might also want to consider keeping a second index for the titles of pages and boost rankings on those. Quite often people are just wanting to find a specific page again when they search
    
    And finally a question. How much improvement does stemming give if you also support prefix search?
    1. 2
      
      bglw 2 years ago | link
      
      Hopefully the solution you describe can also be integrated into the websites language selector so that it can remain completely transparent to the user
      
      That’s the goal — one potential path is that the search bundle is output for each language directory, so the site would load /fr-fr/_pagefind/pagefind.js and get a search experience specifically for that language. Some degree of this will need to be done, as the wasm file is language-specific (I’m avoiding loading in every stemmer)
      
      Thanks for the tips on term ranking — also funny that you link to an Otago University paper, that’s where my CS degree is from :) (though I didn’t study information retrieval). The extra byte plan sounds like a good strategy.
      
      It would allow you to spell check using only the chunks you’ll have already fetched
      
      The reason I have been investigating spellcheck with an extra index is that the chunks as they exist now are difficult to suggest searches from, since the words are stored stemmed. Many words stem down to words that aren’t valid (configuration -> configur) so that doesn’t give me enough information to show some helper text like Showing results for ~~comfigure~~ configure.
      
      Thesaurus at the start of each chunk would be alright on the network, but if those words were then incorporated into the search we would need to load the other chunks for those words, which would make the network requests heavier across the board unless they were only used in a “no results” setting.
      
      Are you doing run length encoding for the postings list yet?
      
      Every index that Pagefind spits out is manually gzipped, and gunzipped in the browser from the Pagefind js. It’s been quite a cheap way to get RLE for “free”. I did some brief experiments early on with being smarter about the data structures, but nothing beat a simple gzip. Doing it manually also means that you aren’t reliant on server support, and the compressed versions happily sit in caches.
      
      Great tips on speed — I’ll definitely look into those.
      
      To improve result accuracy you might also want to consider keeping a second index for the titles of pages
      
      I have some plans here to add some generic weighting support. Ideally I can implement something more configurable, with a data-pagefind-weight="4" style tag that can be wrapped around any content, which would provide the ability to add title ranking. I haven’t done much R&D on this yet, but the loose plan is to investigate adding some marker bytes into the word position indexes that can signify the next n words should be weighted higher / lower, without having to split out separate indexes.
      
      And finally a question. How much improvement does stemming give if you also support prefix search?
      
      Great question! For partially-typed words, not a lot — the prefix search handles that well. For full words stemming provides a rudimentary thesaurus-like search, in that configuration and configuring will both stem down to configur and match each other. Additionally, storing words against their stem makes for smaller indexes, since we don’t need to allocate every version of configur* in the index.
      
      These are great questions and tips, thanks for the detailed dig! I’ve been tackling this from the “I want to build search, lets learn information retrieval” side, rather than the “I know IR lets build search”, so there are definitely aspects I’m still up-skilling on :)
      1. 1
        
        oxTung 2 years ago | link
        
        That’s the goal — one potential path is that the search bundle is output for each language directory, so the site would load /fr-fr/_pagefind/pagefind.js and get a search experience specifically for that language. Some degree of this will need to be done, as the wasm file is language-specific (I’m avoiding loading in every stemmer)
        
        Brilliant. That sounds like it’ll be nice and ergonomic
        
        Thanks for the tips on term ranking — also funny that you link to an Otago University paper, that’s where my CS degree is from :) (though I didn’t study information retrieval). The extra byte plan sounds like a good strategy.
        
        If you look at the literature for performance and compression, he does very well. There was a fairly recent comparison published for academic open source search engines. A shame you didn’t take the paper as not many universities teach search engines
        
        The reason I have been investigating spellcheck with an extra index is that the chunks as they exist now are difficult to suggest searches from, since the words are stored stemmed. Many words stem down to words that aren’t valid (configuration -> configur) so that doesn’t give me enough information to show some helper text like Showing results for comfigure configure.
        
        Seeing as you’re already scanning them with the prefix search you could store them unstemmed and stem on search. Though you might lose in your postings compression the same weight as doubly storing the words would give. Don’t know… would have to test
        
        Or you could stem the misspelled word, then fix, and silently add the fixed stemmed version to the query. As you’re already doing prefix search you’re gonna get a bunch of results that are good quality but don’t match the query literally anyway
        
        Thesaurus at the start of each chunk would be alright on the network, but if those words were then incorporated into the search we would need to load the other chunks for those words, which would make the network requests heavier across the board unless they were only used in a “no results” setting.
        
        You’d want to include the words/postings found from the thesaurus in the same chunk as the original term as you’re adding them to the query anyway. But yeah not worth talking too deeply about a feature which won’t be worth implementing
        
        Every index that Pagefind spits out is manually gzipped, and gunzipped in the browser from the Pagefind js. It’s been quite a cheap way to get RLE for “free”. I did some brief experiments early on with being smarter about the data structures, but nothing beat a simple gzip. Doing it manually also means that you aren’t reliant on server support, and the compressed versions happily sit in caches.
        
        You’ll find that small integer compression on top of RLE will compress a lot better than GZIP even including the weight of the decompressor. GZIP is a decent general purpose compressor but it can’t beat something that’s specialised
        
        I have some plans here to add some generic weighting support. Ideally I can implement something more configurable, with a data-pagefind-weight=“4” style tag that can be wrapped around any content, which would provide the ability to add title ranking. I haven’t done much R&D on this yet, but the loose plan is to investigate adding some marker bytes into the word position indexes that can signify the next n words should be weighted higher / lower, without having to split out separate indexes.
        
        Sounds like a neat solution. I haven’t experimented with position indexes myself, but bigram chaining is another implementation of phrase searching and may compress better (or worse). Worth being aware of if you weren’t already
        
        Great question! For partially-typed words, not a lot — the prefix search handles that well. For full words stemming provides a rudimentary thesaurus-like search, in that configuration and configuring will both stem down to configur and match each other. Additionally, storing words against their stem makes for smaller indexes, since we don’t need to allocate every version of configur* in the index.
        
        I’ve found in my experience that a lot of words have a stem which is also a word (except when you use the snowball stemmers of course) which can often be the form that users enter in to the search box. And that most articles which talk about configuration will also talk about configuring. But I’m also working on web search and not a product for individual sites so there’s a different precision/recall tradeoff
        
        Another interesting thing about Trie structures is you can use their branching factors to find stems. I haven’t tested this in a search engine context though so I’m not sure if it’s better or worse than snowball. But might be worth playing with https://github.com/takuyaa/yada
        
        I’ve been tackling this from the “I want to build search, lets learn information retrieval” side, rather than the “I know IR lets build search”, so there are definitely aspects I’m still up-skilling on :)
        
        It’s a fun journey and it’s always good to see more people on it :)
        
        1
        
        bglw 2 years ago | link
        
        Amazing resources, thanks. I’ll definitely be revisiting these comments in the future.
        
        Cheers for the great discussion :)
        
        1
        
        oxTung 2 years ago | link
        
        No problem! I very much enjoyed it as well :)
3

itamarst 2 years ago | link

My impression is that it will use less bandwidth compared to Stork.
2

bglw 2 years ago | link

Yes, bandwidth is the leading differentiator here. If you look at Stork’s test site here, the index for 500 pages is nearly 2MB after compression, and the wasm itself is 350KB.

The Pagefind XKCD demo has 2,500 pages. The exact bandwidth depends on how much you search, but a simple search can come in at around 100KB including the wasm, js, and indexes.

2

tekknolagi 2 years ago | link

This is pretty neat. I couldn’t figure out how to get it set up with my Jekyll+GitHub pages site, though. I would be happy to run it on my own, offline, before pushing, but I can’t figure out how to get it to ignore _site, the generated directory. And it doesn’t seem to understand the markdown sources, unfortunately.

2

bglw 2 years ago | link

Yes GitHub pages is a challenge, building locally and pushing would be the best bet (or doing so in a GitHub action).

Pagefind actually wants to index only your _site folder — it’s built to index the output HTML of a site rather than the input markdown.

If you run it locally on your _site folder, you could output the bundle in your site root with --bundle-dir ../_pagefind and push that. Then if you add include: ["_pagefind"] to your Jekyll config this folder will be carried through to the output site when it builds (otherwise Jekyll would ignore it due to the leading _). Hope that makes sense!
1. 3
  tekknolagi 2 years ago | link
  So it looks like this requires a little two-step render:
  
  Build the site locally
  
  Run pagefind, which scans _site and deposits metadata into the site root
  
  Commit the metadata and push to GitHub, which will re-build
  
  Do I have that right?
  1. 2
    
    bglw 2 years ago | link
    
    Yep that’s it!
    
    There’s an upcoming GitHub Pages release that will make life a lot easier — mentioned in this talk but applies to any SSG. Doesn’t look like it’s released yet, but when it is you’ll be able to go to your Pages settings and change from a source branch to a source GitHub Action, and it’ll generate a workflow automatically for your SSG — as described in this site readme. Then it would be trivial to run Pagefind after that build in the action.
    1. 2
      
      tekknolagi 2 years ago | link
      
      Oh that sounds very nice!
2. 2
  
  zladuric 2 years ago | link
  
  Oh, good to know. I was thinking about how could one build that into homegrown salad of plugins and scripts that a lot of people have for sites. Indexing the output makes it pretty trivial, I guess.

2

benharri 2 years ago | link

just added this to my blog and i’m very impressed so far!

2

oxTung 2 years ago | link

Nice to hear. Out of all the products I’ve found in this space it’s been one of the easiest to set up. I’ve also found it to be decently performant. I’ve looked at frontend driven search products over the years but it’s always come with an upfront cost of a few megs to load the index file. This is the first I’ve found that chunks it and only loads necessary chunks

2

carlana 2 years ago | link

I’ve been using Algolia, but this looks really promising as a replacement.

3

oxTung 2 years ago | link

I hadn’t heard of Algolia before, I guess because I mostly focus on OpenSource search engines and self-hosting. What sort of scale are you doing with it?
1. 2
  
  carlana 2 years ago | link
  
  I am low enough volume that I can use their free plan, but obviously “use some service and just hope they don’t go broke or start charging a lot” is a bad long term strategy.
  1. 2
    
    oxTung 2 years ago | link
    
    Indeed, hence my desire to self-host everything

3

itamarst 2 years ago | link

Over the course of a year, I have visitors from almost every country in the world to https://pythonspeed.com. Even on a daily basis I get a bunch of visitors from countries where it’s pretty expensive to pay bandwidth. I feel a little guilty about the web fonts, though I tried to make them small.

All the search solutions I’ve found have either used a bunch of bandwidth, or didn’t even bother talking about it, to the point where I assumed they’d just use lots out of not-caring.

But this seems very promising:

“For a 10,000 page site, you can expect to perform a single-word search with a total network payload under 300KB — including the Pagefind javascript and webassembly libraries.”

7

bglw 2 years ago | link

Author here — it’s been a great couple of weeks hearing from people that I’m not alone in the frustrations I have had with picking a search tool. Hopefully it would come in closer to 100KB for you — that 300KB figure is from my testing on a clone of MDN