That event of switching from dialup to 100BaseT when I went to college was, by the numbers and multiplicatively, the single biggest personal tech step improvement I’ve lived through. The web was like lightning for years. I miss it dearly and love sites like this one that just don’t do any extra bullshit.
I’d love to see the source just for a good example of proper use of PostgreSQL for full-text indexing! I haven’t seen an application that does full-text indexing/search so well, and so quickly, in a very long time. I have a feeling that, even if it’s got some of the usual small hacks that personal projects sometimes have, this is the kind of source code that aspiring developers would find extremely rewarding to read and understand.
Thank you! Unfortunately not, it’s more of a 48 hour hack that is too ugly and way too tailored for the OpenBSD manual pages. I hope to be able to release it in a somewhat near future though, when most of the ugly stuff is sorted.
Computers are ridiculously fast. JASSjr which is a project trying to demonstrate how easy it is to write a minimal search engine (just a hash table, and a merge) can search a 173,252 document corpus with an average length of 470 words in 10ms using a naïve Python implementation on an i5-4570 (a CPU from 2013). This doesn’t include the index read but if you are clever about your index format you can make that fast too. For a long time my hobby search project was a CGI binary written in C that would read and decode the index file on every request. It had about 80,000 documents and still ran in 10ms without anything more than a hash table and a repeating 2-way merge for the k-merge algorithm
Yeah, the indexer turns words into “lexemes”, which is a normalized form of the word. This make a search for the word “code” match documents containing “code”, “coding” and “coder”. In your case, “atoms” and “atomic” will become “atom” and various X libraries seem to use the word “atoms” more frequently. Pagination and more than 10 results per query is on the todo though.
You can also choose to only search the kernel functions:
Is there a way to tune the relevancy model used? If it’s processing “atomic” like this, it’s tokenizing atomic_add_int into potentially [atomic, add, int] and then probably processing those tokens into lexemes. In my past experience with things like Apache Lucene, usually there’s a way to influence the scoring so an exact token match (atomic in this case) would boost the score higher than a match on a shared lexeme/lemma/stem/etc.
I unfortunately have no experience using PostgreSQL for this type of stuff…but there’s probably a way.
Full-text refers to searching over the documents rather than summaries or title of the documents as classic retrieval engines used to do. We now live in a world where full-text search is usually the standard
It is too fast.
Seriously, this reminds me of 1995 and the first time I clicked links on a webpage on a university network instead of dial up.
It is so fast, its a testament to how terrible the web has gotten.
100% this, man. We have so much power at our fingertips and waste it on bloat.
That event of switching from dialup to 100BaseT when I went to college was, by the numbers and multiplicatively, the single biggest personal tech step improvement I’ve lived through. The web was like lightning for years. I miss it dearly and love sites like this one that just don’t do any extra bullshit.
I was talking about how I wish we could get Google Fiber where we lived.
Wife: “Is it going to feel as fast as when I connected to the ethernet in my dorm at UT Austin in 1998?”
Me: “I mean, technically, it’s…”
Wife: “Is it going to feel as fast?”
Me: “Probably…”
Wife: “I have standards.”
The answer is NO.
It is not going to feel as fast.
Looks nice and clean. Is it free software? Can I use it for indexing and searching my own documentation and host it myself?
I’d love to see the source just for a good example of proper use of PostgreSQL for full-text indexing! I haven’t seen an application that does full-text indexing/search so well, and so quickly, in a very long time. I have a feeling that, even if it’s got some of the usual small hacks that personal projects sometimes have, this is the kind of source code that aspiring developers would find extremely rewarding to read and understand.
[Comment removed by author]
Thank you! Unfortunately not, it’s more of a 48 hour hack that is too ugly and way too tailored for the OpenBSD manual pages. I hope to be able to release it in a somewhat near future though, when most of the ugly stuff is sorted.
Release the ugly software!
I don’t think it’s ever too early to release the code - put a big warning notice on the README that it’s all a big hack and ship it anyway.
The value for me would be seeing what hacky tricks you used to get it to work. I couldn’t care less if it’s not polished yet.
Agreed, and some of the fun lies in letting other people find cool ways to remove the hacky stuff (if you want to accept contributions).
I had done a hack as well for manpages with sqlite https://lobste.rs/s/t7tuej/epilys_buke_toy_manpage_full_text_search
Feel free to play with it, since it’s free software.
Source is now available through Git:
I am curious about which search technology is used.
My guess is a custom indexing solution tailored towards the usecase, that keeps everything in-memory.
When your corpus is static, relatively small for modern computers, and changes slowly, you can do lots of things to squeeze perf out.
I’m using PostgreSQL with proper indexing. :)
Computers are ridiculously fast. JASSjr which is a project trying to demonstrate how easy it is to write a minimal search engine (just a hash table, and a merge) can search a 173,252 document corpus with an average length of 470 words in 10ms using a naïve Python implementation on an i5-4570 (a CPU from 2013). This doesn’t include the index read but if you are clever about your index format you can make that fast too. For a long time my hobby search project was a CGI binary written in C that would read and decode the index file on every request. It had about 80,000 documents and still ran in 10ms without anything more than a hash table and a repeating 2-way merge for the k-merge algorithm
Something might be goofy with tokenization on the indexer side and/or stemming on the query side…
Search for “atomic” and I’d expect some hits on the atomic kernel functions.
For instance, compare to apropos:
https://man.ifconfig.se/?q=atomic&a=false&s=false
vs.
https://man.openbsd.org/?query=atomic&apropos=1&sec=0&arch=default&manpath=OpenBSD-current
Yeah, the indexer turns words into “lexemes”, which is a normalized form of the word. This make a search for the word “code” match documents containing “code”, “coding” and “coder”. In your case, “atoms” and “atomic” will become “atom” and various X libraries seem to use the word “atoms” more frequently. Pagination and more than 10 results per query is on the todo though.
You can also choose to only search the kernel functions:
https://man.ifconfig.se/?q=atomic&a=false&s=9
Is there a way to tune the relevancy model used? If it’s processing “atomic” like this, it’s tokenizing
atomic_add_int
into potentially[atomic, add, int]
and then probably processing those tokens into lexemes. In my past experience with things like Apache Lucene, usually there’s a way to influence the scoring so an exact token match (atomic
in this case) would boost the score higher than a match on a shared lexeme/lemma/stem/etc.I unfortunately have no experience using PostgreSQL for this type of stuff…but there’s probably a way.
Does it count as full-text search if it’s based on indexed words and you can’t search for an arbitrary string?
Full-text refers to searching over the documents rather than summaries or title of the documents as classic retrieval engines used to do. We now live in a world where full-text search is usually the standard