The original file-per-message format is MH, which has been around since the 1970s.
It’s less true now that NVME storage is so fast, but file-per-message used to be very slow because it requires many blocking filesystem metadata operations. But mbox is not much faster, has some nasty performance cliffs, >From corruption, and concurrency issues. So IMAP servers such as Cyrus and Dovecot do best with file-per-message storage, supported by separate mailbox index files which allow them to go much faster, multiple messages per IO instead of multiple IOs per message.
I tend to think the right way to store mail is to let Dovecot take care of it, and access it over IMAP. (Edited to add) One of the things languishing on my todo list is to set up Dovecot as a local IMAP proxy/cache for my Fastmail account, for better latency with non-webmail access. Fastmail’s web UI is indeed fast but I have ancient habits and I like my old terminal MUA. I guess the setup I have in mind might also work OK for the author of the post?
(Flash back to the job interview in 2002 for my postmaster job at Cambridge University during which I gave an off-the-cuff 5 minute lecture on tradeoffs in mail storage formats.)
So IMAP servers such as Cyrus and Dovecot do best with file-per-message storage, supported by separate mailbox index files which allow them to go much faster, multiple messages per IO instead of multiple IOs per message.
I think the same, and ended up just going with Dovecot’s mdbox, which is a format that I can import/export to as needed. Anecdotally, that has yielded the best performance for me (although since it’s a personal mail server, I didn’t measure, I just went on vibes, so I can’t provide data to back that up). I was also even able to enable compression with it and not feel much pain.
I’m curious - why not a database? Maildir/MH/mbox all have gross tradeoffs between performance and reliability, whereas SQLite would cut the gordian knot of those. Of course, schema design could be a problem, but even just stuffing the whole message without special columns for indexing would probably end up working better.
I think archival purposes used be a a really good reason, but SQLite is now a LOC recommendation for long-term storage, so I am not really convinced of it anymore.
On the subject of schemas: Once upon a time I added SQLite support to dbmail; If you only do IMAP/SMTP with the occasional ad-hoc query maybe that’s fine for you. The general architecture is good enough for archival purposes but then derived tables are used to make IMAP suck less and are sometimes useful for ad-hoc queries. Obviously you can add your own triggers and build up your own cache tables as well…
For a server or general backing storage for e.g. a GUI client, I agree. This is aimed at people who enjoy having stuff stored as regular files that can be inspected with regular command line tools, like grepping for something etc.
Even for the local MUA people, I think command line tools that abstract SQL queries so you can use a more robust format than plain text would be worthwhile.
On the other hand, SHA is already always available and fast. And strong. To me, FNV has no benefit over a stock cryptographic hash, and some disadvantages: a strong hash completely obviates the need for the version/conflict numbers, for example. And the need for the length salt.
The unique ID description is weird. It specifies FNV salted with the length of the message, but FNV isn’t a salted hash, so it isn’t clear how to calculate it.
Looks interesting. Seems like moving the flags to be in a file introduces more disk activity but also the potential for more complicated race conditions? IE when doing something to a flag in Maildir, it’s a rename (atomic). For m2dir, you have to check that .meta exists, check that the metadata file exists, open and read the metadata file, and then write out the new contents. You’d definitely have to lock the metadata file, right? Probably also the mail file to indicate that you’re “using” it (otherwise deleting the mail file whilst writing out the metadata will create an orphan - which may or may not be a problem until you then get the same hash for a future email.)
How do you distinguish new files? I guess “the absence of a \Seen flag” but that then requires you to check whether the metadata file exists (for each email!) and also parse out the contents (you can have new emails with flags due to something like sieve) - with a big mailbox, this is going to add up, right?
I like the actual message on the filesystem as a file idea. Metadata, Flags, search trees, etc could all be done in sqlite and be way faster for mail apps I would think. Also would pretty much fix all these race conditions for the most part.
Race conditions occur when two or more writes need to be synchronized and there is the possibility of a change in between.
Maildir avoids races by using move, which is atomic when the source and destination are on the same filesystem, and by storing metadata in the filename - also atomic.
If you were to store messages in the filesystem and use a database to maintain metadata, you’re going to get a race condition.
The author’s conceptions of things that Maildir supports that are no longer necessary are… odd. Opportunities for disordered writes by different programs abound in multiprocessor systems. NFS is not the only contributor, just the most widely used.
The original file-per-message format is MH, which has been around since the 1970s.
It’s less true now that NVME storage is so fast, but file-per-message used to be very slow because it requires many blocking filesystem metadata operations. But mbox is not much faster, has some nasty performance cliffs,
>From
corruption, and concurrency issues. So IMAP servers such as Cyrus and Dovecot do best with file-per-message storage, supported by separate mailbox index files which allow them to go much faster, multiple messages per IO instead of multiple IOs per message.I tend to think the right way to store mail is to let Dovecot take care of it, and access it over IMAP. (Edited to add) One of the things languishing on my todo list is to set up Dovecot as a local IMAP proxy/cache for my Fastmail account, for better latency with non-webmail access. Fastmail’s web UI is indeed fast but I have ancient habits and I like my old terminal MUA. I guess the setup I have in mind might also work OK for the author of the post?
(Flash back to the job interview in 2002 for my postmaster job at Cambridge University during which I gave an off-the-cuff 5 minute lecture on tradeoffs in mail storage formats.)
I think the same, and ended up just going with Dovecot’s mdbox, which is a format that I can import/export to as needed. Anecdotally, that has yielded the best performance for me (although since it’s a personal mail server, I didn’t measure, I just went on vibes, so I can’t provide data to back that up). I was also even able to enable compression with it and not feel much pain.
I’m curious - why not a database? Maildir/MH/mbox all have gross tradeoffs between performance and reliability, whereas SQLite would cut the gordian knot of those. Of course, schema design could be a problem, but even just stuffing the whole message without special columns for indexing would probably end up working better.
I think archival purposes used be a a really good reason, but SQLite is now a LOC recommendation for long-term storage, so I am not really convinced of it anymore.
On the subject of schemas: Once upon a time I added SQLite support to dbmail; If you only do IMAP/SMTP with the occasional ad-hoc query maybe that’s fine for you. The general architecture is good enough for archival purposes but then derived tables are used to make IMAP suck less and are sometimes useful for ad-hoc queries. Obviously you can add your own triggers and build up your own cache tables as well…
For a server or general backing storage for e.g. a GUI client, I agree. This is aimed at people who enjoy having stuff stored as regular files that can be inspected with regular command line tools, like grepping for something etc.
Even for the local MUA people, I think command line tools that abstract SQL queries so you can use a more robust format than plain text would be worthwhile.
I have also started a (still WIP) implementation in Rust here.
Very nice. Why FNV64a (non-cryptographic, obscure?) rather than a strong (and ideally well-known) hash like a SHA or BLAKE?
There’s really no need for cryptographically strong hashing in this context. FNV is very simple to implement and fast.
On the other hand, SHA is already always available and fast. And strong. To me, FNV has no benefit over a stock cryptographic hash, and some disadvantages: a strong hash completely obviates the need for the version/conflict numbers, for example. And the need for the length salt.
The unique ID description is weird. It specifies FNV salted with the length of the message, but FNV isn’t a salted hash, so it isn’t clear how to calculate it.
Salting is typically done by concatenating the salt to the bytes you want to hash, before hashing.
Here’s how I interpreted it.
Looks interesting. Seems like moving the flags to be in a file introduces more disk activity but also the potential for more complicated race conditions? IE when doing something to a flag in Maildir, it’s a rename (atomic). For m2dir, you have to check that
.meta
exists, check that the metadata file exists, open and read the metadata file, and then write out the new contents. You’d definitely have to lock the metadata file, right? Probably also the mail file to indicate that you’re “using” it (otherwise deleting the mail file whilst writing out the metadata will create an orphan - which may or may not be a problem until you then get the same hash for a future email.)How do you distinguish new files? I guess “the absence of a \Seen flag” but that then requires you to check whether the metadata file exists (for each email!) and also parse out the contents (you can have new emails with flags due to something like
sieve
) - with a big mailbox, this is going to add up, right?I like the actual message on the filesystem as a file idea. Metadata, Flags, search trees, etc could all be done in sqlite and be way faster for mail apps I would think. Also would pretty much fix all these race conditions for the most part.
Race conditions occur when two or more writes need to be synchronized and there is the possibility of a change in between.
Maildir avoids races by using move, which is atomic when the source and destination are on the same filesystem, and by storing metadata in the filename - also atomic.
If you were to store messages in the filesystem and use a database to maintain metadata, you’re going to get a race condition.
The author’s conceptions of things that Maildir supports that are no longer necessary are… odd. Opportunities for disordered writes by different programs abound in multiprocessor systems. NFS is not the only contributor, just the most widely used.
I agree the data can be de-synced between the two files, that’s definitely true. That’s true wether you use a DB or files.