Reading this makes me wonder “What If OpenDocument Used A Directory”? You’d get most of the benefits given in the article, and some more. Imagine being able to update a image in your presentation by just editing the image file!
It’s surprising how difficult it is to use directories like this on a modern OS, though. e.g. you can’t attach a directory to an email, set an application to open it, etc.
Apple’s “packages” are basically exactly what you’re describing, and indeed Apple’s office suite stored documents as packages for a while. Finder displays packages as if they were normal files (you can’t dig into them unless you right click and click “show package contents”), and Mail automatically zips directories (including packages) if you try to attach them.
The office suite moved away at some point in the past few years (files are zips now, I believe) - not sure what the reasoning was, but I suspect it either made iCloud sync tricky or there was too much confusion on non-Apple devices.
Even before Apple’s “packages”, Apple used “resource forks” which was basically another fork of structured information for every file. (Anyone remember ResEdit? It allowed you to manipulate the data in these forks directly, even in compiled apps/binaries… Was amazing)
Microsoft still has something similar in Alternate Data Streams. Their main use case is being very surprising when you mistakenly include a colon (:) in a file name and it writes to the ADS.
It’s a benefit of having them, but not why NTFS supports them as such. When Microsoft designed NTFS, they needed a file system that could support not just FAT, but also HPFS (the OS/2 file system) with its additional metadata and greatly expanded extended attributes (64 KiB or so, IIRC); it also needed a way to natively handle POSIX permissions. Rather than solving these two cases ad hoc, Microsoft simply let NTFS have lots of alternate data streams. When Windows NT Server came on the scene, you’re entirely right that they used ADS to support storing Mac resource forks, but that was more a “build with what you’ve got” than their original purpose as such.
Or frustrating you when you delete the _files folder Firefox created because you didn’t want to switch the remembered Ctrl+S preference away from HTML (Complete) and the .html file goes away without warning.
(Not sure if they still do that. The last time I daily-drove Windows was XP.)
As @Gaelan points out, this is what bundles are in OpenStep / Cocoa. NeXT adopted this model in preference to Apple-style resource forks (HFS has two streams associated with every file, HFS+ allowed an unlimited number) because it provided better interoperability. If you store Mac files with forks on an NFS or SMB server or a FAT filesystem, the extra data is stored in a hidden file. If you then copy the document on a non-Mac machine, you may end up with something that can’t be opened because you failed to copy the hidden file that contained half the document. With a NeXT-style bundle, other systems just see a directory and they know how to handle directories.
Two Apple technologies pushed the rest of the ecosystem to adopt bundles: Spotlight and Time Machine.
Spotlight indexes files on the filesystem. Every time a file is written, the index needs updating. This works a lot better if you can separate the text-like bits from the rest of the document, so that you can read those into the indexer process without touching other things. For a zip archive, you need to read the index, pull out the right parts, and update the index. For a bundle, you can aggregate all of the metadata in a property list and just read that for non-text formats and for rich text you can avoid any disk I/O for the non-text bits when doing full text indexing.
Time Machine backs up files. It does file-level deduplication and so if a file hasn’t been modified then it will not be backed up twice. Imagine you have a rich text document containing text and a load of images, If you use a zip file or a SQLite database and make a single character change, the whole thing gets backed up again. If you use a bundle, all of the embedded images remain unchanged and just the text portion gets a new backup. This means you’re adding a few KiBs to the backup instead of a few MiBs or more. SQLite also has the problem that you write transactions into the file and then clean them up once they’re committed to disk, so you may end up backing up multiple intermediate states (if you run a backup while saving your backup disk will eventually contain three copies of the file, the old, the new, and the intermediate).
This also makes the format great for non-destructive editing. Something like iMovie or Final Cut keeps the original source videos and stores the set of transformations done on them. A format that required writing all of the source videos to disk every time you saved would be painful. SQLite allows in-place modification, so wouldn’t have that problem, but zip files do. Even on a G4 PowerBook, these programs worked well with videos that included tens of GiBs of raw data and played nicely with backups and indexing.
GNUstep has supported bundles since the ‘90s but the open source desktop ecosystem seems to prefer copying bad ideas to copying good ones.
find-file a .zip or other archive and it’ll show the directory structure similar to dired, and allows editting and saving the files (if the appropriate zip/tar/7z/rar commands are installed). I’m not sure of the implementation details, but IIRC it’s smart enough to avoid extracting the whole archive.
Windows does a half-assed version of that out of the box: explorer lets you navigate zip files as if they were regular folders (but for the icon). I don’t think the windows APIs expose them as directory-like though.
Which does make sense as the OS can’t know which “view” of the zip file you want, you might want to move the file around, or you might want to manipulate the directory. Much easier to provide a nice unified view to the GUI user, but let programs do their thing.
I don’t think the windows APIs expose them as directory-like though.
The shell APIs do, if you were to consistently use IShellFolder and friends you could have a program that was agnostic to whether something was a zip or a file system directory or an FTP server etc.
Of course this mostly doesn’t happen in practice outside of Explorer.
This still incurs a heavy performance cost under the hood though since presumably you have to rewrite large portions of the ZIP to service write calls at the FUSE layer.
I thought so, and I think it’s a sorry state the world is in, because once you have been exposed to version control, POOF, all programs that normal people use become unusable. Not unlike how learning what a monad is immediately makes you incapable of explaining it. Not that learning the good stuff isn’t good, but it didn’t have to come with devastating life changes.
I’ve written a handful of presentations in HTML, and it can’t be done quickly. I did not find Markdown or LaTeX useful for that, because they don’t give you control over the presentation – kind of important for a presentation.
Are there multiple interoperable implementations of the sqlite file format? Is the format specified somewhere? Does the format remain backwards compatible indefinitely?
I don’t know the answers, but it feels like these are more important questions when considering a document format.
Note that there have been no breaking changes since the file format was designed in 2004. The changes shows in the version history above have all be one of (1) typo fixes, (2) clarifications, or (3) filling in the “reserved for future extensions” bits with descriptions of those extensions as they occurred.
Reading this makes me wonder “What If OpenDocument Used A Directory”? You’d get most of the benefits given in the article, and some more. Imagine being able to update a image in your presentation by just editing the image file!
It’s surprising how difficult it is to use directories like this on a modern OS, though. e.g. you can’t attach a directory to an email, set an application to open it, etc.
Apple’s “packages” are basically exactly what you’re describing, and indeed Apple’s office suite stored documents as packages for a while. Finder displays packages as if they were normal files (you can’t dig into them unless you right click and click “show package contents”), and Mail automatically zips directories (including packages) if you try to attach them.
The office suite moved away at some point in the past few years (files are zips now, I believe) - not sure what the reasoning was, but I suspect it either made iCloud sync tricky or there was too much confusion on non-Apple devices.
Even before Apple’s “packages”, Apple used “resource forks” which was basically another fork of structured information for every file. (Anyone remember ResEdit? It allowed you to manipulate the data in these forks directly, even in compiled apps/binaries… Was amazing)
Microsoft still has something similar in Alternate Data Streams. Their main use case is being very surprising when you mistakenly include a colon (
:
) in a file name and it writes to the ADS.IIRC (and I certainly may not!), the whole reason Alternate Data Streams exist is to support the Mac’s resource forks.
It’s a benefit of having them, but not why NTFS supports them as such. When Microsoft designed NTFS, they needed a file system that could support not just FAT, but also HPFS (the OS/2 file system) with its additional metadata and greatly expanded extended attributes (64 KiB or so, IIRC); it also needed a way to natively handle POSIX permissions. Rather than solving these two cases ad hoc, Microsoft simply let NTFS have lots of alternate data streams. When Windows NT Server came on the scene, you’re entirely right that they used ADS to support storing Mac resource forks, but that was more a “build with what you’ve got” than their original purpose as such.
Thank you!
That’s an extremely silly feature, then ;)
Or frustrating you when you delete the
_files
folder Firefox created because you didn’t want to switch the remembered Ctrl+S preference away from HTML (Complete) and the.html
file goes away without warning.(Not sure if they still do that. The last time I daily-drove Windows was XP.)
As @Gaelan points out, this is what bundles are in OpenStep / Cocoa. NeXT adopted this model in preference to Apple-style resource forks (HFS has two streams associated with every file, HFS+ allowed an unlimited number) because it provided better interoperability. If you store Mac files with forks on an NFS or SMB server or a FAT filesystem, the extra data is stored in a hidden file. If you then copy the document on a non-Mac machine, you may end up with something that can’t be opened because you failed to copy the hidden file that contained half the document. With a NeXT-style bundle, other systems just see a directory and they know how to handle directories.
Two Apple technologies pushed the rest of the ecosystem to adopt bundles: Spotlight and Time Machine.
Spotlight indexes files on the filesystem. Every time a file is written, the index needs updating. This works a lot better if you can separate the text-like bits from the rest of the document, so that you can read those into the indexer process without touching other things. For a zip archive, you need to read the index, pull out the right parts, and update the index. For a bundle, you can aggregate all of the metadata in a property list and just read that for non-text formats and for rich text you can avoid any disk I/O for the non-text bits when doing full text indexing.
Time Machine backs up files. It does file-level deduplication and so if a file hasn’t been modified then it will not be backed up twice. Imagine you have a rich text document containing text and a load of images, If you use a zip file or a SQLite database and make a single character change, the whole thing gets backed up again. If you use a bundle, all of the embedded images remain unchanged and just the text portion gets a new backup. This means you’re adding a few KiBs to the backup instead of a few MiBs or more. SQLite also has the problem that you write transactions into the file and then clean them up once they’re committed to disk, so you may end up backing up multiple intermediate states (if you run a backup while saving your backup disk will eventually contain three copies of the file, the old, the new, and the intermediate).
This also makes the format great for non-destructive editing. Something like iMovie or Final Cut keeps the original source videos and stores the set of transformations done on them. A format that required writing all of the source videos to disk every time you saved would be painful. SQLite allows in-place modification, so wouldn’t have that problem, but zip files do. Even on a G4 PowerBook, these programs worked well with videos that included tens of GiBs of raw data and played nicely with backups and indexing.
GNUstep has supported bundles since the ‘90s but the open source desktop ecosystem seems to prefer copying bad ideas to copying good ones.
I’m certain I’ve seen some tool that lets you mount a zip file so it appears as a regular directory to other files, so you could have those benefits.
And remember, the file system is also a database! :)
Emacs has that functionality.
find-file a .zip or other archive and it’ll show the directory structure similar to dired, and allows editting and saving the files (if the appropriate zip/tar/7z/rar commands are installed). I’m not sure of the implementation details, but IIRC it’s smart enough to avoid extracting the whole archive.
Windows does a half-assed version of that out of the box: explorer lets you navigate zip files as if they were regular folders (but for the icon). I don’t think the windows APIs expose them as directory-like though.
Which does make sense as the OS can’t know which “view” of the zip file you want, you might want to move the file around, or you might want to manipulate the directory. Much easier to provide a nice unified view to the GUI user, but let programs do their thing.
The shell APIs do, if you were to consistently use IShellFolder and friends you could have a program that was agnostic to whether something was a zip or a file system directory or an FTP server etc.
Of course this mostly doesn’t happen in practice outside of Explorer.
This still incurs a heavy performance cost under the hood though since presumably you have to rewrite large portions of the ZIP to service write calls at the FUSE layer.
For version control, definitely!
Are there any document formats that lend themselves to version control? That’s an age old problem.
Most of the markup languages from Markdown to SGML to (La)TeX.
I thought so, and I think it’s a sorry state the world is in, because once you have been exposed to version control, POOF, all programs that normal people use become unusable. Not unlike how learning what a monad is immediately makes you incapable of explaining it. Not that learning the good stuff isn’t good, but it didn’t have to come with devastating life changes.
I’ve written a handful of presentations in HTML, and it can’t be done quickly. I did not find Markdown or LaTeX useful for that, because they don’t give you control over the presentation – kind of important for a presentation.
time to read https://www.sqlite.org/whentouse.html#container and https://www.sqlite.org/fasterthanfs.html then :)
basically what a tar archive is
No more so than zip. Arguably less so, since it lacks a central directory (and has an even higher per-file overhead).
The biggest takeaway for me here was “it’s ok to store BLOBs in sqlite”. That’s going to be very handy!
Unless you use a backup system that does file-level deduplication. Then please store blobs in separate files!
Good idea! Thanks!
Are there multiple interoperable implementations of the sqlite file format? Is the format specified somewhere? Does the format remain backwards compatible indefinitely?
I don’t know the answers, but it feels like these are more important questions when considering a document format.
I think your latter two questions are addressed right on the SQLite home page:
I’m probably just anxious after the sqlite2 -> sqlite3 breakage, though maybe that taught them the value of keeping things stable.
Would you care to elaborate? Docs suggest that sqlite3 was released 2004-08-09, I have not read anything about instability or migrations issues.
That’s a long time to be anxious for, it may be time to let that go ;-)
D. Richard Hipp addressed that in a comment on Hacker News: https://news.ycombinator.com/item?id=37558809