Skip to content

Conversation

@kjnsn
Copy link
Contributor

@kjnsn kjnsn commented Aug 22, 2024

This extends off the ideas in #6493, using RFC 7464 as the serialization format.

Motivation (from @ridiculousfish in #6493)

fish history is currently a broken psuedo-YAML. Mea culpa: I added this about 8 years ago, without thinking hard enough. It is ad-hoc, underspecified and buggy (#6032). This complicates adding new fields while maintaining backwards compatibility, and it prevents the format from being successfully parsed by any other libraries.

The goal here is to settle on a widely-recognized, fully-specified format.

The original issue was raised in 2020, before rust based fish, and while there was general consensus that changing the history format would be beneficial, the change was parked and many things have changed in the four years since.

Goals

Address the issues mentioned by focusing on the following properties:

  • Ability to add new fields while maintaining backwards compatibility
  • Use a standard format that can be parsed and used by other libraries
  • High-quality, solid parsing & serialization

Solutions

Use a standard for the format, in this case RFC 7464. I chose this format as the RFC is seven pages long, and the motivations of the RFC align with the needs of the fish history format. To quote the RFC:

Consider a sequence of one million values, each possibly one kilobyte
when encoded -- roughly one gigabyte. It is often desirable to
process such a dataset in an incremental manner without having to
first read all of it before beginning to produce results.

The format is designed with a specific use-case in mind, where there are a large number of small JSON items, in a stream that may be written to by multiple programs at the same time. Each record is delimited by the ASCII record separator 0x1E, and terminated by a LF. The RFC states that an incomplete record can be detected by the absence of a LF before a RS. This is important for a shell history file, as some situations may result in only a partial buffer being flushed to the file (such as an unexpected termination). The RFC also states that invalid records are not fatal, and should be ignored.

JSON is a well specified format, being described in RFC 7159. Using JSON for history was generally considered a good idea in issue #6493.


Use the serde_json crate. This library is probably the closest rust has to an official JSON library. This ensures (hopefully) fairly bug-free serialization/deserialization, and reduces the amount of custom formatting required.


History is automatically rewritten with a .jsonseq extension. Unfortunately there is no official file extension for the format: https://www.iana.org/assignments/media-types/application/json-seq.

@kjnsn kjnsn marked this pull request as ready for review August 25, 2024 09:42
@mqudsi
Copy link
Contributor

mqudsi commented Aug 27, 2024

Ideally we would have had a discussion about the formats before you embarked on this impressive PR, but here we are! I think of the standardized, human-readable formats, a variant of JSON is a decent first choice despite its verbosity and repetitiveness (though there are a number of derivatives to explore that don't suffer from its verbosity and bloat), but I have in the past expressed if not a preference for then at least a preference for exploring SQLite as an option.

It of course doesn't have the human readability of JSON, but it's such a universal binary language that it's recommended by the Library of Congress and it doesn't just solve our need for serializing history but also removes basically all the supporting code we have in place to manage memory maps, make sure history records are written atomically, avoid record corruption, perform lookups based off of various preconditions, create "append-only views" that are essentially SQL transactions, and everything else we do in code that touches the history file.

As for portability, I would argue that SQLite is more portable than any custom serialization format, even one based off of JSON, self-describing not only the individual records but also the relationships between them and more.

@kjnsn
Copy link
Contributor Author

kjnsn commented Aug 28, 2024

Ideally we would have had a discussion about the formats before you embarked on this impressive PR, but here we are!

I was stuck in bed with covid and had time to kill, it was a good distraction. I also don't think it would be extremely difficult to change this PR to use something like JSON-LD instead, and serde can be adapted pretty easily.

As for portability, I would argue that SQLite is more portable than any custom serialization format, even one based off of JSON, self-describing not only the individual records but also the relationships between them and more.

I don't have strong opinions here honestly, but I do have a preference for plain text (keeping in line with the unix/gnu philosophy here). I've tried to do a small refactor in this PR as well; potentially with a goal of having the low-level history storage system from the searching/when-to-save logic.

I'm wondering if the performance issues would be minimized with this change, as reading a bunch of JSON records is much faster than YAML parsing, and the serde library has gone through a bunch of optimisations.

@kjnsn
Copy link
Contributor Author

kjnsn commented Aug 28, 2024

I wonder if JSON-SEQ would actually be safer on NFS than sqlite, as concurrent writes would likely not corrupt the history file. At worst you would end up with a couple of lines of invalid data, but the rest of the history is fine.

Another advantage of text: you can use things like rsync to merge history across multiple devices

@mqudsi
Copy link
Contributor

mqudsi commented Aug 31, 2024

I hope you're feeling better after COVID!

I'm very sympathetic to the text-only argument, but I don't consider JSON (or YAML, for that matter) to be great for what is effectively tabular/columnar data. With the switch to JSON text sequences, I think the human readability arguments lose some of their steam because the history file now contains ascii control characters, which are effectively binary in nature. You are suddenly limited in terms of what editors you can use to sanely view the file and you cannot edit it without care (and only in certain, sane editors).

JSON (and YAML, as a JSON superset) are excellent for dynamically shaped data with multiple objects each with its own type/schema, such as the results of an API. It's extremely subpar for what is effectively a table of records having identical shape because it is ridiculously verbose for that purpose (inflating the cost to store on disk as well as the cost to read or write the file for (de)serialization purposes). In fact, I would argue that even primitive CSV with a header row would be superior for this purpose, if it weren't for the fact that double quotes are so prevalent in a shell history and, ironically, are the only character you need to escape (rather than delimit) within a csv record.

A schema-less encoding like protobuf or at least one with a much more compact representation like msgpack would be strictly better, if they only had text-based alternatives. And if the option is a binary codec like that, that's all the more reason to just use SQLite and be done with it (though I would prefer not to have a dependency even on that, to tell the truth).

(I don't want to get into commenting about the specifics of your PR at this time, but I do want to point out that you're doing a lot of repeat allocations with the vectors in order to avoid writing a custom serde visitor/(de)serializer for the struct members. Additionally, from what I understand the new recommendation has been to actually not use the full blown serde+serde_json crates if you don't need to and there has been a push to get people to use simd_json instead for markedly better peformance.)

Anyway, I would say that if we are not doing something as radical as shifting to a binary serialization format or SQLite, I don't see a point in switching to JSON at all. The underlying motivation was largely a question of performance, but a proper (de)serializer for the bespoke history format (or YAML in general) would solve that altogether, instead of the string munging hacks we use for productivity at the cost of heavy allocations. It wouldn't need to (but could be, at least in the interim) be built on serde at all; the pros of using serde would be making it easier to add support for new fields while the cons would be longer compile times and more dependencies.

Either way, I think a (de)serialization solution would have to be a bit lower level than what you are currently using (not to take away from how awesome it was that you were able to rewrite the history format and preserve backwards compatibility in such a short timespan), and in that case one might as well just keep the current format rather than use JSON.

@kjnsn
Copy link
Contributor Author

kjnsn commented Sep 1, 2024

With the switch to JSON text sequences, I think the human readability arguments lose some of their steam because the history file now contains ascii control characters, which are effectively binary in nature. You are suddenly limited in terms of what editors you can use to sanely view the file and you cannot edit it without care (and only in certain, sane editors).

Yep correct. The only reason I chose JSON-SEQ over JSON-LD is that one has an RFC and the other doesn't. It's reasonable to switch to JSON-LD for the reasons listed though, acknowledging that the spec isn't as formal.

(I don't want to get into commenting about the specifics of your PR at this time, but I do want to point out that you're doing a lot of repeat allocations with the vectors in order to avoid writing a custom serde visitor/(de)serializer for the struct members.

Yeah I'm aware. I didn't want to spend too much time fixing those issues without having a discussion about whether this is a worth-while change first.

Anyway, I would say that if we are not doing something as radical as shifting to a binary serialization format or SQLite, I don't see a point in switching to JSON at all. The underlying motivation was largely a question of performance...

I don't know if I agree. How fast is fast enough? You mention using SIMD to deserialize JSON. You can keep on optimising until you're hand writing assembly for specific architectures, but even rust just delegates that to the c library in most cases (slice comparison for example).

My goal with this PR was to increase maintainability. My definition of maintainability is "how easy is it for someone new to the codebase to extend functionality and fix bugs". I love fish and I believe that increasing maintainability will mean that fish is still around in 10-20+ years. My understanding of the rust rewrite and original history format discussion was to improve maintainability, and performance was largely secondary.

Regarding performance, sqlite is actually slower than just writing direct to disk (https://www.sqlite.org/fasterthanfs.html), and it looks like serde_json is comparible in performance to other libraries: https://github.com/serde-rs/json-benchmark. Keeping in mind that these could easily read & write over 500MB per-second, which is absolutely more than good enough for a shell history (unless I'm really missing something).

Ultimately if the consensus from people who've been developing fish for a long time is that this change isn't worth it given all considerations, then that's fine! I do however think it's important to make some explicit goals, like "99.9% of history searches complete in under Xms for a history no larger than 100k entries" and/or "custom se-deserialisation logic is something we want to avoid if possible"

@kjnsn
Copy link
Contributor Author

kjnsn commented Sep 1, 2024

Sqlite (unfortunately) isn't a silver bullet for network filesystems either. See atuinsh/atuin#2356.

Regarding plain text (whichever format it is), it means that users can write scripts to import from other shells even if we don't support it. Having a single-line sed/awk/jq command to copy across a history is an amazing thing to have, and I don't know if we would get that with sqlite. I also don't know if we can do it currently because our YAML implementation uses a custom spec. This was raised with atuin as well: atuinsh/atuin#2345.

I would prefer to support user-friendliness over speed, especially since the slogan and primary design goal is user-friendliness.

@faho
Copy link
Member

faho commented Sep 1, 2024

The underlying motivation was largely a question of performance

The underlying motivation of #6493 was to fix bugs and make it easier to extend, e.g. add new fields, by using a well-specified format that gets us out of writing our own parser. This would also enable external tools to be able to read our history without bespoke code. JSON fits that bill nicely (jq exists), sqlite would be okay-ish (the commandline tool requires you to write sql, which is a pretty big context-switch, otherwise you need to link it). Other binary formats would have issues with being parseable by others.

I'm going to repeat something I've said about performance a lot: If you are going to bring up performance as a counter to something, I would like to request a motivating example.

That means I want to see a case where json is prohibitively or at least annoyingly slow.

To be frank, I do not believe the current history format is a bottleneck, and I see no reason to believe a well-written json library will be either.

the cons would be longer compile times and more dependencies

Quick compile time tests show a clean debug build on this old laptop of mine (the new one broke down yesterday) go from 37s to 44s. That's not great, but I would not call it prohibitive.

The dependencies seem to mostly be amortized - Cargo.lock shows 5 more packages. We are already using most of what serde depends on.

Not that I'm a lover of serde, I still remember when they sneakily added a binary blob and so I don't really trust it. I would be open to another json library.

@kjnsn
Copy link
Contributor Author

kjnsn commented Sep 1, 2024

I'm going to repeat something I've said about performance a lot: If you are going to bring up performance as a counter to something, I would like to request a motivating example.

That means I want to see a case where json is prohibitively or at least annoyingly slow.

To be frank, I do not believe the current history format is a bottleneck, and I see no reason to believe a well-written json library will be either.

I had a quick look through the list of bugs that contained the word "history". The vast majority concerned correctness, history file corruption, and usability. Regarding performance, I could only find one single issue that predates rust-fish (out of >1,000): #4650. This evidence leads me to believe performance is not a problem for history.

Not that I'm a lover of serde, I still remember when they sneakily added a binary blob and so I don't really trust it. I would be open to another json library.

I personally don't care which library is used; but I do believe that rolling a custom JSON library isn't the right thing to do. I'm happy to switch to another library if desired :)

@ridiculousfish
Copy link
Member

ridiculousfish commented Sep 1, 2024

Thank you to @kjnsn for tackling this! To be sure, this is something that would go in after the upcoming 4.0 release.

I too have gone back and forth between SQLite and JSON Lines / RFC7464. I prototyped with SQLite a few years back and hit some discouraging speed bumps.

None of these are unsolvable: SQLite is possible. However I no longer believe it will result in major simplification: I think fish would still be on the hook for globbing, case folding, at least some file locking, and we may have an awkward schema. That's what's pushing me away from it and towards JSON Lines, which seems like the best of bad options.

Anyways the specific issues I encountered:

  • The SQLite C API is pretty raw and awkward to use directly. We'd probably use some wrapper library, or an ORM. So it wouldn't just be SQLite - it would be SQLite and some other (TBD) library.

  • Case folding and globs:

    • SQLite's case-insensitive search only does basic ASCII case folding (see "7: Collating Sequences"). We would either have to mess with the ICU extension, or install our own functions, or just live with this (new) limitation.
    • SQLite doesn't support case-insensitive globs - the best option is to convert everything to lowercase and glob on that.
    • SQLite globs aren't perfectly compatible with fish, as they support "?" and fish has dropped that.
  • Network filesystems:

    • SQLite basically says don't use NFS.
    • If SQLite uses lockfiles (e.g. on NFS) then a crash could cause the history file to fail to open until the lockfile is removed. Probably fish itself would have to be prepared for this.
  • Readers block writers:

    • In fish, when you hit up arrow, it starts a lazy history search. That is, fish searches the history file until it finds a match, and returns that; the search can then be resumed (say, hitting up arrow again). But this lazy search is annoying in SQLite, because while a SQLite query is "open" (not fully consumed) then it prevents writes to the database. So we would have to implement our own pagination so we don't hold queries open; otherwise if you press up arrow in one session and forget about it, it will block all others.

    • WAL mode mitigates this somewhat, but still documents:

      Thus a long-running read transaction can prevent a checkpointer from making progress. But presumably every read transaction will eventually end and the checkpointer will be able to continue.

      This presumption is NOT true in fish today, and would need to be made true.

    • Also WAL just doesn't work over NFS, full stop. Shared-memory shenanigans, you know how it goes.

  • Indexing is inefficient space-wise:

    • If you want to be able to efficiently answer a question like "has this command been run before" then you need to index the table of command lines, which doubles its on-disk size. This isn't a deal breaker but it's still annoying.
    • "Without rowid" prevents this growth but have many other drawbacks, including the inability to refer to a command line by id (only full text).
    • Hash-based indexes can help but there's no direct support for these; we'd have to roll our own (i.e. do the hashing ourself and index the hash column).
    • We could also just use SQLite with a dumb "flat" schema: one row per command, with linear searching, similar to how the history works today. But then we aren't getting much out of it!

@mqudsi
Copy link
Contributor

mqudsi commented Sep 2, 2024

I'm aware of the IPC issues we'd run into with SQLite (having tried to abstract over them in rusqlite upstream, had some discussions with the SQLite authors about shared memory mode, etc) and an far from convinced it would magically suite our purposes.

But I'm not convinced switching to JSON here really solves anything. The breaking change to the history format doesn't buy us anything because we can accomplish pretty much the same by just using a serde yaml crate to do the serialization or deserialization for us, no? I'm just advocating a more conservative approach of only making breaking changes where we net clear wins. And my concerns about JSON seq with its binary/control characters mean I would hope to see very definitive advantages to adopting it besides it just being a better/cleaner RFC.

To address some earlier points/questions that were raised in response to my above comment: I am certainly not recommending that we write a custom parser and serializer/deserializer for JSON (that's what we have now for YAML). What I was suggesting is that we should certainly use our own implementation of the serde Visitor trait or manually guide a Serialize/Deserialize trait implementation rather than try to shoe-horn everything via the easiest default approaches that require vector allocations and repeated collection reads/writes for every record, for each and every record. There is a world of a difference between manually writing a serde Serialize impl and writing your own serialaizer altogether, and the tradeoffs in terms of maintainability, required effort, etc vs performance are firmly in favor of doing so.

While the issues you'd find searching on GitHub would be the correctness ones I'd alluded to when I was talking about SQLite and saying how these would be the biggest wins I'd (ideally) hope to solve with a change to the history format (though it might not be possible, as I hinted and as rf elaborated), my reason for mentioning performance is because that's what we, as core fish devs, have seen from actually profiling the code and identifying the hotspots; which isn't something you'd expect random users of the shell to realize and open issues regarding.

(I would appreciate if at least some credence would be lent to the fact that we're bringing up these concerns based off of what we've seen working intimately with fish for years rather than just bringing up random points because they would be nice to have, though I understand that seeing it from the other perspective might be hard because we've obviously not documented our concerns or even necessarily publicly voiced them outside of commit messages no one is expected to read.)

@faho
Copy link
Member

faho commented Sep 3, 2024

But I'm not convinced switching to JSON here really solves anything. The breaking change to the history format doesn't buy us anything because we can accomplish pretty much the same by just using a serde yaml crate

The big win for a json dialect is that jq exists and is widely used. This includes jsonseq because it has support for it.

There are tools for yaml, but they aren't as popular (e.g. the archlinux package statistics show jq on ~70% of systems, the comparable yq on ~10%). SQLite has sql, which is a pretty crappy language and feels like a weird fit.

And a line-based format (including jsonseq at least as it is used here) is easily greppable for quick tasks.

There is also the question of if we could switch to real yaml given the various idiosyncracies of our current format. Is it enough of a yaml subset that it would even work in all cases? And in what way would that improve things, would it enable an old fish to read new history?

my reason for mentioning performance is because that's what we, as core fish devs, have seen from actually profiling the code and identifying the hotspots

I am well aware that you have profiled the history code and made some improvements, and that's cool. Thanks!

But what I am saying is that I haven't, in practice, noticed history reading to be a bottleneck anywhere. Searching history works and is snappy, even opening the ctrl-r history pager and just smashing through pages feels instantaneous, even on weaker machines.

So what I would like to have is a usecase where the performance is too slow. Some form of budget of how long an operation can take before it is visible to the user. A goal.

Benchmarking this vs master, I see that echo $history is ~10% slower (on the order of 200ms vs 220ms). But that is also printing the entirety of my 36k entry history. And I quite simply don't see how that difference matters to anyone.

What is more concerning is that that 20ms difference seems to be constant - echo $history[1] takes ~30ms vs ~50ms. But that would be something to try to improve in this PR, not something that would seem like it requires a different format. It also seems to only happen via fish -c, I can't get it to happen in interactive use.

I am not saying that performance never matters, what I am saying is I want to know if it matters in this case.

@kjnsn
Copy link
Contributor Author

kjnsn commented Sep 4, 2024

I just checked, and fish's YAML format is indeed broken. My history file cannot be parsed by anything other than fish. The issue I encountered came from this line (paraphrased):

- cmd: echo '{foo: 123}'

So just switching to a third-party library like serde yaml would mean that existing history files are unparseable, and that if we did serialize valid yaml, that it would break other running fish processes (since the file is MMAP-ed).

So this is my logic:

  1. The history format is broken
  2. We want the history format to not be broken, for lots of reasons (adding new fields, reducing bugs, parse-ability by external tools)
  3. Changing the format into anything else, including valid YAML, is a backwards-incompatible change
  4. Since the change is backwards-incompatible, why not use a "better" format (for any definition of "better").

I'm wary of letting perfect be the enemy of good. I also genuinely don't care if JSON-LD is preferred over JSON-SEQ, I'm happy to make that change. Or to another format.

I have some theories as to where the extra 20ms might be coming from. If anyone has some concrete suggestions as to where the performance could be improved, I'm more than willing to take those suggestions on board. I really enjoy collaborating and learning from others :)

@kjnsn
Copy link
Contributor Author

kjnsn commented Sep 5, 2024

Benchmarking this vs master, I see that echo $history is ~10% slower (on the order of 200ms vs 220ms). But that is also printing the entirety of my 36k entry history. And I quite simply don't see how that difference matters to anyone.

What is more concerning is that that 20ms difference seems to be constant - echo $history[1] takes ~30ms vs ~50ms. But that would be something to try to improve in this PR, not something that would seem like it requires a different format. It also seems to only happen via fish -c, I can't get it to happen in interactive use.

Did you compile with the release settings (cargo build -r)? Because I can replicate the exact same constant 20ms regression with an unoptimized debug build, but optimized it's fine :)

@faho
Copy link
Member

faho commented Sep 8, 2024

Did you compile with the release settings (cargo build -r)?

Urgh, turns out naming your build directory "build-release" does not magically change it to a release build.

I can replicate the exact same constant 20ms regression with an unoptimized debug build, but optimized it's fine :)

There does still seem to be a constant difference, but now it's ~3-4ms instead of ~20ms.

So if you do count $history, you'll get 21ms vs 17ms, but if you do count $history[1..10] you get 8ms vs 5ms.

(including fish startup time, which is around 4.5ms here)

But given that we're in the milliseconds for a rarely used operation, and given that even the debug build at 10x slower was fast enough that I thought ctrl-r paging through the history was snappy, I really don't think this matters.

It would be cool to get that constant factor down - there's probably something in the initialization that takes a while, but it doesn't seem necessary.

@septatrix
Copy link
Contributor

Anyways the specific issues I encountered:

  • The SQLite C API is pretty raw and awkward to use directly. We'd probably use some wrapper library, or an ORM. So it wouldn't just be SQLite - it would be SQLite and some other (TBD) library.

I don't think this would be too bad. There is a good selection of lightweight query builders or ORMs for Rust.

  • Case folding and globs:
    • SQLite's case-insensitive search only does basic ASCII case folding (see "7: Collating Sequences"). We would either have to mess with the ICU extension, or install our own functions, or just live with this (new) limitation.
    • SQLite doesn't support case-insensitive globs - the best option is to convert everything to lowercase and glob on that.
    • SQLite globs aren't perfectly compatible with fish, as they support "?" and fish has dropped that.

Quite a while ago I played with this - I think while trying out Zig - and it was fairly easy to implement custom collations/function. And SQLite supporting a superset of fish globs is also not a problem. Though as LIKE and glob() will always use the BINARY or NOCASE collation it would be simplest to register a fish_glob() function and use that instead.

  • Network filesystems:
    • SQLite basically says don't use NFS.
    • If SQLite uses lockfiles (e.g. on NFS) then a crash could cause the history file to fail to open until the lockfile is removed. Probably fish itself would have to be prepared for this.

Even if we go with another implementation the locking on such network filesystems will be ugly/impossible. However, NFSv4 requires locking as part of the spec, and many CIFS server also have support for locking. (Even then it would be possible to couple this with custom locking or selecting the unix-dotfile VFS which should work in even more cases.) What we gain is a decent amount of simplification regarding mmaping and locking for local FS.
Currently it also seems like the code assumes that history is strictly appended to and failed locks are mostly ignore, however, when deleting history entries this is not the case.

Whichever way we choose forward, it will be very hard to beat the resilience of SQLite...

  • Readers block writers:
    • In fish, when you hit up arrow, it starts a lazy history search. That is, fish searches the history file until it finds a match, and returns that; the search can then be resumed (say, hitting up arrow again). But this lazy search is annoying in SQLite, because while a SQLite query is "open" (not fully consumed) then it prevents writes to the database. So we would have to implement our own pagination so we don't hold queries open; otherwise if you press up arrow in one session and forget about it, it will block all others.

    • WAL mode mitigates this somewhat, but still documents:

      Thus a long-running read transaction can prevent a checkpointer from making progress. But presumably every read transaction will eventually end and the checkpointer will be able to continue.

      This presumption is NOT true in fish today, and would need to be made true.

    • Also WAL just doesn't work over NFS, full stop. Shared-memory shenanigans, you know how it goes.

I am not sure how the lazy search currently works but if I start a search in one shell and delete old entry from another shell, wouldn't that result in a teared read? Or is the whole history directly loaded into memory?
Regardless, pagination is quite easy to implement as long as one has some strictly monotonically increasing data (e.g. when, or a row id). Simply add AND rowid < $lowest_rowid_from_before to subsequent queries - no need to keep the session open.

  • Indexing is inefficient space-wise:
    • If you want to be able to efficiently answer a question like "has this command been run before" then you need to index the table of command lines, which doubles its on-disk size. This isn't a deal breaker but it's still annoying.
    • "Without rowid" prevents this growth but have many other drawbacks, including the inability to refer to a command line by id (only full text).
    • Hash-based indexes can help but there's no direct support for these; we'd have to roll our own (i.e. do the hashing ourself and index the hash column).
    • We could also just use SQLite with a dumb "flat" schema: one row per command, with linear searching, similar to how the history works today. But then we aren't getting much out of it!

As pointed out by faho before, not performance but features/extensibility should be the more important factor. Still, if we want performance it is likely a lot easier to archive it using sqlite than with some homemade text-based serialization.

Another advantage of SQLite would be extensibility. In the future the could be expanded to allow extensions to add auxiliary data like the return code or runtime of a command to another table and refer to the original command with foreign IDs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants