-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
JSON Format for History #10680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
JSON Format for History #10680
Conversation
|
Ideally we would have had a discussion about the formats before you embarked on this impressive PR, but here we are! I think of the standardized, human-readable formats, a variant of JSON is a decent first choice despite its verbosity and repetitiveness (though there are a number of derivatives to explore that don't suffer from its verbosity and bloat), but I have in the past expressed if not a preference for then at least a preference for exploring SQLite as an option. It of course doesn't have the human readability of JSON, but it's such a universal binary language that it's recommended by the Library of Congress and it doesn't just solve our need for serializing history but also removes basically all the supporting code we have in place to manage memory maps, make sure history records are written atomically, avoid record corruption, perform lookups based off of various preconditions, create "append-only views" that are essentially SQL transactions, and everything else we do in code that touches the history file. As for portability, I would argue that SQLite is more portable than any custom serialization format, even one based off of JSON, self-describing not only the individual records but also the relationships between them and more. |
I was stuck in bed with covid and had time to kill, it was a good distraction. I also don't think it would be extremely difficult to change this PR to use something like JSON-LD instead, and serde can be adapted pretty easily.
I don't have strong opinions here honestly, but I do have a preference for plain text (keeping in line with the unix/gnu philosophy here). I've tried to do a small refactor in this PR as well; potentially with a goal of having the low-level history storage system from the searching/when-to-save logic. I'm wondering if the performance issues would be minimized with this change, as reading a bunch of JSON records is much faster than YAML parsing, and the serde library has gone through a bunch of optimisations. |
|
I wonder if JSON-SEQ would actually be safer on NFS than sqlite, as concurrent writes would likely not corrupt the history file. At worst you would end up with a couple of lines of invalid data, but the rest of the history is fine. Another advantage of text: you can use things like rsync to merge history across multiple devices |
|
I hope you're feeling better after COVID! I'm very sympathetic to the text-only argument, but I don't consider JSON (or YAML, for that matter) to be great for what is effectively tabular/columnar data. With the switch to JSON text sequences, I think the human readability arguments lose some of their steam because the history file now contains ascii control characters, which are effectively binary in nature. You are suddenly limited in terms of what editors you can use to sanely view the file and you cannot edit it without care (and only in certain, sane editors). JSON (and YAML, as a JSON superset) are excellent for dynamically shaped data with multiple objects each with its own type/schema, such as the results of an API. It's extremely subpar for what is effectively a table of records having identical shape because it is ridiculously verbose for that purpose (inflating the cost to store on disk as well as the cost to read or write the file for (de)serialization purposes). In fact, I would argue that even primitive CSV with a header row would be superior for this purpose, if it weren't for the fact that double quotes are so prevalent in a shell history and, ironically, are the only character you need to escape (rather than delimit) within a csv record. A schema-less encoding like protobuf or at least one with a much more compact representation like msgpack would be strictly better, if they only had text-based alternatives. And if the option is a binary codec like that, that's all the more reason to just use SQLite and be done with it (though I would prefer not to have a dependency even on that, to tell the truth). (I don't want to get into commenting about the specifics of your PR at this time, but I do want to point out that you're doing a lot of repeat allocations with the vectors in order to avoid writing a custom serde visitor/(de)serializer for the struct members. Additionally, from what I understand the new recommendation has been to actually not use the full blown serde+serde_json crates if you don't need to and there has been a push to get people to use Anyway, I would say that if we are not doing something as radical as shifting to a binary serialization format or SQLite, I don't see a point in switching to JSON at all. The underlying motivation was largely a question of performance, but a proper (de)serializer for the bespoke history format (or YAML in general) would solve that altogether, instead of the string munging hacks we use for productivity at the cost of heavy allocations. It wouldn't need to (but could be, at least in the interim) be built on serde at all; the pros of using serde would be making it easier to add support for new fields while the cons would be longer compile times and more dependencies. Either way, I think a (de)serialization solution would have to be a bit lower level than what you are currently using (not to take away from how awesome it was that you were able to rewrite the history format and preserve backwards compatibility in such a short timespan), and in that case one might as well just keep the current format rather than use JSON. |
Yep correct. The only reason I chose JSON-SEQ over JSON-LD is that one has an RFC and the other doesn't. It's reasonable to switch to JSON-LD for the reasons listed though, acknowledging that the spec isn't as formal.
Yeah I'm aware. I didn't want to spend too much time fixing those issues without having a discussion about whether this is a worth-while change first.
I don't know if I agree. How fast is fast enough? You mention using SIMD to deserialize JSON. You can keep on optimising until you're hand writing assembly for specific architectures, but even rust just delegates that to the c library in most cases (slice comparison for example). My goal with this PR was to increase maintainability. My definition of maintainability is "how easy is it for someone new to the codebase to extend functionality and fix bugs". I love fish and I believe that increasing maintainability will mean that fish is still around in 10-20+ years. My understanding of the rust rewrite and original history format discussion was to improve maintainability, and performance was largely secondary. Regarding performance, sqlite is actually slower than just writing direct to disk (https://www.sqlite.org/fasterthanfs.html), and it looks like serde_json is comparible in performance to other libraries: https://github.com/serde-rs/json-benchmark. Keeping in mind that these could easily read & write over 500MB per-second, which is absolutely more than good enough for a shell history (unless I'm really missing something). Ultimately if the consensus from people who've been developing fish for a long time is that this change isn't worth it given all considerations, then that's fine! I do however think it's important to make some explicit goals, like "99.9% of history searches complete in under Xms for a history no larger than 100k entries" and/or "custom se-deserialisation logic is something we want to avoid if possible" |
|
Sqlite (unfortunately) isn't a silver bullet for network filesystems either. See atuinsh/atuin#2356. Regarding plain text (whichever format it is), it means that users can write scripts to import from other shells even if we don't support it. Having a single-line sed/awk/jq command to copy across a history is an amazing thing to have, and I don't know if we would get that with sqlite. I also don't know if we can do it currently because our YAML implementation uses a custom spec. This was raised with atuin as well: atuinsh/atuin#2345. I would prefer to support user-friendliness over speed, especially since the slogan and primary design goal is user-friendliness. |
The underlying motivation of #6493 was to fix bugs and make it easier to extend, e.g. add new fields, by using a well-specified format that gets us out of writing our own parser. This would also enable external tools to be able to read our history without bespoke code. JSON fits that bill nicely (jq exists), sqlite would be okay-ish (the commandline tool requires you to write sql, which is a pretty big context-switch, otherwise you need to link it). Other binary formats would have issues with being parseable by others. I'm going to repeat something I've said about performance a lot: If you are going to bring up performance as a counter to something, I would like to request a motivating example. That means I want to see a case where json is prohibitively or at least annoyingly slow. To be frank, I do not believe the current history format is a bottleneck, and I see no reason to believe a well-written json library will be either.
Quick compile time tests show a clean debug build on this old laptop of mine (the new one broke down yesterday) go from 37s to 44s. That's not great, but I would not call it prohibitive. The dependencies seem to mostly be amortized - Cargo.lock shows 5 more packages. We are already using most of what serde depends on. Not that I'm a lover of serde, I still remember when they sneakily added a binary blob and so I don't really trust it. I would be open to another json library. |
I had a quick look through the list of bugs that contained the word "history". The vast majority concerned correctness, history file corruption, and usability. Regarding performance, I could only find one single issue that predates rust-fish (out of >1,000): #4650. This evidence leads me to believe performance is not a problem for history.
I personally don't care which library is used; but I do believe that rolling a custom JSON library isn't the right thing to do. I'm happy to switch to another library if desired :) |
|
Thank you to @kjnsn for tackling this! To be sure, this is something that would go in after the upcoming 4.0 release. I too have gone back and forth between SQLite and JSON Lines / RFC7464. I prototyped with SQLite a few years back and hit some discouraging speed bumps. None of these are unsolvable: SQLite is possible. However I no longer believe it will result in major simplification: I think fish would still be on the hook for globbing, case folding, at least some file locking, and we may have an awkward schema. That's what's pushing me away from it and towards JSON Lines, which seems like the best of bad options. Anyways the specific issues I encountered:
|
4e23e7c to
cbb9b14
Compare
|
I'm aware of the IPC issues we'd run into with SQLite (having tried to abstract over them in rusqlite upstream, had some discussions with the SQLite authors about shared memory mode, etc) and an far from convinced it would magically suite our purposes. But I'm not convinced switching to JSON here really solves anything. The breaking change to the history format doesn't buy us anything because we can accomplish pretty much the same by just using a serde yaml crate to do the serialization or deserialization for us, no? I'm just advocating a more conservative approach of only making breaking changes where we net clear wins. And my concerns about JSON seq with its binary/control characters mean I would hope to see very definitive advantages to adopting it besides it just being a better/cleaner RFC. To address some earlier points/questions that were raised in response to my above comment: I am certainly not recommending that we write a custom parser and serializer/deserializer for JSON (that's what we have now for YAML). What I was suggesting is that we should certainly use our own implementation of the serde While the issues you'd find searching on GitHub would be the correctness ones I'd alluded to when I was talking about SQLite and saying how these would be the biggest wins I'd (ideally) hope to solve with a change to the history format (though it might not be possible, as I hinted and as rf elaborated), my reason for mentioning performance is because that's what we, as core fish devs, have seen from actually profiling the code and identifying the hotspots; which isn't something you'd expect random users of the shell to realize and open issues regarding. (I would appreciate if at least some credence would be lent to the fact that we're bringing up these concerns based off of what we've seen working intimately with fish for years rather than just bringing up random points because they would be nice to have, though I understand that seeing it from the other perspective might be hard because we've obviously not documented our concerns or even necessarily publicly voiced them outside of commit messages no one is expected to read.) |
The big win for a json dialect is that jq exists and is widely used. This includes jsonseq because it has support for it. There are tools for yaml, but they aren't as popular (e.g. the archlinux package statistics show jq on ~70% of systems, the comparable yq on ~10%). SQLite has sql, which is a pretty crappy language and feels like a weird fit. And a line-based format (including jsonseq at least as it is used here) is easily greppable for quick tasks. There is also the question of if we could switch to real yaml given the various idiosyncracies of our current format. Is it enough of a yaml subset that it would even work in all cases? And in what way would that improve things, would it enable an old fish to read new history?
I am well aware that you have profiled the history code and made some improvements, and that's cool. Thanks! But what I am saying is that I haven't, in practice, noticed history reading to be a bottleneck anywhere. Searching history works and is snappy, even opening the ctrl-r history pager and just smashing through pages feels instantaneous, even on weaker machines. So what I would like to have is a usecase where the performance is too slow. Some form of budget of how long an operation can take before it is visible to the user. A goal. Benchmarking this vs master, I see that What is more concerning is that that 20ms difference seems to be constant - I am not saying that performance never matters, what I am saying is I want to know if it matters in this case. |
|
I just checked, and fish's YAML format is indeed broken. My history file cannot be parsed by anything other than fish. The issue I encountered came from this line (paraphrased):
So just switching to a third-party library like serde yaml would mean that existing history files are unparseable, and that if we did serialize valid yaml, that it would break other running fish processes (since the file is MMAP-ed). So this is my logic:
I'm wary of letting perfect be the enemy of good. I also genuinely don't care if JSON-LD is preferred over JSON-SEQ, I'm happy to make that change. Or to another format. I have some theories as to where the extra 20ms might be coming from. If anyone has some concrete suggestions as to where the performance could be improved, I'm more than willing to take those suggestions on board. I really enjoy collaborating and learning from others :) |
Did you compile with the release settings ( |
Urgh, turns out naming your build directory "build-release" does not magically change it to a release build.
There does still seem to be a constant difference, but now it's ~3-4ms instead of ~20ms. So if you do (including fish startup time, which is around 4.5ms here) But given that we're in the milliseconds for a rarely used operation, and given that even the debug build at 10x slower was fast enough that I thought ctrl-r paging through the history was snappy, I really don't think this matters. It would be cool to get that constant factor down - there's probably something in the initialization that takes a while, but it doesn't seem necessary. |
9072901 to
4c5a04b
Compare
I don't think this would be too bad. There is a good selection of lightweight query builders or ORMs for Rust.
Quite a while ago I played with this - I think while trying out Zig - and it was fairly easy to implement custom collations/function. And SQLite supporting a superset of fish globs is also not a problem. Though as
Even if we go with another implementation the locking on such network filesystems will be ugly/impossible. However, NFSv4 requires locking as part of the spec, and many CIFS server also have support for locking. (Even then it would be possible to couple this with custom locking or selecting the Whichever way we choose forward, it will be very hard to beat the resilience of SQLite...
I am not sure how the lazy search currently works but if I start a search in one shell and delete old entry from another shell, wouldn't that result in a teared read? Or is the whole history directly loaded into memory?
As pointed out by faho before, not performance but features/extensibility should be the more important factor. Still, if we want performance it is likely a lot easier to archive it using sqlite than with some homemade text-based serialization. Another advantage of SQLite would be extensibility. In the future the could be expanded to allow extensions to add auxiliary data like the return code or runtime of a command to another table and refer to the original command with foreign IDs. |
This extends off the ideas in #6493, using RFC 7464 as the serialization format.
Motivation (from @ridiculousfish in #6493)
The original issue was raised in 2020, before rust based fish, and while there was general consensus that changing the history format would be beneficial, the change was parked and many things have changed in the four years since.
Goals
Address the issues mentioned by focusing on the following properties:
Solutions
Use a standard for the format, in this case RFC 7464. I chose this format as the RFC is seven pages long, and the motivations of the RFC align with the needs of the fish history format. To quote the RFC:
The format is designed with a specific use-case in mind, where there are a large number of small JSON items, in a stream that may be written to by multiple programs at the same time. Each record is delimited by the ASCII record separator
0x1E, and terminated by a LF. The RFC states that an incomplete record can be detected by the absence of a LF before a RS. This is important for a shell history file, as some situations may result in only a partial buffer being flushed to the file (such as an unexpected termination). The RFC also states that invalid records are not fatal, and should be ignored.JSON is a well specified format, being described in RFC 7159. Using JSON for history was generally considered a good idea in issue #6493.
Use the serde_json crate. This library is probably the closest rust has to an official JSON library. This ensures (hopefully) fairly bug-free serialization/deserialization, and reduces the amount of custom formatting required.
History is automatically rewritten with a
.jsonseqextension. Unfortunately there is no official file extension for the format: https://www.iana.org/assignments/media-types/application/json-seq.