1. 58
    Nobody Gets Fired for Picking JSON, but Maybe They Should? api mcyoung.xyz
    1. 36

      I ran into numerical identifiers getting silently corrupted during roundtripping, and had to encode them as strings. It’s the worst of everything. Restrictive and under-specified at the same time.

      1. 12

        In the Cloud Native Buildpack ecosystem there’s a bug where you store data as TOML but due to an implementation detail it’s persisted via json and then re-serialized into TOML. The round trip turns an int (which toml has) into a float so if you’re using a strong typed language you end up with values that aren’t even the same type as what you put in.

        So we end up having to do the same. Either make it a float up front or a string. The rest of the ecosystem is pretty great, but that’s a subtle but meaningful gotcha.

        1. 7

          We had this same issue when we first added a REST API to the prior-to-that embedded-Java-only Neo4j..

          What got really messy later in was realizing - some implementations of JSON do support real ints. Like, the .NET one if I recall correctly will happily round trip i64s without precision loss.. but then that just makes it even worse, because now some devs think that’s safe and fine, because it usually is in their ecosystem

        2. 7

          Given you can’t have a JSON object with non-string keys, you end up dealing with this mismatch a lot.

          1. 5

            This is bad but a while back I discovered that somehow GraphQL finds a way to make it even worse; the only integer type is defined as 32-bits.

            1. 3

              Heh! Guilty as charged. I had a bug like that in my JSON parser, it corrupted very large integers - such as those used by tweet ids back in the day: https://github.com/SBJson/SBJson/pull/171#issuecomment-19842731

              The bug wasn’t even in the parsing, exactly, but in my assumption that converting NSDecimalNumber to long long integer would retain as much precision as possible. This turned out to be wrong.

              1. 2

                Likewise, we used it for a stock exchange data APIs, started with numbers for price data, and then switching to strings instead for the very reasons explained in the article. It made me sad!

              2. 15

                RFC 7493 (“Internet JSON”) addresses many of these points: https://www.rfc-editor.org/rfc/rfc7493

                Visa and Mastercard credit card numbers happen to fit in the “safe” range for binary64 , which may lull you into a false sense of security,

                Suggesting that somebody would store a credit card number as a JSON number is certainly the weirdest thing I’ve read all week.

                1. [Comment removed by author]

                2. 21

                  I’m here for all the hate on json. It’s very expensive to pass around and serialize and deserialize. It’s the modern CSV, and like CSV it appeals to people because it seems so neutral and safe.

                  1. 5

                    Call me when there’s a viable alternative. I like the idea of protobuf, but I haven’t seen an implementation that isn’t a huge pain to work with, and precisely zero of my applications are bottlenecking anywhere near JSON de/serialization. 🤷‍♂️

                    1. 4

                      If you use JSON today, swapping with CBOR shouldn’t even be that hard, on paper. It’s the same kind of schemaless format. However, unlike JSON, you get actual binary blobs, a good spec, and integer/floats with specified precision. It’s also faster and more compact.

                      1. 1

                        There are a lot of alternatives. But then you have persuade people to install a library.

                        Avro is in many ways the most json-like (it even has a json based format). It’s just so not hard to do better.

                    2. 26

                      Most of this is lame. Number representation problems aren’t part of JSON, they’re the fault of JSON parsers that just convert all numbers to doubles, just because “thats what JavaScript does”. It’s not much harder for the parser to pick an appropriate representation for each number, like using int64 unless there’s a decimal point.

                      Unicode has weirdnesses too, but what data formats go to the trouble of specifying in exhaustive detail exactly what forms of encoding and normalization to use? I don’t think this is a specific fault of JSON.

                      ProtoBuf just isn’t comparable. It’s a much more complex format that requires a schema definition to be parsed at all. It absolutely does not fit the same use cases as JSON.

                      1. 43

                        Nah you don’t get to put the blame on parsers here. The problem is that JSON is under specified. Parsers aren’t doing anything wrong, it’s perfectly in line with the spec to convert all numbers to doubles. Hell, considering the history of JSON as “JavaScript Object Notation”, converting all numbers to doubles was probably the original intention behind the spec.

                        1. 16

                          What is it about specs that leads to this kind of reasoning? “Either the spec is wrong, or the user is wrong, which is it?”

                          I’m not at retirement age, but I’ll pull the grumpy old man card and say “they’re both wrong.”

                          If the spec gives you a terrible option, you should criticize the spec for it. If you take the terrible option, you should also be criticized for it. There is more than enough blame for everyone to get a bit.

                          JSON absolutely has culpability here, because it doesn’t nail down what a compliant parser should do. The only thing approaching a defense is to say “this format was built to be used by javascript engines that can only natively represent doubles, the idea of using it elsewhere was a mistake.” But even if you went that route, why not just specify that all numbers are doubles? You’d make it much clearer what you could and couldn’t use the spec for.

                          1. 9

                            If you make a bad design decision in your parser and defend it with “the spec says I’m allowed to do this” it doesn’t suddenly make it not a bad design decision!

                          2. 4

                            I’m not sure how json is underspecified. json.org says “number”. It doesn’t say “float”.

                            The problem is that the parser output doesn’t match the parser input. i.e. print(parse(foo)) != foo

                            This seems like a pretty serious bug to me. Maybe I could understand the output order being changed, as json.org doesn’t specify ordering of fields. But to change the content of a field? Come on… that cannot possibly be a good decision.

                            1. 5

                              In JavaScript, “number” mean double precision floating point. In JavaScript, parseFloat(x).toString() != x can also be true. Hence, it can also true for JavaScript Object Notation. What you are complaining about is perfectly in line with not only the text of the spec but also the original intention.

                              1. 4

                                That’s impossible with fixed precision numbers in general.

                                1. 3

                                  But.. what is a “number”? It can’t be a number in the mathematical sense, the syntax is too constrained for that (it doesn’t allow the number 1/3, or pi etc).

                                  And, it’s clearly not one of the “normal” numbers computers work with like ints or floats - the syntax is an arbitrary precision decimal.. is that what you mean parsers should require as input and output?

                                  I can kind of get behind the purity of that, but it sure makes JSON a lot less convenient if you can’t say “JSON.dumps({“myval”: 1})”

                                  1. 3

                                    It says “number” but it means “javascript number”

                                    1. 2

                                      Not sure about that. From json.org:

                                      JSON is a text format that is completely language independent

                                      What does ECMA-404 say?

                                      JSON stands for JavaScript Object Notation and was inspired by the object literals of JavaScript aka ECMAScript as defined in the ECMAScript Language Specification, Third Edition. However, it does not attempt to impose ECMAScript’s internal data representations on other programming languages. Instead, it shares a small subset of ECMAScript’s syntax with all other programming languages. The JSON syntax is not a specification of a complete data interchange. Meaningful data interchange requires agreement between a producer and consumer on the semantics attached to a particular use of the JSON syntax. What JSON does provide is the syntactic framework to which such semantics can be attached

                                      JSON is agnostic about the semantics of numbers. In any programming language, there can be a variety of number types of various capacities and complements, fixed or floating, binary or decimal. That can make interchange between different programming languages difficult. JSON instead offers only the representation of numbers that humans use: a sequence of digits. All programming languages know how to make sense of digit sequences even if they disagree on internal representations. That is enough to allow interchange.

                                      It does not mean “javascript number”.

                                      1. 3

                                        Yeah but neither of those include any of the interop lessons that went into RFC 7159 or RFC 7493; in fact ECMA 404 explicitly says it does not support interoperability between independent implementations. My (implicit) point was that the (implicit) meaning of the early (bad) specifications, based on their historical origin and based on what was actually implemented, was in fact quite obvious to anyone who took notice of JSON as she is spoke. Anyone who takes the old JSON specs at face value and ignores the context is doing a bad job.

                                2. 13

                                  Number representation problems aren’t part of JSON, they’re the fault of JSON parsers that just convert all numbers to doubles, just because “thats what JavaScript does”.

                                  That has been explicitly part of the JSON spec for ten years. Even before then, any JSON implementation that used numbers that don’t fit in float64 was asking for trouble because there have always been JSON parsers that only support float64. Causing interop problems by using wider numbers is a bug, a failure to be conservative in what you send.

                                  1. 3

                                    Yeah, we ran into this. Our C++ JSON parser would handle uint64 correctly. One of the consumers was JS though and would lose precision when converting the numbers to float64s (as expected), resulting in duplicated ID numbers with the JS client but not the C++ clients.

                                  2. 12

                                    ProtoBuf just isn’t comparable.

                                    I think it is! ProtoBuf is as good implementation of a schema full data format as JSON is an implementation of schema less!

                                    1. 9

                                      I’ve got a long draft of a blog post that goes on and on about protobuf’s footguns in the same vein as this post. Sounds like you might have some material to contribute? ☠️

                                    2. 10

                                      Number representation problems aren’t part of JSON, they’re the fault of JSON parsers that just convert all numbers to doubles, just because “thats what JavaScript does”. It’s not much harder for the parser to pick an appropriate representation for each number, like using int64 unless there’s a decimal point

                                      100%. People love to give PHP shit for foot guns, but the built in json decoder will return an integer if the json value had no decimal (and fits within an integer).

                                      But even more egregious than the “some decoders do stupid things” excuse is the “some end users do ridiculous things” excuses.

                                      Really? People are storing license keys, credit card numbers and the like as integers? I thought we had this discussion years ago about not storing phone numbers as integers and people learnt the lesson?

                                      1. 6

                                        Number representation problems aren’t part of JSON, they’re the fault of JSON parsers that just convert all numbers to doubles, just because “thats what JavaScript does”. It’s not much harder for the parser to pick an appropriate representation for each number, like using int64 unless there’s a decimal point.

                                        You can make a better parser, but now your JSON documents aren’t going to be compatible with the ecosystem! The original JSON use case was JavaScript web stuff, and back then, JS didn’t even have integers, it only had doubles. So you had to stay within the limits of doubles. You just had to. Just because this sucks, we can’t expect the entire JSON ecosystem to just fix itself. Much better to use a format that has a specification which is strict about what parsers must support.

                                        1. 5

                                          It’s not much harder for the parser to pick an appropriate representation for each number, like using int64 unless there’s a decimal point.

                                          Even better, the parser could return the original string representation of a number and let the user decide how to interpret it. This works especially well in streaming pull parsers (as opposed to parsers that produce an in-memory representation of the entire document) where the user can just call the appropriate value accessor (value<uint64_t>(), value<double>(), etc) without having to store the string representation long-term.

                                          1. 3

                                            Yeah, the Go JSON parser added an option to do this a while ago.

                                          2. 4

                                            There’s another fun aspect to this that we ran into at work: if you happen to use JSON and accept integers which you use in a database query, make sure that you get only integers and not floats, or even bignums (if sufficiently large).

                                            If you don’t ensure you have proper integers in an expected range, the float or bignum can cause your query to fall back to a full table scan. For instance: SELECT foo FROM large_table WHERE id=1.0 will not be able to use your carefully crafted primary key index because of the type mismatch (assuming id is a normal int or bigint). This allows attackers to slow your database down to a crawl simply by tweaking the JSON slightly.

                                            This only (or mostly?) affects dynamically typed languages like Python (or in our case, Clojure).

                                            1. 3

                                              Ah json has no integer type. It’s always valid for json to write 1 as 1.0

                                              1. 1

                                                As pointed out in the OP blog post, the JSON spec does not mandate the representation and (de)serialization of numbers. Many parsers (for instance Python’s builtin and Clojure’s Cheshire) make a distinction between how it’s written and parse integers “as written” into actual integers, even bignums.

                                                1. 3

                                                  Sure but now you’re relying on implementation defined behavior which may or may not be documented.

                                          3. 8

                                            There are other surprising pitfalls around strings: are “x” and “\x78” the same string? RFC8259 feels the need to call out that they are, for the purposes of checking that object keys are equal.

                                            Unfortunately it doesn’t even go this far. It says that if your implementation chooses to act this way it will be interoperable with others that make the same decision. Useless.

                                            1. 32

                                              JSON is a wonderful example of superficial simplicity, where the simplicity is the result not of careful work to distill and model or problem, or of a deliberately limited scope tailored to a large subset of problems. Instead, simplicity is achieved by externalizing difficult problems.

                                              Unfortunately in the case of JSON this kind of superficial simplicity worked very well and had a facile justification in the emerging world of Rich Internet Applications, as you can naively implement JSON quickly (or output it without a serializer using template languages).

                                              1. 4

                                                I think you’re describing WorseIsBetter. IIRC it was originally about engineers and academics thinking hard about the problems of safe memory access, interoperability, etc. whilst at the same time C (unsafe memory access) and UNIX (pipe strings around) managed to succeed, but have forced everyone to deal with segfaults/overflows/aliasing/etc. and ad-hoc parsing, respectively.

                                                1. 2

                                                  I think there’s some level of difference - worse-is-better is to me largely about a “git’r’done” approach where more complex/highly considered approach was beaten to market by a “worse” product that became entrenched. JSON was not developed in that context. I don’t have anything to back this up but it feels instead very much like a rebuttal against the widespread adoption of (and concurrent complaints about) XML.

                                                  1. 4

                                                    Historically, the emergence and rapid adoption JSON was absolutely a reaction to XML with its excessive and byzantine complexity. I don’t disagree with the “superficial simplicity” characterization, but in my own experience it’s an improvement. I’d rather work with mediocre JSON data than mediocre XML data any day.

                                              2. 2

                                                The newer JSON RFCs are hamstrung by years of underspecification and the many existing implementations that do weird things in the murky parts of RFC 4627. Unfortunately they didn’t feel they could deem those existing implementations to be out of spec, so instead you have to read RFC 8259 in a special way that translates “interoperable” into SHOULD. Basically, follow RFC 7493.

                                                1. 5

                                                  Yep. Awful. Compare to the approach taken with SMTP, which clearly and arguably successfully obsoleted certain message syntaxes. 7493 is fine, but even there we see howlers like “A receiving implementation MAY treat two I-JSON messages as equivalent if they differ only in the order of the object members.” MAY?? Good grief.

                                                  1. 5

                                                    Yeah, good grief. Thanks for pointing that out, I knew there were still weaknesses in I-JSON but failed to remember exactly where.

                                              3. 8

                                                This means that your system probably can’t safely handle JSON documents with unknown fields.

                                                Like Protobuf handles unknown fields any better?

                                                If you’re sending me unknown fields, as, in they’re not in my published schema, I’m either ignoring them if I’m honouring Postel, or you’re getting a 400 Bad Request.

                                                I honestly can’t think of a reason why I would accept unknown fields you’re POSTing to my API.

                                                And if JSON is so bad, you need to ask yourself why it’s so obiquitous.

                                                Also, streaming parsers for JSON exist. I can’t speak to their implementation, but I’ve seen them.

                                                1. 7

                                                  And if JSON is so bad, you need to ask yourself why it’s so obiquitous.

                                                  That’s really never been a good argument. It’s popular, like many other things, because it’s the easiest format for the first few hours/days (especially when javascript is involved). By the time the (numerous) downsides become more apparent, it’s too late to change your service/data/protocol.

                                                  1. 6

                                                    Like Protobuf handles unknown fields any better?

                                                    What’s the problem with protobuf unknown fields? I checked the official Dart, Java, Python, C++ implementations, they all handle unknown fields.

                                                    I honestly can’t think of a reason why I would accept unknown fields you’re POSTing to my API.

                                                    You probably shouldn’t. The protobuf guide says:

                                                    In general, public APIs should drop unknown fields on server-side to prevent security attack via unknown fields. For example, garbage unknown fields may cause a server to fail when it starts to use them as new fields in the future.

                                                      1. 1

                                                        Yeah, that’s why I mentioned ignoring unknown fields. But generally in a distributed system I’d use a schema registry like Apicurio so that when a publisher uses a new version of a schema, consumers can pull it down as needed.

                                                        1. 1

                                                          Scheme registries are nice— I’ve used buf— but it doesn’t solve the fundamental problem that old consumers will run at the same time as updated consumers.

                                                          Ingress services can do what they want with unknown fields, but middleboxes need to pass them unmolested.

                                                    1. 7

                                                      For a language that is so ‘deeply flawed’ it is doing quite well. With any encoding/tool/language one needs to understand its limitations and make decisions appropriately. For 97.5% of tasks it is very suitable.

                                                      1. 5

                                                        Is part of a post I wrote on working with binary data, I saved the following timestamps form talks by Joe Armstrong and Martin Thompson on why you shouldn’t use JSON:

                                                        I think Joe and Martin have slightly different arguments, but would probably both say “yes” to the question in the title.

                                                        1. 5

                                                          What everyone winds up doing, one way or another, is to rely on base64 encoding. […] This has the unfortunate side-effect of defeating JSON’s human-readable property: if the blob contains mostly ASCII, a human reader can’t tell.

                                                          Raise your hand if you see something that looks like base64 and you instinctively check whether it starts with eyJ

                                                          1. 5

                                                            I don’t understand the point about streaming, you can stream JSON just fine. We support parsing JSON in chunks in the Dart standard library.

                                                            1. 2

                                                              JSON streaming works fine but it needs minimal buy-in on the protocol end that many popular protocols (cough JSON-RPC cough) manage to flounder. So many APIs have {"type": "Name", "data": {}}… totally unstreamable.

                                                              1. 1

                                                                While that is possible, you may encounter pretty basic parse errors later. E.g. an object might not be properly closed, etc.

                                                                If you mitigate against that, you can certainly parse JSON in chunks.

                                                                1. 1

                                                                  In my experience streaming isn’t well supported in the json library world. I’m getting flashbacks to dealing with >10GB objects when trying to recover a corrupt Cassandra database – their recovery tools would generate files that other tools in their suite could not consume.

                                                                2. 4

                                                                  It’s always “compared to what” and “for what purpose”.

                                                                  And to all those people who complain about JSON verbosity compared to MessagePack. Go on msgpack.org, click on “Try!” and type in 1.2 or [1.1,2.2,3.4,4.4,5.5,6.6].

                                                                  I am a huge fan of Protobuf and some of its competitors, but comparing them with JSON is like saying wget or curl are better than Firefox or Chrome or vice versa. They are both HTTP[1] clients, yet their goals, intentions and applicable use cases vary a lot.

                                                                  The same is true for compression formats and algorithms, even things like encryption and so on. Different use cases require different properties. The idea of there just being one Turing-complete programming language is just as silly.

                                                                  Why would this be different for data serialization?

                                                                  Look at your use case and choose the appropriate tool.

                                                                  Take a deep dive into XML, including CDATA, handing truthiness in a million different ways, binary data, XML being UTF-8 or UTF-16, or neither, parameters vs children vs tags, the fact that fast XML parser still have security issues on a regular basis, verbose tag closing or tags that close themselves and you’ll find JSON might not be a much better fit - or not.

                                                                  This of course isn’t meant to be pro-JSON or a defense of JSON, but that the idea of being angry at a tool - especially given its history and context - is a bit silly.

                                                                  [1] and other protocols

                                                                  1. 3

                                                                    Like u/Relax says, this might be more compact if your use case is storing floats whose decimal notation is short. Try 152351.152152151211292 and suddenly, tada, msgpack is shorter. In the real world, floats will not nicely fit in a short decimal notation; integers will be smaller in msgpack/cbor; and blobs will be 2/3 of what they’d be in JSON+base64.

                                                                    1. 1

                                                                      JSON+base64

                                                                      If you need it. Base64 is an odd choice in many situations. The fact that Protocol Buffers uses it in certain situations is strange indeed, basically double-serializing stuff. Base64 is for when you have to use Text, but this rarely is the case and I’ve seen more the one company and many projects just skipping or re-inventing parts of gRPC to get around that weirdness, when wanting to basically have gRPC in the web browser, without the sillyness interesting design choices that their proxies bring just to get around browsers not having Tailer-Support.

                                                                      Don’t get me wrong though, please. I don’t hate on msgpack. The whole point of my post is consider what you use, but don’t assume one size fits all. Somehow we live in a time where serialization formats have haters and fans, that think only their favorite one should ever be used.

                                                                      1. 3

                                                                        If you need it. Base64 is an odd choice in many situations.

                                                                        Well, with JSON, you need it as soon as you want to embed any form of binary data, since strings have to be UTF8. Want to embed a thumbnail, a hash, another file format? There you go, pay the 3/2 tax from base64. This is just not an issue with (binary) protobuf, msgpack, CBOR, etc.

                                                                        I’m not even getting into the performance overhead of having to base64-encode, or even escape regular strings, of course.

                                                                        The whole point of my post is consider what you use, but don’t assume one size fits all.

                                                                        And the OP is about how JSON is not a one size fits all either, and about how it’s way overused given the tradeoffs! There are situations where JSON is fine (esp. with JSONL imho). There are also lots of situations where it’s used and shouldn’t be (for example, HTTP+JSON APIs could use CBOR instead, gain on performance, correctness, interoperability, etc. and lose basically nothing).

                                                                    2. 3

                                                                      And to all those people who complain about JSON verbosity compared to MessagePack. Go on msgpack.org, click on “Try!” and type in 1.2 or [1.1,2.2,3.4,4.4,5.5,6.6].

                                                                      What’s your point? That msgpack stores doubles rather than strings? That seems…way, way better. Given that msgpack is otherwise more compact, I have a hard time believing there are meaningful pathological cases where this matters. The upside of having a tighter spec just seems like such a massive win.

                                                                      1. 1

                                                                        I have a hard time believing there are meaningful pathological cases where this matters

                                                                        Huh? Whenever you have anything like speeds, distances, duration, and so on. Stuff that float is basically made for. And usually changing the unit from something common (duration in seconds, speed in m/s or km/h, distances in meters, etc.) because of how it is being serialized rarely is reasonable.

                                                                        I don’t think it matters at all, because I could just use JSON + gzip, have something easily readable. copyable (debugging) and move on.

                                                                        1. 2

                                                                          In those cases msgpack is more efficient because it’s storing the 8 byte double rather than a long decimalized string. The pathological cases I’m talking about are where msgpack is surprisingly less efficient than json.

                                                                      2. 2

                                                                        I prefer 1111.2222

                                                                      3. 4

                                                                        Well, like a lot of POSIX warts or C undefined behavior, its all valid criticism, its all thing I agree shouldn’t be here, and its all things that would be hard if not impossible to fix without breaking everything that already exists . But also all problems that seem not to be deal breakers in the real world, and can largely be routed around if you know what you are doing.

                                                                        I can know that because it has been used successfully for two decades and people made money and solved problem with it. That is real life proof that it works. I don’t have to like JSON, I dont have to find it elegant, but I will respect it, because it’s ubiquity means it did something right in the actual reality of the real world, and the actual reality of the real world doesn’t give a damn about what I or anybody believe. People dont get fired for choosing JSON because for two decades people didn’t get fired for picking it. Tool and companies that use it do not seem to get selected out of the market. Even when they use it in use case where it has obvious drawback, like config files. Compare that to XML, a tool that many happily stopped picking as soon they discovered a better alternative.

                                                                        Doesn’t means we cannot search for something better. Doesn’t mean you can’t pick a better tool for the job. But if you screw up with whatever better tool you picked, someone could rightfully point at JSON and say “if you had picked that crusty old standard, you could have made it to work” and they’d likely be right. So, barring a massive disruption in our industry (which can always happens) I’d say it is unlikely that anybody will be fired for choosing JSON anytime soon, even with all the perfectly valid criticism the author point at it.

                                                                        1. 4

                                                                          So, barring a massive disruption in our industry (which can always happens) I’d say it is unlikely that anybody will be fired for choosing JSON leaded gasoline anytime soon, even with all the perfectly valid criticism the author point at it.

                                                                          1. 6

                                                                            Look, I promise I’ll be the first to drop JSON if somebody show me that it in any way increase lead blood-level in kids.

                                                                            In fact, I’ll drop it also it we find a huge zero day that would break every parser trying to fix it, or if it introduce massive vulnerability, or if it is found to be so energy intensive to parse that it make AI training seems like green energy. I’ll also drop it if a new disruption in the industry come and JSON is so bad at that new things that it wouldn’t make sense to use. Heck, I’ll even drop it if some weird moral panic about JSON come in and it stop being that format that everybody under the sun agree and are capable to use. I’ll even go further: Show me that there is a Big JSON that gains to profit by having JSON downside kept hidden, and even if I don’t see direct evidence of harm, I’ll drop it. Show me that there is something unlikely to happens but that could cause massive harm if it does, and I’ll apply the precaution principle and drop it. Show me any of that is happening, and I’ll change my mind, because that what sane people do.

                                                                            Now that article sure didn’t show me any of that. But hey, something could, and I’ll keep my mind open if it ever happen.

                                                                            That good enough for you?

                                                                            1. 2

                                                                              Provocative analogy! Does government regulation count as “massive disruption”? Are you advocating for it with some kind of tongue-in-cheek motte-and-bailey?

                                                                          2. 4

                                                                            Even though I read this post and thread last night, I still just now briefly read this as “nobody gets fired for pickling JSON” and I almost had an existential/ontological crisis about whether JSON is already pickled or not.

                                                                            1. 3

                                                                              I like how the author highlights JSONL as a fix for the streaming problem. I hope they are not aware that there’s… multiple competing standards and implementations. My favourite bane is that some implementations are lenient when it comes to a missing last newline, some are not and silently drop the value. Elasticsearch (at least when I was active in it) did the second, leading to a lot of customer calls for consulting asking “how come we lose about a percent of our data?” with me replying “are you by chance importing in bulks of 100?”.

                                                                              As a small illustration, this issue has most of the personell involved in the different formats… https://github.com/wardi/jsonlines/issues/22

                                                                              1. 3

                                                                                The Preserves data format could do the job better.

                                                                                https://preserves.gitlab.io/preserves/

                                                                                1. 2

                                                                                  I have found binary and simple delimited text formats to be incredibly easy to implement across several languages. Unless writing software that is expected to run in a browser, I explicitly avoid JSON.

                                                                                  If I need a self-describing format I reach for CBOR. Otherwise I typically use BARE or plaintext.

                                                                                  1. 2

                                                                                    Protobuf doesn’t have this problem: in a nutshell, the Protobuf wire format is as if we removed the braces and brackets from the top-level array or object of a document, and made it so that values with the same key get merged. In the wire format, the equivalent of the JSONL document

                                                                                    {"foo": {"x": 1}, "bar": [5, 6]}
                                                                                    {"foo": {"y": 2}, "bar": [7, 8]}
                                                                                    

                                                                                    is automatically “merged” into the single document

                                                                                    { "foo": { "x": 1, "y": 2 }, "bar": [5, 6] }
                                                                                    

                                                                                    This forms the basis of the “message merge” operation, which is intimately connected to how the wire format was designed. We’ll dive into this fundamental operation in a future article.

                                                                                    "bar" should be [5, 6, 7, 8], no? See: Last One Wins.

                                                                                    1. 2

                                                                                      Adjacently relevant: A comprehensive comparison on JSON parser behavior with some snark on the JSON spec(s) https://seriot.ch/projects/parsing_json.html

                                                                                      1. 1

                                                                                        For Go specifically, when dealing with unknown data it’s a good idea to use (*Decoder).UseNumber, which preserves the numbers as strings, so you can turn them into int64 or bigints if necessary. Bit of a footgun there for sure.

                                                                                        My biggest beef with JSON is when people use it in places it’s bad at. Twice in my career I’ve head to deal with converting HTML to home-grown JSON-based presentation formats for rendering news articles on smartphones, and both times I wished it was just simplified HTML or even XML. How do you represent a link within text? Company #1: utf8 byte ranges, lead to a few nasty crashes with poor bounds checking and UTF16; company #2: impossible, insert a button element after the text :). Could probably be remedied with a more thoughtful format, I’m curious how Quill does it.

                                                                                        1. [Comment removed by author]

                                                                                          1. [Comment removed by author]

                                                                                            1. 1

                                                                                              Protobuf is unstreamable on the send side btw, because you need to predetermine the length of the content.

                                                                                              1. 1

                                                                                                excellent write up

                                                                                                1. -2

                                                                                                  slanted titles are insane