CSVs Are Kinda Bad. DSVs Are Kinda Good

67

CSVs Are Kinda Bad. DSVs Are Kinda Good python matthodges.com
via d_run 3 months ago | caches
Archive.org Archive.today Ghostarchive
| 53 comments

53

1. 58
  
  david_chisnall 3 months ago | link
  
  The big advantage of CSV and TSV is that you can edit them in a text editor. If you’re putting non-printing characters in as field separators you lose this. If you don’t need that property then there are a lot of better options up to, and including, sqlite databases.
  1. 27
    
    hwayne 3 months ago | link
    
    Obvious solution is to put non-printing characters on the keyboard
    1. 17
      
      bitemyapp 3 months ago | link
      
      …APL user?
      1. 5
        
        vfoley 3 months ago | link
        
        Close; he uses J.
    2. 6
      
      sylq 3 months ago | link
      
      And after some time, people would start using them for crazy stuff that no one anticipated and this solution wouldn’t work anymore 👌
      1. 1
        
        icefox 3 months ago | link
        
        Though I suppose that it has the advantage of not coming with any meaning pre-loaded into it. Yet. If we use these delimiter tokens for data files then people will be at least slightly discouraged from overloading them in ways that break those files.
  2. 9
    
    andyc edited 3 months ago | link
    
    Also, grep works on both CSV and TSV, which is very useful … it won’t end up printing crap to your terminal.
    
    diff and git merge can work to a degree as well.
    
    Bytes and text are essential narrow waists :) I may change this to “M x N waist” to be more clear.
    
    A text editor or grep is one of M, and TSV is one of N.
    
    If you have an arbitrary DSV, you’re not really in that world any more – now you need to write your own tools.
    
    FWIW I switched from CSV and TSV, because the format is much simpler. As far as I can tell, there is exactly one TSV format, but multiple different CSV formats in practice. There’s less room for misunderstanding.
    1. 6
      
      xoranth 3 months ago | link
      
      If you have an arbitrary DSV, you’re not really in that world any more – now you need to write your own tools.
      
      Do you? I believe awk and tr deal with it just fine. E.g. tr to convert from DSV to TSV for printing:
      
      $ printf '42\x1f99\x1e13\x1f420\x1e' | tr $(printf '\x1f\x1e') '\t\n' 42 99 13 420
      
      and awk for selecting single columns, printing it TSV:
      
      $ printf '42\x1f99\x1e13\x1f420\x1e' | awk -v RS='\x1e' -v FS='\x1f' -v OFS='\t' -v ORS='\n' '{ print $0, $1, $2, NR, NF }' 4299 42 99 1 2 13420 13 420 2 2
      
      Also, I think grep shouldn’t have any problems either, it should pass the non-printable characters as-is?
      1. 4
        
        zimpenfish 3 months ago | link
        
        grep (GNU grep 3.11) does pass the non-printables through but doesn’t recognise \x1e as a line separator (and has no option to specify that either) which means you get the whole wash of data whatever you search for.
        
        $ printf '42\x1f99\x1e13\x1f420\x1e' | grep 99 429913420 $ printf '42\x1f99\x1e13\x1f420\x1e' | grep 13 429913420
        
        You’d have to pipe it through tr to swap \x1e for \n before grep.
        
        2
        
        xoranth 3 months ago | link
        
        Fair, I didn’t know. You can use awk as a grep substitute though.
      2. 2
        
        andyc edited 3 months ago | link
        
        It’s cool that that works, but I’d argue it is indeed a case of writing your own tools! Compare with
        
        $ printf '42\t99\n13\t420\n' 42 99 13 420
        
        $ printf '42\t99\n13\t420\n' | awk '{ print $0, $1, $2, NR, NF }' 42 99 42 99 1 2 13 420 13 420 2 2
        
        And there are more tools, like head and tail and shuf.
        
        xargs -0 and find -print0 actually have the same problem – I pointed this out somewhere on https://www.oilshell.org
        
        It kind of “infects” into head -0, tail -0, sort -0, … Which are sometimes spelled sort -z, etc.
        
        The Oils solution is “TSV8” (not fully implemented) – basically you can optionally use JSON-style strings within TSV cells.
        
        So head tail grep cat awk cut work for “free”. But if you need to represent something with tabs or with \x1f, you can. (It handles arbitrary binary data, which is a primary rationale for the J8 Notation upgrade of JSON - https://www.oilshell.org/release/latest/doc/j8-notation.html)
        
        I don’t really see the appeal of \x1f because it just “pushes the problem around”.
        
        Now instead of escaping tab, you have to escape \x1f. In practice, TSV works very well for me – I can do nearly 100% of my work without tabs.
        
        If I need them, then there’s TSV8 (or something different like sqlite).
        
        1
        
        xoranth 3 months ago | link
        
        head can be done in awk, the rest likely require to output to zero on newline terminated output and pass it to the zero version of themselves. With both DSV and zero-terminated commands, I’d make a bunch of aliases and call it a day.
        
        I guess that counts as “writing your own tools”, but I end up turning commonly used commands into functions and scripts anyway, so I don’t see as a great burden. I guess to each their workflow.
  3. 2
    
    agent281 3 months ago | link
    
    The other major advantage is the ubiquity of the format. You lose a lot of tools if you aren’t using the common formats.
2. 37
  clemherreman 3 months ago | link
  My rule of thumb is :
  
  go to CSV if you want everyone and their dog to be able to eventually be able to open/import your data
  
  if CSV is not enough: then use a really great standard way to serialize data, that fits your kind of data: XML, JSON, Parquet etc.
  
  But don’t roll your own “almost CSV, but not really, and it cannot really be imported everywhere” format.
  1. 21
    
    david_chisnall 3 months ago | link
    
    If I’m producing the data, I tend to favour TSV over CSV. I’ve not seen anything that can read CSV that can’t read TSV for ages (though the converse is not true) and IANA simply disallows tabs in fields (which is rarely a problem) so there’s no escaping. If you need tabs in fields, there are escaping schemes that are simpler than CSV.
  2. 6
    
    franta 3 months ago | link
    
    But don’t roll your own “almost CSV, but not really, and it cannot really be imported everywhere” format.
    
    This is what Microsoft does in Excel. Their format depends on localization, version and maybe the moon phase. For example in Czech localization they will use semicolons as separators (despite they call it comma separated values) while in English localization they will use commas. They are not compatible even with themselves.
    1. 1
      
      gerikson edited 3 months ago | link
      
      I believe the standard in Swedish Excel is to use semicolons as separators too. Does Czech numbers use commas or periods as decimal sign? If so, I believe this was Microsoft’s “solution” to not have everyone outside the Anglosphere have to wrap decimal numbers in quotes.
      1. 3
        
        franta 3 months ago | link
        
        Yes, we have commas, and it is probably the reason (they thought that it will save some byte because decimal values will not need quotes around them?). But it is still wrong IMHO, because a data interchange format should not depend on locale (which can be different on the other side of the data interchange) and should contain data in a machine-readable format.
3. 15
  colindean 3 months ago | link
  When I worked for a data analytics company, I once pitched to my management the idea that we’d charge for a separate “CSV plugin” or “CSV processing” because we found that
  
  problems with malformed CSV ingested accounted for about 75% of support cases that made it to the developers, and
  
  processing well-formed CSV, in general, took something like 35% longer than Avro, the format to which we translated CSV immediately after ingestion. That parsing and maintaining logic for exceptions had a lot of development and operations overhead!
  
  I don’t know if they ever implemented the idea. The organization was dysfunctional and still trying to become a product organization from its roots in consulting services, a.k.a. doing whatever made the customer happy, including suffering mistakes from manual CSV output.
4. 11
  
  czarkoff 3 months ago | link
  
  What do you do when you serialize to DSV and one of your columns contains a DSV document?
  1. 9
    
    bdesham 3 months ago | link
    
    That’s when you switch to SQLite!
  2. 4
    
    mjec 3 months ago | link
    
    This is a good question, but I think the answer is the same as for CSV: define an escaping scheme and use it. No schemaless format can contain itself any other way, and ASCII (or Unicode) cannot define such a scheme, because they are character encoding schemes, not data encoding schemes.
    
    There’s certainly some irony in the post author complaining about an underspecified data encoding scheme (CSV) and proposing a different but equally underspecified data encoding scheme. Though FWIW, I think we would be a lot better off with DSV; the underspecified conditions are far rarer in real data.
    1. 2
      
      franta 3 months ago | link
      
      “underspecified”,encoding,scheme
      1. 1
        
        mjec 3 months ago | link
        
        From that link:
        
        While there are various specifications and implementations for the CSV format (for ex. [4], [5], [6] and [7]), there is no formal specification in existence, which allows for a wide variety of interpretations of CSV files. This section documents the format that seems to be followed by most implementations[.]
        
        1
        
        franta 3 months ago | link
        
        It was the state 20+ years ago. Anyone could have commented on that Request of comments… I recommend considering this the CSV, standard. And anything else as legacy-before-standard-CSV. In your software you may have options for standard CSV and for your previous version (for compatibility with older versions). Twenty years are quite long time…
  3. 1
    
    simonw 3 months ago | link
    
    I think the idea here is that you don’t let that happen - strip those weird characters out of your text before you encode it, and document that the format you are using is incompatible with those values.
5. 7
  
  crazyloglad 3 months ago | link
  
  On the list of hidden footguns that are in my list of goto test cases is that CSVs are sensitive to locale-defined behaviour which trivially leaks from C into scripting languages. The RADIX_POINT, while LC_ALL=C defaults to a period, is a comma in some locales. Lua <= 5.2 with it’s ‘everything is a lua_number and lua_number is build-time double’ setup will fail miserably for string processing with a French locale that would yield a comma for something like printf(”%.4f”, myfloat).
  
  One of the uglier cases where I was called in to debug this in the wild was on a supposedly headless system (which did process CSV embedded in XML because it was from the 2000s) where a parasitic dependency tried to open a X11 connection asynchronously, retrieve locale from metadata stored in ATOMs and proceed to setlocale for the entire process. Most of the cases it would win this race, but every once in a blue moon the processing would break as the RADIX_POINT changed mid-stream.
6. 7
  
  carlana 3 months ago | link
  
  I think it’s a shame that 0x1C to 0x1F as separators with 0x1B as an escape aren’t used more often, but they aren’t, so it’s not really worth trying to push them into use now, since you can just specify your CSV flavor instead and have fewer compatibility problems. In practice, I haven’t received any CSVs with escaping problems in years of handling them from various random government sources. Typically the systems generating them just can’t do anything outside of uppercase A–Z plus space anyway.
  1. 5
    
    andyc edited 3 months ago | link
    
    I don’t think it’s a shame, because it pushes the problem around, without solving it –
    
    Now instead of escaping \t or comma, you have to escape 0x1c 0x1f. The corner case still exists, but it is potentially more surprising / bad when it happens.
    
    The nesting issue brought up is one place that happens, or another case is encoding binary data like a small favicon.ico.
    
    (I mentioned that in Oils we have J8 Notation to solve this problem – it can represent any byte string, and is backward compatible with JSON. It can fit within a TSV cell.)
7. 3
  
  kokada 3 months ago | link
  
  I think CSV is a great way to exchange data between systems that you have control, but is not a great format to exchange data between systems you have no control.
  
  So for example, if you know the quirks of how CSV export works in say, Splunk, you can adapt your parser to fix this issue with a few lines of code (as long as your parser is flexible like the Python one, but mostly parsers are).
  
  However if you are receiving files from other people that you have no control what they’re using, it may be better to just ask them to send in another data format (e.g.: xlsx) and you can export it to CSV yourself, making sure that you respect the rules that you need to follow. Or just use another data format.
  1. 6
    
    masklinn 3 months ago | link
    
    On the other hand CSV is the worst way to exchange between systems you control, it’s by definition a lowest common denominator low-information format whose main advantages are being ubiquitous (with debatable compatibility) and being nominally writable by hand in a text editor.
    
    If you control both producer and consumer it’s trivial to use a richer format because you can just decide to add that and use the relevant dependency if any. You have absolutely no reason to restrict yourself to csv.
    1. 4
      
      kokada edited 3 months ago | link
      
      What I meant about having “control” is not always having access to change the producer. The Splunk example would be a good one: let’s say it only support CSVs, in this case you don’t have access but you still have control in the way that you know how it behaves. The main issue from CSVs generally is someone just giving you a random file, probably generated from Excel, Google Sheets, etc.
      
      Also, one thing that is nice about CSV is that they are easy to generate even if not explicitly supported: anything that allow you to print formatted data (e.g.: printf) can be abused to generate CSV, with different amount of success dependending in how your data is organised.
      
      But yes, if you can modify the producer and consumer there are better ways to exchange data. Probably the second lowest common denominator would be JSON.
  2. 3
    
    dsr 3 months ago | link
    
    However if you are receiving files from other people that you have no control what they’re using, it may be better to just ask them to send in another data format (e.g.: xlsx) and you can export it to CSV yourself
    
    If you have no control, you can’t ask them to send in another data format.
    
    If you can ask them, then ask them to send in anything that is well-specified. That includes CSV, if you can agree on all the options.
    
    But xlsx is a terrible format for data because it is designed to contain things which are not data, so it is much more complicated and prone to interpretation. The paragraph of instructions needed to disambiguate CSV is completely swamped by the unresolved issues in xlsx. It’s a minimum of 7 XML files in a directory tree inside a zip overcoat.
    1. 3
      
      kokada edited 3 months ago | link
      
      I think you and me are imagining 2 completely different situations. Partially my fault, but well, let me answer your points.
      
      If you have no control, you can’t ask them to send in another data format.
      
      Generally you can. The situation I was imagining is something like a Business Analyst doing something in Excel and sending me the results. Most of the time I don’t have control of where the data is being processed, but I can at least ask “pretty please send me in another format that doesn’t mess up the output?”.
      
      If you can ask them, then ask them to send in anything that is well-specified. That includes CSV, if you can agree on all the options.
      
      Sure this works really well for someone that is not tech savy: “can you please export in CSV in Excel and mark all those options that will not mess the output?”. Maybe they will do this the first time correctly (since I will guide them), but forget the next time and them everything is messed up.
      
      But xlsx is a terrible format for data because it is designed to contain things which are not data, so it is much more complicated and prone to interpretation. The paragraph of instructions needed to disambiguate CSV is completely swamped by the unresolved issues in xlsx. It’s a minimum of 7 XML files in a directory tree inside a zip overcoat.
      
      This depends. If the data is in Excel, I much prefer that they send me the original Excel file. I can them make sure I export it correctly, instead of having someone trying to export to CSV and messing up the settings.
      
      And yes, I didn’t mean I would try to parse the XSLX in place of CSV. What I meant was I would be sure myself that the data is exported correctly.
      1. 1
        
        dsr 3 months ago | link
        
        . Maybe they will do this the first time correctly (since I will guide them), but forget the next time and them everything is messed up.
        
        Ah. I was envisioning – because this is the normal state of affairs for me – an automated or semi-automated process on the far side sending me an updated version of the same data each time (daily, weekly, monthly…) which I want to process automatically in order to feed other systems.
        
        If people are sending one-offs without an automatic process, that’s a whole different game. Not a fun one, either.
8. 3
  
  colonelpanic 3 months ago | link
  
  And then you go look up what universal newlines are.
  And then you find out that there are different Dialects of CSVs.
  And then you learn that Python has a Sniffer that claims it can deduce the CSV format for you.
  And then you realize that the data format needs its format deduced and so now you have to become The Joker.
  
  This got a lol out of me, I feel this. CSV is the lowest-common-denominator of data formats, so they are reliably garbage. “Nothing but edge cases” indeed.
9. 3
  
  gerikson 3 months ago | link
  
  The one big hurdle standing before adopting another character as a field separator is whether Excel recognizes it on import.
  
  Like the piece mentions, one part of the US federal bureaucracy uses the file separator character, and they have to provide a tool to convert it to commas - otherwise less people can access the data!
  
  Incidentally, I wonder if the prevalent use of decimal commas in many European countries let to correct quoting of at least values with decimals in them.
10. 3
  
  landon 3 months ago | link
  
  We didn’t have to escape anything because we don’t have any printable characters that would conflict with our control characters.
  
  Uhhh… you sure about that?
  1. 2
    
    icefox 3 months ago | link
    
    Well, I’ve never seen any…
    1. 1
      
      landon 3 months ago | link
      
      ;)
11. 3
  
  meerm 3 months ago | link
  
  I have pretty extensive experience with this! CSVs are, in fact, terrible!
  
  TSVs as others have mentioned are amazing! For some very interesting reason, mysqldump cannot, to save it’s life, generate a valid CSV. It can’t even make a CSV that it can then, later on, read back itself.
  
  It can generate valid TSVs! So can pg_dump! So can pgloader! A lot of things can generate valid TSVs and everything I’ve ever seen can also read these same TSVs back in.
12. 2
  
  franta edited 3 months ago | link
  
  RFC 4180 defines CSV. This is really simple and useful format (if you want it text and are OK with a single table).
  
  Related comment https://lobste.rs/s/oq78jm/using_commandline_process_csv_files#c_qsgwly
  
  BTW: You already spent more time with this discussion than implementing a correct CSV parser/generator will take.
  1. 1
    
    icefox 3 months ago | link
    
    Didn’t know about that RFC, thanks. It looks good but mostly kicks the hard part of the problem down the road, similar to how JSON does. For example the locale-sensitive-number-parsing problems that others have described. The RFC doesn’t say anything about numbers, just strings, so using it you’ll produce valid CSV data that then has to be transformed and cleaned up anyway. It does what it says on the tin, but is borderline too minimalistic to be useful.
    1. 1
      
      franta 3 months ago | link
      
      I agree that It might define/recommend encoding of common data types.
      
      But it is task for the schema, how to interpret the data. And the schema is application specific. It is quite useless to know, that the value is 35.50 decimal number if you do not know, whether it is temperature (which unit?), or price (which currency?) or age etc. The 35.50 number is quite as useful and valuable as the “35.50” string. Data types matter in the binary world where we can effectively store them (like uint32 as 4 octets) and run some instructions on them, but in the world of text formats, they are quite irrelevant in themselves. They are important as part of the schema that gives us also the semantics (like temp column contains temperature in Celsius written as a decimal number with decimal point with min/max values…).
13. 2
  simonw 3 months ago | link
  If you’re using the Python CSV module the easier way to get relatively sensitive escaping to work is to use the confusingly named “excel” dialect - which uses double quotes around strings that contain commas and correctly escapes those double quotes.
  
  csv.writer(fp, dialect=“excel”).writerows(…)
  
  Or use “excel_tab” for correctly escaped TSV.
  1. 3
    
    fanf edited 3 months ago | link
    
    Does “correctly” mean RFC 4180? I know some nonstandard dialects use backslashes and I don’t trust anything named “excel” to be standard or correct. (e.g. I know Excel’s CSV dialect depends on the locale.) To my disgust the Python documentation fails to describe the option settings used by the predefined dialects such as excel https://docs.python.org/3/library/csv.html
14. 2
  
  bruxisma 3 months ago | link
  
  Back in 2019, I was experimenting with serializing targets and “dictionaries” to files in CMake, as the delimiter in CMake for lists is a ; (as everything in CMake is a string, due to its lineage with Tcl). I abandoned it for the very same reasons that others have posted (CSVs are just better to grep through!), though in the case of CMake we don’t have better options up to and including sqlite databases (I will not be accepting “just use a different build system” as a better option, because if it was that easy no one would be using <build system you hate> anymore 🙂)
15. 2
  
  adriano 3 months ago | link
  
  The conclusion I draw from this writeup is
  
  Data is kinda bad. DSVs are not the answer because readability and compatability are important.
16. 2
  
  inactive-user 3 months ago | link
  
  If you use the defaults of the Python csv module, everything works the vast majority of the time. This article could have been reduced to the statement “csv dialects exist.” Which might be useful for people who have never dealt with csv before but for everyone else is basically common knowledge.
17. 2
  
  cmcaine 3 months ago | link
  
  No, this is no better than TSV, and worse in some ways.
  
  If you’re gonna use a weird format then use one with a proper spec that lets you properly record the data types and physical units for each column.
  
  Parquet and it’s derivatives are good choices.
18. 1
  
  pm edited 3 months ago | link
  
  If you serialize with a quoting style and deserialize with another one, of course it won’t work. That is the expected behaviour and the whole reason those parameters exist. There is nothing bad about it.
  
  I find the argument against CSV very weak and always boils down to corner cases and format inconsistency. CSV was (and is) more of a common practice that a formalized format. Of course if you just go for no quoting, then you cannot support newlines or comas because these are used as separators. Likewise, if you support quoting, of course you break trivial naive parsing by simple string splitting and lose the convenience of one row per text line.
  
  There’s nothing “kind of bad” about this. Those are just the obvious design choices and their implications.
  
  If I am inserting data by writing CSV content by hand. Of course I want to skip the quoting if I can.
19. 1
  
  deepchasm 3 months ago | link
  
  Holy 1970s, Batman! This was a real issue for punch cards.
  
  The usual issues are that any choice requires an entire ecosystem of support. You could probably go a lot further trying to make a “correct” CSV standard. That is, picking one set of rules and getting lots of people to buy into those rules. For example, if you make “Comma Separated Data” have no options, then you have ‘import from any, export to the one true standard only’.
20. 1
  
  sylq 3 months ago | link
  
  I always thought the Python csv module was only useful because you don’t want to mess with the escaping. When you reach the point where you know the separators can’t conflict with the data, I think you are better off parsing the « dsv » with a plain file reader and splitting on the 2 special characters. You could even do that with many Unix programs in a single one liner.