Comments on DBMS Musings: Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?

Interesting read although I would avoid benching t...

2020-12-02T01:34:19.727-08:00

Interesting read although I would avoid benching tests on t2 AWS instance type as they have a concept of burst CPU credits, unless you explicitly set unlimited mode when you launched the instance then its possible that the machine ran out of CPU credits at some point during the experiment and altered your results. Probably did not alter much but for future experiments 💪
see

Thanks for the comment. Like I said ... "vas...

2019-08-01T10:21:42.500-07:00

Thanks for the comment.

Like I said ... "vast majority". You bring examples of column-stores. Sybase IQ is another historical example. But the vast majority of data engines were row-stores :)

"For decades, the vast majority of data engin...

2019-07-31T18:31:55.637-07:00

"For decades, the vast majority of data engines used row-oriented storage formats"

Well, that's not quite true. While speaking about decades and rows vs columns, APL, A+, J, K, Q, kdb+ should definitely be mentioned. Beloved pets of stock exchanges and power plant engineers for decades. Essentially columnar.

Really good article ... thanks you

2019-05-22T13:14:56.506-07:00

Really good article ... thanks you

Very interesting article and debate. Maybe using P...

2018-10-22T05:24:16.008-07:00

Very interesting article and debate. Maybe using Parquet/ORC in a RAM disk, taking into account the copying overhead of course, could contribute some more numbers to your debate.

Really nice article!

2018-03-28T18:27:16.307-07:00

Really nice article!

Thanks for the reply. 1) Not necessarily for colu...

2018-01-10T08:53:48.172-08:00

Thanks for the reply.

1) Not necessarily for column format, especially if the update layout is also using column format. Our experience has been that one can get do with 2-3X degradation with even large/frequent updates.

2) Will defer to your experience.

Thanks for the comment and the more precise number...

2018-01-08T16:18:17.399-08:00

Thanks for the comment and the more precise numbers in several places. I do disagree about a couple of things though:

(1) Updatability is mostly orthogonal to row vs. column. Both can dense-pack data if updates are not allowed, and both are more likely to use less read-optimal data layouts if updates are allowed.

(2) Real world DB execution engines are more CPU intensive than what you indicate. (At least in my experience)

Good article but there are some important points m...

2018-01-03T02:38:23.092-08:00

Good article but there are some important points missed here.

The most important being that row-oriented storage also implies one other thing: in all real world implementations row data is not packed contiguously like an array of C structs which this post implies (in absence of source code). That's because it has to allow updates/deletes to rows. Typical in-memory stores will use some form of hash map for in-memory cache. An interesting aside is that linked hash maps typically fare better in scans than normal or even open hash maps when row sizes are bit large (like >100 bytes).

Secondly row format allowing for varying column types also means that it has to maintain those offsets somewhere in schema and read those offsets while decoding thus additional reads are required breaking the cache lines further. On the other hand column format for primitive types have fixed sizes (or ones that can be read/jumped over in batches like run-length or efficient encodings like FastPFOR which are not possible with row format).

All in all, raw speeds per core for a single integer/long in rows with 24 bytes sizes is ~50ns per row per core and typically more, and cannot become much better no matter what one does. Compared to that column formats can do <1ns per row with/without encodings as the post also shows.

Some other points worth noting in the article and comments:

"So even if the processor is doing a 4-byte integer comparison every single cycle, it is processing no more than 12GB a second"

Superscalar architecture can do like 4 instructions per cycle even without SIMD. Not sure about EC2 but on my laptop typically get <1ns per integer/long (per core) for even average processing (which are more instructions than simple equals comparison). Adding additional filters on the same integer/long column hardly changes the numbers.

"But it is far less likely that we will see heavier-weight schemes like gzip and snappy in the Apache Arrow library any time soon"

Don't know about Apache Arrow, but compressing in memory means one can fit lot more in memory. Consider typical speed of decompression with schemes like LZ4 which is of the range of ~2GB/s compared to ~500MB/s disk reads even with SSDs, its a win in most scenarios (assuming no spare RAM for OS buffers).

"However, for memory, the difference is usually less than an order of magnitude."

Not true. Typical numbers are like ~4 cycles for L1, ~10 for L2, ~40 for L3 and ~100 or more for RAM. So the relative difference of sequential vs random is similar whether its disk or memory. Besides all parquet/ORC scanners will do sequential column block reads as far as possible, skipping forward in the same file as required.

"On the other hand the amount of processing done per data item is tiny in this example; in a real system there is generally much more CPU overhead per data item."

In typical queries, problem still remains the RAM/cache to CPU speed. For example even simple joins with reference data that fits in L1/L2 cache, the best hash joins will have to jump on the hash buckets and are typically an order of magnitude slower. Adding more filters, especially on primitives, hardly effects the numbers. For bigger joins the hits are even more (and of course, if a shuffle/sort gets introduced then those will completely dominate the numbers). The case where CPU matters are simple full scan queries with multiple complex filters but those are serviced much better using indexes.

"I would say that there are fundamental differences between main-memory column-stores and disk-resident column-stores."

Yet to find a main memory engine that can do significantly better than Spark+parquet, for example, especially if latter is stored uncompressed and file gets cached in OS buffers. The memory optimized engine we have built at SnappyData can go about 5-10X faster in the best cases, but less for complex queries. The bigger advantages lie elsewhere like indexing, minimizing shuffle, hybrid store etc.

2018-01-03T02:05:24.012-08:00

This comment has been removed by the author.

Hope you can give more think about how to combine ...

2017-11-08T17:48:05.228-08:00

Hope you can give more think about how to combine the Disk-based format and the Memory-based format.

From my experience, customers may want to cache data in memory, but the cost is so high, and want to put data back to disk if the resource is not enough. If one format can be flexible to support this, it maybe a good choose for users.

Thanks for your reply! I agree with your conclusi...

2017-11-08T17:40:32.188-08:00

Thanks for your reply!

I agree with your conclusion. Just the number you given can lead to the evidence is not so strong:-)

I consider the same problem for a while, Do we really need the Apache Arrow project ? Thanks for this article!

I said "around 30GB". I was just trying ...

2017-11-08T15:24:21.004-08:00

I said "around 30GB". I was just trying to give ballpark numbers rather than exact figures.

As far as the number of CPU cores --- this post should be understood on a per-core basis. I agree that if there are many cores, all pulling data from memory at the same time, the bottleneck will be pushed back towards memory. On the other hand the amount of processing done per data item is tiny in this example; in a real system there is generally much more CPU overhead per data item. What I'm trying to show in this article is that it is surprisingly easy for the CPU to become a bottleneck.

Nice article! but just a simple question: Why yo...

2017-11-08T01:13:47.957-08:00

Nice article!

but just a simple question: Why you give the memory bandwidth from memory to CPU is 30GB. The DDR3-2133 is about 18.3 GB/s, DDR4-3200 is just about 25.6 GB/s. The number you give exaggerate the CPU process rate and memory bandwidth. What's more, there are lots of CPU cores in one machine, you just do not mention this enough.

Yes, indeed, and you showed admirable restraint no...

2017-11-01T15:50:56.621-07:00

Yes, indeed, and you showed admirable restraint not making the example even more stark (it could have been 16 columns not 6, and the column of interest could have been 6 bits wide not 32) :-)

Hi Geoff, What I was trying to say is that AVX2 o...

2017-11-01T13:15:20.502-07:00

Hi Geoff,

What I was trying to say is that AVX2 or AVX512 is far less helpful for row-stores than column-stores, since in row-stores the register is polluted with data that will not be operated on.

Is the fact that your benchmark example isn't ...

2017-11-01T13:07:05.251-07:00

Is the fact that your benchmark example isn't compute bound due to the instance not necessarily being capable of using a modern ISA? I would hope to be able to get 8-16 32-bit integer compares a cycle with AVX2 or AVX512 in the steady state (memory permitting; also branch misses permitting assuming we take a branch when we see something).