Talk:Data orientation

Latest comment: 6 months ago by Jorgecarleitao in topic Re-write

Row-oriented redirects here

edit

since row-oriented redirects here, should there be some clarification as to why that is, and why it doesn't make sense for there to be a row-oriented article? Swoophle (talk) 19:58, 28 July 2017 (UTC)Reply

okay, have redirected Row-oriented to a section in this article. Whehn I get the chance ill try to add at the start of that section a point of clarifiction as to why Swoophle (talk) 20:05, 28 July 2017 (UTC)Reply

Reference [1] gives 404

edit

First reference to "Ins and Outs of columnar DBs" gives a not-found error 193.19.114.132 (talk) 10:01, 16 February 2017 (UTC)Reply

Mistake?

edit

In the description of the column-oriented systems, the main difference shown is that column databases show each key only once, and then provide the record references, giving the example: …;Smith:001,Jones:002,004,Johnson:003;…

however, this is not a difference. At least one relational database (DB2/Unix), implements secondary indexes in such a way that each key can appear exactly once, and then can be followed by a list of record id's.

I think that this difference should be removed. The difference suggested basically rests on a relatively small feature that may or may not be implemented in a given row-database. — Preceding unsigned comment added by Namel74 (talkcontribs) 10:24, 27 February 2014 (UTC)Reply

Justifications for being included in the Implementations section

edit

It seems to me the definition of Column-Oriented DBMSs as defined here, does not completely apply to all of the listed implementations. Microsoft SQL Server 2012 (Enterprise Edition) includes Column Store indexes, but does that justify its inclusion here? Apache Cassandra is admittedly a hybrid. Perhaps the inclusion list should be separated into Pure Implementation (DBMSs that primarily use Column-Oriented storage/processing.), Virtual (Column-Oriented features presented logically through a presentation layer, where another DBMS is at a lower level), Hybrid (Hybrid DBMS that implements Column-Oriented functionality with at least one other type of DBMS natively) and Implemented Features (DBMSs which implement certain Column-Oriented features but is not natively Column-Oriented).

  • Response: While an interesting concept, I recommend any such categorization be based on citable published sources regarding column-oriented data-layout capabilities of the systems in question and their effects on performance. The above description of a "Pure Implementation" has little to do with citable facts. Whether a DBMS "primarily" uses a particular model is somewhere between a subjective and marketing distinction. The real problem may be with the article itself, as it's unclear that there is any necessary reason for a distinction between "column and row oriented databases" vs the more granular "column and row oriented storage layout features within databases". Jeskeca (talk) 21:02, 27 August 2012 (UTC)Reply

I would also comment that HBase seems quite out of place here-- much more so even than Cassandra. The split between column-oriented relational DB vs. column-family DB like Cassandra and HBase should be called out. — Preceding unsigned comment added by 24.104.132.118 (talk) 20:48, 1 November 2012 (UTC)Reply

Interesting paper

edit

Hello, I came across a very good paper which shows a possible future for coloumn-oriented DBMS.

"A Common Database Approach for OLTP and OLAP Using an In-Memory Column Database" Hasso Plattner (2009) - Cofounder of SAP. He explains in Detail what column storage does and how it can be a read/write approach when used in combination with In Memory. As I am not a writer I hope some can put this information here. — Preceding unsigned comment added by 128.176.236.92 (talk) 14:01, 30 November 2011 (UTC)Reply

Proprietary software?

edit

I wonder how some open source databases ended up with 'proprietary' tag even though they are free. InfiniDB is GPLv2 licenced, for example. —Preceding unsigned comment added by 80.221.41.202 (talk) 22:04, 30 January 2010 (UTC)Reply

Current version seems to be quite confusing - in particular the section title "Open-source (proprietary software)" is self-contradictory in my opinion. Maybe this was supposed to be a list of dual-licensed software? But any way this title is confusing. Michał Kosmulski (talk) 09:51, 13 March 2010 (UTC)Reply

rewrite 4/23

edit

I agree with the discussion below and found the content misleading and straying off topic as well. I have submitted a rewrite which is substantially similar in coverage but which I believe characterizes the column-oriented DBMS better. In hindsight, I feel that the page still reads too much as a 'comparison of column-oriented and row-oriented DBMS' rather than a standalone description of 'column oriented dbms' as it is titled, as I adopted this by mirrioring some of the previous content. I'm curious to know if others feel this is beneficial, or if this content should be more properly moved to a 'comparison of column-oriented and row-oriented dbms' page, with this page more deliberately dedicated to column-oriented. Jeskeca (talk) 01:43, 24 April 2008 (UTC)Reply


comments from before 4/23 rewrite below

edit

Cost of querying

edit

I read the article:

For [read mostly] workloads, column stores are advantageous because:

  1. Queries tend to require only a few columns out of many contained in a table. Whereas a row store must read the entire table, a column store can confine its reads to the columns required.
  2. Column data, being of uniform type, is much easier to compress than row data. Thus, the actual disk reads required to read the columns required for a query may be far less than the size of the raw data in these columns.

I think this needs to be reformulated, it doesn't seem to be coming across right. Is this an advantage when doing queries to extract a subset of the data? Is there an assumption that the table is not indexed? An index can be used to find a relevant subset of data quickly, and then there's no need to read the entire table. Given such a case, the first point seems to be moot. Perhaps the article should be more specific about the cases where vertical fragmentation is a benefit.

I can't see that the conclusions drawn in the second point follow either. Clarification and/or sources would be beneficial.

WikiKetil 23:16, 6 September 2007 (UTC)Reply

Re-write

edit

This article needs to be rewritten with the content explicity describing row-column databases, not as an article that describes traditional row based RDMBS and then contrasting it with column based dbs in a couple of sentences. Its round the way, and doesn't contain enogh detail either of column based databases. scope_creep 16:11, 7 September 2007 (UTC)Reply

Article Content Needs Full Rewrite

edit

This "definition" is ridiculous. By only citing the most rudimentary basics between the two types of databases it serves no one any good whatsoever. I am now going to spend hours searching and reading the web to find out what the real differences are and if given the time I will edit this article. Prof Kayyos 16:54, 7 September 2007 (UTC)Reply

It's not the most rudimentary basics, it's the fundamental architecture of the data store. It's not clear to me what you mean by "the real differences"... storing the data column by column, instead of row by row, is the real difference, with the advantages and disadvantages described.
Is your background in computer science, have you reviewed the literature on database CS at all? Georgewilliamherbert 00:32, 8 September 2007 (UTC)Reply

Blog reference

edit

I inserted two references earlier today, one of which was to a blog by Dave Kellogg describing column oriented RDBMS systems. He does a lot of RDBMS work and counts as an expert in the field, in addition to being head of a company in a related business space (XML servers, not DBs per se). I believe that the column serves as a good non-technical reference / info source. We can include hard references to all the academic papers and product marketing info sheets we want, but external general overview info is also good source/reference material.

There's a general desire to minimize blog usage in Wikipedia references - but the objective there is to minimize unreliable sources. For sources selected precisely because they are "general" and not focused-technical, that is somewhat a moot point.

Feel free to discuss here if there are further objections... Thanks. Georgewilliamherbert (talk) 22:25, 22 January 2008 (UTC)Reply

The pointed-to article says "I’d recently heard that Michael Stonebraker had founded Vertica, a column-oriented DBMS company.... So I decided to try and figure out what column-oriented DBMS is and why you might want one." That is, he is new to column-oriented DBMSs. And it concludes "For more information on column-oriented DBMSs, check out the Wikipedia entry here." That is, this article is already more complete than his article. So why are we pointing to it? --Macrakis (talk) 18:10, 23 January 2008 (UTC)Reply
He's a DBMS expert in general, but acknowledges not in Column-oriented RDBMS. He's presenting an overview of them to readers who are less familiar than he is. He's an independent source in the DBMS user community commenting on what they are, though not as an expert on them specifically. Georgewilliamherbert (talk) 00:11, 24 January 2008 (UTC)Reply
OK, thanks for confirming my understanding. I will therefore remove the reference. --Macrakis (talk) 05:28, 24 January 2008 (UTC)Reply

An anon added the following comment:

This simply isn't true that schema changes in a row-oriented database have to be expensive. It's really just a shortcoming of MySQL's implementation. In PostgreSQL, for example, adding or removing a column only modify the metadata for a table. If you add a column, new rows with a value for that column will have space for it; old rows that didn't have a value will be null.

This may all be true, but it is written as a comment, not part of the article. I've rewritten it. --Macrakis (talk) 16:01, 3 February 2008 (UTC)Reply

Naming

edit

Aren't this database the same as Pick-style database. Multivalue databases?

I think the article itself should say that this is also called "column-store database". The two terms seem to be competing right now, with column-oriented returning 30K Google hits, and column-store returning 28K Google hits. --Westwind273 (talk) 21:30, 11 June 2008 (UTC)Reply

InfoBright problems?

edit

The Community Edition of Infobright is crippled: you can't update a database once it has been loaded, you have to delete the entire database and reload it from scratch. The Enterprise Edition is not open source either, so I've shifted it into the commercial section.

Smt52 (talk) 21:36, 19 December 2010 (UTC)Reply

Infobright Community Edition does not limit the use of any of its functionality. In addition, the "crippleware" annotation is opinion. Given Infobright Community Edition released whitepapers/customer testimonials showing successful usage of Infobright Community Edition in production, then the "crippleware" verbiage is inappropriate for wikipedia.

Comparation table (pros/cons)

edit

Can anyone provide a table comparing row and column oriented DBMS (incl. pros and cons)? Especially I'd like to know when to use a column oriented instead of an indexed row oriented DBMS (because yes, indexing creates overhead, but it's also a lot more flexible and stored in memory only). And some more infos about performance would be interesting, too, especially when accessing full row data (where column oriented DBMS completely fail in my opinion). --178.197.226.72 (talk) 11:47, 12 January 2014 (UTC)Reply

From SenSage: "In this architecture, the data is stored in columns instead of rows, which eliminates the need for indices when storing event data to increase data compression and retrieval speeds." - That's the kind of info that belong into this articles. But anyway,, why should a column oriented DB stored in HDD be better than a row oriented DB with indexes in RAM? All this doesn't sound very convincing to me...is it maybe just a marketing gig?? --178.197.226.72 (talk) 11:54, 12 January 2014 (UTC)Reply

Row based salary index example correct?

edit

Is the example of the index on the salary column on a row-based system correct? It states "An index on the salary column would look something like this:"

001:40000;
002:50000;
003:44000;
004:55000;

Don't indexes sort values either ascending or descending? So perhaps the example should be:

001:40000;
003:44000;
002:50000;
004:55000;

I'm not an expert, just thought it was worth querying — Preceding unsigned comment added by Jda81 (talkcontribs)

@Jda81: You are correct, modern databases use logically sorted index types by default. The article even states that a bit later. -- intgr [talk] 17:05, 26 August 2015 (UTC)Reply
edit

Hello fellow Wikipedians,

I have just added archive links to one external link on Column-oriented DBMS. Please take a moment to review my edit. You may add {{cbignore}} after the link to keep me from modifying it, if I keep adding bad data, but formatting bugs should be reported instead. Alternatively, you can add {{nobots|deny=InternetArchiveBot}} to keep me off the page altogether, but should be used as a last resort. I made the following changes:

When you have finished reviewing my changes, please set the checked parameter below to true or failed to let others know (documentation at {{Sourcecheck}}).

This message was posted before February 2018. After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than regular verification using the archive tool instructions below. Editors have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the RfC before doing mass systematic removals. This message is updated dynamically through the template {{source check}} (last update: 5 June 2024).

  • If you have discovered URLs which were erroneously considered dead by the bot, you can report them with this tool.
  • If you found an error with any archives or the URLs themselves, you can fix them with this tool.

Cheers.—cyberbot IITalk to my owner:Online 00:40, 31 March 2016 (UTC)Reply

There are several errors in this article

edit

I work in the database field and have written articles, white papers and books on this subject. There are many mistakes here. The initial section really does more to describe the limitation of a column store (and somehow plug IBM BigQuery) rather than describe a columnar store's benefit. Later in the article, there is mention of using a columnar store for OLTP. This is absolutely the wrong use case! As for the list of columnar databases that are available today, the originators of the concept, Stonebraker and the others at MIT, actually started several companies that are not mentioned. I will work to improve this over the next couple of days. Virtualelvis —Preceding undated comment added 17:14, 21 April 2016 (UTC)Reply

@Virtualelvis: The lead you've written looks decent, although now it does not mention any downsides at all to column-oriented databases.
Is this Ventana article actually a reliable source? From a brief glance, it looks like a typical "market research company" that happily publishes whatever conclusions their customers pay for.
You say you work in the database field, please note that Wikimedia Foundation's terms of use require disclosing affiliations that you have with the subjects you write about. Which products and/or companies have you been involved with? Do they include Ventana? -- intgr [talk] 09:23, 22 April 2016 (UTC)Reply

Main definition correct?

edit

The first sentence says that column-oriented DBMSs store data in columns. This is true for relational columnar DBMSs like Vertica, but it's false if applied to HBase and similar, which are non-relational column-oriented DBMSs that still store data row by row. Thus, the main definition is misleading. Please refer to http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html for a more clarifying distinction. — Preceding unsigned comment added by 194.176.222.229 (talk) 09:23, 25 April 2016 (UTC)Reply

Questions on the Benefits section

edit

The Benefits section seems inconsistent when it states:

Comparisons between row-oriented and column-oriented databases are typically concerned with the efficiency of hard-disk access for a given workload, as seek time is incredibly long compared to the other bottlenecks in computers. For example, a Serial ATA (SATA) hard drive has a maximum transfer rate of 600 MB/second (Megabytes per second) [4] while DDR3 SDRAM Memory can reach transfer rates of 17 GB/s (Gigabytes per second).[5]:157–165 Clearly, a major bottleneck in handling big data is disk access.

because it pinpoints seek time as an issue, while the example immediately after is not about seek time, but transfer rate.

Also, if seek time is the problem to be addressed, then the article should mention SSD drives, which have a much lower seek time than hard drives Solid-state_drive, and would mitigate the identified problem.

Frdpnl (talk) 14:08, 13 June 2016 (UTC)Reply

Description of Column-orientation misleading (and potentially wrong)

edit

I read the description and I think it is misleading or wrong. I am no expert but a quick scan through what is available elsewhere on the internet tells me the "reverse indexing" of associating values with rows is not correct but I will need help researching and citing it correctly.

Examples for the table below:

Developer Product Row-or-Column
Oracle Oracle Row
Microsoft SQL Server Row
SAP IQ Column
SAP HANA Hybrid

Basic case: Stored as a column at a time.

edit

It is possible to associate the columns with the rows by just look what the index of the column is. This for me is the most simple column-orientated database (analogy -- in programming a matrix can be represented by an array of arrays - and the choice is which indicates the row and which the column -- do you have an array of rows or do you have an array of columns and from programming perspective the difference is accessing matrix[row][column] or matrix[column][row]).


Implicit Index Developer Product Row-or-Column
1 Oracle Oracle Row
2 Microsoft SQL Server Row
3 SAP IQ Column
4 SAP HANA Hybrid

Seperately sorted dictionary columns:

edit

Column: Developer

Index Value
1 Microsoft
2 Oracle
3 SAP

Column: Product

Index Value
1 HANA
2 IQ
3 Oracle
4 SQL Server

Column: Row-or-column

Index Value
1 Column
2 Hybrid
3 Row

Table:

Developer Product Row-or-column
2 3 3
1 4 3
3 2 1
3 1 2

With virtual IDs

edit

Columns are stored and table is constructed with offsets rather than storing explicit index for each column:

Assume all columns are stored with string length = 20 , the offset column below not stored just provided for information.

Column: Developer

Offset Value
0 Microsoft
+20 Oracle
+40 SAP

Column: Product

Offset Value
0 HANA
+20 IQ
+40 Oracle
+60 SQL Server

Column: Row-or-column

Offset Value
0 Column
+20 Hybrid
+40 Row

Table - store the offset of each value in the virtual id.

Developer Product Row-or-column
20 40 40
0 60 40
40 20 0
40 0 20


The "C" in SCSS

edit

Much as the "S" in PDP-8/S might have been for Serial or for Small, the "C" in SCSS could have been for Conversational or for Columnar/Column-oriented. The marketing money was on Conversational, but the COLMUN aspect was a technical advancement. Pi314m (talk) 20:22, 23 November 2018 (UTC)Reply

Re-write

edit

I have re-written this article to better cover this topic.

The critical aspects are:

  1. Column-oriented and row-oriented are two choices with tradeoffs. This article should not be only about column-oriented, but the tradeoff more generally
  2. data orientation is not a property of a RDBMS - it is a property of the data format. RDBMS _may_ use one or the other, but the property is not specific to a RDBMS. In particular, OLAP query engines like Apache Spark are not RDBMS and need to decide which orientation to use.
  3. The article was written at the time where column-oriented was relatively novel. More recent treaties (see refs) on this aspect (columnar vs row) frame this problem more generally.

The article continues to need some love, but I believe this setup is a better starting point, as it frames this article more generally (and can now be further detailed). Jorgecarleitao (talk) 20:28, 22 May 2024 (UTC)Reply