Column-oriented DBMS

From Wikipedia, the free encyclopedia

A column-oriented DBMS is a database management system (DBMS) which stores its content by column rather than by row. This has advantages for read-mostly databases such as data warehouses and library catalogues.

Fundamentally, a relational database management system (RDBMS) can be implemented according to one of two storage architectures. The most common of these (and found in Oracle, IBM DB2 and Microsoft SQL Server) is the "row store" - so named, because it stores a table's data in row-wise fashion. To illustrate, consider the table, Employees, shown below:

Empid Lastname Firstname Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000

This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary). A rowstore implementation of an RDBMS might store the contents of this table in a file as follows:

      1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;

This file stores every attribute of a given row in sequence, with the last entry of one row followed by the first entry of the next.

An alternative storage architecture to the "row store" is the "column store". Example column stores include MonetDB, C-Store (and its commercialization, Vertica), Sybase IQ and BigTable. Whereas a row store stores data row-wise, a column store stores data column-wise as in the following representation of the aforementioned Employees table:

      1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000

In other words, the table is stored one column at a time, with the end of one column followed by the beginning of the next. Though not strictly required, a column store typically stores each column's worth of data in a separate file. Analagously, a row store is frequently partitioned so that a single table is divided into multiple files.

In practice, row store architectures are well-suited for OLTP-like workloads which are more heavily loaded with update transactions (writes) rather than queries (reads). This is because row stores are extremely "write friendly", in that adding a row of data to a table requires a simple file appending I/O. On the other hand, column stores are well-suited for OLAP (data warehouse)-like workloads. Such workloads can be characterized as "read-mostly", primarily because of the relatively high cost of queries (which might perform complex analysis of several TB of data) vs transactions (which primarily consist of simple insertions). For such workloads, column stores are advantageous because:

  1. Queries tend to require only a few columns out of many contained in a table. Whereas a row store must read the entire table, a column store can confine its reads to the columns required.
  2. Column data, being of uniform type, is much easier to compress than row data. Thus, the actual disk reads required to read the columns required for a query may be far less than the size of the raw data in these columns.

Examples of column-oriented DBMSs include: