RCFile

From Wikipedia, the free encyclopedia

RCFile (Record Columnar File) is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure is a systematic combination of multiple components including data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

RCFile is a result of basic research with collaborative efforts from Facebook, Ohio State University, and Institute of Computing Technology, Chinese Academy of Sciences. A research paper entitled “RCFile: a Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse systems”[1] was published and presented in ICDE’ 11. The data placement structure and its implementation presented in the paper have been widely adopted in the open source community, big data analytics industries, and application users. See the section of Impacts.

Summary

Data storage format

In a relational database, data is organized as two-dimensional tables. For example, a table in a database consists of 4 columns (c1 to c4):

c1 c2 c3 c4
11 12 13 14
21 22 23 24
31 32 33 34
41 42 43 44
51 52 53 54

To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups.

Row Group 1
c1 c2 c3 c4
11 12 13 14
21 22 23 24
31 32 33 34
Row Group 2
c1 c2 c3 c4
41 42 43 44
51 52 53 54


Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:

      Row Group 1   Row Group 2 
      11, 21, 31;     41, 51;
      12, 22, 32;     42, 52;
      13, 23, 33;     43, 53;
      14, 24, 34;     44, 54;

Column data compression

Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.

Performance Benefits

In data warehousing systems, column-store is more efficient when a query only projects a subset of columns, because column-store only read necessary columns from disks but row-store will read an entire row. In MapReduce-based systems, data is normally stored on a distributed system, such as Hadoop Distributed File System (HDFS), and different data blocks might be stored in different machines. Thus, for column-store on MapReduce, different groups of columns might be stored on different machines, which introduces extra network costs when a query projects columns placed on different machines. For MapReduce-based systems, the merit of row-store is that there is no extra network costs to construct a row in query processing, and the merit of column-store is that there is no unnecessary local I/O costs when read data from disks.

RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store.

Impacts

RCFile has been adopted in major real-world systems for big data analytics. Here is a list of representative examples.

  1. RCFile has become the default data placement structure in Facebook's production Hadoop cluster.[2] This is so far the world's largest Hadoop cluster,[3] where 40 terabytes compressed data sets are added every day.[4] In addition, all the data sets stored in HDFS before RCFile have also been transformed to use the RCFile structure.[2]
  2. RCFile has been adopted in Apache Hive (since v0.4),[5] which is an open source data store system running on top of Hadoop and is being widely used in various companies around the world,[6] including several major Internet services, such as Facebook, Taobao, and Netflix.[7]
  3. RCFile has been adopted in Apache Pig (since v0.7),[8] which is another open source data processing system being widely used in many organizations,[9] including several major Web service providers, such as Twitter, Yahoo, Linkedin, AOL, and Salesforce.com.
  4. RCFile has become the de facto standard data storage structure in Hadoop software environment supported by HCatalog project (formerly known as Howl) that is the table and storage management service for Hadoop.[10] RCFile is also supported by the open source Elephant Bird library that is used in production in Twitter for daily data analytics.[11]

See also

References

External links

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.