TR-11-4.pdf

"RCFile: a fast and space-efficient data placement structure in 
MapReduce-based warehouse systems",  

Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, 
and Zhiwei Xu 

Proceedings of International Conference on Data Engineering 
(ICDE 2011), Hannover, Germany, April 11-16, 2011.


Abstract

MapReduce-based data warehouse systems are playing important
roles of supporting big data analytics to understand
quickly the dynamics of user behavior trends and their needs
in typical Web service providers and social network sites (e.g.,
Facebook). In such a system, the data placement structure is
a critical factor that can affect the warehouse performance in
a fundamental way. Based on our observations and analysis
of Facebook production systems, we have characterized four
requirements for the data placement structure: (1) fast data
loading, (2) fast query processing, (3) highly efficient storage
space utilization, and (4) strong adaptivity to highly dynamic
workload patterns. We have examined three commonly accepted
data placement structures in conventional databases, namely row-stores,
column-stores, and hybrid-stores in the context of large
data analysis using MapReduce. We show that they are not very
suitable for big data processing in distributed systems. In this
paper, we present a big data placement structure called RCFile
(Record Columnar File) and its implementation in the Hadoop
system. With intensive experiments, we show the effectiveness
of RCFile in satisfying the four requirements. RCFile has been
chosen in Facebook data warehouse system as the default option.
It has also been adopted by Hive and Pig, the two most widely
used data analysis systems developed in Facebook and Yahoo!