TR-13-5.pdf

"Understanding insights into the basic structure and essential issues 
of table placement methods in clusters" 

Yin Huai, Siyuan Ma, Rubao Lee, Owen O'Malley, and Xiaodong Zhang, 

Proceedings of 39th International Conference on Very Large Data Bases
(VLDB 2013), Riva del Garda, Trento, Italy, August 26-30, 2013.
(This paper will be presented in VLDB 2014 in Hangzhou, China). 

Abstract

A table placement method is a critical component in big data analytics
on distributed systems. It determines the way how data values
in a two-dimensional table are organized and stored in the underlying
cluster. Based on Hadoop computing environments, several
table placement methods have been proposed and implemented.
However, a comprehensive and systematic study to understand, to
compare, and to evaluate different table placement methods has
not been done. Thus, it is highly desirable to gain important insights
into the basic structure and essential issues of table placement
methods in the context of big data processing infrastructures.
In this paper, we present such a study. The basic structure of
a data placement method consists of three core operations: row
reordering, table partitioning, and data packing. All the existing
placement methods are formed by these core operations with variations
made by the three key factors: (1) the size of a horizontal
logical subset of a table (or the size of a row group), (2) the function
of mapping columns to column groups, and (3) the function
of packing columns or column groups in a row group into physical
blocks. We have designed and implemented a benchmarking tool
to provide insights into how variations of each factor affect the I/O
performance of reading data of a table stored by a table placement
method. Based on our results, we give suggested actions to optimize
table reading performance. Results from large-scale experiments
have also confirmed that our findings are valid for production
workloads. Finally, we present ORC File as a case study to show
the effectiveness of our findings and suggested actions.