TR-11-8.pdf

"DOT: a matrix model for analyzing, optimizing and deploying software for 
big data analytics in distributed systems",  

Yin Huai, Rubao Lee, Simon Zhang, Cathy H. Xia, and Xiaodong Zhang 

Proceedings of 2nd ACM Symposium on Cloud Computing 
(SOCC 2011), Cascais, Portugal, October 27-28, 2011.  


Abstract

Traditional parallel processing models, such as BSP, are
"scale up" based, aiming to achieve high performance by
increasing computing power, interconnection network bandwith, 
and memory/storage capacity within dedicated systems, 
while big data analytics tasks aiming for high throughput 
demand that large distributed systems "scale out" by
continuously adding computing and storage resources through
networks. Each one of the "scale up" model and "scale out"
model has a different set of performance requirements and
system bottlenecks. In this paper, we develop a general
model that abstracts critical computation and communication 
behavior and computation-communication interactions
for big data analytics in a scalable and fault-tolerant manner. 
Our model is called DOT, represented by three matrices 
for data sets (D), concurrent data processing operations
(O), and data transformations (T), respectively. With the
DOT model, any big data analytics job execution in various
software frameworks can be represented by a specific or
non-specific number of elementary/composite DOT blocks,
each of which performs operations on the data sets, stores
intermediate results, makes necessary data transfers, and
performs data transformations in the end. The DOT model
achieves the goals of scalability and fault-tolerance by 
enforcing a data-dependency-free relationship among concurrent
tasks. Under the DOT model, we provide a set of optimization 
guidelines, which are framework and implementation  
independent, and applicable to a wide variety of big
data analytics jobs. Finally, we demonstrate the effectiveness
of the DOT model through several case studies.