TR-04-7.pdf

``Adaptive memory allocations in clusters to handle unexpectedly large
data-intensive jobs",

Li Xiao, Songqing Chen, and Xiaodong Zhang

IEEE Transactions on Parallel and Distributed Systems, Vol. 15, No. 7,
2004, pp. 577-592.

Abstract

In a cluster system with dynamic load sharing support, a job submission
or migration to a workstation is determined by the availability of CPU
and memory resources of the workstation at the time. In such a system,
a small number of running jobs with unexpectedly large memory allocation
requirements may significantly increase the queuing delay times of the
rest of jobs with normal memory requirements, slowing down execution
of each individual job and decreasing the system throughput.  We call
this phenomenon as the job blocking problem because the big jobs block
the execution pace of majority jobs in the cluster. Since the memory
demand of jobs may not be known in advance and may change dynamically,
the possibility of unsuitable job submissions/migrations to cause the
blocking problem is high, and existing load sharing schemes are unable
to effectively handle this problem.  We propose two schemes to address
this problem.  The first scheme, Network RAM supported load sharing,
combines job migrations with network RAM, which uses remote execution to
initially allocate a job to the most lightly loaded workstation and, if
necessary, network RAM to provide a global memory space for the job larger
than it would be available otherwise.  This scheme has the merits of both
job migrations and network RAM. Our experiments show its effectiveness
and scalability.  However, this scheme requires a network RAM facility in
the cluster, which may cause additional overhead, and increase cluster
network traffic.  In order to address this limit, we propose a second
scheme, memory reservation, incorporating with dynamic load sharing,
which adaptively reserves a small set of workstations to provide special
services to the jobs demanding large memory allocations.  As soon as
the blocking problem is resolved by the memory reservation scheme,
the system will adaptively switch back to the normal load sharing state.

Both schemes target on handling large data-intensive jobs in clusters,
and are mutually complementary. The network RAM supported load sharing
scheme can fully utilize the cluster global memory space, while the memory
reservation scheme has the advantage of simple implementations and low
overhead.  Thus, they both can be effective alternatives, and practically
deployed in cluster computing under different system conditions.