PGAS and Hybrid MPI+PGAS Programming Models on Modern HPC Clusters with
Accelerators
When: May 31, 2016, Time: 1:30pm-5:30pm, Room - Ayasofya
Where: Istanbul, Turkey
Tutorial slides are available here.
Abstract
Multi-core processors, accelerators (GPGPUs), co-processors (Xeon
Phis) and high-performance interconnects (InfiniBand, 10-40 GigE/iWARP
and RoCE) with RDMA support are shaping the architectures for next
generation clusters. Efficient programming models to design
applications on these clusters as well as on future exascale systems
are still evolving. The new MPI-3 standard brings enhancements to
Remote Memory Access Model (RMA) as well as introduce non-blocking
collectives. Partitioned Global Address Space (PGAS) Models provide an
attractive alternative to the MPI model owing to their easy to use
global shared memory abstractions and light-weight one-sided
communication. At the same time, Hybrid MPI+PGAS programming models
are gaining attention as a possible solution to programming exascale
systems. These hybrid models help the transition of codes designed
using MPI to take advantage of PGAS models without paying the pro-
hibitive cost of re-designing complete applications. They also enable
hierarchical design of applications using the different models to
suite modern architectures.
In this tutorial, we provide an overview of the research and
development taking place along the programming models (MPI, PGAS and
Hybrid MPI+PGAS) and discuss associated opportunities and challenges
in designing the associated runtimes as we head toward exascale
computing with accelerator-based systems. We start with an in-depth
overview of modern system architectures with multi-core processors,
GPU accelerators, Xeon Phi co-processors and high-performance
interconnects. We present an overview of the new MPI-3 RMA model,
language based (UPC and CAF) and library based (OpenSHMEM, UPC++) PGAS
models. We introduce MPI+PGAS hybrid programming models and the
associated unified runtime concept. We examine and contrast different
challenges in designing high-performance MPI-3 compliant, OpenSHMEM
and hybrid MPI+OpenSHMEM runtimes for both host-based and accelerator
(GPU- and MIC-) based systems. We present case-studies using
application kernels, to demonstrate how one can exploit hybrid
MPI+PGAS programming models to achieve better performance without
rewriting the complete code. Using the publicly available MVAPICH2-X,
MVAPICH2-GDR and MVAPICH-MIC libraries, we present the challenges and
opportunities to design efficient MPI, PGAS and hybrid MPI+PGAS
runtimes for next generation systems. We introduce the concept of
`CUDA-Aware MPI/PGAS' to combine high productivity and high
performance. We present how to take advantage of GPU features such as
Unified Virtual Address, CUDA-IPC and GPUDirect RDMA technologies to
design efficient MPI, OpenSHMEM and Hybrid MPI+OpenSHMEM
runtimes. Similarly, using MVAPICH2-MIC runtime, we expose optimized
data movement schemes for different system configurations including
multiple MICs per-node in same socket and/or different sockets
configurations.
Objectives
HPC systems are marked by the usage of multi-cores, accelerators
(GPGPUs), co-processors (Xeon Phis) and high-performance interconnects
(InfiniBand, 10-40 GigE/iWARP and RoCE) with RDMA support. Efficient
programming models to design applications on these clusters as well as
on future exascale systems are still evolving. However, programming
models, run- times and associated application designs are not fully
taking advantage of these trends. Highlighting these emerging trends
and the associated challenges, this tutorial is proposed with the
following goals:
- Teach designers, developers and users how to efficiently design
and use parallel programming models (MPI and PGAS) and accelerators (GPU and
MIC)
- Guide scientists, engineers, researchers and students engaged in
designing next-generation HPC systems and applications
- Help newcomers to the field of HPC and exascale computing to
understand the concepts and designs of parallel programming models,
accelerators, networking, and RDMA
- Demonstrate the impact advanced optimizations and tuning of
middlewares can have on application performance through case studies with
representative benchmarks and applications
The content level will be as follows: 30% beginner, 40% intermediate,
and 30% advanced. There is no fixed pre-requisite. As long as the
attendee has a general knowledge in high performance computing,
networking, programming models, parallel applications, and related
issues, he/she will be able to understand and appreciate it. The
tutorial is designed in such a way that an attendee gets exposed to
the topics in a smooth and progressive manner.
Outline of the Tutorial
- Overview of the Modern HPC System Architectures
- Multi-core Processors
- High Performance Interconnects (InfiniBand, 10GigE/iWARP and
RDMA over Converged Enhanced Ethernet (RoCE))
- Heterogeneity with Accelerators (GPUs) and Coprocessors (Xeon Phis)
- Introduction to MPI and Partitioned Global Address Space
(PGAS) Programming Models
- MPI-3 Features including RMA and Non-blocking
collectives
- Library-based Models: Case Study with OpenSHMEM
- Language-based Models: Case Study with UPC
- Overview of MPI+PGAS Hybrid Programming Models and
Benefits
- Challenges and Opportunities in Designing Scalable and
High Performance Runtimes (MPI, PGAS and Hybrid MPI+PGAS) on host-based modern
systems.
- Application-level Case Studies for using Hybrid MPI+PGAS Models
- Challenges and Opportunities in Designing Scalable and
High Performance Runtimes (MPI, PGAS and Hybrid MPI+PGAS) on GPU Clusters
- Overview of CUDA-Aware Concept
- Designing Efficient MPI Runtime for GPU Clusters
- Designing Efficient OpenSHMEM Runtime for GPU Clusters
- Challenges and Opportunities in Designing Scalable and
High Performance Runtimes (MPI, PGAS and Hybrid MPI+PGAS) on MIC Clusters
- Designing Efficient MPI Runtime for Intel MIC Clusters
- Designing Efficient OpenSHMEM Runtime for Intel MIC Clusters
- Conclusion and Q&A
Brief Biography of Speakers
Dr. Dhabaleswar K. (DK)
Panda is a Professor and University Distinguished Scholar
of Computer Science at the Ohio State
University. He obtained his Ph.D. in computer engineering from the
University of Southern California. His research interests include
parallel computer architecture, high performance computing,
communication protocols, files systems, network-based computing, and
Quality of Service. He has published over 350 papers in major journals
and international conferences related to these research
areas. Dr. Panda and his research group members have been doing
extensive research on modern networking technologies including
InfiniBand, HSE and RDMA over Converged Enhanced Ethernet (RoCE). His
research group is currently collaborating with National Laboratories
and leading InfiniBand and 10-40GigE/iWARP companies on designing various
subsystems of next generation high-end systems. The MVAPICH2 (High Performance
MPI over InfiniBand, iWARP and RoCE) and MVAPICH2-X (Hybrid MPI
and PGAS (OpenSHMEM, UPC, CAF and UPC++)) software packages, developed
by his research group, are currently being used by more than 2,600
organizations worldwide (in 81 countries). This software has enabled
several InfiniBand clusters (including the 10th one) to get into the
latest TOP500 ranking. These software packages are also available with
the Open Fabrics stack for network vendors (InfiniBand and iWARP),
server vendors and Linux distributors. Dr. Panda's research is
supported by funding from US National Science Foundation, US
Department of Energy, and several industry including Intel, Cisco,
SUN, Mellanox, QLogic, NVIDIA and NetApp. He is an IEEE Fellow and a
member of ACM. More details about Dr. Panda, including a
comprehensive CV and publications are available
here.
Khaled
Hamidoucheis a Senior Research Associate in the Department of
Computer Science and Engineering at The Ohio State University. He is a
member of the Network-Based Computing Laboratory lead by
Dr. D. K. Panda. His research interests include high-performance
interconnects, parallel programming models, accelerator computing and
high-end computing applications. His current focus is on designing
high performance unified MPI, PGAS and hybrid MPI+PGAS runtimes for
InfiniBand clusters and their support for accelerators. Dr. Hamidouche
is involved in the design and development of the popular MVAPICH2
library and its derivatives MVAPICH2-MIC, MVAPICH2-GDR and
MVAPICH2-X. He has published over 45 papers in international journals
and conferences related to these research areas. He has been actively
involved in various professional activities in academic journals and
conferences. He is a member of ACM. More details about Dr. Hamidouche
are available
here.
Last Updated: May 30, 2016