Uday BONDHUGULA Dept. of Computer Science & Engg. Phone: +1-614-260-7695 The Ohio State University Email: bondhugu@cse.ohio-state.edu 2015 Neil Ave. 395DL Fax: +1-614-292-2911 Columbus, OH 43210 USA Web: http://www.cse.ohio-state.edu/~bondhugu ------------------------------------------------------------------------ Research Interests Parallelizing compilers, Automatic parallelization for multicores, Polyhedral model, Compilation for GPUs, Parallel programming models Education - Ph.D., Computer Science & Engineering Sep '04 - Aug '08 The Ohio State University (OSU) Columbus, OH Advisor: Prof. P. Sadayappan - Bachelor of Technology, Computer Science & Engineering Jul 2004 Indian Institute of Technology (IIT), Madras. Chennai, India Professional Experience - Visting Researcher Mar 2008 - Apr 2008 ALCHEMY team INRIA Futurs (INRIA Saclay), Ile de France Orsay, FRANCE - Research Intern Jun 2007 - Mar 2008 Advanced Compilation Technologies IBM T.J. Watson Research Center Yorktown Heights, NY - Graduate Research Associate Apr'05 - Jun'07, Oct'07 - Aug '08 Dept. of CSE, OSU Automatic parallelization for multicores, GPUs, Automatic polyhedral transformations for parallelism and locality - Graduate Teaching Associate Sep 2004 - Mar 2005 Department of Comp. Sci. & Engg., OSU. Instructor for CSE 459.21 'Programming in C', CSE 459.23 'Programming in Java'. - Summer Intern May 2003 - Jul 2003 Trilogy Software Inc. Bangalore, India Conference Publications 1. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer Uday Bondhugula, A. Hartono, J. Ramanujan, P. Sadayappan. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '08), Jun 2008, Tucson, Arizona. 2. Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model Uday Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. International Conference on Compiler Construction (CC), Apr 2008, Budapest, Hungary. 3. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories M. Baskaran, Uday Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. ACM SIGPLAN PPoPP'08, Feb 2008, Salt Lake City, Utah. 4. Automatic Mapping of Nested Loops to FPGAs Uday Bondhugula, J. Ramanujam, and P. Sadayappan. ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '07), Mar 2007, San Jose, California. 5. Effective Automatic Parallelization of Stencil Computations S. Krishnamoorthy, M. Baskaran, Uday Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan. ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07), Jun 2007, San Diego, California. 6. Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths Uday Bondhugula, A. Devulapalli, J. Dinan, J. Fernando, P. Wyckoff, E. Stahlberg, and P. Sadayappan. IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM), Apr 2006, Napa Valley, California. 7. Parallel FPGA-based All-Pairs Shortest-Paths in a Directed Graph Uday Bondhugula, A. Devulapalli, J. Fernando, P. Wyckoff, and P. Sadayappan. 20th IEEE International Parallel and Distributed Processing Symposium (IPDPS), Apr 2006, Rhodes, Greece. 8. High performance RDMA-based All-to-all Broadcast for InfiniBand Clusters S. Sur, Uday Bondhugula, A. Mamidala, H.-W. Jin, and D. K. Panda. 12th IEEE International Conference on High Performance Computing (HIPC '05), Dec 2005. Research Experience 1. Automatic parallelization for multicores - Affine transformations in the polyhedral model capture a complex sequence of execution-reordering loop transformations that can improve performance by parallelization as well as locality enhancement. Although a significant body of research has addressed affine scheduling and partitioning, the problem of automatically finding good affine transforms for communication-optimized coarse-grained parallelization together with locality optimization for sequences of imperfectly-nested loops has remained a challenging problem. We have developed a new automatic transformation framework that addresses this problem. The approach works by finding good ways of tiling through a powerful and practical linear cost function that simultaneously enables minimization of inter-processor communication and improved reuse at each node. Fusion across a long sequence of loop nests that have a producer/consumer relationship is also enabled. Programs requiring one-dimensional versus multi-dimensional time schedules (with scheduling-based approaches) are all handled with the same algorithm. Synchronization-free parallelism, pipelined parallelism or permutable loops at various levels can be detected. The framework scales very well with input size and is targetable to multiple parallel architectures (general-purpose multicores, Cell, GPGPUs) and has been implemented into a tool, PLUTO. [Bondhugula et al., CC 2008, OSU CISRC-5/07-TR43]. - PLUTO is a fully automatic source-to-source transformation system that can optimize regular programs for parallelism and locality simultaneously. It is an implementation of our new automatic transformation framework for multicores and uses the the LooPo project's frontend, CLooG code generator and the PIP library. OpenMP parallel code is generated from sequential C input. Through this work, we demonstrated the practicality of analytical model-driven automatic transformation in the polyhedral framework -- far beyond what is possible by current production compilers. Experimental results from the implemented system show very high speedups for local/parallel execution on multicores over state-of-the-art research compiler frameworks as well as native compilers. [Bondhugula et al., PLDI 2008, OSU-CISRC-10/07-TR70] http://pluto-compiler.sourceforge.net - Automatic polyhedral parallelization in XLC/TPO IBM T.J. Watson Research Center, Yorktown Heights, NY Oct 2007 -- present 2. Compilation for Accelerators - Compilation and automatic parallelization for GPUs, FPGAs; addressing issues such as mapping under resource constraints, custom processor array design, automatic data movement from explicitly addressable memories, and Verilog/HDL code generation - Accelerating the all-pairs shortest-paths problem through a custom processor array on the Cray XD1 FPGA [Bondhugula et al., IPDPS'06, FCCM'06] 3. High performance cluster interconnects - Optimized collective communication over RDMA-enabled networks: I worked on optimizing MPI all-to-all broadcast over InfiniBand clusters using advanced InfiniBand features like Remote DMA. The new implementation is faster, scales better and is part of the OSU MVAPICH software [Sur et al., HIPC '05] Awards & Honors - ACM SIGPLAN Professional Activities Committee travel award for PLDI 2008 - All-India Rank 84 (top 0.06%) at the Indian Institutes of Technology Joint Entrance Examination (IIT-JEE) 2000, out of a total of about 1,27,000 candidates. - Represented state of Andhra Pradesh, India at the Indian National Mathematical Olympiad in 1999. - National Talent Search Exam (NTSE) scholarship (India) - 1998. Miscellaneous - Reviewer for LCPC 2006, PPoPP 2007, ICS 2007, LCPC 2007 - Table Tennis: The Ohio State University team (2007 - short while), Umpire - US College Nationals (NCTTA) 2007, India south-zone zonals (1997, 1998) References Available on request.