Albert Hartono Dept. of Computer Science and Engineering Dreese Labs 2015 Neil Ave., 395 Columbus, OH 43210 (614)370-5269 hartonoa@cse.ohio-state.edu http://www.cse.ohio-state.edu/~hartonoa RESEARCH INTEREST Compile-time and run-time optimizations for High Performance Computing (HPC); Loop transformations for locality and parallelism; Automation of empirical performance tuning of parallel applications; Parallel programming models. EDUCATION 2003-2009: Ph.D. Computer Science and Engineering, Ohio State University, Columbus, OH. 2001-2003: M.S. Computer Science, Indiana University, Bloomington, IN. 1996-2000: B-Tech. Electrical Engineering, Trisakti University, Jakarta, Indonesia. EXPERIENCE: Su 2003-Au 2009: Graduate Research Associate, Ohio State University, Columbus, OH. Department of Computer Science and Engineering Advisor: Prof. P. Sadayappan Su 2007: Research Intern, IBM Almaden Research Center, San Jose, CA. GPFS (General Parallel File System) team Supervisor: John Palmer Au 2006-Sp 2008: Graduate Research Co-op, Argonne National Laboratory, Argonne, IL. Su 2006: Wallace Givens Research Fellow, Argonne National Laboratory, Argonne, IL. Mathematics and Computer Science Division Supervisor: Dr. Boyana Norris Su 2004: Software Engineer Intern, Bell Laboratories, Lucent Technologies, Murray Hill, NJ. Lucent nmake Product Builder, Software Technology Center Supervisor: Gary M. Selzer 2002-2003: Associate Instructor, Indiana University, Bloomington, IN. Department of Computer Science 1998-2000: Laboratory Assistant, Trisakti University, Jakarta, Indonesia. Department of Electrical Engineering 1998-1999: Teaching Assistant, Trisakti University, Jakarta, Indonesia. Department of Electrical Engineering JOURNAL PUBLICATIONS Performance Optimization of Tensor Contraction Expressions for Many Body Methods in Quantum Chemistry. Albert Hartono, Qingda Lu, Thomas Henretty, Sriram Krishnamoorty, Huaijian Zhang, Gerald Baumgartner, David Bernholdt, Marcel Nooijen, Russell Pitzer, J. Ramanujam, P. Sadayappan. Journal of Physical Chemistry A. (accepted) BOOK CHAPTER Annotations for Productivity and Performance Portability. Boyana Norris, Albert Hartono, and William Gropp. Petascale Computing: Algorithms and Applications. Computational Science. Chapman and Hall / CRC Press, Taylor and Francis Group, 2007. REFEREED CONFERENCE PUBLICATIONS Parametric Tiled Loop Generation for Effective Parallel Execution on Multicore Processors. Albert Hartono, Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), April 2010, Atlanta, Georgia. (submitted) Parameterized Tiling Revisited. Muthu Manikandan Baskaran, Albert Hartono, Thomas Henretty, J. Ramanujam, and P. Sadayappan. In IEEE/ACM International Symposium on Code Generation and Optimization (CGO), April 2010, Toronto, Canada. (accepted) Parametric Multi-Level Tiling for Imperfectly Nested Loop. Albert Hartono, Muthu Manikandan Baskaran, Cedric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. In ACM International Conference on Supercomputing (ICS), June 2009, Yorktown Heights, New York. Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes. Boyana Norris, Albert Hartono, Elizabeth Jessup, and Jeremy Siek. In International Conference on Computational Science (ICCS), May 2009, Baton Rouge, Louisiana. Annotation-Based Empirical Performance Tuning Using Orio. Albert Hartono, Boyana Norris, and P. Sadayappan. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009, Rome, Italy. A Polyhedral Framework for Automatic Parallelization and Locality Optimization. Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. In Workshop on Compilers for Parallel Computing (CPC), January 2009, Zurich, Switzerland. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. In ACM SIGPLAN Programming Language Design and Implementation (PLDI), June 2008, Tucson, Arizona. Towards Effective Automatic Parallelization for Multicore Systems. Uday Bondhugula, Muthu Baskaran, Albert Hartono, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. In IPDPS Workshop on Next Generation Software (NGS), April 2008, Miami, Florida. Designing High Performance and Scalable MPI Intra-Node Communication Support for Clusters. Lei Chai, Albert Hartono, and Dhabaleswar K. Panda. In IEEE International Conference on Cluster Computing (CLUSTER), September 2006, Barcelona, Spain. Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations. Albert Hartono, Qingda Lu, Xiaoyang Gao, Sriram Krishnamoorthy, Marcel Nooijen, Gerald Baumgartner, Venkatesh Chopella, David E. Bernholdt, Russell M. Pitzer, J. Ramanujam, Atanas Rountev, and P. Sadayappan. In International Conference on Computational Science (ICCS), May 2006, Reading, UK. Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations. Albert Hartono, Alexander Sibiryakov, Marcel Nooijen, Gerald Baumgartner, David E. Bernholdt, So Hirata, Chi-Chung Lam, Russell M. Pitzer, J. Ramanujam, and P. Sadayappan. In International Conference on Computational Science (ICCS), May 2005, Atlanta, Georgia. RESEARCH REPORTS PrimeTile: A Parametric Multi-Level Tiler for Imperfect Loop Nests. Albert Hartono, Muthu Manikandan Baskaran, Cedric Bastoul, Albert Cohen, Sriram Krishnamoorthy, Boyana Norris, J. Ramanujam, and P. Sadayappan. Research Report OSU-CISRC-2/09-TR04, Computer Science and Engineering Department, Ohio State University, February 2009. Generating Empirically Optimized Composed Matrix Kernels from MATLAB Prototypes. Boyana Norris, Albert Hartono, Elizabeth Jessup, and Jeremy Siek. Preprint ANL/MCS-P1581-0209, Mathematics and Computer Science Division, Argonne National Laboratory, February 2009. Annotation-Based Empirical Performance Tuning Using Orio. Albert Hartono, Boyana Norris, and P. Sadayappan. Preprint ANL/MCS-P1556-1108, Mathematics and Computer Science Division, Argonne National Laboratory, November 2008. Annotations for Productivity and Performance Portability. Boyana Norris, Albert Hartono, and William Gropp. Preprint ANL/MCS-P1392-0107, Mathematics and Computer Science Division, Argonne National Laboratory, January 2007. Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations. Albert Hartono, Alexander Sibiryakov, Marcel Nooijen, Gerald Baumgartner, David E. Bernholdt, So Hirata, Chi-Chung Lam, Russell M. Pitzer, J. Ramanujam, and P. Sadayappan. Research Report OSU-CISRC-2/05- TR10, Computer Science and Engineering Department, Ohio State University, February 2005. SOFTWARE DEVELOPMENTS PrimeTile: A Parametric Multi-Level Tiler for Imperfect Loop Nests. http://primetile.sourceforge.net/ Orio: An Annotation-Based Empirical Performance Tuning Framework. http://trac.mcs.anl.gov/projects/performance/wiki/Orio SELECTED PROJECTS Loop transformations for data locality and parallelism. Tiling is a crucial loop transformation for generating high-performance code on modern architectures, aiming at maximizing data locality and exploiting coarse-grained parallelism. An automatic source-to-source transformation tool, Pluto, has been developed to generate parallel OpenMP tiled code from sequential untiled code. Pluto generated tiled code has fixed tile sizes. Tiled codes with variable tile sizes are important for autotuning systems such as ATLAS. Multi-level tiled codes are important for increasing data reuse in deep memory hierarchy. A code generator, called PrimeTile, has been developed to generate efficient parametric multi-level tiled loop nests. Experimental results show high speedups for both sequential and parallel executions over the best production compilers. Empirical performance tuning using annotations. An extensible annotation-based empirical performance tuning tool, called Orio, has been developed to achieve both performance and productivity in scientific computations. Users annotate the source code with performance hints that will trigger a set of low-level optimizations. Based on the given values of various performance parameters, Orio generates many tuned code versions, selectively and empirically evaluates the performance of each version, and finally selects the best-performing code. Orio framework has a clean design and can be easily extended with external code optimization tools. Experiments show that Orio can deliver performance improvements when used alone or in conjunction with other transformation tools. Operation minimization of tensor contraction expressions Complex tensor expressions arising in many science and engineering domains have many algebraically equivalent forms that can differ drastically in the number of arithmetic operations required for evaluating these expressions. The problem of finding an operation-minimal form of tensor expressions is, however, NP-hard. We have developed effective operation minimization techniques for tensor expressions using algebraic transformations and common-subexpression elimination. Using our automated operation minimization tool, scientists can find new more effective formulations of tensor contraction expressions. High performance cluster interconnects We have designed and implemented an optimized MPI intra-node communication using user space memory copy scheme, for NUMA and multicore systems. Our approach has efficient utilization of L2 cache and significantly reduces memory consumption. The implementation is faster, more efficient, and scales better, and currently is part of the MVAPICH release. High-performance parallel file systems I designed and implemented a light-weight performance analysis tool to track and visualize locking contentions inside IBM's General Parallel File System (GPFS). Using the tool, I was able to identify a set of well-behaved and bad mutexes caused by bad coding styles. Automatic dependency-based Java build support I implemented an extension for Lucent nmake Product Builder to automatically extract implicit build dependencies from Java program source containing new J2SE 1.5 language constructs using an open-source tool, called JavaDeps. AWARDS AND HONORS 2009: Travel award, International Conference on Supercomputing (ICS) 2009: Travel award, International Parallel and Distributed Processing Symposium (IPDPS) 2000: Ranked 4th out of the ten best graduates of Electrical Engineering Department, Trisakti University, Jakarta, Indonesia. 1999: Undergraduate engineering scholarship, Trisakti University, Jakarta, Indonesia. REFERENCES Dr. P. Sadayappan, Professor. Ohio State University saday@cse.ohio-state.edu Dr. Boyana Norris, Computer Scientist. Argonne National Laboratory norris@mcs.anl.gov Dr. J. Ramanujam, Professor. Louisiana State University jxr@ece.lsu.edu Dr. John Palmer, Research Staff Member and Manager (GPFS - General Parallel File Systems). IBM Almaden Research Center jpalmer@almaden.ibm.com Gary M. Selzer, Manager (Lucent nmake Product Builder). Bell Laboratories, Lucent Technologies (now Alcatel-Lucent) gmselzer@alcatel-lucent.com