Research Interests

My research interests cover different aspects of PC cluster technology - architecture, system software, applications - and the intersection of cluster architecture and Grid computing. 

One important focus of my investigation is how to integrate Terabyte-sized disk storage built using off-the-shelf components into a parallel machine with supercomputer performance and enhanced I/O capabilities. The node parallelism of cluster architectures present an unique opportunity to increase the performance of file access if augmented with a well designed striped file system that can leverage the parallelism of disk access. Another focus of my research is the investigation of mechanisms to increase the efficiency of data transport between remote clusters over wide area networks. The purpose of this project is to overcome current performance limitations seen by data intensive Grid applications in accessing remote data repositories. On the application side I am studying computationally intensive Computational Biology problems involving large amounts of data such as genome assemblies, whole genome comparisons, and tools for high sensitivity queries to biological databases.

Many of our experiments start on a small cluster (called Datacluster) that dr. Srini Parthasarathy and I use for our research. For larger runs we use the Ohio Supercomputer Center machines, particularly the larger Itanium2 cluster. Our Myrinet interconnected cluster consists of eight nodes plus a plus one filewerver. Each node is a Dell PowerEdge1400 equipped with two 1GHz Pentium III, 1 GB of RAM, 18 GB SCSI disk, two 60 GB IBM Deskstar 75GXP IE disks connected to a 3Ware controller in Raid 0 configuration (hardware striping). The the file server has the same features except 2GB of RAM, three 40GB SCSI disks and no IDE disks. We are currently running the Red Hat distribution of Linux on the cluster. A dedicated 1 Gb/s link to the Ohio Supercomputer Center is used for high throughput data transfer and Grid computing experiments.

Current Projects

High Performance I/O

Today several TBs of disk storage can be easily added to a cluster for less than $1,000/TB by installing a few IDE disks on each node. Besides the low cost, such a configuration has other benefits such as a large aggregate disk access bandwidth and the availability of resources for distributed preprocessing and caching of data. Currently however there are no robust system tools capable of reaping all the potential benefits of this type of distributed storage.  We are interested in developing tools and system software to enable parallel applications to efficiently and transparently access storage on multiple cluster nodes. We are looking at existing parallel file systems (PVFS) and parallel I/O libraries (MPI-IO) and studying how they can be adapted and optimized for our purposes. Some of this work is done in collaboration with the Ohio Supercomputer Center in the context of the OSC Mass Storage project.

The Organic Grid

Desktop grids have recently been used to perform some of the largest computations in the world and have the potential to grow by several more orders of magnitude. However, current approaches to utilizing desktop resources require either centralized servers or extensive knowledge of the underlying system, limiting their scalability. We propose a biologically inspired and fully-decentralized approach to the organization of computation that is based on the autonomous scheduling of strongly mobile agents on a peer-to-peer network. Our approach achieves the following design objectives: near-zero knowledge of network topology, zero knowledge of system status, autonomous scheduling, distributed computation, lack of specialized nodes. Every node is equally responsible for scheduling and computation, both of which are performed with practically no prior knowledge about the system. We have implemented an extension of Java with strong mobility that allows multi-threaded agents to migrate with all of their execution state. We built a grid infrastructure, the Organic Grid, in which an application is scheduled by encapsulating it in an agent together with a scheduler specific to the application characteristics. We are currently working on a screen saver for deploying the Organic Grid to desktop PCs. This work is a collaboration with Dr. Gerald Baumgartner.

Cellular Computation

We have recently started exploring new concepts at the intersection of computing and biology. Using  quantitative models of protein-DNA binding found in the literature, we are studying how to build sequential circuits using the transcriptional machinery of the cell. The basic idea is that the recruitment of RNA polymerase to a gene promoter region (an event that starts the expression of the gene and then the production of the protein it encodes) is modulated by the action of multiple transcription factors (a class of DNA-binding proteins). Using this principle, properly designed configurations of transcription factors implement elementary gate functionality (such as AND, OR, XOR) in terms of protein concentrations. Starting with the definition of a finite state machine, we are investigating how to define a configuration of transcription factors that implements the given machine. Other researchers have studied the case of combinatorial logics, and we build upon their results in tackling the challenge of implementing arbitrary sequential circuits.

Data transport over the Grid

The increasing need to access distant large data sets is the motivation behind the study of novel techniques to increase the throughput of data transfer between remote sites.  One of the tools we are developing is an enhanced version of a remote storage access tool called Storage Resource Broker (SRB); SRB is a production quality tool developed at the San Diego Supercomputer Center. The performance enhanced version of SRB employs several strategies to increase the amount of data moved per unit of time between two remote machines. One such strategy is to introduce a notion of pipelining in handling the data and try to overlap different stages of the transfer. Another strategy is to stripe data across several parallel connections between the two remote sites. In a collaboration with researchers at the San Diego Supercomputer Center I am developing a MPI-IO interface to SRB that would enable parallel applications to take full advantage of the aggregate bandwidth of the network striping.

Data intensive Computational Biology

I keep experimenting with our cluster to learn new ways of building and programming clusters specifically to solve demanding computational biology problems. For example in a recent project on the assembly and annotation of a complete mammalian genome directed by Dr. Bo Yuan at the OSU Medical College, the development of a parallel version of a popular bioinformatic tool (BLAST) and an enhanced 1 TB storage system resulted in an order of magnitude improvement in the speed of the large scale computation required for the assembly.

In a separate project in collaboration with Dr. Ralf Bundschuh in the Physics Department, we are studying how to enhance the sensitivity of PSI-BLAST using a novel statistical theory of sequence alignments. PSI-BLAST is the program of choice for the search of large protein databases. This new theory should enable the creation of a version of PSI-BLAST in which the use of position-specific gap costs should enable the detection of very weak and thus previously undetectable alignments.


Projects I have been involved with in the past


Donwloads


Publications


Copyright disclaimer: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder. 

Journals

Conferences

Other