Exploration of Dynamic Sequences in Scientific Databases

 

Principal Investigator:

            Hakan Ferhatosmanoglu

 

Students:

            Tan Apaydin

            Guadalupe Canahuate

 

Alumni:

Ozgur Ozturk (Oracle)

            Fatih Altiparmak (ASELSAN)

            Michael Gibas (Teradata)

           

This research has been supported in part by NSF IIS -0546713, 08/01/2006 - 07/31/2011 (estimated).

 

Latest results can be found at our Database Lab and Bioinformatics Lab websites.

Software can be found here.

Publications: For a complete list of papers, please visit here. 

 

A description of the research activities, followed by a list of recent papers funded by this project are given below.

 

 

High Performance Querying of Multi-dimensional Scientific Data

Advances in technology have enabled the production of massive volumes of data through experiments and simulations in many science and engineering disciplines. As a result, scientific data repositories increasingly involve large amounts of non-traditional data in the form of sequences of images and streams of empirical measurements and observations, generated by a diverse set of data sources. As databases increasingly integrate such different types of information, it becomes necessary to support efficient retrieval of multi-dimensional data sequences. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. Traditionally, DBMSs supported simple query types such as selection, range, and join queries. With the new DBMS applications, supporting new types of queries is emerging. We develop an iterative querying framework for modeling and mining of data represented by sequences of observations.

In our recent work (VLDB ’07), we proposed a model where a range of queries can be posed by varying the coefficients of an optimization objective function and defining a set of constraints on the attributes. Due to the nature of scientific applications, traditional database queries are largely being replaced by queries that involve complex mathematical and statistical functions over multiple attributes, e.g., HENP queries seeking events with small momenta for a set of particles. Currently, there is no effective DB support for such complex queries. We proposed model-based optimization queries, in which a generic model is used to define optimization of linear and nonlinear expressions over object attributes as well as many existing query types studied in database research literature. A significant subset of this general model relevant to real-world applications include queries where the optimization function and constraints are convex. We cast such queries as members of the convex optimization (CP) model and provide a unified query processing framework for CP queries that I/O optimally accesses data and space partitioning index structures without changing the underlying structures.  We performed experiments to show the generality of the technique and where possible, compared to techniques developed for specialized optimization queries.

With the newly proposed Optimization Query Framework, we achieved nearly identical performance to the limited optimization query types with optimal solutions, while providing generic processing for a broader class of queries, and while effectively handling problem constraints. We experimentally showed that the framework can be used to answer a number of different types of optimization queries including nearest neighbor queries, linear function optimization, and non-linear function optimization queries.

This framework can be used to measure I/O performance associated with using different indexes and index types to answer certain classes of optimization queries, in order to determine which structures can most effectively answer the optimization query type. Database researchers and administrators can use this technique as a benchmarking tool to evaluate the performance of a wide range of index structures available in the literature. The system does not need to know the objective function, weights, or constraints in advance, nor to compute a queried function for all data objects in order to find the optimal answers. The technique provides optimization of arbitrary convex functions, and does not incur a significant penalty in order to provide this generality. This makes the framework appropriate for applications and domains where a number of different functions are being optimized or when optimization is being performed over different constrained regions and the exact query parameters are not known in advance.


Summarization of Multiple Data Streams

We develop tools to manage multiple multi-dimensional data streams, and algorithms for continuous similarity and optimization based queries. The developed tools are integrated and tested in real-time and near real-time environmental monitoring projects.

Environmental data repositories involve sequences of data collected by a diverse set of data acquisition technologies. Civil and environmental engineers are interested in using the available data to model and monitor chemical, hydrological, thermal, and seismic changes over both space and time. Such a modeling requires an integrated analysis over dynamic multi-dimensional time series of temperature, humidity, bathymetry, water-level and tide observations at various gauge stations and buoys, as well as maps, and images. Similarity queries are executed to detect seismic/weather patterns similar to a previously known pattern. Similarity joins over streaming and archival data are needed in modeling and classification of a new object as a certain terrain property, by comparing its attributes to a set of previously classified objects in the archival database.

In our recent work, we have developed an online predictive quantization (PQ) based stream summarization method for data mining and optimization based analysis. The underlying methodology applies both to a distributed or central setting, and can be utilized for data transfers as well. A synopsis over a sliding window of most recent entries is computed in one pass and dynamically updated in constant time. The correlation between consecutive data elements is taken into account without the need for preprocessing. We extended PQ to multiple streams (PQ-Stream) for summarization and querying of a massive number of streams. Queries on any subsequence of a sliding window over multiple streams are processed in real-time.

PQ-stream, is shown to be more advantageous over transform-based techniques, as the amortized time complexity of the synopsis update is O(1) for K =1, i.e., independent of the sliding window length. With PQ-Stream the queries are answered in O(L) time where L stands for the length of the query window, which can be over any subsequence of multiple streams.

 


Microarray Data Mining

 

Time-series expression data is used to investigate complex gene regulation schemes and metabolic pathways. These investigations are facilitated by algorithms that can extract and cluster related behaviors from the full population of time-series behaviors observed. We investigated methods for the analysis of time series gene expression data, with a focus on Haemophilus influenzae a major cause of otitis media in children. After the preprocessing and discretization, taking both positive and negative correlations into consideration, data is passed to a clustering algorithm that allows elucidation and searching of timeseries patterns across multiple experiments. As a result we are able to identify several signal pathways that initiate competence development, and to characterize the transcriptomes of wild-type and an adenylate cyclase mutant (cya) strains under both nutrient-limiting and nutrient-complete growth conditions.

 

We then extend this work using multi-metric similarity based analysis. No single clustering method with single distance metric is capable of capturing all types of relationships that a gene may have with other genes. Genes are grouped around a query gene, and ranked corresponding to different levels of similarity utilizing multiple metrics. In these gene centered clusters no two genes are distant from each other, greater than a threshold value. The genes are then ranked by their frequency of co-occurrence. The grouping and rankings are drawn by applying set operations over results of multiple distance metrics, each capturing particular similarities such as shifted relationships, negative correlations and strong positive relationships.

 

The experimental study for the multi-metric gene expression analysis algorithm is done on two case studies. In the first one, a single yeast cell cycle dataset is used. It is shown that different combination of set operations reveals different kinds of interactions between genes. The algorithm is also tested on multiple microarray datasets obtained from Stanford Microarray Database. By utilizing several metrics, various types of relationships using a metric-independent algorithm can be captured.

 

 

·        CellTrack: An Open-Source Software for Cell Tracking and Motility Analysis. A. Sacan, H. Ferhatosmanoglu, H. Coskun, Bioinformatics, vol. 24, no. 14, July 2008, pp. 1647-1649.

 

 

This page was last updated on July 29, 2008 by Hakan Ferhatosmanoglu.