Exploration of Dynamic Sequences in Scientific Databases
Principal Investigator:
Students:
Alumni:
Ozgur Ozturk (Oracle)
Fatih Altiparmak (ASELSAN)
Michael Gibas (Teradata)
This research has been supported in part by NSF IIS -0546713, 08/01/2006 - 07/31/2011 (estimated).
Latest results can be found at our Database Lab and Bioinformatics Lab websites.
Publications: For a complete list of papers, please visit here.
A description of the research activities, followed by a list of recent papers funded by this project are given below.
|
High Performance Querying of Multi-dimensional Scientific Data Advances in technology have enabled the production of massive volumes of data through experiments and simulations in many science and engineering disciplines. As a result, scientific data repositories increasingly involve large amounts of non-traditional data in the form of sequences of images and streams of empirical measurements and observations, generated by a diverse set of data sources. As databases increasingly integrate such different types of information, it becomes necessary to support efficient retrieval of multi-dimensional data sequences. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. Traditionally, DBMSs supported simple query types such as selection, range, and join queries. With the new DBMS applications, supporting new types of queries is emerging. We develop an iterative querying framework for modeling and mining of data represented by sequences of observations. In our recent work (VLDB ’07), we proposed a model where a range of queries can be posed by varying the coefficients of an optimization objective function and defining a set of constraints on the attributes. Due to the nature of scientific applications, traditional database queries are largely being replaced by queries that involve complex mathematical and statistical functions over multiple attributes, e.g., HENP queries seeking events with small momenta for a set of particles. Currently, there is no effective DB support for such complex queries. We proposed model-based optimization queries, in which a generic model is used to define optimization of linear and nonlinear expressions over object attributes as well as many existing query types studied in database research literature. A significant subset of this general model relevant to real-world applications include queries where the optimization function and constraints are convex. We cast such queries as members of the convex optimization (CP) model and provide a unified query processing framework for CP queries that I/O optimally accesses data and space partitioning index structures without changing the underlying structures. We performed experiments to show the generality of the technique and where possible, compared to techniques developed for specialized optimization queries. With the newly proposed Optimization Query Framework, we achieved nearly identical performance to the limited optimization query types with optimal solutions, while providing generic processing for a broader class of queries, and while effectively handling problem constraints. We experimentally showed that the framework can be used to answer a number of different types of optimization queries including nearest neighbor queries, linear function optimization, and non-linear function optimization queries. This framework can be used to measure I/O performance associated with using different indexes and index types to answer certain classes of optimization queries, in order to determine which structures can most effectively answer the optimization query type. Database researchers and administrators can use this technique as a benchmarking tool to evaluate the performance of a wide range of index structures available in the literature. The system does not need to know the objective function, weights, or constraints in advance, nor to compute a queried function for all data objects in order to find the optimal answers. The technique provides optimization of arbitrary convex functions, and does not incur a significant penalty in order to provide this generality. This makes the framework appropriate for applications and domains where a number of different functions are being optimized or when optimization is being performed over different constrained regions and the exact query parameters are not known in advance.
We develop tools to
manage multiple multi-dimensional data streams, and algorithms for
continuous similarity and optimization based queries. The developed tools are
integrated and tested in real-time and near real-time environmental
monitoring projects. In our recent work, we have developed an online predictive quantization (PQ) based stream summarization method for data mining and optimization based analysis. The underlying methodology applies both to a distributed or central setting, and can be utilized for data transfers as well. A synopsis over a sliding window of most recent entries is computed in one pass and dynamically updated in constant time. The correlation between consecutive data elements is taken into account without the need for preprocessing. We extended PQ to multiple streams (PQ-Stream) for summarization and querying of a massive number of streams. Queries on any subsequence of a sliding window over multiple streams are processed in real-time. PQ-stream, is shown to be more advantageous over transform-based techniques, as the amortized time complexity of the synopsis update is O(1) for K =1, i.e., independent of the sliding window length. With PQ-Stream the queries are answered in O(L) time where L stands for the length of the query window, which can be over any subsequence of multiple streams.
Time-series expression data is used to investigate complex gene regulation schemes and metabolic pathways. These investigations are facilitated by algorithms that can extract and cluster related behaviors from the full population of time-series behaviors observed. We investigated methods for the analysis of time series gene expression data, with a focus on Haemophilus influenzae a major cause of otitis media in children. After the preprocessing and discretization, taking both positive and negative correlations into consideration, data is passed to a clustering algorithm that allows elucidation and searching of timeseries patterns across multiple experiments. As a result we are able to identify several signal pathways that initiate competence development, and to characterize the transcriptomes of wild-type and an adenylate cyclase mutant (cya) strains under both nutrient-limiting and nutrient-complete growth conditions. We then extend this work using multi-metric similarity based analysis. No single clustering method with single distance metric is capable of capturing all types of relationships that a gene may have with other genes. Genes are grouped around a query gene, and ranked corresponding to different levels of similarity utilizing multiple metrics. In these gene centered clusters no two genes are distant from each other, greater than a threshold value. The genes are then ranked by their frequency of co-occurrence. The grouping and rankings are drawn by applying set operations over results of multiple distance metrics, each capturing particular similarities such as shifted relationships, negative correlations and strong positive relationships. The
experimental study for the multi-metric gene expression analysis algorithm is done
on two case studies. In the first one, a single yeast cell cycle dataset is used. It
is shown that different combination of set operations reveals different kinds
of interactions between genes. The algorithm is also tested on
multiple microarray datasets obtained from Stanford Microarray Database.
By utilizing several metrics, various types of relationships using a
metric-independent algorithm can be captured. |
· CellTrack: An Open-Source Software for Cell Tracking and Motility Analysis. A. Sacan, H. Ferhatosmanoglu, H. Coskun, Bioinformatics, vol. 24, no. 14, July 2008, pp. 1647-1649.
This page was last updated on July 29, 2008 by Hakan Ferhatosmanoglu.