Exploration of Dynamic Sequences in Scientific Databases
Principal Investigator:
Students:
Alumni:
Bharath Sriram
Guadalupe Canahuate
Tan Apaydin
Ozgur Ozturk
Fatih Altiparmak
Michael Gibas
This research has been supported in part by NSF IIS -0546713, 08/01/2006 - 07/31/2011 (estimated).
Latest results can be found at our Database Lab and Bioinformatics Lab websites.
Publications: For a complete list of papers, please visit here.
A description of the research activities, followed by a list of recent papers funded by this project are given below.
|
Multi-dimensional Data Scientific data repositories increasingly involve large amounts of sequences and streams generated by diverse data sources. Both the dimensionality and the amount of data that needs to be processed are increasing rapidly. In our recent work (VLDB ’07), we proposed a query model that involves varying the coefficients of an objective function and defining a set of constraints on the attributes. Due to the nature of scientific applications, traditional database queries are largely being replaced by queries that involve functions over multiple attributes. We use model-based optimization queries of linear and nonlinear expressions over object attributes, which covers many existing query types studied in database research literature. A subset of this general model relevant to real-world applications include queries where the optimization function and constraints are convex. We provide a unified query processing framework for such queries that I/O optimally accesses data and space partitioning index structures without changing the underlying structures. The framework achieves nearly identical performance to the limited optimization query types with optimal solutions, while providing generic processing for a broader class of queries, and effectively handling problem constraints. Streaming Data and Text We develop tools to manage multiple multi-dimensional data streams, and algorithms for continuous similarity queries. The developed tools are integrated and tested in environmental monitoring projects. Environmental data repositories involve sequences of data collected by a diverse set of data acquisition technologies. Environmental engineers are interested in using the available data to model and monitor chemical, hydrological, thermal, and seismic changes over both space and time. Such a modeling requires an integrated analysis over dynamic multi-dimensional time series of temperature, humidity, bathymetry, water-level and tide observations at various gauge stations and buoys. Similarity queries are executed to detect seismic/weather patterns similar to a previously known pattern. Similarity joins over streaming and archival data are needed in modeling and classification of a new object as a certain terrain property, by comparing its attributes to a set of previously classified objects in the archival database. In our recent work, we developed an online predictive quantization (PQ) for multiple streams for data mining. The underlying methodology applies both to a distributed or central setting, and can be utilized for data transfers as well. A synopsis over a sliding window of most recent entries is computed in one pass and dynamically updated in constant time. The correlation between consecutive data elements is taken into account without the need for preprocessing. We extended PQ to multiple streams (PQ-Stream) for summarization and querying of a massive number of streams. Queries on any subsequence of a sliding window over multiple streams are processed in real-time. PQ-stream, is shown to be more advantageous over transform-based techniques, as the amortized time complexity of the synopsis update is independent of the sliding window length. With PQ-Stream the queries are answered in O(L) time where L stands for the length of the query window, which can be over any subsequence of multiple streams. An interesting application for data streams is
managing and mining streaming text. Recently, microblogging
services such as Twitter have attracted significant attention both from
practical and research perspectives. In these services, the users may become
overwhelmed by the raw data. One solution to this problem is the
classification of streaming short text messages. As short texts do not
provide sufficient word occurrences, traditional classification methods such
as 'Bag-Of-Words' have limitations. To address this problem, we propose to
use a small set of domain-specific features extracted from the author's
profile and text. This multi-dimensional approach effectively classifies the
text to a predefined set of generic classes such as News, Events, Opinions,
Deals, and Private Messages.We are planning to
support similarity search within our classes supplemented with semantic
information gathered from URL information in the tweets. We believe that this
will result in higher precision and be especially useful when Twitter is
accessed on hand-held devices where performance and accuracy are the major
concerns.
Time-series expression data is used to investigate complex gene regulation schemes and metabolic pathways. These investigations are facilitated by algorithms that can extract and cluster related behaviors from the full population of time-series behaviors observed. We investigated methods for the analysis of time series gene expression data, with a focus on Haemophilus influenza, a major cause of otitis media in children. After the preprocessing and discretization, taking both positive and negative correlations into consideration, data is passed to a clustering algorithm that allows elucidation and searching of time-series patterns across multiple experiments. As a result we are able to identify several signal pathways that initiate competence development, and to characterize the transcriptomes of wild-type and an adenylate cyclase mutant (cya) strains under both nutrient-limiting and nutrient-complete growth conditions. We then extend this work using multi-metric similarity
based analysis. No single clustering method with single distance metric is
capable of capturing all types of relationships that a gene may have with
other genes. Genes are grouped around a query gene, and ranked corresponding
to different levels of similarity utilizing multiple metrics. In these gene
centered clusters no two genes are distant from each other, greater than a
threshold value. The genes are then ranked by their frequency of
co-occurrence. The grouping and rankings are drawn by applying set operations
over results of multiple distance metrics, each capturing particular
similarities such as shifted relationships, negative correlations and strong
positive relationships. We performed experiments on two case studies and illustrated that by utilizing several
metrics, various types of relationships using a metric-independent algorithm
can be captured. |
· Short Text Classification in Twitter to Improve Information Filtering. E. Demir, D. Fuhry, B. Sriram, M. Demirbas, H. Ferhatosmanoglu. Proceedings of the ACM SIGIR 2010 Posters and Demos, Geneva, Switzerland, to appear.
· Crowd-sourced Sensing and Collaboration Using Twitter. M. Demirbas, C. Akcora, M. Bayir, Y. Yilmaz, H. Ferhatosmanoglu. Proceedings of IEEE WoWMoM, 2010, Montreal, Canada, June 2010.
· Identifying Breakpoints in Public Opinion. C. Akcora, M. Bayir, M. Demirbas, H. Ferhatosmanoglu, Social Media Analytics, SIGKDD Workshop, July 2010.
·
Secondary Bitmap Indexes with Vertical and Horizontal
Partitioning . G. Canahuate, T. Apaydin, A. Sacan, H. Ferhatosmanoglu , EDBT, March 2009, pp 600-611.
·
Bitmap-based
Index Structures. G. Canahuate, H. Ferhatosmanoglu. Encyclopedia of Database Systems 2009:
248-25.
· CellTrack: An Open-Source Software for Cell Tracking and Motility Analysis. A. Sacan, H. Ferhatosmanoglu, H. Coskun, Bioinformatics, vol. 24, no. 14, July 2008, pp. 1647-1649.
This page was last updated on June 15, 2010.