SDM 2010 Workshop on Clustering: Theory and applications

May 1, 2010

Mikhail Belkin, Ohio State University
Joseph S. Verducci, Ohio State University

Web page is created by Yuwen Zhuang.
Clustering is one of the key components of data analysis used in thousands of practical applications. It spans both statistics and computer science and, arguably, is a key part of human (and animal) cognition. In this workshop we will look at theoretical and practical issues related to clustering data from both computational and statistical point of view.

The workshop will be held on May 1, 2010 in Columbus, Ohio as part of the 2010 Siam International Conference on Data Mining.


Keynote Speaker: Anil Jain, Michigan State University

Daniel Janies, Ohio State University

Sun Kim, University of Indiana

Ann Lee, Carnegie Mellon University

Tao Shi, Ohio State University

Suresh Venkatasubramanian, University of Utah


Time: 8:30-9:30
Anil Jain,
Michigan State University

Data clustering: 50 years beyond K-means

Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Data clustering algorithms do not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). Contributions to vast clustering literature (both algorithms and applications) have come from a large number of diverse scientific fields. One of the most popular, simple and effective clustering algorithm, is K-means. In spite of the fact that K-means was proposed over 50 years ago, K-means and its variants are still widely used due to lack of a clustering algorithm that consistently outperforms K-means. In other words, K-means continues to be an admissible clustering algorithm with no clustering algorithm being optimal. This highlights the ill-posed nature of clustering. It is instructive for the users as well as designers of clustering algorithms to understand the exploratory nature of clustering whose primary goal is to formulate hypotheses about structure in given data. This talk will present a brief overview of clustering, summarize some of the well known clustering methods, discuss various issues in designing clustering algorithms, and highlight some research directions: semi-supervised clustering, ensemble clustering, simultaneous dimensionality reduction and clustering, and large scale clustering.

Time: 9:30-10:00
Ann Lee, Carnegie Mellon University

A Geometry-Based Metric Between Distributions of High-Dimensional Data

For tasks such as classification, clustering and information retrieval, high-dimensional data objects are often transformed into a distribution over a well-chosen space of features. This approach is particularly useful in image analysis where an image is a data object and features could represent attributes for color, texture, shape, etc. The outcome of these tasks rely on the careful construction of a metric for comparing feature distributions. Furthermore, these distributions are often estimated from data by histograms, where the bins correspond to a certain partitioning of the feature space. In this talk, I will describe a novel metric for distributions that takes the intrinsic data geometry into account by a generalization of the so-called diffusion distance. I will also show how to estimate these distances from data and how, due to an efficient coordinate coding of the histograms (i.e. density estimates), our distance computations are linear in time and allows for quick searches and organization of large databases. The method will be illustrated on texture discrimination and clustering of digital pathology images.

Coffee break: 10:00-10:30

Time: 10:30-11:00
Suresh Venkatasubramanian, University of Utah

On the geometry of clusterings, both hard and soft

It has long been recognized that a single clustering may not be the best way to characterize a data set. Data can be polymorphic and heterogeneous, and different clustering algorithms bring out different patterns in a given data set. Motivated by this, researchers are now studying how to analyze collections of clusterings, by comparing them, clustering them, and even finding so-called "diverse" clusterings that are distant from each other To analyze collections of clusterings requires an understanding of how to organize them geometrically, much like data mining for document or image data can be translated into questions about points in high dimensional Euclidean spaces. In this talk, I will describe current research efforts towards designing a geometry for clusterings. These efforts have yielded two outcomes thus far: a unified approach to representing clusterings in which the hard clusterings can be viewed as integer points of a continuous metric space of soft clusterings, and a new tree-based representation for soft clusterings that may be of independent interest. This is joint work with Parasaran Raman.

Time: 11:00-11:30
Tao Shi, Ohio State University

Multi-Sample Data Spectroscopic Clustering of Large Datasets using Nystrom Extension

Spectral clustering algorithms have shown promising results in statistics, bioinformatics, genetic study, machine learning, and other scientific fields. These spectral algorithms cluster observations (of size n) into groups by investigating eigenvectors of an affinity matrix or its corresponding Laplacian matrix, both of which are size of n by n. However, the computation involved in eigen-decompostion of an n by n matrix is expensive or even infeasible when the sample size is large. To overcome the computation hurdle, subsampling techniques, such as Nystrom extension, have been used in approximating eigenvectors of large matrices. In this talk, we discuss the statistical properties of such approximation and their influence on the accuracy of various spectral clustering algorithms. We found that the perturbation of spectrum due to subsampling could lead to large discrepancy among clustering results based on different subsamples. In order to provide accurate and stable clustering results for large datasets, we propose a method to combine multiple sub-samples using data spectroscopic clustering and the Nystrom extension. In addition, we propose a sparse approximation of the eigenvectors to further speed up the computation. Simulation and experiments on real data sets showed that this multi-sample approach is fast and accuracy.

Time: 11:30-12:00
Daniel Janies, Ohio State University

Large scale genetic, phenotypic, and global visualization of emergent infectious diseases

Emerging infectious diseases are critical issues for public health. As demonstrated by the response to SARS and pandemic influenza, rapid genomic sequencing has become a primary method of diagnosing agents of infectious disease. Despite this key descriptive role, raw sequence data, on their own, do not provide the knowledge required to understand the etiology and trajectory of disease outbreaks. In order to understand epidemics one must incorporate knowledge of pathogen phenotypes as well as factors external to the pathogen, such as its spread among various hosts and geography. Sequence alignment and phylogenetics are fundamental tools for making sense of sequence data from emergent pathogens via comparison to well characterized pathogens. However, both of these tools require significant computational equipment and user expertise. We have developed web-based tools applications (e.g., that marry parallel sequence alignment and tree search with geographic visualization of the spread of pathogens and their genotypes and phenotypes. By doing such, we lessen the expertise and nearly eliminate the computational equipment required for individuals to use sequence alignment and phylogenetics. Our applications allow users to analyze large datasets of raw genetic sequence and phenotypes from pathogens and hosts in a geographic context. We provide open source code for those users who want to perform more complex analyses and work on their own equipment. Overall, we strive to foster interactions among diverse disciplines by providing a common framework for hypothesis generation and testing. I will discuss several multidiciplinary use cases, including analyses of host adaptation and drug resistance over space and time.

Time: 12:00-12:30
Sun Kim, University of Indiana

Integrated Analysis of microRNA and mRNA expression data

Clustering techniques have been commonly used in analyzing expression data measured in various experimental conditions. As biologists have been able to measure expression levels of various genomic elements of different types, a new problem has begun to emerge to perform integrated analysis of expression data for various genomic elements. In this talk, I will introduce a problem of performing integrated analysis of microRNA and mRNA expression data. MicroRNAs are small molecules of 19-24 bp in length and they are known to interfere with genes at transcription and translation levels. As a result, we can observe negative correlations between mRNA and microRNA expression levels when microRNAs interfere with genes at the transcription level. Unfortunately, simultaneous analysis of microRNA and mRNA data faces significant challenges. In addition to the well known problems in analyzing expression data, targets of microRNAs are typically predicted by computational methods that are not reliable. We have been developing computational methods for the integrated analysis microRNA and mRNA expression data. In this talk, I will present our previous work and share some preliminary data towards a new method.