Publications
of Mikhail Belkin
[Journal papers] -- [Conference papers] -- [Ph.D. Thesis] -- [Book Chapter] -- [Technical Reports]
Journal Papers and preprints:
-
On Learning with Integral Operators
[pdf, bib]
L. Rosasco, M.Belkin, E. De Vito,
Journal of Machine Learning Research (accepted, pending minor revisions).
-
+ abstract
-
A large number of learning algorithms, for example,
spectral clustering, kernel Principal Components
Analysis and many manifold methods are
based on estimating eigenvalues and eigenfunctions
of operators defined by a similarity function or a
kernel, given empirical data. Thus for the analysis
of algorithms, it is an important problem to
be able to assess the quality of such approximations.
The contribution of our paper is two-fold.
First, we use a technique based on a concentration
inequality for Hilbert spaces to provide new simplified
proofs for a number of results in spectral
approximation. Second, using these methods we
provide several new results for estimating spectral
properties of the graph Laplacian operator extending
and strengthening results from [28].
-
Learning Gaussian Mixtures with Arbitrary Separation [pdf]
Mikhail Belkin, Kaushik Sinha
Preprint
-
+ abstract
-
In this paper we present a method for learning the parameters of a mixture of k identical
spherical Gaussians in n-dimensional space with an arbitrarily small separation between the
components. Our algorithm is polynomial in all parameters other than k. The algorithm is
based on an appropriate grid search over the space of parameters. The theoretical analysis of
the algorithm hinges on a reduction of the problem to 1 dimension and showing that two 1-
dimensional mixtures whose densities are close in the L2 norm must have similar means and
mixing coefficients. To produce such a lower bound for the L2 norm in terms of the distances
between the corresponding means, we analyze the behavior of the Fourier transform of a mixture
of Gaussians in 1 dimension around the origin, which turns out to be closely related to the
properties of the Vandermonde matrix obtained from the component means. Analysis of this
matrix together with basic function approximation results allows us to provide a lower bound
for the norm of the mixture in the Fourier domain.
In recent years much research has been aimed at understanding the computational aspects
of learning parameters of Gaussians mixture distributions in high dimension. To the best of
our knowledge all existing work on learning parameters of Gaussian mixtures assumes minimum
separation between components of the mixture which is an increasing function of either the
dimension of the space n or the number of components k. In our paper we prove the first
result showing that parameters of a n-dimensional Gaussian mixture model with arbitrarily
small component separation can be learned in time polynomial in n. We also believe that the
new technique used to provide the key lower bound is of independent interest and will be useful
for other related problems.
-
Data spetroscopy: eigenspace of convolution operators and clustering [pdf]
Tao Shi, Mikhail Belkin, Bin Yu
The Annals of Statistics, vol. 37, Number 6B (2009), 3960-3984.
-
+ abstract
-
This paper focuses on obtaining clustering information about a
distribution from its i.i.d. samples. We develop theoretical results to
understand and use clustering information contained in the eigenvectors
of data adjacency matrices based on a radial kernel function
with a sufficiently fast tail decay. In particular, we provide population
analyses to gain insights into which eigenvectors should be used and
when the clustering information for the distribution can be recovered
from the sample. We learn that a fixed number of top eigenvectors
might at the same time contain redundant clustering information and
miss relevant clustering information. We use this insight to design
the Data Spectroscopic clustering (DaSpec) algorithm that utilizes
properly selected eigenvectors to determine the number of clusters
automatically and to group the data accordingly. Our findings extend
the intuitions underlying existing spectral techniques such as
spectral clustering and Kernel Principal Components Analysis, and
provide new understanding into their usability and modes of failure.
Simulation studies and experiments on real world data are conducted
to show the potential of our algorithm. In particular, DaSpec is found
to handle unbalanced groups and recover clusters of different shapes
better than the competing methods.
-
Convergence of Laplacian Eigenmaps [pdf, bib]
M. Belkin, P. Niyogi
preprint,
2008.
-
+ abstract
-
Geometrically based methods for various tasks of data analysis
have attracted considerable attention over the last few years.
In many of these algorithms, a central role is played by the
eigenvectors of the graph Laplacian of a
data-derived graph. In this paper, we show that if points are
sampled uniformly at random from an unknown submanifold ${\cal M}$
of $\R^N$, then the eigenvectors of a suitably constructed
graph Laplacian converge to the eigenfunctions of the Laplace
Beltrami operator on ${\cal M}$. This basic result directly
establishes the convergence of spectral manifold learning
algorithms such as Laplacian Eigenmaps and Diffusion Maps.
It also has implications for the understanding of geometric algorithms
in data analysis, computational harmonic analysis, geometric
random graphs, and graphics.
-
Towards a Theoretical Foundation for Laplacian-Based
Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi
Journal of Computer and System Sciences, 2008.
Volume 74 , Issue 8, pp. 1289-1308
Special Issue on Learning Theory (invited).
-
+ abstract
-
In recent years manifold methods have attracted a considerable amount of atten-
tion in machine learning. However most algorithms in that class may be termed
"manifold-motivated" as they lack any explicit theoretical guarantees. In this pa-
per we take a step towards closing the gap between theory and practice for a class
of Laplacian-based manifold methods. These methods utilize the graph Laplacian
associated to a data set for a variety of applications in semi-supervised learning,
clustering, data representation.
We show that under certain conditions the graph Laplacian of a point cloud of
data samples converges to the Laplace-Beltrami operator on the underlying mani-
fold. Theorem 3.1 contains the .rst result showing convergence of a random graph
Laplacian to the manifold Laplacian in the context of machine learning.
-
Manifold Regularization:
a Geometric Framework for Learning from Labeled and
Unlabeled Examples [pdf, bib]
M. Belkin, P. Niyogi, V. Sindhwani
Journal of Machine Learning Research, 7(Nov):2399-2434,
2006.
-
+ abstract
- We propose a family of learning algorithms based on a
new form of regularization that allows us to exploit
the geometry of the marginal distribution. We focus
on a semi-supervised framework that incorporates
labeled and unlabeled data in a general-purpose
learner. Some transductive graph learning algorithms
and standard methods including support vector
machines and regularized least squares can be
obtained as special cases. We use properties of
reproducing kernel Hilbert spaces to prove new
Representer theorems that provide theoretical basis
for the algorithms. As a result (in contrast to
purely graph-based approaches) we obtain a natural
out-of-sample extension to novel examples and so are
able to handle both transductive and truly
semi-supervised settings. We present experimental
evidence suggesting that our semi-supervised
algorithms are able to use unlabeled data
effectively. Finally we have a brief discussion of
unsupervised and fully supervised learning within our
general framework.
-
Consistency of Spectral Clustering [pdf, bib]
U. von Luxburg, M. Belkin, O. Bousquet,
The Annals of Statistics 2008, Vol. 36, No. 2, 555-586.
-
+ abstract
-
Consistency is a key property of statistical
algorithms when the data is drawn from some
underlying probability distribution. Surprisingly,
despite decades of work, little is known about
consistency of most clustering algorithms. In this
paper we investigate consistency of the popular
family of spectral clustering algorithms, which
clusters the data with the help of eigenvectors of
graph Laplacian matrices. We develop new methods to
establish that for increasing sample size, those
eigenvectors converge to the eigenvectors of certain
limit operators. As a result we can prove that one of
the two major classes of spectral clustering
(normalized clustering) converges under very general
conditions, while the other (unnormalized clustering)
is only consistent under strong additional
assumptions, which are not always satisfied in real
data. We conclude that our analysis provides strong
evidence for the superiority of normalized spectral
clustering.
-
Semi-supervised Learning on Riemannian Manifolds
[pdf, bib]
M. Belkin, P. Niyogi
Machine Learning, 56, Invited, secial Issue on
Clustering, 209-239, 2004.
-
+ abstract
-
We consider the general problem of utilizing both
labeled and unlabeled data to improve classification
accuracy. Under the assumption that the data lie on a
submanifold in a high dimensional space, we develop
an algorithmic framework to classify a partially
labeled data set in a principled manner. The central
idea of our approach is that classification functions
are naturally defined only on the submanifold in
question rather than the total ambient space. Using
the Laplace-Beltrami operator one produces a basis
(the Laplacian Eigenmaps) for a Hilbert space of
square integrable functions on the submanifold. To
recover such a basis, only unlabeled examples are
required. Once such a basis is obtained, training can
be performed using the labeled data set. Our
algorithm models the manifold using the adjacency
graph for the data and approximates the Laplace-
Beltrami operator by the graph Laplacian. We provide
details of the algorithm, its theoretical
justification, and several practical applications for
image, speech, and text classification.
-
Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation [pdf, bib]
M. Belkin, P. Niyogi
Neural Computation, June 2003; 15 (6):1373-1396.
-
+ abstract
-
One of the central problems in machine learning and
pattern recognition is to develop appropriate
representations for complex data. We consider the
problem of constructing a representation for data
lying on a lowdimensional manifold embedded in a
high-dimensional space. Drawing on the correspondence
between the graph Laplacian, the Laplace Beltrami
operator on the manifold, and the connections to the
heat equation, we propose a geometrically motivated
algorithm for representing the highdimensional data.
The algorithm provides a computationally efficient
approach to non
dimensionality reduction that
has locality-preserving properties and a natural
connection to clustering. Some potential applications
and illustrative examples are discussed.
Refereed and Invited Conference
Proceedings:
-
Semi-supervised Learning using Sparse
Eigenfunction Bases
[pdf, bib]
K. Sinha, M.Belkin, NIPS 2009.
-
+ abstract
-
We present a new framework for semi-supervised learning with sparse eigenfunction
bases of kernel matrices. It turns out that when the data has clustered, that
is, when the high density regions are suf.ciently separated by low density valleys,
each high density area corresponds to a unique representative eigenvector.
Linear combination of such eigenvectors (or, more precisely, of their Nystrom
extensions) provide good candidates for good classi.cation functions when the
cluster assumption holds. By .rst choosing an appropriate basis of these eigenvectors
from unlabeled data and then using labeled data with Lasso to select a
classi.er in the span of these eigenvectors, we obtain a classi.er, which has a very
sparse representation in this basis. Importantly, the sparsity corresponds naturally
to the cluster assumption.
Experimental results on a number of real-world data-sets show that our method
is competitive with the state of the art semi-supervised learning algorithms and
outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis).
-
A Note on Learning with Integral Operators
[pdf, bib]
L. Rosasco, M.Belkin, E. De Vito, COLT 2009.
-
+ abstract
-
A large number of learning algorithms, for example,
spectral clustering, kernel Principal Components
Analysis and many manifold methods are
based on estimating eigenvalues and eigenfunctions
of operators defined by a similarity function or a
kernel, given empirical data. Thus for the analysis
of algorithms, it is an important problem to
be able to assess the quality of such approximations.
The contribution of our paper is two-fold.
First, We use a technique based on a concentration
inequality for Hilbert spaces to provide new simplified
proofs for a number of results in spectral
approximation. Second, using these methods we
provide several new results for estimating spectral
properties of the graph Laplacian operator extending
and strengthening results from [28].
-
Constructing Laplace Operator from Point Clouds in R^d
[pdf, bib, code]
M. Belkin, J. Sun, Y. Wang, SODA 2009, pp. 1031--1040.
-
+ abstract
-
We present an algorithm for approximating the Laplace-
Beltrami operator from an arbitrary point cloud obtained
from a k-dimensional manifold embedded in the d-
dimensional space. We show that this PCD Laplace (Point-
Cloud Data Laplace) operator converges to the Laplace-
Beltrami operator on the underlying manifold as the point
cloud becomes denser. Unlike the previous work, we do
not assume that the data samples are independent identically
distributed from a probability distribution and do
not require a global mesh. The resulting algorithm is easy
to implement. We present experimental results indicating
that even for point sets sampled from a uniform distribution,
PCD Laplace converges faster than the weighted graph
Laplacian. We also show that using our PCD Laplacian we
can directly estimate certain geometric invariants, such as
manifold area.
-
Data Spectroscopy: Learning Mixture Models using Eigenspaces of
Convolution Operators
[pdf, bib]
T. Shi, M. Belkin, B. Yu, ICML 2008.
-
+ abstract
-
In this paper we develop a spectral frame-
work for estimating mixture distributions,
specifically Gaussian mixture models. In
physics, spectroscopy is often used for the
identification of substances through their
spectrum. Treating a kernel function K(x, y)
as "light" and the sampled data as "sub-
stance", the spectrum of their interaction
(eigenvalues and eigenvectors of the kernel
matrix K) unveils certain aspects of the un-
derlying parametric distribution p, such as
the parameters of a Gaussian mixture. Our
approach extends the intuitions and analyses
underlying the existing spectral techniques,
such as spectral clustering and Kernel Prin-
cipal Components Analysis (KPCA).
We construct algorithms to estimate param-
eters of Gaussian mixture models, includ-
ing the number of mixture components, their
means and covariance matrices, which are im-
portant in many practical applications. We
provide a theoretical framework and show en-
couraging experimental results.
-
Discrete Laplace Operator for Meshed Surfaces
[pdf, bib]
M. Belkin, J. Sun, Y. Wang, SOCG 2008.
-
+ abstract
-
In recent years a considerable amount of work in graphics and geometric optimization used tools
based on the Laplace-Beltrami operator on a surface. The applications of the Laplacian include mesh
editing, surface smoothing, and shape interpolations among others. However, it has been shown [12, 23,
25] that the popular cotangent approximation schemes do not provide convergent point-wise (or even
L^2) estimates, while many applications rely on point-wise estimation. Existence of such schemes has
been an open question [12].
In this paper we propose the first algorithm for approximating the Laplace operator of a surface from
a mesh with point-wise convergence guarantees applicable to arbitrary meshed surfaces. We show that
for a sufficiently fine mesh over an arbitrary surface, our mesh Laplacian is close to the Laplace-Beltrami
operator on the surface at every point of the surface.
Moreover, the proposed algorithm is simple and easily implementable. Experimental evidence shows
that our algorithm exhibits convergence empirically and outperforms cotangent-based methods in providing
accurate approximation of the Laplace operator for various meshes.
-
Component Based Shape Retrieval Using Differential Profiles.
L. Ding, M. Belkin, Proceedings of ACM International Conference on Multimedia Information Retrieval, 2008.
-
+ abstract
-
In this paper, we describe the use of differential profiles, which are computed from 2D shapes smoothed with Gaussian
functions, as the shape features for building a shape retrieval system. We build a global shape component dictionary
from all the shape descriptors collected from shapes available in a database and then represent each shape as a
probabilistic mixture of elements from such a dictionary. Finally, shape retrieval from a given database is simply
done by comparing the mixing coefficients of the model of a query shape and those of known shapes. Our retrieval
experiments are done on both object contour and line drawing collections and show promising results.
-
The value of labeled and unlabeled examples when the model is imperfect
[pdf, bib]
K. Sinha, M. Belkin, NIPS 2007.
-
+ abstract
-
Semi-supervised learning, i.e. learning from both labeled and unlabeled data has
received signi.cant attention in the machine learning literature in recent years.
Still our understanding of the theoretical foundations of the usefulness of unlabeled
data remains somewhat limited. The simplest and the best understood situation
is when the data is described by an identi.able mixture model, and where
each class comes from a pure component. This natural setup and its implications
ware analyzed in [11, 5]. One important result was that in certain regimes, labeled
data becomes exponentially more valuable than unlabeled data.
However, in most realistic situations, one would not expect that the data comes
from a parametric mixture distribution with identifiable components. There have
been recent efforts to analyze the non-parametric situation, for example, .cluster.
and manifold assumptions have been suggested as a basis for analysis. Still,
a satisfactory and fairly complete theoretical understanding of the nonparametric
problem, similar to that in [11, 5] has not yet been developed.
In this paper we investigate an intermediate situation, when the data comes from a
probability distribution, which can be modeled, but not perfectly, by an identi.able
mixture distribution. This seems applicable to many situation, when, for example,
a mixture of Gaussians is used to model the data. the contribution of this paper is
an analysis of the role of labeled and unlabeled data depending on the amount of
imperfection in the model.
-
Convergence of Laplacian
Eigenmaps [long tech report]
M. Belkin, P. Niyogi, NIPS 2006.
-
On the Relation Between Low Density Separation,
Spectral Clustering and Graph Cuts [pdf, bib]
H. Narayanan, M. Belkin, P. Niyogi, NIPS 2006.
-
+ abstract
-
One of the intuitions underlying many graph-based
methods for clustering and semi-supervised learning,
is that class or cluster boundaries pass through
areas of low probability density. In this paper we
provide some formal analysis of that notion for a
probability distribution. We introduce a notion of
weighted boundary volume, which measures the length
of the class/cluster boundary weighted by the density
of the underlying probability distribution. We show
that sizes of the cuts of certain commonly used data
adjacency graphs converge to this continuous weighted
volume of the boundary.
-
Heat Flow and a Faster Algorithm to Compute the
Surface Area of a Convex Body [pdf, bib]
M. Belkin, H. Narayanan, P. Niyogi, FOCS 2006.
-
+ abstract
-
We draw on the observation that the amount of heat
diffusing outside of a heated body in a short period
of time is proportional to its surface area, to
design a simple algorithm for approximating the
surface area of a convex body given by a membership
oracle. Our method has a complexity of O*(n^4), where
n is the dimension, compared to O*(n^8.5) for the
previous best algorithm. We show that our complexity
cannot be improved given the current state-of-the-art
in volume estimation.
-
Maximum Margin Semi-Supervised Learning for Structured
Variables [pdf, bib]
Y. Altun, D. McAllester, M. Belkin, NIPS 2005.
-
+ abstract
-
Many real-world classification problems involve the
prediction of multiple inter-dependent variables
forming some structural dependency. Recent progress
in machine learning has mainly focused on supervised
classification of such structured variables. In this
paper, we investigate structured classification in a
semi-supervised setting. We present a discriminative
approach that utilizes the intrinsic geometry of
input patterns revealed by unlabeled data points and
we derive a maximum-margin formulation of
semi-supervised learning for structured variables.
Unlike transductive algorithms, our formulation
naturally extends to new test points.
-
Beyond the Point Cloud: from Transductive to
Semi-supervised Learning [pdf, bib]
V. Sindhwani, P. Niyogi, M. Belkin, ICML 2005.
-
+ abstract
-
Due to its occurrence in engineering domains and
implications for natural learning, the problem of
utilizing unlabeled data is attracting increasing
attention in machine learning. A large body of recent
literature has focussed on the transductive setting
where labels of unlabeled examples are estimated by
learning a function defined only over the point cloud
data. In a truly semi-supervised setting however, a
learning machine has access to labeled and unlabeled
examples and must make predictions on data points
never encountered before. In this paper, we show how
to turn transductive and standard supervised learning
algorithms into semi-supervised learners. We
construct a family of data-dependent norms on
Reproducing Kernel Hilbert Spaces (RKHS). These norms
allow us to warp the structure of the RKHS to reflect
the underlying geometry of the data. We derive
explicit formulas for the corresponding new kernels.
Our approach demonstrates state of the art
performance on a variety of classification tasks.
-
A Co-Regularization Approach to Semi-supervised
Learning with Multiple Views [pdf]
V. Sindhwani, P. Niyogi, M. Belkin,
ICML Workshop on Learning with Multiple Views, 2005.
-
+ abstract
-
The Co-Training algorithm uses unlabeled
examples in multiple views to bootstrap classifiers in each view, typically in a greedy
manner, and operating under assumptions
of view-independence and compatibility. In
this paper, we propose a Co-Regularization
framework where classifiers are learnt in each
view through forms of multi-view regularization. We propose algorithms within this
framework that are based on optimizing measures of agreement and smoothness over la-
beled and unlabeled examples. These algorithms naturally extend standard
regularization methods like Support Vector Machines
(SVM) and Regularized Least squares (RLS)
for multi-view semi-supervised learning, and
inherit their benefits and applicability to
high-dimensional classification problems. An
empirical investigation is presented that confirms the promise of this approach.
-
Linear Manifold Regularization for Large Scale
Semi-supervised Learning
V. Sindhwani, P. Niyogi, M. Belkin, S. Keerthi
ICML Workshop on Learning with Partially Classified
Training Data, 2005
-
Towards a Theoretical Foundation for Laplacian-Based
Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi, COLT 2005
-
On Manifold Regularization [pdf]
M. Belkin, P. Niyogi, V. Sindhwani, AISTATS 2005
-
Limits of Spectral Clustering [pdf]    Outstanding
Student Paper Award
U. von Luxburg, O. Bousquet, M. Belkin, NIPS 2004.
-
Regularization and Semi-supervised Learning on Large
Graphs [pdf, bib]
M. Belkin, I. Matveeva, P. Niyogi, COLT 2004.
-
On the Convergence of Spectral Clustering on Random
Samples: the Normalized Case [pdf]
U. von Luxburg, O. Bousquet, M. Belkin, COLT 2004.
-
Tikhonov Regularization and Semi-supervised Learning
on Large Graphs
M. Belkin, I. Matveeva, P. Niyogi
ICASSP 2004, Special Session: Manifolds and Geometry in
Signal Processing (Invited paper).
-
Using Manifold Structure for Partially Labeled
Classification [pdf]
M. Belkin, P. Niyogi, NIPS 2002
-
Laplacian Eigenmaps and Spectral Techniques for
Embedding and Clustering [pdf]
M. Belkin, P. Niyogi, NIPS 2001.
-
Using Eigenvectors of the Bigram Graph to Infer
Morpheme Identity [pdf]
M. Belkin, J. Goldsmith
Morphological and Phonological Learning: Proceedings of
the 6th Workshop
of the ACL Special Interest Group in Computational
Phonology (SIGPHON), Philadelphia, July 2002, pp.
41-47.
Ph.D. Thesis:
-
Problems of Learning on Manifolds [pdf, bib]
Ph.D. Dissertation, University of Chicago, Dept. of
Mathematics, 2003.
Advisor: Partha Niyogi.
Book chapter:
-
The Geometric Basis of Semi-supervised Learning [pdf]
V. Sindhwani, M. Belkin, P. Niyogi.
Book chapter in "Semi-supervised Learning", Chapelle, O.,
B. Schölkopf and A. Zien, Eds., MIT Press, 2006.
Technical Reports:
-
Consistency of Spectral Clustering [pdf]
U. von Luxburg, M. Belkin, O. Bousquet
Max Planck Insitute for Biological Cybernetics Technical
Report TR-134, 2004. (The Annals of Statistics, to
appear)
-
+ abstract
-
Consistency is a key property of statistical
algorithms when the data is drawn from some
underlying probability distribution. Surprisingly,
despite decades of work, little is known about
consistency of most clustering algorithms. In this
paper we investigate consistency of the popular
family of spectral clustering algorithms, which
clusters the data with the help of eigenvectors of
graph Laplacian matrices. We develop new methods to
establish that for increasing sample size, those
eigenvectors converge to the eigenvectors of certain
limit operators. As a result we can prove that one of
the two major classes of spectral clustering
(normalized clustering) converges under very general
conditions, while the other (unnormalized clustering)
is only consistent under strong additional
assumptions, which are not always satisfied in real
data. We conclude that our analysis provides strong
evidence for the superiority of normalized spectral
clustering.
-
Manifold Regularization: a Geometric Framework for
Learning from Examples
M. Belkin, P. Niyogi, V. Sindhwani
University of Chicago CS Technical Report TR-2004-06,
2004.
-
Regression and Regularization on Large Graphs
[link]
M. Belkin, I. Matveeva, P. Niyogi
The University of Chicago CS Technical Report
TR-2003-11.
-
Semi-supervised learning on manifolds
[link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report
TR-2002-12.
-
Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation
[link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report TR-2002-01.