Publications of Mikhail Belkin

[Journal papers] -- [Conference papers] -- [Ph.D. Thesis] -- [Book Chapter] -- [Technical Reports]


Journal Papers and preprints:

  1. On Learning with Integral Operators [pdf, bib]
    L. Rosasco, M.Belkin, E. De Vito,
    Journal of Machine Learning Research (accepted, pending minor revisions).
    + abstract
    A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold. First, we use a technique based on a concentration inequality for Hilbert spaces to provide new simplified proofs for a number of results in spectral approximation. Second, using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from [28].

  2. Learning Gaussian Mixtures with Arbitrary Separation [pdf]
    Mikhail Belkin, Kaushik Sinha
    Preprint
    + abstract
    In this paper we present a method for learning the parameters of a mixture of k identical spherical Gaussians in n-dimensional space with an arbitrarily small separation between the components. Our algorithm is polynomial in all parameters other than k. The algorithm is based on an appropriate grid search over the space of parameters. The theoretical analysis of the algorithm hinges on a reduction of the problem to 1 dimension and showing that two 1- dimensional mixtures whose densities are close in the L2 norm must have similar means and mixing coefficients. To produce such a lower bound for the L2 norm in terms of the distances between the corresponding means, we analyze the behavior of the Fourier transform of a mixture of Gaussians in 1 dimension around the origin, which turns out to be closely related to the properties of the Vandermonde matrix obtained from the component means. Analysis of this matrix together with basic function approximation results allows us to provide a lower bound for the norm of the mixture in the Fourier domain. In recent years much research has been aimed at understanding the computational aspects of learning parameters of Gaussians mixture distributions in high dimension. To the best of our knowledge all existing work on learning parameters of Gaussian mixtures assumes minimum separation between components of the mixture which is an increasing function of either the dimension of the space n or the number of components k. In our paper we prove the first result showing that parameters of a n-dimensional Gaussian mixture model with arbitrarily small component separation can be learned in time polynomial in n. We also believe that the new technique used to provide the key lower bound is of independent interest and will be useful for other related problems.

  3. Data spetroscopy: eigenspace of convolution operators and clustering [pdf]
    Tao Shi, Mikhail Belkin, Bin Yu
    The Annals of Statistics, vol. 37, Number 6B (2009), 3960-3984.
    + abstract
    This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the Data Spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.

  4. Convergence of Laplacian Eigenmaps [pdf, bib]
    M. Belkin, P. Niyogi
    preprint, 2008.
    + abstract
    Geometrically based methods for various tasks of data analysis have attracted considerable attention over the last few years. In many of these algorithms, a central role is played by the eigenvectors of the graph Laplacian of a data-derived graph. In this paper, we show that if points are sampled uniformly at random from an unknown submanifold ${\cal M}$ of $\R^N$, then the eigenvectors of a suitably constructed graph Laplacian converge to the eigenfunctions of the Laplace Beltrami operator on ${\cal M}$. This basic result directly establishes the convergence of spectral manifold learning algorithms such as Laplacian Eigenmaps and Diffusion Maps. It also has implications for the understanding of geometric algorithms in data analysis, computational harmonic analysis, geometric random graphs, and graphics.

  5. Towards a Theoretical Foundation for Laplacian-Based Manifold Methods [pdf, bib]
    M. Belkin, P. Niyogi
    Journal of Computer and System Sciences, 2008.
    Volume 74 , Issue 8, pp. 1289-1308 Special Issue on Learning Theory (invited).
    + abstract
    In recent years manifold methods have attracted a considerable amount of atten- tion in machine learning. However most algorithms in that class may be termed "manifold-motivated" as they lack any explicit theoretical guarantees. In this pa- per we take a step towards closing the gap between theory and practice for a class of Laplacian-based manifold methods. These methods utilize the graph Laplacian associated to a data set for a variety of applications in semi-supervised learning, clustering, data representation. We show that under certain conditions the graph Laplacian of a point cloud of data samples converges to the Laplace-Beltrami operator on the underlying mani- fold. Theorem 3.1 contains the .rst result showing convergence of a random graph Laplacian to the manifold Laplacian in the context of machine learning.

  6. Manifold Regularization: a Geometric Framework for Learning from Labeled and Unlabeled Examples [pdf, bib]
    M. Belkin, P. Niyogi, V. Sindhwani
    Journal of Machine Learning Research, 7(Nov):2399-2434, 2006.
    + abstract
    We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework.

  7. Consistency of Spectral Clustering [pdf, bib]
    U. von Luxburg, M. Belkin, O. Bousquet,
    The Annals of Statistics 2008, Vol. 36, No. 2, 555-586.
    + abstract
    Consistency is a key property of statistical algorithms when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.

  8. Semi-supervised Learning on Riemannian Manifolds [pdf, bib]
    M. Belkin, P. Niyogi
    Machine Learning, 56, Invited, secial Issue on Clustering, 209-239, 2004.
    + abstract
    We consider the general problem of utilizing both labeled and unlabeled data to improve classification accuracy. Under the assumption that the data lie on a submanifold in a high dimensional space, we develop an algorithmic framework to classify a partially labeled data set in a principled manner. The central idea of our approach is that classification functions are naturally defined only on the submanifold in question rather than the total ambient space. Using the Laplace-Beltrami operator one produces a basis (the Laplacian Eigenmaps) for a Hilbert space of square integrable functions on the submanifold. To recover such a basis, only unlabeled examples are required. Once such a basis is obtained, training can be performed using the labeled data set. Our algorithm models the manifold using the adjacency graph for the data and approximates the Laplace- Beltrami operator by the graph Laplacian. We provide details of the algorithm, its theoretical justification, and several practical applications for image, speech, and text classification.

  9. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [pdf, bib]
    M. Belkin, P. Niyogi
    Neural Computation, June 2003; 15 (6):1373-1396.
    + abstract
    One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a lowdimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the highdimensional data. The algorithm provides a computationally efficient approach to non dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.


Refereed and Invited Conference Proceedings:
  1. Semi-supervised Learning using Sparse Eigenfunction Bases [pdf, bib]
    K. Sinha, M.Belkin, NIPS 2009.
    + abstract
    We present a new framework for semi-supervised learning with sparse eigenfunction bases of kernel matrices. It turns out that when the data has clustered, that is, when the high density regions are suf.ciently separated by low density valleys, each high density area corresponds to a unique representative eigenvector. Linear combination of such eigenvectors (or, more precisely, of their Nystrom extensions) provide good candidates for good classi.cation functions when the cluster assumption holds. By .rst choosing an appropriate basis of these eigenvectors from unlabeled data and then using labeled data with Lasso to select a classi.er in the span of these eigenvectors, we obtain a classi.er, which has a very sparse representation in this basis. Importantly, the sparsity corresponds naturally to the cluster assumption. Experimental results on a number of real-world data-sets show that our method is competitive with the state of the art semi-supervised learning algorithms and outperforms the natural base-line algorithm (Lasso in the Kernel PCA basis).

  2. A Note on Learning with Integral Operators [pdf, bib]
    L. Rosasco, M.Belkin, E. De Vito, COLT 2009.
    + abstract
    A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold. First, We use a technique based on a concentration inequality for Hilbert spaces to provide new simplified proofs for a number of results in spectral approximation. Second, using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from [28].

  3. Constructing Laplace Operator from Point Clouds in R^d [pdf, bib, code]
    M. Belkin, J. Sun, Y. Wang, SODA 2009, pp. 1031--1040.
    + abstract
    We present an algorithm for approximating the Laplace- Beltrami operator from an arbitrary point cloud obtained from a k-dimensional manifold embedded in the d- dimensional space. We show that this PCD Laplace (Point- Cloud Data Laplace) operator converges to the Laplace- Beltrami operator on the underlying manifold as the point cloud becomes denser. Unlike the previous work, we do not assume that the data samples are independent identically distributed from a probability distribution and do not require a global mesh. The resulting algorithm is easy to implement. We present experimental results indicating that even for point sets sampled from a uniform distribution, PCD Laplace converges faster than the weighted graph Laplacian. We also show that using our PCD Laplacian we can directly estimate certain geometric invariants, such as manifold area.

  4. Data Spectroscopy: Learning Mixture Models using Eigenspaces of Convolution Operators [pdf, bib]
    T. Shi, M. Belkin, B. Yu, ICML 2008.
    + abstract
    In this paper we develop a spectral frame- work for estimating mixture distributions, specifically Gaussian mixture models. In physics, spectroscopy is often used for the identification of substances through their spectrum. Treating a kernel function K(x, y) as "light" and the sampled data as "sub- stance", the spectrum of their interaction (eigenvalues and eigenvectors of the kernel matrix K) unveils certain aspects of the un- derlying parametric distribution p, such as the parameters of a Gaussian mixture. Our approach extends the intuitions and analyses underlying the existing spectral techniques, such as spectral clustering and Kernel Prin- cipal Components Analysis (KPCA). We construct algorithms to estimate param- eters of Gaussian mixture models, includ- ing the number of mixture components, their means and covariance matrices, which are im- portant in many practical applications. We provide a theoretical framework and show en- couraging experimental results.

  5. Discrete Laplace Operator for Meshed Surfaces [pdf, bib]
    M. Belkin, J. Sun, Y. Wang, SOCG 2008.
    + abstract
    In recent years a considerable amount of work in graphics and geometric optimization used tools based on the Laplace-Beltrami operator on a surface. The applications of the Laplacian include mesh editing, surface smoothing, and shape interpolations among others. However, it has been shown [12, 23, 25] that the popular cotangent approximation schemes do not provide convergent point-wise (or even L^2) estimates, while many applications rely on point-wise estimation. Existence of such schemes has been an open question [12]. In this paper we propose the first algorithm for approximating the Laplace operator of a surface from a mesh with point-wise convergence guarantees applicable to arbitrary meshed surfaces. We show that for a sufficiently fine mesh over an arbitrary surface, our mesh Laplacian is close to the Laplace-Beltrami operator on the surface at every point of the surface. Moreover, the proposed algorithm is simple and easily implementable. Experimental evidence shows that our algorithm exhibits convergence empirically and outperforms cotangent-based methods in providing accurate approximation of the Laplace operator for various meshes.

  6. Component Based Shape Retrieval Using Differential Profiles.
    L. Ding, M. Belkin, Proceedings of ACM International Conference on Multimedia Information Retrieval, 2008.
    + abstract
    In this paper, we describe the use of differential profiles, which are computed from 2D shapes smoothed with Gaussian functions, as the shape features for building a shape retrieval system. We build a global shape component dictionary from all the shape descriptors collected from shapes available in a database and then represent each shape as a probabilistic mixture of elements from such a dictionary. Finally, shape retrieval from a given database is simply done by comparing the mixing coefficients of the model of a query shape and those of known shapes. Our retrieval experiments are done on both object contour and line drawing collections and show promising results.


  7. The value of labeled and unlabeled examples when the model is imperfect [pdf, bib]
    K. Sinha, M. Belkin, NIPS 2007.
    + abstract
    Semi-supervised learning, i.e. learning from both labeled and unlabeled data has received signi.cant attention in the machine learning literature in recent years. Still our understanding of the theoretical foundations of the usefulness of unlabeled data remains somewhat limited. The simplest and the best understood situation is when the data is described by an identi.able mixture model, and where each class comes from a pure component. This natural setup and its implications ware analyzed in [11, 5]. One important result was that in certain regimes, labeled data becomes exponentially more valuable than unlabeled data. However, in most realistic situations, one would not expect that the data comes from a parametric mixture distribution with identifiable components. There have been recent efforts to analyze the non-parametric situation, for example, .cluster. and manifold assumptions have been suggested as a basis for analysis. Still, a satisfactory and fairly complete theoretical understanding of the nonparametric problem, similar to that in [11, 5] has not yet been developed. In this paper we investigate an intermediate situation, when the data comes from a probability distribution, which can be modeled, but not perfectly, by an identi.able mixture distribution. This seems applicable to many situation, when, for example, a mixture of Gaussians is used to model the data. the contribution of this paper is an analysis of the role of labeled and unlabeled data depending on the amount of imperfection in the model.

  8. Convergence of Laplacian Eigenmaps [long tech report]
    M. Belkin, P. Niyogi, NIPS 2006.

  9. On the Relation Between Low Density Separation, Spectral Clustering and Graph Cuts [pdf, bib]
    H. Narayanan, M. Belkin, P. Niyogi, NIPS 2006.
    + abstract
    One of the intuitions underlying many graph-based methods for clustering and semi-supervised learning, is that class or cluster boundaries pass through areas of low probability density. In this paper we provide some formal analysis of that notion for a probability distribution. We introduce a notion of weighted boundary volume, which measures the length of the class/cluster boundary weighted by the density of the underlying probability distribution. We show that sizes of the cuts of certain commonly used data adjacency graphs converge to this continuous weighted volume of the boundary.

  10. Heat Flow and a Faster Algorithm to Compute the Surface Area of a Convex Body [pdf, bib]
    M. Belkin, H. Narayanan, P. Niyogi, FOCS 2006.
    + abstract
    We draw on the observation that the amount of heat diffusing outside of a heated body in a short period of time is proportional to its surface area, to design a simple algorithm for approximating the surface area of a convex body given by a membership oracle. Our method has a complexity of O*(n^4), where n is the dimension, compared to O*(n^8.5) for the previous best algorithm. We show that our complexity cannot be improved given the current state-of-the-art in volume estimation.

  11. Maximum Margin Semi-Supervised Learning for Structured Variables [pdf, bib]
    Y. Altun, D. McAllester, M. Belkin, NIPS 2005.
    + abstract
    Many real-world classification problems involve the prediction of multiple inter-dependent variables forming some structural dependency. Recent progress in machine learning has mainly focused on supervised classification of such structured variables. In this paper, we investigate structured classification in a semi-supervised setting. We present a discriminative approach that utilizes the intrinsic geometry of input patterns revealed by unlabeled data points and we derive a maximum-margin formulation of semi-supervised learning for structured variables. Unlike transductive algorithms, our formulation naturally extends to new test points.

  12. Beyond the Point Cloud: from Transductive to Semi-supervised Learning [pdf, bib]
    V. Sindhwani, P. Niyogi, M. Belkin, ICML 2005.
    + abstract
    Due to its occurrence in engineering domains and implications for natural learning, the problem of utilizing unlabeled data is attracting increasing attention in machine learning. A large body of recent literature has focussed on the transductive setting where labels of unlabeled examples are estimated by learning a function defined only over the point cloud data. In a truly semi-supervised setting however, a learning machine has access to labeled and unlabeled examples and must make predictions on data points never encountered before. In this paper, we show how to turn transductive and standard supervised learning algorithms into semi-supervised learners. We construct a family of data-dependent norms on Reproducing Kernel Hilbert Spaces (RKHS). These norms allow us to warp the structure of the RKHS to reflect the underlying geometry of the data. We derive explicit formulas for the corresponding new kernels. Our approach demonstrates state of the art performance on a variety of classification tasks.

  13. A Co-Regularization Approach to Semi-supervised Learning with Multiple Views [pdf]
    V. Sindhwani, P. Niyogi, M. Belkin,
    ICML Workshop on Learning with Multiple Views, 2005.
    + abstract
    The Co-Training algorithm uses unlabeled examples in multiple views to bootstrap classifiers in each view, typically in a greedy manner, and operating under assumptions of view-independence and compatibility. In this paper, we propose a Co-Regularization framework where classifiers are learnt in each view through forms of multi-view regularization. We propose algorithms within this framework that are based on optimizing measures of agreement and smoothness over la- beled and unlabeled examples. These algorithms naturally extend standard regularization methods like Support Vector Machines (SVM) and Regularized Least squares (RLS) for multi-view semi-supervised learning, and inherit their benefits and applicability to high-dimensional classification problems. An empirical investigation is presented that confirms the promise of this approach.

  14. Linear Manifold Regularization for Large Scale Semi-supervised Learning
    V. Sindhwani, P. Niyogi, M. Belkin, S. Keerthi
    ICML Workshop on Learning with Partially Classified Training Data, 2005

  15. Towards a Theoretical Foundation for Laplacian-Based Manifold Methods [pdf, bib]
    M. Belkin, P. Niyogi, COLT 2005

  16. On Manifold Regularization [pdf]
    M. Belkin, P. Niyogi, V. Sindhwani, AISTATS 2005

  17. Limits of Spectral Clustering [pdf]    Outstanding Student Paper Award
    U. von Luxburg, O. Bousquet, M. Belkin, NIPS 2004.

  18. Regularization and Semi-supervised Learning on Large Graphs [pdf, bib]
    M. Belkin, I. Matveeva, P. Niyogi, COLT 2004.

  19. On the Convergence of Spectral Clustering on Random Samples: the Normalized Case [pdf]
    U. von Luxburg, O. Bousquet, M. Belkin, COLT 2004.

  20. Tikhonov Regularization and Semi-supervised Learning on Large Graphs
    M. Belkin, I. Matveeva, P. Niyogi
    ICASSP 2004, Special Session: Manifolds and Geometry in Signal Processing (Invited paper).

  21. Using Manifold Structure for Partially Labeled Classification [pdf]
    M. Belkin, P. Niyogi, NIPS 2002

  22. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering [pdf]
    M. Belkin, P. Niyogi, NIPS 2001.

  23. Using Eigenvectors of the Bigram Graph to Infer Morpheme Identity [pdf]
    M. Belkin, J. Goldsmith
    Morphological and Phonological Learning: Proceedings of the 6th Workshop
    of the ACL Special Interest Group in Computational Phonology (SIGPHON), Philadelphia, July 2002, pp. 41-47.


Ph.D. Thesis:
  • Problems of Learning on Manifolds [pdf, bib]
    Ph.D. Dissertation, University of Chicago, Dept. of Mathematics, 2003.
    Advisor: Partha Niyogi.


Book chapter:
  • The Geometric Basis of Semi-supervised Learning [pdf]
    V. Sindhwani, M. Belkin, P. Niyogi.
    Book chapter in "Semi-supervised Learning", Chapelle, O., B. Schölkopf and A. Zien, Eds., MIT Press, 2006.


Technical Reports:

  • Consistency of Spectral Clustering [pdf]
    U. von Luxburg, M. Belkin, O. Bousquet
    Max Planck Insitute for Biological Cybernetics Technical Report TR-134, 2004. (The Annals of Statistics, to appear)
    + abstract
    Consistency is a key property of statistical algorithms when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.

  • Manifold Regularization: a Geometric Framework for Learning from Examples
    M. Belkin, P. Niyogi, V. Sindhwani
    University of Chicago CS Technical Report TR-2004-06, 2004.

  • Regression and Regularization on Large Graphs [link]
    M. Belkin, I. Matveeva, P. Niyogi
    The University of Chicago CS Technical Report TR-2003-11.

  • Semi-supervised learning on manifolds [link]
    M. Belkin, P. Niyogi
    The University of Chicago CS Technical Report TR-2002-12.

  • Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [link]
    M. Belkin, P. Niyogi
    The University of Chicago CS Technical Report TR-2002-01.