gavi's picture
From gavi rss RSS  subscribe Subscribe

Nonlinear Dimensionality Reduction and Manifold Learning 



 

 
 
Tags:  Dimensionality  Reduction 
Views:  2037
Downloads:  19
Published:  August 30, 2007
 
0
download

Share plick with friends Share
save to favorite
Report Abuse Report Abuse
 
Related Plicks
Dimensional Aster Doily

Dimensional Aster Doily

From: Brendasdoilies
Views: 26 Comments: 0
Dimensional Aster Doily
 
Group Theory and Three-dimensional Manifolds (Mathematical Monograph, No. 4)

Group Theory and Three-dimensional Manifolds (Mathematical Monograph, No. 4)

From: anon-389856
Views: 180 Comments: 0
Group Theory and Three-dimensional Manifolds (Mathematical Monograph, No. 4) ,rx library delphi 2007, calibre ebook cnet, jacksonville public library texas, benefits information technology infrastructure library
 
Electron Tomography: Methods for Three-Dimensional Visualization of Structures in the Cell

Electron Tomography: Methods for Three-Dimensional Visualization of Structures in the Cell

From: anon-392016
Views: 367 Comments: 0
Electron Tomography: Methods for Three-Dimensional Visualization of Structures in the Cell ,epub reader for mac, bad back book ebook, embrace library, cane toomer ebook
 
Optimization Theory: The Finite Dimensional Case (Pure & Applied Mathematics)

Optimization Theory: The Finite Dimensional Case (Pure & Applied Mathematics)

From: anon-390739
Views: 268 Comments: 0
Optimization Theory: The Finite Dimensional Case (Pure & Applied Mathematics) ,why do libraries need a database, shareware html code library, library method of collecting data, virtual library and user 0
 
See all 
 
More from this user
Microsoft Office Business Scorecard Manager 2005

Microsoft Office Business Scorecard Manager 2005

From: gavi
Views: 3321
Comments: 0

Google Earth

Google Earth

From: gavi
Views: 4610
Comments: 1

Comparing J2EE with .NET

Comparing J2EE with .NET

From: gavi
Views: 3781
Comments: 0

flash

flash

From: gavi
Views: 2250
Comments: 0

Evolution Of Soa - Gartner

Evolution Of Soa - Gartner

From: gavi
Views: 3883
Comments: 0

 
See all 
 
 
 URL:          AddThis Social Bookmark Button
Embed Thin Player: (fits in most blogs)
Embed Full Player :
 
 

Name

Email (will NOT be shown to other users)

 

 
 
Comments: (watch)
 
 
Notes:
 
Slide 1: Nonlinear Dimensionality Reduction and Manifold Learning Martin Law CSE Department Michigan State University
Slide 2: Why Dimensionality Reduction More features ) more information ) potentially higher accuracy Unfortunately, more features ) harder to train a classifier The curse of dimensionality Solution: start with as many potentially useful features as possible, and then reduce the number of features 08/30/07 2
Slide 3: Why Dimensionality Reduction Number of potential features can be huge Image data: each pixel of an image • A 64£64 image ) 4096 features Genomic data: expression levels of the genes • Several thousand features Text categorization: frequencies of phrases in a document or in a web page • More than ten thousand features 08/30/07 3
Slide 4: Why Dimensionality Reduction Two approaches to reduce no. of features Feature selection: select the salient features by some criteria Feature extraction: obtain a reduced set of features by a transformation of all features Data visualization and exploratory data analysis also need to reduce dimension Usually reduce to 2D or 3D 08/30/07 4
Slide 5: Outline Linear methods and their drawbacks Manifold and dimensionality reduction A taxonomy of the nonlinear algorithms A quick tour of some nonlinear dimensionality reduction algorithms Auto-associative NN, kernel PCA, principal curves, SOM, GTM, MDS, ISOMAP, LLE Conclusion and possible research directions 08/30/07 5
Slide 6: Linear Method: Principal Component Analysis (PCA) PCA finds the best “subspace” that captures as much data variance as possible Based on eigen-decomposition of data covariance matrix Principal component 08/30/07 6
Slide 7: Linear Method: Linear Discriminant Analysis (LDA) LDA finds the projection that best separates the two classes Multiple discriminant analysis (MDA) extends LDA to multiple classes Best projection direction for classification 08/30/07 7
Slide 8: Deficiencies of Linear Methods Data may not be best summarized by linear combination of features Example: PCA cannot discover 1D structure of a helix 2 0 1 5 1 0 5 0 1 0 .5 0 -0 .5 -0 .5 -1 -1 0 .5 0 1 08/30/07 8
Slide 9: Deficiencies of Linear Methods Useful characteristics for real world data are usually not linear combination of features Example: poses of faces Example: when shall I get hit (from motion data) 08/30/07 Figures from ISOMAP paper and Jerkins et. al 9
Slide 10: Deficiencies of Linear Methods Currently, these characteristics can only be extracted by a carefully crafted system Time consuming and tedious Good nonlinear dimensionality reduction algorithms can help by extracting these features automatically 08/30/07 10
Slide 11: Manifold and Dimensionality Reduction Manifold: generalized “subspace” in Rn The value of n is usually large Points in a local region on a manifold can be indexed by a subset of Rk The value of k is usually small The index is called the co-ordinate of the point on the manifold There may not be such co-ordinate systems globally 08/30/07 11
Slide 12: Example of a Manifold Rn M z R 2 x2 x: coordinate for z x x1 08/30/07 12
Slide 13: Manifold and Dimensionality Reduction Assumption: data points lie closely to M Most unsupervised feature extraction schemes assume this implicitly If there is a global indexing scheme for M Find z on Mthat is closest to the data y Interpret the coordinates of z as the reduced dimension representation of y Other mapping that preserves certain local properties of Mcan be used 08/30/07 13
Slide 14: Manifold for Dimensionality Reduction y Rn M z R 2 x2 x: coordinate for z x x1 08/30/07 14
Slide 15: Taxonomy of the Nonlinear Algorithms Is there any explicit notion of manifold The map to reduce dimensionality: parametric or non-parametric Can the algorithm use a dissimilarity matrix (instead of pattern matrix) as input Is local neighborhood in Rn used 08/30/07 15
Slide 16: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 16
Slide 17: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 17
Slide 18: Auto-associative Neural Networks Given: a data set Y={y1, …, ym} in Rn Train a neural network using yi as input and target output, with a “middle” layer of k neurons, using square loss function The output of the k neurons when yi is used as input are the transformed vector Denote this by xi 08/30/07 18
Slide 19: Auto-associative Neural Networks Input: Target output: Actual output: Find weight that minimize Output layer Decoding layer “Middle” layer Encoding layer Input layer Transformed data: 08/30/07 19
Slide 20: Auto-associative Neural Networks Intuition: xi is “compressed” version of yi as it can reconstruct yi approximately When the transfer function is linear, this approach is equivalent to PCA Different transfer functions can be used for different types of mapping Drawbacks: training not simple, local minima, meaning of extracted features obscure 08/30/07 20
Slide 21: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 21
Slide 22: Recap of Support Vector Machines (SVM) SVM finds the separating hyperplane between 2 classes that maximizes the margin Key to the nonlinearity of SVM: (y) (.) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Input space 08/30/07 Feature space 22
Slide 23: Recap of Support Vector Machines (SVM) Why (x): we hope the data become linearly separable in the feature space This is equivalent to finding a non-linear decision boundary in the input space 08/30/07 23
Slide 24: Recap of Support Vector Machines (SVM) Example K(x, y): kernel function 08/30/07 24
Slide 25: Recap of Support Vector Machines (SVM) In practice, all computation in the feature space are performed using K(x, y) We specify K(x, y), and not (x, y), in practice Intuition of K(x, y): similarity between x and y Lesson here: any linear algorithm that can be expressed by inner products can be made nonlinear by going to the feature space 08/30/07 25
Slide 26: Kernel PCA (KPCA) Perform PCA in the feature space Steps for KPCA Calculate the kernel matrix Perform eigen-decomposition of Let i be the entries of the j-th eigenvector is the j-th principal component 08/30/07 26
Slide 27: Kernel PCA The j-th component of the transformed vector of y is the projection of (y) on the jth principal component in the feature space The principal components in the feature space can be visualized by a contour plot For linear PCA: Principal component The contours 08/30/07 27
Slide 28: Kernel PCA: Example Figures from ftp://ftp.research.micro soft.com/users/mtippin g/skpca_nips.ps.gz 08/30/07 28
Slide 29: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 29
Slide 30: Principal Curves f(s) Find the curve that passes through the “center” of the data points The curve can be generalized to a manifold 08/30/07 30
Slide 31: Principal Curves Version one: f(s) is self-consistent, i.e., f(s) is the average of all points projected to it 08/30/07 31
Slide 32: Principal Curves Version two: the data are generated by first picking a point on f(s), and then Gaussian noise is added f(s) is defined in a non-parametric manner Locally weighted running lines or cubic smoothing splines can be used to smooth f(s) 08/30/07 32
Slide 33: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 33
Slide 34: Kohonen’s Self-organizing Maps (SOM) Learn a topographic map between points in Rn and (usually) a 2D array of grid points (nodes) Each node contains a reference vector tj in Rn Learning of SOM is often on-line, i.e., the data points come sequentially 08/30/07 34
Slide 35: Learning of SOM For a new data point z, node j wins if tj is closest to z among all reference vectors Node j and its neighbors learn from z The nodes learn by moving their reference vectors closer to z with a learning rate  The neighborhood of a node is large initially, but becomes smaller gradually  also decreases gradually until convergence 08/30/07 35
Slide 36: SOM: When Neighborhood Is Large Initially t1 =(9,5,2,6) t5=(9,9,9,1) t9 =(3,2,3,3) t13=(0,0,5,5) New data (3,7,9,3) t2 =(4,8,0,7) t6=(7,9,8,3) t10=(7,3,2,1) t14=(8,7,8,1) winner t3 =(5,6,4,6) t7=(4,1,9,8) t11=(9,2,8,6) t15=(3,4,7,5) t4 =(0,3,0,1) t8=(7,5,8,2) t12=(3,5,8,7) t16=(1,1,7,9) 08/30/07 36
Slide 37: SOM: Example t1 =(9,5,2,6) t5 =(9,9,9,1) t9 =(3,5,6,3) t13 =(2,4,7,4) t2 =(4,8,0,7) t6 =(5,8,9,3) t10 =(5,5,6,2) t14 =(6,7,9,2) t3 =(5,6,4,6) t7 =(4,4,9,6) t11 =(6,5,9,5) t15 =(3,6,8,4) t4 =(0,3,0,1) t8 =(5,6,9,3) t12 =(3,6,9,5) t16 =(2,4,8,6) 08/30/07 37
Slide 38: SOM: When Neighborhood Is Small t1 =(9,5,2,6) t5 =(9,9,9,1) t9 =(3,5,6,3) t13 =(2,4,7,4) New data (4,8,4,5) winner t2 =(4,8,0,7) t6 =(5,8,9,3) t10 =(5,5,6,2) t14 =(6,7,9,2) t3 =(5,6,4,6) t7 =(4,4,9,6) t11 =(6,5,9,5) t15 =(3,6,8,4) t4 =(0,3,0,1) t8 =(5,6,9,3) t12 =(3,6,9,5) t16 =(2,4,8,6) 08/30/07 38
Slide 39: SOM: When Neighborhood Is Small t1 =(9,5,2,6) t5 =(9,9,9,1) t9 =(3,5,6,3) t13 =(2,4,7,4) t2 =(4,8,2,6) t6 =(5,8,9,3) t10 =(5,5,6,2) t14 =(6,7,9,2) t3 =(5,7,4,6) t7 =(4,6,7,5) t11 =(6,5,9,5) t15 =(3,6,8,4) t4 =(2,6,2,3) t8 =(5,6,9,3) t12 =(3,6,9,5) t16 =(2,4,8,6) 08/30/07 39
Slide 40: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Disimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 40
Slide 41: Generative Topographic Maps (GTM) A probabilistic formalization of SOM Figure from GTM paper 08/30/07 41
Slide 42: Generative Topographic Maps (GTM) The mapping y consist of two parts  Map to a point on the manifold l=1,...,n Gaussian noise is then added to zi to form yi The manifold is parameterized by W and the M functions j(x) j(x) usually take the form of Gaussians 08/30/07 42
Slide 43: Generative Topographic Maps (GTM) Closely related to radial basis function neural networks Smoothness of the manifold can be controlled by regularizing W Other formalization of SOM exists Example: topographic vector quantization (TVQ) performs vector quantization with neighborhood relationship as in SOM 08/30/07 43
Slide 44: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 44
Slide 45: Multidimensional Scaling (MDS) Given: ij, a distance measure between zi and zj The items zi and zj may not be in an Euclidean space ij might be asymmetric and do not satisfy the triangle inequality Goal: map all zi to xi, points in a low dimension space Rk, that best “preserves” ij Two types of MDS Metric MDS: minimizes discrepancy in numerical values • Classical scaling and metric least square scaling Nonmetric MDS: minimizes discrepancy in ranks 08/30/07 45
Slide 46: Multidimensional Scaling (MDS) zi ij zj xi dij MDS xj Items in possibly non-Euclidean space Data points in low-D Euclidean space 08/30/07 Goal: find xi s that make ij as “close” to dij as possible 46
Slide 47: Classical Scaling Define [A]rs = - ½ ij2 and [B]rs=xrT xs 1: matrix with entries all 1 X, the coordinates of the projected points, can be recovered by eigenvalue decomposition of B It is equivalent to linear PCA if zi is in Rn and Euclidean distances between zi are used Classical scaling is essentially “linear” 08/30/07 47
Slide 48: Metric Least Square Scaling (MLSS) Minimize a certain function of the difference between ij and dij dij is the distance between xi and xj in Rk Sammon maps: minimize Sammon’s stress Gradient descent can be used for finding xi that minimize S 08/30/07 48
Slide 49: Notes on MDS No explicit mapping  If a new item comes, we do not know where it should be mapped to It is not easy to change the dimension of Rk (where xi are in) in MLSS 08/30/07 49
Slide 50: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 50
Slide 51: Geodesic Distance Geodesic: the shortest curve on a manifold that connects two points on the manifold Example: on a sphere, geodesics are great circles Geodesic distance: length of the geodesic A B Figure from http://mathworld.wolfra m.com/GreatCircle.html 08/30/07 51
Slide 52: Geodesic Distance Euclidean distance needs not be a good measure between two points on a manifold Length of geodesic is more appropriate Example: Swiss roll Figure from LLE paper 08/30/07 52
Slide 53: Isometric Feature Mapping (ISOMAP) Take a distance matrix {ij} as input Estimate geodesic distance between any two points by “a chain of short paths” Formulate this as a graph theory problem Perform classical scaling on the matrix of geodesic distances to obtain final projection 08/30/07 53
Slide 54: Steps to Estimate Geodesic Distances Find the “neighbors” of all data items zi Two possible definitions of neighbors • • Set of items whose distances are less than  The K closest items • Construct a weighted undirected graph Vertex i corresponds to zi An edge between the vertex i and j iff zi and zj are neighbors, and its weight is ij 08/30/07 54
Slide 55: Steps to Estimate Geodesic Distances 1. Find the shortest distance between all pairs of vertices in the graph Floyd (O(m3)) or Dijkstra (O(m2log m+mp)) The shortest distance between vertices i and j in the graph is the estimated geodesic distance between zi and zj 08/30/07 55
Slide 56: Rationale for the Geodesic Distance Estimation Figures from ISOMAP paper 08/30/07 56
Slide 57: A Run of ISOMAP Figure from http://isomap.stanford. edu/handfig.html 08/30/07 57
Slide 58: A Run of ISOMAP Figures from ISOMAP paper 08/30/07 58
Slide 59: Interpolation on Straight Lines in the Projected Co-ordinates Figures from ISOMAP paper 08/30/07 59
Slide 60: Summary of the Algorithms Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes Local No neighborhood 08/30/07 60
Slide 61: Local Linear Embedding (LLE) Assumption: manifold is approximately “linear” when viewed locally, that is, in a small neighborhood Approximation error, (W), can be made small Local neighborhood is effected by the constraint Wij=0 if zi is not a neighbor of zj 08/30/07 61
Slide 62: Local Linear Embedding (LLE) Additional constraint j Wij = 1 is needed before finding W that minimizes (W) Meaning of W: a linear representation of every data point by its neighbors This is an intrinsic geometrical property of the manifold A good projection should preserve this geometric property as much as possible 08/30/07 62
Slide 63: Finding a Map to a Lower Dimensional Space xi 2 Rk: projected vector for zi The geometrical property is best preserved if the error below is small X is given by the eigenvectors of the lowest k non-zero eigenvalues of the matrix 08/30/07 63
Slide 64: Examples Figure from LLE paper 08/30/07 64
Slide 65: Why LLE and ISOMAP Are Successful Both methods are non-parametric, thus do not assume much about the manifold They focus on the notion of manifold, in particular the local property 08/30/07 65
Slide 66: Some Limitations of ISOMAP and LLE Both algorithms are unlikely to work well for manifolds like a sphere or a torus Both require dense data points on the manifold for good estimation A good neighborhood seems essential to their success They are hard to be evaluated quantitatively 08/30/07 66
Slide 67: Conclusion Dimensionality reduction is an important problem in pattern recognition It is closely related to manifold learning Algorithms for nonlinear dimensionality reduction and manifold learning are presented The rise of fame for ISOMAP and LLE stimulates a lot of research in this area 08/30/07 67
Slide 68: Possible Research Directions Manifold is the magic word How to determine the dimension of the target space Rk Similarity instead of dissimilarity What can one do if the data are not so dense Can side information be used to learn the manifold, e.g., by examples of geodesic 08/30/07 68
Slide 69: References Note: some graphs in this presentation are taken/inspired from the papers/URL listed here. P. Baldi and K. Hornik. Neural networks and principal component analysis: learning from examples without local minima. Neural Networks, vol 2, pp. 5358, 1989 C. M. Bishop, M. Svensen and C. K. I. Williams. GTM: the generative topographic mapping. Neural Computation, vol 10, pp. 215--234, 1998 N. Cristianini. ICML'01 tutorial on support vector and kernl machines, 2001 T. F. Cox and M. A. A. Cox. Multidimensional Scaling, 2nd edition, 2001. Chapman & Hall/CRC D. DeMers and G. W. Cottrell. Non-Linear Dimensionality Reduction. Advances in Neural Information Processing Systems 5, pp. 580-587, 1993 T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, vol 84, pp. 502--516, 1989 08/30/07 69
Slide 70: References O. C. Jenkins and M. J Mataric´, Deriving Action and Behavior Primitives from Human Motion Data. In IROS-2002, pp. 2551-2556, 2002 T. Kohonen. Self-organizing maps, 3rd edition, 2001. Springer-Verlag New York S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, vol. 290, pp. 2323--2326, 2000 B. Schölkopf, A.J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, vol 10, pp. 1299--1319, 1998 J.B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319--2323, 2000 R. Tibshirani. Principal curves revisited. Statistics and Computing, vol 2, pp. 183--190, 1992 08/30/07 70

   
Time on Slide Time on Plick
Slides per Visit Slide Views Views by Location