From:
anon-392016
Views: 367
Comments: 0
Electron Tomography: Methods for Three-Dimensional Visualization of Structures in the Cell ,epub reader for mac, bad back book ebook, embrace library, cane toomer ebook
From:
anon-390739
Views: 268
Comments: 0
Optimization Theory: The Finite Dimensional Case (Pure & Applied Mathematics) ,why do libraries need a database, shareware html code library, library method of collecting data, virtual library and user 0
Slide 1: Nonlinear Dimensionality Reduction and Manifold Learning
Martin Law
CSE Department Michigan State University
Slide 2: Why Dimensionality Reduction
More features ) more information ) potentially higher accuracy Unfortunately, more features ) harder to train a classifier
The curse of dimensionality
Solution: start with as many potentially useful features as possible, and then reduce the number of features
08/30/07
2
Slide 3: Why Dimensionality Reduction
Number of potential features can be huge
Image data: each pixel of an image
• A 64£64 image ) 4096 features
Genomic data: expression levels of the genes
• Several thousand features
Text categorization: frequencies of phrases in a document or in a web page
• More than ten thousand features
08/30/07
3
Slide 4: Why Dimensionality Reduction
Two approaches to reduce no. of features
Feature selection: select the salient features by some criteria Feature extraction: obtain a reduced set of features by a transformation of all features
Data visualization and exploratory data analysis also need to reduce dimension
Usually reduce to 2D or 3D
08/30/07 4
Slide 5: Outline
Linear methods and their drawbacks Manifold and dimensionality reduction A taxonomy of the nonlinear algorithms A quick tour of some nonlinear dimensionality reduction algorithms
Auto-associative NN, kernel PCA, principal curves, SOM, GTM, MDS, ISOMAP, LLE
Conclusion and possible research directions
08/30/07 5
Slide 6: Linear Method: Principal Component Analysis (PCA)
PCA finds the best “subspace” that captures as much data variance as possible Based on eigen-decomposition of data covariance matrix
Principal component
08/30/07
6
Slide 7: Linear Method: Linear Discriminant Analysis (LDA)
LDA finds the projection that best separates the two classes Multiple discriminant analysis (MDA) extends LDA to multiple classes
Best projection direction for classification
08/30/07
7
Slide 8: Deficiencies of Linear Methods
Data may not be best summarized by linear combination of features
Example: PCA cannot discover 1D structure of a helix
2 0 1 5
1 0
5
0 1 0 .5 0 -0 .5 -0 .5 -1 -1 0 .5 0 1
08/30/07
8
Slide 9: Deficiencies of Linear Methods
Useful characteristics for real world data are usually not linear combination of features
Example: poses of faces
Example: when shall I get hit (from motion data)
08/30/07
Figures from ISOMAP paper and Jerkins et. al
9
Slide 10: Deficiencies of Linear Methods
Currently, these characteristics can only be extracted by a carefully crafted system
Time consuming and tedious
Good nonlinear dimensionality reduction algorithms can help by extracting these features automatically
08/30/07
10
Slide 11: Manifold and Dimensionality Reduction
Manifold: generalized “subspace” in Rn
The value of n is usually large
Points in a local region on a manifold can be indexed by a subset of Rk
The value of k is usually small The index is called the co-ordinate of the point on the manifold
There may not be such co-ordinate systems globally
08/30/07 11
Slide 12: Example of a Manifold
Rn M
z
R
2
x2
x: coordinate for z x x1
08/30/07
12
Slide 13: Manifold and Dimensionality Reduction
Assumption: data points lie closely to M
Most unsupervised feature extraction schemes assume this implicitly
If there is a global indexing scheme for M
Find z on Mthat is closest to the data y Interpret the coordinates of z as the reduced dimension representation of y
Other mapping that preserves certain local properties of Mcan be used
08/30/07 13
Slide 14: Manifold for Dimensionality Reduction
y
Rn
M
z
R
2
x2
x: coordinate for z x x1
08/30/07
14
Slide 15: Taxonomy of the Nonlinear Algorithms
Is there any explicit notion of manifold The map to reduce dimensionality: parametric or non-parametric Can the algorithm use a dissimilarity matrix (instead of pattern matrix) as input Is local neighborhood in Rn used
08/30/07
15
Slide 16: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
16
Slide 17: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
17
Slide 18: Auto-associative Neural Networks
Given: a data set Y={y1, …, ym} in Rn Train a neural network using yi as input and target output, with a “middle” layer of k neurons, using square loss function The output of the k neurons when yi is used as input are the transformed vector
Denote this by xi
08/30/07
18
Slide 19: Auto-associative Neural Networks
Input: Target output: Actual output: Find weight that minimize
Output layer Decoding layer “Middle” layer Encoding layer Input layer
Transformed data:
08/30/07
19
Slide 20: Auto-associative Neural Networks
Intuition: xi is “compressed” version of yi as it can reconstruct yi approximately When the transfer function is linear, this approach is equivalent to PCA Different transfer functions can be used for different types of mapping Drawbacks: training not simple, local minima, meaning of extracted features obscure
08/30/07 20
Slide 21: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
21
Slide 22: Recap of Support Vector Machines (SVM)
SVM finds the separating hyperplane between 2 classes that maximizes the margin Key to the nonlinearity of SVM: (y) (.)
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( )
Input space
08/30/07
Feature space
22
Slide 23: Recap of Support Vector Machines (SVM)
Why (x): we hope the data become linearly separable in the feature space This is equivalent to finding a non-linear decision boundary in the input space
08/30/07
23
Slide 24: Recap of Support Vector Machines (SVM)
Example
K(x, y): kernel function
08/30/07 24
Slide 25: Recap of Support Vector Machines (SVM)
In practice, all computation in the feature space are performed using K(x, y)
We specify K(x, y), and not (x, y), in practice Intuition of K(x, y): similarity between x and y
Lesson here: any linear algorithm that can be expressed by inner products can be made nonlinear by going to the feature space
08/30/07
25
Slide 26: Kernel PCA (KPCA)
Perform PCA in the feature space Steps for KPCA
Calculate the kernel matrix Perform eigen-decomposition of
Let i be the entries of the j-th eigenvector is the j-th principal component
08/30/07 26
Slide 27: Kernel PCA
The j-th component of the transformed vector of y is the projection of (y) on the jth principal component in the feature space The principal components in the feature space can be visualized by a contour plot
For linear PCA:
Principal component
The contours
08/30/07 27
Slide 28: Kernel PCA: Example
Figures from ftp://ftp.research.micro soft.com/users/mtippin g/skpca_nips.ps.gz
08/30/07
28
Slide 29: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
29
Slide 30: Principal Curves
f(s)
Find the curve that passes through the “center” of the data points The curve can be generalized to a manifold
08/30/07 30
Slide 31: Principal Curves
Version one: f(s) is self-consistent, i.e., f(s) is the average of all points projected to it
08/30/07
31
Slide 32: Principal Curves
Version two: the data are generated by first picking a point on f(s), and then Gaussian noise is added f(s) is defined in a non-parametric manner
Locally weighted running lines or cubic smoothing splines can be used to smooth f(s)
08/30/07
32
Slide 33: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
33
Slide 34: Kohonen’s Self-organizing Maps (SOM)
Learn a topographic map between points in Rn and (usually) a 2D array of grid points (nodes) Each node contains a reference vector tj in Rn Learning of SOM is often on-line, i.e., the data points come sequentially
08/30/07
34
Slide 35: Learning of SOM
For a new data point z, node j wins if tj is closest to z among all reference vectors Node j and its neighbors learn from z
The nodes learn by moving their reference vectors closer to z with a learning rate
The neighborhood of a node is large initially, but becomes smaller gradually also decreases gradually until convergence
08/30/07 35
Slide 36: SOM: When Neighborhood Is Large Initially
t1 =(9,5,2,6) t5=(9,9,9,1) t9 =(3,2,3,3) t13=(0,0,5,5)
New data (3,7,9,3)
t2 =(4,8,0,7)
t6=(7,9,8,3)
t10=(7,3,2,1)
t14=(8,7,8,1)
winner
t3 =(5,6,4,6) t7=(4,1,9,8) t11=(9,2,8,6) t15=(3,4,7,5)
t4 =(0,3,0,1)
t8=(7,5,8,2)
t12=(3,5,8,7)
t16=(1,1,7,9)
08/30/07
36
Slide 37: SOM: Example
t1 =(9,5,2,6) t5 =(9,9,9,1) t9 =(3,5,6,3) t13 =(2,4,7,4)
t2 =(4,8,0,7)
t6 =(5,8,9,3)
t10 =(5,5,6,2)
t14 =(6,7,9,2)
t3 =(5,6,4,6)
t7 =(4,4,9,6)
t11 =(6,5,9,5)
t15 =(3,6,8,4)
t4 =(0,3,0,1)
t8 =(5,6,9,3)
t12 =(3,6,9,5)
t16 =(2,4,8,6)
08/30/07
37
Slide 38: SOM: When Neighborhood Is Small
t1 =(9,5,2,6) t5 =(9,9,9,1) t9 =(3,5,6,3) t13 =(2,4,7,4)
New data (4,8,4,5)
winner
t2 =(4,8,0,7)
t6 =(5,8,9,3)
t10 =(5,5,6,2)
t14 =(6,7,9,2)
t3 =(5,6,4,6)
t7 =(4,4,9,6)
t11 =(6,5,9,5)
t15 =(3,6,8,4)
t4 =(0,3,0,1)
t8 =(5,6,9,3)
t12 =(3,6,9,5)
t16 =(2,4,8,6)
08/30/07
38
Slide 39: SOM: When Neighborhood Is Small
t1 =(9,5,2,6) t5 =(9,9,9,1) t9 =(3,5,6,3) t13 =(2,4,7,4)
t2 =(4,8,2,6)
t6 =(5,8,9,3)
t10 =(5,5,6,2)
t14 =(6,7,9,2)
t3 =(5,7,4,6)
t7 =(4,6,7,5)
t11 =(6,5,9,5)
t15 =(3,6,8,4)
t4 =(2,6,2,3)
t8 =(5,6,9,3)
t12 =(3,6,9,5)
t16 =(2,4,8,6)
08/30/07
39
Slide 40: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Disimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
40
Slide 41: Generative Topographic Maps (GTM)
A probabilistic formalization of SOM
Figure from GTM paper
08/30/07 41
Slide 42: Generative Topographic Maps (GTM)
The mapping y consist of two parts
Map to a point on the manifold
l=1,...,n
Gaussian noise is then added to zi to form yi
The manifold is parameterized by W and the M functions j(x)
j(x) usually take the form of Gaussians
08/30/07 42
Slide 43: Generative Topographic Maps (GTM)
Closely related to radial basis function neural networks Smoothness of the manifold can be controlled by regularizing W Other formalization of SOM exists
Example: topographic vector quantization (TVQ) performs vector quantization with neighborhood relationship as in SOM
08/30/07 43
Slide 44: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
44
Slide 45: Multidimensional Scaling (MDS)
Given: ij, a distance measure between zi and zj
The items zi and zj may not be in an Euclidean space ij might be asymmetric and do not satisfy the triangle inequality
Goal: map all zi to xi, points in a low dimension space Rk, that best “preserves” ij Two types of MDS
Metric MDS: minimizes discrepancy in numerical values
• Classical scaling and metric least square scaling
Nonmetric MDS: minimizes discrepancy in ranks
08/30/07 45
Slide 46: Multidimensional Scaling (MDS)
zi ij zj xi dij MDS xj
Items in possibly non-Euclidean space
Data points in low-D Euclidean space
08/30/07
Goal: find xi s that make ij as “close” to dij as possible
46
Slide 47: Classical Scaling
Define [A]rs = - ½ ij2 and [B]rs=xrT xs
1: matrix with entries all 1
X, the coordinates of the projected points, can be recovered by eigenvalue decomposition of B It is equivalent to linear PCA if zi is in Rn and Euclidean distances between zi are used
Classical scaling is essentially “linear”
08/30/07
47
Slide 48: Metric Least Square Scaling (MLSS)
Minimize a certain function of the difference between ij and dij
dij is the distance between xi and xj in Rk
Sammon maps: minimize Sammon’s stress
Gradient descent can be used for finding xi that minimize S
08/30/07 48
Slide 49: Notes on MDS
No explicit mapping
If a new item comes, we do not know where it should be mapped to
It is not easy to change the dimension of Rk (where xi are in) in MLSS
08/30/07
49
Slide 50: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
50
Slide 51: Geodesic Distance
Geodesic: the shortest curve on a manifold that connects two points on the manifold
Example: on a sphere, geodesics are great circles
Geodesic distance: length of the geodesic
A B
Figure from http://mathworld.wolfra m.com/GreatCircle.html
08/30/07 51
Slide 52: Geodesic Distance
Euclidean distance needs not be a good measure between two points on a manifold
Length of geodesic is more appropriate
Example: Swiss roll
Figure from LLE paper
08/30/07
52
Slide 53: Isometric Feature Mapping (ISOMAP)
Take a distance matrix {ij} as input Estimate geodesic distance between any two points by “a chain of short paths”
Formulate this as a graph theory problem
Perform classical scaling on the matrix of geodesic distances to obtain final projection
08/30/07
53
Slide 54: Steps to Estimate Geodesic Distances
Find the “neighbors” of all data items zi
Two possible definitions of neighbors
• • Set of items whose distances are less than The K closest items
•
Construct a weighted undirected graph
Vertex i corresponds to zi An edge between the vertex i and j iff zi and zj are neighbors, and its weight is ij
08/30/07 54
Slide 55: Steps to Estimate Geodesic Distances
1. Find the shortest distance between all pairs of vertices in the graph
Floyd (O(m3)) or Dijkstra (O(m2log m+mp))
The shortest distance between vertices i and j in the graph is the estimated geodesic distance between zi and zj
08/30/07
55
Slide 56: Rationale for the Geodesic Distance Estimation
Figures from ISOMAP paper
08/30/07 56
Slide 57: A Run of ISOMAP
Figure from http://isomap.stanford. edu/handfig.html
08/30/07 57
Slide 58: A Run of ISOMAP
Figures from ISOMAP paper
08/30/07 58
Slide 59: Interpolation on Straight Lines in the Projected Co-ordinates
Figures from ISOMAP paper
08/30/07 59
Slide 60: Summary of the Algorithms
Auto. Principal KPCA SOM GTM MDS ISOMAP LLE NN curves Explicit Manifold Parametric Dissimilarity matrix No Yes No No Yes No(?) No Yes No No No(?) No No No No Yes Yes No No No No Yes No Yes No Yes Yes Yes No No(?) Yes
Local No neighborhood
08/30/07
60
Slide 61: Local Linear Embedding (LLE)
Assumption: manifold is approximately “linear” when viewed locally, that is, in a small neighborhood
Approximation error, (W), can be made small
Local neighborhood is effected by the constraint Wij=0 if zi is not a neighbor of zj
08/30/07 61
Slide 62: Local Linear Embedding (LLE)
Additional constraint j Wij = 1 is needed before finding W that minimizes (W) Meaning of W: a linear representation of every data point by its neighbors
This is an intrinsic geometrical property of the manifold
A good projection should preserve this geometric property as much as possible
08/30/07 62
Slide 63: Finding a Map to a Lower Dimensional Space
xi 2 Rk: projected vector for zi The geometrical property is best preserved if the error below is small
X is given by the eigenvectors of the lowest k non-zero eigenvalues of the matrix
08/30/07
63
Slide 64: Examples
Figure from LLE paper
08/30/07 64
Slide 65: Why LLE and ISOMAP Are Successful
Both methods are non-parametric, thus do not assume much about the manifold They focus on the notion of manifold, in particular the local property
08/30/07
65
Slide 66: Some Limitations of ISOMAP and LLE
Both algorithms are unlikely to work well for manifolds like a sphere or a torus Both require dense data points on the manifold for good estimation A good neighborhood seems essential to their success They are hard to be evaluated quantitatively
08/30/07
66
Slide 67: Conclusion
Dimensionality reduction is an important problem in pattern recognition It is closely related to manifold learning Algorithms for nonlinear dimensionality reduction and manifold learning are presented The rise of fame for ISOMAP and LLE stimulates a lot of research in this area
08/30/07
67
Slide 68: Possible Research Directions
Manifold is the magic word How to determine the dimension of the target space Rk Similarity instead of dissimilarity What can one do if the data are not so dense Can side information be used to learn the manifold, e.g., by examples of geodesic
08/30/07 68
Slide 69: References
Note: some graphs in this presentation are taken/inspired from the papers/URL listed here. P. Baldi and K. Hornik. Neural networks and principal component analysis: learning from examples without local minima. Neural Networks, vol 2, pp. 5358, 1989 C. M. Bishop, M. Svensen and C. K. I. Williams. GTM: the generative topographic mapping. Neural Computation, vol 10, pp. 215--234, 1998 N. Cristianini. ICML'01 tutorial on support vector and kernl machines, 2001 T. F. Cox and M. A. A. Cox. Multidimensional Scaling, 2nd edition, 2001. Chapman & Hall/CRC D. DeMers and G. W. Cottrell. Non-Linear Dimensionality Reduction. Advances in Neural Information Processing Systems 5, pp. 580-587, 1993 T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, vol 84, pp. 502--516, 1989
08/30/07
69
Slide 70: References
O. C. Jenkins and M. J Mataric´, Deriving Action and Behavior Primitives from Human Motion Data. In IROS-2002, pp. 2551-2556, 2002 T. Kohonen. Self-organizing maps, 3rd edition, 2001. Springer-Verlag New York S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, vol. 290, pp. 2323--2326, 2000 B. Schölkopf, A.J. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, vol 10, pp. 1299--1319, 1998 J.B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, vol. 290, pp. 2319--2323, 2000 R. Tibshirani. Principal curves revisited. Statistics and Computing, vol 2, pp. 183--190, 1992
08/30/07 70