SVD & PCA
Dimensionality Reduction

ML for Science and Engineering

Why Reduce Dimensions?

More features should mean more information, right? Not always.
The Curse of Dimensionality
As dimensions grow, data becomes sparse. Distances between points converge. Models need exponentially more data to generalize.
Noise accumulates
Each added feature brings signal and noise. With many features, noise can dominate the patterns.
Computation
Training time, memory, and storage all grow with dimension. Many algorithms scale as $O(d^2)$ or $O(d^3)$.
Visualization
Humans can only see 2D or 3D. Reducing to 2 dimensions lets us see structure in data.

A Real Example: Iris Flowers

The Iris dataset has 4 features per flower — measurements of two structures:

Sepal Petal length width
#FeatureUnit
1Sepal lengthcm
2Sepal widthcm
3Petal lengthcm
4Petal widthcm

Sepal = outer leaf-like part  |  Petal = inner colorful part

150 flowers, 3 species, 4 measurements each.

A classic benchmark for classification and dimensionality reduction (Fisher, 1936).

How do you visualize a point in 4-dimensional space?
How do you find the most important directions?
Key idea: Most of the variation might live in a low-dimensional subspace. PCA finds it.

Start Simple: 2D Data

Pick two features: Sepal Length vs Petal Length.

SampleSepal L.Petal L.Species

Showing first 6 of 150 samples

Principal Components

The data is spread out more in some directions than others.

PC1 = direction of maximum variance (major axis).
PC2 = second-most variance, orthogonal to PC1.
Orthogonality: $\mathbf{u}_1 \perp \mathbf{u}_2$ — new coordinate system.
$\mathbf{u}_1^\top \mathbf{u}_2 = 0$

Center and Standardize

Before finding principal components, we preprocess each feature:

$\text{Center: } x_j \leftarrow x_j - \mu_j \qquad\qquad \text{Standardize: } x_j \leftarrow \frac{x_j - \mu_j}{\sigma_j}$
Raw data Center Standardize
Original data with mean $\mu$ shown.

Project onto an Axis

Data can be projected onto the axis of highest variation: $\mathbf{u}_1$.

Each point maps to its shadow on the line. The projected data is 1-dimensional. What information is lost?

Reconstruction Error

The error is the distance from each original point to its projection.

Goal: Find $\mathbf{u}_1$ that maximizes variance of projections
Equivalently: minimizes reconstruction error (sum of squared distances)

Find the Best Axis

Drag the line to rotate it. Find the direction that maximizes variance of projections.

Variance of projections:
Reconstruction error:
Angle: 0°
Show PC1 Show PC2

Projection onto a Direction

Given centered data $\{\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(n)}\}$ in $\mathbb{R}^d$, find the unit vector $\mathbf{u}$ along which projections have maximum variance.

What is the scalar projection of $\mathbf{x}^{(i)}$ onto $\mathbf{u}$?
$z^{(i)} = \mathbf{x}^{(i)\top}\mathbf{u}$
The scalar projection $z$ is the signed length of the shadow of $\mathbf{x}$ onto the direction $\mathbf{u}$.

Maximize Variance of Projections

The scalar projection is $z^{(i)} = \mathbf{x}^{(i)\top}\mathbf{u}$. We want to find $\mathbf{u}$ that maximizes the variance of $z$:

$\max_{\mathbf{u}} \;\; \frac{1}{n}\sum_{i=1}^{n}\left(\mathbf{x}^{(i)\top}\mathbf{u}\right)^2 \quad \text{subject to } \|\mathbf{u}\|=1$

Rewrite the sum in matrix form:

$= \frac{1}{n}\sum_{i=1}^{n} \mathbf{u}^\top \mathbf{x}^{(i)}\mathbf{x}^{(i)\top}\mathbf{u} = \mathbf{u}^\top \underbrace{\left[\frac{1}{n}\sum_{i=1}^{n}\mathbf{x}^{(i)}\mathbf{x}^{(i)\top}\right]}_{\mathbf{C}} \mathbf{u} = \mathbf{u}^\top \mathbf{C} \, \mathbf{u}$
$\mathbf{C} = \frac{1}{n}\mathbf{X}^\top\mathbf{X}$ is the covariance matrix (for centered data).
Our objective becomes: maximize $\mathbf{u}^\top\mathbf{C}\mathbf{u}$ subject to $\|\mathbf{u}\|=1$.

Lagrange Multipliers

Constrained optimization: maximize $\mathbf{u}^\top\mathbf{C}\mathbf{u}$ subject to $\mathbf{u}^\top\mathbf{u} = 1$.

$\mathcal{L}(\mathbf{u}, \lambda) = \mathbf{u}^\top\mathbf{C}\mathbf{u} - \lambda\left(\mathbf{u}^\top\mathbf{u} - 1\right)$

Take the derivative and set to zero:

$\frac{\partial\mathcal{L}}{\partial\mathbf{u}} = 2\mathbf{C}\mathbf{u} - 2\lambda\mathbf{u} = 0$
$\mathbf{C}\mathbf{u} = \lambda\mathbf{u}$
This is an eigenvalue problem. The direction $\mathbf{u}$ that maximizes projection variance is an eigenvector of the covariance matrix, and $\lambda$ is the corresponding eigenvalue.

Eigenvalues Tell the Story

PC1: eigenvector with largest eigenvalue $\lambda_1$
PC2: eigenvector with second-largest eigenvalue $\lambda_2$
 ⋮
PC$d$: eigenvector with smallest eigenvalue $\lambda_d$
$\text{Var along } \mathbf{u}_k = \mathbf{u}_k^\top\mathbf{C}\mathbf{u}_k = \lambda_k$
Eigenvectors → directions.
Eigenvalues → importance (variance captured).

Iris dataset: 4 eigenvalues

Projecting to Lower Dimensions

Given $\mathbf{x} \in \mathbb{R}^d$, project to $\mathbb{R}^k$ using the top $k$ eigenvectors:

$\mathbf{y}^{(i)} = \begin{bmatrix} \mathbf{u}_1^\top\mathbf{x}^{(i)} \\ \mathbf{u}_2^\top\mathbf{x}^{(i)} \\ \vdots \\ \mathbf{u}_k^\top\mathbf{x}^{(i)} \end{bmatrix} \in \mathbb{R}^k$

Variance retained by keeping $k$ of $d$ components:

$\frac{\sum_{i=1}^{k}\lambda_i}{\sum_{i=1}^{d}\lambda_i} \times 100\%$
Iris example: first 2 of 4 components capture 95.8% of total variance. Projecting 4D → 2D loses only 4.2%.

SVD: Singular Value Decomposition

Any matrix $\mathbf{X} \in \mathbb{R}^{m \times n}$ can be factored as:

$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$
$\underbrace{\begin{bmatrix} & & \\ & \mathbf{X} & \\ & & \end{bmatrix}}_{m \times n} = \underbrace{\begin{bmatrix} \vert & \vert & & \vert \\ \mathbf{u}_1 & \mathbf{u}_2 & \cdots & \mathbf{u}_m \\ \vert & \vert & & \vert \end{bmatrix}}_{\color{#98c379}{\mathbf{U}\;(m \times m)}} \underbrace{\left[\begin{array}{ccc} \sigma_1 & & 0 \\ & \ddots & \\ 0 & & \sigma_n \\ \hline & \mathbf{0} & \end{array}\right]}_{\color{#e5c07b}{\boldsymbol{\Sigma}\;(m \times n)}} \underbrace{\begin{bmatrix} \text{---} \; \mathbf{v}_1^\top \; \text{---} \\ \text{---} \; \mathbf{v}_2^\top \; \text{---} \\ \vdots \\ \text{---} \; \mathbf{v}_n^\top \; \text{---} \end{bmatrix}}_{\color{#e06c75}{\mathbf{V}^\top\;(n \times n)}}$
U: left singular vectors
(orthonormal columns)
Σ: singular values
(diagonal, $\sigma_1 \geq \sigma_2 \geq \cdots \geq 0$)
VT: right singular vectors
(orthonormal rows)

SVD ↔ PCA Connection

Start from the SVD of the centered data matrix $\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$:

$\mathbf{X}^\top\mathbf{X} = (\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top)^\top(\mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top) = \mathbf{V}\boldsymbol{\Sigma}^\top\underbrace{\mathbf{U}^\top\mathbf{U}}_{\mathbf{I}}\boldsymbol{\Sigma}\mathbf{V}^\top = \mathbf{V}\boldsymbol{\Sigma}^2\mathbf{V}^\top$

Compare with eigendecomposition of covariance $\mathbf{C} = \frac{1}{n}\mathbf{X}^\top\mathbf{X}$:

Right singular vectors $\mathbf{V}$ = eigenvectors of $\mathbf{X}^\top\mathbf{X}$
= principal component directions
Singular values relate to eigenvalues:
$\lambda_i = \frac{\sigma_i^2}{n}$
SVD computes PCA without ever forming $\mathbf{X}^\top\mathbf{X}$. This is numerically more stable and often faster.

Three Forms of SVD

The SVD comes in three flavors, each progressively more compact. Assume $m \geq n$ (more rows than columns).

Outer-product form:

Low-Rank Approximation

Keep only the first $r$ singular values (set the rest to zero):

$\mathbf{X} \approx \mathbf{X}_r = \mathbf{U}_r\boldsymbol{\Sigma}_r\mathbf{V}_r^\top = \sum_{i=1}^{r}\sigma_i\,\mathbf{u}_i\mathbf{v}_i^\top$
Eckart-Young theorem (1936): $\mathbf{X}_r$ is the best rank-$r$ approximation to $\mathbf{X}$ in both Frobenius and spectral norms:
$\mathbf{X}_r = \arg\min_{\text{rank}(\mathbf{B}) = r}\|\mathbf{X} - \mathbf{B}\|_F$
This is the theoretical foundation for data compression, denoising, and dimensionality reduction. We truncate the small singular values (noise) and keep the large ones (signal).

PCA in 3 Lines of Code


# 1. Center the data
X_centered = X - X.mean(axis=0)

# 2. Compute the SVD
U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

# 3. Project onto top k components
X_pca = X_centered @ Vt[:k].T
        
S contains the singular values $\sigma_1, \sigma_2, \dots$
Vt rows are the principal directions
X_pca is the reduced representation
That is it. No need to compute $\mathbf{X}^\top\mathbf{X}$ or solve eigenvalue problems manually. np.linalg.svd does everything.

Application 1: Iris Dataset PCA

The Iris dataset: 150 flowers, 4 measurements each (sepal length, sepal width, petal length, petal width).

72.8% of variance in PC1 alone! 95.8% in just 2 components.
4D → 2D with minimal loss. Three species separate cleanly.

Application 2: Image Compression

An image is a matrix. Apply SVD and keep only the first $r$ singular values.

Original

Brunton & Kutz, Data-Driven Science and Engineering, Cambridge UP, 2019

Eigenfaces: Faces as Points in High-D Space

2,410 face images, each $192 \times 168 = $ 32,256 pixels. Each face is a point in 32,256-dimensional space.

Goal: represent any face as a combination of a few basis faces (eigenfaces). PCA finds these basis directions.

Average Face

$\bar{\mathbf{x}} = \frac{1}{n}\sum \mathbf{x}^{(i)}$

First 8 Eigenfaces (Principal Components)

PC 1
PC 2
PC 3
PC 4
PC 5
PC 6
PC 7
PC 8

Each eigenface captures a different mode of variation (lighting, pose, expression).

Brunton & Kutz, Data-Driven Science and Engineering, Cambridge UP, 2019

Face Reconstruction

Reconstruct a face using $k$ eigenfaces: $\hat{\mathbf{x}} = \bar{\mathbf{x}} + \sum_{i=1}^k c_i\,\mathbf{u}_i$

Original

Reconstruction

k = 800

Brunton & Kutz, Data-Driven Science and Engineering, Cambridge UP, 2019

Scree Plot: How Many Components?

The scree plot shows singular values (or eigenvalues) in decreasing order. Look for the "elbow".

Rules of thumb:
• Cumulative variance ≥ 95%
• "Elbow" in scree plot
• Cross-validate for prediction

Summary

PCA finds orthogonal directions of maximum variance.

The key equation:
$\mathbf{C}\mathbf{u} = \lambda\mathbf{u}$
SVD provides the computation.
$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top$

No need to form $\mathbf{X}^\top\mathbf{X}$ explicitly.
Low-rank approximation compresses data optimally (Eckart-Young).

Applications:
• Dimensionality reduction
• Image compression
• Denoising
• Feature extraction
• Visualization
Limitations: PCA only captures linear structure. For nonlinear manifolds, you need kernel PCA, autoencoders, or t-SNE.

What's Next?

What if the data is a time series?
What if we want to decompose dynamics, not just static data?

$\mathbf{X} = \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^\top \quad \longrightarrow \quad$ Dynamic Mode Decomposition

Next lecture: Linear Dynamical Systems & DMD