Singular Value Decomposition

Illustrating the power and utility of SVD with a few applications.

By Vamshi Jandhyala in mathematics machine learning linear algebra

September 2, 2021

Overview

The singular value decomposition (SVD) is among the most important matrix factorizations of the computational era. In many domains, complex systems will generate data that is naturally arranged in large matrices, or more generally in arrays. For example, a time-series of data from an experiment or a simulation may be arranged in a matrix with each column containing all of the measurements at a given time. If the data at each instant in time is multi-dimensional, as in a high-resolution simulation of the weather in three spatial dimensions, it is possible to reshape or flatten this data into a high-dimensional column vector, forming the columns of a large matrix. Similarly, the pixel values in a grayscale image may be stored in a matrix, or these images may be reshaped into large column vectors in a matrix to represent the frames of a movie. Remarkably, the data generated by these systems are typically $\textbf{low rank}$, meaning that there are a few dominant patterns that explain the high-dimensional data. The SVD provides a systematic way to determine a low-dimensional approximation to high-dimensional data in terms of dominant patterns. This technique is $\textbf{data-driven}$ in that patterns are discovered purely from data, without the addition of expert knowledge or intuition. The SVD is numerically stable and provides a hierarchical representation of the data in terms of a new coordinate system defined by dominant correlations within the data. Moreover, the SVD is guaranteed to exist for any matrix, unlike the eigendecomposition.

Definition of SVD

Generally, we are interested in analyzing a large data set $\mathbf{X} ∈ \mathbb{R}^{n×m}$:

$$ \mathbf{X} = \begin{bmatrix} \vert & \vert & &\vert \\
\mathbf{x}_1 & \mathbf{x}_2 & \dots &\mathbf{x}_k \\
\vert & \vert & &\vert \end{bmatrix} $$

The columns $\mathbf{x}_k ∈ \mathbb{R}^n$ may be measurements from simulations or experiments. The index $k$ is a label indicating the $k^{th}$ distinct set of measurements. Often the state-dimension $n$ is very large, on the order of millions or billions of degrees of freedom. The columns are often called $\textbf{snapshots}$, and $m$ is the number of snapshots in $\mathbf{X}$. For many systems $n \gg m$, resulting in a tall-skinny matrix, as opposed to a short-fat matrix when $n \ll m$. The SVD is a unique matrix decomposition that exists for every real-valued matrix $\mathbf{X} ∈ \mathbb{R}^{n×m}$,

$$ \mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T $$

where $\mathbf{U} ∈ \mathbb{R}^{n×n}$ and $\mathbf{V} ∈ \mathbb{R}^{m×m}$ are unitary matrices (i.e. $UU^T = U^TU=1$) with orthonormal columns, and $\mathbf{\Sigma} ∈ \mathbb{R}^{m×m}$ is a matrix with real, nonnegative entries on the diagonal and zeros off the diagonal.

When $n ≥ m$, the matrix $\mathbf{\Sigma}$ has at most m nonzero elements on the diagonal, and may be written as $ \mathbf{\Sigma} = \begin{bmatrix} \mathbf{\hat{\Sigma}} \\
\mathbf{0} \\
\end{bmatrix} $ .Therefore, it is possible to exactly represent $\mathbf{X}$ using the economy SVD:

$$ \mathbf{X} = \mathbf{U}\mathbf{\Sigma}\mathbf{V}^T = \begin{bmatrix} \mathbf{\hat{U}} & \mathbf{\hat{U}^{⊥}} \end{bmatrix} \begin{bmatrix} \mathbf{\hat{\Sigma}} \\
\mathbf{0} \\
\end{bmatrix} \mathbf{V}^T = \mathbf{\hat{U}}\mathbf{\hat{\Sigma}}\mathbf{V}^T $$

The columns of $\mathbf{\hat{U}^{⊥}}$ span a vector space that is complementary and orthogonal to that spanned by $\mathbf{\hat{U}}$. The columns of $\mathbf{U}$ are called left singular vectors of $\mathbf{X}$ and the columns of $\mathbf{V}$ are right singular vectors. The diagonal elements of $\mathbf{\Sigma}$ are called singular values and they are ordered from largest to smallest. The rank of $\mathbf{X}$ is equal to the number of nonzero singular values. The cost of computing the SVD is $\mathcal{O}(\min(mn^2,m^2n))$.

Computing SVD

The code for computing SVD in Python is quite straightforward.

import numpy as np
X = np.random.rand(5, 3) % create random data matrix
U, S, V = np.linalg.svd(X,full_matrices=True) % full SVD
Uhat, Shat, Vhat = np.linalg.svd(X, full_matrices=False) % economy SVD

Matrix Approximation

Perhaps the most useful and defining property of the SVD is that it provides an optimal low-rank approximation to a matrix $\mathbf{X}$. In fact, the SVD provides a hierarchy of low-rank approximations, since a rank-$r$ approximation is obtained by keeping the leading $r$ singular values and vectors, and discarding the rest.

Schmidt (of Gram-Schmidt) generalized the SVD to function spaces and developed an approximation theorem, establishing truncated SVD as the optimal low-rank approximation of the underlying matrix X. Schmidt’s approximation theorem was rediscovered by Eckart and Young, and is sometimes referred to as the $\textbf{Eckart-Young}$ theorem.

The optimal rank-$r$ approximation to $\mathbf{X}$, in a leastsquares sense, is given by the rank-$r$ SVD truncation $\mathbf{\tilde{X}}$:

$$ \underset{\mathbf{\tilde{X}}, s.t. rank(\mathbf{\tilde{X}})=r}{argmin} ||\mathbf{X} - \mathbf{\tilde{X}}||_F = \mathbf{\tilde{U}}\mathbf{\tilde{\Sigma}}\mathbf{\tilde{V}}^T $$

Here, $\mathbf{\tilde{U}}$ and $\mathbf{\tilde{V}}$ denote the first $r$ leading columns of $\mathbf{U}$ and $\mathbf{U}$, and $\mathbf{\Sigma}$contains the leading $r × r$ sub-block of $\mathbf{\Sigma}$. $|| \cdot ||_F$ is the Frobenius norm.

Here, we establish the notation that a truncated SVD basis (and the resulting approximated matrix $\mathbf{\tilde{X}}$) will be denoted by $\mathbf{\tilde{U}}\mathbf{\tilde{\Sigma}}\mathbf{\tilde{V}}^T$. Because $\mathbf{\Sigma}$ is diagonal, the rank-$r$ SVD approximation is given by the sum of $r$ distinct rank-$1$ matrices:

$$ \mathbb{\tilde{\Sigma}} = \sum_{k=1}^r \sigma_k \mathbf{u}_k \mathbf{v}_k^T = \sigma_1 \mathbf{u}_1 \mathbf{v}_1^T + \sigma_2 \mathbf{u}_2 \mathbf{v}_2^T + \dots + \sigma_r \mathbf{u}_r \mathbf{v}_r^T. $$

This is the so-called $\textbf{dyadic}$ summation. For a given rank $r$, there is no better approximation for $\mathbf{X}$, in the $\mathcal{l}_2$ sense, than the truncated SVD approximation $\mathbf{\tilde{X}}$. Thus, high-dimensional data may be well described by a few dominant patterns given by the columns of $\mathbf{\tilde{U}}$ and $\mathbf{\tilde{V}}$.

Image Compression

We demonstrate the idea of matrix approximation with a simple example: image compression. Natural images present a simple and intuitive example of this inherent compressibility. A grayscale image may be thought of as a real-valued matrix $\mathbf{X} ∈ \mathbb{R}^{n×m}$, where $n$ and $m$ are the number of pixels in the vertical and horizontal directions, respectively.

Consider the gray scale image of the lion. .

The figures below shows the approximate matrix $\mathbf{\tilde{X}}$ for the various SVD truncation values $r$. By $r = 100$, the reconstructed image is quite accurate. The SVD truncation results in a compression of the original image requiring $26.67\% = (5000\times 100 + 100 + 100\times3000)/(5000\times3000)$ storage if the original image has $5000 × 3000$ pixels, since only the first $100$ columns of $\mathbf{U}$ and $\mathbf{V}$, along with the first $100$ diagonal elements of $\mathbb{\Sigma}$ need to be stored.

.

.

.

The Python code for the truncated SVD is given below:

from matplotlib.image import imread
import matplotlib.pyplot as plt
import numpy as np
import os

A = imread('lion.png'))
img = plt.imshow(A)
img.set_cmap('gray')
plt.axis('off')
plt.show()

X = np.mean(A, -1)
U, S, VT = np.linalg.svd(X,full_matrices=False)
S = np.diag(S)

j = 0
for r in (5, 20, 100):
    # Construct approximate image
    Xapprox = U[:,:r] @ S[0:r,:r] @ VT[:r,:]
    plt.figure(j+1)
    j += 1
    img = plt.imshow(Xapprox)
    img.set_cmap('gray')
    plt.axis('off')
    plt.title('r = ' + str(r))
    plt.show()

Eigen Faces

In this problem, PCA (i.e. SVD on mean-subtracted data) is applied to a large library of facial images to extract the most dominant correlations between images. The result of this decomposition is a set of eigenfaces that define a new coordinate system. Images may be represented in these coordinates by taking the dot product with each of the principal components. Here, we demonstrate this algorithm using the Extended Yale Face Database B, consisting of cropped and aligned images of $38$ individuals ($28$ from the extended database, and $10$ from the original database) under $9$ poses and $64$ lighting conditions.Each image is $192$ pixels tall and $168$ pixels wide. Each of the facial images have been reshaped into a large column vector with $192 × 168 = 32, 256$ elements. We use the first $36$ people in the database as our training data for the eigenfaces example, and we hold back two people as a test set. As mentioned before, each image is reshaped into a large column vector, and the average face is computed and subtracted from each column vector. The mean-subtracted image vectors are then stacked horizontally as columns in the data matrix $X$. Thus, taking the SVD of the mean-subtracted matrix $\mathbf{X}$ results in the PCA. The columns of $\mathbf{U}$ are the eigenfaces, and they may be reshaped back into $192 \times 168$ images. Using the eigenface library, $\mathbf{\tilde{U}}$, we will approximately represent an image that was not in the training data. At the beginning, we held back two individuals (the $37th$ and $38th$ people), and we now use one of their images as a test image, $\mathbf{x_{test}}$.We will see how well a rank-$r$ SVD basis will approximate this image using the projection $ \mathbf{\tilde{x}_{test}} = \mathbf{\tilde{U}} \mathbf{\tilde{U}}^T \mathbf{{x}_{test}}$.

import matplotlib.pyplot as plt
import numpy as np
import scipy.io

mat_contents = scipy.io.loadmat('allFaces.mat')
faces = mat_contents['faces'] #(32256, 2410)
m = int(mat_contents['m'])
n = int(mat_contents['n'])
nfaces = np.ndarray.flatten(mat_contents['nfaces']) # number of images for each person

# We use the first 36 people for training data
trainingFaces = faces[:,:np.sum(nfaces[:36])]
avgFace = np.mean(trainingFaces, axis=1) # size 192*168 by 1

# Compute eigenfaces on mean-subtracted training data
X = trainingFaces - np.tile(avgFace,(trainingFaces.shape[1],1)).T
U, S, VT = np.linalg.svd(X,full_matrices=False)

fig1 = plt.figure()
ax1 = fig1.add_subplot(121)
img_avg = ax1.imshow(np.reshape(avgFace,(m,n)).T)
img_avg.set_cmap('gray')
plt.axis('off')

ax2 = fig1.add_subplot(122)
img_u1 = ax2.imshow(np.reshape(U[:,0],(m,n)).T)
img_u1.set_cmap('gray')
plt.axis('off')

plt.show()

## Now show eigenface reconstruction of image that was omitted from test set
testFace = faces[:,np.sum(nfaces[:36])] # First face of person 37
plt.imshow(np.reshape(testFace,(m,n)).T)
plt.set_cmap('gray')
plt.title('Original Image')
plt.axis('off')
plt.show()

testFaceMS = testFace - avgFace
r_list = [50, 200, 800, 1600]

for r in r_list:
    reconFace = avgFace + U[:,:r]  @ U[:,:r].T @ testFaceMS
    img = plt.imshow(np.reshape(reconFace,(m,n)).T)
    img.set_cmap('gray')
    plt.title('r = ' + str(r))
    plt.axis('off')
    plt.show()

Training faces

Average face and the first eigen face

Test face 37th in the dataset

Face reconstruction using the first $r$ eigen faces

References

Data-Driven Science and Engineering