Numerical Linear Algebra (Part 2)

Least squares problems

Given an $m$ x $n$ matrix $A$ with $m > n$ and an $m$-vector $b$ in general there may not be a vector $x$ such that $A\; x = b$. So instead we can try to find an $x$ such that the residual $b - A\; x$ is "as small as possible". The least squares problem is to do this by minimizing the 2-norm $||\;b - A\;x||$ of the residual (hence the name least squares).

The condition $m>n$ stipulated above makes this an overdetermined least squares problem. The case $munderdetermined problem which we will not cover.

Often the least squares problem is written as $A\; x \simeq b$ to emphasize the approximate nature of the solution.

The four fundamental subspaces of linear algebra corresponding to a matrix $A$ are the range or column space of $A$, the nullspace of $A$, the row space (which is the column space of $A^T$) and the left nullspace (which is the nullspace of $A^T$). We will refer to these as $\operatorname{im} A$, $\operatorname{ker} A$, $\operatorname{im} A^T$ and $\operatorname{ker} A^T$.

These spaces are related by orthogonality in pairs. For example, $\operatorname{im} A^\perp$ = $\operatorname{ker} A^T$ -- the orthogonal complement of the image of $A$ is the kernel of $A^T$. And so on.

The figure below should clarify why $\operatorname{im} A$ is an important object in least squares problems.

alt

Ever point in the image of $A$ corresponds to $A\; x$ for some $x$. The vector from the tip of $b$ to such a point is the residual vector $b - A\; x$. The shortest distance from the tip of $b$ to image of $A$ achieves the minimization of the residual. Thus the corresponding $x$ is the least squares solution.

Since we want the residual $r = b - A\; x$ to be orthogonal to the image of $A$, this is the same as requiring $r \in \operatorname{im} A^\perp = \operatorname{ker} A^T$. Thus we want

$$ \begin{align} A^T \; r &= 0\\ A^T\; (b - A\; x) &=0\\ A^T\; b - A^T\;A\; x &= 0 \end{align} $$

Thus the least squares problem can be solved if we can solve the square linear system

$$ A^T\; A\; x = A^T\; b $$

The matrix $A^T\; A$ is square and is nonsingular if and only if $A$ is a full rank matrix (remember we are assuming the $A$ has more rows than columns).

This is the oldest way to solve least squares problems and the above equation is called the normal equation.

The solution can then be expressed as $x = (A^T\; A)^{-1} \; A^T\; b$ and the matrix $(A^T\; A)^{-1} \; A^T$ is called the pseudoinverse of $A$ often written $A^+$. Of course typically we will not use inverses and instead solve the normal equations as a linear solve.

Exercise 1

  1. Compute the conditioning of $A$ and $A^T\; A$ for a few random matrices $A$. How do you think the conditioning of $A$ and $A^T\;A$ are related generally?
  2. Use the normal equation method to solve a least squares problem that finds the "best fit" line through the dataset:
    x = [-0.01, 0.078, 0.191, 0.318, 0.467, 0.54, 0.694, 0.76, 0.898, 0.988]
    y = [0.988, 1.056, 1.142, 1.156, 1.221, 1.279, 1.349, 1.367, 1.463, 1.536]
    Although this is commonly refered to as linear regression, you can see that it really isn't exclusive to lines. You could easily adapt the code to fit quadratics, cubics and so on.
  3. Repeat the above exercise using the least square solver lstsq function from scipy.linalg.
  4. Repeat the exercise using numpy.polyfit(x, y, deg) to compute the best fit line instead.
  5. Finally repeat using scipy.stats.linregress. It provides you with information about how good the fit is.

Least Squares and QR Factorization

The condition number of a full rank tall rectangular matrix $A$ (more rows than columns) is defined by replacing the inverse in the definition of condition number by pseudoinverse. Thus $\operatorname{cond}(A) := ||A||\;\;||A^+||$.

The conditioning of the least squares problem depends on both the condition number of the matrix as well as the angle that $b$ makes with the image of $A$. If that angle is small then the conditioning only depends on the condition number of the matrix.

There is nothing we can do about the angle between $b$ and image of $A$ since $b$ is the given data and $A$ depends on the problem. The one thing we can do is to replace the matrix $A$ by another matrix which has the same image but is much better conditioned.

What matrix has the best conditioning ? An orthogonal matrix !

The first modern advance in solving least squares problems was this realization. So now our task is to find a matrix with orthogonal columns which has the same image as $A$.

This is achieved by the QR factorization of $A$. For every full rank tall matrix $A$ of size $m$ x $n$ one can find an orthogonal matrix $Q$ of size $m$ x $m$ and an upper right "triangular" matrix $R$ such that $A = Q\; R$.

Actually there are two types of QR factorizations, the reduced (also known as economy or economic) and the full QR factorizations. The pictures below captures the essesntial features of the two.

The sizes are drawn to scale. If the matrix $A$ has $m$ rows and $n$ columns then so does $\hat{Q}$ in the reduced QR factorization. The $Q$ in the full QR has some additional orthonormal columns tacked on to make it $m$ x $m$. The matrix $\hat{R}$ in the reduced QR is invertible if and only if $A$ is full rank.

We will distinguish between the two factorizations by decorating with the hats in the reduced QR. It is the $\hat{Q}$ in the reduced QR which has the same image as $A$ (since obviously the full $Q$ spans the whole space).

The following example computes the reduced and full QR factorizations.

In [1]:
import numpy as np
from scipy.linalg import qr

np.random.seed(6142016)
A = np.random.rand(3, 2)
Q, R = qr(A, mode='economic')

print 'Economic QR'
print
print Q
print
print R
print

A = np.random.rand(3, 2)
Q, R = qr(A, mode='full')

print 'Full QR'
print
print Q
print
print R
print
Economic QR

[[-0.37462426  0.88131196]
 [-0.48644815  0.07761285]
 [-0.78931924 -0.46611744]]

[[-1.20063905 -0.84288104]
 [ 0.         -0.15951347]]

Full QR

[[-0.11803397  0.98770614  0.10249176]
 [-0.68717369 -0.00673629 -0.72646194]
 [-0.7168405  -0.15617683  0.67952078]]

[[-0.28124322 -0.28336706]
 [ 0.          0.30083746]
 [ 0.          0.        ]]

If the matrix $A$ is not very well conditioned then using the reduced QR factorization is a better way to solve a least squares problem than using the normal equations. You start with the normal equations and substitute the reduced QR to obtain an algorithm: $$ \begin{align} A^T\; A\; x &= A^T\; b\\ \hat{R}^T\; \hat{Q}^T\;\hat{Q}\; \hat{R}\; x &= \hat{R}^T\; \hat{Q}^T\; b\\ \hat{R}^T\; \hat{R}\; x &= \hat{R}^T\; \hat{Q}^T\;b \end{align} $$

Then cancelling the $\hat{R}^T$ which we can do since $\hat{R}$ is nonsingular (since $A$ was assumed to be full rank) we get the system to solve

$$ \hat{R} \; x = \hat{Q}^T\; b $$

Exercise 2

  1. Solve the linear regression problem from the last exercise, but this time using reduced QR factorization.

Eigenvalue problems

Due to Abel's impossibility theorem about the unsolvability (by radicals) of polynomials of degree 5 and higher there is no finite algorithm to compute eigenvalues of a general matrix. All eigenvalue finding algorithms are infinite iterative methods.

Modern numerical analysts have produced iterative methods with excellent convergence. There are also classical slower iterative algorithms, which are still worth learning about especially when only a few eigenvalues are needed. These are very easy to implement and we start with these.

The power method, shifted inverse iteration and Rayleigh quitient iteration were covered in lecture. Will be added here later.

Singular value decomposition (SVD)

Reduced and full SVD were covered in lecture. Will be added here later.