Linear least squares (mathematics)

This article is about the mathematics that underlie curve fitting using linear least squares. For statistical regression analysis using least squares, see linear regression. For linear regression on a single variable, see simple linear regression. For other uses, see ordinary least squares and regression analysis.

Regression analysis
Part of a series on Statistics

Models
Linear regression Simple regression Ordinary least squares Polynomial regression General linear model
Generalized linear model Discrete choice Logistic regression Multinomial logit Mixed logit Probit Multinomial probit Ordered logit Ordered probit Poisson
Multilevel model Fixed effects Random effects Mixed model
Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic Principal components Least angle Local Segmented
Errors-in-variables
Estimation
Least squares Ordinary least squares Linear (math) Partial Total Generalized Weighted Non-linear Non-negative Iteratively reweighted Ridge regression
Least absolute deviations Bayesian Bayesian multivariate
Background
Regression model validation Mean and predicted response Errors and residuals Goodness of fit Studentized residual Gauss–Markov theorem
Statistics portal

In statistics and mathematics, linear least squares is an approach fitting a mathematical or statistical model to data in cases where the idealized value provided by the model for any data point is expressed linearly in terms of the unknown parameters of the model. The resulting fitted model can be used to summarize the data, to predict unobserved values from the same system, and to understand the mechanisms that may underlie the system.

Mathematically, linear least squares is the problem of approximately solving an overdetermined system of linear equations, where the best approximation is defined as that which minimizes the sum of squared differences between the data values and their corresponding modeled values. The approach is called linear least squares since the assumed function is linear in the parameters to be estimated. Linear least squares problems are convex and have a closed-form solution that is unique, provided that the number of data points used for fitting equals or exceeds the number of unknown parameters, except in special degenerate situations. In contrast, non-linear least squares problems generally must be solved by an iterative procedure, and the problems can be non-convex with multiple optima for the objective function. If prior distributions are available, then even an underdetermined system can be solved using the Bayesian MMSE estimator.

In statistics, linear least squares problems correspond to a particularly important type of statistical model called linear regression which arises as a particular form of regression analysis. One basic form of such a model is an ordinary least squares model. The present article concentrates on the mathematical aspects of linear least squares problems, with discussion of the formulation and interpretation of statistical regression models and statistical inferences related to these being dealt with in the articles just mentioned. See outline of regression analysis for an outline of the topic.

Example

Using a quadratic model

The result of fitting a quadratic function

y=\beta _{1}+\beta _{2}x+\beta _{3}x^{2}\,

(in blue) through a set of data points

(x_{i},y_{i})

(in red). In linear least squares the function need not be linear in the argument

x,

but only in the parameters

\beta _{j}

that are determined to give the best fit.

Importantly, in "linear least squares", we are not restricted to using a line as the model as in the above example. For instance, we could have chosen the restricted quadratic model $y=\beta _{1}x^{2}$ . This model is still linear in the $\beta _{1}$ parameter, so we can still perform the same analysis, constructing a system of equations from the data points:

{\begin{alignedat}{2}6&&\;=\beta _{1}(1)^{2}\\5&&\;=\beta _{1}(2)^{2}\\7&&\;=\beta _{1}(3)^{2}\\10&&\;=\beta _{1}(4)^{2}\\\end{alignedat}}

The partial derivatives with respect to the parameters (this time there is only one) are again computed and set to 0:

${\frac {\partial S}{\partial \beta _{1}}}=0=708\beta _{1}-498$

and solved

$\beta _{1}=0.703$

leading to the resulting best fit model $y=0.703x^{2}$

The general problem

Consider an overdetermined system

\sum _{j=1}^{n}X_{ij}\beta _{j}=y_{i},\ (i=1,2,\dots ,m),

of m linear equations in n unknown coefficients, β₁,β₂,…,β_n, with m > n. (Note: for a linear model as above, not all of $X$ contains information on the data points. The first column is populated with ones, $X_{i1}=1$ , only the other columns contain actual data, and n = number of regressors + 1). This can be written in matrix form as

\mathbf {X} {\boldsymbol {\beta }}=\mathbf {y} ,

where

\mathbf {X} ={\begin{bmatrix}X_{11}&X_{12}&\cdots &X_{1n}\\X_{21}&X_{22}&\cdots &X_{2n}\\\vdots &\vdots &\ddots &\vdots \\X_{m1}&X_{m2}&\cdots &X_{mn}\end{bmatrix}},\qquad {\boldsymbol {\beta }}={\begin{bmatrix}\beta _{1}\\\beta _{2}\\\vdots \\\beta _{n}\end{bmatrix}},\qquad \mathbf {y} ={\begin{bmatrix}y_{1}\\y_{2}\\\vdots \\y_{m}\end{bmatrix}}.

Such a system usually has no solution, so the goal is instead to find the coefficients ${\boldsymbol {\beta }}$ which fit the equations "best," in the sense of solving the quadratic minimization problem

{\hat {\boldsymbol {\beta }}}={\underset {\boldsymbol {\beta }}{\operatorname {arg\,min} }}\,S({\boldsymbol {\beta }}),

where the objective function S is given by

S({\boldsymbol {\beta }})=\sum _{i=1}^{m}{\bigl |}y_{i}-\sum _{j=1}^{n}X_{ij}\beta _{j}{\bigr |}^{2}={\bigl \|}\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}{\bigr \|}^{2}.

A justification for choosing this criterion is given in properties below. This minimization problem has a unique solution, provided that the n columns of the matrix $\mathbf {X}$ are linearly independent, given by solving the normal equations

(\mathbf {X} ^{\rm {T}}\mathbf {X} ){\hat {\boldsymbol {\beta }}}=\mathbf {X} ^{\rm {T}}\mathbf {y} .

The matrix $\mathbf {X} ^{\rm {T}}\mathbf {X}$ is known as the Gramian matrix of $\mathbf {X}$ , which possesses several nice properties such as being a positive semi-definite matrix, and the matrix $\mathbf {X} ^{\rm {T}}\mathbf {y}$ is known as the moment matrix of regressand by regressors.^[1] Finally, ${\hat {\boldsymbol {\beta }}}$ is the coefficient vector of the least-squares hyperplane, expressed as

{\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} .

Example implementation

MATLAB

The following MATLAB code shows implementation of this approach on the data used in the first example above.

% MATLAB code for finding the best fit line using least squares method
input = [...                 % input in the form of matrix
    1, 6;...                 % rows contain points
    2, 5;...
    3, 7;...
    4, 10];
m = length(input);             % number of points
X = [ones(m,1), input(:,1)];   % forming X of X beta = y
y = input(:,2);                % forming y of X beta = y
betaHat = (X' * X) \ X' * y;   % computing projection of matrix X on y, giving beta
% display best fit parameters
disp(betaHat);
% plot the best fit line
xx = linspace(0, 5, 2);
yy = betaHat(1) + betaHat(2)*xx;
plot(xx, yy)
% plot the points (data) for which we found the best fit
hold on
plot(input(:,1), input(:,2), 'or')
hold off

Python

Python code using the same variable naming as the MATLAB code above:

import numpy as np
import matplotlib.pyplot as plt
input = np.array([
    [1, 6],
    [2, 5],
    [3, 7],
    [4, 10]
])
m = np.shape(input)[0]
X = np.matrix([np.ones(m), input[:,0]]).T
y = np.matrix(input[:,1]).T
betaHat = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
print(betaHat)
plt.figure(1)
xx = np.linspace(0, 5, 2)
yy = np.array(betaHat[0] + betaHat[1] * xx)
plt.plot(xx, yy.T, color='b')
plt.scatter(input[:,0], input[:,1], color='r')
plt.show()

Julia (programming language)

using Plots
pyplot() #choose plotting backend
input = [
    1 6
    2 5
    3 7
    4 10]
m = size(input)[1]
X = [ones(m) input[:,1]]
y = input[:,2]
betaHat = (X' * X ) \ X' * y #backslash computes LS-solution as in Matlab
print(betaHat)
plot(x->betaHat[2]*x + betaHat[1],0,5,label="curve fit")
scatter!(input[:,1],input[:,2],label="data")

R (programming language)

m <- 4
n <- 2
input <- matrix(c(1, 6, 2, 5, 3, 7, 4, 10), ncol = n, byrow = T)
k <- rep(1,m)
X <- cbind(k, input[,1])
y <- input[,2]
X.T <- t(X)
betaHat <- solve(X.T%*%X) %*% X.T %*%y
print(betaHat)
plot(input)
abline(betaHat[1], betaHat[2])

Derivation of the normal equations

Define the $i$ th residual to be

r_{i}=y_{i}-\sum _{j=1}^{n}X_{ij}\beta _{j}

Then $S$ can be rewritten

S=\sum _{i=1}^{m}r_{i}^{2}.

Given that S is convex, it is minimized when its gradient vector is zero (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further - see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:

{\frac {\partial S}{\partial \beta _{j}}}=2\sum _{i=1}^{m}r_{i}{\frac {\partial r_{i}}{\partial \beta _{j}}}\ (j=1,2,\dots ,n).

The derivatives are

{\frac {\partial r_{i}}{\partial \beta _{j}}}=-X_{ij}.

Substitution of the expressions for the residuals and the derivatives into the gradient equations gives

{\frac {\partial S}{\partial \beta _{j}}}=2\sum _{i=1}^{m}\left(y_{i}-\sum _{j=1}^{n}X_{ij}\beta _{j}\right)(-X_{ij})\ (j=1,2,\dots ,n).

Thus if ${\hat {\beta }}$ minimizes S, we have

2\sum _{i=1}^{m}\left(y_{i}-\sum _{j=1}^{n}X_{ij}{\hat {\beta }}_{j}\right)(-X_{ij})=0\ (j=1,2,\dots ,n).

Upon rearrangement, we obtain the normal equations:

\sum _{i=1}^{m}\sum _{j=1}^{n}X_{ij}X_{ij}{\hat {\beta }}_{j}=\sum _{i=1}^{m}X_{ij}y_{i}\ (j=1,2,\dots ,n).

The normal equations are written in matrix notation as

(\mathbf {X} ^{\mathrm {T} }\mathbf {X} ){\hat {\boldsymbol {\beta }}}=\mathbf {X} ^{\mathrm {T} }\mathbf {y}

(where X^T is the matrix transpose of X).

The solution of the normal equations yields the vector ${\hat {\boldsymbol {\beta }}}$ of the optimal parameter values.

Derivation directly in terms of matrices

The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize

S({\boldsymbol {\beta }})={\bigl \|}\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}{\bigr \|}^{2}=(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})=\mathbf {y} ^{\rm {T}}\mathbf {y} -{\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {y} -\mathbf {y} ^{\rm {T}}\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {X} {\boldsymbol {\beta }}.

Note that : $({\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {y} )^{\rm {T}}=\mathbf {y} ^{\rm {T}}\mathbf {X} {\boldsymbol {\beta }}$ has the dimension 1x1 (the number of columns of $\mathbf {y}$ ), so it is a scalar and equal to its own transpose, hence ${\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {y} =\mathbf {y} ^{\rm {T}}\mathbf {X} {\boldsymbol {\beta }}$ and the quantity to minimize becomes

S({\boldsymbol {\beta }})=\mathbf {y} ^{\rm {T}}\mathbf {y} -2{\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {y} +{\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {X} {\boldsymbol {\beta }}.

Differentiating this with respect to ${\boldsymbol {\beta }}$ and equating to zero to satisfy the first-order conditions gives

-\mathbf {X} ^{\rm {T}}\mathbf {y} +(\mathbf {X} ^{\rm {T}}\mathbf {X} ){\boldsymbol {\beta }}=0,

which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that $\mathbf {X}$ have full column rank, in which case $\mathbf {X} ^{\rm {T}}\mathbf {X}$ is positive definite.

Derivation without calculus

When $\mathbf {X} ^{\rm {T}}\mathbf {X}$ is positive definite, the formula for the minimizing value of ${\boldsymbol {\beta }}$ can be derived without the use of derivatives. The quantity

S({\boldsymbol {\beta }})=\mathbf {y} ^{\rm {T}}\mathbf {y} -2{\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {y} +{\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}\mathbf {X} {\boldsymbol {\beta }}

can be written as

\langle {\boldsymbol {\beta }},{\boldsymbol {\beta }}\rangle -2\langle {\boldsymbol {\beta }},(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} \rangle +\langle (\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} ,(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} \rangle +C,

where $C$ depends only on $\mathbf {y}$ and $\mathbf {X}$ , and $\langle \cdot ,\cdot \rangle$ is the inner product defined by

\langle x,y\rangle =x^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} )y.

It follows that $S({\boldsymbol {\beta }})$ is equal to

\langle {\boldsymbol {\beta }}-(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} ,{\boldsymbol {\beta }}-(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} \rangle +C

and therefore minimized exactly when

{\boldsymbol {\beta }}-(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} =0.

Generalization for complex equations

In general, the coefficients of the matrices ${\displaystyle \mathbf {X} },{\displaystyle {\boldsymbol {\beta }}}$ and ${\displaystyle \mathbf {y} }$ can be complex. By using a Hermitian transpose instead of a simple transpose, it is possible to find a vector ${\displaystyle {\boldsymbol {\hat {\beta }}}}$ which minimize ${\displaystyle S({\boldsymbol {\beta }})}$ , just as for the real matrices. In order to get the normal equations we follow a similar path as in previous derivations:

{\displaystyle S({\boldsymbol {\beta }})=\langle \mathbf {y} -\mathbf {X} {\boldsymbol {\beta }},\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }}\rangle =\langle \mathbf {y} ,\mathbf {y} \rangle -{\overline {\langle \mathbf {X} {\boldsymbol {\beta }},\mathbf {y} \rangle }}-{\overline {\langle \mathbf {y} ,\mathbf {X} {\boldsymbol {\beta }}\rangle }}+\langle \mathbf {X} {\boldsymbol {\beta }},\mathbf {X} {\boldsymbol {\beta }}\rangle =\mathbf {y} ^{\rm {T}}{\overline {\mathbf {y} }}-{\boldsymbol {\beta }}^{\dagger }\mathbf {X} ^{\dagger }\mathbf {y} -\mathbf {y} ^{\dagger }\mathbf {X} {\boldsymbol {\beta }}+{\boldsymbol {\beta }}^{\rm {T}}\mathbf {X} ^{\rm {T}}{\overline {\mathbf {X} }}{\overline {\boldsymbol {\beta }}},}

where ${\displaystyle \dagger }$ stands for Hermitian transpose.

We should now take derivatives of ${\displaystyle S({\boldsymbol {\beta }})}$ with respect to each of the coefficient ${\displaystyle \beta _{j}}$ , but first we separate real and imaginary part to deal with the conjugate factors in above expression. For the ${\displaystyle \beta _{j}}$ we have

{\displaystyle \beta _{j}=\beta _{j}^{R}+i\beta _{j}^{I}}

and the derivatives changes into

{\displaystyle {\frac {\partial S}{\partial \beta _{j}}}={\frac {\partial S}{\partial \beta _{j}^{R}}}{\frac {\partial \beta _{j}^{R}}{\partial \beta _{j}}}+{\frac {\partial S}{\partial \beta _{j}^{I}}}{\frac {\partial \beta _{j}^{I}}{\partial \beta _{j}}}={\frac {\partial S}{\partial \beta _{j}^{R}}}-i{\frac {\partial S}{\partial \beta _{j}^{I}}}\ \ (j=1,2,3,...,n).}

After rewriting ${\displaystyle S({\boldsymbol {\beta }})}$ in the summation form and writing ${\displaystyle \beta _{j}}$ explicite, we can calculate both partial derivatives with result:

{\displaystyle {\frac {\partial S}{\partial \beta _{j}^{R}}}=-\sum _{i=1}^{m}{\Big (}{\overline {X}}_{ij}y_{i}+{\overline {y}}_{i}X_{ij}{\Big )}+2\sum _{i=1}^{m}X_{ij}{\overline {X}}_{ij}\beta _{j}^{R}+\sum _{i=1}^{m}\sum _{k\neq j}^{n}{\Big (}X_{ij}{\overline {X}}_{ik}{\overline {\beta }}_{k}+\beta _{k}X_{ik}{\overline {X}}_{ij}{\Big )},}

{\displaystyle -i{\frac {\partial S}{\partial \beta _{j}^{I}}}=\sum _{i=1}^{m}{\Big (}{\overline {X}}_{ij}y_{i}-{\overline {y}}_{i}X_{ij}{\Big )}-2i\sum _{i=1}^{m}X_{ij}{\overline {X}}_{ij}\beta _{j}^{I}+\sum _{i=1}^{m}\sum _{k\neq j}^{n}{\Big (}X_{ij}{\overline {X}}_{ik}{\overline {\beta }}_{k}-\beta _{k}X_{ik}{\overline {X}}_{ij}{\Big )},}

which, after adding it together and comparing to zero ( minimalization condition for ${\displaystyle {\boldsymbol {\hat {\beta }}}}$ ) yields

{\displaystyle \sum _{i=1}^{m}X_{ij}{\overline {y}}_{i}=\sum _{i=1}^{m}\sum _{k=1}^{n}X_{ij}{\overline {X}}_{ik}{\overline {\hat {\beta }}}_{k}\ \ (j=1,2,3,...,n).}

In matrix form:

{\displaystyle {\textbf {X}}^{\rm {T}}{\overline {\textbf {y}}}={\textbf {X}}^{\rm {T}}{\overline {{\big (}{\textbf {X}}{\boldsymbol {\hat {\beta }}}{\big )}}}\ \ \ {\text{or}}\ \ \ {\big (}{\textbf {X}}^{\dagger }{\textbf {X}}{\big )}{\boldsymbol {\hat {\beta }}}={\textbf {X}}^{\dagger }{\textbf {y}}.}

Computation

A general approach to the least squares problem $\operatorname {\,min} \,{\big \|}\mathbf {y} -X{\boldsymbol {\beta }}{\big \|}^{2}$ can be described as follows. Suppose that we can find an n by m matrix S such that XS is an orthogonal projection onto the image of X. Then a solution to our minimization problem is given by

{\boldsymbol {\beta }}=S\mathbf {y}

simply because

X{\boldsymbol {\beta }}=X(S\mathbf {y} )=(XS)\mathbf {y}

is exactly a sought for orthogonal projection of $\mathbf {y}$ onto an image of X (see the picture below and note that as explained in the next section the image of X is just a subspace generated by column vectors of X). A few popular ways to find such a matrix S are described below.

Inverting the matrix of the normal equations

The algebraic solution of the normal equations can be written as

{\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y} =\mathbf {X} ^{+}\mathbf {y}

where X ⁺ is the Moore–Penrose pseudoinverse of X. Although this equation is correct, and can work in many applications, it is not computationally efficient to invert the normal equations matrix (the Gramian matrix). An exception occurs in numerical smoothing and differentiation where an analytical expression is required.

If the matrix X^TX is well-conditioned and positive definite, implying that it has full rank, the normal equations can be solved directly by using the Cholesky decomposition R^TR, where R is an upper triangular matrix, giving:

R^{\rm {T}}R{\hat {\boldsymbol {\beta }}}=X^{\rm {T}}\mathbf {y} .

The solution is obtained in two stages, a forward substitution step, solving for z:

R^{\rm {T}}\mathbf {z} =X^{\rm {T}}\mathbf {y} ,

followed by a backward substitution, solving for ${\hat {\boldsymbol {\beta }}}$

R{\hat {\boldsymbol {\beta }}}=\mathbf {z} .

Both substitutions are facilitated by the triangular nature of R.

See example of linear regression for a worked-out numerical example with three parameters.

Orthogonal decomposition methods

Orthogonal decomposition methods of solving the least squares problem are slower than the normal equations method but are more numerically stable because they avoid forming the product X^TX.

The residuals are written in matrix notation as

\mathbf {r} =\mathbf {y} -X{\hat {\boldsymbol {\beta }}}.

The matrix X is subjected to an orthogonal decomposition, e.g., the QR decomposition as follows.

X=Q{\begin{pmatrix}R\\0\end{pmatrix}}\

where Q is an m×m orthogonal matrix (Q^TQ=I) and R is an n×n upper triangular matrix with $r_{ii}>0$ .

The residual vector is left-multiplied by Q^T.

Q^{\rm {T}}\mathbf {r} =Q^{\rm {T}}\mathbf {y} -\left(Q^{\rm {T}}Q\right){\begin{pmatrix}R\\0\end{pmatrix}}{\hat {\boldsymbol {\beta }}}={\begin{bmatrix}\left(Q^{\rm {T}}\mathbf {y} \right)_{n}-R{\hat {\boldsymbol {\beta }}}\\\left(Q^{\rm {T}}\mathbf {y} \right)_{m-n}\end{bmatrix}}={\begin{bmatrix}\mathbf {u} \\\mathbf {v} \end{bmatrix}}

Because Q is orthogonal, the sum of squares of the residuals, s, may be written as:

s=\|\mathbf {r} \|^{2}=\mathbf {r} ^{\rm {T}}\mathbf {r} =\mathbf {r} ^{\rm {T}}QQ^{\rm {T}}\mathbf {r} =\mathbf {u} ^{\rm {T}}\mathbf {u} +\mathbf {v} ^{\rm {T}}\mathbf {v}

Since v doesn't depend on β, the minimum value of s is attained when the upper block, u, is zero. Therefore the parameters are found by solving:

R{\hat {\boldsymbol {\beta }}}=\left(Q^{\rm {T}}\mathbf {y} \right)_{n}.

These equations are easily solved as R is upper triangular.

An alternative decomposition of X is the singular value decomposition (SVD)^[2]

X=U\Sigma V^{\rm {T}}\

where U is m by m orthogonal matrix, V is n by n orthogonal matrix and $\Sigma$ is an m by n matrix with all its elements outside of the main diagonal equal to 0. The pseudoinverse of $\Sigma$ is easily obtained by inverting its non-zero diagonal elements and transposing. Hence,

\mathbf {X} \mathbf {X} ^{+}=U\Sigma V^{\rm {T}}V\Sigma ^{+}U^{\rm {T}}=UPU^{\rm {T}},

where P is obtained from $\Sigma$ by replacing its non-zero diagonal elements with ones. Since $(\mathbf {X} \mathbf {X} ^{+})^{*}=\mathbf {X} \mathbf {X} ^{+}$ (the property of pseudoinverse), the matrix $UPU^{\rm {T}}$ is an orthogonal projection onto the image (column-space) of X. In accordance with a general approach described in the introduction above (find XS which is an orthogonal projection),

S=\mathbf {X} ^{+}

and thus,

\beta =V\Sigma ^{+}U^{\rm {T}}\mathbf {y}

is a solution of a least squares problem. This method is the most computationally intensive, but is particularly useful if the normal equations matrix, X^TX, is very ill-conditioned (i.e. if its condition number multiplied by the machine's relative round-off error is appreciably large). In that case, including the smallest singular values in the inversion merely adds numerical noise to the solution. This can be cured with the truncated SVD approach, giving a more stable and exact answer, by explicitly setting to zero all singular values below a certain threshold and so ignoring them, a process closely related to factor analysis.

Properties of the least-squares estimators

The residual vector,

y-X{\hat {\boldsymbol {\beta }}},

which corresponds to the solution of a least squares system,

y=X{\boldsymbol {\beta }}+\epsilon ,

is orthogonal to the column space of the matrix

X.

The gradient equations at the minimum can be written as

(\mathbf {y} -X{\hat {\boldsymbol {\beta }}})^{\rm {T}}X=0.

A geometrical interpretation of these equations is that the vector of residuals, $\mathbf {y} -X{\hat {\boldsymbol {\beta }}}$ is orthogonal to the column space of X, since the dot product $(\mathbf {y} -X{\hat {\boldsymbol {\beta }}})\cdot X\mathbf {v}$ is equal to zero for any conformal vector, v. This means that $\mathbf {y} -X{\boldsymbol {\hat {\beta }}}$ is the shortest of all possible vectors $\mathbf {y} -X{\boldsymbol {\beta }}$ , that is, the variance of the residuals is the minimum possible. This is illustrated at the right.

Introducing ${\hat {\boldsymbol {\gamma }}}$ and a matrix K with the assumption that a matrix $[X\ K]$ is non-singular and K^T X = 0 (cf. Orthogonal projections), the residual vector should satisfy the following equation:

{\hat {\mathbf {r} }}\triangleq \mathbf {y} -X{\hat {\boldsymbol {\beta }}}=K{\hat {\boldsymbol {\gamma }}}.

The equation and solution of linear least squares are thus described as follows:

\mathbf {y} ={\begin{bmatrix}X&K\end{bmatrix}}{\begin{pmatrix}{\hat {\boldsymbol {\beta }}}\\{\hat {\boldsymbol {\gamma }}}\end{pmatrix}},

{\begin{pmatrix}{\hat {\boldsymbol {\beta }}}\\{\hat {\boldsymbol {\gamma }}}\end{pmatrix}}={\begin{bmatrix}X&K\end{bmatrix}}^{-1}\mathbf {y} ={\begin{bmatrix}(X^{\rm {T}}X)^{-1}X^{\rm {T}}\\(K^{\rm {T}}K)^{-1}K^{\rm {T}}\end{bmatrix}}\mathbf {y} .

If the experimental errors, $\epsilon \,$ , are uncorrelated, have a mean of zero and a constant variance, $\sigma$ , the Gauss-Markov theorem states that the least-squares estimator, ${\hat {\boldsymbol {\beta }}}$ , has the minimum variance of all estimators that are linear combinations of the observations. In this sense it is the best, or optimal, estimator of the parameters. Note particularly that this property is independent of the statistical distribution function of the errors. In other words, the distribution function of the errors need not be a normal distribution. However, for some probability distributions, there is no guarantee that the least-squares solution is even possible given the observations; still, in such cases it is the best estimator that is both linear and unbiased.

For example, it is easy to show that the arithmetic mean of a set of measurements of a quantity is the least-squares estimator of the value of that quantity. If the conditions of the Gauss-Markov theorem apply, the arithmetic mean is optimal, whatever the distribution of errors of the measurements might be.

However, in the case that the experimental errors do belong to a normal distribution, the least-squares estimator is also a maximum likelihood estimator.^[3]

These properties underpin the use of the method of least squares for all types of data fitting, even when the assumptions are not strictly valid.

Limitations

An assumption underlying the treatment given above is that the independent variable, x, is free of error. In practice, the errors on the measurements of the independent variable are usually much smaller than the errors on the dependent variable and can therefore be ignored. When this is not the case, total least squares or more generally errors-in-variables models, or rigorous least squares, should be used. This can be done by adjusting the weighting scheme to take into account errors on both the dependent and independent variables and then following the standard procedure.^[4]^[5]

In some cases the (weighted) normal equations matrix X^TX is ill-conditioned. When fitting polynomials the normal equations matrix is a Vandermonde matrix. Vandermonde matrices become increasingly ill-conditioned as the order of the matrix increases. In these cases, the least squares estimate amplifies the measurement noise and may be grossly inaccurate. Various regularization techniques can be applied in such cases, the most common of which is called ridge regression. If further information about the parameters is known, for example, a range of possible values of $\mathbf {\hat {\boldsymbol {\beta }}}$ , then various techniques can be used to increase the stability of the solution. For example, see constrained least squares.

Another drawback of the least squares estimator is the fact that the norm of the residuals, $\|\mathbf {y} -X{\hat {\boldsymbol {\beta }}}\|$ is minimized, whereas in some cases one is truly interested in obtaining small error in the parameter $\mathbf {\hat {\boldsymbol {\beta }}}$ , e.g., a small value of $\|{\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}}\|$ . However, since the true parameter ${\boldsymbol {\beta }}$ is necessarily unknown, this quantity cannot be directly minimized. If a prior probability on ${\hat {\boldsymbol {\beta }}}$ is known, then a Bayes estimator can be used to minimize the mean squared error, $E\left\{\|{\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}}\|^{2}\right\}$ . The least squares method is often applied when no prior is known. Surprisingly, when several parameters are being estimated jointly, better estimators can be constructed, an effect known as Stein's phenomenon. For example, if the measurement error is Gaussian, several estimators are known which dominate, or outperform, the least squares technique; the best known of these is the James–Stein estimator. This is an example of more general shrinkage estimators that have been applied to regression problems.

Weighted linear least squares

Parameter errors and correlation

The estimated parameter values are linear combinations of the observed values

{\hat {\boldsymbol {\beta }}}=(X^{\rm {T}}WX)^{-1}X^{\rm {T}}W\mathbf {y} .\,

Therefore an expression for the residuals (i.e., the estimated errors in the parameters) can be obtained by error propagation from the errors in the observations. Let the variance-covariance matrix for the observations be denoted by M and that of the parameters by M^β. Then,

M^{\beta }=(X^{\rm {T}}WX)^{-1}X^{\rm {T}}WMW^{\rm {T}}X(X^{\rm {T}}W^{\rm {T}}X)^{-1}.

When W = M⁻¹ this simplifies to

M^{\beta }=(X^{\rm {T}}WX)^{-1}.

When unit weights are used (W = I, the identity matrix) it is implied that the experimental errors are uncorrelated and all equal: M = σ²I, where σ² is the a priori variance of an observation. In any case, σ² is approximated by the reduced chi-squared $\chi _{\nu }^{2}$ :

M^{\beta }=\chi _{\nu }^{2}(X^{\rm {T}}X)^{-1}.

\chi _{\nu }^{2}=S/\nu

where S is the minimum value of the (weighted) objective function:

S=r^{\rm {T}}Wr

The denominator, $\nu =m-n$ is the number of degrees of freedom; see effective degrees of freedom for generalizations for the case of correlated observations.

In all cases, the variance of the parameter $\beta _{i}$ is given by $M_{ii}^{\beta }$ and the covariance between parameters $\beta _{i}$ and $\beta _{j}$ is given by $M_{ij}^{\beta }$ . Standard deviation is the square root of variance, and the correlation coefficient is given by $\rho _{ij}=M_{ij}^{\beta }/(\sigma _{i}\sigma _{j})$ . These error estimates reflect only random errors in the measurements. The true uncertainty in the parameters is larger due to the presence of systematic errors which, by definition, cannot be quantified. Note that even though the observations may be un-correlated, the parameters are typically correlated.

Parameter confidence limits

Main article: Confidence interval

It is often assumed, for want of any concrete evidence but often appealing to the central limit theorem—see Normal distribution#Occurrence—that the error on each observation belongs to a normal distribution with a mean of zero and standard deviation $\sigma$ . Under that assumption the following probabilities can be derived for a single scalar parameter estimate in terms of its estimated standard error $se_{\beta }$ (given here):

68% that the interval

{\hat {\beta }}\pm se_{\beta }

encompasses the true coefficient value

95% that the interval

{\hat {\beta }}\pm 2se_{\beta }

encompasses the true coefficient value

99% that the interval

{\hat {\beta }}\pm 2.5se_{\beta }

encompasses the true coefficient value

The assumption is not unreasonable when m >> n. If the experimental errors are normally distributed the parameters will belong to a Student's t-distribution with m − n degrees of freedom. When m >> n Student's t-distribution approximates a normal distribution. Note, however, that these confidence limits cannot take systematic error into account. Also, parameter errors should be quoted to one significant figure only, as they are subject to sampling error.^[8]

When the number of observations is relatively small, Chebychev's inequality can be used for an upper bound on probabilities, regardless of any assumptions about the distribution of experimental errors: the maximum probabilities that a parameter will be more than 1, 2 or 3 standard deviations away from its expectation value are 100%, 25% and 11% respectively.

Residual values and correlation

The residuals are related to the observations by

\mathbf {\hat {r}} =\mathbf {y} -X{\hat {\boldsymbol {\beta }}}=\mathbf {y} -H\mathbf {y} =(I-H)\mathbf {y}

where H is the idempotent matrix known as the hat matrix:

H=X\left(X^{\rm {T}}WX\right)^{-1}X^{\rm {T}}W

and I is the identity matrix. The variance-covariance matrix of the residuals, M^r is given by

M^{\mathbf {r} }=\left(I-H\right)M\left(I-H\right)^{\rm {T}}.

Thus the residuals are correlated, even if the observations are not.

When $W=M^{-1}$ ,

M^{\mathbf {r} }=\left(I-H\right)M.

The sum of residual values is equal to zero whenever the model function contains a constant term. Left-multiply the expression for the residuals by X^T:

X^{\rm {T}}{\hat {\mathbf {r} }}=X^{\rm {T}}\mathbf {y} -X^{\rm {T}}X{\hat {\boldsymbol {\beta }}}=X^{\rm {T}}\mathbf {y} -(X^{\rm {T}}X)(X^{\rm {T}}X)^{-1}X^{\rm {T}}\mathbf {y} =\mathbf {0}

Say, for example, that the first term of the model is a constant, so that $X_{{i1}}=1$ for all i. In that case it follows that

\sum _{i}^{m}X_{i1}{\hat {r}}_{i}=\sum _{i}^{m}{\hat {r}}_{i}=0.

Thus, in the motivational example, above, the fact that the sum of residual values is equal to zero it is not accidental but is a consequence of the presence of the constant term, α, in the model.

If experimental error follows a normal distribution, then, because of the linear relationship between residuals and observations, so should residuals,^[9] but since the observations are only a sample of the population of all possible observations, the residuals should belong to a Student's t-distribution. Studentized residuals are useful in making a statistical test for an outlier when a particular residual appears to be excessively large.

Objective function

The optimal value of the objective function, found by substituting in the optimal expression for the coefficient vector, can be written as (assuming unweighted observations)

S=\mathbf {y} ^{\rm {T}}(I-H)^{\rm {T}}(I-H)\mathbf {y} =\mathbf {y} ^{\rm {T}}(I-H)\mathbf {y} ,

the latter equality holding since (I – H) is symmetric and idempotent. It can be shown from this^[10] that under an appropriate assignment of weights the expected value of S is m-n. If instead unit weights are assumed, the expected value of S is $(m-n)\sigma ^{2}$ , where $\sigma ^{2}$ is the variance of each observation.

If it is assumed that the residuals belong to a normal distribution, the objective function, being a sum of weighted squared residuals, will belong to a chi-squared ( $\chi ^{2}$ ) distribution with m-n degrees of freedom. Some illustrative percentile values of $\chi ^{2}$ are given in the following table.^[11]

{\begin{array}{r|ccc}m-n&\chi _{0.50}^{2}&\chi _{0.95}^{2}&\chi _{0.99}^{2}\\\hline 10&9.34&18.3&23.2\\25&24.3&37.7&44.3\\100&99.3&124&136\end{array}}

These values can be used for a statistical criterion as to the goodness-of-fit. When unit weights are used, the numbers should be divided by the variance of an observation.

Constrained linear least squares

Often it is of interest to solve a linear least squares problem with an additional constraint on the solution. With constrained linear least squares, the original equation

\mathbf {X} {\boldsymbol {\beta }}=\mathbf {y}

must be fit as closely as possible (in the least squares sense) while ensuring that some other property of ${\boldsymbol {\beta }}$ is maintained. There are often special purpose algorithms for solving such problems efficiently. Some examples of constraints are given below:

Equality constrained least squares: the elements of ${\boldsymbol {\beta }}$ must exactly satisfy $\mathbf {L} {\boldsymbol {\beta }}=\mathbf {d}$ (see Ordinary least squares#Constrained estimation.)
Regularized least squares: the elements of ${\boldsymbol {\beta }}$ must satisfy $\|\mathbf {L} {\boldsymbol {\beta }}-\mathbf {d} \|\leq \rho$
Non-negative least squares (NNLS): The vector ${\boldsymbol {\beta }}$ must satisfy the vector inequality ${\boldsymbol {\beta }}\geq {\boldsymbol {0}}$ defined componentwise—that is, each component must be either positive or zero.
Box-constrained least squares: The vector ${\boldsymbol {\beta }}$ must satisfy the vector inequalities ${\boldsymbol {lb}}\leq {\boldsymbol {\beta }}\leq {\boldsymbol {ub}}$ , each of which is defined componentwise.
Integer constrained least squares: all elements of ${\boldsymbol {\beta }}$ must be integers (instead of real numbers).
Phase constrained least squares: all elements of ${\boldsymbol {\beta }}$ must have the same phase (or must be real rather than complex numbers, i.e. phase = 0).

When the constraint only applies to some of the variables, the mixed problem may be solved using separable least squares by letting $\mathbf {X} =[\mathbf {X_{1}} \mathbf {X_{2}} ]$ and $\mathbf {\beta } ^{\rm {T}}=[\mathbf {\beta _{1}} ^{\rm {T}}\mathbf {\beta _{2}} ^{\rm {T}}]$ represent the unconstrained (1) and constrained (2) components. Then substituting the least squares solution for $\mathbf {\beta _{1}}$ , i.e.

{\hat {\boldsymbol {\beta _{1}}}}=\mathbf {X_{1}} ^{+}(\mathbf {y} -\mathbf {X_{2}} {\boldsymbol {\beta _{2}}})

back into the original expression gives (following some rearrangement) an equation that can be solved as a purely constrained problem in $\mathbf {\beta _{2}}$ .

\mathbf {P} \mathbf {X_{2}} {\boldsymbol {\beta _{2}}}=\mathbf {P} \mathbf {y}

where $\mathbf {P} :=\mathbf {I} -\mathbf {X_{1}} \mathbf {X_{1}} ^{+}$ is a projection matrix. Following the constrained estimation of ${\hat {\boldsymbol {\beta _{2}}}}$ the vector ${\hat {\boldsymbol {\beta _{1}}}}$ is obtained from the expression above.

Typical uses and applications

Polynomial fitting: models are polynomials in an independent variable, x:
- Straight line: $f(x,{\boldsymbol {\beta }})=\beta _{1}+\beta _{2}x$ .^[12]
- Quadratic: $f(x,{\boldsymbol {\beta }})=\beta _{1}+\beta _{2}x+\beta _{3}x^{2}$ .
- Cubic, quartic and higher polynomials. For regression with high-order polynomials, the use of orthogonal polynomials is recommended.^[13]
Numerical smoothing and differentiation — this is an application of polynomial fitting.
Multinomials in more than one independent variable, including surface fitting
Curve fitting with B-splines ^[4]
Chemometrics, Calibration curve, Standard addition, Gran plot, analysis of mixtures

Uses in data fitting

The primary application of linear least squares is in data fitting. Given a set of m data points $y_{1},y_{2},\dots ,y_{m},$ consisting of experimentally measured values taken at m values $x_{1},x_{2},\dots ,x_{m}$ of an independent variable ( $x_{i}$ may be scalar or vector quantities), and given a model function $y=f(x,{\boldsymbol {\beta }}),$ with ${\boldsymbol {\beta }}=(\beta _{1},\beta _{2},\dots ,\beta _{n}),$ it is desired to find the parameters $\beta _{j}$ such that the model function "best" fits the data. In linear least squares, linearity is meant to be with respect to parameters $\beta _{j},$ so

f(x,{\boldsymbol {\beta }})=\sum _{j=1}^{n}\beta _{j}\phi _{j}(x).

Here, the functions $\phi _{j}$ may be nonlinear with respect to the variable x.

Ideally, the model function fits the data exactly, so

y_{i}=f(x_{i},{\boldsymbol {\beta }})

for all $i=1,2,\dots ,m.$ This is usually not possible in practice, as there are more data points than there are parameters to be determined. The approach chosen then is to find the minimal possible value of the sum of squares of the residuals

r_{i}({\boldsymbol {\beta }})=y_{i}-f(x_{i},{\boldsymbol {\beta }}),\ (i=1,2,\dots ,m)

so to minimize the function

S({\boldsymbol {\beta }})=\sum _{i=1}^{m}r_{i}^{2}({\boldsymbol {\beta }}).

After substituting for $r_{i}$ and then for $f$ , this minimization problem becomes the quadratic minimization problem above with

X_{ij}=\phi _{j}(x_{i}),

and the best fit can be found by solving the normal equations.

Further discussion

The numerical methods for linear least squares are important because linear regression models are among the most important types of model, both as formal statistical models and for exploration of data-sets. The majority of statistical computer packages contain facilities for regression analysis that make use of linear least squares computations. Hence it is appropriate that considerable effort has been devoted to the task of ensuring that these computations are undertaken efficiently and with due regard to round-off error.

Individual statistical analyses are seldom undertaken in isolation, but rather are part of a sequence of investigatory steps. Some of the topics involved in considering numerical methods for linear least squares relate to this point. Thus important topics can be

Computations where a number of similar, and often nested, models are considered for the same data-set. That is, where models with the same dependent variable but different sets of independent variables are to be considered, for essentially the same set of data-points.
Computations for analyses that occur in a sequence, as the number of data-points increases.
Special considerations for very extensive data-sets.

Fitting of linear models by least squares often, but not always, arise in the context of statistical analysis. It can therefore be important that considerations of computation efficiency for such problems extend to all of the auxiliary quantities required for such analyses, and are not restricted to the formal solution of the linear least squares problem.

Rounding errors

Matrix calculations, like any other, are affected by rounding errors. An early summary of these effects, regarding the choice of computation methods for matrix inversion, was provided by Wilkinson.^[14]

References

↑ Goldberger, Arthur S. (1964). "Classical Linear Regression". Econometric Theory. New York: John Wiley & Sons. pp. 156–212 [p. 158]. ISBN 0-471-31101-4.
↑ Lawson, C. L.; Hanson, R. J. (1974). Solving Least Squares Problems. Englewood Cliffs, NJ: Prentice-Hall. ISBN 0-13-822585-0.
↑ Margenau, Henry; Murphy, George Moseley (1956). The Mathematics of Physics and Chemistry. Princeton: Van Nostrand.
1 2 Gans, Peter (1992). Data fitting in the Chemical Sciences. New York: Wiley. ISBN 0-471-93412-7.
↑ Deming, W. E. (1943). Statistical adjustment of Data. New York: Wiley.
↑ This implies that the observations are uncorrelated. If the observations are correlated, the expression $\textstyle S=\sum _{k}\sum _{j}r_{k}W_{kj}r_{j}\,$ applies. In this case the weight matrix should ideally be equal to the inverse of the variance-covariance matrix of the observations.
↑ Strutz, T. (2016). Data Fitting and Uncertainty (A practical introduction to weighted least squares and beyond). Springer Vieweg. ISBN 978-3-658-11455-8. , chapter 3
↑ Mandel, John (1964). The Statistical Analysis of Experimental Data. New York: Interscience.
↑ Mardia, K. V.; Kent, J. T.; Bibby, J. M. (1979). Multivariate analysis. New York: Academic Press. ISBN 0-12-471250-9.
↑ Hamilton, W. C. (1964). Statistics in Physical Science. New York: Ronald Press.
↑ Spiegel, Murray R. (1975). Schaum's outline of theory and problems of probability and statistics. New York: McGraw-Hill. ISBN 0-585-26739-1.
↑ Acton, F. S. (1959). Analysis of Straight-Line Data. New York: Wiley.
↑ Guest, P. G. (1961). Numerical Methods of Curve Fitting. Cambridge: Cambridge University Press.
↑ Wilkinson, J.H. (1963) "Chapter 3: Matrix Computations", Rounding Errors in Algebraic Processes, London: Her Majesty's Stationery Office (National Physical Laboratory, Notes in Applied Science, No.32)

External links

Least squares and regression analysis

Computational statistics

Correlation and dependence

Regression analysis

Regression as a
statistical model

Linear regression	Simple linear regression Ordinary least squares Generalized least squares Weighted least squares General linear model

Predictor structure	Polynomial regression Growth curve (statistics) Segmented regression Local regression

Non-standard	Nonlinear regression Nonparametric Semiparametric Robust Quantile Isotonic

Non-normal errors	Generalized linear model Binomial Poisson Logistic

Decomposition of variance

Model exploration

Background

Design of experiments

Numerical approximation

Applications

Regression analysis category
Statistics category
Statistics portal
Statistics outline
Statistics topics

This article is issued from Wikipedia - version of the 11/28/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.

Linear least squares (mathematics)

Example

Using a quadratic model

The general problem

Example implementation

MATLAB

Python

Julia (programming language)

R (programming language)

Derivation of the normal equations

Derivation directly in terms of matrices

Derivation without calculus

Generalization for complex equations

Computation

Inverting the matrix of the normal equations

Orthogonal decomposition methods

Properties of the least-squares estimators

Limitations

Weighted linear least squares

Parameter errors and correlation

Parameter confidence limits

Residual values and correlation

Objective function

Constrained linear least squares

Typical uses and applications

Uses in data fitting

Further discussion

Rounding errors

See also

References

Further reading

External links