Squared deviations from the mean

Squared deviations from the mean (SDM) are involved in various calculations. In probability theory and statistics, the definition of variance is either the SDM expected value (when considering a theoretical distribution) or its average value (for actual experimental data). Computations for analysis of variance involve the partitioning of a sum of SDM.

Introduction

An understanding of the computations involved is greatly enhanced by a study of the statistical value:

\operatorname {E}(X^{2}).

It is well known that for a random variable $X$ with mean $\mu$ and variance $\sigma ^{2}$ :

\sigma ^{2}=\operatorname {E}(X^{2})-\mu ^{2}

^[1]

Therefore

\operatorname {E}(X^{2})=\sigma ^{2}+\mu ^{2}.

From the above, the following are easily derived:

\operatorname {E}\left(\sum \left(X^{2}\right)\right)=n\sigma ^{2}+n\mu ^{2}

\operatorname {E}\left(\left(\sum X\right)^{2}\right)=n\sigma ^{2}+n^{2}\mu ^{2}

If $\hat{Y}$ is a vector of n predictions, and $Y$ is the vector of the true values, then the SSE of the predictor is: $SSE={\frac {1}{2}}\sum _{{i=1}}^{n}({\hat {Y_{i}}}-Y_{i})^{2}$

Sample variance

Main article: Sample variance

The sum of squared deviations needed to calculate sample variance (before deciding whether to divide by n or n − 1) is most easily calculated as

S=\sum x^{2}-{\frac {\left(\sum x\right)^{2}}{n}}

From the two derived expectations above the expected value of this sum is

\operatorname {E}(S)=n\sigma ^{2}+n\mu ^{2}-{\frac {n\sigma ^{2}+n^{2}\mu ^{2}}{n}}

which implies

\operatorname {E}(S)=(n-1)\sigma ^{2}.

This effectively proves the use of the divisor n − 1 in the calculation of an unbiased sample estimate of σ².

Partition — analysis of variance

Main article: Partition of sums of squares

In the situation where data is available for k different treatment groups having size n_i where i varies from 1 to k, then it is assumed that the expected mean of each group is

\operatorname {E}(\mu _{i})=\mu +T_{i}

and the variance of each treatment group is unchanged from the population variance $\sigma ^{2}$ .

Under the Null Hypothesis that the treatments have no effect, then each of the $T_{i}$ will be zero.

It is now possible to calculate three sums of squares:

Individual

I=\sum x^{2}

\operatorname {E}(I)=n\sigma ^{2}+n\mu ^{2}

Treatments

T=\sum _{{i=1}}^{k}\left(\left(\sum x\right)^{2}/n_{i}\right)

\operatorname {E}(T)=k\sigma ^{2}+\sum _{{i=1}}^{k}n_{i}(\mu +T_{i})^{2}

\operatorname {E}(T)=k\sigma ^{2}+n\mu ^{2}+2\mu \sum _{{i=1}}^{k}(n_{i}T_{i})+\sum _{{i=1}}^{k}n_{i}(T_{i})^{2}

Under the null hypothesis that the treatments cause no differences and all the $T_{i}$ are zero, the expectation simplifies to

\operatorname {E}(T)=k\sigma ^{2}+n\mu ^{2}.

Combination

C=\left(\sum x\right)^{2}/n

\operatorname {E}(C)=\sigma ^{2}+n\mu ^{2}

Sums of squared deviations

Main article: Sum of squared deviations

Under the null hypothesis, the difference of any pair of I, T, and C does not contain any dependency on $\mu$ , only $\sigma ^{2}$ .

\operatorname {E}(I-C)=(n-1)\sigma ^{2}

total squared deviations aka total sum of squares

\operatorname {E}(T-C)=(k-1)\sigma ^{2}

treatment squared deviations aka explained sum of squares

\operatorname {E}(I-T)=(n-k)\sigma ^{2}

residual squared deviations aka residual sum of squares

The constants (n − 1), (k − 1), and (n − k) are normally referred to as the number of degrees of freedom.

Example

In a very simple example, 5 observations arise from two treatments. The first treatment gives three values 1, 2, and 3, and the second treatment gives two values 4, and 6.

I={\frac {1^{2}}{1}}+{\frac {2^{2}}{1}}+{\frac {3^{2}}{1}}+{\frac {4^{2}}{1}}+{\frac {6^{2}}{1}}=66

T={\frac {(1+2+3)^{2}}{3}}+{\frac {(4+6)^{2}}{2}}=12+50=62

C={\frac {(1+2+3+4+6)^{2}}{5}}=256/5=51.2

Giving

Total squared deviations = 66 − 51.2 = 14.8 with 4 degrees of freedom.

Treatment squared deviations = 62 − 51.2 = 10.8 with 1 degree of freedom.

Residual squared deviations = 66 − 62 = 4 with 3 degrees of freedom.

Two-way analysis of variance

Main article: Two-way analysis of variance

The following hypothetical example gives the yields of 15 plants subject to two different environmental variations, and three different fertilisers.

	Extra CO₂	Extra humidity
No fertiliser	7, 2, 1	7, 6
Nitrate	11, 6	10, 7, 3
Phosphate	5, 3, 4	11, 4

Five sums of squares are calculated:

Factor	Calculation	Sum	$\sigma ^{2}$
Individual	$7^{2}+2^{2}+1^{2}+7^{2}+6^{2}+11^{2}+6^{2}+10^{2}+7^{2}+3^{2}+5^{2}+3^{2}+4^{2}+11^{2}+4^{2}$	641	15
Fertiliser × Environment	${\frac {(7+2+1)^{2}}{3}}+{\frac {(7+6)^{2}}{2}}+{\frac {(11+6)^{2}}{2}}+{\frac {(10+7+3)^{2}}{3}}+{\frac {(5+3+4)^{2}}{3}}+{\frac {(11+4)^{2}}{2}}$	556.1667	6
Fertiliser	${\frac {(7+2+1+7+6)^{2}}{5}}+{\frac {(11+6+10+7+3)^{2}}{5}}+{\frac {(5+3+4+11+4)^{2}}{5}}$	525.4	3
Environment	${\frac {(7+2+1+11+6+5+3+4)^{2}}{8}}+{\frac {(7+6+10+7+3+11+4)^{2}}{7}}$	519.2679	2
Composite	${\frac {(7+2+1+11+6+5+3+4+7+6+10+7+3+11+4)^{2}}{15}}$	504.6	1

Finally, the sums of squared deviations required for the analysis of variance can be calculated.

Factor	Sum	$\sigma ^{2}$	Total	Environment	Fertiliser	Fertiliser × Environment	Residual
Individual	641	15	1				1
Fertiliser × Environment	556.1667	6				1	−1
Fertiliser	525.4	3			1	−1
Environment	519.2679	2		1		−1
Composite	504.6	1	−1	−1	−1	1

Squared deviations			136.4	14.668	20.8	16.099	84.833
Degrees of freedom			14	1	2	2	9

References

↑ Mood & Graybill: An introduction to the Theory of Statistics (McGraw Hill)

This article is issued from Wikipedia - version of the 2/26/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.