Suppose we are given the value of the random vector from which we wish to estimate the value of another random vector . In other words, given we want to find a function such that is a good approximation of . How might we go about this? If our definition of “good approximation” is the MMSE (minimum mean squared error) and the random vectors and have finite second order moments (i.e. is finite), then the language of Hilbert spaces helps us greatly.

Let the dimensions of and be and respectively (they need not be equal). The convenience of the metric is that we may minimise it by finding the component functions that minimise each term of the sum separately, and then our required is .

Imagine a component (now a random variable as opposed to a random vector) living in the space of random variables with finite second order moment (this space is denoted by where is a collection of events whose probability can be measured). The space of measurable functions of from to forms a subspace of which we denote by . (By measurable we mean that we can still compute the probability of events based on values of the function.) If is in this subspace, it means that can be entirely determined from knowledge of , and an error of zero is possible. If is not in the subspace, we imagine as an arrow from sticking out from the subspace.

With inner product on given by (noting that if and only if with probability one), we see that is minimised when the error is orthogonal to any measurable function of . In our Hilbert space we think of being the projection of onto the subspace :

for any random variable

Note that if this were not true, but instead there existed with (i.e. normalised) and , we may write and so

contradicting the minimality of . Hence the orthogonality condition must hold. Such a projection is unique in the almost-sure sense (i.e. any other random variable is equal to with probability one).

By , is also orthogonal to any constant random variable, or in other words, for any constant showing that the error has zero mean (worth remembering: zero-mean random variables are orthogonal to constants!). In vector notation, with ,

or

For each , is in so by , . Switching to vector notation, this gives us

,

where (we shall use the notation to mean ).

This is the orthogonality condition in our vector case. By Pythagoras’s theorem, the minimum error is

where represents the sum of the diagonal entries of a square matrix (the trace).

#### Linear Case

If we take the example of a linear estimator of the form ( a matrix, a vector), we may use (1) and (2) to identify and that minimises . By (1), , from which and .

Note that any component of is in itself a linear transform of . By (*), it must be orthogonal to any component of the error vector , giving us

If is invertible, this gives , leading to the following expressions:

(here using the fact that ),

.

Note that this precise argument is made in least squares problems: if we wish to find a matrix so that is minimised and is only permitted in a space spanned by the columns of some matrix , we find the projection of onto the subspace whose columns are formed by . This gives the orthogonality condition (for some , since is in the column space of ), from which

In the special case when is a single column vector, the projection operator has the attractive form which is simply the outer product if has unit length.

#### Conditional Expectation

The post so far has made no mention of the conditional expectation , which is what in fact the minimum mean squared estimator turns out to be. In brief is defined as a measurable function of satisfying

for any measurable function . This is a measure-theoretic definition that has the advantage of unifying the discrete and continuous cases, and avoids division-by-zero possibilities that can arise when dealing with probability densities. (Actually conditional expectation is first defined with respect to a set of events called a sigma algebra which may be considered as an information source for that random variable. The larger the set, the more information we have about that random variable.)

If expectation is thought of as an averaging process, then conditional expectation is an average with respect to information or uncertainty. Formula (+) says that this average should be equal to the unconditioned average on any measurable set (to see this take to be on that set and otherwise). There are details missing here, but the interested reader is encouraged to see the references for more.

In the Hilbert space of random variables with finite second moment, this condition is equivalent to being orthogonal to , so is our best estimate found earlier. Hence the conditional expectation can be viewed as the vector projection of onto , i.e. the best estimate of in the expected least squares sense. Conditional probability may be defined in terms of conditional expectation via , where is the random variable equal to if our event occurs, and otherwise.

#### Gaussian Case

In the particular case of and being Gaussian vectors, it turns out that the best linear estimate is also the best overall estimate in the least squares sense. To see this, let be the best linear estimate of given . Since uncorrelated and independent are equivalent notions in the Gaussian world, it follows from the independence of and that given the conditional distribution of does not depend on . It is in fact zero-mean Gaussian with variance given in (4)). Hence the distribution of given is Gaussian with mean (the linear estimate of ) and the same covariance matrix. As a result the conditional expectation is equal to the mean of that Gaussian distribution, which is , and as this is true for all , .

For example, if , where is zero-mean Gaussian, is a deterministic by matrix and is a zero-mean -dimensional Gaussian noise vector uncorrelated with , we have

and as and are jointly Gaussian, our MMSE estimator is also the best linear estimator:

This solution is used in a wide variety of linear estimation applications, ranging from regression analysis in statistics to communication theory (estimating signals passing through a channel and corrupted by noise ).

#### References

[1] Williams, *Probability with Martingales*, Cambridge University Press, 2001.

[2] Hajek, *Notes for ECE 534: An Exploration of Random Processes for Engineers*, July 2011, available here.

[…] observed variable. In this case the result is a little more involved. If we recover the results of this earlier blog post, is jointly Gaussian and so is Gaussian with […]

Pingback by Inverse variance weighting form of the conditional covariance of multivariate Gaussian vectors | Chaitanya's Random Pages — March 26, 2014 @ 5:48 am |