Suppose we are given the value of the random vector
from which we wish to estimate the value of another random vector
. In other words, given
we want to find a function
such that
is a good approximation of
. How might we go about this? If our definition of “good approximation” is the MMSE (minimum mean squared error)
and the random vectors
and
have finite second order moments (i.e.
is finite), then the language of Hilbert spaces helps us greatly.
Let the dimensions of
and
be
and
respectively (they need not be equal). The convenience of the metric
is that we may minimise it by finding the component functions
that minimise each term
of the sum separately, and then our required
is
.
Imagine a component
(now a random variable as opposed to a random vector) living in the space of random variables with finite second order moment (this space is denoted by
where
is a collection of events whose probability can be measured). The space of measurable functions of
from
to
forms a subspace of
which we denote by
. (By measurable we mean that we can still compute the probability of events based on values of the function.) If
is in this subspace, it means that
can be entirely determined from knowledge of
, and an error of zero is possible. If
is not in the subspace, we imagine
as an arrow from
sticking out from the subspace.
With inner product on
given by
(noting that
if and only if
with probability one), we see that
is minimised when the error
is orthogonal to any measurable function of
. In our Hilbert space we think of
being the projection of
onto the subspace
:
for any random variable 
Note that if this were not true, but instead there existed
with
(i.e. normalised) and
, we may write
and so

contradicting the minimality of
. Hence the orthogonality condition
must hold. Such a projection is unique in the almost-sure sense (i.e. any other random variable is equal to
with probability one).
By
,
is also orthogonal to any constant random variable, or in other words,
for any constant
showing that the error
has zero mean (worth remembering: zero-mean random variables are orthogonal to constants!). In vector notation, with
,
or 
For each
,
is in
so by
,
. Switching to vector notation, this gives us
,
where
(we shall use the notation
to mean
).
This is the orthogonality condition in our vector case. By Pythagoras’s theorem, the minimum error is

where
represents the sum of the diagonal entries of a square matrix (the trace).
Linear Case
If we take the example of a linear estimator of the form
(
a matrix,
a vector), we may use (1) and (2) to identify
and
that minimises
. By (1),
, from which
and
.
Note that any component of
is in itself a linear transform of
. By (*), it must be orthogonal to any component of the error vector
, giving us
![\displaystyle 0 = E(f(Y)-X)Y^* = E[(A(Y-EY) + EX - X)Y^*] = A\Sigma_Y - \Sigma_{XY}. \displaystyle 0 = E(f(Y)-X)Y^* = E[(A(Y-EY) + EX - X)Y^*] = A\Sigma_Y - \Sigma_{XY}.](http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+0+%3D+E%28f%28Y%29-X%29Y%5E%2A+%3D+E%5B%28A%28Y-EY%29+%2B+EX+-+X%29Y%5E%2A%5D+%3D+A%5CSigma_Y+-+%5CSigma_%7BXY%7D.&bg=ffffff&fg=000000&s=0)
If
is invertible, this gives
, leading to the following expressions:


(here using the fact that
),
.
Note that this precise argument is made in least squares problems: if we wish to find a matrix
so that
is minimised and
is only permitted in a space spanned by the columns of some matrix
, we find the projection of
onto the subspace whose columns are formed by
. This gives the orthogonality condition
(for some
, since
is in the column space of
), from which

In the special case when
is a single column vector, the projection operator has the attractive form
which is simply the outer product
if
has unit length.
Conditional Expectation
The post so far has made no mention of the conditional expectation
, which is what in fact the minimum mean squared estimator turns out to be. In brief
is defined as a measurable function of
satisfying
![\displaystyle E[g(Y)E[X|Y]] = E[g(Y)X] \quad \quad (+) \displaystyle E[g(Y)E[X|Y]] = E[g(Y)X] \quad \quad (+)](http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+E%5Bg%28Y%29E%5BX%7CY%5D%5D+%3D+E%5Bg%28Y%29X%5D+%5Cquad+%5Cquad+%28%2B%29&bg=ffffff&fg=000000&s=0)
for any measurable function
. This is a measure-theoretic definition that has the advantage of unifying the discrete and continuous cases, and avoids division-by-zero possibilities that can arise when dealing with probability densities. (Actually conditional expectation is first defined with respect to a set of events called a sigma algebra which may be considered as an information source for that random variable. The larger the set, the more information we have about that random variable.)
If expectation is thought of as an averaging process, then conditional expectation is an average with respect to information or uncertainty. Formula (+) says that this average should be equal to the unconditioned average on any measurable set (to see this take
to be
on that set and
otherwise). There are details missing here, but the interested reader is encouraged to see the references for more.
In the Hilbert space of random variables with finite second moment, this condition is equivalent to
being orthogonal to
, so
is our best estimate
found earlier. Hence the conditional expectation
can be viewed as the vector projection of
onto
, i.e. the best estimate of
in the expected least squares sense. Conditional probability may be defined in terms of conditional expectation via
, where
is the random variable equal to
if our event
occurs, and
otherwise.
Gaussian Case
In the particular case of
and
being Gaussian vectors, it turns out that the best linear estimate is also the best overall estimate in the least squares sense. To see this, let
be the best linear estimate of
given
. Since uncorrelated and independent are equivalent notions in the Gaussian world, it follows from the independence of
and
that given
the conditional distribution of
does not depend on
. It is in fact zero-mean Gaussian with variance given in (4)). Hence the distribution of
given
is Gaussian with mean
(the linear estimate of
) and the same covariance matrix. As a result the conditional expectation
is equal to the mean of that Gaussian distribution, which is
, and as this is true for all
,
.
For example, if
, where
is zero-mean Gaussian,
is a deterministic
by
matrix and
is a zero-mean
-dimensional Gaussian noise vector uncorrelated with
, we have


and as
and
are jointly Gaussian, our MMSE estimator is also the best linear estimator:
![\displaystyle E[X|Y] = f(Y) =\Sigma_{XY}\Sigma_Y^{-1}Y = \Sigma_X H^*(H\Sigma_X H^* + \Sigma_V)^{-1} Y. \displaystyle E[X|Y] = f(Y) =\Sigma_{XY}\Sigma_Y^{-1}Y = \Sigma_X H^*(H\Sigma_X H^* + \Sigma_V)^{-1} Y.](http://s0.wp.com/latex.php?latex=%5Cdisplaystyle+E%5BX%7CY%5D+%3D+f%28Y%29+%3D%5CSigma_%7BXY%7D%5CSigma_Y%5E%7B-1%7DY+%3D+%5CSigma_X+H%5E%2A%28H%5CSigma_X+H%5E%2A+%2B+%5CSigma_V%29%5E%7B-1%7D+Y.&bg=ffffff&fg=000000&s=0)
This solution is used in a wide variety of linear estimation applications, ranging from regression analysis in statistics to communication theory (estimating signals passing through a channel
and corrupted by noise
).
References
[1] Williams, Probability with Martingales, Cambridge University Press, 2001.
[2] Hajek, Notes for ECE 534: An Exploration of Random Processes for Engineers, July 2011, available here.