# Chaitanya's Random Pages

## December 5, 2011

### The MMSE estimator is orthogonal projection

Filed under: mathematics — ckrao @ 9:36 pm

Suppose we are given the value of the random vector $Y$ from which we wish to estimate the value of another random vector $X$. In other words, given $Y$ we want to find a function $f$ such that $f(Y)$ is a good approximation of $X$. How might we go about this? If our definition of “good approximation” is the MMSE (minimum mean squared error) $E \|f(Y)-X\|^2$ and the random vectors $X$ and $Y$ have finite second order moments (i.e. $E \|X\|^2 = E \sum_i |X_i|^2$ is finite), then the language of Hilbert spaces helps us greatly.

Let the dimensions of $X$ and $Y$ be $m$ and $n$ respectively (they need not be equal). The convenience of the metric $\|f(Y) - X\|^2$ is that we may minimise it by finding the component functions  $f_1(Y), \ldots, f_m(Y)$ that minimise each term $E|f_i(Y) - X_i|^2$ of the sum separately, and then our required $f$ is $f(Y) = [f_1(Y), \ldots, f_m(Y)]^T$.

Imagine a component $X_i$ (now a random variable as opposed to a random vector) living in the space of random variables with finite second order moment (this space is denoted by $L^2(\cal{F})$ where $\cal{F}$ is a collection of events whose probability can be measured). The space of measurable functions of $Y$ from $\mathbb{C}^n$ to $\mathbb{C}$ forms a subspace of $L^2(\cal{F})$ which we denote by $L^2(\sigma(Y))$. (By measurable we mean that we can still compute the probability of events based on values of the function.) If $X$ is in this subspace, it means that $X_i$ can be entirely determined from knowledge of $Y$, and an error of zero is possible. If $X$ is not in the subspace, we imagine $X_i$ as an arrow from $0$ sticking out from the subspace.

With inner product on $L^2(\cal{F})$ given by $\langle U,V \rangle = E UV^*$ (noting that $\langle U,U \rangle = 0$ if and only if $U = 0$ with probability one), we see that $E|f_i(Y) - X_i|^2$ is minimised when the error $e_i = f_i(Y) - X_i$ is orthogonal to any measurable function of $Y$. In our Hilbert space we think of $f_i(Y)$ being the projection of $X$ onto the subspace $L^2(\sigma(Y))$:

$\displaystyle \langle g(Y), f_i(Y) - X_i \rangle = 0$ for any random variable $g(Y) \quad \quad (*)$

Note that if this were not true, but instead there existed $g(Y) \in L^2(\sigma(Y))$ with $\langle g(Y),g(Y) \rangle = E|g(Y)|^2 = 1$ (i.e. normalised) and $\langle f_i(Y) - X_i, g(Y) \rangle = \alpha \neq 0$, we may write $h(Y) = f_i(Y) - \alpha g(Y) \in L^2(\sigma(Y))$ and so

$\begin{array}{lcl} E|h(Y) - X_i|^2 &=& E|f_i(Y) - \alpha g(Y) - X_i|^2 \\ &=& E| f_i(Y) - X_i|^2 + |\alpha|^2 E |g(Y)|^2 - \alpha^* \langle f_i(Y) - X_i , g(Y) \rangle - \alpha \langle g(Y), f_i(Y) - X_i \rangle \\&=& E|f_i(Y) - X_i|^2 + |\alpha|^2 - |\alpha|^2 - |\alpha|^2 \\ &<& E| f_i(Y) - X_i|^2,\end{array}$

contradicting the minimality of $E|f_i(Y) - X_i|^2$. Hence the orthogonality condition $(*)$ must hold. Such a projection is unique in the almost-sure sense (i.e. any other random variable is equal to $f_i(Y)$ with probability one).

By $(*)$, $f_i(Y) - X_i$ is also orthogonal to any constant random variable, or in other words, $Ec(f_i(Y) - X_i) = 0$ for any constant $c$ showing that the error $f_i(Y) - X_i$ has zero mean (worth remembering: zero-mean random variables are orthogonal to constants!). In vector notation, with $e := f(Y) - X$,

$Ee = 0$  or $Ef(Y) = EX \quad \quad (1)$

For each $j$, $f_j(Y)$ is in $L^2(\sigma(Y))$ so by $(*)$, $\langle f_j(Y), f_i(Y) - X_i \rangle = 0$. Switching to vector notation, this gives us

$\displaystyle {\rm Cov}(f(Y), f(Y)-X) = Ef(Y)(f(Y) - X)^* = 0 \quad \quad (2)$,

where ${\rm Cov}(U,V) :=: \Sigma_{UV} := E(U-E(U))(V-E(V))^*$ (we shall use the notation ${\rm Cov}(U) = \Sigma_U$ to mean ${\rm Cov}(U,U)$).

This is the orthogonality condition in our vector case. By Pythagoras’s theorem, the minimum error is

$\sum_i E|f_i(Y) - X_i|^2 = \sum_i \left(E|X_i|^2 - E|f_i(Y)|^2\right) = {\rm tr}({\rm Cov}(X) - {\rm Cov}(f(Y))),$

where ${\rm tr}$ represents the sum of the diagonal entries of a square matrix (the trace).

#### Linear Case

If we take the example of a linear estimator of the form $f(Y) = AY + b$ ($A$ a matrix, $b$ a vector), we may use (1) and (2) to identify $A$ and $b$ that minimises $E\|f(Y) - X\|^2$. By (1), $AEY + Eb = EX$, from which $b = Eb = EX - AEY$ and $f(Y) = A(Y-EY) + EX$.

Note that any component of $Y$ is in itself a linear transform of $Y$. By (*), it must be orthogonal to any component of the error vector $f(Y) - X$, giving us

$\displaystyle 0 = E(f(Y)-X)Y^* = E[(A(Y-EY) + EX - X)Y^*] = A\Sigma_Y - \Sigma_{XY}.$

If ${\rm Cov}(Y)$ is invertible, this gives $A = \Sigma_{XY}\Sigma_Y^{-1}$, leading to the following expressions:

$\displaystyle f(Y) = EX + \Sigma_{XY}\Sigma_Y^{-1}(Y-EY) \quad \quad (3)$

$\begin{array} {lcl} {\rm Cov}(f(Y) - X) &=& {\rm Cov}(X) - {\rm Cov}(EX + {\rm Cov}(X,Y){\rm Cov}(Y)^{-1}(Y-EY))\\ &=& {\rm Cov}(X) - {\rm Cov}(X,Y){\rm Cov}(Y)^{-1}{\rm Cov}(Y-EY) {\rm Cov}(Y)^{-1}{\rm Cov}(Y,X)\\ &=& \Sigma_X - \Sigma_{XY}\Sigma_Y^{-1} \Sigma_{YX} \quad \quad (4) \end{array}$

(here using the fact that ${\rm Cov}(AX) = A{\rm Cov}(X)A^*$),

$E\|f(Y) - X\|^2 = {\rm tr}( {\rm Cov}(f(Y)) - X) = {\rm tr} (\Sigma_X - \Sigma_{XY}\Sigma_Y^{-1} \Sigma_{YX}) \quad \quad (5)$.

Note that this precise argument is made in least squares problems: if we wish to find a matrix $W$ so that $\|x - Wy\|^2$ is minimised and $Wy$ is only permitted in a space spanned by the columns of some matrix $C$, we find the projection of $x$ onto the subspace whose columns are formed by $C$. This gives the orthogonality condition $0 = C^* (x - Wy) = C^* (x - Cz)$ (for some $z$, since $Wy$ is in the column space of $C$), from which

$\displaystyle Wy = Cz = C(C^*C)^{-1}C^*x.$

In the special case when $C =c$ is a single column vector, the projection operator has the attractive form $(cc^*)/(c^*c)$ which is simply the outer product $cc^*$ if $c$ has unit length.

#### Conditional Expectation

The post so far has made no mention of the conditional expectation $E[X|Y]$, which is what in fact the minimum mean squared estimator turns out to be. In brief $E[X|Y]$ is defined as a measurable function of $Y$ satisfying

$\displaystyle E[g(Y)E[X|Y]] = E[g(Y)X] \quad \quad (+)$

for any measurable function $g(Y)$. This is a measure-theoretic definition that has the advantage of unifying the discrete and continuous cases, and avoids division-by-zero possibilities that can arise when dealing with probability densities. (Actually conditional expectation is first defined with respect to a set of events called a sigma algebra which may be considered as an information source for that random variable. The larger the set, the more information we have about that random variable.)

If expectation is thought of as an averaging process, then conditional expectation is an average with respect to information or uncertainty. Formula (+) says that this average should be equal to the unconditioned average on any measurable set (to see this take $g(Y)$ to be $1$ on that set and $0$ otherwise). There are details missing here, but the interested reader is encouraged to see the references for more.

In the Hilbert space of random variables with finite second moment, this condition is equivalent to $(X-E[X|Y])$ being orthogonal to $g(Y)$, so $E[X|Y]$ is our best estimate $f(Y)$ found earlier. Hence the conditional expectation $E[X_i|Y]$ can be viewed as the vector projection of $X_i$ onto $L^2(\sigma(Y))$, i.e. the best estimate of $X_i$ in the expected least squares sense. Conditional probability may be defined in terms of conditional expectation via $P(A|B) = E[1_A|B]$, where $1_A$ is the random variable equal to $1$ if our event $A$ occurs, and $0$ otherwise.

#### Gaussian Case

In the particular case of $X$ and $Y$ being Gaussian vectors, it turns out that the best linear estimate is also the best overall estimate in the least squares sense. To see this, let $f(Y)$ be the best linear estimate of $X$ given $Y$. Since uncorrelated and independent are equivalent notions in the Gaussian world, it follows from the independence of $(X-f(Y))$ and $f(Y)$ that given $Y=y$ the conditional distribution of $(X-f(Y))$ does not depend on $y$. It is in fact zero-mean Gaussian with variance given in (4)). Hence the distribution of $X$ $(= (X-f(Y)) + f(Y))$ given $Y=y$ is Gaussian with mean $f(Y)$ (the linear estimate of $X$) and the same covariance matrix. As a result the conditional expectation $E[X|Y=y]$ is equal to the mean of that Gaussian distribution, which is $f(y)$, and as this is true for all $y$, $E[X|Y] = f(Y)$.

For example, if $Y = HX + V$, where $X$ is zero-mean Gaussian, $H$ is a deterministic $n$ by $m$ matrix and $V$ is a zero-mean $n$-dimensional Gaussian noise vector uncorrelated with $X$, we have

$\displaystyle \Sigma_{XY} ={\rm Cov}(X,HX +V) = {\rm Cov}(X)H^* + {\rm Cov}(X,V) = \Sigma_X H^*$

$\displaystyle \Sigma_Y = {\rm Cov}(HX+V) = H\Sigma_X H^*+ \Sigma_V,$

and as $X$ and $Y$ are jointly Gaussian, our MMSE estimator is also the best linear estimator:

$\displaystyle E[X|Y] = f(Y) =\Sigma_{XY}\Sigma_Y^{-1}Y = \Sigma_X H^*(H\Sigma_X H^* + \Sigma_V)^{-1} Y.$

This solution is used in a wide variety of linear estimation applications, ranging from regression analysis in statistics to communication theory (estimating signals passing through a channel $H$ and corrupted by noise $V$).

#### References

[1] Williams, Probability with Martingales, Cambridge University Press, 2001.

[2] Hajek, Notes for ECE 534: An Exploration of Random Processes for Engineers, July 2011, available here.