Chaitanya's Random Pages

February 29, 2012

Expensive bowling in ODIs

Filed under: cricket,sport — ckrao @ 8:39 pm

Fascinating that just a few days after I blogged about economical bowling performances in cricket ODIs, we see the most expensive analysis in terms of economy rate (qualification: 5 overs or more). I found about it at Andy Zaltzman’s twitter feed here.

The list of expensive analyses can be found here. Earlier in the month a bowler (Brian Vitori of Zimbabwe) had conceded 100+ runs for only the fourth time in an ODI (9-0-105-1 against New Zealand at Napier).

Lasith Malinga bowled 7.4 overs for 96 runs (and one wicket) as India managed to chase down a victory target of 321 in just 36.4 overs. It was the fifth highest run rate for a 300+ score according to this.

Here is Malinga’s ball-by-ball breakdown. After 14 balls he had only conceded 13, then he conceded 83 off his next 32 balls!

1 1 1 0 0 2 | 1 2 1 0 4 0 | 0 0 4 1 4 6 | 4 x 2W 0 4 1 4 | 4 1 0 4 1 W 4lb | 1 1 4 1 1 0 | 2 6 4 4 4 4 | 1 1 4 4

(W = wide, x = wicket, lb = legbye)

For economical analyses I pointed out those that happened in recent times. The following lists the most expensive analyses that occurred prior to 2000.

  • Saurav Ganguly (India) bowled 5 overs for 62 against Pakistan at Toronto in 1998.
  • Rajab Ali (Kenya) bowled 6 overs for 67 against SL at Kandy in 1996 as SL posted a massive score of 5/398 (the highest at the time) in the World Cup.
  • Ravi Shastri (India) bowled 7 overs for 77 against the West Indies at Jamshedpur in 1983. Viv Richards scored 149 off 99 balls in the game.
  • Greg Matthews (Australia) bowled 5 overs for 54 against India at Delhi in 1986.
  • Heath Davis (New Zealand) bowled 5 overs for 54 against India at Bangalore in 1997.

February 28, 2012

Finite differences for polynomial extrapolation

Filed under: mathematics — ckrao @ 9:41 pm

If I am given the values of a polynomial at some consecutive integers and I want to find its value at the next integer, how might I proceed? For example, if I know p(0) = 1, p(1) = 3, p(2) = 11, p(3) = 28, p(4) = 57, how would I find p(5)?

The easiest way I know is to set up a finite difference table. Write the polynomial values in a row, then underneath each row write the difference of the two numbers above it. Continue until all the numbers in a row are equal (or the row below it contains 0s).

For example:

Then we extrapolate by starting at the bottom row and writing the next equal number (in this case 6), then proceeding upwards by writing the sum of the number to its left and below it. In our example we obtain the following.

Hence we conclude that p(5) = 101. Notice that we did not need to find p(x) for general x (it turns out to be p(x) = (x^3 + 3x^2 + 2)/2 in this example).

The above principle works based on the fact that if p(x) is a polynomial, then p(x+1) - p(x) reduces its degree by 1. By doing this successively the degree eventually becomes 0. Constant polynomials are uniquely determined by one value so by extrapolating this and propagating the result upwards, we can extrapolate any degree of polynomial in a unique manner provided we have a complete table.

What if we wanted to find p(10) in the above example? Would we have to resort to finding p(x) by solving a system of linear equations in its coefficients? That is, do we need to set p(x) = ax^3 + bx^2 + cx + d, substitute known values and then solve for a, b, c, d? In this post I want to show how to proceed in a different way.

Firstly, consider what occurs if our original row were taken from a diagonal of Pascal’s triangle (i.e. binomial coefficients). For example, here in the first row we show numbers of the form \binom{n}{3} = n(n-1)(n-2)/6.

We find the second row has numbers of the form \binom{n}{2} and the third row has numbers of the form \binom{n}{1}. In general if the first row has numbers of the form \binom{n}{k} (where k is fixed and n varies), then the i‘th row has numbers of the form \binom{n}{k-i+1}. This is a consequence of the fundamental identity

\displaystyle \binom{n+1}{k} - \binom{n}{k} = \binom{n}{k-1}.

Given this, it makes sense to try to write our polynomial as a linear combination of binomial coefficients (\binom{n}{k}) rather than as a linear combination of monomials (x^n).

Returning to our previous example, we wish to evaluate p(10) given p(0) = 1, p(1) = 3, p(2) = 11, p(3) = 28, p(4) = 57. Notice how the entire finite difference table can be generated from its first diagonal (running NW-SE). For example,

\begin{array}{lcl} 28 &=& 11 + 17\\ &=& (3 + 8) + (8 + 9)\\ &=& 3 + 2\times 8 + 9\\ &=& (1+2) + 2(2 + 6) + (6 + 3)\\ &=& 1 + 3.2 + 3.6 + 3.\end{array}

This shows that p(3) = 28 can be written in terms of the first diagonal (1, 2, 6, 3) with the binomial coefficients 1,3,3,1. Another way of thinking of this is to see how many ways each number in the first diagonal can trickle across the table. In the above example, the number 6 can trickle up to p(3) in \binom{3}{2} ways: it needs to go up twice and across once in three total moves (up-up-across, up-across-up or across-up-up). This should hopefully explain how the binomial coefficients enter the picture. In general if we need to go across p places and up q places, this can be done in \binom{p+q}{p} ways.

Now if we wish to find p(10), we need to go 10 spaces across from p(0) = 1, nine spaces across and one space up from  2 = p(1)-p(0), eight spaces across and two spaces up from 6, and seven spaces across and three spaces up from 3. We conclude that we may write

\displaystyle p(10) = 1\binom{10}{0} + 2\binom{10}{1} + 6\binom{10}{2} + 3\binom{10}{3}.

Hence p(10) = 1 + 2\times 10 + 6\times (10.9)/2 + 3\times (10.9.8)/6 = 1 + 20 + 270 + 360 = 651. In general, if the first diagonal of the table is a_0, a_1, \ldots, a_n and a_0= p(0), then

\displaystyle p(x) = \sum_{i=0}^n a_i \binom{x}{i}.

There is actually a strong link here between this formula and the Taylor series expansion, forming the basis for umbral calculus. See [2] for more details.

References

[1] Jim Loy’s pages, Finite Difference

[2] Weisstein, Eric W. “Newton’s Forward Difference Formula.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/NewtonsForwardDifferenceFormula.html

February 25, 2012

Economical bowling in ODIs

Filed under: cricket,sport — ckrao @ 12:58 am

A couple of months ago Saeed Ajmal (Pakistan) produced a spell of 7 overs, 4 maidens, 2 wickets for 6 runs in a one-day international against Bangladesh. I wondered how often in recent times it has happened that a bowler has been so economical in ODIs, especially with the trend towards higher run-rates (2006 saw the first 400+ team score and now it has occurred ten times). I used CricInfo’s Statsguru to find out. It turns out that there have been 31 instances of a bowler bowling at least 7 overs with an economy rate less than 1 run per over, with a fair number of those occuring in the 2000s, though mostly against the weaker teams.

The list can be seen here.

The most economical was 4 wickets for 3 runs in 10 overs by Phil Simmons (WI) against Pakistan! Only two of the 31 performances have been five wicket hauls:

  • Muralitharan against NZ at Sharjah in 2002 (10-3-9-5) (New Zealand still posted 218 and won the match!)
  • Sunil Joshi against South Africa at Nairobi in 1999 (10-6-6-5)

Below are some other notable performances of “recent” times.

  • Curtly Ambrose (West Indies) had 10-5-5-1 against Sri Lanka at Sharjah in 1999. His overs were bowled in a single spell.
  • Andre Botha (Ireland) had 8-4-5-2 against Pakistan at Kingston in 2007 (during the World Cup), a game which Pakistan famously lost. This was the last time we saw such an economical bowling display in an ODI against a non-minnow (non-Irish to be more specific) side.
  • Glenn McGrath (Australia) had 10-4-8-4 against India at Sydney in 2000. This was McGrath at the peak of his powers.
  • Aasif Karim (Kenya) had 8.2-6-7-3 against Australia at Durbin in 2003 (during the World Cup Super Sixes stage as a 39 year-old). He had 5 taken off his last two balls so at one point his figures were 8-6-2-3!
  • Trevor Gripper (Zim) had 7-4-6-0 against West Indies at Kandy in 2001. Interestingly he only played 8 ODIs.

February 13, 2012

Decomposition of a 3x3x3 cube

Filed under: mathematics — ckrao @ 9:14 pm

Here is a decomposition of a 3 by 3 by 3 cube that I learnt recently: it can be divided into six 2x2x1 blocks and three unit cubes. The following diagram attempts to show this in layers, where the unit cubes are in red and the 2x2x1 blocks are each shown in a different colour.

A question for the interested reader: do all such decompositions necessarily have the unit cubes appearing along a diagonal of the 3x3x3 cube?

January 31, 2012

Recent fast test cricket centuries

Filed under: cricket,sport — ckrao @ 8:05 pm

David Warner recently scored 180 off 159 balls in a test cricket match against India, which ended up being the third fastest score of 180+ in terms of strike rate (runs/balls) (some older innings don’t have strike rate information). The only faster innings were the 222 scored off 168 balls by Nathan Astle vs England and the 293 off 254 by Virender Sehwag against Sri Lanka.

Michael Clarke scored his more recent 210 speedily too, with only 20 scores of 210+ having a faster strike rate (out of 210 other scores, some which have no data on balls faced). Clarke and Ponting posted a partnership of 386, which was Australia’s largest since 1946 and the 19th highest overall. Some records from day 2 of the 3rd test are here and here. It was the fourth time a triple century and double century were scored by a batsman in a series and the 15th time that two 200+ scores were posted by a batsman in a series (full list here).

Finally Clarke’s triple century in the 2nd test was the 25th overall and the 7th fastest in terms of strike rate (Hammond’s 336* was probably quicker but we don’t know how many balls he faced). He is one of very few players to average 45+ in both tests and ODIs.

January 9, 2012

Extremal points of a triangle

Filed under: mathematics — ckrao @ 11:41 am

In this post we summarise some of the points of a triangle that maximise or minimise various functions on the plane. This is on a similar theme to an earlier post on optimisation problems given a point inside a given angle. Many of the results are proved in [1].

Firstly we introduce some notation and terminology. Let ABC be a given triangle with side lengths a = BC,b = AC, c = AB and let P be a variable point in the plane. Let x, y, z be the distances from P to the sides AB, AC, BC respectively. The feet of the perpendiculars from P to the sides of ABC form a pedal triangle. If AP, BP, CP meet sides BC, CA, AB in points D, E, F, then \bigtriangleup DEF is known as the Cevian triangle of P.

Functions involving distances to the vertices

  • \displaystyle \min |PA| + |PB| + |PC|: P is the Fermat point of ABC. If the largest angle is 120 degrees or more, the Fermat point is at the vertex of this angle, otherwise we can construct it by drawing an equilateral triangle ABD external to ABC, then finding the intersection of CD and the circumcircle of ABD.
    In this figure one can show that PA + PB + PC = CD.
  • \displaystyle \max \min \{|PA|, |PB|, |PC|\}: if ABC is not obtuse, P is the circumcentre of ABC. Otherwise it is the intersection of the longest side of ABC with the perpendicular bisector of its middle-length side.
  • \displaystyle \min \{aPA + bPB + cPC\}: P is the orthocentre H of ABC, with minimum value given by four times the area of ABC.
  • \displaystyle \min PA^2 + PB^2 + PC^2: P is the centroid G of ABC with minimum value given by (a^2 + b^2 + c^2)/3.
  • \displaystyle \min aPA^2 + bPB^2 + cPC^2: P is the incentre I of ABC with minimum value given by abc.
  • \displaystyle \min xPA^2 + yPB^2 + zPC^2: This generalises the previous two expressions: P is the point described by the position vector (xA + yB + zC)/(x+y+z) with minimum value equal to (a^2yz + b^2xz + c^2 xy)/(x+y+z).
  • \displaystyle \min PA.PB.c+ PB.PC.a + PC.PA.b: P is the orthocentre H of ABC with minimum value given by abc.

Functions involving distances to the sides

  • \displaystyle \min x + y + z: P is the vertex of the largest angle of ABC.
  • \displaystyle \max xyz: P is the centroid G of ABC.
  • \displaystyle \max xy + yz + zx: P is the Mittenpunkt of ABC, given by the position vector [a(b+c-a)A + b(c + a - b)B + c(a + b - c)C ]/[2(ab + bc + ca) - (a^2 + b^2 + c^2) ]. This and the previous fact can be proved using Lagrange multipliers using the constraint that ax + by + cz is constant (equal to twice the area of ABC). The Mittenpunkt is the point of intersection of lines joining the triangle’s excentres (formed by exterior angle bisectors) and midpoints.
  • \displaystyle \max \frac{xyz}{PA.PB.PC}: P is the incentre I of ABC (IMO shortlist 2001)

Functions involving the pedal triangle

  • maximum area of pedal triangle: P is the circumcentre O of ABC.
  • minimum perimeter of pedal triangle: P is the orthocentre O of ABC.
  • minimum pedal radius (radius of circumcircle of pedal triangle): P is the incentre of ABC. See [2] for more details.

Functions involving the Cevian triangle

  • maximum area of Cevian triangle: P is the centroid G of ABC with maximum area given by a fourth of the area of ABC.

 

References

[1] T. Andreescu, O. Mushkarov, L. Stoyanov, Geometric Problems on Maxima and Minima, Birkhäuser, 2006.

[2] V. Naik, Optimization methods in planar geometry, available at www.cmi.ac.in/~vipul/olymp_resources/olympiadarticles/geometricoptimization.pdf

December 30, 2011

The distribution of Melbourne’s maximum temperatures

Filed under: climate and weather — ckrao @ 11:26 pm

Here I have generated histograms by month of the maximum temperatures of my home city Melbourne (Australia) for the forty-year period of 1971-2010. The temperatures are collected into temperature bins of size 2 degrees Celsius. The most frequent temperature range is 14-16°C in the cooler months (June-Sep) up to 20-22°C in the warmer months (Dec-Mar). The moderating influence of the Southern Ocean is the primary cause of this relatively small difference. However in the summer months hot winds from the interior of the continent skew the distribution and push the mean maximum closer to 25°C. Curiously in January the maximum has been as likely to be 36-38°C as 30-32°C.

One can see how bunched up the maximum temperatures are in the winter (never too cold) compared with the summer (sometimes too hot!). Since the data is collected over a forty year period, simply divide the y value by 40 to see how many days per month one achieves a particular range. For example, in the month of May between 8 and 9 days per month the maximum temperatures has been between 16 and 18°C.

Finally here is the above data aggregated.

Here we see the most common range is 14-16°C which has happened about 57 days per year (47 of these between May and Sep) and 81% of the time it has been between 12 and 26°C (and 17% of the time above 26°C). The overall range is 7.0 to 46.4°C. On average the maximum temperature has exceeded 40°C 1.5 times per year and been less than 10°C 0.8 times per year.

The data for this post comes from http://www.bom.gov.au/climate/data/.

The results of Fuss and Carlitz for bicentric quadrilaterals

Filed under: mathematics — ckrao @ 4:47 am

Any triangle has both an incircle (tangent to its sides) and circumcircle (through its vertices).

The distance d between the centres O, I of these circles is related to their radii r (for the incircle) and R (for the circumcircle) via Euler’s theorem in geometry:

\displaystyle d^2 = R^2 - 2Rr

or the equivalent formula

\displaystyle \frac{1}{R-d} + \frac{1}{R+d} = \frac{1}{r}.

A convex quadrilateral however need not have an incircle or circumcircle in general. For it to have an incircle (i.e. a tangential quadrilateral), its pairs of opposite sides must have the same sum. For it to have a circumcircle (i.e. a cyclic quadrilateral), its opposite angles must be supplementary. One that satisfies both of these properties is called a bicentric quadrilateral.

Remarkably, if we define d, r and R for a bicentric quadrilateral analogous to the triangular case, we have formulas similar to Euler’s due to Nicolaus Fuss and Leonard Carlitz:

Carlitz: \displaystyle d^2 = R^2 - 2Rr.\mu,

where \displaystyle \mu = \sqrt{\frac{(ab+cd)(ad + bc)}{(a+c)^2(ac+bd)}} and a,b,c,d are the quadrilateral’s side lengths.

Fuss: \displaystyle \frac{1}{(R-d)^2} + \frac{1}{(R+d)^2} = \frac{1}{r^2}

Isn’t it cool how similar these formulas look for the triangular and quadrilateral cases?! Both of these results (and indeed Euler’s theorem) follow from the intersecting chords theorem. Fuss’s theorem can be shown by proving that both sides of the equation are equal to \displaystyle \frac{1}{AI^2} + \frac{1}{CI^2}. Carlitz’s identity can be shown via the preliminary result that \mu = IE/BE, where E is the intersection of line AI with the circumcircle. A beautiful proof of Fuss’s theorem due to Salazar (using little more than a short angle chase and the theorems of Pythagoras and Apollonius) is found in [1] while that of Carlitz can be seen from p154 of [2].

By Poncelet’s Porism, it is true that if we start with two circles of radius r and R with distance d between their centres such that the Fuss equation \frac{1}{(R-d)^2} + \frac{1}{(R+d)^2} = \frac{1}{r^2} is satisfied, then a bicentric quadrilateral with those parameters may be constructed. Simply start at any point on the circumcircle and successively draw tangents to the incircle to generate the other three vertices of the quadrilateral. This was how I drew the above bicentric quadrilateral.

References

[1] A. Bogomolny, Fuss’ Theorem from Interactive Mathematics Miscellany and Puzzles http://www.cut-the-knot.org/Curriculum/Geometry/Fuss.shtml#S, Accessed 30 December 2011.

[2] O. Calin, Euclidean and Non-Euclidean Geometry: a metric approach, available at http://math.emich.edu/~ocalin/Math341/Newpdf/driver13GeomB.pdf

December 15, 2011

Australia in close test match finishes

Filed under: cricket,sport — ckrao @ 10:59 am

Following Australia’s recent narrow loss to New Zealand I had a look at this page and found that Australia features in 15 out of the 17 closest finishes (measured by runs) in a test match!

One can add to these the only two test match ties both involving Australia as well!

Here are the recent closest results (by runs) for Australia together with intermediate scores in their final innings. (By recent, I mean after 1929!)

lost by 1 run vs WI, 92/3 (7/74 chasing 184)
lost by 2 runs vs Eng, 2005 (7/137 chasing 282)
lost by 3 runs vs Eng, 82/3 (9/218 chasing 292)
lost by 5 runs vs SA, 93/4 (8/75 chasing 117)
lost by 7 runs vs NZ, 11/2 (9/199 chasing 241)
lost by 12 runs vs Eng, 98/9 (3/130 chasing 175)
lost by 12 runs vs Ind, 04/5 (7/58 chasing 107)

All were losses, but in all but one case Australia was way behind and did well to come as close as it did!

The first recent case of a close win to Australia was in 92/3 versus Sri Lanka (16 runs). That time it was Sri Lanka’s turn to collapse (2/127 chasing 181) after a first innings lead of 291!

On the plus side for Australia, its narrowest wins (after 1977) while chasing have all come against South Africa in South Africa:

won by 2 wickets vs SA, 96/7 (target 270)
won by 2 wickets vs SA, 05/6 (target 292)
won by 2 wickets vs SA, 11/2 (target 310)

Source: http://stats.espncricinfo.com/ci/content/records/255926.html

December 5, 2011

The MMSE estimator is orthogonal projection

Filed under: mathematics — ckrao @ 9:36 pm

Suppose we are given the value of the random vector Y from which we wish to estimate the value of another random vector X. In other words, given Y we want to find a function f such that f(Y) is a good approximation of X. How might we go about this? If our definition of “good approximation” is the MMSE (minimum mean squared error) E \|f(Y)-X\|^2 and the random vectors X and Y have finite second order moments (i.e. E \|X\|^2 = E \sum_i |X_i|^2 is finite), then the language of Hilbert spaces helps us greatly.

Let the dimensions of X and Y be m and n respectively (they need not be equal). The convenience of the metric \|f(Y) - X\|^2 is that we may minimise it by finding the component functions  f_1(Y), \ldots, f_m(Y) that minimise each term E|f_i(Y) - X_i|^2 of the sum separately, and then our required f is f(Y) = [f_1(Y), \ldots, f_m(Y)]^T.

Imagine a component X_i (now a random variable as opposed to a random vector) living in the space of random variables with finite second order moment (this space is denoted by L^2(\cal{F}) where \cal{F} is a collection of events whose probability can be measured). The space of measurable functions of Y from \mathbb{C}^n to \mathbb{C} forms a subspace of L^2(\cal{F}) which we denote by L^2(\sigma(Y)). (By measurable we mean that we can still compute the probability of events based on values of the function.) If X is in this subspace, it means that X_i can be entirely determined from knowledge of Y, and an error of zero is possible. If X is not in the subspace, we imagine X_i as an arrow from 0 sticking out from the subspace.

With inner product on L^2(\cal{F}) given by \langle U,V \rangle = E UV^* (noting that \langle U,U \rangle = 0 if and only if U = 0 with probability one), we see that E|f_i(Y) - X_i|^2 is minimised when the error e_i = f_i(Y) - X_i is orthogonal to any measurable function of Y. In our Hilbert space we think of f_i(Y) being the projection of X onto the subspace L^2(\sigma(Y)):

\displaystyle \langle g(Y), f_i(Y) - X_i \rangle = 0 for any random variable g(Y) \quad \quad (*)

Note that if this were not true, but instead there existed g(Y) \in L^2(\sigma(Y)) with \langle g(Y),g(Y) \rangle = E|g(Y)|^2 = 1 (i.e. normalised) and \langle f_i(Y) - X_i, g(Y) \rangle = \alpha \neq 0, we may write h(Y) = f_i(Y) - \alpha g(Y) \in L^2(\sigma(Y)) and so

\begin{array}{lcl} E|h(Y) - X_i|^2 &=& E|f_i(Y) - \alpha g(Y) - X_i|^2 \\ &=& E| f_i(Y) - X_i|^2 + |\alpha|^2 E |g(Y)|^2 - \alpha^* \langle f_i(Y) - X_i , g(Y) \rangle - \alpha \langle g(Y), f_i(Y) - X_i \rangle \\&=& E|f_i(Y) - X_i|^2 + |\alpha|^2 - |\alpha|^2 - |\alpha|^2 \\ &<& E| f_i(Y) - X_i|^2,\end{array}

contradicting the minimality of E|f_i(Y) - X_i|^2. Hence the orthogonality condition (*) must hold. Such a projection is unique in the almost-sure sense (i.e. any other random variable is equal to f_i(Y) with probability one).

By (*), f_i(Y) - X_i is also orthogonal to any constant random variable, or in other words, Ec(f_i(Y) - X_i) = 0 for any constant c showing that the error f_i(Y) - X_i has zero mean (worth remembering: zero-mean random variables are orthogonal to constants!). In vector notation, with e := f(Y) - X,

Ee = 0  or Ef(Y) = EX \quad \quad (1)

For each j, f_j(Y) is in L^2(\sigma(Y)) so by (*), \langle f_j(Y), f_i(Y) - X_i \rangle = 0. Switching to vector notation, this gives us

\displaystyle {\rm Cov}(f(Y), f(Y)-X) = Ef(Y)(f(Y) - X)^* = 0 \quad \quad (2),

where {\rm Cov}(U,V) :=: \Sigma_{UV} := E(U-E(U))(V-E(V))^* (we shall use the notation {\rm Cov}(U) = \Sigma_U to mean {\rm Cov}(U,U)).

This is the orthogonality condition in our vector case. By Pythagoras’s theorem, the minimum error is

\sum_i E|f_i(Y) - X_i|^2 = \sum_i \left(E|X_i|^2 - E|f_i(Y)|^2\right) = {\rm tr}({\rm Cov}(X) - {\rm Cov}(f(Y))),

where {\rm tr} represents the sum of the diagonal entries of a square matrix (the trace).

Linear Case

If we take the example of a linear estimator of the form f(Y) = AY + b (A a matrix, b a vector), we may use (1) and (2) to identify A and b that minimises E\|f(Y) - X\|^2. By (1), AEY + Eb = EX, from which b = Eb = EX - AEY and f(Y) = A(Y-EY) + EX.

Note that any component of Y is in itself a linear transform of Y. By (*), it must be orthogonal to any component of the error vector f(Y) - X, giving us

\displaystyle 0 = E(f(Y)-X)Y^* = E[(A(Y-EY) + EX - X)Y^*] = A\Sigma_Y - \Sigma_{XY}.

If {\rm Cov}(Y) is invertible, this gives A = \Sigma_{XY}\Sigma_Y^{-1}, leading to the following expressions:

\displaystyle f(Y) = EX + \Sigma_{XY}\Sigma_Y^{-1}(Y-EY) \quad \quad (3)

\begin{array} {lcl} {\rm Cov}(f(Y) - X) &=& {\rm Cov}(X) - {\rm Cov}(EX + {\rm Cov}(X,Y){\rm Cov}(Y)^{-1}(Y-EY))\\ &=& {\rm Cov}(X) - {\rm Cov}(X,Y){\rm Cov}(Y)^{-1}{\rm Cov}(Y-EY) {\rm Cov}(Y)^{-1}{\rm Cov}(Y,X)\\ &=& \Sigma_X - \Sigma_{XY}\Sigma_Y^{-1} \Sigma_{YX} \quad \quad (4) \end{array}

(here using the fact that {\rm Cov}(AX) = A{\rm Cov}(X)A^*),

E\|f(Y) - X\|^2 = {\rm tr}( {\rm Cov}(f(Y)) - X) = {\rm tr} (\Sigma_X - \Sigma_{XY}\Sigma_Y^{-1} \Sigma_{YX}) \quad \quad (5).

Note that this precise argument is made in least squares problems: if we wish to find a matrix W so that \|x - Wy\|^2 is minimised and Wy is only permitted in a space spanned by the columns of some matrix C, we find the projection of x onto the subspace whose columns are formed by C. This gives the orthogonality condition 0 = C^* (x - Wy) = C^* (x - Cz) (for some z, since Wy is in the column space of C), from which

\displaystyle Wy = Cz = C(C^*C)^{-1}C^*x.

In the special case when C =c is a single column vector, the projection operator has the attractive form (cc^*)/(c^*c) which is simply the outer product cc^* if c has unit length.

Conditional Expectation

The post so far has made no mention of the conditional expectation E[X|Y], which is what in fact the minimum mean squared estimator turns out to be. In brief E[X|Y] is defined as a measurable function of Y satisfying

\displaystyle E[g(Y)E[X|Y]] = E[g(Y)X] \quad \quad (+)

for any measurable function g(Y). This is a measure-theoretic definition that has the advantage of unifying the discrete and continuous cases, and avoids division-by-zero possibilities that can arise when dealing with probability densities. (Actually conditional expectation is first defined with respect to a set of events called a sigma algebra which may be considered as an information source for that random variable. The larger the set, the more information we have about that random variable.)

If expectation is thought of as an averaging process, then conditional expectation is an average with respect to information or uncertainty. Formula (+) says that this average should be equal to the unconditioned average on any measurable set (to see this take g(Y) to be 1 on that set and 0 otherwise). There are details missing here, but the interested reader is encouraged to see the references for more.

In the Hilbert space of random variables with finite second moment, this condition is equivalent to (X-E[X|Y]) being orthogonal to g(Y), so E[X|Y] is our best estimate f(Y) found earlier. Hence the conditional expectation E[X_i|Y] can be viewed as the vector projection of X_i onto L^2(\sigma(Y)), i.e. the best estimate of X_i in the expected least squares sense. Conditional probability may be defined in terms of conditional expectation via P(A|B) = E[1_A|B], where 1_A is the random variable equal to 1 if our event A occurs, and 0 otherwise.

Gaussian Case

In the particular case of X and Y being Gaussian vectors, it turns out that the best linear estimate is also the best overall estimate in the least squares sense. To see this, let f(Y) be the best linear estimate of X given Y. Since uncorrelated and independent are equivalent notions in the Gaussian world, it follows from the independence of (X-f(Y)) and f(Y) that given Y=y the conditional distribution of (X-f(Y)) does not depend on y. It is in fact zero-mean Gaussian with variance given in (4)). Hence the distribution of X (= (X-f(Y)) + f(Y)) given Y=y is Gaussian with mean f(Y) (the linear estimate of X) and the same covariance matrix. As a result the conditional expectation E[X|Y=y] is equal to the mean of that Gaussian distribution, which is f(y), and as this is true for all y, E[X|Y] = f(Y).

For example, if Y = HX + V, where X is zero-mean Gaussian, H is a deterministic n by m matrix and V is a zero-mean n-dimensional Gaussian noise vector uncorrelated with X, we have

\displaystyle \Sigma_{XY} ={\rm Cov}(X,HX +V) = {\rm Cov}(X)H^* + {\rm Cov}(X,V) = \Sigma_X H^*

\displaystyle \Sigma_Y = {\rm Cov}(HX+V) = H\Sigma_X H^*+ \Sigma_V,

and as X and Y are jointly Gaussian, our MMSE estimator is also the best linear estimator:

\displaystyle E[X|Y] = f(Y) =\Sigma_{XY}\Sigma_Y^{-1}Y = \Sigma_X H^*(H\Sigma_X H^* + \Sigma_V)^{-1} Y.

This solution is used in a wide variety of linear estimation applications, ranging from regression analysis in statistics to communication theory (estimating signals passing through a channel H and corrupted by noise V).

References

[1] Williams, Probability with Martingales, Cambridge University Press, 2001.

[2] Hajek, Notes for ECE 534: An Exploration of Random Processes for Engineers, July 2011, available here.

Next Page »

Theme: Rubric. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.