# Chaitanya's Random Pages

## August 31, 2014

### Players with high test cricket + one day international cricket batting averages

Filed under: cricket,sport — ckrao @ 8:10 am

Jacques Kallis recently retired from international cricket with a phenomenal record of close to 25,000 runs, almost 600 wickets and 300+ catches on the international stage. He ended up with batting averages of 55.37 in test cricket and 44.36 in one day internationals (ODIs), leading to a sum just shy of 100. In turns out this is the highest sum of averages for any retired player who has played both of these forms of the game, just ahead of Michael Hussey (not counting players with a small number of innings).

The following tables show current and former players with a sum of test and ODI batting average above 90. Sangakkara has improved his numbers in both formats dramatically in recent years while Amla and de Villiers are the only players with sum exceeding 100. Statistics are current to August 30 2014.

Current players

 Test cricket One day international cricket Player Innings Runs Average Innings Runs Average Strike Rate Test Average + ODI Average HM Amla 137 6415 51.32 89 4539 55.35 89.84 106.67 AB de Villiers 159 7296 51.02 159 6701 50.38 95.04 101.40 KC Sangakkara 221 11988 58.76 357 12844 40.14 77.55 98.90 IJL Trott 87 3763 46.45 65 2819 51.25 77.06 97.71 MJ Clarke 180 8240 51.50 215 7683 44.67 78.76 96.17 S Chanderpaul 266 11414 51.88 251 8778 41.60 70.74 93.48 MS Dhoni 140 4808 38.46 215 8098 53.28 89.31 91.74 V Kohli 51 1855 39.47 128 5674 51.58 89.75 91.05 AD Mathews 74 3054 52.66 111 3013 38.14 84.06 90.79 Misbah-ul-Haq 84 3285 46.93 136 4594 43.75 73.82 90.68

(NB: Fawad Alam would qualify but only played 3 test matches.)

Former players

 Test cricket One day international cricket Player Innings Runs Average Innings Runs Average Strike Rate Test Average + ODI Average DG Bradman 80 6996 99.94 Bradman predated ODI cricket 99.94 JH Kallis 280 13289 55.37 314 11579 44.36 72.89 99.73 MEK Hussey 137 6235 51.53 157 5442 48.16 87.16 99.69 SR Tendulkar 329 15921 53.79 452 18426 44.83 86.23 98.62 IVA Richards 182 8540 50.24 167 6721 47.00 90.20 97.24 ML Hayden 184 8625 50.74 155 6133 43.81 78.96 94.54 Javed Miandad 189 8832 52.57 218 7381 41.70 67.01 94.27 GS Chappell 151 7110 53.86 72 2331 40.19 75.70 94.05 Mohammad Yousuf 156 7530 52.29 273 9720 41.72 75.10 94.01 RT Ponting 287 13378 51.85 365 13704 42.04 80.39 93.89 BC Lara 232 11953 52.89 289 10405 40.49 79.51 93.38 Zaheer Abbas 124 5062 44.80 60 2572 47.63 84.80 92.43 GM Turner 73 2991 44.64 40 1598 47.00 68.05 91.64 R Dravid 286 13288 52.31 318 10889 39.17 71.24 91.48 DM Jones 89 3631 46.55 161 6068 44.62 72.56 91.17

The above statistics are from ESPN Cricinfo here and here.

## August 30, 2014

### Waiting times and the size biased distribution

Filed under: mathematics — ckrao @ 12:48 pm

Suppose buses arrive at a stop after independent random intervals of $T$ minutes, where $T$ is uniformly distributed between 10 and 20. It is natural to wonder how long one is expected to wait from some random point in time $t$ until the first bus arrives. For example on the one hand the bus could arrive immediately, or one could be unlucky with time $t$ just after the previous bus left and we could wait as many as 20 minutes for the next bus. Interestingly this waiting time $W$ is no longer uniformly distributed.

In general if interarrival times $T$ have (cumulative) distributive function $F_T$ and mean $\mathbb{E}[T]$, the following is true.

The probability density of waiting time is $f_W(x) = \text{Pr}(T \geq x)/\mathbb{E}[T] = (1-F_T(x))/\mathbb{E}[T].\quad\quad(1)$

Applying this formula to the above bus example, we have $\mathbb{E}[T] = 15$ and

$\displaystyle F_T(x) = \begin{cases}0, & x < 10\\ (x-10)/10, & 10 \leq x \leq 20\\ 1, & x > 20\end{cases}$

Using (1), the probability density function of waiting time $f_W(x)$ is plotted below.

We see that the density function is a rectangle (centred at $x = 5$ and having 2/3 of the total probability mass) appended to a triangle (having centroid at $x=40/3$ and 1/3 of the total probability mass). From this we can say the expected waiting time is $\mathbb{E}[W] = 5 \times (2/3) + 40/3 \times (1/3) = 70/9 \approx 7.78$ minutes. This is more than half the mean interarrival time of 15/2 = 7.5 minutes that one might initially expect.

To prove (1), we shall first look at the length of a particular inter-arrival period that contains a given point $t$, and the waiting time can then be the part of this after time $t$. We make use of the following result from alternating renewal process theory [1]:

Proposition: Consider a system that alternates between on and off states where:

• the on durations $X_1, X_2, \ldots$ are independent and identically distributed according to random variable $X$, and
• the off durations $Y_1, Y_2, \ldots$ are independent and identically distributed according to random variable $Y$.

(However it may be that $X_n$ and $Y_n$ are dependent in general.)

Let $p(t)$ be the probability that the system is in the on state at time $t$. Then if $X+ Y$ has finite mean and is non-lattice (i.e. takes on more values than just an integral multiple of some positive number), then

$\displaystyle \lim_{t \rightarrow \infty} p(t) = \frac{\mathbb{E}[X]}{\mathbb{E}[X+Y]}.\quad\quad(2)$

This result is easily understood: the probability the system is on at a given time is the mean on time divided by the mean total cycle time. However it is based on the key renewal theorem, a limit theorem which is not so straightforward to prove.

Now consider a random process with independent and identically distributed interarrival times $T_1, T_2, \ldots$ with common distribution $T$ with probability distribution function $F_T$. Consider the length of the particular interarrival period $T^*$ that contains some point $t$. We shall determine the distribution of $T^*$. We define an on-off system as follows.

• If $T^*$ is greater than some given $x > 0$, let our system be on for the entire interarrival period (i.e. $X_{n} = T_n, Y_{n} = 0$).
• If the length of the interarrival period is less than or equal to $x$ let our system be off for the interarrival period (i.e. $X_{n} = 0, Y_{n} = T_n$).

Then by this definition the conditions of the above proposition are met and so

\begin{aligned} \text{Pr}(T^* > x) &= \text{Pr(system is on at time t)}\\ &= \frac{\mathbb{E}[X_n]}{\mathbb{E}[X_n+Y_n]}\quad\text{(by (2))}\\ &= \frac{\mathbb{E}[T_n \mid T_n > x]\text{Pr}(T_n > x)}{\mathbb{E}[T_n]}\\ &= \frac{\int_x^{\infty} t\ \ dF_T(t)}{\mathbb{E}[T]}.\quad\quad(3)\end{aligned}

Equivalently, $\text{Pr}(T^* \leq x) = \frac{\int_0^{x} t\ \ dF_T(t)}{\mathbb{E}[T]}$, and if $T$ has probability density $f_T$, differentiating both sides of this equation gives the probability density function of the length of interarrival period containing a point $t$:

$\displaystyle f_{T^*}(x) = xf_T(x)/\mathbb{E}[T].\quad\quad (4)$

Intuitively, longer periods are more likely to contain the point $t$, with probability in direct proportion to the length of the interval. Hence the original density $f_T(x)$ is scaled by the length $x$ and then normalized by $\int_0^{\infty} x f_T(x)\ dx = \mathbb{E}[T]$ to make $xf_T(x)/\mathbb{E}[T]$ a valid probability density. $T^*$ defined by (3) is known as the size-biased transform of $T$.

Note that the length of the particular interarrival interval $T^*$ is stochastically larger than any of the identically distributed interarrival times $T$. In other words,

$\displaystyle \text{Pr}(T^* > x) \geq \text{Pr}(T > x).\quad\quad(5)$

This is the so-called inspection paradox, which can be proved by the Chebyshev correlation inequality (a form of rearrangement inequality) which states than when $f$ and $g$ are functions of the random variable $X$ having the same monotonicity (i.e. both non-increasing or both non-decreasing), then $f(X)$ and $g(X)$ are non-negatively correlated:

$\displaystyle \mathbb{E} [f(X)g(X)] \geq \mathbb{E}[f(X)] \mathbb{E}[g(X)].\quad\quad(6)$

Applying this result with the non-decreasing functions $f(T) = T$ and $g(T) = {\bold 1}_{(T > x)}$ (which is 1 when $T > x$ and 0 otherwise):

\begin{aligned} \text{Pr}(T^* > x) &= \frac{\int_x^{\infty} t\ \ dF_T(t)}{E[T]}\\ &= \frac{\mathbb{E}[T {\bold 1}_{(T>x)}]}{\mathbb{E}[T]}\\ &\geq \frac{\mathbb{E}[T] \mathbb{E}[{\bold 1}_{(T>x)}]}{\mathbb{E}[T]}\\ &= \mathbb{E}[{\bold 1}_{(T>x)}]\\ &= \text{Pr}(T > x).\quad\quad(7)\end{aligned}

The inspection paradox is a form of sampling bias, where results are modified as a result of a non-intended sampling of a given space. Another example is the friendship paradox, where in a graph of acquaintances with vertices having some differing degrees (number of friends), the average degree of a friend is greater than the average degree of a randomly chosen node. For one thing a friend will never have degree zero while a randomly chosen person might! The degree of a friend is an example of sampling from a size-biased distribution.

How about the distribution of the waiting time $W$ for the next arrival? For fixed $x > 0$ we consider another on-off system where this time the system is in an off state during the last $x$ units of time of an interarrival interval, and in the on state otherwise. That is,

• $Y_n = \min(x,T_n)$,
• $X_n = T_n - Y_n$.

Then we have

\begin{aligned} \text{Pr}(W \leq x) &= \text{Pr(system is off at time t)}\\ &= \frac{\mathbb{E}[Y_n]}{\mathbb{E}[X_n+Y_n]} \quad\quad \text{(by (2))}\\ &= \frac{\mathbb{E}[\min(x,T_n)]}{\mathbb{E}[T_n]}\\ &= \frac{\int_0^{x} t\ \ dF_T(t) + \int_x^{\infty} x\ \ dF_T(t) }{\mathbb{E}[T_n]}\\ &= \frac{ \left[tF_T(t) \right]_{0}^x - \int_0^xF_T(t)\ dt + x(1-F_T(x))}{\mathbb{E}[T_n]}\quad \quad \text{(integrating the first term by parts)} \\&= \frac{xF_T(x) - \int_0^x (1-F_T(t))\ dt - xF_T(x)}{\mathbb{E}[T_n]}\\ &= \frac{\int_0^x (1-F_T(t))\ dt }{\mathbb{E}[T_n]}.\quad\quad(8)\end{aligned}

Alternatively this form can be found by imagining our time $t$ to be uniformly distributed within the current interarrival interval, thus partitioning our inspection interarrival interval $T^*$ into $(1-X)T^*$ and $XT^*$, where $X$ is the uniform random variable $X \sim U(0,1)$. Hence the distribution of $W = XT^*$ is found by integrating over the distribution of $T^*$ and using the fact that $dF_T^*(t) = \frac{t}{\mathbb{E}[T]}\ dF_T(t)$ from (3):

\begin{aligned} \text{Pr}(W \leq x) &= \text{Pr}(XT^* \leq x)\\&= \int_0^{\infty} \text{Pr}(Xt \leq x \mid T^* = t)\ dF_{T^*}(t)\\&= \int_0^{\infty} \text{Pr}(U \leq x/t)\ dF_{T^*}(t)\\&=\int_0^x 1\ dF_{T^*}(t) + \int_x^{\infty} (x/t)\ dF_{T^*}(t)\\&=\int_0^x \frac{t}{\mathbb{E}[T]}\ dF_T(t) + x\int_x^{\infty} \frac{1}{t}\frac{t}{\mathbb{E}[T]}\ dF_T(t)\\&= \frac{\left[tF_T(t)\right]_0^x - \int_0^x F_T(t)\ dt + x(1-F_T(x))}{\mathbb{E}[T]} \quad\quad \text{(integrating the first term by parts)}\\&= \frac{- \int_0^xF_T(t)\ dt + x}{\mathbb{E}[T]}\\&= \frac{\int_0^x (1-F_T(t))\ dt }{\mathbb{E}[T]}.\quad\quad(9)\end{aligned}

Differentiating both sides of (8) or (9) with respect to $x$ leads to (1), the probability density function of waiting time $f_W(x) = (1-F_T(x))/\mathbb{E}[T]$.

Finally, if we wish to find any moments $E[W^k]$ we write $W = XT^*$ where $X \sim U(0,1)$ and $X$ is independent of the size-biased variable $T^*$:

\begin{aligned}\mathbb{E}[W^k] &= \mathbb{E}[(XT^*)^k]\\ &= \mathbb{E}[X^k] \mathbb{E}[(T^*)^k]\\&= \mathbb{E}[X^k]\int_0^{\infty}\frac{x}{\mathbb{E}[T]} (x^k)\ dF_T(x)\quad\quad \text{(}dF_T^*(x)\text{ related to }dF_T(x)\text{ via (3))}\\ &= \frac{\mathbb{E}[X^k]\mathbb{E}[T^{k+1}]}{\mathbb{E}[T]}\\ &= \frac{\int_0^1 x^k\ dx \mathbb{E}[T^{k+1}]}{\mathbb{E}[T]}\\ &= \frac{\mathbb{E}[T^{k+1}]}{(k+1)\mathbb{E}[T]}.\quad\quad(10)\end{aligned}

In particular

\begin{aligned} \mathbb{E}[W] &= \mathbb{E}[T^2]/(2\mathbb{E}[T])\\ &\geq (\mathbb{E}[T])^2/(2\mathbb{E}[T])\\ &= \mathbb{E}[T]/2\quad\quad(11)\end{aligned}.

Hence we can say that the mean waiting time for the next arrival from any given point in time is more than half the mean interarrival time, unless $\mathbb{E}[T^2] = (\mathbb{E}[T])^2$ (i.e. zero variance in $T$, meaning $T$ is deterministic).

For example, in the initial bus example where $T \sim U(10,20)$:

• $E[T] = 15$,
• $E[T^2] = \int_{10}^{20} (t^2/10)\ dt = (20^3-10^3)/30 = 700/3$ and
• $E[T^3] = \int_{10}^{20} (t^3/10)\ dt = (20^4-10^4)/40 = 3750$,

so $E[W] = E[T^2]/(2E[T]) = 700/90 = 70/9 \approx 7.78 > E[T]/2 = 7.5$ and $E[W^2] = E[T^3]/(3E[T]) = 250/3 \approx 83.33$. Hence the waiting time has mean 7.78 and standard deviation of $\sqrt{250/3-(70/9)^2} \approx 4.78$.

#### References

[1] S. Ross, Stochastic Processes, John Wiley & Sons, 1996.

[2] R. Arratia and L. Goldstein, Size bias, sampling, the waiting time paradox, and infinite divisibility: when is the increment independent? Available at arxiv.org/pdf/1007.3910.

[3] This answer by Did on math.stackexchange.

Blog at WordPress.com.