## Useless statistical indicator alert Russell Degnan

In the past few years statistics have got a real jump in cricket. While we lag behind baseball by some margin, coverages are now full of interesting (albeit sometimes pointless) graphics, and articles abound on this or that statistical artifact. Unfortunately, some of them are complete crap.

The Numbers Game is like that. I like to read it a lot, it often produces interesting numbers, like the contribution by the last 5 wickets or the series on individual strokes (although all of them have their flaws). But it has an annoying tendency to produce figures that are clearly no more than luck, as some sort of keen insight into the game.

I mentioned above that baseball is well beyond cricket in terms of statistics, and there is a simple reason for it. They understand error, and distributions, and the all important difference between transient and persistent phenomena. You never see an error estimation in the Numbers Game. If there was he would soon realize that cricketers play so few games, and have such variable results that seemingly sensible statistics like "the difference between performances on the sub-continent and away" are only valid for a few players with dozens of games on each. He clearly has no sense of distribution, for reasons we'll see shortly. And the consistency with which figures are presented that were the result of a few lucky innings, rather than a season-by-season result is phenomenal.

Baseball understands this. And still they make mistakes. But to see what I mean, I highly recommend this article by Bill James. It lays out some of the problems in great detail, concluding in part, that many things can't be measured -- either there or not there -- because the errors in baseball are too large.

Cricket, if anything is more so. Take this table of differences between the performances of left and right hand batsmen against different sides. It looks plausible but the figures won't show it. The measure that should be looked at here, is not, the difference between the two, but whether the performances match what you would expect. When you look at those numbers, all sides are well within 4 runs or 10% of expectation (based on projecting the linear correlation with r-squared: 0.8968). Yet the averages themselves are so variable (it only takes a double-hundred or a half-dozen cheap wickets to shift it a run or so), that any difference is probably just luck.

The difference between first and second innings performances is a little different. The correlation is much lower, so there is more going on than just standard variation. But here, the time-frame (since 1990) is so long, that you can't say anything useful about current sides, or even a side from the mid-90s. What should be plotted, is the difference between the expected second innings averages, and the actual second innings averages, for each side, for each year. Then we would either see a trend, or a lot of natural variation, or somethign in between.

But neither of those has anything on this week's effort. The use of standard deviation in this case is hopelessly misguided. For many reasons. The distribution of a batsman's scores is heavily skewed, with over a third (as a rule) less than 20. Standard deviation is not only more likely to measure the ability of a batsman to score big centuries (ie. Lara, Attapatu, Zaheer Abbas, Bradman), but more likely to be low when the median is near the mean (ie. when the average is low, as for Pollock, Marsh, and Hadlee).

But he didn't just use standard deviation, he created an index of average/st_dev. But look at what that is:

( runs / (innings - notouts))
_______________________________

sqrt( ( sum ( diff_means ^ 2 )) / innings )

Which means several things:
- Not outs provide an arbitrary cap on the potential runs, and therefore affect the average less than the standard deviation. This is why there is an arbitrary 5000 run limit. Without that, you get Pollock (1.33), Brett Lee (1.21) and other useful lower end batsmen in the top 10.
- A high score affects an average much less than it affects the standard deviation (try it and see). Players without big knocks (Mark Waugh, Chanderpaul, Ranatunga) do better.
- The number of innings increases your index by the square root of that number. Hence a player with the same score distribution, but quadruple the number of knocks will have double the index.

It is an interesting figure, and consistency is in the mix, but so are lots of other factors you don't want, and do nothing but skew the figures.

There are at least two better ways to measure consistency. One is the median, that will give the central knock, and is a reasonable way of telling how often a player gets a start. Another would be to remove the innings bias by dividing the current index by the square root of the number of innings (unsuprisingly, Bradman dominates given this measure).

Regardless, a more than cursory examination of the statistics being produced would also help. Just because it is constructed to say something doesn't mean it necessarily does.

Cricket - Analysis 29th April, 2006 19:09:44   [#]

sqrt? diff_means^2? You lost me, Russ.

Mind you I'm the sort of writer that can't count past ten without taking my socks off, so make of that what you will.

Statistics are a useful tool for historical analysis, but they are a hopeless predictor. As they say in financial circles, past results are no guarantee of future performance.
Scott Wickstein  1st May, 2006 10:14:22