We have seen that the two most commonly used ways of summarising the centre of variation of observed values are the mean and the median. The mean is the numerical average, and the median is the mid-point.
Andrew Vickers uses the following example to illustrate the need for two parameters and the issue when there are outliers. Seven people with annual incomes of $85,000, $50,000, $60,000, $40,000, $75,000, $100,000 and $45,000 are in a dinner. Bill Gates walks in. What is the new distribution of the salary in the room?
Before Gates
Before Mr Gates walked in, the average salary was ($85,000 + $50,000 + $60,000 + $40,000 + $75,000 + $100,000 + $45,000) / 7 = $65,000. To estimate the median, we first need to arrange the numbers in ascending order, $40,000, $45,000, $50,000, $60,000, $75,000, $85,000, $100,000, locate the midpoint, i.e., $60,000, which is the median.
After Gates
The picture changes once Mr Gates enters the room. Let’s assume his annual income (!) is $ 1 B (the highest number I could envision). The mean is = 1,000,455,000 / 8 = $ 125 million and a bit. And the median? ($60,000 + $75,000)/2 = $67,500.
You might say the median ($67,500) better represents the crowd of upper-middle-class people (and one billionaire). The mean, the so-called average, appears helpless here.
The session cannot be complete without invoking my favourite plot of all – the box plot.
You may have noticed that 7 out of 8 fall below the mean.
Reference
What is a p-value anyway? 34 Stories to Help You Actually Understand Statistics: Andrew Vickers