AC9M9ST03 · YEAR 9 · STATISTICS

Comparing Data Distributions

ACARA v9 CONTENT DESCRIPTION represent the distribution of multiple data sets for numerical variables using comparative representations; compare data distributions with consideration of centre, spread and shape, and the effect of outliers on these measures
Builds on: Reading Surveys: the mean and median. This unit builds on the mean and median and on reading data displays. Comparing distributions by centre, spread and shape is the foundation for choosing displays and planning investigations in the rest of this strand.

Comparing whole distributions, not just averages

Comparing two or more sets of data is one of the most useful things statistics can do: which class scored higher, which machine is more consistent, which suburb has the wider range of house prices. To compare fairly you need more than a single average; you need to describe and contrast whole distributions. This unit builds the tools for that, centred on the five-number summary and the box plot, and on three lenses for comparison: centre, spread and shape.

The five-number summary

The five-number summary captures a distribution in five values: the minimum, the lower quartile, the median, the upper quartile, and the maximum. The median splits the sorted data in half. The lower quartile is the median of the lower half and the upper quartile is the median of the upper half; when there is an odd number of values, the overall median itself is left out of both halves. For the nine values two, four, five, six, seven, eight, nine, eleven and thirteen, the median is seven, the lower quartile is the middle of two, four, five, six, which is four and a half, and the upper quartile is the middle of eight, nine, eleven, thirteen, which is ten.

From data to five numbers
The nine sorted values, with the five-number summary marked: minimum, lower quartile, median, upper quartile and maximum.
The five-number summary of 2, 4, 5, 6, 7, 8, 9, 11, 13 is minimum 2, lower quartile 4.5, median 7, upper quartile 10 and maximum 13. The median splits the data in half, and each quartile is the median of a half.

The box plot

A box plot draws this summary as a picture. A box stretches from the lower quartile to the upper quartile, with a line inside it at the median, and whiskers reach out to the minimum and maximum. The box therefore holds the middle half of the data, and its length is the interquartile range. Drawing two box plots on the same scale makes a comparison immediate: you can see at a glance which distribution sits higher and which is more spread out, without reading a single number.

Anatomy of a box plot
The same data as a box plot: the box spans the quartiles, a line marks the median, and whiskers reach the extremes.
The box runs from the lower quartile 4.5 to the upper quartile 10, so it holds the middle half of the data. The line inside marks the median 7, and the whiskers reach out to the minimum 2 and the maximum 13.

Spread: range and the IQR

Spread can be measured in two ways, and the difference matters. The range is simply the maximum minus the minimum, easy to find but sensitive to a single extreme value. The interquartile range, or IQR, is the upper quartile minus the lower quartile, the width of the middle fifty percent of the data. Because it ignores the smallest and largest quarters, the IQR is resistant to outliers and usually the better measure of how spread out the bulk of the data is. For the values above, the range is eleven while the IQR is five and a half.

Range versus IQR
Two spreads on the one box plot: the full range against the interquartile range, the width of the box.
The range, 13 - 2 = 11, spans everything and reacts to extremes. The IQR, 10 - 4.5 = 5.5, is the box width, the middle half, and is resistant to outliers. The IQR is usually the better measure of spread.

Spotting outliers

Outliers, values far from the rest, deserve special care, and there is a standard rule for spotting them. A value is treated as an outlier if it lies below the lower quartile minus one and a half times the IQR, or above the upper quartile plus one and a half times the IQR. Take the data three, five, six, seven, eight, nine and forty. The lower quartile is five and the upper quartile is nine, so the IQR is four. The upper boundary is nine plus one and a half times four, which is fifteen, and since forty is well above fifteen it is an outlier. An outlier inflates the range and pulls the mean, but leaves the median and IQR almost untouched.

The outlier rule
A fence one and a half times the IQR beyond each quartile flags values that sit unusually far out.
A value is an outlier if it lies below Q1 - 1.5 × IQR = -1 or above Q3 + 1.5 × IQR = 15. Here 40 is far beyond 15, so it is the outlier: it inflates the range and pulls the mean, but barely moves the median or IQR.

Comparing centre, spread and shape

With these tools, comparing distributions comes down to three questions. For centre, compare the medians: a data set whose box and median line sit higher is typically larger. For spread, compare the IQRs or ranges: a shorter box means more consistent, less variable data. For shape, look at symmetry. A distribution is right-skewed if it has a long tail of high values, which pulls the mean above the median; left-skewed if the tail runs the other way; and roughly symmetric if the median sits centrally and the mean is close to it.

Comparing two distributions
Two box plots on one scale. Box B sits well to the right of box A, with a similar width.
On a common scale, B's box sits well to the right of A's, so B's centre is higher: median 15.5 against 7. The boxes are a similar width, so the spreads are alike: an IQR of 6 against 5.5.
Shapes: symmetric and skewed
Three shapes of distribution and what each does to the mean compared with the median.
When a distribution is symmetric the mean ≈ median. A right-skewed tail of high values pulls the mean above the median; a left-skewed tail pulls it below. The shape tells you which average to trust.

A worked comparison

A worked comparison ties it together. Suppose data set A has a median of seven and an IQR of five and a half, while data set B has a median of fifteen and a half and an IQR of six. Comparing centres, B is typically higher, since its median is more than double A's. Comparing spreads, B is slightly more variable, with a marginally larger IQR. Reporting a comparison in these terms, centre then spread then shape, supported by box plots on a common scale, is exactly what it means to compare distributions properly, rather than relying on one number to tell a whole story.

Quick self-check
1. For 2, 4, 5, 6, 7, 8, 9, 11, 13, what is the median?
2. For the same data 2, 4, 5, 6, 7, 8, 9, 11, 13, what is the lower quartile (Q1)?
3. Which measure of spread is RESISTANT to outliers?
4. Data 3, 5, 6, 7, 8, 9, 40 has Q1 = 5 and Q3 = 9 (IQR = 4). Is 40 an outlier?
5. Box plot A has median 7; box plot B has median 15.5, on the same scale. Comparing CENTRE, which is typically higher?