How Wrong Is the Mean? | The Null Hypothesis

The mean gives you a single number to represent an entire dataset. That number is almost always wrong for any individual observation — the question is by how much, in what direction, and whether the wrongness is systematic or random. The measures in this unit exist to answer that question. Some are crude. Some are elegant. Some only work on certain data types. Together they tell you something the mean never could: the shape of the uncertainty around it.

Interactive 19The Percentile Machine

Percentiles don't describe the center of data — they describe position within it. When you scored in the 90th percentile on an exam, it doesn't mean you got 90%. It means 90% of other scores were below yours. Quartiles are just percentiles with special names: Q1 is the 25th, Q2 is the median, Q3 is the 75th. The ogive you already built in unit 3 was answering percentile questions the whole time — you just didn't call them that yet.

Enter a percentile to see its exact position in the dataset. The rank formula interpolates if the position isn't a whole number.

Target Percentile (1-99)50

i=1

i=2

i=3

i=4

i=5

i=6

i=7

i=8

Rank Formula

i = \frac{P}{100}(n + 1)

i = \frac{50}{100}(8 + 1)

i = 4.50

Resulting Value

23.50

Interpolated 50% of the way between the 4th value (22) and the 5th value (25).

Interactive 20The Spread Spectrum

Range is the simplest possible measure of spread: largest minus smallest. It takes two seconds to compute and ignores every observation in between. The Interquartile Range fixes the main weakness — instead of measuring the full span of the data, it measures the span of the middle 50%, making it immune to the extreme values that make range unreliable. Neither measure uses all the data. That limitation is what variance exists to solve.

SHOW IQR BANDS

Dataset A

Tightly Clustered

Range

13 - 9 = 4

IQR

12 - 10 = 2

Variance

s^2 = 2.5

Dataset B

Widely Spread

Range

21 - 1 = 20

IQR

16 - 6 = 10

Variance

s^2 = 62.5

Spread

Interactive 21The Variance Engine

Variance measures how far each observation strays from the mean, squares those distances to make them positive, and averages the result. The squaring does two things: it removes negative signs, and it punishes large deviations disproportionately — an observation twice as far from the mean contributes four times as much to the variance. Standard deviation undoes the squaring to bring the measure back into the original units. Adding a constant to every value shifts the mean but leaves the variance untouched. Multiplying every value by a constant scales the variance by the square of that constant.

\bar{x}

= 14

Sum of Deviations

\sum(x_i - \bar{x}) = 0

Variance

s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1} = \frac{166}{6-1} = 33.2

Standard Deviation

s = \sqrt{s^2} = \sqrt{33.2} = 5.8

Interactive 22The Comparison Problem

A standard deviation of 12.7 years sounds large. A standard deviation of $16,097 sounds enormous. You cannot compare them — they live in different units on different scales. The Coefficient of Variation solves this by expressing standard deviation as a percentage of the mean, making spread dimensionless and therefore comparable. It answers the question raw standard deviation can't: relative to what it's describing, how spread out is this data?

Mean (

\bar{x}

)56.2

Standard Dev (

s

)12.7

CV Calculation

CV = \frac{12.7}{56.2} \times 100

22.6%

0Mean (100%)

Mean (

\bar{x}

)89,432

Standard Dev (

s

)16,097

CV Calculation

CV = \frac{16097}{89432} \times 100

18.0%

0Mean (100%)

Age (Years) has greater relative variation despite having a smaller absolute standard deviation.

Interactive 23The Box Plot

A box plot compresses five numbers — minimum, Q1, median, Q3, maximum — into a single visual that simultaneously shows center, spread, skewness, and outliers. The box holds the middle 50% of the data. The whiskers extend to the most extreme non-outlier values. Anything beyond 1.5 × IQR from the box edges is flagged. It is the most information-dense standard statistical graphic that exists.

Step 0: Raw Data

Start with the unorganized data points.

3310

3925

3355

3480

3490

3540

3520

3650

3730

3480

3550

3450

Five-Number Summary

Minimum3310.0

Q13472.5

Median3505.0

Q33575.0

Maximum3925.0

IQR102.5

Five-Number Summary

\text{Five-Number Summary: } [3310.0,\, Q_1(3472.5),\, Q_2(3505.0),\, Q_3(3575.0),\, 3925.0]

Interactive 24The Z-Score and the CV

The Coefficient of Variation standardizes a distribution — it tells you how spread out it is relative to its mean. The Z-score standardizes an individual observation — it tells you how far a specific value sits from the mean, measured in standard deviations. Both are dimensionless. Both remove the problem of incompatible units. But they answer different questions: CV describes the whole distribution, Z-score locates one value within it. Ahmed scoring 70 in statistics and 80 in mathematics cannot be compared directly. Expressed as Z-scores, the comparison becomes exact.

Statistics

\mu

= 65,

\sigma

= 2

X = 70.0

Z = 2.50

Raw Score (X)70.0

Translation Equation

Z = \frac{X - \mu}{\sigma} = \frac{70.0 - 65}{2} = 2.50

Mathematics

\mu

= 70,

\sigma

= 5

X = 80.0

Z = 2.00

Raw Score (X)80.0

Translation Equation

Z = \frac{X - \mu}{\sigma} = \frac{80.0 - 70}{5} = 2.00

Unified Z-Score Scale (Standard Deviations)

-3

-2

-1

Stats

Math

Ahmed's Statistics performance is relatively stronger.

Putting It All Together

Test your understanding of spread, variation, and standardisation. These exercises combine all the concepts from this unit into practical scenarios.

Scenario 1: The CV Decision

A factory measures two quality metrics: weight (mean = 500g, SD = 25g) and length (mean = 12cm, SD = 1.8cm). Which metric has greater relative variation?

Calculate CV for Weight (%)

Calculate CV for Length (%)

Which metric requires tighter quality control due to higher relative variation?

Act Progress5 / 7

Unit 5: How Wrong Is the Mean?

NEXT Unit 6: When Variables Meet

"Most interesting questions in statistics involve two variables. Learn how to describe relationships using cross-tabulation and contingency tables."

Previous Continue