When Variables Meet | The Null Hypothesis

Until now every technique in this course has looked at one variable at a time. Center, spread, shape — all describing a single column of numbers in isolation. But most interesting questions in statistics involve two variables and the relationship between them. Does education level affect employment? Does house size predict price? Do book prices correlate with borrowing rates? The answer always depends on what kind of variables you're working with — because the method that works for nationalities and employment status will destroy you if you apply it to salaries and exam scores.

Interactive 25The Contingency Table

A cross tabulation table does for two categorical variables what a frequency table does for one: it counts. Every cell is the number of observations that belong to both categories simultaneously. The marginal distributions — the row and column totals — collapse the table back into single-variable frequency distributions. The interesting part is what lives between the margins: the joint distribution, which shows not just how many people have each characteristic, but how those characteristics travel together.

Education	Employed	Unemployed	Row Total
High School	$40$	$10$	$50$
Bachelor's	$60$	$5$	$65$
Master's+	$30$	$2$	$32$
Col Total	$130$	$17$	$147$

Grand total percentages answer "what proportion of everyone?" Row percentages answer "what proportion within each category?" Column percentages answer "what proportion within each outcome?" Three different questions from the same table.

Interactive 26Reading the Bars

A cross tabulation table has two natural bar chart representations — and they answer different questions. Grouping bars by education level lets you compare employment outcomes within each education group. Grouping bars by employment status lets you compare education profiles within each employment outcome. The data is identical. The question you're asking determines which grouping you choose.

High School

Bachelor's

Master's+

Employed

Unemployed

Same data. Two groupings. Two different stories. The grouping is an argument about what you're trying to show.

Interactive 27The Scatter Plot

When both variables are quantitative, the cross tabulation approach loses too much information by forcing continuous data into categories. A scatter plot preserves every observation as a point in two-dimensional space. The pattern of those points — their direction, their tightness, their shape — is the relationship made visible before any formula touches it. Reading a scatter plot correctly is the prerequisite for everything that follows in this unit.

Pearson Correlation (r)

r = \frac{\text{Cov}(X, Y)}{S_x S_y} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} = \frac{2298.9}{\sqrt{4594.7 \times 1186.2}} = 0.985

Transform Data:

r (sample, n-1) · ρ (population, N)

Sandbox Controls

Add: Click anywhere on the grid
Move: Drag any point
Delete: Double-click a point

Max 100 points. Current: 35

Always plot before you calculate. A correlation coefficient without a scatter plot is a number without context.

Interactive 28Spearman vs Pearson

Spearman measures whether the ranks of two variables move together. Pearson measures whether the values themselves move together linearly. For ordinal data you have no choice — Spearman. For quantitative data Pearson is more powerful, but Spearman is more robust when the relationship isn't perfectly linear or when outliers are present. Both produce a coefficient between −1 and +1. Both answer the same question — how strongly do these variables associate — but they listen to the data differently.

Pearson (Values)

Sandbox

Pearson Correlation

r_{\text{pearson}} = \frac{\text{Cov}(X, Y)}{S_x S_y} = \frac{143.0}{\sqrt{112.0 \times 207.9}} = 0.937

Spearman (Ranks)

Derivative

Spearman Rank Correlation

r_{\text{spearman}} = 1 - \frac{6 \sum d^2}{n(n^2 - 1)} = 1 - \frac{6 \times 0.0}{7(7^2 - 1)} = 1.000

Mixed Measurement Levels: If one variable is quantitative and the other is ordinal, always downgrade the quantitative variable to ordinal ranks and use Spearman. You cannot use Pearson unless both variables are strictly quantitative.

The choice between Spearman and Pearson isn't arbitrary. It follows from your measurement level and the shape of your relationship.

Interactive 29Correlation Is Not Causation

A high correlation between two variables means they move together. It does not mean one causes the other. Ice cream sales and drowning deaths are positively correlated — not because ice cream causes drowning, but because both increase in summer when temperatures rise. The lurking variable is temperature. Correlation finds the pattern. It cannot tell you why the pattern exists. That question requires a different kind of evidence entirely.

The Admissions Paradox

Overall, acceptance rates seem negatively correlated with test scores. But within each specific department, higher scores mean higher acceptance. The lurking variable is department competitiveness.

Every correlation in this gallery is mathematically real. Every causal interpretation is wrong.

Putting It All Together

Test your understanding of cross tabulations, correlation coefficients, and causal reasoning. These exercises combine all the concepts from this unit into practical scenarios.

Scenario 1: The Cross-Tabulation

A researcher surveyed 100 students. 60 students passed the exam, and 40 failed. Of the 60 who passed, 40 had studied and 20 had not. Of the 40 who failed, 10 had studied and 30 had not. Fill in the contingency table below.

Status \ Prep	Studied	Did Not Study
Passed
Failed

Act Progress6 / 7

Unit 6: When Variables Meet

NEXT Unit 7: From Pattern to Prediction

"Correlation tells you two variables move together. Regression tells you the exact equation of that movement."

Previous Continue