Until now every technique in this course has looked at one variable at a time. Center, spread, shape — all describing a single column of numbers in isolation. But most interesting questions in statistics involve two variables and the relationship between them. Does education level affect employment? Does house size predict price? Do book prices correlate with borrowing rates? The answer always depends on what kind of variables you're working with — because the method that works for nationalities and employment status will destroy you if you apply it to salaries and exam scores.
A cross tabulation table does for two categorical variables what a frequency table does for one: it counts. Every cell is the number of observations that belong to both categories simultaneously. The marginal distributions — the row and column totals — collapse the table back into single-variable frequency distributions. The interesting part is what lives between the margins: the joint distribution, which shows not just how many people have each characteristic, but how those characteristics travel together.
| Education | Employed | Unemployed | Row Total |
|---|---|---|---|
| High School | |||
| Bachelor's | |||
| Master's+ | |||
| Col Total |
Grand total percentages answer "what proportion of everyone?" Row percentages answer "what proportion within each category?" Column percentages answer "what proportion within each outcome?" Three different questions from the same table.
A cross tabulation table has two natural bar chart representations — and they answer different questions. Grouping bars by education level lets you compare employment outcomes within each education group. Grouping bars by employment status lets you compare education profiles within each employment outcome. The data is identical. The question you're asking determines which grouping you choose.
Same data. Two groupings. Two different stories. The grouping is an argument about what you're trying to show.
When both variables are quantitative, the cross tabulation approach loses too much information by forcing continuous data into categories. A scatter plot preserves every observation as a point in two-dimensional space. The pattern of those points — their direction, their tightness, their shape — is the relationship made visible before any formula touches it. Reading a scatter plot correctly is the prerequisite for everything that follows in this unit.
Sandbox Controls
- Add: Click anywhere on the grid
- Move: Drag any point
- Delete: Double-click a point
Max 100 points. Current: 35
Always plot before you calculate. A correlation coefficient without a scatter plot is a number without context.
Spearman measures whether the ranks of two variables move together. Pearson measures whether the values themselves move together linearly. For ordinal data you have no choice — Spearman. For quantitative data Pearson is more powerful, but Spearman is more robust when the relationship isn't perfectly linear or when outliers are present. Both produce a coefficient between −1 and +1. Both answer the same question — how strongly do these variables associate — but they listen to the data differently.
Pearson (Values)
SandboxSpearman (Ranks)
DerivativeThe choice between Spearman and Pearson isn't arbitrary. It follows from your measurement level and the shape of your relationship.
A high correlation between two variables means they move together. It does not mean one causes the other. Ice cream sales and drowning deaths are positively correlated — not because ice cream causes drowning, but because both increase in summer when temperatures rise. The lurking variable is temperature. Correlation finds the pattern. It cannot tell you why the pattern exists. That question requires a different kind of evidence entirely.
The Admissions Paradox
Overall, acceptance rates seem negatively correlated with test scores. But within each specific department, higher scores mean higher acceptance. The lurking variable is department competitiveness.
Every correlation in this gallery is mathematically real. Every causal interpretation is wrong.
Test your understanding of cross tabulations, correlation coefficients, and causal reasoning. These exercises combine all the concepts from this unit into practical scenarios.
Scenario 1: The Cross-Tabulation
A researcher surveyed 100 students. 60 students passed the exam, and 40 failed. Of the 60 who passed, 40 had studied and 20 had not. Of the 40 who failed, 10 had studied and 30 had not. Fill in the contingency table below.
| Status \ Prep | Studied | Did Not Study |
|---|---|---|
| Passed | ||
| Failed |