SA101: مبادئ الإحصاء 1
يوليو 2026

From Pattern to Prediction

Correlation tells you two variables move together. Regression tells you the exact equation of that movement — and hands you a machine for predicting values you've never seen. The least squares method finds the line that minimizes the sum of squared residuals: the total squared distance between every observed point and the line's prediction. It will always produce a line. It produces one even when the relationship is curved, even when one outlier is pulling the slope sideways, even when the trend ended three years ago. The formula doesn't warn you. That's your job.

Interactive 30The Residual

Every regression line makes a prediction for every observation. The residual is the gap between the prediction and reality — how wrong the line is at that specific point. Some residuals are positive (the line underestimated), some negative (it overestimated). They always sum to zero, which means the line is wrong in both directions equally on average. The least squares method chooses the line that makes those residuals as small as possible in total — specifically, that minimizes the sum of their squares.

50607080901001103.04.05.06.07.0
  • Drag left handle: Intercept
  • Drag right handle: Slope
Equation
y^=1.57+0.041x \hat{y} = 1.57 + 0.041x
Sum of Residuals
ei=0.00 \sum e_i = -0.00
Loss (SSR)
ei2=0.31 \sum e_i^2 = 0.31

The least squares line is the unique line for which no other straight line produces a smaller sum of squared residuals. Drag the line away from optimal and watch the loss grow.

Interactive 31The Equation Builder

The slope b₁ tells you how much Y changes for every one-unit increase in X. The intercept b₀ tells you the predicted value of Y when X is zero — which is sometimes meaningful and sometimes an extrapolation beyond your data. Both are derived from the same quantities you already computed for Pearson correlation: the covariance of X and Y, and the variance of X. The slope is just their ratio.

Before we build the equation, which variable are we trying to predict?

b₁ is the covariance-to-variance ratio. b₀ is where the line crosses the Y-axis. Together they define the prediction machine.

Interactive 32What R² Actually Means

R² is the proportion of variability in Y that is explained by X. An R² of 0.89 means 89% of the variation in FTE counts across hospitals is accounted for by the number of beds alone. The remaining 11% is explained by something else — staffing policies, hospital type, location, patient complexity — variables not in the model. R² doesn't tell you if the relationship is causal. It tells you how much of the story your single predictor is telling.

R² = 0.46
Total Variation (from Mean)
y = ȳ
46%Explained
Unexplained Variation (Residuals)
y = ŷ
R² = 0.00R² = 1.00
The model explains 46% of the variance. The remaining 54% is caused by variables not in our equation.
r (Pearson) = +0.68

R² = 0 means your predictor is useless. R² = 1 means your predictor is perfect. Reality lives somewhere in between, and where it lives matters enormously.

Interactive 33Time as a Variable

When one of your variables is time, regression becomes forecasting. The independent variable X is no longer beds or passengers — it's years, months, quarters, coded as integers starting from 1. The slope becomes a trend: how much does the variable grow per time period on average? The intercept becomes the baseline: the predicted value at time zero. Everything else is identical to what you already built. Time series analysis is just regression wearing a watch.

2015
Drag to scrub time
Regression Equation
y^=6.16+2.669t \hat{y} = 6.16 + 2.669t
t = years since 1999
Forecasting
Time (t)16
Year2015
Predicted Sales48.86
Extrapolation Warning

The trend line is a regression line. The forecast is a prediction. Both are only as reliable as the data behind them.

Interactive 34Don't Trust the Line Out Here

The regression line is built from your data. It knows what happened inside your observed range. It has no knowledge of what happens beyond it. Extrapolation — predicting Y for X values outside the range of your data — assumes the trend continues unchanged indefinitely. Sometimes that's reasonable over a short horizon. Often it isn't. The line doesn't know about recessions, policy changes, pandemics, or anything else that happened after your last data point. The formula keeps calculating. You have to know when to stop trusting it.

200020102020203020402050
Year2010
Predicted ŷ35.5
20102050

Interpolation — predicting within your data range — is relatively safe. Extrapolation is an assumption dressed as a calculation.

Interactive 35What the Line Assumes

The least squares formula will always produce a line. It produces one even if your data is curved, even if one outlier is dragging the slope, even if the residuals show a systematic pattern the line is missing entirely. The formula doesn't warn you. It just calculates. Before trusting a regression line you need to verify three things: that the relationship actually looks linear in a scatter plot, that no single observation is exerting disproportionate influence on the slope, and that the residuals scatter randomly around zero — because systematic residuals mean the line is missing something real.

1. Linearity Assumption

Does a straight line adequately capture the relationship?

Scatter PlotResiduals vs Fitted
Underlying DGP

2. Independence from Extreme Leverage

Is the model overly reliant on a single observation?

Drag point to test leverage
Regression Slopes
Current (b1b_1)1.030
Baseline (b1b_1)1.062
Δb1\Delta b_1-0.031
Intercept & Fit
Intercept (b0b_0)-0.03
R-Squared (R2R^2)0.974

3. Homoscedasticity & Patterns

Do residuals have equal variance across all predicted values?

Select Residual Pattern

The formula gives you a line. The scatter plot and residual plot tell you whether to trust it.

The Exercise

Test your understanding of the least squares method, regression equations, and extrapolation. These scenarios ask you to build the model and interpret its output correctly.

Scenario 1: A supermarket chain records annual sales from 1989 to 1994. You need to build a forecasting model to predict sales in 2000.
1. Time Coding
2. Calculate Slope (b₁)

Given: (xxˉ)(yyˉ)=16.5\sum(x-\bar{x})(y-\bar{y}) = 16.5, (xxˉ)2=17.5\sum(x-\bar{x})^2 = 17.5

3. Calculate Intercept (b₀)

Given: xˉ=3.5\bar{x} = 3.5, yˉ=15.53\bar{y} = 15.53

4. Forecast for 2000
Act Progress7 / 7
Unit 7: From Pattern to Prediction
End of Series

You have completed all units in this act.