The Null Hypothesis

Introduction

Today, we're analyzing the bloodpress.txt dataset. The data was collected from $20$ unrelated patients with high blood pressure levels. Our goal is to analyze the relationship between blood pressure and various variables to identify potential determinants and risk indicators of hypertension.

Since we're using Python, we're mainly working with:

pandas for data operations.
numpy for mathematical/algebraic operations.
seaborn for visualizations.

Those are the industry standard, but that doesn't mean they're necessarily the best/most efficient. I encourage you to always learn about more libraries.
For example, while pandas is a great library for easily loading and editing data, polars is a newer, much faster library (especially for big data).

Theorem I.I

The Ten Assumptions of OLS

Phase 1: Data Inspection

A good data analyst/scientist always starts by simply looking at the dataset at hand. We start by inspecting the raw clinical records of 20 unrelated patients with high blood pressure levels. Looking at this wall of a table, we can see the mean and median (50%) are almost identical. The 1st and 3rd quartiles (25% and 75% respectively) are also close in values; this indicates no outliers—or at least no strong ones.

Raw Dataset Preview (n=20)

Pt ID	BP	Age	Weight	BSA	Dur	Pulse	Stress
#1	105	47	85.4	1.75	5.1	63	33
#2	115	49	94.2	2.1	3.8	70	14
#3	116	49	95.3	1.98	8.2	72	10
#4	117	50	94.7	2.01	5.8	73	99
#5	112	51	89.4	1.89	7	72	95

Phase 2: Variable Exploration

Now, tables are informative if you're looking for detailed statistics. But you know what's better for an overall look and to view the distribution of each variable? GRAPHS!By visualizing the relationships between variables, we can form initial hypotheses. Does stress actually raise blood pressure? Is age a bigger factor than weight? Explore the data yourself below.

X-Axis Variable

Y-Axis Variable

Pearson Correlation

r = 0.659

Note: Why KDE?

Kernel Density Estimation (KDE) is preferred over Histograms in our case.

Histograms use discrete, rectangular bars to show frequency counts within set intervals (bins), making them ideal for identifying exact counts and data spread. Kernel Density Estimation ( $KDE$ ) creates a smooth, continuous probability density curve by summing small Gaussian curves over each data point, making it better for visualizing shape and identifying trends without binning bias.

Try selecting the same variable for both X and Y axes above to see the KDE distribution!

Phase 3: Multicollinearity Check

Multicollinearity is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. Our Heatmap reveals a suspicious correlation between Weight and BSA (0.875). This redundancy inflates the variance of our coefficients, making the model unstable. We confirm this with the Variance Inflation Factor (VIF). A VIF over 5.0 typically indicates problematic multicollinearity.

Age

Weight

BSA

Dur

Pulse

Stress

0.659

0.95

0.866

0.293

0.721

0.164

Age

0.659

0.407

0.378

0.344

0.619

0.368

Weight

0.95

0.407

0.875

0.201

0.659

0.034

BSA

0.866

0.378

0.875

0.131

0.465

0.018

Dur

0.293

0.344

0.201

0.131

0.402

0.312

Pulse

0.721

0.619

0.659

0.465

0.402

0.506

Stress

0.164

0.368

0.034

0.018

0.312

0.506

Values > 0.7 indicate high correlation. Notice Weight vs BSA (0.875).

Weight and BSA are highly correlated (VIF > 5), causing multicollinearity.

The rule is simple:

If $VIF$ is higher than $5$ indicates "mild" multicollinearity.
If $VIF$ is Above , it's severe multicollinearity.

Phase 4: Model Tuning

The results are very promising! We have high values for Adjusted R-squared, F-statistic, and Log-Likeliehood.
BUT
Dur and Stress have a much higher P-value over the typical threshold of 0.05 (refer to H0 - Justify your alpha for a guide on selecting a resonable alpha level that isn't $0.05$), meaning they're statistically insignificant. Those variables aren't harmful, but not useful either. Let's talk Backward Elimination.

Initial Model Results

R²: 0.926Adj. R²: 0.900AIC: 83.27

Variable	Coef	Std Err	t	P>\|t\|	[0.025 - 0.975]
const	114.0000	0.384	296.731	0.000	113.176 , 114.824
Age	1.4077	0.514	2.737	0.016	0.304 , 2.511
BSA	3.3512	0.471	7.114	0.000	2.341 , 4.362
Dur	0.1648	0.438	0.376	0.713	-0.776 , 1.105
Pulse	1.7359	0.606	2.866	0.012	0.437 , 3.035
Stress	-0.6206	0.483	-1.284	0.220	-1.657 , 0.416

Backward Elimination

In Data Science, we prefer Parsimony: the simplest model that explains the data is usually the best. Keeping insignificant variables adds complexity without value.

We refine the model by dropping Dur and Stress and re-fitting. The result is a much cleaner model where every variable pulls its weight.

Final Model Results

R²: 0.917Adj. R²: 0.902AIC: 81.57

Variable	Coef	Std Err	t	P>\|t\|	[0.025 - 0.975]
const	114.0000	0.381	299.488	0.000	113.193 , 114.807
Age	1.3535	0.501	2.699	0.016	0.290 , 2.416
BSA	3.5173	0.445	7.906	0.000	2.574 , 4.460
Pulse	1.4441	0.524	2.755	0.014	0.333 , 2.555

Durbin-Watson: 2.420Jarque-Bera: 0.641

Analysis of Variance (ANOVA)

While coefficients tell us how much BP changes, ANOVA tells us how important each factor is.

"BSA accounts for approximately 80% of the explained variance in Blood Pressure. Age and Pulse matter, but they are minor players compared to body mass."

Analysis of Variance (ANOVA)

Type: IIMethod: OLS

Source	Sum Sq	DF	Mean Sq	F	PR(>F)
Age	21.11	1	21.11	7.28	0.0158
BSA	181.13	1	181.13	62.51	< 0.001
Pulse	21.99	1	21.99	7.59	0.0141
Residual	46.37	16	2.90	-	-

Table 3.2: Variance decomposition of the final model.

Phase 5: Diagnostic Analysis

We fitted our model, made sure the variables are significant, same thing for the variables. It has a very good, if not excellent, fit. The final touch is to check both residuals and coefficient variation explanability. We need to make sure our residuals are normally distributed (to ensure that significance tests (t-tests, F-tests), confidence intervals, and prediction intervals are accurate.) and have a constant variance (to ensure that linear regression models produce reliable, unbiased, and efficient estimates). Those two assumptions are critical for inference, not so much for prediction.

Residuals vs. Fitted

(Should be random cloud)

Q-Q Plot

(Should follow the red line)

Normality Test (Shapiro-Wilk)

We fail to reject the null hypothesis of normality ( $p > 0.05$ ). The residuals follow a normal distribution, allowing us to trust our confidence intervals.

0.9625

Statistic (W)

> 0.05

P-Value

Phase 6: Production Readiness

The final model passed all checks. We extracted the coefficients to build the prediction formula:

\text{BP} = 114.00 + 1.35(\text{Age}_{std}) + 3.52(\text{BSA}_{std}) + 1.44(\text{Pulse}_{std})

As the data guy, you'll usually work with developers to deploy this interaction. We simply export the model configuration and scaling parameters to a JSON file (click the code button to see how), and then build the production dashboard below.

10

The Vital Predictor

Introduction

The Ten Assumptions of OLS

Phase 1: Data Inspection

Raw Dataset Preview (n=20)

Phase 2: Variable Exploration

Note: Why KDE?

Phase 3: Multicollinearity Check

Phase 4: Model Tuning

Initial Model Results

Backward Elimination

Final Model Results

Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA)

Phase 5: Diagnostic Analysis

Residuals vs. Fitted

Q-Q Plot

Normality Test (Shapiro-Wilk)

Phase 6: Production Readiness

Inputs

Risk Profile Topology

Key Insights

Final Thoughts