The Null Hypothesis
01.Atlas02.Laboratory03.Workshop04.Case Studies05.Library
TABLE OF CONTENTS
Back to Archive
Source Code
Biostatistics
VOL. IMAR 2026

The Vital Predictor

Introduction

Today, we're analyzing the bloodpress.txt dataset.
The data was collected from 202020 unrelated patients with high blood pressure levels.
Our goal is to analyze the relationship between blood pressure and various variables to identify potential determinants and risk indicators of hypertension.

Since we're using Python, we're mainly working with:

  • pandas for data operations.
  • numpy for mathematical/algebraic operations.
  • seaborn for visualizations.

Those are the industry standard, but that doesn't mean they're necessarily the best/most efficient. I encourage you to always learn about more libraries.
For example, while pandas is a great library for easily loading and editing data, polars is a newer, much faster library (especially for big data).

Theorem I.I

The Ten Assumptions of OLS

Phase 1: Data Inspection

Raw Dataset Preview (n=20)

Pt IDBPAgeWeightBSADurPulseStress
#11054785.41.755.16333
#21154994.22.13.87014
#31164995.31.988.27210
#41175094.72.015.87399
#51125189.41.8977295

Phase 2: Variable Exploration

Pearson Correlationr=0.659r = 0.659r=0.659

Note: Why KDE?

Kernel Density Estimation (KDE) is preferred over Histograms in our case.

Histograms use discrete, rectangular bars to show frequency counts within set intervals (bins), making them ideal for identifying exact counts and data spread. Kernel Density Estimation (KDEKDEKDE) creates a smooth, continuous probability density curve by summing small Gaussian curves over each data point, making it better for visualizing shape and identifying trends without binning bias.

Try selecting the same variable for both X and Y axes above to see the KDE distribution!

Phase 3: Multicollinearity Check

BP
Age
Weight
BSA
Dur
Pulse
Stress
BP
1
0.659
0.95
0.866
0.293
0.721
0.164
Age
0.659
1
0.407
0.378
0.344
0.619
0.368
Weight
0.95
0.407
1
0.875
0.201
0.659
0.034
BSA
0.866
0.378
0.875
1
0.131
0.465
0.018
Dur
0.293
0.344
0.201
0.131
1
0.402
0.312
Pulse
0.721
0.619
0.659
0.465
0.402
1
0.506
Stress
0.164
0.368
0.034
0.018
0.312
0.506
1
Values > 0.7 indicate high correlation. Notice Weight vs BSA (0.875).
Weight and BSA are highly correlated (VIF > 5), causing multicollinearity.
The rule is simple:
  • If VIFVIFVIF is higher than 555 indicates "mild" multicollinearity.
  • If VIFVIFVIF is Above 101010, it's severe multicollinearity.
We have both Weight and BSA having VIF > 555, the others are fine.
Upon further investigation as for what's "BSA" exactly, it refers to "Body Surface Area" and is calculated as follows:

BSA=Weight×Height3600\text{BSA} = \sqrt{\frac{\text{Weight} \times \text{Height}}{3600}}BSA=3600Weight×Height​​

Where
BSA:Body Surface Area (m2)W:Weight (kg)H:Height (cm)\begin{align*} \text{BSA} & : \text{Body Surface Area } (m^2) \\ W & : \text{Weight } (kg) \\ H & : \text{Height } (cm) \end{align*}BSAWH​:Body Surface Area (m2):Weight (kg):Height (cm)​
and so it becomes clear, BSA is a function in Weight! Of course they're going to be highly correlated!
Having both variables is redundant and, as expected, caused multicollinearity, making estimates unreliable for inference (prediction is fine).
So, what to do? Simple. We remove one of both variables. Which variable to keep is up to you, fellow researcher, as it depends on variable importance to your theoretical ground.
My opinion is to keep BSA, since it contains Weight.

Phase 4: Model Tuning

Initial Model Results

R²: 0.926Adj. R²: 0.900AIC: 83.27
VariableCoefStd ErrtP>|t|[0.025 - 0.975]
const114.00000.384296.7310.000113.176 , 114.824
Age1.40770.5142.7370.0160.304 , 2.511
BSA3.35120.4717.1140.0002.341 , 4.362
Dur0.16480.4380.3760.713-0.776 , 1.105
Pulse1.73590.6062.8660.0120.437 , 3.035
Stress-0.62060.483-1.2840.220-1.657 , 0.416

Model Diagnostics & Parsimony

The results above are very promising! We have high values for Adjusted R-squared, F-statistic, and Log-Likeliehood. However, Dur and Stress have high P-values, meaning they are statistically insignificant.
In Data Science, we prefer Parsimony: the simplest model that explains the data is usually the best. Keeping insignificant variables adds complexity without value.
We refine the model by dropping Weight (redundant), Dur, and Stress and re-fitting. The result below is a much cleaner model where every variable pulls its weight.

Final Model Results

R²: 0.917Adj. R²: 0.902AIC: 81.57
VariableCoefStd ErrtP>|t|[0.025 - 0.975]
const114.00000.381299.4880.000113.193 , 114.807
Age1.35350.5012.6990.0160.290 , 2.416
BSA3.51730.4457.9060.0002.574 , 4.460
Pulse1.44410.5242.7550.0140.333 , 2.555
Durbin-Watson: 2.420Jarque-Bera: 0.641

Analysis of Variance (ANOVA)

While coefficients tell us how much BP changes, ANOVA tells us how important each factor is.

"BSA accounts for approximately 80% of the explained variance in Blood Pressure. Age and Pulse matter, but they are minor players compared to body mass."

bp.anova.title

bp.anova.type: IIbp.anova.method: OLS
bp.anova.sourcebp.anova.sum_sqbp.anova.dfbp.anova.mean_sqFPR(>F)
Age21.11121.117.280.0158
BSA181.131181.1362.51< 0.001
Pulse21.99121.997.590.0141
Residual46.37162.90--
Table 3.2: Variance decomposition of the final model.

Phase 5: Diagnostic Analysis

Linearity & Homoscedasticity

(Residuals vs Fitted)

Normal Q-Q

(Ordered Values vs Theoretical)

Normality Test (Shapiro-Wilk)

We fail to reject the null hypothesis of normality (p>0.05p > 0.05p>0.05). The residuals follow a normal distribution, allowing us to trust our confidence intervals.

0.9625
Statistic (W)
> 0.05
P-Value

Phase 6: Production Readiness

Clinical Inputs

Age48
BSA2
Pulse70
Predicted Systolic Pressure
0mmHg
Normal Condition
Sensitivity Analysis

Risk Profile Topology

Current vs. Avg Hypertensive

Key Insights

  • BSA DominanceBody Surface Area is the strongest predictor. A 1 standard deviation increase in BSA raises BP by ~3.5 mmHg.
  • Pulse ImpactHeart Rate is a significant runner-up. Every standardized unit increase adds another ~1.44 mmHg to the pressure.

Final Thoughts

A quick reality check: This dataset had 202020 observations and started with 666 predictors. That is ~3.33.33.3 observations per predictor. The general rule of thumb is 101010-151515 observations per predictor to avoid overfitting. In a real-world clinical setting, we would need much more data to make medical decisions.
That said, you just went through the entire lifecycle of a data science project—from loading raw data and diagnosing statistical assumptions, all the way to extracting the coefficients into a JSON structure for software integration.
I encourage you to keep exploring and testing different methods on different datasets. Most importantly, focus on sharpening your theoretical intuition. Libraries like pandas and statsmodels will update and change, but the math behind them remains the same.
The Null Hypothesis.

Where rigorous statistics and fluid design converge to build the architecture of insight.

Begin Exploration

The Platform

  • The Atlas
  • Laboratory
  • Case Studies
  • Workshop

Reference

  • The Library
  • About
  • Source Code↗

Get in Touch

contact@nullhypothesis.dev

© 2026 The Null Hypothesis. All Rights Reserved.