header.brand
nav.homenav.coursesnav.labsnav.case_files
footer.brand.

footer.brand_description

footer.index

  • nav.home
  • nav.labs
  • nav.case_files
  • nav.courses
  • nav.about

© 2026 footer.rights_reserved Created by Ezz Eldin Ahmed.

nav.back_to_archive
common.source
Biostatistics

The Vital Predictor

"Converting raw clinical data into actionable cardiovascular insights"

Introduction

Today, we're analyzing the bloodpress.txt dataset. The data was collected from 202020 unrelated patients with high blood pressure levels. Our goal is to analyze the relationship between blood pressure and various variables to identify potential determinants and risk indicators of hypertension.

Since we're using Python, we're mainly working with:

  • pandas for data operations.
  • numpy for mathematical/algebraic operations.
  • seaborn for visualizations.

Those are the industry standard, but that doesn't mean they're necessarily the best/most efficient. I encourage you to always learn about more libraries.
For example, while pandas is a great library for easily loading and editing data, polars is a newer, much faster library (especially for big data).

Theorem I.I

The Ten Assumptions of OLS

Phase 1: Data Inspection

A good data analyst/scientist always starts by simply looking at the dataset at hand. We start by inspecting the raw clinical records of 20 unrelated patients with high blood pressure levels. Looking at this wall of a table, we can see the mean and median (50%) are almost identical. The 1st and 3rd quartiles (25% and 75% respectively) are also close in values; this indicates no outliers—or at least no strong ones.

Raw Dataset Preview (n=20)

Pt IDBPAgeWeightBSADurPulseStress
#11054785.41.755.16333
#21154994.22.13.87014
#31164995.31.988.27210
#41175094.72.015.87399
#51125189.41.8977295

Phase 2: Variable Exploration

Now, tables are informative if you're looking for detailed statistics. But you know what's better for an overall look and to view the distribution of each variable? GRAPHS!By visualizing the relationships between variables, we can form initial hypotheses. Does stress actually raise blood pressure? Is age a bigger factor than weight? Explore the data yourself below.
Pearson Correlationr=0.659r = 0.659r=0.659

Note: Why KDE?

Kernel Density Estimation (KDE) is preferred over Histograms in our case.

Histograms use discrete, rectangular bars to show frequency counts within set intervals (bins), making them ideal for identifying exact counts and data spread. Kernel Density Estimation (KDEKDEKDE) creates a smooth, continuous probability density curve by summing small Gaussian curves over each data point, making it better for visualizing shape and identifying trends without binning bias.

Try selecting the same variable for both X and Y axes above to see the KDE distribution!

Phase 3: Multicollinearity Check

Multicollinearity is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. Our Heatmap reveals a suspicious correlation between Weight and BSA (0.875). This redundancy inflates the variance of our coefficients, making the model unstable. We confirm this with the Variance Inflation Factor (VIF). A VIF over 5.0 typically indicates problematic multicollinearity.
BP
Age
Weight
BSA
Dur
Pulse
Stress
BP
1
0.659
0.95
0.866
0.293
0.721
0.164
Age
0.659
1
0.407
0.378
0.344
0.619
0.368
Weight
0.95
0.407
1
0.875
0.201
0.659
0.034
BSA
0.866
0.378
0.875
1
0.131
0.465
0.018
Dur
0.293
0.344
0.201
0.131
1
0.402
0.312
Pulse
0.721
0.619
0.659
0.465
0.402
1
0.506
Stress
0.164
0.368
0.034
0.018
0.312
0.506
1
Values > 0.7 indicate high correlation. Notice Weight vs BSA (0.875).
Weight and BSA are highly correlated (VIF > 5), causing multicollinearity.

The rule is simple:

  • If VIFVIFVIF is higher than 555 indicates "mild" multicollinearity.
  • If VIFVIFVIF is Above , it's severe multicollinearity.

Phase 4: Model Tuning

The results are very promising! We have high values for Adjusted R-squared, F-statistic, and Log-Likeliehood.
BUT
Dur and Stress have a much higher P-value over the typical threshold of 0.05 (refer to H0 - Justify your alpha for a guide on selecting a resonable alpha level that isn't $0.05$), meaning they're statistically insignificant. Those variables aren't harmful, but not useful either. Let's talk Backward Elimination.

Initial Model Results

R²: 0.926Adj. R²: 0.900AIC: 83.27
VariableCoefStd ErrtP>|t|[0.025 - 0.975]
const114.00000.384296.7310.000113.176 , 114.824
Age1.40770.5142.7370.0160.304 , 2.511
BSA3.35120.4717.1140.0002.341 , 4.362
Dur0.16480.4380.3760.713-0.776 , 1.105
Pulse1.73590.6062.8660.0120.437 , 3.035
Stress-0.62060.483-1.2840.220-1.657 , 0.416

Backward Elimination

In Data Science, we prefer Parsimony: the simplest model that explains the data is usually the best. Keeping insignificant variables adds complexity without value.

We refine the model by dropping Dur and Stress and re-fitting. The result is a much cleaner model where every variable pulls its weight.

Final Model Results

R²: 0.917Adj. R²: 0.902AIC: 81.57
VariableCoefStd ErrtP>|t|[0.025 - 0.975]
const114.00000.381299.4880.000113.193 , 114.807
Age1.35350.5012.6990.0160.290 , 2.416
BSA3.51730.4457.9060.0002.574 , 4.460
Pulse1.44410.5242.7550.0140.333 , 2.555
Durbin-Watson: 2.420Jarque-Bera: 0.641

Analysis of Variance (ANOVA)

While coefficients tell us how much BP changes, ANOVA tells us how important each factor is.

"BSA accounts for approximately 80% of the explained variance in Blood Pressure. Age and Pulse matter, but they are minor players compared to body mass."

Analysis of Variance (ANOVA)

Type: IIMethod: OLS
SourceSum SqDFMean SqFPR(>F)
Age21.11121.117.280.0158
BSA181.131181.1362.51< 0.001
Pulse21.99121.997.590.0141
Residual46.37162.90--
Table 3.2: Variance decomposition of the final model.

Phase 5: Diagnostic Analysis

We fitted our model, made sure the variables are significant, same thing for the variables. It has a very good, if not excellent, fit. The final touch is to check both residuals and coefficient variation explanability. We need to make sure our residuals are normally distributed (to ensure that significance tests (t-tests, F-tests), confidence intervals, and prediction intervals are accurate.) and have a constant variance (to ensure that linear regression models produce reliable, unbiased, and efficient estimates). Those two assumptions are critical for inference, not so much for prediction.

Residuals vs. Fitted

(Should be random cloud)

Q-Q Plot

(Should follow the red line)

Normality Test (Shapiro-Wilk)

We fail to reject the null hypothesis of normality (p>0.05p > 0.05p>0.05). The residuals follow a normal distribution, allowing us to trust our confidence intervals.

0.9625
Statistic (W)
> 0.05
P-Value

Phase 6: Production Readiness

The final model passed all checks. We extracted the coefficients to build the prediction formula:
BP=114.00+1.35(Agestd)+3.52(BSAstd)+1.44(Pulsestd) \text{BP} = 114.00 + 1.35(\text{Age}_{std}) + 3.52(\text{BSA}_{std}) + 1.44(\text{Pulse}_{std}) BP=114.00+1.35(Agestd​)+3.52(BSAstd​)+1.44(Pulsestd​)
As the data guy, you'll usually work with developers to deploy this interaction. We simply export the model configuration and scaling parameters to a JSON file (click the code button to see how), and then build the production dashboard below.
1010
10

We have both Weight and BSA having VIF > 555, the others are fine.
Upon further investigation as for what's "BSA" exactly, it refers to "Body Surface Area" and is calculated as follows:

BSA=Weight×Height3600\text{BSA} = \sqrt{\frac{\text{Weight} \times \text{Height}}{3600}}BSA=3600Weight×Height​​

Where

BSA:Body Surface Area (m2)W:Weight (kg)H:Height (cm)\begin{align*} \text{BSA} & : \text{Body Surface Area } (m^2) \\ W & : \text{Weight } (kg) \\ H & : \text{Height } (cm) \end{align*}BSAWH​:Body Surface Area (m2):Weight (kg):Height (cm)​

and so it becomes clear, BSA is a function in Weight! Of course they're going to be highly correlated!
Having both variables is redundant and, as expected, caused multicollinearity, making estimates unreliable for inference (prediction is fine).

So, what to do? Simple. We remove one of both variables. Which variable to keep is up to you, fellow researcher, as it depends on variable importance to your theortical ground.
My opinion is to keep BSA, since it containts Weight.

Inputs

Age48
BSA2
Pulse70
Predicted Systolic Pressure
0mmHg
Normal Condition
Sensitivity Analysis

Risk Profile Topology

Current vs. Avg Hypertensive

Key Insights

  • BSA DominanceBody Surface Area is the strongest predictor. A 1 standard deviation increase in BSA raises BP by ~3.5 mmHg.
  • Pulse ImpactHeart Rate is a significant runner-up. Every standardized unit increase adds another ~1.44 mmHg to the pressure.

Final Thoughts

A quick reality check: This dataset had 202020 observations and started with 666 predictors. That is ~3.33.33.3 observations per predictor. The general rule of thumb is 101010-151515 observations per predictor to avoid overfitting. In a real-world clinical setting, we would need much more data to make medical decisions.

That said, you just went through the entire lifecycle of a data science project—from loading raw data and diagnosing statistical assumptions, all the way to extracting the coefficients into a JSON structure for software integration.

I encourage you to keep exploring and testing different methods on different datasets. Most importantly, focus on sharpening your theoretical intuition. Libraries like pandas and statsmodels will update and change, but the math behind them remains the same.