TABLE OF CONTENTS

Biostatistics

VOL. IMAR 2026

The Vital Predictor

Introduction

Today, we're analyzing the bloodpress.txt dataset.

The data was collected from

20

unrelated patients with high blood pressure levels.

Our goal is to analyze the relationship between blood pressure and various variables to identify potential determinants and risk indicators of hypertension.

Since we're using Python, we're mainly working with:

pandas for data operations.
numpy for mathematical/algebraic operations.
seaborn for visualizations.

Those are the industry standard, but that doesn't mean they're necessarily the best/most efficient. I encourage you to always learn about more libraries.
For example, while pandas is a great library for easily loading and editing data, polars is a newer, much faster library (especially for big data).

Theorem I.I

The Ten Assumptions of OLS

Phase 1: Data Inspection

Raw Dataset Preview (n=20)

Pt ID	BP	Age	Weight	BSA	Dur	Pulse	Stress
#1	105	47	85.4	1.75	5.1	63	33
#2	115	49	94.2	2.1	3.8	70	14
#3	116	49	95.3	1.98	8.2	72	10
#4	117	50	94.7	2.01	5.8	73	99
#5	112	51	89.4	1.89	7	72	95

Phase 2: Variable Exploration

X-Axis Variable

Y-Axis Variable

Pearson Correlation

r = 0.659

Note: Why KDE?

Kernel Density Estimation (KDE) is preferred over Histograms in our case.

Histograms use discrete, rectangular bars to show frequency counts within set intervals (bins), making them ideal for identifying exact counts and data spread. Kernel Density Estimation ( $KDE$ ) creates a smooth, continuous probability density curve by summing small Gaussian curves over each data point, making it better for visualizing shape and identifying trends without binning bias.

Try selecting the same variable for both X and Y axes above to see the KDE distribution!

Phase 3: Multicollinearity Check

Age

Weight

BSA

Dur

Pulse

Stress

0.659

0.95

0.866

0.293

0.721

0.164

Age

0.659

0.407

0.378

0.344

0.619

0.368

Weight

0.95

0.407

0.875

0.201

0.659

0.034

BSA

0.866

0.378

0.875

0.131

0.465

0.018

Dur

0.293

0.344

0.201

0.131

0.402

0.312

Pulse

0.721

0.619

0.659

0.465

0.402

0.506

Stress

0.164

0.368

0.034

0.018

0.312

0.506

Values > 0.7 indicate high correlation. Notice Weight vs BSA (0.875).

Weight and BSA are highly correlated (VIF > 5), causing multicollinearity.

The rule is simple:

If $VIF$ is higher than $5$ indicates "mild" multicollinearity.
If $VIF$ is Above $10$ , it's severe multicollinearity.

We have both Weight and BSA having VIF >

5

, the others are fine.

Upon further investigation as for what's "BSA" exactly, it refers to "Body Surface Area" and is calculated as follows:

$\text{BSA} = \sqrt{\frac{\text{Weight} \times \text{Height}}{3600}}$

Where

\begin{align*} \text{BSA} & : \text{Body Surface Area } (m^2) \\ W & : \text{Weight } (kg) \\ H & : \text{Height } (cm) \end{align*}

and so it becomes clear, BSA is a function in Weight! Of course they're going to be highly correlated!

Having both variables is redundant and, as expected, caused multicollinearity, making estimates unreliable for inference (prediction is fine).

So, what to do? Simple. We remove one of both variables. Which variable to keep is up to you, fellow researcher, as it depends on variable importance to your theoretical ground.

My opinion is to keep BSA, since it contains Weight.

Phase 4: Model Tuning

Initial Model Results

R²: 0.926Adj. R²: 0.900AIC: 83.27

Variable	Coef	Std Err	t	P>\|t\|	[0.025 - 0.975]
const	114.0000	0.384	296.731	0.000	113.176 , 114.824
Age	1.4077	0.514	2.737	0.016	0.304 , 2.511
BSA	3.3512	0.471	7.114	0.000	2.341 , 4.362
Dur	0.1648	0.438	0.376	0.713	-0.776 , 1.105
Pulse	1.7359	0.606	2.866	0.012	0.437 , 3.035
Stress	-0.6206	0.483	-1.284	0.220	-1.657 , 0.416

Model Diagnostics & Parsimony

The results above are very promising! We have high values for Adjusted R-squared, F-statistic, and Log-Likeliehood. However, Dur and Stress have high P-values, meaning they are statistically insignificant.

In Data Science, we prefer Parsimony: the simplest model that explains the data is usually the best. Keeping insignificant variables adds complexity without value.

We refine the model by dropping Weight (redundant), Dur, and Stress and re-fitting. The result below is a much cleaner model where every variable pulls its weight.

Final Model Results

R²: 0.917Adj. R²: 0.902AIC: 81.57

Variable	Coef	Std Err	t	P>\|t\|	[0.025 - 0.975]
const	114.0000	0.381	299.488	0.000	113.193 , 114.807
Age	1.3535	0.501	2.699	0.016	0.290 , 2.416
BSA	3.5173	0.445	7.906	0.000	2.574 , 4.460
Pulse	1.4441	0.524	2.755	0.014	0.333 , 2.555

Durbin-Watson: 2.420Jarque-Bera: 0.641

Analysis of Variance (ANOVA)

While coefficients tell us how much BP changes, ANOVA tells us how important each factor is.

"BSA accounts for approximately 80% of the explained variance in Blood Pressure. Age and Pulse matter, but they are minor players compared to body mass."

bp.anova.title

bp.anova.type: IIbp.anova.method: OLS

bp.anova.source	bp.anova.sum_sq	bp.anova.df	bp.anova.mean_sq	F	PR(>F)
Age	21.11	1	21.11	7.28	0.0158
BSA	181.13	1	181.13	62.51	< 0.001
Pulse	21.99	1	21.99	7.59	0.0141
Residual	46.37	16	2.90	-	-

Table 3.2: Variance decomposition of the final model.

Phase 5: Diagnostic Analysis

Linearity & Homoscedasticity

(Residuals vs Fitted)

Normal Q-Q

(Ordered Values vs Theoretical)

Normality Test (Shapiro-Wilk)

We fail to reject the null hypothesis of normality ( $p > 0.05$ ). The residuals follow a normal distribution, allowing us to trust our confidence intervals.

0.9625

Statistic (W)

> 0.05

P-Value

Phase 6: Production Readiness

Clinical Inputs

Age48

BSA2

Pulse70

Predicted Systolic Pressure

0mmHg

Normal Condition

Sensitivity Analysis

Risk Profile Topology

Current vs. Avg Hypertensive

Key Insights

BSA DominanceBody Surface Area is the strongest predictor. A 1 standard deviation increase in BSA raises BP by ~3.5 mmHg.
Pulse ImpactHeart Rate is a significant runner-up. Every standardized unit increase adds another ~1.44 mmHg to the pressure.

Final Thoughts

A quick reality check: This dataset had

20

observations and started with

6

predictors. That is ~

3.3

observations per predictor. The general rule of thumb is

10

15

observations per predictor to avoid overfitting. In a real-world clinical setting, we would need much more data to make medical decisions.

That said, you just went through the entire lifecycle of a data science project—from loading raw data and diagnosing statistical assumptions, all the way to extracting the coefficients into a JSON structure for software integration.

I encourage you to keep exploring and testing different methods on different datasets. Most importantly, focus on sharpening your theoretical intuition. Libraries like pandas and statsmodels will update and change, but the math behind them remains the same.

Introduction

Today, we're analyzing the bloodpress.txt dataset.

The data was collected from

20

unrelated patients with high blood pressure levels.

Our goal is to analyze the relationship between blood pressure and various variables to identify potential determinants and risk indicators of hypertension.

Since we're using Python, we're mainly working with:

pandas for data operations.
numpy for mathematical/algebraic operations.
seaborn for visualizations.

Pt ID

Age

Weight

BSA

Dur

Pulse

Stress

105

85.4

1.75

5.1

115

94.2

2.1

3.8

116

95.3

1.98

8.2

117

94.7

2.01

5.8

112

89.4

1.89

Variable

Coef

Std Err

P>|t|

[0.025 - 0.975]

const

114.0000

0.384

296.731

0.000

113.176 , 114.824

Age

1.4077

0.514

2.737

0.016

0.304 , 2.511

BSA

3.3512

0.471

7.114

0.000

2.341 , 4.362

Dur

0.1648

0.438

0.376

0.713

-0.776 , 1.105

Pulse

1.7359

0.606

2.866

0.012

0.437 , 3.035

Stress

-0.6206

0.483

-1.284

0.220

-1.657 , 0.416

Variable

Coef

Std Err

P>|t|

[0.025 - 0.975]

const

114.0000

0.381

299.488

0.000

113.193 , 114.807

Age

1.3535

0.501

2.699

0.016

0.290 , 2.416

BSA

3.5173

0.445

7.906

0.000

2.574 , 4.460

Pulse

1.4441

0.524

2.755

0.014

0.333 , 2.555

bp.anova.source

bp.anova.sum_sq

bp.anova.df

bp.anova.mean_sq

PR(>F)

Age

21.11

7.28

0.0158

BSA

181.13

62.51

< 0.001

Pulse

21.99

7.59

0.0141

Residual

46.37

2.90

Final Thoughts

A quick reality check: This dataset had

20

observations and started with

6

predictors. That is ~

3.3

observations per predictor. The general rule of thumb is

10

15

observations per predictor to avoid overfitting. In a real-world clinical setting, we would need much more data to make medical decisions.