Today, we're analyzing the bloodpress.txt dataset. The data was collected from unrelated patients with high blood pressure levels. Our goal is to analyze the relationship between blood pressure and various variables to identify potential determinants and risk indicators of hypertension.
Since we're using Python, we're mainly working with:
numpy for mathematical/algebraic operations.seaborn for visualizations.Those are the industry standard, but that doesn't mean they're necessarily the best/most efficient. I encourage you to always learn about more libraries.
For example, while pandas is a great library for easily loading and editing data, polars is a newer, much faster library (especially for big data).
| Pt ID | BP | Age | Weight | BSA | Dur | Pulse | Stress |
|---|---|---|---|---|---|---|---|
| #1 | 105 | 47 | 85.4 | 1.75 | 5.1 | 63 | 33 |
| #2 | 115 | 49 | 94.2 | 2.1 | 3.8 | 70 | 14 |
| #3 | 116 | 49 | 95.3 | 1.98 | 8.2 | 72 | 10 |
| #4 | 117 | 50 | 94.7 | 2.01 | 5.8 | 73 | 99 |
| #5 | 112 | 51 | 89.4 | 1.89 | 7 | 72 | 95 |
Kernel Density Estimation (KDE) is preferred over Histograms in our case.
Histograms use discrete, rectangular bars to show frequency counts within set intervals (bins), making them ideal for identifying exact counts and data spread. Kernel Density Estimation () creates a smooth, continuous probability density curve by summing small Gaussian curves over each data point, making it better for visualizing shape and identifying trends without binning bias.
Try selecting the same variable for both X and Y axes above to see the KDE distribution!
Weight and BSA (0.875). This redundancy inflates the variance of our coefficients, making the model unstable. We confirm this with the Variance Inflation Factor (VIF). A VIF over 5.0 typically indicates problematic multicollinearity.The rule is simple:
Dur and Stress have a much higher P-value over the typical threshold of 0.05 (refer to H0 - Justify your alpha for a guide on selecting a resonable alpha level that isn't $0.05$), meaning they're statistically insignificant. Those variables aren't harmful, but not useful either. Let's talk Backward Elimination.| Variable | Coef | Std Err | t | P>|t| | [0.025 - 0.975] |
|---|---|---|---|---|---|
| const | 114.0000 | 0.384 | 296.731 | 0.000 | 113.176 , 114.824 |
| Age | 1.4077 | 0.514 | 2.737 | 0.016 | 0.304 , 2.511 |
| BSA | 3.3512 | 0.471 | 7.114 | 0.000 | 2.341 , 4.362 |
| Dur | 0.1648 | 0.438 | 0.376 | 0.713 | -0.776 , 1.105 |
| Pulse | 1.7359 | 0.606 | 2.866 | 0.012 | 0.437 , 3.035 |
| Stress | -0.6206 | 0.483 | -1.284 | 0.220 | -1.657 , 0.416 |
In Data Science, we prefer Parsimony: the simplest model that explains the data is usually the best. Keeping insignificant variables adds complexity without value.
We refine the model by dropping Dur and Stress and re-fitting. The result is a much cleaner model where every variable pulls its weight.
| Variable | Coef | Std Err | t | P>|t| | [0.025 - 0.975] |
|---|---|---|---|---|---|
| const | 114.0000 | 0.381 | 299.488 | 0.000 | 113.193 , 114.807 |
| Age | 1.3535 | 0.501 | 2.699 | 0.016 | 0.290 , 2.416 |
| BSA | 3.5173 | 0.445 | 7.906 | 0.000 | 2.574 , 4.460 |
| Pulse | 1.4441 | 0.524 | 2.755 | 0.014 | 0.333 , 2.555 |
While coefficients tell us how much BP changes, ANOVA tells us how important each factor is.
"BSA accounts for approximately 80% of the explained variance in Blood Pressure. Age and Pulse matter, but they are minor players compared to body mass."
| Source | Sum Sq | DF | Mean Sq | F | PR(>F) |
|---|---|---|---|---|---|
| Age | 21.11 | 1 | 21.11 | 7.28 | 0.0158 |
| BSA | 181.13 | 1 | 181.13 | 62.51 | < 0.001 |
| Pulse | 21.99 | 1 | 21.99 | 7.59 | 0.0141 |
| Residual | 46.37 | 16 | 2.90 | - | - |
(Should be random cloud)
(Should follow the red line)
We fail to reject the null hypothesis of normality (). The residuals follow a normal distribution, allowing us to trust our confidence intervals.
We have both Weight and BSA having VIF > , the others are fine.
Upon further investigation as for what's "BSA" exactly, it refers to "Body Surface Area" and is calculated as follows:
Where
and so it becomes clear, BSA is a function in Weight! Of course they're going to be highly correlated!
Having both variables is redundant and, as expected, caused multicollinearity, making estimates unreliable for inference (prediction is fine).
So, what to do? Simple. We remove one of both variables. Which variable to keep is up to you, fellow researcher, as it depends on variable importance to your theortical ground.
My opinion is to keep BSA, since it containts Weight.
Current vs. Avg Hypertensive
A quick reality check: This dataset had observations and started with predictors. That is ~ observations per predictor. The general rule of thumb is - observations per predictor to avoid overfitting. In a real-world clinical setting, we would need much more data to make medical decisions.
That said, you just went through the entire lifecycle of a data science project—from loading raw data and diagnosing statistical assumptions, all the way to extracting the coefficients into a JSON structure for software integration.
I encourage you to keep exploring and testing different methods on different datasets. Most importantly, focus on sharpening your theoretical intuition. Libraries like pandas and statsmodels will update and change, but the math behind them remains the same.