The Vital Predictor
Introduction
Since we're using Python, we're mainly working with:
- pandas for data operations.
numpyfor mathematical/algebraic operations.seabornfor visualizations.
Those are the industry standard, but that doesn't mean they're necessarily the best/most efficient. I encourage you to always learn about more libraries.
For example, while pandas is a great library for easily loading and editing data, polars is a newer, much faster library (especially for big data).
The Ten Assumptions of OLS
Phase 1: Data Inspection
Raw Dataset Preview (n=20)
| Pt ID | BP | Age | Weight | BSA | Dur | Pulse | Stress |
|---|---|---|---|---|---|---|---|
| #1 | 105 | 47 | 85.4 | 1.75 | 5.1 | 63 | 33 |
| #2 | 115 | 49 | 94.2 | 2.1 | 3.8 | 70 | 14 |
| #3 | 116 | 49 | 95.3 | 1.98 | 8.2 | 72 | 10 |
| #4 | 117 | 50 | 94.7 | 2.01 | 5.8 | 73 | 99 |
| #5 | 112 | 51 | 89.4 | 1.89 | 7 | 72 | 95 |
Phase 2: Variable Exploration
Note: Why KDE?
Kernel Density Estimation (KDE) is preferred over Histograms in our case.
Histograms use discrete, rectangular bars to show frequency counts within set intervals (bins), making them ideal for identifying exact counts and data spread. Kernel Density Estimation () creates a smooth, continuous probability density curve by summing small Gaussian curves over each data point, making it better for visualizing shape and identifying trends without binning bias.
Try selecting the same variable for both X and Y axes above to see the KDE distribution!
Phase 3: Multicollinearity Check
- If is higher than indicates "mild" multicollinearity.
- If is Above , it's severe multicollinearity.
Weight and BSA having VIF > , the others are fine.BSA" exactly, it refers to "Body Surface Area" and is calculated as follows:BSA is a function in Weight! Of course they're going to be highly correlated!BSA, since it contains Weight.Phase 4: Model Tuning
Initial Model Results
| Variable | Coef | Std Err | t | P>|t| | [0.025 - 0.975] |
|---|---|---|---|---|---|
| const | 114.0000 | 0.384 | 296.731 | 0.000 | 113.176 , 114.824 |
| Age | 1.4077 | 0.514 | 2.737 | 0.016 | 0.304 , 2.511 |
| BSA | 3.3512 | 0.471 | 7.114 | 0.000 | 2.341 , 4.362 |
| Dur | 0.1648 | 0.438 | 0.376 | 0.713 | -0.776 , 1.105 |
| Pulse | 1.7359 | 0.606 | 2.866 | 0.012 | 0.437 , 3.035 |
| Stress | -0.6206 | 0.483 | -1.284 | 0.220 | -1.657 , 0.416 |
Model Diagnostics & Parsimony
Dur and Stress have high P-values, meaning they are statistically insignificant.Weight (redundant), Dur, and Stress and re-fitting. The result below is a much cleaner model where every variable pulls its weight.Final Model Results
| Variable | Coef | Std Err | t | P>|t| | [0.025 - 0.975] |
|---|---|---|---|---|---|
| const | 114.0000 | 0.381 | 299.488 | 0.000 | 113.193 , 114.807 |
| Age | 1.3535 | 0.501 | 2.699 | 0.016 | 0.290 , 2.416 |
| BSA | 3.5173 | 0.445 | 7.906 | 0.000 | 2.574 , 4.460 |
| Pulse | 1.4441 | 0.524 | 2.755 | 0.014 | 0.333 , 2.555 |
Analysis of Variance (ANOVA)
"BSA accounts for approximately 80% of the explained variance in Blood Pressure. Age and Pulse matter, but they are minor players compared to body mass."
bp.anova.title
| bp.anova.source | bp.anova.sum_sq | bp.anova.df | bp.anova.mean_sq | F | PR(>F) |
|---|---|---|---|---|---|
| Age | 21.11 | 1 | 21.11 | 7.28 | 0.0158 |
| BSA | 181.13 | 1 | 181.13 | 62.51 | < 0.001 |
| Pulse | 21.99 | 1 | 21.99 | 7.59 | 0.0141 |
| Residual | 46.37 | 16 | 2.90 | - | - |
Phase 5: Diagnostic Analysis
Linearity & Homoscedasticity
(Residuals vs Fitted)
Normal Q-Q
(Ordered Values vs Theoretical)
Normality Test (Shapiro-Wilk)
We fail to reject the null hypothesis of normality (). The residuals follow a normal distribution, allowing us to trust our confidence intervals.
Phase 6: Production Readiness
Clinical Inputs
Risk Profile Topology
Current vs. Avg Hypertensive
Key Insights
- BSA DominanceBody Surface Area is the strongest predictor. A 1 standard deviation increase in BSA raises BP by ~3.5 mmHg.
- Pulse ImpactHeart Rate is a significant runner-up. Every standardized unit increase adds another ~1.44 mmHg to the pressure.
Final Thoughts
JSON structure for software integration.pandas and statsmodels will update and change, but the math behind them remains the same.