Deep Learning
JUN 2026

Statistical Surgery: Compressing VGG19

How we aggressively reduced deep learning parameters in hematological imaging while maintaining 98.4% accuracy.

Research Team
Ezz Eldin AhmedResearcher
Abdulrahman Mostafa KamelResearcher
Masty AhmedResearcher
Mohamed AmirResearcher

Introduction

Artificial intelligence has fundamentally transformed modern computing, but this predictive power comes at a severe computational cost. Classical statistical learning theory suggests that highly parameterized models are prone to overfitting. Yet, deep learning often departs from these classical bias-variance expectations through a phenomenon known as benign overfitting.

Architectures such as VGG19 contain approximately 140 million parameters. When transferred into specialized medical settings—like the BloodMNIST hematological imaging task—the model offers immense representational power but creates a massive, statistically unjustifiable parameter-to-task imbalance.

Parameters
139.6M35.5M

74.5% reduction in total parameters via structured surgery.

Efficiency Gain
2.39x Speedup
231.3ms96.9ms

58.1% faster Wall-clock time on standard hardware (T4 GPU).

Storage Footprint
532.6MB134.9MB

Significant reduction in serialized model size for edge deployment.

§1 Research Methodology

The objective is to minimize the empirical risk while simultaneously reducing the parameter count $\mathcal{P}$. This is formulated as a constrained optimization problem:

minWL(W;D)s.t.W0κ\min_{W} \mathcal{L}(W; \mathcal{D}) \quad \text{s.t.} \quad \|W\|_0 \leq \kappa

where $\kappa$ is the target parameter budget and $\mathcal{L}$ is the cross-entropy loss.

Hematological Dataset Exploration

Dataset N = 17,092 Samples

Statistical Bias Note

"The moderate class imbalance observed here necessitates the use of Macro-averaged F1 metrics. Accuracy alone would be biased toward the majority classes (Neutrophils and Eosinophils)."

Intensity Profile

RGB Mean
Dataset: BloodMNIST-224Task: 8-Class ClassificationSparsity Context: Baseline Analysis

§2 Baseline Model Analysis

The uncompressed VGG19 architecture acts as our empirical upper bound. Fine-tuned using SGD, it achieved a top-1 accuracy of 98.48% with a mean latency of 231.3 ms.

Feature Hierarchy & Volumetrics

VGG19 Hierarchical Construction

Isometric decomposition visualizing the bottleneck transitions and parameter distribution.

Structural Probe

Select an architectural block to inspect its hierarchical role and computational complexity.

§ 2.1 Empirical Performance Fidelity

Our stopping criteria relied on the minimum per-class F1-score to protect minority classes from being "averaged out" by high performance on dominant classes.

Summary Statistics
Macro-F1
98.57%
Latency
231ms
Class Insights
Hover a data point to inspect class-specific precision and recall metrics.
F1-Score Distribution (%)
Optimal
Critical
Eosinophils
Platelets
Basophils
Lymphocytes
Erythroblasts
Monocytes
Neutrophils
Immature Granulocytes
90%
92%
94%
96%
98%
100%
AVG: 98.57%

PCA Redundancy Audit

Quantifying the intrinsic dimensionality of latent activation spaces.

Conv Block 3
41 / 256 Dim
Architectural Capacity84% Redundant
Conv Block 4
122 / 512 Dim
Architectural Capacity76% Redundant
Conv Block 5
59 / 512 Dim
Architectural Capacity88% Redundant
FC1 Head
285 / 4096 Dim
Architectural Capacity93% Redundant
FC2 Head
137 / 4096 Dim
Architectural Capacity97% Redundant
Select a layer to audit variance decay

Select a layer to visualize the variance decay and identify the representational 'elbow'.

§3 L1 Lasso Regularization

The L1 norm penalty (Lasso) induces a continuous zero-attraction effect on individual weights. While this statistically simplifies the model, it creates a Hardware Paradox.

wjnewwjηLtask,jηλsign(wj)w_j^{new} \leftarrow w_j - \eta \nabla \mathcal{L}_{task,j} - \eta \lambda \text{sign}(w_j)

To prevent weights from becoming trapped in "nearly-zero" local minima during pruning, we utilized a combination of Stochastic Gradient Descent with Warm Restarts (SGDR) for global exploration and the Adam optimizer for local convergence. This cyclical approach allowed for a robust discovery of the global error manifold without catastrophic forgetting.

Training Dynamics & Phase Transition

Tracking the emergence of sparsity under increasing L1 pressure.

Macro F1
98.6%
Soft Sparsity
0.0%
L1 Penalty Constraint
Epoch 0.0 / 22
Zero-Attraction Phase
L1 norm is pulling redundant weights toward zero while preserving topological fidelity.
Global Sparsity
0.0%

Weight Magnitude Audit

Analyzing the post-Lasso zero-attraction topology.

Full Spectrum DensityMagnitude Distribution (|w|)
Surgical Threshold Sweep
ε = 0.015
Threshold (ε)
ZeroEffective Weight
Theoretical Sparsity
95.3%

Parameters that can be numerically zeroed without significant loss impact.

Memory Occupancy
528 MB

Static occupancy despite numerical sparsity—demonstrating the Hardware Paradox.

Inference Profile
0.0x (No Speedup)

Unstructured sparsity does not bypass SIMD multipliers. The tensor must undergo structural surgery to gain latency benefits.

§4 Structured Surgery via L0 Gates

To achieve physical acceleration, we must shift from weight-level sparsity to channel-level surgery. We implement a Gaussian Stochastic Gate—a differentiable relaxation that samples gate states from a continuous distribution during training.

To control the pruning rate, we used a PID controller to dynamically modulate the regularization penalty $\lambda$. This prevented the 'sparsity collapse' often seen in static regularization, successfully stabilizing the network as pressure peaked around epoch 13.

Differentiable L0 Relaxation

The Gaussian Stochastic Gate

Differentiable L0 relaxation: Visualizing the transition from continuous parameters to discrete hardware gates.

01Active Threshold
Tensor Shutter Array (Stochastic)

Each cell represents a convolutional channel. Stochastic sampling determines whether the "shutter" (gate) is physically open for inference.

Gate Bias (μ\mu)
0.50
Gate Status
ACTIVE

Identity Map Inherited

Information Transfer

The Gaussian gate acts as a continuous proxy for discrete L0 penalization. By adjusting the bias, we modulate the probability of channel survival.

§ 4.2 The Latent Hierarchy

Structural analysis reveals a depth-dependent survival gradient across the VGG19 backbone. Visual primitives (edges/colors) showed high survival (75%+), the transition zone saw intermediate survival (~50%) as ImageNet-specific features dropped, and deeper layers exposed massive redundant capacity with low survival (~20%).

The Survival Gradient

Visualizing the structural survival of VGG19 channels. Early layers (Blocks 1-2) are preserved for low-level feature extraction, while deep layers are aggressively pruned.

L1Block 1
40/6462.5%
L2Block 1
47/6473.4%
L3Block 2
97/12875.8%
L4Block 2
112/12887.5%
L5Block 3
122/25647.7%
L6Block 3
116/25645.3%
L7Block 3
127/25649.6%
L8Block 3
133/25652.0%
L9Block 4
115/51222.5%
L10Block 4
112/51221.9%
L11Block 4
123/51224.0%
L12Block 4
124/51224.2%
L13Block 4
114/51222.3%
L14Block 5
97/51218.9%
L15Block 5
110/51221.5%
L16Block 5
83/51216.2%

§ 4.3 Structured Tensor Surgery

Pruning Visualization

Discarding redundant channels physically to unlock hardware throughput.

Pruning Severity (L0 Penalty)
35%
Diagnostic ConservationAggressive Pruning
Tensor Shape
[B, 512, 14, 14]
Batch × Channels × Height × Width
Inference Latency
168.1ms
Throughput
1.38x

§ 4.4 Inference Throughput Race

Hardware Inference Race

Throughput Benchmark

Simulating a clinical queue of 20 diagnostic batches. The L0 model achieves physical acceleration through tensor surgery, clearing the queue while the Baseline still processes.

L0 Structured Surgery
LATENCY: 96.9ms
Baseline VGG19
LATENCY: 231.3ms
Elapsed Time0.00s
L0 Status0%
Efficiency Gain2.39x Speedup

§5 Low-Rank Factorization (SVD)

The classification head is compressed using Truncated SVD, splitting enormous matrices into compressed, sequential multiplications.

WUkΣkVkTW \approx U_k \Sigma_k V_k^T

Singular Value Decomposition

Decomposing high-density weight tensors into essential geometric primitives.

Energy98.2%
Compression1.0x
Uncompressed Weights32x32 Tensor
SVD Approximation (Rank-16)1,024 Parameters
Approximation Rank k
k = 16

§ 5.1 The Diagnostic Fidelity Sweep

Stripping out high-frequency parametric noise through SVD can actually improve classification accuracy by acting as a structural denoiser.

Diagnostic Fidelity Sweep

Why SVD works: Clinical images contain massive spatial redundancy.

Rank k = 31
78.7% Energy
Audit Objective

The previous section proved that weights are redundant. This section proves that the diagnostic data itselfis low-rank, allowing the network to discard high-frequency "noise" without losing the cell's nucleus structure.

Compression Rankk = 31
Abstract PatternClinical Detail
Diagnostic Insight

Optimal Spectral Cutoff. High-frequency pixel noise is removed, but the diagnostic nucleus remains structurally intact.

Information Gain12.9x
Fidelity PriceMinimal
Diagnostic Fidelity
98.8% F1
Physical Compression
78.2% RED

Spectral Rank Profile

FC0 Latent Rank309
FC3 Latent Rank169
Active Parameters30.47M
Diagnostic Insight
Native Stability. Operating at full spectral capacity. Redundancy present but largely unexploited.
Sweep Energy Threshold (ε)
0.50
Degenerate (0.1)High Fidelity (0.5)

§6 Discussion & Unified Synthesis

By applying the Akaike Information Criterion (AIC), we find that our compressed models are not merely smaller—they are statistically superior. The baseline VGG19 (AIC: $2.79 \\times 10^8$) suffers from massive parametric bloat, while our SVD variant (AIC: $4.32 \\times 10^7$) and L0 Surgery variant (AIC: $7.11 \\times 10^7$) achieve substantially lower information scores.

Statistical Parsimony Audit

Evaluating model selection through Information-Theoretic criteria. Lower values indicate a more efficient trade-off between empirical fit and parameter complexity.

Metric Definitions

AIC / BIC

Penalize complexity to prevent overfitting. BIC imposes a stronger penalty based on sample size, favouring simpler models.

MDL

Minimum Description Length. Evaluates the statistical hypothesis by the length of its shortest possible description.

The massive reduction in AIC/BIC/MDL confirms that VGG19 is severely over-parameterized for the BloodMNIST task, captured here by the Parsimony Gap.

§ 6.1 The Pareto Efficiency Frontier

We have mapped a robust Pareto frontier governing the trade-off between strict diagnostic fidelity and absolute computational efficiency.

The Pareto Efficiency Frontier

Mapping the trade-off between predictive fidelity and resource constraints.

Baseline
L1 Lasso
SVD
L0 Surgery

§ 6.2 Heterogeneous Degradation

The transition from monolithic architectures to compressed variants does not degrade performance uniformly across the classification manifold. Structurally distinct cell types, such as Platelets and Eosinophils, remain robust under extreme pruning. Conversely, the majority of the diagnostic loss is concentrated in morphologically ambiguous classes like Basophils and Immature Granulocytes, which require high-dimensional feature detectors in the deepest layers of the network.

Heterogeneous Degradation Audit

Visualizing the relative performance drop across compression variants compared to the uncompressed baseline. Structurally distinctive classes remain stable, whereas morphologically ambiguous classes account for the majority of the degradation.

Stable
High Loss
Basophil
Eosinophil
Erythroblast
IG
Lymphocyte
Monocyte
Neutrophil
Platelet
Baseline
99.6
100.0
99.4
96.0
99.6
98.6
96.8
100.0
SVD
-0.4%vs Base
-0.2%vs Base
-0.2%vs Base
+1.0%vs Base
-1.2%vs Base
±0.0
+1.1%vs Base
±0.0
L1 Lasso
-0.8%vs Base
-0.2%vs Base
-1.3%vs Base
+0.3%vs Base
-1.0%vs Base
-1.5%vs Base
+1.0%vs Base
-0.2%vs Base
L0 Surgery
-11.5%vs Base
-1.6%vs Base
-3.4%vs Base
-7.5%vs Base
-10.1%vs Base
-8.4%vs Base
-1.9%vs Base
-0.8%vs Base

Select a matrix cell to view class-specific stability metrics.

The 85% Guardrail: No class fell below the predetermined clinical threshold, confirming that even the most aggressive surgery preserved the minimum features required for diagnostic reliability.

§ 6.3 Unified Error Topology

Comparative analysis of the error manifold across compression techniques. Notice how SVD preserves the baseline's decision boundaries while L0 introduces selective sensitivity in morphologically similar clusters.

Cross-Methodology Comparison

Unified Error Topology Analysis

Visualizing the transition of classification boundaries under different compression constraints.

Basophil
Eosinophil
Erythroblast
IG
Lymphocyte
Monocyte
Neutrophil
Platelet
Basophil
243
0
0
0
0
0
1
0
Eosinophil
0
623
0
1
0
0
0
0
Erythroblast
0
0
303
3
2
3
0
0
IG
0
1
4
559
1
2
12
0
Lymphocyte
0
0
1
2
238
2
0
0
Monocyte
1
0
0
5
1
276
1
0
Neutrophil
0
0
0
9
0
0
657
0
Platelet
0
0
0
0
0
0
0
470
Correct Class
Misclassification

Boundary Sensitivity

The baseline VGG19 shows exceptional fidelity. Most errors are concentrated in the 'Immature Granulocytes' class, which shares morphological primitives with Neutrophils.
Macro F1 Stability98.57%

Statistical Insight: The diagonal entries represent the true positives. Off-diagonal concentration in the middle rows confirms that morphological ambiguity is the primary bottleneck for both dense and sparse models.

§ 6.4 The Compound Pipeline

The final model architecture utilizes a unified pipeline. Future deployment pipelines can compound these methods to achieve compression ratios inaccessible to any single method.

Compounding Efficiency

Simulation of Pipeline Stacking

0.0% COMPRESSION
Compression Vectors
Post-Training Quantization Strategy
Architectural Result
DENSE CONV
DENSE CLASSIFIER
Memory Footprint
532.6MB
Inference Latency
231.3ms
Stacked compression vectors yield super-linear savings in deployment environments.
Deployment Decision Matrix

The Deployment Framework

A statistically-driven matrix for selecting the optimal compression strategy based on clinical and hardware constraints.

Step 01

Primary Constraint Analysis

The Practitioner's Playbook

Summary of Research Recommendations

Match Method to Redundancy

Not all redundancy is created equal. Fully-connected layers exhibit high-rank linear redundancy, making them ideal for SVD. Convolutional layers, however, possess spatial filter redundancy that requires structured pruning.

Practical Implementation

Use SVD for dense layers; use L0 for convolutional bases.

Hardware Profiling Audit

System Audit & Reproducibility Specs

Hardware Stack

GPU
NVIDIA T4 Tensor Core
16GB GDDR6 • Turing Architecture
CPU
Intel Xeon Processor
2 vCPUs @ 2.20GHz (Cloud Instance)
RAM
12.7 GB System RAM
Google Colab Runtime Environment

Software Stack

OS
Ubuntu 22.04 LTS
Linux Kernel (Colab Container)
ENV
Python 3.10.x
PyTorch 2.x • CUDA 12.x Support
DATA
medmnist v3.0.2
BloodMNIST+ (224px native resolution)
Benchmarked @ Batch Size 32Floating Point Precision: FP32Inference Context: Local/Non-Distributed