Statistical Surgery: Compressing VGG19 | The Null Hypothesis

Introduction

Artificial intelligence has fundamentally transformed modern computing, but this predictive power comes at a severe computational cost. Classical statistical learning theory suggests that highly parameterized models are prone to overfitting. Yet, deep learning often departs from these classical bias-variance expectations through a phenomenon known as benign overfitting.

Architectures such as VGG19 contain approximately 140 million parameters. When transferred into specialized medical settings—like the BloodMNIST hematological imaging task—the model offers immense representational power but creates a massive, statistically unjustifiable parameter-to-task imbalance.

Parameters

139.6M→35.5M

74.5% reduction in total parameters via structured surgery.

Efficiency Gain

2.39x Speedup

231.3ms→96.9ms

58.1% faster Wall-clock time on standard hardware (T4 GPU).

Storage Footprint

532.6MB→134.9MB

Significant reduction in serialized model size for edge deployment.

§1 Research Methodology

The objective is to minimize the empirical risk while simultaneously reducing the parameter count $\mathcal{P}$. This is formulated as a constrained optimization problem:

\min_{W} \mathcal{L}(W; \mathcal{D}) \quad \text{s.t.} \quad \|W\|_0 \leq \kappa

where $\kappa$ is the target parameter budget and $\mathcal{L}$ is the cross-entropy loss.

Hematological Dataset Exploration

Dataset N = 17,092 Samples

Statistical Bias Note

"The moderate class imbalance observed here necessitates the use of Macro-averaged F1 metrics. Accuracy alone would be biased toward the majority classes (Neutrophils and Eosinophils)."

Intensity Profile

RGB Mean

Dataset: BloodMNIST-224Task: 8-Class ClassificationSparsity Context: Baseline Analysis

§2 Baseline Model Analysis

The uncompressed VGG19 architecture acts as our empirical upper bound. Fine-tuned using SGD, it achieved a top-1 accuracy of 98.48% with a mean latency of 231.3 ms.

Feature Hierarchy & Volumetrics

VGG19 Hierarchical Construction

Isometric decomposition visualizing the bottleneck transitions and parameter distribution.

Structural Probe

Select an architectural block to inspect its hierarchical role and computational complexity.

§ 2.1 Empirical Performance Fidelity

Our stopping criteria relied on the minimum per-class F1-score to protect minority classes from being "averaged out" by high performance on dominant classes.

Summary Statistics

Macro-F1

98.57%

Latency

231ms

Class Insights

Hover a data point to inspect class-specific precision and recall metrics.

F1-Score Distribution (%)

Optimal

Critical

Eosinophils

Platelets

Basophils

Lymphocytes

Erythroblasts

Monocytes

Neutrophils

Immature Granulocytes

90%

92%

94%

96%

98%

100%

AVG: 98.57%

PCA Redundancy Audit

Quantifying the intrinsic dimensionality of latent activation spaces.

Conv Block 3

41 / 256 Dim

Architectural Capacity84% Redundant

Conv Block 4

122 / 512 Dim

Architectural Capacity76% Redundant

Conv Block 5

59 / 512 Dim

Architectural Capacity88% Redundant

FC1 Head

285 / 4096 Dim

Architectural Capacity93% Redundant

FC2 Head

137 / 4096 Dim

Architectural Capacity97% Redundant

Select a layer to audit variance decay

Select a layer to visualize the variance decay and identify the representational 'elbow'.

§3 L1 Lasso Regularization

The L1 norm penalty (Lasso) induces a continuous zero-attraction effect on individual weights. While this statistically simplifies the model, it creates a Hardware Paradox.

w_j^{new} \leftarrow w_j - \eta \nabla \mathcal{L}_{task,j} - \eta \lambda \text{sign}(w_j)

To prevent weights from becoming trapped in "nearly-zero" local minima during pruning, we utilized a combination of Stochastic Gradient Descent with Warm Restarts (SGDR) for global exploration and the Adam optimizer for local convergence. This cyclical approach allowed for a robust discovery of the global error manifold without catastrophic forgetting.

Training Dynamics & Phase Transition

Tracking the emergence of sparsity under increasing L1 pressure.

Macro F1

98.6%

Soft Sparsity

0.0%

L1 Penalty Constraint

Epoch 0.0 / 22

Zero-Attraction Phase

L1 norm is pulling redundant weights toward zero while preserving topological fidelity.

Global Sparsity

0.0%

Weight Magnitude Audit

Analyzing the post-Lasso zero-attraction topology.

Full Spectrum DensityMagnitude Distribution (|w|)

Surgical Threshold Sweep

ε = 0.015

Threshold (ε)

ZeroEffective Weight

Theoretical Sparsity

95.3%

Parameters that can be numerically zeroed without significant loss impact.

Memory Occupancy

528 MB

Static occupancy despite numerical sparsity—demonstrating the Hardware Paradox.

Inference Profile

0.0x (No Speedup)

Unstructured sparsity does not bypass SIMD multipliers. The tensor must undergo structural surgery to gain latency benefits.

§4 Structured Surgery via L0 Gates

To achieve physical acceleration, we must shift from weight-level sparsity to channel-level surgery. We implement a Gaussian Stochastic Gate—a differentiable relaxation that samples gate states from a continuous distribution during training.

To control the pruning rate, we used a PID controller to dynamically modulate the regularization penalty $\lambda$. This prevented the 'sparsity collapse' often seen in static regularization, successfully stabilizing the network as pressure peaked around epoch 13.

Differentiable L0 Relaxation

The Gaussian Stochastic Gate

Differentiable L0 relaxation: Visualizing the transition from continuous parameters to discrete hardware gates.

Tensor Shutter Array (Stochastic)

Each cell represents a convolutional channel. Stochastic sampling determines whether the "shutter" (gate) is physically open for inference.

Gate Bias (

\mu

)

0.50

Gate Status

ACTIVE

Identity Map Inherited

Information Transfer

The Gaussian gate acts as a continuous proxy for discrete L0 penalization. By adjusting the bias, we modulate the probability of channel survival.

§ 4.2 The Latent Hierarchy

Structural analysis reveals a depth-dependent survival gradient across the VGG19 backbone. Visual primitives (edges/colors) showed high survival (75%+), the transition zone saw intermediate survival (~50%) as ImageNet-specific features dropped, and deeper layers exposed massive redundant capacity with low survival (~20%).

The Survival Gradient

Visualizing the structural survival of VGG19 channels. Early layers (Blocks 1-2) are preserved for low-level feature extraction, while deep layers are aggressively pruned.

L1Block 1

40/6462.5%

L2Block 1

47/6473.4%

L3Block 2

97/12875.8%

L4Block 2

112/12887.5%

L5Block 3

122/25647.7%

L6Block 3

116/25645.3%

L7Block 3

127/25649.6%

L8Block 3

133/25652.0%

L9Block 4

115/51222.5%

L10Block 4

112/51221.9%

L11Block 4

123/51224.0%

L12Block 4

124/51224.2%

L13Block 4

114/51222.3%

L14Block 5

97/51218.9%

L15Block 5

110/51221.5%

L16Block 5

83/51216.2%

§ 4.3 Structured Tensor Surgery

Pruning Visualization

Discarding redundant channels physically to unlock hardware throughput.

Pruning Severity (L0 Penalty)

35%

Diagnostic ConservationAggressive Pruning

Tensor Shape

[B, 512, 14, 14]

Batch × Channels × Height × Width

Inference Latency

168.1ms

Throughput

1.38x

§ 4.4 Inference Throughput Race

Hardware Inference Race

Throughput Benchmark

Simulating a clinical queue of 20 diagnostic batches. The L0 model achieves physical acceleration through tensor surgery, clearing the queue while the Baseline still processes.

L0 Structured Surgery

LATENCY: 96.9ms

Baseline VGG19

LATENCY: 231.3ms

Elapsed Time0.00s

L0 Status0%

Efficiency Gain2.39x Speedup

§5 Low-Rank Factorization (SVD)

The classification head is compressed using Truncated SVD, splitting enormous matrices into compressed, sequential multiplications.

W \approx U_k \Sigma_k V_k^T

Singular Value Decomposition

Decomposing high-density weight tensors into essential geometric primitives.

Energy98.2%

Compression1.0x

Uncompressed Weights32x32 Tensor

SVD Approximation (Rank-16)1,024 Parameters

Approximation Rank k

k = 16

§ 5.1 The Diagnostic Fidelity Sweep

Stripping out high-frequency parametric noise through SVD can actually improve classification accuracy by acting as a structural denoiser.

Diagnostic Fidelity Sweep

Why SVD works: Clinical images contain massive spatial redundancy.

Rank k = 31

78.7% Energy

Audit Objective

The previous section proved that weights are redundant. This section proves that the diagnostic data itselfis low-rank, allowing the network to discard high-frequency "noise" without losing the cell's nucleus structure.

Compression Rankk = 31

Abstract PatternClinical Detail

Diagnostic Insight

Optimal Spectral Cutoff. High-frequency pixel noise is removed, but the diagnostic nucleus remains structurally intact.

Information Gain12.9x

Fidelity PriceMinimal

Diagnostic Fidelity

98.8% F1

Physical Compression

78.2% RED

Spectral Rank Profile

FC0 Latent Rank309

FC3 Latent Rank169

Active Parameters30.47M

Diagnostic Insight

Native Stability. Operating at full spectral capacity. Redundancy present but largely unexploited.

Sweep Energy Threshold (ε)

0.50

Degenerate (0.1)High Fidelity (0.5)

§6 Discussion & Unified Synthesis

By applying the Akaike Information Criterion (AIC), we find that our compressed models are not merely smaller—they are statistically superior. The baseline VGG19 (AIC: $2.79 \\times 10^8$) suffers from massive parametric bloat, while our SVD variant (AIC: $4.32 \\times 10^7$) and L0 Surgery variant (AIC: $7.11 \\times 10^7$) achieve substantially lower information scores.

Statistical Parsimony Audit

Evaluating model selection through Information-Theoretic criteria. Lower values indicate a more efficient trade-off between empirical fit and parameter complexity.

Metric Definitions

AIC / BIC

Penalize complexity to prevent overfitting. BIC imposes a stronger penalty based on sample size, favouring simpler models.

MDL

Minimum Description Length. Evaluates the statistical hypothesis by the length of its shortest possible description.

The massive reduction in AIC/BIC/MDL confirms that VGG19 is severely over-parameterized for the BloodMNIST task, captured here by the Parsimony Gap.

§ 6.1 The Pareto Efficiency Frontier

We have mapped a robust Pareto frontier governing the trade-off between strict diagnostic fidelity and absolute computational efficiency.

The Pareto Efficiency Frontier

Mapping the trade-off between predictive fidelity and resource constraints.

Baseline

L1 Lasso

SVD

L0 Surgery

§ 6.2 Heterogeneous Degradation

The transition from monolithic architectures to compressed variants does not degrade performance uniformly across the classification manifold. Structurally distinct cell types, such as Platelets and Eosinophils, remain robust under extreme pruning. Conversely, the majority of the diagnostic loss is concentrated in morphologically ambiguous classes like Basophils and Immature Granulocytes, which require high-dimensional feature detectors in the deepest layers of the network.

Heterogeneous Degradation Audit

Visualizing the relative performance drop across compression variants compared to the uncompressed baseline. Structurally distinctive classes remain stable, whereas morphologically ambiguous classes account for the majority of the degradation.

Stable

High Loss

Basophil

Eosinophil

Erythroblast

IG

Lymphocyte

Monocyte

Neutrophil

Platelet

Baseline

99.6

100.0

99.4

96.0

99.6

98.6

96.8

100.0

SVD

-0.4%vs Base

-0.2%vs Base

+1.0%vs Base

-1.2%vs Base

±0.0

+1.1%vs Base

±0.0

L1 Lasso

-0.8%vs Base

-0.2%vs Base

-1.3%vs Base

+0.3%vs Base

-1.0%vs Base

-1.5%vs Base

+1.0%vs Base

-0.2%vs Base

L0 Surgery

-11.5%vs Base

-1.6%vs Base

-3.4%vs Base

-7.5%vs Base

-10.1%vs Base

-8.4%vs Base

-1.9%vs Base

-0.8%vs Base

Select a matrix cell to view class-specific stability metrics.

The 85% Guardrail: No class fell below the predetermined clinical threshold, confirming that even the most aggressive surgery preserved the minimum features required for diagnostic reliability.

§ 6.3 Unified Error Topology

Comparative analysis of the error manifold across compression techniques. Notice how SVD preserves the baseline's decision boundaries while L0 introduces selective sensitivity in morphologically similar clusters.

Cross-Methodology Comparison

Unified Error Topology Analysis

Visualizing the transition of classification boundaries under different compression constraints.

Basophil

Eosinophil

Erythroblast

IG

Lymphocyte

Monocyte

Neutrophil

Platelet

Basophil

243

0

1

0

Eosinophil

0

623

0

1

0

Erythroblast

0

303

3

2

3

0

IG

0

1

4

559

1

2

12

0

Lymphocyte

0

1

2

238

2

0

Monocyte

1

0

5

1

276

1

0

Neutrophil

0

9

0

657

0

Platelet

0

470

Correct Class

Misclassification

Boundary Sensitivity

The baseline VGG19 shows exceptional fidelity. Most errors are concentrated in the 'Immature Granulocytes' class, which shares morphological primitives with Neutrophils.

Macro F1 Stability98.57%

Statistical Insight: The diagonal entries represent the true positives. Off-diagonal concentration in the middle rows confirms that morphological ambiguity is the primary bottleneck for both dense and sparse models.

§ 6.4 The Compound Pipeline

The final model architecture utilizes a unified pipeline. Future deployment pipelines can compound these methods to achieve compression ratios inaccessible to any single method.

Compounding Efficiency

Simulation of Pipeline Stacking

0.0% COMPRESSION

Compression Vectors

Post-Training Quantization Strategy

Architectural Result

DENSE CONV

DENSE CLASSIFIER

Memory Footprint

532.6MB

Inference Latency

231.3ms

Stacked compression vectors yield super-linear savings in deployment environments.

Deployment Decision Matrix

The Deployment Framework

A statistically-driven matrix for selecting the optimal compression strategy based on clinical and hardware constraints.

Step 01

Primary Constraint Analysis

The Practitioner's Playbook

Summary of Research Recommendations

Match Method to Redundancy

Not all redundancy is created equal. Fully-connected layers exhibit high-rank linear redundancy, making them ideal for SVD. Convolutional layers, however, possess spatial filter redundancy that requires structured pruning.

Practical Implementation

Use SVD for dense layers; use L0 for convolutional bases.

Hardware Profiling Audit

System Audit & Reproducibility Specs

Hardware Stack

GPU

NVIDIA T4 Tensor Core

16GB GDDR6 • Turing Architecture

CPU

Intel Xeon Processor

2 vCPUs @ 2.20GHz (Cloud Instance)

RAM

12.7 GB System RAM

Google Colab Runtime Environment

Software Stack

OS

Ubuntu 22.04 LTS

Linux Kernel (Colab Container)

ENV

Python 3.10.x

PyTorch 2.x • CUDA 12.x Support

DATA

medmnist v3.0.2

BloodMNIST+ (224px native resolution)

Benchmarked @ Batch Size 32Floating Point Precision: FP32Inference Context: Local/Non-Distributed

Introduction

§1 Research Methodology

Hematological Dataset Exploration

Statistical Bias Note

Intensity Profile

§2 Baseline Model Analysis

VGG19 Hierarchical Construction

§ 2.1 Empirical Performance Fidelity

PCA Redundancy Audit

§3 L1 Lasso Regularization

Training Dynamics & Phase Transition

Weight Magnitude Audit

§4 Structured Surgery via L0 Gates

The Gaussian Stochastic Gate

§ 4.2 The Latent Hierarchy

The Survival Gradient

§ 4.3 Structured Tensor Surgery

Pruning Visualization

§ 4.4 Inference Throughput Race

Throughput Benchmark

§5 Low-Rank Factorization (SVD)

Singular Value Decomposition

§ 5.1 The Diagnostic Fidelity Sweep

Diagnostic Fidelity Sweep

Spectral Rank Profile

§6 Discussion & Unified Synthesis

Statistical Parsimony Audit

Metric Definitions

§ 6.1 The Pareto Efficiency Frontier

The Pareto Efficiency Frontier

§ 6.2 Heterogeneous Degradation

Heterogeneous Degradation Audit

§ 6.3 Unified Error Topology

Unified Error Topology Analysis

Boundary Sensitivity

§ 6.4 The Compound Pipeline

Compounding Efficiency

The Deployment Framework

Primary Constraint Analysis

The Practitioner's Playbook

Match Method to Redundancy

Avoid Unstructured Fantasy

Commit to the Surgery

Match Method to Redundancy

Hardware Profiling Audit

Hardware Stack

Software Stack