The Crisis of Significance

Statistical Inference

Part I: Theory

The Logic of Discovery

The P-Value Definition

We start with a boring assumption called the Null Hypothesis ( $H_0$ ). It states that nothing interesting is happening—the drug didn't work, the groups are identical, or the correlation is zero.

H_0: \theta = \theta_0 \quad \text{vs} \quad H_1: \theta \neq \theta_0

We then construct a mathematical universe where $H_0$ is true. If our observed data ( $t$ ) falls in the extreme tails of this universe, it is surprising. The P-Value is the probability of seeing data this extreme by pure luck.

P\text{-value} = P(T \ge t | H_0 \text{ is True})

inf.alpha.p_value0.0482

f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2}

inf.dist.stat1.96

The Cost of Skepticism

inf.alpha.title: $\alpha$

inf.alpha.desc

inf.alpha.stdinf.alpha.std_desc
inf.alpha.strictinf.alpha.strict_desc
inf.alpha.physicsinf.alpha.physics_desc

inf.alpha.level (

\alpha

)0.050

inf.alpha.t15.0%

inf.alpha.t285.0%

How surprising is "surprising enough"? We set a threshold called Alpha ( $\alpha$ ). This is a policy decision, not a mathematical truth.

Set $\alpha$ low (0.01): You rarely cry wolf (Low Type I Error), but you miss real discoveries (High Type II Error).
Set $\alpha$ high (0.10): You find everything, including noise.

The Trade-off

"There is no free lunch. Minimizing false alarms guarantees missing signals."

The Alternative Reality

We don't just reject $H_0$ ; we accept an alternative $H_1$ . This leads to the Neyman-Pearson Framework, where we care about Statistical Power.

Power = 1 - \beta = P(\text{Reject } H_0 | H_1 \text{ is True})

The Blue Area ( $\beta$ ) represents the risk of missing a real effect. Notice how increasing sample size ( $n$ ) or effect size pulls the distributions apart, increasing Power.

inf.power.title

inf.power.desc

inf.alpha.t1 (α)

0.049

inf.alpha.t2 (β)

0.089

inf.power.power

0.911

inf.power.crit1.65

inf.power.effect (

\theta

)3.0

Power Analysis Insights

Part II: Meta-Science

The Law of Large Numbers

A Single Study Means Nothing

inf.dist.title

inf.dist.desc

inf.dist.effect (

\delta

inf.dist.null_true

inf.dist.waiting

Analysis

We tend to treat a single $p < 0.05$ result as truth. It isn't.

If the Null Hypothesis is actually true (the drug does nothing), P-values will follow a Uniform Distribution.

P \sim \text{Uniform}(0, 1) \quad | \quad H_0

Any single "significant" result (the red bar) could just be a random draw from this flat distribution. Only repeated replication reveals the true shape.

Part III: The Crisis

How to Break Science

The Look-Elsewhere Effect

If $\alpha = 0.05$ , you accept a 5% risk of a False Positive. This implies a dangerous mathematical certainty:

P(\text{At least 1 False Pos}) = 1 - (1 - \alpha)^k

If you run 20 useless experiments ( $k=20$ ) on random noise, you are statistically guaranteed (~64% chance) to find at least one "significant" result.

The P-Hacking Game

Try clicking the button until you get a green square. Congratulations, you just published a false paper.

hack.status.experiments: 0

hack.lab.testing

"hack.hypothesis.prefix Cyan hack.hypothesis.jelly hack.hypothesis.cause_q Hair Loss?"

p = ?.???

exp.inference.s5.ref

hack.list.published0

hack.list.empty

hack.list.discarded

hack.list.discarded_desc

P-Hacking Detector

Manufacturing Significance

Sometimes we don't run new experiments; we just "clean" the old ones. By selectively removing data points (labeling them "Outliers"), we can force a significant difference where none exists.

This is often called Data Torture: "If you torture the data long enough, it will confess."

inf.torture.title

inf.torture.quote

inf.torture.group_a

inf.torture.group_b

inf.torture.diff (

\Delta\bar{x}

)2.69

inf.torture.hypothesis:

\mu_A \neq \mu_B

P = 0.3323

inf.torture.click

inf.torture.insig

Part IV: Redemption

Estimation over Decision

The Dance of Confidence

The binary decision ("Significant / Not Significant") destroys information. Instead, we should focus on Estimation using Confidence Intervals (CI).

P(L \le \theta \le U) = 1 - \alpha

A 95% CI doesn't mean "95% chance the truth is here." It means: "If we repeated this experiment 100 times, 95 of the generated intervals would capture the true parameter."

inf.ci.title

inf.ci.desc

inf.ci.confidence

inf.dist.sample (n)30

\mu

inf.ci.true_mean

inf.ci.waiting

inf.ci.rate

inf.ci.total: 0

Coverage Analysis

Laboratory Index

common.source common.pdf

Statistical Inference

Part I: Theory

The Logic of Discovery

The P-Value Definition

H_0: \theta = \theta_0 \quad \text{vs} \quad H_1: \theta \neq \theta_0

P\text{-value} = P(T \ge t | H_0 \text{ is True})

inf.alpha.p_value0.0482

f(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}x^2}

inf.dist.stat1.96

The Cost of Skepticism

inf.alpha.title: $\alpha$

inf.alpha.desc

inf.alpha.stdinf.alpha.std_desc
inf.alpha.strictinf.alpha.strict_desc
inf.alpha.physicsinf.alpha.physics_desc

inf.alpha.level (

\alpha

)0.050

inf.alpha.t15.0%

inf.alpha.t285.0%

How surprising is "surprising enough"? We set a threshold called Alpha ( $\alpha$ ). This is a policy decision, not a mathematical truth.

Set $\alpha$ low (0.01): You rarely cry wolf (Low Type I Error), but you miss real discoveries (High Type II Error).
Set $\alpha$ high (0.10): You find everything, including noise.

The Trade-off

"There is no free lunch. Minimizing false alarms guarantees missing signals."

The Alternative Reality

We don't just reject $H_0$ ; we accept an alternative $H_1$ . This leads to the Neyman-Pearson Framework, where we care about Statistical Power.

Power = 1 - \beta = P(\text{Reject } H_0 | H_1 \text{ is True})

The Blue Area ( $\beta$ ) represents the risk of missing a real effect. Notice how increasing sample size ( $n$ ) or effect size pulls the distributions apart, increasing Power.

inf.power.title

inf.power.desc

inf.alpha.t1 (α)

0.049

inf.alpha.t2 (β)

0.089

inf.power.power

0.911

inf.power.crit1.65

inf.power.effect (

\theta

)3.0

Power Analysis Insights

Part II: Meta-Science

The Law of Large Numbers

A Single Study Means Nothing

inf.dist.title

inf.dist.desc

inf.dist.effect (

\delta

inf.dist.null_true

inf.dist.waiting

Analysis

We tend to treat a single $p < 0.05$ result as truth. It isn't.

If the Null Hypothesis is actually true (the drug does nothing), P-values will follow a Uniform Distribution.

P \sim \text{Uniform}(0, 1) \quad | \quad H_0

Any single "significant" result (the red bar) could just be a random draw from this flat distribution. Only repeated replication reveals the true shape.

Part III: The Crisis

How to Break Science

The Look-Elsewhere Effect

If $\alpha = 0.05$ , you accept a 5% risk of a False Positive. This implies a dangerous mathematical certainty:

P(\text{At least 1 False Pos}) = 1 - (1 - \alpha)^k

If you run 20 useless experiments ( $k=20$ ) on random noise, you are statistically guaranteed (~64% chance) to find at least one "significant" result.

The P-Hacking Game

Try clicking the button until you get a green square. Congratulations, you just published a false paper.

hack.status.experiments: 0

hack.lab.testing

"hack.hypothesis.prefix Cyan hack.hypothesis.jelly hack.hypothesis.cause_q Hair Loss?"

p = ?.???

exp.inference.s5.ref

hack.list.published0

hack.list.empty

hack.list.discarded

hack.list.discarded_desc

P-Hacking Detector

Manufacturing Significance

Sometimes we don't run new experiments; we just "clean" the old ones. By selectively removing data points (labeling them "Outliers"), we can force a significant difference where none exists.

This is often called Data Torture: "If you torture the data long enough, it will confess."

inf.torture.title

inf.torture.quote

inf.torture.group_a

inf.torture.group_b

inf.torture.diff (

\Delta\bar{x}

)2.69

inf.torture.hypothesis:

\mu_A \neq \mu_B

P = 0.3323

inf.torture.click

inf.torture.insig

Part IV: Redemption

Estimation over Decision

The Dance of Confidence

The binary decision ("Significant / Not Significant") destroys information. Instead, we should focus on Estimation using Confidence Intervals (CI).

P(L \le \theta \le U) = 1 - \alpha

A 95% CI doesn't mean "95% chance the truth is here." It means: "If we repeated this experiment 100 times, 95 of the generated intervals would capture the true parameter."

inf.ci.title

inf.ci.desc

inf.ci.confidence

inf.dist.sample (n)30

\mu

inf.ci.true_mean

inf.ci.waiting

inf.ci.rate

inf.ci.total: 0

The Logic of Discovery

The P-Value Definition

The Cost of Skepticism

inf.alpha.title: α\alphaα

The Trade-off

The Alternative Reality

inf.power.title

Power Analysis Insights

The Law of Large Numbers

A Single Study Means Nothing

inf.dist.title

Analysis

How to Break Science

The Look-Elsewhere Effect

The P-Hacking Game

"hack.hypothesis.prefix Cyan hack.hypothesis.jelly hack.hypothesis.cause_q Hair Loss?"

P-Hacking Detector

Manufacturing Significance

inf.torture.title

Estimation over Decision

The Dance of Confidence

inf.ci.title

Coverage Analysis

The Logic of Discovery

The P-Value Definition

The Cost of Skepticism

inf.alpha.title: α\alphaα

The Trade-off

The Alternative Reality

inf.power.title

Power Analysis Insights

The Law of Large Numbers

A Single Study Means Nothing

inf.dist.title

Analysis

How to Break Science

The Look-Elsewhere Effect

The P-Hacking Game

"hack.hypothesis.prefix Cyan hack.hypothesis.jelly hack.hypothesis.cause_q Hair Loss?"

P-Hacking Detector

Manufacturing Significance

inf.torture.title

Estimation over Decision

The Dance of Confidence

inf.ci.title

Coverage Analysis

inf.alpha.title: $\alpha$

inf.alpha.title: $\alpha$