Post-Hoc Tests After ANOVA: Tukey HSD vs Bonferroni vs FDR Compared

April 24, 2026

You ran a one-way ANOVA with five groups, got F(4, 45) = 6.8, p = .0002, and now you need to know which groups actually differ from which. That is what post-hoc tests are for — and the choice between Tukey HSD, Bonferroni, and FDR changes which comparisons come out significant and how your paper reads.

This post compares the three dominant families side by side, with a single worked dataset showing how each method labels the same 10 pairwise comparisons differently. If you know which to choose before you run the test, the results section writes itself.

Why Post-Hoc Tests Exist

A significant ANOVA tells you that at least one group mean differs from at least one other — but not which. The obvious fix is to run all pairwise t-tests, but that inflates the false positive rate. With five groups there are ten possible pairwise comparisons, and at α = .05 for each, the probability of at least one false positive is:

$1 - (1 - 0.05)^{10} = 0.401$

A 40% chance of a spurious “significant” result is unacceptable for publication. Post-hoc methods preserve an overall error rate of 5% across the whole family of comparisons. They differ in how they preserve it — and that choice has real consequences for statistical power.

The Three Error-Rate Philosophies

All post-hoc methods answer one question: “How do I handle the fact that I am testing many hypotheses at once?” But they answer it differently.

FrameworkControlsInterpretation
Family-wise error rate (FWER)Probability of any false positive in the family“At most 5% chance of making any Type I error across all comparisons”
False discovery rate (FDR)Expected proportion of false positives among significant results“At most 5% of the comparisons I call significant will be false”
Per-comparison error rate (no correction)Probability of false positive for each test“5% chance of error on each test considered alone” (inflated overall)

FWER is conservative — it protects against any false positive, which makes it hard to detect real effects when you have many comparisons. FDR is liberal — it tolerates some false positives among your discoveries, which preserves power when you are doing screening-style work. Per-comparison is inappropriate when you have more than one test.

Comparison: Tukey HSD vs Bonferroni vs Holm vs BH-FDR

The four most-used corrections, ranked from conservative to liberal:

MethodControlsPowerBest used when
BonferroniFWER, strongLowestSmall number of planned comparisons (< 5); simple to justify
Holm (Holm-Bonferroni)FWER, strongUniformly higher than BonferroniAny time you would use Bonferroni — there is no reason to prefer plain Bonferroni
Tukey HSDFWER, strongHigher than Bonferroni when all pairwise comparisons are doneAll pairwise comparisons after one-way ANOVA with equal or roughly equal group sizes
Benjamini-Hochberg (FDR)FDRHighestMany exploratory comparisons (genomics, screening); when false discoveries are tolerable

Two other methods appear often but serve niche roles: Scheffé’s test (extremely conservative; used for complex custom contrasts) and Dunnett’s test (only when comparing each group against a single control, which is more powerful than Tukey for that specific design).

Tip

Plain Bonferroni is rarely the right choice. Holm-Bonferroni dominates it — same error control, strictly more power. If a reviewer requests “Bonferroni correction,” you can report Holm and cite the reason: Holm is a uniformly more powerful closed-testing procedure that controls FWER at the same level.

How Each Method Works

Bonferroni

Divide α by the number of tests. With 10 comparisons at α = .05, declare significance only if p < .005.

Formula $\alpha_{\text{adj}} = \frac{\alpha}{k}$

where k is the number of comparisons. Equivalently, multiply each p-value by k and compare to the original α.

Holm-Bonferroni

Rank p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(k). Compare the smallest to α/k, the second to α/(k−1), and so on. Stop at the first non-significant test; everything after is also non-significant.

Tukey HSD (Honestly Significant Difference)

Uses the studentized range distribution instead of the t-distribution. The critical value depends on the number of groups and the error degrees of freedom. For all pairwise comparisons after ANOVA, Tukey HSD is more powerful than Bonferroni because it uses more information about the joint distribution of the comparisons.

$q = \frac{|\bar{x}_i - \bar{x}_j|}{\sqrt{\text{MS}_{\text{within}} / n}}$

Compare q to the critical value from the studentized range table at α = .05.

Benjamini-Hochberg FDR

Rank p-values smallest to largest. For each rank i out of k, compare p(i) to (i/k) × α. The largest p-value satisfying this inequality, and all smaller p-values, are declared significant.

Benjamini-Hochberg Threshold $p_{(i)} \leq \frac{i}{k} \cdot \alpha$

The threshold grows with rank, so smaller p-values face tighter cutoffs and larger p-values face looser cutoffs. This is why FDR keeps power even with hundreds of tests.

Worked Example: 5-Group Comparison, 10 Pairwise p-Values

Suppose a one-way ANOVA with five cell culture conditions shows F(4, 45) = 6.8, p = .0002. The ten pairwise t-test p-values (sorted smallest to largest) are:

Rank iComparisonRaw pBonferroni adj (× 10)Holm threshold (α/(k−i+1))BH threshold ((i/k)×α)
1A vs B.0003.003 *.005 *.005 *
2A vs C.002.020 *.0056 *.010 *
3B vs D.008.080.00625.015 *
4C vs E.015.150.00714.020 *
5A vs D.022.220.00833.025 *
6B vs C.041.410.010.030
7D vs E.062.620.0125.035
8A vs E.089.890.0167.040
9B vs E.204> 1.0.025.045
10C vs D.411> 1.0.050.050

Asterisks mark comparisons the method calls significant at α = .05.

What Each Method Declares Significant
  • No correction: A-B, A-C, B-D, C-E, A-D, B-C — 6 significant (likely inflated by ~40% chance of any false positive)
  • Bonferroni: A-B, A-C — 2 significant (very conservative)
  • Holm: A-B, A-C — 2 significant (same as Bonferroni here; would differ with more tests)
  • Tukey HSD: Typically matches Holm or is slightly more powerful for all-pairwise designs — would likely give A-B, A-C, and possibly B-D
  • BH-FDR (q = .05): A-B, A-C, B-D, C-E, A-D — 5 significant

The same raw p-values produce five different result counts depending on the correction. That is not cheating — it is the nature of controlling different error rates.

Choosing the Right Method

The choice is not a statistical preference — it is a question about what kind of error you care about:

SituationRecommended methodWhy
All pairwise comparisons after ANOVA, equal n per groupTukey HSDDesigned for this exact case; more powerful than Bonferroni
All pairwise comparisons, unequal n per groupTukey-KramerGeneralization of Tukey HSD to unbalanced designs
Each group vs one control onlyDunnett’s testMore powerful than Tukey when you only care about vs-control
Small number (< 5) of pre-planned specific comparisonsHolm-BonferroniTight FWER control; easy to justify to reviewers
Screening: many comparisons where false discoveries are tolerableBenjamini-Hochberg (FDR)Preserves power; standard in genomics, high-throughput
Complex custom contrasts (e.g., linear trends)SchefféValid for any linear contrast, at cost of low power
Non-parametric equivalent (after Kruskal-Wallis)Dunn’s testCorrectly handles rank-based multiple comparisons
Common Mistake

Using pairwise Mann-Whitney U tests as a post-hoc after Kruskal-Wallis. Each Mann-Whitney re-ranks only the two groups being compared, which is inconsistent with the original omnibus test’s ranking across all groups. Use Dunn’s test, which preserves the original ranks.

Power and Sample-Size Considerations

Post-hoc corrections shrink your effective power. If you designed a study to detect a medium effect (Cohen’s d = 0.5) at 80% power with a simple two-group t-test (n ~ 64 per group), the same study with 10 pairwise comparisons under Bonferroni needs roughly 1.5× the sample size to retain 80% power. FDR requires much less inflation — often < 1.2× — because the threshold is not applied uniformly.

If you are planning a study, decide the correction method before data collection and factor it into the sample size calculation. Post-hoc-correcting a study that was powered for uncorrected comparisons is a common reason studies fail to reject even when the effect is real.

How to Report Post-Hoc Results

The results section must: (1) name the post-hoc method, (2) report adjusted p-values alongside test statistics, (3) include effect sizes for each reported comparison, and (4) if using FDR, report q-values explicitly rather than calling them p-values.

Publication-Ready Write-Up

“A significant main effect of treatment condition was observed, F(4, 45) = 6.80, p < .001, ηp2 = 0.38. Tukey HSD post-hoc comparisons identified significant differences between conditions A and B (mean difference = 4.2, 95% CI [1.8, 6.6], padj = .003, Cohen’s d = 1.12) and between conditions A and C (mean difference = 3.1, 95% CI [0.6, 5.5], padj = .021, Cohen’s d = 0.82). All other pairwise comparisons were not significant after correction.”

For a deeper look at how test selection interacts with sample size — including how post-hoc corrections affect power calculations — see our t-test power analysis walkthrough or the statistical test decision guide if you are still deciding between ANOVA and other approaches.

Methodology References

Primary sources for the methods discussed:

The core rule: decide the correction method before looking at the p-values. Picking the method that makes your result significant is p-hacking, and reviewers and editors are increasingly trained to detect it.

Ready to analyze your data?

Join the beta waitlist and be the first to try GraphHelix.