Post-Hoc Tests After ANOVA: Tukey HSD vs Bonferroni vs FDR Compared
You ran a one-way ANOVA with five groups, got F(4, 45) = 6.8, p = .0002, and now you need to know which groups actually differ from which. That is what post-hoc tests are for — and the choice between Tukey HSD, Bonferroni, and FDR changes which comparisons come out significant and how your paper reads.
This post compares the three dominant families side by side, with a single worked dataset showing how each method labels the same 10 pairwise comparisons differently. If you know which to choose before you run the test, the results section writes itself.
Why Post-Hoc Tests Exist
A significant ANOVA tells you that at least one group mean differs from at least one other — but not which. The obvious fix is to run all pairwise t-tests, but that inflates the false positive rate. With five groups there are ten possible pairwise comparisons, and at α = .05 for each, the probability of at least one false positive is:
$1 - (1 - 0.05)^{10} = 0.401$A 40% chance of a spurious “significant” result is unacceptable for publication. Post-hoc methods preserve an overall error rate of 5% across the whole family of comparisons. They differ in how they preserve it — and that choice has real consequences for statistical power.
The Three Error-Rate Philosophies
All post-hoc methods answer one question: “How do I handle the fact that I am testing many hypotheses at once?” But they answer it differently.
| Framework | Controls | Interpretation |
|---|---|---|
| Family-wise error rate (FWER) | Probability of any false positive in the family | “At most 5% chance of making any Type I error across all comparisons” |
| False discovery rate (FDR) | Expected proportion of false positives among significant results | “At most 5% of the comparisons I call significant will be false” |
| Per-comparison error rate (no correction) | Probability of false positive for each test | “5% chance of error on each test considered alone” (inflated overall) |
FWER is conservative — it protects against any false positive, which makes it hard to detect real effects when you have many comparisons. FDR is liberal — it tolerates some false positives among your discoveries, which preserves power when you are doing screening-style work. Per-comparison is inappropriate when you have more than one test.
Comparison: Tukey HSD vs Bonferroni vs Holm vs BH-FDR
The four most-used corrections, ranked from conservative to liberal:
| Method | Controls | Power | Best used when |
|---|---|---|---|
| Bonferroni | FWER, strong | Lowest | Small number of planned comparisons (< 5); simple to justify |
| Holm (Holm-Bonferroni) | FWER, strong | Uniformly higher than Bonferroni | Any time you would use Bonferroni — there is no reason to prefer plain Bonferroni |
| Tukey HSD | FWER, strong | Higher than Bonferroni when all pairwise comparisons are done | All pairwise comparisons after one-way ANOVA with equal or roughly equal group sizes |
| Benjamini-Hochberg (FDR) | FDR | Highest | Many exploratory comparisons (genomics, screening); when false discoveries are tolerable |
Two other methods appear often but serve niche roles: Scheffé’s test (extremely conservative; used for complex custom contrasts) and Dunnett’s test (only when comparing each group against a single control, which is more powerful than Tukey for that specific design).
Plain Bonferroni is rarely the right choice. Holm-Bonferroni dominates it — same error control, strictly more power. If a reviewer requests “Bonferroni correction,” you can report Holm and cite the reason: Holm is a uniformly more powerful closed-testing procedure that controls FWER at the same level.
How Each Method Works
Bonferroni
Divide α by the number of tests. With 10 comparisons at α = .05, declare significance only if p < .005.
where k is the number of comparisons. Equivalently, multiply each p-value by k and compare to the original α.
Holm-Bonferroni
Rank p-values from smallest to largest: p(1) ≤ p(2) ≤ ... ≤ p(k). Compare the smallest to α/k, the second to α/(k−1), and so on. Stop at the first non-significant test; everything after is also non-significant.
Tukey HSD (Honestly Significant Difference)
Uses the studentized range distribution instead of the t-distribution. The critical value depends on the number of groups and the error degrees of freedom. For all pairwise comparisons after ANOVA, Tukey HSD is more powerful than Bonferroni because it uses more information about the joint distribution of the comparisons.
$q = \frac{|\bar{x}_i - \bar{x}_j|}{\sqrt{\text{MS}_{\text{within}} / n}}$Compare q to the critical value from the studentized range table at α = .05.
Benjamini-Hochberg FDR
Rank p-values smallest to largest. For each rank i out of k, compare p(i) to (i/k) × α. The largest p-value satisfying this inequality, and all smaller p-values, are declared significant.
The threshold grows with rank, so smaller p-values face tighter cutoffs and larger p-values face looser cutoffs. This is why FDR keeps power even with hundreds of tests.
Worked Example: 5-Group Comparison, 10 Pairwise p-Values
Suppose a one-way ANOVA with five cell culture conditions shows F(4, 45) = 6.8, p = .0002. The ten pairwise t-test p-values (sorted smallest to largest) are:
| Rank i | Comparison | Raw p | Bonferroni adj (× 10) | Holm threshold (α/(k−i+1)) | BH threshold ((i/k)×α) |
|---|---|---|---|---|---|
| 1 | A vs B | .0003 | .003 * | .005 * | .005 * |
| 2 | A vs C | .002 | .020 * | .0056 * | .010 * |
| 3 | B vs D | .008 | .080 | .00625 | .015 * |
| 4 | C vs E | .015 | .150 | .00714 | .020 * |
| 5 | A vs D | .022 | .220 | .00833 | .025 * |
| 6 | B vs C | .041 | .410 | .010 | .030 |
| 7 | D vs E | .062 | .620 | .0125 | .035 |
| 8 | A vs E | .089 | .890 | .0167 | .040 |
| 9 | B vs E | .204 | > 1.0 | .025 | .045 |
| 10 | C vs D | .411 | > 1.0 | .050 | .050 |
Asterisks mark comparisons the method calls significant at α = .05.
- No correction: A-B, A-C, B-D, C-E, A-D, B-C — 6 significant (likely inflated by ~40% chance of any false positive)
- Bonferroni: A-B, A-C — 2 significant (very conservative)
- Holm: A-B, A-C — 2 significant (same as Bonferroni here; would differ with more tests)
- Tukey HSD: Typically matches Holm or is slightly more powerful for all-pairwise designs — would likely give A-B, A-C, and possibly B-D
- BH-FDR (q = .05): A-B, A-C, B-D, C-E, A-D — 5 significant
The same raw p-values produce five different result counts depending on the correction. That is not cheating — it is the nature of controlling different error rates.
Choosing the Right Method
The choice is not a statistical preference — it is a question about what kind of error you care about:
| Situation | Recommended method | Why |
|---|---|---|
| All pairwise comparisons after ANOVA, equal n per group | Tukey HSD | Designed for this exact case; more powerful than Bonferroni |
| All pairwise comparisons, unequal n per group | Tukey-Kramer | Generalization of Tukey HSD to unbalanced designs |
| Each group vs one control only | Dunnett’s test | More powerful than Tukey when you only care about vs-control |
| Small number (< 5) of pre-planned specific comparisons | Holm-Bonferroni | Tight FWER control; easy to justify to reviewers |
| Screening: many comparisons where false discoveries are tolerable | Benjamini-Hochberg (FDR) | Preserves power; standard in genomics, high-throughput |
| Complex custom contrasts (e.g., linear trends) | Scheffé | Valid for any linear contrast, at cost of low power |
| Non-parametric equivalent (after Kruskal-Wallis) | Dunn’s test | Correctly handles rank-based multiple comparisons |
Using pairwise Mann-Whitney U tests as a post-hoc after Kruskal-Wallis. Each Mann-Whitney re-ranks only the two groups being compared, which is inconsistent with the original omnibus test’s ranking across all groups. Use Dunn’s test, which preserves the original ranks.
Power and Sample-Size Considerations
Post-hoc corrections shrink your effective power. If you designed a study to detect a medium effect (Cohen’s d = 0.5) at 80% power with a simple two-group t-test (n ~ 64 per group), the same study with 10 pairwise comparisons under Bonferroni needs roughly 1.5× the sample size to retain 80% power. FDR requires much less inflation — often < 1.2× — because the threshold is not applied uniformly.
If you are planning a study, decide the correction method before data collection and factor it into the sample size calculation. Post-hoc-correcting a study that was powered for uncorrected comparisons is a common reason studies fail to reject even when the effect is real.
How to Report Post-Hoc Results
The results section must: (1) name the post-hoc method, (2) report adjusted p-values alongside test statistics, (3) include effect sizes for each reported comparison, and (4) if using FDR, report q-values explicitly rather than calling them p-values.
“A significant main effect of treatment condition was observed, F(4, 45) = 6.80, p < .001, ηp2 = 0.38. Tukey HSD post-hoc comparisons identified significant differences between conditions A and B (mean difference = 4.2, 95% CI [1.8, 6.6], padj = .003, Cohen’s d = 1.12) and between conditions A and C (mean difference = 3.1, 95% CI [0.6, 5.5], padj = .021, Cohen’s d = 0.82). All other pairwise comparisons were not significant after correction.”
For a deeper look at how test selection interacts with sample size — including how post-hoc corrections affect power calculations — see our t-test power analysis walkthrough or the statistical test decision guide if you are still deciding between ANOVA and other approaches.
Methodology References
Primary sources for the methods discussed:
- Proper application of the multiple comparison test (PMC) — peer-reviewed methodology overview
- Statistical notes: post-hoc multiple comparisons (PMC) — practitioner-oriented comparison of FWER and FDR methods
- APA JARS — reporting standards requiring effect sizes alongside p-values
The core rule: decide the correction method before looking at the p-values. Picking the method that makes your result significant is p-hacking, and reviewers and editors are increasingly trained to detect it.