Normality Test Before ANOVA: A Decision Guide for Shapiro-Wilk, Q-Q Plots, and Kruskal-Wallis

April 30, 2026

Your one-way ANOVA design has four groups, n = 12 per group, and you remember someone in grad school saying you have to test normality first. Shapiro-Wilk on the raw outcome variable returns p = 0.03 in one of the groups. Should you switch to Kruskal-Wallis? In most cases, no — and the reason exposes a misconception that ~70% of biology papers get wrong.

This guide walks through the actual decision: when normality matters for ANOVA, what you should be testing (it is not what you think), and when a violation justifies switching to a non-parametric test. The short version: at n > 30 per group, ANOVA is robust; the residuals matter, not the raw data; and Shapiro-Wilk on small samples is a worse signal than a Q-Q plot.

What ANOVA Actually Assumes

The normality assumption for ANOVA is a statement about the residuals, not the outcome variable. Specifically: the within-group residuals (each observation minus its group mean) should be approximately normally distributed. This is the assumption the F-test’s p-value depends on.

If you test normality on the raw outcome variable across all groups, you will frequently see bimodal distributions — not because the data violate ANOVA’s assumption, but because the groups have different means (which is exactly what ANOVA is detecting). The raw-data histogram is uninformative; the residuals histogram is what matters.

Common Mistake

Over 70% of ecology papers and 90% of biology papers test normality of the outcome variable instead of residuals. When the groups differ (which is what you ran ANOVA to detect), the raw-data distribution is bimodal — but the residuals can still be perfectly normal. Always test within-group normality or compute residuals first.

The Decision Tree

Three factors decide whether you need to act on a normality violation:

  1. Sample size per group
  2. Severity of the violation (visual + numeric)
  3. Group balance (equal vs unequal n)

Walk through them in order:

Step 1 — Check sample size

For balanced designs (equal n per group), one-way ANOVA is robust to non-normality once n > 30 per group. The Central Limit Theorem applies to the sampling distribution of group means, which is what the F-statistic compares. At n ≥ 30, even moderately skewed residuals produce reliable p-values.

If n < 30 per group, normality matters more — proceed to Step 2.

Step 2 — Look at a Q-Q plot before reading Shapiro-Wilk

Q-Q (quantile-quantile) plots show observed quantiles against the quantiles of a normal distribution. Points falling near the diagonal indicate normality; systematic departures indicate the type of violation:

Q-Q patternWhat it meansHow worried to be
Points on the diagonalApproximately normalNot worried
S-shape (steep ends, flat middle)Heavy tails (kurtosis > 3)Mild — ANOVA still robust
Curved (bow-shaped)SkewModerate — consider transform or non-parametric
Stair-step patternDiscrete or rounded dataDepends — non-parametric usually better
Single point far off diagonalOutlierInvestigate the data point, not the test

The visual is more diagnostic than the test statistic. A Shapiro-Wilk p-value of 0.04 with a clean Q-Q plot is almost always a false positive; a Shapiro-Wilk p-value of 0.20 with a banana-shaped Q-Q plot is a real violation hidden by low power.

Step 3 — Run Shapiro-Wilk per group, not pooled

Run the Shapiro-Wilk test within each group on the raw values, or compute residuals (observation minus its group mean) and test the residuals pooled. Pooled raw values across groups will fail normality whenever the groups actually differ — which is misleading.

Shapiro-Wilk interpretation

H₀: data come from a normal distribution. p < 0.05 rejects normality.

Power scales with n: at n = 10, the test misses moderate skew; at n = 300, it flags trivial deviations. Always pair with a Q-Q plot.

Step 4 — Decide based on the combination

Per-group nQ-Q + Shapiro-WilkDecision
n < 15Clean Q-Q, p > 0.05Run ANOVA
n < 15Skewed or heavy-tailed Q-Q, p < 0.05Transform (log, sqrt) or run Kruskal-Wallis
15 ≤ n < 30Mild deviation in Q-QRun ANOVA; report Welch’s correction if variances unequal
n ≥ 30 (balanced)Any Q-Q pattern except severeRun ANOVA; CLT covers it
n ≥ 30 (balanced)Severe skew or extreme outliersInvestigate outliers; consider transform if biological reason exists
Unbalanced + non-normalAnyRobustness reduced; consider transform or non-parametric

What “Severe” Means in Practice

Severe non-normality is not p < 0.05 on Shapiro-Wilk. It looks like:

  • Skew > 2 or kurtosis > 7 (the standard heuristic for “serious” deviation)
  • A Q-Q plot where most points deviate, not just the tails
  • A histogram where the distribution is visibly bimodal within a group (not across groups)
  • Concentration measurements (cytokines, gene expression) on a linear scale — these are usually log-normal and need a log transform regardless of test results

Most violations seen in practice are mild and ANOVA tolerates them. If you cannot tell whether a deviation is severe from the Q-Q plot, the deviation is by definition not severe.

Transformations Before Switching to Non-Parametric

For right-skewed continuous data — concentrations, expression levels, antibody titers, time-to-event measurements — a log transformation often restores normality. Most biological measurements are inherently log-normal because they result from multiplicative processes (compound interest of growth rates, dilution series, signal amplification).

Try transformations in this order:

  1. Log (natural log or log₁₀) for right-skewed positive data
  2. Square root for count data with variance proportional to the mean
  3. Arcsine square root for proportions or percentages bounded 0–1

Pick the transformation before looking at the test result. Choosing the transformation that gives significance is p-hacking. Document your transformation rule in the methods section.

When to Just Use Kruskal-Wallis

Kruskal-Wallis (the non-parametric equivalent of one-way ANOVA) is appropriate when:

  • The outcome is genuinely ordinal (Likert scale, severity score)
  • n is small (< 10 per group) and a Q-Q plot shows non-linear pattern
  • Outliers cannot be removed (rare events, real biological extremes) and a robust test is preferred
  • Reviewers explicitly request it after seeing your assumption-checking section

If you go non-parametric, the post-hoc test changes too: use Dunn’s test after Kruskal-Wallis, not pairwise Mann-Whitney (which re-ranks the data inconsistently with the omnibus test). When the parametric assumptions do hold, switching to non-parametric loses about 4.5% power — not catastrophic, but real.

Once you settle on a test, the post-hoc choice still matters for how you report group differences. Tukey HSD vs Bonferroni vs FDR covers that decision in detail.

Reporting the Assumption Check

Methods sections that pass review on assumption-checking typically include three sentences:

  1. Which test you ran for normality (Shapiro-Wilk on within-group residuals or per-group)
  2. What you found (test statistic, p-value, plus a sentence on the Q-Q plot if relevant)
  3. Whether you proceeded with ANOVA, used a transformation, or switched to a non-parametric test — and why

Avoid the phrase “normality was confirmed” — a non-significant Shapiro-Wilk does not confirm normality, it only fails to reject it. The neutral phrasing is “Shapiro-Wilk did not reject normality (W = 0.95, p = 0.12), and the Q-Q plot showed no systematic deviation.” This phrasing is what a careful reviewer expects to see — the PMC guide to normality testing for non-statisticians explicitly recommends pairing the test with visual inspection. Reviewers checking your statistics section will notice the absence of one or the other.

If you are still choosing between tests for a multi-group comparison, the decision guide for choosing the right statistical test walks through the test-selection logic before you reach the assumption-check stage.

Ready to analyze your data?

Join the beta waitlist and be the first to try GraphHelix.