Choosing the Right Statistical Test: A Decision Guide for Experimental Biologists

April 18, 2026

Every experimental biologist hits the same wall: the data are in the spreadsheet, the deadline is near, and the question is “which test do I run?” Reviewers will notice if you pick the wrong one. The right answer is not a single test — it is a set of four decisions that narrow the options.

This guide walks through those four decisions as a decision tree. Follow the path that matches your data and you will land on the correct test, plus the effect size and reporting format to use with it. We will anchor each branch to the kind of experimental data biologists actually collect.

The Four Decisions That Choose Your Statistical Test

Ignore the 30-row flowcharts. For 90% of experimental designs, the test falls out of four questions:

  1. What is the dependent variable? Continuous, categorical, or time-to-event?
  2. How many groups or conditions are you comparing? One, two, or three-plus?
  3. Are the groups independent or paired? Different subjects per group, or the same subject measured twice?
  4. Do your data meet parametric assumptions? Normally distributed residuals and roughly equal variances?

Answering these four in order gives you the test. The rest of this post walks each branch.

Step 1: What Is Your Dependent Variable?

The dependent variable is the outcome you measured. Its type — continuous, categorical, or time-to-event — determines the whole family of tests available.

Variable typeExamplesTest family
ContinuousEnzyme activity, tumor volume, blood pressure, fluorescence intensityt-tests, ANOVA, regression
Categorical (two categories)Survived vs died, positive vs negative, responder vs non-responderChi-square, Fisher’s exact, logistic regression
Categorical (more than two)Genotype (AA/AB/BB), treatment arm (control/low/high)Chi-square, multinomial regression
OrdinalPain score 0–10, histology grade 1–4, Likert scaleNon-parametric tests (Mann-Whitney, Kruskal-Wallis) or ordinal regression
Time-to-eventTime to relapse, survival time, time to disease progressionKaplan-Meier, log-rank, Cox proportional hazards
CountNumber of plaques, colony counts, seizure frequencyPoisson or negative binomial regression
Common Mistake

Treating ordinal scores as continuous. A pain score of 6 is not “twice as bad” as 3 — the distances between ranks are not equal. Running a t-test on Likert or ordinal grading data is common but technically wrong. Use Mann-Whitney or ordinal regression instead.

Step 2: How Many Groups Are You Comparing?

Within the continuous-outcome family, the count of groups or conditions splits the decision tree in two.

One group vs. a reference value

You measured one group and want to test whether its mean differs from a known value (for example, is tumor mass different from a historical control of 500 mg?). Use a one-sample t-test.

Two groups

Comparing exactly two conditions. This is where most bench experiments land: treated vs control, wild-type vs knockout, pre vs post. Go to Step 3 to decide between paired and independent.

Three or more groups

Multiple treatment levels, multiple cell lines, or a dose-response design. Use ANOVA — and follow with a post-hoc test to identify which groups differ. Running three separate t-tests instead of an ANOVA inflates the false positive rate from 5% to roughly 14% with three groups.

Tip

If you have two factors (for example, treatment × time point), you want a two-way ANOVA, not separate one-way ANOVAs. Two-way ANOVA lets you test the interaction — whether treatment effect depends on time — which is usually the most interesting biological question.

Step 3: Independent or Paired?

The pairing question decides between two completely different tests with the same name. Get this wrong and your p-value is either too high (lost power) or invalid entirely.

Design typeExampleTwo-group testMulti-group test
IndependentDifferent mice in treatment vs control groupsUnpaired t-testOne-way ANOVA
Paired (same subject)Blood pressure before and after drug in the same patientsPaired t-testRepeated measures ANOVA
MatchedCases matched to controls on age and sexPaired t-test on matched pairsRM-ANOVA or mixed model
Nested / clusteredMultiple cells per animal, multiple animals per cageLinear mixed modelLinear mixed model

How to tell: ask yourself whether you could swap any data point from group A with any data point from group B without changing the interpretation. If yes, groups are independent. If no — because they come from the same subject, culture, or animal — they are paired.

Common Mistake

Cell biology experiments often use paired designs without realizing it. If you split one cell culture into treated and control wells, those two wells are paired — they share the same biological replicate. Treating them as independent throws away statistical power and may hide real effects.

Step 4: Do Your Data Meet Parametric Assumptions?

t-tests, ANOVA, and linear regression are parametric — they assume your data (or more precisely, the residuals) are approximately normally distributed and that groups have similar variance. Violating these assumptions with small samples produces p-values that are too low or too high.

What to check

Normality. Run a Shapiro-Wilk test on the residuals (not the raw data). If p > .05, normality is plausible. Complement with a Q-Q plot — points should fall close to the diagonal line.

Equal variances. Run Levene’s test. If p > .05, equal variances are plausible. Many statisticians now recommend Welch’s t-test as a default because it does not assume equal variances.

Sample size. With n < 10 per group, normality tests have low power — a non-significant result does not mean normal. With n > 30 per group, the Central Limit Theorem means small normality violations matter little.

Rule of Thumb

If normality fails or n is small and distribution looks skewed, use the non-parametric equivalent:

  • Unpaired t-test → Mann-Whitney U test
  • Paired t-test → Wilcoxon signed-rank test
  • One-way ANOVA → Kruskal-Wallis test
  • Repeated measures ANOVA → Friedman test

Non-parametric tests cost roughly 4.5% power when data are truly normal, but can have substantially more power when data are skewed. The penalty is small; the robustness is large.

The Complete Decision Tree

Putting the four steps together, here is the full map for continuous outcomes:

GroupsDesignParametric (normal)Non-parametric
1 vs referenceOne-sample t-testWilcoxon signed-rank (against median)
2IndependentUnpaired t-test (Welch)Mann-Whitney U
2PairedPaired t-testWilcoxon signed-rank
3+IndependentOne-way ANOVA + post-hocKruskal-Wallis + Dunn’s
3+Repeated measuresRM-ANOVA + post-hocFriedman + Dunn’s
3+Two factorsTwo-way ANOVAAligned-ranks or permutation
Continuous predictor(s)Linear regressionRobust regression or Spearman correlation

For categorical outcomes, the tree is shorter: 2×2 tables with small cell counts (any expected count < 5) use Fisher’s exact test; larger tables use the chi-square test of independence; matched binary outcomes use McNemar’s test.

For time-to-event data, use Kaplan-Meier with log-rank for unadjusted comparisons and Cox proportional hazards regression when you need to adjust for covariates.

Worked Example: A Two-Group Bench Experiment

Worked Example

You are testing whether a knockdown reduces mitochondrial membrane potential. You have 12 replicates per group: 12 control wells, 12 siRNA wells. All wells came from the same parental cell line, split 24 hours before treatment.

  1. Dependent variable: Fluorescence intensity (continuous) → t-test family.
  2. Groups: Two (control vs knockdown) → t-test, not ANOVA.
  3. Design: Wells from the same parent culture split into treated and control — these are paired by biological replicate. If you did 3 independent transfection days (4 wells each per group), the 4 wells per day are technical replicates and you should average or model them as nested. If all 24 wells came from one transfection, n = 1 biological replicate, not 12.
  4. Assumptions: Shapiro-Wilk on residuals, p = .31. Normal. Levene p = .64. Equal variances.

Test: Paired t-test on biological-replicate means.
Report: “Mitochondrial membrane potential was lower in knockdown compared to control (paired t(n−1) = 4.21, p = .002, Cohen’s d = 1.38, 95% CI [0.45, 2.28]).”

Common Traps That Reviewers Catch

Even with the right branch, there are three traps worth naming:

  1. Technical replicates masquerading as biological replicates. Three wells from the same dish are not n = 3. Reviewers will flag this in any competent journal.
  2. Reporting only the p-value. APA and most major journals now require effect sizes and 95% confidence intervals. Cohen’s d for t-tests, partial η² for ANOVA, r for correlations.
  3. Running multiple pairwise tests without correction. If you compare 4 groups with t-tests instead of ANOVA+post-hoc, your false positive rate climbs. Always correct, or use an omnibus test first.

The decision tree above catches the most common mistakes — pairing errors, assumption violations, multiple comparison inflation. For the edge cases (nested designs, mixed models, Bayesian alternatives), a consult with a biostatistician is worth the hour it takes.

Standards and Reporting References

The reporting frameworks below specify what reviewers expect to see alongside the test result itself:

If you want the test recommendation automated, GraphHelix’s sample size calculator starts with the same four decisions above and outputs a defensible power analysis. The logic is the same whether you run it in your head, on a flowchart, or in a tool — the four decisions are what determine the test.

Ready to analyze your data?

Join the beta waitlist and be the first to try GraphHelix.