Choosing the Right Statistical Test: A Decision Guide for Experimental Biologists
Every experimental biologist hits the same wall: the data are in the spreadsheet, the deadline is near, and the question is “which test do I run?” Reviewers will notice if you pick the wrong one. The right answer is not a single test — it is a set of four decisions that narrow the options.
This guide walks through those four decisions as a decision tree. Follow the path that matches your data and you will land on the correct test, plus the effect size and reporting format to use with it. We will anchor each branch to the kind of experimental data biologists actually collect.
The Four Decisions That Choose Your Statistical Test
Ignore the 30-row flowcharts. For 90% of experimental designs, the test falls out of four questions:
- What is the dependent variable? Continuous, categorical, or time-to-event?
- How many groups or conditions are you comparing? One, two, or three-plus?
- Are the groups independent or paired? Different subjects per group, or the same subject measured twice?
- Do your data meet parametric assumptions? Normally distributed residuals and roughly equal variances?
Answering these four in order gives you the test. The rest of this post walks each branch.
Step 1: What Is Your Dependent Variable?
The dependent variable is the outcome you measured. Its type — continuous, categorical, or time-to-event — determines the whole family of tests available.
| Variable type | Examples | Test family |
|---|---|---|
| Continuous | Enzyme activity, tumor volume, blood pressure, fluorescence intensity | t-tests, ANOVA, regression |
| Categorical (two categories) | Survived vs died, positive vs negative, responder vs non-responder | Chi-square, Fisher’s exact, logistic regression |
| Categorical (more than two) | Genotype (AA/AB/BB), treatment arm (control/low/high) | Chi-square, multinomial regression |
| Ordinal | Pain score 0–10, histology grade 1–4, Likert scale | Non-parametric tests (Mann-Whitney, Kruskal-Wallis) or ordinal regression |
| Time-to-event | Time to relapse, survival time, time to disease progression | Kaplan-Meier, log-rank, Cox proportional hazards |
| Count | Number of plaques, colony counts, seizure frequency | Poisson or negative binomial regression |
Treating ordinal scores as continuous. A pain score of 6 is not “twice as bad” as 3 — the distances between ranks are not equal. Running a t-test on Likert or ordinal grading data is common but technically wrong. Use Mann-Whitney or ordinal regression instead.
Step 2: How Many Groups Are You Comparing?
Within the continuous-outcome family, the count of groups or conditions splits the decision tree in two.
One group vs. a reference value
You measured one group and want to test whether its mean differs from a known value (for example, is tumor mass different from a historical control of 500 mg?). Use a one-sample t-test.
Two groups
Comparing exactly two conditions. This is where most bench experiments land: treated vs control, wild-type vs knockout, pre vs post. Go to Step 3 to decide between paired and independent.
Three or more groups
Multiple treatment levels, multiple cell lines, or a dose-response design. Use ANOVA — and follow with a post-hoc test to identify which groups differ. Running three separate t-tests instead of an ANOVA inflates the false positive rate from 5% to roughly 14% with three groups.
If you have two factors (for example, treatment × time point), you want a two-way ANOVA, not separate one-way ANOVAs. Two-way ANOVA lets you test the interaction — whether treatment effect depends on time — which is usually the most interesting biological question.
Step 3: Independent or Paired?
The pairing question decides between two completely different tests with the same name. Get this wrong and your p-value is either too high (lost power) or invalid entirely.
| Design type | Example | Two-group test | Multi-group test |
|---|---|---|---|
| Independent | Different mice in treatment vs control groups | Unpaired t-test | One-way ANOVA |
| Paired (same subject) | Blood pressure before and after drug in the same patients | Paired t-test | Repeated measures ANOVA |
| Matched | Cases matched to controls on age and sex | Paired t-test on matched pairs | RM-ANOVA or mixed model |
| Nested / clustered | Multiple cells per animal, multiple animals per cage | Linear mixed model | Linear mixed model |
How to tell: ask yourself whether you could swap any data point from group A with any data point from group B without changing the interpretation. If yes, groups are independent. If no — because they come from the same subject, culture, or animal — they are paired.
Cell biology experiments often use paired designs without realizing it. If you split one cell culture into treated and control wells, those two wells are paired — they share the same biological replicate. Treating them as independent throws away statistical power and may hide real effects.
Step 4: Do Your Data Meet Parametric Assumptions?
t-tests, ANOVA, and linear regression are parametric — they assume your data (or more precisely, the residuals) are approximately normally distributed and that groups have similar variance. Violating these assumptions with small samples produces p-values that are too low or too high.
What to check
Normality. Run a Shapiro-Wilk test on the residuals (not the raw data). If p > .05, normality is plausible. Complement with a Q-Q plot — points should fall close to the diagonal line.
Equal variances. Run Levene’s test. If p > .05, equal variances are plausible. Many statisticians now recommend Welch’s t-test as a default because it does not assume equal variances.
Sample size. With n < 10 per group, normality tests have low power — a non-significant result does not mean normal. With n > 30 per group, the Central Limit Theorem means small normality violations matter little.
If normality fails or n is small and distribution looks skewed, use the non-parametric equivalent:
- Unpaired t-test → Mann-Whitney U test
- Paired t-test → Wilcoxon signed-rank test
- One-way ANOVA → Kruskal-Wallis test
- Repeated measures ANOVA → Friedman test
Non-parametric tests cost roughly 4.5% power when data are truly normal, but can have substantially more power when data are skewed. The penalty is small; the robustness is large.
The Complete Decision Tree
Putting the four steps together, here is the full map for continuous outcomes:
| Groups | Design | Parametric (normal) | Non-parametric |
|---|---|---|---|
| 1 vs reference | — | One-sample t-test | Wilcoxon signed-rank (against median) |
| 2 | Independent | Unpaired t-test (Welch) | Mann-Whitney U |
| 2 | Paired | Paired t-test | Wilcoxon signed-rank |
| 3+ | Independent | One-way ANOVA + post-hoc | Kruskal-Wallis + Dunn’s |
| 3+ | Repeated measures | RM-ANOVA + post-hoc | Friedman + Dunn’s |
| 3+ | Two factors | Two-way ANOVA | Aligned-ranks or permutation |
| Continuous predictor(s) | — | Linear regression | Robust regression or Spearman correlation |
For categorical outcomes, the tree is shorter: 2×2 tables with small cell counts (any expected count < 5) use Fisher’s exact test; larger tables use the chi-square test of independence; matched binary outcomes use McNemar’s test.
For time-to-event data, use Kaplan-Meier with log-rank for unadjusted comparisons and Cox proportional hazards regression when you need to adjust for covariates.
Worked Example: A Two-Group Bench Experiment
You are testing whether a knockdown reduces mitochondrial membrane potential. You have 12 replicates per group: 12 control wells, 12 siRNA wells. All wells came from the same parental cell line, split 24 hours before treatment.
- Dependent variable: Fluorescence intensity (continuous) → t-test family.
- Groups: Two (control vs knockdown) → t-test, not ANOVA.
- Design: Wells from the same parent culture split into treated and control — these are paired by biological replicate. If you did 3 independent transfection days (4 wells each per group), the 4 wells per day are technical replicates and you should average or model them as nested. If all 24 wells came from one transfection, n = 1 biological replicate, not 12.
- Assumptions: Shapiro-Wilk on residuals, p = .31. Normal. Levene p = .64. Equal variances.
Test: Paired t-test on biological-replicate means.
Report: “Mitochondrial membrane potential was lower in knockdown compared to control (paired t(n−1) = 4.21, p = .002, Cohen’s d = 1.38, 95% CI [0.45, 2.28]).”
Common Traps That Reviewers Catch
Even with the right branch, there are three traps worth naming:
- Technical replicates masquerading as biological replicates. Three wells from the same dish are not n = 3. Reviewers will flag this in any competent journal.
- Reporting only the p-value. APA and most major journals now require effect sizes and 95% confidence intervals. Cohen’s d for t-tests, partial η² for ANOVA, r for correlations.
- Running multiple pairwise tests without correction. If you compare 4 groups with t-tests instead of ANOVA+post-hoc, your false positive rate climbs. Always correct, or use an omnibus test first.
The decision tree above catches the most common mistakes — pairing errors, assumption violations, multiple comparison inflation. For the edge cases (nested designs, mixed models, Bayesian alternatives), a consult with a biostatistician is worth the hour it takes.
Standards and Reporting References
The reporting frameworks below specify what reviewers expect to see alongside the test result itself:
- APA Journal Article Reporting Standards (JARS) — quantitative reporting format, required by most psychology and behavioral journals
- CONSORT guidelines — clinical trial reporting, requires pre-specified analysis plans
- ARRIVE 2.0 guidelines — animal research reporting, requires justification of sample size and statistical test selection
If you want the test recommendation automated, GraphHelix’s sample size calculator starts with the same four decisions above and outputs a defensible power analysis. The logic is the same whether you run it in your head, on a flowchart, or in a tool — the four decisions are what determine the test.