Volume 175, Issue 18 p. 3636-3637
Free Access

Commentary on the BJP's new statistical reporting guidelines

Harvey J Motulsky

Corresponding Author

Harvey J Motulsky

GraphPad Software, La Jolla, CA, USA

Correspondence Harvey Motulsky, GraphPad Software, La Jolla, CA, USA. E-mail: [email protected]Search for more papers by this author
Martin C Michel

Martin C Michel

Department of Pharmacology, Johannes Gutenberg University, Mainz, Germany

Search for more papers by this author
First published: 24 August 2018
Citations: 9

The BJP has recently established new guidelines for reporting data analysis and statistics (Curtis et al., 2018), which replace previously published guidelines (Curtis et al., 2015). We agree that more rigour in data analysis is helpful to improve reproducibility but disagree with some of these guidelines.

Guideline: Sample size must be 5 or greater

We agree that small sample sizes are a major source of poor reproducibility but mildly disagree with the recommendation. The threshold of n ≥ 5 is arbitrary and has no mathematical justification. With a continuous variable, large effects and little variation, valid conclusions can be reached from experiments with n = 3 or 4 (de Winter, 2013). Examples where sample sizes <5 can be meaningful because of large effect sizes are inducible genes such as interleukin-2 or reduction of gene expression by siRNA. Also, see the generic example below.

Guideline: Studies should be designed with equal sample sizes in all treatment groups

We strongly disagree. There are two situations where it is optimal to plan for unequal group sizes:
  • When one control group is compared to multiple treatment groups (common in pharmacology). For these experiments, it makes sense to use a larger sample size for the control group since it is part of every comparison. The optimal approach is to make the sample size of the control group equal to the sample size of the other groups times the square root of K, where K is the number of groups including the control (Dunnett, 1964).
  • When one treatment is more expensive, more time consuming or riskier than the other (rare in pharmacology). There are well-established methods for planning experiments with unequal sample sizes in this case. The total sample size will be higher, but the cost, time or risk will be lower.

Guideline: Only do follow-up tests after ANOVA if the overall ANOVA has P < 0.05

We disagree. With the exception of the rarely used protected Fisher LSD test, none of the multiple comparison tests commonly used as follow-up require that the overall ANOVA reach a statistically significant conclusion (Hsu, 1996). For many pharmacological experiments, we see no reason to even report the overall ANOVA results, as the follow-up multiple comparison tests answer all the scientific questions the study was designed to ask.

Note that P > 0.05 for the overall ANOVA does not prove that all population means are equal but only says that the overall ANOVA has not detected such differences. It is rare for a follow-up test to reach statistical significance when the overall ANOVA does not, but it can happen (especially with Dunnett's test since there are fewer comparisons), and the results of the follow-up tests are valid in these cases.

Guideline: Normalized data must be analysed with a non-parametric method

We strongly disagree. Curtis et al. (2018) say that no parametric test is possible if every value is expressed as a fraction of control. In fact, such data can be analysed by a one-sample t-test testing the null hypothesis that the population mean is 1.0 (if ratios), 100 (if percentages) or 0.0 (if logarithms of ratio). See the example below.

Example of one sample t-test with n = 3

Set-up: An experiment (n = 3) with results expressed as per cent of control.

Data: 47, 41, 42.

Summary: Mean = 43.33; SD = 3.215; SEM = 1.856.

Critical value of t distribution for df = 2 and 95% confidence via Excel: =T.INV(0.975, 2) = 4.303.

Margin of error = Critical t * SEM = 4.303 * 1.856 = 7.986. 95% confidence interval: mean ± margin of error, from 35.35 to 51.32.

Given a reasonable experimental context, this is convincing evidence that the treatment inhibited the response by about 50–65%.

Optional: If you also want a P value…

Null hypothesis: Population mean = 100.

t ratio = abs(mean − null hypothesis)/SEM = 30.53.

Two-tailed P with df = 2 via Excel: =TDIST(30.53, 2, 2) = 0.0011.


This project has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement no. 777364. This Joint Undertaking receives support from the European Union's Horizon 2020 research and innovation programme and EFPIA.

    Conflict of interest

    H.J.M. is shareholder and employee of GraphPad Software, a company that provides software for scientific statistical analysis. Work by M.C.M. is part of the European Quality In Preclinical Data (EQIPD) consortium.