Elements of a Designed Study

The problem of comparing more than two means results from the increase in Type I error that occurs when statistical tests are used repeatedly.

Learning Objective

Discuss the increasing Type I error that accompanies comparisons of more than two means and the various methods of correcting this error.

Key Points

Unless the tests are perfectly dependent, the familywide error rate increases as the number of comparisons increases.
Multiple testing correction refers to re-calculating probabilities obtained from a statistical test which was repeated multiple times.
In order to retain a prescribed familywise error rate $\alpha$ in an analysis involving more than one comparison, the error rate for each comparison must be more stringent than $\alpha$.
The most conservative, but free of independency and distribution assumptions method, way of controlling the familywise error rate is known as the Bonferroni correction.
Multiple comparison procedures are commonly used in an analysis of variance after obtaining a significant omnibus test result, like the ANOVA $F$-test.

Terms

ANOVA
Analysis of variance—a collection of statistical models used to analyze the differences between group means and their associated procedures (such as "variation" among and between groups).
Boole's inequality
a probability theory stating that for any finite or countable set of events, the probability that at least one of the events happens is no greater than the sum of the probabilities of the individual events
Bonferroni correction
a method used to counteract the problem of multiple comparisons; considered the simplest and most conservative method to control the familywise error rate

Full Text

For hypothesis testing, the problem of comparing more than two means results from the increase in Type I error that occurs when statistical tests are used repeatedly. If $n$ independent comparisons are performed, the experiment-wide significance level $\bar { \alpha }$, also termed FWER for familywise error rate, is given by:

$\bar{\alpha} = 1-(1-\alpha_{\text{per comparison}})^n$

Hence, unless the tests are perfectly dependent, $\bar { \alpha }$ increases as the number of comparisons increases. If we do not assume that the comparisons are independent, then we can still say:

$\bar { \alpha } \le n\cdot { \alpha }_ {\text{per comparison}}$.

There are different ways to assure that the familywise error rate is at most $\bar { \alpha }$. The most conservative, but free of independency and distribution assumptions method, is known as the Bonferroni correction ${\alpha }_ {\text{per comparison}}=\frac { \bar { \alpha } }{ n }$. A more sensitive correction can be obtained by solving the equation for the familywise error rate of independent comparisons for ${\alpha }_ {\text{per comparison}}$.

This yields ${\alpha }_ {\text{per comparison}}=1-{ \left( 1-\bar { \alpha } \right) }^{ \frac { 1 }{ n } }$, which is known as the Šidák correction. Another procedure is the Holm–Bonferroni method, which uniformly delivers more power than the simple Bonferroni correction by testing only the most extreme $p$-value ($i=1$) against the strictest criterion, and the others ($i>1$) against progressively less strict criteria.

Methods

Multiple testing correction refers to re-calculating probabilities obtained from a statistical test which was repeated multiple times. In order to retain a prescribed familywise error rate $\alpha$ in an analysis involving more than one comparison, the error rate for each comparison must be more stringent than $\alpha$. Boole's inequality implies that if each test is performed to have type I error rate $\frac{\alpha}{n}$, the total error rate will not exceed $\alpha$. This is called the Bonferroni correction and is one of the most commonly used approaches for multiple comparisons.

Because simple techniques such as the Bonferroni method can be too conservative, there has been a great deal of attention paid to developing better techniques, such that the overall rate of false positives can be maintained without inflating the rate of false negatives unnecessarily. Such methods can be divided into general categories:

Methods where total alpha can be proved to never exceed 0.05 (or some other chosen value) under any conditions. These methods provide "strong" control against Type I error, in all conditions including a partially correct null hypothesis.
Methods where total alpha can be proved not to exceed 0.05 except under certain defined conditions.
Methods which rely on an omnibus test before proceeding to multiple comparisons. Typically these methods require a significant ANOVA/Tukey's range test before proceeding to multiple comparisons. These methods have "weak" control of Type I error.
Empirical methods, which control the proportion of Type I errors adaptively, utilizing correlation and distribution characteristics of the observed data.

Post-Hoc Testing of ANOVA

Multiple comparison procedures are commonly used in an analysis of variance after obtaining a significant omnibus test result, like the ANOVA $F$-test. The significant ANOVA result suggests rejecting the global null hypothesis $H_0$ that the means are the same across the groups being compared. Multiple comparison procedures are then used to determine which means differ. In a one-way ANOVA involving $K$ group means, there are $\frac{K(K-1)}{2}$ pairwise comparisons.

[ edit ]

Prev Concept

Statistical Power

Randomized Design: Single-Factor

Next Concept