## Abstract

A common approach to the analysis of experimental data across much of the biological sciences is test-qualified pooling. Here non-significant terms are dropped from a statistical model, effectively pooling the variation associated with each removed term with the error term used to test hypotheses (or estimate effect sizes). This pooling is only carried out if statistical testing on the basis of applying that data to a previous more complicated model provides motivation for this model simplification; hence the pooling is test-qualified. In pooling, the researcher increases the degrees of freedom of the error term with the aim of increasing statistical power to test their hypotheses of interest. Despite this approach being widely adopted and explicitly recommended by some of the most widely cited statistical textbooks aimed at biologists, here we argue that (except in highly specialized circumstances that we can identify) the hoped-for improvement in statistical power will be small or non-existent, and there is likely to be much reduced reliability of the statistical procedures through deviation of type I error rates from nominal levels. We thus call for greatly reduced use of test-qualified pooling across experimental biology, more careful justification of any use that continues, and a different philosophy for initial selection of statistical models in the light of this change in procedure.

## 1. Introduction

A common approach to the analysis of experimental data across disparate parts of the biological sciences is test-qualified pooling. A common manifestation of this approach can be summarized as follows: the researcher fits their data to a model that they select on the basis of the design of their study and the hypotheses they are interested in testing. After examining the significance of terms in the model that are not specifically related to the hypothesis currently under investigation, the researcher then removes non-significant terms from the model, and re-fits their data to this simplified model. That is, some terms were included in the original model not because they allow an interesting hypothesis to be tested but because (on the basis of the specifics of the experimental design allied to previous knowledge of the system) they were expected to explain substantial portions of the variation. If the data generated in this particular experiment do not suggest that one or more of these terms are strongly influential then they are dropped from the model, and further analysis is performed based on a simplified model. Such a simplification process is often seen as attractive in making presentation of results more compact, in highlighting more influential variables, and/or in increasing statistical power for exploring the significance of remaining terms. By simplifying the model in this way, the researcher is effectively *pooling* the variation associated with each removed term with the error term that will ultimately be used to test their hypotheses. This pooling is only carried out if statistical testing on the basis of applying that data to a previous more complicated model provides motivation for this approach, hence the pooling is *test-qualified*. In pooling, the researcher increases the degrees of freedom of the error term with the aim of increasing statistical power to test their hypotheses of interest. Despite this approach being widely adopted and explicitly recommended by some works on data analysis (e.g. [1]), other influential authors explicitly warned against this practice (e.g. [2]). Here, we want to offer some resolution of this apparent conflict in the literature, in order to help authors, reviewers, editors, and readers evaluate the consequences of pooling in different circumstances. Note that although we couch this discussion in terms of null-hypothesis statistical testing, the arguments transfer naturally to approaches based on estimation of effect size; our discussion is however focused on the analysis of data from planned experiments rather than from purely observational studies. The costs and benefits of test-qualified pooling are more clear-cut for planned experiments where potential confounding factors can often be eliminated or controlled for by careful experimental design, removing the need to deal with these factors statistically. Also, planned experiments generally are of what is termed a ‘confirmatory’ nature, where the study specifically aims to test one or more hypotheses known from the outset. Observational studies more often have an ‘exploratory’ motivation involving measuring a broad range of variables and then seeking to rank them in terms of potential importance and influence. We return to these issues in the *Discussion*.

## 2. Being clear what pooling is and why you might want to do it

To clarify the issues we consider a specific example. You are interested in the effect of an experimental treatment (a new humidification system) on the growth of individually potted tomato plants. Your experiment will be conducted in 10 small greenhouses at your research station, and the nature of the treatment means that it has to be applied to whole greenhouses. You install the humidification system in five (randomly selected) greenhouses, leaving the other five as controls, and you assay the growth of 40 tomato plants in each greenhouse. In this design the greenhouse is the experimental unit, and any hypothesis test of the treatment should use an error based on the variation among greenhouses rather than variation among the individual plants. In this case the simplest means of analysis would be to calculate a mean growth rate across the 40 plants in each greenhouse and carry out a one-way ANOVA using these 10 independent data points.

However, as a thought experiment, suppose that we somehow knew for a fact that growth conditions (in the absence of our treatment manipulation) were absolutely identical among our greenhouses. In this imaginary situation we might argue that, since greenhouse-to-greenhouse variation is not confounded with any treatment effect we can use the growth measures from the individual plants as independent data points in our analysis. This will result in a substantial increase in our degrees of freedom, and consequently our statistical power to detect treatment effects. Of course in reality, we cannot usually know with certainty whether our greenhouses vary, and this has led to the development of methods for test-qualified pooling. In this case, we would start by fitting the nested model defined by the design of our study (with individual plants being nested within greenhouse). This would include the treatment term, a nested term for the variation among greenhouses in the same treatment group, and a second error term corresponding to the variation among plants in the same greenhouse. The key to test-qualified pooling is that the set of data itself influences the nature of the analyses performed on it. If initial analysis of the full model indicates substantial variation among greenhouses, then the significance of the treatment term is tested using the variation among greenhouses as its error term with 8 d.f. However, if there is no evidence of substantial greenhouse-to-greenhouse variation in this initial analysis then the among-greenhouse and the true error variations are pooled, and this combined error term with 398 d.f. is used then to provide a test of the treatment effect that is expected to benefit from higher statistical power (see [3–5] for commonly cited texts that recommend this approach). The justification that advocates of test-qualified pooling give for this approach is that in the absence of any greenhouse effect, the among-greenhouse and the within-greenhouse error terms are both estimating the same thing, and so by combining them we get a better estimate than we would estimating the two separately.

However, pooling is not limited to nested designs. Continuing with tomatoes and greenhouses, you now want to compare the effects of four different growing media in individually potted tomato plants rather than the effect of humidity. To gain a sufficient sample size for the experiment you have to use three different greenhouses to keep all the plants, but because your treatments can now be applied randomly to individual plants, you randomly allocate equal numbers of plants to each treatment in each greenhouse leading to a randomized block design (with specific greenhouse identity as the blocking factor, with three levels). The statistical model implied by this design would include terms for both treatment applied to a plant and the specific greenhouse a plant was kept in, as well as a treatment-by-greenhouse interaction and an among-plant error term based on the variation among individual plants within the same treatment-greenhouse combination. Depending on the exact hypothesis we wish to test, the appropriate error term for our treatment effect will be either the interaction term, or the among-plant error term [6], but in either case, if the interaction term is not significant, we might choose to pool its variation with the among-plant error term prior to testing the treatment effect. Similarly, we might then decide that if the greenhouse term is also non-significant, we would add that source of variation and its associated degrees of freedom to our error pool. In either case, we would be carrying out test-qualified pooling.

Another form of pooling can involve the initial test that triggers whether pooling is used or not being entirely separate to the model testing the hypotheses of interest. To illustrate this, we return to the experiment above comparing the effects of four different growing media on individually potted tomato plants. Imagine that, because of a change of supplier at your institute, you ended up using two different but broadly similar types of pots to grow the tomatoes in. Plants are randomized to pot type as well as to growth medium and greenhouse. You really do not expect type of pot to influence growth rates, but just to be careful you first of all perform a *t*-test comparing growth rates across the two types of pot. Your plan is that if (as you expect) this *t*-test reveals no evidence of a difference, you report this and use this test as justification for pooling data across the two pot types in your subsequent analyses. However if it does reveal evidence of a difference then you will either add pot type as a factor in subsequent analyses or carry out separate analyses for the two types of pot. Again, there is the potential for pooling driven by the results of a pre-test, so this scenario is another manifestation of test-qualified pooling.

## 3. Why is test-qualified pooling controversial?

The case against pooling was made most forcefully and explicitly in the biological literature by Stuart Hurlbert primarily in relation to its use in nested designs [2]. Hurlbert coined the expression *pseudoreplication* for the situation where authors treat data-points that are not independent as if they were independent in their data analysis. His original paper on this [7] has been cited over 6 000 times and has been hugely influential in the design of data collection and the analysis of data spanning all of biology. Hurlbert considers the pooling of errors in a nested analysis to be a form of pseudoreplication, a form that he calls *test-qualified sacrificial pseudoreplication*. He argues that pooling biases *p*-values downwards and biases confidence intervals towards being too narrow. He further argues that demanding a higher *p*-value than 0.05 in the initial test before pooling (a process often called ‘sometimes pooling’) reduces but does not eliminate these problems. An analogous argument can be made against pooling interaction terms with error terms when analysing randomized block designs [6]. However, even in situations where pooling might not be regarded as analogous to pseudoreplication (e.g. pooling an interaction between two fixed factors prior to testing the main effects), type I error rates can be increased (as we will see below). Despite this, pooling is still regularly practiced, and is recommended in influential statistics textbooks aimed at biologists (e.g. [3–5]) and research papers on statistical methodology (e.g. [2,8]). In the next section we argue that both philosophically and pragmatically there are strong arguments for siding with Hurlbert.

## 4. The philosophy and pragmatics of pooling

The two main philosophical arguments against pooling are well articulated by Newman *et al*. [7], and can be explained in the context of our greenhouses and growth media example. Firstly, if we use pooling, then the way that we test for an effect of growth medium becomes conditional on the data, but that conditionality is not acknowledged in the associated *p*-values. That is, whether we test the effect of medium in a model with or without a *greenhouse* term will be determined by the data. Philosophically, *p*-values are probabilities based on a very large number of notional replicates of exactly the experiment under investigation. So imagine that we repeat the full experiment and analysis of the resulting data again and again. In replicates of this experiment, if we adopt a test-qualified pooling approach then sometimes the analysis will test the main hypothesis one way and sometimes the other. For each form of the analysis, that particular analysis will be implemented only for a specific subset of replicate experiments determined by the patterns of data in that replicate experiment. Importantly, this is a biased sample of all the possible replicate experiments in terms of properties of the sample. Yet the test is predicated on the assumption that it is applied to data from an experiment drawn without bias from the population of all possible replicates of this experiment. It is this mismatch that leads to lack of control of type I error and of confidence intervals. Secondly, by pooling (no matter what critical value we compare the calculated *p*-value against) we are accepting the null hypothesis that there is no effect of *greenhouse* is true, and the whole philosophy of null-hypothesis statistical testing is that the null hypothesis is never accepted as true, rather we might either reject it or find that we do not have sufficient grounds to reject it. Thus, from a purist philosophical perspective pooling should not be recommended.

We next ask if there is a pragmatic argument that says that pooling may have some less-than-ideal properties, but pooling leads to relatively mild misbehaviours that are sometimes outweighed by the (enhanced power) benefits of pooling. There is no underlying theory to give general and definitive answers to the issue of pragmatics raised above; all we have to go on are a number of numerical explorations of specific cases. However, the consensus in this literature is that (i) pooling can cause actual type I error rates to be very different from the nominal value, and (ii) there is no consistent and substantial increase in power to compensate. Walde-Tsadik & Afifi [9] explore the effect of always pooling when one factor is associated with a *p*-value above 0.05, and also of ‘sometimes pooling’ when the required critical value was higher than 0.05 in two-way ANOVA random effects models. They found that both procedures very rarely offered adequate control of type I error rates and even less commonly led to significant improvement in power to test for an effect of the other factor. Hines [10] performed extensive simulations and concluded that for multifactorial ANOVA ‘the conditions for pooling to be even potentially rewarding are more restrictive than might be expected, and power improvements are generally lower’. Janky [11] performed a similar analysis of split-plot designs and concluded that ‘pooling generally inflates type I error and offers at best insubstantial gain in power (and often power loss) relative to the nominal test’. Even when using a conservative ‘sometimes pooling’ value of *α* = 0.35 to trigger pooling, Janky found the type I error rate in subsequent tests on pooled data rose from the nominal 5% to generally somewhere between 7% and 11%. This study was interesting for highlighting that pooling actually led to a reduction of power more often than it led to a substantial gain in power; this occurs because the increase in inherent variation caused by pooling dominates any effect of increased degrees of freedom devoted to exploring remaining factors. Figure 1 shows examples of deviations in both directions from the nominal 5% level for type I error rates generated by simulations of our whole-greenhouse-treatment thought experiment. In exploring our model we found that small changes in parameter values could lead to a substantial change in the magnitude and direction of deviations from the nominal level. It is difficult to make generalizations about the circumstances under which deviations will be strongest. In common with the other studies discussed directly above, we found that the direction and magnitude of deviations are driven by a complex interaction between structure of the experimental design, aspects of the shape of the underlying ‘population’ from which sample values are obtained, and sample sizes. Also, as the highest line in figure 1 illustrates, relationships with parameter values can be non-monotonic.

## 5. Discussion and conclusion

Use of test-qualified pooling is widely adopted, but its prevalence across biological sciences is patchy. For example, it is much less commonplace in clinical trials; where often statistical analyses have to be specified in pre-registration of trials, and thus scope for flexibility in data analysis is reduced. Test-qualified pooling is also relatively uncommon in the agricultural sciences, where particular designs and modes of analysis that avoid issues of pooling are traditional; and the statistical software package *Genstat* is commonly used, which is particularly suited to forms of analyses that avoid test-qualified pooling.

We do still consider that test-qualified pooling is over-used in biology. Simply, in ‘confirmatory studies’ based on designed experiments where we aim to test specific hypotheses (or estimate specific effect sizes) we do not recommend pooling under any circumstances. The often-modest expected increases in power from pooling do not make it an attractive option when its drawbacks are taken into account. Apart from statistical power, the other attraction to pooling is simplification of the presentation of results, but we feel that this will never be sufficient grounds for justifying the process. We would only recommend pooling in such a study if the decision to consider test-qualified pooling was made on the basis of a prior simulation study that aimed at evaluating the consequences of pooling for type I and type II error rates. We have yet to see an example of a study that provided such a justification for pooling.

As we mentioned in the *Introduction*, it is not as easy to offer clear and simple guidance on pooling in purely observational studies, and studies where the researchers' aims are more focused on exploration or prediction than on testing specific hypotheses. However, in such situations pooling can be seen as a facet of *model selection*–which is an area of considerable activity in applied statistics. A particularly useful introduction to the concepts involved is that of Chatfield [12]. He makes the point that if the same dataset is used to both select the most appropriate model from a suite of alternatives and also to fit that model, then the interpretation of the fitted model should be quite different from circumstances where the form of the model is decided upon first and only then is the data applied to fit that model. Where there is uncertainty as to the most appropriate model, then there are methodological developments in *model averaging* that can acknowledge this ([13] and [14] offer good introductions for the biologist). A failure to properly acknowledge model uncertainty when the same data is used to select and fit the model can lead to very unreliable inferences [12,15,16].

Despite the complexity of the literature on model selection and model uncertainty, we feel that we can offer a general opinion on the utility of test-qualified pooling outside designed experiments. For more exploratory studies where the intention is to identify factors that might be of interest, rather than to test specific hypotheses, then test-qualified pooling might be more attractive; since researchers may be willing to live with loss of control of type I error rates if this helps boost their statistical power to flag up factors of interest. That is, they may be prepared to suffer higher rates of false positives to boost their likelihood of detecting real effects. We expect that these power gains may sometimes be considerable for nested designs. However for other types of design the literature discussed in the last section should serve as a caution that power gains from pooling may be small or non-existent. Our view is that even in exploratory studies, test-qualified pooling cannot really be recommended except perhaps where the design is nested and where the size of the experiment was reduced from its ideal size by practical constraints or unforeseen adverse circumstances.

Where does this leave the experimenter in our tomato plant example who just wanted to be diligent and reassure themselves and their readers that there was no effect due to two different types of pots being used? They have to make a decision about how important this check is to them. If they feel that it is worth investing a few degrees of freedom in, then they should include *type-of-pot* as a factor in their analysis and pay a modest cost in reduced power to test the hypothesis (comparing different growth media) that they are really interested in. Alternatively, they may decide that careful experimental design and explanation of that experimental design should allay concerns about differential effects of pot types sufficiently that there is no need for formal statistical testing. More generally, we all have to accept that there are no free statistical analyses, and think hard about which factors to include in any model. This is analogous to the decision to block on a given variable in experimental design. It is only advantageous to block on variables that explain a substantial fraction of variation between experimental units, otherwise the degrees of freedom lost in including that blocking term are not compensated for by effective partitioning of variation into error and other terms.

Sometimes we can make a strong enough case based on careful experimental design (especially use of randomization), biological intuition, and logical reasoning for why we can safely assume that some potentially influential factors are in fact very unlikely to be important in our study, and so we omit them from our statistical procedures. In fact, we do this all the time. In our example, the researcher felt no need to test whether which shelf on a greenhouse a pot was placed on had an effect, or what side of the greenhouse, or how near to the door of the greenhouse it was. Sometimes we will feel that we cannot make a sufficiently strong case this way, and we should then include that factor in our model and explore its effects statistically. As so often in the design and analysis of scientific studies, there are no black-and-white rules for which factors to include in your statistical model; we need to think hard about it and justify our choices in terms of experimental design, understanding of underlying biology, and logical reasoning. This should be good news: model selection should be much more about biology than about mathematics and probability theory – and biology is what we are interested in.

## Authors' contributions

This article was conceived, developed, and written equally by both authors.

## Competing interests

We declare we have no competing interests.

## Funding

We received no funding for this study.

## Acknowledgement

We thank Gavin Gibson and three anonymous reviewers for perceptive comments.

- Received August 23, 2016.
- Accepted February 14, 2017.

- © 2017 The Author(s)

Published by the Royal Society. All rights reserved.