## Abstract

Primary results from the Farm Scale Evaluations (FSEs) of spring-sown genetically modified herbicide-tolerant crops were published in 2003. We provide a statistical assessment of the results for count data, addressing issues of sample size (*n*), efficiency, power, statistical significance, variability and model selection. Treatment effects were consistent between rare and abundant species. Coefficients of variation averaged 73% but varied widely. High variability in vegetation indicators was usually offset by large *n* and treatment effects, whilst invertebrate indicators often had smaller *n* and lower variability; overall, achieved power was broadly consistent across indicators. Inferences about treatment effects were robust to model misspecification, justifying the statistical model adopted. As expected, increases in *n* would improve detectability of effects whilst, for example, halving *n* would have resulted in a loss of significant results of about the same order. 40% of the 531 published analyses had greater than 80% power to detect a 1.5-fold effect; reducing *n* by one-third would most likely halve the number of analyses meeting this criterion. Overall, the data collected vindicated the initial statistical power analysis and the planned replication. The FSEs provide a valuable database of variability and estimates of power under various sample size scenarios to aid planning of more efficient future studies.

## 1. Introduction

The Farm Scale Evaluations (FSEs) of spring-sown genetically modified herbicide-tolerant (GMHT) crops were conducted in the UK from 2000 to 2002 (Firbank *et al*. 1999, 2003*a*,*b*). The effects of the management regimes associated with conventional and genetically modified beet (*Beta vulgaris* L.), maize (*Zea mays* L.) and oilseed rape (*Brassica napus* L.) crops on weed plant and invertebrate indicators within fields and in field margins were compared. Each crop was treated as a different experiment. The first results were published in October 2003 for vegetation (Heard *et al*. 2003), soil-surface-active invertebrates (Brooks *et al*. 2003), epigeal and aerial arthropods (Haughton *et al*. 2003), field boundary invertebrates and vegetation (Roy *et al*. 2003), and plant and invertebrate trophic groups (Hawes *et al*. 2003).

The design considerations and statistical methods developed for the FSEs are described in detail elsewhere (Rothery *et al*. 2002, 2003; Perry *et al*. 2003). Briefly, each experiment comprised a randomized block design, with whole fields as blocks and with the treatment (conventional or GMHT) replicated once on half-field units in each field. The primary concerns were with tests of the null hypothesis of no difference in abundance, measured as counts of individuals in each half-field, between the GMHT and conventional treatments, and with estimates of treatment effects.

The FSEs were unusual in at least three ways. First, the FSEs cost £6 m (about £0.5 m crop^{−1} yr^{−1}), much more than most ecological experiments, corresponding in total to 24 standard research grants (Crawley 2003). Second, they were highly controversial and attracted intense examination because of the public concern over genetic modification (Perry 2003). As a response, the research proposed by the contractors was overseen by a Scientific Steering Committee that scrutinized closely the planned design and analysis which became the subject of considerable discussion and further research (Perry *et al*. 2003).

Third, Firbank *et al*. (2003*b*) and others (Lawton 2003; May 2003; Webb 2003; Pollock 2004) have emphasized the prime importance of the FSE database as a source of baseline measurements of the abundance of biodiversity to inform changes in policy for British agriculture. Research is now showing how biodiversity can be enhanced in arable landscapes by the manipulation of farming systems (Dewar *et al*. 2003) and their adjacent field margins (Sotherton 1991), and there is a perceived need to restore the balance between agricultural production and wildlife.

For each of the above reasons, it is fair to ask whether the effort that went into planning was justified, whether the original assumptions were vindicated, whether more sites were required or fewer could have been used, whether the analysis of the FSEs was efficient, and how estimates of variability might be used to inform the design of future similar studies. This paper is intended to provide answers to these and similar questions.

## 2. Background

The statistical power of a significance test is the probability of rejecting the null hypothesis when some given alternative hypothesis is true. Prior to publication of the results, a statistical power analysis (Perry *et al*. 2003) had suggested that the planned replication of around 60 fields per crop over 3 years would be sufficient to provide useful information from which valid statistical inferences could be drawn. Specifically, it indicated that a sample size of *n*=60 fields should have provided adequate power (more than 80%) to detect multiplicative differences of *R*=1.5-fold, for a given biological indicator, so long as its coefficient of variation (CV) did not exceed 50% and its mean abundance exceeded 5.0. Power was estimated over scenarios that encompassed a range of treatment differences, number of fields and degrees of random variability, both for a standard log-Normal model, based on a Normal distribution of logarithmically transformed counts, and also for an extended negative binomial model developed to be more realistic for the count data, particularly for small abundances. For the extended model the variance (*V*) of the count was assumed to be related to the mean count (*μ*) through a power law (Taylor 1961) with parameters *α* and *β*, i.e. *V*=*αμ*^{β}. The mean count, *μ*_{ij}, for treatment *i* in field *j* was given by ln *μ*_{ij}=*γ*+*F*_{j}+*t*_{i}, where *γ* is the logarithm of the overall mean count (*γ*=ln[*μ*], *μ*=e^{γ}), *F*_{j} is a field effect and *t*_{i} is a treatment effect. Note that treatment and field effects were, therefore, multiplicative on the natural count scale (*μ*_{ij}=*μ* exp[*F*_{j}]exp[*t*_{i}]).

The model was used to simulate count data to estimate power for detecting multiplicative differences *R*=1.3, 1.5 and 2, using sample sizes *n*=20, 30, 40, 60 and 90, with mean counts *μ*=1, 5, 10 and 100, field effects which varied over a 100-fold range, *β*=1.0, 1.5 and 2, and values of *α* chosen to achieve coefficients of variation on the natural scale (CV) of 50, 80 and 100%. The power of the Monte Carlo paired randomization test (two-tailed test at the 5% significance level) (Manly 1994) was estimated using 10^{5} sets (500 repetitions of each of 199 randomized sets plus the original data) of simulated data, for each of 12 combinations of the model parameters. Three test-statistics were examined, reflecting three forms of variance–mean relationships defined through *β*: *d*, the mean of the differences between the two treatments on a logarithmic scale (*β*=2); *r*, the logarithm of the ratio of the overall arithmetic means of the two treatments (*β*=1); and *d*_{w}, a weighted version of *d* with weights based on the approximate variance of the difference in logarithmically transformed counts, assuming *β*=1.5.

The analysis reported in the five FSE data papers was a standard randomized block ANOVA. Prior to analysis the total count, *c*_{ij}, per half-field for treatment *i* in field *j* was transformed to *l*_{ij}=log(*c*_{ij}+1), after inspection of residuals had suggested that the standard log-Normal model with *β*=2 provided an adequate model. The realized sample size, *n*, was the number of fields remaining after excluding those with missing values, and those for which the total whole-field count was zero or one. The null hypothesis was tested with a Monte Carlo paired randomization test using the test-statistic , where *n* is the number of fields in the analysis, with *p*-values estimated from 999 random permutations. Treatment effects were estimated by the multiplicative ratio (GMHT/conventional), calculated as *R*=10^{d}.

In addition to the published analysis that assumed *β*=2, two other multiplicative models were fitted which were similar, except they made different assumptions about the relationship between variance and mean expressed through *β*. One was a standard generalized linear model (GLM; McCullagh & Nelder 1989) with logarithmic link and Poisson error distribution (*β*=1); the other was a GLM with logarithmic link and power law variance function (*β*=1.5).

What follows is a statistical assessment of the FSE results for count data for spring-sown crops as published in Heard *et al*. (2003), Brooks *et al*. (2003), Haughton *et al*. (2003), Roy *et al*. (2003) and Hawes *et al*. (2003), focusing on estimates of variability and their effect on realized power. Many of the 531 biological indicators tested in those papers were pre-selected on the basis of taxonomic groups, but do not form a random sample because the other criteria for inclusion were mean abundance, and, to a lesser extent, the ecological importance of the test result. Results for other data types (plant biomass, crop canopy, height, etc.) and follow-up samples taken in the two subsequent cropping years are not considered here.

## 3. Methods

We study the relationships amongst statistical significance, sample size and treatment effect; estimate the actual value of *β* and various measures of variability; compare the performance of different statistical models for *β*, and the three test-statistics *d*, *r* and *d*_{w}; investigate the effects of increasing/reducing the sample size of the FSEs on the realized significance levels; estimate the realized power, and compare it to power estimates for different possible future values of *n*; and estimate the sample size required to achieve 80 and 90% power in a given percentage of analyses of the measured biological indicators.

### (a) Relationships amongst significance level, sample size and treatment effect

A volcano plot (−log(*p*) versus log(*R*), Jin *et al*. 2001) allowed an assessment of the frequency of significant results achieved for various sizes of estimated treatment effect, particularly those greater than 1.5-fold identified by Perry *et al*. (2003) as effects the FSEs had sought to detect with relatively high frequency. Small values of realized sample size, *n*, occurred when a biological indicator was relatively rare, so a scatter plot of log(*R*) versus *n* allowed an appraisal of whether treatment effects were consistent between abundant and rare species or groups.

### (b) Estimation of *β*

For each of the three fitted multiplicative models described above, the true but unknown value of *β* was estimated as *β*_{est} from the regression coefficient (*b*) in a linear regression of the logarithm of the absolute standardized residuals on the logarithmically transformed fitted values, i.e. *β*_{est}=2*b*+*β* (Carroll & Ruppert 1988). A combined estimate *β*_{o} was then calculated from the three estimates using linear interpolation to find the value of *β* for which *β*_{est}=*β*, i.e. the value for which the regression coefficient *b* in the residual plot is zero (*cf* Perry 1987).

### (c) Summary statistics and measures of variability

Whole-field geometric means, *M*, measures of variability (CV on natural scale and standard deviation, *s*, the square-root of the residual mean square of the ANOVA from the published analysis on the natural logarithmic scale) and estimates of *β*_{o} were computed for each of 531 reported tests of the null hypothesis for count data. Some summary statistics were tabulated for each of the five indicator groups (FSE papers), for all 531 indicators combined and for each of the three crops. These included: mean, minimum and maximum values of *n* and CV; median, and lower and upper quartiles of *β*_{o} after exclusion of analyses with *n*<30; and the frequency with which large treatment effects were detected with statistical significance.

### (d) Comparison of test-statistics from different models

Since the majority of individual values and all median values of *β*_{o} were found to lie between 1.5 and 2, a graphical comparison was made of the 531 published results of tests and estimates of treatment effects using the statistic *d* (assuming *β*=2) with the unpublished results based on the statistic *r*_{1.5} (assuming *β*=1.5).

### (e) Significance of d-statistic in relation to sample size

The effect of reducing or increasing the sample size of the FSEs on the realized significance levels was examined. Note that *p*-values for the Monte Carlo paired randomization test for *d* were very similar to those for the *t*-test. This analysis, therefore, uses the *p*-value for the parametric paired *t*-test, i.e. *t*=|*d*|/s.e.[*d*], where the standard error of the test-statistic *d* is based on the residual mean square in the ANOVA for the randomized block design. New *t*-values, *t*_{n}, were calculated for a range of projected sample sizes *n*_{p}=*kn*, where *k*=0.08, 0.17, 0.33, 0.5, 0.67, 1, 1.5, 2, 3, 6 and 12, using . Corresponding *p*-values were calculated from Student's *t* distribution with *n*_{p}−1 d.f. The percentage of analyses statistically significant at the 5 and 1% levels were tabulated for each value of *k*, for each group of indicators and for all 531 indicators combined.

### (f) Estimates of statistical power of d-statistic

Statistical power depends upon the chosen experimental design, the magnitude of the effect specified, variability, abundance and replication. For these data, the results of the power analysis (Perry *et al*. 2003) were used to develop an empirical model to estimate power for detecting a specified difference (*R*), as follows:where Probit[] denotes the cumulative distribution of the standardized Normal distribution, the estimated non-centrality parameter, *θ*, is calculated as , and other terms are as defined earlier (and see the Glossary of statistical terms given in table A1 of the electronic supplementary material). This model has mean absolute error of 1.5 percentage points over the range of power values reported in Perry *et al*. (2003).

The power of the *d*-statistic was estimated for each individual indicator with values of treatment effect *R*=1.1, 1.2, 1.3, 1.4, 1.5 and 2, and for projected sample sizes of *n*_{p}=*kn*, where *k*=0.08, 0.17, 0.33, 0.5, 1, 1.5, 2, 3, 6 and 12. The numbers of indicators with greater than 80 and 90% power were obtained for each group of indicators and for all 531 indicators combined, for values of *R*=1.3, 1.5 and 2.

The sample size, *n*_{80}, required for 80% power was estimated for each indicator for values of *R*=1.1, 1.2, 1.3, 1.4, 1.5 and 2. Estimates were obtained by solving the equation that defines the non-centrality parameter, *θ*, iteratively (Conte & De Boor 1980).

Estimates of *n*_{80} were adjusted to allow for the difference between the originally planned sample size (*n*_{o}) and the realized sample size (*n*), by multiplying by *n*_{o}/*n*, where *n*_{o}=66, 65 and 67 for beet, maize and spring oilseed rape, respectively. Median values and 60-, 70-, 80- and 90-percentiles of the distribution of *n*_{80} were calculated for each value of *R*, for each group of indicators and for all 531 indicators combined.

## 4. Results

### (a) Relationships amongst significance level, sample size and treatment effect

There were 110 indicators in total for which the estimated treatment effect exceeded 1.5-fold (i.e. *R*>1.5 or *R*<0.67; shown as symbols outside the two vertical dashed lines in figure 1*a*), and 82% of these (those above the horizontal line in figure 1*a*) achieved significance at the 5% level. There was no apparent relationship between the size of the treatment effect and realized sample size, for any of the three crops (figure 1*b*).

### (b) Summary statistics, estimates of β and measures of variability

Summary statistics for *n*, CV and *β*_{o} are given in table 1. Although individual values varied from less than zero to considerably greater than three, median values of *β*_{o} were remarkably consistent between the groups of indicators and the crops, all falling between 1.5 and 2.0, and averaging 1.7 overall. Values of *n* always exceeded 40 for the vegetation indicators.

Whilst *n* was very small for some indicators, its mean value usually exceeded 45. Similarly, CV varied from below 10 to almost 200%, but the mean CV was consistent between crops and 73% overall. The mean CV for vegetation indicators, 97%, was notably larger than that for the other indicator groups. This indicates that, although, on average, the power would have been of the order of about 70% to detect an effect of size *R*=1.5, the actual treatment effect for many indicators was larger than this, especially those for vegetation. Individual values of *n*, *M*, CV, *β*_{o} and *s* for each of the 531 indicators are given in table A2 of the electronic supplementary material. Values of *s* were relatively small for indicators measuring trophic interactions (Hawes *et al*. 2003) but relatively large for vegetation indicators (Heard *et al*. 2003).

### (c) Comparison of test-statistics from different models

Inferences appeared robust to model misspecification. Values in the scatterplot (figure 2) of treatment effects using test-statistic *d* (assuming *β*=2) versus *r*_{1.5} (assuming *β*=1.5) were clustered tightly around the equality line, especially within the range −0.3<*d*<0.3, that accounted for over 90% of all values. Only in about 4% of cases would a significant test at the 5% level using one model have given non-significance using the other.

### (d) Significance of d-statistic in relation to sample size

The predicted percentage of analyses that would result in a significant treatment difference at 5 and 1%, for various multiples, *k*, of *n*, are shown in figure 3 for each of the five groups of indicators and for all 531 indicators combined. Overall achieved percentages, given by *k*=1, were 27.1% at the 5% level and 15.4% at the 1% level. As expected, projected increases in sample size would improve detectability of effects and *vice versa*. A halving of sample size would have resulted in a loss of significant results of about the same order. There was a relatively larger number of significant treatment effects for the vegetation indicators reported by Heard *et al*. (2003). This reflects the fact that herbicide management affects vegetation directly, whereas invertebrates were generally affected less and indirectly (Firbank *et al*. 2003*b*).

### (e) Estimates of statistical power of d-statistic

Estimates of the numbers of analyses with greater than 80 and 90% realized power are shown in table 2, for each group of indicators and for all 531 indicators combined. Of course, there is a distribution of power over the different analyses. However, for a particular analysis with a true power value of 80%, we might expect each realization to yield greater than 80% power in approximately half of the cases, which is not greatly dissimilar to the 40% achieved. For values of *R*≤1.5 a change from ≥80% to the more stringent requirement of ≥90% power reduced the estimated number of analyses achieving this by about one-third, although for *R*=2 the reduction is not nearly as great. Notably, one in four of the FSE analyses had greater than 90% power to detect an effect of size *R*=1.5.

The extent to which the percentage of analyses with power greater than 80% is increased by projected increases of sample size and reduced by decreases is quantified in figure 4. Note that a reduction in sample size of just one-third, here represented by a decrease from about *n*=67 to *n*=44, would likely almost halve the number of analyses meeting this criterion. Results for the case of greater than 90% power are in table A3 of the electronic supplementary material.

Median values of the sample size, *n*_{80}, required for 80% power are shown in table 3. Values of *n* lie, as expected, between the tabulated values for *R*=1.5 and *R*=2. If greater certainty of large power is required then sample sizes must be increased; estimated sample sizes required to achieve at least 80% power in 60, 70, 80 and 90% of analyses are presented in figure A1 of the electronic supplementary material.

## 5. Discussion

Prior to the FSEs, there was very sparse data on measures of variability for any biological indicators at the scale of plot size of half- or whole-fields; hence the ability to predict power was restricted (Perry *et al*. 2003). The results did confirm the choice of the range of variability in counts used in the power analysis and the percentage of tests that achieved statistical significance slightly exceeded 80%. Although there was no guarantee that this would be the case in 1999 at the planning stage, interim unpublished analyses during 2000 and 2001 for a limited number of sites gave confidence that this would be the case. Had this not been true, sample sizes could have been increased in the later years of the FSEs; Firbank *et al*. (2003*b*) emphasized that treatment effects were consistent with no evidence of interactions of treatment with years.

The lack of a relationship between the size of the treatment effect and realized sample size gives confidence that effects are consistent between rare and abundant species. This is important, since many species of conservation value in arable ecosystems may suffer effects such as a ‘double jeopardy’ from being rare and restricted in range (Lawton 1993); monitoring of their biodiversity requires special care.

The statistical model adopted for the FSE data was justified. Results from Clark *et al*. (1996) and earlier authors indicated that values of *β* should be, on average, less than 2 but somewhat greater than 1.5, as used in the power analysis (Perry *et al*. 2003). Whilst the estimated average value for *β* was, at 1.7, closer to 1.5 than to 2, the value assumed by the published analyses, there would have been very little difference in the inferences drawn had *β* been assumed to be 1.5.

Vegetation in the FSEs probably provided the most important biological indicators, being affected by direct herbicidal effects (Firbank *et al*. 2003*b*). Vegetation indicators were generally surprisingly variable, as measured by CV and *s*. However, sample size for vegetation indicators was generally large and *n* always exceeded 40. Also, treatment effect size was usually large; 22 of the 48 analyses had values of *R*>2. These large values of *n* and *R* more than offset the large variability, and explained the large realized power for vegetation indicators. By contrast, low abundance resulted in some very small values of *n* for analyses in the aerial paper; this frequently prevented the epigeal invertebrate indicators concerned from being analysed with great power.

In summary, for such a costly experiment it was proper to make a considerable initial effort to plan and to estimate the replication required to achieve the desired power. This analysis has shown this effort to have been entirely justified, and vindicated the original assumptions. It suggests that any future projects of major ecological importance or risk assessments of important novel agricultural practices may merit similar inputs. It reflects a growing trend in recent years to give greater prominence to power calculations. These have often been hampered by lack of knowledge concerning variability, but this was not the case here.

New EU legislation, both for genetically modified crops and non-genetically modified applications, requires the effects of various agricultural practices on biodiversity to be studied as part of the regulatory and registration processes, and monitored subsequently. The FSEs provide a valuable database of variability (see statistics given in table A2 of the electronic supplementary material) that enables future such studies to be planned more efficiently than could have been the case previously. The more direct comparisons of estimated power under various scenarios of sample size, presented here, will assist predictions of power and significance levels for future studies, similar to the FSEs, which for reasons of cost may not be as well resourced.

## Acknowledgments

We thank our colleagues in the FSE Research Consortium and the Scientific Steering Committee, and also Robin Thompson for helpful comments on the manuscript. The FSEs were funded by Defra and the Scottish Executive. Rothamsted Research receives grant-aided support from the BBSRC.

## Footnotes

The electronic supplementary material is available at http://dx.doi.org/10.1098/rspb.2005.3282 or via http://www.journals.royalsoc.ac.uk.

- Received May 10, 2005.
- Accepted August 2, 2005.

- © 2005 The Royal Society