Garcia-Gonzalez *et al.* [1] conducted an original and elegant experiment examining whether fertilization of a female's eggs by multiple males (polyandry) can provide fitness benefits via ‘bet-hedging’ (i.e. due to decreased variance in offspring fitness). The authors measured these benefits in both stable and variable environments, and also quantified the joint fitness consequences of bet-hedging and sperm competition. We believe that the study's experimental design is sound, but that its statistical analysis was incorrect. Here, we reanalyse the raw data and find that all but one of the study's results is consistent with the null hypothesis that polyandry does not provide benefits via bet-hedging, contrary to the original conclusions.

Garcia-Gonzalez *et al.* [1] compared the fitness of females under experimentally imposed monandry and polyandry. The eggs of 12 females were divided into three batches, representing three ‘generations'. For each female and generation, half of the eggs were fertilized using the sperm of a single male (monandry) and the other half with sperm from three males (polyandry). The authors measured fertilization rates and offspring viability (the latter in two different environments, termed A and B) for each of the 12 females and three generations.

From the viability and fertilization data, the authors calculated the between-generation geometric means of each fitness measure (*W*_{BG}) under polyandry and monandry, and calculated the difference in *W*_{BG} between treatments as
1.1

A positive value of Δ_{Geo} implies that polyandry improves geometric mean fitness. To account for variation in Δ_{Geo} due to the arbitrary ordering of generations, the authors repeated this calculation 10^{4} times while randomly varying the order of generations for each female, and generated 95% central ranges for Δ_{Geo}, which they took as approximate confidence intervals. Similar randomization procedures were used to simulate various regimes of stable and fluctuating environmental conditions.

This method of generating confidence intervals is problematic. It accounts for one source of variation in the sample (i.e. the ordering of generations) but neglects the variance that results from sampling a finite number of males, females and offspring from the population. This causes the analysis to substantially underestimate the size of the confidence intervals around Δ_{Geo}, and thereby to produce false positives.

Moreover, we believe that the geometric mean is not the most appropriate fitness measure for Garcia-Gonzalez *et al.*'s analysis. In cases where (i) the absolute fitness of each strategy comes from the same probability distribution in all generations and (ii) fitness is uncorrelated among individuals of the same genotype (here, monandrous and polyandrous females), it is more appropriate to apply Gillespie's measure [2–4]
1.2here *μ _{i}* and are the population mean and variance in reproductive success of each strategy within a generation, and

*N*is the population size. The assumptions behind Gillespie's measure are met for those treatments of Garcia-Gonzalez

*et al.*in which the environment is assumed constant between generations. By contrast, the geometric mean fitness is a more appropriate measure when the expected absolute fitness of a strategy fluctuates between generations, but there is minimal within-generation variation in fitness among individuals of the same genotype [3–6]. The geometric mean can also be applied to relative fitness, regardless of the underlying assumptions, but this approach is of limited practical use because calculating relative fitness requires knowing the frequencies of each genotype in the population [3].

Garcia-Gonzalez *et al.* also simulated fitness in deterministically alternating environments of the form ABA and BAB. In these environments, neither of the above fitness measures strictly applies. This is because there is both between-generation variance in a strategy's expected success (making Gillespie's measure inappropriate) and also substantial within-generation variance among individuals playing the same strategy (which violates the conditions for the geometric mean fitness). These environmental regimes necessitate a more complex analysis, which we omit for brevity (cf. [7]).

Here, we estimate confidence intervals using a bootstrapping method that accounts for the missing sources of sampling variance. We also estimate the probability of obtaining the observed values under the null hypothesis that there is no difference in fitness between monandrous and polyandrous treatments (a statistic not provided in the original paper). We present results both for the difference in geometric mean fitness Δ_{Geo} and for Gillespie's measure Δ_{Gill}.

## 2. Generating effect size confidence intervals

The original experiment generated 36 data points (12 females measured over three generations) for each fitness component (fertilization rates and offspring viability in environments A and B) under both monandry and polyandry. We resampled from these data 10^{4} times to perform the bootstrap analysis. For each run, we sampled 36 monandrous data points from the original 36 with replacement. Matching polyandrous data points were selected so as to maintain the pairings of female and generation from the original experiment.

For each run, we calculated the difference in geometric mean fitness (Δ_{Geo}) and Gillespie's measure (Δ_{Gill}) using equations (1.1) and (1.2), respectively. For the geometric mean fitness, we split the 36 data points into three generations of 12 females at random, maintaining the pairing of treatments as above. For Gillespie's measure, we approximated the population means and variances by the sample means and unbiased sample variances and, for consistency, assumed a population size of *N* = 12. Because individual females and males appear multiple times in the original experiment, this procedure will tend to underestimate the true population variances, and hence the strength of bet-hedging effects. We nevertheless believe this pseudoreplication is unlikely to affect the results strongly.

For both fitness definitions, we calculated the mean and 95% central range from the bootstrap distributions (table 1); the latter provides an estimate of the 95% effect size confidence intervals (the true CIs are probably wider, due to additional variance in the population that is not captured by the sample). This procedure was performed using data from environments A, B, and an average of these two environments, as described in the original study and in our Mathematica code (see electronic supplementary material).

## 3. Null hypothesis significance testing

We also used bootstrapping to simulate the expected distribution of the mean values of Δ_{Geo} and Δ_{Gill} under the null hypothesis that there is no difference between the polyandrous and monandrous treatments. In each of 10^{4} runs, we randomly swapped the monandrous and polyandrous data points within each of the 36 data pairs with probability 0.5. This gave us two new ‘treatment groups', each consisting of random mixtures of the original monandry and polyandry treatments. We calculated mean Δ_{Geo} and Δ_{Gill} for each run using the same method as above in order to obtain their approximate distributions under the null hypothesis.

We next calculated the proportion of mean Δ_{Geo} and Δ_{Gill} generated under the null model that were at least as large (in absolute value) as the means calculated in the previous section, giving an approximate two-tailed *p*-value (table 1). One can interpret this *p*-value as the probability of seeing a result at least as extreme as the observed one, under the assumption that mating treatment has no effect on the distributions of Δ_{Geo} and Δ_{Gill}. Exact *p*-values would probably be larger, due to additional variance in the population that is not captured by the sample.

## 4. Results and discussion

Table 1 shows that most of the comparisons made by Garcia-Gonzalez *et al.* [1] yielded results consistent with the null hypothesis (*α* = 0.05) after reanalysis, suggesting that fitness did not differ significantly between polyandry and monandry treatments. Nevertheless, the estimated 95% CIs often included large differences, suggesting that this dataset does not rule out the existence of a substantial benefit from bet-hedging via polyandry. The two fitness measures (Δ_{Geo} and Δ_{Gill}) gave quantitatively similar results, largely because variance in both measures was dominated by statistical fluctuations in the sample means. These fluctuations would be reduced with a larger sample size. Because the revised confidence intervals are large, our analysis highlights that a greater degree of replication (especially of females) is required to measure the benefits of bet-hedging with sufficient precision. We applaud Garcia-Gonzalez *et al.*'s efforts to establish a ‘proof-of-principle’ approach for studying bet-hedging in isolation from other factors, and we hope that our modified statistical approach proves useful to future experiments.

## Data accessibility

Mathematica code for the statistical model is provided in the electronic supplementary material.

## Authors' contributions

J.M.H. and L.H. conceived of the study and wrote the manuscript. J.M.H. performed the statistical analysis.

## Competing interests

We declare we have no competing interests.

## Funding

Funding was provided by an Australian Postgraduate Award to J.M.H. and an Australian Research Council DECRA to L.H.

## Acknowledgments

We thank Francisco Garcia-Gonzalez for insightful and generous comments throughout the writing and review of this manuscript. We also thank Anne Lizé and two anonymous reviewers for their helpful feedback.

## Footnotes

The accompanying reply can be viewed at http://dx.doi.org/doi:10.1098/rspb.2015.0866.

- Received February 13, 2015.
- Accepted April 10, 2015.

- © 2015 The Author(s) Published by the Royal Society. All rights reserved.