## Abstract

A common approach to estimating the total number of extant species in a taxonomic group is to extrapolate from the temporal pattern of known species descriptions. A formal statistical approach to this problem is provided. The approach is applied to a number of global datasets for birds, ants, mosses, lycophytes, monilophytes (ferns and horsetails), gymnosperms and also to New World grasses and UK flowering plants. Overall, our results suggest that unless the inventory of a group is nearly complete, estimating the total number of species is associated with very large margins of error. The strong influence of unpredictable variations in the discovery process on species accumulation curves makes these data unreliable in estimating total species numbers.

## 1. Introduction

Current knowledge of global species richness is based on a small to moderate fraction of all extant species. The actual size of that fraction, and thus the magnitude of global biodiversity, has been much debated (May 1988, 1990, 2000; Gaston 1991, in press; Hammond 1995; Hawksworth 2001; Lambshead & Boucher 2003) and at times the issue has received substantial media attention. Our ability to provide an accurate answer has been argued to provide a valuable test of understanding the structure and composition of global biodiversity, the answer itself to be a basic fact that we should know about the world around us, and one that will facilitate better answers to other important questions, such as the extent to which the carrying capacity of diversity on Earth has been attained, the rate at which species are becoming globally extinct and the scale of the task faced by species conservation (see Gaston (in press) for a review).

The key issue arising from the uncertainty over global species numbers is that of how to extrapolate from the fraction of species that is known to science. By far, the most popular method has involved the use of species discovery curves to estimate the number of species that remain to be discovered in a given taxonomic group, globally or regionally (Steyskal 1965; Frank & Curtis 1979; Soberon & Llorente 1993; Scoble *et al.* 1995; Medellin & Soberon 1999; Ertter 2000; Aravind *et al.* 2004; Gower 2004; Shen Tsung-Jen & Chih-Feng 2003; Solow & Smith 2005; Wilson & Costello 2005; Pimm *et al.* 2006). This approach involves plotting a cumulative frequency curve for the taxon with the expectation that this becomes asymptotic when the inventory is reaching completion and new species are becoming more and more difficult to find. Unfortunately, estimates of species numbers derived in this way tend to lack associated error margins (Steyskal 1965; Frank & Curtis 1979; Soberon & Llorente 1993; Solow & Smith 2005), making it impossible to objectively assess their accuracy. Perhaps in consequence, while some have expressed severe reservations about the application of this approach (Hammond 1995), others have attempted to place it on a more secure statistical foundation (Solow & Smith 2005; Wilson & Costello 2005).

To investigate the confidence with which predictions of total species numbers can be made from species discovery curves, and to determine the effect of incompleteness of discovery curves on confidence limits of predictions, we have compiled and examined datasets for birds of the world and UK flowering plants, which are assumed to be more or less complete. We have also compiled datasets for ants, mosses, lycophytes, monilophytes (ferns and horsetails), gymnosperms and New World grasses as a sample of other major taxa. We developed a generalized linear model for analysing and interpreting the dynamics of these species discovery curves, which provides both point estimates and confidence limits for the number of unknown species using analysis of deviance.

## 2. Material and methods

### (a) Datasets

Datasets of accepted species and their date of publication were assembled for birds of the world, UK flowering plants, ants of the world, mosses of the world, lycophytes of the world, monilophytes (ferns and horsetails) of the world, gymnosperms of the world and New World grasses. The date of the basionym of the accepted name was used where available, namely for UK flowering plants, mosses, lycophytes, monilophytes and gymnosperms. Otherwise, the date used corresponded to the date of publication of the accepted name, namely for birds, ants and New World grasses. The data for birds were supplied by Alan Peterson as a download from http://www.zoonomen.net for the period between 1758 and 2004. UK plant data were compiled from the Oxford University Herbaria database (http://herbaria.plants.ox.ac.uk/bol/?oxford) based on Kent (1992) and Stace (1997) for the period between 1753 and 2005. The ant data were supplied by Donat Agosti as a download from Antbase for the period between 1750 and 2006 (Agosti & Johnson 2005). Moss data were supplied by Marshall Crosby as a download from Crosby *et al.* (2006) for the period between 1753 and 2004. Monilophytes and lycophytes were extracted from World Ferns on CD-ROM for the period between 1753 and 2000 (Hassler & Swale 2001). Data on gymnosperms were compiled from Farjon (2001), World Checklist of Cycads (http://plantnet.rbgsyd.gov.au/PlantNet/cycad/wlist.html) and the TROPICOS database (http://mobot.mobot.org/W3T/Search/vast.html) for the period between 1753 and 2005. New World grasses were supplied by Gerrit Davidse as a download from http://mobot.mobot.org/Pick/Search/nwgc.html for the period between 1753 and 2006.

### (b) The model

It is assumed that species identification and group membership are generally agreed, and that the total number of species in the group, *N*_{tot}, is fixed. The problem is to estimate *N*_{tot}, or equivalently the number hitherto undiscovered. We postulate that the expected number of species discovered in time *t*, *S*_{t}, is some fraction *k* of the number of undiscovered species at time *t*−1,The coefficient *k* depends on several interacting factors, including the effort expended in discovering new species, the visibility of the undiscovered species, the expertise in identifying new species and the proportion of habitat remaining unexplored. We have no independent estimates of how these factors vary through time, but we might suppose that *k* would decrease if more obvious species are discovered first, and conversely that *k* would increase with increasing discovery effort and smaller areas of unexplored habitat.

As we have no independent means of separating the effects of discovery effort, the visibility of undiscovered species, etc., we propose fitting the simplest model in which systematic variation in these factors is low compared with the effect of diminishing new species, and *k* is therefore roughly constant. If a plot of *S*_{t} against *N*_{t−1}, smoothed by local regression or spline interpolation, shows a more or less linear trend, then this is a justification for fitting a generalized linear model of *S*_{t} on *N*_{t−1} to the observations in this linear period. The model then has the form of a linear regression of *S*_{t} on *N*_{t−1}, with intercept and slope −*k*. The point estimate is then minus the intercept over the slope, i.e. the value of *N*_{t−1} for which *S*_{t} is zero.

*S*_{t} could show a nonlinear decline with *N*_{t−1}. The model could be altered to include, for example, *k* as a linear function of *N*_{t−1}, e.g. *k*=*b*+*cN*_{t−1}. In this case, *E*(*S*_{t})=*kN*_{tot}−(*b*+*cN*_{t−1})*N*_{t−1}. This variation could model both increases and decreases in discovery rates. Once the model has been fitted, this quadratic equation could be solved for *S*_{t}=0, giving the point estimate as the smallest, positive, real root. Other functions of *k* are also possible. If *S*_{t} increases with *N*_{t−1}, then increases in discovery effort dominate the decline in the number of species remaining to be discovered, and estimating is impossible.

If species discoveries were independent of one another, the model would have poisson error and identity link. However, in practice, the data are likely to show overdispersion, the residual deviance being greater than the corresponding degrees of freedom. The discovery time is usually defined as the date of the first published description, and such descriptions tend to appear in groups such as monographs and other books (Wilson & Costello 2005; Bebber *et al.* 2007). This type of model is described by McCullough & Nelder (1989) and Venables & Ripley (2002); it is often referred to as a quasi-Poisson model. The model also gives an estimate of the scale factor, or dispersion, of the quasi-Poisson distribution. This is a measure of the overdispersion, being 1 for Poisson errors and larger when data tend to be grouped. The scale factor is best estimated as the residual Pearson *Χ*^{2} divided by its degrees of freedom, rather than the mean residual variance (Venables & Ripley 2002).

It is then straightforward to derive confidence limits for *N*_{tot}. Define a new variable , where *M* can be positive or negative. Fit a generalized linear model for *S*_{t} against *R* with Poisson error and Identity link *without intercept*. This forces the regression through . The model will give the same residual deviance as the best fit when *M* is zero, and a larger residual deviance as *M* diverges from zero. Changes in deviance scaled by the dispersion have an *F* distribution (McCullough & Nelder 1989), allowing calculation of confidence limits. The fit has one degree of freedom more in the residual term, and thus if the deviance ratio is significant (i.e. if it is greater than 3.84), then—lies outside the 95% CI. A search gives the upper and lower limits for which this happens. Note that if the negative slope is not well defined, the upper limit may be infinity. This model is closely related to Fieller's (1954) theorem, which can also give infinite CIs if the data are uninformative.

Species discovery curves show a variety of trajectories, from those that appear to be increasing exponentially, such as the New World grasses (figure 1*a*), to those that appear to have reached an asymptote, for example, birds of the world (figure 1*h*). By plotting *S*_{t} against *N*_{t−1} and fitting smoothing splines to these data, changes in the discovery rate *k* can be followed (figure 2). Successful prediction using the model requires that the slope of *S*_{t} against *N*_{t−1} be negative and constant, such that *k* is positive and constant. If the slope is zero or positive, this shows that more species are being discovered than expected by the model. Zero or positive slopes indicate increases in discovery effort or rates of description, or some other process unrelated to *N*_{tot}. Smoothing splines were therefore used to identify regions of the data with negative slopes, where model fitting would give sensible predictions.

## 3. Results

In all cases except for the British flora, *S*_{t} increased with *N*_{t−1} for the early discoveries (figure 2*a–h*). This was reflected in an exponential increase in *N*_{t} over time (figure 1*a–h*), and meant that for all groups except the British flora and birds of the world, bounded confidence limits on *N* could not be estimated when the entire dataset was included (table 1). For all groups except New World grasses (figure 2*a*), the discovery rate declined after this initial ‘start-up’ period. The New World grasses were omitted from further analyses, as there was no indication of decline in *S*_{t}, and subsequently no point estimate of *N*_{tot} could be made. Subsets of the data for the other groups were then fitted, which omitted the early increasing *S*_{t} phase.

Gymnosperms, ants and mosses all showed a similar discovery rate dynamic (figure 2*b–d*). Although *S*_{t} declined after an initial increase, this was followed by another increase for the most recently discovered species. Model fits that included only the central declining *S*_{t} phase gave bounded confidence limits for *N*_{tot} (table 1). However, inclusion of the most recently discovered species either gave very large, or unbounded, confidence limits (table 1).

Ferns and lycopods show a slightly different pattern (figure 2*e*,*f*). In these groups, *S*_{t} remained roughly constant, precluding estimation of *N*_{tot} even when the increasing *S*_{t} phase was omitted. Bounded confidence limits could be obtained for the latest 10–20% of discoveries, however, owing to recent declines in discovery rates (table 1). In other words, predictions at some point in the past would have been impossible. For ferns, the upper 95% confidence limit of the best estimate that includes the most recent data was only 271 species more than the current total of 14 891 species (table 1). For lycopods, the upper limit was 22 species more than the current total of 484 species (table 1).

The best-behaved groups, in terms of long-term declines in *S*_{t}, were the British flora and birds of the world (figure 2*g*,*h*). For the British flora, the first data point (Linnaeus) was omitted from the model fit as it contained many more species than any other point. Fitting the remaining data gave very small confidence limits for *N*_{tot}, but omission of just the most recent 10% of discoveries lead to unbounded confidence limits (table 1). The best estimate for British flora gave 95% confidence limits of 1459–1488 species, with a current known total of 1458 species. Implementing *k* as a linear function of *N*_{t−1} (the ‘varying *k*’ model) gave a best estimate of 1493 species with 95% confidence limits of 1459–1559 species, using the 131 most recent discoveries (table 2).This interval is wider than the best estimate from the constant *k* model. The varying *k* model was able to give bounded, though wide, confidence limits when the most recent 10% of discoveries were omitted (table 2). Omission of more than 10% of the most recent discoveries made prediction impossible.

For the birds of the world, omission of earlier data gave progressively smaller estimates of the dispersion index, and tighter confidence limits on *N*_{tot} (table 1). However, omission of just a few late discoveries widened the confidence limits dramatically. Once again, prediction in the absence of just a few species was impossible. The 95% confidence limits on the best estimate for birds were 9994–10 061, with a current total of 9968 species. This estimate included only the most recent 25% of species in the model. Use of the varying *k* model did not lead to improvements in prediction over the constant *k* model (table 2).

## 4. Discussion

### (a) Methodological issues

The modelling of species discovery curves and the prediction of total species numbers are complicated by three features of the data. The model we have proposed, and the analyses we have conducted, explicitly address these features. Firstly, the discovery rate is governed not only by the number of species remaining to be found, but also by the effort employed in finding and reporting them. The variability of discovery effort is best illustrated by the early exponential increase in discoveries, which is independent of the number of species remaining to be found. All but one of the datasets presented here contain this feature. The UK plants dataset is anomalous because Linnaeus described more than 800 species in 1753. This dataset therefore does not suffer from the problem of erratic early data collection. Wilson & Costello (2005) attempted to model early rate increases using logistic curves. However, because these early discoveries are largely uninformative of *N*_{tot}, there seems to be little reason to include them. In the worst case, their inclusion could bias estimates of *N*_{tot} or overstate the informativeness of the dataset. We found no support for the use of models of varying *k*, and would instead recommend limiting the data to subsets in which *S*_{t} declines linearly with *N*_{t−1}.

The second problematic feature of the data is also due to variability in discovery effort, namely the occurrence of false plateaux that leads to underestimates of *N*_{tot}. The ant and moss datasets demonstrate this issue. Both curves apparently begin to flatten at approximately 80% of the current total. However, for both ants and mosses, the subsequent rate of discovery increases, and analyses that include these most recent data cannot provide upper confidence limits for *N*_{tot}.

The third feature of the data regards the error distribution. Solow & Smith (2005) regard discovery dates as having independent Poisson errors, while Wilson & Costello (2005) recognize that the assumption of independence cannot be maintained for these data. Estimates of the dispersion parameter for our data are much greater than unity, and assumption of Poisson errors would lead to underestimates of the range of the confidence limits on *N*_{tot}. The model also avoids problems of time-series autocorrelations, as the fits are not functions of time.

### (b) Predictions

The approach of using virtually completed curves for birds and UK flowering plants has demonstrated that our model can yield predictions with a high degree of confidence, but only when the vast majority of species are already described. The bird dataset shows that the earliest 50% of the data are not helpful in prediction, but thereafter, subsets including the most recent discoveries provide consistent estimates of *N*_{tot}. Even when discovery rates show a long period of decline, predictions from incomplete datasets can be highly uncertain. For birds, omission of the last 10–20% of species greatly increases the confidence limits. Analysis of the whole UK plants dataset provides a realistic estimate with small confidence limits; however, if 90% of the total dataset is used, then no upper confidence limit can be set on *N*_{tot}.

We have presented datasets for most major lineages of land plants with the exception of hornworts and liverworts. For mosses, lycophytes, monilophytes (ferns and horsetails) and gymnosperms, we have complete world coverage. These data are not available for angiosperms, so we have used New World grasses as a surrogate for angiosperms. Our results for these five datasets demonstrate that although plants are generally considered as relatively well known, there is no evidence that any of the major lineages of land plants are reaching or nearing an asymptote, with the exception of monilophytes and lycophytes. For both these lineages, there is a point estimate and small associated error only if a very recent subset of the data are used (5% for monilophytes and 18% for lycophytes), but this is over such a short period of time that it is impossible to distinguish an asymptotic curve from what might be a false plateau. The problem of false plateaux is well illustrated by the ant data that yield small-bounded estimates between the 40th and 80th percentile, but subsequent discoveries, i.e. the most recent 20%, clearly show this to be misleading. We consider New World grasses to be a fair surrogate for all angiosperms due to its size and the extent of its geographical distribution covering many distinct habitats. Nonetheless, for New World grasses, we interpret the fact that there is no indication of decline in *S*_{t}, and subsequently no point estimate of *N*_{tot}, as a strong indication that the inventory of angiosperms is far from nearing completion.

## 5. Conclusion

In conclusion, these results are significant in two respects. First, unless an inventory is more or less complete (e.g. 90% complete for birds), extrapolations based on existing data are associated with very large margins of error. This, in addition to issues relating to synonymy, partly explains current levels of uncertainty about species numbers even for relatively well-known taxa such as plants (Scotland & Wortley 2003; Wortley & Scotland 2004). Unfortunately, the completeness of an inventory cannot be known until all species have been found. Second, any extrapolation from existing data is sensitive to the dynamics of the discovery process over time, as well as to the proportion of known species used in the extrapolation. It is clear that species discovery curves, governed both by the number of species remaining to be discovered, and by the vagaries of discovery effort, are largely unable to provide statistically rigorous estimates of total species numbers in a group, unless long periods with near-zero discovery have elapsed. Changes in discovery effort appear to be arbitrary and are unlikely to be predictable, thereby apparent plateaux in discovery curves cannot be relied upon to indicate the final approach to completeness of the inventory. Even when data are well-behaved, confidence limits on *N* become very large when just a few of the most recently discovered species are omitted.

Recent literature (Solow & Smith 2005; Wilson & Costello 2005) including this paper, attempts to place analyses of species discovery curves on a more secure statistical foundation by proposing improved models, dealing with the error distribution and making the interpretation of results more transparent. We consider that the approach used here focuses on the essential element of flattening of these curves when species become harder to find and deals appropriately with the error distribution. In addition, plotting the data as number of species discovered per year versus number discovered up to that year (figure 2) reveals the noisy and unpredictable nature of the discoveries, which may be obscured by traditional accumulation curves (figure 1).

Our results suggest that prediction for incomplete datasets is problematic because, unless a curve has flattened for some considerable time, it contains little appropriate information. There are many reasons why species continue to be described for many taxa such as the use of new analytical techniques, new species concepts, new areas of the world being explored, publication of a long-term monographic study, etc. Thus, even for apparently completed curves, it only takes effort in any one of these variables to discover new species. This suggests that prediction using discovery curves for incomplete groups is largely futile. We suggest that biologists shift focus from species discovery curves to other methods (Gaston in press) that are immune to the problems caused by temporal variations in the discovery process.

## Acknowledgments

We thank Donat Agosti, Marshall Crosby, Gerry Davidse and Alan Peterson for making data available. We thank Donat Agosti, Bob May, Alan Peterson and two anonymous reviewers for comments on various drafts of the manuscript.

## Footnotes

- Received February 23, 2007.
- Accepted April 3, 2007.

- © 2007 The Royal Society