In a recent article , we examined a simple and rather generic individual-based model consisting of a large number of organisms that undergo reproduction with mutation and death through competitive interaction. Our analysis revealed that the formation and coherence of species depend crucially on population size. Specifically, species are unlikely to form under high values of μK, the product of mutation rate (μ) with carrying capacity (K). The model contains only the two basic processes of competition and mutation. This simplicity allowed us to uncover the root cause of a phenomenon that we believe could be quite general.
To what extent do our theoretical findings manifest themselves in real ecological systems? We investigated this question by comparing the outputs of our model  with phylogenetic data derived from ecogenomic surveys in the literature [2,3]. We found that the reconstructed phylogenetic trees of organisms with body size around the millimetre scale or below have similar characteristics to those occurring in our model for parameters where species do not form. This finding led us to ask: ‘are there species smaller than 1 mm?’
In their comment, Morgan et al.  propose that our theoretical findings, though correct, are not applicable to real ecological communities. They argue that the work reported in references [2,3] was flawed, specifically suggesting that the counts of operational taxonomic units (OTUs, interpretable as lineages) reported in those articles are highly inflated owing to errors in sequencing. If this were true, then the patterns observed in our figure 1  would be artefacts, and their similarity to the results of our model mere coincidence. We believe that Morgan et al. are unjustified in dismissing these data and the conclusions we drew from them, as we now explain.
For the datasets in question, the number of OTUs found declines steadily with the maximal permitted genetic distance within OTUs. In the light of our theoretical findings, this fact suggests the absence of genetic species. Morgan et al. would like to demonstrate that species have in fact formed. To do this, they propose to ‘clean’ the underlying sequence data by removing large numbers of sequences, so as to reveal a pattern that they believe has been obscured by noise. The remarkable effect of this removal process can be seen in figure 2 of their comment , in which a plateau in the number of OTUs is recovered from data where OTUs previously declined smoothly. Morgan et al. claim that this plateau, which was absent from the untreated data, is the one predicted by our theory in the case when species have formed.
We would like to urge caution. Selectively removing parts of a dataset can profoundly alter it, and often imposes a new structure not present in the original data. Any noise removal requires some preconceptions about structure in the underlying data; one must have an extremely good understanding of both the system and the noise in order to attempt this. For ecogenomic pyrosequencing data, this understanding might still be insufficient at present. One can test for bias in a denoising algorithm such as the one used by Morgan et al. by inputting data known to have no structure, and seeing if the algorithm creates a structure where none previously existed (a false-positive).
We have undertaken such a test. We applied the procedure used by Morgan et al. to two synthetic datasets, each consisting of 5000 sequences of 200 base pair length. The first set was designed to mimic the low-diversity mock community used by Morgan et al.; it was obtained by repeatedly sampling from a set of 10 initial sequences. The second was a high-diversity dataset generated by repeatedly replacing one randomly chosen sequence by a copy of another randomly chosen sequence, modified by random substitutions with probability 0.01 per base pair. This process simulates neutral evolution; after many iterations, it produces sequence data with no discernible species structure. Applying the fast clustering algorithm of OCTUPUS  to these datasets for a range of levels of genetic similarity leads to the expected [1,4] structures in figures 1 and 2 (triangles). We observe a plateau at low genetic distances for the low-diversity dataset, and a steady decline in the number of OTUs for the high-diversity set.
To model sequencing errors (the noise), sequences in both datasets were then subjected to random substitutions with a probability of 0.01 per base pair, simulating raw sequencer reads. In the output of the clustering algorithm (figures 1 and 2, diamonds), the addition of noise is observed to shift the original curves to the right. The low-diversity dataset exhibits highly inflated numbers of OTUs at small genetic distance, in line with concerns raised by Morgan et al. . For the high-diversity dataset, however, the effect is weaker, suggesting that raw or slightly processed  high-diversity data can meaningfully be analysed in this format.
We then applied the APDP-SS algorithm [4,5] to delete some of the raw reads. The steps of the algorithm involving primer occurrences and comparison with GenBank were omitted as they are not relevant to synthetic data. For the low-diversity dataset, clustering after application of APDP (figure 1, squares) reveals a structure very similar to the original data, with a pronounced plateau at low genetic distances.
When applied to the high-diversity dataset, however, APDP again generates a plateau (figure 2, squares). This plateau is an artefact that would wrongly suggest the presence of only about 33 unique sequences in the original data; in fact, there were 4383. This result is important in the light of the similarity between our figure 2 and figure 2 of Morgan et al. . In our case, the APDP algorithm has created a plateau from underlying data where this did not exist. In their case, Morgan et al. conclude that the algorithm has uncovered a true signal that was obscured by noise.
We have not analysed in detail exactly how APDP imposes the structure found in figures 1 and 2, although it appears to be mainly due to the blanket removal of all singleton sequences. This step was recognized by Morgan et al. as potentially problematic  but retained as ‘a conservative approach’, supported by its apparently successful inclusion in other recent algorithms . Further analysis of this algorithm is clearly necessary. We have included as electronic supplementary material the R script used for the processing chain reported above, so that others may reproduce our test.
In our original article , we began a theoretical investigation of the basic mechanisms leading to genetic clustering. As well as challenging the result of references [2,3], Morgan et al. have speculated about some aspects of our model that they believe are too simple; for example, asexual reproduction. Our experience suggests that the mechanism of cluster formation is generic and will hold in more realistic models. Crucially, we have already demonstrated that the same phenomenon occurs in both the phenotypic  and genotypic  versions of the model, which appear very different a priori. We are currently studying other variants of the model incorporating sexual reproduction, and hope that other researchers will also investigate this question.
Although the simulated organisms in our models do not form species when μK is large, it is important to note that the populations do still exhibit a certain structure. In particular, while not forming species, individuals are phenotypically (or genetically) differentiated and adapted to their niches. We expect that future theoretical work will establish that many population-level features (including biogeographic structure, ecological differentiation, etc. ) are not dependent on the existence of coherent species. Indeed, even reproductive isolation of two subpopulations  does not conclusively demonstrate the separation of species; the same would be observed if specimens were taken from opposite ends of a ring species.
Further work is needed to accurately assess the extent of species formation in the meiofaunal biosphere. As we have seen, the handling of errors produced in current high-throughput sequencing technologies poses a major challenge. Possible areas for improvement include more extensive genetic and phylogenetic analyses of selected meiofaunal taxa, potential for synthesizing population-level surveys with selective whole-genome sequencing, and the development of more sophisticated mathematical models incorporating the effects of sequencing errors. The question of species formation is closely related to the problem of identifying so-called barcoding gaps [10,11]; however, in the present literature, the existence of species is often assumed a priori. Reanalysis of existing data without this assumption could well provide new insights. As the quantity and quality of ecogemonic data improves, we may find that the concept of ‘species’ is no longer central to our understanding of many aspects of ecology and biodiversity.
The accompanying comment can be viewed at http://dx.doi.org/10.1098/rspb.2013.3076.
- Received January 24, 2014.
- Accepted February 19, 2014.
- © 2014 The Author(s) Published by the Royal Society. All rights reserved.