Detecting regions of the human genome that are, or have been, influenced by natural selection remains an important goal for geneticists. Many methods are used to infer selection, but there is a general reliance on an accurate understanding of how mutation and recombination events are distributed, and the well-known link between these processes and their evolutionary transience introduces uncertainty into inferences. Here, we present and apply two new, independent approaches; one based on single nucleotide polymorphisms (SNPs) that exploits geographical patterns in how humans lost variability as we colonized the world, the other based on the relationship between microsatellite repeat number and heterozygosity. We show that the two methods give concordant results. Of these, the SNP-based method is both widely applicable and detects selection over a well-defined time interval, the last 50 000 years. Analysis of all human genes by their Gene Ontology codes reveals how accelerated and decelerated loss of variability are both preferentially associated with immune genes. Applied to 168 immune genes used as the focus of a previous study, we show that members of the same gene family tend to yield similar indices of selection, even when located on different chromosomes. We hope our approach will provide a useful tool with which to infer where selection has acted to shape the human genome.
One hundred and fifty years after Darwin published ‘The Origin’, literally millions of single nucleotide polymorphisms (SNPs) [1–3] finally provide the tools that should allow us to analyse in detail how natural selection has acted on, and continues to shape the human genome. Various approaches have been explored , including the study of linkage disequilibrium blocks , detection of SNP clusters, testing for an excess of SNPs with one very common allele , discovery of unusually large or small genetic distances between populations  and, within genes, inferences about the ratio of synonymous to non-synonymous substitutions . Although these studies have told us much, they tend to focus on directional rather than balancing selection and to rely on poorly tested assumptions about where and at what rate recombination events and mutations occur, assumptions that are increasingly being challenged [8–11]. Where balancing selection has been tested for, it seems elusive , possibly because ‘the requirements for detection by means of SNP data alone will rarely be met’ , a notable exception being Andrés et al. . This is potentially of concern because there is increasing evidence that heterozygote advantage may be common, particularly at immune loci, in both humans [14,15] and many other species [15–17].
An alternative, and we believe novel, approach to the detection of natural selection is suggested by humankind's unusual demographic history. Somewhat over 50 000 years ago, anatomically modern humans moved out of Africa to colonize the world [18–20]. As they did so, one or a series of population bottlenecks caused a dramatic loss of neutral genetic variability [18,21–23], manifest everywhere people have looked, from microsatellites  and SNPs  to morphological traits  and even commensal bacterial diversity . The signature of this loss is a monotonic decline in neutral genetic variability with land-only distance from Africa [19,23]. Previous methods for inferring selection have tended either to ignore this trend completely or to treat it as a nuisance variable that has to be controlled . However, the uniformity and ubiquity of the decline in variability itself provides a useful new null hypothesis. Deviations from the overall trend should be informative about the action of natural selection. For example, balancing selection maintains two or more lineages within a population, thereby creating regions of enhanced diversity . During a population bottleneck, such regions should show reduced diversity loss, manifest as genomic regions in which the gradient of diversity against distance from Africa is close to zero. Similarly, positive selection acting on variants that helped early modern humans adapt to new environments will have accelerated the reduction of diversity and created steeper slopes. Finally, positive slopes might be generated wherever the non-African environment presented new challenges that were best met by multiallelic solutions, for example when humans encountered new classes of pathogens or parasites [15,28].
A pervasive problem with many tests for selection is the lack of independent verification. Most tests rely on assumptions about local recombination or mutation rates, if only neither have changed appreciably in the recent past. In practice, these assumptions are open to question. Point mutations appear to occur non-randomly, falling in clusters [9,10], and these clusters themselves correlate with local recombination rate , though this may reflect correlation of both with features such as local GC base composition . Nonetheless, recombination hotspots can be both intense and highly localized  and appear to be evolutionarily transient [11,32]. Equally, the clustering of SNPs may reflect gene conversion events focused on existing polymorphisms [10,33], potentially creating a dynamic and constantly changing mutation landscape. The main method for detecting selection that directly bypasses these issues involves dn/dS ratios , the proportion of all nucleotide substitutions that cause changes at the level of the protein. However, being based on several/many mutations in coding regions, this method cannot be used meaningfully to infer current selection acting on a single variant allele, or selection acting on variants in non-coding regions.
Given the above uncertainties, it is desirable to compare two independent methods of inference. For a second test, we therefore turned to microsatellites. It is well-established that microsatellite heterozygosity is positively correlated with repeat number [35–37]. Consequently, the average relationship between heterozygosity and repeat number provides an expectation for how variable an ‘average’ microsatellite of a given repeat number should be . Wherever a microsatellite lies near to a gene experiencing selection, this expectation will change. In regions affected by balancing selection, a microsatellite should carry greater heterozygosity than expected from the number of repeats it carries. Similarly, microsatellites in regions affected by strong directional selection will have lost variability through selective sweep effects, and should show less variability than expected for their length.
The two methods for detecting selection described above are essentially independent: the first looks at how heterozygosity varies across the world regardless of absolute levels, while the latter looks at patterns within a single population and focuses on absolute variability relative to an extrinsic relationship, the way microsatellite heterozygosity scales with repeat number. Here we cross-test these two methods using large, published human datasets and show that they yield concordant patterns. We then apply the more general SNP-based approach to show how immune genes in particular exhibit patterns consistent with both balancing and directional selection.
SNP data were downloaded from http://hapmap.ncbi.nlm.nih.gov/, specifically HapMap phase II and III (5 February 2009 release) genotyped in the following population samples: Yoruba from Nigeria (YRI), Europeans from Utah (CEU), Lahuya from Kenya (LWK), Maasai from Kenya (MKK), Toscans from Italy (TSI), Han from China (CHB) and Japanese from Japan (JPT) . Four other populations were excluded owing to their greater risk of mixed ancestry. Heterozygosity was estimated assuming two alleles in Hardy–Weinberg equilibrium. Distance from Africa was measured as the land-only route from Addis Ababa to the town of sampling/centre of sampling region . CEU was taken as Paris, an intermediate western European location.
To determine the local slope of SNP heterozygosity against distance from Africa for any given point in the genome, a custom macro was written in Visual Basic. SNP data for the relevant chromosome were read into an array and stored as heterozygosities for each of the seven populations. Local slope was then calculated as the Pearson correlation coefficient of average heterozygosity against distance from Africa across the seven populations, average heterozygosity being based on all SNPs within a given distance of the focal location. A correlation coefficient was preferred to the actual slope because, with so few data points, steep but poorly supported slope values often arise by chance, while large correlation coefficients more often imply a well-defined relationship, regardless of whether the slope itself is steep (for a given set of SNPs, heterozygosity varies little among populations, so large outliers are unlikely). In all cases, we compared the results obtained using four different window sizes: ±10, ±25, ±50 and ±100 kb.
Microsatellite data were downloaded from the Centre d'Etude du Polymorphisme Humain (CEPH) website (http://www.cephb.fr/en/cephdb/) and are based on the data published by Dib et al. . The location of each microsatellite on the human genome, build 36.6 (chosen for maximum compatibility across all datasets used), was determined through the sequence-tagged sites database, and expected heterozygosity calculated using the frequencies of alleles listed, assuming Hardy–Weinberg equilibrium. Wherever possible, we extracted the clone sequence and the primer sequences, with which we calculated the mean allele length converted to numbers of repeat units (= ‘length’), on the assumption of no insertions or deletions in the regions between the primer sites and the microsatellite. Finally, we calculated residual heterozygosity at each locus. Plotting heterozygosity against length yields the expected positive relationship. However, the variance in heterozygosity declines strongly with increasing repeat number, owing to the fact that while essentially all long microsatellites have high heterozygosity, short microsatellites can have almost any value. To reduce this bias, we therefore expressed the heterozygosity of each microsatellite as the standardized residual heterozygosity of all loci within 0.5 repeat units in length. Loci with extreme residuals (greater than 2.5 s.d.) were excluded, since these may include strongly aberrant loci with unusual features such as insertions or deletions in their flanking DNA.
A full list of all annotated human genes was downloaded from the Gene Ontology (GO) website (http://www.geneontology.org) on 10 March 2009 . Locations on the human genome build 36.6 were verified and each locus stored as its unweighted mid-point location (i.e. we used the middle base rather than the middle exonic base), along with all associated GO codes. In addition, we also downloaded a list of 168 genes from a previous paper examining selection on immune genes , along with their locations. This list was used as a supplementary test of the association between selection and immune genes.
(a) Microsatellite heterozygosity and single nucleotide polymorphism variability
After excluding loci with extreme residual heterozygosity and where lack of sequence/primer information precluded inference of repeat number, data from a total of 4524 microsatellites were retained. Data were combined into 20 equal-width bins spanning the range of residual heterozygosity, standardized by subtracting the mean and dividing by the standard deviation, of −2.5 to 2.5. Within each bin, each microsatellite was placed at the centre of a symmetrical window (four sizes examined = ±10, ±25, ±50 and ±100 kb) and in each case the correlation coefficient of the relationship between SNP heterozygosity and distance from Africa was calculated based on the seven study populations. Figure 1 shows how the mean correlation coefficient varies across the 20 microsatellite bins for a window size of ±25 kb. A regression based on the data as shown is significant (r2 = 0.488, n = 19, p = 0.0009), but becomes appreciably stronger if the first data point, a major outlier, is removed (r2 = 0.812, n = 19, p = 3.3 × 10−7). The lowest bin is likely to be an outlier because very low heterozygosity can result from several processes other than selection, most obviously stabilization of the locus through internal point mutations . The data point for the highest bin contained only a single locus and was omitted in both cases. Other window sizes yield substantially weaker associations, the narrowest window being non-significant and the two larger windows approaching significance (p ∼ 0.07 in both cases). In all cases, excluding the extreme bins yields stronger, more positive slopes. We believe our optimum window size lies at 25 kb because while smaller windows reflect well local conditions, they carry more statistical noise owing to the small number of SNPs included, and for larger windows the converse is true; with more reliable numbers of SNPs reducing stochastic noise but the larger regions tending to embrace more than one functional block.
(b) Single nucleotide polymorphism variability and GO codes
Using the ‘best’ bin size determined from the microsatellite analysis, 25 kb, we next analysed a list of 65508 genes and gene functions downloaded from the GO website. Multiple GO codes for the same gene (defined as having the same start and stop location) were treated as separate entries and gene location was taken as the mid-point of the gene. Having determined the local SNP slope at each locus, mean slopes were calculated for each GO code found, with qualifying codes having more than five different genes. To assess whether immune-related genes tend to have extreme slopes, suggesting selection, all retained GO codes (n = 1308) were classified blind by one of us (C.B. presented with an alphabetically ordered list of gene classes without any inferred selection coefficients) as to whether they were or were not directly linked to immune function. Examples include ‘defence against bacteria’, ‘positive regulation of chemokine biosynthetic process’ and ‘natural killer cell activation’ (for full list, see the electronic supplementary material, table S1). Attempts to use the GO coding system directly failed because key descriptors such as ‘immune response’, while capturing many relevant genes, also exclude many legitimate and important classes (e.g. ‘I-κB kinase/NF-κB cascade’) which would have to be added manually. After sorting by mean slope, the frequencies of immune genes were determined for each consecutive block of 100 codes (figure 2). The two highest bin counts are found in the highest and lowest mean slope classes, significantly more often than expected by chance (χ21 = 7.43, p = 0.006). The 24 GO codes associated with the strongest positive and negative slopes are listed in table 1. Note that the standard errors of GO codes with negative slopes tend to be appreciably lower than those of the top positive slopes, despite being based on similar numbers of genes, suggesting that the selective forces acting on genes in code classes that yield positive slopes are more heterogeneous. Finally, to get an idea of the level of non-independence, we also estimated the correlation between the slopes of adjacent genes, classified according to genomic separation (end of gene1, start of gene2) in 10 kb bins, finding a decline from r = 0.62 (genes less than 10 kb apart) down to r = 0.32 (genes separated by 190–200 kb), suggesting that only extremely close genes will have similar slopes owing to proximity alone.
(c) Single nucleotide polymorphism variability around 168 immune-related genes
Slopes were determined for each of the 168 genes studied by Walsh et al. , plus the gene APCS, which does not appear in the main list, but is discussed in the text. We also included Walsh et al.'s positive control, beta haemoglobin (HBB). Results are summarised in table 2. Several trends are apparent. First, the six genes identified as putatively under selection (IL9, CAV2, FUT2, ABCC1, VAV3 and APCS) and the positive control, HBB, tend to yield strongly negative slopes (−0.947, −0.908, −0.98, 0.67, −0.31, −0.912, −0.94, respectively). Indeed, IL9 and FUT2, and other members of the CAV and VAV gene families, CAV1 and VAV2, yield four of the 12 most negative values found. ABCC1 and VAV3 are very big genes (approx. 0.2 and 0.4 Mb, respectively), and both contain regions outside the window used that give strongly negative slopes, though other ATP-binding cassette (ABC) genes are also positive (see below).
The second trend is for genes with similar names to yield similar slopes. A rigorous analysis is hampered both by non-independence owing to gene clustering and the fact that our understanding of function is insufficiently complete to group genes accurately by function. Some genes with similar names may have very different functions in terms of the precise role they play. Nonetheless, several groupings stand out. All three CAV and all three VAV genes have strongly negative values, despite lying on multiple chromosomes. Similarly, all four ABC genes, all five alpha defensin (DEFA) genes and 11 of 13 interferon alpha (IFNA) genes have positive/strongly positive slopes. Interestingly, although the DEFA genes all form a single cluster, DEFT1 lies within this cluster and has a negative slope, indicating that the generally positive slopes are not owing entirely to linkage disequilibrium. IFNA genes also form a cluster on chromosome 9, but the cluster is big enough (275 kb) to contain contrasting slopes and the two group members with negative slopes lie at either end.
We show that microsatellites which are more heterozygous than expected for their repeat number tend to lie in genomic regions where SNP variability either fails to decline or actually increases with distance from Africa. Assays of regions around human genes reveal how key immune gene classes tend to show extreme SNP slopes, with antigen presentation genes having the most positive slopes and genes associated with defence against bacterial infection showing the most negative. Focusing on 168 immune genes studied previously , we find good agreement with the original conclusions in terms of genes experiencing directional selection, but also identify several candidate gene families that appear to be under balancing selection.
Previous methods for detecting the action of natural selection on the human genome have met with mixed success . Apart from the obvious problem of false positives that applies to all genome-wide analyses, many of the other methods rely on identifying regions of the genome with unusual characteristics, such as high levels of linkage disequilibrium or SNP density. Such approaches could be powerful with a complete understanding of how recombination and mutation events occur, but as yet we do not have this. Instead, it seems that recombination and mutation events tend to cluster with each other , and that rates can vary over periods of evolutionary time as short as that which separates humans and chimpanzees [11,32,43]. Also, mutations may be more common near to microdeletions  or simply to each other . Such uncertainties make the interpretation of the distribution of SNPs within any given population difficult. Methods based on finding SNPs with unusually high differences in allele frequency among populations potentially overcome these issues, but are in turn hampered by ascertainment bias, the phenomenon in which the discovery process may favour SNPs with unusually large allele frequency differences among populations [45,46], which would exacerbate the (already non-trivial) issue of false positives.
Our new method offers two potentially important advantages over other methods. First, by comparing levels of variability among global populations relative to a well-defined expectation, the strong linear decline with distance from Africa, many of the problems associated with not knowing how patterns of linkage disequilibrium and mutations came to be distributed are avoided. Second, although ascertainment bias has the clear potential to enrich for SNPs that give large Fst values, our method averages heterozygosity over many SNPs, reducing greatly the impact of one or a few unusual markers. A further aspect of our approach is that it detects selection over a well-defined time scale, specifically the period in which humans colonized the world from Africa, somewhat over 50 000 years. On the one hand, this means our method is inappropriate, for example, in detecting selection acting on humans before they left Africa. On the other hand, having a known period may allow substantial future refinement, for example by modelling the impact of recombination.
A further benefit of our method is that it detects several different forms of selection, including balancing selection. Balancing selection has previously proved difficult to detect , despite evidence that it affects a number of genomic regions [14,47]. The key issue is that the primary prediction of balancing selection, that of maintaining locally higher levels of heterozygosity , is difficult to distinguish within a population from other factors such as the presence of mutation hotspots [48–50]. However, when a population goes through a bottleneck and as a result suffers genome-wide loss of neutral diversity, those regions experiencing balancing selection should stand out as islands where diversity has been unusually retained. Our approach appears to show this, both through the fact that microsatellites with higher than expected variability for their repeat number lie in genomic regions where variability has not declined across the world, and through the fact that genes most known for balancing selection, those associated with antigen presentation, also lie in these areas.
Our approach remains somewhat crude. The analysis presented is based only on seven populations, three of which are in Africa, and using a point of origin for the decline of variability, Addis Ababa, which was chosen somewhat arbitrarily and which should probably be replaced by a location lying more in central southern Africa [19,23,25]. Use of more populations could help immensely, particularly the inclusion of populations from South America, the part of the world most distant from Africa. Another improvement involves ascertainment bias during the discovery process . Although we believe that ascertainment bias impacts rather little on our analysis overall, there remains a concern that locally one or a few unusual SNPs could impact our analysis. Use of larger SNP datasets based on markers developed so as to minimize bias would help reduce this potential problem further. Arguably the biggest improvement is likely to be achieved through a more sophisticated statistical analysis. We currently treat all SNPs as equal and independent (in the sense that we do not recover phase), even though it is clear that recombination rates vary widely across the genome. Algorithms that reconstruct phase and estimate local recombination rates  have the potential to yield appreciably improved estimates of heterozygosity, based more on haplotype blocks than on individual SNPs. A further issue relates to gene classification. When analysing all genes together we were forced to use a pragmatic rule of counting many genes several times, one for each GO code attracted. While this should not bias our results in terms of creating consistently high or low slopes for immune genes, it is clearly sub optimal. More focused studies should, by their nature, be able to avoid this problem. For example, one might compare members of a gene family, some involved in immune function and some not.
Finally, it is worth considering how our method works in practice. Applied to a list of known, immune-related genes, we find that our method tends to yield strong negative slopes when applied to ‘hits’ generated by other tests. Strong negative slopes should indicate purifying selection, selection that has acted to accelerate the loss of diversity relative to neutral sites. This makes biological sense, in that the other tests are generally aimed at detecting patterns generated by this form of selection, and we can imagine that humans encountered many new pathogens as they moved into new areas and encountered new foods and new climates. However, we also find several gene clusters that yield strongly positive values, suggestive of balancing or diversifying selection. These include: ABC proteins, a group only recently recognized as being important in the immune system, but whose functions include regulation of antigen presentation ; defensins, specifically the DEFA group, involved with defence against bacteria and antitoxin activity ; and IFNA, a group of proteins with direct antiviral, antiproliferative and immunomodulatory properties . Across all qualifying GO codes, we find that immune-related genes are over-represented both in genes yielding extremely high and extremely low slopes, suggesting that immune genes in general are more likely than average to be under selection.
In conclusion, we present a new method for detecting the action of natural selection on the human genome that exploits our unusual demographic history. By using as our null hypothesis the changes in diversity that are known to have occurred when humans moved out of Africa to colonize the world, we bypass many of the uncertainties that attach to other approaches. Our method appears effective in pinpointing immune-related genes as foci for natural selection, supporting the findings of other studies . Future expansion of SNP datasets to embrace further populations and rigorous modelling to determine null distributions for our measure should increase its power.
- Received September 23, 2010.
- Accepted October 18, 2010.
- This Journal is © 2010 The Royal Society