## Abstract

Introgression is the permanent incorporation of genes from the genome of one population into another. This can have severe consequences, such as extinction of endemic species, or the spread of transgenes. Quantification of the risk of introgression is an important component of genetically modified crop regulation. Most theoretical introgression studies aimed at such quantification disregard one or more of the most important factors concerning introgression: realistic genetical mechanisms, repeated invasions and stochasticity. In addition, the use of linkage as a risk mitigation strategy has not been studied properly yet with genetic introgression models. Current genetic introgression studies fail to take repeated invasions and demographic stochasticity into account properly, and use incorrect measures of introgression risk that can be manipulated by arbitrary choices. In this study, we present proper methods for risk quantification that overcome these difficulties. We generalize a probabilistic risk measure, the so-called hazard rate of introgression, for application to introgression models with complex genetics and small natural population sizes. We illustrate the method by studying the effects of linkage and recombination on transgene introgression risk at different population sizes.

## 1. Introduction

Human activity has dramatically increased the rate of hybridization between species or ecotypes by agriculture, trade and travel. An important consequence is the potential occurrence of introgression, when genes from the genome of one population or species become permanently incorporated into the genome of another. Hybridization and introgression can have undesirable effects, such as the extinction of endemic species, or the spread of resistance genes, which may, for instance, result in an increased weediness of plant species [1,2]. Especially, the application of genetically modified crops in agriculture has raised many concerns about the incidence of introgression. Quantification of the risk of transgene introgression is therefore a key component of the regulation of genetically modified (GM) crops. There are two aspects to risk: the probability of occurrence and the consequences. In this study, we focus on the first. Most theoretical introgression studies aimed at quantifying this aspect disregard one or more of the most important factors involved: realistic genetical mechanisms, repeated invasions and stochasticity.

The inclusion of realistic genetical mechanisms, especially pertaining to the location of a transgene in the crop genome, is of critical importance. Linkage to a crop gene that is under positive selection in the wild population may considerably enhance introgression, whereas linkage to a deleterious gene will reduce it. Thus, multi-locus genetics are an important aspect of introgression risk. It has, however, been ignored in the modelling literature until now [3]. Specifically, the use of genomic linkage as a strategy to mitigate introgression is still in the conceptual stage [4].

The application of genetic models for introgression risk assessment is hampered by the fact that until now there has been no universal, standardized quantification method. Different measures are used in different studies, and most of these are seriously flawed. We first give an overview of the previously used methods, and outline their limitations. We then propose that a measure called the ‘hazard rate’ can be used for quantifying introgression risk, and show how this statistic can be calculated from branching-process models with multi-locus genetics, as well as from simulations. Finally, we illustrate the method by considering the effects of genetic linkage and recombination at different population sizes.

## 2. Quantifying introgression risk

At first sight, it may seem as though the spread of a gene through introgression might be studied with classical population genetics models that include selection and recombination. However, there are important differences, because these models consider only evolutionary time scales, where new genotypes are created through mutation. The time between successive mutations is assumed to be very large compared with the generation time. Therefore, each new mutation can be considered as a single invasion into a stable population, and the effects of repeated invasions are ignored. In situations with gene flow over an extended period, invasion repetition is an important element that changes the population genetic dynamics considerably. Yet many genetic introgression studies are based on deterministic models with single invasion attempts [5]. The relative fitness advantage of an invading gene is then used to measure invasion risk. This measure is closely related to the probability of success of a single invasion in an infinitely large population [6,7], or the fixation probability of single mutations in finite populations [8].

There are several reasons why the probability of fixation is an inappropriate measure of introgression risk. When the repetition of invasions persists indefinitely, the foreign gene will eventually become fixed in any population of finite size owing to genetic drift, regardless of its fitness effects. Models based on single invasions, however, predict that the establishment probability of deleterious invading genes is zero for infinite populations, or very small when population size is finite. Similarly, repetitive invasions of advantageous genes in (infinitely) large populations will have a success probability of one, even if the success probability of a single invasion is very small. Even when repetitive invasions occur only during a finite time period, the establishment probability of an invading gene is much higher than would be predicted on the basis of the single invasion scenario, because the existence of multiple invaders increases the probability that fixation occurs through drift.

Even in genetic studies that do consider repeated invasions, the measures currently proposed for introgression risk are suspect. The most popular risk measure is the frequency of the transgene after a fixed period of gene flow [9]. This measure is subjective, because the choice of the period length is arbitrary. Furthermore, it says little about the eventual fate of the transgene. For example, a constant flow of a near-neutral transgene into a wild population may lead to a considerably high transgene frequency in a population, even when it is still likely to be removed by drift after gene flow stops. On the other hand, when the transgene is under positive selection and the population size is large, even a nearly undetectably low frequency may result in fixation [10]. An alternative measure used is the fixation probability of the transgene after a certain number of generations of gene flow [10]. While this measure overcomes some of the objections raised earlier, it also depends on the (arbitrarily chosen) period of gene flow. It increases with the length of the period, and converges to a value of one, because the foreign gene will always become fixed in any population of finite size due to genetic drift, regardless of its fitness effects.

The current genetic introgression studies also ignore another crucial aspect of introgression. Because the initial numbers of invaders are usually small, invasion attempts fail several times owing to demographic stochasticity, before permanent establishment is initiated. The time until this initiation is an important characteristic of invasion risk that should be included in its quantification.

In previous papers, we presented a method to quantify introgression risk properly, taking repeated invasions and demographic stochasticity into account. We derived a probabilistic risk measure, the so-called hazard rate of introgression, based on branching-process models. These models, however, assumed simple single-locus two-allele dynamics, and (infinitely) large recipient populations. In this study, we extend this approach in two ways. First, we extend the branching-process calculations to include more realistic genetical mechanisms. This results in a method for proper introgression risk quantification in wild populations of moderate to large sizes, that is computationally much more efficient than the usual simulation-based estimation. Second, to study introgression risk for small populations, we develop a method to calculate hazard rates from simulations. We demonstrate these new methods with a two-locus model with recombination. This allows us to study the effect of linkage between the transgene and a domestication gene under negative selection. In addition, we examine the accuracy of the branching-process approximation for different population sizes.

We discuss our methods in the context of transgene introgression from GM crops into wild populations, but they can be used in any situation where repeated invasions occur. For example, they may have important applications in the evolutionary dynamics of microbial systems, where mutation rates are high, and additional modes of genome modification occur, such as bacterial competence [11]. Other examples are epidemic processes [12], exotic species invasions [13], and the origin, growth and spread of tumours [14].

## 3. Definition of the hazard rate

The hazard rate is defined as the probability per time unit that a so-called introgression event occurs, given that it has not previously occurred [15]. The use of hazard rates is well established in the field of survival analysis, where they represent instantaneous mortality risks. The interested reader can find more information on this in the work of Kalbfleisch & Prentice [15], for example. If *T* denotes the (random) time at which an introgression event occurs, then the hazard rate at time *n* can be expressed as follows:
3.1so the hazard rate is a function of time that is in one-to-one correspondence with the distribution of *T*. Further information about the nature of this correspondence in an introgression context can be found in a paper by Ghosh & Haccou [16].

In previous models, we considered (infinitely) large wild populations, where the probability of interaction between hybrid lineages can be ignored. In that case, an introgression event corresponds to the initiation of a permanently introgressed lineage. In finite populations, this definition is problematic, because individuals from different lineages may produce offspring together. Therefore, permanent introgression is not necessarily initiated by a single lineage. One possibility would be to study the frequency of the transgene after a certain period, as done by Thompson *et al*. [9]. The time at which this frequency exceeds a given level could then be considered as the starting point of introgression. However, because the choice of the threshold frequency is arbitrary, this is a doubtful method.

Here, we propose an alternative definition: an introgression event has occurred at or before a specific time if the invading allele will go to fixation in the population even if no further invasions occur after that time. This redefinition is slight, and with respect to the asymptotic hazard rate, it is equivalent to the one we used in previous papers [16,17]. However, it does have the important consequence that the hazard rate can be estimated from simulations by means of survival analysis methods [15]. This implies that with the methodology presented in this study hazard rates can be computed for stochastic population dynamic models with any degree of genetic and/or ecological complexity, and for small- or medium-sized populations, using simulation-based approaches.

The hazard rate can also be calculated from branching-process models, using the approach that we developed previously for a single-locus system [16]. Although this is an approximation, based on infinitely large population sizes, it has the advantage that it does not require computer simulations, but only the numerical solution of a system of equations, which is a computationally much more efficient method.

As an illustration, we show here how to calculate hazard rates for two-locus two-allele dynamics. It should be noted, though, that our methods can straightforwardly be generalized to more complex cases. For the situation that we consider here, it turns out that the branching-process approximation already works extremely well for population sizes of about 100 individuals.

## 4. Models and computations

We consider situations with a flow of pollen from a crop field into a nearby population of a wild relative. The crop contains a transgene conferring a positive fitness effect, which is physically linked to a domestication gene with a negative fitness effect. In heterozygotes, recombination can cause the transgene to become uncoupled from the domestication gene, creating the haplotype with the highest fitness. The crop is assumed to be homozygous for the crop alleles at both the transgene and the domestication gene; the wild population is assumed to be homozygous for the wild-type alleles at both loci. We represent the transgene and domestication gene alleles by the capital letters *A* and *B*, respectively. The corresponding wild-type alleles are given by the lower-case letters *a* and *b*, respectively. We use the symbol *w _{i}* to denote the fitness of an individual of genotype

*i*. Wild-type individuals (

*abab*) are taken to have a fitness of one, and the fitness of the other genotypes is expressed relative to this. All plants in the model are hermaphroditic annuals, and mate randomly. The wild population is assumed to have a fixed size.

### (a) Branching-process approach

With this approach, the wild population is assumed to be large enough such that hybrids and their descendants initially cross only with wild individuals (see [16]). Consequently, in the initial phase of the invasion, there are no hybrid–hybrid crosses, which means that there are no homozygotes for either the transgene or domestication allele in the wild population. While it is true that such homozygotes will eventually be produced in finite populations, we assume that in sufficiently large populations, the success of an invasion is determined by this initial phase. This allows hybrid lineages to be considered independent of each other, which simplifies the dynamics considerably.

We here consider a model where a Poisson-distributed number (*m*) of F_{1} hybrids (genotype *ABab*) is produced per generation by crosses between individuals from the crop (*ABAB*) and the wild population (*abab*). The following generation, these F_{1} hybrids produce a number of offspring according to a Poisson distribution with mean of 2*w _{ABab}*. Thus, the number of offspring produced by a genotype is twice its fitness. A fitness of one thus corresponds to an individual producing on average two offspring, because each parent contributes only half of an offspring's chromosomes. The genotypes of the offspring produced by the F

_{1}hybrids depend on the recombination rate (

*r*) between the transgene and the domestication gene. With probability ½

*r*, the genotype is

*Abab*; with probability ½

*r*, it is

*aBab*; with probability ½ (1 −

*r*),

*ABab*; and with probability ½ (1 −

*r*), it is

*abab*. Because we assume that the population is large, these four genotypes are the only ones we have to take into account. Furthermore, we have to consider the fate of only the two genotypes that carry the transgene (i.e.

*ABab*and

*Abab*), because the others cannot initiate lineages that lead to the permanent introgression of the transgene.

As argued by Ghosh & Haccou [16], the hazard rate increases from zero to an asymptote when there is continuous gene flow (figure 1). This asymptotic hazard rate provides a convenient measure of introgression risk. Details of the derivation of the asymptotic value are given in the appendix in the electronic supplementary material. Here, we summarize the main results. According to branching process theory (see the electronic supplementary material, appendix) [18], the extinction probability of a lineage after a single invasion of the genotype *ABab* equals the smallest root of the following equation:
4.1
where *w _{ABab}* represents the fitness of an individual of genotype

*ABab*,

*r*represents the recombination rate between the loci of the transgene and domestication gene and

*q*represents the extinction probability of the lineage of a single

_{Abab}*Abab*individual, which is the smallest root of the following equation: 4.2where

*w*denotes the fitness of an individual of genotype-

_{Abab}*Abab.*The smallest root

*q*will be less than one when

_{Abab}*w*is greater than one. Equation (4.2) can be solved numerically, and then used along with equation (4.1) to calculate the extinction probability of a single hybrid individual. This extinction probability can be combined with the hybridization rate to find an expression for the asymptotic hazard rate (see the appendix in the electronic supplementary material for full details): 4.3where

_{Abab}*m*is the average number of hybrids produced per generation.

### (b) Population genetic simulation approach

The simulation-based approach considers a wild population of a fixed size, *N*, which may be small. Consequently, hybrids may mate with other hybrids, and all possible genotypes can appear. The population is represented by a vector, where each item corresponds to the number of individuals of a given genotype. This vector has length 16 (rather than nine), because we distinguish paternal and maternal chromosomes. The life cycle progresses according to three stages: reproduction, death of adults and germination of seeds. Reproduction takes place through the production of exactly *N* seeds. For this purpose, first, the expected frequency of the 16 genotypes in the following generation is calculated given random mating, their current frequency and the fitness effects of the two loci. Then, the frequencies of the 16 genotypes among the produced seeds are drawn as random numbers from a multinomial distribution with the expected frequencies as probabilities. A small number of randomly selected seeds are then replaced by seeds created through hybridization between the wild population and the crop. For this, the paternal contribution of the selected seeds is replaced with an *AB* gamete from the crop. The number of hybrids produced is chosen by drawing a random number from a binomial distribution with *N* trials and a success probability of *m*/*N*. In the limit of large *N*, this becomes a Poisson distribution with a mean of *m*, which agrees with the use of a Poisson distribution in the branching-process approach. After these seeds have been created, the adult individuals die, and all seeds germinate and establish themselves. The simulation model was programmed and run in the programming package R. For every combination of parameters settings, we ran 100 000 replicates, varying the number of generations during which hybridization took place from 1 to 100.

To estimate the hazard rate from such simulations, we need to find the probability that an introgression event has occurred at each time step. This is done by running the simulation model with continuous hybridization for *n* generations. At that time, hybridization stops and the simulation continues until the transgene is either fixed or has disappeared from the population. The proportion of replicates that reach fixation after *n* generations of hybridization provides an estimate of the probability *P*(*T* ≤ *n*). From these probabilities, the hazard rate can be calculated at each value of *n*, using equation (3.1). Calculated in this way, the hazard rate will reach an asymptote, such as for the branching-process approach. Unfortunately, for large values of *n*, the probabilities in equation (3.1) can be very low, giving a large error in the estimation of the asymptotic hazard rate. However, during the asymptotic phase, there is a linear relationship between *n* and ln(*P*(*T* > *n*)) (figure 1). This means that a more accurate estimate of the asymptotic hazard rate can be obtained by means of the slope of the linear regression of ln(*P*(*T* > *n*)) [15]. The location where the linear part of this function starts has to be determined either by eye, or through formal methods for estimating lag-times in exponential distributions [19]. Note that this has to be done separately for every combination of parameter values, because the rate at which the asymptote is approached may differ between combinations.

## 5. Results

In analysing our results, we take the hybridization rate, *m*, to be fixed. It has been previously shown [16] that the asymptotic hazard rate increases monotonically with *m*.

Figure 2 shows the effect of the fitnesses of type *ABab* and *Abab* individuals on the asymptotic hazard rate. As expected, the asymptotic hazard rate increases with increasing fitness for both genotypes. The rate of increase depends on both the population size and the recombination rate. Figure 2 also shows that there is a strong interactive effect of the recombination rate and the population size. At a low recombination rate (*r* = 0.005), the hazard rate is generally the highest for the smallest population size (*N* = 10). On the other hand, at higher recombination rates (*r* = 0.05, *r* = 0.5), the same population size generally gives the lowest asymptotic values for the hazard rate. The effect of changing the fitness of type *Abab* individuals is larger at high recombination rates than at lower recombination rates.

The interactive effects of recombination rate and population size are studied in more detail in figure 3. For the branching process, an increase in the recombination rate simply results in an increase in the asymptotic hazard rate. In the simulation model, the hazard rate reaches very high levels in extremely small populations, consisting of just a few individuals. At intermediate population sizes, of about 10–20 individuals, however, the hazard rate is quite low. At large populations, the hazard rates of the simulation-based method approach the branching-process value. As can be seen from the figure, the branching-process approximation already works well at population sizes of 100 individuals, especially with low recombination rates.

The effect of drift at different population sizes is studied in figure 4. At small resident population sizes, it appears that invasion is largely due to short-term chance events, whereas long-term selection appears more important at larger resident population sizes.

## 6. Discussion

The hazard rate provides an intuitive and general way to quantify introgression risk with repeated invasions. In previous papers, we demonstrated how this measure can be used to study introgression risk in relation to fitness parameters [16] and crop-management schemes [17]. The underlying genetics, however, did not go beyond one-locus two-allele situations. With the methods presented in this study, more realistic population genetics can be examined. For the sake of simplicity, we demonstrated this by a two-locus two-allele model, but generalization of our methods to more complex genetic mechanisms is straightforward. Similarly, more complex life cycles or ecological conditions can be incorporated. Thus, the current generalization makes it possible to calculate hazard rates for realistic models with a high level of complexity.

We presented two complementary approaches for hazard-rate calculations. For small populations, where drift plays an important role, simulation models should be used to calculate hazard rates. On the other hand, branching-process models provide a fast and efficient way of approximating hazard rates for large populations, where simulations would require much time and computer space. As demonstrated by our results, this approximation is very efficient, and may already work well at moderate population sizes (of 100 individuals or more).

As seen in figure 3, the effects of drift can be counterintuitive and cannot be extrapolated from the results of branching-process models. While one might expect that drift always increases the probability of a successful invasion, at some intermediate population sizes invasion risks are smaller than at larger population sizes. The reason for this lies in the fact that selection is the main driving force behind invasions in larger populations. Under strong selective pressures, the invasion risk can be high at large populations, but smaller at small populations because potential invaders can be removed from the population by drift.

Another difference between the results of the two approaches occurs when the fitness of the introgressed *Abab* genotype is equal to one (i.e. when the transgene is selectively neutral). In this case, the branching-process model predicts a hazard rate of zero. This is because at neutrality, the transgene can only go to fixation as a result of drift, and this is not possible in infinitely large populations. In the simulation model, drift does occur, and therefore the asymptotic hazard rate for this model need not approach zero with these identical parameters.

The results from both approaches show that recombination rates between loci have large effects, and thus that linkage is an important aspect of introgression modelling. As figure 3 shows, lowering the recombination rate reduces the hazard rate significantly, which implies that the expected time until the initiation of introgression increases [16]. For the lowest examined recombination rate (*r* = 0.005), this expected time has an order of magnitude of 100 generations. There is an interaction between the effect of linkage and fitness. At high recombination rates, there are more type *Abab* individuals produced, and so changing the fitness of such individuals has a larger effect than at low recombination rates (figure 2). The sensitivity of the hazard rate to recombination rates is smallest in small populations, where introgression is primarily driven by drift (figure 3).

We find high hazard rates at small population sizes that are of the same order of magnitude as the hybridization rate. This is because hybridization alone can be enough to push invading genes to fixation, as can be seen in figure 4. Consequently, many copies of the domestication gene also go to fixation under these circumstances. The hazard rate typically reaches a minimum at population sizes of the order of 10–20. This is because introgression is primarily still driven by drift in these circumstances, and the number of invaders is small compared with the number of residents, so drift generally acts to push these invaders out of the population. At larger population sizes, of the order of 80 and higher, selection becomes the dominant factor, and the results from numerical simulations approach those predicted by branching processes.

These results shed light on the circumstances when branching-process models can be used as good predictors for invasion risk, and when simulation models should be used. To date, little work has been done in investigating how large a resident population is necessary for the results of branching-process models to hold (but see [20]). Our results suggest that branching processes provide good approximations for population sizes that are ecologically relevant.

While the approaches outlined are important for calculating introgression risks, there is still much to be done. In the simulation model, population sizes were assumed to be fixed, which is usually not the case in reality. Also, we have not taken into account the metapopulation dynamics of wild populations, which would be an important aspect of a more complete model [10,21]. Another extension is to consider time-inhomogeneous processes, such as caused by, for example, crop-management schemes, as considered by Ghosh *et al*. [17].

In conclusion, hazard rates characterize the important aspects of invasion risk in situations with repeated invasions. With the methods described here, they become applicable for models with realistic genetic as well as ecological mechanisms, providing an accurate measure of risk in complex situations.

## Acknowledgements

This research was funded through the research programme Ecology Regarding Genetically modified Organisms (ERGO), commissioned by four Dutch ministries. The main investor is the Ministry of Housing, Spatial Planning and the Environment (VROM). This funding programme is managed by the Earth and Life Sciences Council (ALW) of the Netherlands Organization for Scientific Research (NWO). P.H.'s research is additionally supported by the Nonlinear Dynamics of Natural systems (NDNS) programme of the NWO. We also thank Peter van Tienderen, Klaas Vrieling, Tom de Jong, Jun Rong, Wil Tamis and Cilia Grebenstein for helpful discussions.

- Received August 15, 2012.
- Accepted September 18, 2012.

- This journal is © 2012 The Royal Society