- Split View
-
Views
-
Cite
Cite
Hyunjin Shim, Stefan Laurent, Sebastian Matuszewski, Matthieu Foll, Jeffrey D Jensen, Detecting and Quantifying Changing Selection Intensities from Time-Sampled Polymorphism Data, G3 Genes|Genomes|Genetics, Volume 6, Issue 4, 1 April 2016, Pages 893–904, https://doi.org/10.1534/g3.115.023200
- Share Icon Share
Abstract
During his well-known debate with Fisher regarding the phenotypic dataset of Panaxia dominula, Wright suggested fluctuating selection as a potential explanation for the observed change in allele frequencies. This model has since been invoked in a number of analyses, with the focus of discussion centering mainly on random or oscillatory fluctuations of selection intensities. Here, we present a novel method to consider nonrandom changes in selection intensities using Wright-Fisher approximate Bayesian (ABC)-based approaches, in order to detect and evaluate a change in selection strength from time-sampled data. This novel method jointly estimates the position of a change point as well as the strength of both corresponding selection coefficients (and dominance for diploid cases) from the allele trajectory. The simulation studies of this method reveal the combinations of parameter ranges and input values that optimize performance, thus indicating optimal experimental design strategies. We apply this approach to both the historical dataset of P. dominula in order to shed light on this historical debate, as well as to whole-genome time-serial data from influenza virus in order to identify sites with changing selection intensities in response to drug treatment.
The common assumption of constant selection intensity through time utilized in many tests of selection is often criticized as unrealistic in natural and experimental populations, both owing to environmental changes (e.g., fluctuations in climate, predator density, or nutrient availability) as well as to genetic changes (e.g., epistasis, clonal interference). Despite this, such considerations are not accounted for in most population genetic models, since inferring changing selection coefficients (s) from single-time point polymorphism data are difficult. However, owing to recent technological advances, time-sampled polymorphism data are increasingly available, and time-serial analytical methods are expanding (Malaspinas et al. 2012; Mathieson and McVean 2013; Foll et al. 2014; Lacerda and Seoighe 2014; and see review of Bank et al. 2014), allowing for an empirical evaluation of the importance of changing s models.
Fluctuating selection in natural populations was suggested by Wright (1948) with regards to the phenotypic time-serial data of Panaxia dominula (scarlet tiger moth) to account for its observed annual fluctuations in coloring polymorphism (Fisher and Ford 1947). Since, there have been several theoretical considerations of fluctuating selection (Kimura 1954; Karlin and Levikson 1974; Karlin and Lieberman 1974; Chevin 2013; Gossmann et al. 2014; Gompert 2015), as well as many observations of fluctuating selection in natural populations (for a review, see Bell 2010). Nonetheless, until recently, analyses of fluctuating selection centered on random or seasonal oscillations of selection strength through time, as the mathematical complexity of analytical methods only allowed the simplest cases to be considered.
Approximate Bayesian Computation (ABC) has the advantage of being flexible in integrating complex models due to computational efficiency and the lack of likelihood computation (Beaumont 2010). Recently, a hierarchical ABC-based method based on the Wright-Fisher model was developed in order to infer genome-wide effective population size and per-site selection coefficients from whole-genome multiple-time point datasets (Foll et al. 2014, 2015). While the initial approach performs well overall, the authors noted the possibility for observations inconsistent with a single-s Wright-Fisher model; this was indeed observed at certain sites in their analysis of the influenza virus genome, which were simply excluded from consideration. Thus, as a natural extension, we here investigate the presence of changing selection intensities in these outlier single nucleotide polymorphisms (SNPs); in doing so, we also develop an extended Wright-Fisher ABC-based method capable of detecting and quantifying changing selection intensities through time (termed CP-WFABC).
Methods
Wright-Fisher ABC-based method
The Wright-Fisher model simulator with a change point in selection strength is used to simulate the data X, with relative fitnesses wAA = 1 + s, wAa = 1 + sh and waa = 1 for the diploid model, and wA = 1 + s and wa = 1 for the haploid model (Ewens 2004). Initially, the random sampling of an allele from generation 1 to generation CP-1 is simulated using s1, and onwards from the change point (CP) using s2. In order to simulate realistic allele trajectories with changing selection coefficients, the allele needs to be segregating at the time of the change point. This condition is necessary since the change in selection coefficient cannot occur if the allele is either lost or fixed beforehand, assuming the infinite-site model with no back mutations. Thus, only alleles segregating at the change point are accepted as simulated datasets.
i. Simulate K trajectories from the Wright-Fisher model with a change in selection intensity, with θ randomly sampled from its prior P(θ), conditional on the allele segregating at the change point.
ii. Compute U(xk) for each trajectory using the Fs’i summary statistic between all consecutive sampling time points i: U(xk,i) where i = 1,…,M−1 where M is the last sampling time point.
iii. Retain the 1000 best simulations with the smallest Euclidian distance between U(xk,i) (from the simulated) and U(Xi) (from the observed) to obtain an approximate posterior density of P(θ|X).
For the second step, it is important to note that the Fs’ summary statistic is calculated between every pair of consecutive sampling time points (Figure 1). This construction of the summary statistic enables information on both the timing and strength of the allele frequency change to be captured, as the timing of the change is essential in detecting the change point and the strength of the change is essential in estimating the corresponding selection coefficients. For the diploid model, an additional parameter h is inferred jointly with the other three parameters, as its value is one of the determining factors in the timing of allele frequency change (Haldane 1932).
For the third step, the simulated Fs’ summary statistics U(xk,i) between every pair of consecutive sampling time points are compared with the corresponding observed Fs’ summary statistics U(Xi) – allowing a small fraction of the simulated trajectories with allele frequency changes that best match the observed trajectory (in terms of both timing and strength) to be retained.
Wright-Fisher ABC-based method with change point analysis
Simulated data with constant selection and with changing selection
We generated simulated datasets of different effective population sizes using the Wright-Fisher model for two scenarios: (1) trajectories of constant selection with only s and h (for diploid models) as parameters, and (2) trajectories of changing selection with s1, s2, CP, and h (for diploid models) as parameters. For selection coefficients, uniform priors of (−1, 1) were used. The uniform prior of CP was set to occur between the second generation and the second-to-last generation (2, T−1), where T is the number of generations of the population in the time-serial data. The dominance coefficient h for the diploid model was randomly drawn from one of three values: complete recessiveness, codominance, or complete dominance [0, 0.5, 1]. Although these prior ranges are uninformative, the constraint on the trajectories to be segregating at the change point shapes the distribution of the prior ranges according to the input parameters such as ploidy, effective population size, initial allele frequency, and number of generations; the updated priors for the haploid population of Ne = 100 and the diploid population of Ne = 50 are shown as examples (Supplemental Material, Figure S1 and Figure S2).
The other input values – such as the number of generations (T = 100), the sampling time points (M = 10), the sample size (n = 100), the initial allele arising as a new mutation, and the ascertainment of observing a minimum frequency at 2% – were kept constant for the two scenarios. We retained the best 0.1% of 1,000,000 simulations for each pseudoobservable trajectory using the rejection algorithm based on the Euclidean distance as described above. The mode of the posterior distribution from the best simulations (Sunnåker et al. 2013) was used to evaluate the estimated parameter value against the true parameter value.
Data availability
The R package of CP-WFABC is available on jensenlab.epfl.ch.
Results
Model choice in the change point Wright-Fisher ABC method
We considered two cases for the pseudoobservables to test the sensitivity and specificity of the ABC model choice: the first case when the pseudoobserved trajectories have a single selection coefficient, and the second case when they have changing selection coefficients. One thousand pseudoobservable trajectories were generated for each case with the data ascertainment minimum frequency set to 2% for at least one of the sampling time points. Additionally, for the second case, pseudoobservable trajectories were accepted only when the allele was segregating at the time of the change point – a constraint for realistic combinations of selection coefficients, change points, and dominance (for diploids) – in order to reproduce changing selection trajectories in real datasets. All other input values were kept constant as in the simulated datasets described in the previous section.
The results of the ABC model choice from a haploid population with Ne = 100, and a diploid population with Ne = 500, are represented as ROC curves (Xavier et al. 2011) in Figure 2. Specificity is given on the x-axis showing the true negative rate, while sensitivity is given on the y-axis showing the true positive rate of the Bayes factor B1,0 calculated from 1000 pseudoobservables of changing selection (where B1,0 should be large) and 1000 pseudoobservables of constant selection (where B1,0 should be small). The overall ROC curves in black (all trajectories) show that when the specificity threshold is most conservative in detecting no false positives (i.e., B1,0 = infinite), the Bayes factor B1,0 has a sensitivity of ∼30% for all populations. Considering that the pseudoobservable trajectories were simulated randomly from a wide range of prior values, the Bayes factor B1,0 from CP-WFABC is, in general, sensitive and specific. The ROC curves in black for the other haploid and diploid populations (Figure S3) also indicate that the Bayes factor B1,0 is sensitive and specific as they are above the diagonal line of no-discrimination. The area under the ROC curve (AUC) is used to assess how reflective the Bayes factor B1,0 is of the true model, as summarized for all pseudoobservable populations in Table 1. The AUC values show that the Bayes factor B1,0 is ∼80% more probable to rank a randomly chosen changing selection case above a randomly chosen constant selection case. Additionally, the distribution of Bayes factors B1,0 under the null model M0 (i.e., case 1) was used to compute the significance level α at 1% (Good 1992). For both diploids and haploids, the significance threshold is higher for smaller population sizes (Table 2), and the calculation of these thresholds will be important in any given data application.
AUC values and confidence intervals for ROC curves
ROC (Receiver operating characteristic) curves . | |||
---|---|---|---|
100 | 1000 | 10000 | |
AUC | 0.7936 | 0.7988 | 0.7943 |
CI | 0.7756–0.8124 | 0.7797–0.8181 | 0.7750–0.8140 |
ROC (Receiver operating characteristic) curves . | |||
---|---|---|---|
100 | 1000 | 10000 | |
AUC | 0.7936 | 0.7988 | 0.7943 |
CI | 0.7756–0.8124 | 0.7797–0.8181 | 0.7750–0.8140 |
ROC curves for haploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
AUC | 0.8233 | 0.7378 | 0.7470 |
CI | 0.8051–0.8402 | 0.7164–0.7593 | 0.7257–0.7683 |
ROC curves for haploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
AUC | 0.8233 | 0.7378 | 0.7470 |
CI | 0.8051–0.8402 | 0.7164–0.7593 | 0.7257–0.7683 |
ROC, ; AUC, area under the ROC curve; CI, confidence interval.
ROC (Receiver operating characteristic) curves . | |||
---|---|---|---|
100 | 1000 | 10000 | |
AUC | 0.7936 | 0.7988 | 0.7943 |
CI | 0.7756–0.8124 | 0.7797–0.8181 | 0.7750–0.8140 |
ROC (Receiver operating characteristic) curves . | |||
---|---|---|---|
100 | 1000 | 10000 | |
AUC | 0.7936 | 0.7988 | 0.7943 |
CI | 0.7756–0.8124 | 0.7797–0.8181 | 0.7750–0.8140 |
ROC curves for haploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
AUC | 0.8233 | 0.7378 | 0.7470 |
CI | 0.8051–0.8402 | 0.7164–0.7593 | 0.7257–0.7683 |
ROC curves for haploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
AUC | 0.8233 | 0.7378 | 0.7470 |
CI | 0.8051–0.8402 | 0.7164–0.7593 | 0.7257–0.7683 |
ROC, ; AUC, area under the ROC curve; CI, confidence interval.
The Bayes factor (BF) thresholds for the significance level α of 1% computed using the distribution of Bayes factors under the null model M0
Diploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
BF | 4.7 | 1.6 | 1.3 |
Diploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
BF | 4.7 | 1.6 | 1.3 |
Haploid populations | |||
100 | 1000 | 10000 | |
BF | 3.7 | 3.2 | 1.7 |
Haploid populations | |||
100 | 1000 | 10000 | |
BF | 3.7 | 3.2 | 1.7 |
Diploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
BF | 4.7 | 1.6 | 1.3 |
Diploid populations . | |||
---|---|---|---|
50 | 500 | 5000 | |
BF | 4.7 | 1.6 | 1.3 |
Haploid populations | |||
100 | 1000 | 10000 | |
BF | 3.7 | 3.2 | 1.7 |
Haploid populations | |||
100 | 1000 | 10000 | |
BF | 3.7 | 3.2 | 1.7 |
Following the detection of changing selection trajectories using ABC model choice, the quality of parameter estimation by the model chosen was evaluated. The cross-validation results from the haploid population of Ne = 100 are shown in Figure 3 and those from the diploid population of Ne = 500 in Figure 4 (see Figure S4, Figure S5, Figure S6, and Figure S7 for additional results). For the case where the pseudoobservables were of constant selection, the estimation for a single s (and the dominance h for diploid) using CP-WFABC is very accurate, as the mode of the best simulations from the M0 model for each pseudoobservable lies along the red diagonal line. Exceptions include uninformative trajectories where the allele surpasses the minimum frequency of 2% in the first sampling and is lost immediately due to genetic drift or negative selection and therefore not observed in subsequent samplings. Such uninformative trajectories will always keep the same set of best simulations from the M0 model since their selection strength is indistinguishable, and they result in horizontal lines along the estimated negative value. This phenomenon is particularly pronounced when population size is small as shown in Figure 3 and Figure S6, since the role of genetic drift is more significant.
For the second case, when the pseudoobservables are of changing selection intensity, the joint estimation of the parameters is also effective for a restricted range of values. In Figure 3, Figure 4, Figure S4, Figure S5, Figure S6, and Figure S7, the mode estimation of each pseudoobservable is color-coded according to the three categories of trajectory shape. The green dots are pseudoobservable trajectories that change from positive s1 to positive s2. The blue dots are those that change from positive s1 to negative s2, while the magenta colors include all other cases (e.g., neutral or negative s1 to any value of s2). There is a clear clustering by category – with the best estimation being of positive values of s1 below 0.5, moderate values of s2 between −0.5 and 0.5, and CP values for the blue category of positive s1 to negative s2. In trajectories other than those with positive s1 to negative s2, the change point is difficult to detect, particularly for diploid populations where the additional dominance parameter h was estimated (Figure 4, Figure S6, and Figure S7). This trend is also observed when the ROC curves are generated according to these three categories (Figure 2 and Figure S3). For all populations, the Bayes factor B1,0 is more sensitive and specific for trajectories changing from positive s1 to negative s2 (ROC curves in blue), reaching above 60% of the true positive rate when there are no false positives. Despite the restricted range of good parameter estimation in s1, s2, and CP, the estimation of dominance is robust for both cases of constant and changing selection (except for the small population size of Ne = 50; Figure S6).
R2 values of the parameter estimation with the ABC (approximate Bayesian computation) model choice for haploid populations
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 100 | 1000 | 10,000 |
s | 0.558 | 0.843 | 0.914 |
M1 (False) | 100 | 1000 | 10,000 |
s 1 | 0.366 | 0.769 | 0.834 |
s 2 | 0.608 | 0.752 | 0.685 |
CP | −16 | −5.97 | −2.27 |
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 100 | 1000 | 10,000 |
s | 0.558 | 0.843 | 0.914 |
M1 (False) | 100 | 1000 | 10,000 |
s 1 | 0.366 | 0.769 | 0.834 |
s 2 | 0.608 | 0.752 | 0.685 |
CP | −16 | −5.97 | −2.27 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 100 | 1000 | 10000 |
s | −0.278 | −0.00232 | −0.00205 |
M1 (True) | 100 | 1000 | 10,000 |
s 1 | −0.323 | 0.363 | 0.415 |
s 2 | 0.704 | 0.768 | 0.777 |
CP | 0.0946 | 0.413 | 0.303 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 100 | 1000 | 10000 |
s | −0.278 | −0.00232 | −0.00205 |
M1 (True) | 100 | 1000 | 10,000 |
s 1 | −0.323 | 0.363 | 0.415 |
s 2 | 0.704 | 0.768 | 0.777 |
CP | 0.0946 | 0.413 | 0.303 |
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 100 | 1000 | 10,000 |
s | 0.558 | 0.843 | 0.914 |
M1 (False) | 100 | 1000 | 10,000 |
s 1 | 0.366 | 0.769 | 0.834 |
s 2 | 0.608 | 0.752 | 0.685 |
CP | −16 | −5.97 | −2.27 |
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 100 | 1000 | 10,000 |
s | 0.558 | 0.843 | 0.914 |
M1 (False) | 100 | 1000 | 10,000 |
s 1 | 0.366 | 0.769 | 0.834 |
s 2 | 0.608 | 0.752 | 0.685 |
CP | −16 | −5.97 | −2.27 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 100 | 1000 | 10000 |
s | −0.278 | −0.00232 | −0.00205 |
M1 (True) | 100 | 1000 | 10,000 |
s 1 | −0.323 | 0.363 | 0.415 |
s 2 | 0.704 | 0.768 | 0.777 |
CP | 0.0946 | 0.413 | 0.303 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 100 | 1000 | 10000 |
s | −0.278 | −0.00232 | −0.00205 |
M1 (True) | 100 | 1000 | 10,000 |
s 1 | −0.323 | 0.363 | 0.415 |
s 2 | 0.704 | 0.768 | 0.777 |
CP | 0.0946 | 0.413 | 0.303 |
R2 values of the parameter estimation with the ABC (approximate Bayesian computation) model choice for diploid populations
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 50 | 500 | 5000 |
s | −0.13 | 0.55 | 0.796 |
h | −0.789 | 0.653 | 0.88 |
M1 (False) | 50 | 500 | 5000 |
s 1 | −0.194 | 0.363 | 0.746 |
s 2 | 0.491 | 0.57 | 0.609 |
h | 0.195 | 0.764 | 0.857 |
CP | −10.8 | −1.65 | −1.37 |
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 50 | 500 | 5000 |
s | −0.13 | 0.55 | 0.796 |
h | −0.789 | 0.653 | 0.88 |
M1 (False) | 50 | 500 | 5000 |
s 1 | −0.194 | 0.363 | 0.746 |
s 2 | 0.491 | 0.57 | 0.609 |
h | 0.195 | 0.764 | 0.857 |
CP | −10.8 | −1.65 | −1.37 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 50 | 500 | 5000 |
s | −0.441 | 0.0955 | 0.118 |
h | −0.119 | 0.525 | 0.573 |
M1 (True) | 50 | 500 | 5000 |
s 1 | −0.713 | 0.181 | 0.277 |
s 2 | 0.467 | 0.536 | 0.463 |
h | 0.219 | 0.611 | 0.678 |
CP | −0.145 | −0.258 | −0.125 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 50 | 500 | 5000 |
s | −0.441 | 0.0955 | 0.118 |
h | −0.119 | 0.525 | 0.573 |
M1 (True) | 50 | 500 | 5000 |
s 1 | −0.713 | 0.181 | 0.277 |
s 2 | 0.467 | 0.536 | 0.463 |
h | 0.219 | 0.611 | 0.678 |
CP | −0.145 | −0.258 | −0.125 |
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 50 | 500 | 5000 |
s | −0.13 | 0.55 | 0.796 |
h | −0.789 | 0.653 | 0.88 |
M1 (False) | 50 | 500 | 5000 |
s 1 | −0.194 | 0.363 | 0.746 |
s 2 | 0.491 | 0.57 | 0.609 |
h | 0.195 | 0.764 | 0.857 |
CP | −10.8 | −1.65 | −1.37 |
Case 1: 1000 constant selection pseudoobservables . | |||
---|---|---|---|
M0 (True) | 50 | 500 | 5000 |
s | −0.13 | 0.55 | 0.796 |
h | −0.789 | 0.653 | 0.88 |
M1 (False) | 50 | 500 | 5000 |
s 1 | −0.194 | 0.363 | 0.746 |
s 2 | 0.491 | 0.57 | 0.609 |
h | 0.195 | 0.764 | 0.857 |
CP | −10.8 | −1.65 | −1.37 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 50 | 500 | 5000 |
s | −0.441 | 0.0955 | 0.118 |
h | −0.119 | 0.525 | 0.573 |
M1 (True) | 50 | 500 | 5000 |
s 1 | −0.713 | 0.181 | 0.277 |
s 2 | 0.467 | 0.536 | 0.463 |
h | 0.219 | 0.611 | 0.678 |
CP | −0.145 | −0.258 | −0.125 |
Case 2: 1000 changing selection pseudoobservables | |||
M0 (False) | 50 | 500 | 5000 |
s | −0.441 | 0.0955 | 0.118 |
h | −0.119 | 0.525 | 0.573 |
M1 (True) | 50 | 500 | 5000 |
s 1 | −0.713 | 0.181 | 0.277 |
s 2 | 0.467 | 0.536 | 0.463 |
h | 0.219 | 0.611 | 0.678 |
CP | −0.145 | −0.258 | −0.125 |
For the second case in which the pseudoobservables have changing selection coefficients, parameter estimation from the true model (M1) performs better than that from the false model (M0) for all parameters – particularly when population sizes are large. As expected, there is a trend of better parameter estimation as population size increases (though see Figure S4 and Figure S5). Additionally, it has been shown that the value of the Bayes factor B1,0 is a good indicator of the parameter estimation performance (results not shown).
Data application
Historical dataset of P. dominula
A long-running dataset based on the medionigra morph responsible for darker wing color in wild populations of P. dominula (Figure S8) began in 1939 with collections by Fisher and Ford (1947) and continued through to 1999 (Cook and Jones 1996; Jones 2000). Despite this phenotypic time-serial data having been analyzed previously from various angles (O’Hara 2005; Mathieson and McVean 2013; Foll et al. 2015), it is still relevant to consider a model of changing selection in time, as Wright (1948) originally suggested.
The recent reconsiderations of the dataset tend to favor a lethal-recessive model with an effective population of Ne = 500 (Mathieson and McVean 2013; Foll et al. 2015) – however, the biological question of how the medionigra morph could have reached the initial frequency of 11% in the dataset remains unanswered with this conclusion of constant strong negative selection. Wright asserted that the trajectory of the medionigra morph during this period could be explained by fluctuating selection with “no net selective advantage or disadvantage.” Although this alternative hypothesis has been considered as a random fluctuation of selection by estimating selection coefficients between every sampling time point (see O’Hara 2005), the quantitative plausibility of a directional change-in-s model over a single-s model lacks thorough investigation. Thus, we reanalyzed this dataset using the CP-WFABC method in order to investigate the possibility of changing selection in the medionigra morph during the 60-yr data collection.
Using the ABC model choice introduced here as a test for a change in selection strength, and to estimate the parameters of interest for the chosen model, we assume the medionigra allele is a single codominant locus responsible for the homozygous and heterozygous expressions of the phenotypic forms bimacula and medionigra, respectively (Cook and Jones 1996). The model M0 assumes a single selection coefficient, thus the only parameter to estimate is s. The M1 model assumes a change in selection strength, thus the parameters of interest are s1, s2, and CP. Both M0 and M1 take the prior range of (−1, 1) for the selection coefficients and the prior range of (2, 59) for the change point in the M1 model. For the M1 model, these uninformative priors are updated with the constraint that the allele must be segregating at the time of change point. Here, we create 107 simulated datasets for each M0 and M1, and apply the rejection algorithm of the CP-WFABC method to retain the best 1000 simulations compared with the observed trajectory. The effective population size is assumed to be Ne = 500 as in previous studies (Wright 1948; Cook and Jones 1996; O’Hara 2005), with an initial allele frequency of 11% and a minimum frequency ascertainment of 2%.
The Bayes factor for M1 over M0 is calculated as 0.952, indicating that the single coefficient M0 model cannot be rejected in favor of the changing selection M1 model (Table 4). From the parameter estimation of the model M0 (Figure S9), the mode of the posterior distribution for s is given as −0.15 as asserted by Fisher and Ford (1947). When the ABC model choice was repeated with a smaller population size of Ne = 50 as suggested by Wright (1948) and O’Hara (2005), the Bayes factor increases to 1.87 (i.e., the changing selection model is twice as likely as the constant selection) – however, this value is not large enough to be significant for a diploid population of Ne = 50 (Table 4).
Experimental evolution of influenza virus with drug treatment
The evolution of pathogens within a host is one of the most important cases in which the possibility of fluctuating selection must be considered – as they may experience drastically changing selective pressures due to host immune response, specific drug treatments, and/or pathogenic cooperation or competition (Tanaka and Valckenborgh 2011; Hall et al. 2011). Thus, how these pathogens adapt to these rapid external and internal changes is of major concern to the biomedical community.
The time-serial experimental dataset of influenza A conducted by Renzette et al. (2014) and Foll et al. (2014) is an interesting case study on the impact of drug treatment on influenza virus evolution. The dataset consists of 13 sampling points from which population-level whole-genome data were collected. Drug treatment with a commonly used neuraminidase inhibitor (oseltamivir) began after the collection of the third sample and continued, at increasing concentrations, until the final passage. Using WFABC, the genome-wide effective population size across the sampling time points was estimated (Ne = 176) and the SNPs under selection were identified.
Here, we apply the CP-WFABC method on two cases of interest from this study to consider a possible change in selection strength under drug treatment: the first case includes trajectories identified as being driven by positive selection, while the second includes outlier trajectories (i.e., trajectories not fitting a single s Wright-Fisher model). For all cases, we test the model M0 (i.e., a single selection coefficient) and M1 (i.e., a changing selection coefficient), with parameters of interest (s) and (s1, s2, CP), respectively. The number of generations per passage is assumed to be 13 (following Foll et al. 2014), and the minimum frequency of 2% is set as an ascertainment for observing the minor allele in the data. De novo mutations are assumed to occur at the sampling time point where the SNP was first sampled (except for trajectories whose initial sample frequency was first sampled above 1/Ne, which are assumed to be standing variation). This assumption is based on the high mutation rate, large population bottlenecks associated with passaging, and large census population size between passages. A total of 107 datasets were simulated for each M0 and M1; the best 1000 trajectories from the lot were retained using the rejection algorithm described in the Methods section. The uniform prior ranges for the selection coefficients were set as (−1, 1), for the change point as (t0, 157) where t0 is the generation when the SNP was first sampled, and with the constraint of segregating alleles at the change point for M1.
The results for the Bayes factors and the parameter estimates are summarized in Table 5 for all trajectories of interest. The Bayes factors of most trajectories show strong support for the changing selection model: the stronger the selection strength change, the larger the Bayes factor. Using the Bayes factors from the simulation studies as guidance (Table 4), the significance threshold to reject M0 is computed as 3.7 for a small haploid population. As expected, the trajectories identified as outliers of the single s Wright-Fisher model (NP 159, PB1 33) all reject the constant selection model M0 with a large Bayes factor. We also note that the Bayes factor for the drug-resistant mutation H275Y (NA 823) does not support the changing selection model strongly, confirming that the experimental evolution procedure kept the selective pressure of the drug constant by adjusting the drug concentration to reduce viral plaque numbers to 50% at each passage. As shown in Figure 5, the change points are estimated to be mostly between the seventh and eighth passages, a notable result since three of these trajectories (HA 48, HA 1395, NA 582) are increasing rapidly after the drug-resistant mutation H275Y appears, whereas one trajectory (NP 159) from a different segment decreases rapidly. This result may indicate that positive selective for the three SNPs (including HA 1395; a known compensatory mutation encoded also as D112N) increased along with the drug-resistant mutation H275Y, potentially due to epistatic interactions, whereas another SNP (NP 159) decreased at that time, potentially owing to clonal interference. The single selection estimates from WFABC (Foll et al. 2015) are similar to the estimates of CP-WFABC only when the Bayes factor does not reject M0 – strong evidence that an alternative model of changing selection must be considered for some trajectories in order to correctly estimate selection coefficients.
Bayes factors and parameters estimated for the influenza trajectories in the presence and absence of drug
Segment . | Position . | Bayes Factor (M1/M0) . | M0 s estimate . | M1 s1 estimate . | M1 s2 estimate . | M1 CP estimate . | WFABC s estimate . |
---|---|---|---|---|---|---|---|
Trajectories under selection with drug | |||||||
PAa | 2194 | ∞ | — | 0.251 | −0.328 | 24 (p1–2) | 0.09 |
HAb | 48 | 42.5 | 0.11 | 0.031 | 0.309 | 102(p7–8) | 0.14 |
HAb | 1395 | 5.5 | 0.142 | 0.088 | 0.188 | 104 (p7–8) | 0.22 |
NAb | 582 | 999 | 0.127 | 0.019 | 0.41 | 102 (p7–8) | 0.29 |
NAb | 823 | 1.8 | 0.114 | 0.103 | 0.139 | 105 (p8–9) | 0.15 |
Ma | 147 | 2.7 | 0.056 | 0.042 | 0.089 | 116 (p8–9) | 0.08 |
NSb | 820 | 1.7 | 0.037 | 0.029 | 0.05 | 146 (p11–12) | 0.12 |
Outlier trajectories with drug | |||||||
NPb | 159 | 8.09 | 0.011 | 0.027 | −0.054 | 103 (p7–8) | 0 |
PB1b | 33 | 16.2 | 0.047 | 0.023 | 0.256 | 113 (p8–9) | 0.14 |
Trajectories under selection without drug | |||||||
PB1a | 1119 | 0.57 | 0.044 | 0.038 | 0.047 | 60 (p4–5) | 0.06 |
HAb | 1395 | 2.38 | 0.097 | 0.083 | 0.131 | 106 (p8–9) | 0.12 |
NPa | 1104 | 1.72 | 0.047 | 0.042 | 0.062 | 110 (p8–9) | 0.05 |
NPb | 1396 | 0.89 | 0.049 | 0.049 | 0.053 | 49 (p3–4) | 0.09 |
Segment . | Position . | Bayes Factor (M1/M0) . | M0 s estimate . | M1 s1 estimate . | M1 s2 estimate . | M1 CP estimate . | WFABC s estimate . |
---|---|---|---|---|---|---|---|
Trajectories under selection with drug | |||||||
PAa | 2194 | ∞ | — | 0.251 | −0.328 | 24 (p1–2) | 0.09 |
HAb | 48 | 42.5 | 0.11 | 0.031 | 0.309 | 102(p7–8) | 0.14 |
HAb | 1395 | 5.5 | 0.142 | 0.088 | 0.188 | 104 (p7–8) | 0.22 |
NAb | 582 | 999 | 0.127 | 0.019 | 0.41 | 102 (p7–8) | 0.29 |
NAb | 823 | 1.8 | 0.114 | 0.103 | 0.139 | 105 (p8–9) | 0.15 |
Ma | 147 | 2.7 | 0.056 | 0.042 | 0.089 | 116 (p8–9) | 0.08 |
NSb | 820 | 1.7 | 0.037 | 0.029 | 0.05 | 146 (p11–12) | 0.12 |
Outlier trajectories with drug | |||||||
NPb | 159 | 8.09 | 0.011 | 0.027 | −0.054 | 103 (p7–8) | 0 |
PB1b | 33 | 16.2 | 0.047 | 0.023 | 0.256 | 113 (p8–9) | 0.14 |
Trajectories under selection without drug | |||||||
PB1a | 1119 | 0.57 | 0.044 | 0.038 | 0.047 | 60 (p4–5) | 0.06 |
HAb | 1395 | 2.38 | 0.097 | 0.083 | 0.131 | 106 (p8–9) | 0.12 |
NPa | 1104 | 1.72 | 0.047 | 0.042 | 0.062 | 110 (p8–9) | 0.05 |
NPb | 1396 | 0.89 | 0.049 | 0.049 | 0.053 | 49 (p3–4) | 0.09 |
The estimates whose Bayes factors show strong support for M1 are in bold.
Standing variation.
A de novo mutation.
Segment . | Position . | Bayes Factor (M1/M0) . | M0 s estimate . | M1 s1 estimate . | M1 s2 estimate . | M1 CP estimate . | WFABC s estimate . |
---|---|---|---|---|---|---|---|
Trajectories under selection with drug | |||||||
PAa | 2194 | ∞ | — | 0.251 | −0.328 | 24 (p1–2) | 0.09 |
HAb | 48 | 42.5 | 0.11 | 0.031 | 0.309 | 102(p7–8) | 0.14 |
HAb | 1395 | 5.5 | 0.142 | 0.088 | 0.188 | 104 (p7–8) | 0.22 |
NAb | 582 | 999 | 0.127 | 0.019 | 0.41 | 102 (p7–8) | 0.29 |
NAb | 823 | 1.8 | 0.114 | 0.103 | 0.139 | 105 (p8–9) | 0.15 |
Ma | 147 | 2.7 | 0.056 | 0.042 | 0.089 | 116 (p8–9) | 0.08 |
NSb | 820 | 1.7 | 0.037 | 0.029 | 0.05 | 146 (p11–12) | 0.12 |
Outlier trajectories with drug | |||||||
NPb | 159 | 8.09 | 0.011 | 0.027 | −0.054 | 103 (p7–8) | 0 |
PB1b | 33 | 16.2 | 0.047 | 0.023 | 0.256 | 113 (p8–9) | 0.14 |
Trajectories under selection without drug | |||||||
PB1a | 1119 | 0.57 | 0.044 | 0.038 | 0.047 | 60 (p4–5) | 0.06 |
HAb | 1395 | 2.38 | 0.097 | 0.083 | 0.131 | 106 (p8–9) | 0.12 |
NPa | 1104 | 1.72 | 0.047 | 0.042 | 0.062 | 110 (p8–9) | 0.05 |
NPb | 1396 | 0.89 | 0.049 | 0.049 | 0.053 | 49 (p3–4) | 0.09 |
Segment . | Position . | Bayes Factor (M1/M0) . | M0 s estimate . | M1 s1 estimate . | M1 s2 estimate . | M1 CP estimate . | WFABC s estimate . |
---|---|---|---|---|---|---|---|
Trajectories under selection with drug | |||||||
PAa | 2194 | ∞ | — | 0.251 | −0.328 | 24 (p1–2) | 0.09 |
HAb | 48 | 42.5 | 0.11 | 0.031 | 0.309 | 102(p7–8) | 0.14 |
HAb | 1395 | 5.5 | 0.142 | 0.088 | 0.188 | 104 (p7–8) | 0.22 |
NAb | 582 | 999 | 0.127 | 0.019 | 0.41 | 102 (p7–8) | 0.29 |
NAb | 823 | 1.8 | 0.114 | 0.103 | 0.139 | 105 (p8–9) | 0.15 |
Ma | 147 | 2.7 | 0.056 | 0.042 | 0.089 | 116 (p8–9) | 0.08 |
NSb | 820 | 1.7 | 0.037 | 0.029 | 0.05 | 146 (p11–12) | 0.12 |
Outlier trajectories with drug | |||||||
NPb | 159 | 8.09 | 0.011 | 0.027 | −0.054 | 103 (p7–8) | 0 |
PB1b | 33 | 16.2 | 0.047 | 0.023 | 0.256 | 113 (p8–9) | 0.14 |
Trajectories under selection without drug | |||||||
PB1a | 1119 | 0.57 | 0.044 | 0.038 | 0.047 | 60 (p4–5) | 0.06 |
HAb | 1395 | 2.38 | 0.097 | 0.083 | 0.131 | 106 (p8–9) | 0.12 |
NPa | 1104 | 1.72 | 0.047 | 0.042 | 0.062 | 110 (p8–9) | 0.05 |
NPb | 1396 | 0.89 | 0.049 | 0.049 | 0.053 | 49 (p3–4) | 0.09 |
The estimates whose Bayes factors show strong support for M1 are in bold.
Standing variation.
A de novo mutation.
We also applied CP-WFABC to the control case of SNPs increasing in frequency without drug as a comparison to the case with drug. The effective population sizes of the viral populations were averaged to be 226 in the absence of drug from the previous study (Foll et al. 2014). All of the other inputs and the ABC model choice conditions were kept the same as in the drug case. The Bayes factor results summarized in Table 5 demonstrate that none of the SNP trajectories under selection in the control experiment can reject M0 (i.e., constant selection).
Discussion
These simulations demonstrate that the novel CP-WFABC approach presented here is able to detect changing selection trajectories via ABC model choice, and also to estimate a wide range of parameters of interest. Performance was analyzed separately for three categories of allele trajectories according to the nature of the change in selection strength: (1) a change from positive s1 to positive s2, (2) a change from positive s1 to negative s2, and (3) all other changes. The datasets for each possible combination were generated using the Wright-Fisher model with a change in selection strength, using the most general prior ranges for all parameters s1, s2, CP, and h for diploids, with the only constraint being segregation of the allele at the change point. For both the detection and parameter estimation, CP-WFABC performs best when the change is large, particularly for the second category of change (positive s1 to negative s2), as shown in the ROC curves (Figure 2 and Figure S3) and the cross-validation graphs (Figure 3, Figure 4, Figure S4, Figure S5, Figure S6, and Figure S7). For the first category (positive s1 to positive s2) and the third category (any other changes), the change point is difficult to estimate, particularly for diploids where the additional parameter h is also estimated. The ABC model choice of CP-WFABC has the best sensitivity for full specificity, for larger population sizes (Ne > 500 for diploids), and for haploid populations.
The parameter estimates of s1 and s2 perform best when the values are moderate – given as the intervals (0, 0.5) and (−0.5, 0.5) respectively. For s1, the optimal parameter range for estimation is (0, 0.5), where a de novo mutation that survives negative selection and segregates until the change point is indistinguishable from other drifting mutations with similar trajectories with uninformative low allele frequency. These trajectories naturally arise more frequently when population size is small and in diploids where the dominance effect plays a role, as shown in the third category of change (Figure 3, Figure 4, Figure S4, Figure S5, Figure S6, and Figure S7; magenta points). When an initial frequency of 10% is used instead of a de novo mutation, the advantage of having a more informative trajectory at the beginning is counteracted by the effect of more cases dominated by negative selection or genetic drift segregating until the change point. Thus, the performance of CP-WFABC for standing variation is similar to that of de novo mutation. For s2, the optimal parameter range for estimation is (−0.5, 0.5), as trajectories with extreme values are less informative since they are lost or fixed directly after the change point (explaining the clustering of the change points at earlier times). For diploid populations, estimates of h are accurate to the level of determining dominance from codominance or recessiveness, particularly for large population sizes (Figure S7), given the difficulty of joint estimation with the three other parameters. Indeed, estimation of this additional parameter comes at the cost of worse performance for the other parameters, as can be seen in the ROC curves and cross-validation graphs: the detection and parameter estimation of changing selection cases is always better for haploids. Thus, in diploid cases, we recommend fixing the dominance parameter if known, in order to improve the performance of CP-WFABC.
Although CP-WFABC is intended to detect and evaluate changing selection intensities, the simulation studies demonstrate that the method also performs well in estimating parameters for cases of constant selection (as has been demonstrated by Foll et al. 2014). For haploids with large population sizes, in particular, the estimated values of the single parameter s correlate almost perfectly with the true values (Figure S5). However, when the population size is small for both diploids and haploids, some trajectories that are lost by genetic drift are difficult to estimate, as shown in Figure 3 and Figure S6 as horizontal lines along negative estimated values. This limitation of constant selection coefficients, however, is due to the simulation conditions of the de novo mutation at the first generation and the minimal ascertainment scheme (minimum frequency of 2% at one of the sampling time points). For real datasets, the conditions are likely to be less stringent, since such uninformative trajectories will not be considered for parameter estimation.
Finally, we utilized this approach to make inference in two very different time-sampled datasets: P. dominula (diploid) and Influenza A virus (haploid). The time-serial medionigra trajectory of P. dominula was reanalyzed to test for a change in selection strength and/or direction during 60 yr of data collection. By assuming h as codominant, the results for Ne = 500 indicate that the model M0 of constant selection cannot be rejected according to the Bayes factor from the ABC model choice algorithm. The selection coefficient from this model is estimated as −0.15, corresponding with that calculated by Fisher and Ford (1947). However, when the population size is assumed to be smaller (Ne = 50), the Bayes factor result supports M1 (changing selection) twice as strongly as M0 (constant selection), but not strong enough to reject M0 according to the significance level test computed with the distribution of Bayes factors under the null model M0. This dataset of the medionigra morph thus demonstrates the difficulty of detecting and evaluating a change in selection when the assumption of a known and constant effective population size is not fulfilled. Depending on the assumed population sizes, the conclusions may indeed change, as this study demonstrates. Furthermore, if the population size fluctuates, this method has the caveat of detecting a change in Ne*s rather than only in selection strength.
Next, CP-WFABC was applied to SNP trajectories of interest from an experimental dataset of influenza A virus. The ABC model choice test was conducted on the trajectories identified as 1) being positively selected and 2) as outliers from the single-s WFABC method. For the SNPs in the presence of drug, the Bayes factor for six out of nine trajectories favored the changing selection model M1. As shown in Figure 5, the change points for four out of these six trajectories occurred between passages seven and eight – the interval during which three trajectories from the Hemagglutinin and Neuraminidase segments (HA and NA, respectively) increased rapidly while one trajectory from the segment NP decreased rapidly along with the known drug-resistant mutation NA 823 (H275Y). These results appear to support the presence of epistasis and/or clonal interference, where the selection strength of the other SNPs is influenced by the appearance of a drug-resistant mutation under drug pressure. In fact, a known compensatory mutation HA 1395 (D112N) for infectivity (Thoennes et al. 2008) was among the three trajectories increasing rapidly, further confirming that increased infectivity might contribute to tissue culture adaptation, thus reinforcing the use of the method to evaluate biological hypotheses. Moreover, the estimated values of selection coefficients differed greatly between the constant-selection method (WFABC) and CP-WFABC. In particular, the estimates of s for PA 2194 and NP 159 from the Wright-Fisher model were inferred by WFABC as being near zero (i.e., neutral), though the more robust CP-WFABC estimates here indicate fluctuation of s from negative to positive values. Thus, these fluctuations cannot be explained by genetic drift alone, as previously speculated. We have therefore identified some cases where an alternative model of changing selection is essential for correctly estimating selection parameters and identifying change points. In the absence of drug treatment, the Bayes factor test could not reject the constant selection model M0 for any of the identified SNPs suggesting that in, the absence of the drug, the selective pressures on the population are largely constant as expected.
The simulation studies of changing selection reveal some important points to consider from the standpoint of experimental design. For CP-WFABC, the parameter estimation of a single selection coefficient between two sampling time points performs reasonably well for haploid population sizes above Ne = 1000. However, it is advisable to have three sampling time points to enhance the parameter estimation by maximizing the information contained in the allele trajectory for both selection periods, particularly for diploids and in smaller population sizes (Ne < 1000). Thus, in order for a change to be detectable, it is required to have at least four sampling time points where the change must occur between the second and third sampling time points. The simulation studies of CP-WFABC confirm that the estimation of CP performs best at the intermediate range of time-sampled data, as any change happening before the second and after the second-to-last sampling time point is impossible to detect (Figure 3, Figure 4, Figure S4, Figure S5, Figure S6, and Figure S7). Finally, it remains a future challenge to expand this method to consider more than one change point in selection strength, as some of the trajectories in the influenza A application (such as PA 2194 and PB1 33) suggest the presence of several change points along the trajectory, and to resolve the problem arising from the assumption of constant population size.
Acknowledgments
We thank Kristen Irwin for helpful comments. The computations were performed at Vital-IT (http://www.vital-it.ch) in the Swiss Institute of Bioinformatics. This work was funded by grants from the Swiss National Science Foundation and a European Research Council starting grant to J.D.J.
Footnotes
Supplemental Material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.115.023200/-/DC1
Communicating editor: Y. Kim