## Abstract

The availability of whole genome sequencing data from multiple related populations creates opportunities to test sophisticated population genetic models of convergent adaptation. Recent work by Lee and Coop (2017) developed models to infer modes of convergent adaption at local genomic scales, providing a rich framework for assessing how selection has acted across multiple populations at the tested locus. Here I present, *rdmc*, an R package that builds on the existing software implementation of Lee and Coop (2017) that prioritizes ease of use, portability, and scalability. I demonstrate installation and comprehensive overview of the package’s current utilities.

Convergent adaptation occurs when natural selection independently orchestrates the evolution of the same set of trait or traits in multiple populations (Takuno *et al.* 2015; Tishkoff *et al.* 2007; Yeaman *et al.* 2016; Losos 2011). Efforts by Lee and Coop (2017) used coalescent theory to develop composite likelihood models to infer which among several competing modes of convergent adaptation best explains allele frequencies at a putatively selected region. These models provide rich statistical information about the inferred adaptive mutation, including its location along the region, the strength of selection, migration rate, age, and its initial allele frequency prior to selection.

To facilitate use of the convergent adaptation models of Lee and Coop (2017), I developed rdmc, an R package implementing their models that was designed to be easy to use, portable, and scalable. In this short manuscript, I provide an overview of the usage and installation of the package, concluding with opportunities for future improvements and expansion to the software.

## Materials And Methods

Lee and Coop (2017) described three distinct modes of convergent adaption: independent mutations, where two or more populations independently gain the selected mutation; migration, where the mutation occurs once and subsequently migrates to multiple populations prior to fixation; and standing variation, where the mutation was present at low frequency in the ancestral population prior to divergence. The models are composite likelihood-based, where likelihood calculations are made over a grid of user-chosen input parameters.

### Data requirements

Using *rdmc* requires two kinds of allele frequency data. The first is allele frequencies from unlinked neutral sites across all populations. The second is allele frequencies from at least three populations that have putatively undergone convergent adaptation at a specific locus, and three or more populations that did not. Sample allele frequencies can be estimated with a number of existing software resources including VCFtools (Danecek *et al.* 2011) and ANGSD (Korneliussen *et al.* 2014). Typically, the allele frequencies at sites that have putatively undergone convergent adaptation will have been identified prior to using *rdmc*. Numerous methods exists for identifying such regions, such as finding overlapping selective sweeps in multiple populations (Stetter *et al.* 2020), or by identifying regions with elevated values between populations that putatively did and did not experience convergent adaptation (Hohenlohe *et al.* 2010). Additionally, *rdmc* requires an estimate of the per base recombination rate for the region or regions of interest, and an estimate of the effective population size. Assuming a mutation rate, effective population size can be estimated from genetic diversity (Gillespie 2004), or inferred via multiple methods (Gutenkunst *et al.* 2010; Schiffels and Durbin 2014; Excoffier and Foll 2011). Likewise, local recombination rates can be derived from genetic maps (Swarts *et al.* 2014), or inferred (Chan *et al.* 2012; McVean and Auton 2007; Adrion *et al.* 2020). At the time of writing, all available models in *rdmc* assume a single effective population size for all populations. Depending on which modes of convergent adaptation are being investigated, users must also provide vectors of selection coefficients, migration rates, allele frequency ages, and initial allele frequencies prior to selection. The exhaustive list of required inputs and their definitions is given in Table 1.

### Installation and dependencies

Installation of *rdmc* requires the r package *devtools* (Wickham *et al.* 2020b). With devtools available, the package can be installed and made locally available with the following R commands:

`devtools::install_github(’silastittes/rdmc’) library(rdmc)`

In addition to *devtools*, *rdmc* depends on several other packages. Namely, *MASS* (Venables and Ripley 2002), *dplyr* (Wickham *et al.* 2020a), *tidyr* (Wickham and Henry 2020), *purrr* (Henry and Wickham 2019), *magrittr* (Bache and Wickham 2014), and *rlang* (Henry and Wickham 2020). All dependencies are automatically installed or updated when the installation command above is issued. I encourage users to update to the most recent version of R prior to issuing any of the commands featured here.

### Specifying parameters and input data

For convenience, the original simulated example data generated by Lee and Coop (2017) are provided with the installation and can be loaded with:

`#load example data`

`data(neutral_freqs)`

`data(selected_freqs)`

`data(positions)`

The example data consists of 10,000 simulated base pairs from six populations, three of which (with indices 1,3,5) independently mutated to the selected allele at position 0, along with three other populations that evolved neutrally. Allele frequencies must be be passed to *rdmc* as a matrix, where each row is a population and each column is a locus. Users should note that the simulated positions here take on values between zero and one, but that base pair positions along the chromosomes of empirical data should not be altered prior to fitting the models.

When fitting possible convergent adaptation models, several quantities are reused regardless of which modes of convergent adaptation are of interest. In efforts to minimize computation, all parameters and quantities that are common across models are stored in a single named list generated with the function *parameter_barge()* that can be used when fitting any of the possible models. The list of quantities is generated using:

`#specify parameters and input data.`

`param_list <-`

`parameter_barge(`

`Ne = 10000,`

`rec = 0.005,`

`neutral_freqs = neutral_freqs,`

`selected_freqs = selected_freqs,`

`selected_pops = c(1, 3, 5),`

`positions = positions,`

`n_sites = 10,`

`sample_sizes = rep(10, 6),`

`num_bins = 1000,`

`sels = c(`

`1e-4,`

`1e-3,`

`0.01,`

`seq(0.02, 0.14, by = 0.01),`

`seq(0.15, 0.3, by = 0.05),`

`seq(0.4, 0.6, by = 0.1)`

`),`

`times = c(0, 5, 25, 50, 100, 500, 1000, 1e4, 1e6),`

`gs = c(1 / (2 * 10000), 10 -̂(4:1)),`

`migs = c(10 -̂(seq(5, 1, by = -2)), 0.5, 1),`

`sources = selected_pops,`

`locus_name = ”test_locus”,`

`cholesky = TRUE`

`)`

where all the arguments are fully described in Table 1. This command also determines the grid of parameter values (namely the arguments, *sels*, *times*, *gs*, *migs*, *sources*, and *n_sites* or *positions*) that will be used in the likelihood calculations. Depending on which modes of convergent adaptation are being studied, some of these grid parameters may not be used for inferences. Users must still input values for all of the grid parameters.

Naturally, features of the input data (the density and amount of variation in the allele frequencies, the effective population size, and the mutation and recombination rates), will impact the model results, and will determine the resolution we have to infer the model parameters. The number and density of points along the grid of parameters also affect the resolution one has to make inferences. However, computation time and memory usage may become infeasible if these grids are made too large.

### Fitting the models

Once the parameter barge is constructed, all models can be fit using this list of quantities as the only data input. All of the mode types (neutral, independent mutations, standing variation with and without a source population, migration, and mixed-modes) are implemented using the same function, *mode_cle()*, passing the desired mode as an argument to the function. The neutral, independent mutations, migration, and standing variation with a source population modes can be fit, respectively with:

`#fit composite likelihood models`

`neut_cle <- mode_cle(param_list, mode = ”neutral”)`

`ind_cle <- mode_cle(param_list, mode = ”independent”)`

`mig_cle <- mode_cle(param_list, mode = ”migration”)`

`sv_cle <- mode_cle(param_list, mode = ”standing_source”)`

Models of mixed modes, where specified populations are modeled under different modes, can be also implemented by modifying the parameter list object in-place. Specifically, mixed modes are constructed by adding the *sets* and *modes* arguments, which groups the population indices according the vector of modes, and specifies which modes are to be used. For example, to fit a model where populations with indices 1 and 3 adapted via standing variation, and population 5 gained the same mutation independently, and another mixed-mode model where populations 1 and 3 adapted via migration, and population 5 mutated independently:

`#update barge to fit a mixed-mode model`

`param_list <-`

`update_mode(`

`barge = param_list,`

`sets = list(c(1, 3), 5),`

`modes = c(”standing_source”, ”independent”))`

`#fit mixed-mode model`

`multi_svind <- mode_cle(param_list, ”multi”)`

`#update to another mixed-mode`

`param_list <-`

`update_mode(`

`barge = param_list,`

`sets = list(c(1, 3), 5),`

`modes = c(“migration”, “independent”))`

`#fit mixed-mode model`

`multi_migind <- mode_cle(param_list, ”multi”)`

Regardless of which mode is used when calling *mode_cle()*, the data frame returned will always contain the same 10 features: The 6 grid parameters generated by *parameter_barge()* (Table 1), the composite likelihood score calculated over all possible combinations of the grid parameters, the indices of the selected populations, and the names of the locus and mode that were used. To always maintain the same number of columns, missing (*NA*) values are added when variables are not used for a given mode type. As will be shown below, this design facilitates combining results from multiple models for downstream analyses.

## Results And Discussion

### Benchmarking

The computation time and memory usage of *rdmc* increases with the complexity of the model and size of the input data used. Compared to the original code implemented by Lee and Coop (2017), *rdmc* is slightly faster computationally, and requires substantially less memory. However, the reduced time and memory allocation for *rdmc* only occurs when Cholesky factorization is used to obtain the inverse of the neutral and selected covariance matrices (Tables 1 and 2). Alternatively, the matrix inverses are obtained using *ginv()* from the MASS package (Venables and Ripley 2002), which requires a larger memory allocation, but will still approximate the inverse even if the covariance matrix is not positive definite. Users are therefore encouraged to use the default *parameter_barge()* argument *cholesky = TRUE* unless Cholesky factorization fails.

The composite likelihood calculations are made over a grid of input parameters chosen when constructing the parameter barge (code shown above), hence, a denser grid will also have a considerable impact on time and memory usage. The size of the example data provided gives a realistic sense of memory and time usage for potential empirical data. While most modern laptops are capable of handling the required memory, many users will be interested in genome-wide analysis, where the mode of convergence for many separate regions are of interest. In these instances, cloud or high performance computing environments will be more appropriate. Making *rdmc* a portable and easy to install R package simplifies running separate genomic regions as independent jobs in parallel using workflows such as Snakemake (Köster and Rahmann 2012) or Nextflow (Di Tommaso *et al.* 2017).

### Extracting useful quantities and visualization

Once the models of interest have finished, the common format of the returned data frames allows all of the inferences to be combined into a single data frame, which simplifies creation of statistical and graphical summaries, and storage:

`#rdmc loads dplyr::bind_rows()`

`all_mods <-`

`bind_rows(`

`ind_cle,`

`mig_cle,`

`sv_cle,`

`multi_svind,`

`multi_migind`

`)`

`#save results to file`

`readr::write_csv(neut_cle, ”rdmc_neutral.csv”)`

`readr::write_csv(all_mods, ”rdmc_modes.csv”)`

With a single data frame containing output from all tested models, there are many visualization and summary methods available in the R ecosystem (R Core Team 2020). For example, the maximum composite likelihood estimate of the selection coefficient for each model can be accessed with:

`#rdmc loads dplyr::group_by() and magrittr::%>%`

`all_mods %>%`

`group_by(model) %>%`

`filter(cle == max(cle)) %>%`

`select(selected_sites, sels, model)`

`#returns (model names edited here for space)`

`# A tibble: 4 × 3`

`# Groups: model [4]`

`selected_sites sels model`

`<dbl> <dbl> <chr>`

`1 0.0017 0.03 independent`

`2 0.0017 0.03 migration`

`3 0.0017 0.03 stdvar-stdvar-ind.`

`4 0.0017 0.03 mig.-mig.-ind.`

Visualizing the composite likelihood values by genomic position (relative to the neutral composite likelihood) (Figure 1) can be done with:

`library(ggplot2)`

`library(cowplot)`

`theme_set(theme_cowplot(font_size = 18))`

`neut <- unique(neut_cle$cle)`

`all_mods %>%`

`group_by(selected_sites, model) %>%`

`summarize(mcle = max(cle) - neut) %>%`

`ggplot(aes(selected_sites, mcle, color = model)) +`

`geom_line() +`

`geom_point() +`

`xlab(”Position”) +`

`ylab(“Composite likelihood”) +`

`theme(legend.position = ”n”) +`

`scale_color_brewer(palette = ”Set1”)`

Lastly, one can visualize the likelihood surface with respect to specific parameter, such as selection (Figure 1):

`#visualize likelihood surface wrt selection`

`all_mods %>%`

`group_by(sels, model) %>%`

`summarize(mcle = max(cle) - neut) %>%`

`ggplot(aes(sels, mcle, color = model)) +`

`geom_line() +`

`geom_point() +`

`ylab(”Composite likelihood”) +`

`xlab(”Selection coefficient”) +`

`scale_color_brewer(palette = ”Set1”)`

## Concluding remarks and future developments

*rdmc* was made to facilitate the use of convergent adaptation models of Lee and Coop (2017). The package is easy to install, and requires only a few lines of code to generate and analyze the output. By making *rdmc* an R package, the code is highly portable and has relatively few, highly maintained dependencies, making it simpler to adopt to different computing systems. Because of its portability and ease of use, *rmdc* also simplifies downstream tasks which facilitates usage at large scales, such as modeling thousand of genomic regions simultaneously on high performance computing resources.

Several elaborations to the currently available utilities in the *rdmc* package could be addded. Since the methods developed in Lee and Coop (2017), additional models have been developed, including ones that can use putatively selected deletion variation, strong selection, concurrent sweeps, and variation in population size among populations (Oziolor *et al.* 2019). Lee and Coop (2017) also introduced parametric bootstrapping to evaluate support for alternative modes. While not currently incorporated into *rdmc*, future development of the package would include functions to perform bootstrapping. However, for the same reasons mentioned above, *rdmc* should facilitate creation and computation of bootstrap replicates in parallel.

## Web Resources

The source of the package and workflow outlined above are available at https://github.com/silastittes/rdmc. The package is released under GNU General Public License (v3.0). All of the presented analyses were computed on a personal laptop (x86_64, Apple) using R version 4.0.0 2020-04-24).

The original code associated with Lee and Coop (2017) is available https://github.com/kristinmlee/dmc.

## Acknowledgments

I would like to thank Jeff Ross-Ibarra for early reviews and encouragement to write this manuscript, to Kristin Lee, Sivan Yair, and Graham Coop for developing the methods and code that form the foundation of the *rdmc* package, and for giving me their blessing in pursuing the development of the package. Lastly, I thank Felix Andrews for helpfully suggesting the function name, *parameter_barge()*. This work was supported by the National Science Foundation, Plant Genome Research Project, award number 1822330.

## Footnotes

*Communicating editor: A. Kern*

- Received April 24, 2020.
- Accepted July 16, 2020.

- Copyright © 2020 Tittes
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.