Evaluating Adaptive Divergence Between Migratory and Nonmigratory Ecotypes of a Salmonid Fish, Oncorhynchus mykiss

Next-generation sequencing and the application of population genomic and association approaches have made it possible to detect selection and unravel the genetic basis to variable phenotypic traits. The use of these two approaches in parallel is especially attractive in nonmodel organisms that lack a sequenced and annotated genome, but only works well when population structure is not confounded with the phenotype of interest. Herein, we use population genomics in a nonmodel fish species, rainbow trout (Oncorhynchus mykiss), to better understand adaptive divergence between migratory and nonmigratory ecotypes and to further our understanding about the genetic basis of migration. Restriction site-associated DNA (RAD) tag sequencing was used to identify single-nucleotide polymorphisms (SNPs) in migrant and resident O. mykiss from two systems, one in Alaska and the other in Oregon. A total of 7920 and 6755 SNPs met filtering criteria in the Alaska and Oregon data sets, respectively. Population genetic tests determined that 1423 SNPs were candidates for selection when loci were compared between resident and migrant samples. Previous linkage mapping studies that used RAD DNA tag SNPs were available to determine the position of 1990 markers. Several significant SNPs are located in genome regions that contain quantitative trait loci for migratory-related traits, reinforcing the importance of these regions in the genetic basis of migration/residency. Annotation of genome regions linked to significant SNPs revealed genes involved in processes known to be important in migration (such as osmoregulatory function). This study adds to our growing knowledge on adaptive divergence between migratory and nonmigratory ecotypes of this species; across studies, this complex trait appears to be controlled by many loci of small effect, with some in common, but many loci not shared between populations studied.


File S1 Reads per individual from Illumina RADseq
Sequencing statistics for Illumina RADseq. Columns include the sequencing chemistry (both Illumina HiSeq and HiScanSeq were used), the number of individuals pooled per library, the raw number of reads (in millions) the number of quality filtered targeted reads (in millions) and the average number of quality filtered reads per individual (in millions).  Miller et al. (2012). Naming follows on from the database used in Miller et al (2012) and sequences for both allele 1 and allele 2 are given.
File S3 Names and statistics of loci with a significant population genomic statistic.
File S4 Sequences of 61 scaffolds (Miller et al pers. comm.) that produced both a significant association in at least one of the population genomic and/or GWAS tests and an annotation from either BLASTn or BLASTx analysis.

Overview
Though population structure is completely confounded with phenotype in the Sashin Creek system, we opted to conduct a GWAS to ask the following questions: 1) how severe is the confounding, and what happens when we control for population structure using 'neutral loci' identified from population genomics approaches? and 2) since population structure is not confounded with phenotype in Little Sheep Creek, what loci are associated with the phenotype when controlling for population structure in the same way?

Methods
A mixed model approach was used to evaluate associations between genotype and phenotype in each of the three datasets: Sashin only, Little Sheep Creek only, and the combined data set. This model included sex and population structure as co--factors, and an FDR correction (alpha =0.05; BENJAMINI and YEKUTIELI 2001) was applied to account for multiple testing. To account for population structure, a principal component analysis using genotypes from all neutral markers from the output from LOSITAN on the combined dataset was performed. Principle component analysis has been shown to accurately account for structure whilst being computationally faster than using, for example, STRUCTURE (PRICE et al. 2006;ZHAO et al. 2007). Principle component analysis with the 3,801 neutral loci identified from LOSITAN from the combined dataset identified three groupings of samples: the Little Sheep Creek individuals (composed of both residents and migrants), the residents from Sashin Creek and the migrants from Sashin Creek ( Figure S1). The first three axes account for 12.6, 3.3 and 2.8 % of the variation respectively and were used as fixed effects in GWAS models.
To test the effectiveness of the mixed models in removing false positives, quantile--quantile plots were constructed.

Results and Discussion
Mixed models were used for GWAS on each of the three datasets separately for models for two different traits: 1) migration coded as a binary trait (0 = migrants, 1 = residents) and 2) fork length of the fish. Population structure was included as a fixed effect. Sex was also used as a fixed effect since migration has a well--known sex bias with females being more likely to migrate than males (DELLEFORS and FAREMO 1988;JONSSON et al. 1998). A total of 295 SNPs (p = <0.001) were significantly associated with migration in the Sashin Creek population. The strongest association between SNP genotype and migration in Sashin Creek was with SNP R41941 (FDR corrected p = 1.406 x 10 --5 ; Table S1). For the majority of significant associations the migrant phenotypes showed a more even distribution of genotypes between the three genotypic classes (but note that heterozygote deficits occur in several SNPs) than the resident phenotype (Table S1). The Sashin Creek fork length GWAS produced six significant SNPs. SNP R51797 produced the most significant association (FDR corrected p = 0.002 Table S2). The binary trait GWAS in Little Sheep Creek samples failed to find any statistically significant SNPs (data not shown). The fork length GWAS for Little Sheep Creek produced eight significant associations. The most significant association was with SNP 49559 (FDR corrected p = 0.001 Table S2). The combined binary GWAS failed to produce and significant associations, whereas the combined fork length GWAS produced 9 significant associations with the strongest being with SNP R50174 (FDR corrected p = 7.53 x 10 --4 ). No loci were significant in both the binary and fork length GWAS within the Sashin Creek populations. Very few significant loci were in common between the fork length GWAS analyses in the three data sets. Only R41941 was significant in both the Sashin Creek and combined datasets and only R50490 was significant in both the Little Sheep Creek and combined datasets. Mapping data showed every chromosome except Omy27 had at least one significant SNP for the binary GWAS in the Sashin Creek dataset ( Figure S2a). Kernel smoothing suggests a relatively constant distribution of associations with no obvious chromosome containing a peak. However Omy6 and Omy16 seem to produce a slightly higher peak than the other chromosomes and Omy2 Omy7 and Omy10 also showing a slight increase the kernel-smoothed average p value. The Sashin Creek fork length GWAS produced seven mapped loci that were significant (FDR corrected p=< 0.05) which were found on Omy4, Omy5, Omy10, Omy12, Omy17 and Omy21 ( Figure S2b). Kernel smoothing methods showed no chromosomes with a notable peak of associations. The Little Sheep Creek fork length GWAS produced 13 mapped loci that were significant, (FDR corrected p = < 0.05) which were located on Omy1, Omy3, Omy4, Omy5, Omy7, Omy10, Omy16, and Omy26. Kernel smoothing methods showed no chromosomes with any peaks of association ( Figure S2c). Quantile--quantile plots were constructed to test for effectiveness of removing potential false positives (due to confounding with population structure) from the model. The Sashin Creek binary GWAS shows that even after correcting for population structure there is still inflation of p--values and therefore the presence of potential false positives. However, both fork length GWAS for Sashin Creek and Little Sheep Creek suggest over correction of p--values and therefore potential false negatives ( Figure S3).
Comparing population genomic and association approaches. Of the 317 loci significantly associated with either the binary migratory trait or fork length, 61 were also suggestive of positive selection in FST outlier analysis in at least one population. A total of 55 out of 295 loci were significant in both the outlier and associations analyses within the Sashin Creek population, the remaining 6 where significant in both the Little Sheep Creek fork length GWAS and were outliers in the Sashin Creek FST analysis.

Table S1
List of 20 most significant SNPs from GWAS (binary) between resident and migrants for Sashin Creek. All p-values are FDR adjusted (alpha = 0.05). The number of individuals for each of the three genotypes is also given. migrants residents SNP ID --log 10 P