Regulatory Architecture of Gene Expression Variation in the Threespine Stickleback Gasterosteus aculeatus

Much adaptive evolutionary change is underlain by mutational variation in regions of the genome that regulate gene expression rather than in the coding regions of the genes themselves. An understanding of the role of gene expression variation in facilitating local adaptation will be aided by an understanding of underlying regulatory networks. Here, we characterize the genetic architecture of gene expression variation in the threespine stickleback (Gasterosteus aculeatus), an important model in the study of adaptive evolution. We collected transcriptomic and genomic data from 60 half-sib families using an expression microarray and genotyping-by-sequencing, and located expression quantitative trait loci (eQTL) underlying the variation in gene expression in liver tissue using an interval mapping approach. We identified eQTL for several thousand expression traits. Expression was influenced by polymorphism in both cis- and trans-regulatory regions. Trans-eQTL clustered into hotspots. We did not identify master transcriptional regulators in hotspot locations: rather, the presence of hotspots may be driven by complex interactions between multiple transcription factors. One observed hotspot colocated with a QTL recently found to underlie salinity tolerance in the threespine stickleback. However, most other observed hotspots did not colocate with regions of the genome known to be involved in adaptive divergence between marine and freshwater habitats.


Introduction 29
It is now known that much adaptive evolution is underlain by changes in regions of the genome 30 regulating gene expression, rather than in the protein coding regions of the genes themselves (Pavey 31 et al. 2010). Recent work has demonstrated that much variation in gene expression is heritable, and 32 Reads were split by barcode, and barcodes removed, using a custom perl script. Low quality bases 167 were removed from the reads via window adaptive trimming using Trim.pl (available: 168 http://wiki.bioinformatics.ucdavis.edu/index.php/Trim.pl, Illumina quality score ≤ 20). Paired-end 169 reads for each of these individuals were aligned to the BROAD S1 stickleback genome using BWA 170 aln/sampe (v 0.6.2) with default parameters (Li and Durbin 2009). The threespine stickleback genome 171 comprises 21 assembled chromosomes plus 1,823 un-placed genomic scaffolds. Unmapped reads, and 172 reads with non-unique optimal alignments, pair-rescued alignments, or any alternative suboptimal 173 alignments, were discarded from resulting SAM files. SAM files were converted to sorted BAM files 174

Genotype quality control 186
Vcftools (Danecek et al. 2011) was used to recode genotypes with a genotype quality phred score 187 (GQ) < 25 or a sequencing depth (DP) < 8 or > 1000 to missing. Vcf files for all families were merged 188 and the merged file converted to the input format for Plink 1.07 (Purcell et al. 2007). For SNPs on all 189 autosomal chromosomes and the pseudoautosomal region of Chromosome 19 (see below), the 190 following filters were applied in Plink: hwe (based on founders only) < 0.01, maximum missing 191 genotypes = 0.25, minor allele frequency > 0.05, offspring with > 70% missing data removed. 192 Adjacent SNPs in complete linkage disequilibrium were manually consolidated into a single locus, 193 with combined SNP information used to call genotypes. 194 Several approaches were used check for sample contamination or errors in barcode splitting and 195 family assignment: in Plink, the mendel option was used to screen families for Mendelian errors, and 196 sample relatedness was examined by graphically visualizing genome-wide IBD-sharing coefficients 197 generated by genome; the program SNPPIT (Anderson 2012) was used to assign individuals to 198 parents, based on five independent datasets of 100 SNPs; and 220 SNPs on Stratum II of calcul = 3), and included dye, temperature treatment, and sex as fixed factors in the model. Due to the 236 relatively small size of some of our half-sib families, we examined sire effects only, with a separate 237 QTL effect estimated for each sire. Excluding dam effects is expected to reduce our power of eQTL 238 detection, as fewer parents will be segregating for each QTL. 239 A fast algorithm was used to identify phase and estimate transmission probabilities at each 240 chromosomal location (Elsen et al. 1999, QTLMap option: --snp). Autosomes and the 241 pseudoautosomal portion of the sex chromosome were scanned at 1cM intervals, and the presence of 242 QTL on a chromosome was assessed using a likelihood ratio test (LRT) under the hypothesis of one 243 versus no QTL. Chromosome-wide LRT significance thresholds for each trait were identified 244 empirically, by permuting fixed effects and traits amongst individuals within families and 245 recalculating LRT scores (5000 permutations). As the combination of 5000 permutations x 10,332 246 traits x 21 chromosomes was computationally prohibitive, we first performed permutations on a 247 subset of 200 expression traits to establish a LRT threshold below which identified QTL were 248 unlikely to be significant at chromosome-wide p < 0.05 (LRT = 55), and then used permutations to 249 assess significance of all QTL above this threshold. The non-pseudo-autosomal region of the female 250 Chromosome 19 can be considered analogous to the X chromosome; identification of QTL in this 251 region requires estimation of dam effects and was therefore not performed. The 95% confidence 252 interval for each QTL was estimated using the drop-off method implemented in QTLMap 0.9.7, 253 which returns flanking map positions plus their nearest marker. 254

Cis vs. trans eQTL 255
To discriminate cis vs. trans QTL, we compared inferred QTL location to the position of the 256 expressed gene according to the BROAD G. aculeatus genome annotation v. 1.77 (available: 257 http://ftp.ensembl.org/pub/release-77/gtf/gasterosteus_aculeatus/). All positions on the BROAD 258 annotation were re-coded to positions on our modified chromosome assemblies. For genes on 259 scaffolds un-anchored to our assembly, we also used information on chromosomal scaffold locations we observed strong enrichment of significant trans eQTL on the same chromosome as the regulated 265 gene, indicating that these were actually mis-identified cis eQTL; we therefore selected the 266 conservative 10Mb threshold. In practice, examination of our results showed that 95% CI of eQTL 267 sometimes extended further than this 10Mb threshold. Considering median 95% CI (approximately 268 1Mb), we therefore classified a QTL as trans if the SNP closest to the upper or lower 95% confidence 269 bounds of that QTL was further than 9.5Mb from the regulated gene. Following Johnsson et al. (2015) we applied a local significance threshold (chromosome-wide p < 0.01) for evaluation of possible cis-271 QTL and a genome-wide significance threshold (genome-wide p < 0.021, = chromosome-wide 272 threshold of 0.001 * 21 chromosomes) for evaluation of possible trans-QTL. Although this 273 significance threshold is permissive, we considered it acceptable as our aim was to analyse the eQTL 274 distribution across the genome rather than to identify individual QTL-locus associations. Similar 275 significance thresholds have been used for eQTL detection in comparable studies (e.g. Whiteley et al. 276

2008). 277
To ask whether the effect of variation in trans regulatory sites was more often non-additive than the 278 effect of variation in cis regulatory sites, we examined the narrow sense heritability (h 2 ) and 279 dominance proportion of genetic variance (d 2 ) estimated for each expression trait by Leder et al. 280 (2015) and provided in the Supplementary Data for that paper. 281

Genes with plastic vs. non-plastic expression 282
To investigate whether genes exhibiting an alteration in expression level in response to a temperature 283 stress treatment (i.e. those exhibiting environmental plasticity) had a different underlying regulatory 284 architecture to those not exhibiting such a response, we divided genes into a 'responding' and 'non-

Evaluation of eQTL hotspots 288
As all identified eQTL had wide 95% confidence intervals, meaning that physically close eQTL 289 positions could be due to the effect of the same locus (see below), we evaluated potential eQTL 290 hotspots by counting eQTL within 5cM bins across the genome ('hotspot size' = number of eQTL). 291 Where the number of 1cM bins within a chromosome was not a simple multiple of 5, bin sizes at the 292 start and/or end of the chromosome were increased to 6 or 7. To obtain an empirical significance 293 threshold above which clusters of eQTL could be considered a 'hotspot', we simulated the expected 294 neutral distribution of eQTL across the genome using a custom script. We performed 5000 295 simulations: for each, we assigned n eQTL (where n = relevant number of significant eQTL) 296 randomly across the 3,062 1cM bins of the genome and then summed them into 5cM (or larger) bins 297 as described above. Conservatively, we compared the size of hotspots in the real data to the size 298 distribution of the largest hotspot observed over each of the 5000 simulations. 299

Association of eQTL with regions under selection 300
Hohenlohe et al. regulators that might contribute to adaptation to different aquatic habitats by comparing the location 304 of these regions with the location of our identified trans eQTL hotspots. We also compared hotspot 305 location to regions of the genome inferred by Guo et al. (2015) to be involved in adaptive 306 differentiation amongst different stickleback populations in the Baltic Sea. 307

Ortholog identification 308
In order to maximize the functional information available, we identified human orthologues for G.

Hotspot annotation 316
To identify regulatory genes physically associated with an eQTL hotspot, we defined hotspot 317 confidence boundaries as being the most frequently observed 95% confidence limits of all significant 318 eQTL centred in the hotspot. We used AmiGO2 (Carbon et al. 2009) to identify 'molecular function' 319 or 'biological process' Gene Ontology (GO) terms associated with transcriptional regulation by 320 applying the search term [transcription regulation andpathway]. We then used BioMart to examine 321 all genes within the hotspot boundaries for any of these GO annotations, using the HGNC symbols as 322 input. As an important transcriptional regulator generating a hotspot might itself be regulated by the 323 hotspot rather than physically present within it, we repeated this analysis for all genes with eQTL 324 mapped to the hotspot (cis eQTL significant at chromosome-wide p < 0.01; trans eQTL significant at 325 genome wide p < 0.021). We used DAVID (Huang et al. 2009a;.to examine GO term enrichment 326 for the sets of genes with trans QTL mapping to each hotspot, using the 9,071 genes on the 327 microarray with identified human orthologues as the background. To increase our sample size, we 328 lowered our stringency and examined all genes with trans eQTL mapping to the hotspot locations at 329 genome-wide p < 0.057 (chromosome-wide p < 0.0027). 330

Upstream regulator and functional interaction analyses 331
To search for regulatory genes which may be responsible for the expression variation in genes with 332 identified trans eQTL, we used the upstream regulator analysis in the Ingenuity Pathway Analysis 333 (IPA) software (Qiagen). This analysis uses a Fisher's Exact Test to determine whether genes in a test 334 dataset are enriched for known targets of a specific transcription factor. We used the human HGNC 335 symbols as identifiers in IPA. First we examined all genes that had a significant trans-eQTL mapping detail the upstream regulators potentially involved in generating eQTL hotspots, we lowered our 338 stringency and also examined all genes with trans eQTL mapping to the hotspot locations at genome-339 wide p < 0.057 (chromosome-wide p < 0.0027). 340 Since transcription is typically initiated by a complex of genes rather than a single transcription factor, 341 we examined functional relationships among the identified upstream regulators for each hotspot 342 (Table S7b), the genes located within a hotspot, and the genes with significant eQTL mapping to that 343 hotspot (Table S3; cis eQTL significant at chromosome-wide p < 0.01, trans eQTL significant at 344 genome-wide p < 0.021), using STRING v10 (Jensen et al. 2009, http://string-db.org/). We searched 345 for evidence of functional relationships from experiments, databases and gene co-expression, and 346 applied a minimum required interaction score of 0.4.  Table S0. 356

Identification of cis and trans eQTL 370
Expression data were available for 500 of the 517 genotyped offspring. Twenty-six of these offspring 371 had > 60% missing genotype data and were removed from the analysis. As we found that missing 372 values in the expression trait file caused QTLMap to over-estimate the LRT statistic, we eliminated 373 these from the dataset by removing one additional individual and 195 expression traits. Eighty-eight 374 genotyped parents, 473 genotyped and phenotyped offspring (mean no. offspring per family = 15.8, 375 mean proportion of missing genotypes in offspring = 0.11; maximum = 0.56), and 10,332 expression 376 traits were retained for the analysis. At chromosome-wide p < 0.01, we identified 5,366 eQTL 377 associated with 4,507 expression traits (43.7% of the 10,322 expression traits examined, Table S3). 378 Based on our recoded gene positions, we classified 2,335 of these as cis eQTL, 2,870 as trans eQTL, 379 and 161 as unknownthat is, the expressed gene was located on a scaffold that had not been assigned 380 to a G. aculeatus chromosome by either this study or Glazer et al. (2015 ; Table S3, Table S4). Four 381 hundred and seventy four of the trans eQTL were significant at genome-wide p < 0.021. Of these, 382 84.5% mapped to a chromosome other than the one containing the regulated gene. After application of 383 this genome-wide significance threshold for trans eQTL, 2,858 expression traits (27.7% of those 384 examined) remained associated with one or more significant cis or trans eQTL. Of these, 79.4% were 385 associated with a cis eQTL, 13.9% with one or more trans eQTL, 2.3% with both a cis and a trans 386 eQTL and 4.4% with eQTL of unknown class (Table S3). The physical distribution across the genome 387 of the 2,858 loci with significant cis or trans eQTL is shown in Figure S1. Mean 95% confidence 388

Association of eQTL with regions under selection 421
None of our identified eQTL hotspots overlapped parallel regions of the genome divergent between  Table S5). 429

Hotspot annotation 430
We identified human orthologues for 16,315 of the 20,787 protein-coding genes annotated on the 431 Broad stickleback genome (78.5%, Table S4). There were 393 genes with human annotation 432 physically located within the designated boundaries of the eleven hotspots (Table S5). Of these, 70 433 (17.8%) had a GO term related to transcription regulation (Table 1, Table S6). In addition, 21 genes 434 with significant cis eQTL or trans eQTL mapping to a hotspot had GO terms related to transcriptional 435 regulation (Table 1, Table S6). Following correction for multiple testing we found no significant GO 436 term enrichment amongst any group of genes trans regulated by the same eQTL hotspot.

Upstream regulator and functional interaction analyses 438
When examining all 405 genes with trans eQTL significant at genome wide p < 0.021, 79 439 significantly enriched upstream regulators were identified using IPA (Table S7). In total, these 440 regulators had 208 of the genes in the dataset as known targets. Hepatocyte nuclear factor 4α 441 (HNF4A) was identified as a particularly important regulator (p = 9.3*10 -8 ), with 70 (33.7%) of these 442 genes as downstream targets. Other highly enriched regulatory factors included one cut homeobox 1 443 (ONECUT1; p = 3.2*10 -5 ; 16 target genes ), Nuclear Receptor Subfamily 4 Group A Member 1 444 (NR4A1; p = 2.0*10 -4 ; 11 genes), Signal Transducer And Activator Of Transcription 5B (STAT5B; p 445 = 5.5*10 -4 ; 10 genes), Krüppel-like factor 3 (KLF3; p = 8.7*10 -4 ; 15 genes), estrogen receptor 1 446 (ESR1; p = 1.8*10 -3 ; 37 genes); Hepatocyte nuclear factor 1α (HNF1A; p = 1.9*10 -3 ; 17 genes); 447 CAMP Responsive Element Binding Protein 1 (CREB1; p=3.2*10 -3 ; 19 genes) and myc proto-448 oncogene protein (MYC; p = 3.4*10 -3 ; 30 genes). The full list of 79 significant upstream regulators is 449 in Table S7. 450 To identify upstream regulators that could be contributing to the ten eQTL hotspots we further 451 examined all genes that had trans eQTL mapping to the hotspots at genome-wide p< 0.057 (1120 452 genes). One hundred and ninety-two different enriched upstream regulators were identified for these 453 genes (Table S7b). For genes with trans eQTL mapping to the Chr4b, Chr6, Chr12a and Chr12c 454 hotspots, HNF4A remained an important regulator. Only five of the identified upstream regulators 455 were physically located within a hotspot (NFKB1, Chr4a; SOX3, Chr4a; SRF, Chr9; NFATC2, 456 Chr12b; NR4A1, Chr12b). Five had significant cis or trans eQTL mapping to a hotspot (IRF2, Chr4a 457 trans; NCOA4, Chr6 cis; HIF1A, Chr6 trans; JUP, Chr7 trans; ELK1, Chr12a cis). None of these ten 458 hotspot-associated regulatory proteins were identified as significant upstream regulators for the sets 459 genes with trans eQTL mapping to the same hotspotin other words, their presence did not appear to 460 be causative of the observed hotspots. 461 When the enriched upstream regulators, genes with cis eQTL mapping to a hotspot at chromosome-462 wide p < 0.01, and genes with trans eQTL mapping to a hotspot at genome wide p < 0.021 were 463 examined in STRING, multiple protein-protein interactions were found (Figure 2, Figure S4). In 464 particular for the Chr6 hotspot we found a complex interaction network that included eight molecules approach to generate an improved linkage map, and applied interval mapping to identify eQTL. Our 472 new map was independent of that recently constructed by Glazer et al. (2015), and the congruent 473 placement of scaffolds between the two maps confirms the reliability of these new genome 474 assemblies. Our map covered a substantially larger distance in cM than those of Roesti   Using a chromosome-wide significance threshold for cis regulatory loci and a genome-wide threshold 482 for trans loci, we identified eQTL for just over a quarter of the 10,332 expression traits examined. 483 Because at least 74% of these expression traits exhibit significant heritable variation (Leder et al. 484 2015), and gene expression is commonly regulated by multiple eQTL, we expect that a much larger 485 number of underlying eQTL remain undetected due to low statistical power. Despite expectations that 486 trans regulatory regions might be under purifying selection due to their potentially pleiotropic effect, 487 and that the effect of trans eQTL on expression will be weaker than that cis eQTL, we found many 488 cases where gene expression was influenced by regulatory variation in trans but not in cis. This Identifying eQTL directly implicated in local adaptation in sticklebacks was not our experimental 534 aim, and it is possible that regulatory hotspots acting in tissues or life stages that we did not examine 535 have a role in stickleback adaptive radiation. In general, it is difficult to predict in which tissues, or at 536 which life stages, gene expression variation gives rise to observed adaptive differences. We examined 537 transcription in the liver, an easily accessible, metabolically active, tissue. The liver expresses many 538 genes with potential roles in the physiological adaptation to different aquatic environments, including 539 hormone receptors and genes involved in osmoregulation, energy homeostasis, and response to 540 hypoxia. Further, many eQTL identified in this study may be common to other tissues. In general, the 541 extent to which eQTL are shared amongst tissues remains unclear, due to the need for very large 542 sample sizes and the limitations of the statistical methodologies available to address this question (The GTEx Consortium 2015). In particular, variation in gene expression levels amongst tissues 544 means that the power to detect underlying eQTL also varies amongst tissues. Although studies have 545 suggested that up to 70% of genes may have common underlying eQTL across tissues (Nica et al. 546 2011), there is also some evidence that trans eQTL hotspots in particular may act in a tissue-specific 547 manner (Grundberg et al. 2012). Thus, replication of this study in a greater range of tissues, and at 548 different life stages, would shed more light on the regulatory genetic architecture underlying the 549 parallel changes observed when marine sticklebacks independently colonize freshwater. 550 To investigate the potential genetic mechanisms generating the nine observed eQTL hotspots we 551 searched for associated loci with known transcriptional regulatory functions, and performed upstream 552 regulator analysis for the genes with eQTL in the hotspots. Although the pathways regulating 553 transcription are still poorly characterized for most genes, particularly in non-mammalian species, 554 these analyses can provide useful preliminary information. We found no evidence that eQTL hotspots 555 were due to the presence of a single 'master' regulatory locus, or a cluster of regulatory genes, at the 556 hotspot locations. Although many genes with roles in transcriptional regulation were present in, or 557 regulated by, hotspots, finding such genes is not unexpected: approximately 18% of the human 558 orthologues of BROAD stickleback genes are annotated with the GO terms that we used to identify 559 transcriptional regulators. It is also possible that the regulatory elements generating such hotspots are 560 not annotated coding genes: microRNAs and long non-coding RNAs are potentially important trans 561 regulators (Vance and Ponting 2014) and not yet well characterized across the stickleback genome. 562 Our results suggest that, alternatively, these hotspots may be generated by a complex interaction of 563 multiple transcription regulators. Several well-characterized regulatory proteins were identified as 564 important upstream regulators for genes with trans eQTL mapping to the hotspots. Unsurprisingly, 565 these included three genes -HNF4A, ONECUT1 and HNF1A -known to be master transcriptional 566 regulators in the mammalian liver (Odom et al. 2004). HNF4A and ONECUT1 were identified as a 567 particularly strongly enriched upstream regulators when examining all genes with a trans eQTL at 568 genome-wide p<0.021 (Table S7), and were also found to be enriched when examining the subsets of 569 genes with trans eQTL mapping to the hotspots on Chromosome 4, Chromosome 6, and Chromosome 570 12 (Table S7). None of the three genes were physically located in any hotspot, and we were unable to 571 identify significant eQTL underlying variation in their expression (ONECUT1 was not on the 572 microarray). However, we note that HNF4A is less than 300 kb from hotspot Chr12b. These In conclusion, we have performed the first genome-wide characterisation of the regulatory 593 architecture of gene expression in G. aculeatus. We found that variation in gene expression was 594 influenced by polymorphism in both cis-acting and trans acting regulatory regions. Trans-acting 595 eQTLS clustered into hotspots. In general these hotspots did not co-locate with regions of the genome 596 known to be associated with parallel adaptive divergence among marine and freshwater threespine 597 sticklebacks. However, one hotspot overlapped with a known QTL underlying salinity tolerance, a 598 locally adaptive trait. Hotspots locations appeared to be mediated by complex interactions amongst 599 regulator molecules rather than the presence of few 'master regulators'. Our broad-scale study 600 suggests many avenues for finer-scale investigation of the role of transcriptional regulation in 601 stickleback evolution. barcode splitting and we thank her for bioinformatics advice. We also thank the numerous people who 612 helped in obtaining and maintaining sticklebacks. Generous computing resources were provided by 613 the Finnish Centre for Scientific Computing (CSC-IT). Research was conducted under an ethical 614 license from the University of Helsinki (HY121-06). Comments from two anonymous reviewers 615 greatly improved the manuscript. 616 Table 1: Known transcriptional regulators associated with identified eQTL hotspots. Human orthologues of stickleback genes were identified using BioMart. 878