Draft Genome Assembly and Population Genetics of an Agricultural Pollinator, the Solitary Alkali Bee (Halictidae: Nomia melanderi)

Alkali bees (Nomia melanderi) are solitary relatives of the halictine bees, which have become an important model for the evolution of social behavior, but for which few solitary comparisons exist. These ground-nesting bees defend their developing offspring against pathogens and predators, and thus exhibit some of the key traits that preceded insect sociality. Alkali bees are also efficient native pollinators of alfalfa seed, which is a crop of major economic value in the United States. We sequenced, assembled, and annotated a high-quality draft genome of 299.6 Mbp for this species. Repetitive content makes up more than one-third of this genome, and previously uncharacterized transposable elements are the most abundant type of repetitive DNA. We predicted 10,847 protein coding genes, and identify 479 of these undergoing positive directional selection with the use of population genetic analysis based on low-coverage whole genome sequencing of 19 individuals. We found evidence of recent population bottlenecks, but no significant evidence of population structure. We also identify 45 genes enriched for protein translation and folding, transcriptional regulation, and triglyceride metabolism evolving slower in alkali bees compared to other halictid bees. These resources will be useful for future studies of bee comparative genomics and pollinator health research.

is most powerful for understanding social evolution when it includes closely related species that are representative of the solitary ancestor from which eusociality arose (Rehan and Toth 2015). However, the rate at which genomic resources have become available for social Hymenoptera has far out-paced that for solitary species. Genome assemblies are publicly available for just three solitary bees and no solitary vespid wasps, compared to over 30 reference genomes currently available for social bees, wasps, and ants (Branstetter et al. 2018). This is in stark disproportion to the species that express solitary behavior among bees and wasps, most of which lead solitary lifestyles (Wcislo and Fewell 2017).
Alkali bees (Nomia melanderi) belong to the subfamily Nomiinae (Halictidae), a taxon composed of species that are solitary, though some express communal behavior and other forms of social tolerance (Wcislo and Engel 1996). The subfamily is the sister clade to the Halictinae, which includes both solitary and social lineages (Danforth et al. 2008). The alkali bees may be representative of the solitary ancestor from which eusociality likely evolved within the bee family Halictidae, and provide important phylogenetic context to comparative genomics (Brady et al. 2006;Gibbs et al. 2012). Alkali bees also possess several of the characteristic traits thought to be important in the ancestor of social halictids, including nest defense and other forms of maternal care (Batra and Bohart 1969;Batra 1970Batra , 1972) ( Figure 1A). As such, this species has become an important model for testing hypotheses for the origins of eusociality, and has provided meaningful insight into the reproductive physiology of solitary bees (Kapheim 2017;Kapheim andJohnson 2017a, 2017b). Development of genomic resources for this species will enable additional hypothesis testing regarding the solitary antecedents of eusociality in this family, and insects in general.
The development of genomic resources for alkali bees will also have practical and applied benefits. Alkali bees are native pollinators of alfalfa seed, which is a multi-billion dollar industry in the United States, accounting for one-third of the $14 billion value attributed to U.S. bee-pollinated crops (Van Deynze et al. 2008; U.S. Department of Agriculture 2014). With issues of honey bee health and colony loss over the last decade, increased attention has been placed on the need to find alternative pollinators for many of our most important crops. Aggregations of alkali bees have been sustainably managed alongside alfalfa fields in southeastern Washington state for several decades (Cane 2008), and they are more effective pollinators of this crop than honey bees (Batra 1976;Cane 2002). Moreover, as a naturally aggregating native species, they are less costly pollinators than alfalfa leafcutter bees (Megachile rotundata), which must be purchased commercially (James 2011). Genomic resources have been an invaluable resource for the study of honey bee health and management, and are thus likely to benefit this important pollinator as well.
Here we present a draft genome assembly and annotation for N. melanderi, along with intial genomic comparisons with other Hymenoptera, a description of transcription factor binding sites, and population genetic analyses based on resequencing of individuals from throughout the southeastern Washington population. These resources will provide an important foundation for future research in sociogenomics and pollinator health.

Genome sequencing and assembly
Sample collections: All of the bees used for sequencing were collected from nesting aggregations in and around Touchet, Washington (USA) with permission from private land owners in June 2014 or June 2015. Samples were collected from two sites approximately 8 km apart, separated by agricultural land dominated by alfalfa seed. Dispersal distance is unknown for this species, but adult females are known to forage up to 3 miles from their nests (Stephen 2003). Adult males and females were captured live, and flash frozen in liquid nitrogen. They were transported in a dry nitrogen shipper, and then stored at 280°u ntil nucleic acid extraction.
DNA and RNA isolation: For genome sequencing, we isolated genomic DNA from individual males in three separate reactions targeting either the head or one half of a thorax. We used a Qiagen MagAttract kit, following the manufacturer's protocol, with two 200 ml elutions in AE buffer. We isolated RNA from three adult females using a Qiagen RNeasy kit, following the manufacturer's protocol, eluting once in 50 ml of water. We extracted RNA from the head and rest of the body separately for each female. For whole genome resequencing, we isolated genomic DNA of 18 adult females and one male from half of a thorax with a Qiagen MagAttract kit, as above. DNA was quantified with a dsDNA high sensitivity Qubit reaction, and quality was assessed on an agarose gel. RNA was quantified on a Nanodrop spectrophotometer, and quality was assessed with a Bioanalyzer. Genome assembly: The DNA shotgun and mate-pair library sequencing generated a total of 593,526,700 reads. After adapter trimming, these reads were filtered for quality (Phred 64 , 7) and excessive ($10) Ns. We removed PCR duplicates from read pairs.
We used SOAPdenovo 2 with default parameters for genome assembly. We began by constructing contigs from the shotgun library reads split into kmers, which were used to construct a de Bruijn graph. Filtered reads were then realigned onto the contigs, and used to construct scaffolds based on shared paired-end relationships between contigs. We then closed gaps in the assembly using information from paired-end reads that mapped to a unique contig and a gap region. BUSCO assessment of assembly completeness: The genome assembly completeness in terms of expected gene content was quantified using the Benchmarking Universal Single-Copy Ortholog (BUSCO) assessment tool (Waterhouse et al. 2018) for N. melanderi and seven other Apoidea species. Assembly completeness assessments employed BUSCOv3.0.3 with Augustus 3.3 (Stanke et al. 2006), HMMER 3.1b2 (Finn et al. 2011), and BLAST+ 2.7.1 (Camacho et al. 2009) (Camacho et al. 2009), using both the hymenoptera_odb9 and the insecta_odb9 BUSCO lineage datasets and the Augustus species parameter 'honeybee1'.

Genome annotation
Gene annotation: We predicted gene models based on homology and de novo methods. Results were integrated with GLEAN (Elsik et al. 2014). Homology based gene prediction used the gene models of four species (Apis mellifera, Acromyrmex echinator, Drosophila melanogaster, and Homo sapiens). We used TBLASTN to gather a non-redundant set of protein sequences, and then selected the most similar proteins for each candidate protein coding region based on sequence similarity. Short fragments were connected with a custom script (SOLAR), and Genewise (v2.0) (Birney et al. 2004) was used to generate the gene structures based on the homology alignments. This generated four gene sets based on homology with four different species.
We used Augustus (Stanke et al. 2006) and SNAP (Johnson et al. 2008) for de novo gene prediction, with parameters trained on 500-1,000 intact genes from the homology-based predictions. We chose genes that were predicted by both programs for the final de novo gene set.
The four homology-based gene sets and one de novo gene set were integrated to generate a consensus gene set with GLEAN. We then filtered genes affiliated with repetitive DNA and genes whose CDS regions contained more than 30% Ns. Repetitive DNA was identified through annotation of tandem repeats (Tandem Repeats Finder v4.04) (Benson 1999) and transposable elements (TEs). This initial identification of TEs was performed based on homology-based and de novo The maximum likelihood 15-species molecular phylogeny estimated from the superalignment of 2,025 single-copy orthologs recovers supported families. Branch lengths represent substitutions per site, all nodes achieved 100% bootstrap support. Right: Total gene counts per species partitioned into categories from single-copy orthologs in all 15 species, or present but not necessarily single-copy in all (i.e., including gene duplications), to lineagerestricted orthologs (Halictidae, Apidae and M. rotundata, Apoidea, Formicoidea, Apoidea and Formicoidea, Hymenoptera, specific outgroups), genes showing orthology in less than 13 species (i.e., patchy distributions), genes present in the outgroups (present in P. domunila or C. cinctus, present in P. dominula or C. Cinctus or P. humanus), and genes with orthologs from other sequenced insect genomes or with no identifiable orthology. The purple Halictidae bar is present but barely visible as only 16 to 32 orthologous genes were assigned to the Halictidae-restricted category. (D) A large proportion of repetitive DNA consists of uncharacterized transposable elements, but all major transposon groups were detected.
We used the 571,457,212 reads generated from RNA sequencing to polish the gene set. After filtering, we mapped reads to the genome with TopHat (Trapnell et al. 2009), and used Cufflinks (Trapnell et al. 2012) to assemble transcripts. Assembled transcripts were then used to predict ORFs. Transcript-based gene models with intact ORFs that had no overlap with the GLEAN gene set were added. GLEAN gene models were replaced by transcript-based gene models with intact ORFs when there was a discrepancy in length or merging of gene models. Transcripts without intact ORFs were used to extend the incomplete GLEAN gene models to find start and stop codons.
Putative gene functions were assigned to genes based on best alignments to the Swiss-Prot database (Release 2013_11) (Bairoch 2004) using BLASTP. We used InterPro databases v32.0 (Zdobnov and Apweiler 2001;Quevillon et al. 2005) including Pfam, PRINTS, PROSITE, Pro-Dom, and SMART to identify protein motifs and domains. Gene Ontology terms were obtained from the corresponding InterPro entries.
BUSCO assessment of annotation completeness: Annotated gene set completeness in terms of expected gene content was quantified using the BUSCO assessment tool (Waterhouse et al. 2018) for N. melanderi and seven other Apoidea species. Gene sets were first filtered to select the single longest protein sequence for any genes with annotated alternative transcripts. Gene set completeness assessments employed BUS-COv3.0.3 with HMMER 3.1b2 (Finn et al. 2011), and BLAST+ 2.7.1 (Camacho et al. 2009), using both the hymenoptera_odb9 and the insecta_odb9 BUSCO lineage datasets.
Transcription factor motif scans: We generated binding scores for 223 representative transcription factor (TF) binding motifs in the N. melanderi genome. Motifs representative of TF clusters with at least one ortholog in bees (Kapheim et al. 2015) were selected from Fly-FactorSurvey (Zhu et al. 2011). After masking tandem repeats with Tandem Repeat Finder, we produced normalized genome-wide scoring profiles for each selected TF motif in the genome based on sliding windows of 500 bp with 250 bp overlap. We used the HMM-based motif scoring program Stubb (Sinha et al. 2006) with a fixed transition probability of 0.0025 and a background state nucleotide distribution learned from 5 kb regions without coding features of length . 22 kb. We then normalized these motif scores using two different methods. First, we created a "Rank Normalized" matrix, to normalize the window scores across each motif on a scale of 0 (best) to 1 (worst). Second, we created a "G/C Normalized" matrix, by considering each window's GC content. Motifs with high GC content are likely to produce a high Stubb score in a GC rich window. We thus separated genomic windows into 20 bins of equal size based on GC content, and performed rank-normalization separately within each bin. We next summarized motif scores at the gene level. For each gene, we calculated a score for each motif as Pgm = 1-(1-Ngm)^Wg, where Ngm is the best normalized score for motif m among the Wg windows that fall within the regulatory region of the gene g. We defined the regulatory region of the gene in five different ways: 5Kup2Kdown -5000 bp upstream to 2000 bp downstream of a gene's transcription start site (TSS), 5Kup -5000 bp upstream of a gene's TSS, 1Kup -1000 bp upstream of a gene's TSS, NearStartSiteall genomic windows that are closer to the gene's TSS than any other gene TSS, GeneTerrall genomic windows between the boundary positions of the nearest non-overlapping gene neighbors within at least 5000 bp upstream of the TSS.
We used the results of these target motif scans to check for transcription factor motif enrichment among gene sets of interest (i.e., genes under selection). For each normalization method and regulatory region, we created two motif target gene sets: a "conservative" set that contains only the top 100 genes by normalized score and a more "liberal" set that contains the 800 top genes. Enrichment tests for genes of interest were performed using the one-sided Fisher exact test for each of 1784 motif target sets defined using the two thresholds, both "G/C" and "Rank" normalization procedures, the 1Kup (likely the core promoter) and GeneTerr (likely containing distal enhancers) regulatory region definitions, and each of the representative 223 motifs. Multiple hypothesis test corrections were performed using the Benjamini-Hochberg procedure (Benjamini and Hochberg 1995). For significantly enriched motifs (adjusted-P , 6E-04), we determined if an ortholog of the fly transcription factor protein was present in the N. melanderi genome using blastp with e-value , 10e-3 and % identity $ 50.
Transposable element identification: We performed a more detailed de novo investigation of transposable elements in the N. melanderi genome using raw sequencing reads in a genome assembly-independent approach (Goubert et al. 2015). This method uses short reads representing a .1x coverage of the genome for an assembly, which typically is only successful for sequences which are represented as multiple copies (i.e., repeated) within the genome and hence contribute .1x coverage (of their respective repeat family) to the assembly. Consequently, this repeatome assembly represents a qualitative overview of repetitive and transposable elements in the genome and can subsequently be used to quantify each element based on the number of reads mapping to them. It is expected that this approach is less biased than inferring repetitive elements based on the genome assembly, which is limited by the small proportion of reads that span across longer repetitive sequences.
First, we filtered a subset of five million raw reads for mitochondrial contamination to avoid biasing the detection of highly repetitive sequences. This involved aligning reads to the genome assembly with bwa-mem (Li 2013), and evaluating read depth with bedtools (Quinlan and Hall 2010). We identified contigs and scaffolds with high coverage ($ 500x) as potential mitochondrial sequences, based on the assumption that the number of sequenced mitochondrial copies is much higher than that of the nuclear genome. These contigs and scaffolds were further analyzed for sequence similarity (blastn v. 2.2.28+) to the mitogenome of the closest available bee species, Halictus rubicundus (KT164656.1). We identified five scaffolds as putatively mitochondrial (scaffold235256, scaffold241193, scaffold252191, scaffold252994, scaffold257806). Reads aligning to these scaffolds were filtered from the analysis.
The remaining reads were used for repeat analysis in five iterations of the transposable element discovery program DnaPipeTE v1.1 (Goubert et al. 2015), following Stolle et al. (2018). Each iteration used a new set of the same number of reads randomly sampled from the filtered reads. The analysis was repeated for different number of reads to represent a genome sequence assembly length coverage of 0.20x-0.40x in steps of 0.05x. This series of repeat content estimates determines the amount of input data that provides a stable estimate of genomic repeat content, and thus ensures that adequate coverage has been obtained for accurate estimates. The final set of repetitive elements was generated based on 0.30x coverage, using RepeatMasker v4.0.7 and a 10% sequence divergence cut-off. Overlap between repetitive element annotations and genes was detected with bedtools.

Orthology delineation
Orthologous groups (OGs) delineated across 116 insect species were retrieved from OrthoDB v9.1 (Zdobnov et al. 2017) to identify orthologs. The OrthoDB orthology delineation procedure employs allagainst-all protein sequence alignments to identify all best reciprocal hits (BRHs) between genes from each pair of species. It then uses a graph-based approach that starts with BRH triangulation to build OGs containing all genes descended from a single gene in the last common ancestor of the considered species. The annotated proteins from the genomes of N. melanderi were first filtered to select one protein-coding transcript per gene and then mapped to OrthoDB v9.1 at the Insecta level, using all 116 species and an unpublished halictid bee genome (Megalopta genalis; K. M. Kapheim et al. unpublished) for orthology mapping. The OrthoDB orthology mapping approach uses the same BRH-based procedure as for building OGs, but only allowing proteins from the mapped species to join existing OGs.

Phylogenomic analysis
We reconstructed a molecular species phylogeny from 2,025 universal single-copy orthologs among the protein sequences of 15 insects including N. melanderi (Table S1-S2). The protein sequences from each orthogroup were first aligned with Muscle 3.8.31 (Edgar 2004 With these data, we performed a comparative orthology analysis to identify genes with universal, widely shared, or lineage-specific/ restricted distributions across the selected species, or with identifiable orthologs from other insect species from OrthoDB v9.1. Ortholog presence, absence, and copy numbers were assessed for all OGs across the 15 species to classify genes according to their orthology profiles. The categories (each mutually exclusive) included: 1) Single-copy in all 15 insect species; 2) Present in all 15 insect species; 3) Halictidae: Present in .=2 Halictidae but none of the other 11 species; 4) Apidae + Mrot: Present in .= 2 Apidae and Megachile rotundata but none of the other 11 species; 5) Apoidea: Present in .= 1 Halictidae, present in .= 1 Apidae and Megachile rotundata but none of the other 7 species; 6) Formicoidea: Present in .= 2 Formicoidea but none of the other 11 species; 7) Apoidea + Formicoidea: Present in .=2 Apoidea, present in .=1 Formicoidea but not in Polistes dominula or Cephus cinctus or P. humanus; 8) Pdom + Ccin: Present in P. dominula and C. cinctus but none of the other 13 species; 9) Pdom or Ccin or Phum: Present in .=2 of P. dominula or C. cinctus or P. humanus and none of the other 12 species; 10) Hymenoptera: Present in .=2 Hymenoptera and absent from P. humanus; 11) Present , 13: Present in ,13 of the 15 species, i.e., a patchy distribution not represented by any other category; 12) Other orthology: Present in any other insect from OrthoDB v9.1; 13) No orthology: No identifiable orthology at the OrthoDB v9.1 Insecta level.

Population genetic analysis
SNP discovery and filtering: We used sequences generated from the 18 females and one male to characterize genetic variants following GATK best practices (https://software.broadinstitute.org/gatk/bestpractices/). Reads were pre-processed by quality trimming using sickle with default parameters (Joshi and Fass 2011). We then converted paired reads to BAM format and marked adapters with Picard tools ("Picard. http://picard.sourceforge.net/. Accessed January 12, 2016").
We performed variant calling in two rounds. The first pass was to generate a high quality SNP set that could be used for base quality recalibration, followed by a second pass of variant calling. For both rounds, we used the HaplotypeCaller function in Picard tools (-variant_ index_type LINEAR,-variant_index_parameter 128000, -ERC GVCF), followed by joint genotyping for the 18 females and individual genotyping for the male sample (GenotypeGVCFs). Haplotype caller was run set with ploidy level = 2n for all samples, including the haploid male. The latter was used to identify low-confidence or spurious SNPs that could be filtered from the female calls.
Variant filtering followed the GATK generic recommendations (-filterExpression "QD , 2.0, FS . 60.0, MQ , 40.0, ReadPosRank-Sum , -8.0,-restrictAllelesTo BIALLELIC). These were further filtered for SNPs identified as heterozygous in the male sample and for which genotypes were missing in any sample (-max-missing-count 0). This set of high-confidence SNPs was used as input for base quality score recalibration for the 18 females. The second round of variant calling and filtering for these samples followed that of the first round, with the exception that we allowed missing genotypes in up to 8 samples. We then applied a final, more stringent set of filters using vcftools (Danecek et al. 2011) (-min-meanDP 5,-max-missing-count 4,-maf 0.05,-minGQ 9,-minDP 3). This yielded a final set of 412,800 high confidence SNPs used in the downstream analyses (File S1).

Structure analysis:
We evaluated the potential for population structure by estimating heterozygosity, relatedness, and Hardy-Weinberg disequilibrium within our samples using vcftools. We also used ADMIXTURE v.1.3 (Alexander et al. 2009) to look for evidence of population structure (N = 18 diploids). We randomly extracted SNPS that were at least 1000bp apart across the genome and ran K = 1-4 for three independent datasets. SNP function: We identified the functional role (e.g., upstream, synonymous, non-synonymous, etc.) of SNPs using SNPEFF (Cingolani et al. 2012) for all SNPs within our data set (N = 412,800).
Genetic diversity: We characterized genetic diversity by evaluating pi and Tajima's D in 10Kb and 1Kb windows with vcftools (-windowpi,-TajimaD,-site-pi). We mapped gene models to these windows with bedtools intersect, and Tajima's D and pi values were averaged over each gene model using the aggregate function in R (R Core Team 2016). We then calculated the cumulative percentile for pi and Tajima's D for each gene using the ecdf function in R. These percentiles were then multiplied and recalculated. Genes for the joint percentile of pi and Tajima's D that fell in the lowest 5% were considered to be under ongoing positive selection. This allowed us to identify outlier regions which are likely experiencing positive selection relative to the entire genome (Kelley et al. 2006). To estimate genetic diversity across the genome in windows, we first calculated coverage at each site within 1Kb windows across the genome using bedtools coverage. Within each window, we estimated the proportion of sites with at least 5 reads of coverage. We used this value as the denominator to calculate pi within 1Kb windows.
Effective population size and demography: We estimated Ne using SMC++ (Terhorst et al. 2017). We randomly selected 4 large scaffolds (.1 Kb) and estimated effective population size of our single Nomia. melanderi population from 1000 to 100000 years before present. We assumed a single generation per year and a mutation rate of 6.8x10 29 (Liu et al. 2017). For each scaffold, we created 6 datasets by randomly selecting between 5 and 8 individuals without replacement. We used these files to estimate Ne using the cross-validation for each scaffold.
We evaluated the possibility of recent demographic changes by estimating Tajima's D in 1000bp windows across the genome for all samples (Tajima 1989).

Evolutionary rate analysis
Single copy orthologs were extracted from OGs identified above for Lasioglossum albipes, Dufourea novaengliae, M. genalis, and N. melanderi. Peptide alignments were obtained by running GUIDANCE2 (Penn et al. 2010) with the PRANK aligner (Löytynoja 2014) and species tree ((Dnov:67.51,(Nmel:58.18,(Mgen:47.03,Lalb:47.03):11.15):9.33); (Branstetter et al. 2017)) on each orthogroup. Low scoring residues (scores , 0.5) were masked to N using GUIDANCE2 to mask poor quality regions of each alignment. PAL2NAL (Suyama et al. 2006) was used to back-translate aligned peptide sequences to CDS and format alignments for PAML. PAML (Yang 2007) was run to evaluate the likelihood of multiple hypothesized branch models of dN/dS relative to two null models with trees and parameters as follows: Orthogroups with dS . 2 were removed, and likelihood ratio tests were performed to determine the most likely value of omega for each branch.

Functional Enrichment Tests
We performed all tests of functional enrichment using the GOstats package (Gentleman and Falcon 2013) in R version 3.4.4. We used terms that were significantly enriched (P , 0.05) to build word clouds with the R packages tm (Feinerer et al. 2008), SnowballC (Bouchet-Valat 2014), and wordcloud (Fellows 2018).

Data Availability
Sequence data are available at NCBI (BioProject PRJNA495036). The genome assembly is available at NCBI (BioProject PRJNA494873). Genetic variants and genotypes are available in VCF format in File S1. TF binding motif scores are in File S2. Repetitive DNA content is in File S3. SNP effects are in File S4. The genome annotation (GFF format) is in File S5. All supplementary tables (Table S1-S8) and files (Files S1-S5) have been deposited at FigShare. Supplemental material available at Figshare: https://doi.org/10.25387/g3.7296281.

RESULTS AND DISCUSSION
The N. melanderi genome assembly resulted in 268,376 scaffolds (3,194 . 1 kb) with an N50 contig length of 25.01 kb and scaffold length of 2.05 Mb (Table 1). Total size is estimated to be 299.6 Mb, based on a k-mer analysis with k = 17 and a peak depth of 70. CEGMA analysis indicated 244 of 248 (98.39%) core eukaryotic genes were completely assembled, and 10.25% of the detected CEGMAs had more than one ortholog. BUSCO analyses indicated 98.8% of Insecta BUSCOs were complete in the assembly (Table S3).
Our official gene set includes 10,847 predicted protein-coding gene models. This is likely to be a relatively complete gene set, as 96.0% of Insecta BUSCOs were identified as complete, which is comparable to other bee genomes (Table S3). Most (8,075) of the predicted genes belong to ancient OGs that include orthologs in vertebrate species. However, there were 819 genes without any known orthologs ( Figure 1B). Our comparative analysis with representative Hymenoptera species and the outgroup, P. humanus, identified 2,025 single-copy orthologs from which we constructed the molecular species phylogeny that confidently places Halictidae as a sister group to the combined Apidae and Megachilidae groups within Apoidea ( Figure 1C). Orthology delineation showed that 92.2% of N. melanderi predicted genes have orthologs in other insects and only 16 of them were unique to the family Halictidae ( Figure 1C). Transcription factor motif binding scores for each gene are available in File S2.
In a genome-assembly independent approach using short reads and DnaPipeTE, we assembled 54,236 repetitive elements, suggesting that 37.5% of the N. melanderi genome is repetitive content (File S3; Figure  1D). We identified transposable elements from all major groups (LTR, LINE, SINE, DNA, Helitron) and other elements with similarities to unclassified repeats (7,866 total annotated repeats), but unknown elements are the most abundant type of transposon (25.5%) ( Figure 1D), showing no similarities to known repetitive elements, conserved domains, or sequences in NCBI's non-redundant nt database.
Of annotated transposable elements, LINE retrotransposons (most common: I and Jockey) were the most abundant, followed by LTR retrotransposons (most common: Gypsy) and small amounts of DNA (mostly Tc1-Mariner, PiggyBac, hAT and Kolobok) or other transposons (File S3). Some annotations suggest the presence of Crypton, Helitron and Maverick elements as well as 5S/tRNA SINE (File S3). A majority of the detected retroelements show little sequence divergence, indicating recent activity, particularly Gypsy (LTR), Copia (LTR), I (LINE) and R2 (LINE).
Annotation of the genome assembly yielded 25.93 Mbp of masked sequences (8.59% at 10% sequence divergence), which is less than the repetitive fraction of .37% inferred by DnaPipeTE. Even at a 20% sequence divergence threshold, only 43.36 Mbp (14.37%) were masked, suggesting that a substantial fraction of the repetitive part of the genome is not part of the genome assembly, likely due to the technical limitations in assembling repetitive elements from short reads.
Our population genetic analysis indicated our population is panmictic. We did not find any evidence of population structure among our samples. Across all three datasets run through STRUCTURE, the lowest CV error was found for K = 1 (CV = 0.68) (Figure 2A). Likewise, pairwise relatedness estimates based on the unadjusted Ajk statistic were close to 0 (-0.084 --0.047) for all females in our population (Yang et al. 2010).
Solitary bees are expected to have high genetic diversity and large effective population sizes (Romiguier et al. 2014), and recent census data suggests there are 17 million females nesting in our study population in the Touchet Valley (Washington, USA) (Cane 2008  We tested for population structure for K = 1-4 (right numbers) and found that the most likely K = 1 (average CV error = 0.68 across three independent runs). K:CV is given to the right of each row. (B) Estimates of N e show evidence for a decline in effective population size in our alkali bee population, beginning about 10,000 years before present. Blue line, median estimated N e ; shaded gray area, 95% confidence intervals. (C) Genes under positive selection are significantly enriched for molecular functions and biological processes related to tRNA transfer and binding. (D) Genes with a slower evolutionary rate (dN/dS) in N. melanderi than in other halictid bees are significantly enriched for processes and functions related to transcription and translation. In B and C, the size of the word corresponds to the frequency to which that term appears on a list of significantly enriched GO terms. (E) The distribution of dN/dS values for N. melanderi genes are skewed toward zero, and none are greater than 1. Blue dashed line, mean dN/dS. However, we find several lines of evidence to suggest that effective population size of our N. melanderi population has declined in the recent past. First, our estimates of genetic diversity were surprisingly low. Three of the 18 females in our dataset had significantly higher homozygosity than expected (P , 0.05). Genetic diversity (pi) across the genome in 1Kb windows (corrected for coverage, see Methods) was estimated to be 0.00153. This is intermediate to diversity previously estimated for Apis mellifera (0.0131, (Harpur et al. 2014)) and Bombus impatiens (0.002, (Harpur et al. 2017)). Second, the genome-wide average Tajima's D was significantly greater than 0 (one-way T-test; mean = 0.77 +/2 0.002 SE; P , 0.00001) indicating a recent population decline.
Third, N e is predicted to have declined within the last 10,000 years ( Figure 2B). In the last 2,000 years, N e has had a median of 12,554 individuals (range: 3,119-3,978,942). The long, slow population decline reflected in our samples corresponds to a period during which much of Washington state was underwater due to glacial flooding, known as the Missoula Floods. Our study area, Touchet Valley, was under Lake Lewis during this time, and was thus uninhabitable for ground-nesting bees.
More recent fluctuations in N e may reflect less catastrophic events. Seed growers have maintained large nesting areas ("bee beds") for alkali bees within a 240 km 2 watershed that encompasses our sampling area for several decades (Cane 2008). Some of these bee beds are among the largest nesting aggregations ever recorded, at up to 278 nests per m 2 . However, survey data suggests there are large fluctuations in population size, as the population increased ninefold over an eight year period (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006) (Cane 2008). Records from individual bee beds reflect these fluctuations. For example, a bed that was started in 1973 grew from 550 nesting females to 5.3 million nesting females in 33 years (Johansen et al. 1978;Cane 2008). However, other beds were destroyed or abandoned for decades at a time, only to be recolonized later. A large population crash occurred in the 1990s, likely due to use of a new pesticide (Cane 2008), and flooding events have caused massive valley-wide reproductive failures (Stephen 2003). Our wide range of N e estimates and signatures of genetic bottlenecks likely reflect these population fluctuations.
Our selection scan revealed 479 N. melanderi genes under positive directional selection. Genes under selection were highly conserved, and the age distribution was similar to the distribution across all predicted genes (x 2 = 54, d.f. = 48, P = 0.26; Table S4). Genes showing signatures of ongoing positive selection were enriched for functions related to tRNA transfer and DNA/nucleosome binding ( Figure 2C, Table S5). Because DNA binding is typically an indicator of transcription factor activity, we performed enrichment analysis of genes under selection with our previously defined transcription factor motif target sets (File S2). The most enriched motif target sets (adjusted-P , 6E-04) included transcription factors involved in neural differentiation (brick-a-brack 1, prospero, nubbin, zelda, twin-of-eyeless, pox meso, worniu) and neural secretory functions (dimmed) ( Table S6). We identified 505,203 functional predictions for 412,800 variable sites (SNPs) within 9,692 genes, most of which are intergenic (File S4).
Our analysis of evolutionary rates included 6,644 single-copy orthologs, most of which (95%) were evolving at similar rates across all four halictid bee lineages. We identified 61 N. melanderi genes that are evolving at a significantly different rate from other halictid bees (Table S7). Of these, the majority (74%) are evolving slower than in other lineages. These genes are significantly enriched for functions related to transcription and translation ( Figure 2D, Table S8). The distribution of estimated dN/dS values for N. melanderi genes was skewed toward zero, with a notable absence of values greater than one ( Figure 2E). This suggests that most genes in our analysis show evidence of neutral or purifying selection. This result is likely influenced by the vast evolutionary distance separating the four halictid lineages, which shared a common ancestor . 150 million years ago (Branstetter et al. 2017). Our set of single-copy orthologs was thus limited to highly conserved genes.
In conclusion, we present a high quality draft genome assembly of the solitary alkali bee, N. melanderi, that will be a valuable resource for both basic and applied research communities.