argyle: An R Package for Analysis of Illumina Genotyping Arrays

Genotyping microarrays are an important and widely-used tool in genetics. I present argyle, an R package for analysis of genotyping array data tailored to Illumina arrays. The goal of the argyle package is to provide simple, expressive tools for nonexpert users to perform quality checks and exploratory analyses of genotyping data. To these ends, the package consists of a suite of quality-control functions, normalization procedures, and utilities for visually and statistically summarizing such data. Format-conversion tools allow interoperability with popular software packages for analysis of genetic data including PLINK, R/qtl and DOQTL. Detailed vignettes demonstrating common use cases are included as supporting information. argyle bridges the gap between the low-level tasks of quality control and high-level tasks of genetic analysis. It is freely available at https://github.com/andrewparkermorgan/argyle and has been submitted to Bioconductor.


Introduction
This vignette demonstrates the steps required to load genotype data from output files produced by Illumina BeadStudio. Two files are expected: one named like *Sample_Map.zip, the sample manifest, and one named like *FinalReport.zip, which contains genotype calls and hybridization intensity values. These are plain-text reports compressed with ZIP. For Mac and Linux users, they will be decompressed on-the-fly by argyle; Windows users will need to unzip them before starting the import process.
The FinalReport is expected to have the following columns: SNP Name (unique marker name), Sample ID (unique sample name), Allele1 -Forward (first allele call, one of ACGTN), Allele 2 -Forward (second allele call, one of ACGTN), X (hybridization intensity in the x dimension) and Y (intensity in the y dimension). Column names are preceeded by a header section describing the number of samples and markers genotyped. Each row after the column names represents a single genotype (ie. a sample-marker pair). Homozygous calls will have the same value in columns Allele 1 and Allele 2; heterozygous calls will have diffferent values in the two columns; and missing calls will have an N in both columns. Allele calls are made with respect to the forward strand for ease of comparison with public databases and reference genome sequences.
In addition to the Sample_Map and FinalReport, argyle requires the user to supply a marker map for the array. This should be a dataframe with at least the following six columns: chr (chromosome), marker (unique marker name, to be matched to SNP Name in FinalReport), cM (genetic position in centimorgans), pos (physical position in base pairs), A1 (reference allele) and A2 (alternate allele). Alleles should be specified with respect to the forward strand, as in FinalReport. The dataframe should have row names, and these should match the values in the marker column.
The marker map must be prepared ahead of time, and doing so requires knowledge of the array platform. Maps for the Mouse Universal Genotyping Array series are available for download at http://csbio.unc.edu/CCstatus/.
This and other vignettes included with argyle makes use of genotyping reports from the Mouse Universal Genotyping Arrays for the house mouse (Mus musculus) sold by Neogen Inc. Users of other platforms should consult their vendor with specific questions about file formats.

Importing from BeadStudio files
First, load the argyle package and (for convenience) set the working directory for the R session.

library(argyle) setwd("~/argyle")
Next load the marker map, which has been pre-constructed and saved in an Rdata file. The object is a dataframe called snps. To import genotypes, use the read.beadstudio() function. Argument in.path specifies the directory in which argyle will search for the Sample_Map and FinalReport files, and prefix gives the file name prefix (in our case, none). The marker map (argument snps) must also be provided. Genotypes at markers not present in the marker map will be omitted from the output. If keep.intensity = TRUE (the default), hybridization intensities will be imported along with genotype calls.

The genotypes object
The genotypes object is the central data structure of the argyle package. The design mimics the file structure used by PLINK (see here for details). A genotypes is a matrix with markers in rows and samples in columns; each entry is a biallelic genotype call. Row names match marker names, and column names match sample names. The genotypes class inherits from the matrix in base R, so all functions which work for matricesapply(), dim() and so on -will also work for genotypes objects. Marker and sample metadata are stored as attributes of the matrix.
For a high-level overview genotypes object, use summary():

summary(geno)
## ---geno ---## A genotypes object with 77808 sites x 96 samples ## Allele encoding: native ## Intensity data: yes (raw) ## Sample metadata: yes ( 0 male / 0 female / 96 unknown ) ## Filters set: 0 sites / 0 samples ## File source: /Users/apm/Dropbox/pmdvlab/argyle/manuscript/vignettes/datasets/mega_example (on 2015-10 ## Checksum: ccf22694fee67b07e68b8aeb6fb5b053 As expected, the object contains 77808 markers and 96 samples. Alleles are encoded using the native scheme for BeadStudio -that is, as nucleotides (ACGT) with respect to the forward strand for homozygous calls, H for heterozygous calls, and N for no-calls (missing data). Intensity data is present, and raw indicates that no intensity normalization has been applied yet. No filters have been set. A checksum is computed during the import process that can be used to check file integrity if re-importing the same data again.
To peek at the contents of the object, use head():   This function returns a dataframe with (at least) six columns: fid (a "family" identifier, or more generally any grouping variable; it defaults to the sample name), iid (unique sample name), mom and dad (pedigree information, if any; 0 indicates missing data), sex (0 = unknown, 1 = male, 2 = female), and pheno (phenotype; -9 indicates missing data). Row names match the iid column.

## [1] 1247
Note that argyle's subset(), like subset.data.frame() in base R, uses lazy evaluation. By default the subsetting expression is evaluated in the context of the marker map, but we can also subset according to sample properties by passing the argument by = "samples": g3 <-subset(geno, fid != "exclude_me", by = "samples") ncol(g3) # how many samples are left?

Allele encoding schemes
When reading genotypes from BeadStudio output, alleles are encoded just as they appear in the file (native mode). However, it is often useful to represent genotypes as allele counts (of the non-reference or minor allele). To convert to numeric allele encoding with 0 = homozygous reference allele, 1 = heterozygous and 2 = homozygous alternate allele, do g2.recode <-recode(g2, "01") ## Recoding to 0/1/2 using reference alleles.
In the numeric encoding it is easy and fast to compute allele frequencies.

Other input formats
The argyle package can read PLINK binary filesets, and by extension, any format that can be converted by PLINK into such a fileset. PLINK-related functions are covered in a separate vignette.
Users whose genotype data is stored in ad hoc formats can still benefit from the functions provided by argyle. If genotype data can be manipulated into matrix form in R, a genotypes object can be constructed manually as shown in the example below.