# On Robust Association Testing for Quantitative Traits and Rare Variants

- Peng Wei*,
^{†},^{1}, - Ying Cao
^{†}, - Yiwei Zhang
^{‡}, - Zhiyuan Xu
^{‡}, - Il-Youp Kwak
^{‡}, - Eric Boerwinkle
^{†},^{§}and - Wei Pan
^{‡},^{1}

^{*}Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, Texas 77030^{†}Human Genetics Center, The University of Texas School of Public Health, Houston, Texas 77030^{‡}Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455^{§}Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030

- 1Corresponding authors: Department of Biostatistics, The University of Texas MD Anderson Cancer Center, 1400 Pressler Street, Pickens Tower, FCT4.6044, Houston, TX 77030. E-mail: pwei2{at}mdanderson.org; and Division of Biostatistics, School of Public Health, University of Minnesota, MMC 303, A460 Mayo, Minneapolis, MN 55455. E-mail: weip{at}biostat.umn.edu

## Abstract

With the advance of sequencing technologies, it has become a routine practice to test for association between a quantitative trait and a set of rare variants (RVs). While a number of RV association tests have been proposed, there is a dearth of studies on the robustness of RV association testing for nonnormal distributed traits, *e.g.*, due to skewness, which is ubiquitous in cohort studies. By extensive simulations, we demonstrate that commonly used RV tests, including sequence kernel association test (SKAT) and optimal unified SKAT (SKAT-O), are not robust to heavy-tailed or right-skewed trait distributions with inflated type I error rates; in contrast, the adaptive sum of powered score (aSPU) test is much more robust. Here we further propose a robust version of the aSPU test, called aSPUr. We conduct extensive simulations to evaluate the power of the tests, finding that for a larger number of RVs, aSPU is often more powerful than SKAT and SKAT-O, owing to its high data-adaptivity. We also compare different tests by conducting association analysis of triglyceride levels using the NHLBI ESP whole-exome sequencing data. The QQ plots for SKAT and SKAT-O were severely inflated (*λ* = 1.89 and 1.78, respectively), while those for aSPU and aSPUr behaved normally. Due to its relatively high robustness to outliers and high power of the aSPU test, we recommend its use complementary to SKAT and SKAT-O. If there is evidence of inflated type I error rate from the aSPU test, we would recommend the use of the more robust, but less powerful, aSPUr test.

- SKAT
- associate testing
- next-generation sequencing
- rare variants
- robustness

Thanks to the rapidly decreasing cost of the next-generation sequencing (NGS) technology, whole-exome sequencing (WES) and whole-genome sequencing (WGS) have been performed in many deeply phenotyped prospective cohort studies and electronic health record (EHR)-based cohorts of tens of thousands of individuals. Completed and ongoing large-scale WES and WGS sequencing efforts include the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP) (Crosby *et al.* 2014), Trans-Omics for Precision Medicine (TOPMed) Program (Abecasis *et al.* 2015), the NHGRI Genome Sequencing Program (GSP), the UK10K project (UK10K Consortium 2015), and the Geisinger MyCode project (Mukherjee *et al.* 2015), to name a few. This big wave of sequencing data provides researchers with unprecedented opportunities to investigate low frequency [minor allele frequency (MAF) between 1 and 5%] and rare (MAF ) single nucleotide variants (SNVs) in association with complex phenotypes and diseases (Yi *et al.* 2011; Schaid *et al.* 2013; Lee *et al.* 2014). An example of the initial successes is the discovery of rare functional variants in *APOC3* associated with lower plasma triglyceride levels and a reduced risk of coronary heart disease (Crosby *et al.* 2014).

Many phenotypes such as triglyceride and fasting glucose collected in population-based cohort studies are quantitative and may not follow a normal distribution, as explicitly or implicitly assumed in most existing statistical methods for rare variant (RV)-based association testing (Bansal *et al.* 2010; Fan *et al.* 2015). However, there is a dearth of literature on the robustness of RV tests to the nonnormality of the observed traits, *e.g.*, due to skewness, which is expected to be ubiquitous in cohort studies. In particular, we find that commonly used RV tests, including the sequence kernel association test (SKAT) (Wu *et al.* 2011) and SKAT-O test (Lee *et al.* 2012), are very sensitive to quantitative trait’s deviation from normality and can have severely inflated association p-values. For example, when applied to the ESP WES data in association with plasma triglyceride levels, as described in detail later on, SKAT and SKAT-O had globally inflated quantile–quantile (QQ) plots with genomic control (GC; Devlin and Roeder 1999) respectively. In addition to the case study of RV-triglyceride association testing, here we have conducted extensive simulation studies to investigate the performance of several commonly used RV tests, including the burden test (Li and Leal 2008), SKAT, and SKAT-O, as well as our recently proposed adaptive sum of powered score (aSPU) test (Pan *et al.* 2014), in the presence of nonnormal quantitative traits. We have also studied and compared some commonly used *ad hoc* strategies to deal with nonnormal traits, such as natural logarithm transformation, inverse normal transformation, Winsorizing, trimming, and minor allele count (MAC) thresholding. Although we find that the aSPU test is more robust than SKAT and SKAT-O, it can sometimes suffer from inflated type I error rates in the presence of a few contaminated observations. In response, we further propose a robust version of the aSPU test, called aSPUr. While the traditional variant-by-variant association test for common SNVs (MAF ) has been shown to be robust to nonnormal distributed traits (Cao *et al.* 2014), here we demonstrate that RV association testing can be very sensitive to quantitative trait’s subtle deviation from normality. Based on type I error control and statistical power considerations, we further provide practitioners with some general guidelines and a new robust test to deal with nonnormal quantitative traits.

## Methods

### Review of existing RV tests

We first review our recently proposed class of sum of powered score (SPU) tests and their adaptive version called aSPU test (Pan *et al.* 2014). The former include the burden and SKAT tests as special cases. We then introduce a new robust version of the SPU and aSPU tests, denoted as SPUr and aSPUr. Consider a linear model for a quantitative trait,where is the trait for subject *i*, is the MAC (coded as 0, 1, or 2) of SNV *j* for subject *I*, and the error term is assumed to have a distribution with mean 0 and a constant variance The main interest is to test *i.e.*, none of the *k* variants in a set is associated with the phenotype. The score vector isand its covariance matrix is which can be consistently estimated by In fact, for any generalized linear models (GLMs) with a canonical link function, the score vector *U* remains the same as the above.

Pan *et al.* (2014) proposed a class of SPU tests, for an integer Note that when and the SPU test is equivalent to the burden test and the SKAT test under the linear kernel with equal RV weighting, respectively. Importantly, as *γ* increases, the SPU(*γ*) test puts more weights on the larger components of *U* while gradually ignoring the remaining components. In particular, we haveAs will be shown, since the SPU tests are based on resampling methods to calculate their p-values, they are invariant to monotone transformations, such as That is, we can define which uses only the largest component of and does not aggregate information from other RVs. More generally, as we increase the value of *γ*, we put higher and higher weights on the larger components of *U*, effectively realizing RV selection. On the other hand, an even integer of *γ* automatically eliminates the effects of different signs of ’s, avoiding power loss of the burden test in the presence of different association directions. However, an odd integer of *γ* might be more suitable, as in the SPU(1) or burden test, when the associations are all in the same direction.

Without covariates, Pan *et al.* (2014) proposed using permutations to obtain p-values for the SPU tests. With covariates, the parametric bootstrap (or, alternatively, permuting residuals) can be performed. Briefly, we fit a null model under and obtain the residuals, then we randomly permute the residuals and add them to the estimated means of the traits from the null model, obtaining a new set of null traits We use the null traits to obtain a null statistic We repeat the above process *B* times, and calculate the p-value as

Since the power of an SPU(*γ*) test depends on the choice of *γ* while the optimal choice of *γ* depends on the unknown true association pattern of the RVs to be tested, it would be desirable to data-adaptively choose the value of *γ*. For this purpose, Pan *et al.* (2014) proposed an adaptive SPU (aSPU) test to combine information across multiple SPU tests with various values of *γ*. Suppose that we have some candidate values of *γ* in *e.g.*, as used in our later simulation experiments, and suppose that the p-value of the test is then our combining procedure is to take the minimum p-value:Of course, is no longer a genuine p-value; as for the SPU tests, we recourse to a resampling method to estimate its p-value. As before, first we simulate *B* independent copies of the null traits by the parametric bootstrap for We then calculate the corresponding SPU test statistics and their p-values Thus, we have and the final p-value of the aSPU test is We used in our simulation experiments. Note that, we can first use a smaller or so to scan a genome, then use a larger *B* to test on a few genes or regions that pass the significance criterion (*e.g.*, p-value ) in the first step.

### New tests: robust SPU and aSPU tests

A potential problem with the above Gaussian likelihood-based approach is its nonrobustness to outliers, which can be caused by non-Gaussian errors or contaminated traits Consider a situation where we observe a singleton for RV *j*; that is, say and all other for Then the *j*th component of the score vector is which will be largely influenced by a single observation As to be shown later, in such a situation, if is contaminated or measured with error, then we may have inflated type I errors. To overcome the problem, we propose using a robust regression method. Rather than using the Gaussian-based likelihood, we propose using the Huber loss with the corresponding score vector with andwhere is chosen to maintain a high efficiency for a normal error (*i.e.*, trait) distribution, and is an estimate of *σ* (Jureckova and Picek 2006). Under we can use the median absolute deviation (MAD) as a robust estimate of *σ*. It is clear that the truncation of at a constant *c* eliminates or alleviates the undue influence of outlying ’s.

We define a robust SPU (SPUr) test for a given asWith various values of we obtain a class of the SPUr tests. Accordingly we define an adaptive robust SPU (aSPUr) test aswhere is the p-value of the SPUr(*γ*) test, and we use as before. The p-values of the SPUr and aSPUr tests are obtained in the same way as for the SPU and aSPU tests described earlier.

Alternatively, based on some initial estimate (*e.g.*, the least squares or least absolute deviation estimate) of *β*, we define residuals and then use which might give higher power than using the other estimate of *σ* (which does not take account of possible effects of RVs). However, it is difficult to obtain reliable estimates of for RVs, which in fact motivated the development of the burden tests and other methods. This is a topic to be explored in the future.

### Comparison with Winsorizing and trimming

Two simple and straightforward ways to handle outliers are Winsorizing and trimming. For a specified small such as or 0.025, define the -percentile and -percentile of as and respectively. In Winsorizing, any satisfying is truncated at and any is truncated at In trimming, any observation *i* is removed from the dataset if or

Winsorizing is to some degree like using the Huber loss function in truncating outlying trait values. However there are two important differences. First, the choice of the threshold is arbitrary, which may be too small or too large, depending on the unknown proportion of the outliers. Second, more importantly, in Winsorizing whether a trait value is judged to be an outlier or not is completely based on its absolute value without accounting for any covariates; if instead we Winsorize residuals it will be more similar to using the Huber loss. As will be shown, ignoring covariate effects may lead to severely inflated type I errors or power loss.

In addition to the above two disadvantages shared with Winsorizing, trimming is too extreme in eliminating the observations judged to be, but in truth may or may not be, outliers, which often leads to severe loss of power.

### Software and data availability

The aSPUr test has been implemented in *R* package “aSPU” available on the Comprehensive R Archive Network (CRAN): https://cran.r-project.org/web/packages/aSPU/. The NHLBI ESP data are accessible from the National Center for Biotechnology Information (NCBI) dbGaP with accession numbers phs000398, phs000400, phs000401, and phs000281.

## Results

### Simulation set-ups

To evaluate and compare the performance of various tests, we conducted extensive simulation studies under different trait distributions. The genotype data were simulated following Wang and Elston (2007) and Basu and Pan (2011). Specifically, a latent variable was simulated from a *k*-dimensional multivariate normal distribution with *V* as an AR-1(*ρ*) correlation structure: for any Then we randomly drew from a uniform distribution *k* MAFs between 0.1 and 0.5%, and accordingly dichotomized to yield a haplotype. We similarly simulated another latent variable and the corresponding haplotype. We combined the two haplotypes to form the genotype for subject *i*. This process was repeated times to generate genotypes for subjects. We used and to generate independent and correlated SNVs (in linkage equilibrium and in linkage disequilibrium) respectively. Note that we only used unphased genotypes, not haplotypes, in simulations.

A trait was simulated from linear regression model with the following error distribution. First, was independent and identically distributed (iid) from Second, was iid from a Log-normal distribution with mean 0 and SD on the log scale. Third, was iid from a *t*-distribution with degrees of freedom or 3. Fourth, was iid from a contaminated a single observation with was randomly chosen and its trait had an additive error with or 10.

To evaluate the empirical type I error (null cases), we had For empirical power (nonnull cases), we randomly chose eight SNVs from *k* SNVs as causal ones with nonzero ’s while other SNVs having their For causal SNVs, we used two sets of coefficients: (power set-up I) and (power set-up II), which favors SKAT and the burden test, respectively. In the absence of other covariates, was just a constant 1 for the intercept term; otherwise, we randomly generated two independent covariates from with To investigate the robustness of a method to the number of SNVs, we increased *k* from 8 to 256.

### Simulation results

Table 1 shows the empirical type I error rates for various tests without any transformation. Under error distribution, all tests controlled the type I error rates satisfactorily at the nominal level Under heavy-tailed ( and ) and skewed ( and ) error distributions, SKAT and SKAT-O had severely inflated type I error rates, while aSPU and aSPUr controlled their type I error rates satisfactorily. With even just 1 (out of 400) quantitative trait contaminated, all the tests except aSPUr could not control the type I error rates well, though the aSPU test (along with the SPU tests) performed much better than SKAT and SKAT-O. The same conclusions held with correlated SNVs and with or without covariates (Supplemental Material, Table S1, Table S2, and Table S3). For nonnormal error distributions subject to Winsorizing or trimming, the results were dependent on the choice of the cut-off as shown in Table S4 and Table S5. For example, when the error distribution was with no covariates, SKAT and SKAT-O after Winsorizing at could maintain a correct type I error rate, but not at For an error distribution of SKAT and SKAT-O could not control the type I error rates for either or (Table S4). With covariates, the performance became worse, especially with trimming; neither Winsorizing nor trimming could control type I error rates at either or (Table S5).

For empirical power comparison, under error distribution the aSPU test performed similarly to SKAT or SKAT-O with a smaller number of SNVs; however, as the number of SNVs increased, the aSPU test became more powerful (Table S6). As reported before (Pan *et al.* 2014; 2015a,b), with increasing number of SNVs an SPU(*γ*) test with a larger value tended to be more powerful; in particular, SPU(4) could be much more powerful than SPU(2), *e.g.*, when there were 128 or more SNVs (Table S6). Of note, SPU(2) is equivalent to SKAT with a linear kernel which was optimal here and was used throughout (Pan 2009; Pan 2011). On the other hand, the aSPUr was conservative, especially for causal SNVs with larger effect sizes (Cases III and IV in Table S6), in which it was hard to distinguish a genuinely large effect size of a RV from a contaminated trait value. In robust statistics, one would like to use some initial estimator to estimate and thus take account of large effects, which however was almost impossible in the current context for RVs: with small MAFs, it is almost impossible to obtain reliable estimates for RVs. The same conclusions held with correlated SNVs (Table S7). For Winsorizing or trimming, there was always a dramatic loss of power with trimming, while Winsorizing performed well for a smaller number of SNVs. But its performance deteriorated as the number of SNVs increased; in particular, again its performance depended on the use of the level (Table S8 and Table S9).

We further investigated the effects of natural logarithm (Ln) and rank-based inverse normal (INV) transformations on the type I error rate and power. We considered the skewed and heavy-tailed error distributions. With the two covariates in the simulated data, we first regressed them out in a linear model under then used the residuals or their transformations to test their association with a set of SNVs.

Figure 1 shows the type I error rates and powers for a skewed error distribution First, without transformation both SKAT and SKAT-O gave severely inflated type I error rates, while aSPU and aSPUr controlled their type I error rates satisfactorily; with Ln transformation, all tests performed well, although the type I error rates of SKAT and SKAT-O might be slightly inflated; with INV transformation, all tests controlled the type I error rates satisfactorily. Second, under power set-up I which favored SKAT and SPU(2) because the association directions of the eight causal SNVs were different, without transformation although both aSPU and aSPUr could control the type I error rates, they lost power dramatically as compared to those with transformed traits; between the two, aSPUr was more powerful. On the other hand, with Ln transformation SKAT was most powerful, followed by SKAT-O, then aSPU, and finally aSPUr, though the power difference became smaller as the number of SNVs to be tested increased; with INV transformation, SKAT was most powerful for smaller numbers of SNVs while aSPU was more powerful for larger numbers of SNVs; for unknown reasons, aSPUr did not perform well. Third, under power set-up II which favored burden tests because the association directions of the eight causal SNVs were the same, without transformation although both aSPU and aSPUr could control the type I error rates, they lost power dramatically as compared to those with transformed traits; between the two, aSPU was more powerful. With Ln transformation, SKAT-O and aSPU were most powerful, followed by SKAT or aSPUr, though the power difference was not dramatic; with INV transformation, aSPUr was consistently best, followed by SKAT-O and aSPU (for which the former had an edge for a smaller number of SNVs while the latter had otherwise), finally by SKAT.

Figure 2 shows the type I error rates and powers for a heavy-tailed (and nonskewed) error distribution First, without transformation both SKAT and SKAT-O gave severely inflated type I error rates, while aSPU and aSPUr controlled their type I error rates satisfactorily. Although no reason to use a Ln transformation, to show possible effects of using an incorrect transformation, we also presented results based on the Ln transformation: again both aSPU and aSPUr were robust with well-controlled type I error rates, while SKAT and SKAT-O had severely inflated ones; with INV transformation, all tests were satisfactory. On the other hand, under power set-up I, with no or Ln transformation, although both aSPU and aSPUr could control the type I error rates, aSPU lost power dramatically while aSPUr did not as compared to those with transformed traits; between the two, aSPUr was much more powerful. We showed the results for SKAT and SKAT-O, even though they had severely inflated type I error rates. With INV transformation, SKAT and aSPUr were the winners, though SKAT was slightly more powerful for smaller numbers of SNVs while aSPUr was more powerful for larger numbers of SNVs, closely followed by aSPU, then SKAT-O. Finally, under power set-up II, with no or Ln transformation, although both aSPU and aSPUr could control the type I error rates, aSPU lost power dramatically while aSPUr did not as compared to those with transformed traits; between the two, aSPUr was more powerful; with INV transformation, aSPUr was best, closely followed by aSPU, then SKAT-O, and then SKAT.

### Data example: application to the NHLBI ESP triglyceride phenotype

To further demonstrate the performance of various RV tests in a real data example, we analyzed the WES data in association with plasma triglyceride level in 1731 individuals of European ancestry who were sequenced in the NHLBI ESP project. The study subjects were selected from the following population-based cohorts: Atherosclerosis Risk in Communities, the Cardiovascular Heart Study, the Framingham Heart Study, and the Womens Health Initiative; see Crosby *et al.* (2014) for details. We performed gene-based RV association tests, including SKAT, SKAT-O, T1 burden test, aSPU, and aSPUr, on untransformed, natural logarithm transformed, and rank-based inverse normal transformed triglyceride levels, denoted as TG, Ln(TG), and INV(TG), respectively. Following Crosby *et al.* (2014), we included nonsynonymous (nonsense and missense) and splice-site variants of MAF within each gene and excluded genes with cumulative MACs <5, resulting in 13,978 genes. The genome-wide significance threshold was set at based on the Bonferroni procedure. As in Crosby *et al.* (2014), we performed natural logarithm transformation on the raw triglyceride level and adjusted for covariates including age, sex, two principal components capturing population substructure, and indicator variables for the ESP ascertainment scheme in all association testing. We used QQ plots and GC *λ* to detect possible inflation of the RV association test p-values. Because there were a large number of extremely rare variants, *e.g.*, singletons and doubletons, in the ESP WES data, we used the power set for both aSPU and aSPUr, as suggested by Pan *et al.* (2014) for numerical stability. In addition, we used the following stage-wise bootstrap procedure for aSPU and aSPUr: we started with for all genes and then gradually increased *B*. If an estimated p-value was we increased *B* to to reestimate the p-value until for genome-wide significance. Moreover, we used *APOC3* as a positive control gene to compare the power of various tests. *APOC3* was identified as the top gene harboring putatively functional RVs associated with reduced level of Ln(TG), and was further replicated and confirmed in independent large samples (Crosby *et al.* 2014). In addition, it was identified in other RV association studies (Tachmazidou *et al.* 2013; Li *et al.* 2015).

Figure 3F shows that TG was right-skewed with some individuals having extremely high TG levels. When applied to TG, the QQ plots for aSPU and aSPUr behaved normally as shown in Figure 3 (). In contrast, the SKAT and SKAT-O tests had severely inflated QQ plots (*λ* = 1.89 and 1.78, respectively); the QQ plot for T1 was less inflated but had a discernable deviation from the null in the tail area (*λ* = 1.13). We investigated the effectiveness of some *ad hoc* strategies, including trimming, Winsorizing, and increasing the MAC threshold, in alleviating the p-value inflation. When trimming at *λ* for SKAT and SKAT-O was reduced to 1.09 and 1.10, respectively; when Winsorizing at *λ* was reduced to 1.12 and 1.11, respectively. Despite the improvement, the QQ plots for SKAT and SKAT-O remained inflated (Figure S1). When we excluded genes with a MAC the QQ plots for SKAT and SKAT-O were still inflated with *λ* = 1.36 and 1.30, respectively, whereas T1 had a much improved QQ plot (*λ* = 1.04) (Figure S2). However, increasing the MAC threshold to 30 would further exclude 8155 genes, including the positive control gene *APOC3* with 14 minor alleles. When applied to Ln(TG) and INV(TG), all tests had well-behaved QQ plots and (Figure S3 and Figure S4). As shown in Figure 3F, TG approximately followed a Log-normal distribution, leading to similar results from the Ln and INV transformations; see the p-value comparison for *APOC3* in Table 2. To investigate whether the RV association test p-value inflation was also applicable to common variants, we performed conventional variant-by-variant association testing of TG for 50,602 SNPs with an MAF . As shown in Figure S5, the QQ plot was well behaved with suggesting that the p-value inflation was likely a unique problem for some RV association tests.

Table 2 shows the p-values and ranking of *APOC3* by various tests. In the analysis of TG, *APOC3* was ranked 62nd by aSPUr, but was not among the top 200 genes by all other methods; in the analyses of Ln(TG) and INV(TG), it was ranked among the top two by T1, SKAT-O, aSPU, and aSPUr, but not SKAT. This is consistent with the results reported in Crosby *et al.* (2014) that *APOC3* was the top gene associated with Ln(TG) by the T1 test but its p-value was not genome-wide significant in the discovery samples from the ESP. We demonstrate here that the statistical significance of *APOC3* was dependent on the transformation of the phenotype TG. Figure 3F shows that the carriers of the minor alleles for five out of six RVs in *APOC3* had reduced TG levels compared with the population average. In the presence of quite a few individuals with extremely high TG levels, *APOC3* was only nominally associated with TG and lowly ranked by all RV association tests except for aSPUr. By downweighting the extremely high TG observations, aSPUr increased the statistical significance and ranking of *APOC3* compared with aSPU and other tests, while avoiding global inflation of the p-values. On the other hand, since both Ln and INV transformations reduced the impact of extremely high TG observations, the association signal of *APOC3* was much amplified, resulting in its high ranking. In addition, as the majority of the variants in *APOC3* reduced the TG level, *i.e.*, the effects were roughly in the same direction, the T1 burden test and adaptive tests that incorporate the burden test, such as aSPU, aSPUr, and SKAT-O, yielded higher ranking for *APOC3* than did SKAT.

## Discussion

In summary, we have demonstrated using extensive simulations and application to the ESP WES data that SKAT and SKAT-O are not robust to heavy-tailed or skewed error distributions of quantitative traits with inflated type I error rates. *Ad hoc* remediation procedures, such as trimming and Winsorizing, may not be effective in reducing the inflated type I error rates and may lead to severe power loss. Depending on the underlying trait distributions, Ln or INV transformation may help control the type I error rates for SKAT and SKAT-O, which, however, could lead to transformation-specific association results as illustrated in the *APOC3* example, as well as power loss as demonstrated in simulation set-up II in Figure 2. On the other hand, the aSPU test and the newly proposed aSPUr test are much more robust to quantitative traits’ deviation from normality.

The nonrobustness of the SKAT test is mainly due to its poor asymptotic approximation of the null distribution in the presence of outliers. Note that the issue with SKAT remained with the use of its resampling method to calculate its p-values: we found that SKAT-Resampling implemented in the *R* package “SKAT” gave essentially equal p-values to those of SKAT in both simulations and real data application; the Pearson correlation between the two sets of the p-values was Moreover, the RV weighting scheme of SKAT makes its type I error inflation even worse. Since SKAT puts a higher weight on a more rare SNV *j*, if then it is a high-leverage point; in addition, if is outlying, then we know is an influential point. Hence, although SKAT’s weighting on rare SNVs might help it gain power to detect associated RVs, at the same time, the weighting also renders its nonrobustness to observations with outlying traits, which could happen when the trait has a heavy-tailed or right-skewed distribution, as shown in our simulations and supported by the real data application. For the latter, when applied to TG, SKAT with its default weighting and equal weighting gave a GC *λ* of 1.89 and 1.86, respectively.

Recently Auer *et al.* (2016) also reported that single SNV-based and SNV-set-based RV tests can be nonrobust to phenotypic outliers and nonnormality, which is in agreement with the main theme of this paper and highlights the importance of the topic studied here. They recommended the INV transformation for nonnormally distributed traits. Our work here is distinctive from Auer *et al.* in several important aspects. First, in addition to demonstrating the nonrobustness of existing RV tests, we have proposed a new SNV-set-based robust RV test, aSPUr. Second, while Auer *et al.* applied the Huber robust regression in the context of single SNV-based RV testing, we impose the Huber loss on the score vector *U* in the proposed aSPUr, which is generalizable to the broad class of score vector-based SNV-set RV tests, *e.g.*, SPU(1)/T1 and SPU(2)/SKAT. Third, in contrast to the finding of Auer *et al.* that the permutation test was the least powerful method when applied to single SNV-based RV test, we found the permutation-based aSPU and aSPUr to be robust in terms of both type I error control and maintaining high statistical power in the presence of true signals. Finally, we found that while the INV transformation could maintain the type I error rate, it could also lead to transformation-dependent ranking order of p-values as demonstrated in the *APOC3* example (Table 2).

We proposed the aSPUr test in the Huber loss framework. It is one of the first proposed and most thoroughly studied loss function in the robust statistics literature. For example, when the error distribution is normal, it has been shown that the Huber loss achieves asymptotic efficiency with the tuning parameter *c* being equal to 1.345 (Huber 1964). We found that it performed satisfactorily in our extensive numerical experiments. Other loss functions are also possible, for example, Tukeys biweight function (Jureckova and Picek 2006), which warrants further investigation.

In conclusion, we would recommend the use of the aSPU test for its robustness to heavily-tailed or skewed error distributions and its high power across many situations due to its adaptiveness. If there is evidence of inflated type I error or *λ*, *e.g.*, through QQ plots, then one may try the more robust aSPUr test or SKAT and SKAT-O with the INV transformation. Finally, neither Winsorizing nor trimming the data before applying another test, *e.g.*, SKAT or SKAT-O, outperformed the aSPUr test that was applied to the original data.

## Acknowledgments

This research was supported by National Institutes of Health (NIH) grant R01HL116720. P.W. was also supported by NIH grants R01CA169122 and R21HL126032, W.P. by R01GM113250, R01HL105397, and R01GM081535, and E.B. by RC2HL102419. The authors declare that they have no competing interests.

## Footnotes

Supplemental material is available online at www.g3journal.org/lookup/suppl/doi:10.1534/g3.116.035485/-/DC1.

*Communicating editor: D. J. de Koning*

- Received July 12, 2016.
- Accepted September 21, 2016.

- Copyright © 2016 Wei
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.