## Abstract

The methods commonly used to test the associations between ordinal phenotypes and genotypes often treat either the ordinal phenotype or the genotype as continuous variables. To address limitations of these approaches, we propose a model where both the ordinal phenotype and the genotype are viewed as manifestations of an underlying multivariate normal random variable. The proposed method allows modeling the ordinal phenotype, the genotype and covariates jointly. We employ the generalized estimating equation technique and M-estimation theory to estimate the model parameters and deduce the corresponding asymptotic distribution. Numerical simulations and real data applications are also conducted to compare the performance of the proposed method with those of methods based on the logit and probit models. Even though there may be potential limitations in Type I error rate control for our method, the gains in power can prove its practical value in case of exactly ordinal phenotypes.

Research in the field of genetic epidemiology suggests that some genetic variants play important roles in the etiology of human diseases. On the one hand, genetic variants are defined as genotypes, which are often treated as ordinal variables. On the other hand, there are multiple data types for diseases, *i.e.*, phenotypes, which can be continuous, binary, or ordinal (Li *et al.* 2006; Kim *et al.* 2013). Note that both binary and ordinal variables are categorical variables, but the latter can describe the disease state of a patient more precisely in many circumstances. For example, four levels—normal liver, light steatosis, moderate steatosis, and severe steatosis—have been utilized to describe the severity of liver steatosis (Bedogni *et al.* 2010).

With the development of high throughput biologic technology, increasingly more genotypes and data with complex traits have been generated and deposited in public databases. It is urgently required to develop new statistical testing methods to investigate the associations between these and extract useful information to understand the underlying occurrence and development mechanisms of diseases and traits.

Genome-wide association studies aim to identify associations between phenotypes and genotypes. In these studies, genotypes are often treated as predictors and phenotypes as outcomes. If the phenotype of interest is continuous, then the classic linear regression model is commonly employed. When the phenotype is ordinal, the multinomial logit model (McCullagh 1980; Zhang *et al.* 2015) or ordered probit model (Daykin and Moffatt 2002; Wang 2014) should be recommended. All these models regress phenotype values or their distribution-based transformations on genotypes, with the assumptions that genotype values are continuous (Korse *et al.* 2009; Bedogni *et al.* 2010) and the probability of having a disease increases linearly with the genotype value. However, the continuity assumption on genotype values and the linearity assumption between a phenotype and genotype are difficult to verify in practice. If these two assumptions are violated, the corresponding Wald testing statistics may severely decrease in power. To overcome this, some researchers treated genotypes as ordinal variables and reversed the regression process by regressing genotypes on phenotypes (O’Reilly *et al.* 2012). When a phenotype is a continuous variable, this new method is indeed useful for removing or relaxing the continuity and linearity assumptions. However, this does not work when a phenotype is exactly ordinal, such as in the above-mentioned example of liver steatosis. Therefore, we propose a new method to deal with this problem.

In this work, we treat genotypes as ordinal variables and propose a new procedure to assess the association between an ordinal phenotype and ordinal genotype after adjusting for covariates. Rather than regressing the phenotype on the genotype or regressing the genotype on the phenotype using existing methods, we jointly model the phenotype and genotype by introducing a latent variable following a multivariate normal distribution. The phenotype and genotype are regarded as manifestation values of the latent variable. The relationships between phenotypes, genotypes, and covariates of interest are elaborately described by the covariance matrix. Taking advantage of the framework of generalized estimation equations (Hanley *et al.* 2003; Zhang *et al.* 2014) and M-estimation theory (Huber 1981; Stefanski and Boos 2002), we construct a Wald test statistic for an equivalent transformation of the original null hypothesis, and prove that it asymptotically follows the standard normal distribution under the null hypothesis. Numerical simulations are conducted to compare the proposed method with other methods. Our simulation results show that the proposed method can suitably maintain Type I error control and may achieve considerable statistical power compared to existing methods in various scenarios. Finally, we apply the proposed method to anticyclic citrullinated protein antibody data for rheumatoid arthritis studies, to further demonstrate its performance.

## Materials And Methods

### Notations

Let random variables *G* and *Y* denote a genotype and ordinal phenotype, respectively, and be a -dimensional continuous covariate of interest, with , where *τ* denotes the transpose of a matrix or vector. Without loss of generality, we assume that there are two alleles at a genetic locus, with one being the risk allele and the other being the reference allele. The value of the random variable *G* represents the number of the risk allele at a locus, which means that *G* can take three values: 0, 1, or 2. Suppose that *Y* takes *m* ordinal values: . The null hypothesis states that after adjusting for covariates, the phenotype is not related to the genotype, *i.e.*, the phenotype and genotype are conditionally independent given a set of covariates, which can be denoted by(1)Assume that *n* subjects are enrolled in the genetic association study. Further, let , , and be the *n* observations of *G*, *Y*, and Z corresponding to the *n* subjects, respectively.

### Equivalent statement of by introducing a latent variable U

We assume that the combined vector is generated from a *k*-dimensional random variable following a multivariate normal distribution, with mean vector (the *k*-dimensional column vector with all units being zero) and covariance matrix Δ, where and all its diagonal elements are equal to one. That is, represents the manifestation of and . We rewrite Δ in the partitioned matrix form(2)where is a 2 × 2 matrix. Then, Z follows a multivariate normal distribution with mean vector and covariance matrix . Then, *G* and *Y* can be obtained as follows:(3)and(4)where and .

Based on the theory of the conditional normal distribution, we have that

(5)We define Then, the conditional covariance matrix above can be further expressed as(6)Now, denote , . By introducing the latent variable U and taking advantage of its distribution property, we can state that the original hypothesis is exactly equivalent to

(7)### Proposed statistical test

In this subsection, we construct a test statistic to test based on the generalized estimating equation technique and M-estimation theory. Recall that the joint distribution of and is(8)Hence, the conditional distribution of given Z is(9)Similarly, the conditional distribution of given Z is(10)Denote . It should be noted that the marginal density function of Z and the joint density functions of each pair of variables among , and Z are as follows:(11)(12)(13)(14)where is an indicator function, *i.e.*, is one if the event *E* holds and zero otherwise. The unknown parameters and can be estimated via the following procedures.

First, based on the marginal density function (11) of Z, we have the likelihood function(15)By maximizing on , we can obtain the MLE of (denoted by ), which is the sample covariance matrix of the observed data .

Second, in our model both *G* and *Y* are ordinal variables, whose realizations are determined by the intervals in which the values of two standard normal random variables and may fall in, respectively. Specifically, we employ the distribution properties of *G* and *Y* to intuitively estimate and , respectively. We define , , , and for based on the observed data and of *G* and *Y*, respectively. Recall that . Then, we can estimate as , and as , by solving the following equations:(16)(17)where is the cumulative distribution function of the standard normal variable.

The parameter can be estimated using the generalized estimating equation technique and M-estimation theory. First, let(18)The function vector consisting of the first-order partial derivatives of with respect to each parameter in is(19)Then, the estimator of is the root of the following generalized estimating equation(20)After estimating all the unknown parameters, the estimate of can be expressed as . To construct a Wald-type test statistic, we need to derive the asymptotical distribution of . Based on the classical M-estimation theory, asymptotically follows a multivariate normal distribution. That is,(21)where , , and

According to the delta method, is also asymptotically multivariate normal. Namely,(22)where .

Now, we can propose a new Wald test statistic for the null hypothesis as follows:

(23)### Data availability

The authors affirm that all data necessary for confirming the conclusions of the article are present within the article, figures, and tables. Supplemental material available at FigShare: https://doi.org/10.25387/g3.8226650.

## Results

### Simulation results

In this subsection, we present a series of simulation studies to investigate the performance of our proposed latent variable model (abbreviated as lvm), and compare it with the probit and logit models, which both regressing an ordinal phenotype on a genotype. We compared these under multiple simulation scenarios, so that different modeling assumptions would be favored. Two types of data generation mechanism were considered throughout our simulation studies: (i) generating data from a multivariate normal random variable (named the ND mechanism) and (ii) generating data under the proportional odds model (named the PO mechanism). In addition, three genetic models (co-dominant, dominant, and recessive models) were considered.

The specifics of our simulation data generation scenarios are as follows. For simplicity, the dimension of the covariate Z is one, and the number of levels for a phenotype *Y* is set to five. Under the ND mechanism, the three-dimensional latent variable U was generated from a multivariate normal distribution, with the genotype *G* being a manifestation of , the ordinal phenotype *Y* being a manifestation of , and the covariate Z being equal to . Each marginal distribution of followed a standard normal distribution. Note that the distribution of *G* varied according to the type of the true genetic model. Let *A* (major allele) and *a* (minor allele) denote two alleles at the single biallelic locus corresponding to the genotype *G*. Under the Hardy–Weinberg equilibrium (HWE) conditions, the expected genotype frequencies of *G* being , , and would be , and , respectively, with the minor allele frequency (MAF) *p* taking on values from . If the HWE assumption did not hold, we directly set three kinds of multinomial distribution (P) for (, , ) with different parameter structures , and . When a co-dominant model was assumed, the three genotypes , , and were coded as 0, 1, and 2, respectively. In a model where a dominant effect was assumed, the genotype was coded as 0 while both and were coded as 1. Accordingly, scores of 0 for both and and 1 for were employed in a recessive model. In addition, under each genetic model the probabilities of *Y* being , and 5 were always , and 0.05, respectively. We set the covariance matrix Σ of U as(24)to investigate the Type I error rate, and let Σ be(25)to compare the statistical power under different alternatives, depending on the parameter *θ*, whose range was the set .

Under the PO mechanism, the ordinal phenotype *Y* was related to the genotype *G* and covariate Z through the proportional odds model. Specifically, the distribution of *G* under different genetic models would remain the same as that under the ND mechanism, regardless of whether the HWE held. In this case, Z still followed the standard normal distribution, and *Y* was generated with five levels using the proportional odds model(26)where , , , . It should be noted that . When , the null hypothesis was true, and the derived simulation data were used to compare the Type I error rates for the three models of interest, *i.e.*, the lvm, probit, and logit models. When , we explored the power of these three models under different alternatives.

As previously described, we considered scenarios. For each simulation scenario, we generated 1000 datasets, each consisting of 300 subjects. P-values were calculated for each dataset using the three respective models. The nominal level of the tests was set to 0.05, and all simulations were performed using the R language (https://www.r-project.org/). The empirical Type I error rates and power estimates were calculated using the percentage of rejection in each scenario. The results are presented side-by-side in Tables 1–4. Note that the blanks (marked with —) in these four tables are a result of unavailability of the simulation data under the corresponding scenarios. The reason for this is that in these parameter setting conditions, the mean number for *G* amounting to 1 is three, which can easily lead to samples with all the *G* values being 0, such that none of the three considered models can be applied.

Table 1 presents the results for the ND mechanism when the HWE holds. The first five rows suggest that all three methods can control the Type I error rate at the nominal level of 0.05 under the three different genetic models. Furthermore, the remaining rows show that the lvm model enhances the statistical power over the probit and logit models. In some cases, the power gain can be as high as 0.07 to 0.1. For example, when and , the empirical powers for the probit, logit, and lvm models are 0.514, 0.481, and 0.557 under the co-dominant genetic model, and are 0.503, 0.459, and 0.544 under the dominant genetic model, respectively. When the true genetic model is recessive and and , the power estimates for the probit, logit, and lvm models are 0.126, 0.094, and 0.166, respectively.

The results of the three methods under the ND mechanism when the HWE does not hold are presented in Table 2. We can observe that the proposed method achieves a greater power than the other two methods, even though the distribution of the genotype *G* does not satisfy the HWE conditions. Specifically, the power gain can be as high as 0.07 when and the distribution of *G* is .

The corresponding results when the data are generated under the PO mechanism and the HWE is assumed are displayed in Table 3. It follows that all three models can control the Type I error rates under the null hypothesis. Even though the data are generated using the proportional odds model, the lvm method still performs better than the other two methods in detecting an alternative hypothesis, and can achieve a power gain of up to 0.058 in some cases, such as the setting with for the recessive genetic model. It is worth noting that the advantage of the proposed lvm method is more obvious under the recessive model.

The simulation results under the PO mechanism when the HWE does not hold are presented in Table 4. We observe that the advantage of the proposed lvm method is not as significant as that when the HWE holds, but the model is superior to the probit model in all scenarios. Moreover, the logit model has a slightly greater power under the co-dominant and dominant models, while the lvm method outperforms it in the recessive model.

From all of the four tables, it can be seen that the proposed method might have potential limitations in controlling Type I error rates in a few situations, while the power gains in almost all of simulation scenarios indeed indicate its efficiency for practical applications.

### Application to anticyclic citrullinated protein antibody data for rheumatoid arthritis study

It is well known that rheumatoid arthritis (RA) is significantly associated with some genetic variants (Carlton *et al.* 2005; Ruiz-Larrañaga *et al.* 2016). The anticyclic citrullinated protein antibody (anti-CCP) can be an auxiliary diagnosis indicator for RA, and the specificity of anti-CCP lies between 87.8% and 96.4% (Coenen *et al.* 2007). Besides, the genomic region of 6p21.33 has been reported to be associated with RA (Zhang *et al.* 2009; Zhang and Li 2015). The aim here is to check whether the single nucleotide polymorphisms (SNPs) in the 6p21.33 region are associated with anti-CCP, taking advantage of the proposed lvm test.

Note that there are a total of 45 SNPs in the region of 6p21.33 according to the Genetic Analysis Workshop 16 Data, all of which meet the quality control rule of the MAF being more than 5%, the missing rate being smaller than 15%, and the least genotype frequency being no less than five. The anti-CCP measure takes four values, 1, 2, 3, and 4, and the number of subjects who have these four values were 1195, 103, 66, and 698, respectively. The total number of subjects was 2062. Five principal components coordinated by applying the multi-dimensional scaling method (Li and Yu 2008) to the 12747 population structure information SNPs (Yu *et al.* 2008) were used to adjust for population stratification effects. Before conducting association analysis, we run chi-square tests to check whether HWE holds for each of these 45 SNPs in controls. The P-value results are summarized in Table 5. At a 0.05 level of significance, we can state that the HWE law holds for all SNPs in controls on the basis of their Bonferroni-corrected P-values. Then we apply the probit model, the logit model, and the lvm model to these 45 SNPs to test their association with anti-CCP in sequence. The results are presented in Table 5. It shows that after Bonferroni correction, the SNPs rs2246986, rs3093998, rs2071596, and rs2844509 were found to be significant under the probit and logit models, while the SNPs rs2516398, rs2844494, rs3130637, rs3093993, rs3095227, rs2259435, and rs2844509 were identified as significant using the lvm method. Though these two groups of SNPs overlap at only one SNP rs2844509, each of other SNPs found by the probit (logit) model is physically close to one or two SNPs found by the lvm model. For example, the SNP rs2246986 of the first group is 677 kb away from rs2516398 on one side and is 1212 kb away from rs2844494 on the other side, while both of these two SNPs rs2516398 and rs2844494 are included in the second group. In addition, the SNP rs3093998 (in the first group) is 2971 kb away from rs3130637 (in the second group). The distances are so short that it is reasonable to infer that the SNPs rs2246986, rs2516398, and rs2844494 contain similar information. So do another two SNPs rs3093998 and rs3130637. In short, for detecting the association between anti-CPP and the genomic region 6p21.33, the proposed lvm method is more powerful than the methods based on logit and probit models.

## Discussion

In this work, we have shown that the idea of treating a genotype variable as ordinal without assuming linearity can result in a more powerful and robust test, via introducing a joint multivariate normal distribution for the group of genotypes, traits, and covariates. Meanwhile, we have also demonstrated that the proposed lvm test can provide appropriate Type I error rates. The important strength of our method is that it does not make an assumption on the type of relationship between a phenotype and genotype; nor does it treat the genotype as a continuous variable. Rather, our approach only introduces a latent multivariate normal variable to characterize the relationship between the two, which is very reasonable, and generally considerably more useful.

Besides the simulations with respect to significance level of 0.05, we also conducted simulation studies with a lower significance level 0.005. The results are given in Supplementary Table S1. We found that the proposed method can reasonably control Type I error rates and achieve power gains at this lower significance level, similar to those results in Tables 1-4. It is worth mentioning that our proposed test model can also be applied to other situations where the outcome is continuous. In such a situation, we can still employ a joint multivariate normal distribution to model outcomes, genotypes, and covariates simultaneously. Even though the proposed method might have potential limitations in Type I error rate control in some situations, the power gains prove its efficiency in practical applications.

In population-based genetic association studies, hundreds of thousands of subjects are often enrolled to achieve optimal power. It is inevitable that there exists a population stratification effect in such large-scale studies. Not considering the effect of population stratification could lead to many false positive findings, and therefore adjusting for its effect represents the basis for conducting a genetic association analysis (Price *et al.* 2006; Li and Yu 2008). In this study, to characterize the influence of population stratification when investigating the relationships between ordinal traits and genotypes, we treat these effects as covariates in the proposed lvm method. The numerical results of our simulation studies and real data applications have demonstrated that the strategy is feasible and effective.

## Acknowledgments

We would like to thank the associate editor and the two anonymous reviewers for their insightful comments, which helped us to improve the manuscript. This work is partially supported by the Beijing Natural Science Foundation (Z180006), National Nature Science Foundation of China (11722113, 11661080, 11731011, 11501134), Applied Basic Research Project of Yunnan Province (2017FB002), and Science and Technology Research Project of the Education Department in Hubei Province (Q20172505).

## Footnotes

Supplemental material available at FigShare: https://doi.org/10.25387/g3.8226650.

*Communicating editor: G. de los Campos*

- Received May 8, 2019.
- Accepted June 4, 2019.

- Copyright © 2019 Wang
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.