# A Multiple-Trait Bayesian Variable Selection Regression Method for Integrating Phenotypic Causal Networks in Genome-Wide Association Studies

^{*}Department of Animal Science, University of California, Davis^{†}Department of Animal and Poultry Sciences, Virginia Polytechnic Institute and State University

- 1Corresponding author: 2139 Meyer Hall, Department of Animal Science, University of California Davis, 1 Shield Avenue, Davis, CA 95616. E-mail: qtlcheng{at}ucdavis.edu

## Abstract

Bayesian regression methods that incorporate different mixture priors for marker effects are used in multi-trait genomic prediction. These methods can also be extended to genome-wide association studies (GWAS). In multiple-trait GWAS, incorporating the underlying causal structures among traits is essential for comprehensively understanding the relationship between genotypes and traits of interest. Therefore, we develop a GWAS methodology, SEM-Bayesian alphabet, which, by applying the structural equation model (SEM), can be used to incorporate causal structures into multi-trait Bayesian regression methods. SEM-Bayesian alphabet provides a more comprehensive understanding of the genotype-phenotype mapping than multi-trait GWAS by performing GWAS based on indirect, direct and overall marker effects. The superior performance of SEM-Bayesian alphabet was demonstrated by comparing its GWAS results with other similar multi-trait GWAS methods on real and simulated data. The software tool JWAS offers open-source routines to perform these analyses.

- Structural Equation Models
- Bayesian Regression
- Variable Selection
- GWAS
- Genomic Prediction
- GenPred
- Shared data resources

Genome-wide association studies (GWAS) are widely used to identify associations between single nucleotide polymorphisms (SNPs) and phenotypes (Ozaki *et al.* 2002; Visscher *et al.* 2017; McCarthy *et al.* 2008; Cantor *et al.* 2010). GWAS have successfully mapped quantitative trait loci (QTL) associated with traits of interest, *e.g.*, meat quality and quantity in livestock (Sharma *et al.* 2015), crop yields in plants (Liu and Yan 2018), and diseases in humans (Visscher *et al.* 2017). GWAS are typically based on using linear mixed models to fit one SNP at a time to a single trait (Hackinger and Zeggini 2017). While this allows for a relatively simple statistical model, the interwoven nature of gene expression translates to many traits being correlated with each other (Sodini *et al.* 2018). These correlations can be utilized in multi-trait linear mixed models for GWAS to reduce false positives and increase the statistical power for association mapping (O’Reilly *et al.* 2012; Korte *et al.* 2012).

Conventional multi-trait linear mixed models do not consider the causal relationships between traits. To address this issue, researchers have proposed refining the multi-trait methods with structural equation models (SEM) introduced by Wright (1934) that consider the causal relationship among traits. A model that incorporates causal structures should better reflect underlying genetic mechanisms. Gianola and Sorensen (2004) used SEM to extend conventional multi-trait linear mixed models to accommodate for recursive and simultaneous relationships among traits, which allows traits to be explanatory variables for other traits. Recently, Momen *et al.* (2018, 2019) proposed the SEM-based GWAS (SEM-GWAS) methodology by applying SEM to linear mixed models for GWAS. They showed that while conventional GWAS methodology only provides overall SNP effects, SEM-GWAS can capture the complex causal relationships among traits and further decompose the overall SNP effects into direct and indirect effects.

The SEM-GWAS method proposed by Momen *et al.* (2018, 2019) is based on linear mixed models with a fixed substitution effect for the tested SNP and a random effect with covariances defined by a “genomic relationship matrix” computed from genotypes (VanRaden 2008) to account for genetic relatedness. Markers are usually implicitly assumed to affect all traits when the ”genomic relationship matrix” is constructed in multi-trait analysis. However, this assumption is not biologically meaningful, especially in multi-trait analyses involving many traits. Cheng *et al.* (2018b) proposed a general class of multi-trait Bayesian variable selection regression methods that use a broad range of mixture priors, *e.g.*, multi-trait BayesCΠ, where each locus can affect any combination of traits, which allows us to more closely model the true biological mechanisms, *e.g.*, pleiotropy (Cheng *et al.* 2018b).

The primary goal of this current research is to develop a multi-trait Bayesian regression GWAS method that more closely resembles the underlying biological mechanisms including pleiotropy and causal structure among traits. In this paper, we develop and implement a new GWAS method called SEM-Bayesian alphabet, which integrates SEM to the multi-trait Bayesian variable selection methods, to incorporate the underlying biological mechanism. The term “Bayesian alphabet” denotes a collection of Bayesian regression models that differ in the priors adopted for marker effects (Gianola 2013). In this paper, we use SEM-BayesCΠ, a Bayesian variable selection method, to show the utility of the SEM-Bayesian alphabet. The performance of our proposed method is studied using real and simulated data.

## Materials And Methods

### Multi-trait Bayesian regression model using mixture priors

Assuming that individuals have all traits measured with a general mean as the only fixed effect, we write the multi-trait model for individual *i* from *n* genotyped individuals as:where is the vector of phenotypes of *t* traits for individual *i*, μ is a vector of overall means for *t* traits, *p* is the number of genotyped loci, is the genotype covariate at locus *j* for individual *i* (coded as 0,1,2), is the vector of marker effects of *t* traits for locus *j*, and is the vector of residuals of *t* traits for individual *i*. The fixed effects are assigned flat priors. The residuals, , are a priori assumed to be independently and identically distributed multivariate normal vectors with null mean and covariance matrix R, which is assumed to have an inverse Wishart prior distribution, .

Allowing each locus to affect any combination of traits, in a multiple-trait Bayesian variable selection method, *e.g.*, multi-trait BayesCΠ (Cheng *et al.* 2018b), the vector of marker effects at locus *j* can be written as , where is a diagonal matrix whose diagonal element is , where is the indicator variable indicating whether the marker effect of locus *j* for trait *k* is zero or not, and is a priori assumed to be independently and identically distributed multivariate normal vectors with null mean and covariance matrix G, which is assumed to have an inverse Wishart prior distribution, . Given that a locus can have an effect on any combination of traits, we use numeric labels to represent all possible combinations for , in which case the prior distribution for is:where is the prior probability that the vector corresponds to the vector labeled “*i*” and . We assume the prior for is a uniform distribution.

### Structural Equation Model

The linear SEM is composed of two parts: the measurement equation analyzing the relationship between the observable variables and latent variables, and the structural equation capturing the connections among latent variables (Anderson and Gerbing 1988). These two equations can be written as:where is the vector of observable variables for individual *i*, is a vector of endogenous latent variables, is a vector with exogenous latent variables, and are the matrix of unknown coefficients in structural equation, Λ is a matrix of unknown structural coefficients, and are and *q* × 1 vectors of residuals. The details of parameter estimation are discussed in Song and Lee (2012).

In our study, no latent variables are assumed and the sole observable variables are phenotypes. Thus only the causal relationship among observable variables, *i.e.*, phenotypes, are fitted in the SEM model (also known as path analysis (Wright 1921)) as:(1)where and Λ are defined as above, represents everything that is not explained by , and Λ is an matrix of structural coefficients representing the causal structure recovered from the Inductive Causation (IC) algorithm as described in the next section.

To illustrate, we assume that the phenotypes of three traits for each individual (*i.e.*, and for traits 1, 2, and 3) have the following causal relationship:where causal coefficient represents that a 1-unit increase in trait *i* results in a unit increase in trait *j*. Given the causal structure above, the Λ can be written as:

### Searching causal structure

As described above, fitting the SEM requires the causal structure among all traits to be known before analysis. To explore the wide-range of possible underlying causal structures, we use the method from Valente *et al.* (2010) to discern the causal structure based on the posterior distribution of the residual covariance matrix. The reason we do not directly apply this method to phenotype data are that the covariance among phenotypes is likely confounded by genetic effects. The process of inferring causal structure is composed of three steps:

Fit the multi-trait Bayesian model and obtain the posterior distribution of the residual covariance matrix.

Follow Valente

*et al.*(2010) to derive the conditional independence relationship among traits based on the posterior distribution of the residual covariance matrix. In detail, we derive the residual partial correlation p(, |h), where h is a set of traits, to test whether trait is conditionally independent from . The highest posterior density (HPD) interval of 0.9 was used to make statistical decisions. If HPD interval of p(, |h) contains zero, and are regarded as conditionally independent on h.Apply the IC algorithm (Pearl 2009) as described in the Appendix to the conditional independence relationship from step 2 to obtain the causal structure.

### SEM-Bayes

Assume in equation (1) and follow assumptions in multi-trait BayesCΠ, the SEM-Bayes model can be written as:(3)Move from the right side to the left side of equation (3), and define , where I is a identity matrix and Λ is a matrix of structural coefficients based on the discerned causal structure, the model becomes:(4)To guarantee that the structural coefficient is identifiable, we assume that the residuals for each trait of individual *i* are independent with each other, which means the residual covariance matrix is diagonal (Wu *et al.* 2010; Momen *et al.* 2018). The vector of all non-zero elements in Λ, *e.g.*, , is assumed to have a prior distribution:where **1** is a vector of ones, I is the identity matrix, and is a known mean for all elements in λ. is a tuning parameter to adjust the sharpness degree of the prior (Gianola and Sorensen 2004). In this paper, we set and . The priors for the remaining parameters are the same as in the section Multi-trait Bayesian regression model using mixture priors.

Gibbs samplers are used to draw samples for all parameters. The full conditional distribution to draw samples for λ is shown below. The derivations of the full conditional distributions of the remaining parameters of interest for Gibbs samplers are in Cheng *et al.* (2018b).

*Full conditional distribution of* Λ** :** We follow Gianola and Sorensen (2004) to obtain the full conditional distribution of Λ, with the difference between our derivation and Gianola and Sorensen (2004) being that we specify the causal structure with positions of parameters in the Λ. Let Ω denote all parameters except λ in the SEM-BayesCΠ and use the causal structure as an example, the left hand side of equation (4), , can be written as:The conditional posterior distribution of λ can be written as:(5)Setting , equation (5) can be written as:Following the derivation in Gianola and Sorensen (2004) and the fact that in a recursive system, the full conditional distribution of λ iswhere

### Decomposition of SNP effects

In SEM-BayesCΠ, the marker effect for locus *j*, , is considered as the vector of direct marker effect of *t* traits. The indirect effect of locus *j* of *t* traits can be calculated as . The overall effect of locus *j* on *t* traits is computed as or , which is the summation of both direct and indirect effect of locus *j*. For example, given a causal structure , the direct effect for locus *j* on three traits is , and the indirect effect for locus *j* on three traits is calculated as and the overall effect of locus *j* on trait *k* is .

### Inference of association based on genomic windows

Markers in a genomic window are usually highly correlated, indicating that any single marker may not show a strong association with the trait even though a causal variant exists in the window. In this paper, we make an inference of association based on genomic windows, because multiple markers inside a genomic window may jointly capture the signal from the causal variant (Fernando and Garrick 2013; Fernando *et al.* 2017).

To make an inference of association based on genomic windows, posterior distribution for the proportion of the genetic variance explained by markers in genomic window *w*, , is estimated from MCMC samples of overall, direct, and indirect marker effects as follows. For one MCMC sample of all marker effects on one trait, let , , and denote direct, indirect, and overall effects of all markers respectively.

The genetic value that is attributed to genomic window *w* is calculated as:where is a matrix of marker covariates in window *w* and , , and are the MCMC samples of direct, indirect, and overall marker effects for SNPs in window *w*. Then the variance explained by the genomic window *w* is defined as:Similarly, the total genetic variance is calculated as:The proportion of the genetic variance explained by direct, indirect, and overall marker effects in the genomic window *w* is calculated as:Given the MCMC samples of , the window posterior probability of association (WPPA) is calculated as the proportion of MCMC samples of that exceed a specific value *T* (Fernando and Garrick 2013; Chen *et al.* 2017; Lloyd-Jones *et al.* 2017). In this paper, associations are tested for non-overlapping windows of 100 SNPs, and genomic windows that explain over of the total genetic variance were deemed to be of potential interest (*i.e.*, , where *N* is the total number of windows).

### Data analysis

#### Real data:

The Rice Diversity Panel with 413 *Oryza sativa* individual accessions was used in the analysis. Three traits were considered, including plant height (PH), flowering time in Arkansas (FTA), and panicle number per plant (PN) in our GWAS. After removing the records with missing data for these three traits and genotype with minor allele frequency , 370 individuals with 33,519 SNPs genotyped were included in our analysis. The phenotypic and genotypic data were publicly available for download from http://www.ricediversity.org/. It has been shown that using a threshold of WPPA to declare a significant genomic window restricts the proportion of false positives (PFP) to (Fernando *et al.* 2017). A previous GWAS (Zhao *et al.* 2011) identified significantly associated SNPs in chromosome 6 for flowering time in Arkansas (FTA) using the same dataset. A threshold of WPPA = 0.8 and p-value = 5 in our GWAS analysis resulted in similarly significant signals. This result suggests that a WPPA of 0.8 and p-value = 5 are reasonable for declaring a significant genomic window.

#### Simulated data:

To compare SEM-BayesCΠ with SEM-GWAS of Momen *et al.* (2018), we simulated data based on real genotypes from the Rice Diversity Panel. The simulation scenarios in Chen *et al.* (2017) were applied to simulate different genetic architectures. The QTL effects were generated from unit-gamma distribution (scale = 1) with three different shape parameters (*γ*): fewer QTL with large effects (), fewer QTL with small or large effects and many QTL with intermediate effects (), and the intermediate case (). In addition to the distribution of QTL effects, the number of QTL () may play an important role in GWAS, thus three numbers of QTL () combined with the three shape parameters were used to create 9 scenarios. For each scenario, 50 replicated populations were simulated with QTL positions randomly sampled across the genome. Trait 1 was assumed to have a causal effect on trait 2 with causal structural coefficient . The QTL effects were simulated under two scenarios (Figure 1): the QTL have direct effect on both trait 1 and trait 2 in scenario 1, where QTL only have direct effect on trait 1 in scenario 2. Half of the QTL were simulated following scenario 1, while the remaining followed scenario 2. Phenotypes for two traits were generated based on heritability of 0.5. In the simulated data analysis, the causal structure was assumed known to exclude the bias caused by searching causal structures. The structural coefficients were assumed known in SEM-GWAS.

We have implemented SEM-Bayesian alphabet in JWAS (Cheng *et al.* 2018a), an open-source, publicly available package for single-trait and multi-trait whole-genome analyses written in the freely available Julia language. More details can be found at https://reworkhow.github.io/JWAS.jl/latest/.

## Results

### Simulated data result

The performances of SEM-BayesCΠ and SEM-GWAS were compared based on the AUC (Area under Receiver Operating Characteristic). Inference of association on genomic windows for SEM-GWAS was based on the minimum p-value (Begum *et al.* 2016), *i.e.*, genomic windows containing at least one significant variant are declared as significant windows. To exclude the irrelevant AUC with low levels of specificity, only the partial area under the curve up until the false positive rate of 5% (pAUC5) (Chen *et al.* 2017; Ma *et al.* 2013) was calculated. For the convenience of comparison, all pAUC5 measurements were rescaled such that the pAUC5 of the random classifier is equal to 1. The R package ROCR (Sing *et al.* 2005) was used to obtain the pAUC5; the paired *t*-tests (p-value < 0.1) were used for comparing both the pAUC5 mean across all scenarios (overall mean comparison) and the pAUC5 mean for each level of and shape parameters *γ* (marginal mean comparison).

The GWAS results based on overall, direct and indirect effect were shown in Table 1. For the overall effect result on trait 1, there is no significant difference between SEM-BayesCΠ and SEM-GWAS in both overall mean comparison and marginal mean comparison. The direct effect on trait 1 is the same as the overall effect on trait 1 since the direct effect and overall effect on trait 1 are equal based on the causal structure. For the overall effect on trait 2, the pAUC5 mean of SEM-BayesCΠ is significantly higher than that of SEM-GWAS in the overall mean comparison, and some marginal mean comparisons (*e.g.*, and ). For the direct effect on trait 2, though higher overall mean of pAUC5 is usually observed in SEM-BayesCΠ, there is no significant difference (p-value < 0.1) between SEM-BayesCΠ and SEM-GWAS in both overall mean comparison and marginal mean comparison. For the indirect effect result on trait 2, similar to the overall effect result on trait 2, the pAUC5 mean of SEM-BayesCΠ is significantly higher than that of SEM-GWAS in the overall mean comparison, and some marginal mean comparisons (*e.g.*, and ).

### Real data result

#### Causal structure and structural coefficients:

The causal structure among three traits is inferred by the IC algorithm from the estimated posterior distribution of the residual covariance matrix in the multi-trait BayesCΠ model. Figure 2 shows three potential phenotypic causal structures among traits PH (), FTA (), and PN () recovered for the 0.9 HPD interval. The causal structure matrices for IC1 (), IC2 (), and IC3 () are:These three causal structures are fitted in the SEM-BayesCΠ model, and samples from posterior distributions for coefficients in these causal structures are obtained. The 90% credible intervals for structural coefficients in IC1, IC2, and IC3 are shown in Table 2. It is worth noting that the causal structures in Figure 2 provides the same set of marginal and conditional independencies, extra biological knowledge is required to further infer the causal structure. In this paper, the SEM-BayesCΠ model is proposed to incorporate (known) underlying causal structure among traits for GWAS. Thus, for the simplicity of our presentation, the causal structure is assumed known in Real Data Result section, *e.g.*, IC1 is used to demonstrate the performance of SEM-BayesCΠ.

#### Decomposition of SNP effects:

Direct, indirect, and overall SNP effects for all markers are estimated from SEM-BayesCΠ and SEM-GWAS. In SEM-BayesCΠ, direct SNP effects are assigned mixture priors, where each locus can affect any combinations of traits directly; samples from posterior distributions of indirect effects are obtained using joint samples from posterior distributions of Λ and direct SNP effects . In IC1, for trait PH, the overall SNP effect is equal to the direct SNP effect, because there is no intermediate trait. For trait FTA, the overall SNP effect is composed of direct SNP effect and indirect SNP effect transmitted from PH. So the overall SNP effect for FTA is given by summing the direct SNP effect and indirect SNP effect. Similarly, for trait PN, the overall SNP effect is obtained by summing the direct SNP effect and indirect SNP effect transmitted from PH.

The results of GWAS from SEM-BayesCΠ and SEM-GWAS incorporating causal structure IC1 are shown in Figure 3. Significant signals are found only for trait FTA. SEM-BayesCΠ adopts a threshold of WPPA = 0.8 to declare a significant genomic window and SEM-GWAS adopts a threshold of p-value = . The overall SNP effects are partitioned into direct and indirect effects, and GWAS are performed for the direct, indirect, and overall SNP effects separately for trait FTA. In Figure 3, the blue points, pink points and green points represents the significant genomic windows located in chromosome 1, chromosome 5 and chromosome 6. Window A contains SNPs from “id1000759” to “id1001229”; window B contains SNPs from “id1023967” to “id1024499”; window C contains SNPs from “id5013234” to “id5013920”; window D contains SNPs from “id6005814” to “id6006470”.

For the overall effects, in the SEM-GWAS, window D achieved -log(p-value) 14.21; in the SEM-BayesCΠ, window C achieved WPPA 0.90, window D achieved WPPA 0.88, and window A achieved WPPA 0.82. For the direct effect, in the SEM-GWAS, window D achieved -log(p-value) 12.24; in the SEM-BayesCΠ, window C achieved WPPA 0.90, window D achieved WPPA 0.86, and window A achieved WPPA 0.80. For the indirect effect, in the SEM-GWAS, window B achieved -log(p-value) 14.65; in the SEM-BayesCΠ, although no window is identified as significant in SEM-BayesCΠ, a peak was observed at window B with WPPA 0.52. Further, for all three effects, the results from SEM-BayesCΠ and SEM-GWAS are correlated (the correlation between the WPPA from SEM-BayesCΠ and -log(p-value) from the SEM-GWAS is higher than 0.5). The correlation of indirect effect results from these two methods results achieved 0.70. Also, for both SEM-BayesCΠ and SEM-GWAS, the overall effect is more correlated with direct effect rather than indirect effect. The magnitudes for overall, direct, and indirect SNP effect in SEM-BayesCΠ are also shown in Figure 5. Though most large overall SNP effects consist of a large direct SNP effect and a relatively small indirect SNP effect, the indirect effect of some SNPs play an important role, *e.g.*, the overall effect of SNP “id1024159”, as shown in Figure 5, consists of a large indirect SNP effect and relatively small direct effect.

### Genetic pleiotropy in SEM-BayesCΠ

As shown in Figure 4, the posterior distribution of the parameter Π is obtained, and markers show different levels of pleiotropy for direct SNP effects. In SEM-BayesCΠ, each SNP can have direct effects on any combination of traits, and the parameter Π is used to estimate the proportion of SNPs having different levels of pleiotropy. Indirect SNP effects on one trait are transmitted from direct SNP effects on other intermediate traits. For example, in causal structure IC1, the indirect SNP effects on FTA is transmitted from direct SNP effects on the trait PH. A SNP having no direct effect on FTA may have overall effects on FTA if its direct effect on trait PH is non-zero. The proportions of markers affecting different combinations of traits through overall SNP effects can also be estimated. This result is shown in Figure 4, and different probabilities are observed for some cases between overall and direct SNP effects. More SNPs have effects on all traits simultaneously when overall SNP effects are considered compared to direct SNP effects (case 1). If only overall SNP effects are considered, some cases having non-zero probabilities for indirect SNP effects are hidden by the causal relationships among traits (cases 2-4). The same patterns for direct and overall SNP effects are observed in cases 5-8 because there is no causal relationship between trait FTA and PN.

## Discussion

The complex causal relationships among multiple traits are usually not considered in conventional multi-trait GWAS. Here we propose the SEM-Bayesian alphabet method to incorporate pre-inferred causal structures among multiple traits into multi-trait Bayesian regression methods. SEM-Bayesian alphabet accounts for causal structures among traits, and has the potential advantage of estimating causal effects, providing genomic window-based inference, as well as providing a comprehensive understanding of the underlying biological mechanism.

### GWAS

To show the potential utility of SEM-Bayesian alphabet, simulated data were used to compare SEM-BayesCΠ with SEM-GWAS. A wide variety of potential genomic architectures were constructed by the combination of different levels of skewness of gamma distribution for QTL effects (*γ*) and different numbers of QTL() (Chen *et al.* 2017).

BayesCΠ was also performed to estimate overall SNP effects on the same datasets, and similar results as those in the SEM-BayesCΠ were obtained (not shown in this paper). This is reasonable since the SEM-Bayesian alphabet model can be reduced to a model similar to Bayesian regression by reparameterization, indicating that the joint likelihood functions of SEM-Bayesian alphabet and Bayesian regression are similar. Compared to Bayesian regression, the SEM-Bayesian alphabet provides a more comprehensive understanding of the underlying biological mechanism by decomposing overall SNP effects into direct and indirect SNP effects (Figures 3, 4 and 5).

The comparison between SEM-BayesCΠ and SEM-GWAS using simulated data were shown in Table 1. As shown in our results, SEM-BayesCΠ has relatively the same or higher pAUC5 than SEM-GWAS in all simulation scenarios. In some scenarios, SEM-BayesCΠ has significantly higher pAUC5 than SEM-GWAS. For example, when one trait is affected by few QTL of large effects (*e.g.*, *n** _{QTL}* = 30,), SEM-BayesCΠ has significantly higher pAUC5 than SEM-GWAS to infer indirect and overall effects. Though significant difference is not observed for direct effect, higher overall mean of pAUC5 is usually observed in SEM-BayesCΠ.

### Causal structure

The causal structure is assumed to be known in SEM-Bayesian alphabet, and it is usually discerned by three types of algorithms: the constraint-based algorithm, the score-based algorithms, and the hybrid algorithms. The IC algorithm (Pearl 2009; Valente *et al.* 2010) used in this paper is a typical constraint-based algorithm, which is based on conditional independence tests. The score-based algorithms apply the heuristic optimization techniques, which set an initial graph structure and assign an initial goodness-of-fit score to it, and then maximize the goodness-of-fit score to obtain the most possible causal structure. The hybrid algorithm is a hybrid of both the constraint-based and the score-based algorithms. It utilizes conditional independence tests to reduce the space of candidate causal structures, and uses network scores to identify the optimal structure among them (Scutari 2014). The causal structures inferred from these algorithms may be different. Note that different evaluation criteria may also result in different outcome causal structures. For example, in this paper, if we choose 0.99 instead of 0.9 HPD interval to search for causal structures, there will be no edge between the traits PH and PN.

### Decomposition of SNP effects

In some previous analysis (Mi *et al.* 2010; Momen *et al.* 2018, 2019), the indirect SNP effect of locus *j* of *t* traits is obtained by multiplying the estimated Λ, , and estimated direct SNP effects, , as . This is similar to using posterior means of causal structural coefficients and direct SNP effects for calculation of the indirect SNP effects. In our method, indirect SNP effects are estimated using joint samples from posterior distributions of Λ and . We compared these two approaches for indirect SNP effect estimation on real rice data, and found that the indirect effects estimated from these two approaches are slightly different. The SEM-BayesCΠ approach should be used in indirect SNP effect estimation due to the fact that Λ and may be highly dependent.

## Conclusion

SEM-Bayesian alphabet provides more interpretation into biological mechanisms than Bayesian regression methods by decomposing the overall SNP effects into direct and indirect SNP effects. In SEM-Bayesian alphabet, posterior distributions of the overall, direct, and indirect SNP effects, as well as causal structure coefficients, are obtained, which are used to make inferences about these parameters. Compared to the typical GWAS method incorporating causal structure among multiple traits, such as SEM-GWAS, SEM-Bayesian alphabet obtains the posterior distributions for the proportion of variance attributed to a genomic region to detect causal loci (*i.e.*, the use of WPPA). The level of gene pleiotropy, *e.g.*, proportion of markers affecting different combinations of traits as shown in Figure 4, can also be further dissected into direct and indirect SNP effects. Also, with estimating structural coefficients, SEM-Bayesian alphabet still has relatively same or greater pAUC5 than SEM-GWAS in all scenarios of simulated data. In summary, SEM-Bayesian alphabet offers a more comprehensive understanding of the underlying biological mechanisms including pleiotropy and causal relationships among traits than conventional GWAS, as well as has a potential advantage in the GWAS inference than other GWAS considering complex causal effect among multiple traits.

## Acknowledgments

We want to thank Bruno D. Valente for his explanation of the causal structure searching method and Mehdi Momen for his clarification on the IC algorithm. This work was supported by the United States Department of Agriculture, Agriculture and Food Research Initiative National Institute of Food and Agriculture Competitive grant no. 2018-67015-27957.

## APPENDIX

### Inductive causation algorithm

We use the IC algorithm (Pearl 2009) to recover the causal structure among a set of traits, denoted as *U* below, from the conditional independence relationship. The IC algorithm is composed of three steps (Valente *et al.* 2010; Chicharro and Panzeri 2014):

1. For each pair of traits *X* and *Y*, search for a set such that holds. That is, X and Y are independent, conditional on . If there is no such , place an undirected edge between these two traits.

2. If this pair of traits *X* and *Y* are non-adjacent (*i.e.*, no un-directed edge between *X* and *Y*) with a common neighbor *Z* (*i.e.*, Z is adjacent to X as well as to Y), and , place arrowheads pointing to Z, *i.e.*, .

3. In the partially-oriented graph from step 2, orient as many edges as possible following two requirements:

(a) Any alternative orientation will not yield a new *V* structure (*i.e.*, X→Z←Y).

(b) Any alternative orientation will not yield a directed cycle.

In summary, we find all the pairs of variables that have a dependent relationship to reconstruct the basic structure of the underlying causal network in step 1. Then we find all the *V* structures in the network in step 2 and prevent the creation of new V structures or directed cycles in step 3.

### Marker effect decomposition

Estimated direct, indirect, and overall SNP effects from SEM-BayesCΠ incorporating the IC1 causal structure for the trait flowering time at Arkansas (FTA) are shown in Figure 5. The SNP “id1024159” has direct effect 0.005 on trait FTA, while its indirect effect transmitted through PH is -0.039. Thus, the overall effect of SNP “id1024159” is mainly determined by the indirect effect.

## Footnotes

*Communicating editor: D.-J. de Koning*

- Received July 30, 2020.
- Accepted September 28, 2020.

- Copyright © 2020 Wang
*et al.*

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.