A Bayesian model for the analysis of transgenerational epigenetic variation.

Epigenetics has become one of the major areas of biological research. However, the degree of phenotypic variability that is explained by epigenetic processes still remains unclear. From a quantitative genetics perspective, the estimation of variance components is achieved by means of the information provided by the resemblance between relatives. In a previous study, this resemblance was described as a function of the epigenetic variance component and a reset coefficient that indicates the rate of dissipation of epigenetic marks across generations. Given these assumptions, we propose a Bayesian mixed model methodology that allows the estimation of epigenetic variance from a genealogical and phenotypic database. The methodology is based on the development of a T matrix of epigenetic relationships that depends on the reset coefficient. In addition, we present a simple procedure for the calculation of the inverse of this matrix (T−1) and a Gibbs sampler algorithm that obtains posterior estimates of all the unknowns in the model. The new procedure was used with two simulated data sets and with a beef cattle database. In the simulated populations, the results of the analysis provided marginal posterior distributions that included the population parameters in the regions of highest posterior density. In the case of the beef cattle dataset, the posterior estimate of transgenerational epigenetic variability was very low and a model comparison test indicated that a model that did not included it was the most plausible.


epigenetics Bayesian analysis genetic variance resemblance between relatives
Epigenetics studies variations in gene expression that are not caused by modifications of the DNA sequence (Eccleston et al. 2007) and is now considered as one of the most important fields of biological research. Several biochemical mechanisms that alter gene activity (DNA methylation, histone modifications, etc.) underpin epigenetic processes (Jablonka and Raz 2009;Law and Jacobsen 2010). A number of studies have emphasized the importance of epigenetics in cancer (Esteller 2008) and other human illnesses (Feinberg 2007) as well as in other mammal (Reik 2007) and plant traits (Henderson and Jacobsen 2007). Gene activity modifications may occasionally occur in a sperm or an egg cell and can be transferred to the next generation, denoted as transgenerational epigenetic inheritance (Youngson and Whitelaw 2008), a phenomenon that has been reported in a wide range of organisms (Jablonka and Raz 2009). Despite this, the magnitude of the phenotypic variation that is explained by epigenetic processes still remains unclear (Grossniklaus et al. 2013;Heard and Martienssen 2014). From the perspective of quantitative genetics, the presence of transgenerational epigenetic inheritance involves a redefinition of the covariance between relatives. Tal et al. (2010) developed a model for the calculation of the covariance between relatives for asexual and sexual reproduction as a function of epigenetic heritability (g 2 ), the reset coefficient (v), and its complement, the epigenetic transmission coefficient (1 2 v). According to these authors, the covariance between relatives is reduced as the number of opportunities to dissipate (or reset) the epigenetic marks increase. Therefore, for sexual diploid organisms, the covariance between parent and offspring is greater than the covariance between full sibs, and the covariance between half sibs is greater than the covariance between an uncle and its nephew, despite the fact that the additive numerator relationship between them is identical (0.5 and 0.25, respectively).
In animal and plant breeding, the standard procedure for estimating variance components is by means of a linear mixed model (Henderson 1984) that includes systematic or random environmental effects, one or more random genetic effects, and a residual. The variance components are estimated by using likelihood-based procedures (Patterson and Thompson 1971) or Bayesian approaches (Gianola and Fernando 1986). Under the Bayesian paradigm, the standard procedure for determining the posterior distribution of the variance components uses the Gibbs Sampler algorithm (Gelfand and Smith 1990) that involves an updated iterative sampling scheme from the full conditional distributions of all the unknowns in the model.
The aim of this study is to present a Bayesian linear mixed model that allows one to estimate the heritable epigenetic variability, the reset coefficient, and the epigenetic transmission coefficient based on genealogical and phenotypic information. The model is illustrated with two simulated datasets and with one example that considers a beef cattle database.

Statistical model
The standard mixed linear model is described by the following equation: where y is the vector of phenotypic records, b is the vector of systematic effects, u is the vector of random additive genetic effects, and e is the vector of residuals. Then, X and Z are the matrices that link the systematic and additive genetic effects with the data. The usual assumption for the prior distribution of u and e are the following multivariate Gaussian distributions (MVN): where s 2 u and s 2 e are the additive genetic and residual variances, respectively, and A is the numerator relationship matrix. Further, the prior distribution of b is commonly assumed to be a uniform distribution. Conjugate priors for the variance components are the following inverted chi-square distributions: where s 2 u and s 2 e are the prior values of the variances and n u and n e are their corresponding prior "degrees of belief ".
To estimate transgenerational epigenetic variability, this standard model can be expanded to: where w is the vector of individual epigenetic effects. Note that the incidence matrix (Z) is the same for u and w and that both genetic and epigenetic effects are assumed to be independent. The prior distribution of w is defined as: where T is the matrix of epigenetic relationships between individuals and s 2 w is the transgenerational epigenetic variance. As before, the prior distribution of s 2 w is defined as: with hyperparameters s 2 w and n w . The structure of the T matrix is defined by the recursive relationship between the epigenetic effect of one individual (w i ) with respect to the epigenetic effects of its father (w fi ) and mother (w mi ): As defined by Tal et al. (2010), v is the reset coefficient and (1 2 v) is the epigenetic transmission coefficient. The reset coefficient represents the proportion of epigenetic marks across the parental genome that are expected to be erased, whereas its opposite, the epigenetic transmission coefficient, indicates the proportion that are transmitted. Furthermore, e i is the residual epigenetic effect of the ith individual, independent from w i, , and whose distribution is: if only one ancestor in known; assuming that the variance of transgenerational epigenetics effects (V()) is constant across generations: In matrix notation: where the P matrix defines a recurrent relationship with the epigenetic effects of the father and mother. For nonbase individuals, the ith row of the P matrix contains a parameter l in the column pertaining to the father and mother of the ith individual. The rest of elements are null. Furthermore if: where VðeÞ is a diagonal matrix with entries equal to s 2 w for base individuals, ð1 2 l 2 Þs 2 w for individuals with one known ancestor, and ð1 2 2l 2 Þs 2 w for individuals whose father and mother are known. The prior distribution for l will be assumed to be uniform, between 0 and 0.5.

Parameter estimation through a Gibbs sampler
The Gibbs sampler algorithm is an iterative, updating sampling scheme that obtains samples from the marginal posterior distributions of all the unknowns in a model. It requires samples from the full conditional distributions of all the parameters. In the proposed model, the set of parameters can be classified into three main groups: i) the location parameters (b, u, and w); ii) the parameter l associated with matrix T; iii) the variance components (s 2 u , s 2 w , and s 2 e ). The full conditional distributions for these groups are: Sampling of the location parameters (b, u, w): The conditional distributions of the location parameters are univariate Gaussian (Wang et al. 1994), with parameters drawn from the following mixed model equations: given the multivariate Gaussian nature of the conditional distribution of the location parameters.
A key limitation of the procedure is the calculation of the A 21 and T 21 matrices. Whereas A 21 is calculated by the standard Henderson's rules (Henderson 1976) only once, the T 21 matrix needs to be calculated afresh in each cycle as the parameter l is updated. An algorithm to set-up the T 21 matrix is presented in the appendix of this work (see Appendix).
The full conditional distribution of l: The conditional distribution of l is developed from the recursive definition of the epigenetic effects (equation 1). It is a truncated univariate Gaussian distribution (TN) between 0 and 0.5 due to the prior distribution of the parameter.
pðljwÞ TN ½0;0:5 ðm l ; s l Þ where n 1 is the number of individuals with known fathers and mothers, n 2 is the number of individuals with only father known, and n 3 is the number of individuals with only mother known.
The full conditional distributions of the variance components (s 2 u ; s 2 w ; s 2 e ): The conditional distributions for the variance components are the following inverted x 2 distributions where nan is the number of individuals in the population, ndat is the number of phenotypic data, and s 2 x and n x are the hyperparameters for s 2 x .

Simulated datasets
To evaluate the procedure, we simulated two different datasets. The first one had a relative high percentage of variation caused by additive and transgenerational epigenetic effects (35% and 20%, respectively) and an intermediate reset coefficient (v = 0.40). In the second dataset, a lower percentage of variation was explained by additive genetic and transgenerational epigenetic effects (15% and 10%, respectively) and the reset coefficient was greater (v = 0.80). Each dataset was composed by a three-generation population. The base populations included 3000 individuals (1500 sires and 1500 dams) and each of the two subsequent generations were composed by 3000 full sib families of 10 individuals each one. The sire and dam for each family were sampled randomly from the individuals of the previous generation. The genetic and epigenetic effects for each individual (u i and w i ) in the base population were generated from the following Gaussian distributions: Furthermore, the genetic and epigenetic effects for the individuals (u j and w j ) of the second and third generation were obtained by sampling from: where u fi and u mj and w fi and w mj were the additive genetic and transgenerational epigenetic effects of the father and the mother of the jth individual, respectively. In addition, one phenotypic record (y i ) was generated for every individual by: where m was a general mean set to 100 units. The values of the parameters for the two datasets were: Dataset 1.
The Gibbs sampler was implemented using own software written in Fortran 95. The analysis consisted of a single chain of 1,250,000 cycles, and the first 250,000 were discarded. Each analysis took 36 hr using a single thread of an Intel Xeon E5-2650 of 2.00 GHz. The source code is included as Supporting Information, File S1. Convergence was checked by the visual inspection of the chains and with the test of Raftery and Lewis (1992). All samples were stored for calculating summary statistics.

Pirenaica beef cattle data
As an illustrative example, we used a database of phenotypic records from the yield recording system of the Pirenaica beef cattle breed. The Pirenaica breed is a meat-type beef population from northern Spain with an approximate census of 20,000 individuals that are typically reared under extensive conditions (Sánchez et al. 2002). The data set was made up of 78,209 records for birth weight with an average value of 41.52 kg and a raw standard deviation of 4.65 kg. In addition, a pedigree file including 125,974 individualsire-dam records was used. This information was provided by the National Pirenaica Breeders Confederation (Confederación Nacional de Asociaciones de Ganado Pirenaico; http://www.conaspi.net). Animal Care and Use Committee approval was not required for this study as field data were obtained from the Yield Recording System of the Pirenaica breed; furthermore, data were recorded by the stockbreeders themselves, under standard farm management, with no additional requirements.
The full model of analysis was: Where b is the vector of fixed systematic effects that included sex (2 levels) and age of the mother (16 levels), u and m are the vectors of direct and maternal additive genetic effects with 125,974 elements, p is the vector of permanent environmental maternal effects (21,143 levels), h is the vector of the random herd-year-season effects (12,925 levels), w is the vector of transgenerational epigenetic effects (125,974 levels), and e is the vector of the residuals. Furthermore, X, Z 1 , Z 2 , and Z 3 are the incidence matrices that link the different effects with the phenotypic data. Appropriate uniform bounded distributions were assumed for the systematic effects and for each variance component (s 2 x ¼ fs 2 p ; s 2 h ; s 2 w ; s 2 e g), defined by hyperparameters (n x = 22 and s 2 x = 0). Furthermore, the prior distribution for the (co) variance matrix between direct and maternal genetic effects (G) was the following inverted Wishart: being s 2 u and s 2 m the direct and maternal genetic variances and s um the covariance between them. In this case, n G was set to 23 and G 0 to a 2 · 2 matrix of zeroes to define a uniform distribution.
Two statistical models (I and II) were fitted. Model I includes transgenerational epigenetic variance (s 2 w ), whereas Model II does not. For each model, the Gibbs sampler was implemented with a single chain of 3,250,000 cycles, and the first 250,000 were discarded. Convergence was checked by the visual inspection of the chains and with the test of Raftery and Lewis (1992). The models were compared based on the pseudo-log-marginal probability of the data (Gelfand 1996;Varona and Sorensen 2014) by computing the logarithm of the Conditional Predictive Ordinate (LogCPO) calculated from the Markov chain Monte Carlo (MCMC) samples. It was calculated as: where y -i is the vector of data with the ith datum (y i ) deleted, M k is the kth model and being Ns is the number of MCMC draws, u j k is the jth draw from the posterior distribution of the parameters of the kth model.

RESULTS AND DISCUSSION
From the perspective of quantitative genetics, the estimation of variance components is obtained from the statistical information provided by the covariance between the phenotypes of relatives (Falconer and Mckay 1996). Tal et al. (2010) proposed a simple model for the resemblance between relatives under transgenerational epigenetic inheritance for asexual and sexual reproduction. This model assumes that transgenerational epigenetic effects are not correlated with the additive genetic ones, and that they are distributed under a Gaussian law. After the central limit theorem, this may be explained by a large number of epigenetic marks randomly distributed across the genome. In addition, the distribution pattern of these marks is assumed independent between individuals unless they are relatives. In that case, some of these marks should have not been erased during the meiosis drawing them apart from their common ancestor. The population rate of erasure of these marks is measured through the reset coefficient (v).
In this study, we propose a procedure that makes use of the aforementioned authors' definition for sexual diploid organisms to estimate the parameters under a Bayesian mixed model framework. Our approach makes it feasible to estimate transgenerational epigenetic variability from huge datasets of genealogical and phenotypic data. The key to the procedure is the definition of a T matrix of (co) variance between transgenerational epigenetic effects. This matrix exclusively depends on a single parameter (l) which is directly related to the reset coefficient (v) put forward by Tal et al. (2010). As an example, for a simple pedigree of seven individuals (Figure 1), the T matrix is: Figure 1 Example of a pedigree.
T ¼ 2 6 6 6 6 6 6 6 6 4 1 0 0 l l l l 2 0 1 0 l 0 0 0 0 0 1 0 l l l 2 l l 0 1 l 2 l 2 l 3 l 0 l l 2 1 2l 2 2l 3 l 0 l l 2 2l 2 1 l l 2 0 l 2 l 3 2l 3 l 1 3 7 7 7 7 7 7 7 7 5 As Tal et al. (2010) stated, if we compare the covariance between full sibs (individuals 5 and 6) and the covariance between sire-offspring (individuals 1 and 4), the covariance is greater for the sire offspring pair, l vs. 2l 2 , as l is defined as being within the interval (0, 0.5). Note that if the reset coefficient is 0, then l is 1/2 and the T matrix becomes equivalent to the numerator relationship matrix (A), where the covariances between full sibs and sire offspring are equivalent. Similarly, the uncle2nephew covariance (individuals 5 and 7) is 2l 3 , lower than the covariance between half sibs (individuals 4 and 5) which is only l 2 . The mixed model implementation of the covariance between relatives requires the inverse of the T matrix to construct the mixed model equations, in a similar manner as the numerator relationship matrix in the standard model (Henderson 1984). In this study, we describe a simple procedure for the calculation of this T -1 matrix (see the Appendix section). The procedure takes into account the recursive nature of the transgenerational epigenetic effects, using an argument equivalent to Quass (1976), for the inverse of numerator relationship matrix (A 21 ), and Quintanilla et al. (1999), for dam-related permanent maternal environmental effects. The main consequence is that the inverse of the T matrix can be sequentially constructed by reading the pedigree of the population, in the light of very simple rules. These rules are close to Henderson's (1976) rules for constructing the inverse of the A matrix.
For the example pedigree, the T 21 matrix is: Given this algorithm to set-up the T 21 matrix, the implementation of a Gibbs sampling approach becomes straightforward; it merely requires sequential iterative sampling from Gaussian (b, u, and w), truncated Gaussian (l) and inverted x 2 distributions (s 2 u ; s 2 w ; s 2 e ). The only minor complication is that the T 21 matrix depends on parameter l and consequently must be calculated in each Gibbs sampler's iteration. One possible alternative to avoid this matrix inversion, although not to avoid its updating in each cycle, is to implement an approach similar to the one proposed by Rodríguez-Ramilo et al. (2014) for the genomic relationship matrix. However, setting up the T matrix is computationally more demanding than the calculation of its inverse with the algorithm proposed in this study.
The proposed procedure was first checked with simulated data. The summary of the marginal posterior distributions for the variance components, additive genetic (h 2 ), and epigenetic (g 2 ) heritability, reset coefficient (v), and epigenetic transmission coefficient (1 2 v) for both cases of simulation are presented in Table 1. The greatest posterior density at 95% for all parameters in the model included the simulated values. However, some details of the results should be highlighted. In the second case of simulation, with lower s 2 w and l, the posterior standard deviation (and the HPD95) for the s 2 w and s 2 e were remarkable wider (66.19 and 67.42, respectively). The cause of this phenomenon is that there is a statistical confounding between the epigenetic and residual variance components when the reset coefficient (v) is very high, because the T and I matrices becomes very similar. To illustrate this fact, in Figure 2 and Figure  3, we present the joint posterior densities of g 2 and v for both cases of simulation. The results of the first case of simulation showed posterior independence between both parameters, whereas in the second, the marginal posterior density presented a half-moon shape. This indicates that for large values of v, g 2 (and s 2 w ) may take any value on its support with a fairly equal probability. In other words, in the first simulated case the model showed very good ability to discriminate between both parameters, whereas in the second it did not. In addition, mixing of the MCMC procedure in the second case of simulation was clearly worst, and adaptive MCMC algorithms, such as the proposed by Mathew et al. (2012) may represent an interesting alternative for its implementation in large data sets.
It must be noted that the sources of information for the estimation of s 2 w and v (or l) comes from the comparison between the covariance between different categories of relatives (see Table 2). Following T 21 ¼ 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 Tal et al. (2010), one estimate of v can be achieved from the difference between the estimates of covariance between half-sib and uncle2nephew relationships, a difference that becomes almost null for low values of l (or high v), as seen in the second case of simulation (23.10 vs. 22.62). On the contrary, with greater values of l(or lower v), as in the first case of simulation, this difference is higher (60.60 vs. 57.36), and, thus, the amount of information available for estimation purposes increases. This uncertainty in the estimation of l (or v) is also reflected in the estimation of s 2 w (g 2 ) and s 2 e and implies wider posterior marginal densities. Nevertheless, it is important to emphasize that the joint posterior mode (v = 0.80, g 2 = 0.13) in the second case of simulation was very close to the simulated parameters (0.80 and 0.10, respectively).
We have also checked our model with a real dataset. In Table 3, summary statistics of the marginal posterior distributions for the parameters and LogCPO values are presented for models I and II in the analysis of birth weight from the Pirenaica beef cattle population. The posterior mean estimates for heritability ranged between 0.38 (Model I) to 0.41 (Model II), and the posterior mean estimates for transgenerational epigenetic heritability was only 0.04 (Model I), with a greatest posterior density at 95% that ranged between 0.00 and 0.11. These results are coherent with the output of the model comparison test (LogCPO), which pointed to Model II as the more plausible. If we focus in this latter model, the estimates of direct and maternal heritability were within the range of estimates in the literature for these traits (Meyer 1992;Varona et al. 1999;Jamrozik and Miller 2014), and the absence of the epigenetic transgenerational heritability is coherent with the fact that most of the epigenetic marks are erased during the meiosis in mammals Raz 2009, Schmitz 2014). However, it should be highlighted that the posterior mean estimate of the reset coefficient under Model I was surprisingly low (0.20), although its posterior standard deviation was very high (0.20), and that the HPD95 region covered almost all the parametric space (0.0120.89). This wide range reflects the absence of information to properly estimate the parameter, given the low magnitude of s 2 w . It is worth noting that the procedure also can provide epigenetic breeding values in the fields of animal and plant breeding. These are calculated by weighting the phenotypic information of relatives n   according to the magnitude of the elements of the T matrix. In fact, for epigenetic effects, the weight of distantly related individuals is lowered as the number of opportunities to reset the epigenetic marks increases. Both additive genetic and epigenetic effects can be used for prediction of the future performance of the individual: the expected genetic response (R) after one cycle of mass selection should be R ¼ i½h 2 þ ð1 2 vÞg 2 s 2 p , where i is the intensity of selection and s 2 p is the phenotypic variance. However, further research must be undertaken to develop adequate indices of selection that consider genetic and epigenetic effects. Although both of them affect the immediate future performance of the offspring, epigenetic effects are diluted in future generations as the epigenetic marks are cleaned. It is important to note that if this selection pressure is relaxed, the average epigenetic effect declines to zero with a rate of ð12vÞ n , where n is the number of generations without selection. It should be further noted that the proposed model uses a very basic definition of transgenerational epigenetic inheritance, as it assumes equal epigenetic variance for all individuals in the population. However, when phenotypic records are recorded through the life of the individual, transgenerational epigenetic variance can be modeled as age dependent, based on the consideration that the number of epigenetic marks was accumulated in the genome of the individuals. Moreover, in the proposed model, it is assumed that the l parameter (or the reset coefficient) is equal for fathers and mothers, and, in future research, it seems reasonable to assume a different transmission coefficient for males and females, allowing for the presence of sex differential genomic imprinting (Barlow 1997;Reik and Walker 1998). The model can be refined by the inclusion of a hierarchical Bayesian paradigm that includes systematic effects for any environmental factors that may influence the epigenetic transmission coefficient. In addition, a more profound approach could even allow for the consideration of a genetic determinism of the reset coefficient with a model similar to that proposed by Varona et al. (2008) for the degree of asymmetry of a skewed Gaussian distribution.
Finally, and as mentioned by Raz (2009) andTal et al. (2010), epigenetics may be viewed as a more wide ranging concept that could include several types of cultural transmission (Avital and Jablonka 2000; Richerson and Boyd 2005). The procedure suggested in this work also could be applied to the analysis of human or animal datasets that involve any kind of transgenerational transmission, even those not directly related to gene expression.
By way of a conclusion, we can say that this paper presents an original procedure for the estimation of transgenerational epigenetic variability based on some generalized assumptions. It is hoped that the proposal will lead to future research on variations of epigenetic transmission abilities caused by environmental and/or genetic factors.

ACKNOWLEDGMENTS
We thank the Pirenaica Breeders Association (Confederación Nacional de Asociaciones de Ganado Pirenaico) for the provision of data used in this work. We also thank the useful comments of the editor and the reviewers. This research was funded by grant AGL2010-15903 (Ministerio de Ciencia e Innovación, Spain).