A High-Fidelity Cell Lineage Tracing Method for Obtaining Systematic Spatiotemporal Gene Expression Patterns in Caenorhabditis elegans

Advances in microscopy and fluorescent reporters have allowed us to detect the onset of gene expression on a cell-by-cell basis in a systematic fashion. This information, however, is often encoded in large repositories of images, and developing ways to extract this spatiotemporal expression data is a difficult problem that often uses complex domain-specific methods for each individual data set. We present a more unified approach that incorporates general previous information into a hierarchical probabilistic model to extract spatiotemporal gene expression from 4D confocal microscopy images of developing Caenorhabditis elegans embryos. This approach reduces the overall error rate of our automated lineage tracing pipeline by 3.8-fold, allowing us to routinely follow the C. elegans lineage to later stages of development, where individual neuronal subspecification becomes apparent. Unlike previous methods that often use custom approaches that are organism specific, our method uses generalized linear models and extensions of standard reversible jump Markov chain Monte Carlo methods that can be readily extended to other organisms for a variety of biological inference problems relating to cell fate specification. This modeling approach is flexible and provides tractable avenues for incorporating additional previous information into the model for similar difficult high-fidelity/low error tolerance image analysis problems for systematically applied genomic experiments.


Notation
Throughout this manuscript, we use various superscripts to denote group elements.
In several cases, the superquadric kernel function , is used to describe a weighted value based on the normalized distance from the center of each nucleus where is the 3D location index and is a location. , are arbitrary constants. is an anisotropic superquadric normalized distance function from nucleus to some arbitrary point / This distance function is a direct extension of the normalized superquadric distance function Jaklic et al. (2000) where the distance value is modified by and , to account for larger attenuation effects. is the affine transformed point that is projected into the superquadrics coordinate space: , where is the inverse rotation matrix of . A more generalized version of this distance function is used for some components: During the decoding process, the base probability is changed to a mixture of the trained time to event model and base probability:

| ,
This mixture provides a minimum allowed probability for the time to event model , and prevents the time to event component of the model from over-correcting when encountering false positives and false negatives. Linkage Model The full time varying trained parameters for the model are shown in Supplemental Figure 2.
The covariability function, | | , helps correct for interactions between individual features.
While these interactions are often minor and have minimal impact on the training of the model, failing to account for these interactions can result in odd shaped divisions called during the decoding process (often due to false-positives/false negatives that the model does not correct for appropriately. The interaction between variables is modelled as: where is a binary indicator value for the terms. Four interaction terms are modelled: division angle and difference of mother/daughter cells; division angle and distance of mother/daughter cells, distance of mother/daughter cells and difference of distance of mother/daughter cells, and distance of mother/daughter cells and the difference in the radius of the cells.

Initial Seeding
Before the decoding of the algorithm enters its main iterative phase, an initial seeding step is run. This initial seeding proposal is run for two reasons: to run a fast proposal for individual nuclei that is independent of time; and to obtain an initial seeding of the nuclei to fit a model for the normalized offset and rate of the image series. For each time point t, let , be the proposed points. The proposal seeding is done using the proportional probability of the discriminative model, nuclei size, and the spatial interaction: , |, , | , , | , , | , . An iterative algorithm is run that proposes to add a new point , to the seed data, , , , , where the new point is acceptance ratio is above a certain cutoff . The algorithm terminates for time when a new point has not been accepted in the last attempts.
Empirical Proposal Distributions While all moves are reversible, a number of moves break detailed balance by using empirical proposal distribution to bias the move towards more probable regions. These moves are necessary to bring the computational demands of the decoding down to reasonable levels.

Decoding
Reversible Jump Markov Chain Monte Carlo Methods (RJMCMC) are used to decode the lineage from any given trace. RJMCMC methods differ from traditional MCMC methods in that special consideration must be made to properly normalize the trans-dimensional jumps in the model. In many cases with Bayesian methods, the normalizing constant can be dealt with by integrated out the prior information. In our particular case, this is not possible; and as a result, we use a composite space and pseudo-priors to allow for equivalent normalizing factors Godsill (2001). Let | , , be a draw from the joint prior of the augmented data, and let | , , be the normalizing ratio constant of a reversible jump move (in this example, the addition of a new point 1 at time to the model where ĉ , . The acceptance ratio of the move is then: provides the missing normalizing constant for the distributions that result from the addition of the point that is missing in ĉ but present in (e.g., the Gaussian prior on the major axis). For performance reasons, we simulate the most probable draw from the joint prior. While this limits our ability to address uncertainty in the choice of model, the effect is mitigated by the fact that we are using a hybrid simulated annealing approach to sample the most probable configuration and are not directly concerned with the uncertainty of the lineage. Pure methods for properly calculating this constant using full detailed balance RJMCMC methods or bridge/path sampling are possible, but are not practical from a computational/memory perspective.

Sequential RJMCMC Moves
Both the add point and remove point use empirical proposals. Split, merge, and updating the points within a branch are done uniformly. The add point proposal proposes a region to add a new point. Let , be the set of all possible regions. The probability of a point being added at time to region i is: where , , is the residual from the generative observed image at the sub-location in region , , and is the proposed time to event ratio probability of adding a new hypothetical point and attaching it to the parent of time 1: , is a Gaussian kernel function describing the weight from the center of the parent to location and 0 are negative constants that invert the log probabilities. A new point is then drawn uniformly from this region. The proposal to remove a nuclei is based on the current state of the discriminative model. The probability to propose removing nuclei at time is: The split move splits a single nuclei into two nuclei. The location of the two new nuclei is set to be orthogonal to the parent point. Let 2,5 , be a magnitude that is drawn from a uniform distribution times the major axis length of the previous point, and be a unit vector drawn from uniformly from the unit sphere. The locations for the new nuclei and are then a function of previous location, the magnitude, and the unit vector , .

Non-sequential RJMCMC Moves
Empirical proposals for selecting a new point are used for: adding a new branch, adjusting the beginning of a branch, and adjusting the end of a branch. Proposing a branch to update or delete are selected uniformly from all possibly branches. The proposal distribution for selecting a branch to update is: where is the beginning parent (the mother cell of a division). These proposals are used for relinking both the beginning and end of the branch. For adding a new branch, an empirical proposal is used for selecting the initial seed point based on a set of past sequential rejected points at time : , . During the chronological stage, when rejecting a move from ĉ to ĉ in the add stage, the point , has a probability of adding to the rejected point list. This acceptance probability of adding a point to the set: , , , , |ĉ ĉ |

, |ĉ ĉ|
This proposal biases to points that have strong evidence from the image of existing but are often rejected due to topological constraints (linkage, time to event, etc.). While the length of a new branch , as well as the distance to look for relinking the start and end of each branch , are drawn from a modified Poisson distribution, each move has an empirically probability of extending the length of this branch based on the following: Let be the proposed new configuration from a new branch, or the start/end relinking moves, where ĉ is the original configuration. We extend the addition of a new branch, or the relinking of the start end by with probability , where is the probability ratio of the bottom up components: and is the product of the link probability and time to event of the proposed new branch/relinking step. This empirical proposal extends regions where strong evidence from the data that a point should exist, but whose linkage and time to event components suggests the incorrect parent (and hence, that it should be in a different time point).

Relinking Empirical Moves
During the relinking phase, a point can be added whose true parent is bound incorrectly to a child, or whose parent will unlikely be accepted due to a previous miscalled division. To overcome these issues, we've designed several empirical proposals that will allow an already bound child to choose a new parent. Let , , , , , be the current parent, first, and second (if exists) child of a cell nucleus and | | is the number of children. When proposing a new parent, , , each child of the current parent , , has a probability of entering its own relinking phase of: , log log where are constant weight factors, and , are the link and time to event probability improvement ratios: When proposing a new parent, it is possible that the probability of the link is good, but the time to event model leads to a decreased probability of the true parent being selected. In many situations, this occurs when a parent/child relationship in a previous or future time point is incorrect. This is also addressed through empirical proposals. When a new proposed parent increases the number of children (a continuation to a division), a proposal to relink a previous or future cell within the branch is made: exp are the improvements of the linkage as described above, is the ratio of the previous division to the geometric mean of the divisions before and after(if it exists) and is a log weighted probability based on the improvement of the link and the change in the time to event distributions: , , , increases the probability of a past future relink if the link probability has improved, but the time to event probability has gotten worse. This ensures that a previous time relinking will only be attempted if the transition to a division is restrictive due to a division that recently occurred.
To maintain balance, we must also allow the reverse: when a new parent is proposed, and the old parent transitions from 2 children to 1 (division to a continuation), a relinking step that allows a previous or future division occurs with probability: exp is a slightly modified version of removing a division. Instead of a straight value for the time to event probability, a slightly modified value is used to allow for an increased probability of adding a past division when it nears the expected division time: 1 1 exp

Decoding algorithm
The decoding algorithm is based on single time and multi time reversible jump Markov chain Monte Carlo methods. The specifics of the algorithm are described below using pseudo code and briefly described here. The program initiates with a proposal step (PROPOSAL) that burns in a series of proposal points using the discriminative distribution across all the points. This is followed by a fit of the normalized time to a linear model (NORMALIZED_FIT). The method then enters an iterative procedure, where for every time point, for a variable number of iterations per time point, sequential and non-sequential moves are sampled with fixed probability constants (SAMPLE_FULL). After running through all time points, the algorithm enters a final iteration where only non-sequential moves are sampled (SAMPLE_NON). . The prerequisite on having 2 subsequent cells being missing for a detection error and allowing for parent identifiable cells to belong to previous branches is to allow flexibility in division calls: cells undergoing division often appear to divide within a range of 2 time points and are often so closely spaced that the "ground truth" could occur in either time point (this is especially the case in z-axis divisions, where the increased z attenuation often makes two daughter cells appear to be a larger mother cell.)

Imaging setup and strain information
Images are captured as described in (Murray et al., 2008