Bayesian characterization of uncertainty in intra-subject non-rigid registration

In settings where high-level inferences are made based on registered image data, the registration uncertainty can contain important information. In this article, we propose a Bayesian non-rigid registration framework where conventional dissimilarity and regularization energies can be included in the likelihood and the prior distribution on deformations respectively through the use of Boltzmann’s distribution. The posterior distribution is characterized using Markov Chain Monte Carlo (MCMC) methods with the effect of the Boltzmann temperature hyper-parameters marginalized under broad uninformative hyper-prior distributions. The MCMC chain permits estimation of the most likely deformation as well as the associated uncertainty. On synthetic examples, we demonstrate the ability of the method to identify the maximum a posteriori estimate and the associated posterior uncertainty, and demonstrate that the posterior distribution can be non-Gaussian. Additionally, results from registering clinical data acquired during neurosurgery for resection of brain tumor are provided; we compare the method to single transformation results from a deterministic optimizer and introduce methods that summarize the high-dimensional uncertainty. At the site of resection, the registration uncertainty increases and the marginal distribution on deformations is shown to be multi-modal.


Introduction
The vast majority of non-rigid registration methods only report a single transformation (i.e. a point-estimate) result without any estimate of the registration uncertainty, which could provide valuable information whenever important clinical decisions, or high-level analysis, are based on registered data.Various factors such as the ill-posed nature of non-rigid registration, the stochastic nature of the images, the large variability of anatomy, the presence of homogeneous intensity regions, and imaging artifacts like distortion and biasfield, may negatively affect the registration uncertainty.Despite this, state-of-the-art registration methods are typically evaluated on a limited set of representative images through estimation of the alignment error of sparse homologous landmarks, or regions, based a single registration result.Suppose a new dataset, with new pathology or image artifacts that are not represented in the evaluation dataset, was to be registered -then it cannot be assumed that the resulting deformation point-estimate will carry the same accuracy as the previous evaluation results would indicate.If the registration result was associated with some uncertainty measure, or confidence limits, it could be possible to detect that the method had difficulty in finding a proper alignment of the images in certain locations, and the surgical risk involved with making decisions based on the registered data could be evaluated.It should be noted that a highly uncertain (low precision) registration result does not necessarily mean that the result is also inaccurate and vice versa, so systematic errors (bias) of the registration model should be evaluated before placing trust in precise registration results.
In image-guided neurosurgery, where surgeons assess the operative risk based on registered image data, misplaced confidence in registration results may have severe consequences.Important functional areas of the brain are commonly defined from pre-operative studies.If these eloquent areas are located adjacent to a tumor, it is critical that their locations are mapped as precisely as possible to the intra-operative scans.Factors such as resection of tissue, degraded intra-operative image quality, or blood and edema around a tumor, may all contribute to an increase in registration uncertainty and/or decrease of registration accuracy, especially in the vicinity of the surgical field, precisely where accuracy is most essential.For a neurosurgeon, whose goal is to maximize the resection of tumor tissue while preserving eloquent areas, information of the uncertainty in the location of the functional areas in relation to the tumor is just as important as the optimal estimate.Current image guided navigation systems use rigid registration to map between the pre-and intra-operative space (Shamir et al., 2009;Gumprecht et al., 1999), and researchers have focused on estimating the distribution of errors in rigid registration, especially in fiducialbased registration where a set of homologous points are used to drive the rigid transform that aligns the two spaces (Maurer et al., 1997).It is generally known that the Fiducial Registration Error (FRE) is a poor predictor of real registration error (Danilchenko and Fitzpatrick, 2011).Hence, recent works have focused on characterizing the distribution of the Target Registration Error (TRE), i.e. the distance between homologous points other than the fiducials used in the FRE, after registration (Danilchenko and Fitzpatrick, 2011;Seginer, 2011).An extension to the standard rigid fiducial registration algorithm was presented by Pennec and Thirion (1997) where, given a set of matched geometric features in the form of points or frames, they estimate a rigid motion as well as the covariance matrix which is compatible with the matched points.Another well known method, the Iterative Closest Point (ICP) algorithm, which does not assume point homology, has been extended to include point uncertainties (Maier-Hein et al., 2012;Fieten et al., 2010).
In rigid intensity-based registration, Kybic (2008) assumed normally distributed registration (translation) parameters and used the Hessian of the similarity criterion to compute confidence intervals of the registration parameters.Robinson and Milanfar (2004) used the Cramèr-Rao bounds to find a lower bound on the covariance of the parameters in a 2Dtranslational registration.Bansal et al. (2009) introduced a framework for quantifying errors in registration by computing the confidence intervals of the estimated parameters (translation, rotation and scale) for the similarity transform.They show that the parameters are multi-variate normally distributed and use the covariance matrix to compute the confidence interval of the transformation parameters.
Rigid registration cannot account for the general motion of brain tissue that occurs during neurosurgery which has spurred interest in non-rigid registration methods.However, few authors have attempted to quantify the full uncertainty of the estimates produced by nonrigid registration methods.Hub et al. (2009) developed a heuristic method for quantifying uncertainty in B-spline image registration by perturbing the B-spline control points and analyzing the effect on the similarity criterion.A statistical approach was presented by Kybic (2010) who considered images to be random processes and proposed a bootstrapping method to compute statistics on registration parameters.However, their method was only demonstrated on 2D datasets with low-dimensional transformation models.Gee et al. (1994Gee et al. ( , 1995a) ) described a Bayesian approach using a prior based on linear elasticity to regularize the problem.By using local quadratic approximations of the likelihood term, they constructed a Gibbs sampler to estimate the posterior mean (Gee et al., 1995a), while a deterministic method was used to estimate the posterior mode (Gee et al., 1994).They reported results of estimating transformation parameter variance for the case of 2D registrations.Another fast Bayesian approach for approximating the posterior uncertainty on registration parameters using variational Bayes was introduced by Simpson et al. (2012) where an approximate posterior distribution on the registration and hyper-parameters is estimated based on a set of simple parametric distributions and has shown to be useful in classification (Simpson et al., 2011a) and segmentation propagation tasks (Simpson et al., 2011b).
In Bayesian inference, typical approaches to finding the maximum a posteriori (MAP) estimate of non-parametric distributions involve some form of local gradient based maximization of the log posterior.Unfortunately, there are usually no guarantees that the optimization method finds the global maximum, nor are "error-bars" or credible intervals of the MAP results usually provided.Because the posterior distribution is non-parametric, and assumed to have heavy tails and possibly be multi-modal, Markov Chain Monte Carlo (MCMC) is the most principled approach to infer posterior estimates, including the uncertainty (Gelman et al., 2003).MCMC numerically approximates the posterior distribution by a set of posterior samples.While it asymptotically characterizes the entire posterior distribution, a large number of samples may be needed to characterize the posterior to the needed accuracy, and MCMC methods are consequently compromised by a high computational burden.As a result, few applications of MCMC have been reported in medical image analysis.Recently, Smal et al. (2012) applied MCMC for motion analysis of tagged magnetic resonance images.For segmentation purposes, Fan et al. (2007) used an MCMC method for sampling curves from the exponentiated negative segmentation energy and showed how to summarize posterior marginal distributions in terms of marginal probability maps.An extension of that work was presented in Chang and Fisher (2011) where they sample level sets instead of curves, which facilitates topology changes.MCMC has been widely used in functional Magnetic Resonance Imaging (fMRI) analysis (Woolrich et al., 2006;de Pasquale et al., 2008) and Diffusion Tensor Imaging (DTI) (Behrens et al., 2003;McNab et al., 2009).
The posterior distribution over deformations is a highly informative high-dimensional spatially varying object with local correlations.In this article, we introduce a Bayesian nonrigid registration framework where the posterior distribution on deformation parameters is characterized by MCMC.Although posterior approximations through, for instance, variational Bayes may provide fast estimates of the posterior, it is not known how well they approximate the unknown shape of the posterior distribution.Even though MCMC comes at a hefty computational price, and consequently may not be directly applicable in many clinical applications today, the insights gained into the complexities of the posterior distribution can be used to guide construction of new and fast posterior approximations and help point the way for designing future clinical systems that puts clinically relevant error-bars on registration results to reduce surgical risk.The framework allows for any traditional dissimilarity and regularization energy to be included in the likelihood and prior distribution respectively using Boltzmann's distribution.We note that other authors have derived prior probability distributions on deformations from regularization energies using Gibbs distribution (e.g.Gee and Bajcsy (1999) and Papademetris et al. (2001)) which is closely related to Boltzmann's distribution.
We demonstrate the method with a Sum of Squared Differences (SSD) based likelihood, where the Boltzmann temperature models the image noise variance.The prior model over deformations is based on a linear elastic bio-mechanical model which, together with Boltzmann's distribution, implements a phenomenological approach to regularization where the Boltzmann temperature modulates the inverse stiffness of the tissue material.Higher temperatures (lower stiffness) facilitate more radical changes to the tissue configuration, while lower temperatures (higher stiffness) prevent large non-smooth deformations.Consequently, one major concern is the effect of model hyper-parameters, in our case the mechanical properties of the tissue and the variance of the image noise, on the registration results.The hyper-parameters are typically specified manually on an ad-hoc basis because it is difficult to find appropriate point-estimates for them.In the context of prostate registration (Risholm et al., 2012), we have shown that the posterior distribution is highly sensitive to the particular value of these hyper-parameters.If training data is available, e.g. in the form of homologous structures or landmarks, the hyper-parameters can be estimated.Examples of this in registration are e.g. in a classic machine learning framework for model selection through optimization of the cross-validation error (Yeo et al., 2010), or by maximumlikelihood estimation (Risholm et al., 2012).However, if no such data is available, the standard way of treating the uncertainty in hyper-parameter values in a Bayesian framework is by equipping them with uninformative hyper-prior distributions and marginalizing them out from the final registration estimates.Another advantage of marginalization of broad hyper-parameters is that the likelihood and prior can, when appropriate, become heavy tailed.Assuming they were modeled with Gaussian distributions, then the marginalized distribution can be viewed as a mixture of Gaussian distributions with varying variance, and such distributions are well equipped for handling outliers.Working with the posterior distribution under marginalized hyper-parameters is complex and computationally expensive and may necessitate fast approximations such as variational Bayes (Simpson et al., 2012).By using a system of sequential local Laplace approximations, we show how to marginalize model hyper-parameters representing image noise and tissue stiffness over broad uninformative log-normal hyper-priors (Janoos et al., 2012).
While a lot of research has gone into finding ways of reducing high-dimensional uncertainty information into simple local low-dimensional objects that can be easily visualized (Spiegelhalter et al., 2011), limited work has been reported on posterior uncertainty visualization in medical applications; including visualization of uncertainty in image guided needle insertions (Simpson et al., 2006), uncertainty in radiation dose control in radiation oncology (McCormick et al., 2004), determining uncertainty in white matter fiber orientation of diffusion tensors (Jones, 2003) and visualization of the local registration variance characterized by variational Bayes (Simpson et al., 2011a) or Gibbs sampling (Gee et al., 1995a).Gee and Bajcsy (1999) also showed how to derive Bayesian credible intervals about deformed point estimates from the posterior variance.While the main focus of this paper is non-parametric estimation of the posterior distribution over deformations, we also show techniques to extract simple and robust low-dimensional posterior summaries, such as the joint mode and the registration uncertainty, and show the importance of such summaries in the realm of neurosurgery.

Methods
Consider two images m : Ω m → ℝ and f : Ω f → ℝ, where we want to align the moving image m(x), x ∈ Ω m , with a fixed image f (x), x ∈ Ω f , where Ω f , Ω m ⊂ ℝ d are the domains of the d-dimensional fixed and moving images respectively.The usual aim of non-rigid registration (Crum et al., 2004) is to find a high-dimensional displacement field u(x) : x ∈ Ω f ↦ u(x) + x ∈ Ω m that minimizes an energy function composed of an image dissimilarity measure E s (u; f, m) and a regularization term E r (u) on the displacements.Although this framework does not depend on the specific form of the dissimilarity energy, regularization energy or transformation model, for clarity and computational tractability, we restrict discussion in this paper to the Sum of Squared Differences (SSD) dissimilarity measure, a linear elastic regularization model and a Finite Element (FE) based transformation model.FE-modeling (Zienkiewicz and Taylor, 2000;Bro-nielsen, 1998) is a popular method for solving bio-mechanical problems and has shown great potential in modeling tissue deformations (Miller et al., 2010).Here, the non-rigid transformation is effected through deformations of the FE mesh, which implements the bio-mechanical model based on tissue stiffness.Furthermore, it provides machinery to rapidly compute the energy functionals E s and E r as well as interpolating the displacement u(x) at inter-vertex positions.The FE method is described further in Appendix A.
We start by describing the Bayesian formulation of the registration problem in Section 2.1.This is followed by a description of the estimation of the posterior distribution by way of MCMC in Section 2.2, and the hyper-parameter marginalization using local Laplace approximations is described in Section 2.2.3.Finally, in Section 2.3, an algorithm for efficiently finding the mode of the posterior deformation samples is presented.

Bayesian Non-Rigid Registration Framework
In our Bayesian setting, depicted by the hierarchical graphical model in Fig. 1, the fixed and moving images, f and m respectively, are treated as random variables and are assumed to be generated by a model m = f ∘ u + ε that introduces both noise ε and displacement u in the image generation process, where displacement is also treated as a random variable with an associated probability measure.The likelihood, p(m | u, τ s , f), expresses the probability of observing the moving image, given a deformation u, under the noise model parameterized by τ s .The prior p(u | τ r ) contributes additional information to regularize the ill-posed maximum likelihood problem where τ r controls the spatial regularization.With accurate point-estimates for the hyper-parameters ≜ {τ r , τ s }, they can be modeled with delta distributions, or alternatively with maximum entropy distributions.In this paper we model them with log-normal distributions p(τ r | λ r , μ r ) and p(τ s | λ s , μ s ), where Γ ≜ {μ r , λ r , μ s , λ s } are the log-normal hyper-prior parameters.
As can be seen from the graphical illustration of the model in Fig. 1, the joint distribution can be factorized as follows (omitting the hyper-prior parameters Γ): (1) By Bayes' theorem, the posterior distribution on deformation and hyperparameters ≜ {τ r , τ s } can be written as: (2) Assuming the dissimilarity and regularization energies (E s and E r respectively) as the sufficient statistics of the posterior p(u | , m, f), then the Boltzmann's distribution obtained from these energies has maximum entropy that satisfies this sufficiency property.Modeled with Boltzmann's distribution, the likelihood with a SSD energy functional is now equivalent to a voxel-based i.i.d.Gaussian noise model with zero mean and variance τ s .As a precursor to registration, images are often smoothed, which introduces local correlation in the noise and the i.i.d.assumption is theoretically violated.One approach to model such noise correlation is to decimate the data, or through the inclusion of a virtual decimation factor in the likelihood as proposed by Groves et al. (2011).This approach is especially important if the main goal is to infer correct values for hyper-parameters, e.g.noise and regularization/stiffness parameters (Simpson et al., 2012).However, as we will discuss in Section 4, the (in)validity of the i.i.d.assumption is not a problem in practice.
The linear elastic energy discretized by the FE-method in Appendix A takes the form of Eqn.(A.3) which is quadratic in the displacement, and the prior distribution takes a form similar to a multi-variate normal with the stiffness matrix serving the role of a sparsely banded (and singular) precision matrix.It should be noted that regularizers typically used in non-rigid registration are designed so as to not penalize rigid transformations such as rotations and/or translations.This implies invariance of the regularization energy under all deformations equivalent up to a rigid transform.Therefore, as the regularization term has a non-negligible kernel, the prior obtained in this manner is improper, i.e. does not have finite measure on the domain ℝ D , where D is the dimensionality of u.In the elastic energy model of the FE-method used here, which is only translation, and not rotation invariant, the stiffness matrix K (see Eqn. (A.3)) is positive semi-definite and yields an improper prior on ℝ D .However, in this case, when combined with the likelihood term and considering boundary constraints on the feasible values of the deformation (yielding a compact domain), this technicality does not pose any problems.
With fixed values of the temperature hyper-parameters (HPs) , the posterior distribution on the displacement field takes the following form: (3) and Dom(u) denotes the domain of the transformation u.The computation of the posterior for a given pair of images is complicated not only by the need to estimate the partition function Z( ), but also by the sensitivity of the shape of the posterior distribution on the specific value of the temperatures, viz.τ s which controls the image noise model and τ r which determines the tissue stiffness.While the ratio of these HPs affects the location of the mode, which is important in optimization-based (i.e.maximum a posteriori (MAP)) registration, the absolute value of each HP affects the spread and skew of the posterior distribution and also, in case of multiple modes, the relative weighting of the modes.The Bayesian approach encourages marginalizing out these HPs as nuisance variables under an appropriate hyper-prior.Assuming a finite range in the orders of magnitude for τ r and τ s , i.e.

and
, the temperatures are equipped with maximum entropy prior distributions: and .Then the complete probability model is: (4) The posterior distribution with hyper-parameters marginalized out is: (5) The marginalization in combination with the likelihood defined as a function of arbitrary and possibly non-smooth images, obviates any meaningful parametric approximation of this distribution and necessitates non-parametric analysis or approximations.6).However, when also sampling over temperatures , it becomes necessary to estimate the partition function ratio Z( )/Z( ) in the MH criterion.Partition ratios are often estimated with an importance sampler (Bishop, 2006), and in our case that would entail generating representative importance samples of u that cover the parameter space by running the MH sampler at high temperature values.The drawback of this method is that it requires a very long MCMC chain to generate importance samples to ensure low bias and variance in the estimates of the partition function.In Section 2.2.3, we describe an alternative approach where deformations are sampled directly from the posterior distribution with hyper-parameters marginalized out using Laplace approximations.

Proposal Distribution-
The acceptance and convergence rate of the MH sampler is dependent on the size and shape of the proposal distribution (Gelman et al., 2003).It is clear that the most effective proposal distribution would be the posterior distribution itself.However, with limited information available regarding the posterior, the typical choice of proposal distribution is a normal distribution because it can be efficiently sampled from, and because of its symmetry, the proposal distributions cancel out in the MH-criterion in Eqn.(6).
Displacement proposals u′ are sampled from a multi-variate normal centered on the previous displacement u n : (7) If little was known about the posterior, a diagonal proposal covariance would be most suitable, but convergence would be slow.However, since the transformation of neighboring FE-vertices is correlated, enforcing a neighborhood structure on the correlation matrix Σ of the proposal distribution improves performance.The proposal correlation is set to Σ ij = exp(− (v i , v j )/ρ), where is the distance between FE-vertices i and j and ρ is a constant.If a proposal sample causes folding of the deformation field, the sample is discarded and a new sample is drawn.Performance of MH sampling is characterized by the proportion of jumps that are accepted.Suppose that the proposal kernel has the same shape as the posterior distribution, then it is known that the optimal jumping rule has acceptance rate around 23% in high dimensions (Gelman et al., 2003).In our simulations, the acceptance rate is tuned automatically to be approximately 23% by adjusting the proposal variance .

Hyper-Parameter
Marginalization-Although the hyper-prior is log-normal, the shape of the posterior distribution on temperatures is unknown.Consequently, if we were to sample temperatures, a standard choice for proposal distribution would be a univariate normal, or possibly a log-normal, centered on the previously accepted temperature.However, instead of sampling from the joint posterior on displacement and temperatures, which requires complicated tuning of three proposal scale parameters ( and two variances for the temperature proposal distributions) as well as partition ratio estimates via MCMC integration, we use the approach presented in Janoos et al. (2012) where hyper-parameters (HP) are marginalized using local Laplace approximations of the hyper-parameter posteriors.The problem of sampling from the posterior distribution of the registration parameters with HPs marginalized out is formulated in the variational free-energy framework, where the variational density turns out to be equivalent to the Laplace approximation of the HP posterior.Estimating the mode of the HP posterior requires derivatives of the partition function of the deformation parameter posterior, which are in turn quickly estimated using a local Laplace approximation.By applying Laplace's approximation locally during the intermediate steps of marginalizing out the HPs, the effect of this inaccuracy on the posterior density of the deformation parameter is restricted (Friston et al., 2007).A detailed description of the marginalization procedure is included in Appendix B. The MCMC scheme for sampling deformation parameters u from the posterior distribution in Eqn.(4) with Laplace marginalization of hyper-parameters is summarized in Algorithm 1.

Posterior Estimates
With a collection of samples {u s } s=1,…,S characterizing the posterior distribution in Eqn. ( 5), statistics such as the MAP estimate of the deformation, its uncertainty and the uncertainty in the deformations at individual FE-vertices can be extracted, as explained next.

2.3.1.
Posterior Modes-If the deformations are sampled from a distribution with fixed hyper-parameters, the joint mode on deformations can easily be found by selecting the sample with the highest un-normalized log posterior probability according to Eqn. (3).However, when marginalizing hyper-parameters the value of the partition function is required to calculate individual posterior log probabilities.A more tractable approach in this case is to use the density defined by the samples along with a non-parametric mode finding method such as mean-shift (Cheng, 1995).
Because the high-dimensional probability space is only sparsely sampled, the kernel density estimation for mean-shift, when directly applied on the joint distribution of deformations, is poorly conditioned.However, the Markov structure of the FE-mesh allows a factorization of the joint distribution on deformations into a product of low-dimensional conditional distributions on which the mean-shift procedure is well posed.Consider the 2-D example in Fig. 2 for illustrative purposes.Furthermore, assume a set of samples {u s } s=1,…,S are given, where are the deformations of a triangular FE-mesh with four 2-D vertices and two triangular elements = {a, b, c} and = {b, d, c}.The joint posterior distribution on deformations can then be factorized as: (8) For high-dimensional joint distributions, the dimensionality of the factorized distributions is dependent on the mesh topology and in the worst case equals the number of vertices in the Markov blanket of the vertex with the highest valency.In general, the vertices are ordered according to valency and the vertex with the lowest valency is factorized first.Defining the vector u bc ≜ [0, u b , u c , 0] and similarly for u abc and u bcd , then the mean-shift procedure on the joint log-posterior in Eqn. ( 8) factorizes according to δ(u) = δ(u abc )+ δ(u bcd ) − δ(u bc ) where is the mean-shift operator and K is a radial basis function (typically a Gaussian kernel).The mean-shift estimation sets u ← δ(u) and continues until convergence.

Marginal Posterior Distributions-
The marginal posterior distribution of a deformation component u i is defined by (where Dom(u /i ) denotes the domain of all components except the i-th): (9) i.e. the probability distribution of the movement of one FE vertex along one axis in the deformation field.In a sampling framework, the posterior samples of this particular deformation component characterizes the marginal distribution.Posterior marginal modes can easily be found with the mean shift method on the samples from the specific deformation component.The deformation estimates should fall within a certain confidence interval which can be robustly conveyed by e.g. the Inter-Quartile Range (IQR).As the IQR and displacements are only defined at FE vertices, we use linear FE-based interpolation to determine them at inter-vertex locations.

Results
The proposed method for characterizing the posterior distribution over deformations was evaluated on both synthetic and clinical image data.In Section 3.1 the method is evaluated on a synthetic dataset with a non-convex likelihood and known ground-truth deformation, while Section 3.2 demonstrates the feasibility of registering clinical Magnetic Resonance (MR) images acquired during neurosurgery for tumor resection.All clinical data was used with informed consent under an institutional review board approved protocol.
To evaluate the MAP mode estimated with the mean-shift method described in Section 2.3.1, the posterior mode of Eqn.(2) was also estimated with a global non-linear optimization procedure implemented in C++ and NLopt (Johnson, 2012).The global stochastic optimizer, a Multi-Level Single-Linkage (MLSL) algorithm (Rinnooy Kan and Timmer, 1987), was coupled with a local gradient-based Method of Moving Asymptotes (MMA) (Svanberg, 2002), and non-linear constraints to prevent flipping of tetrahedra (positive deformation Jacobian over the element) were included through an augmented Lagrangian method which adds a penalty to the objective function for any violated constraints.After characterizing the posterior distribution with Algorithm 1 on a given dataset, the deterministic optimizer was run with the temperatures fixed to the mean of the temperatures marginalized through Laplace approximations.
All algorithms were implemented in C++ with extensive use of OpenMP ® for parallelization whenever feasible, and the experiments were run on Linux machines with 2 × 6-core Intel Xeon ™ 2.67GHz CPUs, 12GB DDR-3 RAM.
For all experiments, the material parameters of the linear elastic regularizer were initialized with a Young's modulus of E = 1.0 and Poisson's ratio of ν = 0.49.Because the elastic model is used as a regularizer, as opposed to a pure mechanical description of the deformation determined from boundary conditions and external forces, the Young's modulus is unitless and only relative stiffness can be prescribed.Every drawn proposal sample that caused folding of the FE-mesh was rejected.

Synthetic Experiment
This section evaluates and provides insights into the complex posterior distribution using the synthetic dataset of Fig. 3 which simulates a typical deformation seen during neurosurgical procedures.A synthetic deformation field was created with a thin-plate spline deformation model (Bookstein, 1989) simulating brain shift with an approximately 12mm collapse of the brain surface towards the mid-brain.To make the problem even more challenging, a sinusoidal component was added to the deformation in the xand ydirections with magnitude 6mm and 4mm and wavelength w/2 and w/4 respectively, where w is the image width.A clinical T1 MR-image of size 256×256×88 and 1.0×1.0×2.0mm 3 voxel size acquired prior to neurosurgery was deformed by the simulated deformation field to obtain a synthetic proxy for an intra-operative image.Image intensities were normalized to the range [0, 1] and Gaussian noise with mean μ = 0 and variance σ 2 = 0.06 was added to the intraoperative image.
The reference image was skull stripped using Brain Extraction Tool (BET) (Smith, 2002) to create an approximate label-map of the brain tissue, which was then used to build the FEmesh with V = 317 vertices and E = 1177 tetrahedra using the Computational Geometry Algorithms Library (CGAL) (CGAL, 2008).The minimum and maximum element volumes were 340mm 3 and 2430mm 3 respectively, while the total mesh volume was 1530cm 3 .From the simulated deformation field, the ground truth deformation of the vertices was available, with maximum absolute deformation of [11.5, 10.9, 3.4]mm and mean absolute deformation of [4.0, 2.6, 2.0]mm.Broad log-normal hyper-priors model the temperature parameters (λ r = 1/100, μ r = ln(100), λ s = 1/200, μ s = ln(0.01)),and the proposal covariance matrix was constructed with ρ = 100.
The scale of the proposal distribution was tuned to achieve an acceptance rate of approximately 23% as described in Section 2.2.2.

Convergence-
The MCMC sampler was evaluated with the Geweke time-series criterion (Geweke, 1991) to test the intra-chain convergence using long MCMC chains from instances of the sampler initialized at dispersed starting points.Inter-chain convergence was also evaluated with the potential scale reduction criterion (Gelman et al., 2003).The Geweke time-series approach (Geweke, 1991) applies a two-sided z test for convergence by comparing the mean of the first half of the chain with the mean of a certain number of sub-chains from the second half of the chain.The scale reduction criterion uses a one-sided variance ratio test to check whether a set of parallel chains have converged to the same target distribution, where a large potential scale reduction indicates a lack of convergence.
We used three parallel samplers initialized within approximately 4mm in all dimensions around the ground truth and 5•10 6 samples were generated in each chain.After discarding the initial 20% of the samples for chain burn-in followed by thinning with a factor of 20, 2 • 10 5 samples were retained.At a sampling rate of approximately 30 samples per second, the total running time for a single chain was approximately 46 hours.Figure 4 shows intra-chain convergence evaluated with the Geweke criterion for each chain.The chains are seen to have converged with the majority of the z-scores within two standard errors of the mean.
However, with low Markov chain mixing time (Gelman et al., 2003), the chains may be stuck in different regions of the probability space.The inter-chain convergence through the potential scale reduction measures the amount to which different instances of the sampler have converged to the same equilibrium distribution, which is confirmed by the results in Fig. 5. Following convergence evaluation, the samples were then pooled together for a total of 6 • 10 5 samples, which were used in the next set of results.6(a)-6(b) plots the distribution of the marginalized temperatures under the log-normal hyper-priors.

Posterior Estimates-Figures
The mode of the joint distribution was estimated with the mean-shift method described in Section 2.3.1, and the results compared with the ground truth and the results from the MAP estimate of the deterministic optimizer (with temperatures fixed to the mean of the marginalized temperatures, i.e. τ r = 1174 and τ s = 0.19).From Fig. 7, it can be seen that the estimated posterior mode is highly concordant with that found by the optimizer, and has a maximum and mean absolute error in each of the three dimensions with respect to the ground truth deformation of [2.4,1.9, 1.6]mm and [0.3, 0.2, 0.3]mm respectively.Table 1 summarizes the registration results.The marginal distribution of the component with the maximal error (2.4mm) is shown in Fig. 8(a) with the joint mode and true displacement overlaid.The reason for the misalignment may be due to the regularizer, which introduces a bias in the registration changing the location of the MAP estimate away from the ground truth.We also found that at the mode, 80 out of 951 deformation components fell outside the ± IQR/2 range of the ground truth values as shown in Fig. 7(c), while 2/951 fell outside the 95% interval.Hence, in this case the ground truth deformation falls inside the distribution prescribed by the IQR with (951 -80)/951 ≈ 92% certainty.The marginal distribution for the component with the highest IQR (5.0mm) is shown in Fig. 8(b) while the distribution of IQRs is shown in Fig. 8(c).

Hyper-parameter
Sensitivity-To explore the sensitivity of the posterior distribution to the hyper-parameters, we ran the MCMC sampler with a set of fixed HPs and compared the resulting posterior modes with the ground truth and the posterior summaries from the posterior with marginalized HPs.We selected four sets of fixed HPs based on overdispersed values from the distribution of marginalized temperatures shown in Fig. 6.The specific temperature values and the resulting deviation from the ground truth is summarized in Table 2.It is clear that modeling HPs with point-estimates when having little information regarding the correct value of these HPs can negatively affect the posterior distribution.In our case, the posterior mode has a maximum error of 8.56mm when using dispersed pointestimate HPs compared to 2.41mm when marginalizing HPs (see Table 1).

Normality of Posterior Distribution over
Deformations-To investigate the deviation from normality of the marginal distributions, Fig. 9 plots the marginal distribution of the components with the highest/lowest kurtosis and skewness estimates.A sufficient condition to show non-normality of the posterior distribution is to show that one of the posterior marginal distributions deviates from normality.The plots indicate that the marginal distributions have a bias for leptokurtic distributions, i.e. a distribution with a heavy tail.Mardia's test (Mardia, 1970), which uses skewness and kurtosis statistics to test for multinormality, further confirms that the posterior distribution does not pass the test for multivariate skewness nor kurtosis under a 5% significance level.

Clinical Experiments
The clinical experiments were carried out on a neurosurgical MR-dataset acquired on a Siemens Magnetom Verio 3T MR machine in the Advanced Multi-modal Imaging and Guidance Operating (AMIGO) suite at Brigham and Women's hospital.A pre-operative T2 Fast Spin Echo MR image of size 512×512×100 and resolution 0.5×0.5×1.5mm 3 was acquired prior to surgery.After resection, an intra-operative T2 BLADE MR image of size 320×320×33 with resolution 0.6×0.6×5.0mm 3 was acquired.Because of brain-shift and the missing tissue due to the resection, large tissue movements are observed around the resection site.Corresponding slices of the dataset can be seen in Fig. 10.
BET was used to skull-strip the pre-operative image, and the corresponding label-map defining the brain-tissue was used to construct a FE-mesh (V = 858 vertices and E = 3791 tetrahedra) which discretizes the image domain covering the brain.The minimum and maximum element volumes were 140mm 3 and 660mm 3 respectively while the total element volume was 1554cm 3 .
Both the pre-and intra-operative images went through a number of preprocessing stages: 1) the image intensities were normalized through histogram matching, 2) normalization of the intensities to the interval [0, 1], 3) subsampling the pre-operative image by a factor of 2 in all dimensions to reduce the data size, and 4) Gaussian smoothing with variance 2mm 2 .
To assess MCMC chain convergence, three MCMC parallel chains were generated from dispersed starting points (±4mm in each dimension) around u = 0.Each chain used identical parameters with the log-normal hyper-prior parameters set to λ r = 1/20, μ r = ln(100), λ s = 1/200, μ s = ln(0.01).This represents nearly identical hyper-prior parameters as used in the synthetic experiments, with the exception of an increased precision (λ r ) of the regularization hyper-prior to incorporate the a-priori knowledge that deformations observed in the clinical case require more regularization due to missing correspondences due to resection and manipulation of tissue during the procedure.The proposal covariance was constructed with ρ = 100, and the proposal scale was tuned to to ensure an acceptance rate of approximately 23%.A total number of 5 • 10 6 samples were generated for each chain, the first 1 • 10 6 samples were thrown away as burn-in samples and the rest of the samples were thinned out with a factor of 20.With a sampling rate of 27 samples per second, the total simulation time was approximately 50 hours.The chains were found to have converged because 99.5% of the Geweke z-scores were within 2σ and the maximum scale reduction was 1.1.The remaining samples from the three MCMC chains were pooled together for a total of 6 • 10 5 samples that were used in the following set of results.9(f) shows a scatter plot of skewness vs. kurtosis for the marginal distributions which are biased toward leptokurtic distributions.The minimum and maximum skewness was found to be −0.75 and 1.95, while the kurtosis was in the range [−0.50, 6.64] with a median of 0.11.It is evident that the posterior distribution is not normally distributed which is confirmed by Mardia's test for multivariate normality (Mardia, 1970) which rejects the null-hypothesis that the posterior distribution is multivariate normal under a 5% significance level, both with regards to kurtosis as well as skewness.The non-normality of the posterior distribution is further emphasized by Fig. 11 where we show marginal distributions with multiple modes.

Posterior Summaries-Figure
Table 3 summarizes the results in terms of the distance between the MAP estimate of the deterministic optimizer, the joint mode of posterior deformation samples found with mean shift, as well as the marginal posterior deformation modes found with mean shift.The results indicate that the marginal modes have a maximum error of 1.85mm from the joint mode, while the joint and optimizer modes match well with a maximum absolute distance of 0.45mm.The discrepancy between the marginal and joint mode is natural for highdimensional non-Gaussian distributions.Risholm et al. (2010) we introduced posterior predictive summaries for portraying the location of deformed DTI and fMRI activated areas.Given S posterior deformation samples, a marginal posterior distribution on whether a voxel in the intra-operative domain is inside or outside a functional activated area is constructed by deforming the fMRI activated area with each of the S deformation samples and counting how many times a voxel fell inside the deformed area.Similarly, fiber tracts extracted from DTI are deformed with each of the S deformation samples and a voxel visitation count volume is created by incrementing each voxel a deformed fiber crosses.To highlight the clinical importance of characterizing deformation uncertainty in neurosurgery, we compare a posterior predictive distribution of a motion activated center as found by fMRI with residual tumor as delineated by a neurosurgeon in Fig. 13.Notice that the posterior predictive distribution may provide important information which can make it easier for the surgeon to minimize the surgical risk involved with resecting more of the tumor.

Clinically Relevant Posterior Summaries-In
In Fig. 11(a) we visualize the expected deformed pre-operative image with regards to the posterior distribution.With a Gaussian posterior distribution, this posterior expected deformed image could be constructed by first deforming the pre-operative image with the MAP estimate followed by Gaussian smoothing with spatial variance according to the registration uncertainty.However, since the posterior distribution is clearly not Gaussian, a more accurate posterior expected image can be estimated according to which takes into account the skewness and leptokurticity of the posterior distribution.In such a posterior expected image, areas of high uncertainty will be relatively blurry compared to areas of high certainty.The posterior expected image shows the resection area as more blurry compared to areas away from the resection.In the same figure (Fig. 11) we show the corresponding spatial uncertainty in terms of the IQR, and it can be seen that the uncertainty around the resection is considerably higher than in other areas.To investigate this further we plot the marginal distributions for the x-, yand zcomponents of the FE vertex that is closest to the resection cavity and find that the marginal distributions have a high spread (IQR) and are multi-modal.
In Fig. 12 we visualize the directional uncertainty in terms of the IQRs along the x-, yand z-directions.The IQRs indicate that the uncertainty is highest at the resection site, but also high along the boundary of the mesh/brain which is caused by the reduced valence of the boundary vertices.One approach to reduce the registration uncertainty along the boundary of the mesh is to include suitable boundary conditions in the registration, e.g. to add an additional prior distribution which prevents boundary vertices from moving into the skull extracted from the intra-operative image.In our framework, this can be implemented by discarding proposal samples which move a boundary vertex into the skull.

Discussion
It has been our goal in this investigation to characterize the posterior distribution in some intra-subject non-rigid registration problems.Our longer range goal is to provide uncertainty information to clinical end-users, because we feel that it should be factored into decision making or treatment optimization.For example, communicating the distributional information in Fig. 13(c) gives the hypothetical end-user the option to discount the utility of the registration information if the uncertainty in the result is too large; this opportunity is not available if the MAP result, i.e.Fig. 13(b), is the only information about the registration result that is provided.
A Bayesian non-rigid registration framework, that characterizes the posterior distribution on deformations by MCMC, was presented.The likelihood and prior distribution on deformations is constructed by converting dissimilarity and regularization energies into probability density functions by way of Boltzmann's distribution.Proper treatment of hyperparameters in registration, e.g.noise precision and the tissue parameters/regularization weight, is an important but often neglected problem.Non-rigid registration is a highly under-determined problem and therefore generally needs a prior (regularizer) to reduce the solution space.The prior introduces a bias, but with a large reduction in uncertainty.Decreasing the importance of the prior, e.g. by setting the prior temperature to a large value, effectively makes the regularization negligible which increases the posterior uncertainty.Even though registration results are sensitive to these parameters, they are in current practice often set on an ad-hoc basis.One approach to determine hyper-parameters that are highly informative for a specific clinical application is to learn them from a collection of representative clinical datasets where homologous landmarks or structures are identified (Risholm et al., 2012).However, in cases where such data is not available, and little information is available regarding the optimal value of these parameters, the recommended Bayesian approach to handle them is to equip them with uninformative hyper-priors and marginalize them out.We formulated the posterior distribution over deformations, with the log-normal hyper-parameters marginalized over, in a variational free-energy framework where the variational density is the Laplace approximation of the hyper-parameter posterior.A log-normal hyper-distribution was chosen because it provides very broad hyper-priors (with variation over orders of magnitude), reflecting the uncertainty in the exact values of the hyper-parameters.The hyper-prior ensures that while tissue-stiffness and image-noise variance parameters are allowed to vary over a very wide range of values, they are kept physically sensible by disallowing fantastically huge or small values.However, ideally, they should be centered around some sensible values derived for instance through hyperparameter optimization using homologous landmarks/structures.The posterior estimation of the prior temperature can be seen as an estimate of the homogeneous inverse stiffness of the underlying tissue.In the future we plan to extend the current methodology to marginalize/ estimate compartment-or element-wise tissue stiffness.
Our experiments have shown that marginal posterior distributions on deformations may be multi-modal -which does not imply that the complete posterior distribution is necessarily multimodal.However, because we have shown that the marginal distributions are non-Gaussian, and possibly multi-modal, we can claim that the posterior distribution is non-Gaussian.While it may be surprising to some that the marginal distributions can have multiple modes, our experience in image registration has been that registration systems can become trapped in local extrema in the face of truncation or resection.Likelihood functions are complicated and highly non-convex functions of the transformation parameters; when there is structural disagreement among images, the likelihood functions frequently exhibit multiple modes that reflect spurious matches of image structures.Because of this, multiple modes in posterior marginals, or even the complete posterior distribution, are a possibility.Furthermore, increasing the degrees of freedom of the warp, which is similar to reducing the smoothing in the objective function, seems unlikely to reduce the occurrence of multiple modes in the posterior marginal; frequently it happens that increasing the degrees of freedom leads to more extrema in optimization problems.
While the proposed MCMC method is a principled approach to Bayesian inference, and provides a numerical approximation to the exact posterior distribution, it does come with high computational cost which is not unreasonable in an exploratory investigation.Although the computational cost may currently be too high for use in neurosurgery, it is straightforward to apply the method to new applications; e.g. to monitor delivered dose in adaptive radiotherapy where each fraction is delivered with approximately 24 hour intervals.For other applications, the proposed methodology can provide valuable insights into complex posterior distributions to help guide the design of faster methods that find simple robust and accurate approximations to the complex posterior distribution.One example is the variational Bayes approach taken by Simpson et al. (2012) which provides a fast exact analytical solution to a parametric approximation to the posterior distribution that facilitates high-dimensional registration.However, the effect of the approximation on the estimated posterior statistics is unknown and may be significant because of the possibly multi-modal and non-Gaussian posterior distribution.Parametric approximations of the posterior should therefore be compared against the asymptotically correct posterior distribution that can be generated by MCMC.
Another MCMC approach to registration, the Gibbs sampler proposed by (Gee et al., 1995b,a), requires approximations of the likelihood function.By using MH, such likelihood approximations are not required and the proposed method is consequently more general in terms of the dissimilarity energies that can readily be used.On the other hand, Gibbs sampling should in theory converge quicker on the posterior distribution because every sample is accepted.The proposed marginalization over hyper-parameters using local Laplace approximations can readily be incorporated with the Gibbs sampler proposed by Gee et al. (1995b,a).
The sampling approach we have used here is computationally intensive.Consequently, for the experiments in this article, we have used meshes with approximately 1000-2500 degrees of freedom, which corresponds to a vertex spacing of approximately 15mm in the clinical case; this is comparable to control point spacing typically used in intra-subject registration (Rueckert et al., 1999).Our vertex spacing is larger than the 5mm B-spline spacing that is commonly used in inter-subject registration (Klein et al., 2009), and it is possible that with higher resolution meshes and data, the estimated posterior uncertainty would decrease.With higher degrees of freedom comes not only higher computational cost, but also higher storage requirements because of the need to store enough samples to characterize the posterior.It is common practice to thin the MH sequence by a factor of at least ten to reduce intra-chain correlation and to discard the first half of the sample chain to remove burn-in effects.Consequently, if 10 7 samples are required to characterize a posterior distribution with 100k degrees of freedom, the storage requirements will be approximately 200Gb after thinning and burn-in.This does not pose a significant problem in practice, especially if advanced storage formats such as HDF5 (The HDF Group, 2000-2010), which can sequentially compress large files, are used.
Because of the high computational complexity of MCMC, some simplifying model assumptions were made to achieve tolerable running times.First, the likelihood assumes that the intensity difference between registered voxels differ by i.i.d.Gaussian noise.Theoretically, this is not a valid assumption, especially when Gaussian blurring is applied, but has been shown to work well in practice.In our work, where light blurring was used, we have expanded on the Gaussian noise model by marginalizing over the noise precision under a log-normal hyper-prior, which effectively can convert the Gaussian noise distribution into a distribution with heavy tails which is better suited for handling outliers, e.g.areas of resection.A well founded approach for handling the noise correlation caused by smoothing of the images is to include a constant Virtual Decimation (VD) factor in the likelihood term (Simpson et al., 2012;Groves et al., 2011), which is similar to decimating the data.This is especially important if the main goal is to infer correct noise and regularization (stiffness) parameters.However, in our case, where the main goal is to characterize the posterior distribution on deformations, the constant VD factor will be absorbed into the marginalized noise temperature and will not have an affect in practice, aside from changing the scale of the temperature parameter.Although the SSD-based likelihood does not currently accommodate for functional or statistical intensity differences, such complex relationships can easily be included in the model, though likely with some computational cost.
Secondly, the FE-based transformation model was a natural choice given the use of the phenomenologically inspired linear elastic prior, but other popular transformation models such as B-splines (Rueckert et al., 1999) can also be incorporated.
Thirdly, linear elasticity has been widely used as a prior in non-rigid registration, but more complex prior models such as visco-and poro-elasticity (Kyriacou et al., 2002) may be more appropriate.Choosing a suitable prior that is a good model for the underlying tissue under deformation may improve the registration results and reduce registration uncertainty.If a segmentation of the underlying tissue is available, and relative stiffness between tissue types is known, a mechanical model that reflects this can easily be constructed and spatially varying (pseudo-)stiffness parameters can easily be encoded in the model without any modifications to the proposed posterior sampling scheme.It is also possible to use a random variable for each tissue type, and equip it with a hyper-prior and marginalize over it, but this requires changing the Laplace approximation accordingly, or using MH of the per tissue type hyper-parameters.
We have used common modeling assumptions, and some approximations, and our goal is to understand the implications of those assumptions on the uncertainty in the posterior distribution (or in deterministic terms, the landscape of the objective function).In the future, it is likely that developments in likelihood modeling, prior modeling, FEM technology, inference algorithms, and application specific development and engineering will lead to faster methods that could have tighter posterior distributions and associated marginals than those that are illustrated in Fig. 13(c).Our experiments indicate that, depending on the application, it may be appropriate to consider the implications of non-Gaussian, and potentially multi-modal, posterior distributions in the design of the system.Approaches that should be investigated to speed up chain convergence, reduce inter-sample correlations and the random walk behavior of MH include Hybrid Monte Carlo (Duane et al., 1987), which uses gradients of the log posterior distribution to take larger proposal steps while keeping the rejection rate low, and adaptive MCMC where the proposal distribution is dynamically fitted to match the posterior based on the accepted samples (Andrieu and Thoms, 2008).One approach for learning the best compromise between likelihood model, prior model, mesh size, accuracy and uncertainty is through model selection using e.g. the Deviance Information Criterion (DIC) which is a suitable model selection criterion when the posterior is obtained by MCMC (Spiegelhalter et al., 2002).
In future work we will adapt the framework to registration of MRI and ultrasound where we suspect the posterior distribution will be multi-modal because of the high noise level and lack of contrast in ultrasound images.We also suspect that exploiting the high-dimensional information available in the posterior distribution on deformations when doing high-level inferences based on the registered data will open up new and exciting research opportunities.Simpson et al. (2011a) showed that utilizing the registration uncertainty in classification of Alzheimer patients improved their classification results.In adaptive radiotherapy, monitoring the radiation dose delivered to critical tissue is important due to changes in anatomy over the treatment period.In Risholm et al. (2011) we showed that the uncertainty in registration can induce significant uncertainty in the estimated cumulative dose delivered to critical tissue during head and neck radiotherapy.The estimated delivered dose and the associated uncertainty was visualized in terms of dose volume histograms.Finding lowdimensional representations of the posterior distribution that are suitable for visualization is a difficult task and an active research topic (Spiegelhalter et al., 2011).In this paper we used simple posterior summaries to convey uncertainty.However, one can envision that more complex approaches to interacting and extracting clinical relevant summaries from the posterior distribution can be designed in the future -for instance by exploiting the local correlations in the posterior.

Conclusion
We presented a Bayesian non-rigid registration framework where the posterior distribution is characterized through MCMC sampling.In contrast to conventional registration approaches that only report posterior modes, the posterior distribution, represented by a set of MCMC samples, contains other valuable information such as the registration uncertainty; this is important information when higher level inferences are drawn based on the data being registered.energy of a deformation, which we use as a regularizing term to penalize physically implausible solutions, takes the form E el (u) = ∫ x∈Ω f u(x) ⊤ K x u(x)dx, where K x is the stiffness matrix that is directly proportional to the Young's modulus.Under the linear element model, this is equivalent to (A.3) where K i,j is non-zero only if vertices v i and v j share a tetrahedral element.
Consequently, both the SSD and elastic energy contributions of one vertex v i under displacement u i depends only on the displacements of vertices v j that share tetrahedral elements with v i .Therefore, the energy term exhibits Markov structure and the Markov blanket for vertex v i is = ∪ e \v i such that v i ∈ .
Even though linear elasticity puts a penalty on non-smooth deformations, it does not prevent unwanted folding of tissue/elements.Folding of elements can be prevented by restricting the determinant of the Jacobian over the deformed element to be positive |J(u e (Ω e )| > 0.

Appendix B. Marginalization through Laplace Approximations
To simplify the following derivations, we define θ r ≜ ln τ r and θ s ≜ ln τ s , so that the lognormal priors over imply normal priors over Θ ≜ {θ r , θ s }.Introducing a variational density q(Θ) over the HPs Θ, the marginal distribution of the deformations u in Eqn. ( 4) can be decomposed into a free-energy and KL-divergence term as: Typically in the applications of this method the value of e θ s −θ r ≪ 1, therefore using the fact that for small Θ it is the case that |I + ΘA| = 1 + τ r (A)Θ + (Θ 2 ), the approximation: To avoid the expensive inversion of the 3N × 3N matrix D at each new value of deformation u * , as required in the evaluation of ξ, the term τ r (D −1 K) is directly estimated from K − D using Gauss quadrature and the modified Chebyshev algorithm along with stochastic estimates of the modified moments (Meurant, 2009).In this calculation, pseudo-inverse K − (which can be pre-computed) is used since the translation invariance property of K renders det K = 0.The expected value of ln Z(Θ) under the variational distribution q(Θ) = (Θ * , , where [ln | 1+e θ s −θ r τ r (D −1 K)|] is evaluated numerically using Gauss-Hermite quadrature (Davis and   Rabinowitz, 2007).Even though the scale reduction criterion is violated, the marginal distribution is relatively well-characterized.Notice that the red line denotes the joint mode, which for highdimensional non-Gaussian distributions do not necessarily align with the marginal mode.In this case it is highly concordant with the ground truth, but not with the marginal mode.Histogram plots of the marginalized temperatures using Laplace approximations under the hyper-prior distributions with the method described in Section 2.  Registration error plots from the experiment on the synthetic MRI data in Fig. 3. (a) Error between the joint mode, estimated non-parametrically from the MCMC samples, and the ground-truth plotted against the ground-truth deformation value.A maximum absolute error of 2.4mm was found in the x-direction, while the mean error over all dimensions was 0.27mm.Notice that the highest errors are found for components with small deformations which may be explained by the regularizer which puts more restrictions on large deformations.(b) Histogram of the absolute distances between ground truth and the mode.(c) The distance between ground truth and the mode in relation to the spread of the 50% middle samples (IQR) and the spread of the 95% middle samples of the corresponding deformation component.The components are sorted in ascending order of IQR.(d) The marginal and joint error in the MCMC non-parametric mode (as compared to ground truth) plotted against corresponding error in optimizer estimates.It can be observed that there is a high correlation between the non-parametric joint error and optimizer error, while the errors of the marginal estimates are less correlated with regards to the optimizer and joint estimates.In Section 3.1.4,the posterior is shown to not be normally distributed which explains the fact that the marginal modes do not correspond well with the joint mode.The IQR along the z-direction (superior-inferior).Uncertainty is clearly increased around the resection.It is worth noting that the uncertainty is in general also somewhat higher along the cortical surface, i.e. the outer boundary of the FE-mesh because there are fewer constraints in the FE-mesh due to the reduction in connected elements/vertices.However, it is straightforward in our Bayesian registration framework to reduce the boundary uncertainty by including prior information, e.g. that boundary vertices should not deform into the skull defined in the intra-operative domain.The sensitivity of the posterior distribution to the HPs was explored by estimating the posterior mode for four separate sets of dispersed HPs selected from the tails of the marginalized HPs shown in Fig. 6.This table summarizes the specific values of the HPs in the first column, and the maximum error, 95th percentile error and median error, all reported in millimeters, are summarized in the following columns.The posterior mode was found to be sensitive to using point-estimates of the HPs, with a maximum error compared to the ground truth in the range of 6.43mm-8.56mmcompared to a maximum error of 2.41mm when marginalizing HPs under uninformative hyper-priors.Summary of the results (in millimeters) from the registration with the neurosurgical dataset in Fig. 10.The MAP estimate u o from the deterministic optimizer described in Section 3.1.2is compared with the posterior joint mode u j and the posterior marginal modes u m .The deformation vectors are of length 3V where V = 858 is the number of vertices in the FE-mesh.Notice that the optimizer and mean shift on the posterior samples are finding the approximately same MAP estimate (with a maximum error of 0.46mm), while the marginal estimates are off by a maximum of 1.85mm.Outline of Monte Carlo sampling algorithm with marginalization of hyper-parameters through Laplace approximations.The initial transformation u 0 can either be set to a zero deformation, or alternatively to the result of a fast deterministic registration method.The log temperatures are typically set to the mean of the log normal prior.Note that in Appendix B, where the marginalization through Laplace approximations is derived, θ r ≜ ln τ r and θ s ≜ ln τ s and Θ = {θ r , θ s } to simplify the mathematical derivation.
Estimation through MCMC Simulation 2.2.1.Metropolis-Hastings-One method to characterize the posterior distribution p(u | m, f, Γ) with HPs marginalized (see Eqn. (5)) is to perform Metropolis Hastings (MH) sampling (Gelman et al., 2003) from the joint distribution on displacement and temperatures p(u, | m, f, Γ) (see Eqn. (4)) and discard the temperature values, effectively marginalizing them out.With ω ≜ {u, }, the MH algorithm starts from an initial sample ω 0 and generates a proposal sample, ω′ at iteration n + 1 from a simple proposal distribution which only depends on the previous sample ω′ ~ π(ω | ω n ).The proposal sample ω′ is accepted (i.e.ω n+1 = ω′) or rejected (i.e.ω n+1 = ω n ) with acceptance probability: (6) Because MH-MCMC has no parametric assumptions about the posterior distribution, and does not require the computation of gradients, it is general in terms of the types of likelihood functions and prior distributions that can be used.If the posterior distribution includes nonlinear constraints, e.g.prevention of folding of the FE-mesh or movement of the mesh into image regions such as the skull, the proposal sample should be checked before computing the MH-criterion.If the constraints are violated, a new sample should be drawn.With constant values of the temperature HPs, the partition function cancels in the MH criterion in Eqn. (

Figure 6
Figure 6(c)-6(d) shows histogram plots of the marginalized temperatures under the lognormal hyper-priors using the method in Section 2.2.3.The MCMC posterior mode was compared to the MAP estimate computed with the deterministic optimizer (with temperatures fixed to the mean of the marginalized temperatures, i.e. τ s = 1.59 and τ r = 7528).
the KL divergence term is positive, F acts as a lower bound on the estimate of ln p(u | m, f) and is exact for q = p(Θ | u, m, f).Restricting the variational density q(Θ) to the family of normal distributions, the minimizer of (q||p(Θ | u, m, f)) is ( [p(Θ | u, m, f)], ar[p(Θ | u, m, f)]).However, as the mean and variance of the posterior do not have closed-form solutions, and to avoid expensive Monte Carlo integration, we use a Laplace approximation of p(Θ| u, m, f) by replacing the mean with the mode and the variance with the local curvature of the mode.Consequently, q(Θ) ~ (Θ * , Λ* −1 ), where Θ * = argmax Θ ln p(Θ | u, m, f) is the conditional mode and Λ * = −2∇∇ ⊤ ln p(Θ| u, m, f) is the negative Hessian of ln p(Θ| u, m, f) respectively, which are equivalent to the mode and negative Hessian of ln p(u, Θ| m, f).The maximum of ln p(u = u ′, Θ| m, f) with respect to Θ (see Eqn. (4)) can be computed with a Newton-like method using its analytical gradient: the term ξ ≜ τ r (D −1 K)e θ s −θ r /(1+e θ s −θ r τ r (D −1 K)), the gradient of the approximate log-partition function becomes: eqns.(B.4) and (C.5), it can be seen that ln p(Θ|u, m, f) is concave in Θ, implying the existence of a global and unique maximum.

1
The gradient and Hessian of the similarity energy with respect to u can be expressed as transformations of the moving image gradient and Hessian as follows: Med Image Anal.Author manuscript; available in PMC 2014 July 01.NIH-PA Author Manuscript NIH-PA Author Manuscript NIH-PA Author Manuscript

Figure 1 .
Figure 1.Graphical model of the Bayesian image registration framework.Using a generative form, m = f ∘ u + ε, for the registration problem where the moving image m is a realization of a deformed fixed image f plus some noise ε.The hyper-parameter τ r controls the spatial precision of the deformation field u under a log-normal hyper-prior parametrized by the mean μ r and precision λ r , and is inversely proportional to tissue stiffness in the elastic regularization model used in this article.A log-normal hyper-prior, with mean μ s and precision λ s , models the uncertainty in the parameter τ s which is proportional to the variance of noise introduced during formation of image m.

Figure 2 .
Figure 2. A simple FE-mesh which consists of vertices a, b, c and d and two triangular elements = {a, b, c} and = {b, d, c}.Assume u a , u b , u c and u d are displacements associated with the FE-vertices, then, due to the Markov structure of the FE-mesh, the high-dimensional joint distribution on deformations can be factorized into lower-dimensional marginal distributions p(u a , u b , u c , u d ) = p(u a , u b , u c ) p(u b , u c , u d )/p(u b , u c ) according to Eqn. (8).

Figure 3 .
Figure 3. Corresponding axial slices of the synthetic data used for the simulations.(a) A T1 MRimage (256×256×88) with 1.0×1.0×2.0mm 3 spacing acquired prior to neuro-surgery with intensities in the range [0, 1].(b) Using a B-spline deformation model, a brain-shift of about 12mm was simulated together with global sinusoidal components with magnitude of 6mm and 4mm in the xand y-components as explained in the text.The resulting ground-truth deformation field was applied to (a) to create the synthetic "intra-operative" image in (b).(c) A 4×4 checkerboard pattern of the images in (a) and (b).(d) A 3D view of the corresponding FE-mesh consisting of V = 317 vertices and E = 1177 tetrahedra.The blue color indicates that the element surface is on the boundary of the mesh, while red indicates that the surface is shared between two elements.

Figure 4 .
Figure 4. Geweke time-series results for the 3 MCMC chains for the synthetic data in Fig. 3.The latter half of each chain after thinning (1 • 10 5 samples) was divided into 20 subchains, and zscores comparing the first half of a chain with each of the 20 subintervals were computed for each of the 951 deformation components.Each figure shows box-plots of the 951 z-scores for the 20 sub-intervals, and we can conclude that the chains have converged because the majority of the z-scores are within two standard errors of the mean.

Figure 5 .
Figure 5. Scale reduction results from the experiment on the synthetic MRI data in Fig. 3. (a) Histogram of the potential scale reduction for the deformation components.The general guideline is that the scale reduction should be below ≈ 1.1 for each component before terminating the chains.In this case, four components exceed a scale reduction of 1.1, with the max scale reduction at 1.23.(b) Marginal distribution of the component with the largest scale reduction after pooling together the samples from the three individual MCMC chains.Even though the scale reduction criterion is violated, the marginal distribution is relatively well-characterized.Notice that the red line denotes the joint mode, which for highdimensional non-Gaussian distributions do not necessarily align with the marginal mode.In this case it is highly concordant with the ground truth, but not with the marginal mode.
2.3.(a,b) The distribution of marginalized temperatures from registering the synthetic dataset in Fig. 3.The mean and variances were τ ̂r = 1174, τ ̂s = 0.19 and .(c,d) The distribution of marginalized temperatures from registering the clinical dataset in Fig. 10.The mean and variances were τ ̂r = 7528, τ ̂s = 1.59 and .

Figure 8 .
Figure 8. Histogram plots of the deformation components with (a) the highest error and (b) highest IQR from registering the synthetic MRI data in Fig. 3.The component in (a) has an absolute error of 2.4mm while the component in (b) has an IQR of 5.0mm.(c) The distribution of the 951 IQRs (one for each deformation component).The smallest and largest IQR were 0.52mm and 5.00mm respectively.

Figure 9 .
Figure 9. Marginal distributions of the components with min/max skewness and kurtosis from the posterior distribution characterizing the displacement field which aligns the synthetic MRI dataset in Fig. 3.A normal distribution is assumed to have zero skewness and kurtosis.(a) A plot of the marginal distribution with the highest kurtosis (2.9).(b) The marginal distribution with the lowest kurtosis (−0.5).(c) Marginal distribution with the highest skewness (0.67).(d) Marginal distribution with the lowest skewness (−0.56).(e) A scatter plot of kurtosis vs skewness for the marginal distributions on displacements for the synthetic data in Fig. 3. (f) A scatter plot of the kurtosis vs skewness for the marginal distributions on displacements for the clinical data in Fig. 10.The marginal distributions have a bias towards being leptokurtic (positive kurtosis) which is shown in the distributions by an acute peak and heavier tails than a normal distribution.

Figure 10 .
Figure 10.Corresponding axial slices of the clinical MRI data used for the simulations.(a) A skull stripped pre-operative T2 FSE MR-image acquired prior to neurosurgery.The FE-mesh discretizes the image domain represented by the remaining brain tissue after the skull is stripped.(b) Deformed pre-operative image to match the intra-operative image.(c) An intraoperative T2 BLADE MR image acquired after resection.Note that skull stripping the intraoperative image can be challenging due to the resection and is not required in the proposed registration procedure.

Figure 11 .
Figure 11.Qualitative registration results for the clinical dataset in Fig. 10.All images show corresponding axial slices.(a) Mean marginal deformed pre-operative image.Given the posterior deformation samples {u s } s=1…S , we sum over all deformed pre-operative images and normalize with the number of samples .This is equivalent to the expected deformed pre-operative image with respect to the posterior distribution over deformations (u|f,m) (f ∘ u).Areas of high uncertainty are relatively blurred compared to areas of high certainty.Notice that the resection area is relatively blurry compared to other areas.(b) The axial slice of the intra-operative image corresponding to the slice shown in (a).(c) The uncertainty of the estimated deformation in terms of the IQRs (summed over the IQR for the x-, y-and z-components of the deformation field) for the same slice as visualized in (a) and (b).The units of the colorbar is in millimeters.Notice the high uncertainty around the resection where the (summed) IQR is 19.6mm.(d-f) The marginal distributions (x-, y-and z-components) for the FE-vertex closest to the resection cavity.We notice that the distributions are multi-modal, skewed and consequently non-Gaussian.

Figure 12 .
Figure 12.This figure visualizes the directional uncertainty for the estimated deformation that aligns the clinical dataset in Fig.10for an axial (row 1), sagittal (row 2) and coronal (row 3) slice.(a) Deformed (with MAP estimate) pre-operative image.(b) The IQR along the x-direction (left-right according to patient).(c) The IQR along the y-direction (anterior-posterior). (d)The IQR along the z-direction (superior-inferior).Uncertainty is clearly increased around the resection.It is worth noting that the uncertainty is in general also somewhat higher along the cortical surface, i.e. the outer boundary of the FE-mesh because there are fewer constraints in the FE-mesh due to the reduction in connected elements/vertices.However, it is straightforward in our Bayesian registration framework to reduce the boundary uncertainty by including prior information, e.g. that boundary vertices should not deform into the skull defined in the intra-operative domain.

Figure 13 .
Figure 13.Predictive posterior distribution of fMRI activated area from hand tapping in relation to intra-operative residual tumor.(a) fMRI is acquired pre-operatively and associated with a pre-operative anatomical image.Here we show the pre-operative image with the tumor, as delineated by a neurosurgeon, overlaid in green and the outline of the fMRI activated area overlaid in purple.Notice that the fMRI activated area is dangerously close to the tumor.(b) Because fMRI cannot be acquired intra-operatively, it needs to be registered to an intraoperative image to locate it in relation to the tumor during surgery.Here we show the intraoperative image overlaid with the residual tumor in green, and the fMRI activated area, deformed by the most probable deformation estimate, in purple.(c) Close-up of the location in (b), but with the predictive posterior distribution on the location of the deformed motion activated area overlaid.The most probable deformation of the fMRI activated area is shown with a black outline.In this surgical case, where the posterior predictive distribution was computed retrospectively, it was deemed too risky to resect the tumor close to the fMRI activated area because the true intra-operative location of the fMRI area could not be assessed properly.However, if the full posterior predictive distribution on the location of the functional area in the intra-operative space was available, e.g. as in Fig. (c), the surgeon may have been able to better assess the surgical risk and continued the resection.