Image-Driven Population Analysis Through Mixture Modeling

We present iCluster, a fast and efficient algorithm that clusters a set of images while co-registering them using a parameterized, nonlinear transformation model. The output of the algorithm is a small number of template images that represent different modes in a population. This is in contrast with traditional, hypothesis-driven computational anatomy approaches that assume a single template to construct an atlas. We derive the algorithm based on a generative model of an image population as a mixture of deformable template images. We validate and explore our method in four experiments. In the first experiment, we use synthetic data to explore the behavior of the algorithm and inform a design choice on parameter settings. In the second experiment, we demonstrate the utility of having multiple atlases for the application of localizing temporal lobe brain structures in a pool of subjects that contains healthy controls and schizophrenia patients. Next, we employ iCluster to partition a data set of 415 whole brain MR volumes of subjects aged 18 through 96 years into three anatomical subgroups. Our analysis suggests that these subgroups mainly correspond to age groups. The templates reveal significant structural differences across these age groups that confirm previous findings in aging research. In the final experiment, we run iCluster on a group of 15 patients with dementia and 15 age-matched healthy controls. The algorithm produces two modes, one of which contains dementia patients only. These results suggest that the algorithm can be used to discover subpopulations that correspond to interesting structural or functional ldquomodesrdquo.


I. INTRODUCTION
T ODAY, computational anatomy studies are mainly hy- pothesis-driven, aiming to identify and characterize structural or functional differences between, for instance a group of patients with a specific disorder and control subjects.This approach is based on two premises: accurate clinical classification of subjects and spatial correspondence across the images.In practice, achieving either can be challenging.First, the complex spectrum of symptoms of neuro-degenerative disorders like schizophrenia and overlapping symptoms across different types of dementia, such as Alzheimer's disease, delirium and depression, make a diagnosis based on standardized clinical tests difficult [22].Second, establishing across-subject correspondence in the images is a particularly hard problem constrained by the specifics of the application.A popular technique is to normalize all subjects into a standard space, such as the so-called Talairach space [47], by registering each image with a single, universal template image that represents an average brain [12].However, the quality of such an alignment is limited by the accuracy with which the universal template represents the population in the study.
With the increasing availability of medical images, datadriven algorithms offer the ability to probe a population and potentially discover subgroups that may differ in unexpected ways.In this paper, we propose and demonstrate an efficient probabilistic clustering algorithm, called iCluster, that 1) computes a small number of templates that summarize a given population of images; 2) simultaneously co-registers all the images using a nonlinear transformation model; 3) assigns each input image to a template that best describes the image.The templates are guaranteed to live in an affine-normalized space, i.e., they are spatially aligned with respect to an affine transformation model.A preliminary version of iCluster was published at the International Conference on Medical Image Computing and Computer Assisted Intervention [42].This paper expands the conference paper with a more detailed theoretical development and more extensive experimental work.
In our experiments, we demonstrate that the templates computed by the proposed algorithm can be used for various purposes, including constructing multiple atlases for improved segmentation and discovering structural modes of a population.On a data set of 50 brain MR images with manual labels for several temporal lobe structures, we illustrate that the subpopulations computed by iCluster manifest significantly improved average label alignment compared to the clinical subpopulations and the whole population.This result suggests that a multi-template strategy will yield improved segmentation accuracy in an atlas-based framework.In other experiments, we show that the modes of the population discovered by iCluster capture known structural differences and similarities.On a population of 415 brain magnetic resonance imaging (MRI) of subjects aged 18-96 years, the algorithm computed three unique templates that mainly comprised of young subjects (mean age 31), older middle aged subjects (mean age 69), and elderly subjects (mean age 79).In another setting, we demonstrate that the modes discovered by the algorithm reflect the two groups of subjects (with mild dementia and healthy) in the population.These results suggest that iCluster can be used to probe a population of images to discover important structural or functional "modes." The remainder of the paper is organized as follows.Section II includes an overview of the literature on atlas construction and inter-subject registration.In Section III, we introduce the generative model and develop our algorithm.Section IV reports experimental results.Section V discusses the advantages and drawbacks of the proposed algorithm, while pointing to future directions of research.Section VI concludes with a summary of contributions.

II. BACKGROUND AND PRIOR WORK
In medical imaging, the term atlas usually refers to a (probabilistic) model of a population of images, with the parameters learned from a training data set [14], [51].In its simplest form, an atlas is a mean intensity image, which we call a template [6], [12], [53], [54].Richer statistics, such as intensity variance or segmentation label counts, can also be included in the atlas model [19].Atlases are used for various purposes including normalization of new subjects for structure and function localization, segmentation, or parcellation of certain structures of interest, and group analysis that aims to identify pathology-related changes or developmental trends.
Atlas construction requires a dense correspondence across subjects.Earlier techniques used a single image-either a standard template [12], or an arbitrary subject from the training data set [25]-to initially align images using a pairwise registration algorithm.Other methods focused on determining the least biased template from the training set [31], [37].A single template approach faces substantial methodological challenges when presented with a heterogeneous population, such as patients and matched normal control subjects in clinical studies.To circumvent this, more recent approaches co-register the group of images simultaneously without computing a group template [46], [58].Even though these algorithms remove the requirement of a single template, they do not attempt to model the heterogeneity in the population.Recent work [9] presented a method that automatically identified the modes of a population using a mean-shift algorithm.This approach solved pairwise registrations to compute each interimage distance, which slowed down the algorithm substantially.Furthermore, the multi-modality of the population was not modeled explicitly, making it difficult to extract a representation of the heterogeneous population.An alternative strategy to atlas-based segmentation is to use all training images as the atlas [27].A new subject is registered with each training image and segmentation is based on a fusion of the manual labels in the training data.This approach is not suitable for anatomical variability studies, where a universal coordinate frame is necessary to identify and characterize group differences and study developmental and pathological trends.
There is a rich range of techniques used to characterize similarities and differences across subpopulations defined by attributes like gender, handedness and pathology.Volume-based [11], [39], [44], voxel-based [4], [15], and deformation-based [5] morphometry methods are commonly used to compare anatomical MRI scans of two or more groups of subjects.Other examples include statistical analysis of functional magnetic resonance imaging (fMRI), positron emission tomography (PET), and diffusion data to identify age and disease-related changes in the functional and structural organization of the brain [24], [33].In these studies, intersubject correspondence is typically achieved via one of the image registration algorithms discussed above.When faced with a heterogeneous group of healthy and pathological brains, however, establishing intersubject correspondence is an ambiguous and more challenging problem due to dramatic structural changes associated with the pathology.For instance, defining a similarity measure when certain corresponding regions are missing or unclear, is not straightforward.
Probabilistic atlases are powerful tools used commonly for supervised segmentation [3], [13], [18], [55].A probabilistic atlas can provide statistics about the frequency of a certain label at a particular location, and topological information like the frequency of two different labels neighboring each other at a particular location and with a certain orientation.Moreover, it can include information about the relationship between labels and image intensities.Given a new image, intensity models, such as a template image, are typically used for spatial normalization.Automatic segmentation is then formulated as an inference problem.Recent joint registration and segmentation frameworks [3], [38] integrate the two steps: spatial normalization is updated based on the current segmentation and vice versa.Most atlas-based segmentation approaches make a strong unimodal assumption on the intensity distribution either when building the atlas, or when segmenting the new image or at both stages.In other words, they assume a homogeneous population, where each subject can be modeled as a deformed and noisy version of a universal template.However, there is growing evidence that population-specific atlases can improve the quality of segmentation [48], [57].This, we believe, highlights the limitations of a single-template atlas in segmentation applications and points toward a multi-template atlas strategy.
In this paper, we develop a probabilistic framework for joint registration of a set of images into a common coordinate frame, while clustering them into a small number of groups, each represented by a template image.We employ a mixture of Gaussians model and a maximum likelihood framework which we solve using the generalized expectation maximization (GEM) algorithm.A similar approach was independently developed in [1], which provides a rigorous analysis of the maximum a posteriori estimate of the deformable templates using a Gaussian kernel based deformation parametrization.In [1] the application of the framework was limited to 2-D images of handwritten digits.In contrast, we focus on high-resolution 3-D medical data and employ a B-spline parametrization for the nonlinear transformation, as previously demonstrated in [41].Furthermore, we present approximate solutions to the template estimation problem that yield fast algorithms applicable to large data sets.Our algorithm can also be viewed as an extension of the approach in [50], which solves the registration problem as an initial, separate step.Our framework leads to a fast, scalable, and flexible algorithm that removes the sensitivity of the resulting atlas coordinate frame to the selected target.Moreover, it provides a novel, data-driven way to probe the population for different modes.Analyzing the discovered subpopulations and their representative templates promises to advance our understanding of dominant structural or functional changes due to pathology or development.

III. MODEL AND ALGORITHM
We assume that the input images are generated from a small number of templates , where is known and fixed.Later, we will propose a strategy to automatically determine from the data.Thus, for each , there exists such that (1) where is an admissible, invertible spatial warp, such as a parameterized nonlinear transformation, denotes its inverse, is a spatially independent, non-stationary Gaussian noise field with zero mean and standard deviation .The last term models imaging noise, and the independent Gaussian assumption is a commonly used model in the literature [18].We model the noise parameters in the coordinate frame of the template.Fig. 1 illustrates this generative model for two templates.
Let denote the conditional probability of the image given that it is generated by the 'th template, and with the fixed model parameters.This can be computed from (1) (2) where is the Gaussian density with mean and standard deviation .
Let denote the prior probabilities of the templates.This distribution governs the initial random draw of templates shown in Fig. 1 and models the possibly unbalanced sizes of the clusters.Thus the parameters for the whole model include the templates , template priors and standard deviation image .The spatial transformations can be viewed as hidden random variables, drawn independently for each image from a prior distribution that favors smoother transformations, for instance.In this paper, however, for simplicity we will treat as model parameters.We use to denote the pooled set of model parameters and spatial transformations.Marginalizing over all possible template indices, we obtain the probability of observing a particular image (3)

A. Generalized EM for Atlas Construction
We formulate the problem of atlas construction as a maximum likelihood estimation (4) where denotes the log-likelihood of the entire image set evaluated for the parameter .We use a generalized expectation maximization (GEM) algorithm to solve (4).For a fixed , using Jensen's inequality we form a lower bound for (5) where is a constant that does not depend on and is the posterior probability that the image was generated from the template (6) Note that .The GEM algorithm iteratively improves this lower bound.Let be the guess of at iteration .Computing -or, equivalently -is the E-step of iteration .The M-step updates to increase .In our formulation, we use a coordinate ascent strategy in the M-step and divide it into two substeps: the T-step ("T" stands for template) where we compute the closed form expressions of the template parameters that maximize ; and the R-step ("R" stands for registration) where we numerically solve for the transformation parameters .We will use to denote the Jacobian field of a transformation with respect to the spatial coordinates and will indicate the determinant of matrix .Derivations for the T-and R-steps can be found in the Appendix.Here we summarize the algorithm.
• E-step: Given the model parameters from iteration , the algorithm updates the posterior cluster probabilities: 1) , where is defined in (2).
2) Normalize to sum to 1 (7) These probabilities can be seen as "soft cluster memberships," where indicates a "hard membership" in cluster .
• T-step: Given the posterior probability estimates and transformation parameters , the algorithm updates its estimates of the templates , template priors and standard deviation image , for which we derive closed-form expressions where is the "effective template" (i.e., target image in registration) for image at iteration and is the weighted sum of square differences (WSSD) objective function of the R-step.The effective template is a weighted average of the current templates and the weights are membership probabilities.A single, invertible transformation is estimated for each image.Current membership probabilities determine which template the image should be aligning with.We employ a B-spline transformation model (on an 8 8 8 control point grid, unless specified otherwise) and a multiresolution strategy.In general, this transformation model does not guarantee invertibility.In practice, the algorithm checks for invertibility by monitoring the Jacobian terms and terminates when there is a Jacobian determinant value below a certain small positive threshold.Rather than solving the nonconvex optimization problem of ( 11), we perform a single Brent's method line search [10] based on gradient directions.The line search of each image is done in parallel, since the optimization for one image does not depend on other images.This strategy guarantees that the lower bound on the log-likelihood is improved, if not maximized, at each step; hence the name Generalized EM.

B. Initialization
The above GEM algorithm does not guarantee that the computed template images are in alignment.To introduce a notion of common coordinate frame, we use an initial affine normalization step that coregisters all images using a single dynamic mean image and an affine transformation model.This step is one of the popular coregistration algorithms used in practice.After affine normalization, the GEM algorithm starts with the E-step by computing membership probabilities according to (7).We initialize the template images as a random selection of different input images, where is the predetermined number of templates.In our experiments, we explore various values for and only report results for the values that produce robust results across multiple random initializations as discussed in Section IV-A.The template priors are initially assigned to be , and the variance image is initialized to be the sample variance at each voxel after affine normalization.Each R-step is initialized with the transformation parameters from the previous iteration.

C. Gradient Re-Normalization
In group-wise registration, one needs to anchor the registration parameters to avoid global transformation drifts across subjects [8], [46], [58].A natural common coordinate frame can be defined as the average of the population.This natural coordinate frame is computed implicitly by constraining the sum of all displacements across the subjects to be zero.We extend this strategy to the multi-template setting by constraining each point in the common coordinate frame to lie at the average location of corresponding points across the images in each cluster.To impose this constraint, we use the soft memberships ( 13) Equivalently (14) , and .Summing both sides of ( 14) over yields (15) which is the anchoring constraint used by other group-wise registration methods [8], [46], [58].
In a gradient descent optimization strategy, one way of imposing the constraint of ( 15) is to re-normalize the gradients of the R-step objective function by subtracting the average gradient from all the individual image gradients.Let be a dimensional row vector that denotes the gradient of the R-step objective function with respect to the transformation parameters of the image at iteration .Then, before each update of the transformation parameters, one re-normalizes the gradients: (16) In the multi-template setting, we extend this re-normalization to satisfy the constraint of (13).We stack all the gradient row vectors to create an matrix and all the membership probabilities to create an column vector for each .First, using the Gram-Schmidt process, we obtain at most orthonormal vectors from .Using this orthonormal basis, we re-normalize all the gradients as (17) where denotes the identity matrix, denotes the th column of and denotes the transpose of .After re-normalization each column of is orthogonal to for all . In other words: .

D. Determining the Optimal Number of Templates
Determining the optimal number of clusters is a classical problem in unsupervised machine learning, which unfortunately has no universal solution [35], [49].The problem can be viewed as a specific case of model selection.In general, increasing the number of clusters provides a better fit to the observed data, yet this does not necessarily translate into improved generalization.A standard approach to controlling the generalization ability of the model is to penalize the model complexity.Bayesian information criterion (BIC) is a widely-used technique that attempts to achieve this balance [45].In our setting, BIC (or equivalently minimum description length) can be formulated as minimizing the penalized negative log-likelihood (18) where is the maximum value of the likelihood in (3) for a fixed number of templates and is the total number of model parameters, which in our case is equal to , where is the number of transformation parameters and is the number of voxels.
Alternatively, one can use the stability of the resulting model to quantitatively asses the structure in the clustered data, cf.[7].In practice, we found it useful to measure the stability of the output against different random initializations.For example, we observed that beyond a particular input , the computed clustering is significantly less consistent across runs with different initializations.We quantify this consistency using a relative measure defined for each run as  where denotes the membership probabilities computed in run and is the average membership probability over all remaining runs for a fixed input .To handle the ambiguity in cluster indexing, we maximized (19) over all permutations of indexing of the templates in all runs.This procedure yields a relative consistency value for each run with a fixed input .Based on the stability criterion, we propose to pick the highest value of that yields a relatively high average consistency (e.g., the average over multiple runs exceeds 0.9).
We tested both BIC and the consistency criterion using synthetic data where ground truth was known.Our experiments, presented in Section IV-A, indicate that the consistency criterion yields an accurate prediction of the optimal number of templates.

E. Complexity
Each iteration of the algorithm has a computational complexity and memory requirement of , where is the number of input images, is the number of templates and is the number of voxels.We use multithreading in ITK [30] to implement a parallelized version of iCluster.Similar to [2], [58], we employ a stochastic subsampling strategy to speed up the algorithm.At each iteration, a random sample of less than 1% of the voxels was used to compute the soft memberships, templates, template priors, standard deviation image and to update transformation parameters.In practice, we run the numerical optimization of the R-step as a single line search for each image, where the search directions are the normalized gradients.The effect of stochastic subsampling is investigated using synthetic data in Section IV-A.Selecting a stopping criterion is not straightforward with the subsampling strategy, since a comparison of the objective function values across iterations is not possible.Instead, one can monitor the change in the parameters.In practice, the algorithm stops when the change in the class memberships and registration parameters falls below a predetermined threshold.Fig. 2 summarizes the iCluster algorithm.

IV. EXPERIMENTS
We validate the algorithm and investigate its behavior in four different experiments.In the first experiment, we use synthetic data to inform a choice of parameter settings, including the amount of subsampling.The availability of ground truth allows us to quantify the quality of results objectively and perform comparisons across different settings of parameters.The second experiment demonstrates the use of iCluster for building a multi- template atlas for a segmentation application.In the third experiment, we employ iCluster to compute multiple templates from a large data set that contains 415 brain MRI volumes.Our results demonstrate that these templates correspond to different age groups.In the last experiment, we use our algorithm on a smaller population that contains patients with dementia and healthy subjects.The results indicate that the templates computed by the algorithm correspond to the two clinical groups.We find the correlation between the image-based clustering and demographic and clinical characteristics particularly intriguing, given that iCluster does not have access to this information when constructing the model of heterogeneity in the population.

A. Synthetic Experiments
In this experiment, we synthesized three data sets from four whole brain MR images (obtained from the Oasis repository [34], with an image resolution of 176 208 176 voxels and voxel dimensions of 1 mm ).The subjects were warped by applying random transformations parameterized with a 8 8 8 B-spline model [41].Each control point was displaced by an amount sampled uniformly from a 20 mm box around its original location.Furthermore, the warped images were corrupted with i.i.d.zero mean Gaussian noise with a variance equal to 10% of the maximum intensity value.Axial slices of the original images and representative synthetic images are shown in Fig. 3. Table I summarizes the ground truth information for the synthetic data.
1) Effect of Stochastic Subsampling: First, we analyze the effect of stochastic subsampling on the quality of results.We ran iCluster on synthetic Data Set 3, with input .The four templates were initialized poorly as four different synthetic  subjects that were all generated from the original subject 1.The quality of results was assessed using two measures: membership accuracy and error in the template images.
To define membership accuracy, we used the inner product between two membership probability matrices as a proxy for similarity.Formally, let denote a set of output membership probabilities and denote ground truth membership probabilities, with 1 corresponding to the template that generated the image and all remaining entries equal to zero.We define membership accuracy as (20) where is the number of input images.To resolve the ambiguity in the cluster indices, we maximize (20) over all possible permutations of the ground truth template indices.We use this maximum value as a measurement of membership accuracy.
Let denote the output template images and denote the ground truth templates, i.e., original subject MRIs.We define the average template error as (21) where is the number of voxels in is the number of templates and the template indexing is determined by maximizing (20) for output memberships.
Fig. 4 shows both the membership accuracy and template error values for a range of sampling percentages, where the sampling percentage is the ratio of the size of the stochastic set of voxels used at each iteration to the total number of voxels.For each parameter setting, we performed 10 runs of iCluster starting from the same poor initialization.Each run yielded a different output due to stochastic subsampling.For sampling percentage values larger than 0.1% membership accuracy was perfect and the template error reached its minimum for all ten runs.In practice, we chose 0.5% as the sampling percentage.This corresponds to using roughly 30 000 voxels at each iteration.The bottom row of Fig. 3 shows the four templates computed by iCluster with input and 0.5% sampling percentage.The output templates were computed using (8) on the whole domain with the estimated model parameters.
2) Determining the Optimal Number of Templates: Here, we compare two methods for automatically determining the optimal number of templates.We ran iCluster on the three synthetic data sets with a range of input values.For each setting, we ran the algorithm ten times with different random initializations to a collection of outputs.Using (18), we computed the negative penalized log-likelihood values for these outputs.Fig. 5 plots these values as a function of input for the three data sets.BIC determines the optimal number of templates as the value of that minimizes the penalized log-likelihood of the data under the estimated model.According to this criterion, data sets 1,2, and 3 have at least 4, 5, and 4 underlying templates, respectively.The optimal for data sets 1 and 2 should have been 2 and 3, respectively.
Alternatively, we can look at the consistency of the resulting model to determine the optimal number of templates.We quantified the consistency of the model using the relative membership consistency measure defined in (19).The average relative membership consistency values for each input are shown in Fig. 6.Based on the consistency criterion, we propose to select the highest value of that yields a relatively high average consistency (e.g., the mean over multiple runs exceeds 0.9).According to this criterion, data sets 1, 2, and 3 have 2, 3, and 4 underlying templates, respectively, which agrees perfectly with the ground truth.In the remaining experiments, we used the consistency criterion to determine the optimal number of templates.

B. Segmentation Label Alignment
In atlas-based segmentation, one typically normalizes the new subject by registering the image with a template.Segmentation is then achieved by inferring labels based on the intensities of the new image and the training images that contain manual labels.The training data is usually employed to establish a prior for segmentation.To assess the quality of this prior, one can measure its agreement with the ground truth label of a new subject.In the following experiment, we measure this agreement by quantifying the alignment between one (new) subject and the remaining (training) subjects.In the case of multiple atlases, this requires an assignment of the new subject to one of the atlases.If these atlases are constructed through an image-based clustering strategy, as the one proposed in this paper, one can use the same framework to determine this assignment.This means fixing the template images, noise variance image and template priors in the iCluster algorithm.The assignment of the new subject can then be computed using the same GEM algorithm, which iterates over the E and R-steps.
In this experiment, we used a data set of 50 whole brain MR brain images that contained 16 patients with first episode schizophrenia (SZ), 17 patients with first-episode affective disorder (AFF), and 17 age-matched healthy subjects (CON).The MRI volumes were obtained using a 1.5-T General Electric scanner (GE Medical Systems, Milwaukee, WI).The acquisition protocol was a coronal series of contiguous images.The imaging variables were as follows: ms, ms, one repetition, 45 nutation angle, 24-cm field-of-view, (number of excitations), matrix (192 phase-encoding steps) . The voxel dimensions were 0.9375 0.9375 1.5 mm.First episode patients are relatively free of chronicity-related confounds such as the long-term effects of medication, thus any structural differences between the three groups are subtle, local and difficult to identify in individual scans.Fig. 7 shows coronal slices of the affine-normalized mean images for each clinical population.A detailed description of the data and related findings are reported in [28].
For these images, we also had manual delineations of eight temporal lobe structures: the (left and right) superior temporal gyrus (STG), hippocampus (HIP), amygdala (AMY), and parahippocampal gyrus (PHG).Prior MRI studies of schizophrenic patients revealed structural brain abnormalities, with low volumes of gray matter in the left posterior superior temporal gyrus and in medial temporal lobe structures.However, the specificity to schizophrenia and the roles of chronic morbidity and neuroleptic treatment in these abnormalities remain under investigation [28], [29].Accurate segmentation tools for temporal lobe structures is thus important for schizophrenia research.We used manual labels to explore label alignment across subjects under different groupings: on the whole data   We ran iCluster on the 50 MR images for different values of input .We emphasize that the algorithm did not have access to the clinical and manual label data.Fig. 8 shows the iCluster output membership consistency, as defined in III-D.We ran the algorithm ten times for each value of input .Based on our proposed consistency criterion, we determine as the optimal number of templates.However, to provide a comparison with the clinical grouping (where there are three groups: SZ, AFF, and CON), we present results for as well.Tables II and III show the relationship between the clustering of the algorithm and the clinical diagnosis.We observe that the clustering computed by the algorithm demonstrates no correlation with the clinical diagnosis.This result confirms the difficulty of identifying structural differences between these first-episode patients and control subjects on an individual basis.Fig. 9 shows coronal views of the two templates discovered by iCluster and the difference image between these two.There are subtle structural differences between the two templates, especially around the cortical regions of the temporal lobes.
To measure the quality of alignment of a region of interest in two subjects, we employed two measures: 1) the Dice score which quantifies the overlap between the regions of interest in two subjects [55] and 2) the modified Haussdorff distance [56], which is defined as the average Euclidean distance (in millimeters) between a boundary point and the closest corresponding boundary point in the other subject.The Dice score ranges between 0 and 1, where 1 indicates a perfect overlap.The Haussdorff distance achieves zero at perfect alignment; higher values indicate worse alignment.We compared average label alignments for three strategies.1) ALL: All subjects were coregistered with a single dynamic average template.This was achieved using the iCluster algorithm with and a 32 32 32 B-spline grid.The average label alignment for each subject was then computed by averaging all pairwise measures of label alignment with the remaining subjects.
2) CLIN: Each clinical group was coregistered separately using iCluster with and a 32 32 32 B-spline grid.The average label alignment for each subject was then computed by averaging all pairwise measures of label alignment with the remaining subjects with the same clinical diagnosis.
3) iC2 and iC3: We ran iCluster on all subjects with input and 3, and a 32 32 32 B-spline grid.For each input value, we report label alignment results for the run that yielded the highest relative consistency value as defined in (19).The average label alignment for each subject was then computed by averaging all pairwise measures of label alignment with the remaining subjects in the same cluster.Fig. 10 shows the average Dice scores and Hausdorff distances for the individual ROIs.These values were computed in the atlas space, where the manual labels were interpolated using the transformations obtained from the registrations and the nearest neighbor interpolator.We performed a paired permutation test comparison between the average label alignments of the three scenarios.The p-values were computed by assessing the average difference between two sets of paired measurements based on a histogram of differences obtained by randomly shuffling the order of pairings.The comparisons for the Haussdorff distances are presented in Table IV.Dice score comparisons yield similar results.In summary, iCluster with input yields the best label alignment results, where 6 out of 8 ROIs were significantly better aligned (with ) compared to the first two strategies of co-registering all subjects (ALL) and clinical groups separately (CLIN).This result provides further evidence for the usefulness of the proposed consistency criterion that determines the optimal number of templates.ALL and CLIN yield statistically improved label alignment for only one ROI: the right Superior Temporal Gyrus.
These results suggest that, on average, for most ROIs we achieve a better agreement between the ground truth labels and a prior obtained via iCluster, than a prior computed by co-registering all subjects or subjects within a clinical population.

C. Age Groups in the OASIS Data Set
In this experiment, we used the OASIS data set [34] which consists of 415 preprocessed (skull stripped and gain-field corrected) brain MR images of subjects aged 18-96 years including individuals with early-stage Alzheimer's disease (AD).We ran iCluster on the whole data set while varying the number of templates from 2 through 5.Each run took 4-8 h on a 16 processor PC with 128 GB RAM.Fig. 11 shows the output consistency against for different values of input .For and 5 the consistency values are significantly smaller than 0.9.We, therefore, report our results for and .Figs. 12 and 13 show the two and three robust templates computed with and , respectively.Fig. 14 shows typical individual subjects (in their native coordinates) corresponding to each cluster computed with .These subjects were chosen based on age, gender, and clinical condition, not image similarity.Fig. 15 shows the age distributions determined via Parzen  window estimator based on a Gaussian kernel with a standard deviation of four years.
It is easy to see that each template corresponds to a unique age group: For , we identify a group of 268 young subjects (aged years) and a group of 147 elderly subjects (aged years).For the algorithm detected 201 young subjects (Group 1, aged years), an older middle aged group of 127 subjects (Group 2, aged years) and elderly 87 subjects (Group 3, aged years).Fig. 15(b) illustrates the intersection between the middle aged distribution of and the distributions of .This plot reveals that the middle aged group for consists of two subpopulations: 1) a younger group of subjects that are in the young group for and 2) an older age group in the elderly for .These results suggest that the dominant structural modes in this large population are mainly due to aging.Analyzing the decomposition of the whole age distribution [shown in black in Fig. 15(b)] indicates that iCluster does not simply find the three major age modes.Specifically, the small middle peak around 50 years is robustly included with the younger population in both and .With three modes, the algorithm identifies an older middle aged group (Group 2) that has a significant overlap in age with the elderly group (Group 3).
We further analyzed the clinical dementia rating (CDR) [36] data to explore the differences across the image-based clusters.Table V summarizes the results.Group 1 [Fig.13(a)] has almost no subjects with positive CDR (an indication of probable Alzheimer's), whereas Group 2 [Fig.13(b)] consists of 35% patients diagnosed with probable Alzheimer's disease (AD) (i.e., has a CDR score of greater than zero), and 65% subjects with no dementia.Group 3 [Fig.13(c)] includes 69% patients with probable AD and 31% healthy subjects with zero CDR.The difference between the patient percentage in each group is statistically significant at as determined by a permutation test.This result indicates that the old-middle aged group computed by iCluster contains a majority of healthy individuals, whereas the elderly group is dominated by probable AD patients.
An important question at this point is to what extent these dementia profiles are correlated with the age data of the individuals, since it is known that the rate of incidence of dementia increases with aging [21].Moreover, we would like to explore the influence of gender on these structural modes.One important point to note is that approximately half of the subjects over 60 years old (100 subjects) were clinically diagnosed with dementia, as summarized in Table VI.Examining this table reveals a difference between the two genders: healthy females without dementia are more likely to belong to Group 2 [Fig.13(b)].On the other hand, males with positive CDR (i.e., with dementia) are more likely to belong to Group 3 [Fig.13(c)].For the other two groups, i.e., males without dementia and females with dementia, there is no obvious relationship that these tables reveal.
To get a better insight into the characteristics of the discovered structural modes, we performed a multinomial logistic regression on the iCluster group memberships using age, gender and clinical data1 as regressors.Table VII reports the regression coefficients, assuming Group 2 to be the reference category.If we convert the estimated probabilities to group assignments, the total model achieves around 75% training accuracy and a likelihood ratio test estimates the significance of the full, fitted model at .The significance of each coefficient was determined with a Wald test [17].These results suggest that the most significant factor that determines group assignment is age: with each year, the odds of a subject being assigned to the next, older group increases by approximately .Groups 2 and 3 are also differentiated by the clinical score and gender (with less significance).One point decrease in the MMSE score increases the odds of a subject belonging to Group 3, rather than Group 2, by .A female's odds of belonging to Group 2 versus Group 3 is roughly 2.5-fold higher than a male's.
These results confirm that aging and dementia are both significant factors that influence major structural changes in the brain.Moreover, our results indicate that these factors may have different effects for the two genders.These findings demonstrate a qualitative similarity with the ones reported in [20], where aging and dementia are shown to correlate with brain atrophy in a similar manner.Furthermore, [20] reports that these effects have a tendency to be different in the two genders: males tend to demonstrate a higher rate of atrophy.The gender difference, however, does not reach statistical significance in the analysis of [20] and remains under debate in the literature [23], [32].

D. Patients With Dementia
In the fourth experiment, we used a set of 30 subjects (aged between 65 and 84 years) from the OASIS data set.Fifteen of these had a positive CDR, i.e., were diagnosed with very mild to mild dementia and probable AD (aged years, with education level of ), while the other 15 individuals were controls (aged years, with education level of ) with no sign of clinical dementia at the time of scanning.Fig. 16 shows the consistency of iCluster outputs over a range of input K values.For , we observe that the membership consistency is less than 0.9, thus we report results for : Group 1 [Fig.17   We performed a multinomial logistic regression on the iCluster assignments using age, education data (1: less than high school, 2: high school, 3: some college, 4: college graduate, 5: beyond college), clinical score and gender data as regressors.Only the clinical score demonstrated significant

V. DISCUSSION
Our experiments demonstrate the use of iCluster in multiple settings.The synthetic experiments served to asses the effect of stochastic subsampling on the quality of results and informed the design of the method that automatically determines the optimal number of templates.In the second experiment presented in Section IV-B, we show that, using the proposed clustering strategy, one can compute a multi-template atlas for a segmentation application.Based on growing evidence that populationspecific atlases yield more accurate segmentation, we can em-ploy iCluster to discover coherent subpopulations in a large population of images and construct separate atlases for each subpopulation.Our experiments suggest that a multi-template atlas can improve segmentation quality.The proposed approach promises significantly better segmentation than a disease-specific atlas, especially in the case of spectrum diseases such as schizophrenia.
In another setting, we demonstrate the utility of an imagedriven approach for computational anatomy.This is in contrast with today's popular techniques that rely on a clinical or demographic classification of the subjects.Our experiments show that iCluster can robustly identify structural modes in a population that are mainly determined by age and dementia.This type of analysis promises to provide insight into the major factors that drive structural change and, more importantly, characterize subtypes of a particular disorder.
In our experiments, enlarged ventricles are immediately obvious in the older and dementia templates when compared to the younger and healthy populations, respectively.Moreover, cortical thinning and anterior white matter changes are visible in the difference images shown in Fig. 13.These types of structural changes due to aging and dementia have been well documented in the literature [16], [26], [43].Further analysis is required to understand the structural differences between the discovered modes.The intermediate group (the older middle aged in the first experiment) and the mixture group in the dementia experiment can provide interesting insights into structural changes due to aging and dementia.
With a single template, i.e., input , iCluster can be seen as an efficient unbiased template estimation algorithm, similar to the ones proposed in [14], [31], [58].Yet, the main point of this paper is that a single template is not sufficient to summarize the variability in a large and heterogenous population of images.To that extent, iCluster is similar to the recent works on atlas stratification [9] and deformable templates [1].In the atlas stratification framework of [9], the authors propose to use an off-the-shelf clustering algorithm on images to identify underlying homogeneous subpopulations.The framework does not explicitly model anatomical heterogeneity and yields a computationally inefficient algorithm, where one needs to perform pairwise registration instances to analyze input images.The generative model we developed in this paper is similar    to the deformable templates model of [1].Yet, in contrast with [1], our main focus is to propose a computationally efficient algorithm that can be employed on large collections of high resolution medical image data.Most importantly, however, we include a concrete demonstration of how an image-clustering approach can be used to construct multiple segmentation atlases and study the effects of clinical and demographic factors on neuroanatomy.
The image-based clustering approach can also be extended to descriptors of anatomical shape, such as volume [20] or surface-based representations [52].Various shape descriptors have been used to study the effects of disease progression and aging on anatomy.Based on similarity measures defined for these different descriptors, one can potentially derive different shape clustering algorithms.One such algorithm was proposed in [50].The main drawback of such a shape-based approach is the need for accurate segmentations, which limits the amount of data such a strategy can be applied to.An image-based clustering approach, on the other hand, has the advantage that it can be used with large collections of raw images.Furthermore, imagebased clustering can potentially reveal modes in a population that differ in unexpected regions.
We view iCluster as a first step towards a more comprehensive image-driven population analysis framework.The current algorithm suffers from several limitations.Notably, the simple additive Gaussian noise model cannot handle significant intensity variations across images.Thus, the current algorithm can only be used with intensity corrected (e.g., histogram matched, bias field corrected) images of the same modality.Moreover, the algorithm constructs clusters based on a similarity measure computed over whole images.This makes the method less sensitive to subtle and local differences across groups of images.One solution is to use a similarity measure computed over a region of interest in the E-step of iCluster.In the following, we summarize the possible directions one can explore to extend iCluster to a broader set of problems.
1) Use an entropy-based similarity measure that is insensitive to intensity variations to compute memberships in the E-step and perform co-registration in the R-step.2) Compute memberships within a region of interest or based on a different type of information, e.g., connectivity from diffusion data.3) Use more sophisticated models of deformation, e.g., diffeomorphisms.Moreover, one can integrate a more sophisticated prior on the spatial transformations.Hence, the memberships will be a function of both a similarity measure based on image intensities and the deformation cost.4) Rather than using an additive noise model on intensities, one could explicitly model the variance in warps which would lead to a clustering strategy based on deformations.

VI. CONCLUSION
We presented a fast and efficient image clustering algorithm for co-registering a group of images, computing multiple templates that represent different modes in the population, and determining template assignments.We demonstrated our algorithm in several experiments, which illustrated a multi-template atlas strategy for accurate image segmentation and revealed age and disease-related modes in a population.Our results confirm previous findings and lead to interesting insights that suggest future research directions in computational anatomy.
(35) (36) where in (35) and (36) we dropped and added terms that do not depend on .
Authorized licensed use limited to: MIT Libraries.Downloaded on September 4, 2009 at 15:15 from IEEE Xplore.Restrictions apply.

Fig. 3 .
Fig. 3. Top row: Axial slices of the original subject MRIs used to synthesize data.Middle row: Axial slices of representative synthetic images.Bottom row: Axial slices of the four templates computed by iCluster with K = 4 and 0.5% sampling percentage.

Fig. 4 .
Fig. 4. Output quality as a function of sampling percentage, i.e., the ratio of the size of stochastic set of voxels used at each iteration to the total number of voxels.Error bars indicate standard deviation.
Authorized licensed use limited to: MIT Libraries.Downloaded on September 4, 2009 at 15:15 from IEEE Xplore.Restrictions apply.

Fig. 6 .
Fig. 6.Consistency Criterion: The consistency of output membership probabilities for a range of input K values.Error bars indicate standard error.(a) Syn- thetic Data 1.(b) Synthetic Data 2. (c) Synthetic Data 3.

Fig. 8 .
Fig. 8. Consistency Criterion for the schizophrenia data set: The consistency of output membership probabilities for input K = 2; 3; 4. Error bars indicate standard error.
(a)] consists of 25 subjects, 15 of which were CDR zero.Group 2 [its template shown in Fig.

17
(c)] consists of five subjects, all of which have dementia.

Fig. 15 .
Fig. 15.Age distributions of the OASIS data.(a) Age distributions for K = 2, (b) the relationship between the ages of subjects in clusters identified for K = 2 and for K = 3, (c) Age distributions for K = 3.(a) K = 2, (b) from K = 2 to K = 3, (c) K = 3.

Fig. 16 .
Fig. 16.Consistency criterion for the 30 subject dementia data set.The consistency of output membership probabilities for a range of input K values.Error bars indicate standard error.

Fig. 17 .
Fig. 17.Two templates and their difference image for the 30 subject dementia data set.(a) Mostly Healthy.(b) Difference Image.(c) Dementia Patients.

TABLE I SUMMARY
OF GROUND TRUTH FOR THE SYNTHETIC DATA

TABLE II CLINICAL
COMPOSITION OF CLUSTERS FORK = 2

TABLE III CLINICAL
COMPOSITION OF CLUSTERS FORK = 3 set, on the clinical grouping, and on the image-based clustering as determined by iCluster.

TABLE V NUMBER
(PERCENTAGE) OF SUBJECTS WITH RESPECT TO THEIR GENDER AND CLINICAL DEMENTIA SCORE IN EACH GROUP COMPUTED BY ICLUSTER WITH K = 3

TABLE VI NUMBER
(PERCENTAGE) OF SUBJECTS AGED 60 AND OLDER WITH RESPECT The fact that Group 2 comprised of dementia patients with significantly low MMSE scores is intriguing.Yet, the more interesting question is, what is special about the ten dementia patients assigned to Group 1?This clustering suggests that their anatomies are more similar to healthy subjects in the same age group.Clinical and demographic attributes of the patients in the two groups are virtually identical: 1) age: TO THEIR GENDER AND CLINICAL DEMENTIA SCORE DATA IN EACH GROUP COMPUTED BY ICLUSTER WITH K = 3 relevance to differentiate the two groups (see Table VIII): the first group's average MMSE score was , whereas group two's score was .

TABLE VII LOGISTIC
REGRESSION COEFFICIENTS ON ICLUSTER MEMBERSHIPS COMPUTEDFOR THE WHOLE OASIS DATASET AND K = 3

TABLE VIII LOGISTIC
REGRESSION COEFFICIENTS ON ICLUSTER MEMBERSHIPS COMPUTER FOR THE 30 SUBJECT DEMENTIA DATASET AND K = 2