H-means image segmentation to identify solar thermal features

Properly segmenting multiband images of the Sun by their thermal properties will help determine the thermal structure of the solar corona. However, off-the-shelf segmentation algorithms are typically inappropriate because temperature information is captured by the relative intensities in different pass-bands, while the absolute levels are not relevant. Input features are therefore pixel-wise proportions of photons observed in each band. To segment solar images based on these proportions, we use a modiﬁcation of k -means clustering that we call the H -means algorithm because it uses the Hellinger distance to compare probability vectors. H -means has a closed-form expression for cluster centroids, so computation is as fast as k -means. Tempering the input probability vectors reveals a broader class of H -means algorithms which include spherical k -means clustering. More generally, H -means can be used anytime the input feature is a probabilistic distribution, and hence is useful beyond image segmentation applications.


INTRODUCTION
The thermal structure of the solar corona is a crucial factor in a proper understanding of the Sun-Earth connection.It places constraints on coronal heating mechanisms, controls the characteristics of the solar wind, and has the potential to predict flare onsets and mass ejections.Determining the thermal structure is however difficult since the data are obtained as images at high cadence in multiple filters.Hence, fast image segmentation methods that segregate regions of similar thermal properties are needed.
Professor van Dyk was supported in part by NSF grant DMS-09-07522 and by a British Royal Society Wolfson Research Merit Award.N. Stein and Professor Meng were supported in part by NSF grant DMS-09-07185.V. Kashyap was partially supported by CXC NASA contract NAS8-39073.The authors are grateful to three reviewers for their very helpful and encouraging comments.
However, for studying the thermal structure of the solar corona, standard segmentation approaches suffer two disadvantages.First, off-the-shelf segmentation algorithms often assume that the absolute intensities in each band of multiband images are the natural input features to be compared, which is not the case in the thermal structure problem.Second, model-based approaches that incorporate more complicated observation models or input features often suffer from computation requirements that prevent their use on streams of high-resolution images.Fast computation is necessary when analyzing data from the Atmospheric Imaging Assembly (AIA) of the Solar Dynamics Observatory, which captures high resolution multiband images with a cadence of approximately ten seconds.
In the corona, absolute brightness roughly corresponds to the amount of plasma in a particular region, which is not relevant for studying thermal properties.Thermal properties such as average temperature are instead captured by the relative intensities in the different passbands.Motivated by a probabilistic model of the data generating process, we cluster pixels based on the vectors of proportions of photons observed in each passband.To cluster these probability vectors, we use a modified k-means algorithm that replaces Euclidean distance with Hellinger distance, inspiring the name H-means.
In Section 2, we describe the H-means algorithm and discuss tempering the input distributions.In Section 3, we describe our application and the statistical model that inspires our approach.In Section 4, we illustrate the performance of our algorithm on AIA images.Section 5 concludes.

H-MEANS CLUSTERING
Clustering algorithms based on distances other than Euclidean distance have been proposed and studied in, for instance, [6], [7], [8], [9], [10], and [11].The Hellinger distance has been suggested as an alternative to Euclidean distance in clustering and dimension reduction problems in [12], [13], [14], and [11], but there has been relatively little study of substituting the Hellinger distance for Euclidean distance in k-means clustering.We call this modification to k-means the H-means algorithm.
A standard generalization of the k-means algorithm replaces squared Euclidean distance with another dissimilarity measure d(y 1 , y 2 ).The so-called k-medoids algorithm alternates between updating cluster assignments and updating cluster medoids.To initialize the k-medoids algorithm, we can randomly chose the starting medoids m 1 , . . ., m k from the data {y 1 , . . ., y n }.We assign each unit to the cluster with the closest medoid as measured by the dissimilarity d.Formally, cluster assignments {c 1 , . . ., c n } are determined by setting c i equal to the j that minimizes d(m j , y i ).Then, the following two steps are repeated until a convergence criterion is met: • Update cluster assignments by setting c i to the j that minimizes d(m j , y i ) for each unit i.
• Update the cluster medoids by setting m j equal to the minimizer arg min In cases where the distance is an arbitrary dissimilarity measure only defined pairwise between units in the data set, the cluster medoid is found as the point from the original data set in the given cluster that minimizes the total distance (1).
The usual problem with k-medoids is that, if there is not a closed-form expression for (1), then finding the minimizer in (1) requires an expensive search, and the computation consequently scales quadratically in the number of observations.When there is a closed-form solution for the minimizer (1), then it is possible to replace Euclidean distance with another metric without the usual computational disadvantages of the k-medoids algorithm.Fortunately, the Hellinger distance (2) has such a closed-form solution.The squared Hellinger distance between two discrete 1 probability distributions p 1 and p 2 on a state space X is ( For a collection p 1 , . . ., p n of discrete probability distributions, the minimizer

Spherical k-means and H-means with tempering
The spherical k-means algorithm [6] is closely related to Hmeans.The spherical k-means algorithm takes input vectors {y 1 , . . ., y n } with nonnegative entries and normalizes them to have unit length.Unit vectors are then compared according to their cosine similarity.That is, the distance between vectors y i and y j is given by Like Euclidean k-means and H-means, the spherical k-means algorithm has a closed-form expression for the cluster centroids, enabling fast computation [6].
In fact, the spherical k-means algorithm can be viewed as a member of a class of generalized H-means algorithms.The Hellinger distance (2) can be expressed as We can also "temper" these distributions with a parameter α ≥ 0, defining Then, Hellinger distance and cosine distance are related by Thus, spherical k-means can be viewed as H-means with tempered inputs.

IDENTIFYING SOLAR THERMAL FEATURES
Much is still unknown about the processes that govern thermal features in the corona.The Atmospheric Imaging Assembly is a four-telescope array on the Solar Dynamics Observatory satellite that captures near-simultaneous images through seven extreme ultraviolet wavelength passband filters.Because these different filters have differing responses to temperature, they should be useful in reconstructing the temperature distribution of the emitting plasma in each image pixel.If it were possible to accurately infer these underlying temperature distributions, then astronomers could study the thermal characteristics of interesting features on the Sun, such as loops of hot plasma, and trace the evolution of these thermal properties over time.However, with only at most seven filters, each responding to a fairly wide range of temperatures, reconstructing the temperature distributions is an extremely underconstrained problem.Instead of adding constraints from prior information about the likely shapes of temperature distributions, we choose to bypass the reconstruction step altogether.Our approach is motivated by a statistical model for the data generating process.For simplicity of illustration, we assume no background contamination and no spatial dependence.The J × 1 observed vector y i in pixel i is modeled as a Poisson random variable with mean where P is a diagonal J × J matrix of exposure times, A = (a jk ) is a J × K matrix encoding the response of the jth filter to temperature bin k, and µ i is a K × 1 vector of true intensities in each temperature bin.We parameterize the true intensities as where γ i is a scalar roughly corresponding to the amount of plasma in pixel i and θ i is a discretized probability distribution for temperature.For the scientific applications we consider, we focus on properties of θ i such as average temperature, and γ i is a nuisance parameter.It is difficult to reliably estimate the temperature distributions because K is larger than J.
To avoid directly modeling the niusance parameters γ i , we use the conditional likelihood of y i given the pixel-wise total j y ij , a multinomial distribution free of γ i .The multinomial probabilities π i = P Aθ i 1 P Aθ i capture all of the temperature information available after the degradation by the response matrix A. Thus, in an attempt to retain the available information with respect to this model, we identify solar thermal features by clustering pixels with similar values of the maximum likelihood estimates πi = y i / j y ij .

Simulations
To investigate the performance of H-means clustering, we applied it to six 128 × 128 images simulated under the model ( 7) and (8) using the same response matrix A that we use on AIA data in Section 4.2. Figure 1a plots the simulated temperature map.We used two different underlying temperature distributions arranged in a vertical stripe pattern.The temperature distributions on the log 10 of the temperature in Kelvin were Gaussian with standard deviation 0.25 and means 6.00 and 6.05.Values for γ i in (8) were set proportional to the observed total intensity summed across all filters in a slice of an observation of the Sun, shown in Figure 1b.
Figure 1c shows the results of H-means clustering with ten clusters based on the data simulated in six bands under (7) and (8).To make the graphical display more meaningful, gray scale values of each cluster were set to the pooled proportions of counts in the 171 Angstrom band.The clustering successfully reveals the vertical stripe pattern of the true underlying temperature distributions, without being obscured by the patterns in the γ i map.

Application to AIA Data
We have applied H-means clustering to six full-resolution (4096 × 4096) images of the Sun, using the 94, 131, 171, 193, 211, and 335 Angstrom filters, observed on October 2, 2010, at 05:57am UT. Figure 2 shows the original, untransformed data in all six bands.Figure 3a shows the estimated proportions πi,171 in the 171 Angstrom filter.That is, Figure 3a plots the pixel-wise ratios of counts in the 171 Angstrom band to the sums of counts across all bands.Figure 3b displays the results of applying H-means clustering with 64 clusters.As in Section 4.1, gray scale values for each cluster were set to the pooled proportions of counts in the 171 Angstrom band.The H-means clustering results capture features in the estimated probability images that do not appear in the original images, indicating that clustering is operating on features of the temperature distributions that are of scientific interest.This is reflected in the similarity between Figures 3a and  3b.Moreover, the 'S'-shaped region evident in Figure 3 is a much larger scale feature than the typical solar features that astronomers study, suggesting a direction for further investigation.

CONCLUSION
H-means clustering is a general method for clustering probability distributions based on the Hellinger distance.It is based on the k-means algorithm, but by tempering the input probability distributions, H-means encompasses other methods designed to reduce the influence of total counts or magnitudes across categories, such as spherical k-means.We applied Hmeans in a model-inspired image segmentation algorithm to reveal thermal features in multiband images of the Sun.A key advantage of this method is that is does not require the reconstruction of the underlying temperature distributions in each pixel.A remaining question is the effect of different choices of the tempering parameter α, given the connection between H-means and spherical k-means.Future work will aim to develop a statistically principled way to choose α in practice.

Fig. 1 .
Fig. 1. Results for a simulated image under the model (7) and (8).(a) The simulated map of the true temperature distributions.(b) The simulated map of the nuisance parameters γ i , corresponding to the total intensity.(c) The results of Hmeans clustering with 10 clusters.

Fig. 2 .Fig. 3 .
Fig. 2. AIA images of the Sun from 05:57am UT on October 2, 2010.The images in the top row, left to right, are the 94, 131, and 171 Angstrom filters.The images in the bottom row are the 193, 211, and 335 Angstrom filters.