Learning generic prior models for visual computation

This paper presents a novel theory for learning generic prior models from a set of observed natural images based on a minimax entropy theory that the authors studied in modeling textures. We start by studying the statistics of natural images including the scale invariant properties, then generic prior models were learnt to duplicate the observed statistics. The learned Gibbs distributions confirm and improve the forms of existing prior models. More interestingly inverted potentials are found to be necessary, and such potentials form patterns and enhance preferred image features. The learned model is compared with existing prior models in experiments of image restoration.


Introduction and motivation
Many generic smoothness models have been widely used in visual computation ranging from image restoration, motion analysis, to 3D surface reconstruction.For example, In image segmentation (Geman and Geman 1984, Blake and Zisserman 1987, Mumford and Shah 1989), these smoothness prior models take the forms as the following joint probability distribution: where V , I ( x , y ) = I(z+l, y)-I($, y ) , and V , I ( z , y ) = I ( z , y + 1) -I ( x , y ) are differential operators.Three typical forms of the potential function $0 are displayed in figure (1).The functions in figure lb, and IC have flat tails to preserve edges and object boundaries, and thus they are said to have advantages over the function in figure (1.a).
These prior models enjoy nice explanations in terms of regularization theory (Poggio, Torre, and Koch 1985), physical modeling (Terzopoulos 1983), Bayesian theory (Geman and Geman 1984) and robust 'This work was supported by an NSF grant and an ARO grant to Mumford.For correspondence, contact Song Chun Zhu at zhu@dam.brown.edu,detailed reports are available in http://hrl.harvard.edu/people/postdocs/zhu/zhu.htmlstatistics (Geiger andYuille 1991, Black andRangarajan 1997), there is, however, little rigorous theoretical or empirical justifications for applying these prior models to general images, and the following questions are not answered in the literatures.i).Why are the differential operators good choices in capturing image features?ii).What are the best forms for p(1) and $()? iii).Real world scenes are observed at arbitrary scales, thus a good prior model should remain the same for image features at multiple scales.However none of the existing prior models on 2 0 images has scale-invariant property, i.e, they are not renormalizable in terms of the renormalization group theory (Wilson 1975).
This paper presents a novel theory for learning generic prior models from a set of observed natural images based on a minimax entropy theory that the authors studied in modeling textures (Zhu, Wu, and Mumford 1996).We start by studying the statistics of natural images including the scale invariant properties, then generic prior models were learnt to duplicate the observed statistics.First, instead of being limited to differential operators, our theory examines whatever filters capture the structures of natural images, such as Gabor filters (Daugman 1985).An information criterion is put forth for choosing the most informative features (or filters) in p(1).Second, unlike previous prior models which subjectively assume some parametric forms for the potential functions $0 2Here, natural images refer to an arbitrary view of the world, indoor or outdoor.our theory uses non-parametric forms and learns them from observed images.The learned Gibbs distributions confirm and improve the forms of existing prior models.More interestingly inverted potentials are found to be necessary, and such potentials produce patterns and enhance preferred image features.The learned model is compared with existing prior models in experiments of image restoration.
This paper is arranged as follows.Section (2) discusses the objective and theory of learning prior models.Section (3) presents a novel information criterion for model selection.Section (4) and section (5) demonstrate some experiments on the statistics of natural images and prior learning.Section (6) compares different prior models by experiments of image restoration.Finally section (7) concludes with a discussion.

Learning prior models by maximum entropy
We define an image I on an N x N lattice L , and we assume that there is an underlying joint probability distribution f ( 1 ) on the image space for general natural images -arbitrary views of the world.We note that p z L ( z ) is an unbiased estimate for and the latter is an 1D marginal distribution of f ( I ) .
Given a set of filters { F ( " ) , a = 1,2, ..., K}, and observed statistics {p$!, a = 1,2, ..., K}, a maximum entropy distribution is derived as the following Gibbs forms: 3) is a set of potential functions on the features extracted by S .In practice, the filter responses are divided into a finite number of bins, thus A(")() is approximated by a piecewise constant functions, i.e., a vector, which we denote by The X(")'s are computed in a non-parametric way so that the learnt p(1, ; A, S ) can reproduce the observed statistics: So as far as the selected features and their statistics are concerned, we cannot distinguish between p(1; A, S ) and the "true" distribution f ( I ) .
Unfortunately, there is no simple way to express the A(a)'s in terms of the pzi's.We adopted the Gibbs sampler (Geman and Geman 1984), which simulates an inhomogeneous Markov chain in image space (Winkler 1995).This Monte Carlo method iteratively samples from the distribution p(1; A, S ) , followed by computing the histogram of the filter responses for this sample and updating the to bring the synthesized histograms closer to the observed ones.For a detailed account of the computation of X(@)'S, the readers are referred to (Zhu, Wu and Mumford 1996).
In our previous papers, the following two propositions are observed.
Proposition 2 As M + CO, K + CO, with only linear filters used, p(1; A, S ) converges to the underlyzng distribution f ( I ) .
We shall discuss how to choose filters in the next section

Information criterion for model selection
We notice that the statistics of natural images vary from image to image.For each image I$' or a group of images in a given domain, it is desirable to have a specific underlying distribution fn(I).[Proof].The proof follows proposition 4 in (Zhu, Wu, and Mumford 1996).
By proposition 3, we obtain The following proposition measures the distance of two ME distributions in terms of the difference of their marginal distributions.

Fixing ho, KL(po(I),p(I)) is a function of h, and K L ( p o ( I ) , p ( I ) ) = ( h -ho)Var-'(h*)(hho)T, where Var(h*) is a variance matrix of h* and h* lies between ho and h.
[Proof] The proof follows proposition 3 in (Zhu,Wu and Mumford 1996).
In practice, for computational convenience, we use the L1 norm distance to replace the quadratic term, In summary, if we use the L1 norm distance, together with equation (4), we approximate IC* by IC defined below.
. M we call the first term average information gain (AIG) of F(P), and the second term average information fluctuation (AIF).
In practice, we need to sample p(1; A, S), thus synthesize images { I ~~, y " , n = 1,2, ..., M ' } , and use averaged histogram of these synthesized images to estimate %(I,A,s) [ H(fl)(z;I)].For a filter F(fl), the bigger AIG is, the more information F(fl) captures, as it measures the error between the current model and the observations.A I F is a measure of disagreement between observed images.The bigger AIF is, the less common F(a) is shared by all images.We start from studying the statistical properties of natural images.We collect a set of 44 natural images, four of which are shown in figure 2, and these images are normalized to have intensity between 0 and 31.

Statistics of natural images
As stated in proposition (2) , marginal distributions of linear filters alone are capable of characterizing f(1).In the rest of this paper, we shall only study the histograms of linearly filtered images.
Firstly, for some features, the statistics of natural images vary largely from image to image.As an example, we study the 6() filter, the filter response is Secondly, for some other filters, the histograms of filtered images are amazingly consistent across all 44 natural images, and they are very different from those of noise images.Therefore the IC is relatively large for these features.For example, we look at filter V, and the histograms are plotted in figure (4).The ava b c .Thirdly, the statistics of natural images are scale invariant with respect to some features.We look at filter V, again.Let p!lbs be the average histogram of filtered images, then we scale down all observed images from N x N pixels to N/2 x N/2 pixels by averaging 2 x 2 pixels, and we compute the average this process we obtained &JbS where s = 2,3, ... is the index of the layer or scale in the image pyramid.Figure 6: a. &ibS s = 0,1,2.b. histograms of a filtered uniform noise image at scales: s = 0 (solid curve), s = 1 (dash-dotted curve), and s = 2 (dashed curve).

Simulations of prior learning
This section briefly presents the experiments on learning generic prior models.We first choose a general filter bank to characterize the interesting features of natural images.This filter bank includes the intensity filter 6 0 , the Laplacian of Gaussian filters at various scales, and Gabor filters with both sine and cosine components at various scales and orientations. .Dashed curves are the fitting functions $() with parameters (a, e, y) being: a.According the information criterion discussed before, we found that A , V , , V , are sequentially the first three important filters whose IC are the biggest among all filters in the filter bank, where A is the second differential operator whose impulse response is a 3 x 3 window [0,1,0; 1, -4 , l ; 0,1,0].Detailed account for the IC's of each filter is referred to (Zhu and Mumford 1996).
Thus a prior model is learned as following, The potential functions A(")(), Q = 1 , 2 , 3 are plotted in figure (7).Although we have used the three most informative filters to extract the structures and statistics of natural images, the synthesized image according to model p3(I; A), as shown in figure (8), is still far from natural ones.Especially, even though the learned potential functions X(")(z), Q = 1 , % , 3 all have flat tails to encourage intensity breaks, but it only generates small speckles instead of big regions and long edges as one may expected.We also sampled the distributions with potential function shown in figure (l), the sampled images have even less features.
In the next experiment, we shall study a prior model which has the scale invariant property with respect to filters v, and v, as we find in natural images.
Given an image I defined on an N x N lattice L. We build an image pyramid by scaling the images as before with I["], s = 0 , 1 , 2 , 3 being four layers of the pyramid.We set I['] = I, and I["] is defined on lattice LIS], which is of size N/2s x N/2" pixels.Let H,(z;I["I) denotes the average of H , ( z ; I, ) .Similarly we define and H Y ( z ; I["]) and ,U;!,~~(Z).In contrast to existing prior models in vision, the learned model in figure (9) has inverted potentials A!]() for s = 1,2,3.Such potentials have significant meanings in visual computation.In image restoration, when a high intensity difference V,I["](z, y) presents, it is very likely to be noise if s = 0. However this is not true for s = 1,2,3.Additive noises can hardly pass to the high layers of the pyramid because at each layer the 2 x 2 averaging operator reduces the variance of noise by 4 times.When V,I["](z, y) is large for s = 1,2,3, it is more likely to be a true edge and object boundary.So in ps(I), Aio1 () suppresses noise at the first layer, whereas A!](), s = 1 , 2 , 3 encourage sharp edges to form, and thus enhance blurred boundaries.We notice that figure (10) shows regions of various scales, and the intensity contrasts are also higher at the boundary.These are missing in figure (8).Model p,(I) further leads to the study a new class of reactiondiffusion equations with the inverted terms produce reaction and form patterns.For more detailed account, the readers are referred to (Zhu and Mumford 1997).Due to space limitation, more experiments, such as clutter removal etc, are referred to (Zhu and Mumford 1997).

Discussion
In this paper, a general theory is proposed for learning generic prior models for natural images.We argue that the same strategy can be used in other applications.For example, learning prior models for MRI images, and for 3D surfaces, where prior models of different forms are expected.
An important fact in the learned prior models is the inverted potentials associated with reaction, pattern formation and feature enhancement.Although the synthesized images bear important features of natural images, they are still far from realistic ones.In other words, these generic prior models can do very little beyond image restoration.This is mainly due to the fact that all generic prior models are assumed to be translation invariant.This homogeneity assumption is unrealistic.
We call the generic prior models studied in this paper the first level prior.A more sophisticated prior model, which we call second level prior, should incorporate concepts like object geometry, and such prior model is used in image segmentation (Zhu and Yuille  1996).It is our hope that this article will stimulate further investigations along this direction to build more realistic prior models.
{I?("),a = 1,2, ..., K} which are characteristic of the observation.Given a linear filter F(") and an image I , the empirical marginal distribution (or histogram) of filtered image F(") * I ( % , ~) is, where a() is a Dirac function with point mass concentrated at0.I L I is the size of the image lattice We compute the average histogram of all observed images as the observed statistics, , M Given a set S , and an ME distribution p(1; A, S ) , the goodness of p(1; A, S ) with respect to Igbs depends on S , and is often measured by the Kullback-Leibler information distance between fn(I) and p(1; A, S ) (Kullback and Leibler 1951), Then for a fixed model complexity K , the best feature set S* is selected by the following criterion, we assume Ef,, [H(")(z; I)] = H(")(z; Igbs).Thus fn(I) is better estimated by p(1; AA, S ) than by p(1; A, S ) , as stated in the following proposition.Proposition 3 Given two ME distributions p(1; A, S ) and p(1; A:, S ) defined above, K L ( fn(I),p(I; A, S ) ) = KL(f,(I),p(I; A:, S ) ) + K L b K A:, S),P(I; A, S ) ) .
1 S* = arg min D K = arg min -KL(f,(I),p(I; A, S ) ) ISl=K ISI=K M n=l Proposition 4 Let where S is chosen from a general filter bank B such as Gabor filters at multiple scales and orientations.Enumerating all possible sets of features S in a filter bank and comparing their entropies is computational too expensive.Instead, we propose a stepwise greedy procedure for minimizing the average KLdistance.We start from S = 0 and p(1; A, S ) a uniform distribution, then it sequentially introduces one filter at a time.At each time the added filter leads to the maximum decrease in the average KL-distance, and keep doing this until the decrease is smaller than a certain value.Let S be the currently selected set, and p(1; A, S ) the ME distribution duplicating the observed statistics.For the next step, let S+ = S U ( F ( 0 ) ) be a new feature set, and p(1; A+, S+) the new ME distribution.Our greedy procedure chooses the next filter by minimizing the following information criterion IC*, To compute the above equation, for each f n ( I ) and S we introduce a new ME distribution p(1; AA, S ) which reproduces the statistics of Igb', E ~( ~; A ~ ,SI [H(") ( z ; I)] = ( z ; 1 ~~' ) H ( " ) ( z ; IEbs) is a closer estimate to the marginal distribution Ef,, [El(") ( z ; I)] than ~$ 2 (z).In the following and p(1) be two M E distributions, Ep,(~)[H(")(z; I)] = h,") ( and E,(I)[H(")(z; I)] = h(") for a = 1,2, ..., k.Denote ho = ( h c ) , h r ) , ..., h c ) ) and h = (h('), h('), ..., h(')).

Figure 2 :
Figure 2: 4 out of the 44 collected natural images.

Figure 3 :
Figure 3: The intensity histograms, a, averaged over 44 natural images, b, an individual natural image, c, an uniform noise image.

Figure 4 :
Figure 4: The histograms of V,I, a. averaged over 44 natural images, b, an individual natural image (the same image as in figure (3.b) ), c, an uniform noise image.erage histogram in figure (4.a) is very different from a Gaussian distribution.Figure (5.a)plots it against a Gaussian curve (dashed one) of the same mean and same variance.The histogram of natural images has higher kurtosis and heavier tails.Similar results are reported in (Field 1994).To see the difference of the tails, we plot the logarithm of the two curves in figure (5.b).

Figure ( 6
.a) plots p!jbs, for s = 0,1,2, and they are almost identical.In contrast, figure (6.b) plots the histograms of V,I with I being an uniform noise image at scales s = 0,1,2.Similar results are observed for filter V,. histogram pxObs PI over these scaled images.Continuing a b Figure 7: The three potential functions: a. b.c.

Figure
Figure (1l.a)shows an input image Id, and it is the lobster boat image in figure (2) distorted with additive i.i.d.Gaussian noises.Thus the data model p(Id I I) is known to be Gaussian.Then given a prior model p(I),