Prior Learning and Gibbs Reaction-Diffusion

—This article addresses two important themes in early visual computation: First, it presents a novel theory for learning the universal statistics of natural images—a prior model for typical cluttered scenes of the world—from a set of natural images, and, second, it proposes a general framework of designing reaction-diffusion equations for image processing. We start by studying the statistics of natural images including the scale invariant properties, then generic prior models were learned to duplicate the observed statistics, based on the minimax entropy theory studied in two previous papers. The resulting Gibbs distributions have potentials of the form U S F x,y x,y K I I ; ,


Prior Learning and Gibbs Reaction-Diffusion Song Chun Zhu and David Mumford
Abstract-This article addresses two important themes in early visual computation: First, it presents a novel theory for learning the universal statistics of natural images-a prior model for typical cluttered scenes of the world-from a set of natural images, and, second, it proposes a general framework of designing reaction-diffusion equations for image processing.We start by studying the statistics of natural images including the scale invariant properties, then generic prior models were learned to duplicate the observed statistics, based on the minimax entropy theory studied in two previous papers.The resulting Gibbs distributions have potentials of with S = {F (1) , F (2) , ..., F (K) } being a set of filters and L = {l (1) (), l (2) (), ...,

INTRODUCTION
N computer vision, many generic prior models have been proposed to capture the universal low level statistics of natural images.These models presume that surfaces of objects be smooth, and adjacent pixels in images have similar intensity values unless separated by edges, and they are applied in vision algorithms ranging from image restoration, motion analysis, to 3D surface reconstruction.
For example, in image restoration, general smoothness models are expressed as probability distributions [9], [4], [20], [11]: where I is the image, Z is a normalization factor, and x I(x, y) = I(x + 1, y) -I(x, y),y I(x, y) = I(x, y + 1) -I(x, y) are differential operators.Three typical forms of the potential function y() are displayed in Fig. 1.The functions in Fig. 1b and Fig. 1c have flat tails to preserve edges and object boundaries, and thus they are said to have advantages over the quadratic function in Fig. 1a.
These prior models have been motivated by regularization theory [26], [18], 1 physical modeling [31], [4], 2 Bayesian theory [9], [20], and robust statistics [19], [13], [3].Some connections between these interpretations are also observed in [12], [13] based on effective energy in statistics mechanics.Prior models of this kind are either generalized from traditional physical models [37] or chosen for mathematical convenience.There is, however, little rigorous theoretical or empirical justification for applying these prior models to generic images, and there is little theory to guide the construction and selection of prior models.One may ask the following questions: 1) Why are the differential operators good choices in capturing image features?2) What is the best form for p(I) and y()? 3) A relevant fact is that real world scenes are observed at more or less arbitrary scales, thus a good prior model should remain the same for image features at multiple scales.However none of the existing prior models has the scale-invariance property on the 2D image lattice, i.e., is renormalizable in terms of renormalization group theory [36].
In previous work on modeling textures, we proposed a new class of Gibbs distributions of the following form [40], [41], 1.Where the smoothness term is explained as a stabilizer for solving "illposed" problems [32].
2. If y() is quadratic, then variational solutions minimizing the potential are splines, such as flexible membrane or thin plate models.( In the above equation, S = {F (1) , F (2) , ..., F (K) } is a set of linear filters, and L = {l (1)  * I estimated over a set of the training images I-while having the maximum entropyand the best set of features {F (1) , F (2) , ..., F (K) } is selected by minimizing the entropy of p(I) [41].The conclusion of our earlier papers is that, for an appropriate choice of a small set of filters S, random samples from these models can duplicate very general classes of textures-as far as normal human perception is concerned.Recently, we found that similar ideas of model inference using maximum entropy have also been used in natural language modeling [1].
In this paper, we want to study to what extent probability distributions of this type can be used to model generic natural images, and we try to answer the three questions raised above.
We start by studying the statistics of a database of 44 real world images, and then we describe experiments in which Gibbs distributions in the form of (2) were constructed to duplicate the observed statistics.The learned potential functions l (a) (), a = 1, 2, ..., K can be classified into two categories: diffusion terms which are similar to Fig. 1c, and reaction terms which, in contrast to all previous models, have inverted potentials (i.e., l(x) decreasing as a function of |x|).
We find that the partial differential equations given by gradient descent on U(I; L, S) are essentially reactiondiffusion equations, which we call the Gibbs Reaction and Diffusion Equations (Grade).In Grade, the diffusion components produce denoising effects which are similar to the anisotropic diffusion [25], while reaction components form patterns and enhance preferred image features.
The learned prior models are applied to the following applications.
First, we run the Grade starting with white noise images and demonstrate how Grade can easily generate canonical texture patterns, such as leopard blobs and zebra stripe, as the Turing reaction-diffusion equations do [34], [38].Thus our theory provides a new method for designing PDEs for pattern synthesis.
Second, we illustrate how the learned models can be used for denoising, image enhancement, and clutter removal by careful choice of both prior and noise models of this type, incorporating the appropriate features extracted at various scales and orientations.The computation simulates a stochastic process-the Langevin equations-for sampling the posterior distribution.
This paper is arranged as follows: Section 2 presents a general theory for prior learning.Section 3 demonstrates some experiments on the statistics of natural images and prior learning.Section 4 studies the reaction-diffusion equations.Section 5 demonstrates experiments on denoising, image enhancement and clutter removal.Finally, Section 6 concludes with a discussion.

Goal of Prior Learning and Two Extreme Cases
We define an image I on an N ¥ N lattice L to be a function such that for any pixel (x, y), I(x, y) OE /, and / is either an interval of R or / Ã Z.We assume that there is an underlying probability distribution f(I) on the image space / o t be a set of observed images which are independent samples from f(I).The objective of learning a generic prior model is to look for common features and their statistics from the observed natural images.Such features and their statistics are then incorporated into a probability distribution p(I) as an estimation of f(I), so that p(I), as a prior model, will bias vision algorithms against image features which are not typical in natural images, such as noise distortion and blurring.For this objective, it is reasonable to assume that any image features have an equal chance to occur at any location, so f(I) is translation invariant with respect to (x, y).We will discuss the limits of this assumption in Section 6.
To a f a f is an unbiased estimate for f (a) (z), and as Now, to learn a prior model from the observed images t , immediately we have two simple solutions.The first one is: where The second solution is: where <, > is inner product, y 2 (0) = 0, and y 2 (x) = • if x π 0, i.e., y 2 () is similar to the Potts model [37].These two solutions stand for two typical mechanisms for constructing probability models in the literature.The first is often used for image coding [35], and the second is a special case of the learning scheme using radial basis functions (RBF) [27]. 3Although the philosophies for learning these two prior models are very different, we observe that they share two common properties.
1) The potentials y 1 (), y 2 () are built on the responses of linear filters.In (7) This second property is in general not satisfied by existing prior models in (1).p(I) in both cases meets our objective for prior learning, but intuitively they are not good choices.In (5), the d() filter does not capture spatial structures of larger than one pixel, and in (7), filters F (obsn) are too specific to predict features in unobserved images.
In fact, the filters used above lie in the two extremes of the spectrum of all linear filters.As discussed by Gabor [7], the d filter is localized in space but is extended uniformly in frequency.In contrast, some other filters, like the sine waves, are well localized in frequency but are extended in space.Filter F (obsn) includes a specific combination of all the components in both space and frequency.A quantitative analysis of the goodness of these filters is given in Table 1 in Section 3.2.

Learning Prior Models by Minimax Entropy
To generalize the two extreme examples, it is desirable to compute a probability distribution which duplicates the observed marginal distributions for an arbitrary set of filters, linear or nonlinear.This goal is achieved by a minimax entropy theory studied for modeling textures in our previous papers [40], [41].
Given a set of filters {F (a) , a = 1, 2, ..., K}, and observed , a maximum entropy distribution is derived which has the following Gibbs form: In the above equation, we consider linear filters only, and } is a set of potential functions on the features extracted by S. In practice, image intensities are discretized into a finite gray levels, and the filter responses are divided into a finite number of bins, therefore l (a) () is approximated by a piecewisely constant functions, i.e., a vector, which we denote by l (a) , a = 1, 2, ..., K.
The l (a) s are computed in a nonparametric way so that the learned p(I; L, S) reproduces the observed statistics: .
Therefore as far as the selected features and their statistics are concerned, we cannot distinguish between p(I; L, S) and the "true" distribution f(I).
Unfortunately, there is no simple way to express the l (a) s in terms of the m a obs a f s as in the two extreme examples.To compute l (a) s, we adopted the Gibbs sampler [9], which simulates an inhomogeneous Markov chain in image space . This Monte Carlo method iteratively samples from the distribution p(I; L, S), followed by computing the histogram of the filter responses for this sample and updating the l (a) to bring the histograms closer to the observed ones.
For a detailed account of the computation of l (a) s, the readers are referred to [40], [41].
In our previous papers, the following two propositions are observed.

PROPOSITION 1. Given a filter set S, and observed statistics
, there is an unique solution for

f(I) is determined by its marginal distributions, thus p(I) = f(I) if it reproduces all the marginal distributions of linear filters.
But for computational reasons, it is often desirable to choose a small set of filters which most efficiently capture the image structures.Given a set of filters S, and an ME distribution p(I; L, S), the goodness of p(I; L, S) is often measured by the Kullback-Leibler information distance between p(I; L, S) and the ideal distribution f(I) [17], where S is chosen from a general filter bank B such as Gabor filters at multiple scales and orientations.Enumerating all possible sets of features S in the filter bank and comparing their entropies is computationally too expensive.Instead, in [41] we propose a stepwise greedy procedure for minimizing the KL-distance.We start from S = ∆ and p(I; L, S) a uniform distribution, and introduce one filter at a time.Each added filter is chosen to maximally decrease the KL-distance, and we keep doing this until the decrease is smaller than a certain value.
In the experiments of this paper, we have used a simpler measure of the "information gain" achieved by adding one new filter to our feature set S. This is roughly an L 1 -distance (vs. the L 2 -measure implicit in the Kullback-Leibler distance), the readers are referred to [42] for a detailed account.DEFINITION 3. Given S = {F (1) , F (2) , ..., F (K) } and p(I; L, S) defined above, the information criterion (IC) for each filter F ;I .
For a filter F (b) , the bigger AIG is, the more information captures as it reports the error between the current model and the observations.AIF is a measure of disagreement between the observed images.The bigger AIF is, the less their responses to F (a) have in common.

EXPERIMENTS ON NATURAL IMAGES
This section presents experiments on learning prior models, and we start from exploring the statistical properties of natural images.

Statistic of Natural Images
It is well known that natural images have statistical properties distinguishing them from random noise images [28], [6], [24].In our experiments, we collected a set of 44 natural images, six of which are shown in Fig. 2.These images are from various sources, some digitized from books and postcards, and some from a Corel image database.Our database includes both indoor and outdoor pictures, country and urban scenes, and all images are normalized to have intensity between zero and 31.As stated in Proposition 2, marginal distributions of linear filters alone are capable of characterizing f(I).In the rest of this paper, we shall only study the following bank B of linear filters.
LG x y s const x y s e x y s , , where s = 2s stands for the scale of the filter.We de-note these filters by LG(s).A special filter is LG , , ; , , ; , , -, and we denote it by D.
3) Gabor filters with both sine and cosine components, which are models for the frequency and orientation sensitive simple cells.

G x y s const Rot e e s s
x y i x , , ,q It is a sine wave at frequency 2p s modulated by an elongated Gaussian function, and rotated at angle q.
We denote the real and image parts of G(x, y, s, q) by Gcos(s, q) and Gsin(s, q).Two special Gsin(s, q) filters are the gradientsx ,y .4) We will approximate large scale filters by filters of small window sizes on the high level of the image pyramid, where the image in one level is a "blowndown" version (i.e., averaged in 2 ¥ 2 blocks) of the image below.
We observed three important aspects of the statistics of natural images.
First, for some features, the statistics of natural images vary widely from image to image.We look at the d() filter as in Section 2.1.The average intensity histogram of the 44 images m obs o a f is plotted in Fig. 3a, and Fig. 3b is the intensity histogram of an individual image (the temple image in Fig. 2).It appears that m obs o z a f a f is close to a uniform distri- bution (Fig. 3c), while the difference between Fig. 3a and Fig. 3b is very big.Thus IC for filter d() should be small (see Table 1).Second, for many other filters, the histograms of their responses are amazingly consistent across all 44 natural images, and they are very different from the histograms of noise images.For example, we look at filterx .Fig. 4a is the average histogram of 44 filtered natural images, Fig. 4b is the histogram of an individual filtered image (the same image as in Fig. 3b), and Fig. 4c is the histogram of a filtered uniform noise image.
The average histogram in Fig. 4a is very different from a Gaussian distribution.To see this, Fig. 5a plots it against a Gaussian curve (dashed) of the same mean and same variance.The histogram of natural images has higher kurtosis and heavier tails.Similar results are reported in [6].To see the difference of the tails, Fig. 5b plots the logarithm of the two curves.
Third, the statistics of natural images are essentially scale invariant with respect to some features.As an example, we look at filters - , ,

b g c h c h c h c h
The size of For the filterx , let m x,s (z) be the average histogram of 1) natural images contains objects of all sizes and 2) natural scenes are viewed and made into images at arbitrary distances.

Empirical Prior Models
In this section, we learn the prior models according to the theory proposed in Section 2 and analyze the efficiency of the filters quantitatively.

Experiment I
We start from S = ∆ and p 0 (I) a uniform distribution.We compute the AIF, AIG, and IC for all filters in our filter bank.We list the results for a small number of filters in Table 1.The filter D has the biggest IC (= 0.642), thus is chosen as F (1) .An ME distribution p 1 (I; L, S) is learned, and the information criterion for each filter is shown in the column headed p 1 (I) in Table 1.We notice that the IC for the filter D drops to near zero, and IC also drops for other filters because these filters are in general not independent of D. Some small filters like LG (1) have smaller ICs than others, due to higher correlations between them and D.
The big filters with larger IC are investigated in Experiment II.In this experiment, we choose bothx andy to be F (2) , F (3) as in other prior models.Therefore, a prior model ), ), respectively.A synthesized image sampled from p 3 I a f is displayed in Fig. 8.
So far, we have used three filters to characterize the statistics of natural images, and the synthesized image in Fig. 8 is still far from natural ones.Especially, even though the learned potential functions l (a) (z), a = 1, 2, 3 all have flat tails to preserve intensity breaks, they only generate small speckles instead of big regions and long edges as one may expect.Based on this synthesized image, we compute the AIG and IC for all filters, and the results are listed in Table 1 in column p 3 (I).

Experiment II
It is clear that we need large-scale filters to do better.Rather than using the large scale Gabor filters, we chose to usex andy on four different scales and impose explicitly the scale invariant property that we find in natural images.
Given an image I defined on an N ¥ N lattice L, we build a pyramid in the same way as before.Let I [s] , s = 0, 1, 2, 3 be four layers of the pyramid.Let H x,s (z, x, y) denote the histo- (x, y) and H y,s (z, x, y) the histogram ofy I [s] (x, y).We ask for a probability model p(I) which satisfies

TABLE 1 THE INFORMATION CRITERION FOR FILTER SELECTION
Fig. 9 displays l x,s (), s = 0, 1, 2, 3.At the beginning of the learning process, all l x,s (), s = 0, 1, 2, 3 are of the form displayed in Fig. 7 with low values around zero to encourage smoothness.As the learning proceeds, gradually l x,3 () turns "upside down" with smaller values at the two tails.Then l x,2 () and l x,1 () turn upside down one by one.Similar results are observed for l y,s (), s = 0, 1, 2, 3. Fig. 11 is a typical sample image from p s (I).To demonstrate it has scale invariant properties, in Fig. 10 we show the histograms H x,s and log H x,s of this synthesized image for s = 0, 1, 2, 3.
The learning process iterates for more than 10,000 sweeps.To verify the learned l()s, we restarted a homogeneous Markov chain from a noise image using the learned model, and found that the Markov chain goes to a perceptually similar image after 6,000 sweeps.

Remark 1
In Fig. 9, we notice that l x,s () are inverted, i.e., decreasing functions of |z| for s = 1, 2, 3, distinguishing this model from other prior models in computer vision.First of all, as the image intensity has finite range [0, 31],x I [s] is defined in [-31, 31].Therefore we may define l x,s (z) = 0 for |z| > 31, so p s (I) is still well-defined.Second, such inverted potentials have significant meaning in visual computation.In image restoration, when a high intensity difference y) is present, it is very likely to be noise if s = 0.
However this is not true for s = 1, 2, 3. Additive noise can hardly pass to the high layers of the pyramid because at each layer the 2 ¥ 2 averaging operator reduces the variance of the noise by four times.Whenx I [s] (x, y) is large for s = 1, 2, 3, it is more likely to be a true edge and object boundary.So in p s (I), l x,0 () suppresses noise at the first layer, while l x,s (), s = 1, 2, 3 encourages sharp edges to form, and thus enhances blurred boundaries.We notice that regions of various scales emerge in Fig. 11, and the intensity contrasts are also higher at the boundary.These appearances are missing in Fig. 8.

Remark 2
Based on the image in Fig. 11, we computed IC and AIG for all filters and list them under column p s (I) in Table 1.We also compare the two extreme cases discussed in Section 2.1.For the d() filter, AIF is very big, and AIG is only slightly bigger than AIF.Since all the prior models that we learned have no preference about the image intensity domain, the image intensity has uniform distribution, but we limit it inside [0, 31], thus the first row of Table 1 has the same value for IC and AIG.For filter I (obsi) , AIF M M = -1 , i.e., the biggest among all filters, and AIG AE 1.In both cases, ICs are the two smallest.

GIBBS REACTION-DIFFUSION EQUATIONS 4.1 From Gibbs Distribution to Reaction-Diffusion Equations
The empirical results in the previous section suggest that the forms of the potentials l (a) (z) learned from images of real world scenes can be divided into two classes: upright curves l(z) for which l() is an even function increasing as |z| increases and inverted curves for which the opposite happens.A similar phenomenon was observed in our learned texture models [40].
In Fig. 9, l x,s (z) are fit to the family of functions (see the dashed curves), x 0 , b are, respectively, the translation and scaling constants, and ʈaʈ weights the contribution of the filter.
In general, the Gibbs distribution learned from images in a given application has potential function of the following form,  Instead of defining a whole distribution with U, one can use U to set up a variational problem.In particular, one can attempt to minimize U by gradient descent.This leads to a non-linear parabolic partial differential equation: with . Thus starting from an input image I(x, y, 0) = I in , the first term diffuses the image by reducing the gradients, while the second term forms patterns as the reaction term.We call (14) the Grade.Since the computation of ( 14) involves convolving twice for each of the selected filters, a conventional way for efficient computation is to build an image pyramid so that filters with large scales and low frequencies can be scaled down into small ones in the higher level of the image pyramid.This is appropriate especially when the filters are selected from a bank of multiple scales, such as the Gabor filters and the wavelet transforms.We adopt this representation in our experiments.
For an image I, let I [s] be an image at level s = 0, 1, 2, ..

Anisotropic Diffusion and Gibbs Reaction-Diffusion
This section compares Grades with previous diffusion equations in vision.
In [25], [23], anisotropic diffusion equations for generating image scale spaces are introduced in the following form, where div is the divergence operator, i.e., div r V P Q x y c h. Perona and Malik defined the heat conduc- tivity c(x, y, t) as functions of local gradients, for example: Equation ( 16) minimizes the energy function in a continuous form, are plotted in Fig. 12.Similar forms of the energy functions are widely used as prior distributions [9], [4], [20], [11], and they can also be equivalently interpreted in the sense of robust statistics [13], [3].
In the following, we address three important properties of the Gibbs reaction-diffusion equations.
First, we note that ( 14) is an extension to (15) on a discrete lattice by defining a vector field, and a divergence operator, Thus ( 14) can be written as, Compared to (15), which transfers the "heat" among adjacent pixels, (17) transfers the "heat" in many directions in a graph, and the conductivities are defined as functions of the local patterns not just the local gradients.Second, in Fig. 13, f , where s is the summation of the other terms in the differential equation whose values are well defined.Intuitively when g < 1 and x = (F (a) * I)(x, y) = 0, f (a)¢ (0) forms an attractive basin in its neighborhood 1 (a) (x, y) specified by the filter window of F (a) .For a pixel (u, v) OE 1 (a) (x, y), the depth of the attractive basin is pixel is involved in multiple zero filter responses, it will accumulate the depth of the attractive basin generated by each filter.Thus unless the absolute value of the driving force from other well-defined terms is larger than the total depth of the attractive basin at (u, v), I(u, v) will stay unchanged.In the image restoration experiments in later sections, g < 1 shows better performance in forming piecewise constant regions.Third, the learned potential functions confirmed and improved the existing prior models and diffusion equations, but, more interestingly, reaction terms are first discovered, and they can produce patterns and enhance preferred features.We will demonstrate this property in the experiments below.

Gibbs Reaction-Diffusion for Pattern Formation
In the literature, there are many nonlinear PDEs for pattern formation, of which the following two examples are interesting: 1) The Turing reaction-diffusion equation which models the chemical mechanism of animal coats [33], [21].Two canonical patterns that the Turing equations can synthesize are leopard blobs and zebra stripes [34], [38].These equations are also applied to image processing such as image halftoning [29], and a theoretical analysis can be found in [15].2) The Swindale equation which simulates the development of the ocular dominance stripes in the visual cortex of cats and monkey [30].The simulated patterns are very similar to the zebra stripes.
In this section, we show that these patterns can be easily generated with only two or three filters using the Grade.We run (14) starting with I(x, y, 0) as a uniform noise image, and Grade converges to a local minimum.Some synthesized texture patterns are displayed in Fig. 14.
For all six patterns in Fig. 14   It seems that the leopard blobs and zebra stripes are among the most canonical patterns which can be generated with easy choices of filters and parameters.As shown in [40], the Gibbs distribution are capable of modeling a large variety of texture patterns, but filters and different forms for y(x) have to be learned for a given texture pattern.

IMAGE ENHANCEMENT AND CLUTTER REMOVAL
So far we have studied the use of a single energy function U(I) either as the log likelihood of a probability distribution at I or as a function of I to be minimized by gradient descent.In image processing, we often need to model both the underlying images and some distortions, and to maximize a posterior distribution.Suppose the distortions are additive, i.e., an input image is, In many applications, the distortion images C are often not i.i.d.Gaussian noise, but clutter with structures such as trees in front of a building or a military target.Such clutter will be very hard to handle by edge detection and image segmentation algorithms.
We propose to model clutter by an extra Gibbs distribution, which can be learned from some training images by the minimax entropy theory as we did for the underlying image I. Thus an extra pyramidal representation for I in -I is needed in a Gibbs distribution form as shown in Fig. 15.The resulting posterior distributions are still of the Gibbs form with potential function, where U C () is the potential of the clutter distribution.The analyses of convergence of the equations can be found in [14], [10], [8].The computational load for the annealing process is notorious, but, for applications like denoising, a fast decrease of temperature may not affect the final result very much.

Experiment I
In the first experiment, we take U C to be quadratic, i.e., C to be an i.i.d.Gaussian noise image.We first compare the performance of the three prior models p l (I), p t (I), and p s (I) whose potential functions are, respectively:

Experiment II
In many applications, i.i.d.Gaussian models for distortions are not sufficient.For example, in Fig. 17a, the tree branches in the foreground will make image segmentation and object recognition extremely difficult, because they cause strong edges across the image.Modeling such clutter is a challenging problem in many applications.In this paper, we only consider clutter as two-dimensional pattern, despite its geometry and 3D structure.We collected a set of images of buildings and a set of images of trees all against clean background-the sky.For the tree images, we translate the image intensities to [-31, 0], i.e., zero for sky.In this case, since the trees are always darker than the building, thus the negative intensity will approximately take care of the occlusion effects.We learn the Gibbs distributions for each set respectively in the pyramid, then such models are respectively adopted as the prior distribution and the likelihood as in (18).We recovered the underlying images by maximizing a posteriori distribution using the stochastic process.As a comparison, we run the anisotropic diffusion process [25] on Fig. 19a, and images at iterations t = 50, 100, 300 are displayed in Fig. 20.As we can see that as t AE •, I(t) becomes a flat image.A robust anisotropic diffusion equation is recently reported in [2].

CONCLUSION
In this paper, we studied the statistics of natural images, based on which a novel theory is proposed for learning the generic prior model-the universal statistics of real world scenes.We argue that the same strategy developed in this paper can be used in other applications.For example, learning probability models for MRI images and 3D depth maps.
The learned prior models demonstrate some important properties such as the "inverted" potentials terms for patterns formation and image enhancement.The expressive power of the learned Gibbs distributions allow us to model structured noise-clutter in natural scenes.Furthermore, our prior learning method provides a novel framework for designing reaction-diffusion equations based on the observed images in a given application, without modeling the physical or chemical processes as people did before [33].
Although the synthesized images bear important features of natural images, they are still far from realistic ones.In other words, these generic prior models can do very little beyond image restoration.This is mainly due to the fact that all generic prior models are assumed to be translation invariant, and this homogeneity assumption is unrealistic.We call the generic prior models studied in this paper the first-level prior.A more sophisticated prior model should incorporate concepts like object geometry, and we call such prior models second-level priors.Diffusion equations derived from this second-level prior are studied in image segmentation [39], and in scale space of shapes [16].A discussion of some typical diffusion equations is given in [22].It is our hope that this article will stimulate further investigations on building more realistic prior models as well as sophisticated PDEs for visual computation.

Fig. 1 .
Fig. 1.Three existing forms for y().(a) Quadratic: y(x) = ax 2 .(b) Line process: y(x) = a min(q 2 , x 2 ).(c) T-function: y x Then for a fixed model complexity K, the best feature set S * is selected by the following criterion: We call the first term "average information gain" (AIG) by choosing F(b) , and the second term "average information fluctuation" (AIF).Intuitively, AIG measures the average error between the filter responses in the database and the marginal distribution of the current model p(I; L, S).In practice, we need to sample p(I; L, S), thus synthesize images I n syn n estimate E p(I; L ,S) [H (b) (z; I)] by m
x andy .For each image I n obs obs NI OE , we build a pyramid with I n s being the image at the sth layer.

I
, over n = 1, 2, ..., 44.Fig. 6a plots m x,s (z), for s = 0, 1, 2, and they are almost identical.To see the tails more clearly, we display log m x,s (z), s = 0, 1, 2 in Fig. 6c.The differences between them are still small.Similar results are observed for m y,s (z), s = 0, 1, 2, the average histograms ofy n obs I .In contrast, Fig. 6b plots the histograms of -x s I with I s being a uniform noise image at scales s = 0, 1, 2. Combining the second and the third aspects above, we conclude that the histograms of -x n s I ,y n s I are very consistent across all observed natural images and across scales s = 0, 1, 2. The scale invariant property of natural images is largely caused by the following facts:

p 3 (
I) is learned with potential: for z OE[-22, 22].These three curves are fitted with the functions y 1 N ¥ N being the size of synthesized image.
is the image lattice at level s, and m z a f is the aver- age of the observed histograms ofx I[s] andy I[s] on all 44 natural images at all scales.This results in a maximum entropy distribution p s (I) with energy of the following form, filter set is now divided into two parts S = S d ʜ S r , with S d = {F (a) , a = 1, 2, .., n d } and S r = {F (a) , a = n d + 1, ..., K}.In most cases S d consists of filters such asx , y , D which capture the general smoothness of images, and S r contains filters which characterize the prominent features of a class of images, e.g., Gabor filters at various orientations and scales which respond to the larger edges and bars.

a
. of a pyramid, and I[0] = I, the potential function becomes We can derive the Grade equations similarly for this pyramidal representation.

.
(x) has round tip for g ≥ 1, and a cusp occurs at x = 0 for 0 < g < 1 which leaves ¢ f x b g unde- fined, i.e., ¢ f x b g can be any value in (-•, •) as shown by the dotted curves in Fig. 13d.An interesting fact is that the potential function learned from real world images does have a cusp as shown in Fig. 9a, where the best fit is g = 0.7.This cusp forms because a large part of objects in real world images have flat intensity appearances, and f(x) with g < 1 will produce piecewise constant regions with much stronger forces than g ≥ 1.By continuity, ¢ f x b g can be assigned any value in the range [-w, w] for x OE [-⑀, ⑀] In numerical simulations, for x OE [-w, w], we take

Fig. 14f is 3 ) for y x 1 2 a f b g and y x 1 3 a f b g.
Fig. 14f is generated with two reaction filters Gcos(4, 30 o ), Gcos(4, 60 o ) at level one, where (a = -3.5, b = 10, g = 2.0, x o = -3) for y x 1 2 a f b g and y x 1 3 a f b g.

Fig. 15 .
Fig. 15.The computational scheme for removing noise and clutter.

Fig. 16 .
Fig. 16.(a) The noise distorted image.(b)-(d) Restored images by prior models p I l a f, p I t a f, and p I s a f, respectively.
. (a) The observed image.(b) The restored image using six filters.For example, Fig. 17b is computed using six filters with two filters for I: {x,0 ,y,0 }, and four filters for I C : {d,x ,y , Gcos(2, 30 o )}, i.e., the potential for I C is: are fit to the potential functions learned from the set of tree images: energy term f * (I(x, y)) forces zero intensity for the clutter image while allowing for large negative intensities for the dark tree branches.
Fig. 18b is computed using eight filters with four filters for I and four filters for I C .Thirteen filters are used for Fig. 19 where the clutter is much heavier.As a comparison, we run the anisotropic diffusion process[25] on Fig.19a, and images at iterations t = 50, 100, 300 are displayed in Fig.20.As we can see that as t AE •, I(t) becomes a flat image.A robust anisotropic diffusion equation is recently reported in[2].