1236 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 Prior Learning and Gibbs Reaction-Diffusion Song Chun Zhu and David Mumford Abstract—This article addresses two important themes in early visual computation: First, it presents a novel theory for learning the universal statistics of natural images—a prior model for typical cluttered scenes of the world—from a set of natural images, and, second, it proposes a general framework of designing reaction-diffusion equations for image processing. We start by studying the statistics of natural images including the scale invariant properties, then generic prior models were learned to duplicate the observed statistics, based on the minimax entropy theory studied in two previous papers. The resulting Gibbs distributions have potentials of the form U I; L , S = l (K) b g  Âa K a =1 x,y (1) (2) (K) (1) (2) aa f aa f f l eeF * Ijax, y fj with S = {F , F , ..., F } being a set of filters and L = {l (), l (), ..., ()} the potential functions. The learned Gibbs distributions confirm and improve the form of existing prior models such as line- process, but, in contrast to all previous models, inverted potentials (i.e., l(x) decreasing as a function of |x|) were found to be necessary. We find that the partial differential equations given by gradient descent on U(I; L, S) are essentially reaction-diffusion equations, where the usual energy terms produce anisotropic diffusion, while the inverted energy terms produce reaction associated with pattern formation, enhancing preferred image features. We illustrate how these models can be used for texture pattern rendering, denoising, image enhancement, and clutter removal by careful choice of both prior and data models of this type, incorporating the appropriate features. Index Terms—Visual learning, Gibbs distribution, reaction-diffusion, anisotropic diffusion, texture synthesis, clutter modeling, image restoration. —————————— 3 —————————— 1 INTRODUCTION N computer vision, many generic prior models have been proposed to capture the universal low level statistics of natural images. These models presume that surfaces of objects be smooth, and adjacent pixels in images have similar intensity values unless separated by edges, and they are applied in vision algorithms ranging from image restoration, motion analysis, to 3D surface reconstruction. For example, in image restoration, general smoothness models are expressed as probability distributions [9], [4], [20], [11]: I pI = af 1 -  b x , y g y d — x Ib x , y gi +y e — y Ib x , y gj e Z These prior models have been motivated by regulariza1 2 tion theory [26], [18], physical modeling [31], [4], Bayesian theory [9], [20], and robust statistics [19], [13], [3]. Some connections between these interpretations are also observed in [12], [13] based on effective energy in statistics mechanics. Prior models of this kind are either generalized from traditional physical models [37] or chosen for mathematical convenience. There is, however, little rigorous theoretical or empirical justification for applying these prior models to generic images, and there is little theory to guide the construction and selection of prior models. One may ask the following questions: 1) Why are the differential operators good choices in capturing image features? 2) What is the best form for p(I) and y()? 3) A relevant fact is that real world scenes are observed at more or less arbitrary scales, thus a good prior model should remain the same for image features at multiple scales. However none of the existing prior models has the scale-invariance property on the 2D image lattice, i.e., is renormalizable in terms of renormalization group theory [36]. In previous work on modeling textures, we proposed a new class of Gibbs distributions of the following form [40], [41], p I; L , S = (1) where I is the image, Z is a normalization factor, and —xI(x, y) = I(x + 1, y) - I(x, y), —yI(x, y) = I(x, y + 1) - I(x, y) are differential operators. Three typical forms of the potential function y() are displayed in Fig. 1. The functions in Fig. 1b and Fig. 1c have flat tails to preserve edges and object boundaries, and thus they are said to have advantages over the quadratic function in Fig. 1a. ²²²²²²²²²²²²²²²² • S.C. Zhu is with the Computer Science Department, Stanford University, Stanford, CA 94305. E-mail: szhu@cs.stanford.edu. • D. Mumford is with the Division of Applied Mathematics, Brown University, Providence, RI 02912. Manuscript received 12 July 1996; revised 15 Sept. 1997. Recommended for acceptance by B. Vemuri. For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number 105703. c h 1 -U b I ; L , S g e Z (2) 1. Where the smoothness term is explained as a stabilizer for solving “illposed” problems [32]. 2. If y() is quadratic, then variational solutions minimizing the potential are splines, such as flexible membrane or thin plate models. 0162-8828/97/$10.00 © 1997 IEEE ZHU AND MUMFORD: PRIOR LEARNING AND GIBBS REACTION-DIFFUSION 1237 D 2 E 2 2 F Fig. 1. Three existing forms for y(). (a) Quadratic: y(x) = ax . (b) Line process: y(x) = a min(q , x ). (c) T-function: y x = a 1 - a f FH 1 2 1+ cx IK . U I; L , S = c h   la f FH eFa f * Ijcx, yhIK . b g a a a =1 x, y (1) (2) (K) (1) (2) (K) K (3) In the above equation, S = {F , F , ..., F } is a set of linear filters, and L = {l (), l (), ..., l ()} is a set of potential functions on the features extracted by S. The central property of this class of models is that they can reproduce the marginal distributions of F * I estimated over a set of the training images I—while having the maximum entropy— (1) (2) (K) and the best set of features {F , F , ..., F } is selected by minimizing the entropy of p(I) [41]. The conclusion of our earlier papers is that, for an appropriate choice of a small set of filters S, random samples from these models can duplicate very general classes of textures—as far as normal human perception is concerned. Recently, we found that similar ideas of model inference using maximum entropy have also been used in natural language modeling [1]. In this paper, we want to study to what extent probability distributions of this type can be used to model generic natural images, and we try to answer the three questions raised above. We start by studying the statistics of a database of 44 real world images, and then we describe experiments in which Gibbs distributions in the form of (2) were constructed to duplicate the observed statistics. The learned potential (a) functions l (), a = 1, 2, ..., K can be classified into two categories: diffusion terms which are similar to Fig. 1c, and reaction terms which, in contrast to all previous models, have inverted potentials (i.e., l(x) decreasing as a function of |x|). We find that the partial differential equations given by gradient descent on U(I; L, S) are essentially reactiondiffusion equations, which we call the Gibbs Reaction and Diffusion Equations (Grade). In Grade, the diffusion components produce denoising effects which are similar to the anisotropic diffusion [25], while reaction components form patterns and enhance preferred image features. The learned prior models are applied to the following applications. First, we run the Grade starting with white noise images and demonstrate how Grade can easily generate canonical texture patterns, such as leopard blobs and zebra stripe, as the Turing reaction-diffusion equations do [34], [38]. Thus (a) our theory provides a new method for designing PDEs for pattern synthesis. Second, we illustrate how the learned models can be used for denoising, image enhancement, and clutter removal by careful choice of both prior and noise models of this type, incorporating the appropriate features extracted at various scales and orientations. The computation simulates a stochastic process—the Langevin equations—for sampling the posterior distribution. This paper is arranged as follows: Section 2 presents a general theory for prior learning. Section 3 demonstrates some experiments on the statistics of natural images and prior learning. Section 4 studies the reaction-diffusion equations. Section 5 demonstrates experiments on denoising, image enhancement and clutter removal. Finally, Section 6 concludes with a discussion. 2 THEORY OF PRIOR LEARNING 2.1 Goal of Prior Learning and Two Extreme Cases We define an image I on an N ¥ N lattice L to be a function such that for any pixel (x, y), I(x, y) Œ /, and / is either an interval of R or / à Z. We assume that there is an underlying probability distribution f(I) on the image space /N for general natural images—arbitrary views of the world. Let obs NI obs = I n , n = 1, 2 , ... , M be a set of observed images 2 o t which are independent samples from f(I). The objective of learning a generic prior model is to look for common features and their statistics from the observed natural images. Such features and their statistics are then incorporated into a probability distribution p(I) as an estimation of f(I), so that p(I), as a prior model, will bias vision algorithms against image features which are not typical in natural images, such as noise distortion and blurring. For this objective, it is reasonable to assume that any image features have an equal chance to occur at any location, so f(I) is translation invariant with respect to (x, y). We will discuss the limits of this assumption in Section 6. obs To study the properties of images I n , n = 1, 2 , ... , M , o t we start from exploring a set of linear filters S = {F , a = 1, 2, ..., K} which are characteristic of the observed images. The statistics extracted by S are the empirical marginal distributions (or histograms) of the filter responses. (a) 1238 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 DEFINITION 1. Given a probability distribution f(I), the marginal (a) distribution of f(I) with respect to F is: f 1) The potentials y1(), y2() are built on the responses of linear filters. In (7), I n , n = 1, 2 , ... M are used as linear filobs aa f a z f = z zb g ◊ a F a f *I f I dI = E f d z - F " x, y Œ L af x, y = z LM e N aa f * Icx , yh O jQP ters of size N ¥ N pixels, which we denote by Fb obsn f obs = In . (a) c h where "z Œ R and d() is a Dirac function with point mass concentrated at zero. DEFINITION 2. Given a linear filter F and an image I, the empirical marginal distribution (or histogram) of the filtered (a) image F * I(x, y) is: (a) 2) For each filter F chosen, p(I) in both cases duplicates the observed marginal distributions. It is trivial to prove that in both cases Ep H M increases, Ep H aa f c z ; I h = m obs z , thus as aa f a f aa f c z ; I h Æ Ef H aa f c z ; I h . H aa f c z ; I h = 1 L b x , y gŒL  d ez - F aa f * Ibx, ygj . NI obs We compute the histogram averaged over all images in as the observed statistics, m obs z = aa f a f 1 M obs  H aa f e z ; I n j , n=1 M a = 1, 2 , ... , K . If we make a good choice of our database, then we may assume that m a f z is an unbiased estimate for f (z), and as obs a af (a) M Æ • , m a f z converges to f (z) = Ef[H (z; I)]. obs Now, to learn a prior model from the observed images a af (a) (a) This second property is in general not satisfied by existing prior models in (1). p(I) in both cases meets our objective for prior learning, but intuitively they are not good choices. In (5), the d() filter does not capture spatial struc(obsn) are too tures of larger than one pixel, and in (7), filters F specific to predict features in unobserved images. In fact, the filters used above lie in the two extremes of the spectrum of all linear filters. As discussed by Gabor [7], the d filter is localized in space but is extended uniformly in frequency. In contrast, some other filters, like the sine waves, are well localized in frequency but are extended in space. Filter (obsn) includes a specific combination of all the components in F both space and frequency. A quantitative analysis of the goodness of these filters is given in Table 1 in Section 3.2. oI obs n , n = 1, 2 , ... , M , immediately we have two simple t 2.2 Learning Prior Models by Minimax Entropy To generalize the two extreme examples, it is desirable to compute a probability distribution which duplicates the observed marginal distributions for an arbitrary set of filters, linear or nonlinear. This goal is achieved by a minimax entropy theory studied for modeling textures in our previous papers [40], [41]. Given a set of filters {F , a = 1, 2, ..., K}, and observed statistics (a) solutions. The first one is: pI = o a f ’ m a f dIcx, yhi , b g o obs x , y ŒL (o) (4) where m a f is the observed average histogram of the image obs intensities, i.e., the filter F y1 obs = d is used. Taking azf = - log m aof azf , we rewrite (4) as pI = The second solution is: {m a f azf, a = 1, 2, ... , K} , a obs a maximum entropy af 1 bx,yg e Z 1 M M - Ây 1 I x, y d b gi . (5) distribution is derived which has the following Gibbs form: p I; L , S = c h 1 -U b I ; L , S g e Z a a (8) pI = Let I n obs 2 af obs  d eI - I n j . n=1 (6) U I; L , S = c h   la f FH eFa f * Ijcx, yhIK b g a =1 x, y (K) K (9) = cn for n = 1, 2, ..., M, (6) becomes In the above equation, we consider linear filters only, and (7) L = {l (), l (), ..., l ()} is a set of potential functions on the features extracted by S. In practice, image intensities are discretized into a finite gray levels, and the filter responses are divided into a finite number of bins, therefore l () is approximated by a piecewisely constant functions, i.e., a vector, which we denote by l , a = 1, 2, ..., K. (a) The l s are computed in a nonparametric way so that the learned p(I; L, S) reproduces the observed statistics: (a) (a) (1) (2)  1 - n = 1y 2 e < I n pI = e M M obs af , I > - cn j , where <, > is inner product, y2(0) = 0, and y2(x) = • if x π 0, i.e., y2() is similar to the Potts model [37]. These two solutions stand for two typical mechanisms for constructing probability models in the literature. The first is often used for image coding [35], and the second is a special case of the learning scheme using radial basis func3 tions (RBF) [27]. Although the philosophies for learning these two prior models are very different, we observe that they share two common properties. 3. In RBF, the basis functions are presumed to be smooth, such as a Gaussian function. Here, using d() is more loyal to the observed data. EpaI ; L , Sf H aa f c z ; I h = m obs z a = 1, 2 , ... , K , "z . aa f a f Therefore as far as the selected features and their statistics are concerned, we cannot distinguish between p(I; L, S) and the “true” distribution f(I). ZHU AND MUMFORD: PRIOR LEARNING AND GIBBS REACTION-DIFFUSION 1239 Unfortunately, there is no simple way to express the l s in terms of the (a) (a) B/S at step K + 1 is: IC = 1 2M 1 2M M obs  H bb g ez; In j - EpaI; L ,Sf H bb g cz; Ih a obs  H b b g ez; In j - m aobsf azf n=1 n=1 M a m obs s af as in the two extreme examples. To - compute l s, we adopted the Gibbs sampler [9], which simulates an inhomogeneous Markov chain in image space / N2 . This Monte Carlo method iteratively samples from the distribution p(I; L, S), followed by computing the histogram of the filter responses for this sample and updating the l (a) to bring the histograms closer to the observed ones. (a) We call the first term “average information gain” (AIG) by (b) choosing F , and the second term “average information fluctuation” (AIF). Intuitively, AIG measures the average error between the filter responses in the database and the marginal distribution of the current model p(I; L, S). In practice, we need to sample syn p(I; L, S), thus synthesize images I n , n = 1, 2 , ... , M ¢ , and For a detailed account of the computation of l s, the readers are referred to [40], [41]. In our previous papers, the following two propositions are observed. PROPOSITION 1. Given a filter set S, and observed statistics o t {m a f , a = 1, 2, ... , K} , there is an unique solution for {la f , a = 1, 2, ... , K}. a obs a g estimate Ep(I;L,S)[H (z; I)] by m bsyn = (b) b 1 M¢ syn Ân = 1 H b b g e z ; I n j . M¢ (b) PROPOSITION 2. f(I) is determined by its marginal distributions, thus p(I) = f(I) if it reproduces all the marginal distributions of linear filters. But for computational reasons, it is often desirable to choose a small set of filters which most efficiently capture the image structures. Given a set of filters S, and an ME distribution p(I; L, S), the goodness of p(I; L, S) is often measured by the Kullback-Leibler information distance between p(I; L, S) and the ideal distribution f(I) [17], For a filter F , the bigger AIG is, the more information F captures as it reports the error between the current model and the observations. AIF is a measure of disagreement between the observed images. The bigger AIF is, the less their responses to F (a) (b) have in common. 3 EXPERIMENTS ON NATURAL IMAGES This section presents experiments on learning prior models, and we start from exploring the statistical properties of natural images. KL f I , p I; L , S = zz d a f c hi f aI f dI = E ◊ f aIf log pcI; L , Sh 3.1 Statistic of Natural Images f log f I - E f log p I; L , S af c h * Then for a fixed model complexity K, the best feature set S is selected by the following criterion: S * = arg min KL f I , p I; L , S , S =K daf c hi where S is chosen from a general filter bank B such as Gabor filters at multiple scales and orientations. Enumerating all possible sets of features S in the filter bank and comparing their entropies is computationally too expensive. Instead, in [41] we propose a stepwise greedy procedure for minimizing the KL-distance. We start from S = ∆ and p(I; L, S) a uniform distribution, and introduce one filter at a time. Each added filter is chosen to maximally decrease the KL-distance, and we keep doing this until the decrease is smaller than a certain value. In the experiments of this paper, we have used a simpler measure of the “information gain” achieved by adding one 1 new filter to our feature set S. This is roughly an L -distance 2 (vs. the L -measure implicit in the Kullback-Leibler distance), the readers are referred to [42] for a detailed account. DEFINITION 3. Given S = {F , F , ..., F } and p(I; L, S) defined (b) above, the information criterion (IC) for each filter F Œ (1) (2) (K) It is well known that natural images have statistical properties distinguishing them from random noise images [28], [6], [24]. In our experiments, we collected a set of 44 natural images, six of which are shown in Fig. 2. These images are from various sources, some digitized from books and postcards, and some from a Corel image database. Our database includes both indoor and outdoor pictures, country and urban scenes, and all images are normalized to have intensity between zero and 31. Fig. 2. Six out of the 44 collected natural images. 1240 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 (a) (b) (c) Fig. 3. The intensity histograms in domain [0, 31]. (a) Averaged over 44 natural images. (b) An individual natural image. (c) A uniform noise image. (a) (b) (c) Fig. 4. The histograms of — x I plotted in domain [-30, 30]. (a) Averaged over 44 natural images. (b) An individual natural image. (c) A uniform noise image. (a) the two curves in (a). (b) Fig. 5. (a) The histogram of — x I plotted against Gaussian curve (dashed) of same mean and variance in domain [-15, 15]. (b) The logarithm of As stated in Proposition 2, marginal distributions of linear filters alone are capable of characterizing f(I). In the rest of this paper, we shall only study the following bank B of linear filters. 1) An intensity filter d(). 2) Isotropic center-surround filters, i.e., the Laplacian of Gaussian filters. note these filters by LG(s). A special filter is LG e j, 2 2 1 1 1 1 which has a 3 ¥ 3 window 0 , 4 ,0; 4 , -1, 4 ; 0, 4 ,0 , and we denote it by D. 3) Gabor filters with both sine and cosine components, which are models for the frequency and orientation sensitive simple cells. LG x , y , s = const ◊ x + y - s e where s = c h e 2 2 2 j - x2 + y2 s2 , (10) G x , y , s,q = const o Rot q o e It is a sine wave at frequency c h af - 2s 1 2 e4x 2 +y 2 je -i 2p s x (11) 2s stands for the scale of the filter. We de- 2p modulated by an s ZHU AND MUMFORD: PRIOR LEARNING AND GIBBS REACTION-DIFFUSION 1241 elongated Gaussian function, and rotated at angle q. We denote the real and image parts of G(x, y, s, q) by Gcos(s, q) and Gsin(s, q). Two special Gsin(s, q) filters are the gradients —x, —y. 4) We will approximate large scale filters by filters of small window sizes on the high level of the image pyramid, where the image in one level is a “blowndown” version (i.e., averaged in 2 ¥ 2 blocks) of the image below. We observed three important aspects of the statistics of natural images. First, for some features, the statistics of natural images vary widely from image to image. We look at the d() filter as in Section 2.1. The average intensity histogram of the 44 o conclude that the histograms of — x I n , — y I n are very consistent across all observed natural images and across scales s = 0, 1, 2. The scale invariant property of natural images is largely caused by the following facts: 1) natural images contains objects of all sizes and 2) natural scenes are viewed and made into images at arbitrary distances. s s 3.2 Empirical Prior Models In this section, we learn the prior models according to the theory proposed in Section 2 and analyze the efficiency of the filters quantitatively. 3.2.1 Experiment I We start from S = ∆ and p0(I) a uniform distribution. We compute the AIF, AIG, and IC for all filters in our filter bank. We list the results for a small number of filters in Table 1. The filter D has the biggest IC (= 0.642), thus is chosen as F . An ME distribution p1(I; L, S) is learned, and the information criterion for each filter is shown in the column headed p1(I) in Table 1. We notice that the IC for the filter D drops to near zero, and IC also drops for other filters because these filters are in general not independent of D. Some small filters like LG(1) have smaller ICs than others, due to higher correlations between them and D. The big filters with larger IC are investigated in Experiment II. In this experiment, we choose both —x and —y to be (2) (3) F , F as in other prior models. Therefore, a prior model p3(I) is learned with potential: U 3 I; L , S = 1 (1) images m a f is plotted in Fig. 3a, and Fig. 3b is the intensity obs histogram of an individual image (the temple image in Fig. 2). It appears that m a f z is close to a uniform distriobs bution (Fig. 3c), while the difference between Fig. 3a and o af Fig. 3b is very big. Thus IC for filter d() should be small (see Table 1). Second, for many other filters, the histograms of their responses are amazingly consistent across all 44 natural images, and they are very different from the histograms of noise images. For example, we look at filter —x. Fig. 4a is the average histogram of 44 filtered natural images, Fig. 4b is the histogram of an individual filtered image (the same image as in Fig. 3b), and Fig. 4c is the histogram of a filtered uniform noise image. The average histogram in Fig. 4a is very different from a Gaussian distribution. To see this, Fig. 5a plots it against a Gaussian curve (dashed) of the same mean and same variance. The histogram of natural images has higher kurtosis and heavier tails. Similar results are reported in [6]. To see the difference of the tails, Fig. 5b plots the logarithm of the two curves. Third, the statistics of natural images are essentially scale invariant with respect to some features. As an example, we look at filters —x and —y. For each image I n Œ NI obs obs bx, yg (a) h  la f dDIcx, yhi +la f d— Icx, yhi + la f e— Icx, yhj . 2 3 x y c l (z), a = 1 m a f a zf = 0 if obs 1, 2, 3 z ≥ 9.5 , (1) 4 are obs plotted in Fig. 7. Since z ≥ 22 for a = 2, 3, (2) (3) a and m a f a zf = 0 if , we build a pyramid with We set 0 In s In we only plot l (z) for z Œ [-9.5, 9.5] and l (z), l (z) for z Œ [-22, 22]. These three curves are fitted with the functions y1(z) = 2.1(1 - 1/(1 + (|z|/4.8) 1.32 being the image at the sth layer. = obs In and let s n s n ), y2(z) = 1.25(1 - 1/(1 + (|z|/2.8) ), 1.5 1.5 In In s+1 s bx, yg = I c2x, 2yh + I c2x, 2y + 1h + c2x + 1, 2yh + I c2x + 1, 2y + 1h s n s s s The size of I n is N/2 ¥ N/2 . For the filter —x, let mx,s(z) be the average histogram of — x I n , over n = 1, 2, ..., 44. Fig. 6a plots mx,s(z), for s = 0, 1, 2, and they are almost identical. To see the tails more clearly, s we display log mx,s(z), s = 0, 1, 2 in Fig. 6c. The differences between them are still small. Similar results are observed obs for my,s(z), s = 0, 1, 2, the average histograms of — y I n . In y3(z) = 1.95(1 - 1/(1 + (|z|/2.8) ), respectively. A synthesized image sampled from p3 I is displayed in Fig. 8. So far, we have used three filters to characterize the statistics of natural images, and the synthesized image in Fig. 8 is still far from natural ones. Especially, even though the (a) learned potential functions l (z), a = 1, 2, 3 all have flat tails to preserve intensity breaks, they only generate small speckles instead of big regions and long edges as one may expect. Based on this synthesized image, we compute the AIG and IC for all filters, and the results are listed in Table 1 in column p3(I). af contrast, Fig. 6b plots the histograms of — x I with I being a uniform noise image at scales s = 0, 1, 2. Combining the second and the third aspects above, we s s 4. In fact, m obs ( DI) = 0 if m obs ( DI) < synthesized image. ( 1) ( 1) 1 N 2 , with N ¥ N being the size of 1242 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 (a) (b) (c) Fig. 6. (a) mx,s(z), s = 0, 1, 2. (b) logmx,s(z), s = 0 (solid), s = 1 (dash-dotted), and s = 2 (dashed). (c) Histograms of a filtered uniform noise image at scales: s = 0 (solid curve), s = 1 (dash-dotted curve), and s = 2 (dashed curve). (a) (b) (c) Fig. 7. The three learned potential functions for filters. (a) D. (b) — x . (c) — y . Dashed curves are the fitting functions: (a) y 1 x = 2.1 1 - 1 / 1 + x / 4.8 af a f e b g j . (b) y ax f = 1.25a1 - 1f / e1 + b x / 2.8g j . (c) y ax f = 1.95a1 - 1f / e1 + b x / 2.8g j . 1.32 1.5 1.5 2 3 3.2.2 Experiment II It is clear that we need large-scale filters to do better. Rather than using the large scale Gabor filters, we chose to use —x and —y on four different scales and impose explicitly the scale invariant property that we find in natural images. Given an image I defined on an N ¥ N lattice L, we build a [s] pyramid in the same way as before. Let I , s = 0, 1, 2, 3 be four layers of the pyramid. Let Hx,s(z, x, y) denote the histogram of —xI (x, y) and Hy,s(z, x, y) the histogram of —yI (x, y). We ask for a probability model p(I) which satisfies [s] [s] c h Eaf H "z "c x , y h Œ L , s = 0 , 1, 2 , 3 where L is the image lattice at level s, and m a zf is the averEpaI f H x , s z , x , y = m z , "z " x , y Œ Ls , s = 0, 1, 2 , 3 p I y,s s s c h cz, x , yh af = m a z f, age of the observed histograms of —xI and —yI on all 44 natural images at all scales. This results in a maximum entropy distribution ps(I) with energy of the following form, Fig. 8. A typical sample of p3(I) (256 ¥ 256 pixels). [s] [s] Us I = af   b g 3 l x, s — xI e s cx, yhj + l e— I cx, yhj . s y,s y (12) s = 0 x , y ŒLs ZHU AND MUMFORD: PRIOR LEARNING AND GIBBS REACTION-DIFFUSION 1243 TABLE 1 THE INFORMATION CRITERION FOR FILTER SELECTION form, and thus enhances blurred boundaries. We notice that regions of various scales emerge in Fig. 11, and the intensity contrasts are also higher at the boundary. These appearances are missing in Fig. 8. 3.2.2.2 Remark 2 Based on the image in Fig. 11, we computed IC and AIG for all filters and list them under column ps(I) in Table 1. We also compare the two extreme cases discussed in Section 2.1. For the d() filter, AIF is very big, and AIG is only slightly bigger than AIF. Since all the prior models that we learned have no preference about the image intensity domain, the image intensity has uniform distribution, but we limit it inside [0, 31], thus the first row of Table 1 has the same (obsi) value for IC and AIG. For filter I , AIF = M -1 , i.e., the M biggest among all filters, and AIG Æ 1. In both cases, ICs are the two smallest. 4 GIBBS REACTION-DIFFUSION EQUATIONS 4.1 From Gibbs Distribution to Reaction-Diffusion Equations Fig. 9 displays lx,s(), s = 0, 1, 2, 3. At the beginning of the learning process, all lx,s(), s = 0, 1, 2, 3 are of the form displayed in Fig. 7 with low values around zero to encourage smoothness. As the learning proceeds, gradually lx,3() turns “upside down” with smaller values at the two tails. Then lx,2() and lx,1() turn upside down one by one. Similar results are observed for ly,s(), s = 0, 1, 2, 3. Fig. 11 is a typical sample image from ps(I). To demonstrate it has scale invariant properties, in Fig. 10 we show the histograms Hx,s and log Hx,s of this synthesized image for s = 0, 1, 2, 3. The learning process iterates for more than 10,000 sweeps. To verify the learned l()s, we restarted a homogeneous Markov chain from a noise image using the learned model, and found that the Markov chain goes to a perceptually similar image after 6,000 sweeps. The empirical results in the previous section suggest that (a) the forms of the potentials l (z) learned from images of real world scenes can be divided into two classes: upright curves l(z) for which l() is an even function increasing as |z| increases and inverted curves for which the opposite happens. A similar phenomenon was observed in our learned texture models [40]. In Fig. 9, lx,s(z) are fit to the family of functions (see the dashed curves), F 1 f bx g = aG 1 GH 1 + d x - x F 1 y bx g = aG 1 GH 1 + d x - x 0 0 I J a>0 / bi J K I J a<0 / bi J K g g 3.2.2.1 Remark 1 In Fig. 9, we notice that lx,s() are inverted, i.e., decreasing functions of |z| for s = 1, 2, 3, distinguishing this model from other prior models in computer vision. First of all, as the image intensity has finite range [0, 31], —xI is defined in [-31, 31]. Therefore we may define lx,s(z) = 0 for |z| > 31, so ps(I) is still well-defined. Second, such inverted potentials have significant meaning in visual computation. In image restoration, when a high intensity difference —xI (x, y) is present, it is very likely to be noise if s = 0. However this is not true for s = 1, 2, 3. Additive noise can hardly pass to the high layers of the pyramid because at each layer the 2 ¥ 2 averaging operator reduces the variance of the noise by four times. When —xI (x, y) is large for s = 1, 2, 3, it is more likely to be a true edge and object boundary. So in ps(I), lx,0() suppresses noise at the first layer, while lx,s(), s = 1, 2, 3 encourages sharp edges to [s] [s] [s] x0, b are, respectively, the translation and scaling constants, and a weights the contribution of the filter. In general, the Gibbs distribution learned from images in a given application has potential function of the following form, U I; L , S = c h   f a f eFa f * Icx, yhj + b g a a a =1 x, y K a = nd + 1 x , y nd   y aa f e F aa f * Icx, yhj b g (a) (a) (13) Note that the filter set is now divided into two parts S = Sd Sr, with Sd = {F , a = 1, 2, .., nd} and Sr = {F , a = nd + 1, ..., K}. In most cases Sd consists of filters such as —x, —y, D which capture the general smoothness of images, and Sr contains filters which characterize the prominent features of a class of images, e.g., Gabor filters at various orientations and scales which respond to the larger edges and bars. 1244 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 (a) (b) (c) (d) g Fig. 9. Learned potential functions lx,s(), s = 0, 1, 2, 3. The dashed curves are fitting functions: f(x) = a(1 - 1/(1 + (|x|/b) ). (a) (a = 5, b = 10, xo = 0, g = 0.7). (b) (a = -2.0, b = 10, xo = 0, g = 1.6). (c) (a = -4.8, b = 15, xo = 0, g = 2.0). (d) (a = -10.0, b = 22, xo = 0, g = 5.0). (a) (b) Fig. 10. (a) The histograms of the synthesized image at four scales—almost indistinguishable. (b) The logarithm of histograms in Fig. 10a. Instead of defining a whole distribution with U, one can use U to set up a variational problem. In particular, one can attempt to minimize U by gradient descent. This leads to a non-linear parabolic partial differential equation: It = a =1  F-aa f * f aa f e F aa f * Ij +  F-aa f *y aa f e F aa f * Ij ¢ ¢ a = nd + 1 nd K (14) with F- aa f c x , y h = - F a a f c - x , in - y . Thus starting from an input h that filters with large scales and low frequencies can be scaled down into small ones in the higher level of the image pyramid. This is appropriate especially when the filters are selected from a bank of multiple scales, such as the Gabor filters and the wavelet transforms. We adopt this representation in our experiments. [s] For an image I, let I be an image at level s = 0, 1, 2, ... of a pyramid, and I[0] = I, the potential function becomes U I; L , S = c image I(x, y, 0) = I , the first term diffuses the image by reducing the gradients, while the second term forms patterns as the reaction term. We call (14) the Grade. Since the computation of (14) involves convolving twice for each of the selected filters, a conventional way for efficient computation is to build an image pyramid so h   b g s a s a f as a f e F aa f * I s cx , yhj + s x , y ŒLs    y asa f e Fsaa f * I s cx, yhj b x , y gŒL s We can derive the Grade equations similarly for this pyramidal representation. ZHU AND MUMFORD: PRIOR LEARNING AND GIBBS REACTION-DIFFUSION 1245 r a1f¢ V x, y = f b g FGH c h, ... , f c h c h, y c nd ¢ 1 2 nd + 1 ¢ hc h, ... , y a f c hIJK K ¢ K and a divergence operator, a a a div = F- f * + F- f * + L + F- f * . Thus (14) can be written as, r I t = div V . di (17) Compared to (15), which transfers the “heat” among adjacent pixels, (17) transfers the “heat” in many directions in a graph, and the conductivities are defined as functions of the local patterns not just the local gradients. Second, in Fig. 13, f(x) has round tip for g ≥ 1, and a cusp occurs at x = 0 for 0 < g < 1 which leaves f ¢ x undefined, i.e., f ¢ x can be any value in (-•, •) as shown by the dotted curves in Fig. 13d. An interesting fact is that the potential function learned from real world images does have a cusp as shown in Fig. 9a, where the best fit is g = 0.7. This cusp forms because a large part of objects in real world imFig. 11. A typical sample of ps(I) (384 ¥ 384 pixels). bg bg ages have flat intensity appearances, and f(x) with g < 1 will produce piecewise constant regions with much stronger forces than g ≥ 1. By continuity, f ¢ x can be assigned any value in the 4.2 Anisotropic Diffusion and Gibbs Reaction-Diffusion This section compares Grades with previous diffusion equations in vision. In [25], [23], anisotropic diffusion equations for generating image scale spaces are introduced in the following form, It = div(c(x, y, t)—I), I(x, y, 0) = I , in bg range [-w, w] for x Œ [- , ] and g -1 w = c e 1+ ( / b)g j 2 . In nu- merical simulations, for x Œ [-w, w], we take (15) +w f ¢ x = -s -w R bg | S | T if s < -w if s Œ -w , w if s > w where div is the divergence operator, i.e., r div V = — x P + — y Q r for V = P , Q . Perona and Malik defined the heat conductivity c(x, y, t) as functions of local gradients, for example: di where s is the summation of the other terms in the differential equation whose values are well defined. Intuitively when g < 1 and x = (F (a) c h x * I)(x, y) = 0, f (a) (a)¢ (0) forms an at(a) tractive basin in its neighborhood 1 (x, y) specified by the filter window of F . For a pixel (u, v) Œ 1 (x, y), the (a) It F 1 I F 1 =— G I J +— G GH 1 + cI / bh JK GGH 1 + eI / bj 2 x y x y 2 I I J JJK y (16) depth of the attractive basin is wF- aa f c u - x , v - y h . If a Equation (16) minimizes the energy function in a continuous form, c hi e c hj where l(x) = alog(1 + (x/b) ) and l ¢bx g = a b UI = 2 af zz d l — x I x , y + l — y I x , y dxdy x 1+ x /b pixel is involved in multiple zero filter responses, it will accumulate the depth of the attractive basin generated by each filter. Thus unless the absolute value of the driving force from other well-defined terms is larger than the total depth of the attractive basin at (u, v), I(u, v) will stay unchanged. In the image restoration experiments in later sections, g < 1 shows better performance in forming piecewise constant regions. Third, the learned potential functions confirmed and improved the existing prior models and diffusion equations, but, more interestingly, reaction terms are first discovered, and they can produce patterns and enhance preferred features. We will demonstrate this property in the experiments below. g 2 are plot- ted in Fig. 12. Similar forms of the energy functions are widely used as prior distributions [9], [4], [20], [11], and they can also be equivalently interpreted in the sense of robust statistics [13], [3]. In the following, we address three important properties of the Gibbs reaction-diffusion equations. First, we note that (14) is an extension to (15) on a discrete lattice by defining a vector field, 1246 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 (a) Fig. 12. (a) l(x) = alog(1 + (x/b) ). (b) l ¢ x = a 2 (b) x 2 af 1+ ( x /b ) . (a) (b) (c) Fig. 13. The potential function f (x ) = -a (d).f ¢ x , g < 1 1 1+ x /b (d) af a f g + a , f ¢(x ) . (a) and (c) g = 2.0. (b) and (d) g = 0.8. (a) f x , g ≥ 1. (b) f x , g < 1. (c) f ¢ x , g ≥ 1. af af af 4.3 Gibbs Reaction-Diffusion for Pattern Formation In the literature, there are many nonlinear PDEs for pattern formation, of which the following two examples are interesting: 1) The Turing reaction-diffusion equation which models the chemical mechanism of animal coats [33], [21]. Two canonical patterns that the Turing equations can synthesize are leopard blobs and zebra stripes [34], [38]. These equations are also applied to image processing such as image halftoning [29], and a theoretical analysis can be found in [15]. 2) The Swindale equation which simulates the development of the ocular dominance stripes in the visual cortex of cats and monkey [30]. The simulated patterns are very similar to the zebra stripes. In this section, we show that these patterns can be easily generated with only two or three filters using the Grade. We run (14) starting with I(x, y, 0) as a uniform noise image, and Grade converges to a local minimum. Some synthesized texture patterns are displayed in Fig. 14. For all six patterns in Fig. 14, we choose F0a f = D the Laplacian of Gaussian filter at level zero of the image pyramid as the only diffusion filter, and we fix a = 5, b = 10, 1 xo = 0, g = 1.2 for f 0 x . For the three patterns in Fig. 14a, Fig. 14b, and Fig. 14c, we choose isotropic center-surround a1f b g ZHU AND MUMFORD: PRIOR LEARNING AND GIBBS REACTION-DIFFUSION 1247 i.e., an input image is, I = I + C. In many applications, the distortion images C are often not i.i.d. Gaussian noise, but clutter with structures such as trees in front of a building or a military target. Such clutter will be very hard to handle by edge detection and image segmentation algorithms. We propose to model clutter by an extra Gibbs distribution, which can be learned from some training images by the minimax entropy theory as we did for the underlying in image I. Thus an extra pyramidal representation for I - I is needed in a Gibbs distribution form as shown in Fig. 15. The resulting posterior distributions are still of the Gibbs form with potential function, U (I) = UC(I - I; LC, SC) + U(I; L, S), Fig. 14. Leopard blobs and zebra stripes synthesized by Grades. * in in (18) where UC() is the potential of the clutter distribution. filter LG d 2 i of widow size 7 ¥ 7 pixels as the reaction filter o 2 F1a f at level one of the image pyramid, and we set (a = -6.0, a2f b = 10, g = 2.0) for y 1 bx g . The differences between these a2f three patterns are caused by x in y bx g . x = 0 forms the 1 o patterns with symmetric appearances for both black and white parts as shown in Fig. 14a. As xo becomes negative, black blobs begin to form as shown in Fig. 14b, where xo = -6, and positive xo generates white blobs in the black background as shown in Fig. 14c, where xo = 6. The general smoothness appearance of the images is attributed to the diffusion filter. Fig. 14d is generated with two reaction filters: LG 2 at level one and level two, respectively, there- d i Fig. 15. The computational scheme for removing noise and clutter. fore the Grade creates blobs of mixed sizes. Similarly we selected one cosine Gabor filter Gcos(4, 30 ) (size 7 ¥ 7 pixels oriented at 30 ) at level one as the reaction filter F1a o o Fig. 14e where (a = -3.5, b = 10, g = 2.0, xo o f for a2f = 0) for y bx g . 2 Thus the MAP estimate of I is the minimum of U . In the experiments which follow, we use the Langevin equation for minimization, a variant of simulated annealing: dI t = -—U I dt + 2T t dwt * * 1 Fig. 14f is generated with two reaction filters Gcos(4, 30 ), Gcos(4, 60 ) at level one, where (a = -3.5, b = 10, g = 2.0, xo = -3) for y 1 x and y 1 x . It seems that the leopard blobs and zebra stripes are among the most canonical patterns which can be generated with easy choices of filters and parameters. As shown in [40], the Gibbs distribution are capable of modeling a large variety of texture patterns, but filters and different forms for y(x) have to be learned for a given texture pattern. o af af (19) where w(x, y, t) is the standard Brownian motion process, i.e., a2f b g a3f b g w x , y , t ~ N m x , y , t , dwt = c h db g i dtN 0, 1 . c h 5 IMAGE ENHANCEMENT AND CLUTTER REMOVAL So far we have studied the use of a single energy function U(I) either as the log likelihood of a probability distribution at I or as a function of I to be minimized by gradient descent. In image processing, we often need to model both the underlying images and some distortions, and to maximize a posterior distribution. Suppose the distortions are additive, T(t) is the “temperature” which controls the magnitude of the random fluctuation. Under mild conditions on U * , (19) approaches a global minimum of U * at a low temperature. The analyses of convergence of the equations can be found in [14], [10], [8]. The computational load for the annealing process is notorious, but, for applications like denoising, a fast decrease of temperature may not affect the final result very much. 5.1 Experiment I In the first experiment, we take UC to be quadratic, i.e., C to be an i.i.d. Gaussian noise image. We first compare the performance of the three prior models pl(I), pt(I), and ps(I) whose potential functions are, respectively: 1248 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 Ul(I) = yl(—xI) + yl(—yI), yl(x) = amin(q , x ) 2 2 2 2 (20) Ut(I) = yt(—xI) + yt(—yI), yt(x) = ax /(1 + cx ) (21) Us(I) = the four-scale energy in (12) (22) yl() and yt() are the line-process and T-function displayed in Fig. 1b and Fig. 1c, respectively. Fig. 16 demonstrates the results: The original image is the lobster boat displayed in Fig. 2. It is normalized to have intensity in [0, 31] and Gaussian noise from N(0, 25) are added. The distorted image is displayed in Fig. 16a, where we keep the image boundary noise-free for the convenience of boundary condition. The restored images using pl(I), pt(I), and ps(I) are shown in Fig. 16b, Fig. 16c, and Fig. 16d, respectively. ps(I), which is the only model with a reaction term, appears to have the best effect in recovering the boat, especially the top of the boat, but it also enhances the water. darker than the building, thus the negative intensity will approximately take care of the occlusion effects. We learn the Gibbs distributions for each set respectively in the pyramid, then such models are respectively adopted as the prior distribution and the likelihood as in (18). We recovered the underlying images by maximizing a posteriori distribution using the stochastic process. (a) (b) Fig. 17. (a) The observed image. (b) The restored image using six filters. For example, Fig. 17b is computed using six filters with two filters for I: {—x,0, —y,0}, and four filters for IC: {d, —x, —y, Gcos(2, 30 )}, i.e., the potential for IC is: UC I = o a f  fd— Ibx, ygi + fe— Ibx, ygj + f dIbx, ygi + y dG cos * Ibx, ygi b g * * x y x, y In the above equation, f (x) and y (x) are fit to the potential functions learned from the set of tree images: * * R F 1 |a G 1 | GH 1 + d x - x | f bx g = S |a FG 1 1 | G | H 1 + dx - x T 1 * 2 * o o I J / bi J K I J / bi J K g g x < xo x ≥ x o , a2 > a1 > 0 Fig. 16. (a) The noise distorted image. (b)-(d) Restored images by prior models pl I , pt I , and ps I , respectively. So, the energy term f (I(x, y)) forces zero intensity for the clutter image while allowing for large negative intensities for the dark tree branches. af af af 5.2 Experiment II In many applications, i.i.d. Gaussian models for distortions are not sufficient. For example, in Fig. 17a, the tree branches in the foreground will make image segmentation and object recognition extremely difficult, because they cause strong edges across the image. Modeling such clutter is a challenging problem in many applications. In this paper, we only consider clutter as two-dimensional pattern, despite its geometry and 3D structure. We collected a set of images of buildings and a set of images of trees all against clean background—the sky. For the tree images, we translate the image intensities to [-31, 0], i.e., zero for sky. In this case, since the trees are always RF 1 |aG 1 | y bx g = S G |0H 1 + d x - x | T * o I J / bi J K g x < xo , a > 0 Fig. 18b is computed using eight filters with four filters for I and four filters for IC. Thirteen filters are used for Fig. 19 where the clutter is much heavier. As a comparison, we run the anisotropic diffusion process [25] on Fig. 19a, and images at iterations t = 50, 100, 300 are displayed in Fig. 20. As we can see that as t Æ •, I(t) becomes a flat image. A robust anisotropic diffusion equation is recently reported in [2]. ZHU AND MUMFORD: PRIOR LEARNING AND GIBBS REACTION-DIFFUSION 1249 (a) (b) Fig. 18. (a) An observed image. (b) The restored image using eight filters. Although the synthesized images bear important features of natural images, they are still far from realistic ones. In other words, these generic prior models can do very little beyond image restoration. This is mainly due to the fact that all generic prior models are assumed to be translation invariant, and this homogeneity assumption is unrealistic. We call the generic prior models studied in this paper the first-level prior. A more sophisticated prior model should incorporate concepts like object geometry, and we call such prior models second-level priors. Diffusion equations derived from this second-level prior are studied in image segmentation [39], and in scale space of shapes [16]. A discussion of some typical diffusion equations is given in [22]. It is our hope that this article will stimulate further investigations on building more realistic prior models as well as sophisticated PDEs for visual computation. ACKNOWLEDGMENT This work was started when the authors were at Harvard University. This research was supported by a U.S. National Science Foundation grant and a grant from ARO. We thank Y.N. Wu and S. Geman for valuable discussion. REFERENCES [1] (a) (b) [2] [3] [4] [5] A. Berger, V. Della Pietra, and S. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Computational Linguistics, vol. 22, no. 1, 1996. M. Black, G. Sapiro, D. Marimont, and D. Heeger, “Robust Anisotropic Diffusion,” IEEE Trans. Image Processing, to appear. M.J. Black and A. Rangarajan, “On the Unification of Line Processes, Outlier Rejection, and Robust Statistics With Applications in Early Vision,” Int’l J. Computer Vision, vol. 19, no. 1, 1996. A. Blake and A. Zisserman, Visual Reconstruction. Cambridge, Mass.: MIT Press, 1987. J. Daugman, “Uncertainty Relation for Resolution in Space, Spatial Frequency, and Orientation Optimized by Two-Dimensional Visual Cortical Filters,” J. Optical Soc. America, vol. 2, no. 7, pp. 1,160-1,169, 1985. D.J. Field, “Relations Between the Statistics of Natural Images and the Response Properties of Cortical Cells,” J. Optical Soc. America, A, vol. 4, no. 12, 1987. D. Gabor, “Theory of Communication,” IEE Proc., vol. 93, no. 26, 1946. S.B. Gelfand and S.K. Mitter, “On Sampling Methods and Annealing Algorithms,” Markov Random Fields—Theory and Applications. New York: Academic Press, 1993. S. Geman and D. Geman, “Stochastic Relaxation, Gibbs Distributions and the Bayesian Restoration of Images,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 6, no. 7, pp. 721-741, July 1984. S. Geman and C. Hwang, “Diffusion for Global Optimization,” SIAM J. Control and Optimization, vol. 24, no. 5, 1986. D. Geman and G. Reynoids, “Constrained Restoration and the Recover of Discontinuities,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, pp. 367-383, 1992. D. Geiger and F. Girosi, “Parallel and Deterministic Algorithms for MRFs: Surface Reconstruction,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 5, pp. 401-412, May 1991. D. Geiger and A.L. Yuille, “A Common Framework for Image Segmentation,” Int’l J. Computer Vision, vol. 6, no. 3, pp. 227-243, 1991. B. Gidas, “A Renormalization Group Approach to Image Processing Problems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 2, Feb. 1989. P. Grindrod, The Theory and Applications of Reaction-Diffusion Equations. New York: Oxford Univ. Press, 1996. B. Kimia, A. Tannebaum, and S. Zucker, “Shapes, Shocks, and Deformations I: The Components of Two-Dimensional Shape and Fig. 19. (a) The observed image. (b) The restored image using 13 filters. [6] (a) (b) (c) [7] [8] [9] Fig. 20. Images by anisotropic diffusion at iteration (a) t = 50, (b) t = 100, and (c) t = 300 for comparison. 6 CONCLUSION In this paper, we studied the statistics of natural images, based on which a novel theory is proposed for learning the generic prior model—the universal statistics of real world scenes. We argue that the same strategy developed in this paper can be used in other applications. For example, learning probability models for MRI images and 3D depth maps. The learned prior models demonstrate some important properties such as the “inverted” potentials terms for patterns formation and image enhancement. The expressive power of the learned Gibbs distributions allow us to model structured noise–clutter in natural scenes. Furthermore, our prior learning method provides a novel framework for designing reaction-diffusion equations based on the observed images in a given application, without modeling the physical or chemical processes as people did before [33]. [10] [11] [12] [13] [14] [15] [16] 1250 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 11, NOVEMBER 1997 [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] the Reaction-Diffusion Space,” Int’l J. Computer Vision, vol. 15, pp. 189-224, 1995. S. Kullback and R.A. Leibler, “On Information and Sufficiency,” Annual Math. Stat., vol. 22, pp. 79-86, 1951. J. Marroguin, S. Mitter, and T. Poggio, “Probabilistic Solution of Ill-Posed Problems in Computational Vision,” J. Am. Statistical Assoc., vol. 82, no. 397, 1987. P. Meer, D. Mintz, D.Y. Kim, and A. Rosenfeld, “Robust Regression Methods for Computer Vision: A Review,” Int’l J. Computer Vision, vol. 6, no. 1, 1991. D. Mumford and J. Shah, “Optimal Approximations by Piecewise Smooth Functions and Associated Variational Problems,” Comm. Pure Applied Math., vol. 42, pp. 577-684, 1989. J.D. Murray, “A Pre-Pattern Formation Mechanism for Mammalian Coat Markings,” J. Theoretical Biology, vol. 88, pp. 161-199, 1981. W. Niessen, B. Romeny, L. Florack, and M. Viergever, “A General Framework for Geometry-Driven Evolution Equations,” Int’l J. Computer Vision, vol. 21, no. 3, pp. 187-205, 1997. M. Nitzberg and T. Shiota, “Nonlinear Image Filtering With Edge and Corner Enhancement,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 14, no. 8, pp. 826-833, Aug. 1992. B.A. Olshausen and D.J. Field, “Natural Image Statistics and Efficient Coding,” Proc. Workshop on Information Theory and the Brain, Sept. 1995. P. Perona and J. Malik, “Scale-Space and Edge Detection Using Anisotropic Diffusion,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 7, pp. 629-639, July 1990. T. Poggio, V. Torre, and C. Koch, “Computational Vision and Regularization Theory,” Nature, vol. 317, pp. 314-319, 1985. T. Poggio and F. Girosi, “Networks for Approximation and Learning,” Proc. IEEE, vol. 78, pp. 1,481-1,497, 1990. D.L. Ruderman and W. Bialek, “Statistics of Natural Images: Scaling in the Woods,” Phys. Rev. Letter, vol. 73, pp. 814-817, 1994. A. Sherstinsky and R. Picard, “M-Lattice: From Morphogenesis to Image Processing,” IEEE Trans. Image Processing, vol. 5, no. 7, July 1996. N.V. Swindale, “A Model for the Formation of Ocular Dominance Stripes,” Proc. Royal Soc. London B, vol. 208, pp. 243-264, 1980. D. Terzopoulos, “Multilevel Computational Processes for Visual Surface Reconstruction,” Computer Vision, Graphics, and Image Processing, vol. 24, pp. 52-96, 1983. A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-Posed Problems. V.H. Winston & Sons, 1977. A.M. Turing, “The Chemical Basis of Morphogenesis,” Philosophy Trans. Royal Soc. London, vol. 237, no. B, pp. 37-72, 1952. G. Turk, “Generating Textures on Arbitrary Surfaces Using Reaction-Diffusion,” Computer Graphics, vol. 25, no. 4, 1991. A.B. Watson, “Efficiency of Model Human Image Code,” J. Optical Soc. America A, vol. 4, no. 12, 1987. K. Wilson, “The Renormalization Group: Critical Phenomena and the Knodo Problem,” Rev. Mod. Phys., vol. 47, pp. 773-840, 1975. G. Winkler, Image Analysis, Random Fields and Dynamic Monte Carlo Methods. New York: Springer-Verlag, 1995. A. Witkin and M. Kass, “Reaction-Diffusion Textures,” Computer Graphics, vol. 25, no. 4, 1991. S.C. Zhu and A.L. Yuille, “Region Competition: Unifying Snakes, Region Growing, and Bayes/MDL for Multi-Band Image Segmentation,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 9, pp. 884-900, Sept. 1996. S.C. Zhu, Y.N. Wu, and D.B. Mumford, “Filters, Random Fields, and Minimax Entropy (FRAME): Towards a Unified Theory for Texture Modeling,” Proc. CVPR, 1996. S.C. Zhu, Y.N. Wu, and D.B. Mumford, “Minimax Entropy Principle and Its Application to Texture Modeling,” Neural Computation, vol. 9, no. 8, Nov. 1997. S.C. Zhu and D.B. Mumford, “Learning Generic Prior Models for Visual Computation,” Harvard Robotics Lab, Technical Report TR-96-05 (a short version appeared in CVPR97). Song Chun Zhu received his BS degree in computer science from the University of Science and Technology of China in 1991. He received his MS and PhD degrees in computer science from Harvard University in 1994 and 1996, respectively. From 1996 to 1997, he worked in the Division of Applied Math at Brown University, and he joined the Computer Science Department at Stanford University in 1997. His research is concentrated in the areas of computer and human vision, statistical modeling, and pattern theory. David Mumford received his AB degree in mathematics from Harvard University in 1957 and his PhD degree also in mathematics from Harvard in 1961. He continued at Harvard as instructor, 1961; associate professor, 1963; and professor in 1967. He was chairman of the department from 1981 to 1984 and has held visiting appointments at the Institute for Advanced Study (Princeton), Warwick University, the Tata Institute of Fundamental Science (Bombay), the Institut des Hautes Etudes Scientifiques, and Isaac Newton Institute of Mathematical Sciences (Cambridge). He was Higgins Professor in Mathematics until his retirement in 1997. He was appointed University Professor at Brown in 1996. He received the Fields Medal in 1974 and DSc(hon) at Warwick in 1983 and is a member of the National Academy of Science. He is president of the International Mathematical Union (1995-1998). Professor Mumford worked in the area of algebraic geometry up to 1983. Since then he has been studying the application of mathematical ideas to the modeling of intelligence, both theoretically and in animals.