Exploring Defocus Matting: Nonparametric Acceleration, Super-Resolution, and Off-Center Matting

Defocus matting is a fully automatic and passive method for pulling mattes from video captured with coaxial cameras that have different depths of field and planes of focus. Nonparametric sampling can accelerate the video-matting process from minutes to seconds per frame. In addition, a super-resolution technique efficiently bridges the gap between mattes from high-resolution video cameras and those from low-resolution cameras. Off-center matting pulls mattes for an external high-resolution camera that doesn't share the same center of projection as the low-resolution cameras used to capture the defocus matting data. In this article, we address these limitations and extend defocus matting in several important ways

I mage matting and compositing are important operations in image editing, photography, and film production. Matting separates a foreground element from an image by estimating a color, F, and an opacity, ␣, for each foreground pixel. The set of all ␣ values is the alpha matte. Compositing blends the extracted foreground element, F, on top of an opaque background image, B, using linear blending: I[x;y] ͌ ␣F ʳ (1 ٞ ␣)B. Given an image, I, matting solves the inverse problem with seven unknowns (␣, Fr, Fg, Fb, Br, Bg, Bb) and three constraints (Ir, Ig, Ib).
To make the matting problem tractable, most commercial matting approaches use a background with known, constant color. This is called blue screen matting (see the "Previous Work in Matting and Compositing" sidebar), even though green is preferable when shooting with digital cameras. Unfortunately, blue screen matting is an intrusive, expensive process unavailable to amateur users. The ideal matting approach works for scenes with arbitrary, unknown, and possibly dynamic backgrounds. This natural image matting typically requires substantial manual labor. However, two fully automatic natural image matting solutions have recently been developed. They acquire additional data during scene capture using special imaging devices.
In previous work, we (Joshi, Matusik, and Avidan) describe using a camera array to create a synthetic aperture image that focuses on the foreground object. 1 We estimate the variance of foreground and background pixels and compute a high-quality alpha matte at several frames per second (fps). We also show how to transfer the computed alpha matte to one of the cameras in the array using image-based rendering. Although camera arrays might be feasible for professional use, they can be impractical for certain applications.
McGuire et al. 2 developed a fully automated method that computes alpha mattes using three video cameras that share a common center of projection but vary in depth of field and focal plane. The additional defocus information constrains the original ill-posed matting problem. This defocus matting approach can compute high-quality mattes from natural images without user assistance. The special-purpose defocus camera uses three imagers that share an optical axis using beam splitters. Defocus matting is fully automatic, passive, and could be implemented compactly using methods similar to those used for constructing 3-charge-coupled-device (CCD) cameras. However, the approach also has several limitations: In this article, we address these limitations and extend defocus matting in several important ways.

Improvements on defocus matting
Our improved hardware setup allows more accurate alignment of cameras and beam splitters and more efficient light splitting, thus improving the input images' noise characteristics.
Our approach also addresses the speed of McGuire et al.'s optimization procedure. Their process requires many minutes of computation per frame and, when used for processing videos, runs independently on each video frame. We accelerate the method for video Defocus matting is a fully automatic and passive method for pulling mattes from video captured with coaxial cameras that have different depths of field and planes of focus. Nonparametric sampling can accelerate the video-matting process from minutes to seconds per frame. In addition, a super-resolution technique efficiently bridges the gap between mattes from highresolution video cameras and those from low-resolution cameras. Off-center matting pulls mattes for an external high-resolution camera that doesn't share the same center of projection as the lowresolution cameras used to capture the defocus matting data.
sequences by using a nonparametric sampling technique that relies on the optimization result from previous frames.
Third, we consider a setup consisting of a defocus camera where the primary camera is high resolution and the other two cameras are low resolution. We show how to handle the resolution difference by adapting an image-restoration technique to perform super-resolution to create alpha mattes that match the primary camera's resolution.
This setup gives good results, but it requires placing beam splitters in front of all the cameras, which might be undesirable in some situations. Thus, we extend defocus matting to handle the case where an off-center camera doesn't share a center of projection with the low-resolution cameras. In such a setup, we attach the three defocus cameras next to the off-center camera and align them in software instead of hardware.

The A-Cam
The A-Cam consists of a compact collection of three video cameras, each with a different plane of focus and depth of field, that share a single center of projection ( Figure 1). The A-Cam is an improved version of McGuire et al.'s camera for defocus matting. 2 The three cameras are axis-aligned and image the scene through a tree of beam-splitters. We focus one camera, IF, on the foreground, another camera, IB, on the background, and a pinhole camera, IP, on the entire scene. The foreground and background cameras have a large aperture with a narrow depth of field, whereas the pinhole camera has a small aperture with a large depth of field. Figures 2a to 2c give examples of these images.
The defocus in the foreground and background cameras occurs because the cone of rays from a point in the scene intersects the image plane at a disk. We describe the resulting point-spread function (PSF) or circle of confusion as: where the camera is focused at depth zF , the point is at zR, is the f-number, f is the focal length, and is the width of a pixel. 2 Depths are positive distances in front of the lens. To a good approximation, we can express images from the three cameras, IP, IB, and IF, using the following equations:

Previous Work in Matting and Compositing
Matting and compositing are important tasks in television, film production, and publishing and have attracted researchers' attention since at least the late 1950s. Vlahos's initial work on matting and compositing led to the development of the UltiMatte system. Wallace 1 and Smith and Blinn 2 formalized digital compositing for film production mathematically, including the invention of twobackground matte extraction.
Two-background matte extraction shows that by imaging a foreground object against two backgrounds with different intensity (or color), you can derive an expression for the ␣ and foreground color. Zongker et al. 3 introduced environment matting, an extension of alpha matting to cover more complex light transport effects (such as specular reflection or refraction).
Estimating ␣ and foreground color for scenes in which the background can't be controlled is often referred to as natural image matting. This task is much more difficult because the solution is typically underconstrained. To constrain the problem, most methods make assumptions about the frequency content of background, foreground, and alpha. 4-6 Furthermore, they rely on a user-specified trimap that segments the image into definitely foreground, definitely background, and unknown regions. These algorithms analyze unknown pixels using local color distributions. Natural image matting methods have been extended to video by propagating user-specified trimaps with optical flow. 7 The Zcam (www.3dvsystems.com) and the work of Yasuda et al. 8 follow some of the same principles as our method in that they film additional data during capture to aid in matting. Both approaches also include a beam splitter so the extra camera shares the same optical axis as the main camera; however, the Zcam uses active illumination, which is undesirable for many applications, while the work of Yasuda et al. focuses on segmenting people using passive IR.
Equations 1, 2, and 3 give us seven unknowns (␣, Fr, Fg, Fb, Br, Bg, Bb) and nine constraints (three per equation). The convolutions with disk(rF) and disk(rB) suggest that we can't solve for each pixel independently. Instead, we search for a global solution that minimizes this error function: where IP, IF, and IB are the observed images, and ÎP, ÎF, and ÎB are the reconstructed images. We find the minimum using McGuire et al.'s sparse nonlinear optimization procedure. 2 This procedure finds a solution by iteratively updating initial values for the unknowns, taking steps along the error function's gradient toward a minimum value. Because the convolutions in Equations 2 and 3 are linear operations and Equation 1 is also linear, we can efficiently compute these equations' derivatives and then easily compute the error function's gradient. The optimization procedure also includes several regularization terms, such as terms enforcing spatial smoothness for the recovered ␣, F, and B. We've improved the defocus camera hardware in two ways.
First, we constructed a rigid case to hold the cameras and beam splitters such that the optical centers are the same. This improves the cameras' overall alignment. In addition, the defocus camera is now self-contained, and the entire unit is much more portable and can be mounted on a tripod, as Figure 1 shows.
We also modified the beam splitters. The original design uses two 50/50 beam splitters. The pinhole camera receives half the light and the foreground-and background-focused cameras each receive one-quarter. However, because the camera apertures don't reflect this division of light, the pinhole doesn't receive enough light and the other cameras receive more than they need. As a result, the pinhole camera's images suffer from noise because of lack of light, and the other two cameras tend to be overexposed as they receive too much light. We modified the design to use a 90/10 beam splitter to send 90 percent of the light to the pinhole with the remaining 10 percent split equally between the other two cameras. The ratio of the apertures now matches these splitting ratios so that each sensor receives the same amount of light.
Even though the three cameras are better aligned and matched in light exposure, we still need to colormetrically and geometrically align them. We can calibrate color through simple histogram matching to match the A-Cam's foregroundand background-focused cameras to the pinhole camera. We compute the histogram remapping on smoothed version images to limit the effect of noise in the calibration process. The cameras' geometric alignment is straightforward. Because the cameras are coaxial, we need to correct only for each camera sensor's possible rotation and translation, which we do using one homography per camera. We do this by placing a grid in the scene, detecting feature points on the grid, and computing a homography for the foreground and background cameras to align them to the pinhole camera.

Nonparametric video matting
The original defocus matting work makes no distinction between still image matting and video matting, and creates mattes for video by processing each frame independently. With many minutes of CPU time per frame and video rates of 30 frames per second (fps), processing video clips quickly becomes impractical. To address this issue, we accelerate the video-matting process using a nonparametric model built on-the-fly from our optimization results. For videos with strong temporal consistency, where "past predicts future" is a reliable assumption, this process provides a significant speedup. Our method starts with a training phase in which we individually process the first k frames of a video sequence using our full optimization procedure (we use k ͌ 5). Then, during the processing phase, we use the results of these k frames to predict the result for a subsequent frame i using a method similar to image analogies. 3 Specifically, we create an input feature vector, , for each pixel. This vector contains the corresponding pixel values from IP, IB, and IF and the disk(rB) and disk(rF) size neighborhoods of pixels from IB and IF, respectively. (The sizes of disk(rB) and disk(rF) are a function of the camera lens settings and the foreground and background depths. These disks model the lens' PSF at the foreground and background depths 2 -as a rule of thumb, they're generally on the order of two to 10 pixels wide for common defocus matting setups.) We don't use a neighborhood from IP because surrounding pixels don't affect the pinhole image constraint. For the neighborhood of pixels, instead of using pixel values directly, we use a rotationally invariant transform of the pixel data. 4 This transform collapses the feature space significantly, allowing our model to generalize to a larger range of input data with many fewer samples. We populate a We then predict values for a subsequent frame i using a process similar to that used by Hertzmann et al. 3 In scanline order, for pixel q in frame i, we 1. Construct input feature vector (q). 2. Find five nearest-neighbor pixels (the five pixels closest to q in terms of distance, dj ͌ ||( (q) ٞ (pj)) 2 ). 3. From these nearest-neighbor pixels, select the smoothest pixel, psmooth-that is, the pixel whose sum of squared difference between its corresponding ␣ and the ␣ values for the already computed neighboring pixels in the alpha matte is minimal. 4. Set dsmooth ͌ ||( (q) ٞ (psmooth)) 2 . 5. From the nearest-neighbor pixels, find the nearest pixel, pn-that is, the pixel whose distance, dn ͌ ||( (q) ٞ (pn)) 2 is smallest.
Intuitively, we construct the result for frame i pixelby-pixel, where we set the result equal to the corresponding result for either the closest match, or equal to the result for a close match whose corresponding alpha is similar to those already computed in the image. The ␥ parameter lets us adjust the result's smoothness. We've empirically found that ␥ ͌ 2 works well. For efficiency, instead of finding exact nearest neighbors, we use an approximate nearest-neighbor algorithm with a small tolerance. Once we've predicted the entire frame, we evaluate the error function in Equation 4 with this newly created result. For any pixels whose error is greater than some ⑀ tolerance, we refine the estimate by running the optimization. We then add these newly optimized pixels back into our lookup table so we can continually update our nonparametric model.
With this approach, the number of pixels to be optimized depends on the video's nature. If the video exhibits slow transitions, the nonparametric method should suffice. In case of abrupt changes, say due to camera motion, we'll have to evaluate more pixels using the time-consuming optimization method. In this latter case, after we add the frame to the model, the system is bootstrapped and can return to computing subsequent frames primary from the model. By varying the ⑀ tolerance, we can trade accuracy for speed, where a value of 0 would perform full optimization for each frame and a value of infinity would compute results using only the nonparametric model from the first k frames. An in-between value would be a mixture of the two, which we refer to as mixed optimization.

Alpha super-resolution
The resolution of the three cameras currently limits the resolution of the mattes pulled with defocus matting, so high-resolution results require three high-resolution cameras. Although using high-resolution cameras might seem reasonable, increases in computation time, memory and bandwidth usage, and camera cost can quickly make this option impractical. Using full optimization, the computation time for one 320 ´240 frame is about five to 10 minutes, which would put a 1megapixel video at well over an hour per frame. Our nonparametric method will accelerate this process. However, when building a nonparametric model from high-resolution images, the model's size can become prohibitively large. Another concern is the limited bandwidth of cameras and their capture devices. Generally, cameras trade frame rate for increased resolution-a trade-off that's nice to avoid if possible.
To combat these issues, we propose a method that computes high-resolution alpha mattes by replacing the A-Cam's pinhole camera with a high-resolution camera, leaving the other two cameras the same. We downsample the high-resolution pinhole image, , to get IP to match the resolution of IF and IB. We then use IP, IF, and IB as before to compute mattes at 320 ´240 resolution using our nonparametrically accelerated optimization method. We then use a super-resolution technique as a postprocess to upgrade the alpha matte to the primary camera's native resolution. Specifically, given ␣ and F, as computed using the method described in the previous sections, and given a blur function f, which models the resolution gap, where , we want to find ␣ H and F H such that f(␣ H ) ͌ ␣, and f(F H ) ͌ F. This is an underconstrained problem-there are many high-resolution ␣ and F images that could blur to the low-resolution versions. Thus, we solve this problem using an edge-preserving regularization term that's based on anisotropic diffusion. Researchers have used similar techniques for image restoration, and the following derivation mirrors those used in these techniques (see Kornprobst et al. 5 for a more detailed explanation of this derivation).
Specifically, we want to optimize the following regularized error function: . This regularization term lets us perform smoothing when gradient values are low; for high gradients, we can apply smoothing only along an edge and not across it.
We can minimize this partial differential equation (PDE) using a variational method. Applying the Euler-Lagrange equation shows that the error function is minimized when Imposing Neumann boundary conditions, which specify that values of the solution's gradient in the direction normal to the boundary are zero, lets us convert the divergence term to a Laplacian matrix, B, that is spatially varying as a function of the gradients of ␣ H : We then assume . This enforces that edges in the original image are preserved in the sharpened alpha matte. This assumption is valid for depth edges, which dominate in the unknown region, because gradients due to depth discontinuities correspond to edges in the alpha matte. Although this assumption is invalid for gradients in IP that don't appear in the alpha matte (for example, because of texture), this shouldn't cause any errors as our regularization term is edgepreserving but shouldn't introduce edges when none are present. We also note that our super-resolution method doesn't rely on any new information relative to our original matting construct. The gradients we use for superresolution are the same needed for defocus matting to compute an alpha matte. We perform super-resolution only on alpha in the unknown region. Equation 5 now becomes: We can then solve for ␣ H in one step using a sparse linear least-squares solver. Observe that our regularization depends on the edges of and not the edges of ␣ H . This is where our method deviates from similar image restoration methods. In those methods, the regularization matrix B is a function of the edges of the image being solved for, which in this case is ␣ H . This gives rise to a nonlinear PDE that must be solved using an iterative approach. Our solution is linear, as B is a function of a known value and thus can be solved in one step using linear least-squares. We use the same process to obtain F H , the high-resolution foreground. Performing super-resolution for the entire image at once can be problematic for large images-even though the matrices in Equation 6 are sparse, they can be still be quite large. However, because the blurring that we're trying to remove is a relatively local function, we can perform super-resolution on a block-by-block basis rather efficiently even for large images. We've empirically determined that 60 pixel subblocks that overlap by seven pixels work well for our scenes.

Off-center alpha matting
The A-Cam uses a collection of beam splitters placed between the cameras and the scene. Although this might be acceptable in some cases, it might not be desirable for high-end photography or for filming movies where the user might not want to let any optical device, such as a beam splitter, come between the camera and the scene. We propose a hybrid camera approach 6 in which we use the A-Cam as an accessory to an external off-center camera, and the cameras don't share the same center of projection, as Figure 1b shows. In this situation, we can't use a matte from the A-Cam directly as a matte for the off-center camera. Instead, we compute the alpha matte directly for the off-center camera and use the data from the A-Cam to regularize the solution.
The off-center camera and the A-Cam observe the same scene, albeit with potentially different camera settings. Still, we can assume that they're both focused on the foreground object and that we can colorimetrically and geometrically align the cameras. Given this information, how can we use the A-Cam data to regularize the alpha matte we compute for the off-center camera?

Regularized alpha matting
The external off-center camera IOFF gives us one equation with seven unknowns and three constraints: If the A-Cam and primary camera are colormetrically and geometrically aligned and the foreground object is in focus in both cameras, we can assume that ␣OFF ͌ ␣ and FOFF ͌ F. This reduces the number of unknowns by four. This gives us a set of four equations with 10 unknowns and 12 constraints:

IEEE Computer Graphics and Applications
Just as before, we can't solve this problem on a pixelby-pixel basis because we convolve the images IB and IF with a finite-size disk. Instead, we solve this system of equations using regularized optimization. Of all the solutions that satisfy the primary camera's matting equation, we choose the one that also satisfies the alpha matting equation of the A-Cam. Specifically, let J ͌ (IP ٞ ÎP) 2 ʳ (IB ٞ ÎB) 2 ʳ (IF ٞ ÎF) 2 be the error term for the A-Cam, where ÎP, ÎB, and ÎF are the pinhole, background, and foreground images recovered by the optimization, respectively, and let JOFF ͌ (IOFF ٞ ÎOFF) 2 be the error term for the high-definition camera, where ÎOFF is the recovered off-center camera image. We solve , (7) where is the regularization parameter controlled by the user. We've empirically found that setting between 0.3 and 0.5 works well. Although we assume that the off-center camera and the A-Cam can share ␣ and F, this isn't the case with the background. This is because the background B and the background BOFF are related by a homography if they're planar, or by pixel-wise correspondence if they aren't. In addition, they might be related by a convolution (if they're defocused differently because of different depths of field of the lenses in the A-Cam and off-center camera). We avoid estimating these relationships by directly solving both for B and BOFF. We adapt McGuire et al.'s 2 minimization method to minimize the function in Equation 7 by adding the additional unknown, BOFF, and error term, JOFF, to the error function and gradient computation and by incorporating the regularization factor; the minimization method is otherwise unchanged. We've similarly adapted our nonparametric acceleration method to incorporate this additional data.

Off-center calibration
Colorimetrically and geometrically calibrating the primary camera and the A-Cam takes several steps.
We achieve color calibration through simple histogram matching, as described earlier. For geometric calibration, we assume that we know the foreground object's depth and place a grid at that location ahead of time. (We address this assumption later.) We use corresponding points from this grid to compute a homography between the primary camera and the A-Cam.
Because this initial homography aligns features at only one depth and a typical foreground object spans a range of depths, we must perform additional alignment between the off-center camera and the A-Cam. One approach is to use pixel-wise optical flow. However, because optical flow tends to fail along object borders, we instead align the foreground by computing homographies on a block-by-block basis. Because we solve for different backgrounds for the A-Cam and off-center camera, we need to align only the foreground pixels. We do this by computing homographies to best align only the foreground pixels, as determined by our trimap. We align the background pixels in a block with the same transform as the foreground pixels. This block-based alignment preserves the structure of the foreground object in the border region by providing a rigid transform across the boundary. Our method works well as long as the object boundary lies within the block. Small seams can exist if the object boundary lies across a block boundary, so we size our blocks to approximately match the unknown region's size to minimize this situation's occurrence. We've empirically determined that 60 ´60pixel blocks work well for our scenes.
Computing these homographies requires estimating eight parameters per homography. Fortunately, we can rely on the fact that the space of all homographies between a pair of images is a 4D space. 7 Intuitively, this is true because a plane that defines the homography has four degrees of freedom. We therefore image a planar grid in 10 positions and compute the homography for each position. For these 10 homographies, we use principal components analysis (PCA) to find the four basis homographies spanning the space of all homographies between the two cameras. Then, for every pixel block, we find a homography that minimizes the sum-ofsquared differences between the aligned pixel blocks. We need to solve for only four unknowns¾the four coefficients of the basis homographies¾which we do with an iterative Levenberg-Marquardt solver.
We return now to the assumption that we know the foreground object's depth ahead of time. This need not be the case. After we recover the four basis homography matrices between the off-center camera and the A-Cam, we can represent any homography, including the one caused by the plane going through the foreground object, as a linear combination of the basis matrices. We can estimate the foreground object's depth by estimating the plane that minimizes the image difference between the off-center camera and the A-Cam. The plane estimation involves estimating only four parameters (the coefficients of the basis homographies) and not the eight parameters of a homography. Note that even if the off-center camera moves, as long as the A-Cam remains rigidly attached to it, these basis homographies can continue to be used for alignment.

Experiments
We conducted several experiments to validate our methods. We first show results using our nonparametric acceleration method. We then show results that illustrate our super-resolution method and follow those with results from off-center matting. In addition to visual results, we provide an analysis of the running times of our extensions. Figure 3 compares results from our nonparametric method with a result from the original defocus matting paper. We captured this data set, of a person with blond hair, using three 640 ´480 (Bayer pattern) video cameras. As in McGuire et al., 2 we demosaiced the data to 320 ´240 resolution for computing mattes. For this sequence, we used the first five frames for training the model, and here show the result of the 22nd frame. For a comparison of the longer video sequence, see our argmin , , , video results at http://graphics.ucsd.edu/papers/ exploring_defocus_matting. Our mixed optimization approach produces a result with comparable quality to the original method in little less than half the time (3.5 minutes per frame compared to 8.5 minutes with the original optimization process). Figure 3 also shows our super-resolution approach applied to this same data set. For the first result, we push our super-resolution method and downsample the A-Cam data to 160 ´120 and then compute the matte at this low resolution. We then upgrade these mattes to 640 ´480 resolution. In the second result, we upgrade a 320 ´240 result to 640 ´480. We show the original low-resolution versions and a result computed natively at 640 ´480 for comparison. Both super-resolution results look sharper and more detailed than the original optimization output, and the upgraded 160 ´120 result (Figure 3f) and the upgraded 320 ´240 result (Figure 3h) are on par with the result computed natively at 640 ´480 (Figure 3i).
Although the results with this original data set are reasonable, some errors are still present. These errors are because of poor geometric alignment and saturation of the foreground-and background-focused cameras. Our nonparametric method doesn't introduce these errorssimilar artifacts are present in McGuire et al.'s results. 2 It's the presence of these errors that lead us to change the beam splitters and build an accurately designed case for holding the camera and beam splitter assembly. Figure 2 shows a result for a new data set acquired with our improved A-Cam setup. We show the A-Cam images, resulting matte and composite, and super-resolution from 320 ´240 to 640 ´480. The images show a stuffed bear in front of some wood boxes and artificial trees. The improved alignment from our new A-Cam hardware setup lets us pull a much cleaner and accurate matte, even though the background in this scene has complex structure and high-frequency content.
Our final set of results illustrates off-center matting. We attached the A-Cam to an external off-center camera, as Figure 1b shows. We captured images of a grid in several positions in the filming area and used them to compute the basis homographies between the offcenter camera and the A-Cam, as we described earlier. gorilla in front of some wood boxes. In this data set, there is a relatively large (5´) resolution difference between the off-center camera and the A-Cam because the off-center camera is a two-megapixel consumer camera. This object spans a relatively large depth range, thus our per-block geometric alignment was essential. We successfully pulled the alpha matte and, as Figure 5 shows, recovered a significant amount of detail using our super-resolution method. Figure 6 shows a single frame of a video sequence of an off-center matting result for a person with black hair in an office supply room. We captured this data set using a 640 480 video camera mounted next to the A-Cam. We computed mattes at 320 ´240 resolution and then upgraded to 640 ´480 using our super-resolution method. Figure 7 shows a side-by-side comparison of a result using full optimization, mixed optimization, and the nonparametric model alone. We computed the fully optimized result in six minutes, the mixed optimization result in 1.5 minutes, and the nonparametric result in 30 seconds. We used the first five frames of a 60-frame sequence for training. Video results for this data set are available at http://graphics. ucsd.edu/papers/exploring_defocus_matting. Table 1 compares running times of our two video data sets processed with and without our extensions. This data demonstrates that our pure nonparametric approach provides a reasonably significant speedup over the full optimization approach. A mixed optimization still produces a moderate speedup. The exact value to set for ⑀ for a mixed approach depends on the desired balance of speed versus quality and also on the overall data quality¾that is, some inherent lower bound on the error exists, given noise in the data, calibration errors, and so on. Thus, if we set ⑀ below this lower bound, it's equivalent to specifying full optimization. The units of ⑀ are mean-squared-difference of image intensity of the observed and predicted images. For the data sets shown here, we set ⑀ equal to 0.003, which corresponds to around a 5-percent error in image intensity.
Although we initially envisioned the super-resolution method as a way to match mixed-resolution cameras, our method can also serve as an acceleration technique when all cameras have the same resolution. First, we downsample all videos to a low resolution, then compute an alpha matte, and then upgrade back to the original resolution. Although this might not provide exactly the same quality as optimizing at the A-Cam's native resolution, for some applications the speed increase might justify the slight loss of quality.

Discussion and future work
Our novel defocus matting extensions increase speed and resolution and can be used with traditional consumer cameras; however, some limitations remain. Our nonparametric method's two primary limitations are typical of many learning methods.
The first limitation is that errors present in the training set will propagate through to prediction. Currently, we pick our training set by simply using the first five frames in a sequence. By using this scheme, we let the quality of the optimization for these initial frames dictate our training data's quality, but we don't have any way to ensure that these first five frames are actually good examples. Furthermore, we pick entire frames for training. One straightforward modification would be to include training data on a per-pixel basis only where the absolute error from the optimization is low. A further modification would be to train a classifier to automatically label "good" and "bad" results. The nonparametric method's second limitation is the memory used by the model, which grows as we update it with new frames. Our current Matlab implementation has an upper limit of about 60 added frames. This number varies, however, with the ⑀ threshold and the extent of the video's temporal coherence¾that is, the more the frames are alike, the smaller the model can be. One solution to this problem is to compress the model. Recently, researchers have used canonical correlation analysis to reduce nonparametric model size. We're interested in exploring this as an alternative to the traditional PCA-based techniques.
A limitation of our super-resolution method is that it can enhance noise in the input low-resolution alpha matte. Occasionally a small, relatively low, errant alpha value becomes more prominent after super-resolution. In practice, a small amount of postprocessing 2 will often remove these errors before super-resolution. Decoupling our superresolution method from the matte-optimization procedure might also introduce errors relative to computing the matte at high resolution with an integrated superresolution method. Our motivation was speed, and we considered this additional source of error acceptable. However, one future area of work is to merge these processes without the performance hit of full optimization at high resolution. Another approach for improving our method is to combine it with a nonparametric approach similar in principle to our acceleration method. This seems like a promising direction because example-based methods have been used successfully for super-resolution of traditional images.
Our off-center method is currently limited to work with scenes containing predominantly planar foreground and background objects. Although we can handle some amount of depth variation with our alignment, this is, in general, a limitation. One way to address this is by using more sophisticated alignment techniques that leverage the fact that the off-center primary camera and the pinhole camera essentially provide a stereo pair. Also, although the resolution gap is the most obvious distinction between the consumer-level camera and the A-Cam, other camera setup gaps might be considered, such as a gap in the dynamic range of the two cameras.
While we've addressed several limitations of defocus matting, other inherent limitations remain. The primary limitation is the presence of local minima in the errorfunction minimized by nonlinear optimization. The final results can be sensitive to the initial guess and the choice of the regularization parameters involved in the process. Thus, the process can often terminate in a local minimum. Figure 8 shows an example of this type of failure. In this result, the optimization terminated with an incor- rect low-frequency blur of the foreground color polluting the background color. Because of the large amount of defocus and the relatively smooth background, this error really only violates the pinhole constraint, but is low error given the remaining constraints. Our nonparametric acceleration method provides a better initial guess than the original process of inpainting the unknown areas and estimating the initial alpha, but it doesn't remove the need for some form of initial guess for generating the training frames. Improving the initial guess (for example, by using single-frame methods such as Poisson matting) could be a fruitful direction for future work. ■