Deep Sparse-coded Network (DSN)

We present Deep Sparse-coded Network (DSN), a deep architecture based on multilayer sparse coding. It has been considered difficult to learn a useful feature hierarchy by stacking sparse coding layers in a straightforward manner. The primary reason is the modeling assumption for sparse coding that takes in a dense input and yields a sparse output vector. Applying a sparse coding layer on the output of another tends to violate the modeling assumption. We overcome this shortcoming by interlacing nonlinear pooling units. Average- or max-pooled sparse codes are aggregated to form dense input vectors for the next sparse coding layer. Pooling achieves nonlinear activation analogous to neural networks while not introducing diminished gradient flows during the training. We introduce a novel backpropagation algorithm to finetune the proposed DSN beyond the pretraining via greedy layerwise sparse coding and dictionary learning. We build an experimental 4-layer DSN with the ℓ1-regularized LARS and the greedy-ℓ0 OMP, and demonstrate superior performance over a similarly-configured stacked autoencoder (SAE) on CIFAR-10.


MOTIVATION
Representational power of single-layer feature learning is limited for tasks that involve large complex data objects such as a high-resolution image of human face.Best current practices in visual recognition use deep architectures based on autoencoder [1], restricted Boltzmann machine (RBM) [2], and convolutional neural network (CNN) [3].A deep architecture stacks two or more layers of feature learning units in the hope of discovering hierarchical representations for data.In other words, deep architectures allow us to understand a feature at each layer using the features of the layer below.Such hierarchical decomposition is particularly useful when we cannot resolve ambiguity of the low-level (or localized) features of data.Another benefit of using deep architectures is representational efficiency.Deep architectures can achieve compaction of all characteristic features for the entire image, book, or lengthy multimedia clip to a single vector.
In an empirical analysis by Coates, Lee and Ng [4], sparse coding is found superior to RBM, deep neural network, and CNN for classification tasks on the CIFAR-10 and NORB datasets.We have also been able to draw a similar conclusion from our experiments with sparse coding.With these in mind, it is sound to build a deep architecture on sparse coding.Unfortunately, it is much more than just stacking sparse coding units together.From sparse coding research on hierarchical feature learning [5,6], we could deduce plausible explanations for the difficulty.First, sparse coding (in particular, the 1-regularized LASSO or LARS) is computationally expensive for multilayering and associated optimizations.From our experience, it indeed is quite cumbersome and challenging to simply connect multiple sparse coding units and run data as a feedforward network.Secondly, sparse coding makes an inherent assumption on the input being non-sparse.This makes a straightforward adoption to take the output from one sparse coding unit for an input to another flawed.Lastly, it is difficult to optimize all layers of sparse coding jointly.One consensual notion of deep learning suggests layer-by-layer unsupervised pretraining should be followed by supervised finetuning of the whole system, which is commonly done by backpropagation.
In this paper, we present a deep architecture for sparse coding as a principled extension from its single-layer counterpart.We build on both the 1-regularized and greedy-0 sparse coding.Using max pooling as nonlinear activation analogous to neural networks, we avoid linear cascade of dictionaries and keep the effect of multilayering in tact.This architectural usage will remedy the problem of too many feature vectors by aggregating them to their maximum elements and help preserve translational invariance of higherlayer representations.Beyond the layer-by-layer pretraining, we propose a novel backpropagation algorithm that can further optimize our performance.
Rest of this paper is organized as follows.In Section 2, we provide background on sparse coding.Section 3 will introduce Deep Sparse-coded Network (DSN), explain its architectural principles, and discuss training algorithms.In Section 4, we present an empirical evaluation of DSN, and Section 5 concludes the paper.

SPARSE CODING BACKGROUND
Sparse coding is a general class of unsupervised methods to learn efficient representations of data as a linear combination of basis vectors in a dictionary.Given an input (patch) x ∈ R N drawn from the raw data and dictionary D N ×K , sparse coding searches for a representation (i.e., sparse code) y ∈ R K in the 1-regularized optimization miny x − Dy 2 2 + λ y 1, known as LASSO or LARS.Greedy-0 matching pursuit such as OMP is an alternative method for sparse coding in miny x − Dy 2 2 s.t.y 0 ≤ S. Dictionary learning is essential for sparse coding.During unsupervised feature learning, we perform sparse coding on unlabeled training examples by holding D constant.After sparse coding finishes, dictionary updates follow.They will alternate until convergence.Sparse coding can be thought as a generalization of K-means clustering that hard-assigns each training example to one cluster.Sparse coding can also be thought as less stringent, purely data-driven version of Gaussian mixture models.!!!"

Architectural Overview
: : : ' 1'' #$#!&%" : : : ' : : : ' : : : ' Figure 1: Deep Sparse-coded Network (DSN) with four layers Deep Sparse-coded Network (DSN) is a feedforward network built on multilayer sparse coding.We present a 4-layer DSN in Figure 1.This is a deep architecture because there are two hidden layers of sparse coding, each of which can learn corresponding level's feature representations by training own dictionary.Similar to neural network, layers 1 and 4 are the input and output layers.The input layer takes in vectorized patches drawn from the raw data that are sparse coded and max pooled, propagating up the layers.Unlike convolutional neural network, we do not count pooling units as a separate layer.The output layer consists of classifiers or regressors specific to application needs.
Figure 2 depicts a stackable layering module to build DSN.Sparse coding and pooling units together constitute the module.The Jth hidden layer (for J ≥ II) takes in pooled sparse codes zJ−1's from the previous hidden layer and produces yJ using dictionary DJ .Max pooling yJ 's yields pooled sparse code zJ that are passed as the input for hidden layer J + 1.

Algorithms
Hinton et al. [7] suggested pretrain a deep architecture with layer-by-layer unsupervised learning and finetune via backpropagation, a supervised training algorithm popularized by neural network.We explain training algorithms for DSN using the example architecture in Figure 1.

Pretraining completes by producing dictionaries
and the highest hidden layer's pooled sparse codes {z Max pooling is crucial for our DSN architecture.It subsamples sparse codes to their max elements.More importantly, max pooling serves as nonlinear activation function in neural network.Without nonlinear pooling, multilayering has no effect: x = DIyI and yI = DIIyII implies x = DIDIIyII ≈ DyII because linear cascade of dictionaries is simply D ≈ DIDII regardless of total number of layers.

Training classifiers at output layer
DSN learns each layer's dictionary greedily during pretraining.The resulting highest hidden layer output zII is already a powerful feature for classification tasks.Suppose DSN output layer predicts a class label l = hw(φ), where hw(.) is a standard linear classifier or logistic regression that takes a feature encoding φ as input.Note that φ is encoded on zII, but depends on DSN setup.For instance, we may have φ = [z II ] if the highest hidden layer yields four pooled sparse codes per training example.

Backpropagation
By now, we have the DSN output layer with trained classifiers, and this is a good working pipeline for discriminative tasks.However, we might further improve the performance of DSN by optimizing the whole network in a supervised setting.Is backpropagation possible for DSN?  DSN backpropagation is quite different from conventional neural network or deep learning architectures.We explain our algorithm again using the DSN example in Figure 1.The complete feedforward path of DSN is summarized by We define the loss or cost function for the DSN classification Our objective now is to propagate the loss value down the reverse path and adjust sparse codes.Fixing classifier weights w, we back-estimate optimal z * II that minimizes J(zII).To do so, we perform gradient descent learning with J(zII) that adjusts each element of vector zII by where zII = [zII,1 zII,2 . . .zII,K 2 ] , and K2 is the number of basis vectors in dictionary DII for hidden layer 2. Since an optimal z * II is estimated by correcting zII, the partial derivative is with respect to each element z II,k Here, note our linear classifier hw(zII Therefore, the following gradient descent rule adjusts zII and obtain z * where α is the learning rate.This update rule is intuitive because it down-propagates the error [l − hw(zII)] proportionately to the contribution from each z II,k and adjusts accordingly.
Using the corrected z * II , we can correct the unpooled original yII's to optimal y * II 's by a procedure called putback illustrated in Figure 3 subalgorithm Down-propagation

18:
Compute y †(i) I by sparse coding with D * I ∀i

22:
Compute y †(i) J by sparse coding with D * J ∀i

23:
Compute z †(i) J by max pooling ∀i

27:
end 28: until converged of the error for DSN stops here, and we up-propagate corrected sparse codes to finetune the dictionaries and classifier weights.The loss function with respect to DI is We denote the corrected dictionary D * I .We redo sparse coding at hidden layer 1 with D * I followed by max pooling.Similarly at hidden layer 2, we update where z † I is the pooled sparse code over M1 y † I 's from sparse coding redone with D * I .Using corrected dictionary D * II , we also redo sparse coding and max pooling at hidden layer 2. The resulting pooled sparse codes z † II are the output of the highest hidden layer, which will be used to retrain the classifier hw.All of the steps just described are a single iteration of DSN backpropagation.We run multiple iterations until convergence.where GD stands for gradient descent, and SC sparse coding.We present the backpropagation algorithm for general Llayer DSN in Algorithm 1.

EXPERIMENTAL RESULTS
We evaluate the classification performance of a 4-layer DSN using CIFAR-10.

Sparse coding setup
We denote different configurations of LARS and OMP by LARS-λ and OMP-ρ where ρ = S K × 100 (%).We configure hidden layer 1 sparse coding with more dense LARS-0.1 and OMP-20.For hidden layer 2, we use LARS-0.2 and OMP-10.

Data processing and training
Instead of using the full CIFAR-10 dataset, we uniformly sample 20,000 images and cut to four folds for cross validation.We use three folds for training and the remaining fold for testing.We have enforced the same number of images per class.For output layer, we have trained a 1-vs-all linear classifiers for each of ten classes in CIFAR-10.Each datum in CIFAR-10 is a 3×32×32 color image.We consider a per-image feature vector from densely overlapping patches drawn from a receptive field with width w = 6 pixels and stride s = 2. Thus, each patch (vectorized) has size N = 3 × 6 × 6 = 108.We preprocess patches by ZCAwhitening.Figure 4 illustrates sparse coding and max pooling at hidden layer 1.Each image is divided into four quadrants.For each quadrant, there are four (pooling) groups of 9 patches.Hidden layer 1 uses a dictionary size K1 = 4N = 432 and max pooling factor M1 = 9.Hidden layer 1 produces {z } (4 pooled sparse codes per quadrant), which will be passed to hidden layer 2. Figure 5 illustrates sparse coding and max pooling at hidden layer 2. We use K2 = 2K1 = 864 and M2 = 4.We encoder the final perimage feature vector φDSN = [z

Results
We report cross-validated 1-vs-all classification accuracy of DSN against deep SAE.Both DSN approaches achieve better classification accuracy than deep SAE, and DSN-LARS is found to be the best performer.Optimization by  backpropagation is critical for deep SAE as it gains more than 7% accuracy over pretraining only.DSN-OMP improves by 4.7% on backpropagation whereas the improvement is slightly less for DSN-LARS with a 4.4% gain.Importantly, DSN-OMP with pretraining only is already 0.7% better than deep SAE with both pretraining and backpropagation.

CONCLUSION
Motivated by superior feature learning performance of single-layer sparse coding, we have presented Deep Sparsecoded Network (DSN), a deep architecture on multilayer sparse coding.DSN is a feedforward network having two or more hidden layers of sparse coding interlaced with max pooling units.We have discussed the benefit of DSN and described training methods including a novel backpropagation algorithm.From our experiments, we have found that DSN is superior to deep SAE in classifying CIFAR-10 images.

Adjusting 2 2 = 1 ∀k ( 5 )
DI requires to solve the following optimization problem given examples (x, y * I ) min d I,k J(DI) s.t.d I,k where d I,k is the kth basis vector in DI.Taking the partial derivative with respect to d I,k yields ∂J(DI) ∂d I,k = (DIy * I − x) y * I,k − y I,k where y * I = y * I,1 . . .y * I,K 1 and yI = [yI,1 . . .yI,K 1 ] .We obtain the update rule to adjust DI by gradient descent final feature vector has a dimensionality 4K2 = 3456.(The feature vector is not too dense because of sparse coding.)

7#5' "))8' :::'
. At hidden layer 2, we have performed max pooling by M2.For putback, we need to keep the original M2 yII's that have resulted zII in memory so that corrected values at z * II are put back to corresponding locations at the original sparse codes yII's and yield y * DSN backpropagation require Pretrained {DI, DII, . . ., D L−2 } and classifier hw input Labeled training examples {(X1, l1), . . ., (Xm, lm)} output Fine-tuned {D * I , D * II , . . ., D * L−2 } and classifier h * * I , we obtain y * I 's.Each pooling group at hidden layer 1 originally has M1 yI's that need to be saved in memory.With y * I 's, we should correct DI, not x, because it does not make sense to correct data input.Hence, down-propagation Algorithm 1