Hyperparameter Optimization and Boosting for Classifying Facial Expressions: How good can a “Null” Model be? James Bergstra bergstra@uwaterloo.ca University of Waterloo, 200 University Ave., Wateroo, Ont N2L 3G1 CANADA David D. Cox Harvard University, 52 Oxford St., Cambridge, MA 02138 USA davidcox@fas.harvard.edu arXiv:1306.3476v1 [cs.CV] 14 Jun 2013 Abstract One of the goals of the ICML workshop on representation and learning is to establish benchmark scores for a new data set of labeled facial expressions. This paper presents the performance of a “Null model” consisting of convolutions with random weights, PCA, pooling, normalization, and a linear readout. Our approach focused on hyperparameter optimization rather than novel model components. On the Facial Expression Recognition Challenge held by the Kaggle website, our hyperparameter optimization approach achieved a score of 60% accuracy on the test data. This paper also introduces a new ensemble construction variant that combines hyperparameter optimization with the construction of ensembles. This algorithm constructed an ensemble of four models that scored 65.5% accuracy. These scores rank 12th and 5th respectively among the 56 challenge participants. It is worth noting that our approach was developed prior to the release of the data set, and applied without modification; our strong competition performance suggests that the TPE hyperparameter optimization algorithm and domain expertise encoded in our Null model can generalize to new image classification data sets. to make. The techniques at our disposal, as designers of machine learning systems, are intuition: repeating what worked in other settings that seem to be similar (appealing to our expertise and intuition), and search: model selection by trial and error search using e.g. cross-validation. In common practice, both intuition and search are typically carried out informally. A practitioner may design a complete system by a semi-automated search process in which small-scale searches (e.g. grid search) update the practitioner’s own implicit beliefs regarding what constitutes a good model for the task at hand. Those beliefs inform the choice of future small-scale searches in an iterative process that makes progressive improvements to the system. (We may see many empirical results published in machine learning conferences as evidence of this process unfolding on an international scale over a course of years.) One practical problem that arises from this common practice is that algorithms which have been demonstrated to work on particular data sets are notoriously difficult to adapt to new data sets. The trouble is that the implicit beliefs of the practitioner play a crucial role in the process of model selection. This difficulty has been widely recognized by domain experts, but the status quo remains because so many experts feel that their search is sufficiently efficient, and the insights gained from the model selection process are valuable. We hope that our results, taken together with other recent work on hyperparameter optimization such as Bergstra et al. (2011); Snoek et al. (2012); Thornton et al. (2012), challenge these beliefs and induce more researchers to recognize automatic hyperparameter optimization as an important technique for model evaluation. 1. Introduction The design of an effective machine learning system typically involves making many design choices that reflect the nature of data at hand and the inferences we wish Presented at the ICML Workshop on Representation and Learning, Atlanta, Georgia, USA, 2013. Copyright 2013 by the author(s). Hyperparameter Optimization and Boosting for Classifying Expressions 1.1. Hyperparameter Optimization and Ensemble Construction While hyperparameter optimization is important, it is not the only standard performance-enhancing technique used to improve the scores of a model family on a given benchmark task. Ensemble methods such as Bagging (Breiman, 1996a), Boosting (Demiriz et al., 2002), Stacking (Wolpert, 1992; Breiman, 1996b; Sill et al., 2009), and Bayesian Model Averaging (Hoeting et al., 1999) are also commonly employed to exact every last drop of accuracy from a given set of algorithmic technology. One of the goals of automating hyperparameter optimization is to assess, in an objective way, how good a set of classification system components can be. In pursuit of that goal, the automation of ensemble creation is also critical. This paper presents a first attempt to provide a fully automated algorithm for model selection and ensemble construction. It starts from a palette of configurable pre-processing strategies, and classification algorithms, and proceeds to creates the most accurate ensemble of optimized components that it can. Our algorithm uses a Boosting approach to ensemble construction, in which hyper-parameter optimization plays the role of a base learner. This setting creates unique challenges that motivate a new Boosting algorithm, which we call SVM HyperBoost. models of Coates & Ng (2011). Relative to Bergstra et al. (2013a) we add the possibility of affine warping of input images and remove the input-cropping step. In total, the configuration space includes 238 hyperparameters, although no configuration uses all 238 at once. Many of the hyperparameters are conditional hyperparameters because they are only active in certain conditions; for example, the hyperparameters governing the creation of a third layer are inactive for twolayer models. The details of this meta-model are described in Bergstra et al. (2013a) and implemented in the hyperopt-convnet software available from http: //github.com/jaberg/hyperopt-convnet. Notable omissions from the model space include: backpropagation (Rumelhart et al., 1986), unsupervised learning for filters such as RBMs (Hinton, 2002; Hinton et al., 2006), Sparse Coding (Coates & Ng, 2011), DAAs (Vincent et al., 2008), and recent highperformance regularization strategies such as dropout (Hinton et al., 2012) and maxout (Goodfellow et al., 2013). Hyperparameter optimization within this Null model was carried out using the TPE algorithm (Bergstra & Bengio, 2012), as implemented by the hyperopt software (available from http://jaberg.github.com/ hyperopt). 3. SVM HyperBoosting 2. Null Model for Image Classification Our basic approach is described in Bergstra et al. (2013a). We use Hyperopt (Bergstra et al., 2013b) to describe a configuration space that includes onelayer, two-layer, and three-layer convolutional networks. The elements of our image classification model are standard scaling (image resolution), affine warp (rigid image deformation), filterbank normalized crosscorrelation, local spatial pooling, di-histogram spatial pooling, and an L2-SVM classifier. For each layer in each architecture, hyperparameters govern the size of filters, the volume of pooling regions, constants that modulate local normalization, and so on. The filters themselves are either chosen randomly from a centered Gaussian distribution, or are random projections of PCA components of training data (as it appears as input to each layer), or are random projections of input patches (again, as input arrives to each layer). Features for classification are derived from the output layer by either signed (di-histogram) or unsigned pooling over some topographically local partitioning of output features. This configuration space was chosen to span the model space investigated by Pinto & Cox (2011) and the random-filter This section describes an ensemble construction method (SVM HyperBoost, or just HyperBoost) that is particularly well-suited to the use of a hyperparameter optimization algorithm as an inner loop. This algorithm is presented in the context of models which have the form of a feature extractor and a linear classifier. In this context, the ensemble is simply a larger linear function that can be seen as the concatenation of ensemble members. The HyperBoost algorithm can be understood as a piecewise training of this single giant linear classifier. To derive the HyperBoost algorithm, suppose that we commit to using an ensemble of size J. (No such commitment is necessary in practice, but it makes the development clearer.) The ideal ensemble weights w(∗) and hyperparameter configuration settings λ(∗) for a binary classification task would optimize generalization error: w(∗) , λ(∗) = argminw∈R∗ ,λ∈HJ Ex,y∼D   I{0 > y( j wj · f (x, λj )} . (1) Hyperparameter Optimization and Boosting for Classifying Expressions Here D stands for a joint density over inputs and labels (x ∈ RM , y ∈ {−1, 1}). We have used f (x, λj ) to denote the feature vector associated to input x by hyperparameter configuration λj . The expression I{0 > a} denotes the indicator function that is one for values of a which are negative, but zero for values of a which are not negative. We use H to stand for the set of possible hyperparameter configurations, so that the argmin means “choose J optimal hyperparameter configurations” (one for each ensemble member). We use the notation w ∈ R∗ in the argmin to indicate that the final set of weights w(∗) will be a vector, but it is not known a-priori how many elements it will have. Rather, w will be logically divided into J pieces corresponding to ensemble elements and each piece wj will have a dimensionality that matches f (x; λj ). The joint optimization of w and λ implied by Equation 1 is challenging because of • the complicated effect of each λj on f , of HyperBoost. This technique makes HyperBoost a partially corrective Boosting algorithm. Round #1 Label Optimized Features 1 Image Examples Features 1 Pred. Label Round #2 Label Fixed(*) Optimized Features 1 Features 2 • the non-differentiable indicator function. Our strategy for dealing with the complicated relationship between λj and f is to select the λj configurations greedily, using the algorithm illustrated in Figure 1. Our strategy for dealing with the expectation is to estimate it from what is typically called validation data, so that each argmin for j < J is what previous work has called hyperparameter optimization. Our strategy for dealing with the non-differentiability of the indicator function is to use a gradient-free optimization method, namely TPE (Bergstra et al., 2011). Normally, Boosting (functional gradient methods) on a Hinge loss or Zero-One loss would quickly run into trouble because once the training margins are pushed past the decision boundary, subsequent rounds have nothing to do (the training criterion is completely satisfied). We avoid this nonsense using two techniques. First, Boosting on sufficient validation data helps because models fit to training data are seldom perfect for validation data by random chance (incidentally, we are interested in collaborations that might shed light on exactly how much validation data is necessary). Second, each round of HyperBoost is free to scale the contribution of previous ensemble components (see α in Figure 1), so standard SVM regularization techniques (i.e. C) allow us to meaningfully add features and improve w even if the Hinge loss had been reduced to 0 at a previous HyperBoosting iteration. The regularization parameter governing the entire SVM is a hyperparameter that is re-optimized on every round Examples • the expectation over unknown D, and Image Features 2 Pred. Label +α +α +α +α +α +α Carry-forward Figure 1. The SVM HyperBoost algorithm creates a large linear SVM piece-wise. The first round of training is standard SVM training. At the end of the first round, the SVM weights are fixed (*) up to multiplicative scaling. Subsequent rounds “carry forward” the total contribution of previous features and their corresponding fixed weights toward label predictions. On Round 2, HyperBoost optimizes the feature weights (shown in light red) for a candidate feature set (bright red) and re-scales (via α) weights fit in previous rounds. This approximate, greedy procedure makes it possible to fit very large SVMs to large numbers of examples, when feature computation is also computationally costly. 3.1. Weak Learners vs. Strong Learners HyperBoost is suitable for Boosting strong base learners. In fact, when Boosting and model fitting are conducted on statistically independent example sets, the distinction between distinction between “weak” vs. “strong” learners is no longer important. Instead, any learners (weak or strong) simply provide models, and HyperBoosting chooses the model that most improves the validation set performance of the ensemble. While strong learners generally require additional regularization compared with weak learners in order to generalize correctly from training data, strong and weak base learners are equally useful for HyperBoosting. Hyperparameter Optimization and Boosting for Classifying Expressions 4. Results on Facial Expression In support of the ICML2013 workshop on representation learning, Pierre-Luc Carrier and Aaron Courville released a data set for facial expression recognition as a Kaggle competition ( http://www.kaggle.com/c/challenges-inrepresentation-learning-facial-expressionrecognition-challenge). The data consist of 48x48 pixel grayscale images of faces, and labels for the expressions of those faces. The faces have been automatically registered so that each face is approximately centered and occupies about the same amount of area within each image. The task is to categorize each face as one of seven categories (anger, disgust, fear, happy, sad, surprised, neutral). The set distributed by Kaggle consists of 28,709 training examples examples, and 7,178 test examples. Our protocol for model selection was simple crossvalidation on the training examples. We partitioned the training data into 20709 SVM-fitting examples and 8000 validation examples, and performed hyperparameter optimization with regard to the performance on this validation set. The test examples were not used for model selection. The Kaggle website only provided the images for test examples, to prevent cheating in the contest. The test scores listed for the Facial Expression Recognition Challenge were obtained by uploading our predictions to Kaggle’s website, which computed the test set accuracy on our behalf. On each round of HyperBoosting, we evaluated 1000 non-degenerate hyperparameter proposals in search of the best feature set to add to the ensemble. These proposals ranged in accuracy from chance baseline of 20% up to a relatively strong 62%. Experiments were done using a single computer with four NVidia Tesla 2050 GPUs and a slow file system so typically 2 or 3 jobs would run simultaneously. Many configurations were invalid (e.g. downsampling so much that 0 features remain) but these are recognized relatively quickly. Valid (non-degenerate) trials typically took 10 - 25 minutes to complete, so each round of HyperBoost took two or three days using this one machine. The accuracies of the models chosen by HyperBoost are shown in Figure 2. HyperBoost creates a small ensemble whose combined accuracy (65.5%) is significantly better than the best individual model (60%). The ranking relative to other models in the Kaggle competition is shown in Table 1. The ensemble of size 4 ranks among the top 5 competition entries. It is worth noting that the model and training programs used for HyperBoosting in this model space HyperBoost for Ensemble Construction Test Validation (Ensemble) Validation (Round Features) 0.65 0.64 Classification Accuracy 0.63 0.62 0.61 0.60 0.59 0.581 2 HyperBoost Round 3 4 Figure 2. HyperBoost improves test set generalization with successive rounds while the individual feature sets chosen at each round hold steady just below 60% accuracy . Each round selects the best of 1000 non-degenerate candidate feature sets. Training set accuracy (not shown) ranges from 85% for the first round up to 97% on the fourth round. were entirely designed prior to the release of the data set. The model space was chosen to span the models of Pinto & Cox (2011) and Coates & Ng (2011). Pinto & Cox (2011) reported excellent match verification performance on the Labeled Faces in the Wild (LFW) data set (Huang et al., 2007)), and Coates & Ng (2011) advanced the state of the art at the time on the CIFAR-10 object recognition data set. Our approach was developed prior to the release of the Facial Expression Recognition data set, so the good performance speaks directly to the ability of our metamodeling approach to generalize to new image classification tasks. It is also worth noting that the training accuracy (not shown) of all models in the ensemble is much higher than the generalization accuracy (from 85% up to 97%). The size of individual feature sets was capped at 9000 (to stay within the available memory on the GPU cards), and all of the best models approached this maximum number of features. Although these large feature sets demonstrated significant over-fitting of the training data (these feature sets represent strong base learners for Boosting), HyperBoost selected ensemble members that brought steady improvement on the test set. This is a familiar story for Boosting algorithms based on an exponential loss, but HyperBoost produces the effect while operating on the more representative hinge loss. Hyperparameter Optimization and Boosting for Classifying Expressions Table 1. Performance relative to Kaggle submissions. Rank 1 2 3 4 5 . . . 11 12 . . . 56 Team “RBM” “Unsupervised” “Maxim Milakov” “Radu+Marius+Cristi” HyperBoost Round 4 “Lor.Voldy” “jaberg” HyperBoost Round 1 “bulbugoglu” “dstarerstor” Accuracy (%) 71.162 69.267 68.821 67.484 65.450 65.255 61.967 61.466 59.654 20.006 Acknowledgments This project has been supported by the Rowland Institute of Harvard, the United States’ National Science Foundation (IIS 0963668), and Canada’s National Science and Engineering Research Council through the Banting Fellowship program. References Bergstra, J. and Bengio, Y. Random search for hyperparameter optimization. Journal of Machine Learning Research, 13:281–305, 2012. Bergstra, J., Bardenet, R., Bengio, Y., and K´gl, B. e Algorithms for hyper-parameter optimization. In NIPS*24, pp. 2546–2554, 2011. Bergstra, J., Yamins, D., and Cox, D. D. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In In Proc. ICML, 2013a. Bergstra, J., Yamins, D., and Cox, D. D. Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms. In SciPy’13, 2013b. Breiman, L. Bagging predictors. Machine Learning, 24:123–140, 1996a. Breiman, L. Stacked regressions. Machine Learning, 24:49–64, 1996b. Coates, A. and Ng, A. Y. The importance of encoding versus training with sparse coding and vector quantization. In Proc. ICML-28, 2011. Demiriz, A., Bennett, K.P., and Shawe-Taylor, J. Linear programming boosting via column generation. Machine Learning, 46:225–254, 2002. Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. CoRR, abs/1302.4398, 2013. Hinton, G. E. Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1771–1800, 2002. Hinton, G. E., Osindero, S., and Teh, Y. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554, 2006. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012. 5. Conclusion Hyperparameter optimization within large model classes is difficult. We have shown that hyperparameter optimization within a Null model achieves over 60% accuracy in the workshop’s Facial Expression Recognition Challenge, which ranks 12th of 56 contest submissions. Further use of an ensemble-construction mechanism raises that accuracy to 65.5%, which would have ranked 5th / 56 had it been ready by the contest closing date. These performances underscore the importance and difficulty of fully leveraging known algorithmic technology for image classification. We can only conjecture that a future version of our model space that includes a wider range of algorithms for feature initialization and refinement (e.g. backpropagation, dropout, maxout, sparse coding, sparsity regularization, RBMs, DAAs) could perform better yet. By the same token, until such a search is carried out, it is difficult to make quantitative claims regarding the value added by such algorithms over and above a wellconfigured set of simpler components. The software used in these experiments is publicly available from github: Hyperopt The TPE hyperparameter optimization algorithm and distributed optimization infrastructure. http://jaberg.github.com/ hyperopt Hyperopt-ConvNet The HyperBoost algorithm and hyperopt-searchable representation of the image classification model. http://github.com/ jaberg/hyperopt-convnet Hyperparameter Optimization and Boosting for Classifying Expressions Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. Bayesian model averaging: A tutorial. Statistical Science, pp. 382–401, 1999. Huang, G. B., Ramesh, M., Berg, T., and LearnedMiller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. Pinto, N. and Cox, D. D. Beyond simple features: A large-scale feature search approach to unconstrained face recognition. In Proc. Face and Gesture Recognition, 2011. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning internal representations by error propagation. In Rumelhart, D. E. and McClelland, J. L. (eds.), Parallel Distributed Processing, volume 1, chapter 8, pp. 318–362. MIT Press, Cambridge, 1986. Sill, J., Tak´cs, G., Mackey, L., and Lin, D. Featurea weighted linear stacking. CoRR, abs/0911.0460, 2009. Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Neural Information Processing Systems, 2012. Thornton, C., Hutter, F., Hoos, H. H., and LeytonBrown, K. Auto-WEKA: Automated selection and hyper-parameter optimization of classification algorithms. CoRR, abs/1208.3719, 2012. Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P-A. Extracting and composing robust features with denoising autoencoders. In Proc. of the 25th ICML, pp. 1096–1103. ACM, 2008. Wolpert, D. H. Stacked generalization. Neural Networks, 5:241–259, 1992.