Maximizing all margins: Pushing face recognition with Kernel Plurality

We present two theses in this paper: First, performance of most existing face recognition algorithms improves if instead of the whole image, smaller patches are individually classified followed by label aggregation using voting. Second, weighted plurality1 voting outperforms other popular voting methods if the weights are set such that they maximize the victory margin for the winner with respect to each of the losers. Moreover, this can be done while taking higher order relationships among patches into account using kernels. We call this scheme Kernel Plurality. We verify our proposals with detailed experimental results and show that our framework with Kernel Plurality improves the performance of various face recognition algorithms beyond what has been previously reported in the literature. Furthermore, on five different benchmark datasets - Yale A, CMU PIE, MERL Dome, Extended Yale B and Multi-PIE, we show that Kernel Plurality in conjunction with recent face recognition algorithms can provide state-of-the-art results in terms of face recognition rates.


Introduction
There is little debate that today we live in an abundance of face recognition (FR) methods [24,12,2,3,14].Some of the methods do well on concrete measures like classification accuracy and computational efficiency while others score high on subjective measures like ease of implementation and public domain availability.Here we intend to revisit existing FR methods, from the rusty old Eigenfaces [24] to the more recent Volterrafaces [14], in order to explore the possibility of squeezing more performance from them, while maintaining their existing advantages.
We begin by noting that FR as a classification problem is characterized by high data dimensionality and data sparsity.These are the textbook conditions that lead classifiers to overfit the data.We believe that this is one of the rea-sons performance of many FR algorithms has been limited to a much lower level than what could be achieved by them if this issue is addressed.Our simple yet effective solution to this problem is to divide images into patches and to train classifiers per patch location.During the testing stage, single label for an image is obtained by weighted plurality voting by the patch locations.Note that use of patches has been explored from time to time in FR, but our proposal is broader in the sense that it calls upon all FR methods to be used in this manner.
Next, we make the observation that in a weighted voting scheme, the manner in which weights are selected is critical.There is a large body of literature which has tried to address this problem with a few significant methods like Log-Odds Weighted Voting [16], Weighted Majority Voting [17], Bagging [4], Boosting [21,9], Stacking [26].It has been shown that most of the supervised weighted voting methods learn weights based on maximization of the margin of victory [22,13] in a two class scenario.In the case of plurality voting (multiclass), there is a margin of victory with respect to each of the losers.Interestingly even the more recent multiclass Boosting methods do not take advantage of this and only maximize the minimum margin of victory [21].We propose to learn plurality voting weights such that all the margins of victory are maximized simultaneously.We call our scheme Kernel Plurality since in addition to maximizing all margins, it also allows for higher order relations among various patch labels to be taken into account while weight computation via the use of kernels.
We corroborate our proposals with extensive experimental results using five different benchmark face datasets and five different FR algorithms.We show that: (1) FR algorithms when used within our framework significantly exceed their own performance without our framework.(2) Kernel Plurality outperforms simple Plurality, Log-Odds Weighted Plurality [16] and Stacking [26] implemented with SVMs.Note that different FR methods perform differently on various datasets and though the absolute performance of FR methods is important as shown in Fig. 4, it is more enlightening to look at the percentage improvement in performance of various FR methods (Fig. 5).That said, in Table 1.Symbols and their meaning conjunction with the recently proposed Volterrafaces [14] and LBP [2], Kernel Plurality does provide the state-of-theart results.
To summarize, the key points made in this paper are: (1) Patch based voting outperforms holistic classification for various algorithms across databases.(2) Using off-the-shelf classifiers (e.g.SVMs) for label aggregation is not optimal.(3) Kernel Plurality outperforms existing voting methods across most databases and training set sizes indicating utility of all-margin maximization.(4) On average, Kernel Plurality improves accuracy over Plurality by 3 − 21% while for a state-of-the-art method like Volterrafaces improvement ranges from 5 − 66%.

Kernel Plurality
Kernel Plurality is a new kernel based voting method.In the next subsection we describe the process through which the optimal weights are obtained for a given kernel using a training set of feature vectors.Following that we will outline the process by which a winning label is selected for a test feature vector using a given kernel and computed weights.

Weight Computation
The meaning of various symbols and functions used in this discussion is summarized in Table 1.According to weighted Plurality, if we ignore ties for the moment, x i is assigned a label l according to following criteria where δ is the Kronecker delta function and w k ∈ R is the weight associated with the classifier c k .Another way to express the criteria in Eq. 1 is to say that x i should be assigned the label l such that ∏ m∈L,m̸ =l where w k and I is the indicator function (Table 1).Eq. 2 encodes that the winner label l must have more weighted votes than each of the other losing labels.We can rewrite this in dot product form as ∏ m∈L,m̸ =l where The transformation of the decision criteria from Eq. 1 to Eq. 3 brings out the fact that a Plurality contest among multiple classes can be fully described by a set of multiple pair-wise contests.To understand it more clearly, consider the example outlined in Fig. 1.There are eight classifiers that vote for five classes (A-E) as shown in Fig. 1(a).In this example, Eq. 1 selects class E as the winner of the Plurality contest.The same conclusion can also be reached if we consider all binary contests between the classes A-E, which we represent using a digraph (directed graph) in Fig. 1(b) with an edge from label l i to l j if If there is a tie, edges pointing to both labels are added.Given such a digraph, the winner of the Plurality contest is the root of the corresponding Strongly Connected Components (SCC) graph [6,18].The SCC graph is shown in Fig. 1(b) using colored overlays where class E, the correct winner, is also the root of the SCC graph.In case of a tie for the win, the SCC root will correspond to multiple voting digraph nodes (i.e.Eq. 3 will be set to zero for multiple l) and a strategy must be chosen to resolve the tie.We will revisit this graph formulation of voting while using Kernel Plurality on test feature vectors.
At this stage we introduce the first of the two key ideas behind Kernel Plurality.Note that the ensembles we are considering have fixed size and the classifiers are learned independently using different patches.In such a setting, the linear relation in Eq. 3 implies that the elements of − → p lm (x i ) act independently as they contribute their votes toward a decision.For instance, conditions such as 'The winner should be the label that is picked by both classifier 1 and classifier 2' cannot be encoded using a linear equation like Eq. 3. We would like to take such higher order interactions among classifiers into account while deciding a winner of a Plurality contest.Mathematically, this translates to transforming the prediction vector − → p lm (x i ) and the weight vector − → w using some mapping ϕ to a kernel space K.The winner label l must be now be chosen such that For a given ensemble and ϕ, we do not know the best − → w a priori and would like to recover it using the training data.This brings us to the second key idea behind Kernel Plurality.For the case of two-class weighted voting contests, Lin et al. [16] show that the reliability of classification increases with the margin of victory.Since a Plurality contest can be defined in terms of multiple two-class contests (Fig. 1), we reason that Plurality would provide more reliable generalization performance on a test set if its weights are set such that the margin of victory with respect to each losing class is maximized for the training feature vectors.Note that this is in contrast to maximization of the minimum margin, which some existing techniques [21] try to achieve.The idea of maximizing all-margins as opposed to only the minimum margin is explained with a toy example in Fig. 2. Fig. 2(a) shows four classes in some embedding space with two noisy data points that belong to class 1.Note that due to their proximity to class 2 and 4, respectively, the two data point cannot be reliably classified.If the minimum margin for class 1 is maximized, we get to a situation shown in Fig. 2(b), where class 2, the class closest to class 1 is pushed far away, but other classes have clustered not far from class 2. In this case, ambiguity for the data points which was closer to class 2 has been removed, but the other point is still closer to class 4. If, as proposed, all the margins are maximized with respect to class 1, we get the situation shown in Fig. 2(c) where classes 3 and 4 are pushed farther away than before.Thus it is more likely that ambiguity for the second data point would also be removed.
In terms of the mathematical formulation, the similarity between our objective in the prediction space P and the objective of Support Vector Machines [7] can be readily noted.Borrowing the formalism from SVM, for a given training set T of feature vectors x i with labels l i , we would like to set the weights − → w ⋆ such that Note that we have encoded the problem such that the margins should be above a certain threshold and the norm of the weight vector − → w , which is inversely proportional to the margin, should be minimized.To build robustness against outliers we also introduce soft-margins in our formulation and allow for certain ⟨ϕ( − → p lim (x i )), ϕ( − → w )⟩ to be less than 1.This transforms Eq. 7 to where ξ i are the slack variables and C is a constant controlling the soft-margin trade-off.
A few salient points should be noted: Firstly, in terms of SVM, we only have one class whose margin has to be maximized with respect to the origin.Consequently, the decision plane runs through the origin and b, the intercept parameter in the standard SVM formulation [7] is set to 0. Secondly, we can generate an equivalent two-class problem by negating all the vectors and labeling them class 2. The symmetry would force the decision plane to pass through the origin.Thirdly, recall from the beginning of this section that unlike most other weighted voting schemes which restrict the weight vector to the positive quadrant, we defined C la s s i e r 2 Cl as si er 3 C la ss i e r 1 C la s s i e r 2 Cl as si er 3 C la ss i e r 1 (0,0,0) Classi er 1 Classi er 2 Classi er 3 (0,0,0) (0,0,0) w ∈ R |C| .This was done since in simple weighted Plurality (ϕ is the identity function), + which picks the same winner as w using Eq. 1, but this cannot be guaranteed for any general kernel space K. Finally and most importantly, the procedure outlined above is not classifying the feature vectors x i ∈ D using an SVM.We are working in the prediction space P, where we have a two-class problem, while we have an |L|-way classification problem in D. We have simply used the mathematical modeling provided by SVMs to optimize our objective function of maximizing all victory margins in a Plurality contest.
The solution of the mathematical program in Eq. 8 is given by where ϕ(x ′ i ) are the support vectors and α i are the corresponding coefficients.Like in SVMs, the exact form of the mapping ϕ is not required as long as the kernel matrix K, with its i th row j th column given as is available for all the prediction vectors p i , p j ∈ P.
We summarize the key ideas behind the Kernel Plurality weight learning algorithm with an example in Fig. 3.In Fig. 3(a) we show feature vectors in the input space D with 3 linear classifiers.The different labeling imposed by the 3 classifiers is shown in Fig. 3(b).In Fig. 3(c) we have colored feature vectors according to their corresponding prediction vectors listed in Fig. 3(e) (given by Eq. 4).Eq. 8 asks for such a weight vector − → w ⋆ that the prediction vectors are separated from the origin with maximum margin in the prediction space P as shown in Fig. 3(d).We allow for a non-linear boundary in P by using kernel mapping ϕ.This corresponds to the separation boundary being a hyperplane in the Kernel Space K as depicted in Fig. 3(f) where we do our computations.We must mention that the complexity of our method is governed by the efficiency of the quadratic program solver used to find the weights.

Voting with Kernel Plurality
Given the set of prediction labels {c k (y i )} for a test vector y i , we now consider the problem of conducting a Kernel Plurality contest among the elements of L to pick a label for y i .Combining Eq. 3 and Eq. 1, we pick l as the label for In case of a tie for the win, the left hand side of Eq. 10 would be zero for all the tied labels.For the purpose of the results presented in this paper, we randomly choose one of the tied labels as the winner.
In practice, instead of evaluating Eq. 10 explicitly, we found it more efficient to generate the set of pair-wise prediction vectors { − → p lilj (y i )} li,lj ∈L and classify them using Algorithms Abbr.Description Face Recognition Methods: Nearest Neighbor NN L2 distance based classification Eigenfaces [24] Eig PCA + NN Volterrafaces [14] Vol Discriminative filtering + NN Tensor Subspace Analysis [12] TSA Tensor extension of Locality Preserving Projections (LPP) [11] Local Binary Patterns [2] LBP Local features + NN Label Aggregation Methods: Support Vector Machine [7] SVM Label vectors classified with linear SVM as in Stacking [26].Log-Odds Weighted Voting [16] WMV Plurality with voters weights set to log of its correct classification odds.Simple Plurality [16] Vot Plurality with weights set to unity.Linear Kernel Plurality Lin Kernel Plurality with Table 2. Details of Face Recognition and Label Aggregation Algorithms used in our experiments.
a SVM with weight vector − → w ⋆ and associated kernel.The classification results are used to build the edges in the voting digraph (Fig. 1) and a winner is picked using a SCC algorithm.

Experiments & Results
In order to validate our framework, we conducted extensive experiments using five different benchmark FR datasets -Yale A, CMU PIE, Extended Yale B, Multi-PIE and MERL Dome.Details of these datasets are summarized in Table .3. We used the preprocessing protocol proposed in [12] that is also used by other methods like [14] and references therein.For the Yale A, CMU PIE, and the Extended Yale B datasets, we obtained the preprocessed images from the authors of [12] 2 .For the Multi-PIE and the MERL Dome3 datasets, we used a subset of 50 labels (subjects), which were then manually cropped and aligned in line with the other three datasets.Note that all the reported results were generated by running various algorithms on the same set of images.
Since the our framework is independent of any one particular FR algorithm, we selected five different publicly available FR methods for our experiments.These are Eigenfaces (Eig) [24] -a PCA based method, Volterrafaces4 (Vol) [14] -a recently proposed state-of-the-art method, Tensor Subspace Analysis (TSA) 5 [12] -a method representative of the class of embedding based techniques, Local Binary Pattern6 (LBP) [2] -a recently proposed features based stateof-the-art method and Nearest Neighbor (NN) Classifiera baseline method.More details for these methods can be found in Table 2.For each algorithm, we also created an as-sociated ensemble of classifiers where each constituent classifier worked with only a 8 × 8 pixels patch of the face image.The different methods for label aggregation we tested included SVM [7] (an instance of Stacking [26]), log-Odds Weighted Voting (WMV) [16], Simple Plurality (Vot), Plurality with Linear Kernel (Lin), Radial Basis Function Kernel (RBF), Polynomial Kernel (Pol) and Sigmoid Kernel (Sig).We used the LIBSVM [5] software package as our SVM implementation.These methods are summarized in Table 2.
All the conclusions drawn in this section are based on tabulated classification error rates for Extended Yale B, Yale A, MERL Dome, CMU PIE, and Multi-PIE datasets presented in Fig. 4. The reported error rates are the averages over ten different random splits of the data.Each row of these tables is labeled by the name of the algorithm used to generate the results listed in it.The name is given in the format 'ALG + AGG', where ALG is the abbreviated FR method name and AGG is the abbreviated label aggregation method name (see Table 2).Parameters for the FR algorithms were set using cross validation as recommended in [14].The heading of each column indicates the number (n) of images per label used for training.In each case, ∼ n/2 images were used as gallery images while the rest were used as probe images while generating the prediction vectors to learn Kernel Plurality weights.The algorithm with the lowest error rate for each FR algorithm is indicated in bold black while the best performer for the whole database is indicated in bold red.We conducted experiments with seven different training set size for each dataset-algorithm combination.Due to lack of space, we have only included results for three representative training set size in Table 4.Our complete results can be found in the Supplementary Material (http://www.seas.harvard.edu/∼rkkumar).
First, we test the broader proposal made in this paper that almost all FR algorithms benefit by patch based clas-sification and subsequent label fusion.For this we compare the performance of each selected classifier (ALG) on the whole image to the performance of the corresponding ensemble with traditional label aggregation methods like ALG+WMV and ALG+Vot.It can be noted that across databases, FR methods, and training set sizes, the ensembles results are significantly better than that of corresponding FR methods applied to the whole image (ALG) (only one exception was observed).At the same time, the importance of a good label aggregation method is highlighted by ALG+SVM results.Here we used the labels generated by the ensemble directly as input to a multi-class SVM.Since the number of classes (|L|) is large in all the databases used, it can be noted that ALG+SVM almost always fails in improving the performance over ALG.
Next we examine our second hypothesis that the Kernel Plurality method, which picks voting weights so as to maximize the victory margin with respect to each losing class, is indeed effective.From the tabulated error rates, we can note that across most databases, FR methods, and training set sizes, the ensembles results with Kernel Plurality (Lin, RBF, Pol and Sig) are better than those from the existing methods like (WMV and Vot).We have color coded those cases of Lin, RBF, Pol and Sig that outperform corresponding Vot method in black for easy reading.
The gains provided by Kernel Plurality are quantitatively captured in the plot presented in Fig. 5.For each databasetraining set size combination presented in Fig. 4, we have plotted the percentage improvement in error rate achieved by the Kernel Plurality variants of the five selected FR algorithms over simple Plurality.Each bar show the range of improvement achieved by the five FR algorithms on a particular database-training set combination and the marker shows the average improvement.The average improvement ranges from 3 − 21% for different cases.But the maximum improvement, typically achieved by Volterrafaces, spans a more significant 5 − 66% range.
The effectiveness of the Kernel in Kernel Plurality is demonstrated by the fact that the RBF, Pol, and Sig variants of Kernel Plurality outperform the Lin variant in most cases (Fig. 4).This is highlighted by the fact that in most cases, the best performer for a given database-algorithm-training set size (encoded in bold black font) is one of the Kernel methods.We must point out that the use of patch-wise classification and Kernel Plurality not only improves performance of individual classifiers, but in conjunction with the recent algorithms like Volterrafaces [14] and LBP [2], our framework can achieve state-of-the-art performance.Instances of this are highlighted with red bold font for all of the selected databases.These rates also compare favorably with respect to the performance of many other existing FR methods listed in [14].
Finally, it is instructive to consider a failure case for Ker- nel Plurality.An easy to understand failure case would be a face image whose prediction vectors falls within the SVM margin due to the slack ξ (Eq.8).Even though it is possible to assign voters weights such that this face is classified correctly, it is sacrificed in hope for better generalization performance.This face image would likely be correctly classified by other weighting schemes like log-Odds Weighted Voting.

Discussion
Here we note the similarities and dissimilarities among Kernel Plurality, Boosting, and SVMs, especially in the context of all margin maximization and Kernel Space voting.
Boosting can be looked as a weighted voting method with the constraint that all the votes sum to unity and be positive.In a two-class scenario, Boosting has been linked to victory margin maximization [22].Though there is lack of a proof for some of its variants like Adaboost [9] that they indeed maximizes the victory margin, there are other twoclass classification algorithms like LPBoost [8] that do so.Thus, barring the important concepts of Kernel voting and possible negative weights, it would seem that Kernel Plurality is similar in spirit to Boosting for the two-class scenario.
As we move to the case of multi-class classification, the notion of margin of victory in a voting scheme must be semantically expanded.For the winner now, there is a margin of victory with respect to each of the losers.But Boosting has traditionally defined the margin in the multi-class scenario as the minimum of all the margins [22].This has also been noted in the very recently published multi-class generalization of LPBoost [21], which ends up maximizing the minimum margin.At this point Kernel Plurality departs significantly (in addition to having negative weights and kernels) from Boosting since it explicitly tries to maximize all the margins.As in the case of two-class voting [16], the expected improvement in the generalization performance due to all-margin maximization was confirmed by our results.
Investigations into margins have also revealed connections between Boosting and SVMs [22,20].They are not exactly the same, but for the binary classification problem, Boosting with a given set of hypotheses is 'similar' to running an SVM with a kernel mapping related to the label vector generated by the hypothesis set [20].Such a relation is not clear for the multi-class scenario, hence our use of kernels with SVM in the prediction space P warrants further theoretical investigation.
The difference between Kernel Plurality, which maximizes all victory margins, and a collection of SVMs maximizing all pair-wise margins in the feature space D must also be appreciated.First, the former works in the pre-diction space while the latter in feature space.Second, in the former case we have one classifier which in required to make O(2 |L| ) prediction vectors classifications to classify each test feature vector, while the latter case requires training of O(2 |L| ) classifiers, instead of one.

Conclusions
In a literature landscape teeming with face recognition algorithms, instead of introducing yet another method, here we have made proposals that can potentially improve performance for most of them.We note that face recognition as a classification problem is especially susceptible to over-fitting and for various popular algorithms, this seems to be holding their performance back.We propose and demonstrate that applying face recognition algorithms to patches and then appropriate aggregating the labels tends to do better than the algorithms applied to the whole image.Aggregating labels without taking higher order interactions among patch labels into account amounts to neglect of correlated discriminatory information present in image patches.To remedy this we propose a new voting algorithm called Kernel Plurality, which takes these high order interactions into account while maximizing the margin of victory for the correct label with respect to each of the losers.This results in better generalization performance of Kernel Plurality as compared to Log-Odds weighted Plurality, Simple Plurality and Stacking with SVMs.
Feature/Input/Data space, i th vector in it.L, ljLabel space, j th label in it.C, c k Classifier ensemble, k th classifier in it.w k weight associated with classifier c k .c k (x) Label assigned by c k to x ∈ D R, R+ Set of Real numbers, positive Real numbers IA(x) Indicator function, 1 if x ∈ A else 0 P Prediction subspace, P = {−1, 0, 1} |C| δi,j Kronecker delta function, 1 if i = j, else 0 K, ϕ, K(•, •) Kernel space, Mapping, Matrix T Training set, T ⊂ D

Figure 1 .
Figure 1.Plurality as a set of pair-wise contests: (a) Votes cast by 8 classifiers toward classes A to E. (b) The corresponding voting digraph (in black) showing pair-wise contests and its Strongly Connected Components graph (in color).

Figure 2 .
Figure 2. All margin maximization: (a) Four classes embedded in some space with two noisy data points that belong to class 1, but seem closer to classes 2 and 4. (b) If for class 1, only the minimum margin is maximized, classes 3 and 4 can possibly cluster just beyond the closest class (1).As a result, ambiguity for the noisy data point closer to class 4, as shown, may remains.(c) If for class 1, all pairwise margins are maximized, classes 3 and 4 can be pushed farther away and ambiguity for both the noisy data points can be reduced.

3 (Figure 3 .
Figure3.Kernel Plurality: a set of data points (a) and an ensemble of classifiers that labels them (b), we can encode each data point (c) with a prediction vector p as tabulated in (e).Kernel Plurality tries to find a weight vector in the prediction space P such that the associated decision boundary separates all the p's from the origin with maximum margin, as shown in (d).A non-linear decision boundary in P corresponds to a linear hyperplane in the kernel space K associated with mapping ϕ, as shown in (f).which is were we compute.

Figure 5 .
Figure 5. Percentage Improvement in Error Rates: For each database-training set size combination in Fig.4, we have plotted the percentage improvement in error rates achieved by Kernel Plurality methods over Plurality (Vot).Each bar shows that range of improvement achieved by the five selected FR algorithms and the marker shows their average.

Table 3 .
Databases used in our experiments.

Set Size Yale A Black Bold Font: best
result for datasetalgorithm combination Red Bold Font: Best result for the dataset Black Font: better results than corresponding unit weight Pulurality Figure 4. Classification Error Rates: Key for algorithm names and color encoding is provided below the table.Lower the error, better the method.In most cases -across databases, FR algorithms and training set sizes -Kernel Plurality methods (Lin, RBF, Pol and Sig) outperforms the competing methods.