Discriminative virtual views for cross-view action recognition

We propose an approach for cross-view action recognition by way of `virtual views' that connect the action descriptors extracted from one (source) view to those extracted from another (target) view. Each virtual view is associated with a linear transformation of the action descriptor, and the sequence of transformations arising from the sequence of virtual views aims at bridging the source and target views while preserving discrimination among action categories. Our approach is capable of operating without access to labeled action samples in the target view and without access to corresponding action instances in the two views, and it also naturally incorporate and exploit corresponding instances or partial labeling in the target view when they are available. The proposed approach achieves improved or competitive performance relative to existing methods when instance correspondences or target labels are available, and it goes beyond the capabilities of these methods by providing some level of discrimination even when neither correspondences nor target labels exist.


Introduction
We consider the challenge of recognizing human actions across changes in the observer's viewpoint.Opportunities for the use of action analysis in domains such as surveillance, video indexing/retrieval, and human-computer interaction are growing fast [16,18,1], but realizing this potential relies on the ability to accurately interpret human activities from a broad range of viewing directions.In a typical action recognition setting, spatio-temporal features are computed from a video to represent the underlying action.These features can be powerful in discriminating between different actions observed from similar viewpoints, but since the same action can appear quite different when observed from different directions, the utility of these features degrades when the viewpoint changes more significantly.
The brutal-force approach of training independent clas-Figure 1. Knowledge transfer using 'virtual views'.Action descriptors x from different views are augmented into cross-view feature vectors x by applying a finite sequence of linear transformations g(λi, x) to each descriptor x.We introduce a flexible, semi-supervised framework for learning the transform-sequences in a way that can exploit various forms of partial labeling for the two camera angles.
sifiers for each action in each view does not scale well due the requirement of excessive labeled training data, so a possible line of attack is to search for view-invariant features, representations, or models that can be used for all viewpoints.One approach is to infer three-dimensional scene structure so that the derived action descriptors can be adapted from one view to another through geometric reasoning [29,26,15,10,4], while another is to search for spatio-temporal features of a video sequence that are insensitive to changes in view angle [17,21,23,22,3,28].Recent view-invariant approaches include [27] and [13].The former learns a classifier on examples taken from various views, and the latter introduces a temporal self-similarity matrix and demonstrates its view stability empirically.
Another emerging family of approaches address crossview action recognition by adapting features, representations, or recognition models trained on one or more source views to a target view where the recognition task will be performed [8,7,14].This boils down to drawing some form of statistical connections between view-dependent features extracted from different viewing directions.This is attractive because it reduces reliance on accurately inferring explicit camera geometry, extended motion trajectories, and three-dimensional actor models.A notable example of this knowledge transfer approach is the work of Farhadi et.al. [8,7], who rely on simultaneous multi-view observations of the same action instance to explicitly identify maps between one view's features and those of another, thereby allowing a classifier learned in one view to be adapted by suitably reorganizing its weights.Another example is the work of Liu et.al. [14] who rely on the same style of input to learn a cross-view bag of 'bilingual words' representation in which each bilingual word represents the co-occurrence of one visual word in one view with another visual word in another view.
We propose a different approach to view knowledge transfer that significantly relaxes the requirements on the training data.Instead of requiring access to simultaneous multi-view observations of the same action instance, our approach can leverage a variety of weak supervisions, including cases in which action categories are labeled in only one camera angle and there are no links or labels at all in another.As depicted visually in Figure 1, the conceptual idea is to construct 'virtual views' between action descriptors from one viewpoint and those from another.We imagine that an action descriptor transforms continuously between one viewpoint and another, and we compute 'virtual views' as a sequence of transformed descriptors obtained by making a finite number of stops along the way.The intermediate views are virtual because they exist only in an abstract feature space and are not identified with any physical change in camera position.Taken together, the sequence of transformed descriptors represents an augmented feature that embeds the statistical transition between two views, and by developing a discriminative method for learning the sequence of transform operators, we ensure that these augmented 'cross-view' features can be used to meaningfully compare actions descriptors from different viewpoints.
Our key technical contribution is an informationtheoretic framework that allows learning discriminative 'virtual view' transformations using a wide variety of partial labelings.Like the approaches in [8,14], it can exploit the case in which an unlabeled action instance (execution) observed simultaneously in both views yields an matched pair, so that a few of such (unlabeled) pairs are available.We refer to this working mode as the correspondence mode.At the same time, our approach can also operate under the conditions usually considered by the transfer learning or domain adaptation paradigm [9,6,19,2,11], where the samples in the target domain are usually partially labeled while matched instances with the source view may not exist.We refer to this working mode as the partially labeled mode.In addition to the these two working modes, our approach can operate in a third mode where the target view is completely unlabeled and no target instances are matched to the source instances.We refer to it as the unlabeled mode.Experiments show that our approach provides improved or com-petitive performance as existing methods when operating in the first two modes, and that it provides some discrimination in the third.

Discriminative Virtual Views
Consider source view V S and target view V T , and imagine that they are connected by some virtual path V (λ), 0 ≤ λ ≤ 1, with V (0) = V S and V (1) = V T .Recall that this virtual path does not correspond to physical changes in camera position, but instead is associated with transformations of action descriptors.For the transformations of action descriptors along the virtual path V (λ), we will use a particular class of linear projections.To this end, it is convenient to express the transformation associated with the source view as g S (x) = A T S x and that associated with the target view as g T (x) = A T T x, where x is a D-dimensional raw action descriptor (e.g., histogram on a vocabulary of visual words) computed from either the source view (in the former case) or the target view (in the latter case).Here A S , A T are both D × d matrices satisfying A T S A S = I and A T T A T = I, i.e., they both have orthogonal columns of unit-length, and induce a linear dimensionality deduction.
We represent the view change along the virtual path V (λ) implicitly as alterations of the feature extractors g S and g T (and thus the matrices A S and A T ).For this purpose, we define g(λ, x) = A T λ x for 0 < λ < 1, where A λ is also a D × d transformation matrix, g(0, x) = g S (x), and g(1, x) = g T (x).Sampling the virtual path V (λ) at a finite number of intervals , and the consecutive incremental 'jumps' from V (0) = V S to V (λ 1 ), V (λ 2 ), etc., through to V (1) = V T are intended to establish a smooth bridge between the visual information existing in the two views.Since we have associated a view V with a transform g uniquely identified by a matrix A , the sequence of virtual-view transforms g(λ x can provide a sequence of 'virtual' features that characterize the smooth changes of the features from the source to the target.Refer again to Figure 1. The major questions to be answered are how to choose effective transformations g S , g T (i.e., A S , A T ) and how to alter the transformations to define the virtual path g(λ, x) (i.e., A λ ).In 2.1, we show that for a given pair of transformations A S , A T , there exists a particular 'shortest' path connecting the two, allowing the virtual views to be obtained analytically.Then, in 2.2 we formulate the problem of identifying the optimal pair (A S , A T ) under our three distinct working modes, so that in each case the augmented cross-view features are discriminative among action categories.Finally, we provide the algorithm to solve this problem and determine the optimal A S and A T in 2.3.

Obtaining a Virtual Path
For the moment, let us assume that the source and target view transformations A S and A T have been given, and our task is to compute the transforms A λ that connect them along a virtual path.To this end, we aim to determine a path of D × d matrices from A S to A T .There are various ways to establish such connections between the two matrices, among which one possibility is to look into the space of all D × d matrices and make use of its geometry [24].However, manipulation in this space is computationally inconvenient, so we pursue an alternative approach.By construction the columns of A S and A T are of unit length and therefore lie on a hyper-sphere.Thus, a natural definition for a continuous path between the ith column of A S and the ith column of A T is the segment of the great circle that connects them.We define a closedform path between the matrices as wholes by separately identifying the D geodesics between their D corresponding columns, and then traveling simultaneously along these geodesics from the columns of A S at rates that guarantee simultaneous arrival at columns of A T .Specifically, to get the transforms and then obtain Note that columns of an A λ constructed in this way are not necessarily orthogonal, but remain unit-norm.The preservation of unit-length guarantees that the transformed feature A T λ x is at the same scale as A T S x and A T T x.To create our augmented cross-view feature, we simply concatenate the transformed features into a single long feature vector: This new feature implicitly incorporates the smooth change from one view to the other, and therefore bridges the two views and serves as a new, unified feature vector.

Maximizing Discrimination
Since our virtual view transforms are completely determined by matrices A S and A T , we now turn to the question about how to choose good values for A S and A T .Let us consider a two-class problem (multi-class problems can be treated as a set of two-class problems using one versus all approach) with positive training examples {(x P,i , 1)} n P i=1 and negative training examples {(x N,j , −1)} n N j=1 .In the unlabeled mode, all these labeled samples come from the source view.For the partially labeled mode, only a minority of the above training samples come from the target view.
In either case, we would like to maximize our ability to discriminate between the two classes in all available labeled samples.To this end, we seek transformations A S and A T that maximize the mutual information between cross-view feature x and the class label c ∈ {1, −1}: Note that so ( 3) can be written in terms of the differential entropy H(x).
To solve (3), we approximate differential entropy H(x) using a finite set of samples.Assuming that the samples of cross-view feature x are drawn from a Gaussian distribution, we may write H(x) = 1  2 ln((2πe) d(L+2) det Σ), in which the covariance matrix Σ can be estimated from samples x.Further assuming equal prior probabilities for the two classes, we approximate the objective in (3) by where Σ all , Σ P , Σ N are covariance matrices computed from all labeled samples, the positive samples, and the negative samples respectively.We may take a similar approach to choose the optimal transformation pair A S and A T in the correspondence mode.Specifically, labeled samples can be written as {(x (S) P,i , 1)} n P i=1 and {(x N,j , −1)} n N j=1 since in this mode the labels are not shared across the two views.The instances in correspondence, meanwhile, can be expressed as {(x k=1 where the unlabeled pair (x (S) , x (T ) ) describes the same instance (execution) of the an unlabeled action in two views.We expand all x (S) and x (T ) to get x(S) and x(T ) , and define ∆x = x(S) − x(T ) for each pair (x (S) , x (T ) ) corresponding to the same instance.Since the pair (x (S) , x(T ) ) describes the same instance of an action, we expect ∆x to be close to zero.In addition to maximizing the mutual information between x and the class label c ∈ {1, −1}, we add penalty H(∆x) to solve As previously, we approximate the mutual information in terms of covariance matrices and assume the cross-view feature ∆x to be Gaussian distributed with zero mean, since we expect it to be not only compactly distributed but also close to the origin.The objective in ( 6) is therefore approximated by where Σ ∆ is the correlation matrix, not the covariance matrix, for all ∆x's.A minimization of det Σ ∆ will yield ∆x's concentrating around 0, by which we enforce the correspondence between the pair (x (S) , x(T ) ).A practical issue that may arise is a rank deficiency in any of the covariance/correlation matrices.In this case we first determine the minimum rank among all involved matrices (say, r), and use the product of the top r large eigenvalues of each matrix to approximate its determinant.
In fact, our learning algorithm for maximizing (7) can not only exploit the semi-supervisions considered in the three working modes, but also accommodate any mixture of those modes: We simply need to encode the information regarding available labels and corresponding instances respectively into the covariance/correlation matrices Σ all , Σ P , Σ N , and Σ ∆ .

Obtaining the Optimal Virtual Views
We now go on to present the algorithm with which we optimize the two objectives ( 5) and ( 7) above.For simplicity, we denote the objectives in both ( 5) and (7) as J(A S , A T ) in the following discussion.
We employ a greedy algorithm that iteratively searches for transformations (A S , A T ) that maximize J.To use a gradient based approach, we need to evaluate ∂J(A S ,A T ) ∂A S and ∂J(A S ,A T ) ∂A T subject to A T S A S = I and A T T A T = I, which is difficult.Instead, we consider an axis-rotating approach.Let A S (t−1) to be the estimate for A S and A T (t− 1) to be the estimate for A T at iteration t − 1.We seek matrices R S (t), R T (t) ∈ SO(D), i.e., the D-dimensional special orthogonal group, so that the estimate at step t is A S (t) = R S (t)A S (t−1) and A T (t) = R T (t)A T (t−1).In essence, we seek a pair of R S (t), R T (t) to provide a steep ascent in J.Note that SO(D) corresponds to the set of rotation operations in R D , thus the resulting A S (t), A T (t) will be orthonormal matrices as well.We summarize the algorithm by which we obtain the optimal R S (t), R T (t) and consequently A S (t), A T (t) from A S (t − 1), A T (t − 1) in Algorithm 1.The mathematical principle behind this algorithm involves approximate gradient computation on SO(D) and is briefly introduced in the Appendix.More details on SO(D) can be found, for example, in [12].

Input:
)), and ∆J S,i,j = J(R S,i,j A S (t − 1), A T (t − 1)) − J(A S (t − 1), A T (t − 1))/ , where E i,j is a matrix whose (i, j)th element is one and all others are zero; , and c k,l = ∆J T ,k,l ; In practice, we initialize A S (0) and A T (0) as described in the next section, and iterate Algorithm 1 until A S (t) = A S (t − 1) and A T (t) = A T (t − 1).

Implementation Details and Extensions
The first step in training our model is to determine the working mode and extract the corresponding single-view action descriptors from each training video.In all cases we use an equally-spaced sequence for the path parameter λ, i.e., λ i = i L+1 .Once the optimal transformations are computed, we compute cross-view features x for all training samples and use the subset of labeled samples to train a cross-view action classifier.There are many possible choices for the classifier, and in the experiments we use the Multiple Kernel Learning SVM (MKL-SVM) [20].
For any testing observation x from target view, we compute its cross-view feature x using all transformations obtained from training stage, and then evaluate MKL-SVM at this cross-view feature.
Initialization.Good choices for initializing A S and A T can expedite the training procedure.For the source view, we find it effective to use an orthonormal basis that spans the d dimensional subspace determined by the Fisher discriminant for the labeled samples in that view.For the target view, we simply use the basis of the Fisher discriminant subspace if labeled samples are available, or that of the principal subspace if not.
Multiple Action Classes.For an M -class action recognition problem, we learn M binary one-against-all models as described above.The final classification is determined by selecting the model whose MKL-SVM yields the maximum response.
Multiple Source Views.In many applications we may have w source views with w > 1.In this case, given a test in-stance from the target view, we simply aggregate the response values from the w MKL-SVM classifiers on their respective cross-view features x, and then make a binary decision with the threshold at 0. For a M -class problem, we select the class which achieves the maximum aggregated response value.

Experimental Evaluation
Following [8,7,14], we evaluate our approach on the IX-MAS multi-view action dataset [26] which contains eleven categories including actions like walk, kick, and throw.Each action is performed three times by twelve actors taken from five different views including four side views and one top view.To enable appropriate comparison, we use the same low-level action descriptors used in [14].Specifically, the action is represented by a concatenation of a spatiotemporal interest-point-based descriptor [5] and a shapeflow descriptor [25].The two types of descriptors serve as complementary local and global characterizations of the motion.For the local interest point based descriptor, a 2-D Gaussian filter and then a 1D-Gabor filter are applied to the video, and the interest points are detected at the local maximum response.The parameters for the two filters are σ = 2 and τ = 1.5 respectively, and at most 200 maxima are extracted from each video.Then, the spatio-temporal volumes around the maxima are extracted, and gradientbased descriptors are computed and reduced to 100 dimensions via PCA.These descriptors are further quantized to visual words by k-means clustering.Eventually, each action is represented by a histogram over 1000 visual-words.For the global shape-flow feature, a three channels descriptor is computed from each frame: horizontal optical flow, vertical optical flow, and silhouette.Each of these channels has the same dimension as the input frame, and PCA is again employed to reduce the dimensionality.Descriptors from neighboring frames are concatenated with the current frame descriptor to incorporate temporal information.Finally, the histogram vector is built over 500 quantized visual words.See [5,25] for more details.

Pairwise Cross-View Recognition
We first look into all possible pairwise view combinations (twenty in total for five views) to evaluate the proposed approach.We begin with the correspondence mode and compare with existing approaches.We then show results on partially labeled and unlabeled modes.
Correspondence mode: We follow the same data separation scheme as in [14] (inherited from [8,7]) for fair comparison.This is a leave-one-action-class-out scheme, where we consider one action class (called an 'orphan action') in the target view, and exclude all videos of that class when learning the quantized visual words and establishing corre-  [2] when a varying fraction of samples are labeled in the target view.Note that our approach provides some discrimination even when no target labels are available (i.e., at 0%).
spondence.The instances in correspondence are randomly selected from the non-orphan training actions, and approximately 30% of the non-orphan samples serve as such pairs.We adopt the six-fold cross-validation scheme of [14] to build the discriminative virtual views as well as train the classifier on the augmented cross-view feature vectors.We sample ten virtual views from the virtual path, and set the transformed virtual view dimension to d = 20.The MKL-SVM, meanwhile, consists of nine Gaussian kernels with bandwidths in powers of 10 ranging from 10 −4 to 10 4 .The final performance is reported on an average over all actions classes.The recognition accuracy is shown in Table 1 for all possible source-target view combinations, as compared to baselines [7], and [14].(We omit the accuracy of [8] since it reports the lowest in all cases).It is seen that our approach is outperformed by [7] on two source-target combinations and by [14] on one combination, while it achieves a uniform improvement over all baselines on the other combinations.In particular, our approach achieves increased accuracy on average for all five possible target views with varying source views, though the increase on camera 4, the top view, is less significant than the others.
Partially labeled and unlabeled modes: As mentioned earlier, the view-transfer problem has much in common with other transfer learning or domain adaptation problems.Therefore, we consider cross-view recognition as in a similar setting as semi-supervised classification, where a small portion of the samples from the target view is labeled, and we compare our performance to that obtained by methods studied in [2].Again, we employ a six-fold cross-validation strategy, and provide class labels to randomly selected samples from the target view.We again sample ten virtual views from the virtual path, and set the dimension of the transformed feature to d = 20.Three types of SVMs used in [2] are employed in our experiment for comparison: SVM-Table 1. Cross-view action recognition accuracy on the IXMAS dataset when matched instances are available between the source view and the target view (correspondence mode).Each row is a source view and each column a target view.The three accuracy numbers in a triple are the average recognition accuracy of [7], [14] 2. Cross-view action recognition accuracy on the IXMAS dataset when some labels are available in the target view but there are no matched pairs (partially labeled mode).Each row is a source view and each column a target view.The four accuracy numbers in a tuple are the average recognition accuracy of SVMSUT, AUGSVM, MIXSVM from [2] SUT, AUGSVM, and MIXSVM.SVMSUT trains a single classifier on all labeled samples from both views and treats each sample as independent.AUGSVM uses a new feature vector which reserves space for both views, and fills an original feature into its corresponding space to obtain the new features.MIXSVM, meanwhile, trains two SVM's on the source and target and then learns an optimal linear combination of the two.Since we use MKL-SVM instead of a single SVM, we use MKL versions of the three baselines as well for comparison.(We refer to them using their original names even though we actually use their MKL version).The kernel types and parameters remain the same as in the previous experiment, and we use the fusion scheme for multiple action class introduced in the previous section.
We vary the fraction of the labeled samples from the target view in increments of 5% up to 30%.The average recognition accuracies for different fractions are shown in Figure 2, from which a substantial improvement is observed for our approach relative to the baselines.Note that the left side of the graph (0%) corresponds to unlabeled mode in which no target samples have labels for training.Our approach handles this mode seamlessly, and it outperforms the baselines that have access to labeled target samples (ours is 26% accurate with no target labels while the others is less than 26% accurate even with 5% of target labels).Also note that AUGSVM directly combines source and target samples into a single vector while we augment either the source or target feature by the discriminative virtual features.Therefore, one can view AUGSVM as a limiting case of our framework in which the number of virtual views is set to zero.Our increase in accuracy therefore demonstrates the advantage of using the virtual views.

Effects of Varying Parameters
In the previous experiments we use ten virtual views, each with a 20-dimensional feature.To investigate the effects of changes in these parameters, we vary the number of virtual views from 1 to 10 and set the dimension for each virtual view to be 5, 10, or 20.All three working modes are evaluated, and the overall recognition accuracy is given in Figure 3.A significant jump is observed from one virtual view to two virtual views, after which the accuracy increases only incrementally.Also, the dimension increase from 10 to 20 leads to mild accuracy improvement, espe- cially for correspondence mode.These observations imply that one may use a relatively smaller number of virtual views and lower dimensions per view unless a very high accuracy is desired.

Non-Discriminative Virtual Views
The transformations A S , A T and A λ are learned through mutual information maximization to discriminate action categories.How will the performance be affected if these transformations are not learned discriminatively?To answer this question, we let A S , A T be the bases of the principal subspaces of the source and target samples respectively, and directly compute A λ following 2.1 without the optimizations in 2.2 or 2.3.This modification reduces our approach to bridging the source and target by non-discriminative projections, similar to the method of Gopalan et.al. [11].In Table 3, we compare results of our proposed approach to those of this non-discriminative version, as well as to [11].The average recognition accuracy of these non-discriminative approach suffers a significant drop for all three working modes, which underscores the benefit of learning the virtual views discriminatively.

Multiple Source Views
To explore the benefits of having multiple source views, we select a target view and use all other four views as sources.Classifiers trained on the four source-target pairs are fused using the method presented in Section 3. We again compare correspondence mode with matched pairs available with the strategy in [14], and compare partially labeled mode with the three domain transfer SVMs, for which the fusion of multiple classifiers is the same.The average accu-racy is provided in Table 4 and Table 5. Comparing Table 4 with Table 1 we find moderate performance gain by fusing multiple source views, while [14] sees a substantial increase.Overall, we accomplish a comparable accuracy with [14].By comparing Table 5 to Table 2, it is also interesting to note that for the partially labeled mode the performance gain from a single source view is more significant on the baseline SVMs than on our approach, though our fused classifier still reports the best accuracy.This may imply that our view transfer method, which attempts to bridge two views via a smooth path, has more thoroughly exploited the connection between the source and the target, so that additional source views only contribute limited additional discrimination.

Conclusion
We propose an approach for cross-view action recognition, in which the source and target views are explicitly connected by a smooth virtual path represented as a sequence of linear transformations of action descriptors.The linear transformations are selected discriminatively based on a measure of mutual information in a training set between the virtual views and class labels.This view-transfer mechanism operates under a variety of weakly supervised scenarios (matched source-target pairs, partial target labels without matched pairs, and no target labels or matches), which have been considered quite separately.In all cases, our performance compares to or improves upon the state of the art.The steepest ascent direction is the gradient of J with respect to R. The gradient on SO(D) is defined as a vector ∇J ∈ so(D), where so(D) is the associated Lie algebra, such that ∇J = arg max ξ∈so(D), ξ =1 ∂J(A) ∂ξ .
Here ∂J(A) ∂ξ is the directional derivative of J along ξ.To find the optimal ξ, we express it in terms of a linear combination of the basis axes of so(D): where we have employed the fact that E i,j − E j,i , 2 ≤ i ≤ D, i + 1 ≤ j ≤ D is the basis of so(D).Consequently, the search for a gradient direction becomes ∇J = arg max ci,j ∂J(A) ∂( i,j c i,j (E i,j − E j,i )) , s.t.i,j c 2 i,j = 1.
We first approximate the directional derivative along a linear combination of basis axes by the linear combination of directional derivatives along the axes, i.e.,

=
J(exp( (E i,j − E j,i ))A) − J(A) ∆J i,j in which is a small positive number.As a result, the optimization (9) has close-form solution c i,j = ∆J i,j .
We hence find an approximate gradient on SO(D), namely ∇J, and the final step is a line search along ∇J at a step length δ at J(exp(nδ∇J)A).By jointly considering R S and R T we reach the greedy axis rotation algorithm in Algorithm 1.

Figure 2 .
Figure2.Cross-view action recognition accuracy on the IXMAS dataset compared with baselines from[2] when a varying fraction of samples are labeled in the target view.Note that our approach provides some discrimination even when no target labels are available (i.e., at 0%).

Figure 3 .
Figure 3. Cross-view action recognition accuracy on the IXMAS dataset for all three working modes (correspondence mode, partially labeled mode, unlabeled mode) with a varying number of virtual views and a varying dimension d for each virtual view.
Appendix: Approximate Gradient Ascent on SO(D) Consider the generic optimization problem max R∈SO(D) J(RA).

Table
, and our approach respectively.

Table 3 .
[11]s-view action recognition accuracy on the IXMAS dataset with[11], non-discriminative virtual views (NDVV), and our approach, under all three working modes.

Table 5 .
Cross-view action recognition accuracy with multiple source views in partially labeled mode.