Subspace Projection Methods for Large Scale Image Data Analysis Ashish Gupta Alper Yilmaz College of Engineering College of Engineering Ohio State University Ohio State University Columbus, Ohio 43210 Columbus, Ohio 43210 Email:
[email protected]Email:
[email protected]Abstract—Images have become the most popular type of mul- timedia in the Big Data era. Widely used applications like auto- matic CBIR underscore the importance of image understanding, especially in terms of the semantically meaningful information. Typically, high dimensional image descriptors are embedded to a subspace using a simple linear projection. However, semantic information has a complex distribution in feature space that requires a non-linear projection. We first estimate an intrinsic dimensionality of image data. Next we build a measure of visual information in embedded subspace. We compare several linear and non-linear projection methods. We use multiple image databases towards a comprehensive evaluation. We report results in terms of information content, consequent recognition rates, (a) (b) and computational cost. This paper is relevant for researchers interested in dimensionality reduction for large scale image Fig. 1: Typical subspace projection methods assume relevant understanding that is both quick and preserves semantically features have global extent and compute an optimal subspace relevant information. for them with a linear transformation between descriptors in I. I NTRODUCTION high dimensional space to the subspace. We pursue a multiple subspace model for visual categories, which have limited The volume of image data stored and shared across net- extent and different embedded subspaces. works has grown exponentially in the Big Data era. This trend is expected to continue and also extends to other types of multimedia like videos. Moreover, like textual information in the past, the visual content of an image is also being image can be associated with an appropriate visual category. A used as both structured query and retrieved information. These category itself is a node in a taxonomy of visual concepts. A conditions require both efficient data management and quick popular example is ImageNet [4], that is inspired by WordNet analysis of large scale image data. Towards data management, for textual data. Unlike textual data, the principal caveat the MapReduce framework designed by Google is very simple here is the large variation in feature description of visual to implement and very flexible in that it can be extended for objects associated with any category. The main reason is that various large-scale data processing functions. This framework any visual object is in turn composed of multiple parts [5], is a powerful tool to develop scalable parallel applications which are associated with different set of visual categories. to process big data on large clusters of computers. Towards In essence, every visual concept has a very complex feature image analysis, much effort has been made in the information description. Correspondingly, the set of descriptors spatially retrieval (IR) field to find lower-dimensional representation sampled from that image have a very complex distribution in of the original high-dimensional data, which enables efficient feature space. Based on the work in: [6] that describes sub- processing of a massive data set while preserving essential space clustering which is related to the concept of computing features [1]. Note that, image content needs to model com- a visual model on multiple subspaces; and [7], that introduces plex visual concepts, and therefore IR methods need to be a probabilistic subspace clustering approach that represents an semantically relevant. Probabilistic topic models proved to be image as a sparse combination of cluster exemplars embedded an effective way to organize large volumes of text documents in a union of subspaces, our hypothesis is that descriptors [2]. In natural language processing, a topic model refers from the same image occur in different regions of feature to a type of statistical model for representing a collection space and are embedded in different subspaces, depicted in of documents by discovering abstract topics. This paradigm Figure 1b. In order for IR methods to be effective, the lower- translates to modeling image data in terms of abstract visual dimensional representation should preserve the nature of the categories. Each of the plethora of visual objects in any complex distribution of an image. In other words, the subspace Scene−15 D−SIFT, 500 feature vectors of 128 dimensions 250 paper is focused on only linear methods and offers a sound 200 theoretical discussion, whereas we consider both linear and dimensions 150 non-linear methods with a thorough empirical evaluation. The 100 50 authors in [19] and [20] write an explanatory survey which 0 provides an overview of some linear and non-linear methods. feature vectors In comparison we explore more recent and effective non-linear (a) Sample of 500 D-SIFT image feature descriptors, each of 128 methods as well as a focus on visual concepts in image data. dimensions, from Scene-15 dataset. Our contributions in this paper are as follows. We note Information−Theoretic Co−Clustering of Scene−15 D−SIFT 500x128 into 10 row and 10 column clusters 250 that choice lower-dimensionality by practitioners, regardless 200 of projection method, is typically ad-hoc and large since dimensions 150 their rationale is reduce dimensionality by as little as possi- 100 ble and consider descriptor distribution in the entire feature 50 space to have a single source. We adopt the rationale of 0 descriptor distribution to have multi-region multi-subspace feature vectors sources, where the intrinsic dimensionality of any one visual (b) Reordered rows and columns of the 500 × 128 feature vector category is relatively small. Accordingly, we utilize multiple matrix of Scene-15 data shown in 2a intrinsic dimensionality estimation techniques to validate our Fig. 2: Visual concepts are embedded in multiple subspaces. hypothesis. Next we develop an entropy based measure for in- We used Information-theoretic co-clustering [3] to compute formation in embedded subspace, inspired by work in [21] that 10 block clusters. The columns are instances of the feature descriptors and rows the dimensions. In each block, rows explores methods for discovering ‘hidden’ structure in high- correspond to subspaces, whereas and columns correspond to dimensional data. The better a projection method, the more spatial clusters. The color-bar shows the SIFT value ranging from 0 to 255. These values here are inverted information it succeeds in preserving. We translate the mutual for the sake of visibility. separation between all descriptors into a probability distribu- tion and compute a R´enyi entropy of it1 . We evaluate several projection method should attempt to model visual concepts in linear and non-linear subspace projection methods based on feature space with different limited spatial extent and different a criteria of information content, classification performance, subspaces. This insight motivates our work in this paper. and computational time. Our objective is a method effective The use of PCA towards improve classification results as preserving visual concepts will simultaneously capable for in [8] popularized use of dimensionality reduction among being scaled to Big Data applications using images. researchers in computer vision. Due to success of Eigen- The rest of the paper is organized as follows. We discuss faces [9], standard practice in the community is to anal- estimation of intrinsic dimensionality of image descriptor data yse eigenvectors of the distribution of training descriptors in Section II. In Section III, we describe our measure of infor- to compute an Eigen-space of specified dimensionality [10]. mation in embedded subspace. We discuss various linear and Unlike this practise or arbitrarily selected lower-dimensions, non-linear projection methods in Section IV. Our evaluation in this paper we compute intrinsic lower-dimensionality of based on information measure, classification performance and image data. A popular survey on dimensionality reduction computational cost is described in Section V-B. We conclude techniques was made in [11]. This paper considers a small with of summary in Section VI. set of methods and is now dated written well before the arrival of Big Data problems. The use of Non-negative Matrix II. I NTRINSIC D IMENSIONALITY Factorization (NMF) for IR was made in [12]. Note that In [13], the researchers use linear discriminant projections PCA and SVD are special cases of NMF. The work in [13] to evaluate sub-manifolds ranging from 128 to 30 dimensions; analyzed a few subspace projection methods, but their focus and report best comparative performance for 30 dimensions. was limited to evaluating method of discriminative projections Besides seeming incomplete, their estimate is not based on an versus principal components. In addition, they do not report inherent dimensionality leading to issues of training set bias. on computational cost, which is important for practitioners Since, we are interested in Big Data, we adopt a principled in Big Data and a major part the evaluation criteria in approach to the estimation of intrinsic dimensionality, that can this paper. While simple and fast linear methods like PCA be scaled to arbitrarily large dataset size. We consider the are traditional, subsequently developed non-linear projection geometry of descriptor distributions by four different methods methods like Isomap [14], LLE [15] have demonstrated better to understand variance in estimated dimensionality across performance, but their proof was initially restricted to small visual concepts to finally compute the most likely estimate toy problems in machine learning. Nevertheless, these methods across all datasets. suggested that relevant information regarding a visual concept 1 The R´enyi divergence of order α or α-divergence of a distribution P from exists in spatially local structurs in the distribution of feature ! 1 Pn pα descriptors [16] and the importance of the preservation of these a distribution Q is defined to be Dα (P kQ) = α−1 log i=1 α−1 qi i , structures when embedding in the sub-manifold [17]. More when 0 < α < ∞ and α 6= 1. In particular the limit α → 1 gives the recently, an interesting related paper is [18]. However, this Kullback-Leibler divergence. (a) Pascal VOC 2007 (b) Scene-15 (c) Caltech-101 Fig. 3: Samples of visual categories in popular image databases used in this paper. A. Correlation Dimension C. Normalized Eigenvalue Eigenvalue based intrinsic dimensionality estimation was The intuition behind the correlation dimension estimator proposed by [24]. After compute PCA on data Z, we look is that, the number of feature descriptors in a hyper-sphere at the normalized eigenvalues. The intrinsic dimension is of radius r is proportional to rp , where p is the estimated determined by the number of these eigenvalues that are greater dimensionality. Following from [22], it computes the relative than a given small threshold value ǫ. number of feature vectors C(r) that lie within a hyper-sphere of radius r: D. GMST Estimate n n The Geodesic Minimum Spanning Tree (GMST) estimate 2 X X 1 if k zi − zj k≤ r C(r) = 1r , 1r = is based on the observation that the length function of a n(n − 1) 0 otherwise i=1 j=1+1 geodesic minimum spanning tree is dependent on the intrinsic (1) dimensionality p [25]. It is the minimum spanning tree of the Since, C(r) is proportional to rp , it is used to estimate di- neighborhood graph defined on the data Z. Similar to Isomap, mensionality p as p = limr→0 C(r) log r . Since, the limit cannot be the GMST estimator constructs a neighborhood graph G on explicitly solved, its value is approximated by computing C(r) the data Z, in which every data point zi is connected with for two values of r. Approximated dimensionality pˆCorrDim its k nearest neighbors zij . The geodesic minimum spanning is the ratio of these two values give as: tree T is defined as theP minimal graph over Z, which has log(C(r2 ) − C(r1 )) length L(Z) = minT ∈T e∈T ge , where T is the set of all pˆCorrDim = (2) sub-trees of graph G, e is an edge, and ge is euclidean distance log(r2 − r1 ) corresponding to edge e. In GMST estimation, a number of B. Maximum Likelihood Estimate subsets A ⊂ Z of Z are constructed with various sizes m. Next the lengths L(A) of the GMSTs of the subsets A are computed. The Maximum Likelihood Estimate computes the number Theoretically, the ratio log L(A)/ log m is linear and approx- of feature descriptors {z1 , . . . , zn } covered by a hyper-sphere imated by a function of the form y = ax + b. The estimated 1 with a growing radius r. This number is modelled as a Poisson intrinsic dimensionality pˆGM ST of Z is pˆGM ST = 1−a . process. Based on [23], the relation between the rate of the Poisson process λ(t) and intrinsic dimensionality p is E. Image Databases We selected several popular databases that are frequently f (z)π p/2 ptp−1 λ(t) = (3) referred to in computer vision literature. They vary in number Γ(1 + p2 ) of available labeled visual categories and number of training ,in which f (z) is the sampling density. Intrinsic dimensionality and validation images. In our experiments we use the protocol p around zi given k nearest neighbors is associated with these datasets in terms of train, validation k−1 and test image partitioning. The databases utilized in this 1 X Tk (zi ) −1 paper are Caltech-101 [26], Caltech-256 [27], Scene-15 [28], pˆk (zi ) = log (4) k − 1 j=1 Tj (zi ) VOC-2006, VOC-2007, and VOC-2010 [29]. Details on these databases can be found in the cited references. An illustrative ,in which Tk (zi ) represents the radius of the smallest hyper- sample of visual categories in Scene-15 and VOC-2007 is sphere with center zi that covers k neighboring data points. shown in Figure 3. n 1Xˆ Tˆk = Tk (zi ) (5) F. Empirical Results on Dimensionality n i=1 In this experiment we report results on Caltech-101, The estimation of the intrinsic dimensionality pˆM LE of data Caltech-256, Scene-15, VOC-2006, VOC-2007, and VOC- Z is obtained by averaging over the n local estimates pˆk (zi ). 2010 databases. Dense-SIFT were computed on 8 × 8 pixel Fig. 4: Intrinsic dimensionality of visual categories in Caltech- 101 image database. Typically, conceptually simpler categories have a lower dimensionality. patches with step size of 4 pixels. For Scene-15, in accor- dance with [28], 100 images from each category are used as training set and the rest are used for testing. The train, validate, and test image sets are provided by the publishers Fig. 5: Intrinsic dimensionality of visual categories estimated of the VOC-2006, VOC-2007, and VOC-2010 datasets [30]. using CorrDim, EigenValue, MLE, and GMST estimates. The In these experiments, the training set was used for training box-plot shows the mean and variance across categories in and the validate set was used for testing. For Caltech-101 and each dataset. The variance here reflects the difference in the Caltech-256, the training set was 30 randomly selected images, structure of feature descriptor distribution across categories. according to [27] and the remaining used as test images. A The mean estimate observed across datasets is largely similar. random sample of 10000 descriptors from the images in the training set is utilized in a 10 − f old estimation routine. The results are shown in 5, where the mean estimated intrinsic dimensionality for all databases is in the neighborhood of 14. in 6a and the ‘intersect’-loop in 6b, wherein the structure is The variance in dimensionality depicted by the box-plot, is readily apparent. The third data distribution is a 1000 random indicative of the difference between visual categories. sample of PCA embedded D-SIFT feature vectors of images labelled ‘car’ in the VOC-2006 database in 6c. A set of n(n−1) 2 III. S UBSPACE I NFORMATION M EASURE pair-wise distances are computed for each of the n = 500 We formulate the mutual separation between embedded data points using the Euclidean distance metric. Subsequently, descriptors as a probability distribution, by normalizing a R´enyi entropy with α = 2 is computed for the normalized histogram of all pair-wise distances. Next we compute the histogram ( with 100 bins) of each distribution. In this ex- R´enyi entropy of this probability distribution. We motivate periment, the ‘intersect’-loop distribution with Hα = −19.31 this measure using an illustrative example, shown in Figure is estimated to have more structure than the ‘swiss’-roll 6. Consider an isometric distribution. It arguably lacks ‘struc- distribution with Hα = −25.33; and both more than the PCA ture’, in the sense that any two randomly selected subsets of embedded ‘VOC-2006:car’ distribution with Hα = −33.03. the data will exhibit identical mutual separation statistics. A R´enyi entropy appears to successfully encode ’structure’ in distribution with ‘structure’ deviates from such an isometric descriptor distribution. R´enyi entropy was introduced as a distribution. This property can be estimated by measuring the generalization of Shannon entropy. It was developed to be the diversity in the set of all pair-wise distances. We consider the most general type of information measure, which preserves measure of diversity to correlate to a measure of ‘structure’ additivity of statistically independent systems and compatible in the descriptor distribution. For example, the ‘swiss’-roll with Kolmogorov’s axioms of probability, [31], [32]. For a ’swiss’ synthetic data ’intersect’ synthetic data ’VOC2006,car’ data 15 10 10 200 100 attempts to optimize the trade-off between preservation of 5 5 local geometry and minimization of separation between distant Z 0 0 0 Z Z −5 −5 −10 1 0.5 −100 −200 vectors. −15 40 20 0 −0.5 1 1.5 −300 500 3) Isomap: Isomap [14] attempts to preserve pair-wise 0.5 0 400 geodesic distance - the distance between two points on a 5 10 15 0 0 200 0 −5 0 Y −0.5 −500 −400 −200 Y −10 −1 −1 x −1.5 X Y X (a) ‘Swiss’ (b) ‘Intersect’ (c) ‘VOC-2006:car’ manifold, which distinguishes it from methods which preserve pair-wise Euclidean distances. The geodesic distances between Fig. 6: R´enyi entropy reflects ’structure’ in descriptor distri- feature vectors {z1 , . . . , zn } are computed by constructing a bution. neighborhood graph G, in which every feature vector zi is connected with its k nearest neighbors zij . Dijkstra’s algorithm is utilized to compute the shortest distance between two nodes discrete probability distribution P = {p1 , p2 , . . . , pn } satisfy- Pn on G, which is a satisfactory approximation to the geodesic ing the conditions of i=1 pi = 1, and pi ≥ 0 , ∀1 ≤ i ≤ n, distance between corresponding feature vectors. the R´enyi entropy of order α is defined as 4) Diffusion Maps: Diffusion maps are based on defining n a Markov random walk on the graph of the descriptors. The 1 X Hα (x) = log pα i (6) proximity of descriptors is inferred from aggregate results of 1−α i=1 random walk trials on the graph between nodes corresponding In the limit α → 1, R´enyi entropy reduces to Shannon entropy. to these descriptors. The objective is retention of pair-wise diffusion distances between embedded descriptors [35]. Since IV. S UBSPACE P ROJECTION M ETHODS diffusion distance is based on multiple paths through the graph, Canonical dimensionality reduction seeks to project a p- it is comparatively more robust to noise, than the geodesic dimensional feature descriptor vector z = [z1 , . . . , zp ]T to a distance. lower dimensional representation of it, s = [s1 , . . . , sk ]T with 5) Kernel PCA: Kernel PCA computes the principal eigen- k ≤ p, based on some criterion. Typically, the criterion is vectors of the kernel matrix, instead of the covariance matrix preservation of some geometric structure, which can be global as in traditional PCA [36]. The application of PCA in kernel or localized to neighborhood of z. The technique could utilize space provides Kernel PCA the property of constructing non- a linear or a non-linear projection function to compute s. Based linear mappings. on these distinctions, the set of techniques considered here C. Local Non-linear Techniques are grouped to traditional linear, global non-linear, local non- Local non-linear embedding techniques aim to preserve linear, and variants of local non-linear techniques. distribution properties of local neighbourhoods around feature A. Global Linear Techniques vectors. 1) Locally Linear Embedding: LLE is similar to Isomap, 1) Principal Component Analysis: PCA [8] constructs a in that it constructs a graph, however unlike Isomap it embeds low-dimensional representation of descriptors that describes as to a non-convex manifold [15]. Local distribution properties much of the variance in that descriptors as possible. It projects of the manifold are expressed as a linear combination of the data onto computed eigenvectors with the greatest variance. the nearest neighbors of each feature vector. A criterion of The key benefits of PCA are its ease of implementation and LLE is retention of ‘reconstruction’ weights in these linear speed of computation. combinations. The objective of LLE is to find low-dimensional B. Global Non-linear Techniques global co-ordinates for feature vectors that lie on or near a manifold in high dimensional space. These attempt to preserve global properties of the descriptor 2) Laplacian Eigenmaps: Laplacian Eigenmaps is a geo- distribution, while constructing non-linear projections. metrically inspire embedding technique by [37]. The objective 1) Multi Dimensional Scaling: MDS [33] seeks to find an is minimization of distance between each embedded descriptor embedding to lower dimensional space such that distances and its k nearest neighbors, mapped to a graph structure between pairs of feature descriptors is preserved. The quality by solving an eigenvector problem set up using the graph’s of the mapping is expressed in a ‘stress function’ - a measure Laplacian. of the error between the pairwise distances in the low- and D. Variants of Local Non-linear Techniques high-dimensional representation. Given n data vectors, let the distance between zi and zj ∈ Rp be δi,j . The goal of MDS 1) Locality Preserving Projection: Locality Preserving Pro- is to find vectors {s1 , . . . , sn } in Rk , given ∆, such that jection (LPP) [17] combines aspects of global linear and local k si − sj kk ≈ δi,j ∀zi , zj ∈ Z. non-linear embedding techniques; computing a linear mapping 2) Stochastic Proximity Embedding: SPE is an iterative al- that minimizes the cost function of the non-linear Laplacian gorithm that generates low-dimensional euclidean embeddings Eigenmaps technique. It builds a graph G using neighborhood [34]. It attempts to preserve similarities between ‘related’ information of feature vectors. The edge weights of G are descriptors, by minimizing a ‘stress function’ used in MDS k z − z k2 i j IV-B1. It uses an iterative pair-wise refinement strategy, which wij = exp − (7) 2σ 2 Next, LPP solves the generalized Eigen problem value of the estimated intrinsic dimensionality of the visual ¯ T L(Z − Z)υ ¯ = λ(Z − Z) ¯ T M (Z − Z)υ ¯ category to which the image belongs. A normalized histogram (Z − Z) (8) of 100 bins is computed followed by R´enyi entropy with where L is the graph Laplacian, and M is the degree matrix. α = 2. The average entropic value of all visual categories for The eigenvectors υi corresponding to the k smallest non-zero each database are listed in Table I. To compare the subspace eigenvalues form the columns of a linear mapping T that methods, an aggregate entropic measure for all datasets is minimizes the Laplacian Eigenmap cost function [17]. The listed in the rightmost column. The variance in entropy is embedded vectors are expressed as Y = (Z − Z) ¯ T. indicative that visual categories have an inherently different 2) Neighborhood Preserving Embedding: Similar to LPP, local descriptor distribution structure. Neighbourhood Preserving Embedding (NPE) [38] minimizes Note that LPP has a mean entropic measure of −4.54, which a cost function typical of a non-linear technique but using a is promising in comparison to PCA, which is −24.43. Though linear mapping. NPE can be considered as a linear approxi- both methods use linear embedding, LPP uses local structure, mation to LLE. instead of global structure as used by PCA. Evidently, the 3) Probabilistic Principal Component Analysis: Probabilis- global embedding of PCA loses more relevant information tic Principal Component Analysis [39] is a probabilistic model than the local embedding of LPP in comparison. for PCA which combines local PCA models within the frame- work of a probabilistic mixture in which all the parameters are Method Scene15 VOC2006 VOC2007 VOC2010 Average determined from maximum-likelihood using an EM algorithm. Diff. Maps -14.55 -14.56 -14.49 -14.52 -14.55 4) Landmark Isomap: Landmark Isomap [40] is an exten- Factor Anal. -33.53 -33.39 -33.44 -33.47 -33.47 sion of Isomap. An issue with Isomap is the costly global Isomap -26.92 -27.25 -27.24 -27.26 -27.11 computation utilizing all feature vectors Z. Landmark Isomap L. Isomap -26.11 -26.40 -26.41 -26.42 -26.29 approximates this global computation, by a much smaller set LLE -9.35 -8.16 -8.05 -8.06 -8.69 of computations, by using a small subset of the data called LPP -5.23 -3.69 -3.69 -3.69 -4.54 landmark points. MDS -23.82 -23.69 -23.76 -23.77 -23.79 5) Stochastic Neighborhood Embedding: Stochastic Neigh- NPE -9.35 -8.32 -8.32 -8.33 -9.10 borhood Embedding (SNE) [41] approximates a probability PCA -24.45 -24.40 -24.44 -24.44 -24.43 distribution of descriptors in high-dimensional space, with a ProbPCA -15.33 -15.38 -15.37 -15.37 -15.33 probability distribution in the embedded space. SPE -11.45 -11.41 -11.48 -11.49 -11.51 6) Symmetric Stochastic Neighbor Embedding: SymSNE SymSNE -16.12 -16.04 -16.07 -16.08 -16.10 [?] is a variation of Stochastic Neighbor Embedding. tSNE -19.53 -19.37 -19.46 -19.44 -19.45 7) t-SNE: An improvement over SNE, t-SNE [42] avoids ‘crowding’ issue, where moderately dissimilar feature vectors TABLE I: R´enyi entropy of normalized distribution density of are huddled together in the embedding, by use of heavy- pair-wise distances in embedded space by different subspace tailed Student t-distribution that allows moderate distances in methods for different databases. the high-dimensional space to be projected to much larger distances in the embedding. B. Classification Performance Analysis V. E XPERIMENTS AND R ESULTS In this experiment we consider the Scene-15, VOC-2006, We describe our experiments to evaluate subspace projection VOC-2007, and VOC-2010 databases. We first learn a visual methods in terms of embedded information content, classi- model using random sampled images from training set. The fication performance and finally computational time cost. In model is computed using sparse coding [44] with ℓ1 -norm these experiments we used the Scene-15, VOC-2006, VOC- and regularization parameter of λ = 10. The training routine 2007, and VOC-2010 databases. The descriptor used is D- is run for a maximum of 36, 000 iterations. The appropriate SIFT. Training and test data for all the datasets is generated in regularization parameter was determined empirically after a a similar manner to that described in the previous experiments. grid search for λ ∈ {10−4 , 10−3 , . . . , 103 , 104 }. The maxi- mum iterations is set sufficiently high to allow the optimization A. Information measure of subspace methods routine to converge within satisfactory tolerance. A similar We use a random sample of 500 embedded descriptors parameter set is used in image encoding, with λ = 10. We use of each image. Pair-wise distances are computed using the a SVM with RBF kernel and the performance is reported in II, Minkowski metric2 , which is a generalization of the Euclidean shows the mAP for each database averaged across all the visual metric, selected in part due to concern over the reliability categories in it. The best performance is highlighted by bold of the Euclidean metric in higher dimensions, discussed by text. The computational time for all the techniques is shown [43]. The parameter m, in this experiment, is assigned the in figure 7. Some of the non-linear techniques utilized time of several orders of magnitude higher than the linear methods. 2 Minkowski metric d (·, ·), with parameter m, between two vectors k, l ∈ For effective visualization the y-axis is on a log scale. The PM 1 Rp , is dM (k, l) = ( pj=1 |kj − lj |m ) m error bars denote the variation across categories, showing the Method Scene15 VOC-2006 VOC-2007 VOC-2010 Average Subspace Method VOC2006 VOC2007 VOC2010 Average Diff. Maps 61.79 69.18 68.91 69.05 67.35 Factor Anal 62.68 70.38 68.86 68.44 67.54 PCA 0.444 0.357 0.224 0.233 Isomap 64.04 71.84 68.95 64.09 66.77 LDA 0.990 0.948 0.917 0.643 LLE 58.04 70.21 68.15 68.13 66.13 LPP 68.21 73.68 72.75 69.14 70.74 MDS 0.211 0.215 0.214 0.148 MDS 59.77 72.58 71.49 70.32 68.59 ProbPCA 861 875 880 611 NPE 64.94 72.68 73.15 68.85 69.86 PCA 64.80 70.97 70.21 69.88 68.98 FactorAnalysis 113 116 111 79.8 ProbPCA 62.77 70.91 71.80 69.514 68.87 Isomap 7827 7987 7799 5834 SPE 63.13 64.59 70.1501 69.70 67.53 SymSNE 63.14 69.91 69.94 68.83 68.03 LandmarkIsomap 217 222 223 208 tSNE 56.58 63.91 69.1634 69.10 65.43 LLE 813 890 757 617 DiffusionMaps 4926 5130 4881 3783 TABLE II: Comparative analysis of classification performance, measured in mAP, for the subspace methods aggregated over KernelPCA 52729 50741 68566 39736 all the visual categories in each of the databases considered SymSNE 126050 188320 179230 142233 here. tSNE 6397 28810 27941 15407 NPE 123 997 948 505 20 LPP 31.5 268 213 132 SPE 13142 22524 23141 16038 15 TABLE III: Relative computational time cost of subspace 10 projection methods for VOC-2006, VOC-2007, VOC-2010 databases. Time ( log(sec.) ) 5 0 VI. S UMMARY In this paper, we considered subspace projection methods −5 for high-dimensional and large scale image data. We focused on developing a method that best preserved semantically rele- −10 vant information, which we argued was embedded in different A LD A S A al. ap ap LLE aps PCA SNE SNE NPE LPP SPE TSA PC MD obPC orAn Isom som f.M el t LL Pr Fact L− I Dif Kern Sy m subspaces with varying local spatial extent. The hierarchical composition of a visual category in terms of other visual Fig. 7: Comparative look at computation time of various categories along with the properties of local patch image de- techniques. scriptor support our interpretation of the semantic information in images. Instead of an ad-hoc lower-dimensionality, we com- puted an intrinsic dimensionality using multiple approached minimum and maximum time utilized. The traditional linear for several databases. The mean and variance of these results methods PCA,LDA, and MDS are the fastest, as expected. indicate that image data has much lower dimensionality than The important outcome is that LPP which performed the best had been used in literature. We developed an information is also faster than other non-linear methods. This makes LPP measure for descriptor distribution in embedded subspace stand out as a valid candidate to replace PCA in future. and used it to compare several subspace projection methods. Since, researchers and practitioners overwhelming use simple C. Computational Time Cost linear methods, our main objective was to build upon our In this experiment, we report results on the VOC-2006, intuition of multi-subspace, local-extent model of semantic VOC-2007, and VOC-2010 databases. The time taken to learn information distribution in feature space. Consequently, we a subspace projection function for 10000 randomly selected focused on non-linear methods that preserve mutual separation feature vectors from training images of each dataset and of descriptors in close proximity while ignoring the mutual subspace method is reported in III. The experiment was run separation of descriptors far from each other. Their comparison using Matlab scripts [45] on 3.0 GHz Intel processors with with linear methods like PCA was interesting since these 48 GB of RAM. The average time across all databases is methods consider the global distribution. shown in the Aggregate column. As expected, linear methods, Since, this work is intended for Big Data, the computational MDS and PCA, are found to be the fastest. Of interest here, complexity of each method and its ability to scale to large data is the cost of LPP, which is the well below other methods was also important. Accordingly, we also analyzed the relative of its class, that preserve local structure. Although it is more computational time of all the methods. Global linear method complex than PCA, it is candidate for offline systems and like PCA took 0.23 seconds in comparison to a global non- more efficient than newer methods like SPE, t-SNE, which linear method like Kernel-PCA which took 39736.6 seconds. are orders of magnitude more costly. In addition, LPP uses Regardless of their performance benefits, when applied to large a linear projective matrix, which makes it very fast during scale learning problems, non-linear embedding methods are operation, once it has been trained. prohibitively slow. However, LPP took 132.5 seconds, which though higher than PCA, is comparatively much lower than [18] J. P. Cunningham and Z. Ghahramani, “Linear Dimensionality other embedding techniques with comparable performance. Reduction: Survey, Insights, and Generalizations,” Journal of Machine Learning Research, vol. 16, pp. 2859–2900, 2015. In conclusion, we have shown that semantically meaningful [19] C. O. S. Sorzano, J. Vargas, and a. P. Montano, “A survey of large scale image analysis requires a subspace projection dimensionality reduction techniques,” arXiv preprint arXiv:1403.2877, method that emphasizes preservation of geometry of distribu- pp. 1–35, 2014. [20] A. Sarveniazi, “An Actual Survey of Dimensionality Reduction,” Amer- tion of feature descriptors in close mutual proximity in the ican Journal of Computational Mathematics, vol. 04, no. 02, pp. 55–72, training phase to learn a linear projection that is quick in 2014. the testing phase. Future work will consider other form of [21] D. Scott and S. Sain, “Multidimensional Density Estimation,” Handbook of Statistics, vol. 24, no. August 2004, pp. 229–261, 2005. multimedia, like videos. [22] F. Camastra and A. Vinciarelli, “Estimating the Intrinsic Dimension of Data with a Fractal-Based Method,” Transactions on Pattern Analysis R EFERENCES and Machine Intelligence, vol. 24, no. 10, pp. 1404–1407, 2002. [23] E. Levina and P. J. Bickel, “Maximum Likelihood Estimation of Intrinsic [1] D. Markonis, R. Schaer, I. Eggel, H. Muller, and A. Depeursinge, “Using Dimension,” in Neural Information Processing Systems, 2005, pp. 777– MapReduce for Large-Scale Medical Image Analysis,” 2012 IEEE 784. Second International Conference on Healthcare Informatics, Imaging [24] F. Camastra, “Data dimensionality estimation methods: a survey,” and Systems Biology, no. August 2009, pp. 1–1, 2012. Pattern Recognition, vol. 36, no. 12, pp. 2945–2954, dec 2003. [2] D. M. Blei, A. Y. Ng, M. I. Jordan, and J. Lafferty, “Latent Dirichlet [25] B. Poczos and A. Lorincz, “Independent Subspace Analysis Using Allocation,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. Geodesic Spanning Trees,” in International Conference on Machine 993–1022, may 2003. Learning, Bonn, 2005, pp. 1–8. [3] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. S. Modha, “A [26] L. Fei-fei, R. Fergus, and P. Perona, “Learning generative visual models generalized maximum entropy approach to bregman co-clustering and from few training examples: An incremental Bayesian approach tested matrix approximation,” in ACM SIGKDD international conference on on 101 object categories,” Computer Vision and Image Understanding, Knowledge discovery and data mining - KDD ’04, vol. 8. New York, vol. 106, no. 1, pp. 59–70, apr 2007. New York, USA: ACM Press, 2004, pp. 1–6. [27] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category [4] a.C. Berg, M. Maire, J. Malik, A. Berg, M. Maire, J. Malik, and dataset,” California Institute of Technology, Tech. Rep. 7694, 2007. R. Socher, “ImageNet: A large-scale hierarchical image database,” 2009 [28] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond Bags of Features: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, Spatial Pyramid Matching for Recognizing Natural Scene Categories,” pp. 248–255, jun 2009. in IEEE Computer Vision and Pattern Recognition. Ieee, 2006, pp. [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and 2169–2178. D. Ramanan, “Object detection with discriminatively trained part- [29] M. Everingham, L. V. Gool, C. K. I. Williams, and J. Winn, “The based models.” IEEE Transacttions on Pattern Analysis and Machine PASCAL Visual Object Classes ( VOC ) Challenge,” International Intelligence, vol. 32, no. 9, pp. 1627–45, sep 2010. Journal of Computer Vision, vol. 88, no. 2, pp. 303–338, 2010. [6] R. Vidal, “Subspace Clustering,” IEEE Signal Processing Magazine, no. [30] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, MARCH, pp. 52–68, mar 2011. and A. Zisserman, “The PASCAL Visual Object Classes [7] A. Adler and M. Elad, “Probabilistic Subspace Clustering via Sparse Challenge 2010 (VOC2010) Results,” http://www.pascal- Representations,” in ICML 2012 Workshop Sparsity, Dictionaries and network.org/challenges/VOC/voc2010/workshop/index.html, 2010. Projections in Machine Learning and Signal Processing, Endinburgh, [31] A. Renyi, “On Measure of Information and Entropy,” in 4th Berkeley 2012, pp. 1–4. Symposium on Mathematics, Statistics and Probability, vol. 547, no. c, [8] I. T. Jolliffe, Principal Component Analysis, second edi ed. New York: Berkeley, 1960, pp. 547–561. Springer, 2002. [32] J. C. Principe and D. Xu, “Information-theoretic learning using renyi’s [9] M. Turk and A. Pentland, “Face recognition using eigenfaces,” in quadratic entropy,” in Proceedings of the First International Workshop Computer Vision and Pattern Recognition. Maui, HI: IEEE Press, 1991, on Independent Component Analysis and Signal Separation, 1999, pp. pp. 586 – 591. 407–412. [10] Y. Ke and R. Sukthankar, “PCA-SIFT: A more distinctive representation [33] T. F. Cox and M. A. A. Cox, Multidimensional scaling, 2nd ed. for local image descriptors,” in Conference on Computer Vision and Chapman and Hall/CRC, Sep. 2001. Pattern Recognition, vol. 2. IEEE Comput. Soc, 2004, pp. 506–513. [34] D. K. Agrafiotis, “Stochastic Proximity Embedding,” Journal of Com- [11] I. K. Fodor, “A survey of dimension reduction techniques,” Center for putational Chemistry, vol. 24, pp. 1215–1221, 2003. Applied Scientific Computing Lawrence Livermore National Laboratory, [35] S. Lafon and A. B. Lee, “Diffusion maps and coarse-graining: A vol. 9, no. 1, pp. 1–18, 2002. unified framework for dimensionality reduction, graph partitioning, and [12] S. Tsuge and M. Shishibori, “Dimensionality Reduction using Non- data set parameterization.” IEEE Transacttions on Pattern Analysis and Negative Matrix Factorization for Information Retrieval,” IEEE Interna- Machine Intelligence, vol. 28, no. 9, pp. 1393–403, sep 2006. tional Conference on Systems, Man, and Cybernetics, vol. 2, pp. 960– [36] H. Hoffmann, “Kernel pca for novelty detection,” Pattern Recognition, 965, 2001. vol. 40, no. 3, pp. 863–874, mar 2007. [13] H. Cai, K. Mikolajczyk, and J. Matas, “Learning linear discriminant [37] M. Belkin and P. Niyogi, “Laplacian Eigenmaps and Spectral Techniques projections for dimensionality reduction of image descriptors.” IEEE for Embedding and Clustering,” in Neural Information Processing Transactions on Pattern Analysis and Machine Intelligence, vol. 33, Systems. MIT Press, 2002, pp. 585—-591. no. 2, pp. 338–52, feb 2011. [38] X. He, S. Yan, and H.-j. Zhang, “Neighborhood Preserving Embedding,” [14] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric in IEEE International Conference on Computer Vision. IEEE Computer framework for nonlinear dimensionality reduction.” Science (New York, Society, 2005, pp. 1–6. N.Y.), vol. 290, no. 5500, pp. 2319–23, dec 2000. [39] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal [15] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction component analyzers.” Neural computation, vol. 11, no. 2, pp. 443–82, by locally linear embedding.” Science (New York, N.Y.), vol. 290, no. feb 1999. 5500, pp. 2323–6, dec 2000. [40] V. D. Silva and J. B. Tenenbaum, “Global Versus Local Methods in [16] Y. Sun, S. Todorovic, and S. Goodison, “Local-learning-based feature Nonlinear Dimensionality Reduction,” in NIPS, 2003, pp. 721–728. selection for high-dimensional data analysis.” IEEE Transacttions on [41] G. Hinton and S. Roweis, “Stochastic Neighbor Embedding,” in Neural Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1610–26, Information Processing Systems, 2002, pp. 833–840. sep 2010. [42] L. V. D. Maaten and G. Hinton, “Visualizing Data using t-SNE,” Journal [17] X. He, S. Yan, Y. Hu, and H.-j. Zhang, “Learning a locality preserving of Machine Learning Research, vol. 9, pp. 2579–2605, 2008. subspace for visual recognition,” Proceedings Ninth IEEE International [43] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the Surprising Conference on Computer Vision, pp. 385–392 vol.1, 2003. Behavior of Distance Metrics in High Dimensional Space,” in Database Theory - ICDT 2001, vol. 1973. London: Springer-Verlag Berlin, Heidelberg, 2001, pp. 420–434. [44] J. Mairal, F. Bach, Jean Ponce, and G. Sapiro, “Online Learning for Matrix Factorization and Sparse Coding,” Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010. [45] L. J. P. van der Maaten and L. van der Maaten, “An Introduction to Dimensionality Reduction Using Matlab,” Universiteit Maastricht, Maastricht, Netherlands, Tech. Rep. July, 2007.