III.      Kernel Feature Extraction



       Kernel K is a Hermitian and positive semi-definite matrix that computes the inner product Between any finite sequence of inputs  and is defined as:              


 Commonly used kernels are [34]:


·              The linear kernel:

·              The polynomial kernel:

·              The hyperbolic tangent kernel :

·              The ANOVA RB kernel:

·              The Gaussian RBF kernel:

·              The Laplace RBF kernel:  

·              The linear spline kernel in one dimension:  



Kernel selection is heavily dependant on the data specifics. For instance, the linear kernel is important in large sparse data vectors and it implements the simplest of all kernels whereas the Gaussian and Laplace RBF are general purpose kernels used when prior knowledge about data is not available. Gaussian kernel avoids the sparse distribution caused by the high degree polynomial kernel in large feature space. The polynomial kernel is widely applied in image processing while ANOVA RB is usually reserved for regression tasks. Because classification accuracy heavily depends on kernel selection, researchers had proposed to have kernel functions based on a general purpose learning element and a problem specific kernel function. For example, Amari and Wu in [32] modified the kernel function by enlarging the Riemannian geometry structure induced by the kernel around the SVs and Souza and Carvalho in [33] proposed selecting the hyper planes parameters by using k-fold cross validation and leave-one-out criteria.


In 2006 Shen and Chou proposed modifications to the method of Ding and Dubchak (2001). They proposed a somewhat ad hoc ensemble learning approach where multi-class k-nn classifiers individually trained on each feature space (such as C or S) were later combined and secondly, they proposed the use of four additional feature groups to replace the amino-acid composition[31].


Weston et al. (2001) performed feature selection for SVMs by combining the feature scaling technique with the leave-one-out error bound. Chapelle, Vapnik, Bousquet, and Mukherjee (2002) tuned multiple parameters for two-norm SVMs by minimizing the radius margin bound or the span bound. Ong and Smola (2003) applied semidefinite programming to learn kernel function by hyperkernel. Lanckriet, Cristianni, Bartlett, Ghaoui, and Jordan (2004) designed kernel matrix directly by semidefinite programming.




    Our new kernel feature analysis for CT colonographic images is based on the following two methods: Kernel Principal Feature Analysis and Sparse Kernel Feature Analysis.  This section provides a brief review of these existing methods.  In the next section, we will extend these existing methods and propose a new method called Accelerated Kernel Feature Analysis.

A.     Kernel Principal Feature Analysis

Kernel Principal Component Analysis (KPCA) uses a Mercer kernel [15] to perform a linear principal component analysis of this transformed image. Without loss of generality, we assume that the image of the data has been centered so that its scatter matrix in  is given by . Eigenvalues  and eigenvectors  are obtained by solving


for .  Since  is not known, (1) must be solved indirectly. Letting  gives


Multiplying by  on the left, for , yields


Substitution of (2) into (3) produces


which can be rewritten as, where K is an  Gram matrix, with the element  and  The latter is a dual eigenvalue problem equivalent to the problem


Since  the normalization of each eigenvector () requires .

In the following, we choose a Gaussian kernel, i.e.,


If the image of  is not centered in the Hilbert space, we need to use the centered Gram Matrix deduced by Smola, Mangasarian, and Schölkoph [16]:

                                                         . (7)

where K is the Gram Matrix of uncentered data, and


Keeping the  eigenvectors associated with the  largest eigenvalues, we can reconstruct data in the mapped space: where . The reconstruction square error of each data, is  The mean square error is .  Using (5),.  Therefore, the mean square reconstruction error is . Since  and,  

Kernel PCA can now be summarized as follows:

Step 1:     Calculate the Gram matrix using (6), which contains the inner products between pairs of image vectors.

Step 2:     Use (5) to get the coefficient vectors  for .

Step 3:     The projection of a test point  along the j-th eignevector is


    The above implicitly contains an eigenvalue problem of rank , so the computational complexity of Kernel PCA is . In addition, each resulting eigenvector is represented as a linear combination of  n terms. Thus, all data contained in  must be retained, which is computationally cumbersome and unacceptable for incremental or on-line learning.

B.     Sparse Kernel Feature Analysis

    Sparse Kernel Feature Analysis (SKFA) [16] extracts the features one by one in order of decreasing projection variance.  SKFA improved the computational costs of KPCA, associated with both time complexity and data retention requirements.  In that sense, SKFA is more compact and time efficient method than KPCA.  The particular advantage of SKFA is that the  features only depend on  elements of , which is extremely useful for on-line learning.  We let , for  denote the features selected by SKFA. We intentionally avoid using  in order to distinguish these features from the eigenvectors obtained using KPCA. Again, we analyze the scatter matrix of the image data.  Following (1), with  replaced by , we obtain , where  is the j-th feature with unit length. Thus, . Therefore, the first feature, corresponding to the maximum eigenvalue, is chosen as the direction with the maximum projected variance:


The global solution of (8), , needs to satisfy an  normalization constraint, i.e., unit Euclidian length. Changing the  constraint to an  constraint leads to a vertex solution. The  constraint assumed by SKFA is


The first feature selected by SKFA satisfies


Smola et al. showed that this feature corresponds to an element of the image ,  hence


Subsequent features are obtained iteratively. Suppose  features  have been found; then each image  is projected into the orthogonal subspace to obtain


where . Then  is normalized by the  constraint in (9). The projection variance of the normalized  with all  is then calculated. Finally, one identifies the maximum projection variance and selects the corresponding  as the i-th feature, .

    Based on the  features extracted by SKFA, each training data in the mapped space can be reconstructed by


where , the projection of i-th training data on j-th feature, is stored after extracting the j-th feature.  According to (11), the feature set  only depends upon the set of  image vectors, , where  denotes the subscript of the projected image  that was selected when constructing . Therefore, after training, we only need to retain the  input vectors , where  is the number of features extracted.

In this way, SKFA extracts  features, where one assumes .   As  operations are required to extract the i-th feature, the total computational cost for  features is , which is an improvement over the  operations required by kernel PCA.