Kernel K is a Hermitian and positive semi-definite matrix that computes the inner product Between any finite sequence of inputs and is defined as:
Commonly used kernels are :
· The linear kernel:
· The polynomial kernel:
· The hyperbolic tangent kernel :
· The ANOVA RB kernel:
· The Gaussian RBF kernel:
· The Laplace RBF kernel:
· The linear spline kernel in one dimension:
Kernel selection is heavily dependant on the data specifics. For instance, the linear kernel is important in large sparse data vectors and it implements the simplest of all kernels whereas the Gaussian and Laplace RBF are general purpose kernels used when prior knowledge about data is not available. Gaussian kernel avoids the sparse distribution caused by the high degree polynomial kernel in large feature space. The polynomial kernel is widely applied in image processing while ANOVA RB is usually reserved for regression tasks. Because classification accuracy heavily depends on kernel selection, researchers had proposed to have kernel functions based on a general purpose learning element and a problem specific kernel function. For example, Amari and Wu in  modified the kernel function by enlarging the Riemannian geometry structure induced by the kernel around the SVs and Souza and Carvalho in  proposed selecting the hyper planes parameters by using k-fold cross validation and leave-one-out criteria.
In 2006 Shen and Chou proposed modifications to the method of Ding and Dubchak (2001). They proposed a somewhat ad hoc ensemble learning approach where multi-class k-nn classifiers individually trained on each feature space (such as C or S) were later combined and secondly, they proposed the use of four additional feature groups to replace the amino-acid composition.
Weston et al. (2001) performed feature selection for SVMs by combining the feature scaling technique with the leave-one-out error bound. Chapelle, Vapnik, Bousquet, and Mukherjee (2002) tuned multiple parameters for two-norm SVMs by minimizing the radius margin bound or the span bound. Ong and Smola (2003) applied semidefinite programming to learn kernel function by hyperkernel. Lanckriet, Cristianni, Bartlett, Ghaoui, and Jordan (2004) designed kernel matrix directly by semidefinite programming.
KERNEL FEATURE ANALYSIS:
Our new kernel feature analysis for CT colonographic images is based on the following two methods: Kernel Principal Feature Analysis and Sparse Kernel Feature Analysis. This section provides a brief review of these existing methods. In the next section, we will extend these existing methods and propose a new method called Accelerated Kernel Feature Analysis.
Kernel Principal Component Analysis (KPCA) uses a Mercer kernel  to perform a linear principal component analysis of this transformed image. Without loss of generality, we assume that the image of the data has been centered so that its scatter matrix in is given by . Eigenvalues and eigenvectors are obtained by solving
for . Since is not known, (1) must be solved indirectly. Letting gives
Multiplying by on the left, for , yields
Substitution of (2) into (3) produces
which can be rewritten as, where K is an Gram matrix, with the element and The latter is a dual eigenvalue problem equivalent to the problem
Since the normalization of each eigenvector () requires .
In the following, we choose a Gaussian kernel, i.e.,
If the image of is not centered in the Hilbert space, we need to use the centered Gram Matrix deduced by Smola, Mangasarian, and Schölkoph :
where K is the Gram Matrix of uncentered data, and
Keeping the eigenvectors associated with the largest eigenvalues, we can reconstruct data in the mapped space: where . The reconstruction square error of each data, is The mean square error is . Using (5),. Therefore, the mean square reconstruction error is . Since and,
Kernel PCA can now be summarized as follows:
Step 1: Calculate the Gram matrix using (6), which contains the inner products between pairs of image vectors.
Step 2: Use (5) to get the coefficient vectors for .
Step 3: The projection of a test point along the j-th eignevector is
The above implicitly contains an eigenvalue problem of rank , so the computational complexity of Kernel PCA is . In addition, each resulting eigenvector is represented as a linear combination of n terms. Thus, all data contained in must be retained, which is computationally cumbersome and unacceptable for incremental or on-line learning.
Sparse Kernel Feature Analysis (SKFA)  extracts the features one by one in order of decreasing projection variance. SKFA improved the computational costs of KPCA, associated with both time complexity and data retention requirements. In that sense, SKFA is more compact and time efficient method than KPCA. The particular advantage of SKFA is that the features only depend on elements of , which is extremely useful for on-line learning. We let , for denote the features selected by SKFA. We intentionally avoid using in order to distinguish these features from the eigenvectors obtained using KPCA. Again, we analyze the scatter matrix of the image data. Following (1), with replaced by , we obtain , where is the j-th feature with unit length. Thus, . Therefore, the first feature, corresponding to the maximum eigenvalue, is chosen as the direction with the maximum projected variance:
The global solution of (8), , needs to satisfy an normalization constraint, i.e., unit Euclidian length. Changing the constraint to an constraint leads to a vertex solution. The constraint assumed by SKFA is
The first feature selected by SKFA satisfies
Smola et al. showed that this feature corresponds to an element of the image , hence
Subsequent features are obtained iteratively. Suppose features have been found; then each image is projected into the orthogonal subspace to obtain
where . Then is normalized by the constraint in (9). The projection variance of the normalized with all is then calculated. Finally, one identifies the maximum projection variance and selects the corresponding as the i-th feature, .
Based on the features extracted by SKFA, each training data in the mapped space can be reconstructed by
where , the projection of i-th training data on j-th feature, is stored after extracting the j-th feature. According to (11), the feature set only depends upon the set of image vectors, , where denotes the subscript of the projected image that was selected when constructing . Therefore, after training, we only need to retain the input vectors , where is the number of features extracted.
In this way, SKFA extracts features, where one assumes . As operations are required to extract the i-th feature, the total computational cost for features is , which is an improvement over the operations required by kernel PCA.