# III. Kernel Feature Extraction

KERNEL BASICS:

Kernel
K is a Hermitian and positive semi-definite matrix that computes the inner product
Between any finite sequence of inputs _{} and is defined as: _{}

Commonly used kernels are [34]:

·
The linear
kernel: _{}

·
The polynomial
kernel: _{}

·
The hyperbolic
tangent kernel :_{}

·
The ANOVA RB
kernel: _{}

·
The Gaussian RBF
kernel: _{}

·
The Laplace RBF
kernel: _{}

·
The linear spline
kernel in one dimension:

_{}

Kernel selection is heavily dependant on the data specifics.
For instance, the linear kernel is important in large sparse data vectors and
it implements the simplest of all kernels whereas the Gaussian and Laplace RBF
are general purpose kernels used when prior knowledge about data is not
available. Gaussian kernel avoids the sparse distribution caused by the high
degree polynomial kernel in large feature space. The polynomial kernel is
widely applied in image processing while ANOVA RB is usually reserved for
regression tasks. Because classification accuracy heavily depends on kernel
selection, researchers had proposed to have kernel functions based on a general
purpose learning element and a problem specific kernel function. For example,
Amari and Wu in [32] modified the kernel function by enlarging the Riemannian
geometry structure induced by the kernel around the SVs and Souza and Carvalho in
[33] proposed selecting the hyper planes parameters by using k-fold cross
validation and leave-one-out criteria.

In 2006 Shen and Chou proposed modifications to the method
of Ding and Dubchak (2001). They proposed a somewhat ad hoc ensemble learning
approach where multi-class k-nn classifiers individually trained on each feature
space (such as C or S) were later combined and secondly, they proposed the use
of four additional feature groups to replace the amino-acid composition[31].

Weston et al. (2001) performed feature selection for SVMs by
combining the feature scaling technique with the leave-one-out error bound.
Chapelle, Vapnik, Bousquet, and Mukherjee (2002) tuned multiple parameters for
two-norm SVMs by minimizing the radius margin bound or the span bound. Ong and
Smola (2003) applied semidefinite programming to learn kernel function by
hyperkernel. Lanckriet, Cristianni, Bartlett, Ghaoui, and Jordan (2004)
designed kernel matrix directly by semidefinite programming.

__KERNEL FEATURE ANALYSIS:__

__ __

Our new kernel feature analysis for CT colonographic
images is based on the following two methods: Kernel Principal Feature Analysis
and Sparse Kernel Feature Analysis. This section provides a brief review of
these existing methods. In the next section, we will extend these existing
methods and propose a new method called Accelerated Kernel Feature Analysis.

## A. Kernel Principal Feature Analysis

Kernel Principal Component Analysis (KPCA) uses a Mercer
kernel [15] to perform a linear principal component analysis of this transformed image. Without loss of generality, we assume that the image of the data has been centered so that its scatter matrix in _{} is given by _{}. Eigenvalues _{} and eigenvectors _{} are obtained by
solving

_{} (1)

for _{}. Since _{} is not known, (1) must be solved
indirectly. Letting _{} gives

_{} (2)

Multiplying by _{} on the left, for _{}, yields

(3)

Substitution of (2) into (3) produces

_{} (4)

which can be rewritten as, _{}where *K* is an _{} Gram matrix, with the element
_{} and _{} The latter is a
dual eigenvalue problem equivalent to the problem

_{} (5)

Since _{} the normalization of each
eigenvector (_{}) requires _{}.

In the following, we choose a Gaussian kernel, i.e.,

_{} (6)

If the image of _{} is not centered in the Hilbert space,
we need to use the centered Gram Matrix deduced by Smola, Mangasarian, and
Schölkoph [16]:

_{}.
(7)

where *K* is the Gram Matrix of uncentered data, and

_{}

Keeping the _{} eigenvectors associated with the _{} largest eigenvalues,
we can reconstruct data in the mapped space: _{}where _{}. The reconstruction square error of
each data_{},
is _{} The
mean square error is _{}. Using (5),_{}. Therefore, the mean square
reconstruction error is _{}. Since _{} and_{}, _{}

Kernel PCA can now be summarized as follows:

**Step 1:** Calculate the Gram
matrix using (6), which contains the inner products between pairs of image
vectors.

**Step 2****:** Use (5) to get the
coefficient vectors _{} for _{}.

**Step 3****:** The projection of a
test point _{} along
the *j*-th eignevector is

_{}

The above implicitly contains an eigenvalue problem of
rank _{}, so
the computational complexity of Kernel PCA is _{}. In addition, each resulting
eigenvector is represented as a linear combination of *n* terms. Thus, all
data contained in _{} must
be retained, which is computationally cumbersome and unacceptable for
incremental or on-line learning.

## B. Sparse Kernel Feature Analysis

Sparse Kernel Feature Analysis (SKFA) [16] extracts the features one by one in order of decreasing projection variance. SKFA improved the computational costs of KPCA, associated with both time complexity and data retention requirements. In that sense, SKFA is more compact and time efficient method than KPCA. The particular advantage of SKFA is that the _{} features only depend
on _{} elements
of _{}, which
is extremely useful for on-line learning. We let _{}, for _{} denote the features selected by SKFA.
We intentionally avoid using _{} in order to distinguish these features
from the eigenvectors obtained using KPCA. Again, we analyze the scatter matrix
of the image data. Following (1), with _{} replaced by _{}, we obtain _{}, where _{} is the *j*-th
feature with unit length. Thus, _{}. Therefore, the first feature,
corresponding to the maximum eigenvalue, is chosen as the direction with the
maximum projected variance:

_{} (8)

The global solution of (8), _{}, needs to satisfy an _{} normalization
constraint, i.e., unit Euclidian length. Changing the _{} constraint to an _{} constraint leads to
a vertex solution. The _{} constraint assumed by SKFA is

_{} (9)

The first feature selected by SKFA satisfies

_{}

Smola et al. showed that this feature corresponds to an
element of the image _{}, hence

_{} (10)

Subsequent features are obtained iteratively. Suppose _{} features _{} have been found; then
each image _{} is
projected into the orthogonal subspace to obtain

_{} (11)

where _{}. Then _{} is normalized by the _{} constraint in (9).
The projection variance of the normalized _{} with all _{} is then calculated. Finally, one
identifies the maximum projection variance and selects the corresponding _{} as the *i*-th
feature, _{}.

Based on the _{} features extracted by SKFA, each
training data in the mapped space can be reconstructed by

_{} (12)

where _{}, the projection of *i*-th training
data on *j*-th feature, is stored after extracting the *j*-th
feature. According to (11), the feature set _{} only depends upon the set of _{} image vectors, _{}, where _{} denotes the
subscript of the projected image _{} that was selected when constructing _{}. Therefore, after
training, we only need to retain the _{} input vectors _{}, where _{} is the number of features
extracted.

In this way, SKFA extracts _{} features, where one assumes _{}. As _{} operations are
required to extract the *i*-th feature, the total computational cost for _{} features is _{}, which is an
improvement over the _{} operations required by kernel PCA.