Feature Selection and Dimensionality Reduction

Feature Selection and Dimensionality Reduction#

Sometimes when we are working with large datasets with many features, it can be difficult to figure out which features are important and which are not. This is especially true in unsupervised learning problems where we have a dataset \(\mathbf{x}\), but no regression or classification target value \(y\). Fortunately, there are some powerful dimensionality analysis and reduction techniques that can be applied for these problems. Here, we will focus on the most popular of these techniques, principal component analysis (PCA).

The Correlation Matrix#

In order to identify and extract meaningful features from data, we must first understand how the data is distributed. If the data is normalized (i.e. the transformation \(\mathbf{x} \rightarrow \mathbf{z}\) is applied), then every feature has mean \(\mu = 0\) and standard deviation \(\sigma = 1\); however, significant correlations may still exist between features, making the inclusion of some features redundant. We can see the degree to which any pair of normalized features are correlated by examining the entries of the correlation matrix \(\bar{\Sigma}\), given by:

\[ \bar{\Sigma} = \frac{1}{N} \sum_{n=1}^N \mathbf{z}_n\mathbf{z}_n^T \]

where \(\mathbf{z}_1, \mathbf{z}_2, .., \mathbf{z}_N\) is the normalized dataset.

As a motivating example, let’s examine the correlation matrix of random 3D points that are approximately confined to the plane defined by the equation \(x_3 = 3x_1 -2x_2\). We can generate this dataset using the following Python code:

Important

The covariance matrix \(\Sigma\) is different from the correlation matrix \(\bar{\Sigma}\), though the two are commonly confused with one another. Both matrices are symmetric with entries given by:

\[\Sigma_{ij} = \frac{1}{N} \sum_{n=1}^N ((\mathbf{x}_n)_i - \mu_i)((\mathbf{x}_n)_j - \mu_j),\qquad \bar{\Sigma}_{ij} = \frac{1}{N} \sum_{n=1}^N \frac{((\mathbf{x}_n)_i - \mu_i)((\mathbf{x}_n)_j - \mu_j)}{\sigma_i\sigma_j} \]

The difference between these two matrices is the division by \(\sigma_i\sigma_j\) for \((i,j)\) entries.

../_images/95d447563c3b58eb2da747a1e8c8a6a3c8173c8ada0c13e76f2f49f91ffd7fa0.png

Next, we normalize the dataset and compute \(\bar{\Sigma}\) using np.cov:

../_images/b50799805f445fbd374126cfa6d92521befffda0b226daafa0e090f1800f753a.png

Examining the correlation matrix, we see strong positive correlation between \(x_1\) and \(x_3\) and strong negative correlation between \(x_2\) and \(x_3\), which corresponds to the relationship \(x_3 \approx 3x_1 - 2 x_2\). Since the third row of \(\bar{\Sigma}\) is highly correlated with the other features, it contributes the least to the overall variance of the data.

The Correlation Matrix in Supervised Learning

In supervised learning (where the dataset contains \((\mathbf{x},y)\) pairs, not just \(\mathbf{x}\) values) the correlation matrix and also be used to quantify the linear relationship between features and the output label \(y\). This is done by simply appending the corresponding \(y\) to the end of each \(\mathbf{x}\), normalizing this vector, and then computing the correlation matrix.

The values in this matrix that correspond to the correlation of \(y\) with each feature in \(\mathbf{x}\) can be used to reduce the dimensionality of the \(\mathbf{x}\) data. Specifically, features with the weakest correlation with \(y\) can be dropped, resulting in a much smaller feature vector. Since the dropped features had low correlation with \(y\), it is likely that it will not cause a drop in model accuracy. In fact, this will sometimes result an increase in model accuracy.

Principal Components Analysis#

Because the correlation matrix is symmetric, we can diagonalize the matrix by writing it as the product:

\[\bar{\Sigma} = \mathbf{P} \mathbf{D} \mathbf{P}^{T}\]

where \(\mathbf{D}\) is a diagonal matrix containing the eigenvalues of \(\bar{\Sigma}\) and \(\mathbf{P}\) is an orthogonal matrix (i.e. \(\mathbf{P}^T = \mathbf{P}^{-1}\)). The columns \(\mathbf{p}_1, \mathbf{p}_2, ..., \mathbf{p}_D\) of \(\mathbf{P}\) are called the principal components of the dataset. The principal components are vectors of magnitude \(1\) that are pairwise orthogonal, that is:

\[\begin{split}\mathbf{p}_i^T\mathbf{p}_j = \begin{cases} 1, & i = j \\ 0, & i \neq j \end{cases}\end{split}\]

For our example dataset, we can compute the principal component matrix \(\mathbf{P}\) and eigenvalue matrix \(\mathbf{D}\) by diagonalizing \(\bar{\Sigma}\) as follows:

P matrix:
[[-0.58601984 -0.5588232   0.58676859]
 [ 0.4044678  -0.82921035 -0.38576677]
 [ 0.70213     0.01126202  0.71195971]]

D matrix:
[[0.03450965 0.         0.        ]
 [0.         1.03164571 0.        ]
 [0.         0.         1.94892002]]

P @ D @ P.T (correlation matrix):
[[ 1.00502513  0.02871695  0.79348019]
 [ 0.02871695  1.00502513 -0.5351054 ]
 [ 0.79348019 -0.5351054   1.00502513]]

Each principal component \(\mathbf{p}_i\) (the \(i\)th column of \(\mathbf{P}\)) has an associated eigenvalue \(\lambda_i\), which is the corresponding value along the diagonal in the \(i\)th column of \(\mathbf{D}\). The eigenvalues \(\lambda_i\) describe the total variance of the data in the direction \(\mathbf{p}_i\). The principal component with the highest value of \(\lambda_i\) is called the first principal component, since it is a vector that points in the direction that “accounts for” most of the variance of the data. Similarly, the second principal component points in the direction that “accounts for” most of the variance not captured by the first principal component, and so on. Here, we will denote the first principal component as \(\mathbf{p}^{(1)}\), the second as \(\mathbf{p}^{(2)}\), and so on. We will use the same notation for the principal component eigenvalues, i.e. \(\lambda^{(1)}, \lambda^{(2)}\), etc.

From examining the printout of the \(\mathbf{D}\) matrix above, we see that \(\mathbf{p}^{(1)} = \mathbf{p}_3\) (the first principal component is the third column of \(\mathbf{P}\)) and \(\mathbf{p}^{(2)} = \mathbf{p}_2\) (the second principal component is the second column of \(\mathbf{P}\)). The corresponding eigenvalues are \(\lambda^{(1)} = 2.033\) and \(\lambda^{(2)} = 0.97\). However, we observe that \(\lambda^{(3)} = 0.024 \ll \lambda^{(1)}, \lambda^{(2)}\), which suggests that the third principal component accounts for very little variance in the data. This is due to the fact that the data is approximately confined to a 2D plane embedded in a larger 3D space.

One of the most powerful aspects of principal components analysis (often abbreviated PCA), is that we can project the normalized data onto the subset of principal components that are significant (i.e. have large \(\lambda_i\)), thereby reducing the dimensionality of the data while maximizing the amount of variance that is accounted for in the reduced data.

To project a normalized feature vector \(\mathbf{z}\) onto the first \(k\) principal components, we write it as a linear combination of all the principal components \(\mathbf{p}^{(1)}, ..., \mathbf{p}^{(D)}\):

\[\mathbf{z} = u_1\mathbf{p}^{(1)} + u_2\mathbf{p}^{(2)} + ... u_D\mathbf{p}^{(D)}\]

Next, we solve for the coefficients \(u_i\). Since the \(\mathbf{p}^{(i)}\) are all orthonormal basis vectors, the coefficients can be computed independently of one another as follows:

\[ u_i = \mathbf{z}^T\mathbf{p}^{(i)} \]

The vector of the first \(k\) coefficients \(\mathbf{u} = \begin{bmatrix} u_1 & u_2 & ... & u_k \end{bmatrix}^T\) is the reduced \(k\)-dimensional representation of \(\mathbf{z}\). This vector is the projection of the data onto the first \(k\) principal components.

PCA Dimension Reduction#

The sklearn Python package has functionality that makes PCA dimensionality reduction very easy. To compute the 2D PCA embedding of the 3D dataset we have been working on so far, we can use sklearn.decomposition.PCA, and visualize the projected data as follows:

../_images/5c155958ed14d89394093436b00ae3c108ff31dedd0d9d97c9da9e10a80205fe.png

Feature Selection and Dimensionality Reduction

Contents

Feature Selection and Dimensionality Reduction#

The Correlation Matrix#

Principal Components Analysis#

PCA Dimension Reduction#

Exercises#

Solutions#

Exercise 1: Applying PCA#