Advanced Regression Models

Advanced Regression Models#

In the last section, we learned about how regression models can be fit to data using the gradient descent learning algorithm. Interestingly, for certain regression models, the optimal “best fit” solution can be found in closed form. The simplest of these closed-form models is multivariate linear regression.

Multivariate Linear Regression#

At the name suggests, multivariate linear regression models are generalizations of linear regression models where \(\mathbf{x}\) has many dimensions. A linear regression model for a \(D\)-dimensional feature vector \(\mathbf{x}\) takes the form:

\[\hat{y} = f(\mathbf{x}) = w_0 + \sum_{i=1}^D w_ix_i\]

where \(w_0\) and the \(w_i\) are the weights that must be learned. If we define \(\underline{\mathbf{x}} = \begin{bmatrix} 1 & x_1 & x_2 & ... & x_D \end{bmatrix}^T\) (the vector \(\mathbf{x}\) prepended with \(1\)), we can write \(f(x)\) as an inner product of \(\underline{\mathbf{x}}\) with the weight vector \(\mathbf{w} = \begin{bmatrix} w_0 & w_1 & w_2 & ... & w_D \end{bmatrix}^T\):

\[\hat{y} = f(\mathbf{x}) = \underline{\mathbf{x}}^T\mathbf{w}\]

For linear regression models, it is helpful to represent a dataset \(\{ (\mathbf{x}_n,y_n) \}_{n=1}^N\) as a matrix-vector pair \((\mathbf{X},\mathbf{y})\), given by:

\[\begin{split}\mathbf{X} = \begin{bmatrix} \underline{\mathbf{x}_1}^T \\ \underline{\mathbf{x}_2}^T \\ \vdots \\ \underline{\mathbf{x}_N}^T \end{bmatrix},\qquad\qquad \mathbf{y} = \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_N \end{bmatrix}\end{split}\]

In terms of \(\mathbf{X}\) and \(\mathbf{y}\), the task of fitting the linear regression model reduces to solving the weights \(\mathbf{w}\) satisfying the matrix equation \(\mathbf{X}\mathbf{w} \approx \mathbf{y}\). Since most datasets have some degree of noise, it is usually impossible to find a weight vector \(\mathbf{w}\) for which \(\mathbf{X}\mathbf{w} = \mathbf{y}\) exactly. Instead, we must settle for weights \(\mathbf{w}\) that minimizes a some loss function. The most popular choice of loss function for linear regression is the mean square error (MSE). In terms of \(\mathbf{X}\) and \(\mathbf{y}\), the MSE can be written as:

\[\mathcal{E}(f) = \frac{1}{N}(\mathbf{X}\mathbf{w} -\mathbf{y})^T(\mathbf{X}\mathbf{w} - \mathbf{y})\]

It can be shown that the weights \(\mathbf{w}\) minimizing \(\mathcal{E}(f)\) can be computed in closed form as \(\mathbf{w} = \mathbf{X}^+\mathbf{y}\), where \(\mathbf{X}^+\) is the Moore-Penrose inverse (sometimes called the pseudo-inverse) of \(\mathbf{X}\). For a sufficiently well-conditioned data matrix \(\mathbf{X}\), the weights can be computed as:

\[\mathbf{w} = \mathbf{X}^{+}\mathbf{y} = \left( (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\right)\mathbf{y}\]

For most applications, fitting linear regression models with the Moore-Penrose inverse is preferred to other methods, such as gradient descent.

Tip

To compute the Moore-Penrose inverse in Python, you can use the np.linalg.pinv function.

High-Dimensional Embeddings#

While linear regression models are quite powerful and admit and optimal closed-form fit, they often fail to yield good predictions on data with non-linear trends. Fortunately, we can extend the closed form solution of linear regression to work with some non-linear models as well. The general idea behind this extension is to embed the feature vector \(\mathbf{x}\) into a high-dimensional space where linear regression can be applied. Often, this embedding is a non-linear function of \(\mathbf{x}\), which allows for the model to perform regression based on non-linear functions of the features \(\mathbf{x}\). These embedding models have the general form:

\[ f(\mathbf{x}) = w_0 + \sum_{j=1}^{D_{emb}} w_j \phi_j(\mathbf{x})\]

where the \(\phi_j: \mathbb{R}^{D} \rightarrow \mathbb{R}\) are the embedding functions and \(D_{emb}\) is the dimension of the embedding. In most cases, \(D_{emb} > D\). As with standard multivariate linear regression, the MSE loss is commonly used. The MSE loss of the embedding model is:

\[\mathcal{E}(f) = \frac{1}{N}(\mathbf{\Phi}(\mathbf{X})\mathbf{w} - \mathbf{y})^T(\mathbf{\Phi}(\mathbf{X})\mathbf{w} - \mathbf{y})\]

where \(\mathbf{\Phi}(\mathbf{X})\) is the embedding of the data matrix \(\mathbf{X}\):

\[\begin{split}\mathbf{\Phi}(\mathbf{X}) = \begin{bmatrix} 1 & \phi_1(\mathbf{x}_1) & \phi_2(\mathbf{x}_1) & \dots & \phi_{D_{emb}}(\mathbf{x}_1) \\ 1 & \phi_1(\mathbf{x}_2) & \phi_2(\mathbf{x}_2) & \dots & \phi_{D_{emb}}(\mathbf{x}_2) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & \phi_1(\mathbf{x}_N) & \phi_2(\mathbf{x}_N) & \dots & \phi_{D_{emb}}(\mathbf{x}_N) \end{bmatrix}\end{split}\]

The MSE-minimizing weight vector \(\mathbf{w}\) can be solved for using the Moore-Penrose inverse: \(\mathbf{w} = \mathbf{\Phi}(\mathbf{X})^{+}\mathbf{y}\).

Example: Polynomial Fits of Single-Variable Functions#

We have already studied the problem of fitting a polynomial of degree \(D\) to a single-variable \(\mathbf{x}\) with gradient descent. It turns out that we can actually obtain a closed-form best fit using the method of high-dimensional embeddings, where the \(\phi_j\) functions are simply powers of \(\mathbf{x}\): \(\phi_j(\mathbf{x}) = x^j\) (Since \(\mathbf{x}\) is a vector of length 1, we can treat \(\mathbf{x}\) as a scalar \(x\)). The embedding matrix \(\mathbf{\Phi}(\mathbf{X})\) for a polynomial model is known as a Vandermonde matrix:

\[\begin{split}\mathbf{\Phi}(\mathbf{X}) = \begin{bmatrix} 1 & x_1 & x_1^2 & \dots & x_1^{D_{emb}} \\ 1 & x_2 & x_2^2 & \dots & x_2^{D_{emb}} \\ 1 & x_3 & x_3^2 & \dots & x_3^{D_{emb}} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_N & x_N^2 & \dots & x_N^{D_{emb}} \end{bmatrix}\end{split}\]

One can easily compute the optimal weights \(\mathbf{w}\) of a degree \(D_{emb}\) polynomial fit by computing the Moore-Penrose inverse of this matrix. In fact, this is the method used in the np.polyfit function, which we have used before.

Underfitting and Overfitting#

One of the benefits of using high-dimensional embeddings of data is that it allows for many different possible non-linear functions to be incorporated into the model \(f(\mathbf{x})\), allowing for a high model flexibility. This flexibility, however, can come at a significant cost: the more weights a model has, the more likely it is to overfit the data. What do we mean by overfitting? Overfitting occurs when the predictions made by a fitted model correspond too closely to the training data and therefore fail to correspond to unseen data not contained in the training dataset. Put simply, overfitted models tend to “memorize” the training data instead of “learn” from it.

If a model is too inflexible, it may be subject to the opposite problem: underfitting. Underfitting occurs when the model is unable to capture the underlying trends of the data. Typically, underfitting occurs as a result of poor model choice. A good model is one that strikes a balance between these extremes, being just flexible enough to learn the underlying trends of the training data, but not so flexible that it simply “memorizes” the training data.

To help illustrate underfitting and overfitting, let’s return to the familiar example of fitting 1D data with polynomials:

../_images/632911cea23c4579320657a7f807149832b68f4643275049fd5fea1ee747b208.png

In the plot above, the polynomial with \(D_{emb} = 11\) fits the data perfectly; however, we can see that it interpolates between data points very poorly, especially at the edges of the data distribution, where \(x \approx 0\) and \(x \approx 10\). It is very likely that this model is overfitting the data. On the other hand, we see that the linear model (\(D_{emb} = 1\)) is not flexible enough to capture the non-linearities of the data. As a result, it is very likely that this model is underfitting the data. The quadratic fit (\(D_{emb} = 2\)) seems to neither overfit nor underfit the data, which suggests it is the best model for this training dataset.

Identifying Overfitting#

The easiest way of identifying is a model is overfitting the data is by comparing the error of the model on the training and validation error. Since the model is only fit to the training set data, the validation set error gives us an idea of how accurate the model is on unseen data. If the validation error is significantly higher than the training error and the training error is close to its minimum, the model is likely overfitting the data. On the other hand, if both the training and validation error are high, the model is likely underfitting the data.

To illustrate this, we plot the training and validation error of our 1D polynomial models and indicate the regions of overfitting and underfitting:

../_images/1bc3f38a2751e326eb4d61fa4bafc072265cc278aae13a3f1b7594b93792a68c.png

We can see that some underfitting occurs with the linear model (\(D_{emb} = 1\)). We also observe that for polynomials with degree \(D_{emb} > 5\), the training error goes to \(0\) and the validation error begins to increase. This is the characteristic symptom of overfitting.

Regularization#

There are a few ways to deal with overfitting. One way is by going out and collecting more data. As the size and diversity of the training dataset increases, the harder it will be for our model to “memorize” the dataset, thereby reducing overfitting. Another way to deal with overfitting is to try a less complex model. In general, the fewer weights a model has, the less prone it is to overfitting data. There is also a third method for reducing overfitting in some models which requires changing neither the model nor the dataset. This is called regularization.

The most common forms of regularization incorporate a “model complexity penalty” term directly into the model loss function \(\mathcal{E}(f)\). For linear regression problems (both with and without an embedding \(\mathbf{\Phi}\)), a popular choice of the regularization term is the sums of squares of the model weights times a constant \(\lambda\), called the regularization parameter:

\[\text{ Penalty Term } = \lambda \sum_{j} w_j^2 = \lambda(\mathbf{w}^T\mathbf{w})\]

As the value of the regularization parameter \(\lambda\) increases, the model is penalized more for having weights with large magnitudes, which reduces the model’s ability to overfit data. When this sum of squares penalty term is added to the mean square error loss function of a linear regression model, the resulting regression problem is called Ridge regression:

\[\mathcal{E}(f) = \frac{1}{N}(\mathbf{\Phi}(\mathbf{X})\mathbf{w} - \mathbf{y})^T(\mathbf{\Phi}(\mathbf{X})\mathbf{w} - \mathbf{y}) + \underbrace{\lambda(\mathbf{w}^T\mathbf{w})}_{\text{regularization term}}\]

For any value of \(\lambda\) the optimal weights \(\mathbf{w}\) for a ridge regression problem can be computed in closed form:

\[\mathbf{w} = \left((\mathbf{\Phi}(\mathbf{X})^T\mathbf{\Phi}(\mathbf{X}) + \lambda\mathbf{I})^{-1} \mathbf{\Phi}(\mathbf{X})^T \right) \mathbf{y}\]

Note

Take note of how the term

\[(\mathbf{\Phi}(\mathbf{X})^T\mathbf{\Phi}(\mathbf{X}) + \lambda \mathbf{I})^{-1}\mathbf{\Phi}(\mathbf{X})^T\]

is a “regularized” form of the Moore-Penrose inverse of \(\mathbf{\Phi}(\mathbf{X})\):

\[\mathbf{\Phi}(\mathbf{X})^+ = (\mathbf{\Phi}(\mathbf{X})^T\mathbf{\Phi}(\mathbf{X}))^{-1}\mathbf{\Phi}(\mathbf{X})^T.\]

As expected, the two agree when \(\lambda = 0\) (i.e. no regularization is applied).

Another popular form of regularization is Lasso regression, which imposes a penalty term proportional to the sum of absolute values of the weights:

\[\mathcal{E}(f) = \frac{1}{N}(\mathbf{\Phi}(\mathbf{X})\mathbf{w} - \mathbf{y})^T(\mathbf{\Phi}(\mathbf{X})\mathbf{w} - \mathbf{y}) + \underbrace{\lambda \sum_{j=1}^{D_{emb}} |w_j|}_{\text{regularization term}}\]

Unlike Ridge regression, Lasso regression does not readily admit a closed-form solution for the optimal weights \(\mathbf{w}\).

Advanced Regression Models

Contents

Advanced Regression Models#

Multivariate Linear Regression#

High-Dimensional Embeddings#

Example: Polynomial Fits of Single-Variable Functions#

Underfitting and Overfitting#

Identifying Overfitting#

Regularization#

Exercises#

Solutions:#

Exercise 1: Computing the Moore-Penrose Inverse#

Exercise 2: Regularized Linear Regression#