Materials+ML Workshop Day 8¶

Content for today:¶

Regression Models Review
- Linear Regression
- High-dimensional Embeddings
- Kernel Machines
Unsupervised Learning
- Feature Selection
- Dimensionality reduction
- Clustering
- Distribution Estimation
- Anomaly Detection
Application: Classifying Superconductors
- Application of unsupervised methods

Tentative Workshop Schedule:¶

Session	Date	Content
Day 0	06/16/2023 (2:30-3:30 PM)	Introduction, Setting up your Python Notebook
Day 1	06/19/2023 (2:30-3:30 PM)	Python Data Types
Day 2	06/20/2023 (2:30-3:30 PM)	Python Functions and Classes
Day 3	06/21/2023 (2:30-3:30 PM)	Scientific Computing with Numpy and Scipy
Day 4	06/22/2023 (2:30-3:30 PM)	Data Manipulation and Visualization
Day 5	06/23/2023 (2:30-3:30 PM)	Materials Science Packages
Day 6	06/26/2023 (2:30-3:30 PM)	Introduction to ML, Supervised Learning
Day 7	06/27/2023 (2:30-3:30 PM)	Regression Models
Day 8	06/28/2023 (2:30-3:30 PM)	Unsupervised Learning
Day 9	06/29/2023 (2:30-3:30 PM)	Neural Networks
Day 10	06/30/2023 (2:30-3:30 PM)	Advanced Applications in Materials Science

Questions¶

Regression Models
- Linear Regression
- High-dimensional Embeddings
- Kernel Machines
- Supervised Learning (in general)

Multivariate Linear Regression¶

Multivariate Linear regression is a type of regression model that estimates a label as a linear combination of features:

$$\hat{y} = f(\mathbf{x}) = w_0 + \sum_{i=1}^D w_i x_i$$

We can re-write the linear regression model in vector form:
- Let $\underline{\mathbf{x}} = \begin{bmatrix} 1 & x_1 & x_2 & \dots & x_D \end{bmatrix}^T$ ($\mathbf{x}$ padded with a 1)
- Let $\mathbf{w} = \begin{bmatrix} w_0 & w_1 & w_2 & \dots & w_D \end{bmatrix}^T$ (the weight vector)

$f(\mathbf{x})$ is just the inner product (i.e. dot product) of these two vectors:

$$\hat{y} = f(\mathbf{x}) = \underline{\mathbf{x}}^T\mathbf{w}$$

Closed Form Solution:¶

Multivariate Linear Regression:

$$\mathbf{w} = \mathbf{X}^+\mathbf{y}$$

Above, $\mathbf{X}^+$ denotes the Moore-Penrose inverse (sometimes called the pseudo-inverse) of $\mathbf{X}$.

If the dataset size $N$ is sufficiently large such that $\mathbf{X}$ has linearly independent columns, the optimal weights can be computed as:

$$\mathbf{w} = \mathbf{X}^{+}\mathbf{y} = \left( (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\right)\mathbf{y}$$

High-Dimensional Embeddings¶

Often, the trends of $y$ with respect to $\mathbf{x}$ are non-linear, so multivariate linear regression may fail to give good results.
One way of handling this is by embedding the data in a higher-dimensional space using many different non-linear functions:

$$\phi_j(\mathbf{x}) : \mathbb{R}^{D} \rightarrow \mathbb{R}\qquad (j = 1, 2, ..., D_{emb})$$

(The $\phi_j$ are nonlinear functions, and $D_{emb}$ is the embedding dimension)

$$\hat{y} = f(\mathbf{x}) = w_0 + \sum_{j=1}^{D_{emb}} w_j \phi_j(\mathbf{x})$$

Underfitting and Overfitting¶

Finding the best fit of a model requires striking a balance between underfitting and overfitting the data.

A model underfits the data if it has insufficient degrees of freedom to model the data.

A model overfits the data if it has too many degrees of freedom such that it fails to generalize well outside of the training data.

Polynomial Regression Example:

poly fits

Regularization:¶

To reduce overfitting, we apply regularization:

Usually, a penalty term is added to the overall model loss function:

$$\text{ Penalty Term } = \lambda \sum_{j} w_j^2 = \lambda(\mathbf{w}^T\mathbf{w})$$
The parameter $\lambda$ is called the regularization parameter
- as $\lambda$ increases, more regularization is applied.

Today's Content:¶

Unsupervised Learning

Feature Selection
Dimensionality reduction
Clustering
Distribution Estimation
Anomaly Detection

Unsupervised Learning Models:¶

Models applied to unlabeled data with the goal of discovering trends, patterns, extracting features, or finding relationships between data.
- Deals with datasets of features only
- (just $\mathbf{x}$, not $(\mathbf{x},y)$ pairs)

unsupervised learning

Feature Selection and Dimensionality Reduction¶

Determines which features are the most "meaningful" in explaining how the data is distributed

Sometimes we work with high-dimensional data that is very sparse
Reducing the dimensionality of the data might be necessary
- Reduces computational complexity
- Eliminates unnecessary (or redundant) features
- Can even improve model accuracy

The Importance of Dimensionality¶

Dimensionality is an important concept in materials science.
- The dimensionality of a material affects its properties

Much like materials, the dimensionality of a dataset can say a lot about the properties of a dataset:
- How complex is the data?
- Does the data have fewer degrees of freedom than features?

Sometimes, data can be confined to some low-dimensional manifold embedded in a higher-dimensional space.

Example: The "Swiss Roll" manifold

Swiss roll

Review: The Covariance Matrix¶

The Covariance Matrix describes the variance of data in more than one dimension:

$$\mathbf{\Sigma} = \begin{bmatrix} \sigma_{1}^2 & \sigma_{12} & \dots & \sigma_{1d} \\ \sigma_{21} & \sigma_{2}^2 & \dots & \sigma_{2d} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{d1} & \sigma_{d2} & \dots & \sigma_{d}^2 \end{bmatrix}$$

$\Sigma_{ii} = \sigma_i^2$: variance in dimension $i$
$\Sigma_{ij} = \sigma_{ij}$: covariance between dimensions $i$ and $j$

$$\Sigma_{ij} = \frac{1}{N} \sum_{n=1}^N ((\mathbf{x}_n)_i - \mu_i)((\mathbf{x}_n)_j - \mu_j)$$

The Correlation Matrix:¶

Recall that it is generally a good idea to normalize our data:

$$\mathbf{x} \mapsto \mathbf{z}:\quad z_i = \frac{x_i - \mu_i}{\sigma_i}$$

The correlation matrix (denoted $\bar{\Sigma}$) is the covariance matrix of the normalized data:

$$ \bar{\Sigma} = \frac{1}{N} \sum_{n=1}^N \mathbf{z}_n\mathbf{z}_n^T $$

The entries of the correlation matrix (in terms of the original data) are:

$$\bar{\Sigma}_{ij} = \frac{1}{N} \sum_{n=1}^N \frac{((\mathbf{x}_n)_i - \mu_i)((\mathbf{x}_n)_j - \mu_j)}{\sigma_i\sigma_j}$$

Interpreting the Correlation Matrix¶

$$\bar{\Sigma}_{ij} = \frac{1}{N} \sum_{n=1}^N \frac{((\mathbf{x}_n)_i - \mu_i)((\mathbf{x}_n)_j - \mu_j)}{\sigma_i\sigma_j}$$

The diagonal of the correlation matrix consists of $1$s. (Why?)

The off-diagonal components describe the strength of correlation between feature dimensions $i$ and $j$
- Positive values: positive correlation
- Negative values: negative correlation
- Zero values: no correlation

Principal Components Analysis (PCA)¶

The eigenvectors of the correlation matrix are called principal components.
The associated eigenvalues describe the proportion of the data variance in the direction of each principal component.

$$\bar{\Sigma} = P D P^{T}$$

$D$: Diagonal matrix (eigenvalues along diagonal)
$P$: Principal component matrix (columns are principal components)

Since $\bar{\Sigma}$ is symmetric, the principal components are all orthogonal.

Dimension reduction with PCA¶

We can project our (normalized) data onto the first $n$ principal components to reduce the dimensionality of the data, while still keeping most of the variance:

$$\mathbf{z} \mapsto \mathbf{u} = \begin{bmatrix} \mathbf{z}^T\mathbf{p}^{(1)} \\ \mathbf{z}^T\mathbf{p}^{(2)} \\ \vdots \\ \mathbf{z}^T\mathbf{p}^{(n)} \\ \end{bmatrix}$$

Clustering and Distribution Estimation¶

Clustering methods allow us to identify dense groupings of data.
Distribution Estimation allows us to estimate the probability distribution of the data.

K-Means Clustering:¶

$k$-means is a popular clustering algorithm that identifies the centerpoints a specified number of clusters $k$
These center points are called centroids

kmeans

Kernel Density Estimation:¶

Kernel Density Estimation (KDE) estimates the probability distribution of an entire dataset
Estimates the distribution as a sum of multivariate normal "bumps" at the position of each datapoint

kde

Gaussian Mixture Model¶

A Gaussian Mixture Model (GMM) performs both clustering and distribution estimation simultaneously.
Works by fitting a mixture of multivariate normal distributions to the data

gmm

Application: Classifying Superconductors¶

Exploring the distribution of superconducting materials