Application: Bandgap Prediction#

Let’s now try to tackle a more difficult regression problem: predicting material bandgaps. The bandgap of a material is an important property related to whether or not a material is a conductor: Materials with a zero bandgap are typically conductors, whereas materials with a positive bandgap are insulators (if the gap is large) or semiconductors (if the gap is small). We can estimate the bandgap by examining the largest gap between the energies of two states in the material’s band structure. For example, the bandgap of one crystalline phase of the insulator SiO\(_2\) (mp-546794) is roughly 5.69 eV, which is the difference in band energy at the \(\Gamma\) point (shown in purple):

SiO2 bandgap

Bandgap Dataset#

The band gap can be estimated through ab initio calculation methods, such as density functional theory (DFT). In this section, we will use band gap values estimated through DFT calculations as reported in the Materials Project Database.

You can download the dataset for this section using the following Python code:

import requests

CSV_URL = 'https://raw.githubusercontent.com/cburdine/materials-ml-workshop/main/MaterialsML/regression/bandgaps.csv'

r = requests.get(CSV_URL)
with open('bandgaps.csv', 'w', encoding='utf-8') as f:
    f.write(r.text)

Alternatively, you can download the CSV file directly here.

Loading the Dataset#

To start, let’s load the dataset into a Pandas dataframe and view it:

Hide code cell source
import pandas as pd

# load dataset into a pandas DataFrame:
BANDGAP_CSV = 'bandgaps.csv'
data_df = pd.read_csv(BANDGAP_CSV)

# show dataframe in notebook:
display(data_df)
mp_id formula composition crystal_system symmetry_symbol volume density e_fermi formation_energy_per_atom band_gap bandgap_direct
0 mp-1005 FeP {'Fe': 1, 'P': 1} Orthorhombic Pnma 90.583862 6.366076 9.163143 -0.598423 0.0000 False
1 mp-1006367 Ce2HfSe5 {'Ce': 2, 'Hf': 1, 'Se': 5} Orthorhombic Pnma 785.754385 7.215011 6.181067 -1.941463 0.0000 False
2 mp-1001606 LuFeC2 {'Lu': 1, 'Fe': 1, 'C': 2} Orthorhombic Amm2 45.919844 9.215206 5.530327 -0.286076 0.0000 False
3 mp-1001613 LuGa {'Lu': 1, 'Ga': 1} Orthorhombic Cmcm 87.819318 9.253483 3.090952 -0.608621 0.0000 False
4 mp-1008624 YBiPd {'Y': 1, 'Bi': 1, 'Pd': 1} Cubic F-43m 74.987877 8.952998 5.397804 -0.961384 0.0000 False
... ... ... ... ... ... ... ... ... ... ... ...
29190 mp-555534 K4MnMo4O15 {'K': 4, 'Mn': 1, 'Mo': 4, 'O': 15} Trigonal P-3 808.366683 3.430836 0.147938 -2.154358 2.9808 False
29191 mp-23703 LiH {'Li': 1, 'H': 1} Cubic Fm-3m 16.207758 0.814396 -0.997677 -0.495530 2.9810 True
29192 mp-561093 Rb2Cr2O7 {'Rb': 2, 'Cr': 2, 'O': 7} Monoclinic P2_1/c 793.897416 3.237203 -1.868672 -1.934791 2.9814 True
29193 mp-16180 Na6S2O9 {'Na': 6, 'S': 2, 'O': 9} Cubic Fm-3m 218.925628 2.624871 1.735700 -2.095472 2.9814 False
29194 mp-557108 Ba2Nb2TeO10 {'Ba': 2, 'Nb': 2, 'Te': 1, 'O': 10} Orthorhombic Pbca 943.682404 5.265263 1.411631 -2.859859 2.9816 False

29195 rows × 11 columns

There are several different features included in this dataset. Here’s a summary of the included features:

  • mp_id: Materials Project ID of material

  • formula: Chemical formula

  • composition: Composition of the material in the unit cell (as a Python dict)

  • crystal_system: Crystal system of the material

  • symetry_symbol: Space group symbol of crystal structure

  • volume: Unit cell volume (Å\(^3\))

  • density: Density of material (g/cm\(^3\))

  • e_fermi: Fermi energy (relative to calculated band structure) (eV)

  • formation_energy_per_atom: Formation energy per unit cell atom (eV/atom)

  • bandgap: Estimated bandgap of the material (eV)

  • bandgap_direct: Whether or not the estimated bandgap is direct.

The most important features that uniquely identify each material in the dataset are the composition and crystal system of each material. If we want to include this as a feature in a supervised model, we must find a way to convert these feature to a numerical vector. Let’s start by listing all of the elements and crystal systems in the dataset:

Hide code cell source
# Generate a list of elements in the dataset:
ELEMENTS = set()
for v in data_df['composition'].values:
    ELEMENTS |= set(eval(v).keys())
ELEMENTS = sorted(ELEMENTS)

# Generate a list of the crystal systems in the dataset:
CRYSTAL_SYSTEMS = sorted(set(data_df['crystal_system']))

# print the sizes of ELEMENTS and CRYSTAL_SYSTEMS
print('Number of elements:', len(ELEMENTS))
print('Number of crystal systems:', len(CRYSTAL_SYSTEMS))
Number of elements: 86
Number of crystal systems: 7

We see that there are 86 elements represented in the dataset and 7 different crystal systems. The simplest way of converging the composition of each material to a vector is by representing it as a “bag of atoms”, that is, by converting the composition to a vector with 86 entries (representing the 86 elements present in the data) where the entry of each vector represents the fraction of the unit cell occupied by the corresponding element.

We can also represent the crystal system of each material as a vector with seven entries, where only the value corresponding to the crystal system is set to 1. The remaining entries of this vector are set to 0. Below, we write Python functions to vectorize the material composition and the crystal system:

import numpy as np

def vectorize_composition(composition, elements):
    """ converts an elemental composition dict to a vector. """
    total_n = sum(composition.values())
    vec = np.zeros(len(elements))
    for elem, n in composition.items():
        if elem in elements:
            vec[elements.index(elem)] = n/total_n
    return vec

def vectorize_crystal_system(crystal_system, systems):
    """ converts a crystal system to a vector. """
    vec = np.zeros(len(systems))
    if crystal_system in systems:
        vec[systems.index(crystal_system)] = 1.0
        
    return vec

Below, we give some examples of how this vectorization procedure works for compositions and crystal systems:

Hide code cell source
# generate an example of a composition vector:
test_comp = {'Si': 1, 'O': 2}
print('Example of a composition vector:')
print(vectorize_composition(
            test_comp, 
            elements=['C', 'O', 'Si']))

# generate an example of a crystal system vector:
test_system = 'Hexagonal'
print('Example of a crystal system vector:')
print(vectorize_crystal_system(
            test_system, 
            systems=['Cubic', 'Hexagonal']))
Example of a composition vector:
[0.         0.66666667 0.33333333]
Example of a crystal system vector:
[0. 1.]

Next, let’s write a function that extracts features from each row in the dataframe and constructs each material’s feature vector \(\mathbf{x}\) and label vector \(\mathbf{y}\). The feature vector \(\mathbf{x}\) will contain the following values combined into a single vector:

  • The material’s “bag of atoms” composition vector

  • The material’s crystal system vector

  • The material’s unit cell volume

  • The material’s density

  • The material’s formation energy per atom

The label vector \(\mathbf{y}\) for each material will contain two values: the band gap of the material (in eV) and a value of \(\pm 1\) indicating if the band gap is direct or indirect.

def parse_data_vector(row):
    """ parses x and y vectors from a dataframe row """
    
    # parse whether or not the bandgap is direct: 
    bandgap_direct = 1.0 if row['bandgap_direct'] else -1.0
    
    # parse the composition dict:
    composition_dict = eval(row['composition'])
    
    # parse feature vector (x):
    x_vector = np.concatenate([
        vectorize_composition(composition_dict, ELEMENTS),
        vectorize_crystal_system(row['crystal_system'], CRYSTAL_SYSTEMS),
        np.array([ row['volume'] ]),
        np.array([ row['density'] ]),
        np.array([ row['formation_energy_per_atom'] ]),
    ])
    
    # parse label vector (y):
    y_vector = np.concatenate([
        np.array([ row['band_gap'] ]),
        np.array([ bandgap_direct ])
    ])
    
    return x_vector, y_vector

Next, we will iterate over each row in the dataframe and apply the parse_data_vector() function to extract the \(\mathbf{x}\) and \(\mathbf{y}\) vectors from the dataset. We will take the list of all \(\mathbf{x}\) and \(\mathbf{y}\) vectors and convert them into a pair of numpy arrays:

Hide code cell source
# parse x and y vectors from dataframe rows:
data_x, data_y = [], []
for i, row in data_df.iterrows():
    x_vector, y_vector = parse_data_vector(row)
    data_x.append(x_vector)
    data_y.append(y_vector)

# convert x and y to numpy arrays:
data_x = np.array(data_x)
data_y = np.array(data_y)

print('data_x shape:', data_x.shape)
print('data_y shape:', data_y.shape)
data_x shape: (29195, 96)
data_y shape: (29195, 2)

Now that we have parsed features from the dataset, let’s write some functions that will automatically split data into training, validation, and test set. We may also find it useful to write a function that normalizes these sets as well:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def train_val_test_split(data_x, data_y, split=(0.8,0.1,0.1)):
    """ splits data into train, validation, and test sets. """
    
    # split train and nontrain data:
    train_x, nontrain_x, train_y, nontrain_y = \
        train_test_split(
            data_x, data_y, train_size=split[0]/sum(split))
    
    # split validation and test data:
    val_x, test_x, val_y, test_y = \
        train_test_split(
            data_x, data_y, 
            test_size=split[2]/(split[1]+split[2]))
    
    return (train_x, val_x, test_x), \
           (train_y, val_y, test_y)
    
def normalize(train_x, val_x, test_x):
    """ normalizes a dataset. """
    
    scaler = StandardScaler()
    train_z = scaler.fit_transform(train_x)
    val_z = scaler.transform(val_x)
    test_z = scaler.transform(test_x)
    return scaler, train_z, val_z, test_z

Classifying Metals and Non-Metals#

The first supervised learning problem we will examine is discriminating between materials with a bandgap that is zero (i.e. metals) and materials with a bandgap that is nonzero (i.e. insulators). We will start by creating a vector of \(y\) values with values of \(\pm 1\) indicating whether a material is a metal or nonmetal:

metals_y = np.array([ 1.0 if y[0] <= 0 else -1 for y in data_y])

metal_subsets_x, metal_subsets_y = \
    train_val_test_split(data_x, metals_y)

train_x, val_x, test_x = metal_subsets_x
train_y, val_y, test_y = metal_subsets_y

scaler, train_z, val_z, test_z = \
    normalize(train_x, val_x, test_x)

It’s usually best to start by seeing how well a simple models such as a linear classifier fits the data. We could use a linear classifier (also known as a Perceptron) like we did in the last application section. Since we now know about how regularization can sometimes improve the validation accuracy of a linear model, let’s use a ridge regression linear classifier that is equipped with a regularization parameter \(\lambda\). When \(\lambda = 0\), (no regularization is applied) the model is identical to a linear classifier. Scikit-learn provides an implementation of this with sklearn.linear_model.RidgeClassifier. (Note that in sklearn, the regularization parameter \(\lambda\) is referred to as \(\alpha\)) After fitting the model, we can also use scikit-learn’s ConfusionMatrix class to visualize the kinds of classification errors made by our model on the validation set:

Hide code cell source
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

model = RidgeClassifier(alpha=20)
model.fit(train_z, train_y)

train_yhat = model.predict(train_z)
val_yhat = model.predict(val_z)

# compute accuracy:
cm = confusion_matrix(val_y, val_yhat)

# display confusion matrix:
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model.classes_)
disp.plot()
plt.gca().set_yticklabels(['Non Metal', 'Metal'])
plt.gca().set_xticklabels(['Non Metal', 'Metal'])
plt.show()
../_images/e094cfdcb306556f3426b6953aecec5d9a62950bfa1dce2132865036e4449604.png

From inspecting the confusion matrix, we see that a significant majority of the materials were classified correctly, with almost three times as many metals being predicted as non-metals than non-metals being predicted as metals. We can evaluate the overall validation set accuracy of the model from the confusion matrix as follows:

accuracy = np.sum(np.diag(cm)) / np.sum(cm)
print(accuracy)
0.8688771665410701

Estimating the Bandgap of Non-Metals:#

We have shown that we can classify materials as metals or non-metals with accuracy close to 90%. The next problem we will consider is estimating the bandgap of materials that are either known to be non-metals or are classified by the previous model as such. Let’s start by extracting all non-metallic materials with nonzero bandgaps:

Hide code cell source
bandgap_x = data_x[data_y[:,0] > 0]
bandgap_y = data_y[(data_y[:,0] > 0),0]

print('bandgap_x shape:', bandgap_x.shape)
print('bandgap_y shape:', bandgap_y.shape)
bandgap_x shape: (11464, 96)
bandgap_y shape: (11464,)

We can normalize this data with the following code:

bandgap_subsets_x, bandgap_subsets_y = \
    train_val_test_split(bandgap_x, bandgap_y)

train_x, val_x, test_x = bandgap_subsets_x
train_y, val_y, test_y = bandgap_subsets_y

scaler, train_z, val_z, test_z = \
    normalize(train_x, val_x, test_x)

Ridge Regression Model:#

Since predicting the bandgap is a regression problem, a good model to start with is ridge regression. We will use sklearn.linear_model.Ridge:

Hide code cell source
from sklearn.linear_model import Ridge

# construct a rigde regression model:
ridge_model = Ridge(alpha=0.01)
ridge_model.fit(train_z, train_y)

train_yhat = ridge_model.predict(train_z)
val_yhat = ridge_model.predict(val_z)

# evaluate the MSE on the training/validation sets:
train_mse = np.mean((train_yhat - train_y)**2)
val_mse = np.mean((val_yhat - val_y)**2)

# evaluate the training/validation error:
print('training stddev:', np.std(train_y))
print('training MSE:', train_mse)
print('validation MSE:', val_mse)
print('validation RMSE/stddev:', np.sqrt(val_mse)/np.std(train_y))
training stddev: 0.8648993902607139
training MSE: 0.43916086692415607
validation MSE: 0.4431211081856411
validation RMSE/stddev: 0.769654172578981

While a model root mean square error (RMSE) that is roughly half of the standard deviation is likely be a statistically significant improvement upon random guessing, it leaves much to be desired. Let’s try a more complex model: Gradient Boosting Regression. Since our dataset is quite large, it may also be more computationally efficient to use the histogram variant, which is implemented in sklearn.ensemble.HistGradientBoostingRegressor:

Gradient Boosting Regression#

Hide code cell source
from sklearn.ensemble import HistGradientBoostingRegressor

# Construct a histogram gradient boosting model:
hgbr = HistGradientBoostingRegressor(
    loss='squared_error',
    learning_rate=0.1,
    max_leaf_nodes=128,
)
hgbr.fit(train_z, train_y)

# make predictions on the training/validation sets:
train_yhat = hgbr.predict(train_z)
val_yhat = hgbr.predict(val_z)

# evaluate the MSE on the training/validation sets:
train_mse = np.mean((train_yhat - train_y)**2)
val_mse = np.mean((val_yhat - val_y)**2)

# print the training/validation error
print('training stddev:', np.std(train_y))
print('training MSE:', train_mse)
print('validation MSE:', val_mse)
print('validation RMSE/stddev:', np.sqrt(val_mse)/np.std(train_y))
training stddev: 0.8648993902607139
training MSE: 0.08587815088682216
validation MSE: 0.1270557773908582
validation RMSE/stddev: 0.4121275221300857

This is a significant improvement upon the Ridge regression model. Let’s also see how well a support vector regression (SVR) model works for bandgap estimation when we use the radial basis function (RBF) kernel. We will use sklearn.svm.SVR. Take note that SVR models may take a few minutes to fit to large datasets:

RBF Support Vector Regression#

from sklearn.svm import SVR

# construct an SVR with an rbf kernel:
svr_model = SVR(kernel='rbf', C=200, epsilon=1e-3, cache_size=500)
svr_model.fit(train_z, train_y)

# make predictions on the train/validation/test sets:
train_yhat = svr_model.predict(train_z)
val_yhat = svr_model.predict(val_z)
test_yhat = svr_model.predict(test_z)

# evaluate the MSE on the train/validation/test sets:
train_mse = np.mean((train_yhat - train_y)**2)
val_mse = np.mean((val_yhat - val_y)**2)
test_mse = np.mean((test_yhat - test_y)**2)

# print the training/validation error:
print('# of support vectors:', svr_model.n_support_) 
print('training stddev:', np.std(train_y))
print('training MSE:', train_mse)
print('validation MSE:', val_mse)
print('validation RMSE/stddev:', np.sqrt(val_mse)/np.std(train_y))

# print the test error:
print('test MSE:', test_mse)
print('test RMSE/stddev: ', np.sqrt(test_mse)/np.std(train_y))
# of support vectors: [9152]
training stddev: 0.8640814456704967
training MSE: 0.01842750045837074
validation MSE: 0.10748327182498967
validation RMSE/stddev: 0.37941610323604913
test MSE: 0.1108143968686793
test RMSE/stddev:  0.38525067990691425

Exercises#

Exercise 1: Classifying Direct vs. Indirect Bandgaps

Let’s apply the same techniques we used to classify metallic versus insulating materials to estimate whether a bandgapped material has a direct or indirect bandgap. We can extract the \(\mathbf{x}\) and \(y\) dataset for this task using the following code:

direct_gap_x = data_x[data_y[:,0] > 0]
direct_gap_y = data_y[(data_y[:,0] > 0),1]

print('direct_gap_x shape:', direct_gap_x.shape)
print('direct_gap_y shape:', direct_gap_y.shape)

Split this data into training, validation, and test sets. Then, fit a ridge classifier (or another model you think may perform well on this task) to the training set. Visualize the fitted model’s accuracy on a validation set by plotting a confusion matrix. Once you have found a model with acceptable accuracy, evaluate the final accuracy of the model using the test set.


Hint: Note that you can obtain the accuracy of a classification model in sklearn using the model.score(x,y) function.

Solutions#

Exercise 1: Classifying Direct vs. Indirect Bandgaps#

Hide code cell content
import numpy as np
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# get direct bandgap data for materials with nonzero gaps:
direct_gap_x = data_x[data_y[:,0] > 0]
direct_gap_y = data_y[(data_y[:,0] > 0),1]

# split data in train/test/validation sets:
direct_subsets_x, direct_subsets_y = \
    train_val_test_split(direct_gap_x, direct_gap_y)
train_x, val_x, test_x = direct_subsets_x
train_y, val_y, test_y = direct_subsets_y

# normalize data:
scaler, train_z, val_z, test_z = \
    normalize(train_x, val_x, test_x)

# fit a ridge classifier to the data:
direct_ridge_model = RidgeClassifier(alpha=20)
direct_ridge_model.fit(train_z, train_y)

# make predictions on the validation & test sets:
val_yhat = direct_ridge_model.predict(val_z)
test_yhat = direct_ridge_model.predict(test_z)

# compute accuracy:
cm = confusion_matrix(val_y, val_yhat)

# display validation set confusion matrix:
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=model.classes_)
disp.plot()
plt.gca().set_yticklabels(['Direct', 'Indirect'])
plt.gca().set_xticklabels(['Direct', 'Indirect'])
plt.show()

# print validation set accuracy and test set accuracy:
val_acc = direct_ridge_model.score(val_z, val_y)
test_acc = direct_ridge_model.score(test_z, test_y)
print('validation set accuracy:', val_acc)
print('test set accuracy:', test_acc)
../_images/e8e8596c19c2c8fa4f79d18d46aba9cad408fa4a8a831a878a974c854a4e88cb.png
validation set accuracy: 0.7250523377529658
test set accuracy: 0.7191207257501745