Application: Bandgap Prediction

Application: Bandgap Prediction#

Let’s now try to tackle a more difficult regression problem: predicting material bandgaps. The bandgap of a material is an important property related to whether or not a material is a conductor: Materials with a zero bandgap are typically conductors, whereas materials with a positive bandgap are insulators (if the gap is large) or semiconductors (if the gap is small). We can estimate the bandgap by examining the largest gap between the energies of two states in the material’s band structure. For example, the bandgap of one crystalline phase of the insulator SiO\(_2\) (mp-546794) is roughly 5.69 eV, which is the difference in band energy at the \(\Gamma\) point (shown in purple):

SiO2 bandgap

Bandgap Dataset#

The band gap can be estimated through ab initio calculation methods, such as density functional theory (DFT). In this section, we will use band gap values estimated through DFT calculations as reported in the Materials Project Database.

Notes about Bandgap Estimation and DFT

For more information on how bandgaps are estimated in the Materials Project, see the Electronic Structure documentation page. Take particular note of the “Computed Gap” versus “Experimental Gap” plot, which reflects a mean absolute error of 0.6 eV:

bandgap errors

(There are some methods in DFT that can correct for this, such as the PBE0 model or the GW approach, but neither of these appear to be used in the Materials Project).

In short, this means that we must be very careful if we want to apply our models from this section to make predictions of real bandgaps measured through experiment, especially insulators with large bandgaps.

You can download the dataset for this section using the following Python code:

import requests

CSV_URL = 'https://raw.githubusercontent.com/cburdine/materials-ml-workshop/main/MaterialsML/regression/bandgaps.csv'

r = requests.get(CSV_URL)
with open('bandgaps.csv', 'w', encoding='utf-8') as f:
    f.write(r.text)

Alternatively, you can download the CSV file directly here.

Loading the Dataset#

To start, let’s load the dataset into a Pandas dataframe and view it:

	mp_id	formula	composition	crystal_system	symmetry_symbol	volume	density	e_fermi	formation_energy_per_atom	band_gap	bandgap_direct
0	mp-1005	FeP	{'Fe': 1, 'P': 1}	Orthorhombic	Pnma	90.583862	6.366076	9.163143	-0.598423	0.0000	False
1	mp-1006367	Ce2HfSe5	{'Ce': 2, 'Hf': 1, 'Se': 5}	Orthorhombic	Pnma	785.754385	7.215011	6.181067	-1.941463	0.0000	False
2	mp-1001606	LuFeC2	{'Lu': 1, 'Fe': 1, 'C': 2}	Orthorhombic	Amm2	45.919844	9.215206	5.530327	-0.286076	0.0000	False
3	mp-1001613	LuGa	{'Lu': 1, 'Ga': 1}	Orthorhombic	Cmcm	87.819318	9.253483	3.090952	-0.608621	0.0000	False
4	mp-1008624	YBiPd	{'Y': 1, 'Bi': 1, 'Pd': 1}	Cubic	F-43m	74.987877	8.952998	5.397804	-0.961384	0.0000	False
...	...	...	...	...	...	...	...	...	...	...	...
29190	mp-555534	K4MnMo4O15	{'K': 4, 'Mn': 1, 'Mo': 4, 'O': 15}	Trigonal	P-3	808.366683	3.430836	0.147938	-2.154358	2.9808	False
29191	mp-23703	LiH	{'Li': 1, 'H': 1}	Cubic	Fm-3m	16.207758	0.814396	-0.997677	-0.495530	2.9810	True
29192	mp-561093	Rb2Cr2O7	{'Rb': 2, 'Cr': 2, 'O': 7}	Monoclinic	P2_1/c	793.897416	3.237203	-1.868672	-1.934791	2.9814	True
29193	mp-16180	Na6S2O9	{'Na': 6, 'S': 2, 'O': 9}	Cubic	Fm-3m	218.925628	2.624871	1.735700	-2.095472	2.9814	False
29194	mp-557108	Ba2Nb2TeO10	{'Ba': 2, 'Nb': 2, 'Te': 1, 'O': 10}	Orthorhombic	Pbca	943.682404	5.265263	1.411631	-2.859859	2.9816	False

29195 rows × 11 columns

There are several different features included in this dataset. Here’s a summary of the included features:

mp_id: Materials Project ID of material
formula: Chemical formula
composition: Composition of the material in the unit cell (as a Python dict)
crystal_system: Crystal system of the material
symetry_symbol: Space group symbol of crystal structure
volume: Unit cell volume (Å\(^3\))
density: Density of material (g/cm\(^3\))
e_fermi: Fermi energy (relative to calculated band structure) (eV)
formation_energy_per_atom: Formation energy per unit cell atom (eV/atom)
bandgap: Estimated bandgap of the material (eV)
bandgap_direct: Whether or not the estimated bandgap is direct.

The most important features that uniquely identify each material in the dataset are the composition and crystal system of each material. If we want to include this as a feature in a supervised model, we must find a way to convert these feature to a numerical vector. Let’s start by listing all of the elements and crystal systems in the dataset:

Number of elements: 86
Number of crystal systems: 7

We see that there are 86 elements represented in the dataset and 7 different crystal systems. The simplest way of converging the composition of each material to a vector is by representing it as a “bag of atoms”, that is, by converting the composition to a vector with 86 entries (representing the 86 elements present in the data) where the entry of each vector represents the fraction of the unit cell occupied by the corresponding element.

We can also represent the crystal system of each material as a vector with seven entries, where only the value corresponding to the crystal system is set to 1. The remaining entries of this vector are set to 0. Below, we write Python functions to vectorize the material composition and the crystal system:

import numpy as np

def vectorize_composition(composition, elements):
    """ converts an elemental composition dict to a vector. """
    total_n = sum(composition.values())
    vec = np.zeros(len(elements))
    for elem, n in composition.items():
        if elem in elements:
            vec[elements.index(elem)] = n/total_n
    return vec

def vectorize_crystal_system(crystal_system, systems):
    """ converts a crystal system to a vector. """
    vec = np.zeros(len(systems))
    if crystal_system in systems:
        vec[systems.index(crystal_system)] = 1.0
        
    return vec

Below, we give some examples of how this vectorization procedure works for compositions and crystal systems:

Example of a composition vector:
[0.         0.66666667 0.33333333]
Example of a crystal system vector:
[0. 1.]

Next, let’s write a function that extracts features from each row in the dataframe and constructs each material’s feature vector \(\mathbf{x}\) and label vector \(\mathbf{y}\). The feature vector \(\mathbf{x}\) will contain the following values combined into a single vector:

The material’s “bag of atoms” composition vector
The material’s crystal system vector
The material’s unit cell volume
The material’s density
The material’s formation energy per atom

The label vector \(\mathbf{y}\) for each material will contain two values: the band gap of the material (in eV) and a value of \(\pm 1\) indicating if the band gap is direct or indirect.

def parse_data_vector(row):
    """ parses x and y vectors from a dataframe row """
    
    # parse whether or not the bandgap is direct: 
    bandgap_direct = 1.0 if row['bandgap_direct'] else -1.0
    
    # parse the composition dict:
    composition_dict = eval(row['composition'])
    
    # parse feature vector (x):
    x_vector = np.concatenate([
        vectorize_composition(composition_dict, ELEMENTS),
        vectorize_crystal_system(row['crystal_system'], CRYSTAL_SYSTEMS),
        np.array([ row['volume'] ]),
        np.array([ row['density'] ]),
        np.array([ row['formation_energy_per_atom'] ]),
    ])
    
    # parse label vector (y):
    y_vector = np.concatenate([
        np.array([ row['band_gap'] ]),
        np.array([ bandgap_direct ])
    ])
    
    return x_vector, y_vector

Next, we will iterate over each row in the dataframe and apply the parse_data_vector() function to extract the \(\mathbf{x}\) and \(\mathbf{y}\) vectors from the dataset. We will take the list of all \(\mathbf{x}\) and \(\mathbf{y}\) vectors and convert them into a pair of numpy arrays:

data_x shape: (29195, 96)
data_y shape: (29195, 2)

Now that we have parsed features from the dataset, let’s write some functions that will automatically split data into training, validation, and test set. We may also find it useful to write a function that normalizes these sets as well:

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

def train_val_test_split(data_x, data_y, split=(0.8,0.1,0.1)):
    """ splits data into train, validation, and test sets. """
    
    # split train and nontrain data:
    train_x, nontrain_x, train_y, nontrain_y = \
        train_test_split(
            data_x, data_y, train_size=split[0]/sum(split))
    
    # split validation and test data:
    val_x, test_x, val_y, test_y = \
        train_test_split(
            data_x, data_y, 
            test_size=split[2]/(split[1]+split[2]))
    
    return (train_x, val_x, test_x), \
           (train_y, val_y, test_y)
    
def normalize(train_x, val_x, test_x):
    """ normalizes a dataset. """
    
    scaler = StandardScaler()
    train_z = scaler.fit_transform(train_x)
    val_z = scaler.transform(val_x)
    test_z = scaler.transform(test_x)
    return scaler, train_z, val_z, test_z

Classifying Metals and Non-Metals#

The first supervised learning problem we will examine is discriminating between materials with a bandgap that is zero (i.e. metals) and materials with a bandgap that is nonzero (i.e. insulators). We will start by creating a vector of \(y\) values with values of \(\pm 1\) indicating whether a material is a metal or nonmetal:

metals_y = np.array([ 1.0 if y[0] <= 0 else -1 for y in data_y])

metal_subsets_x, metal_subsets_y = \
    train_val_test_split(data_x, metals_y)

train_x, val_x, test_x = metal_subsets_x
train_y, val_y, test_y = metal_subsets_y

scaler, train_z, val_z, test_z = \
    normalize(train_x, val_x, test_x)

It’s usually best to start by seeing how well a simple models such as a linear classifier fits the data. We could use a linear classifier (also known as a Perceptron) like we did in the last application section. Since we now know about how regularization can sometimes improve the validation accuracy of a linear model, let’s use a ridge regression linear classifier that is equipped with a regularization parameter \(\lambda\). When \(\lambda = 0\), (no regularization is applied) the model is identical to a linear classifier. Scikit-learn provides an implementation of this with sklearn.linear_model.RidgeClassifier. (Note that in sklearn, the regularization parameter \(\lambda\) is referred to as \(\alpha\)) After fitting the model, we can also use scikit-learn’s ConfusionMatrix class to visualize the kinds of classification errors made by our model on the validation set:

../_images/e094cfdcb306556f3426b6953aecec5d9a62950bfa1dce2132865036e4449604.png

From inspecting the confusion matrix, we see that a significant majority of the materials were classified correctly, with almost three times as many metals being predicted as non-metals than non-metals being predicted as metals. We can evaluate the overall validation set accuracy of the model from the confusion matrix as follows:

accuracy = np.sum(np.diag(cm)) / np.sum(cm)
print(accuracy)

0.8688771665410701

Estimating the Bandgap of Non-Metals:#

We have shown that we can classify materials as metals or non-metals with accuracy close to 90%. The next problem we will consider is estimating the bandgap of materials that are either known to be non-metals or are classified by the previous model as such. Let’s start by extracting all non-metallic materials with nonzero bandgaps:

bandgap_x shape: (11464, 96)
bandgap_y shape: (11464,)

We can normalize this data with the following code:

bandgap_subsets_x, bandgap_subsets_y = \
    train_val_test_split(bandgap_x, bandgap_y)

train_x, val_x, test_x = bandgap_subsets_x
train_y, val_y, test_y = bandgap_subsets_y

scaler, train_z, val_z, test_z = \
    normalize(train_x, val_x, test_x)

Ridge Regression Model:#

Since predicting the bandgap is a regression problem, a good model to start with is ridge regression. We will use sklearn.linear_model.Ridge:

training stddev: 0.8648993902607139
training MSE: 0.43916086692415607
validation MSE: 0.4431211081856411
validation RMSE/stddev: 0.769654172578981

While a model root mean square error (RMSE) that is roughly half of the standard deviation is likely be a statistically significant improvement upon random guessing, it leaves much to be desired. Let’s try a more complex model: Gradient Boosting Regression. Since our dataset is quite large, it may also be more computationally efficient to use the histogram variant, which is implemented in sklearn.ensemble.HistGradientBoostingRegressor:

Gradient Boosting Regression#

training stddev: 0.8648993902607139
training MSE: 0.08587815088682216
validation MSE: 0.1270557773908582
validation RMSE/stddev: 0.4121275221300857

This is a significant improvement upon the Ridge regression model. Let’s also see how well a support vector regression (SVR) model works for bandgap estimation when we use the radial basis function (RBF) kernel. We will use sklearn.svm.SVR. Take note that SVR models may take a few minutes to fit to large datasets:

RBF Support Vector Regression#

from sklearn.svm import SVR

# construct an SVR with an rbf kernel:
svr_model = SVR(kernel='rbf', C=200, epsilon=1e-3, cache_size=500)
svr_model.fit(train_z, train_y)

# make predictions on the train/validation/test sets:
train_yhat = svr_model.predict(train_z)
val_yhat = svr_model.predict(val_z)
test_yhat = svr_model.predict(test_z)

# evaluate the MSE on the train/validation/test sets:
train_mse = np.mean((train_yhat - train_y)**2)
val_mse = np.mean((val_yhat - val_y)**2)
test_mse = np.mean((test_yhat - test_y)**2)

# print the training/validation error:
print('# of support vectors:', svr_model.n_support_) 
print('training stddev:', np.std(train_y))
print('training MSE:', train_mse)
print('validation MSE:', val_mse)
print('validation RMSE/stddev:', np.sqrt(val_mse)/np.std(train_y))

# print the test error:
print('test MSE:', test_mse)
print('test RMSE/stddev: ', np.sqrt(test_mse)/np.std(train_y))

# of support vectors: [9152]
training stddev: 0.8640814456704967
training MSE: 0.01842750045837074
validation MSE: 0.10748327182498967
validation RMSE/stddev: 0.37941610323604913
test MSE: 0.1108143968686793
test RMSE/stddev:  0.38525067990691425

Application: Bandgap Prediction

Contents