Application: Classifying Superconductors#

You can download the dataset for this section using the following Python code:

import requests

CSV_URL = 'https://raw.githubusercontent.com/cburdine/materials-ml-workshop/main/MaterialsML/unsupervised_learning/matml_supercon_cleaned.csv'

r = requests.get(CSV_URL)
with open('matml_supercon_cleaned.csv', 'w', encoding='utf-8') as f:
    f.write(r.text)

Alternatively, you can download the CSV file directly here.

Loading the Dataset#

Let’s begin by loading the dataset in to a Pandas dataframe so we can view the features:

Hide code cell source
import pandas as pd

# load dataset into a pandas DataFrame:
SUPERCON_CSV = 'matml_supercon_cleaned.csv'
data_df = pd.read_csv(SUPERCON_CSV)

# remove rows for materials with an ambiguous composition:
data_df = data_df[data_df['Composition'].isnull() == False]

# remove corrupted data with invalid Tc:
data_df = data_df[data_df['Tc (K)'] < 400]

# show dataframe in notebook:
display(data_df)
Material Substitutions Composition Tc (K) Pressure (GPa) Shape Substrate DOI
1 In O {} {'In': 1, 'O': 1} 2.00 0.0 films NaN 10.1134/1.568304
2 Pr Ru 2 Si 2 {} {'Pr': 1, 'Ru': 2, 'Si': 2} 14.00 0.0 NaN NaN 10.1088/0953-8984/12/34/307
3 Lu 3 Os 4 Ge 13 {} {'Lu': 3, 'Os': 4, 'Ge': 13} 2.85 0.0 NaN NaN NaN
4 La Mn O 3.04 {} {'La': 1, 'Mn': 1, 'O': 3.04} 125.00 0.0 NaN NaN 10.1103/physrevlett.87.127206
5 La Mn O 3.045 {} {'La': 1, 'Mn': 1, 'O': 3.045} 130.00 0.0 NaN NaN 10.1103/physrevlett.87.127206
... ... ... ... ... ... ... ... ...
23876 Nb Se 2 {} {'Nb': 1, 'Se': 2} 7.00 0.0 nanobelts NaN 10.1103/physrevb.75.020501
23877 Mg B 2 {} {'Mg': 1, 'B': 2} 40.00 0.0 NaN NaN 10.1103/physrevb.66.012511
23878 Ca Bi 2 {} {'Ca': 1, 'Bi': 2} 2.00 0.0 NaN NaN NaN
23879 Ca Bi 2 {} {'Ca': 1, 'Bi': 2} 2.00 0.0 single-crystalline NaN NaN
23880 Cd 2 Re 2 O 7 {} {'Cd': 2, 'Re': 2, 'O': 7} 1.00 0.0 single crystals NaN 10.1016/s0022-3697(02)00090-2

19463 rows × 8 columns

Here’s a summary of the different features included in the dataset:

  • Material: The chemical formula of the superconducting material.

  • Substitutions: Python dictionary consisting of doping ratios.

  • Composition: Composition of the chemical formula under doping.

  • Tc: Reported critical temperature of superconductivity (K).

  • Pressure: Applied pressure to superconducting sample (GPa).

  • Shape: Shape of the superconducting sample (if any).

  • Substrate: Substrate the sample was deposited on (if any).

  • DOI: DOI of paper that data was extracted from.

In this section, we will not attempt to predict any superconductor properties; instead, we will aim to extract insight from this dataset by applying clustering and distribution estimation models.

Preprocessing Data#

Hide code cell source
import numpy as np

# Generate a list of elements in the dataset:
ELEMENTS = set()
for v in data_df['Composition'].values:
    ELEMENTS |= set(eval(v).keys())
ELEMENTS = sorted(ELEMENTS)

def vectorize_composition(composition, elements):
    """ converts an elemental composition dict to a vector. """
    total_n = sum(composition.values())
    vec = np.zeros(len(elements))
    for elem, n in composition.items():
        if elem in elements:
            vec[elements.index(elem)] = n/total_n
    return vec

print('Number of elements:', len(ELEMENTS))
Number of elements: 86
def parse_data_vector(row):
    """ parses data from a dataframe row """
    
    # parse the composition dict:
    composition_dict = eval(row['Composition'])
    
    # parse feature vector (x):
    x_vector = np.concatenate([
        vectorize_composition(composition_dict, ELEMENTS),
        np.array([ row['Pressure (GPa)'] ]),
        np.array([ row['Tc (K)'] ]),
    ])
     
    return x_vector
from sklearn.preprocessing import StandardScaler

data_x = np.array([
    parse_data_vector(row)
    for _, row in data_df.iterrows()
])

print(data_x.shape)

# normalize data:
scaler = StandardScaler()
data_z = scaler.fit_transform(data_x)

# extract a matrix of only compositions:
composition_z = data_z[:,:-2]
(19463, 88)

Estimate the Empirical Distribution of \(T_c\)#

from sklearn.neighbors import KernelDensity

Tc_data = data_x[:,-1]
Tc_kde = KernelDensity(kernel='gaussian').fit(Tc_data.reshape(-1,1))
import matplotlib.pyplot as plt

# generate points to evaluate the KDE model:
Tc_eval_pts = np.linspace(0, np.max(Tc_data), 1000)

# determine the normalized density of the distribution:
Tc_density = np.exp(Tc_kde.score_samples(
                        Tc_eval_pts.reshape(-1,1)))
Tc_density /= np.trapz(Tc_density, Tc_eval_pts)

plt.figure()
plt.plot(Tc_eval_pts, Tc_density, label='KDE Distribution')
plt.fill_between(Tc_eval_pts, Tc_density, alpha=0.2)
plt.axvline(
    40,  linestyle=':', color='r',
    label='Limit of Conventional\nSuperconductivity (40 K)')
plt.grid()
plt.xlabel('Temperature (K)')
plt.ylabel('Probability Density')
plt.legend()
plt.xlim(0,np.max(Tc_data))
plt.show()
Hide code cell output
../_images/343d1798b01f27a78b3ac6847e100b79093231fd7e1c05317ffa0b5a4ce7ae93.png

Identifying Outliers in the Distribution:#

N_outliers = 10

log_likelihoods = Tc_kde.score_samples(Tc_data.reshape(-1,1))
cutoff = np.sort(log_likelihoods)[N_outliers]

outlier_df = data_df[log_likelihoods < cutoff]
display(outlier_df[['Material','Tc (K)','Pressure (GPa)','DOI']])
Material Tc (K) Pressure (GPa) DOI
520 Y H 6 290.0 300.0 10.1103/physrevb.99.220502
618 Si 2 H 6 153.0 170.0 NaN
1422 Sc H 6 196.0 170.0 10.1007/1-4020-3989-1_6
2879 H 3 S 0.96 Si 0.04 274.0 170.0 NaN
3714 Y H 10 303.0 400.0 NaN
4164 Y H 9 256.0 177.0 NaN
6699 K H 2 P O 4 210.0 170.0 NaN
10629 Si 2 H 6 174.0 275.0 10.1016/j.physc.2012.12.002
14301 H 3 S 178.0 170.0 10.1038/srep38570
20021 La H 10 218.0 170.0 NaN

Correlations of Composition with \(T_c\)#

cov_matrix = np.cov(data_z.T)
Tc_covs = cov_matrix[:-2,-1]
from pymatgen.util.plotting import periodic_table_heatmap

import matplotlib.pyplot as plt
import matplotlib
# convert Tc covariances to an (element -> cov.) dictionary:
Tc_element_covs = {
    elem : cov
    for elem, cov in zip(ELEMENTS, Tc_covs)
}

# plot covariance periodic table heatmap:
max_cov = np.max(np.abs(Tc_covs))
heatmap = periodic_table_heatmap(
    Tc_element_covs, 
    cmap='coolwarm', 
    symbol_fontsize=20,
    cmap_range=(-max_cov, max_cov),
    blank_color=matplotlib.colormaps['coolwarm'](0.5)
)
plt.title('Correlation of Element Composition with $T_c$', fontsize='20')
plt.show()
Hide code cell output
../_images/84ed81c277677fcd04d9702800a3a00c47a2f5ca52ccd6c98a15937e01040f45.png

Identifying Clusters in the Data#

from sklearn.decomposition import PCA

full_pca = PCA(n_components=data_z.shape[1])
full_pca.fit(data_z)
lambdas = full_pca.explained_variance_
# cutoff (selected by inspecting the plot below)
pca_cutoff = 10
Hide code cell source
# plot explained variance vs. principal component:
plt.figure()
plt.bar(np.arange(1,len(lambdas)+1),lambdas)
plt.ylabel(r'Explained Variance ($\lambda$)')
plt.axvline(pca_cutoff, linestyle='--', color='r', 
            label=f'Selected Cutoff: {pca_cutoff}')
plt.xlabel('Principal Component')
plt.show()
../_images/a1f6da77d60f78b5e00ce1068cee50c8883ae17af138a1a539b3bced165888e0.png
pca = PCA(n_components=pca_cutoff)
data_u = pca.fit_transform(data_z)

plt.figure()
plt.grid()
plt.scatter(data_u[:,0], data_u[:,1], marker='+')
plt.xlabel(r'$u_1$')
plt.ylabel(r'$u_2$')
plt.title('Projection onto Principal Components')
plt.show()
Hide code cell output
../_images/a5e5a097501eab76d7be97436bcc75199a004a40daa10b0fcbe87b14264228f3.png
from sklearn.cluster import KMeans

# number of clusters to find:
n_clusters = 6
kmeans = KMeans(n_clusters, n_init='auto')
kmeans.fit(data_u)
assignments = kmeans.predict(data_u)

clusters = {}
for i in range(n_clusters):
    clusters[i] = data_u[assignments == i]
Hide code cell source
plt.figure()
for i, cluster in clusters.items():
   plt.scatter(cluster[:,0], cluster[:,1], marker='+', 
               label=f'Cluster {i}')
plt.xlabel(r'$u_1$')
plt.ylabel(r'$u_2$')
plt.title('Principal Component Clusters')
plt.legend()
plt.show()
../_images/4dd52f348f7a1354c5ed2b308e415463c80235f51e45867788a7c07aa139ab8e.png
Hide code cell content
# show some samples of each cluster:
for i in clusters.keys():

    # obtain cluster dataframe:
    cluster_df = data_df[assignments == i]
    sample_df = cluster_df[
        ['Material', 'Tc (K)', 'Pressure (GPa)', 'DOI']
    ].sample(5)

    # show sample dataframe:
    display(f'Cluster {i} Examples:')
    display(sample_df)
'Cluster 0 Examples:'
Material Tc (K) Pressure (GPa) DOI
5143 Y Ba 2 Cu 4 O 8 80.0 0.0 10.1103/physrevlett.100.047003
17201 La 1.885 Ba 0.115 Cu O 4 13.0 0.0 10.1126/science.aan3438
20577 Bi 2 Sr 2 Ca Cu 2 O 8 33.5 0.0 10.1088/0953-8984/22/42/426004
10749 Sr 2 Ru O 4 1.4 0.0 10.1103/physrevlett.87.257601
3716 Mg H 6 271.0 300.0 NaN
'Cluster 1 Examples:'
Material Tc (K) Pressure (GPa) DOI
5367 Ce Pt 3 Si 0.75 0.0 10.1038/srep41853
3459 Ce Pt 3 Si 0.50 0.0 10.1088/1367-2630/11/5/055054
3900 Y Pt Bi 0.77 0.0 10.1088/0953-8984/27/27/275701
8238 Ce Co In 5 2.30 0.0 10.1038/srep01446
10002 Cs 3 C 60 40.00 0.0 10.1088/0953-8984/28/15/153001
'Cluster 2 Examples:'
Material Tc (K) Pressure (GPa) DOI
8271 Mg B 2 39.00 0.0 10.1088/0953-2048/17/5/001
18635 Mg 0.9 (Al Li) 0.1 B 2 1.00 0.0 10.1016/j.physc.2007.04.111
14636 Zr B 12 5.82 0.0 10.1103/physrevb.69.212508
14385 Mg B 2 40.00 0.0 10.1007/s10948-006-0164-9
784 Mg B 2 0.62 0.0 10.1016/j.ssc.2009.12.007
'Cluster 3 Examples:'
Material Tc (K) Pressure (GPa) DOI
11673 Li Fe As 17.0 0.0 10.1103/revmodphys.83.1589
16583 Sm 0.95 La 0.05 Fe As O 0.85 F 0.15 52.0 0.0 10.1209/0295-5075/84/67014
3497 K 0.95 Ni 1.86 Se 2 0.3 0.0 NaN
20433 Nb Se 2 7.0 0.0 10.1038/nphys3583
12507 Fe Se 65.0 0.0 NaN
'Cluster 4 Examples:'
Material Tc (K) Pressure (GPa) DOI
4605 Ho Ni 2 B 2 C 8.0 0.0 NaN
10048 Ti 0.4 0.0 10.1134/s1063783418110173
12554 Mo N 7.8 0.0 NaN
13135 Nb 8.5 0.0 10.1103/physrevb.76.132502
4930 Ti N 4.5 0.0 10.1007/s10909-013-1078-0
'Cluster 5 Examples:'
Material Tc (K) Pressure (GPa) DOI
22433 Pr Ru 4 Sb 12 1.04 0.0 10.1143/jpsj.73.1135
16770 Pd 0.88 Ni 0.12 9.00 0.0 10.1103/physrevb.93.174510
2083 U Rh Ge 0.25 0.0 10.1103/physrevb.67.054416
11349 Al 1.30 0.0 10.1021/acs.nanolett.9b02369
10217 Tl Ni 2 Se S 1.30 0.0 10.1103/physrevb.90.201105