Application: Classifying Superconductors#
You can download the dataset for this section using the following Python code:
import requests
CSV_URL = 'https://raw.githubusercontent.com/cburdine/materials-ml-workshop/main/MaterialsML/unsupervised_learning/matml_supercon_cleaned.csv'
r = requests.get(CSV_URL)
with open('matml_supercon_cleaned.csv', 'w', encoding='utf-8') as f:
f.write(r.text)
Alternatively, you can download the CSV file directly here.
Loading the Dataset#
Let’s begin by loading the dataset in to a Pandas dataframe so we can view the features:
Show code cell source
import pandas as pd
# load dataset into a pandas DataFrame:
SUPERCON_CSV = 'matml_supercon_cleaned.csv'
data_df = pd.read_csv(SUPERCON_CSV)
# remove rows for materials with an ambiguous composition:
data_df = data_df[data_df['Composition'].isnull() == False]
# remove corrupted data with invalid Tc:
data_df = data_df[data_df['Tc (K)'] < 400]
# show dataframe in notebook:
display(data_df)
Material | Substitutions | Composition | Tc (K) | Pressure (GPa) | Shape | Substrate | DOI | |
---|---|---|---|---|---|---|---|---|
1 | In O | {} | {'In': 1, 'O': 1} | 2.00 | 0.0 | films | NaN | 10.1134/1.568304 |
2 | Pr Ru 2 Si 2 | {} | {'Pr': 1, 'Ru': 2, 'Si': 2} | 14.00 | 0.0 | NaN | NaN | 10.1088/0953-8984/12/34/307 |
3 | Lu 3 Os 4 Ge 13 | {} | {'Lu': 3, 'Os': 4, 'Ge': 13} | 2.85 | 0.0 | NaN | NaN | NaN |
4 | La Mn O 3.04 | {} | {'La': 1, 'Mn': 1, 'O': 3.04} | 125.00 | 0.0 | NaN | NaN | 10.1103/physrevlett.87.127206 |
5 | La Mn O 3.045 | {} | {'La': 1, 'Mn': 1, 'O': 3.045} | 130.00 | 0.0 | NaN | NaN | 10.1103/physrevlett.87.127206 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
23876 | Nb Se 2 | {} | {'Nb': 1, 'Se': 2} | 7.00 | 0.0 | nanobelts | NaN | 10.1103/physrevb.75.020501 |
23877 | Mg B 2 | {} | {'Mg': 1, 'B': 2} | 40.00 | 0.0 | NaN | NaN | 10.1103/physrevb.66.012511 |
23878 | Ca Bi 2 | {} | {'Ca': 1, 'Bi': 2} | 2.00 | 0.0 | NaN | NaN | NaN |
23879 | Ca Bi 2 | {} | {'Ca': 1, 'Bi': 2} | 2.00 | 0.0 | single-crystalline | NaN | NaN |
23880 | Cd 2 Re 2 O 7 | {} | {'Cd': 2, 'Re': 2, 'O': 7} | 1.00 | 0.0 | single crystals | NaN | 10.1016/s0022-3697(02)00090-2 |
19463 rows × 8 columns
Here’s a summary of the different features included in the dataset:
Material: The chemical formula of the superconducting material.
Substitutions: Python dictionary consisting of doping ratios.
Composition: Composition of the chemical formula under doping.
Tc: Reported critical temperature of superconductivity (K).
Pressure: Applied pressure to superconducting sample (GPa).
Shape: Shape of the superconducting sample (if any).
Substrate: Substrate the sample was deposited on (if any).
DOI: DOI of paper that data was extracted from.
In this section, we will not attempt to predict any superconductor properties; instead, we will aim to extract insight from this dataset by applying clustering and distribution estimation models.
Preprocessing Data#
Show code cell source
import numpy as np
# Generate a list of elements in the dataset:
ELEMENTS = set()
for v in data_df['Composition'].values:
ELEMENTS |= set(eval(v).keys())
ELEMENTS = sorted(ELEMENTS)
def vectorize_composition(composition, elements):
""" converts an elemental composition dict to a vector. """
total_n = sum(composition.values())
vec = np.zeros(len(elements))
for elem, n in composition.items():
if elem in elements:
vec[elements.index(elem)] = n/total_n
return vec
print('Number of elements:', len(ELEMENTS))
Number of elements: 86
def parse_data_vector(row):
""" parses data from a dataframe row """
# parse the composition dict:
composition_dict = eval(row['Composition'])
# parse feature vector (x):
x_vector = np.concatenate([
vectorize_composition(composition_dict, ELEMENTS),
np.array([ row['Pressure (GPa)'] ]),
np.array([ row['Tc (K)'] ]),
])
return x_vector
from sklearn.preprocessing import StandardScaler
data_x = np.array([
parse_data_vector(row)
for _, row in data_df.iterrows()
])
print(data_x.shape)
# normalize data:
scaler = StandardScaler()
data_z = scaler.fit_transform(data_x)
# extract a matrix of only compositions:
composition_z = data_z[:,:-2]
(19463, 88)
Estimate the Empirical Distribution of \(T_c\)#
from sklearn.neighbors import KernelDensity
Tc_data = data_x[:,-1]
Tc_kde = KernelDensity(kernel='gaussian').fit(Tc_data.reshape(-1,1))
import matplotlib.pyplot as plt
# generate points to evaluate the KDE model:
Tc_eval_pts = np.linspace(0, np.max(Tc_data), 1000)
# determine the normalized density of the distribution:
Tc_density = np.exp(Tc_kde.score_samples(
Tc_eval_pts.reshape(-1,1)))
Tc_density /= np.trapz(Tc_density, Tc_eval_pts)
plt.figure()
plt.plot(Tc_eval_pts, Tc_density, label='KDE Distribution')
plt.fill_between(Tc_eval_pts, Tc_density, alpha=0.2)
plt.axvline(
40, linestyle=':', color='r',
label='Limit of Conventional\nSuperconductivity (40 K)')
plt.grid()
plt.xlabel('Temperature (K)')
plt.ylabel('Probability Density')
plt.legend()
plt.xlim(0,np.max(Tc_data))
plt.show()
Show code cell output

Identifying Outliers in the Distribution:#
N_outliers = 10
log_likelihoods = Tc_kde.score_samples(Tc_data.reshape(-1,1))
cutoff = np.sort(log_likelihoods)[N_outliers]
outlier_df = data_df[log_likelihoods < cutoff]
display(outlier_df[['Material','Tc (K)','Pressure (GPa)','DOI']])
Material | Tc (K) | Pressure (GPa) | DOI | |
---|---|---|---|---|
520 | Y H 6 | 290.0 | 300.0 | 10.1103/physrevb.99.220502 |
618 | Si 2 H 6 | 153.0 | 170.0 | NaN |
1422 | Sc H 6 | 196.0 | 170.0 | 10.1007/1-4020-3989-1_6 |
2879 | H 3 S 0.96 Si 0.04 | 274.0 | 170.0 | NaN |
3714 | Y H 10 | 303.0 | 400.0 | NaN |
4164 | Y H 9 | 256.0 | 177.0 | NaN |
6699 | K H 2 P O 4 | 210.0 | 170.0 | NaN |
10629 | Si 2 H 6 | 174.0 | 275.0 | 10.1016/j.physc.2012.12.002 |
14301 | H 3 S | 178.0 | 170.0 | 10.1038/srep38570 |
20021 | La H 10 | 218.0 | 170.0 | NaN |
Correlations of Composition with \(T_c\)#
cov_matrix = np.cov(data_z.T)
Tc_covs = cov_matrix[:-2,-1]
from pymatgen.util.plotting import periodic_table_heatmap
import matplotlib.pyplot as plt
import matplotlib
# convert Tc covariances to an (element -> cov.) dictionary:
Tc_element_covs = {
elem : cov
for elem, cov in zip(ELEMENTS, Tc_covs)
}
# plot covariance periodic table heatmap:
max_cov = np.max(np.abs(Tc_covs))
heatmap = periodic_table_heatmap(
Tc_element_covs,
cmap='coolwarm',
symbol_fontsize=20,
cmap_range=(-max_cov, max_cov),
blank_color=matplotlib.colormaps['coolwarm'](0.5)
)
plt.title('Correlation of Element Composition with $T_c$', fontsize='20')
plt.show()
Show code cell output

Identifying Clusters in the Data#
from sklearn.decomposition import PCA
full_pca = PCA(n_components=data_z.shape[1])
full_pca.fit(data_z)
lambdas = full_pca.explained_variance_
# cutoff (selected by inspecting the plot below)
pca_cutoff = 10
Show code cell source
# plot explained variance vs. principal component:
plt.figure()
plt.bar(np.arange(1,len(lambdas)+1),lambdas)
plt.ylabel(r'Explained Variance ($\lambda$)')
plt.axvline(pca_cutoff, linestyle='--', color='r',
label=f'Selected Cutoff: {pca_cutoff}')
plt.xlabel('Principal Component')
plt.show()

pca = PCA(n_components=pca_cutoff)
data_u = pca.fit_transform(data_z)
plt.figure()
plt.grid()
plt.scatter(data_u[:,0], data_u[:,1], marker='+')
plt.xlabel(r'$u_1$')
plt.ylabel(r'$u_2$')
plt.title('Projection onto Principal Components')
plt.show()
Show code cell output

from sklearn.cluster import KMeans
# number of clusters to find:
n_clusters = 6
kmeans = KMeans(n_clusters, n_init='auto')
kmeans.fit(data_u)
assignments = kmeans.predict(data_u)
clusters = {}
for i in range(n_clusters):
clusters[i] = data_u[assignments == i]
Show code cell source
plt.figure()
for i, cluster in clusters.items():
plt.scatter(cluster[:,0], cluster[:,1], marker='+',
label=f'Cluster {i}')
plt.xlabel(r'$u_1$')
plt.ylabel(r'$u_2$')
plt.title('Principal Component Clusters')
plt.legend()
plt.show()

Show code cell content
# show some samples of each cluster:
for i in clusters.keys():
# obtain cluster dataframe:
cluster_df = data_df[assignments == i]
sample_df = cluster_df[
['Material', 'Tc (K)', 'Pressure (GPa)', 'DOI']
].sample(5)
# show sample dataframe:
display(f'Cluster {i} Examples:')
display(sample_df)
'Cluster 0 Examples:'
Material | Tc (K) | Pressure (GPa) | DOI | |
---|---|---|---|---|
5143 | Y Ba 2 Cu 4 O 8 | 80.0 | 0.0 | 10.1103/physrevlett.100.047003 |
17201 | La 1.885 Ba 0.115 Cu O 4 | 13.0 | 0.0 | 10.1126/science.aan3438 |
20577 | Bi 2 Sr 2 Ca Cu 2 O 8 | 33.5 | 0.0 | 10.1088/0953-8984/22/42/426004 |
10749 | Sr 2 Ru O 4 | 1.4 | 0.0 | 10.1103/physrevlett.87.257601 |
3716 | Mg H 6 | 271.0 | 300.0 | NaN |
'Cluster 1 Examples:'
Material | Tc (K) | Pressure (GPa) | DOI | |
---|---|---|---|---|
5367 | Ce Pt 3 Si | 0.75 | 0.0 | 10.1038/srep41853 |
3459 | Ce Pt 3 Si | 0.50 | 0.0 | 10.1088/1367-2630/11/5/055054 |
3900 | Y Pt Bi | 0.77 | 0.0 | 10.1088/0953-8984/27/27/275701 |
8238 | Ce Co In 5 | 2.30 | 0.0 | 10.1038/srep01446 |
10002 | Cs 3 C 60 | 40.00 | 0.0 | 10.1088/0953-8984/28/15/153001 |
'Cluster 2 Examples:'
Material | Tc (K) | Pressure (GPa) | DOI | |
---|---|---|---|---|
8271 | Mg B 2 | 39.00 | 0.0 | 10.1088/0953-2048/17/5/001 |
18635 | Mg 0.9 (Al Li) 0.1 B 2 | 1.00 | 0.0 | 10.1016/j.physc.2007.04.111 |
14636 | Zr B 12 | 5.82 | 0.0 | 10.1103/physrevb.69.212508 |
14385 | Mg B 2 | 40.00 | 0.0 | 10.1007/s10948-006-0164-9 |
784 | Mg B 2 | 0.62 | 0.0 | 10.1016/j.ssc.2009.12.007 |
'Cluster 3 Examples:'
Material | Tc (K) | Pressure (GPa) | DOI | |
---|---|---|---|---|
11673 | Li Fe As | 17.0 | 0.0 | 10.1103/revmodphys.83.1589 |
16583 | Sm 0.95 La 0.05 Fe As O 0.85 F 0.15 | 52.0 | 0.0 | 10.1209/0295-5075/84/67014 |
3497 | K 0.95 Ni 1.86 Se 2 | 0.3 | 0.0 | NaN |
20433 | Nb Se 2 | 7.0 | 0.0 | 10.1038/nphys3583 |
12507 | Fe Se | 65.0 | 0.0 | NaN |
'Cluster 4 Examples:'
Material | Tc (K) | Pressure (GPa) | DOI | |
---|---|---|---|---|
4605 | Ho Ni 2 B 2 C | 8.0 | 0.0 | NaN |
10048 | Ti | 0.4 | 0.0 | 10.1134/s1063783418110173 |
12554 | Mo N | 7.8 | 0.0 | NaN |
13135 | Nb | 8.5 | 0.0 | 10.1103/physrevb.76.132502 |
4930 | Ti N | 4.5 | 0.0 | 10.1007/s10909-013-1078-0 |
'Cluster 5 Examples:'
Material | Tc (K) | Pressure (GPa) | DOI | |
---|---|---|---|---|
22433 | Pr Ru 4 Sb 12 | 1.04 | 0.0 | 10.1143/jpsj.73.1135 |
16770 | Pd 0.88 Ni 0.12 | 9.00 | 0.0 | 10.1103/physrevb.93.174510 |
2083 | U Rh Ge | 0.25 | 0.0 | 10.1103/physrevb.67.054416 |
11349 | Al | 1.30 | 0.0 | 10.1021/acs.nanolett.9b02369 |
10217 | Tl Ni 2 Se S | 1.30 | 0.0 | 10.1103/physrevb.90.201105 |