Application: Classifying Superconductors

Application: Classifying Superconductors#

You can download the dataset for this section using the following Python code:

import requests

CSV_URL = 'https://raw.githubusercontent.com/cburdine/materials-ml-workshop/main/MaterialsML/unsupervised_learning/matml_supercon_cleaned.csv'

r = requests.get(CSV_URL)
with open('matml_supercon_cleaned.csv', 'w', encoding='utf-8') as f:
    f.write(r.text)

Alternatively, you can download the CSV file directly here.

Loading the Dataset#

Let’s begin by loading the dataset in to a Pandas dataframe so we can view the features:

	Material	Substitutions	Composition	Tc (K)	Pressure (GPa)	Shape	Substrate	DOI
1	In O	{}	{'In': 1, 'O': 1}	2.00	0.0	films	NaN	10.1134/1.568304
2	Pr Ru 2 Si 2	{}	{'Pr': 1, 'Ru': 2, 'Si': 2}	14.00	0.0	NaN	NaN	10.1088/0953-8984/12/34/307
3	Lu 3 Os 4 Ge 13	{}	{'Lu': 3, 'Os': 4, 'Ge': 13}	2.85	0.0	NaN	NaN	NaN
4	La Mn O 3.04	{}	{'La': 1, 'Mn': 1, 'O': 3.04}	125.00	0.0	NaN	NaN	10.1103/physrevlett.87.127206
5	La Mn O 3.045	{}	{'La': 1, 'Mn': 1, 'O': 3.045}	130.00	0.0	NaN	NaN	10.1103/physrevlett.87.127206
...	...	...	...	...	...	...	...	...
23876	Nb Se 2	{}	{'Nb': 1, 'Se': 2}	7.00	0.0	nanobelts	NaN	10.1103/physrevb.75.020501
23877	Mg B 2	{}	{'Mg': 1, 'B': 2}	40.00	0.0	NaN	NaN	10.1103/physrevb.66.012511
23878	Ca Bi 2	{}	{'Ca': 1, 'Bi': 2}	2.00	0.0	NaN	NaN	NaN
23879	Ca Bi 2	{}	{'Ca': 1, 'Bi': 2}	2.00	0.0	single-crystalline	NaN	NaN
23880	Cd 2 Re 2 O 7	{}	{'Cd': 2, 'Re': 2, 'O': 7}	1.00	0.0	single crystals	NaN	10.1016/s0022-3697(02)00090-2

19463 rows × 8 columns

Here’s a summary of the different features included in the dataset:

Material: The chemical formula of the superconducting material.
Substitutions: Python dictionary consisting of doping ratios.
Composition: Composition of the chemical formula under doping.
Tc: Reported critical temperature of superconductivity (K).
Pressure: Applied pressure to superconducting sample (GPa).
Shape: Shape of the superconducting sample (if any).
Substrate: Substrate the sample was deposited on (if any).
DOI: DOI of paper that data was extracted from.

In this section, we will not attempt to predict any superconductor properties; instead, we will aim to extract insight from this dataset by applying clustering and distribution estimation models.

Preprocessing Data#

Number of elements: 86

def parse_data_vector(row):
    """ parses data from a dataframe row """
    
    # parse the composition dict:
    composition_dict = eval(row['Composition'])
    
    # parse feature vector (x):
    x_vector = np.concatenate([
        vectorize_composition(composition_dict, ELEMENTS),
        np.array([ row['Pressure (GPa)'] ]),
        np.array([ row['Tc (K)'] ]),
    ])
     
    return x_vector

from sklearn.preprocessing import StandardScaler

data_x = np.array([
    parse_data_vector(row)
    for _, row in data_df.iterrows()
])

print(data_x.shape)

# normalize data:
scaler = StandardScaler()
data_z = scaler.fit_transform(data_x)

# extract a matrix of only compositions:
composition_z = data_z[:,:-2]

(19463, 88)

Estimate the Empirical Distribution of \(T_c\)#

from sklearn.neighbors import KernelDensity

Tc_data = data_x[:,-1]
Tc_kde = KernelDensity(kernel='gaussian').fit(Tc_data.reshape(-1,1))

import matplotlib.pyplot as plt

# generate points to evaluate the KDE model:
Tc_eval_pts = np.linspace(0, np.max(Tc_data), 1000)

# determine the normalized density of the distribution:
Tc_density = np.exp(Tc_kde.score_samples(
                        Tc_eval_pts.reshape(-1,1)))
Tc_density /= np.trapz(Tc_density, Tc_eval_pts)

plt.figure()
plt.plot(Tc_eval_pts, Tc_density, label='KDE Distribution')
plt.fill_between(Tc_eval_pts, Tc_density, alpha=0.2)
plt.axvline(
    40,  linestyle=':', color='r',
    label='Limit of Conventional\nSuperconductivity (40 K)')
plt.grid()
plt.xlabel('Temperature (K)')
plt.ylabel('Probability Density')
plt.legend()
plt.xlim(0,np.max(Tc_data))
plt.show()

Identifying Outliers in the Distribution:#

N_outliers = 10

log_likelihoods = Tc_kde.score_samples(Tc_data.reshape(-1,1))
cutoff = np.sort(log_likelihoods)[N_outliers]

outlier_df = data_df[log_likelihoods < cutoff]
display(outlier_df[['Material','Tc (K)','Pressure (GPa)','DOI']])

	Material	Tc (K)	Pressure (GPa)	DOI
520	Y H 6	290.0	300.0	10.1103/physrevb.99.220502
618	Si 2 H 6	153.0	170.0	NaN
1422	Sc H 6	196.0	170.0	10.1007/1-4020-3989-1_6
2879	H 3 S 0.96 Si 0.04	274.0	170.0	NaN
3714	Y H 10	303.0	400.0	NaN
4164	Y H 9	256.0	177.0	NaN
6699	K H 2 P O 4	210.0	170.0	NaN
10629	Si 2 H 6	174.0	275.0	10.1016/j.physc.2012.12.002
14301	H 3 S	178.0	170.0	10.1038/srep38570
20021	La H 10	218.0	170.0	NaN

Correlations of Composition with \(T_c\)#

cov_matrix = np.cov(data_z.T)
Tc_covs = cov_matrix[:-2,-1]

from pymatgen.util.plotting import periodic_table_heatmap

import matplotlib.pyplot as plt
import matplotlib
# convert Tc covariances to an (element -> cov.) dictionary:
Tc_element_covs = {
    elem : cov
    for elem, cov in zip(ELEMENTS, Tc_covs)
}

# plot covariance periodic table heatmap:
max_cov = np.max(np.abs(Tc_covs))
heatmap = periodic_table_heatmap(
    Tc_element_covs, 
    cmap='coolwarm', 
    symbol_fontsize=20,
    cmap_range=(-max_cov, max_cov),
    blank_color=matplotlib.colormaps['coolwarm'](0.5)
)
plt.title('Correlation of Element Composition with $T_c$', fontsize='20')
plt.show()

Identifying Clusters in the Data#

from sklearn.decomposition import PCA

full_pca = PCA(n_components=data_z.shape[1])
full_pca.fit(data_z)
lambdas = full_pca.explained_variance_

# cutoff (selected by inspecting the plot below)
pca_cutoff = 10

../_images/a1f6da77d60f78b5e00ce1068cee50c8883ae17af138a1a539b3bced165888e0.png

pca = PCA(n_components=pca_cutoff)
data_u = pca.fit_transform(data_z)

plt.figure()
plt.grid()
plt.scatter(data_u[:,0], data_u[:,1], marker='+')
plt.xlabel(r'$u_1$')
plt.ylabel(r'$u_2$')
plt.title('Projection onto Principal Components')
plt.show()

from sklearn.cluster import KMeans

# number of clusters to find:
n_clusters = 6
kmeans = KMeans(n_clusters, n_init='auto')
kmeans.fit(data_u)
assignments = kmeans.predict(data_u)

clusters = {}
for i in range(n_clusters):
    clusters[i] = data_u[assignments == i]

../_images/4dd52f348f7a1354c5ed2b308e415463c80235f51e45867788a7c07aa139ab8e.png

Show code cell content Hide code cell content

# show some samples of each cluster:
for i in clusters.keys():

    # obtain cluster dataframe:
    cluster_df = data_df[assignments == i]
    sample_df = cluster_df[
        ['Material', 'Tc (K)', 'Pressure (GPa)', 'DOI']
    ].sample(5)

    # show sample dataframe:
    display(f'Cluster {i} Examples:')
    display(sample_df)

'Cluster 0 Examples:'

	Material	Tc (K)	Pressure (GPa)	DOI
5143	Y Ba 2 Cu 4 O 8	80.0	0.0	10.1103/physrevlett.100.047003
17201	La 1.885 Ba 0.115 Cu O 4	13.0	0.0	10.1126/science.aan3438
20577	Bi 2 Sr 2 Ca Cu 2 O 8	33.5	0.0	10.1088/0953-8984/22/42/426004
10749	Sr 2 Ru O 4	1.4	0.0	10.1103/physrevlett.87.257601
3716	Mg H 6	271.0	300.0	NaN

'Cluster 1 Examples:'

	Material	Tc (K)	DOI
5367	Ce Pt 3 Si	0.75	10.1038/srep41853
3459	Ce Pt 3 Si	0.50	10.1088/1367-2630/11/5/055054
3900	Y Pt Bi	0.77	10.1088/0953-8984/27/27/275701
8238	Ce Co In 5	2.30	10.1038/srep01446
10002	Cs 3 C 60	40.00	10.1088/0953-8984/28/15/153001

'Cluster 2 Examples:'

	Material	Tc (K)	DOI
8271	Mg B 2	39.00	10.1088/0953-2048/17/5/001
18635	Mg 0.9 (Al Li) 0.1 B 2	1.00	10.1016/j.physc.2007.04.111
14636	Zr B 12	5.82	10.1103/physrevb.69.212508
14385	Mg B 2	40.00	10.1007/s10948-006-0164-9
784	Mg B 2	0.62	10.1016/j.ssc.2009.12.007

'Cluster 3 Examples:'

	Material	Tc (K)	DOI
11673	Li Fe As	17.0	10.1103/revmodphys.83.1589
16583	Sm 0.95 La 0.05 Fe As O 0.85 F 0.15	52.0	10.1209/0295-5075/84/67014
3497	K 0.95 Ni 1.86 Se 2	0.3	NaN
20433	Nb Se 2	7.0	10.1038/nphys3583
12507	Fe Se	65.0	NaN

'Cluster 4 Examples:'

	Material	Tc (K)	DOI
4605	Ho Ni 2 B 2 C	8.0	NaN
10048	Ti	0.4	10.1134/s1063783418110173
12554	Mo N	7.8	NaN
13135	Nb	8.5	10.1103/physrevb.76.132502
4930	Ti N	4.5	10.1007/s10909-013-1078-0

'Cluster 5 Examples:'

	Material	Tc (K)	DOI
22433	Pr Ru 4 Sb 12	1.04	10.1143/jpsj.73.1135
16770	Pd 0.88 Ni 0.12	9.00	10.1103/physrevb.93.174510
2083	U Rh Ge	0.25	10.1103/physrevb.67.054416
11349	Al	1.30	10.1021/acs.nanolett.9b02369
10217	Tl Ni 2 Se S	1.30	10.1103/physrevb.90.201105