Data Analysis and Visualization#

Now that we have learned about the Numpy and Scipy packages, let’s take a look at some packages that can help us with analyzing and visualizing data. Although there are several Python packages that can assist with these tasks, we will focus on the two most popular packages: Pandas and Matplotlib. Pandas (pandas) is a package that provides an interface for working with large heterogeneous datasets using array-like structures called DataFrames. Matplotlib (matplotlib) is the most popular data visualization tool in Python, with an interface inspired by the plotting functionality in MATLAB. In this section, we will discuss how to analyze data with the pandas package, and discuss how to plot and visualize data with matplotlib in the next section.

The Pandas Package#

Pandas is an open-source Python package for data manipulation and analysis. The name of package is (sadly) not named after the fuzzy black and white mammal. It is a contraction of “panel datasets”, which is a term from econometrics that refers to the type of series-based data commonly used in that field. Much like how the numpy package is centered around the numpy.ndarray data type, Pandas is centered around an array-like data type called pandas.DataFrame.

When using the Pandas package, it is customary to import it with the alias pd:

import pandas as pd

Working with Pandas DataFrames#

A DataFrame is a two-dimensional table-like structure with labeled rows and columns. It is similar to a spreadsheet or SQL table. The easiest way to create a DataFrame is by constructing one from a Python dictionary as follows:

# Data on the first four elements of the periodic table:
elements_data = {
    'Element' : ['H', 'He', 'Li', 'Be'],
    'Atomic Number' : [ 1, 2, 3, 4 ],
    'Mass' : [ 1.008, 4.002, 6.940, 9.012],
    'Electronegativity' : [ 2.20, 0.0, 0.98, 1.57 ]
}

# construct dataframe from data dictionary:
df = pd.DataFrame(elements_data)

In a Jupyter Notebook, we can display a Pandas DataFrame using the display function:

display(df)
Element Atomic Number Mass Electronegativity
0 H 1 1.008 2.20
1 He 2 4.002 0.00
2 Li 3 6.940 0.98
3 Be 4 9.012 1.57

Data Manipulation:#

Pandas provides various functions and methods that make the manipulation and transformation of data relatively simple. Using square brackets (i.e. []) and the methods in the Dataframe class, we can index the DataFrame by row or column or even ranges of rows and columns. For example:

# access a single column:
display(df['Element'])

# access multiple columns:
display(df[ ['Element', 'Mass'] ])

# access a single row:
display(df.iloc[1])

# access a single value:
display(df.at[1,'Mass'])

# access a range of rows:
display(df.iloc[1:3])

# access multiple rows and columns simultaneously:
display(df.iloc[1:3, :2])
Hide code cell output
0     H
1    He
2    Li
3    Be
Name: Element, dtype: object
Element Mass
0 H 1.008
1 He 4.002
2 Li 6.940
3 Be 9.012
Element                 He
Atomic Number            2
Mass                 4.002
Electronegativity      0.0
Name: 1, dtype: object
np.float64(4.002)
Element Atomic Number Mass Electronegativity
1 He 2 4.002 0.00
2 Li 3 6.940 0.98
Element Atomic Number
1 He 2
2 Li 3

From inspecting the output above, we notice that when a range of rows or columns is accessed, the returned result is a pandas.DataFrame. However, whenever we access a row or a column of a Dataframe, the result is a 1D sequence of values called a Series, not a DataFrame. To access the values of a series, we simply use square brackets

# extract series (i.e. row or column) from dataframe:
element_series = df['Element']
helium_series = df.iloc[1]

# verify the returned types are a Series object:
print(type(element_series))
print(type(helium_series))

# access values in series:
print(element_series[0])
print(helium_series['Mass'])
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
H
4.002

Sometimes, we might want to extract Series data as a Numpy array. This can be done by simply constructing a Numpy array from the Series object:

import numpy as np

# get the 'Mass' column and convert it to a numpy array:
mass_series = df['Mass']
mass_array = np.array(mass_series)

print(mass_array)
[1.008 4.002 6.94  9.012]

We can also filter the rows of DataFrames using Boolean indexing. This feature is similar to how Numpy arrays can be filtered:

# only show electronegative elements:
filtered_df = df[ df['Electronegativity'] > 0 ]

display(filtered_df)
Element Atomic Number Mass Electronegativity
0 H 1 1.008 2.20
2 Li 3 6.940 0.98
3 Be 4 9.012 1.57

We can also add columns to the DataFrame by assigning a list (or Numpy array) of values to the new column name. For example:

# add group and period columns to the dataframe:
df['Group'] = [ 1, 18, 1, 2 ]
df['Period'] = np.array([ 1, 1, 2, 2 ])

display(df)
Element Atomic Number Mass Electronegativity Group Period
0 H 1 1.008 2.20 1 1
1 He 2 4.002 0.00 18 1
2 Li 3 6.940 0.98 1 2
3 Be 4 9.012 1.57 2 2

Transforming Data#

A crucial part of working with Pandas Dataframes is the transformation of data. Usually this involves applying some mathematical function to a Dataframe column and storing the result in a new Dataframe column. To show how this is done in Pandas, let’s write a function called approximate_mass, which (naively) approximates atomic mass as the atomic number times the sum of the proton and neutron masses (in atomic mass units). We can apply the function using the apply function on either a Series or DataFrame object:

from scipy.constants import proton_mass, neutron_mass, m_u

# define an approximation function:
def approximate_mass(atomic_number):
    """ Naively approximates atomic mass """
    return atomic_number * (proton_mass + neutron_mass)/ m_u

# add an "Estimated Mass" column to the dataframe:
df['Estimated Mass'] = df['Atomic Number'].apply(approximate_mass)

display(df)
Element Atomic Number Mass Electronegativity Group Period Estimated Mass
0 H 1 1.008 2.20 1 1 2.015941
1 He 2 4.002 0.00 18 1 4.031883
2 Li 3 6.940 0.98 1 2 6.047824
3 Be 4 9.012 1.57 2 2 8.063766

Analyzing Data#

Pandas provides a wide range of functions for analyzing data. The quickest way to obtain summary statistics of numerical columns in a Dataframe is by using the describe method:

df.describe()
Atomic Number Mass Electronegativity Group Period Estimated Mass
count 4.000000 4.000000 4.000000 4.000000 4.00000 4.000000
mean 2.500000 5.240500 1.187500 5.500000 1.50000 5.039853
std 1.290994 3.490962 0.935356 8.346656 0.57735 2.602569
min 1.000000 1.008000 0.000000 1.000000 1.00000 2.015941
25% 1.750000 3.253500 0.735000 1.000000 1.00000 3.527897
50% 2.500000 5.471000 1.275000 1.500000 1.50000 5.039853
75% 3.250000 7.458000 1.727500 6.000000 2.00000 6.551809
max 4.000000 9.012000 2.200000 18.000000 2.00000 8.063766

This function reports key statistics, such as total counts, mean, standard deviation (std), min, max, etc. We can obtain these values manually using the method with the respective name (e.g. df.mean()). Below, we give some examples of how to compute these statistics individually:

# compute mean of the 'Mass' column:
display(df['Mass'].mean())

# compute standard deviations of mass and electronegativity:
display(df[['Mass', 'Electronegativity']].std())

# compute the mean of electronegativity grouped by 'Group':
display(df.groupby('Group')['Electronegativity'].mean())
Hide code cell output
np.float64(5.2405)
Mass                 3.490962
Electronegativity    0.935356
dtype: float64
Group
1     1.59
2     1.57
18    0.00
Name: Electronegativity, dtype: float64

Importing and Exporting Data#

Pandas supports reading and writing data from/to various file formats, including CSV (_comma-separated values), Excel spreadsheets, SQL databases, and more. You can use functions like read_csv, to_csv, read_excel, to_excel, read_sql, and to_sql to handle data input and output operations. In this workshop, we will use data that is primarily written in the CSV format. For example:

# export dataframe to CSV file:
df.to_csv('elements_data.csv', index=False)

# import the exported csv back into a Dataframe:
imported_df = pd.read_csv('elements_data.csv')

# display imported DataFrame:
display(imported_df)
Element Atomic Number Mass Electronegativity Group Period Estimated Mass
0 H 1 1.008 2.20 1 1 2.015941
1 He 2 4.002 0.00 18 1 4.031883
2 Li 3 6.940 0.98 1 2 6.047824
3 Be 4 9.012 1.57 2 2 8.063766

Exercises#

Exercise 1: Exploring the Periodic Table

For this exercise, we will be working with a large dataset with data about elements of the Periodic Table. First, download the dataset here and extract the Periodic Table of Elements.csv file into the same folder as your Jupyter Notebook. Then, you should be able to import the data into your Jupyter Notebook as follows:

import pandas as pd

# Load periodic table dataframe:
ptable_df = pd.read_csv('Periodic Table of Elements.csv')

# Display dataframe columns:
display(ptable_df.columns) 

# Display dataframe:
display(ptable_df)

Using this data, answer the following questions:

  1. What fraction of elements of the Periodic Table were discovered before 1900?

  2. Which elements have at least 100 isotopes?

  3. What is the average atomic mass of the radioactive elements?


Data used in this exercise was obtained from GoodmanSciences.

Solutions#

Exercise 1: Exploring the Periodic Table#

Hide code cell content
import numpy as np
import pandas as pd

# Load periodic table dataframe:
ptable_df = pd.read_csv('Periodic Table of Elements.csv')

# Display dataframe columns:
display(ptable_df.columns) 

# Display dataframe:
display(ptable_df)


# Question 1:
print('What fraction of elements of the periodic table was discovered before 1900?')

year_discovered = ptable_df['Year']
frac = year_discovered[year_discovered < 1900].count() / year_discovered.count()
print(frac)


# Question 2:
print('\nWhich elements have at least 100 isotopes?')

n_isotopes = ptable_df['NumberOfIsotopes']
max_isotope_df = ptable_df[n_isotopes >= 100]
print(max_isotope_df[['Element','NumberOfIsotopes']])


# Question 3:
print('\nWhat is the average atomic mass of radioactive elements?')
radioactive_elements = ptable_df[ptable_df['Radioactive'] == 'yes']
print(radioactive_elements['AtomicMass'].mean())
Index(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass', 'NumberofNeutrons',
       'NumberofProtons', 'NumberofElectrons', 'Period', 'Group', 'Phase',
       'Radioactive', 'Natural', 'Metal', 'Nonmetal', 'Metalloid', 'Type',
       'AtomicRadius', 'Electronegativity', 'FirstIonization', 'Density',
       'MeltingPoint', 'BoilingPoint', 'NumberOfIsotopes', 'Discoverer',
       'Year', 'SpecificHeat', 'NumberofShells', 'NumberofValence'],
      dtype='object')
AtomicNumber Element Symbol AtomicMass NumberofNeutrons NumberofProtons NumberofElectrons Period Group Phase ... FirstIonization Density MeltingPoint BoilingPoint NumberOfIsotopes Discoverer Year SpecificHeat NumberofShells NumberofValence
0 1 Hydrogen H 1.007 0 1 1 1 1.0 gas ... 13.5984 0.000090 14.175 20.28 3.0 Cavendish 1766.0 14.304 1 1.0
1 2 Helium He 4.002 2 2 2 1 18.0 gas ... 24.5874 0.000179 NaN 4.22 5.0 Janssen 1868.0 5.193 1 NaN
2 3 Lithium Li 6.941 4 3 3 2 1.0 solid ... 5.3917 0.534000 453.850 1615.00 5.0 Arfvedson 1817.0 3.582 2 1.0
3 4 Beryllium Be 9.012 5 4 4 2 2.0 solid ... 9.3227 1.850000 1560.150 2742.00 6.0 Vaulquelin 1798.0 1.825 2 2.0
4 5 Boron B 10.811 6 5 5 2 13.0 solid ... 8.2980 2.340000 2573.150 4200.00 6.0 Gay-Lussac 1808.0 1.026 2 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
113 114 Flerovium Fl 289.000 175 114 114 7 14.0 artificial ... NaN NaN NaN NaN NaN NaN 1999.0 NaN 7 4.0
114 115 Moscovium Mc 288.000 173 115 115 7 15.0 artificial ... NaN NaN NaN NaN NaN NaN 2010.0 NaN 7 5.0
115 116 Livermorium Lv 292.000 176 116 116 7 16.0 artificial ... NaN NaN NaN NaN NaN NaN 2000.0 NaN 7 6.0
116 117 Tennessine Ts 295.000 178 117 117 7 17.0 artificial ... NaN NaN NaN NaN NaN NaN 2010.0 NaN 7 7.0
117 118 Oganesson Og 294.000 176 118 118 7 18.0 artificial ... NaN NaN NaN NaN NaN NaN 2006.0 NaN 7 8.0

118 rows × 28 columns

What fraction of elements of the periodic table was discovered before 1900?
0.6635514018691588

Which elements have at least 100 isotopes?
         Element  NumberOfIsotopes
92     Neptunium             153.0
93     Plutonium             163.0
94     Americium             133.0
95        Curium             133.0
97   Californium             123.0
98   Einsteinium             123.0
99       Fermium             103.0
102   Lawrencium             203.0

What is the average atomic mass of radioactive elements?
248.0298108108108