Data Analysis and Visualization#
Now that we have learned about the Numpy and Scipy packages, let’s take a look at some packages that can help us with analyzing and visualizing data. Although there are several Python packages that can assist with these tasks, we will focus on the two most popular packages: Pandas and Matplotlib. Pandas (pandas
) is a package that provides an interface for working with large heterogeneous datasets using array-like structures called DataFrames. Matplotlib (matplotlib
) is the most popular data visualization tool in Python, with an interface inspired by the plotting functionality in MATLAB. In this section, we will discuss how to analyze data with the pandas
package, and discuss how to plot and visualize data with matplotlib
in the next section.
The Pandas Package#
Pandas is an open-source Python package for data manipulation and analysis. The name of package is (sadly) not named after the fuzzy black and white mammal. It is a contraction of “panel datasets”, which is a term from econometrics that refers to the type of series-based data commonly used in that field. Much like how the numpy
package is centered around the numpy.ndarray
data type, Pandas is centered around an array-like data type called pandas.DataFrame
.
When using the Pandas package, it is customary to import it with the alias pd
:
import pandas as pd
Working with Pandas DataFrames#
A DataFrame is a two-dimensional table-like structure with labeled rows and columns. It is similar to a spreadsheet or SQL table. The easiest way to create a DataFrame is by constructing one from a Python dictionary as follows:
# Data on the first four elements of the periodic table:
elements_data = {
'Element' : ['H', 'He', 'Li', 'Be'],
'Atomic Number' : [ 1, 2, 3, 4 ],
'Mass' : [ 1.008, 4.002, 6.940, 9.012],
'Electronegativity' : [ 2.20, 0.0, 0.98, 1.57 ]
}
# construct dataframe from data dictionary:
df = pd.DataFrame(elements_data)
In a Jupyter Notebook, we can display a Pandas DataFrame using the display
function:
display(df)
Element | Atomic Number | Mass | Electronegativity | |
---|---|---|---|---|
0 | H | 1 | 1.008 | 2.20 |
1 | He | 2 | 4.002 | 0.00 |
2 | Li | 3 | 6.940 | 0.98 |
3 | Be | 4 | 9.012 | 1.57 |
Data Manipulation:#
Pandas provides various functions and methods that make the manipulation and transformation of data relatively simple. Using square brackets (i.e. [
…]
) and the methods in the Dataframe class, we can index the DataFrame by row or column or even ranges of rows and columns. For example:
# access a single column:
display(df['Element'])
# access multiple columns:
display(df[ ['Element', 'Mass'] ])
# access a single row:
display(df.iloc[1])
# access a single value:
display(df.at[1,'Mass'])
# access a range of rows:
display(df.iloc[1:3])
# access multiple rows and columns simultaneously:
display(df.iloc[1:3, :2])
Show code cell output
0 H
1 He
2 Li
3 Be
Name: Element, dtype: object
Element | Mass | |
---|---|---|
0 | H | 1.008 |
1 | He | 4.002 |
2 | Li | 6.940 |
3 | Be | 9.012 |
Element He
Atomic Number 2
Mass 4.002
Electronegativity 0.0
Name: 1, dtype: object
np.float64(4.002)
Element | Atomic Number | Mass | Electronegativity | |
---|---|---|---|---|
1 | He | 2 | 4.002 | 0.00 |
2 | Li | 3 | 6.940 | 0.98 |
Element | Atomic Number | |
---|---|---|
1 | He | 2 |
2 | Li | 3 |
From inspecting the output above, we notice that when a range of rows or columns is accessed, the returned result is a pandas.DataFrame
. However, whenever we access a row or a column of a Dataframe, the result is a 1D sequence of values called a Series, not a DataFrame. To access the values of a series, we simply use square brackets
# extract series (i.e. row or column) from dataframe:
element_series = df['Element']
helium_series = df.iloc[1]
# verify the returned types are a Series object:
print(type(element_series))
print(type(helium_series))
# access values in series:
print(element_series[0])
print(helium_series['Mass'])
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
H
4.002
Sometimes, we might want to extract Series data as a Numpy array. This can be done by simply constructing a Numpy array from the Series object:
import numpy as np
# get the 'Mass' column and convert it to a numpy array:
mass_series = df['Mass']
mass_array = np.array(mass_series)
print(mass_array)
[1.008 4.002 6.94 9.012]
We can also filter the rows of DataFrames using Boolean indexing. This feature is similar to how Numpy arrays can be filtered:
# only show electronegative elements:
filtered_df = df[ df['Electronegativity'] > 0 ]
display(filtered_df)
Element | Atomic Number | Mass | Electronegativity | |
---|---|---|---|---|
0 | H | 1 | 1.008 | 2.20 |
2 | Li | 3 | 6.940 | 0.98 |
3 | Be | 4 | 9.012 | 1.57 |
We can also add columns to the DataFrame by assigning a list (or Numpy array) of values to the new column name. For example:
# add group and period columns to the dataframe:
df['Group'] = [ 1, 18, 1, 2 ]
df['Period'] = np.array([ 1, 1, 2, 2 ])
display(df)
Element | Atomic Number | Mass | Electronegativity | Group | Period | |
---|---|---|---|---|---|---|
0 | H | 1 | 1.008 | 2.20 | 1 | 1 |
1 | He | 2 | 4.002 | 0.00 | 18 | 1 |
2 | Li | 3 | 6.940 | 0.98 | 1 | 2 |
3 | Be | 4 | 9.012 | 1.57 | 2 | 2 |
Transforming Data#
A crucial part of working with Pandas Dataframes is the transformation of data. Usually this involves applying some mathematical function to a Dataframe column and storing the result in a new Dataframe column. To show how this is done in Pandas, let’s write a function called approximate_mass
, which (naively) approximates atomic mass as the atomic number times the sum of the proton and neutron masses (in atomic mass units). We can apply the function using the apply
function on either a Series or DataFrame object:
from scipy.constants import proton_mass, neutron_mass, m_u
# define an approximation function:
def approximate_mass(atomic_number):
""" Naively approximates atomic mass """
return atomic_number * (proton_mass + neutron_mass)/ m_u
# add an "Estimated Mass" column to the dataframe:
df['Estimated Mass'] = df['Atomic Number'].apply(approximate_mass)
display(df)
Element | Atomic Number | Mass | Electronegativity | Group | Period | Estimated Mass | |
---|---|---|---|---|---|---|---|
0 | H | 1 | 1.008 | 2.20 | 1 | 1 | 2.015941 |
1 | He | 2 | 4.002 | 0.00 | 18 | 1 | 4.031883 |
2 | Li | 3 | 6.940 | 0.98 | 1 | 2 | 6.047824 |
3 | Be | 4 | 9.012 | 1.57 | 2 | 2 | 8.063766 |
Analyzing Data#
Pandas provides a wide range of functions for analyzing data. The quickest way to obtain summary statistics of numerical columns in a Dataframe is by using the describe
method:
df.describe()
Atomic Number | Mass | Electronegativity | Group | Period | Estimated Mass | |
---|---|---|---|---|---|---|
count | 4.000000 | 4.000000 | 4.000000 | 4.000000 | 4.00000 | 4.000000 |
mean | 2.500000 | 5.240500 | 1.187500 | 5.500000 | 1.50000 | 5.039853 |
std | 1.290994 | 3.490962 | 0.935356 | 8.346656 | 0.57735 | 2.602569 |
min | 1.000000 | 1.008000 | 0.000000 | 1.000000 | 1.00000 | 2.015941 |
25% | 1.750000 | 3.253500 | 0.735000 | 1.000000 | 1.00000 | 3.527897 |
50% | 2.500000 | 5.471000 | 1.275000 | 1.500000 | 1.50000 | 5.039853 |
75% | 3.250000 | 7.458000 | 1.727500 | 6.000000 | 2.00000 | 6.551809 |
max | 4.000000 | 9.012000 | 2.200000 | 18.000000 | 2.00000 | 8.063766 |
This function reports key statistics, such as total counts, mean, standard deviation (std), min, max, etc. We can obtain these values manually using the method with the respective name (e.g. df.mean()
). Below, we give some examples of how to compute these statistics individually:
# compute mean of the 'Mass' column:
display(df['Mass'].mean())
# compute standard deviations of mass and electronegativity:
display(df[['Mass', 'Electronegativity']].std())
# compute the mean of electronegativity grouped by 'Group':
display(df.groupby('Group')['Electronegativity'].mean())
Show code cell output
np.float64(5.2405)
Mass 3.490962
Electronegativity 0.935356
dtype: float64
Group
1 1.59
2 1.57
18 0.00
Name: Electronegativity, dtype: float64
Importing and Exporting Data#
Pandas supports reading and writing data from/to various file formats, including CSV (_comma-separated values), Excel spreadsheets, SQL databases, and more. You can use functions like read_csv
, to_csv
, read_excel
, to_excel
, read_sql
, and to_sql
to handle data input and output operations. In this workshop, we will use data that is primarily written in the CSV format. For example:
# export dataframe to CSV file:
df.to_csv('elements_data.csv', index=False)
# import the exported csv back into a Dataframe:
imported_df = pd.read_csv('elements_data.csv')
# display imported DataFrame:
display(imported_df)
Element | Atomic Number | Mass | Electronegativity | Group | Period | Estimated Mass | |
---|---|---|---|---|---|---|---|
0 | H | 1 | 1.008 | 2.20 | 1 | 1 | 2.015941 |
1 | He | 2 | 4.002 | 0.00 | 18 | 1 | 4.031883 |
2 | Li | 3 | 6.940 | 0.98 | 1 | 2 | 6.047824 |
3 | Be | 4 | 9.012 | 1.57 | 2 | 2 | 8.063766 |
Exercises#
Exercise 1: Exploring the Periodic Table
For this exercise, we will be working with a large dataset with data about elements of the Periodic Table. First, download the dataset here and extract the Periodic Table of Elements.csv
file into the same folder as your Jupyter Notebook. Then, you should be able to import the data into your Jupyter Notebook as follows:
import pandas as pd
# Load periodic table dataframe:
ptable_df = pd.read_csv('Periodic Table of Elements.csv')
# Display dataframe columns:
display(ptable_df.columns)
# Display dataframe:
display(ptable_df)
Using this data, answer the following questions:
What fraction of elements of the Periodic Table were discovered before 1900?
Which elements have at least 100 isotopes?
What is the average atomic mass of the radioactive elements?
Data used in this exercise was obtained from GoodmanSciences.
Solutions#
Exercise 1: Exploring the Periodic Table#
Show code cell content
import numpy as np
import pandas as pd
# Load periodic table dataframe:
ptable_df = pd.read_csv('Periodic Table of Elements.csv')
# Display dataframe columns:
display(ptable_df.columns)
# Display dataframe:
display(ptable_df)
# Question 1:
print('What fraction of elements of the periodic table was discovered before 1900?')
year_discovered = ptable_df['Year']
frac = year_discovered[year_discovered < 1900].count() / year_discovered.count()
print(frac)
# Question 2:
print('\nWhich elements have at least 100 isotopes?')
n_isotopes = ptable_df['NumberOfIsotopes']
max_isotope_df = ptable_df[n_isotopes >= 100]
print(max_isotope_df[['Element','NumberOfIsotopes']])
# Question 3:
print('\nWhat is the average atomic mass of radioactive elements?')
radioactive_elements = ptable_df[ptable_df['Radioactive'] == 'yes']
print(radioactive_elements['AtomicMass'].mean())
Index(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass', 'NumberofNeutrons',
'NumberofProtons', 'NumberofElectrons', 'Period', 'Group', 'Phase',
'Radioactive', 'Natural', 'Metal', 'Nonmetal', 'Metalloid', 'Type',
'AtomicRadius', 'Electronegativity', 'FirstIonization', 'Density',
'MeltingPoint', 'BoilingPoint', 'NumberOfIsotopes', 'Discoverer',
'Year', 'SpecificHeat', 'NumberofShells', 'NumberofValence'],
dtype='object')
AtomicNumber | Element | Symbol | AtomicMass | NumberofNeutrons | NumberofProtons | NumberofElectrons | Period | Group | Phase | ... | FirstIonization | Density | MeltingPoint | BoilingPoint | NumberOfIsotopes | Discoverer | Year | SpecificHeat | NumberofShells | NumberofValence | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Hydrogen | H | 1.007 | 0 | 1 | 1 | 1 | 1.0 | gas | ... | 13.5984 | 0.000090 | 14.175 | 20.28 | 3.0 | Cavendish | 1766.0 | 14.304 | 1 | 1.0 |
1 | 2 | Helium | He | 4.002 | 2 | 2 | 2 | 1 | 18.0 | gas | ... | 24.5874 | 0.000179 | NaN | 4.22 | 5.0 | Janssen | 1868.0 | 5.193 | 1 | NaN |
2 | 3 | Lithium | Li | 6.941 | 4 | 3 | 3 | 2 | 1.0 | solid | ... | 5.3917 | 0.534000 | 453.850 | 1615.00 | 5.0 | Arfvedson | 1817.0 | 3.582 | 2 | 1.0 |
3 | 4 | Beryllium | Be | 9.012 | 5 | 4 | 4 | 2 | 2.0 | solid | ... | 9.3227 | 1.850000 | 1560.150 | 2742.00 | 6.0 | Vaulquelin | 1798.0 | 1.825 | 2 | 2.0 |
4 | 5 | Boron | B | 10.811 | 6 | 5 | 5 | 2 | 13.0 | solid | ... | 8.2980 | 2.340000 | 2573.150 | 4200.00 | 6.0 | Gay-Lussac | 1808.0 | 1.026 | 2 | 3.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
113 | 114 | Flerovium | Fl | 289.000 | 175 | 114 | 114 | 7 | 14.0 | artificial | ... | NaN | NaN | NaN | NaN | NaN | NaN | 1999.0 | NaN | 7 | 4.0 |
114 | 115 | Moscovium | Mc | 288.000 | 173 | 115 | 115 | 7 | 15.0 | artificial | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2010.0 | NaN | 7 | 5.0 |
115 | 116 | Livermorium | Lv | 292.000 | 176 | 116 | 116 | 7 | 16.0 | artificial | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2000.0 | NaN | 7 | 6.0 |
116 | 117 | Tennessine | Ts | 295.000 | 178 | 117 | 117 | 7 | 17.0 | artificial | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2010.0 | NaN | 7 | 7.0 |
117 | 118 | Oganesson | Og | 294.000 | 176 | 118 | 118 | 7 | 18.0 | artificial | ... | NaN | NaN | NaN | NaN | NaN | NaN | 2006.0 | NaN | 7 | 8.0 |
118 rows × 28 columns
What fraction of elements of the periodic table was discovered before 1900?
0.6635514018691588
Which elements have at least 100 isotopes?
Element NumberOfIsotopes
92 Neptunium 153.0
93 Plutonium 163.0
94 Americium 133.0
95 Curium 133.0
97 Californium 123.0
98 Einsteinium 123.0
99 Fermium 103.0
102 Lawrencium 203.0
What is the average atomic mass of radioactive elements?
248.0298108108108