Data Analysis and Visualization

Data Analysis and Visualization#

Now that we have learned about the Numpy and Scipy packages, let’s take a look at some packages that can help us with analyzing and visualizing data. Although there are several Python packages that can assist with these tasks, we will focus on the two most popular packages: Pandas and Matplotlib. Pandas (pandas) is a package that provides an interface for working with large heterogeneous datasets using array-like structures called DataFrames. Matplotlib (matplotlib) is the most popular data visualization tool in Python, with an interface inspired by the plotting functionality in MATLAB. In this section, we will discuss how to analyze data with the pandas package, and discuss how to plot and visualize data with matplotlib in the next section.

The Pandas Package#

Pandas is an open-source Python package for data manipulation and analysis. The name of package is (sadly) not named after the fuzzy black and white mammal. It is a contraction of “panel datasets”, which is a term from econometrics that refers to the type of series-based data commonly used in that field. Much like how the numpy package is centered around the numpy.ndarray data type, Pandas is centered around an array-like data type called pandas.DataFrame.

When using the Pandas package, it is customary to import it with the alias pd:

import pandas as pd

Working with Pandas DataFrames#

A DataFrame is a two-dimensional table-like structure with labeled rows and columns. It is similar to a spreadsheet or SQL table. The easiest way to create a DataFrame is by constructing one from a Python dictionary as follows:

# Data on the first four elements of the periodic table:
elements_data = {
    'Element' : ['H', 'He', 'Li', 'Be'],
    'Atomic Number' : [ 1, 2, 3, 4 ],
    'Mass' : [ 1.008, 4.002, 6.940, 9.012],
    'Electronegativity' : [ 2.20, 0.0, 0.98, 1.57 ]
}

# construct dataframe from data dictionary:
df = pd.DataFrame(elements_data)

In a Jupyter Notebook, we can display a Pandas DataFrame using the display function:

display(df)

	Element	Atomic Number	Mass	Electronegativity
0	H	1	1.008	2.20
1	He	2	4.002	0.00
2	Li	3	6.940	0.98
3	Be	4	9.012	1.57

Data Manipulation:#

Pandas provides various functions and methods that make the manipulation and transformation of data relatively simple. Using square brackets (i.e. […]) and the methods in the Dataframe class, we can index the DataFrame by row or column or even ranges of rows and columns. For example:

# access a single column:
display(df['Element'])

# access multiple columns:
display(df[ ['Element', 'Mass'] ])

# access a single row:
display(df.iloc[1])

# access a single value:
display(df.at[1,'Mass'])

# access a range of rows:
display(df.iloc[1:3])

# access multiple rows and columns simultaneously:
display(df.iloc[1:3, :2])

Show code cell output Hide code cell output

   H
  He
  Li
  Be
Name: Element, dtype: object

	Element	Mass
0	H	1.008
1	He	4.002
2	Li	6.940
3	Be	9.012

Element                 He
Atomic Number            2
Mass                 4.002
Electronegativity      0.0
Name: 1, dtype: object

np.float64(4.002)

	Element	Atomic Number	Mass	Electronegativity
1	He	2	4.002	0.00
2	Li	3	6.940	0.98

	Element	Atomic Number
1	He	2
2	Li	3

From inspecting the output above, we notice that when a range of rows or columns is accessed, the returned result is a pandas.DataFrame. However, whenever we access a row or a column of a Dataframe, the result is a 1D sequence of values called a Series, not a DataFrame. To access the values of a series, we simply use square brackets

# extract series (i.e. row or column) from dataframe:
element_series = df['Element']
helium_series = df.iloc[1]

# verify the returned types are a Series object:
print(type(element_series))
print(type(helium_series))

# access values in series:
print(element_series[0])
print(helium_series['Mass'])

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
H
4.002

Sometimes, we might want to extract Series data as a Numpy array. This can be done by simply constructing a Numpy array from the Series object:

import numpy as np

# get the 'Mass' column and convert it to a numpy array:
mass_series = df['Mass']
mass_array = np.array(mass_series)

print(mass_array)

[1.008 4.002 6.94  9.012]

We can also filter the rows of DataFrames using Boolean indexing. This feature is similar to how Numpy arrays can be filtered:

# only show electronegative elements:
filtered_df = df[ df['Electronegativity'] > 0 ]

display(filtered_df)

	Element	Atomic Number	Mass	Electronegativity
0	H	1	1.008	2.20
2	Li	3	6.940	0.98
3	Be	4	9.012	1.57

We can also add columns to the DataFrame by assigning a list (or Numpy array) of values to the new column name. For example:

# add group and period columns to the dataframe:
df['Group'] = [ 1, 18, 1, 2 ]
df['Period'] = np.array([ 1, 1, 2, 2 ])

display(df)

	Element	Atomic Number	Mass	Electronegativity	Group	Period
0	H	1	1.008	2.20	1	1
1	He	2	4.002	0.00	18	1
2	Li	3	6.940	0.98	1	2
3	Be	4	9.012	1.57	2	2

Transforming Data#

A crucial part of working with Pandas Dataframes is the transformation of data. Usually this involves applying some mathematical function to a Dataframe column and storing the result in a new Dataframe column. To show how this is done in Pandas, let’s write a function called approximate_mass, which (naively) approximates atomic mass as the atomic number times the sum of the proton and neutron masses (in atomic mass units). We can apply the function using the apply function on either a Series or DataFrame object:

from scipy.constants import proton_mass, neutron_mass, m_u

# define an approximation function:
def approximate_mass(atomic_number):
    """ Naively approximates atomic mass """
    return atomic_number * (proton_mass + neutron_mass)/ m_u

# add an "Estimated Mass" column to the dataframe:
df['Estimated Mass'] = df['Atomic Number'].apply(approximate_mass)

display(df)

	Element	Atomic Number	Mass	Electronegativity	Group	Period	Estimated Mass
0	H	1	1.008	2.20	1	1	2.015941
1	He	2	4.002	0.00	18	1	4.031883
2	Li	3	6.940	0.98	1	2	6.047824
3	Be	4	9.012	1.57	2	2	8.063766

Analyzing Data#

Pandas provides a wide range of functions for analyzing data. The quickest way to obtain summary statistics of numerical columns in a Dataframe is by using the describe method:

df.describe()

	Atomic Number	Mass	Electronegativity	Group	Period	Estimated Mass
count	4.000000	4.000000	4.000000	4.000000	4.00000	4.000000
mean	2.500000	5.240500	1.187500	5.500000	1.50000	5.039853
std	1.290994	3.490962	0.935356	8.346656	0.57735	2.602569
min	1.000000	1.008000	0.000000	1.000000	1.00000	2.015941
25%	1.750000	3.253500	0.735000	1.000000	1.00000	3.527897
50%	2.500000	5.471000	1.275000	1.500000	1.50000	5.039853
75%	3.250000	7.458000	1.727500	6.000000	2.00000	6.551809
max	4.000000	9.012000	2.200000	18.000000	2.00000	8.063766

This function reports key statistics, such as total counts, mean, standard deviation (std), min, max, etc. We can obtain these values manually using the method with the respective name (e.g. df.mean()). Below, we give some examples of how to compute these statistics individually:

# compute mean of the 'Mass' column:
display(df['Mass'].mean())

# compute standard deviations of mass and electronegativity:
display(df[['Mass', 'Electronegativity']].std())

# compute the mean of electronegativity grouped by 'Group':
display(df.groupby('Group')['Electronegativity'].mean())

Importing and Exporting Data#

Pandas supports reading and writing data from/to various file formats, including CSV (_comma-separated values), Excel spreadsheets, SQL databases, and more. You can use functions like read_csv, to_csv, read_excel, to_excel, read_sql, and to_sql to handle data input and output operations. In this workshop, we will use data that is primarily written in the CSV format. For example:

# export dataframe to CSV file:
df.to_csv('elements_data.csv', index=False)

# import the exported csv back into a Dataframe:
imported_df = pd.read_csv('elements_data.csv')

# display imported DataFrame:
display(imported_df)

	Element	Atomic Number	Mass	Electronegativity	Group	Period	Estimated Mass
0	H	1	1.008	2.20	1	1	2.015941
1	He	2	4.002	0.00	18	1	4.031883
2	Li	3	6.940	0.98	1	2	6.047824
3	Be	4	9.012	1.57	2	2	8.063766

Exercises#

Solutions#

Exercise 1: Exploring the Periodic Table#

Show code cell content Hide code cell content

import numpy as np
import pandas as pd

# Load periodic table dataframe:
ptable_df = pd.read_csv('Periodic Table of Elements.csv')

# Display dataframe columns:
display(ptable_df.columns) 

# Display dataframe:
display(ptable_df)


# Question 1:
print('What fraction of elements of the periodic table was discovered before 1900?')

year_discovered = ptable_df['Year']
frac = year_discovered[year_discovered < 1900].count() / year_discovered.count()
print(frac)


# Question 2:
print('\nWhich elements have at least 100 isotopes?')

n_isotopes = ptable_df['NumberOfIsotopes']
max_isotope_df = ptable_df[n_isotopes >= 100]
print(max_isotope_df[['Element','NumberOfIsotopes']])


# Question 3:
print('\nWhat is the average atomic mass of radioactive elements?')
radioactive_elements = ptable_df[ptable_df['Radioactive'] == 'yes']
print(radioactive_elements['AtomicMass'].mean())

Index(['AtomicNumber', 'Element', 'Symbol', 'AtomicMass', 'NumberofNeutrons',
       'NumberofProtons', 'NumberofElectrons', 'Period', 'Group', 'Phase',
       'Radioactive', 'Natural', 'Metal', 'Nonmetal', 'Metalloid', 'Type',
       'AtomicRadius', 'Electronegativity', 'FirstIonization', 'Density',
       'MeltingPoint', 'BoilingPoint', 'NumberOfIsotopes', 'Discoverer',
       'Year', 'SpecificHeat', 'NumberofShells', 'NumberofValence'],
      dtype='object')

	AtomicNumber	Element	Symbol	AtomicMass	NumberofNeutrons	NumberofProtons	NumberofElectrons	Period	Group	Phase	...	FirstIonization	Density	MeltingPoint	BoilingPoint	NumberOfIsotopes	Discoverer	Year	SpecificHeat	NumberofShells	NumberofValence
0	1	Hydrogen	H	1.007	0	1	1	1	1.0	gas	...	13.5984	0.000090	14.175	20.28	3.0	Cavendish	1766.0	14.304	1	1.0
1	2	Helium	He	4.002	2	2	2	1	18.0	gas	...	24.5874	0.000179	NaN	4.22	5.0	Janssen	1868.0	5.193	1	NaN
2	3	Lithium	Li	6.941	4	3	3	2	1.0	solid	...	5.3917	0.534000	453.850	1615.00	5.0	Arfvedson	1817.0	3.582	2	1.0
3	4	Beryllium	Be	9.012	5	4	4	2	2.0	solid	...	9.3227	1.850000	1560.150	2742.00	6.0	Vaulquelin	1798.0	1.825	2	2.0
4	5	Boron	B	10.811	6	5	5	2	13.0	solid	...	8.2980	2.340000	2573.150	4200.00	6.0	Gay-Lussac	1808.0	1.026	2	3.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
113	114	Flerovium	Fl	289.000	175	114	114	7	14.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	1999.0	NaN	7	4.0
114	115	Moscovium	Mc	288.000	173	115	115	7	15.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2010.0	NaN	7	5.0
115	116	Livermorium	Lv	292.000	176	116	116	7	16.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2000.0	NaN	7	6.0
116	117	Tennessine	Ts	295.000	178	117	117	7	17.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2010.0	NaN	7	7.0
117	118	Oganesson	Og	294.000	176	118	118	7	18.0	artificial	...	NaN	NaN	NaN	NaN	NaN	NaN	2006.0	NaN	7	8.0

118 rows × 28 columns

What fraction of elements of the periodic table was discovered before 1900?
6635514018691588

Which elements have at least 100 isotopes?
         Element  NumberOfIsotopes
   Neptunium             153.0
   Plutonium             163.0
   Americium             133.0
      Curium             133.0
 Californium             123.0
 Einsteinium             123.0
     Fermium             103.0
 Lawrencium             203.0

What is the average atomic mass of radioactive elements?
0298108108108