More plotting with pandas

Things we will cover:

scatter plots
histograms
grouping data
box-and-whisker plots
plotting on a map

First, the usual imports…

import matplotlib.pyplot as plt
import pandas as pd

Scatter plots

Let’s use the box office dataset

movie_data = pd.read_csv('all-time-worldwide-box-office.csv')
movie_data.head()

	Rank	Year	Movie	WorldwideBox Office	DomesticBox Office
0	1	2009	Avatar	$2,845,899,541</td> <td>$ 760,507,625	$2,085,391,916
1	2	2019	Avengers: Endgame	$2,797,800,564</td> <td>$ 858,373,000	$1,939,427,564
2	3	1997	Titanic	$2,207,986,545</td> <td>$ 659,363,944	$1,548,622,601
3	4	2015	Star Wars Ep. VII: The Force Awakens	$2,064,615,817</td> <td>$ 936,662,225	$1,127,953,592
4	5	2018	Avengers: Infinity War	$2,044,540,523</td> <td>$ 678,815,482	$1,365,725,041

Remember, we need to clean those dollar amounts up. This is the same thing we did last time, except we’ll also conver the dollar amounts to billions at the same time.

def clean_money(entry):
    if isinstance(entry, float):
        return 0

    return int(entry.strip('$').replace(',', ''))

movie_data['Worldwide'] = movie_data['WorldwideBox Office'].apply(clean_money)/1e9
movie_data['Domestic'] = movie_data['DomesticBox Office'].apply(clean_money)/1e9
movie_data['International'] = movie_data['InternationalBox Office'].apply(clean_money)/1e9
movie_data.head()

	Rank	Year	Movie	WorldwideBox Office	DomesticBox Office	InternationalBox Office	Worldwide	Domestic
0	1	2009	Avatar	$2,845,899,541</td> <td>$ 760,507,625	$2,085,391,916	2.845900	0.760508	2.085392
1	2	2019	Avengers: Endgame	$2,797,800,564</td> <td>$ 858,373,000	$1,939,427,564	2.797801	0.858373	1.939428
2	3	1997	Titanic	$2,207,986,545</td> <td>$ 659,363,944	$1,548,622,601	2.207987	0.659364	1.548623
3	4	2015	Star Wars Ep. VII: The Force Awakens	$2,064,615,817</td> <td>$ 936,662,225	$1,127,953,592	2.064616	0.936662	1.127954
4	5	2018	Avengers: Infinity War	$2,044,540,523</td> <td>$ 678,815,482	$1,365,725,041	2.044541	0.678815	1.365725

To make a scatter plot, use:

kind='scatter'

Scatter plots work well when the values the data can take are granular. For example, no two movies likely have the exact same box office numbers. So let’s scatter year versus worldwide box office. This is a nice way to see what is in the dataset.

movie_data.plot(x='Year', y='Worldwide', kind='scatter',
                ylabel='WorldWide Box Office (billions)', xlabel='Year')

    <AxesSubplot:xlabel='Year', ylabel='WorldWide Box Office (billions)'>

png

Scatter plots are also useful when you want to see if there is a relationship between two columns in your dataset. For example, if a movie has a high domestic box office, is it likely to also have a high international box office? Let’s do a scatter plot of these two columns.

movie_data.plot(x='Domestic', y='International', kind='scatter',
                xlabel='Domestic Box Office (billions)',
                ylabel='International Box Office (billions)')

    <AxesSubplot:xlabel='Domestic Box Office (billions)', ylabel='International Box Office (billions)'>

png

One of the useful things about a scatter plot is we can look for and then examine outliers. What are these movies that have very low International box office numbers?

low_international = movie_data['International'] < 0.1
movie_data[low_international]

	Rank	Year	Movie	WorldwideBox Office	DomesticBox Office	InternationalBox Office	Worldwide	Domestic
407	412	2000	How the Grinch Stole Christmas	$345,445,403</td> <td>$ 260,348,825	$85,096,578	0.345445	0.260349	0.085097
455	460	1984	Beverly Hills Cop	$316,300,000</td> <td>$ 234,760,478	$81,539,522	0.316300	0.234760	0.081540
482	487	2009	The Blind Side	$305,705,794</td> <td>$ 255,959,475	$49,746,319	0.305706	0.255959	0.049746
505	511	2002	Austin Powers in Goldmember	$296,338,663</td> <td>$ 213,117,789	$83,220,874	0.296339	0.213118	0.083221
508	514	1984	Ghostbusters	$295,211,811</td> <td>$ 242,212,467	$52,999,344	0.295212	0.242212	0.052999
530	536	2005	Wedding Crashers	$283,218,368</td> <td>$ 209,218,368	$74,000,000	0.283218	0.209218	0.074000
552	558	2012	Lincoln	$273,346,281</td> <td>$ 182,207,973	$91,138,308	0.273346	0.182208	0.091138

What about those movies that have very high Domestic box office numbers (all of them also doing quite well internationally)?

outliers = movie_data['Domestic'] > 0.6
movie_data[outliers]

	Rank	Year	Movie	WorldwideBox Office	DomesticBox Office	InternationalBox Office	Worldwide	Domestic
0	1	2009	Avatar	$2,845,899,541</td> <td>$ 760,507,625	$2,085,391,916	2.845900	0.760508	2.085392
1	2	2019	Avengers: Endgame	$2,797,800,564</td> <td>$ 858,373,000	$1,939,427,564	2.797801	0.858373	1.939428
2	3	1997	Titanic	$2,207,986,545</td> <td>$ 659,363,944	$1,548,622,601	2.207987	0.659364	1.548623
3	4	2015	Star Wars Ep. VII: The Force Awakens	$2,064,615,817</td> <td>$ 936,662,225	$1,127,953,592	2.064616	0.936662	1.127954
4	5	2018	Avengers: Infinity War	$2,044,540,523</td> <td>$ 678,815,482	$1,365,725,041	2.044541	0.678815	1.365725
5	6	2015	Jurassic World	$1,669,979,967</td> <td>$ 652,306,625	$1,017,673,342	1.669980	0.652307	1.017673
8	9	2012	The Avengers	$1,515,100,211</td> <td>$ 623,357,910	$891,742,301	1.515100	0.623358	0.891742
11	12	2018	Black Panther	$1,336,494,321</td> <td>$ 700,059,566	$636,434,755	1.336494	0.700060	0.636435
13	14	2017	Star Wars Ep. VIII: The Last Jedi	$1,331,635,141</td> <td>$ 620,181,382	$711,453,759	1.331635	0.620181	0.711454
17	18	2018	Incredibles 2	$1,242,805,359</td> <td>$ 608,581,744	$634,223,615	1.242805	0.608582	0.634224

Histograms

A histogram let’s us visualize how many things meet certain criteria. For examle, how many movies manage to make a billion at the box office? We can do a histogram on the worldwide box office to see this. To create a histogram, you can use hist() on a single column or an entire data frame:

movie_data.hist(figsize=(10,10))

    array([[<AxesSubplot:title={'center':'Rank'}>,
            <AxesSubplot:title={'center':'Year'}>],
           [<AxesSubplot:title={'center':'Worldwide'}>,
            <AxesSubplot:title={'center':'Domestic'}>],
           [<AxesSubplot:title={'center':'International'}>, <AxesSubplot:>]],
          dtype=object)

png

movie_data.hist(column='Worldwide')

    array([[<AxesSubplot:title={'center':'Worldwide'}>]], dtype=object)

png

To put labels on the axes of this plot, we need to use matplotlib because they don’t work inside of hist():

movie_data.hist(column='Worldwide')
plt.xlabel('Worldwide Box Office (billions)')
plt.ylabel('Number of Movies')
plt.title('Histogram of Worldwide Box Office')

    Text(0.5, 1.0, 'Histogram of Worldwide Box Office')

png

We can specify the number of bins to use in the histogram

movie_data.hist(column='Worldwide',bins=20)
plt.xlabel('Worldwide Box Office (billions)')
plt.ylabel('Number of Movies')
plt.title('Histogram of Worldwide Box Office')

    Text(0.5, 1.0, 'Histogram of Worldwide Box Office')

png

We can specify the exact bins to use:

movie_data.hist(column='Worldwide', bins=[0,0.5,1,1.5,2,2.5,3])
plt.xlabel('Worldwide Box Office (billions)')
plt.ylabel('Number of Movies')
plt.title('Histogram of Worldwide Box Office')

    Text(0.5, 1.0, 'Histogram of Worldwide Box Office')

png

Grouping Data

When you have data that is categorical — it takes a discrete set of values — you may want to group the data according to these values.

What are you going to do with the data once it is grouped? Maybe you want to look at the mean or max or min money earned by all the movies released each year.

Let’s group all of our movies by the year they were released and look at their mean box office:

means_per_year = movie_data.groupby('Year').mean().reset_index()
means_per_year.head()

	Year	Rank	Worldwide	Domestic	International
0	1939	319.0	0.390525	0.198680	0.191845
1	1942	571.0	0.268000	0.102797	0.165203
2	1950	585.0	0.263591	0.085000	0.178591
3	1965	533.0	0.286214	0.163214	0.123000
4	1972	568.0	0.268500	0.134966	0.133534

Notice that we used reset_index() to get Year added as a column.

Now we can plot this:

recent = means_per_year['Year'] > 2000
means_per_year[recent].plot(x='Year', y='Worldwide', ylabel='Mean Worldwide Box Office (billions)', )

    <AxesSubplot:xlabel='Year', ylabel='Mean Worldwide Box Office (billions)'>

png

We can also look at the total box office of all movies released each year, using sum():

total_per_year = movie_data.groupby('Year').sum().reset_index()
total_per_year.head()

	Year	Rank	Worldwide	Domestic	International
0	1939	319	0.390525	0.198680	0.191845
1	1942	571	0.268000	0.102797	0.165203
2	1950	585	0.263591	0.085000	0.178591
3	1965	533	0.286214	0.163214	0.123000
4	1972	568	0.268500	0.134966	0.133534

Now we can plot this:

total_per_year.plot(x='Year', y='Worldwide', ylabel='Total Box Office')

    <AxesSubplot:xlabel='Year', ylabel='Total Box Office'>

png

Box-and-whisker plots

A box-and-whisker plot is useful for exploring multiple statistics that characterize a distribution of values.

We can divide all the data into quartiles — four segments that each contain 25% of the data:

Q1 is the 25th-percentile. 25% of the data is below that point.
Q2 is the 50th-percentile — the median. Half of the data is at or below that point, and half above it.
Q3 is the 75th percentile. 75% of the data is below that point.

A box plot shows all three of these.

Let’s create a box-and-whisker plot that shows statistics about the worldwide box office for movies since the year 2000.

recent = movie_data['Year'] > 2000
means = movie_data[recent].groupby('Year').mean()
means.boxplot(column=['Worldwide','International','Domestic'])

    <AxesSubplot:>

png

The bottom of the box is Q1, the top of the box is Q3.
The middle line is the Q2, the median.
The distance between Q3 and Q1 is the interquartile range (IQR).
Each whisker extends to the furthest data point that is within 1.5 times the IQR.
Dots are outliers

Use box plots carefully. These two histograms have the same box-and-whiskers plot.

histograms with same box and whisker plot

You really should use a violin plot instead of a box-and-whiskers plot. But for that you should use seaborn.

import seaborn as sns

recent_movies = movie_data[recent]
sns.violinplot(data=recent_movies, y="Worldwide")

    <AxesSubplot:ylabel='Worldwide'>

png

In the world of Python plotting:

matplotlib — does everything, low level
pandas — uses dataframes, built on matplotlib, can always use matplotlib to tweak
seaborn — tries to do a small number of things well, and make them look pretty

We won’t spend time on violin plots here. Go here if you want to learn more about violin plots.

You can plot lots of box plots at once using by inside of boxplot():

recent = movie_data['Year'] > 2010
movie_data[recent].boxplot(by="Year", column='Worldwide', figsize=(8,5))
# Because we are plotting a bunch of things, we have  both 'suptitle' and 'title' available
plt.suptitle('Worldwide Box Office')
plt.title('')
plt.ylabel('Worldwide Box Office (billions)')

    Text(0, 0.5, 'Worldwide Box Office (billions)')

png

import seaborn as sns

recent_movies = movie_data[recent]
fig, ax = plt.subplots(figsize=(12, 5))
sns.violinplot(data=recent_movies, x="Year", y="Worldwide", ax=ax)

    <AxesSubplot:xlabel='Year', ylabel='Worldwide'>

png

Plotting on a map of the world

Let’s use a new data set — one that includes oil spills and their locations.

Let’s take a look at the dataset:

! head incidents.csv

It has lots of empty data points!

oil_spills = pd.read_csv('incidents.csv')
oil_spills.head()

	id	open_date	name	location	lat	lon	threat	tags	commodity	measure_skim	measure_shore	measure_bio	measure_disperse	measure_burn	max_ptl_release_gallons	posts	description
0	10431	2022-03-21	Tug Vessel Loses Power, Grounds, and Leaks Die…	Neva Strait, Sitka, AK	57.270000	-135.593330	Oil	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	At approximately 0400 on 21-Mar02922, the tug …
1	10430	2022-03-17	Compromised Fuel Transfer Pipe Spills Oil into…	Oswego, NY	43.459410	-76.531650	Oil	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	On March 17, 2022, NOAA ERD was notified by Mi…
2	10429	2022-03-16	Floating Humpback Whale Carcass off of Carolin…	Carolina Beach, NC, USA	34.031323	-77.830343	Other	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	On March 16, 2022, the Gulf of Mexico Marine M…
3	10428	2022-03-15	Containership Grounded off Gibson Island in Ch…	Gibson Island, MD, USA	39.070000	-76.410000	Oil	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2	On 15 March 2022, USCG Sector Maryland NCR not…
4	10426	2022-03-14	Oil Pipeline Discharge into Cahokia Canal, Edw…	Cahokia Canal, Edwardsville, IL	38.824034	-89.974600	Oil	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0	On March 14, 2022, USEPA Region 5 contacted th…

Let’s create a simple scatter plot using latitude and longitude.

oil_spills.plot(x="lon", y="lat", kind="scatter", xlabel='Longitude', ylabel='Latitude', title='Oil Spills')

    <AxesSubplot:title={'center':'Oil Spills'}, xlabel='Longitude', ylabel='Latitude'>

png

This looks a little bit like the United States. What if we put a real map behind it?

🌎🐼

pip install geopandas

import geopandas as gpd

countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))

# plot the countries first
countries.plot(color="lightgrey")
# now plot the oil spills
# use ax=plt.gca() to get it on the same plot
# use figsize to get a good shape
oil_spills.plot(x="lon", y="lat", kind="scatter", ax=plt.gca(),figsize=(100,70))

    <AxesSubplot:xlabel='lon', ylabel='lat'>

png

That’s pretty good, but what if we want to only plot North and South America?

subset = countries.continent.isin(['North America','South America'])

countries = gpd.read_file(
               gpd.datasets.get_path("naturalearth_lowres"))
# get just the countries we want to plot
subset = countries.continent.isin(['North America','South America'])
countries[subset].plot(color="lightgrey")
oil_spills.plot(x="lon", y="lat", kind="scatter", ax=plt.gca(),figsize=(20,15), s=1, color='firebrick')

    <AxesSubplot:xlabel='lon', ylabel='lat'>

png

What if we want to zoom in on just the continental United States? Use these parameters in plot()

ylim=(start,end)
xlim=(start,end)

countries = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
subset = countries.continent.isin(['North America'])
countries[subset].plot(color="lightgrey")
oil_spills.plot(x="lon", y="lat", kind="scatter", ax=plt.gca(),figsize=(20,15),
                s=1, color='firebrick', ylim=(10,55), xlim=(-150,-50))

    <AxesSubplot:xlabel='lon', ylabel='lat'>

png

Let’s do it all over again

We’ll read in the Chicago beach water monitoring data

chicago = pd.read_csv("bwq.csv", parse_dates=['Measurement Timestamp'])
chicago.head()

	Beach Name	Measurement Timestamp	Water Temperature	Turbidity	Transducer Depth	Wave Height	Wave Period	Battery Life	Measurement Timestamp Label	Measurement ID
0	Montrose Beach	2013-08-30 08:00:00	20.3	1.18	0.891	0.080	3.0	9.4	8/30/2013 8:00 AM	MontroseBeach201308300800
1	Ohio Street Beach	2016-05-26 13:00:00	14.4	1.23	NaN	0.111	4.0	12.4	05/26/2016 1:00 PM	OhioStreetBeach201605261300
2	Calumet Beach	2013-09-03 16:00:00	23.2	3.63	1.201	0.174	6.0	9.4	9/3/2013 4:00 PM	CalumetBeach201309031600
3	Calumet Beach	2014-05-28 12:00:00	16.2	1.26	1.514	0.147	4.0	11.7	5/28/2014 12:00 PM	CalumetBeach201405281200
4	Montrose Beach	2014-05-28 12:00:00	14.4	3.36	1.388	0.298	4.0	11.9	5/28/2014 12:00 PM	MontroseBeach201405281200

Then we need to drop any invalid measurements:

chicago = chicago.dropna(subset=['Measurement Timestamp'])

and add a column for the year:

chicago = chicago.dropna(subset=['Measurement Timestamp'])

def get_year(date):
    return int(date.year)

chicago['Year'] = chicago['Measurement Timestamp'].apply(get_year)
chicago.head()

	Beach Name	Measurement Timestamp	Water Temperature	Turbidity	Transducer Depth	Wave Height	Wave Period	Battery Life	Measurement Timestamp Label	Measurement ID	Year
0	Montrose Beach	2013-08-30 08:00:00	20.3	1.18	0.891	0.080	3.0	9.4	8/30/2013 8:00 AM	MontroseBeach201308300800	2013
1	Ohio Street Beach	2016-05-26 13:00:00	14.4	1.23	NaN	0.111	4.0	12.4	05/26/2016 1:00 PM	OhioStreetBeach201605261300	2016
2	Calumet Beach	2013-09-03 16:00:00	23.2	3.63	1.201	0.174	6.0	9.4	9/3/2013 4:00 PM	CalumetBeach201309031600	2013
3	Calumet Beach	2014-05-28 12:00:00	16.2	1.26	1.514	0.147	4.0	11.7	5/28/2014 12:00 PM	CalumetBeach201405281200	2014
4	Montrose Beach	2014-05-28 12:00:00	14.4	3.36	1.388	0.298	4.0	11.9	5/28/2014 12:00 PM	MontroseBeach201405281200	2014

Let’s try doing a scatter plot of wave height and wave period to see if they are related:

chicago.plot(x='Wave Period', y='Wave Height', kind='scatter')

    <AxesSubplot:xlabel='Wave Period', ylabel='Wave Height'>

png

Oops! Looks like we have some negative wave heights and periods! Let’s clean our data:

clean_chicago = chicago[(chicago['Wave Height'] > 0) & (chicago['Wave Period'] > 0)]
clean_chicago.plot(x='Wave Period', y='Wave Height', kind='scatter')

    <AxesSubplot:xlabel='Wave Period', ylabel='Wave Height'>

png

We can do the same thing with wave period and Turbidity.

clean_chicago.plot(x='Wave Period', y='Turbidity', kind='scatter')

    <AxesSubplot:xlabel='Wave Period', ylabel='Turbidity'>

png

This is a line plot, but for fun, let’s look at wave height over time at a certain beach:



recent = clean_chicago['Year'] == 2016
montrose = clean_chicago['Beach Name'] == 'Montrose Beach'
clean_chicago[recent & montrose].plot(x='Measurement Timestamp', y='Wave Height')

    <AxesSubplot:xlabel='Measurement Timestamp'>

png

How about a histogram so we can see how common various wave heights are?


clean_chicago[recent & montrose].hist(column='Wave Height')

    array([[<AxesSubplot:title={'center':'Wave Height'}>]], dtype=object)

png

A box plot would give us some statistical data about wave heights:

earlier = clean_chicago['Year'] == 2014
clean_chicago[earlier].groupby('Beach Name').boxplot(column='Wave Height', figsize=(6,18))

    63rd Street Beach         AxesSubplot(0.1,0.679412;0.363636x0.220588)
    Calumet Beach        AxesSubplot(0.536364,0.679412;0.363636x0.220588)
    Montrose Beach            AxesSubplot(0.1,0.414706;0.363636x0.220588)
    Ohio Street Beach    AxesSubplot(0.536364,0.414706;0.363636x0.220588)
    Osterman Beach                AxesSubplot(0.1,0.15;0.363636x0.220588)
    Rainbow Beach            AxesSubplot(0.536364,0.15;0.363636x0.220588)
    dtype: object

png

Or we could put those all on one graph:

clean_chicago[earlier].boxplot(by='Beach Name',column='Wave Height', figsize=(10,5))

    <AxesSubplot:title={'center':'Wave Height'}, xlabel='Beach Name'>

png

Finally, we can use groupby to compute max wave heights and see which beach has the highest wave:

grouped = clean_chicago[earlier].groupby('Beach Name').max().sort_values('Wave Height').reset_index()
grouped.plot(x = 'Beach Name', y='Wave Height', kind='bar')

    <AxesSubplot:xlabel='Beach Name'>

png