Making Expected Goal Charts in R

Visualizing hockey games has never been easier, with the data made available through sites like Moneypuck and EvolvingHockey. In a few easy steps, we can take raw data from sites like these and produce visuals that nicely summarize how a particular game was played. We’re going to recreate a chart like the one below that visualize expected goals over the course of a game. Let’s jump right in.

An example Expected Goals chart from EvolvingHockey.com

Getting Data from Moneypuck.com

Moneypuck is a great resource for publicly available data, which we will pull down to create our very own Expected Goal game charts. We are particularly interested in the shot data, available here for download, but we will pull this data in without leaving R using the code below.

library(tidyverse)
library(downloader)

download("http://peter-tanner.com/moneypuck/downloads/shots_2019.zip", dest="dataset.zip", mode="wb") #downloads the zip file
unzip ("dataset.zip", exdir = ".") #unzip the file into our directory
shots = read_csv("shots_2019.csv") #read in the shots csv file

What we’ve just done is taken the url of the zip file on the moneypuck site, downloaded the zip file, unzipped its contents into our directory, and read in the shots csv file for analysis. Now we are ready to wrangle our shots data with the goal of plotting the expected goals of one game over time, for both teams involved.

Wrangling our Shots data for Plotting

Before we jump into the data, it’s worth pausing and thinking about what we want our visual to look like. This will inform any variables we need to alter or create ahead of time, and reduce any back tracking you may need to do later.

The shots data, as the name suggest, includes all the shots taken in the 2019-2020 season to date. The observations alone are powerful, and Moneypuck even includes the results of it’s Expected Goals model we can leverage in our visuals. I encourage readers to poke around the data file and see what else could be of interest.

To start – we want to visualize just one game – so let’s filter down our data set to look at a game. We’ll use the game_id variable to do this:

game_to_chart = 20951 #insert game_id we want to chart

shots %>% 
  filter(game_id == game_to_chart) #filter data to this game

Pretty neat – now we have just the shots taken in the ludicrous display last night that was the Leafs vs. Hurricanes game. You can filter any other game_id you desire by just replacing the value in the game_to_chart variable we created. Our end goal is to create a cumulative expected goals chart like the ones featured on various analytics sites, and the next step requires a little forethought.

We have observations for each shot, the xG for that shot, and the time in seconds of when that shot was taken. What happens if there was a second in the game where a shot wasn’t taken? How would our plot look? We can imagine a plot where the Leafs go 5 minutes without a shot, and our chart would jump from one value to another, which can be misleading. What we need to do is “pad” our data with observations for each second, using the padr package as below:

library(padr)
shots %>% 
  filter(game_id == game_to_chart) %>%
  pad_int('time', group = 'teamCode', start_val = 0, end_val = 3600) %>%  #pad data for seconds without a shot
  mutate(xGoal = ifelse(is.na(xGoal),0,xGoal)) #convert NAs to 0 so they are plotted

Now we have observations for each second in the game. Since we’re going to plot xGoals, we also replaced missing values with 0 using ifelse so they aren’t dropped from our plot. Next, we’re ready to take the cumulative sum of expected goals so we can see how the game progressed over time:

shots %>% 
  filter(game_id == game_to_chart) %>%
  pad_int('time', group = 'teamCode', start_val = 0, end_val = 3600) %>%  #pad data for seconds without a shot
  mutate(xGoal = ifelse(is.na(xGoal),0,xGoal)) %>%  #convert NAs to 0 so they are plotted
  group_by(teamCode) %>%
  mutate(cumulativexG = cumsum(xGoal)) #take cumulative sum to add up xG over time

Time to Plot!

Time to ggplot! We have our data in a format that will allow us to visualize the expected goal trends for a game – so let’s see what it looks like:

shots %>% 
  filter(game_id == game_to_chart) %>%
  pad_int('time', group = 'teamCode', start_val = 0, end_val = 3600) %>%  #pad data for seconds without a shot
  mutate(xGoal = ifelse(is.na(xGoal),0,xGoal)) %>%  #convert NAs to 0 so they are plotted
  group_by(teamCode) %>%
  mutate(cumulativexG = cumsum(xGoal)) %>%  #take cumulative sum to add up xG over time
  ggplot(aes(time, cumulativexG, group = teamCode, color = teamCode)) +
  geom_line()

Pretty cool! We can see that Carolina basically dominated this one from start to finish. We can identify periods of time where a team was dominant, like the first few minutes for Carolina, as well as periods where chances flatlined, like most of the game for the Leafs.

This is a good start, but we may want to add another dimension – one that would allow us to see when an actual goal occurred. We’re going to do that by filtering a new layer of our plot so that it just includes goals. This is made easier if we create a new object, and assign it to what we want to plot:

plot = shots %>% 
  filter(game_id == game_to_chart) %>%
  pad_int('time', group = 'teamCode', start_val = 0, end_val = 3600) %>%  #pad data for seconds without a shot
  mutate(xGoal = ifelse(is.na(xGoal),0,xGoal)) %>%  #convert NAs to 0 so they are plotted
  group_by(teamCode) %>%
  mutate(cumulativexG = cumsum(xGoal))  #take cumulative sum to add up xG over time
 
plot %>%  
  ggplot(aes(time, cumulativexG, group = teamCode, color = teamCode)) +
  geom_line() +
  geom_point(data = plot %>% filter(goal == 1))

Now we can see when the goals occurred in this game. The Leafs got out to a 1-0 lead despite being outchanced, before the Hurricanes rattled off the next 4 goals. We have a pretty decent overhead view of how this game transpired at this point. Let’s clean it up and call it a day:

#Packages required for the ggplot theme
library(awtools)
library(showtext)
library(extrafont)
font_add_google("IBM Plex Mono", "IBM Plex Mono")
font_add_google("IBM Plex Sans", "IBM Plex Sans")
showtext_auto()

plot %>%  
  ggplot(aes(time, cumulativexG, group = teamCode, color = teamCode)) +
  geom_line(size = 1.5) +
  geom_point(data = plot %>% filter(goal == 1), fill = "white", size = 3.5, alpha = 0.9, shape = 21, stroke = 1.5) +
  geom_vline(xintercept = c(1200,2400,3600), color = "grey", alpha = 0.4) +
  geom_hline(yintercept = c(1,2,3,4), color = "grey", alpha = 0.4) +
  a_plex_theme(grid = FALSE) +
  scale_color_manual(values = c('#CC0000', '#00205B')) +
  labs(title = "TOR 3 vs. CAR 6", subtitle = "Toronto was felled by the Carolina Hurricanes and stud goaltender and zamboni driver, David Ayers. \nThe Leafs have stumbled the past few games, struggling to string games of solid effort together. \nLuckily, the Panthers lost to Vegas in their matchup.", x = "Game Seconds", y = "Expected Goals", caption = "Chart by @MackinawStats, data from @Moneypuck") +
  scale_x_continuous(breaks = c(1200,2400,3600)) +
  geom_label(data = plot %>% filter(time == 3600), aes(label = teamCode), vjust = 1.5) +
  theme(legend.position = "none")

Potential Additions and Happy Charting

There are a lot of game aspects our chart leaves out, like any powerplays the teams had, or the fact that the Hurricanes had to play 3 different goalies, one of which may have had to clean the ice at the conclusion of the game. There are always additional layers and annotations that can be worth adding, subject for a future post. Enjoy!

-Mackinaw Stats

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s