03 Mar 2024 6 min read Tutorials

How to Create an Injury Report with Production Cost

Learn how to create an injury report that includes impact metrics, such as Points per Game.

In this Tutorial

We'll show you how to expand a daily injury report to include other factors, such as points per game, so you can gauge which injuries are more "expensive".

We'll answer five questions using this report:

What injuries are the most pervasive?
What team has the most injuries?
Which position has the most injuries?
Which injuries are the most expensive (from a production perspective)?
What team is incurring the highest production cost?

🏒

For more information on injury reports, check out our newsletter edition entitled Ice Breakers: Decoding the Impact of Injuries in Hockey.

Getting the Data and Code

For this tutorial, you can find the data and code links below:

Sourcing & Shaping the Data

The original source for the injury data for this tutorial is our sports data provider, My Sports Feeds. However, we created a dataset for this tutorial that curated injury data for one day and then integrated player and goalie statistics. This starting file (final_player_injury_impact_dataset.csv) can be downloaded from here.

With the data downloaded, open RStudio and create a new project. Add the following libraries into the project.


library(dplyr)
library(tidyverse)
library(ggplot2)
...

Next, load the curated injury and stats dataset, rename the columns and implement a small amount of data cleaning and transformation.

...

injury_data_current_df <- read.csv("final_player_injury_impact_dataset.csv")

colnames(injury_data_current_df) <- c("DATE", "PLAYER_ID", "LNAME", "FNAME",
                                      "JNUM", "POSITION", "TEAM_ID", "TEAM_ABBR",
                                      "TEAM_CITY", "TEAM_NAME", "INJURY", 
                                      "GP", "G", "A", "PTS", "TOI", "PTS_PG",
                                      "GAA", "SAVE_PCT")

injury_data_current_df$PLAYER_NAME <- paste(injury_data_current_df$FNAME, 
                                      injury_data_current_df$LNAME, sep = " ")
injury_data_current_df$TEAM <- paste(injury_data_current_df$TEAM_CITY, 
                                  injury_data_current_df$TEAM_NAME, sep = " ")
                                  
injury_data_current_df$PTS_PG <- round(injury_data_current_df$PTS_PG, 2)
injury_data_current_df$TOI <- round(injury_data_current_df$TOI, 2)
...

An optional step here is to load a daily snapshot of player stats summary, which you can use to calculate the average points per game. For the date we pulled the data, this was 0.342, so you can either use the MSF Player Stats Summary dataset and implement the following R code or assign 0.342 to the avg_points_per_game variable.

...
nhl_player_data_df <- read.csv("MSF_Player_Stats_Summary.csv")

nhl_player_data_df$PTS_P_G  = nhl_player_data_df$POINTS / nhl_player_data_df$GAMES_PLAYED 

avg_points_per_game = mean(nhl_player_data_df$PTS_P_G)

avg_points_per_game
...

You will now create a data frame called sub_injury_dataset to use throughout the tutorial. Note that we've omitted the NAs in the dataset and written the data frame to file – so you can use the data in other tools/platforms should you want to.

...
sub_injury_dataset <- injury_data_current_df %>% 
  select(PLAYER_ID, PLAYER_NAME, TEAM_ID, TEAM_ABBR, TEAM, JNUM, POSITION, 
         INJURY, GP, G, A, PTS, PTS_PG, TOI, GAA, SAVE_PCT) %>% 
  arrange(desc(PTS_PG))

sub_injury_dataset$PTSPG_DIFF = round(sub_injury_dataset$PTS_PG - avg_points_per_game, 2)

rows_with_na <- sum(apply(sub_injury_dataset, 1, function(x) any(is.na(x))))

clean_df <- na.omit(sub_injury_dataset)

write.csv(clean_df, "injury_summary_data_w_cost.csv", row.names = FALSE)
...

What Injuries are the Most Pervasive?

The first question to answer is what are the most pervasive injuries for the dataset. This is answered by grouping the data by INJURY and getting a count of each injury across the grouping.

total_injuries_df <- sub_injury_dataset %>% 
  group_by(INJURY) %>% 
  summarize(COUNT = n()) %>% 
  arrange(desc(COUNT))

ggplot(total_injuries_df, aes(x = reorder(INJURY, COUNT), y = COUNT)) +
  geom_bar(stat = "identity", fill = "#549490") +  
  theme_combmatrix(combmatrix.label.make_space = TRUE) +
  geom_text(aes(label = COUNT), vjust = -.5, size = 3, color = "#549490") +
  theme_minimal() + 
  theme(panel.grid = element_blank()) +
  labs(x = "Injury Type", y = "Count", 
       title = "Count of Types of Injuries",
       caption = "Data Source: My Sports Feeds") + 
  theme(plot.title = element_text(face = "bold", size = 12),
        axis.title = element_text(size = 8),
        axis.text.x = element_text(angle = 45, size = 8, hjust = 1),
        axis.text.y = element_blank(),
        panel.grid.major.y = element_blank(),   
        panel.grid.minor.y = element_blank(),
        legend.position = "none")

The result of the above code is a relatively clean R chart that shows upper body with the highest incidence of injury (n = 25).

While this is somewhat interesting, it's general and doesn't quantify impact or production costs.

What Team has the Most Injuries?

Our first impact metric is the count of injuries by TEAM. Here we'll see what teams have the most injuries for that day we pulled the data.

injuries_by_team <- sub_injury_dataset %>% 
  group_by(TEAM)  %>% 
  summarize(COUNT = n()) %>% 
  arrange(desc(COUNT))

write.csv(injuries_by_team, "injuries_by_team.csv", row.names = FALSE)

Note that here again we wrote the data frame to file so we could use in DataWrapper, an online data tool.

With this chart, we're starting to see some impact, but it's still not where we want it to be.

What Positions have the Most Injuries?

As a third general chart, this one informs which position is more vulnerable to injuries for the time period. To create this view, we now group by POSITION and take the count.

total_injuries_by_position_df <- sub_injury_dataset %>% 
  group_by(POSITION) %>% 
  summarize(COUNT = n()) %>% 
  arrange(desc(COUNT))

write.csv(total_injuries_by_position_df, "total_injuries_by_position.csv", row.names = FALSE)

Here again, we save as a local file and use DataWrapper to display the results. The results are that D is the position with the most injuries.

💡

As a variation on the above chart, you could evaluate where the highest impacts are when it comes to position. For D, this might be Blocked Shots and for forward it might be Points.

Which Injuries are the Most Expensive?

It's this next chart that starts to show the cost of the injuries. Here, we again group by TEAM, but also add some summary statistics to see how injuries would impact a team. For example, we have included the calculated average points per game (AVG_PTS_PG), which is the average points per game obtained by the players in the injury dataset. Thus, the higher this number, the more higher-production players are injured. We've also included some other metrics, for example, average time on ice (AVG_TOI), average points per game difference with league average (AVG_PTSPG_DIFF), etc.

💡

Be creative and add your own impact metrics here. You also may want to break out position, so impact metrics are directly mapped into the roles.

grouped_data <- sub_injury_dataset %>%
  group_by(TEAM) %>%
  summarize(
    TOTAL_INJURIES = n_distinct(PLAYER_ID),  
    AVG_PTS_PG = round(mean(PTS_PG, na.rm = TRUE), 2),   
    AVG_TOI = round(mean(TOI, na.rm = TRUE), 2),  
    AVG_PTSPG_DIFF = round(mean(PTSPG_DIFF, na.rm = TRUE), 2), 
    AVG_GAA = round(mean(GAA, na.rm = TRUE), 2), 
    AVG_SAVE_PCT = round(mean(SAVE_PCT, na.rm = TRUE), 2), 
  )

write.csv(grouped_data, "grouped_injury_data.csv", row.names = FALSE)

For this dataset, we saved it and then used Microsoft Excel to create a heatmap. This gives you a team-level view where you can begin to see how the total injuries incurred for that snapshot could impact production. For example, Toronto have 7 injuries with an AVG_PTS_PG of 0.31 (below the league average for impact) whereas Colorado have 3 injuries with an AVG_PTS_PG of 1.05 (three times the league average for impact).

What Team is Incurring the Highest Production Cost?

Here, we again group by TEAM but only include the average points per game metric.

💡

Note that in this data frame, we remove the rows with NAs, so the numbers will change. NAs can indicate data issues or players that have been brought up temporarily from the minors.

injury_cost_by_team <- sub_injury_dataset %>% 
  group_by(TEAM)  %>% 
  summarize(AVG_PTS_P_GAME = round(mean(PTS_PG), 2)) %>% 
  arrange(desc(AVG_PTS_P_GAME))

injury_cost_by_team = na.omit(injury_cost_by_team)

ggplot(injury_cost_by_team, aes(x = reorder(TEAM, AVG_PTS_P_GAME), y = AVG_PTS_P_GAME)) +
  geom_bar(stat = "identity", fill = "#6A4C93") +  
  coord_flip() +  
  labs(x = "Team", y = "Average Points per Game", 
       title = "Cost of Injury per Team",
       subtitle = "Average points lost per game due to injuries",
       caption = "Data Source: My Sports Feeds") +   
  theme_bw() +  
  theme(plot.title = element_text(face = "bold", size = 14),
        axis.title = element_text(size = 12),
        axis.text = element_text(size = 10)) +   
  scale_fill_brewer(palette = "Purples")

And here, irrespective of the number of injuries, we see that Pittsburgh has the highest impact cost with an average points per game of 0.69.

For a simple chart, the above informs what teams are impacted more by higher-production players being injured.

Summary

In this tutorial, we walked through how to create different views that begin to show the impact of injuries. The first three charts were more general in nature; they showed totals at the team level, counts by injury and counts by position. The fourth and fifth charts introduced more metrics that indicate potential impact. For example, teams with higher average points per game and time on ice metrics translated into higher-production players on the injured list.

As follow-on exercises, you could create your own impact metrics and build a comparative view where you can compare two teams playing one another to see whether injuries could be a potential factor in the game.

Subscribe to our newsletter to get the latest and greatest content on all things hockey analytics!