```{r setup, include=FALSE} knitr::opts_chunk$set(collapse = TRUE, warning = FALSE, message = FALSE, dpi = 340) library(tidyverse) library(tidycensus) library(ggthemes) library(gganimate) library(sf) options(tigris_use_cache = TRUE) ``` **Updated `r Sys.Date()`** With the onset COVID-19 pandemic, many people are creating and sharing visualizations help us understand and map the spread of the virus. I live and work in Chicago, which has been under a "shelter-in-place" order for over a week. Schools have been closed for two weeks. I've been busy with the little ones and with work. But I thought I might visualize the spread of COVID-19 in Illinois, which I have not yet seen done. I like to set a custom `ggplot2` theme. Nothing fancy. ```{r} # Define ggplot theme theme_set( theme_minimal() + theme( text = element_text(family = "Raleway"), axis.title.x = element_text(size = 10, hjust = 1), axis.title.y = element_text(size = 10), axis.text.x = element_text(size = 6), axis.text.y = element_text(size = 6), plot.title = element_text(size = 9, colour = "grey25", face = "bold"), plot.subtitle = element_text(family = "Lato", size = 8, colour = "grey45"), plot.caption = element_text(family = "Lato", size = 6, colour = "grey45"), legend.margin=margin(), legend.text = element_text(size = 6), legend.title = element_text(size = 8, vjust = 0), legend.key.size = unit(0.5, "cm") ) ) ``` # Getting the data from the New York Times COVID-19 github repo There are a few organizations tracking confirmed COVID-19 cases and deaths, including the [Johns Hopkins University Center for Systems Science and Engineering](https://github.com/CSSEGISandData/COVID-19) and the [New York Times](https://github.com/nytimes/covid-19-data). The JHU repo seems a bit more comprehensive, but the New York Times gives me just what I need. I read the county-level csv file straight from GitHub and filter it to include only Illinois counties. ```{r results='hide'} il <- read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv") %>% filter(state == "Illinois", county != "Unknown") ``` I use the `{tidycensus}` package to get both county population from the 2010 Dicennial Census and county geographic data. This will enable me to determine per capita rates, as well as create spatial visualizations of COVID-19 cases. ```{r results='hide'} # il county census data il_pop <- get_decennial(geography = "county", variables = "P001001", state = "IL", geometry = TRUE) %>% janitor::clean_names() %>% transmute(geoid = geoid, county = str_remove(name, " County, Illinois"), population = value, geometry = geometry) ``` # Visualizing confirmed COVID-19 cases in Illinois Accessing the data was easy. I visualize the growth of confirmed cases in Illinois overall and then by county. ## Overall growth of confirmed cases ```{r} il %>% filter(date > "2020-02-20") %>% group_by(date) %>% summarize(total_cases = sum(cases, na.rm = TRUE)) %>% ggplot() + geom_col(aes(x = date, y = total_cases), fill = "cornflowerblue", alpha = 0.7) + labs(x = "", y = "", title = "Daily confirmed cases in Illinois", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") ``` Let's animate that plot, because why not and I'm obsessed with `{gganimate}` lately. ```{r} il %>% filter(date > "2020-02-20") %>% group_by(date) %>% summarize(total_cases = sum(cases, na.rm = TRUE)) %>% ggplot() + geom_col(aes(x = date, y = total_cases), fill = "cornflowerblue", alpha = 0.7) + labs(x = "", y = "", title = "Daily confirmed cases in Illinois", subtitle = "Date: {closest_state}", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") + transition_states(date) + shadow_mark(fill = "cornflowerblue", alpha = 0.25) + enter_grow() ``` Now, let's look at these again, but group by county. ```{r} il %>% filter(date > "2020-02-20") %>% ggplot() + geom_col(aes(x = date, y = cases, fill = county), alpha = 0.7) + scale_fill_viridis_d(option = "plasma", guide = FALSE) + labs(x = "", y = "", title = "Daily confirmed cases in Illinois", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") ``` And animated: ```{r} il %>% filter(date > "2020-02-20") %>% ggplot() + geom_col(aes(x = date, y = cases, fill = county), alpha = 0.7) + scale_fill_viridis_d(option = "plasma", guide = FALSE) + labs(x = "", y = "", title = "Daily confirmed cases in Illinois", subtitle = "Date: {closest_state}", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") + transition_states(date) + shadow_mark(alpha = 0.5) + enter_grow() ``` ## Nominal confirmed cases by county First, a simple visualization of the number of confirmed cases in Illinois, by county, since Feb. 20. I include a couple of relevant dates as annotations--when CPS closed schools, the statewide shelter-in-place order, and the closing of the Lakefront trail. Cook County, where Chicago is located and by far the most populous in the state, has the largest number of confirmed cases, unsurprisingly. As of March 28, there are over 2,000 confirmed cases. ```{r} il %>% filter(cases != 0, date > "2020-02-20") %>% ggplot(aes(x = date, y = cases, group = county)) + geom_line(aes(color = county)) + geom_vline(xintercept = as.Date("2020-03-10"), linetype = 2) + geom_vline(xintercept = as.Date("2020-03-21"), linetype = 2) + geom_vline(xintercept = as.Date("2020-03-27"), linetype = 2) + scale_color_viridis_d(guide = FALSE) + annotate("text", label = "CPS closes schools", x = as.Date("2020-03-10"), y = max(il$cases)*0.5, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("text", label = "Statewide shelter-in-place", x = as.Date("2020-03-21"), y = max(il$cases)*0.5, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("text", label = "Lakefront trail\ncloses", x = as.Date("2020-03-27"), y = max(il$cases)*0.5, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("curve", x = as.Date("2020-03-16"), y = 600, xend = as.Date("2020-03-18"), yend = (il$cases[il$county == "Cook" & il$date == "2020-03-18"] + 50), size = 0.25, curvature = -.3, arrow = arrow(length = unit(2, "mm"))) + annotate("text", label = "Cook County", x = as.Date("2020-03-16"), y = 600, hjust = 1.1, size = 2.5, family = "Lato") + labs(x = "", y = "", title = "COVID-19 Cases in Illinois", subtitle = "Number of confirmed cases by county", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") ``` ## Log scale for confirmed cases in Illinois Using a log scale helps to clarify the line graph, turning the steeply curving exponential line into a nice linear line. A linear increase on a log scale graph corresponds to a nominal exponential increase. A flatter line would mean that the rate of confirmed cases is becoming more linear. Notable in this version is the extreme growth of cases in DuPage County, which borders Cook County to the west and is the second-most populous in the state. Tracking close to DuPage County is Lake County, also part of the Chicago metropolitan area, bordering Cook County to the north. ```{r} il %>% filter(date > "2020-02-20") %>% ggplot(aes(x = date, y = cases, group = county)) + geom_line(aes(color = county)) + geom_vline(xintercept = as.Date("2020-03-10"), linetype = 2) + geom_vline(xintercept = as.Date("2020-03-21"), linetype = 2) + geom_vline(xintercept = as.Date("2020-03-27"), linetype = 2) + scale_color_viridis_d(guide = FALSE) + scale_y_log10() + annotate("text", label = "CPS closes schools", x = as.Date("2020-03-10"), y = max(il$cases)*0.5, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("text", label = "Statewide shelter-in-place", x = as.Date("2020-03-21"), y = max(il$cases)*0.5, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("text", label = "Lakefront trail\ncloses", x = as.Date("2020-03-27"), y = max(il$cases)*0.5, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("curve", x = as.Date("2020-03-16"), y = (il$cases[il$county == "Cook" & il$date == "2020-03-18"] + 300), xend = as.Date("2020-03-18"), yend = (il$cases[il$county == "Cook" & il$date == "2020-03-18"] + 10), size = 0.25, curvature = -.3, arrow = arrow(length = unit(2, "mm"))) + annotate("text", label = "Cook County", x = as.Date("2020-03-16"), y = (il$cases[il$county == "Cook" & il$date == "2020-03-18"] + 300), hjust = 1.1, size = 2.5, family = "Lato") + annotate("curve", x = as.Date("2020-03-15"), y = (il$cases[il$county == "DuPage" & il$date == "2020-03-16"] + 10), xend = as.Date("2020-03-16"), yend = (il$cases[il$county == "DuPage" & il$date == "2020-03-16"] + 5), size = 0.25, curvature = .3, arrow = arrow(length = unit(2, "mm"))) + annotate("text", label = "DuPage County", x = as.Date("2020-03-16"), y = (il$cases[il$county == "DuPage" & il$date == "2020-03-17"] - 10), hjust = 1.1, vjust = 0, size = 2.5, family = "Lato") + labs(x = "", y = "", title = "COVID-19 Cases in Illinois, Log Scale", subtitle = "Log (base 10) number of confirmed cases by county", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") ``` ## Number of confirmed cases per 10,000 Look at the rate per 10,000 people in each county reveals another dimension of the spread. For one, it shows that the situation in both Lake and DuPage counties, while nominally appearing better than Cook, is almost as bad when taking the population into account. ```{r} il_per_cap <- il %>% filter(date > "2020-02-20") %>% complete(county, nesting(date), fill = list(cases = 0, deaths = 0)) %>% arrange(date, county) %>% select(-fips) %>% left_join(il %>% select(county, fips) %>% distinct(), by = "county") %>% left_join(il_pop %>% select(geoid, population), by = c("fips" = "geoid")) %>% mutate(cases_per = (cases/population) * 10000, deaths_per = (deaths/population) * 100000) il_per_cap %>% ggplot(aes(x = date, y = cases_per, group = county)) + geom_line(aes(color = county)) + geom_vline(xintercept = as.Date("2020-03-10"), linetype = 2) + geom_vline(xintercept = as.Date("2020-03-21"), linetype = 2) + geom_vline(xintercept = as.Date("2020-03-27"), linetype = 2) + scale_color_viridis_d(guide = FALSE) + annotate("text", label = "CPS closes schools", x = as.Date("2020-03-10"), y = max(il_per_cap$cases_per, na.rm = TRUE)*0.95, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("text", label = "Statewide shelter-in-place", x = as.Date("2020-03-21"), y = max(il_per_cap$cases_per, na.rm = TRUE)*0.95, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("text", label = "Lakefront trail\ncloses", x = as.Date("2020-03-27"), y = max(il_per_cap$cases_per, na.rm = TRUE)*0.95, hjust = 1.1, size = 3, family = "Lato", fontface = "italic") + annotate("segment", x = as.Date("2020-03-24"), y = (il_per_cap$cases_per[il_per_cap$county == "Cook" & il_per_cap$date == "2020-03-25"] + 0.5), xend = as.Date("2020-03-25"), yend = (il_per_cap$cases_per[il_per_cap$county == "Cook" & il_per_cap$date == "2020-03-25"]), size = 0.25, arrow = arrow(length = unit(2, "mm"))) + annotate("text", label = "Cook County", x = as.Date("2020-03-24"), y = (il_per_cap$cases_per[il_per_cap$county == "Cook" & il_per_cap$date == "2020-03-25"] + 0.55), hjust = 1.1, size = 2.5, family = "Lato") + annotate("segment", x = as.Date("2020-03-24"), y = (il_per_cap$cases_per[il_per_cap$county == "Lake" & il_per_cap$date == "2020-03-25"] + 0.6), xend = as.Date("2020-03-25"), yend = (il_per_cap$cases_per[il_per_cap$county == "Lake" & il_per_cap$date == "2020-03-25"]), size = 0.25, arrow = arrow(length = unit(2, "mm"))) + annotate("text", label = "Lake County", x = as.Date("2020-03-24"), y = (il_per_cap$cases_per[il_per_cap$county == "Lake" & il_per_cap$date == "2020-03-25"] + 0.65), hjust = 1.1, size = 2.5, family = "Lato") + labs(x = "", y = "", title = "COVID-19 Cases in Illinois", subtitle = "Number of confirmed cases per 10,000 by county", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") ``` # Spatial visualization of confirmed COVID-19 cases in Illinois The next set of visualizations look at the spatial distribution of cases. This analysis shows us that, although most of the cases are concentrated around the Chicago metropolitan area, where most of the populatoin of Illinois is located, there are cases throughout the state. At this point, not every country has a confirmed case, but I imagine that reality is not far off. I use the brilliant `{gganimate}` package to animate the growth of cases by county. Not all the counties in Illinois are represented in the data, but I want them included in the visualization. So I have to do a little more wrangling of the data. I will complete the dataframe by doing a `full_join()` with county column from the census data and then recalculate the rate per 10,000. ```{r} il <- il %>% full_join(il_pop %>% st_drop_geometry() %>% select(county)) %>% complete(county, nesting(date), fill = list(cases = 0, deaths = 0)) %>% arrange(date, county) il_per_cap <- il %>% left_join(il_pop %>% select(county, population), by = c("county")) %>% mutate(cases_per = (cases/population) * 10000, deaths_per = (deaths/population) * 100000) ``` For the first map, I use the log of the number of confirmed cases. Since Chicago has so many more cases than any other county, nominally, the maps would not show much meaningful contrast. Logging the number of cases makes the contrast between low-number counties. ```{r} il_anim <- il %>% filter(date > "2020-02-20") %>% mutate(log_cases = log10(cases)) %>% left_join(il_pop %>% select(county, population), by = c("county")) %>% ggplot() + geom_sf(data = il_pop, aes(geometry = geometry)) + geom_sf(aes(geometry = geometry, fill = log_cases)) + scale_fill_viridis_c(name = "Cases (logged)", na.value = "grey70") + labs(x = "", y = "", title = "Confirmed COVID-19 Cases in Illinois\nby County (log scale)", subtitle = "{current_frame}", caption = "Notes. Gray indicates no cases.\nData from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") + transition_manual(date) + theme(plot.background = element_rect(fill = "antiquewhite2", color = NA), axis.ticks.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) animate(il_anim, fps = 10) ``` Second, I use the number of cases per 10,000, as above. This, to my mind, does the best job of showing the severity of the pandemic in each county. ```{r} il_anim <- il_per_cap %>% mutate(cases_per = ifelse(cases_per == 0, NA, cases_per)) %>% ggplot() + geom_sf(data = il_pop, aes(geometry = geometry)) + geom_sf(aes(geometry = geometry, fill = cases_per)) + scale_fill_viridis_c(name = "Cases per 10,000", na.value = "grey70") + labs(x = "", y = "", title = "Confirmed COVID-19 Cases per 10,000\nin Illinois by County", subtitle = "{current_frame}", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") + transition_manual(date) + theme(plot.background = element_rect(fill = "cornsilk2", color = NA), axis.ticks.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) animate(il_anim, fps = 10) ``` Last, I want to capture the distribution of the population in Illinois with the number of cases in each county. This makes it clearer that, while the Chicago metropolitan area has the most number of cases, it is also by far the most populous area in the state. I use the log of the population, since Cook County is SO much bigger than every other county. ```{r} il_anim <- il_per_cap %>% mutate(lon = map_dbl(geometry, ~st_point_on_surface(.x)[[1]]), lat = map_dbl(geometry, ~st_point_on_surface(.x)[[2]]), cases_per = ifelse(cases_per == 0, NA, cases_per)) %>% ggplot() + geom_sf(data = il_pop, aes(geometry = geometry, fill = log(population, base = 10))) + geom_point(aes(x = lon, y = lat, size = cases_per), color = "tomato", alpha = 0.7) + scale_fill_distiller(name = "Logged Population", labels = scales::comma, palette = "Blues", direction = 1) + scale_size(name = "Cases per 10,000") + labs(x = "", y = "", title = "Confirmed COVID-19 Cases per 10,000\nin Illinois by County", subtitle = "Date: {current_frame}", caption = "Data from New York Times Covid-19 repository:\nhttps://github.com/nytimes/covid-19-data") + transition_manual(date) + theme(plot.background = element_rect(fill = "cornsilk2", color = NA), panel.background = element_rect(fill = "cornsilk2", color = NA), axis.ticks.x = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), axis.ticks.y = element_blank(), panel.grid.major = element_blank(), panel.grid.minor = element_blank()) animate(il_anim, fps = 10) ``` # Final remarks The number of confirmed cases represents a likely biased sample of the true number of cases in the population, since confirmation depends upon testing and access to testing. The large number of confirmed cases in Cook, DuPage, and Lake counties could be because there is better access to tests, a more affluent population who can afford testing, etc. Needless to say, the number of confirmed cases does not represent the number of true cases.