library(sf)
library(tmap)
library(dplyr)4 Link Community Data
4.1 Overview
Geographic location can serve as a “key” that links different datasets together. By referencing each dataset and enabling its spatial location, we can integrate different types of information in one setting. In this tutorial, we will use the approximated “service areas” generated in our buffer analysis to identify vulnerable populations during the COVID pandemic. We will connect Chicago COVID-19 Case data by ZIP Code, available as a flat file on the city’s data portal, to our environment.
We will then overlap the 1-mile buffers representing walkable access to the Methadone providers in the city. We use a conservative threshold because of the multiple challenges posed by the COVID pandemic that may impact travel.
Our final goal will be to identify zip codes most impacted by COVID that are outside our acceptable access threshold. Our tutorial objectives are to:
- Clean data in preparation of merge
- Integrate data using geographic identifiers
- Generate maps for a basic gap analysis
4.2 Environment Setup
To replicate the codes & functions illustrated in this tutorial, you’ll need to have R and RStudio downloaded and installed on your system. This tutorial assumes some familiarity with the R programming language.
4.2.1 Input/Output
Our inputs include multiple CSV and SHP files, all of which can be found here, though the providers point file was generated in the Geocoding tutorial. Note that all four files are required (.dbf, .prj, .shp, and .shx) to constitute a shapefile.
- Chicago Methadone Clinics,
methadone_clinics.shp - 1-mile buffer zone of Clinics,
methadoneClinics_1mi.shp - Chicago Zip Codes,
Chicago_Zips.shp - Chicago COVID case data by Zip,
COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv
We will link COVID cumulative case rate data for two weekly time periods in 2020 with a master Zip Code file, and save as a new dataset. Then, we can overlay buffers of health providers we are interested in to identify areas that have varying access to medications during a time of contagion.
4.2.2 Load Libraries
We will use the following packages in this tutorial:
dplyr: to wrangle non-spatial data (or, usetidyverse)sf: to manipulate spatial datatmap: to visualize and create maps
Load the libraries for use.
4.2.3 Load data
First we’ll load the shapefiles.
chicagoZips <- st_read("data/Chicago_Zips.shp")Reading layer `Chicago_Zips' from data source
`/Users/maryniakolak/Code/opioid-environment-toolkit/data/Chicago_Zips.shp'
using driver `ESRI Shapefile'
Simple feature collection with 61 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS: WGS84(DD)
methadoneSf <- st_read("data/methadone_clinics.shp")Reading layer `methadone_clinics' from data source
`/Users/maryniakolak/Code/opioid-environment-toolkit/data/methadone_clinics.shp'
using driver `ESRI Shapefile'
Simple feature collection with 27 features and 8 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -87.7349 ymin: 41.68698 xmax: -87.57673 ymax: 41.96475
Geodetic CRS: WGS 84
buffers<- st_read("data/methadoneClinics_1mi.shp")Reading layer `methadoneClinics_1mi' from data source
`/Users/maryniakolak/Code/opioid-environment-toolkit/data/methadoneClinics_1mi.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 1141979 ymin: 1824050 xmax: 1195960 ymax: 1935751
Projected CRS: NAD83 / Illinois East (ftUS)
Next, we’ll load some new data we’re interested in joining in: Chicago COVID-19 Cases, Tests, and Deaths by ZIP Code, found on the city data portal here. We’ll load in a CSV and inspect the data:
COVID <- read.csv("data/COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv")4.3 Inspect Data
4.3.1 Identify Geographic Key
The spatial Zip dataset will be our primary file. We we need to first identify the geographic identifier we would like to merge on. What is the main key of interest that we will merge our data on?
head(chicagoZips)Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.80649 ymin: 41.88747 xmax: -87.59852 ymax: 41.93228
Geodetic CRS: WGS84(DD)
objectid shape_area shape_len zip geometry
1 33 106052287 42720.04 60647 MULTIPOLYGON (((-87.67762 4...
2 34 127476051 48103.78 60639 MULTIPOLYGON (((-87.72683 4...
3 35 45069038 27288.61 60707 MULTIPOLYGON (((-87.785 41....
4 36 70853834 42527.99 60622 MULTIPOLYGON (((-87.66707 4...
5 37 99039621 47970.14 60651 MULTIPOLYGON (((-87.70656 4...
6 38 23506056 34689.35 60611 MULTIPOLYGON (((-87.61401 4...
In this dataset, we can see that the column showing our geographic key is “zip.”
4.3.2 Inspect Case Data
Next, let’s inspect COVID case data. What information do we need from this file? We may not need everything, so consider just identifying the field with the geographic identifier and main variable(s) of interest.
head(COVID) ZIP.Code Week.Number Week.Start Week.End Cases...Weekly Cases...Cumulative
1 60603 39 09/20/2020 09/26/2020 0 13
2 60604 39 09/20/2020 09/26/2020 0 31
3 60611 16 04/12/2020 04/18/2020 8 72
4 60611 15 04/05/2020 04/11/2020 7 64
5 60615 11 03/08/2020 03/14/2020 NA NA
6 60603 10 03/01/2020 03/07/2020 NA NA
Case.Rate...Weekly Case.Rate...Cumulative Tests...Weekly Tests...Cumulative
1 0 1107.3 25 327
2 0 3964.2 12 339
3 25 222.0 101 450
4 22 197.4 59 349
5 NA NA 6 9
6 NA NA 0 0
Test.Rate...Weekly Test.Rate...Cumulative Percent.Tested.Positive...Weekly
1 2130 27853.5 0.0
2 1534 43350.4 0.0
3 312 1387.8 0.1
4 182 1076.3 0.1
5 14 21.7 NA
6 0 0.0 NA
Percent.Tested.Positive...Cumulative Deaths...Weekly Deaths...Cumulative
1 0.0 0 0
2 0.1 0 0
3 0.2 0 0
4 0.2 0 0
5 NA 0 0
6 NA 0 0
Death.Rate...Weekly Death.Rate...Cumulative Population Row.ID
1 0 0 1174 60603-39
2 0 0 782 60604-39
3 0 0 32426 60611-16
4 0 0 32426 60611-15
5 0 0 41563 60615-11
6 0 0 1174 60603-10
ZIP.Code.Location
1 POINT (-87.625473 41.880112)
2 POINT (-87.629029 41.878153)
3 POINT (-87.620291 41.894734)
4 POINT (-87.620291 41.894734)
5 POINT (-87.602725 41.801993)
6 POINT (-87.625473 41.880112)
We would like to subset the data to include “Case Rate - Cumulative.” The key we’ll need to join data on will be “ZIP.Code.”
4.3.3 Long vs. Wide Data Formats
Upon closer inspection, we can see that there are multiple zip codes (i.e. 60611) incldued as separate observations. The data is in long form, and could be considered “tidy” – each variable is included as a different column. Because each week is recorded as a separate observation, we see zip codes showing up numerous times. While this data format is tidy, it is problematic for merging with spatial data.
Spatial data needs to be in wide format when mapping and doing some spatial analyses, so that each row corresponds to a single spatial object. Thus a zip code should only show up once, as a single observation
4.4 Data Subsetting
There are many ways to reshape data in R. Check out options to melt or gather in dplyr, for example! For our work, we’ll just subset two cleaned data products corresponding weeks 19 and 39. Then, we’ll merge with the master zip code dataset.
4.4.1 Identify column dames
Because this data file has extremely long header names (common in the epi world), let’s use the colnames function to get exactly what we need.
colnames(COVID) [1] "ZIP.Code"
[2] "Week.Number"
[3] "Week.Start"
[4] "Week.End"
[5] "Cases...Weekly"
[6] "Cases...Cumulative"
[7] "Case.Rate...Weekly"
[8] "Case.Rate...Cumulative"
[9] "Tests...Weekly"
[10] "Tests...Cumulative"
[11] "Test.Rate...Weekly"
[12] "Test.Rate...Cumulative"
[13] "Percent.Tested.Positive...Weekly"
[14] "Percent.Tested.Positive...Cumulative"
[15] "Deaths...Weekly"
[16] "Deaths...Cumulative"
[17] "Death.Rate...Weekly"
[18] "Death.Rate...Cumulative"
[19] "Population"
[20] "Row.ID"
[21] "ZIP.Code.Location"
We will gather data for “Case.Rate…Cumulative”.
4.4.2 Fiilter & Subset Data
Next, we’ll use dplyr and piping to:
- filter data to the week of interest
- subset to only the zip code and case rate data we want
- renaming variables to our desired names.
Since the zip code dataset uses “zip” as its geographic identifier, we’ll rename the corresponding field to that here.
COVID.sub19 <- COVID %>%
filter(Week.Number == 19) %>%
select(zip = ZIP.Code,
CaseRtWk19 = Case.Rate...Cumulative)
COVID.sub39 <- COVID %>%
filter(Week.Number == 39) %>%
select(zip = ZIP.Code,
CaseRtWk39 = Case.Rate...Cumulative)
head(COVID.sub19) zip CaseRtWk19
1 60604 1790.3
2 60601 429.3
3 60612 1468.9
4 60613 628.6
5 60615 659.2
6 60605 541.4
head(COVID.sub39) zip CaseRtWk39
1 60603 1107.3
2 60604 3964.2
3 60601 1451.4
4 60602 1688.1
5 60605 1420.8
6 60610 1706.9
As usual, we’ll inspect the data to confirm we did what we hoped.
4.4.3 Check Dimensions
We have about 60 zip codes in Chicago. Let’s make sure our final, subset of data is similar!
dim(COVID.sub19)[1] 60 2
dim(COVID.sub39)[1] 60 2
4.5 Merge Data
Let’s merge the data using the zip code geographic identifier. Inspect the data to confirm it merged correctly!
zipsMerged <- merge(chicagoZips, COVID.sub19, by = "zip", all.x = TRUE)
head(zipsMerged)Simple feature collection with 6 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.63999 ymin: 41.85317 xmax: -87.60246 ymax: 41.88913
Geodetic CRS: WGS84(DD)
zip objectid shape_area shape_len CaseRtWk19 geometry
1 60601 27 9166246 19804.58 429.3 MULTIPOLYGON (((-87.62271 4...
2 60602 26 4847125 14448.17 643.1 MULTIPOLYGON (((-87.60997 4...
3 60603 19 4560229 13672.68 NA MULTIPOLYGON (((-87.61633 4...
4 60604 48 4294902 12245.81 1790.3 MULTIPOLYGON (((-87.63376 4...
5 60605 20 36301276 37973.35 541.4 MULTIPOLYGON (((-87.62064 4...
6 60606 31 6766411 12040.44 612.7 MULTIPOLYGON (((-87.63397 4...
zipsMerged <- merge(zipsMerged, COVID.sub39, by = "zip", all.x = TRUE)
head(zipsMerged)Simple feature collection with 6 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -87.63999 ymin: 41.85317 xmax: -87.60246 ymax: 41.88913
Geodetic CRS: WGS84(DD)
zip objectid shape_area shape_len CaseRtWk19 CaseRtWk39
1 60601 27 9166246 19804.58 429.3 1451.4
2 60602 26 4847125 14448.17 643.1 1688.1
3 60603 19 4560229 13672.68 NA 1107.3
4 60604 48 4294902 12245.81 1790.3 3964.2
5 60605 20 36301276 37973.35 541.4 1420.8
6 60606 31 6766411 12040.44 612.7 2289.6
geometry
1 MULTIPOLYGON (((-87.62271 4...
2 MULTIPOLYGON (((-87.60997 4...
3 MULTIPOLYGON (((-87.61633 4...
4 MULTIPOLYGON (((-87.63376 4...
5 MULTIPOLYGON (((-87.62064 4...
6 MULTIPOLYGON (((-87.63397 4...
Above, we first merged Week 19 data to the master zip dataset. We inspect the data to confirm it merged correctly. Then, we merge Week 39 data.
We include all “x” values, meaning we retain all the zip codes for our master dataset. This corresponds to a left join. It can be helpful to inspect dimensions after a merge to ensure you have retained all the data you expected!
There are many ways to reshape and merge data. Try your favorite R approaches here!
4.6 Visualize Data
Now we are ready to inspect our data spatially! First, we’ll make a simple map.
4.6.1 Choropleth Map Basics
We generate a choropleth map (or statistical map) of case rate data using quantile bins for week 39 with six bins, and use a Blue-Purple color palette.
You can find an R color cheatsheet useful for identifying palette codes here. (More details on thematic mapping are in tutorials that follow!) We overlay the buffers and clincal providers.
tm_shape(zipsMerged) +
tm_polygons(
fill = "CaseRtWk39",
fill.scale = tm_scale_intervals(
style = "quantile",
n = 6,
values = "BuPu"),
fill.legend = tm_legend(
title = "COVID Case rate",
title.size = 0.7,
text.size = 0.5,
show = TRUE,
frame = FALSE,
position = c("left","bottom"))) +
tm_shape(buffers) + tm_borders() +
tm_shape(methadoneSf) + tm_dots(size = 0.3) +
tm_title("Walkable Methadone Service Areas in Week 39")Already we can generate some insight. Areas on the far West side of the city have some of the highest case rates, but are outside a walkable distance to Methadone providers. For individuals with opioid use disorder requiring medication access in these locales, they may be especially vulnerable during the pandemic.
4.6.2 Fixed Breaks
If we generate a quantile map for each time period, we’ll see different breaks or data classifications for each, as case rates grew from week 19 to 39. If we wanted to generate a comparable map, we may need to generate comparable breaks in the data.
For practice, we’ll use breaks of 1,000 - 5,000, following trends in Week 39. First, let’s see see what that looks like for just one time period.
tm_shape(zipsMerged) +
tm_polygons(fill = "CaseRtWk39",
fill.scale = tm_scale_intervals(
breaks = c(1000,2000,3000,4000,5000),
values ="BuPu"),
fill.legend = tm_legend(
title = "COVID Case rate",
title.size = 0.7,
text.size = 0.5,
show = TRUE,
frame = FALSE,
position = c("left","bottom"))) +
tm_shape(buffers) + tm_polygons(alpha=0.2) +
tm_shape(methadoneSf) + tm_symbols(fill = "black",size = 0.3) +
tm_layout(main.title = "Walkable Methadone Service Areas in Week 39",
main.title.position = "center",
main.title.size = 1,
frame.alpha = 0.5)Next, we’ll write each map to it’s own object. Then we can use tmap_arrange to generate a map panel of both maps. We keep standard formatting so the legends remain outside of the map. In future tutorials, we can customize things further. For now, it will be enough to get an idea of trends!
wk19 <- tm_shape(zipsMerged) +
tm_polygons(fill = "CaseRtWk19",
fill.scale = tm_scale_intervals(
breaks = c(1000,2000,3000,4000,5000),
values ="BuPu"),
fill.legend = tm_legend(
title = "COVID Case rate",
title.size = 0.7,
text.size = 0.5,
show = TRUE,
frame = FALSE,
position = c("left","bottom"))) +
tm_shape(buffers) + tm_polygons(alpha=0.2) +
tm_shape(methadoneSf) + tm_symbols(fill = "black",size = 0.3) +
tm_layout(main.title = "Walkable Methadone Service Areas in Week 19",
main.title.position = "center",
main.title.size = 1,
frame.alpha = 0.5)
wk39 <- tm_shape(zipsMerged) +
tm_polygons(fill = "CaseRtWk39",
fill.scale = tm_scale_intervals(
breaks = c(1000,2000,3000,4000,5000),
values ="BuPu"),
fill.legend = tm_legend(
title = "COVID Case rate",
title.size = 0.7,
text.size = 0.5,
show = TRUE,
frame = FALSE,
position = c("left","bottom"))) +
tm_shape(buffers) + tm_polygons(alpha=0.2) +
tm_shape(methadoneSf) + tm_symbols(fill = "black",size = 0.3) +
tm_layout(main.title = "Walkable Methadone Service Areas in Week 39",
main.title.position = "center",
main.title.size = 1,
frame.alpha = 0.5)
tmap_arrange(wk19,wk39)4.6.3 Save Data
We save our newly merged ZCTA level data for future analysis.
#write_sf(zipsMerged, "data/chizips_COVID.gpkg")