4 Link Community Data

4.1 Overview

Geographic location can serve as a “key” that links different datasets together. By referencing each dataset and enabling its spatial location, we can integrate different types of information in one setting. In this tutorial, we will use the approximated “service areas” generated in our buffer analysis to identify vulnerable populations during the COVID pandemic. We will connect Chicago COVID-19 Case data by ZIP Code, available as a flat file on the city’s data portal, to our environment.

We will then overlap the 1-mile buffers representing walkable access to the Methadone providers in the city. We use a conservative threshold because of the multiple challenges posed by the COVID pandemic that may impact travel.

Our final goal will be to identify zip codes most impacted by COVID that are outside our acceptable access threshold. Our tutorial objectives are to:

Clean data in preparation of merge
Integrate data using geographic identifiers
Generate maps for a basic gap analysis

4.2 Environment Setup

To replicate the codes & functions illustrated in this tutorial, you’ll need to have R and RStudio downloaded and installed on your system. This tutorial assumes some familiarity with the R programming language.

4.2.1 Input/Output

Our inputs include multiple CSV and SHP files, all of which can be found here, though the providers point file was generated in the Geocoding tutorial. Note that all four files are required (.dbf, .prj, .shp, and .shx) to constitute a shapefile.

Chicago Methadone Clinics, methadone_clinics.shp
1-mile buffer zone of Clinics, methadoneClinics_1mi.shp
Chicago Zip Codes, Chicago_Zips.shp
Chicago COVID case data by Zip, COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv

We will link COVID cumulative case rate data for two weekly time periods in 2020 with a master Zip Code file, and save as a new dataset. Then, we can overlay buffers of health providers we are interested in to identify areas that have varying access to medications during a time of contagion.

4.2.2 Load Libraries

We will use the following packages in this tutorial:

dplyr: to wrangle non-spatial data (or, use tidyverse)
sf: to manipulate spatial data
tmap: to visualize and create maps

Load the libraries for use.

library(sf)
library(tmap)
library(dplyr)

4.2.3 Load data

First we’ll load the shapefiles.

chicagoZips <- st_read("data/Chicago_Zips.shp")

Reading layer `Chicago_Zips' from data source 
  `/Users/maryniakolak/Code/opioid-environment-toolkit/data/Chicago_Zips.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 61 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS84(DD)

methadoneSf <- st_read("data/methadone_clinics.shp")

Reading layer `methadone_clinics' from data source 
  `/Users/maryniakolak/Code/opioid-environment-toolkit/data/methadone_clinics.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 27 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -87.7349 ymin: 41.68698 xmax: -87.57673 ymax: 41.96475
Geodetic CRS:  WGS 84

buffers<- st_read("data/methadoneClinics_1mi.shp")

Reading layer `methadoneClinics_1mi' from data source 
  `/Users/maryniakolak/Code/opioid-environment-toolkit/data/methadoneClinics_1mi.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 1141979 ymin: 1824050 xmax: 1195960 ymax: 1935751
Projected CRS: NAD83 / Illinois East (ftUS)

Next, we’ll load some new data we’re interested in joining in: Chicago COVID-19 Cases, Tests, and Deaths by ZIP Code, found on the city data portal here. We’ll load in a CSV and inspect the data:

COVID <- read.csv("data/COVID-19_Cases__Tests__and_Deaths_by_ZIP_Code.csv")

4.3 Inspect Data

4.3.1 Identify Geographic Key

The spatial Zip dataset will be our primary file. We we need to first identify the geographic identifier we would like to merge on. What is the main key of interest that we will merge our data on?

head(chicagoZips)

Simple feature collection with 6 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.80649 ymin: 41.88747 xmax: -87.59852 ymax: 41.93228
Geodetic CRS:  WGS84(DD)
  objectid shape_area shape_len   zip                       geometry
1       33  106052287  42720.04 60647 MULTIPOLYGON (((-87.67762 4...
2       34  127476051  48103.78 60639 MULTIPOLYGON (((-87.72683 4...
3       35   45069038  27288.61 60707 MULTIPOLYGON (((-87.785 41....
4       36   70853834  42527.99 60622 MULTIPOLYGON (((-87.66707 4...
5       37   99039621  47970.14 60651 MULTIPOLYGON (((-87.70656 4...
6       38   23506056  34689.35 60611 MULTIPOLYGON (((-87.61401 4...

In this dataset, we can see that the column showing our geographic key is “zip.”

4.3.2 Inspect Case Data

Next, let’s inspect COVID case data. What information do we need from this file? We may not need everything, so consider just identifying the field with the geographic identifier and main variable(s) of interest.

head(COVID)

  ZIP.Code Week.Number Week.Start   Week.End Cases...Weekly Cases...Cumulative
1    60603          39 09/20/2020 09/26/2020              0                 13
2    60604          39 09/20/2020 09/26/2020              0                 31
3    60611          16 04/12/2020 04/18/2020              8                 72
4    60611          15 04/05/2020 04/11/2020              7                 64
5    60615          11 03/08/2020 03/14/2020             NA                 NA
6    60603          10 03/01/2020 03/07/2020             NA                 NA
  Case.Rate...Weekly Case.Rate...Cumulative Tests...Weekly Tests...Cumulative
1                  0                 1107.3             25                327
2                  0                 3964.2             12                339
3                 25                  222.0            101                450
4                 22                  197.4             59                349
5                 NA                     NA              6                  9
6                 NA                     NA              0                  0
  Test.Rate...Weekly Test.Rate...Cumulative Percent.Tested.Positive...Weekly
1               2130                27853.5                              0.0
2               1534                43350.4                              0.0
3                312                 1387.8                              0.1
4                182                 1076.3                              0.1
5                 14                   21.7                               NA
6                  0                    0.0                               NA
  Percent.Tested.Positive...Cumulative Deaths...Weekly Deaths...Cumulative
1                                  0.0               0                   0
2                                  0.1               0                   0
3                                  0.2               0                   0
4                                  0.2               0                   0
5                                   NA               0                   0
6                                   NA               0                   0
  Death.Rate...Weekly Death.Rate...Cumulative Population   Row.ID
1                   0                       0       1174 60603-39
2                   0                       0        782 60604-39
3                   0                       0      32426 60611-16
4                   0                       0      32426 60611-15
5                   0                       0      41563 60615-11
6                   0                       0       1174 60603-10
             ZIP.Code.Location
1 POINT (-87.625473 41.880112)
2 POINT (-87.629029 41.878153)
3 POINT (-87.620291 41.894734)
4 POINT (-87.620291 41.894734)
5 POINT (-87.602725 41.801993)
6 POINT (-87.625473 41.880112)

We would like to subset the data to include “Case Rate - Cumulative.” The key we’ll need to join data on will be “ZIP.Code.”

4.3.3 Long vs. Wide Data Formats

Upon closer inspection, we can see that there are multiple zip codes (i.e. 60611) incldued as separate observations. The data is in long form, and could be considered “tidy” – each variable is included as a different column. Because each week is recorded as a separate observation, we see zip codes showing up numerous times. While this data format is tidy, it is problematic for merging with spatial data.

Spatial data needs to be in wide format when mapping and doing some spatial analyses, so that each row corresponds to a single spatial object. Thus a zip code should only show up once, as a single observation

4.4 Data Subsetting

There are many ways to reshape data in R. Check out options to melt or gather in dplyr, for example! For our work, we’ll just subset two cleaned data products corresponding weeks 19 and 39. Then, we’ll merge with the master zip code dataset.

4.4.1 Identify column dames

Because this data file has extremely long header names (common in the epi world), let’s use the colnames function to get exactly what we need.

colnames(COVID)

 [1] "ZIP.Code"                            
 [2] "Week.Number"                         
 [3] "Week.Start"                          
 [4] "Week.End"                            
 [5] "Cases...Weekly"                      
 [6] "Cases...Cumulative"                  
 [7] "Case.Rate...Weekly"                  
 [8] "Case.Rate...Cumulative"              
 [9] "Tests...Weekly"                      
[10] "Tests...Cumulative"                  
[11] "Test.Rate...Weekly"                  
[12] "Test.Rate...Cumulative"              
[13] "Percent.Tested.Positive...Weekly"    
[14] "Percent.Tested.Positive...Cumulative"
[15] "Deaths...Weekly"                     
[16] "Deaths...Cumulative"                 
[17] "Death.Rate...Weekly"                 
[18] "Death.Rate...Cumulative"             
[19] "Population"                          
[20] "Row.ID"                              
[21] "ZIP.Code.Location"

We will gather data for “Case.Rate…Cumulative”.

4.4.2 Fiilter & Subset Data

Next, we’ll use dplyr and piping to:

filter data to the week of interest
subset to only the zip code and case rate data we want
renaming variables to our desired names.

Since the zip code dataset uses “zip” as its geographic identifier, we’ll rename the corresponding field to that here.

COVID.sub19 <- COVID %>%
    filter(Week.Number == 19) %>%
    select(zip = ZIP.Code,
           CaseRtWk19 = Case.Rate...Cumulative)

COVID.sub39 <- COVID %>%
    filter(Week.Number == 39) %>%
    select(zip = ZIP.Code,
           CaseRtWk39 = Case.Rate...Cumulative)

head(COVID.sub19)

    zip CaseRtWk19
1 60604     1790.3
2 60601      429.3
3 60612     1468.9
4 60613      628.6
5 60615      659.2
6 60605      541.4

head(COVID.sub39)

    zip CaseRtWk39
1 60603     1107.3
2 60604     3964.2
3 60601     1451.4
4 60602     1688.1
5 60605     1420.8
6 60610     1706.9

As usual, we’ll inspect the data to confirm we did what we hoped.

4.4.3 Check Dimensions

We have about 60 zip codes in Chicago. Let’s make sure our final, subset of data is similar!

dim(COVID.sub19)

[1] 60  2

dim(COVID.sub39)

[1] 60  2

4.5 Merge Data

Let’s merge the data using the zip code geographic identifier. Inspect the data to confirm it merged correctly!

zipsMerged <- merge(chicagoZips, COVID.sub19, by = "zip", all.x = TRUE)
head(zipsMerged)

Simple feature collection with 6 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.63999 ymin: 41.85317 xmax: -87.60246 ymax: 41.88913
Geodetic CRS:  WGS84(DD)
    zip objectid shape_area shape_len CaseRtWk19                       geometry
1 60601       27    9166246  19804.58      429.3 MULTIPOLYGON (((-87.62271 4...
2 60602       26    4847125  14448.17      643.1 MULTIPOLYGON (((-87.60997 4...
3 60603       19    4560229  13672.68         NA MULTIPOLYGON (((-87.61633 4...
4 60604       48    4294902  12245.81     1790.3 MULTIPOLYGON (((-87.63376 4...
5 60605       20   36301276  37973.35      541.4 MULTIPOLYGON (((-87.62064 4...
6 60606       31    6766411  12040.44      612.7 MULTIPOLYGON (((-87.63397 4...

zipsMerged <- merge(zipsMerged, COVID.sub39, by = "zip", all.x = TRUE)
head(zipsMerged)

Simple feature collection with 6 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.63999 ymin: 41.85317 xmax: -87.60246 ymax: 41.88913
Geodetic CRS:  WGS84(DD)
    zip objectid shape_area shape_len CaseRtWk19 CaseRtWk39
1 60601       27    9166246  19804.58      429.3     1451.4
2 60602       26    4847125  14448.17      643.1     1688.1
3 60603       19    4560229  13672.68         NA     1107.3
4 60604       48    4294902  12245.81     1790.3     3964.2
5 60605       20   36301276  37973.35      541.4     1420.8
6 60606       31    6766411  12040.44      612.7     2289.6
                        geometry
1 MULTIPOLYGON (((-87.62271 4...
2 MULTIPOLYGON (((-87.60997 4...
3 MULTIPOLYGON (((-87.61633 4...
4 MULTIPOLYGON (((-87.63376 4...
5 MULTIPOLYGON (((-87.62064 4...
6 MULTIPOLYGON (((-87.63397 4...

Above, we first merged Week 19 data to the master zip dataset. We inspect the data to confirm it merged correctly. Then, we merge Week 39 data.

We include all “x” values, meaning we retain all the zip codes for our master dataset. This corresponds to a left join. It can be helpful to inspect dimensions after a merge to ensure you have retained all the data you expected!

There are many ways to reshape and merge data. Try your favorite R approaches here!

4.6 Visualize Data

Now we are ready to inspect our data spatially! First, we’ll make a simple map.

4.6.1 Choropleth Map Basics

We generate a choropleth map (or statistical map) of case rate data using quantile bins for week 39 with six bins, and use a Blue-Purple color palette.

You can find an R color cheatsheet useful for identifying palette codes here. (More details on thematic mapping are in tutorials that follow!) We overlay the buffers and clincal providers.

tm_shape(zipsMerged) +
  tm_polygons(
    
    fill = "CaseRtWk39",

    fill.scale = tm_scale_intervals(
      style = "quantile",    
      n = 6,
      values = "BuPu"),      
       
    fill.legend = tm_legend( 
                title = "COVID Case rate",
                title.size = 0.7,
                text.size = 0.5,
                show = TRUE,
                frame = FALSE,
                position = c("left","bottom"))) +
    
  tm_shape(buffers) + tm_borders() +
  tm_shape(methadoneSf) + tm_dots(size = 0.3) +
  tm_title("Walkable Methadone Service Areas in Week 39")

Already we can generate some insight. Areas on the far West side of the city have some of the highest case rates, but are outside a walkable distance to Methadone providers. For individuals with opioid use disorder requiring medication access in these locales, they may be especially vulnerable during the pandemic.

4.6.2 Fixed Breaks

If we generate a quantile map for each time period, we’ll see different breaks or data classifications for each, as case rates grew from week 19 to 39. If we wanted to generate a comparable map, we may need to generate comparable breaks in the data.

For practice, we’ll use breaks of 1,000 - 5,000, following trends in Week 39. First, let’s see see what that looks like for just one time period.

tm_shape(zipsMerged) +

    tm_polygons(fill = "CaseRtWk39", 
        
              fill.scale = tm_scale_intervals(
                breaks = c(1000,2000,3000,4000,5000),
                values ="BuPu"),
                
              fill.legend = tm_legend( 
                title = "COVID Case rate",
                title.size = 0.7,
                text.size = 0.5,
                show = TRUE,
                frame = FALSE,
                position = c("left","bottom"))) +
  
    tm_shape(buffers) + tm_polygons(alpha=0.2) +
    tm_shape(methadoneSf) + tm_symbols(fill = "black",size = 0.3) +
    tm_layout(main.title = "Walkable Methadone Service Areas in Week 39",
              main.title.position = "center",
              main.title.size = 1,
              frame.alpha = 0.5)

Next, we’ll write each map to it’s own object. Then we can use tmap_arrange to generate a map panel of both maps. We keep standard formatting so the legends remain outside of the map. In future tutorials, we can customize things further. For now, it will be enough to get an idea of trends!

wk19 <- tm_shape(zipsMerged) +

    tm_polygons(fill = "CaseRtWk19", 
        
              fill.scale = tm_scale_intervals(
                breaks = c(1000,2000,3000,4000,5000),
                values ="BuPu"),
                
              fill.legend = tm_legend( 
                title = "COVID Case rate",
                title.size = 0.7,
                text.size = 0.5,
                show = TRUE,
                frame = FALSE,
                position = c("left","bottom"))) +
  
    tm_shape(buffers) + tm_polygons(alpha=0.2) +
    tm_shape(methadoneSf) + tm_symbols(fill = "black",size = 0.3) +
    tm_layout(main.title = "Walkable Methadone Service Areas in Week 19",
              main.title.position = "center",
              main.title.size = 1,
              frame.alpha = 0.5)




wk39 <- tm_shape(zipsMerged) +

    tm_polygons(fill = "CaseRtWk39", 
        
              fill.scale = tm_scale_intervals(
                breaks = c(1000,2000,3000,4000,5000),
                values ="BuPu"),
                
              fill.legend = tm_legend( 
                title = "COVID Case rate",
                title.size = 0.7,
                text.size = 0.5,
                show = TRUE,
                frame = FALSE,
                position = c("left","bottom"))) +
  
    tm_shape(buffers) + tm_polygons(alpha=0.2) +
    tm_shape(methadoneSf) + tm_symbols(fill = "black",size = 0.3) +
    tm_layout(main.title = "Walkable Methadone Service Areas in Week 39",
              main.title.position = "center",
              main.title.size = 1,
              frame.alpha = 0.5)




tmap_arrange(wk19,wk39)

4.6.3 Save Data

We save our newly merged ZCTA level data for future analysis.

#write_sf(zipsMerged, "data/chizips_COVID.gpkg")