Adding Data to OEPS
Because the OEPS backend must export the same data to many different formats, we have a registry that configures and manages all of the data within the system. The data itself is stored across many different CSV files, while a set of JSON files hold descriptive information about these CSV files and how they relate. Finally, a set of a "geodata sources" provide base geometries to which any of the data CSVs can be joined.
With these pieces in mind, adding new data to OEPS is a multi-step process.
Data preparation
Incoming data:
- MUST be stored CSV format
- MUST contain values for only one year and only one geography level
- MUST have proper column names (see below)
- MUST include an appropriate join column (see below)
- MAY have data for as many different variables as needed
Naming columns
If you are adding values for a variable that already exists in the registry, say, "Total Population", then your column must have the exact name that we already use for that variable, TotPop
.
If you are adding values for a new variable, then your columns must follow our existing naming conventions, and you'll need to create a new entry for that variable as described below.
Configuring a join column
A join column serves as the linkage between the non-spatial CSV data and geographic data like county boundaries. To facilitate this linkage, any incoming CSV must have one of these columns:
Name | Geography | Description |
---|---|---|
GEOID |
all | For state, county, and tract data the GEOID is equivalent to the FIPS id. For ZCTA data, GEOID can match the zip code or ZCTA5 for each ZCTA. |
FIPS |
state, county, tract | FIPS ids are nested such that, for example, the 5-digit FIPS for a county includes the 2-digit FIPS for its state. |
ZCTA5 |
zcta | 5-digit zip code that corresponds with the zip code tabulation area. |
HEROP_ID |
all | A HEROP_ID is our version of GEOID , which also includes a "summary-level" code on the front of it, indicating what geography level the id refers to. |
Registry preparation
After the data has been prepared, the registry must also be prepared. The best way to do this is by asking these questions:
Is there a geodata target for this data already in the registry?
If the new data must be joined to 2020 county boundaries, for example, is there already a geodata-source
for that geography and vintage?
- If yes, great! Proceed to next question.
- If not, you need to create a new geodata source before continuing.
Is there an existing table source that matches the year and geography of the incoming data?
If the new CSV contains county-level data for 2021, then is there already a county-2021
table source in the registry?
- If yes, great! Proceed to next question.
- If not, you need to create a new table-source before continuing.
Are all columns in the incoming data already defined in the OEPS registry?
You can check this with the following command:
flask inspect-csv -s path/to/MyNewCountyData2021.csv
Note
You can run this command on a directory as well, and every CSV within it will be inspected.
- If yes, great! You are ready to merge data into OEPS.
- If no, you need to create a new variable entry for every new incoming variable.
Workflow
Listed here in logical order, though often only the later steps will be needed.
- Create a new geodata source
- Create a new table source
- Create a new variable entry
- Create a new theme or construct
- Merge data into OEPS
Create a new geodata source
A geodata source is a JSON file that points to one of the zipped shapefiles we have in our geodata.healthyregions.org S3 storage system. If you need a new geography level and/or vintage that is not already in S3, see github.com/healthyregions/geodata to learn how to add it.
To create a new geodata source, add a new JSON file named <summary level>-<year>.json
. Using 2010 county boundaries as an example, name the file counties-2010.json
and its content will look like this:
{
"bq_dataset_name": "spatial",
"bq_table_name": "counties2010",
"name": "counties-2010",
"title": "County Boundaries, 2010",
"description": "Shapefile of county boundaries from the US Census Bureau, 2010.",
"path": "https://herop-geodata.s3.us-east-2.amazonaws.com/census/county-2010-500k-shp.zip",
"format": "shp",
"mediatype": "application/vnd.shp",
"summary_level": "county",
"schema": {
"primaryKey": "HEROP_ID",
"fields": [
{
"name": "HEROP_ID",
"title": "HEROP_ID",
"type": "string"
}
]
}
}
Note
Currently, we use plurals in geodata-source names, "counties", instead of "county" which we use elsewhere. In the future, geodata source definitions will not use plural.
Note
The bq* attributes are used for Google BigQuery, and are likely to be removed in the future.
Tip
See geodata sources for a full explanation of geodata sources.
Create a new table source
The following command can be used to create a new table source:
flask create-table-source -n <name> -g <geodata-source>
For example, to add a new table source to hold county-level data for the year 2021 (which will join to 2020 geometries) use this command:
flask create-table-source -n county-2021 -g counties-2020
This command will do two things:
- Using the specified geodata-source, create a "dummy" CSV in
data/tables
namedcounty-2021.csv
that only contains join columns--essentially a placeholder file into which new data can be merged - Create a new JSON file for this table-source,
registry/table-sources/county-2021.json
that adds the new CSV to the registry.
Tip
See table sources for a full explanation of table sources.
Create a new variable entry
In registry/variables.json
there is a single entry for each variable that has values for any geography level or any year within OEPS. Currently, adding new variables is a manual process, just directly edit the content of that file. A full variable definition looks like this:
"TotPop": {
"title": "Total Population",
"name": "TotPop",
"type": "integer",
"example": "1632480",
"description": "Total population",
"constraints": "1980-2000 historic data was acquired from NHGIS and then interpolated to modern county boundaries through a population weighted interpolation using the tidycensus `interpolate_pw` function. For 1980, the underlying population weighting was county subdivisions, while for 1990 and 200 the underlying population weighting was tracts.",
"construct": "Population",
"source": "ACS 2018, 5-Year; Census 2010; IPUMS NHGIS",
"source_long": "American Community Survey 2014-2018 5 Year Estimate; 2010 Decennial Census; Integrated Public Use Microdata Series National Historic Geographic Information System",
"comments": "",
"metadata_doc_url": "https://github.com/GeoDaCenter/opioid-policy-scan/blob/main/data_final/metadata/Age_2018.md",
"table_sources": []
}
Some notes on a few of these properties:
- The top-level key and
name
for this variable must be identical, and must follow these rules:- Follow abbreviation patterns we already use
- Follow PascalCase: Use
TotPop
, don't usetot_pop
ortotPop
ortot-pop
type
must be one of:number
for any decimal number valuesinteger
for any integer valuesstring
for text values, for example a coded entryboolean
for true/false or yes/no valuesdate
for date values- see Frictionless field data types for more info
construct
must match the name of a construct in thethemes.json
file. If you are adding a variable that part of a new construct, then you must also create a new construct.table_sources
must start as an empty list, and will fill up with the names of table sources that hold values for this variable, as more and more data is merged into the system.
Tip
See variables for a full explanation of variables.
Create a new theme or construct
OEPS is built on a conceptual framework that places different measures (or, "variables") into different themes and constructs. This grouping is defined in the registry/themes.json
file, which lists "themes" at the top-level and then nests one or more constructs within each theme.
To add a new construct or theme, just edit that file directly. Once you are finished, you'll need to regenerate the docs content for the OEPS explorer.
flask build-explorer-docs
Tip
See themes & constructs for a full explanation of themes and constructs.
Merge data into OEPS
Once both the incoming data and the registry have been prepared, you can merge in an external CSV with this command, using a new set of county-level 2021 data as an example.
flask merge-data-table -s path/to/MyNewCountyData2021.csv -t county-2021 --dry-run
It is wise to use the --dry-run
flag to test the merge before actually altering any data.
What this command will do:
- Load the incoming CSV data, and check for a join field (
GEOID
,FIPS
,ZCTA5
, orHEROP_ID
)- If one of these is present, it will be used to generate a
HEROP_ID
which is the actual join field that will be used during the process
- If one of these is present, it will be used to generate a
- Load the specified table source CSV in the registry and merge these two data frames
- Write out the merged data frame into the existing table source
- Update the
table_sources
attribute invaribles.json
for variable in the incoming CSV
Once you have completed this operation for one or more CSVs, you will need to perform some or all of these actions to make sure the derived output from OEPS is up-to-date: