Adding Data to OEPS

Because the OEPS backend must export the same data to many different formats, we have a registry that configures and manages all of the data within the system. The data itself is stored across many different CSV files, while a set of JSON files hold descriptive information about these CSV files and how they relate. Finally, a set of a "geodata sources" provide base geometries to which any of the data CSVs can be joined.

With these pieces in mind, adding new data to OEPS is a multi-step process.

Adding Data to OEPS

Data preparation

Incoming data:

MUST be stored CSV format
MUST contain values for only one year and only one geography level
MUST have proper column names (see below)
MUST include an appropriate join column (see below)
MAY have data for as many different variables as needed

Naming columns

If you are adding values for a variable that already exists in the registry, say, "Total Population", then your column must have the exact name that we already use for that variable, TotPop.

If you are adding values for a new variable, then your columns must follow our existing naming conventions, and you'll need to create a new entry for that variable as described below.

Configuring a join column

A join column serves as the linkage between the non-spatial CSV data and geographic data like county boundaries. To facilitate this linkage, any incoming CSV must have one of these columns:

Name	Geography	Description
`GEOID`	all	For state, county, and tract data the `GEOID` is equivalent to the `FIPS` id. For ZCTA data, `GEOID` can match the zip code or `ZCTA5` for each ZCTA.
`FIPS`	state, county, tract	`FIPS` ids are nested such that, for example, the 5-digit `FIPS` for a county includes the 2-digit `FIPS` for its state.
`ZCTA5`	zcta	5-digit zip code that corresponds with the zip code tabulation area.
`HEROP_ID`	all	A `HEROP_ID` is our version of `GEOID`, which also includes a "summary-level" code on the front of it, indicating what geography level the id refers to.

Registry preparation

After the data has been prepared, the registry must also be prepared. The best way to do this is by asking these questions:

Is there a geodata target for this data already in the registry?

If the new data must be joined to 2020 county boundaries, for example, is there already a geodata-source for that geography and vintage?

If yes, great! Proceed to next question.
If not, you need to create a new geodata source before continuing.

Is there an existing table source that matches the year and geography of the incoming data?

If the new CSV contains county-level data for 2021, then is there already a county-2021 table source in the registry?

If yes, great! Proceed to next question.
If not, you need to create a new table-source before continuing.

Are all columns in the incoming data already defined in the OEPS registry?

You can check this with the following command:

flask inspect-csv -s path/to/MyNewCountyData2021.csv

Note

You can run this command on a directory as well, and every CSV within it will be inspected.

If yes, great! You are ready to merge data into OEPS.
If no, you need to create a new variable entry for every new incoming variable.

Workflow

Listed here in logical order, though often only the later steps will be needed.

Create a new geodata source
Create a new table source
Create a new variable entry
Create a new theme or construct
Merge data into OEPS

Create a new geodata source

A geodata source is a JSON file that points to one of the zipped shapefiles we have in our geodata.healthyregions.org S3 storage system. If you need a new geography level and/or vintage that is not already in S3, see github.com/healthyregions/geodata to learn how to add it.

To create a new geodata source, add a new JSON file named <summary level>-<year>.json. Using 2010 county boundaries as an example, name the file counties-2010.json and its content will look like this:

{
    "bq_dataset_name": "spatial",
    "bq_table_name": "counties2010",
    "name": "counties-2010",
    "title": "County Boundaries, 2010",
    "description": "Shapefile of county boundaries from the US Census Bureau, 2010.",
    "path": "https://herop-geodata.s3.us-east-2.amazonaws.com/census/county-2010-500k-shp.zip",
    "format": "shp",
    "mediatype": "application/vnd.shp",
    "summary_level": "county",
    "schema": {
        "primaryKey": "HEROP_ID",
        "fields": [
            {
                "name": "HEROP_ID",
                "title": "HEROP_ID",
                "type": "string"
            }
        ]
    }
}

Note

Currently, we use plurals in geodata-source names, "counties", instead of "county" which we use elsewhere. In the future, geodata source definitions will not use plural.

Note

The bq* attributes are used for Google BigQuery, and are likely to be removed in the future.

Tip

See geodata sources for a full explanation of geodata sources.

Create a new table source

The following command can be used to create a new table source:

flask create-table-source -n <name> -g <geodata-source>

For example, to add a new table source to hold county-level data for the year 2021 (which will join to 2020 geometries) use this command:

flask create-table-source -n county-2021 -g counties-2020

This command will do two things:

Using the specified geodata-source, create a "dummy" CSV in data/tables named county-2021.csv that only contains join columns--essentially a placeholder file into which new data can be merged
Create a new JSON file for this table-source, registry/table-sources/county-2021.json that adds the new CSV to the registry.

Tip

See table sources for a full explanation of table sources.

Getting started with PagesCMS

Some elements of the registry can be directly edited through an online interface called PagesCMS. Here's how you can get started with this.

Go to app.pagescms.org/healthyregions/oeps/main
Sign in with GitHub if needed
Once logged in, you'll see oeps and main in the left-hand menu
- Click this button and go to > "Manage branches"
Create a new branch with a descriptive name, add_my_new_variable
- You'll now see oeps and add_my_new_variable in the left-hand menu
Perform the edits you need to do
- All work will be committed directly to your new branch.
When you are ready, go to the actual GitHub repo, https://github.com/healthyregions/oeps/pulls, and create a new pull request into main from your branch.

Create a new variable entry

In registry/variables there are individual files for each variable with data in OEPS. Instead of editing these files directly, we can edit them through PagesCMS.

See getting started with PagesCMS to continue.

Some data entry notes:

name must follow these rules:
- Follow abbreviation patterns we already use
- Follow PascalCase
  - For example, use TotPop, not tot_pop or totPop or tot-pop
type must be one of:
- number for any decimal number values
- integer for any integer values
- string for text values, for example a coded entry
- boolean for true/false or yes/no values
- date for date values
- see Frictionless field data types for more info
table_sources leave this blank, it should not be edited in PagesCMS.

Tip

See variables for a full explanation of variables.

Create a new metadata entry

In registry/metadata there are individual files for each metadata entry. Instead of editing these files directly, we can edit them through PagesCMS.

Note

The actual metadata markdown files themselves are stored at the top of the main repo, in /metadata. In the future, we could further integrate them with the registry so that these files are also edited directly in PagesCMS.

To add a new metadata entry, you should begin the process in GitHub by creating a new branch and added a new markdown file to in this directory: https://github.com/healthyregions/oeps/tree/main/metadata. Create your file based on https://github.com/healthyregions/oeps/tree/main/metadata/metadata-template.md.

With your markdown file complete, you need to create an entry for it in PagesCMS. See getting started with PagesCMS, and be sure to use the branch you have already created (no need to make a new one).

Once the new metadata entry has been created, you can also attach new variables to it if needed.

After a pull request has been merged into the main branch, this command will need to be run to update the docs page in the OEPS Explorer:

flask build-explorer-docs

Tip

See metadata for a full explanation of metadata entries.

Merge data into OEPS

Once both the incoming data and the registry have been prepared, you can merge in an external CSV with this command, using a new set of county-level 2021 data as an example.

flask merge-data-table -s path/to/MyNewCountyData2021.csv -t county-2021 --dry-run

It is wise to use the --dry-run flag to test the merge before actually altering any data.

What this command will do:

Load the incoming CSV data, and check for a join field (GEOID, FIPS, ZCTA5, or HEROP_ID)
- If one of these is present, it will be used to generate a HEROP_ID which is the actual join field that will be used during the process
Load the specified table source CSV in the registry and merge these two data frames
Write out the merged data frame into the existing table source
Update the table_sources attribute in varibles.json for variable in the incoming CSV

Once you have completed this operation for one or more CSVs, you will need to perform some or all of these actions to make sure the derived output from OEPS is up-to-date: