Data Model

Variables

Tip

For a list of all variables currently in the registry, see variables.csv on Github.

A single file, variable.json, serves as a central lookup for all variables, each one being defined as a Frictionless Field Descriptor, with some extra properties that we have added for our own needs. The key for each item in the lookup must be the same as its name property.

Fields are defined by a JSON object that adheres to the field descriptors portion of the table schema standard, though not all possible attributes are required or implemented.

The following additional attributes are also supported and in some cases required:

Property Format Description Comment
name Primary identifier Same as key for this item, must be CamelCase and <= 10 characters long.
title Human-readable title Appears in data dictionaries and OEPS Explorer interfaces
type Datatype of this variable See Frictionless Data field descriptor types
example Varies An example value for the field
description A description for the field
longitudinal N/A N/A
analysis N/A N/A
table_sources List of all table sources this variable appears in Table source names are in this list

Example variable entry

"TotPop": {
    "title": "Total Population",
    "name": "TotPop",
    "type": "integer",
    "example": "1632480",
    "description": "Total population",
    "metadata": "Demographic_Characteristics",
    "longitudinal": true,
    "analysis": false,
    "table_sources": [
      "county-1980",
      "county-1990",
      "county-2000",
      "county-2010",
      "county-2018",
      "county-2020",
      "county-2023",
      "state-1980",
      "state-1990",
      "state-2010",
      "state-2000",
      "state-2018",
      "tract-1980",
      "tract-1990",
      "tract-2010",
      "tract-2000",
      "tract-2020",
      "tract-2018",
      "tract-2023",
      "zcta-2018",
      "zcta-2020",
      "zcta-2023"
    ]
  },
}

Table Sources

Tip

For a list of all table sources currently in the registry, see table_sources.csv on Github.

Table sources are a JSON representations of each CSV dataset within OEPS--one JSON file per CSV. The structure is based on the Tabular Data Resource from Frictionless Data. However, where a schema property would typically define a primary key, foreign key (for joins), and a list of all fields, all of this information is inferred or standardized elsewhere and need not be stored in these files.

Characteristics of table source CSVs:

  • Only have data for one geography level (state, county, tract, or zcta)
  • Only have data for one year
  • Named with the format {geography}-{year}, for example, county-2020
  • Has a HEROP_ID column as primary key that joins each row to a geography unit.
  • Has column names that match (exactly) with variable names already defined in the registry.

Each table source is defined by the following attributes:

Property Description Comment
name ID of table source Will always be in {geography}-{year} format
title Human-readable title Currently not used anywhere, and set to match name
description Short description Will always be "This CSV aggregates all OEPS data values from {year} at the {geography} level.
path Path to CSV Relative to data directory, this path will always be tables/{name}.csv, i.e. tables/{geography}-{year}.csv
format Will always be csv
mediatype Will always be text/csv
data_year Year of the data in this CSV
geodata_source Name of geodata source this CSV will join to Geodata source must already exist in the registry. Importantly, the year of the CSV data may not match the geodata source, as 2015 data should be joined to 2010 geographies (for example).

Future simplification

Much of the content stored in the attributes described above can be inferred from other information, or is always the same across all table sources, so it's possible that some of these will be removed in the future.

Example table_source

{
  "bq_dataset_name": "tabular",
  "bq_table_name": "county-2020",
  "name": "county-2020",
  "path": "tables/county-2020.csv",
  "format": "csv",
  "mediatype": "text/csv",
  "title": "county-2020",
  "description": "This CSV aggregates all OEPS data values from 2020 at the county level.",
  "year": "2020",
  "geodata_source": "counties-2018"
}

Metadata

Tip

For a list of all metadata documents currently in the registry (and links to the markdown files), see metadata.csv on Github.

The registry contains JSON references to metadata documents, which are ultimately stored as Markdown files on Github. Each reference defines a theme, construct, and proxy to be attached to the document, as well as the URL to the file.

Within OEPS, themes, constructs, and proxies create a hierarchical conceptual framework through which each variable can be interpreted. Metadata documents are created one-per-proxy, a proxy being a grouping of variables that (typically) have been created or extracted from the same source.

An example metadata entry looks like this:

{
  "id": "Access_MOUDs",
  "theme": "Environment",
  "construct": "Spatial Access to MOUDs",
  "proxy": "Spatial access metrics... (this is a description of the variables themselves)",
  "url": "https://github.com/healthyregions/oeps/blob/main/metadata/Access_MOUDs.md",
  "source": "SAMSHA 2019, 2021; Vivitrol 2020; OSRM 2020;",
  "source_long": "U.S. Substance Abuse and Mental Health Services Administration, Treatment Locator Tool, 2019; Vivitrol, 2020; Open Source Routing Machine, 2020"
}

Each variable entry must reference a metadata entry, making a linkage to the document that describes its provenance. Typically, multiple variables are described in the same metadata document, so their metadata value will all be the same.

Geodata Sources

Tip

For a list of all geodata sources currently in the registry, see geodata_sources.csv on Github.

Geodata sources define the base geospatial data that can be joined to CSV data. There are 4 different geography levels (referred to as "summary levels" in the code baes): States, Counties, Tracts, and Zip Code Tabulation Areas (ZCTAs). Because we have many different years of data in the CSVs, we also need to include different years of spatial data, as boundaries and geographic unit ids change over time.

Characteristics of geodata source shapefiles:

  • Has a HEROP_ID field which will be used by all CSV files for joins.
  • Stored as zip file in AWS S3, not locally in the repository.

Each geodata source is defined by the following attributes:

Property Description Comment
name ID of geodata source Will always be in {geography (plural)}-{year} format.
title Human-readable title
description Short description Will always be "Shapefile of {geography} boundaries from the US Census Bureau, {year}".
path Path to zipped shapefile This is a full URL to the zipped file in AWS S3.
format Will always be shp
mediatype Will always be application/vnd.shp
summary_level Type of geography Will be one of: state, county, tract, or zcta.
bq_dataset_name Target dataset during BigQuery upload To be deprecated.
bq_table_name Target table during BigQuery upload To be deprecated.
schema A nested JSON schema object The schema defines primaryKey (always "HEROP_ID"), and fields, which will include, at least an entry for HEROP_ID.

Note

Geodata sources are very similar to table sources (based on Frictionless Data Resources) but instead of referencing CSV files they reference zipped ESRI Shapfiles, and they do include the schema property.

Example geodata_source

{
    "bq_dataset_name": "spatial",
    "bq_table_name": "counties2010",
    "name": "counties-2010",
    "title": "County Boundaries, 2010",
    "description": "Shapefile of county boundaries from the US Census Bureau, 2010.",
    "path": "https://herop-geodata.s3.us-east-2.amazonaws.com/census/county-2010-500k-shp.zip",
    "format": "shp",
    "mediatype": "application/vnd.shp",
    "summary_level": "county",
    "schema": {
        "primaryKey": "HEROP_ID",
        "fields": [
            {
                "name": "HEROP_ID",
                "title": "HEROP_ID",
                "type": "string"
            },
            {
                "name": "name",
                "title": "Name",
                "type": "string"
            }
        ]
    }
}