Data Model
Variables
Tip
For a list of all variables currently in the registry, see variables.csv on Github.
A single file, variable.json, serves as a central lookup for all variables, each one being defined as a Frictionless Field Descriptor, with some extra properties that we have added for our own needs. The key for each item in the lookup must be the same as its name property.
Fields are defined by a JSON object that adheres to the field descriptors portion of the table schema standard, though not all possible attributes are required or implemented.
The following additional attributes are also supported and in some cases required:
| Property | Format | Description | Comment |
|---|---|---|---|
name |
Primary identifier | Same as key for this item, must be CamelCase and <= 10 characters long. | |
title |
Human-readable title | Appears in data dictionaries and OEPS Explorer interfaces | |
type |
Datatype of this variable | See Frictionless Data field descriptor types | |
example |
Varies | An example value for the field | |
description |
A description for the field | ||
longitudinal |
N/A | N/A | |
analysis |
N/A | N/A | |
table_sources |
List of all table sources this variable appears in | Table source names are in this list |
Example variable entry
"TotPop": {
"title": "Total Population",
"name": "TotPop",
"type": "integer",
"example": "1632480",
"description": "Total population",
"metadata": "Demographic_Characteristics",
"longitudinal": true,
"analysis": false,
"table_sources": [
"county-1980",
"county-1990",
"county-2000",
"county-2010",
"county-2018",
"county-2020",
"county-2023",
"state-1980",
"state-1990",
"state-2010",
"state-2000",
"state-2018",
"tract-1980",
"tract-1990",
"tract-2010",
"tract-2000",
"tract-2020",
"tract-2018",
"tract-2023",
"zcta-2018",
"zcta-2020",
"zcta-2023"
]
},
}
Table Sources
Tip
For a list of all table sources currently in the registry, see table_sources.csv on Github.
Table sources are a JSON representations of each CSV dataset within OEPS--one JSON file per CSV. The structure is based on the Tabular Data Resource from Frictionless Data. However, where a schema property would typically define a primary key, foreign key (for joins), and a list of all fields, all of this information is inferred or standardized elsewhere and need not be stored in these files.
Characteristics of table source CSVs:
- Only have data for one geography level (state, county, tract, or zcta)
- Only have data for one year
- Named with the format
{geography}-{year}, for example,county-2020 - Has a
HEROP_IDcolumn as primary key that joins each row to a geography unit. - Has column names that match (exactly) with variable names already defined in the registry.
Each table source is defined by the following attributes:
| Property | Description | Comment |
|---|---|---|
name |
ID of table source | Will always be in {geography}-{year} format |
title |
Human-readable title | Currently not used anywhere, and set to match name |
description |
Short description | Will always be "This CSV aggregates all OEPS data values from {year} at the {geography} level. |
path |
Path to CSV | Relative to data directory, this path will always be tables/{name}.csv, i.e. tables/{geography}-{year}.csv |
format |
Will always be csv |
|
mediatype |
Will always be text/csv |
|
data_year |
Year of the data in this CSV | |
geodata_source |
Name of geodata source this CSV will join to | Geodata source must already exist in the registry. Importantly, the year of the CSV data may not match the geodata source, as 2015 data should be joined to 2010 geographies (for example). |
Future simplification
Much of the content stored in the attributes described above can be inferred from other information, or is always the same across all table sources, so it's possible that some of these will be removed in the future.
Example table_source
{
"bq_dataset_name": "tabular",
"bq_table_name": "county-2020",
"name": "county-2020",
"path": "tables/county-2020.csv",
"format": "csv",
"mediatype": "text/csv",
"title": "county-2020",
"description": "This CSV aggregates all OEPS data values from 2020 at the county level.",
"year": "2020",
"geodata_source": "counties-2018"
}
Metadata
Tip
For a list of all metadata documents currently in the registry (and links to the markdown files), see metadata.csv on Github.
The registry contains JSON references to metadata documents, which are ultimately stored as Markdown files on Github. Each reference defines a theme, construct, and proxy to be attached to the document, as well as the URL to the file.
Within OEPS, themes, constructs, and proxies create a hierarchical conceptual framework through which each variable can be interpreted. Metadata documents are created one-per-proxy, a proxy being a grouping of variables that (typically) have been created or extracted from the same source.
An example metadata entry looks like this:
{
"id": "Access_MOUDs",
"theme": "Environment",
"construct": "Spatial Access to MOUDs",
"proxy": "Spatial access metrics... (this is a description of the variables themselves)",
"url": "https://github.com/healthyregions/oeps/blob/main/metadata/Access_MOUDs.md",
"source": "SAMSHA 2019, 2021; Vivitrol 2020; OSRM 2020;",
"source_long": "U.S. Substance Abuse and Mental Health Services Administration, Treatment Locator Tool, 2019; Vivitrol, 2020; Open Source Routing Machine, 2020"
}
Each variable entry must reference a metadata entry, making a linkage to the document that describes its provenance. Typically, multiple variables are described in the same metadata document, so their metadata value will all be the same.
Geodata Sources
Tip
For a list of all geodata sources currently in the registry, see geodata_sources.csv on Github.
Geodata sources define the base geospatial data that can be joined to CSV data. There are 4 different geography levels (referred to as "summary levels" in the code baes): States, Counties, Tracts, and Zip Code Tabulation Areas (ZCTAs). Because we have many different years of data in the CSVs, we also need to include different years of spatial data, as boundaries and geographic unit ids change over time.
Characteristics of geodata source shapefiles:
- Has a
HEROP_IDfield which will be used by all CSV files for joins. - Stored as zip file in AWS S3, not locally in the repository.
Each geodata source is defined by the following attributes:
| Property | Description | Comment |
|---|---|---|
name |
ID of geodata source | Will always be in {geography (plural)}-{year} format. |
title |
Human-readable title | |
description |
Short description | Will always be "Shapefile of {geography} boundaries from the US Census Bureau, {year}". |
path |
Path to zipped shapefile | This is a full URL to the zipped file in AWS S3. |
format |
Will always be shp |
|
mediatype |
Will always be application/vnd.shp |
|
summary_level |
Type of geography | Will be one of: state, county, tract, or zcta. |
bq_dataset_name |
Target dataset during BigQuery upload | To be deprecated. |
bq_table_name |
Target table during BigQuery upload | To be deprecated. |
schema |
A nested JSON schema object | The schema defines primaryKey (always "HEROP_ID"), and fields, which will include, at least an entry for HEROP_ID. |
Note
Geodata sources are very similar to table sources (based on Frictionless Data Resources) but instead of referencing CSV files they reference zipped ESRI Shapfiles, and they do include the schema property.
Example geodata_source
{
"bq_dataset_name": "spatial",
"bq_table_name": "counties2010",
"name": "counties-2010",
"title": "County Boundaries, 2010",
"description": "Shapefile of county boundaries from the US Census Bureau, 2010.",
"path": "https://herop-geodata.s3.us-east-2.amazonaws.com/census/county-2010-500k-shp.zip",
"format": "shp",
"mediatype": "application/vnd.shp",
"summary_level": "county",
"schema": {
"primaryKey": "HEROP_ID",
"fields": [
{
"name": "HEROP_ID",
"title": "HEROP_ID",
"type": "string"
},
{
"name": "name",
"title": "Name",
"type": "string"
}
]
}
}