Constructing and Validating a Comprehensive XYZ Index Across U.S. Census Tracts
Authors: Author One, Author Two, Author Three
Abstract
The abstract to be filled in.
Keywords: Pedestrian Environment Index, walkability, bikability, road safety, social greenspace, transit accessibility, OpenStreetMap, built environment, neighborhood exposome
Introduction
Katherine will fill when finished!
Related Work
To be filled in.
Methods
This section describes the construction and validation of neighborhood-level built environment characteristics at the census tract level across the contiguous United States. All characteristics were derived from publicly available geospatial data sources, primarily OpenStreetMap (OSM), the U.S. Census Bureau, and the Missouri Census Data Center. The methodology comprises four core Pedestrian Environment Index (PEI) sub-components adapted from Peiravian et al. and four novel supplementary indices: the Social Greenspace Index (SGI), the Bike Infrastructure Index (BII), the Road Safety Index (RSI), and the Public Transport Accessibility Level (PTAL). All indices are computed at the census tract level and normalized to enable nationwide comparison.
Data Sources
OpenStreetMap
The primary geospatial data source is OpenStreetMap (OSM), queried through the Overpass API. OSM provides crowd-sourced geographic data including road networks, points of interest, land-use polygons, and cycling infrastructure. To enable longitudinal analysis, we obtained OSM data snapshots for three time points: 2013, 2017, and 2022.
U.S. Census Bureau
Census tract boundary shapefiles (TIGER/Line) were obtained from the U.S. Census Bureau for the corresponding years. These shapefiles provide the geographic units of analysis and the land area denominators used in density calculations.
Missouri Census Data Center
Tract-level population counts were obtained from the Missouri Census Data Center (MCDC), which aggregates decennial census and American Community Survey estimates by tract and block group. Direct download of population tables from MCDC was preferred over the Census API due to rate-limiting and data-vintage inconsistencies encountered when scaling API queries to all U.S. tracts.
General Transit Feed Specification
Public transit schedule data were obtained from publicly available General Transit Feed Specification (GTFS) feeds aggregated through the Mobility Database. GTFS data provide stop locations and service frequency for bus and rail networks.
Core PEI Sub-components
The Pedestrian Environment Index (PEI) is a composite walkability measure comprising four sub-components, each capturing a distinct dimension of the pedestrian environment. We adapted the methodology of Peiravian et al., which was originally applied to the city of Chicago, and extended it to all census tracts in the contiguous United States.
Intersection Density Index (IDI)
Intersection density quantifies street network connectivity. Using OSM highway data queried via the Overpass API, we identified all road network nodes that participate in more than one way (i.e., true intersections, excluding dead ends). Intersection density for each tract is computed as:
where \(N_{\text{intersections}}\) is the count of intersection nodes within the tract and \(A_{\text{tract}}\) is the tract land area in square miles.
Land-use Diversity Index (LDI)
Land-use diversity captures the mix of functional land uses within a tract, which is associated with the availability of destinations within walking distance. Land-use data were retrieved from OSM using the landuse tag via the Overpass API. For each tract, we computed the Shannon entropy of the area distribution across land-use categories:
where \(p_j\) is the proportion of total tract area occupied by land-use type \(j\) and \(k\) is the number of land-use types with non-zero area. The raw entropy is then normalized by the theoretical maximum:
This yields a value between 0 (homogeneous land use) and 1 (maximum diversity).
Commercial Density Index (CDI)
Commercial density reflects the availability of destinations such as shops, services, and amenities. We queried OSM via the Overpass API for points of interest matching the following tags: shop=*; amenity values of restaurant, cafe, bank, school, and cinema; and leisure values of park, sports_centre, and stadium. Commercial density for each tract is:
where \(N_{\text{POI}}\) is the count of qualifying points of interest.
Population Density Index (PDI)
Population density reflects the residential intensity of a tract, which is a precondition for pedestrian activity. Total population was obtained from the Missouri Census Data Center and land area from Census TIGER/Line shapefiles:
where \(P_{\text{tract}}\) is the total resident population of the tract.
Normalization and Composite PEI
Each raw sub-component was percentile-rank normalized across all census tracts nationwide and across all available years, producing values \(S_i \in [0, 1]\). This cross-sectional and cross-temporal normalization enables direct comparison of sub-component values between any two tracts regardless of location or time period.
The composite PEI is computed using a multiplicative aggregation:
where \(S_i\) is the normalized value of the \(i\)-th sub-component and \(n\) is the number of sub-components. The denominator \(2^n\) represents the theoretical maximum of the numerator (attained when all \(S_i = 1\)), ensuring that \(\text{PEI} \in [0, 1]\). With the four core sub-components, \(n = 4\) and \(2^n = 16\).
The multiplicative form has the property that a tract must score reasonably well on all sub-components to achieve a high composite score. A tract that is high on one dimension but near zero on another is penalized more heavily than under an additive (averaging) scheme, reflecting the theoretical premise that walkability is jointly determined by street connectivity, land-use mix, destination availability, and residential intensity.
Supplementary Indices
In addition to the four core PEI sub-components, we constructed four supplementary indices to capture dimensions of the neighborhood built environment not represented in the original PEI formulation.
Social Greenspace Index (SGI)
The Social Greenspace Index quantifies access to outdoor social gathering spaces, focusing on designed public spaces that facilitate pedestrian activity and social interaction rather than general vegetation or tree canopy cover. Using OSM polygon data queried via the Overpass API, we extracted features tagged as leisure=park, leisure=playground, leisure=dog_park, leisure=stadium, and leisure=common, as well as natural=grassland and landuse=meadow where they represent publicly accessible spaces. SGI is computed as:
where \(A_g\) is the area of each qualifying greenspace polygon intersected with the census tract boundary. This yields a normalized ratio between 0 and 1 for each tract.
Bike Infrastructure Index (BII)
The Bike Infrastructure Index scores each census tract based on the quantity and quality of cycling infrastructure. From OSM, we extracted all road and path features carrying bicycle-related tags, including cycleway, bicycle, bicycle_road, highway, ref, network, and cyclestreet. After filtering to ensure each road segment was counted only once and that non-bicycle roads were excluded, each segment was classified into one of four tiers reflecting the degree of separation from motor vehicle traffic.
Table: Bike Infrastructure Index tier weights
| Infrastructure Category | Weight |
|---|---|
| Separated/protected paths and cycle streets | 0.50 |
| Painted bike lanes | 0.25 |
| Shared roads with bicycle designation | 0.15 |
| Local residential roads | 0.10 |
For each segment, the infrastructure score is the product of its tier weight \(w_c\) and its length \(l_s\). The tract-level BII is:
where the sum is over all qualifying segments \(s\) within the tract and \(c(s)\) denotes the tier classification of segment \(s\).
Road Safety Index (RSI)
The Road Safety Index quantifies pedestrian risk attributable to road traffic speed. We focused on roads with OSM highway tags most relevant to pedestrian environments: residential, unclassified, and tertiary, secondary, primary. We assigned default speed limits by road classification using standardized federal speed limit guidelines mapped to OSM highway tags.
Each road segment receives a risk score derived from Nilsson's Power Model, which translates changes in mean traffic speed into expected changes in crash severity:
where \(v_s\) is the speed limit of segment \(s\), \(v_0 = 20\) mph is a pedestrian-safe baseline speed, and \(\beta = 1.5\) corresponds to a power-law relationship calibrated for fatal and serious injury crashes. Segment-level risk is then weighted by segment length:
Higher RSI values indicate greater pedestrian risk. Then, the RSI is normalized and inverted:
so now, the higher RSI values indicate areas with less pedestrian risk, therefore areas that are more walkable; this structure better aligns with the structure of the rest of our indices. Lastly, each census tract's RSI is multiplied by a population density scaling factor
to better capture pedestrian risk by taking into account pedestrian density in that area. As a sensitivity analysis, we compared the standardized federal speed assignment approach against an alternative in which default speed limits were assigned based on state-level statutory speed limits by road type. Preliminary comparisons indicated that the two approaches produce similar relative distributions of tract-level RSI scores. We also evaluated a variant that expanded the RSI beyond pedestrian-oriented roads to include highways (OSM tags motorway, trunk), given that the presence of a highway within a census tract may reduce walkability through barrier effects even when pedestrians do not directly use it.
Public Transport Accessibility Level (PTAL)
The Public Transport Accessibility Level index measures ease of access to public transit services, adapted from the Transport for London methodology. For each census tract centroid, we identified all transit stops within specified walking thresholds: 640 m for bus stops and 960 m for rail stations, using stop locations from GTFS feeds. Walk access time was calculated at a standard speed of 80 m/min (4.8 km/hr).
For each nearby transit route, an Equivalent Doorstep Frequency (EDF) was computed as:
where \(T_w\) is the walk time in minutes from the tract centroid to the stop and \(W\) is the average waiting time, defined as half the headway plus a reliability penalty (2 min for bus, 0.75 min for rail).
The tract-level Accessibility Index is the sum of EDFs across all nearby routes, with non-dominant routes within each transport mode discounted by 50% to avoid overrepresentation of redundant services:
Computational Workflow
All sub-component and supplementary index calculations were implemented in Python using osmnx, geopandas, pandas, and numpy. The workflow proceeds in four stages: (1) data acquisition, in which OSM features and Census boundary files are obtained for each target year; (2) raw index computation, in which each sub-component or supplementary index is calculated per tract; (3) percentile-rank normalization across all tracts and years; and (4) composite PEI computation using the composite PEI formula.
At the national scale (approximately 84,000 census tracts), the primary computational bottleneck was population data retrieval. Initial attempts to query the Census API at scale encountered rate-limiting and data-vintage inconsistencies; we addressed this by downloading population tables directly from the Missouri Census Data Center. For OSM-derived indices, historical data were downloaded in 30 x 30 mile chunks through live Overpass API queries. Then data was then aggregated and filtered based on the needs of each index. Each supplementary index (BII, RSI, SGI) required approximately 30 to 60 minutes of processing time per year on consumer hardware.
Output files are produced as both CSV and GeoJSON, enabling integration with GIS software and web-based visualization platforms. An interactive web application for exploring PEI and supplementary index values across U.S. tracts is available at http://sustainableurbansystems.com/PEI-Map/.
Validation
We evaluated the construct validity of the PEI sub-components and supplementary indices using three complementary approaches.
Convergent Validity
We assessed convergent validity by computing Pearson and Spearman correlations between each novel characteristic and established measures. For PEI sub-components and BII, the primary benchmark was the EPA National Walkability Index (NWI), which is constructed from intersection density, proximity to transit, and land-use diversity at the block-group level. For SGI, we compared against childhood opportunity index greenspace scores. For RSI, we compared against proximity-to-roadway variables. Where available, we also compared against external benchmarks such as the EPA Smart Location Database walkability scores.
Known-groups Comparisons
We conducted known-groups comparisons by stratifying census tracts into urbanicity categories (urban, suburban, and rural) derived from the USDA Rural-Urban Commuting Area (RUCA) codes and testing whether each characteristic differentiated between these groups in theoretically expected directions. For example, we expected higher PEI composite scores in urban versus rural tracts, higher SGI in suburban tracts with dedicated park space versus dense urban cores, higher BII in urban tracts with dedicated cycling infrastructure investment, and higher PTAL in urban tracts served by frequent transit.
Discriminant Validity
We assessed discriminant validity by computing pairwise correlations among characteristics hypothesized to capture distinct constructs. Specifically, we expected BII (cycling infrastructure) and SGI (recreational outdoor space) to show only modest correlation, as a neighborhood may invest in cycling lanes without necessarily providing park space, and vice versa. Similarly, RSI (pedestrian risk from traffic speed) was expected to correlate only weakly with CDI (commercial destination density), as commercial density reflects the presence of destinations rather than the speed environment of surrounding roads. We also expected PTAL (transit service frequency) to be partially independent of IDI (street network connectivity), as transit frequency depends on agency investment and ridership demand rather than network topology alone. Pairwise correlations substantially exceeding \(r = 0.70\) between theoretically distinct pairs would suggest redundancy and prompt reconsideration of whether the characteristics capture unique variance in the neighborhood environment.
Results
To be filled in.
Discussion
To be filled in.
Limitations
This section explores the primary limitation of using OpenStreetMap (OSM) as a primary data source. Utilizing OSM involves a fundamental trade-off: while it provides a high-resolution, temporal, and open-source proxy of the built environment, its crowdsourced nature introduces non-uniformity across different geographies and timeframes. Our analysis identifies OSM as not a "ground truth" record but a reflection of volunteered input. These systematic biases must be considered when implementing PEI.
Urban bias
The concept of the digital divide in Volunteered Geographic Information (VGI) is well-documented, and its effects are prominently observed within this project. Because OSM relies solely on volunteers, mapping density is highly correlated with population density. In major metropolitan areas, high contributor activity ensures that nearly all geospatial points of interest are captured. Conversely, in rural areas, the dataset often suffers from significant under-mapping. This creates a systematic bias where rural areas may receive artificially low scores across several indices, such as the Commercial Density Index (CDI) or the Social Greenspace Index (SGI). In these contexts, a score of zero frequently reflects a lack of local mapping effort rather than the physical absence of a park or a shop. Consequently, the PEI may overstate the disparity in walkability between urban and rural environments.
Historical incompleteness
While OSM is widely regarded today as a reliable data source, the same cannot be said for the project’s earlier stages. During the initial years of our longitudinal study, specifically around 2013, the platform was characterized by chronic under-mapping. The database had not yet reached the critical mass of contributors required for comprehensive nationwide coverage. This poses a significant challenge for temporal analysis. Because the map was less complete in 2013 and 2017 than it is in 2022, longitudinal trends may be artificially inflated. An increase in a tract’s Bike Infrastructure Index (BII) or Commercial Density Index (CDI) over time often reflects the growth of the OSM community and its mapping efforts rather than actual physical development. Consequently, comparing snapshots across a decade requires caution, as the baseline data from earlier years is inherently less reliable. This makes it difficult to distinguish between a neighborhood that has recently improved its walkability and one that was simply documented late by the OSM community.
User error
Our methodology relies heavily on the assumption that contributors are correctly inputting data correctly. However, due to the lack of a centralized verification process, the OSM dataset is subject to the subjective interpretations and varying technical proficiency of individual mappers. This introduces significant uncertainty into our index calculations. For instance, the distinction between residential, unclassified, and tertiary roads is often subtle and varies by region. A user may incorrectly categorize a road, which directly skews the default speed assumptions used in the Road Safety Index (RSI). Furthermore, users may input geometries incompletely, in scattered fragments, or with topological errors. Since the PEI is an inherently geospatial metric, these geometric inaccuracies can severely skew the results.
Qualitative gap
Each index relies on the quantity of points of interest inside each census tract. What this fails to consider is the quality of the built environment. For example, while an area might include a park, it may be poorly-maintained or under-developed compared to other parks. This, however, is not accounted for in any index, and each built environment feature is weighted the same regardless of its condition. This could theoretically be rectified by including OSM metadata describing the state of a park (such as the condition or maintenance tags), but this data is highly subjective and often disregarded in current OSM entries. Consequently, indices prioritize the physical presence of infrastructure over its actual utility or quality.
Conclusion
To be filled in.
Acknowledgments
To be filled in.
Dataset Merge Tool
Probabilistic Dataset Merge
A general-purpose tool for linking two tabular datasets at a common geographic resolution — built around the immediate need to merge our team's Pedestrian Environment Index (PEI) scores with Adolescent Brain Cognitive Development (ABCD) mental-health data at the U.S. census-tract level. The matching logic is dataset-agnostic, so the same tool serves both our in-house research question and a public release for other groups facing comparable record- linkage problems.
Live tool: https://dataset-merge-smur.netlify.app/
Setup
# Backend (Python)
cd version-3
python -m venv venv && source venv/bin/activate
pip install -e .
pytest # 85 tests
# Webapp (loads the same Python via Pyodide in-browser)
cd apps/dataset-merge
pnpm install
pnpm dev
Full backend documentation: version-3/docs/.
Abstract
The tool takes a target dataset and a supplemental dataset, both at the same geographic resolution, and finds the closest supplemental row for each target row using standardized Euclidean distance over shared columns. For every match it also computes a battery of match-quality signals — Nearest Neighbor Distance Ratio (NNDR), Mutual Nearest Neighbor (MNN) confirmation, per-row feature contributions, and dataset-level Standardized Mean Difference (SMD) — and emits a plain-English flags column so non-technical researchers can read row-by-row whether to trust each link.
The driving use case is a collaboration with Dr. Benson Ku (Emory University): linking PEI built-environment scores to ABCD mental-health outcomes by census tract, so we can ask whether walkable, well-connected environments correlate with measured adolescent well-being. The matching infrastructure built for that question is general — it works on any pair of CSVs that share at least one column — so the same code is released as a public webapp.
How it works
The pipeline is a straight line of pure functions:
- Align shared columns. Exact column-name matches are detected automatically; mismatched names can be linked manually in the webapp.
- Joint z-score standardization. Both datasets are normalized using a combined per-column mean and standard deviation, so the same raw value maps to the same standardized value in either source.
- Compute distances. For every target row, the standardized Euclidean distance is computed against every supplemental row. Distances are kept in full so the quality signals can be derived.
- Pick the best match per target. Closest supplemental row by distance. Ties are recorded.
- Derive quality signals and flags. See "Match quality signals" below.
Output is two CSVs: a linked dataset (target rows + matched supplemental columns + per-row signals + plain-English flags) and a wider detail file (per-feature contributions, full per-row diagnostics) for audit.
Privacy / HIPAA framing
The matching engine is deliberately brute-force: every target row is compared against every supplemental row with no spatial index, no kd-tree, no approximate nearest-neighbor structure. This is a privacy decision, not a performance one. ABCD is HIPAA-protected and the supplemental side carries location-derived attributes; any indexed structure that exploits geographic proximity could leak location information through query patterns or re-identification side channels. Holding the algorithm at brute force keeps the privacy posture explicit: every match is computed in isolation, and the same code path is followed for every row.
The current build also surfaces a soft warning when a user attempts to match on columns that look like direct identifiers (census tract IDs, coordinates). That warning is bypassable today and is one of the items flagged for the next revision; an external HIPAA / ethics review of the methodology is the top-priority next step.
Match quality signals
Each linked row carries five signals plus a derived flags string. Brief summary; full reference at version-3/docs/signals/.
| Signal | What it captures |
|---|---|
euc_distance | Standardized Euclidean distance to the matched row. |
cascading_nndr | \(d_1/d_2\) ratio (Lowe 2004) plus a "near-miss count" of supplemental rows within threshold of the best match. Primary ambiguity signal — scale-invariant across runs. |
mnn_confirmed | Reverse-direction check: does the matched supplemental row also point back to this target? Catches asymmetric matches where the supplemental row "belongs" to a different target. |
per_row_feature_contribution | Per-feature share of the squared distance for one match. Useful when a flagged row turns out to be driven by a single column with a unit error. |
dataset_smd | Run-wide standardized mean difference per feature across all matched pairs. Threshold benchmarks from Austin (PMC3472075): >0.10 = imbalance, >0.25 = poor. |
flags | Plain-English summary string. Empty when no signal fires; otherwise a \|-joined list (ambiguous match — NNDR 0.92 (>= 0.80) \| 3 near-miss row(s) ...). |
Webapp
A React + Vite frontend (apps/dataset-merge/) loads the entire Python matcher into the browser via Pyodide. Researchers upload two CSVs, link columns interactively, and inspect a Results UI with a sortable per-row diagnostics table, a per-feature SMD bar chart, and a per-row drill-down showing feature-contribution bars, a top-k rank plot, and a full-population distance histogram.
Running the matcher in the browser means no participant data ever leaves the researcher's machine — another deliberate piece of the privacy posture. The frontend and CLI share the same Python module (web_api mirrors the file- based coordinator), so any signal works identically in both entry points.
Explanatory PDFs
A separate LaTeX pipeline (version-3/explanatory/) generates one PDF per characteristic match scenario — exact match, rounding discrepancy, scale mismatch, ambiguous match, and MNN not confirmed. Each PDF walks a non- technical reader through the target row, all 20 candidates, a worked distance calculation, a histogram of all candidate distances, and the value each signal takes for that scenario with a plain-English explanation.
These are surfaced from the webapp's "How it works" page so a researcher can see what each flag actually corresponds to in real data.
Runtimes
(Current, on a single laptop core, with the full signals pipeline.)
| Workload | Time |
|---|---|
acs-test (~350 tracts × ~350 tracts) | sub-second |
| Realistic ABCD-scale workload (~70k × 70k target) | does not yet meet the under-2-minute target |
The signals layer added meaningful overhead. Vectorising the distance matrix and the per-row signal loop is the natural next pass and is queued as a high-priority next step (see below). Optimisation has to stay within the brute-force constraint set by the privacy posture.
Strengths and Weaknesses
Strengths
- General-purpose matching: works on any pair of CSVs sharing at least one column. Same code serves both our in-house PEI ↔ ABCD merge and a public release.
- Honest quality reporting: every row carries a plain-English flag string instead of a single composite "match score." Researchers can trace exactly why a row was flagged.
- Browser-based, no data egress: Pyodide runs the matcher locally; participant data never leaves the user's machine.
- Privacy posture is explicit and load-bearing: brute force is deliberate, documented, and tested.
Weaknesses
- The signals are theoretically motivated but not empirically validated against human-labelled match quality.
- The default NNDR threshold (0.8) is borrowed from computer vision (Lowe 2004) and has not been calibrated on tabular census-tract data.
- Performance does not yet hit the realistic ABCD workload target.
- Missing values are silently coerced to zero rather than imputed or flagged.
- The PII safeguard is currently a soft warning and is bypassable.
Next steps
Top priorities (full list in the repo's HANDOFF.md):
- External HIPAA / ethics audit of the matching methodology.
- Harden the PII safeguard beyond a soft warning.
- Performance pass — target ~70k × 70k under 2 minutes, staying brute-force.
- Empirically validate the signals against human-labelled match quality on a sample.
- Source real-population calibration data (target dataset measured independently of the supplemental ACS pool) and rebuild the NNDR calibration script.
- UX overhaul of the webapp's "How it works" page — more visual explainers, full PDFs and detailed glossary moved to a links list.