By Michael Garber

Revised October 28, 2025

URL of this page: https://michaeldgarber.github.io/climate-health/describe-code-pipeline

Corresponding GitHub repository: https://github.com/michaeldgarber/climate-health/tree/main

Overview

For an overview of the project, please see the corresponding manuscript: [insert-reference-once-published]

This document describes code that produces the results. The code files to conduct the main steps of the analysis are located in the scripts folder of this repository: https://github.com/michaeldgarber/climate-health/tree/main.

Five scripts are used to run the analysis:

read-wf-hosp.R This code reads in data on the zip-code-level effects of wildfire on hospitalizations. It also reads in some zip-code-level socio-demographic and environmental measures, including the green-space effect modifiers.
read-heat-hosp.R Read in data on zip-code-level effects of heat (various definitions) on related hospitalizations.
model-RD-as-outcome-final.R. Fit models for effects of intervention measures (impervious surfaces and tree canopy) on each of the spatially varying rate differences.
predict-values-alt.R Define scenarios and predict alternative values of the rate differences using the models.
predict-values-alt-summary.R Summarize the results.

To quickly run all three scripts in one file, this code may be used: source-scripts-general-eff-mod.R

What each script does in more detail

1. Read and wrangle data on wildfire-related effect estimates and covariates

The script: read-wf-hosp.R

What it does: The purpose of this code is to read in data on the spatially varying estimates of the effect of wildfire on hospitalizations along with covariates and to create a unified dataset that will be used for the modeling step at the ZCTA level. Note I use zip code and ZCTA interchangeably.

Data used: The code begins by reading in information from various sources. Specifically, data include:

Effect estimates relating wildfire days with hospitalizations from Do et al (2024). Measures are the rate difference (variable name: rd_quo_pt) along with its corresponding standard deviation (rd_quo_sd). The _quo suffix in the variable name is used to differentiate the “status-quo” rate difference from an alternative rate difference used later in the pipeline. Other variables related to the rate difference are apparent from the dplyr::mutate() step in the code.
Various zip-code-level demographic variables, some of which are included in the .csv file that were shared by authors of the Do et al. article, for convenience. Other covariates are gathered directly from the Healthy Places Index of the Public Health Alliance of Southern California.
Each zip code’s biome (variable name: biome_name_freq, where the _freq denotes that the data are ordered by frequency)
Each zip code’s rural-urban commuting code (variable name: ruca_cat3, which is a 3-category variable with 1-3 being most urban and the other categories progressively more rural)
Each zip code’s proportion of land cover with impervious surface (variable name: imperv_prop)
Each zip code’s air basin (variable name: basin_name)

Processing: It then links all of the information together at the zip-code level using dplyr::left_join(), joining by the field zcta common to each dataset.

Quality checks and data exploration: The code briefly creates some plots to assess univariate distributions of the variables along. These checks are performed to gain a better understanding of the data and to inform modeling. Further checks and visualizations occur in checks-vis-before-modeling-RD.R, described below.

Outputs: The code has three important outputs:

outcome_wf: The zip-code-level dataset with the above information linked together
outcome_wf_no_outliers: The same dataset removing outlier observations (defined here as the 5th and 95th percentiles) for the rate difference.
covars: To conveniently join the covariates with the other rate-differences file (next script), the code creates a dataset with various covariates at the zip-code level.

3. Model the rate differences as the outcome

The script: model-RD-as-outcome-final.R

What it does: The purpose of this code is to fit models estimating the effect of each intervention measure-proportion impervious surfaces and proportion tree canopy-on each spatially varying rate difference (RD): those for wildfire and those for heat. It fits weighted Guassian generalized linear models to estimate how the RD varies as a function of the intervention measures —either tree canopy or air conditioning— conditional on the covariates we expect may confound their effects on the RD.

Modeling approach:

Covariates: For each intervention and outcome combination, we tried various models to both control for confounding for the association between the RD and the intervention variable of interest while keeping the models relatively parsimonious to avoid overfitting. Ultimately, as we elaborate further in the manuscript, we decided on one set of covariates for all models, with tree canopy and impervious surface included in the same model. The independent variables for each model are:

proportion impervious surfaces
tree canopy (with a square-root transformation to make it more normal)
proportion insured
rural-urban category (three levels)
biome

Weighting: Considering the dependent variable in each model is itself a modeled result (with a point estimate and a standard error), we we used weighted regression to weight observations (ZCTAs) with more precise estimates more highly. The weight variable is rd_quo_weight. It is defined in the previous code as rd_quo_weight= rd_quo_sd_min/rd_quo_sd , where the denominator is the standard deviation for the modeled RD in that ZCTA and the numerator is the minimum SD over all ZCTAS. EUsing these weights, observations with higher SDs will receive lower weights.

For reference, the code to fit each final glm model follows:

Heat-related RDs as the model outcome:

glm_heat_imp_tree_i_r_b=glm(
  rd_quo_pt~imperv_prop+tree_canopy_prop_sqrt+insured_prop + 
    ruca_cat3 
  + biome_name_freq
  ,
  data=outcome_heat_hi99_2_no_outliers,
  family = gaussian,
  na.action = "na.exclude", 
  weights = rd_quo_weight
)

Wildfire-related RDs as the model outcome:

glm_wf_imp_tree_i_r_b=glm(
  rd_quo_pt~imperv_prop+tree_canopy_prop_sqrt+insured_prop + 
    ruca_cat3 
   + biome_name_freq
  ,
  data=outcome_wf_no_outliers,
  family = gaussian,
  na.action = "na.exclude", 
  weights = rd_quo_weight
)

Output: The outputs of this script are the model objects:

glm_heat_imp_tree_i_r_b
glm_wf_imp_tree_i_r_b

They are not saved locally, as this code can be run quickly, and the model objects can be quickly created in memory.

4. Predict alternative rate differences under various scenarios

The script: predict-values-alt.R

Dependencies: This script depends on the modeling script being run in the session’s memory, as it uses the model objects (glm_heat_imp_tree_i_r_b and glm_wf_imp_tree_i_r_b). That script is set to run in this code using source(here("scripts","model-RD-as-outcome-final.R")).

What the script does: The purpose of the script is to predict a range of alternative rate differences for the separate effects of 1) extreme-heat events and 2) wildfire events) on related hospitalizations over various scenarios.

To define the alternative scenarios, for each intervention variable (impervious surfaces and tree canopy), we found quintile values (i.e., 20th percentile, 40th percentile, …, 80th percentile) over ZCTAs. For impervious surfaces, we found quintiles relative to rural-urban category on the grounds that it may not be feasible to reduce impervious surfaces in a dense urban area to levels typical of a more rural area with expanses of undeveloped land. For similar reasons, for tree canopy, we found quintiles relative to biome because tree canopy is strongly associated with biome, and it may not be realistic, for example, to raise tree canopy in the desert up to the 80th percentile of a temperate forest. This within-context scenario setting helps ensure that the modeled interventions are plausible for their urban and ecological context.

Then, for each scenario, in zip codes with a value of the effect modifier below that of the scenario’s percentile, the intervention measures are raised to the level of that percentile. And a new rate difference is calculated based on the new value for the intervention measure, holding all other variables constant. The script then calculates a difference in rate differences between the status–quo and alternative scenarios. To wash out the influence of modeling, the “status-quo” scenario is actually predicted using the model based, so it is akin to saying, what would the status-quo RD be if the patterns described by the model held true and all variables were at their status-quo levels?

The script additionally produces confidence intervals for these measures via boostrapping. Specifically, we re-sampled the model-predicted RD from a normal distribution, where the mean in each replicate is the model-predicted value, and the standard deviation is the standard deviation of the residuals.

5. Summarize results over scenarios

The script: predict-values-alt-summary.R

Dependencies: This script depends on the previous script (predict-values-alt.R) having been run in memory.

What it does: It generates various summaries comparing the status-quo with the alternative predicted risk-difference. Specifically, it calculates the difference in the rate differences over the entire status and across strata of urban-rural and biome. It additionally calculates confidence intervals for these measures using the bootstrap relicates.

Other code beyond the main pipeline

Scripts to gather additional data

The above sequence of scripts will work assuming all of the data are available locally on the computer in the locations specified by the relative paths, as indicated in the statements setting the working directory using the here() package. For example, the code above expects to find the R object, zcta_ca_biome_highest.R, in the “data-processed” folder one level beneath the directory corresponding to the R project, the folder name of which is “climate-health.”

These code files explain and gather data from various publicly available sources. As spatial data can be more time consuming to gather, I don’t run these scripts every time. Instead, I have saved the relevant information to create aspatial zip-code-level look-up tables that are joined with the main effect-estimate data in data-mgmt-analyses-before-modeling-RD-as-outcome.R.

get-zcta-california-geo.R This code uses the tidycensus R packag eand other functions to read in publicly available geometry data for zip codes in the state of California. It also reads in zip-code-level lookup tables for rural-urban commuting codes.
link-zctas-biome.R Link zip codes with biomes.
read-ruca-codes.R: Read in rural-urban commuting codes and create a few new variables.
california-air-basins-merge-w-zctas.R Read in data on air basin and merges with the zip-code geometry.

Visualing and checking variables

checks-vis-before-modeling-RD.R This script performs various descriptive analyses on the measures included in the analysis, for example, plots of univariate distributions, scatterplots of how the variables are associated with one another, and maps of their spatial distributions. This script does need to be run as part of the main pipeline.

RMarkdown documents to create tables

climate-health/docs/tables-main-text.Rmd: This RMarkdown document creates tables that can be readily pasted into a Word document for the manuscript. For it to work, results from the main pipeline must be saved locally. Code to locally save is included in those corresponding scripts.

The shiny app

The shiny app is hosted using my personal Posit account

https://michaeldgarber.shinyapps.io/app-wf-heat-tree-imp/

Directory: The shiny app’s folder is climate-health/climate-health-shiny

Data management: The shiny app expects all of its input data to be contained within that folder, so there is some additional data management needed to get the shiny app to work. This script (climate-health-shiny/manage-data-for-app.R) loads the data created elsewhere in the pipeline and brings it into the shiny app folder. So if the pipeline is altered, this script should be re-run to update the shiny app’s data.

The app’s script: the script to run the shiny app: climate-health-shiny/app-wf-heat-tree-imp.R

Description of code pipeline