By Michael Garber
Revised September 23, 2025
URL of this page: https://michaeldgarber.github.io/climate-health/describe-code-pipeline
Corresponding GitHub repository: https://github.com/michaeldgarber/climate-health/tree/main
For an overview of the project, please see here:
https://michaeldgarber.github.io/climate-health/wf-health-scenarios-doc.html
This document describes code that produces the results. The code files to conduct the main steps of the analysis are located in the scripts folder of this repository: https://github.com/michaeldgarber/climate-health/tree/main.
The three R scripts to run this analysis are the following:
data-mgmt-analyses-before-modeling-RD-as-outcome.R
This code loads data from various folders, combines the data together,
and creates some new variables. It also explores the data with plots and
mapsmodel-RD-as-outcome.R
Fit models for
effect of intervention variables on the rate difference.predict-values-alt-tree-ac.R
Define
scenarios and predict alternative values of RD based on scenario values
and models from Step 2.To quickly run all three scripts in one file, this code may be used: source-scripts-general-eff-mod.R It “sources” all three scripts.
The script:
data-mgmt-analyses-before-modeling-RD-as-outcome.R
What it does: The purpose of this code is to read in data on the spatially varying estimates of the effect of wildfire on hospitalizations along with covariates and to create a unified dataset that will be used for the modeling step at the ZCTA level. Note I use zip code interchangeably with ZCTA here.
Data used: The code begins by reading in information from various sources. Specifically, data include:
Effect estimates relating wildfire days with hospitalizations
from Do
et al (2024). Measures are the rate difference (variable name:
rd_quo_pt
) along with its corresponding standard deviation
(rd_quo_sd
). The _quo
suffix in the variable
name is used to differentiate the “status-quo” rate difference from an
alternative rate difference used later in the pipeline. Other variables
related to the rate difference are apparent from the
dplyr::mutate()
step in the code.
Various zip-code-level demographic variables, some of which are included in the .csv file that were shared by authors of the Do et al. article, for convenience. Other covariates are gathered directly from the Healthy Places Index of the Public Health Alliance of Southern California.
Each zip code’s biome (variable name:
biome_name_freq
, where the _freq
denotes that
the data are ordered by frequency)
Each zip code’s rural-urban commuting code (variable name:
ruca_cat
, which is a 4-category variable with 1-3 being
most urban and the other categories progressively more rural)
Each zip code’s proportion of land cover with impervious surface
(variable name: imperv_prop
)
Each zip code’s air basin (variable name:
basin_name
)
Processing: It then links all of the information
together at the zip-code level using dplyr::left_join()
,
joining by the field zcta
that each dataset has in
common.
Quality checks and data exploration: The code subsequently creates maps and plots to assess univariate distributions of the variables along with bivariate distributions to assess how variables relate to one another. These checks are performed to gain a better understanding of the data and to inform modeling.
Outputs: The code has two important outputs:
wf_eff_emm_wide.R
: This is the
final zip-code-level dataset with the above information linked
together
wf_eff_emm_wide_no_outliers.R
: The
same dataset removing observations containing outliers for the rate
difference
The script:
model-RD-as-outcome.R
What it does: The purpose of this code is to model the spatially varying rate difference (RD) at the ZCTA level using the unified dataset created previously. It fits a series of weighted Guassian generalized lienar models to estimate how the RD varies as a function of the intervention constructs of interest —either tree canopy or air conditioning— along with covariates we expect may confound the effect of these intervention variables and the outcome.
Modeling approach:
Covariates: For both tree canopy and air conditioning, we tried various models to both control for confounding for the association between the RD and the intervention variable of interest while keeping the models relatively parsimonious. For each model, in addition to include the intervention of interest in the model (either tree canopy or air conditioning), we included the following covariates:
proportion insured
rural-urban category
biome, and
two interaction terms
square root of tree canopy times the rural-urban category
square root of tree canopy times biome.
Weighting: Considering the model outcome is itself a
modeled result (with a point estimate and a standard error), we we used
weighted regression to weight observations (ZCTAs) with more precise
estimates more highly. The weight variable is
rd_quo_weight
. It is defined in the previous code as
rd_quo_weight= rd_quo_sd_min/rd_quo_sd
, where the
denominator is the standard deviation for the modeled RD in that ZCTA
and the numerator is the minimum SD over all ZCTAS. This way, the ZCTA
with. The values with higher SDs will receive lower weights.
For reference, the code to fit each final glm model follows:
For tree canopy as the intervention:
glm_rd_tree_conf_ins_ruca_biome_int_ruca_biome
glm_rd_tree_conf_ins_ruca_biome_int_ruca_biome =glm(
rd_quo_pt~tree_canopy_prop_sqrt +insured_prop + ruca_cat +
biome_name_freq+
tree_canopy_prop_sqrt*ruca_cat+
tree_canopy_prop_sqrt*biome_name_freq,
family = gaussian,
na.action = "na.exclude",
weights = rd_quo_weight,
data = wf_eff_emm_wide_no_outliers
)
For air conditioning as the intervention:
glm_rd_ac_conf_ins_ruca_biome_int_ruca_biome
glm_rd_ac_conf_ins_ruca_biome_int_ruca_biome =glm(
rd_quo_pt~ac_prop +insured_prop + ruca_cat +
biome_name_freq+
ac_prop*ruca_cat+
ac_prop*biome_name_freq,
family = gaussian,
na.action = "na.exclude",
weights = rd_quo_weight,
data = wf_eff_emm_wide_no_outliers
)
Output: This script does not save any output to
disc, as it runs quite quickly. Its outputs (the model objects beginning
with glm_
) are used in the next script.
The script:
predict-values-alt-tree-ac.R
Dependencies: It depends on the modeling script
being run in the session’s memory, as it uses the model objects. That
script does not need to be run manually, however. At the top of this
scripot, that script is run via
source(here("scripts","model-RD-as-outcome.R")).
What it does, in summary: The purpose of the script is to predict a range of alternative rate differences for the effect of wildfire on acute-care utilization over various scenarios. To define the alternative scenarios, for each intervention measure, the script finds quintile values (i.e., 20th percentile, 40th percentile,…, 80th percentile) over ZCTAs within biome. Percentiles are relative to biome to ensure that scenarios are realistic for a given biome. Then, for each scenario, in zip codes with a value of the effect modifier below that of the scenario’s percentile, the effect modifiers are raised to the level of that percentile. And a new rate difference is calculated based on the new value for the effect modifier, holding all other variables constant. The script then calculates a difference in rate differences between the status–quo and alternative scenarios. It also calculates the difference in the absolute number of cases. To wash out the influence of modeling, the “status-quo” scenario is actually predicted using the model based, so it is akin to saying, what would the status-quo RD be if the patterns described by the model held true and all variables were at their status-quo levels? The script also produces confidence intervals for these measures via boostrapping. Finally, the script predicts various summary measures summing over all ZCTAs and stratifying summaries over rural-urban categories and biome.
Step-by-step details:
In progress
On integrating different effect modifiers: To integrate a new effect modifier, the code could be slightly modified by finding and replacing the name of one of the effect modifiers for a new effect modifier of interest.
We considered including an option in the functions to input the name of an effect modifier as a function argument. This would reduce the need for copy and paste and keep the code more DRY (don’t-repeat-yourself). However, it would also make the functions more complex and thus more likely to not work. So the functions currently do not have this flexibility.
The above sequence of scripts will work assuming all of the data are
available locally on the computer in the locations specified by the
relative paths, as indicated in the statements setting the working
directory using the here() package. For example, the
code above expects to find the R object,
zcta_ca_biome_highest.R
, in the “data-processed” folder one
level beneath the directory corresponding to the R project, the folder
name of which is “climate-health.”
These code files explain and gather data from various publicly
available sources. As spatial data can be more time consuming to gather,
I don’t run these scripts every time. Instead, I have saved the relevant
information to create aspatial zip-code-level look-up tables that are
joined with the main effect-estimate data in
data-mgmt-analyses-before-modeling-RD-as-outcome.R
.
get-zcta-california-geo.R
This code
uses tidycensus and other functions to read in publicly available
geometry data for zip codes in the State of California. It also reads in
zip-code-level lookup tables for rural-urban commuting
codes.
link-zctas-biome.R
This code links
Zip codes with biomes.
california-air-basins-merge-w-zctas.R
This code reads in data on air basin and merges with the zip-code
geometry.
climate-health/docs/wf-health-scenarios-tables.Rmd
:
This RMarkdown document creates tables that can be readily pasted into a
Word document for the manuscript. It relies on results from the main
pipeline being either saved locally or created in memory during the
session
Directory: The shiny app’s folder is
climate-health/climate-health-shiny
Data management: The shiny app expects all of its
input data to be contained within that folder, so there is some
additional data management needed to get the shiny app to work. This
script (climate-health-shiny/manage-data-for-app.R
) loads
the data created elsewhere in the pipeline and brings it into the shiny
app folder. So if the pipeline is altered, this script should be re-run
to update the shiny app’s data.
The app’s script: the script to run the shiny app:
app-wf-int.R