By Michael Garber

Revised September 23, 2025

URL of this page: https://michaeldgarber.github.io/climate-health/describe-code-pipeline

Corresponding GitHub repository: https://github.com/michaeldgarber/climate-health/tree/main

Overview

For an overview of the project, please see here:

https://michaeldgarber.github.io/climate-health/wf-health-scenarios-doc.html

This document describes code that produces the results. The code files to conduct the main steps of the analysis are located in the scripts folder of this repository: https://github.com/michaeldgarber/climate-health/tree/main.

The three R scripts to run this analysis are the following:

  1. data-mgmt-analyses-before-modeling-RD-as-outcome.R This code loads data from various folders, combines the data together, and creates some new variables. It also explores the data with plots and maps
  2. model-RD-as-outcome.R Fit models for effect of intervention variables on the rate difference.
  3. predict-values-alt-tree-ac.R Define scenarios and predict alternative values of RD based on scenario values and models from Step 2.

To quickly run all three scripts in one file, this code may be used: source-scripts-general-eff-mod.R It “sources” all three scripts.

What each script does in more detail

1. Import and read data

The script: data-mgmt-analyses-before-modeling-RD-as-outcome.R

What it does: The purpose of this code is to read in data on the spatially varying estimates of the effect of wildfire on hospitalizations along with covariates and to create a unified dataset that will be used for the modeling step at the ZCTA level. Note I use zip code interchangeably with ZCTA here.

Data used: The code begins by reading in information from various sources. Specifically, data include:

  • Effect estimates relating wildfire days with hospitalizations from Do et al (2024). Measures are the rate difference (variable name: rd_quo_pt) along with its corresponding standard deviation (rd_quo_sd). The _quo suffix in the variable name is used to differentiate the “status-quo” rate difference from an alternative rate difference used later in the pipeline. Other variables related to the rate difference are apparent from the dplyr::mutate() step in the code.

  • Various zip-code-level demographic variables, some of which are included in the .csv file that were shared by authors of the Do et al. article, for convenience. Other covariates are gathered directly from the Healthy Places Index of the Public Health Alliance of Southern California.

  • Each zip code’s biome (variable name: biome_name_freq, where the _freq denotes that the data are ordered by frequency)

  • Each zip code’s rural-urban commuting code (variable name: ruca_cat, which is a 4-category variable with 1-3 being most urban and the other categories progressively more rural)

  • Each zip code’s proportion of land cover with impervious surface (variable name: imperv_prop)

  • Each zip code’s air basin (variable name: basin_name)

Processing: It then links all of the information together at the zip-code level using dplyr::left_join(), joining by the field zcta that each dataset has in common.

Quality checks and data exploration: The code subsequently creates maps and plots to assess univariate distributions of the variables along with bivariate distributions to assess how variables relate to one another. These checks are performed to gain a better understanding of the data and to inform modeling.

Outputs: The code has two important outputs:

  • wf_eff_emm_wide.R: This is the final zip-code-level dataset with the above information linked together

  • wf_eff_emm_wide_no_outliers.R: The same dataset removing observations containing outliers for the rate difference

2. Model the rate difference as the outcome

The script: model-RD-as-outcome.R

What it does: The purpose of this code is to model the spatially varying rate difference (RD) at the ZCTA level using the unified dataset created previously. It fits a series of weighted Guassian generalized lienar models to estimate how the RD varies as a function of the intervention constructs of interest —either tree canopy or air conditioning— along with covariates we expect may confound the effect of these intervention variables and the outcome.

Modeling approach:

Covariates: For both tree canopy and air conditioning, we tried various models to both control for confounding for the association between the RD and the intervention variable of interest while keeping the models relatively parsimonious. For each model, in addition to include the intervention of interest in the model (either tree canopy or air conditioning), we included the following covariates:

  • proportion insured

  • rural-urban category

  • biome, and

  • two interaction terms

    • square root of tree canopy times the rural-urban category

    • square root of tree canopy times biome.

Weighting: Considering the model outcome is itself a modeled result (with a point estimate and a standard error), we we used weighted regression to weight observations (ZCTAs) with more precise estimates more highly. The weight variable is rd_quo_weight. It is defined in the previous code as rd_quo_weight= rd_quo_sd_min/rd_quo_sd , where the denominator is the standard deviation for the modeled RD in that ZCTA and the numerator is the minimum SD over all ZCTAS. This way, the ZCTA with. The values with higher SDs will receive lower weights.

For reference, the code to fit each final glm model follows:

For tree canopy as the intervention: glm_rd_tree_conf_ins_ruca_biome_int_ruca_biome

glm_rd_tree_conf_ins_ruca_biome_int_ruca_biome =glm(
  rd_quo_pt~tree_canopy_prop_sqrt +insured_prop + ruca_cat +
    biome_name_freq+
    tree_canopy_prop_sqrt*ruca_cat+
    tree_canopy_prop_sqrt*biome_name_freq,
  family = gaussian,
  na.action = "na.exclude", 
  weights = rd_quo_weight,
  data = wf_eff_emm_wide_no_outliers
)

For air conditioning as the intervention: glm_rd_ac_conf_ins_ruca_biome_int_ruca_biome

glm_rd_ac_conf_ins_ruca_biome_int_ruca_biome =glm(
  rd_quo_pt~ac_prop +insured_prop + ruca_cat +
    biome_name_freq+
    ac_prop*ruca_cat+
    ac_prop*biome_name_freq,
  family = gaussian,
  na.action = "na.exclude", 
  weights = rd_quo_weight,
  data = wf_eff_emm_wide_no_outliers
)

Output: This script does not save any output to disc, as it runs quite quickly. Its outputs (the model objects beginning with glm_) are used in the next script.

3. Predict alternative rate differences under various scenarios

The script: predict-values-alt-tree-ac.R

Dependencies: It depends on the modeling script being run in the session’s memory, as it uses the model objects. That script does not need to be run manually, however. At the top of this scripot, that script is run via source(here("scripts","model-RD-as-outcome.R")).

What it does, in summary: The purpose of the script is to predict a range of alternative rate differences for the effect of wildfire on acute-care utilization over various scenarios. To define the alternative scenarios, for each intervention measure, the script finds quintile values (i.e., 20th percentile, 40th percentile,…, 80th percentile) over ZCTAs within biome. Percentiles are relative to biome to ensure that scenarios are realistic for a given biome. Then, for each scenario, in zip codes with a value of the effect modifier below that of the scenario’s percentile, the effect modifiers are raised to the level of that percentile. And a new rate difference is calculated based on the new value for the effect modifier, holding all other variables constant. The script then calculates a difference in rate differences between the status–quo and alternative scenarios. It also calculates the difference in the absolute number of cases. To wash out the influence of modeling, the “status-quo” scenario is actually predicted using the model based, so it is akin to saying, what would the status-quo RD be if the patterns described by the model held true and all variables were at their status-quo levels? The script also produces confidence intervals for these measures via boostrapping. Finally, the script predicts various summary measures summing over all ZCTAs and stratifying summaries over rural-urban categories and biome.

Step-by-step details:

In progress

On integrating different effect modifiers: To integrate a new effect modifier, the code could be slightly modified by finding and replacing the name of one of the effect modifiers for a new effect modifier of interest.

We considered including an option in the functions to input the name of an effect modifier as a function argument. This would reduce the need for copy and paste and keep the code more DRY (don’t-repeat-yourself). However, it would also make the functions more complex and thus more likely to not work. So the functions currently do not have this flexibility.

Other code beyond the main pipeline (Data-gathering R scripts, R Markdown, Shiny)

Scripts to gather data

The above sequence of scripts will work assuming all of the data are available locally on the computer in the locations specified by the relative paths, as indicated in the statements setting the working directory using the here() package. For example, the code above expects to find the R object, zcta_ca_biome_highest.R, in the “data-processed” folder one level beneath the directory corresponding to the R project, the folder name of which is “climate-health.”

These code files explain and gather data from various publicly available sources. As spatial data can be more time consuming to gather, I don’t run these scripts every time. Instead, I have saved the relevant information to create aspatial zip-code-level look-up tables that are joined with the main effect-estimate data in data-mgmt-analyses-before-modeling-RD-as-outcome.R.

  • get-zcta-california-geo.R This code uses tidycensus and other functions to read in publicly available geometry data for zip codes in the State of California. It also reads in zip-code-level lookup tables for rural-urban commuting codes.

  • link-zctas-biome.R This code links Zip codes with biomes.

  • california-air-basins-merge-w-zctas.R This code reads in data on air basin and merges with the zip-code geometry.

RMarkdown documents to create tables

climate-health/docs/wf-health-scenarios-tables.Rmd: This RMarkdown document creates tables that can be readily pasted into a Word document for the manuscript. It relies on results from the main pipeline being either saved locally or created in memory during the session

The shiny app

Directory: The shiny app’s folder is climate-health/climate-health-shiny

Data management: The shiny app expects all of its input data to be contained within that folder, so there is some additional data management needed to get the shiny app to work. This script (climate-health-shiny/manage-data-for-app.R) loads the data created elsewhere in the pipeline and brings it into the shiny app folder. So if the pipeline is altered, this script should be re-run to update the shiny app’s data.

The app’s script: the script to run the shiny app: app-wf-int.R