Here we show how to apply Bayesian kernel machine regression (BKMR) using the R package bkmr to a simulated dataset. We use one of the datasets generated for the workshop Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology Studies, hosted by the National Institute for Environmental Health Sciences (NIEHS) in 2015.

The goal of this workshop was to bring together statisticians, epidemiologists, and toxicologists to compare the performance and characteristics of statistical methods for estimating the health effects of mixtures by applying the methods to a set of common datasets. These common datasets including two simulated datasets and one ‘real world’ dataset. We will focus on the first simulated dataset (Data Set #1). Additional details on the workshop and on the datasets, as well as the simulated datasets themselves, are available at the workshop website.

1 Load packages and download data

To run the code in this document, you will need to have the following R packages installed: bkmr, readxl (to load in the dataset), and ggplot2 (to generate plots). To install the packages, which only needs to be done once, type

install.packages("bkmr")
install.packages("readxl")
install.packages("ggplot2")

We begin by loading the bkmr package, as well as the ggplot2 package, which will be used to generate plots:

library(bkmr)
library(ggplot2)

Next, we need to load in the example dataset. Note that the dataset file DataSet1.xls should be located in your current working directory. You can see what your working directory is by typing getwd(). The following code checks to see if the dataset is located in the current working directory, and if not it downloads the dataset (.xls file) into the current working directory.

if (!"DataSet1.xls" %in% list.files(getwd())) {
    download.file("http://www.niehs.nih.gov/about/events/pastmtg/2015/statistical/dataset1xls.xls", "DataSet1.xls")
}
dat <- readxl::read_excel("DataSet1.xls", sheet = 1, col_names =  TRUE)

2 Explore the data

Let’s look at the first few rows of the dataset.

head(dat)
## # A tibble: 6 x 10
##     obs         Y        X1        X2        X3        X4        X5
##   <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
## 1     1  7.534686 0.4157066 0.5308077 0.2223965 1.1592634 2.4577556
## 2     2 19.611934 0.5293572 0.9339570 1.1210595 1.3350074 0.3096883
## 3     3 12.664050 0.4849759 0.7210988 0.4629027 1.0334138 0.9492810
## 4     4 15.600289 0.8275456 1.0457137 0.9699040 0.9045099 0.9107914
## 5     5 18.606498 0.5190363 0.7802400 0.6142188 0.3729743 0.5038126
## 6     6 18.525891 0.4009491 0.8639886 0.5501847 0.9011016 1.2907615
## # ... with 3 more variables: X6 <dbl>, X7 <dbl>, Z <dbl>

In this dataset, there are 500 rows, an outcome variable Y, seven exposures X1 through X7, and a single covariate Z. To fit the BKMR model, we define a matrix of exposures and a matrix of covariates.

covar <- data.matrix(dat[, "Z"])
expos <- data.matrix(dat[, c("X1", "X2", "X3", "X4", "X5", "X6", "X7")])
Y <- dat$Y

Note that in the original methodology paper in which BKMR was introduced (Bobb et al. 2015), the exposure variables of interest were denoted by z and the covariates that were adjusted for were denoted by x. For consistency with that literature, here we will rename the exposures and covariates accordingly.

colnames(covar) <- "x"
colnames(expos) <- paste0("z", 1:ncol(expos))

It is a good idea for all of the exposures to have the same scale, so here we z-score the exposures:

scale_expos <- scale(expos)

3 Fit the BKMR model

To fit the BKMR model, we use the kmbayes function, which implements a Markov chain Monte Carlo (MCMC) algorithm.

set.seed(1000)
fitkm <- kmbayes(Y, Z = scale_expos, X = covar, iter = 10000, 
                 verbose = FALSE, varsel = TRUE)

The argument iter indicates the number of iterations of the MCMC sampler; verbose indicates how much interim output summarizing the progress of the model fitting should be printed; and varsel indicates whether to conduct variable selection on the exposures. You can type ?kmbayes to view the help file for this function, which provides additional documentation.

The interim output tells you about the algorithm as it is running, including the number of iterations competed and the amount of time elapsed since the algorithm started. With verbose = TRUE (the default), the interim output will additionally include the acceptance rates of parameters that are updated by using the Metropolis-Hastings algorithm.

4 Investigate model convergence

Let’s visually inspect the trace plots, showing how various parameter values change as the sampler runs.

TracePlot(fit = fitkm, par = "beta")

TracePlot(fit = fitkm, par = "sigsq.eps")

TracePlot(fit = fitkm, par = "r", comp = 7)

5 Posterior inclusion probabilities

Because we fit the model with variable selection (we set the argument varsel to be TRUE), we can estimate the posterior inclusion probability (PIP) for each of the exposures.

ExtractPIPs(fitkm)
##   variable    PIP
## 1       z1 1.0000
## 2       z2 1.0000
## 3       z3 0.0004
## 4       z4 1.0000
## 5       z5 1.0000
## 6       z6 0.0022
## 7       z7 1.0000

The PIPs can be thought of as a measure of variable importance, with higher values (closer to 1) indicating higher importance, and lower values (closer to 0) indicating lower importance.

6 Plot the exposure-response function

Once we have fit the BKMR model, we usually would like to visualize the exposure-response function. Because we can’t view a high-dimensional surface, we instead look at different cross-sections of this surface. We do this by focusing on the relationships of 1 or 2 exposures with the outcome and setting the remaining exposures to specific values.

6.1 Univariate cross-sections

One cross-section of interest is the univariate relationship between each exposure and the outcome, where all of the other exposures are fixed to a particular quantile (e.g., 75th percentile). This can be done using the function PredictorResponseUnivar. The argument specifying the quantile at which to fix the other exposures is given by q.fixed. The default choice is to fix the other exposures at their median values (i.e., q.fixed = 0.5).

pred.resp.univar <- PredictorResponseUnivar(fit = fitkm)

We use the ggplot2 package to plot the resulting cross sections.

ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se, 
                             ymax = est + 1.96*se)) + 
    geom_smooth(stat = "identity") + 
    facet_wrap(~variable, ncol = 4) +
    xlab("expos") +
    ylab("h(expos)")

6.2 Bivariate cross-sections

We can similarly visualize the bivarate exposure-response function for two exposures, where all of the other exposures are fixed at a particular quantile (default is the median). This is done using the PredictorResponseBivar function. By default, this function will estimate the bivariate exposure-response function for all pairs of exposures. Because for this example z3 and z6 were not associated with the outcome (see their posterior inclusion probabilities and univariate exposure-response functions above), here we focus on the remaining exposure variables.

expos.pairs <- subset(data.frame(expand.grid(expos1 = c(1,2,4,5,7), expos2 = c(1,2,4,5,7))), expos1 < expos2)
expos.pairs
##    expos1 expos2
## 6       1      2
## 11      1      4
## 12      2      4
## 16      1      5
## 17      2      5
## 18      4      5
## 21      1      7
## 22      2      7
## 23      4      7
## 24      5      7
pred.resp.bivar <- PredictorResponseBivar(fit = fitkm, 
                                          min.plot.dist = 0.5,
                                          z.pairs = expos.pairs)

This output can be used to create image (or contour) plots.

ggplot(pred.resp.bivar, aes(z1, z2, fill = est)) + 
    geom_raster() + 
    facet_grid(variable2 ~ variable1) +
    scale_fill_gradientn(colours = c("#0000FFFF","#FFFFFFFF","#FF0000FF")) +
    xlab("expos1") +
    ylab("expos2") +
    ggtitle("h(expos1, expos2)")

Because we specified min.plot.dist = 0.5 as an argument in the PredictorResponseBivar function, the exposure-response surface is only estimated for points that are within 0.5 units from an observed data point. Points farther than this distance are grayed out in the plot.

Since it can be hard to see what’s going on in these types of plots, an alternative approach is to investigate the exposure-response function of a single exposure where the second exposure is fixed at various quantiles. This can be done using the PredictorResponseBivarLevels function, which takes as input the bivariate exposure-response function outputted from the previous command, where the argument qs specifies a sequence of quantiles at which to fix the second exposure

pred.resp.bivar.levels <- PredictorResponseBivarLevels(
    pred.resp.bivar, scale_expos, qs = c(0.10, 0.5, 0.90))
ggplot(pred.resp.bivar.levels, aes(z1, est)) + 
    geom_smooth(aes(col = quantile), stat = "identity") + 
    facet_grid(variable2 ~ variable1) +
    ggtitle("h(expos1 | quantiles of expos2)") +
    xlab("expos1")

6.3 Three-way interactions

Building upon the previous section, we can investigate possible 3-way interactions by visualizing the bivariate exposure-response function for a third exposure fixed at different quantiles. For example, here we estimate and plot the exposure response-function of z7 and z1 for z5 fixed at its 10th, 50th, and 90th percentiles:

pred.bivar.10 <- PredictorResponseBivar(
    fit = fitkm, min.plot.dist = 1, z.pairs = rbind(c(7,1)), 
    whichz3 = 5, prob = 0.10)
pred.bivar.50 <- PredictorResponseBivar(
    fit = fitkm, min.plot.dist = 1, z.pairs = rbind(c(7,1)), 
    whichz3 = 5, prob = 0.50)
pred.bivar.90 <- PredictorResponseBivar(
    fit = fitkm, min.plot.dist = 1, z.pairs = rbind(c(7,1)), 
    whichz3 = 5, prob = 0.90)

pred.bivar.10$expos5 <- "10% of z5"
pred.bivar.50$expos5 <- "50% of z5"
pred.bivar.90$expos5 <- "90% of z5"

pred.bivar.by.expos5 <- rbind(pred.bivar.10, pred.bivar.50, pred.bivar.90)
pred.bivar.by.expos5$expos5 <- as.factor(pred.bivar.by.expos5$expos5)
ggplot(pred.bivar.by.expos5, aes(z1, z2, fill = est)) + 
    geom_raster() + 
    facet_wrap(~expos5) +
    scale_fill_gradientn(colours = c("#0000FFFF","#FFFFFFFF","#FF0000FF")) +
    xlab(unique(pred.bivar.by.expos5$variable1)) +
    ylab(unique(pred.bivar.by.expos5$variable2))

7 Calculate summary statistics

In addition to visually inspecting the estimated exposure-response function, one may also wish to calculate a range of summary statistics that highlight specific features of the high-dimensional surface.

7.1 Cumulative effects

One potential summary measure of interest is to compute the overall effect of the mixture, by comparing the value of the exposure-response function when all of exposures are at a particular quantile as compared to when all of them are at their median value. The function OverallRiskSummaries allows one to specify a sequence of quantiles using the argument qs and the fixed quantile using the argument q.fixed (the default is the median value).

risks.overall <- OverallRiskSummaries(fit = fitkm, qs = seq(0.25, 0.75, by = 0.05), q.fixed = 0.5)
risks.overall
##    quantile        est         sd
## 1      0.25 -3.3406701 0.28277781
## 2      0.30 -2.9172379 0.22262555
## 3      0.35 -2.2685305 0.16803133
## 4      0.40 -1.3294117 0.11029672
## 5      0.45 -0.5111734 0.04964008
## 6      0.50  0.0000000 0.00000000
## 7      0.55  0.6657768 0.05670175
## 8      0.60  1.3125629 0.12339046
## 9      0.65  1.7372179 0.18014540
## 10     0.70  2.1827869 0.25463784
## 11     0.75  2.3676724 0.35668475

Here we see that as compared to when all of the exposures are fixed at their median value, when all of the exposures are at their 75th percentile the outcome increases by 2.37 units.

We can also plot these overall risk summaries:

ggplot(risks.overall, aes(quantile, est, ymin = est - 1.96*sd, 
                          ymax = est + 1.96*sd)) + 
geom_hline(yintercept = 0, lty = 2, col = "brown") +
geom_pointrange()

We can see that as cumulative levels across all exposures increases, the health risks increase. There is also a suggestion of a ceiling effect, which indicates a nonlinear exposure-response function.

7.2 Single-exposure effects

Another summary that may be of interest is to estimate the contribution of individual exposures to the cumulative effect. For example, we may wish to compare risk when a single exposure is at the 75th percentile as compared to when that exposure is at its 25th percentile, where all of the remaining exposures are fixed to a particular quantile. We refer to this as the single-exposure health risks, and these can be computed using the function SingVarRiskSummaries. The two different quantiles at which to compare the risk are specified using the qs.diff argument, and a sequence of values at which to fix the remaining exposures can be specified using the q.fixed argument.

risks.singvar <- SingVarRiskSummaries(
    fit = fitkm, qs.diff = c(0.25, 0.75), 
    q.fixed = c(0.25, 0.50, 0.75))

subset(risks.singvar, variable %in% c("z1", "z2", "z4", "z5", "z7"))
## # A tibble: 15 x 4
##    q.fixed variable        est        sd
##     <fctr>   <fctr>      <dbl>     <dbl>
##  1    0.25       z1  3.7965650 0.5578806
##  2    0.25       z2  1.8420522 0.5475556
##  3    0.25       z4 -0.8635769 0.3048438
##  4    0.25       z5 -4.9552478 0.3666194
##  5    0.25       z7  5.6091526 0.4116644
##  6     0.5       z1  4.1889493 0.4184572
##  7     0.5       z2  2.0204781 0.4136215
##  8     0.5       z4 -0.7409458 0.2627585
##  9     0.5       z5 -5.5182836 0.2997611
## 10     0.5       z7  5.8774231 0.3192093
## 11    0.75       z1  4.4643997 0.4866115
## 12    0.75       z2  2.1159967 0.5733866
## 13    0.75       z4 -0.5189454 0.3296200
## 14    0.75       z5 -6.1126144 0.3946934
## 15    0.75       z7  6.0002724 0.4401308

Here we see, for example, that when exposure z7 increases from its 25th to its 75th percentile (and the other exposures remain fixed at their 75th percentile), the outcome increases by 6 units. We can also plot these numerical summaries, which makes it easier to identify the respective contributions of individual exposures to the health risks:

ggplot(risks.singvar, aes(variable, est, ymin = est - 1.96*sd, ymax = est + 1.96*sd, 
                          col = q.fixed)) + 
    geom_pointrange(position = position_dodge(width = 0.75)) + coord_flip()

7.3 Interactions

We may wish to compute specific ‘interaction’ parameters. For example, we may which to compare the single-exposure health risks when all of the other exposures are fixed to their 75th percentile to when all of the other exposures are fixed to their 25th percentile. In the previous plot, this corresponds to substracting the estimate represented by the red circle from the estimate represented by the blue circle. This can be done using the function SingVarIntSummaries.

risks.int <- SingVarIntSummaries(fit = fitkm, 
qs.diff = c(0.25, 0.75), qs.fixed = c(0.25, 0.75))
risks.int
## # A tibble: 7 x 3
##   variable           est         sd
##     <fctr>         <dbl>      <dbl>
## 1       z1  0.6678346664 0.68815716
## 2       z2  0.2739445277 0.80383535
## 3       z3 -0.0001114915 0.03373484
## 4       z4  0.3446315576 0.41077469
## 5       z5 -1.1573665853 0.51849356
## 6       z6  0.0160479504 0.07432786
## 7       z7  0.3911197472 0.60639524
ggplot(risks.int, aes(variable, est, ymin = est - 1.96*sd, ymax = est + 1.96*sd)) + 
    geom_pointrange(position = position_dodge(width = 0.75)) +
    geom_hline(yintercept = 0, lty = 2, col = "brown") +
    coord_flip()

We see that the single-exposure risk for \(z_5\) (associated with a change from its 25th to 75th percentile) decreases by 1.16 units when the remaining exposures are fixed at their 25th percentile, as compared to when they are fixed at their 75th percentile.

8 Comparison of BKMR fit to true exposure-response function

After applying each method to the simulated dataset, workshop partipants were informed of the true model used to generate the data. These solutions have been posted on the conference website (.pdf solutions). For Simulated Dataset 1, the model used to generate the data was given by \[ Y_i = h(z_{i1}, z_{i2}, z_{i4}, z_{i5}, z_{i7}) + 2x_{i} + \epsilon_i, \] where \(\epsilon_i ~ N(0, 2.32^2)\), and the exposure-response function was a biologically-based dose-response function based on endocrine disruption, \[ h(z_1, z_2, z_4, z_5, z_7) = \alpha_0 + \frac{\alpha_1(\frac{T}{K_t} + \frac{z_1}{K_1} + \frac{z_2}{K_2})}{1 + \frac{T}{K_t} + \frac{z_1}{K_1} + \frac{z_2}{K_2} + \frac{z_4}{K_4} + \frac{z_5}{K_5}}\left(R_{00} + \frac{\lambda z_7}{K_7 + z_7}\right), \] where \((T/K_t, K_1, K_2, K_4, K_5, K_7, R_{00}, \lambda) = (0.5, 1.5, 3, 4.5, 1, 0.5, 1, 2)\).

8.1 Overall fit

To evaluate the overall estimates of the BKMR fit to the true exposure-response function, here we plot the estimated \(h_i = h(z_{i1}, z_{i2}, z_{i4}, z_{i5}, z_{i7})\) to the true \(h_i\).

htrue_fun <- function(z1, z2, z4, z5, z7) {
    alpha0 <- 2
    alpha1 <- 20
    TKt <- 0.5
    K1 <- 1.5
    K2 <- 3
    K4 <- 4.5
    K5 <- 1
    K7 <- 0.5
    R00 <- 1
    lambda <- 2
    R0 <- R00 + lambda*z7/(K7 + z7)
    num <- TKt + z1/K1 + z2/K2
    alpha0 + alpha1*num/(1 + num + z4/K4 + z5/K5)*R0
}

hi_true <- htrue_fun(z1 = expos[, "z1"], z2 = expos[, "z2"], z4 = expos[, "z4"], z5 = expos[, "z5"], z7 = expos[, "z7"])
hi_est <- ComputePostmeanHnew(fitkm)$postmean

plot(hi_true, hi_est, xlab = "True h", ylab = "Estimated h")
abline(0, 1, col = "red")

8.2 Univariate exposure-response relationships

Here we plot the estimated univariate exposure-response functions estimate from BKMR to the true exposure-response functions, where each of the other exposures are fixed at their median values.

meds <- apply(expos, 2, median)
df_preds_true <- data.frame()
for (i in 1:ncol(expos)) {
    new_grid <- as.list(meds)
    new_grid[[i]] <- seq(min(expos[, i]), max(expos[, i]), length.out = 50)
    new_grid <- expand.grid(new_grid)
    pred <- with(new_grid, htrue_fun(z1, z2, z4, z5, z7))
    pred_center <- pred - mean(pred)
    df0 <- data.frame(variable = paste0("z", i), 
                      expos = new_grid[, i],
                      z = (new_grid[, i] - mean(expos[, i]))/sd(expos[, i]),
                      est = pred_center)
    df_preds_true <- rbind(df_preds_true, df0)
}
df_preds_true$se <- NA

ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se, 
                             ymax = est + 1.96*se)) + 
    geom_smooth(stat = "identity") + 
    geom_line(data = df_preds_true, col = "red", lty = 2) +
    facet_wrap(~variable, ncol = 4) +
    xlab("expos") +
    ylab("h(expos)")

We see that our estimates of the exposure-response function (blue solid line) well-align with the true exposure-response function (red dashed line).

9 Speeding up computation time

We consider a Gaussian predictive process approach for speeding up the model fitting. To do this, we first need to define a matrix of knots that cover the exposure space, where we select a smaller number of knots than the number of exposures. We will consider both using 100 knots and using 50 knots.

set.seed(1000)
knots100 <- fields::cover.design(scale_expos, nd = 100)$design
fitkm_knots100 <- kmbayes(Y, scale_expos, covar, iter = 10000, 
                          verbose = FALSE, varsel = TRUE,
                          knots = knots100)

set.seed(1000)
knots50 <- fields::cover.design(scale_expos, nd = 50)$design
fitkm_knots50 <- kmbayes(Y, scale_expos, covar, iter = 10000, 
                             verbose = FALSE, varsel = TRUE,
                             knots = knots50)

Let’s compare the computation time:

time_per_iter <- data.frame(
    full = with(fitkm, 
                difftime(time2, time1, units = "secs")/iter),
    knots100 = with(fitkm_knots100, 
                    difftime(time2, time1, units = "secs")/iter),
    knots50 = with(fitkm_knots50, 
                   difftime(time2, time1, units = "secs")/iter)
)
time_per_iter
##             full        knots100         knots50
## 1 0.1368005 secs 0.07026066 secs 0.03593337 secs

We next see if the reduction in computation time leads to a meaningful reduction in the accuracy of the estimated exposure-response function in the the approximate, Gaussian predictive process version as compared to full BKMR. Below we plot the estimated person-specific health risks \(h_i\) estimated under the full BKMR model to the health risks estimated under the approximate, Gaussian predictive process version, and we calculate the root mean square error (rMSE).

hi_est <- ComputePostmeanHnew(fitkm)$postmean
hi_knot100 <- ComputePostmeanHnew(fitkm_knots100)$postmean
hi_knot50 <- ComputePostmeanHnew(fitkm_knots50)$postmean

par(mfrow = c(1, 2))
plot(hi_est, hi_knot100, xlab = "Full model", 
     ylab = "100 knots", main = "Estimated h")
abline(0, 1, col = "red")
plot(hi_est, hi_knot50, xlab = "Full model", 
     ylab = "50 knots", main = "Estimated h")
abline(0, 1, col = "red")

rMSE <- data.frame(
    knots50 = sqrt(mean((hi_knot50 - hi_est)^2)), 
    knots100 = sqrt(mean((hi_knot100 - hi_est)^2))
)
rMSE
##     knots50   knots100
## 1 0.1232796 0.05223356

10 How to learn more

Details on other functionalities are available in the help page for the kmbayes function (type ?kmbayes to view) and on the package overview guide. Additional information on the statistical methodology and on the computational details are provided in Bobb et al. 2015.

11 References

Bobb, JF, Valeri L, Claus Henn B, Christiani DC, Wright RO, Mazumdar M, Godleski JJ, Coull BA (2015). Bayesian Kernel Machine Regression for Estimating the Health Effects of Multi-Pollutant Mixtures. Biostatistics 16, no. 3: 493–508.