bkmr
R package with simulated data from the NIEHS mixtures workshopHere we show how to apply Bayesian kernel machine regression (BKMR) using the R package bkmr
to a simulated dataset. We use one of the datasets generated for the workshop Statistical Approaches for Assessing Health Effects of Environmental Chemical Mixtures in Epidemiology Studies, hosted by the National Institute for Environmental Health Sciences (NIEHS) in 2015.
The goal of this workshop was to bring together statisticians, epidemiologists, and toxicologists to compare the performance and characteristics of statistical methods for estimating the health effects of mixtures by applying the methods to a set of common datasets. These common datasets including two simulated datasets and one ‘real world’ dataset. We will focus on the first simulated dataset (Data Set #1). Additional details on the workshop and on the datasets, as well as the simulated datasets themselves, are available at the workshop website.
To run the code in this document, you will need to have the following R packages installed: bkmr
, readxl
(to load in the dataset), and ggplot2
(to generate plots). To install the packages, which only needs to be done once, type
install.packages("bkmr")
install.packages("readxl")
install.packages("ggplot2")
We begin by loading the bkmr
package, as well as the ggplot2
package, which will be used to generate plots:
library(bkmr)
library(ggplot2)
Next, we need to load in the example dataset. Note that the dataset file DataSet1.xls
should be located in your current working directory. You can see what your working directory is by typing getwd()
. The following code checks to see if the dataset is located in the current working directory, and if not it downloads the dataset (.xls file) into the current working directory.
if (!"DataSet1.xls" %in% list.files(getwd())) {
download.file("http://www.niehs.nih.gov/about/events/pastmtg/2015/statistical/dataset1xls.xls", "DataSet1.xls")
}
dat <- readxl::read_excel("DataSet1.xls", sheet = 1, col_names = TRUE)
Let’s look at the first few rows of the dataset.
head(dat)
## # A tibble: 6 x 10
## obs Y X1 X2 X3 X4 X5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 7.534686 0.4157066 0.5308077 0.2223965 1.1592634 2.4577556
## 2 2 19.611934 0.5293572 0.9339570 1.1210595 1.3350074 0.3096883
## 3 3 12.664050 0.4849759 0.7210988 0.4629027 1.0334138 0.9492810
## 4 4 15.600289 0.8275456 1.0457137 0.9699040 0.9045099 0.9107914
## 5 5 18.606498 0.5190363 0.7802400 0.6142188 0.3729743 0.5038126
## 6 6 18.525891 0.4009491 0.8639886 0.5501847 0.9011016 1.2907615
## # ... with 3 more variables: X6 <dbl>, X7 <dbl>, Z <dbl>
In this dataset, there are 500 rows, an outcome variable Y
, seven exposures X1
through X7
, and a single covariate Z
. To fit the BKMR model, we define a matrix of exposures and a matrix of covariates.
covar <- data.matrix(dat[, "Z"])
expos <- data.matrix(dat[, c("X1", "X2", "X3", "X4", "X5", "X6", "X7")])
Y <- dat$Y
Note that in the original methodology paper in which BKMR was introduced (Bobb et al. 2015), the exposure variables of interest were denoted by z
and the covariates that were adjusted for were denoted by x
. For consistency with that literature, here we will rename the exposures and covariates accordingly.
colnames(covar) <- "x"
colnames(expos) <- paste0("z", 1:ncol(expos))
It is a good idea for all of the exposures to have the same scale, so here we z-score the exposures:
scale_expos <- scale(expos)
To fit the BKMR model, we use the kmbayes
function, which implements a Markov chain Monte Carlo (MCMC) algorithm.
set.seed(1000)
fitkm <- kmbayes(Y, Z = scale_expos, X = covar, iter = 10000,
verbose = FALSE, varsel = TRUE)
The argument iter
indicates the number of iterations of the MCMC sampler; verbose
indicates how much interim output summarizing the progress of the model fitting should be printed; and varsel
indicates whether to conduct variable selection on the exposures. You can type ?kmbayes
to view the help file for this function, which provides additional documentation.
The interim output tells you about the algorithm as it is running, including the number of iterations competed and the amount of time elapsed since the algorithm started. With verbose = TRUE
(the default), the interim output will additionally include the acceptance rates of parameters that are updated by using the Metropolis-Hastings algorithm.
Let’s visually inspect the trace plots, showing how various parameter values change as the sampler runs.
TracePlot(fit = fitkm, par = "beta")
TracePlot(fit = fitkm, par = "sigsq.eps")
TracePlot(fit = fitkm, par = "r", comp = 7)
Because we fit the model with variable selection (we set the argument varsel
to be TRUE
), we can estimate the posterior inclusion probability (PIP) for each of the exposures.
ExtractPIPs(fitkm)
## variable PIP
## 1 z1 1.0000
## 2 z2 1.0000
## 3 z3 0.0004
## 4 z4 1.0000
## 5 z5 1.0000
## 6 z6 0.0022
## 7 z7 1.0000
The PIPs can be thought of as a measure of variable importance, with higher values (closer to 1) indicating higher importance, and lower values (closer to 0) indicating lower importance.
Once we have fit the BKMR model, we usually would like to visualize the exposure-response function. Because we can’t view a high-dimensional surface, we instead look at different cross-sections of this surface. We do this by focusing on the relationships of 1 or 2 exposures with the outcome and setting the remaining exposures to specific values.
One cross-section of interest is the univariate relationship between each exposure and the outcome, where all of the other exposures are fixed to a particular quantile (e.g., 75th percentile). This can be done using the function PredictorResponseUnivar
. The argument specifying the quantile at which to fix the other exposures is given by q.fixed
. The default choice is to fix the other exposures at their median values (i.e., q.fixed = 0.5
).
pred.resp.univar <- PredictorResponseUnivar(fit = fitkm)
We use the ggplot2
package to plot the resulting cross sections.
ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se,
ymax = est + 1.96*se)) +
geom_smooth(stat = "identity") +
facet_wrap(~variable, ncol = 4) +
xlab("expos") +
ylab("h(expos)")
We can similarly visualize the bivarate exposure-response function for two exposures, where all of the other exposures are fixed at a particular quantile (default is the median). This is done using the PredictorResponseBivar
function. By default, this function will estimate the bivariate exposure-response function for all pairs of exposures. Because for this example z3 and z6 were not associated with the outcome (see their posterior inclusion probabilities and univariate exposure-response functions above), here we focus on the remaining exposure variables.
expos.pairs <- subset(data.frame(expand.grid(expos1 = c(1,2,4,5,7), expos2 = c(1,2,4,5,7))), expos1 < expos2)
expos.pairs
## expos1 expos2
## 6 1 2
## 11 1 4
## 12 2 4
## 16 1 5
## 17 2 5
## 18 4 5
## 21 1 7
## 22 2 7
## 23 4 7
## 24 5 7
pred.resp.bivar <- PredictorResponseBivar(fit = fitkm,
min.plot.dist = 0.5,
z.pairs = expos.pairs)
This output can be used to create image (or contour) plots.
ggplot(pred.resp.bivar, aes(z1, z2, fill = est)) +
geom_raster() +
facet_grid(variable2 ~ variable1) +
scale_fill_gradientn(colours = c("#0000FFFF","#FFFFFFFF","#FF0000FF")) +
xlab("expos1") +
ylab("expos2") +
ggtitle("h(expos1, expos2)")
Because we specified min.plot.dist = 0.5
as an argument in the PredictorResponseBivar
function, the exposure-response surface is only estimated for points that are within 0.5 units from an observed data point. Points farther than this distance are grayed out in the plot.
Since it can be hard to see what’s going on in these types of plots, an alternative approach is to investigate the exposure-response function of a single exposure where the second exposure is fixed at various quantiles. This can be done using the PredictorResponseBivarLevels
function, which takes as input the bivariate exposure-response function outputted from the previous command, where the argument qs
specifies a sequence of quantiles at which to fix the second exposure
pred.resp.bivar.levels <- PredictorResponseBivarLevels(
pred.resp.bivar, scale_expos, qs = c(0.10, 0.5, 0.90))
ggplot(pred.resp.bivar.levels, aes(z1, est)) +
geom_smooth(aes(col = quantile), stat = "identity") +
facet_grid(variable2 ~ variable1) +
ggtitle("h(expos1 | quantiles of expos2)") +
xlab("expos1")
Building upon the previous section, we can investigate possible 3-way interactions by visualizing the bivariate exposure-response function for a third exposure fixed at different quantiles. For example, here we estimate and plot the exposure response-function of z7 and z1 for z5 fixed at its 10th, 50th, and 90th percentiles:
pred.bivar.10 <- PredictorResponseBivar(
fit = fitkm, min.plot.dist = 1, z.pairs = rbind(c(7,1)),
whichz3 = 5, prob = 0.10)
pred.bivar.50 <- PredictorResponseBivar(
fit = fitkm, min.plot.dist = 1, z.pairs = rbind(c(7,1)),
whichz3 = 5, prob = 0.50)
pred.bivar.90 <- PredictorResponseBivar(
fit = fitkm, min.plot.dist = 1, z.pairs = rbind(c(7,1)),
whichz3 = 5, prob = 0.90)
pred.bivar.10$expos5 <- "10% of z5"
pred.bivar.50$expos5 <- "50% of z5"
pred.bivar.90$expos5 <- "90% of z5"
pred.bivar.by.expos5 <- rbind(pred.bivar.10, pred.bivar.50, pred.bivar.90)
pred.bivar.by.expos5$expos5 <- as.factor(pred.bivar.by.expos5$expos5)
ggplot(pred.bivar.by.expos5, aes(z1, z2, fill = est)) +
geom_raster() +
facet_wrap(~expos5) +
scale_fill_gradientn(colours = c("#0000FFFF","#FFFFFFFF","#FF0000FF")) +
xlab(unique(pred.bivar.by.expos5$variable1)) +
ylab(unique(pred.bivar.by.expos5$variable2))
In addition to visually inspecting the estimated exposure-response function, one may also wish to calculate a range of summary statistics that highlight specific features of the high-dimensional surface.
One potential summary measure of interest is to compute the overall effect of the mixture, by comparing the value of the exposure-response function when all of exposures are at a particular quantile as compared to when all of them are at their median value. The function OverallRiskSummaries
allows one to specify a sequence of quantiles using the argument qs
and the fixed quantile using the argument q.fixed
(the default is the median value).
risks.overall <- OverallRiskSummaries(fit = fitkm, qs = seq(0.25, 0.75, by = 0.05), q.fixed = 0.5)
risks.overall
## quantile est sd
## 1 0.25 -3.3406701 0.28277781
## 2 0.30 -2.9172379 0.22262555
## 3 0.35 -2.2685305 0.16803133
## 4 0.40 -1.3294117 0.11029672
## 5 0.45 -0.5111734 0.04964008
## 6 0.50 0.0000000 0.00000000
## 7 0.55 0.6657768 0.05670175
## 8 0.60 1.3125629 0.12339046
## 9 0.65 1.7372179 0.18014540
## 10 0.70 2.1827869 0.25463784
## 11 0.75 2.3676724 0.35668475
Here we see that as compared to when all of the exposures are fixed at their median value, when all of the exposures are at their 75th percentile the outcome increases by 2.37 units.
We can also plot these overall risk summaries:
ggplot(risks.overall, aes(quantile, est, ymin = est - 1.96*sd,
ymax = est + 1.96*sd)) +
geom_hline(yintercept = 0, lty = 2, col = "brown") +
geom_pointrange()
We can see that as cumulative levels across all exposures increases, the health risks increase. There is also a suggestion of a ceiling effect, which indicates a nonlinear exposure-response function.
Another summary that may be of interest is to estimate the contribution of individual exposures to the cumulative effect. For example, we may wish to compare risk when a single exposure is at the 75th percentile as compared to when that exposure is at its 25th percentile, where all of the remaining exposures are fixed to a particular quantile. We refer to this as the single-exposure health risks, and these can be computed using the function SingVarRiskSummaries
. The two different quantiles at which to compare the risk are specified using the qs.diff
argument, and a sequence of values at which to fix the remaining exposures can be specified using the q.fixed
argument.
risks.singvar <- SingVarRiskSummaries(
fit = fitkm, qs.diff = c(0.25, 0.75),
q.fixed = c(0.25, 0.50, 0.75))
subset(risks.singvar, variable %in% c("z1", "z2", "z4", "z5", "z7"))
## # A tibble: 15 x 4
## q.fixed variable est sd
## <fctr> <fctr> <dbl> <dbl>
## 1 0.25 z1 3.7965650 0.5578806
## 2 0.25 z2 1.8420522 0.5475556
## 3 0.25 z4 -0.8635769 0.3048438
## 4 0.25 z5 -4.9552478 0.3666194
## 5 0.25 z7 5.6091526 0.4116644
## 6 0.5 z1 4.1889493 0.4184572
## 7 0.5 z2 2.0204781 0.4136215
## 8 0.5 z4 -0.7409458 0.2627585
## 9 0.5 z5 -5.5182836 0.2997611
## 10 0.5 z7 5.8774231 0.3192093
## 11 0.75 z1 4.4643997 0.4866115
## 12 0.75 z2 2.1159967 0.5733866
## 13 0.75 z4 -0.5189454 0.3296200
## 14 0.75 z5 -6.1126144 0.3946934
## 15 0.75 z7 6.0002724 0.4401308
Here we see, for example, that when exposure z7 increases from its 25th to its 75th percentile (and the other exposures remain fixed at their 75th percentile), the outcome increases by 6 units. We can also plot these numerical summaries, which makes it easier to identify the respective contributions of individual exposures to the health risks:
ggplot(risks.singvar, aes(variable, est, ymin = est - 1.96*sd, ymax = est + 1.96*sd,
col = q.fixed)) +
geom_pointrange(position = position_dodge(width = 0.75)) + coord_flip()
We may wish to compute specific ‘interaction’ parameters. For example, we may which to compare the single-exposure health risks when all of the other exposures are fixed to their 75th percentile to when all of the other exposures are fixed to their 25th percentile. In the previous plot, this corresponds to substracting the estimate represented by the red circle from the estimate represented by the blue circle. This can be done using the function SingVarIntSummaries
.
risks.int <- SingVarIntSummaries(fit = fitkm,
qs.diff = c(0.25, 0.75), qs.fixed = c(0.25, 0.75))
risks.int
## # A tibble: 7 x 3
## variable est sd
## <fctr> <dbl> <dbl>
## 1 z1 0.6678346664 0.68815716
## 2 z2 0.2739445277 0.80383535
## 3 z3 -0.0001114915 0.03373484
## 4 z4 0.3446315576 0.41077469
## 5 z5 -1.1573665853 0.51849356
## 6 z6 0.0160479504 0.07432786
## 7 z7 0.3911197472 0.60639524
ggplot(risks.int, aes(variable, est, ymin = est - 1.96*sd, ymax = est + 1.96*sd)) +
geom_pointrange(position = position_dodge(width = 0.75)) +
geom_hline(yintercept = 0, lty = 2, col = "brown") +
coord_flip()
We see that the single-exposure risk for \(z_5\) (associated with a change from its 25th to 75th percentile) decreases by 1.16 units when the remaining exposures are fixed at their 25th percentile, as compared to when they are fixed at their 75th percentile.
After applying each method to the simulated dataset, workshop partipants were informed of the true model used to generate the data. These solutions have been posted on the conference website (.pdf solutions). For Simulated Dataset 1, the model used to generate the data was given by \[ Y_i = h(z_{i1}, z_{i2}, z_{i4}, z_{i5}, z_{i7}) + 2x_{i} + \epsilon_i, \] where \(\epsilon_i ~ N(0, 2.32^2)\), and the exposure-response function was a biologically-based dose-response function based on endocrine disruption, \[ h(z_1, z_2, z_4, z_5, z_7) = \alpha_0 + \frac{\alpha_1(\frac{T}{K_t} + \frac{z_1}{K_1} + \frac{z_2}{K_2})}{1 + \frac{T}{K_t} + \frac{z_1}{K_1} + \frac{z_2}{K_2} + \frac{z_4}{K_4} + \frac{z_5}{K_5}}\left(R_{00} + \frac{\lambda z_7}{K_7 + z_7}\right), \] where \((T/K_t, K_1, K_2, K_4, K_5, K_7, R_{00}, \lambda) = (0.5, 1.5, 3, 4.5, 1, 0.5, 1, 2)\).
To evaluate the overall estimates of the BKMR fit to the true exposure-response function, here we plot the estimated \(h_i = h(z_{i1}, z_{i2}, z_{i4}, z_{i5}, z_{i7})\) to the true \(h_i\).
htrue_fun <- function(z1, z2, z4, z5, z7) {
alpha0 <- 2
alpha1 <- 20
TKt <- 0.5
K1 <- 1.5
K2 <- 3
K4 <- 4.5
K5 <- 1
K7 <- 0.5
R00 <- 1
lambda <- 2
R0 <- R00 + lambda*z7/(K7 + z7)
num <- TKt + z1/K1 + z2/K2
alpha0 + alpha1*num/(1 + num + z4/K4 + z5/K5)*R0
}
hi_true <- htrue_fun(z1 = expos[, "z1"], z2 = expos[, "z2"], z4 = expos[, "z4"], z5 = expos[, "z5"], z7 = expos[, "z7"])
hi_est <- ComputePostmeanHnew(fitkm)$postmean
plot(hi_true, hi_est, xlab = "True h", ylab = "Estimated h")
abline(0, 1, col = "red")
Here we plot the estimated univariate exposure-response functions estimate from BKMR to the true exposure-response functions, where each of the other exposures are fixed at their median values.
meds <- apply(expos, 2, median)
df_preds_true <- data.frame()
for (i in 1:ncol(expos)) {
new_grid <- as.list(meds)
new_grid[[i]] <- seq(min(expos[, i]), max(expos[, i]), length.out = 50)
new_grid <- expand.grid(new_grid)
pred <- with(new_grid, htrue_fun(z1, z2, z4, z5, z7))
pred_center <- pred - mean(pred)
df0 <- data.frame(variable = paste0("z", i),
expos = new_grid[, i],
z = (new_grid[, i] - mean(expos[, i]))/sd(expos[, i]),
est = pred_center)
df_preds_true <- rbind(df_preds_true, df0)
}
df_preds_true$se <- NA
ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se,
ymax = est + 1.96*se)) +
geom_smooth(stat = "identity") +
geom_line(data = df_preds_true, col = "red", lty = 2) +
facet_wrap(~variable, ncol = 4) +
xlab("expos") +
ylab("h(expos)")
We see that our estimates of the exposure-response function (blue solid line) well-align with the true exposure-response function (red dashed line).
We consider a Gaussian predictive process approach for speeding up the model fitting. To do this, we first need to define a matrix of knots that cover the exposure space, where we select a smaller number of knots than the number of exposures. We will consider both using 100 knots and using 50 knots.
set.seed(1000)
knots100 <- fields::cover.design(scale_expos, nd = 100)$design
fitkm_knots100 <- kmbayes(Y, scale_expos, covar, iter = 10000,
verbose = FALSE, varsel = TRUE,
knots = knots100)
set.seed(1000)
knots50 <- fields::cover.design(scale_expos, nd = 50)$design
fitkm_knots50 <- kmbayes(Y, scale_expos, covar, iter = 10000,
verbose = FALSE, varsel = TRUE,
knots = knots50)
Let’s compare the computation time:
time_per_iter <- data.frame(
full = with(fitkm,
difftime(time2, time1, units = "secs")/iter),
knots100 = with(fitkm_knots100,
difftime(time2, time1, units = "secs")/iter),
knots50 = with(fitkm_knots50,
difftime(time2, time1, units = "secs")/iter)
)
time_per_iter
## full knots100 knots50
## 1 0.1368005 secs 0.07026066 secs 0.03593337 secs
We next see if the reduction in computation time leads to a meaningful reduction in the accuracy of the estimated exposure-response function in the the approximate, Gaussian predictive process version as compared to full BKMR. Below we plot the estimated person-specific health risks \(h_i\) estimated under the full BKMR model to the health risks estimated under the approximate, Gaussian predictive process version, and we calculate the root mean square error (rMSE).
hi_est <- ComputePostmeanHnew(fitkm)$postmean
hi_knot100 <- ComputePostmeanHnew(fitkm_knots100)$postmean
hi_knot50 <- ComputePostmeanHnew(fitkm_knots50)$postmean
par(mfrow = c(1, 2))
plot(hi_est, hi_knot100, xlab = "Full model",
ylab = "100 knots", main = "Estimated h")
abline(0, 1, col = "red")
plot(hi_est, hi_knot50, xlab = "Full model",
ylab = "50 knots", main = "Estimated h")
abline(0, 1, col = "red")
rMSE <- data.frame(
knots50 = sqrt(mean((hi_knot50 - hi_est)^2)),
knots100 = sqrt(mean((hi_knot100 - hi_est)^2))
)
rMSE
## knots50 knots100
## 1 0.1232796 0.05223356
Details on other functionalities are available in the help page for the kmbayes
function (type ?kmbayes
to view) and on the package overview guide. Additional information on the statistical methodology and on the computational details are provided in Bobb et al. 2015.
Bobb, JF, Valeri L, Claus Henn B, Christiani DC, Wright RO, Mazumdar M, Godleski JJ, Coull BA (2015). Bayesian Kernel Machine Regression for Estimating the Health Effects of Multi-Pollutant Mixtures. Biostatistics 16, no. 3: 493–508.