Example using the bkmr R package for probit regression with simulated data

In this document, we demonstrate how to apply Bayesian kernel machine regression (BKMR) for binary outcomes using the bkmr R package. See the overview guide for additional information about BKMR, including guided examples for continuous outcomes.

Overview of probit BKMR

We implement kernel machine regression (KMR) for binary outcomes,

\[ \Phi^{-1}(P(Y_i = 1)) = h(z_{i1}, \ldots, z_{iM}) + \beta{\bf x}_i, \quad i = 1, \ldots, n \] where \(\Phi\) is the cummulative distribution function (CDF) for the standard normal distribution (\(\Phi^{-1}\) is the probit link function), the outcome \(Y_i\) is a binary (0/1) variable, \(h\) is a flexible function of the predictor variables \(z_{i1}, \ldots, z_{iM}\), and \({\bf x}\) is a vector of covariates (\(\beta\) is the corresponding vector of coefficients). We will refer to the predictors \({\bf z}\) as exposure variables and to \(h(\cdot)\) as the exposure-response function. The function \(h\) is modeled using a kernel machine representation that can capture complex, non-linear and non-additive, exposure-response relationships.

We implement probit, rather than logistic regression, primarily for reasons of computational convenience and efficiency for Bayesien inference using Gibbs sampling. In particular, for this we note that the probit model above can be reexpressed by incorporating a latent normal random variable (\(Y^*\)), as \[ Y_i^* = h(z_{i1}, \ldots, z_{iM}) + \beta{\bf x}_i + \epsilon_i, \quad i = 1, \ldots, n \] where \(\epsilon_i \sim \mbox{N}(0,1)\) and \(Y_i = I(Y_i^* > 0)\) is equal to 1 if \(Y_i^* > 0\) and is equal to 0 otherwise. In our example below we will demonstrate how the exposure-response function \(h\) can be interpreted under the probit regression model.

Fit BKMR to simulated dataset

To run the code in this document, you will need to have the following R packages installed: bkmr and ggplot2 (to generate plots). To install the packages, which only needs to be done once, type

install.packages("bkmr")
install.packages("ggplot2")

We begin by loading these packages:

library(bkmr)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.4.4

Generate data

Let’s consider a simple example with outcome data are generating under the probit model above, where there are 30 available exposure variables, of which four are included in the true exposure-response function. We first generate the data.

seed <- 111
set.seed(seed)
n <- 200 ## number of observations
M <- 30 ## number of exposure variables
beta.true <- 0.1
Z <- matrix(runif(n * M, -1, 1), n, M)
x <- 3*cos(Z[, 1]) + 2*rnorm(n)

## exposure-response function
hfun0 <- function(z) (2*z + 0.5) ^ 2
hfun <- function(zvec) hfun0(zvec[1]) + hfun0(zvec[2]) - hfun0(zvec[3]) - hfun0(zvec[4]) + zvec[3]*zvec[4]
h <- apply(Z, 1, hfun) ## only depends on z1, z2, z3, z4

## generate using latent normal representation
eps <- rnorm(n)
ystar <- x * beta.true + h + eps
y <- ifelse(ystar > 0, 1, 0)

datp <- list(n = n, M = M, beta.true = beta.true, Z = Z, h = h, X = cbind(x), y = y, eps = eps, ystar = ystar)
rm(n, M, beta.true, Z, x, h, eps, y, ystar)

Fit BKMR

To fit the BKMR model, we use the kmbayes function, specifying to use a binomial distribution for the outcome with the argument family. Here we apply component-wise variable selection.

set.seed(seed)
fitpr <- kmbayes(y = datp$y, Z = datp$Z, X = datp$X, 
                 iter = 10000, verbose = FALSE, 
                 varsel = TRUE, family = "binomial",
                 control.params = list(r.jump2 = 0.5))
fitpr

Note that here we changed the tuning parameter r.jump2 of the Metropolis-Hastings algorithm for updating the \(r_m\) parameters under variable selection to get an improved acceptance rate (details of the tuning parameters are in the overview guide).

Visual inspection of the trace plots (e.g., by using the TracePlot() function) suggests that a larger number of iterations is needed for the algorithm to converge. For the purpose of this vignette, here we keep the number of iterations at 10,000 to maintain a relatively short running time to facilitate reproducibility of the results. In a real application, we would recommend increasing the number of iterations (say to 50,000) and assessing convergence prior to presenting final results.

Posterior inclusion probabilities

We extract and plot the posterior includsion probabilities (PIPs), which provide a measure of the variable imporance for each exposure within the \(h\) function.

pips <- ExtractPIPs(fitpr)

pips$exposure <- as.factor(1:nrow(pips))
ggplot(pips, aes(exposure, PIP)) + 
    geom_point() + 
    ylab("PIP") +
    ylim(0, 1)

We see that the probit BKMR model is able to correctly identify the exposure variables truly included in the exposure-response function.

Interpretting output

On probit scale

We may wish to interpret the estimated \(h\) function directly. We note that \(h\) quantifies the relationship between the exposures and the (probit of the) probability of an event (\(Y = 1\)), holding the covariates \({\bf x}\) fixed. By considering the latent normal formulation above, \(h\) may alternatively be interpreted as the relationship between the exposures and some underlying, continuous latent variable \(Y^*\). For example, if \(Y\) is an indicator variable for whether an individual has a particular health outcome, \(Y^*\) could be interpreted as a latent marker of health status.

Let’s investigate the estimated exposure-response function \(h\). Here we plot the univariate relationship h(\(z_m\)) for each exposure \(m\), where all of the other exposures are fixed to their median values.

pred.resp.univar <- PredictorResponseUnivar(fit = fitpr)

We use the ggplot2 package to plot the resulting cross section of \(h\).

library(ggplot2)
ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se, ymax = est + 1.96*se)) + 
    geom_smooth(stat = "identity") + 
    facet_wrap(~ variable) +
  ylab("h(z)")

As expected based on small posterior inclusion probabilities for \(z_5\)–\(z_{30}\), there is no association between these exposures and the outcome, which matches the true data generating distribution. We next compare the estimated exposure response function for \(z_1\) estimated under BKMR, where the other exposures are fixed at their median value, with that estimated by a probit model assuming linear terms of each of the exposure variables, as well as with an ‘oracle’ probit model that knows the true form of the exposure-response function, fitted using maximum likelihood:

z1 <- datp$Z[, 1]
z2 <- datp$Z[, 2]
z3 <- datp$Z[, 3]
z4 <- datp$Z[, 4]
x <- drop(datp$X)
oracle <- glm(y ~ z1 + I(z1^2) + z2 + I(z2^2) + z3 + I(z3^2) + z4 + I(z4^2) + z3*z4 + x, family = binomial(link = "probit"), data = datp)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

lin <- glm(y ~ Z + x, family = binomial(link = "probit"), data = datp)

## (centered) predictions under true model
z1_grid <- seq(min(datp$Z[, 1]), max(datp$Z[, 1]), length.out = 50)
h_true <- hfun0(z1_grid)
h_true_c <- h_true - mean(h_true)

## (centered) predictions under the oracle model
hpred_oracle <- predict(oracle, newdata = data.frame(z1 = z1_grid, z2 = median(datp$Z[, 2]), z3 = median(datp$Z[, 3]), z4 = median(datp$Z[, 4]), x = 0), se.fit = TRUE)
h_oracle <- hpred_oracle$fit
h_oracle_c <- h_oracle - mean(h_oracle)

## (centered) predictions under BKMR
medians <- apply(datp$Z, 2, median)
medians.mat <- matrix(medians[-1], nrow = length(z1_grid), ncol = ncol(datp$Z) - 1, byrow = TRUE)
Znew <- cbind(z1 = z1_grid, medians.mat)
hpred_bkmr <- ComputePostmeanHnew(fit = fitpr, Znew = Znew)
h_bkmr <- hpred_bkmr$postmean
h_bkmr_c <- h_bkmr - mean(h_bkmr)

## (centered) predictions under the model with linear terms
hpred_lin <- predict(lin, newdata = list(Z = Znew, x = rep(0, nrow(Znew))), se.fit = TRUE)
h_lin <- hpred_lin$fit
h_lin_c <- h_lin - mean(h_lin)

Now let’s compare the estimated exposure-response functions \(h(z_1)\).

plot(z1_grid, h_bkmr_c, type = "l", 
     ylim = c(0.95*min(datp$h), max(datp$h)), 
     xlab = "Z1", ylab = "h(Z1)")
lines(z1_grid, h_oracle_c, col = "red", lty = 2, lwd = 2)
lines(z1_grid, h_true_c, col = "blue", lty = 3, lwd = 2)
lines(z1_grid, h_lin_c, col = "orange", lty = 4, lwd = 2)
legend("topleft", c("BKMR", "oracle", "truth", "linear"), lwd = 2, 
       col = c("black", "red", "blue", "orange"), lty = 1:4, 
       y.intersp = 0.8, cex = 0.75)

As expected, we see that the BKMR fit performs better than the model assuming a linear exposure-response relationship (on the probit scale), but not as well as the oracle model.

On probability scale

Alternatively, we may wish to interpret the association between the exposures and the (untransformed) probility of the outcome. For this we observe \[ P(Y = 1 \mid {\bf z}, {\bf x}) = \Phi\{h(z_{1}, \ldots, z_{M}) + \beta{\bf x}\}. \] Note that under probit regression, common measures of association, such as the risk difference (RD), relative risk (RR), and odds ratio (OR), will in general be dependent on the particular values of the covariates \({\bf x}\). For example, to calculate an overall effect comparing the probability of the outcome when the exposures \({\bf z}\) are set to their 75th percentile versus their 25th percentile on the risk difference scale, one could compute a conditional RD, \[ RD({\bf x}) = \Phi\{h({\bf z}^{0.75}) + \beta{\bf x}\} - \Phi\{h({\bf z}^{0.25}) + \beta{\bf x}\}. \] Alternatively, one could estimate a marginal RD \(E[RD({\bf x})]\) that integrates over the covariate distribution.

Here we calculate the conditional RD comparing the probability of the outcome when the second exposure \(z_2\) is set to its 75th percentile versus its 50th percentile (and other exposure at their median value), for \({\bf x}\) fixed at its 25th, 50th, and 75th percentile. Posterior samples of the predicted probabilities may be obtained using the SamplePred function, in which the user specifies the new Z matrix at which to obtain predictions, as well as a particular value of the vector \({\bf x}\).

Xvals <- quantile(datp$X, c(0.25, 0.50, 0.75))
names(Xvals) <- paste0("X", names(Xvals))

whichz <- 2
Znew2 <- Znew1 <- apply(datp$Z, 2, quantile, 0.5)
qs.diff <- c(0.5, 0.75)
Znew2[whichz] <- apply(datp$Z[, whichz, drop = FALSE], 2, quantile, 
                        qs.diff[2])
Znew1[whichz] <- apply(datp$Z[, whichz, drop = FALSE], 2, quantile, 
                        qs.diff[1])

ptrue1 <- sapply(Xvals, function(xval) with(datp, pnorm(hfun(Znew1) + xval*beta.true)))
ptrue2 <- sapply(Xvals, function(xval) with(datp, pnorm(hfun(Znew2) + xval*beta.true)))
RDtrue <- ptrue2 - ptrue1

set.seed(seed)
pred_samps1 <- sapply(Xvals, function(xval) SamplePred(fit = fitpr, Znew = Znew1, Xnew = xval, type = "response"))
pred_samps2 <- sapply(Xvals, function(xval) SamplePred(fit = fitpr, Znew = Znew2, Xnew = xval, type = "response"))
RD_samps <- pred_samps2 - pred_samps1

The following plot shows the posterior distribution of the overall RD, along with the posterior mean estimates and the corresponding values under the true model, across values of the covariate \(x\).

nms <- expression(X^0.25, X^0.50, X^0.75)
par(mfrow = c(1, 3))
for (i in 1:3) {
    est <- colMeans(RD_samps)[i]
    hist(RD_samps[, i], main = "", xlab = "Risk Difference", xlim = c(-0.2, 1))
    title(main = nms[i], line = 1.1)
    abline(v = RDtrue[i], col = "red", lwd = 2)
    abline(v = est, col = "blue", lwd = 2, lty = 2)
    legend("topright", paste0(c("truth (", "est. ("), round(c(RDtrue[i], est), 2), ")"), col = c("red", "blue"), lty = 1:2, lwd = 2, 
           y.intersp = 0.8, cex = 0.75)    
}

Ninety-five percent posterior credible intervals for the RD across values of \(x\) are given by:

apply(RD_samps, 2, function(x) quantile(x, c(0.025, 0.975)))

##             X25%       X50%       X75%
## 2.5%  0.01888844 0.02814177 0.01166865
## 97.5% 0.73127186 0.70997408 0.71736499

Example using the `bkmr` R package for probit regression with simulated data

Jennifer F. Bobb

2018-07-02