FLOW-MR-tutorial
FLOW-MR-tutorial.Rmd
Introduction
FLOW-MR
is an R package for performing Mendelian
Randomization under the mediation setting. This tutorial demonstrates
how to use the package step by step.
Installation
# Install devtools if not installed
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
# Install FLOW-MR
devtools::install_github("ZixuanWu1/FLOW-MR")
#> rlang (1.1.4 -> 1.1.5 ) [CRAN]
#> cli (3.6.3 -> 3.6.4 ) [CRAN]
#> pillar (1.9.0 -> 1.10.1 ) [CRAN]
#> R6 (2.5.1 -> 2.6.1 ) [CRAN]
#> Rcpp (1.0.13 -> 1.0.14 ) [CRAN]
#> withr (3.0.1 -> 3.0.2 ) [CRAN]
#> gtable (0.3.5 -> 0.3.6 ) [CRAN]
#> RcppArmad... (14.0.2-1 -> 14.2.3-1) [CRAN]
#>
#> The downloaded binary packages are in
#> /var/folders/1p/kyg0zmtx4gdg27q_5x4d78fc0000gn/T//RtmpL9P6lU/downloaded_packages
#> ── R CMD build ─────────────────────────────────────────────────────────────────
#> * checking for file ‘/private/var/folders/1p/kyg0zmtx4gdg27q_5x4d78fc0000gn/T/RtmpL9P6lU/remotes17574441c9acc/ZixuanWu1-FLOW-MR-888679f/DESCRIPTION’ ... OK
#> * preparing ‘FLOWMR’:
#> * checking DESCRIPTION meta-information ... OK
#> * cleaning src
#> Warning: bad markup (extra space?) at BayesMediation.Rd:48:21
#> Warning: bad markup (extra space?) at BayesMediation.Rd:49:22
#> Warning: bad markup (extra space?) at BayesMediation.Rd:50:26
#> Warning: bad markup (extra space?) at BayesMediation.Rd:52:27
#> Warning: bad markup (extra space?) at BayesMediation.Rd:54:29
#> Warning: bad markup (extra space?) at BayesMediation.Rd:56:30
#> Warning: bad markup (extra space?) at BayesMediation.Rd:58:17
#> Warning: bad markup (extra space?) at BayesMediation.Rd:59:18
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:57:21
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:58:22
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:59:26
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:61:27
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:63:29
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:65:30
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:67:17
#> Warning: bad markup (extra space?) at MultivariateMediation.Rd:68:18
#> Warning: bad markup (extra space?) at gibbs_wrapper.Rd:64:9
#> Warning: bad markup (extra space?) at gibbs_wrapper.Rd:65:9
#> Warning: bad markup (extra space?) at gibbs_wrapper.Rd:66:14
#> Warning: bad markup (extra space?) at gibbs_wrapper.Rd:67:14
#> Warning: bad markup (extra space?) at gibbs_wrapper.Rd:68:10
#> Warning: bad markup (extra space?) at gibbs_wrapper.Rd:69:13
#> Warning: bad markup (extra space?) at gibbs_wrapper_multivariate.Rd:71:9
#> Warning: bad markup (extra space?) at gibbs_wrapper_multivariate.Rd:72:9
#> Warning: bad markup (extra space?) at gibbs_wrapper_multivariate.Rd:73:14
#> Warning: bad markup (extra space?) at gibbs_wrapper_multivariate.Rd:74:14
#> Warning: bad markup (extra space?) at gibbs_wrapper_multivariate.Rd:75:10
#> Warning: bad markup (extra space?) at gibbs_wrapper_multivariate.Rd:76:13
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:24:12
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:25:16
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:26:10
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:27:13
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:28:12
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:29:14
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:30:11
#> Warning: bad markup (extra space?) at summary_gibbs.Rd:31:13
#> Warning: bad markup (extra space?) at zero.centered.em.Rd:42:10
#> Warning: bad markup (extra space?) at zero.centered.em.Rd:43:10
#> Warning: bad markup (extra space?) at zero.centered.em.Rd:44:10
#> * installing the package to process help pages
#> * saving partial Rd database
#> * cleaning src
#> * checking for LF line-endings in source and make files and shell scripts
#> * checking for empty or unneeded directories
#> * building ‘FLOWMR_1.0.tar.gz’
Load the Package
library(FLOWMR)
Example Usage
Step 1: Prepare Input Data
Here, we use GRAPPLE
to preprocess the input data. We
can download the GRAPPLE
package using the following
command:
devtools::install_github("jingshuw/grapple")
Next, we use GRAPPLE
to preprocess the data following a
three-sample Mendelian Randomization (MR) design. The sel.file is used
to select genome-wide significant SNPs, the exp.file contains the
exposures of interest, and the out.file contains the outcomes of
interest. For more details, visit GRAPPLE on GitHub.
In this example we use adult BMI from GIANT and childhoood BMI from EGG as selections files. We use childhood body BMI from MOBA, adult BMI from UK biobank as exposures, and Breast Cancer from this paper as outcome. One can download the traits from this link.
library(GRAPPLE)
# Selection file of snps
sel.file <- c("BMI-giant17eu.csv", "BMIchild-egg15.csv" )
# Exposure file
exp.file <- c( "BMI8year-moba19","BMIadult-ukb.csv" )
# Outcome file
out.file <- "Breast-Micha17erp.csv"
# Use plink to select independent significant SNPs
plink_refdat <- "data_maf0.01_rs_ref/data_maf0.01_rs_ref"
## Use max.p.thres to decide the significance level of SNPs
## Use cal.cor = T to compute the noise correlation
data.list <- GRAPPLE::getInput(sel.file, exp.file, out.file, plink_refdat, max.p.thres =1e-3,
plink_exe = "plink_mac_20210606/plink", cal.cor = T)
#> [1] "Marker candidates will not be obtained as number of risk factors k > 1"
#> [1] "loading data for selection: BMI-giant17eu.csv ..."
#> [1] "loading data for selection: BMIchild-egg15.csv ..."
#> [1] "loading data from exposure: BMI8year-moba19 ..."
#> [1] "loading data from exposure: BMIadult-ukb.csv ..."
#> [1] "loading data from outcome: Breast-Micha17erp.csv ..."
#> [1] "Start clumping using PLINK ..."
#> [1] "487 independent genetic instruments extracted. Done!"
Step 2: Run FLOW-MR
To run FLOW-MR, we need to prepare two input files,
Gamma_hat
and Sd_hat
. Also ensuring that the
time order is reversed. Then the function BayesMediation
will execute the Gibbs sampler.
Here, you can optionally provide cor_mat
, a K by K
matrix which represents the shared noise correlation between GWAS
summary statistics across traits. This is particularly useful when the
GWAS summary statistics include overlapping samples.
GRAPPLE
estimates the noise correlation by first selecting
non-significant SNPs (e.g., p-value > 0.5) and then computing the
correlation between the estimated effect sizes. The default value is a
diagonal matrix.
dat <-data.list$data;
# Run the mediation method
Gamma_hat =rbind(dat$gamma_out1,
dat$gamma_exp2,
dat$gamma_exp1)
Sd_hat = rbind(dat$se_out1,
dat$se_exp2,
dat$se_exp1)
cor_mat = data.list$cor.mat[3:1, 3:1]
result = BayesMediation(Gamma_hat, Sd_hat, cor = cor_mat, inv = TRUE)
#> [1] "2025-03-17 17:48:54 CDT"
#> [1] "2025-03-17 17:50:38 CDT"
Step 3: Look at summary of direct effects
In this step, we print the summary of direct effects, where each row corresponds to a parameter.
print(result$summary)
#> mean var sd 2.5% 50%
#> B[1,2] 0.158571124 1.708075e-02 0.1306933512 -0.088594706 0.155708787
#> B[1,3] -0.416820534 3.037425e-02 0.1742821007 -0.772970801 -0.408581119
#> B[2,3] 0.729903282 8.361712e-03 0.0914423953 0.564416591 0.726111975
#> sigma 0.432940868 1.258926e-02 0.1122018645 0.273914991 0.412901419
#> sigma1[1] 0.025897252 1.995543e-05 0.0044671501 0.018968703 0.025298456
#> sigma1[2] 0.025400411 5.691362e-05 0.0075441115 0.015802744 0.023753648
#> sigma1[3] 0.046966628 9.573373e-05 0.0097843616 0.032002343 0.045542140
#> sigma0[1] 0.004279803 7.322883e-07 0.0008557385 0.002882453 0.004194274
#> sigma0[2] 0.008336840 7.043863e-07 0.0008392773 0.006698575 0.008356154
#> sigma0[3] 0.010865672 1.488539e-06 0.0012200570 0.008584134 0.010815020
#> p[1] 0.096893667 1.223057e-03 0.0349722279 0.039493403 0.093387522
#> p[2] 0.023916549 3.255310e-04 0.0180424783 0.002547653 0.019647269
#> p[3] 0.042358001 2.525140e-04 0.0158906899 0.017389480 0.040189888
#> 97.5% ESS Rhat
#> B[1,2] 0.421482632 496.41018 1.0033982
#> B[1,3] -0.086527013 311.31244 1.0034254
#> B[2,3] 0.928726204 60.24373 1.0086724
#> sigma 0.702160361 4455.55279 1.0009391
#> sigma1[1] 0.036117189 1364.89387 1.0042282
#> sigma1[2] 0.044571476 2874.80598 0.9998832
#> sigma1[3] 0.069936393 1211.18638 1.0008464
#> sigma0[1] 0.006155495 178.22450 1.0449098
#> sigma0[2] 0.009872035 364.44844 1.0050532
#> sigma0[3] 0.013334242 371.98046 1.0006414
#> p[1] 0.173356366 648.51063 1.0095333
#> p[2] 0.070259503 779.62068 1.0019857
#> p[3] 0.079158536 1734.45765 1.0011530
In this summary, we provide the estimated posterior mean, variance, standard deviation, 2.5% quantile, median, 97.5% quantile, effective sample size (ESS), and Gelman-Rubin Rhat statistic for each parameter. The ESS quantifies the number of independent samples, while Rhat (the Gelman-Rubin statistic) assesses convergence. See ESS and Rhat for a detailed explanations of ESS and Rhat.
For the meaning of each parameter, B[l, k]
represents
the direct effect of the
-th
latest trait on the
-th
latest trait. For example: B[1, 2]
represents the direct
effect of adult BMI on breast cancer; B[1, 3]
represents
the direct direct effect of childhood BMI on breast cancer;
B[2, 3]
represents the direct direct effect of childhood
BMI on adult BMI.
See the figure below for a visualization.
knitr::include_graphics("https://zixuanwu1.github.io/FLOW-MR/articles/FLOW-MR-tutorial_files/figure-html/EffectVisual.png")
For the meaning of other parameters, we assume the following priors in our manuscript:
We also include the posterior summary of these parameters. In the
summary, sigma1[k]
represents the standard deviation of the
slab component for trait
,
sigma0[k]
represents the standard deviation of the spike
component for trait
,
p[k]
represents the proportion of the slab component, and
sigma
represents the standard deviation of
.
The effects should be considered as significant if its credible
interval does not include 0. For example, here B[2, 3]
and
B[1, 3]
are found to be significant, while
B[1,2]
is not.
Step 4 (Optional) Working with Raw outputs
In most cases, the summary table provides sufficient information. However, in certain situations, it may be useful to directly access the posterior samples, such as when you need to obtain other quantiles. In these cases, you can follow the procedure outlined below.
In the Gibbs sampler, we ran four chains in parallel. To access
information from the i-th chain, you can use the command
result$raw[[i]]
. For example result$raw[[1]]$B
is a three dimensional array of dimension K by K by N, where K is the
number of traits and N is the number of iterations. Similarly,
result$raw[[1]]$sigma1
,
result$raw[[1]]$sigma0
, result$raw[[1]]$p
are
two dimensional arrays of dimension K by N where K is the number of
traits and N is the number of iterations.
To simplify the procedure, one can use the function
get_posterior_samples
in FLOWMR
to extract
posterior samples from all four chains for a particular parameter after
a user-defined warm-up period. The parameter must be one of “B”,
“sigma1”, “sigma0”, “sigma”, “p”.
For example, here we extract the posterior samples for
B[1,3]
after 3000 warm-up periods
post_B = get_posterior_samples(result$raw, par = "B", K = 3, ind = c(1,3), warmup = 3000)
We can check its quantiles:
And we can also draw a histogram of it for visualization:
hist(post_B, breaks = 30, main = "Histogram of posterior samples of direct effects of X3 on X1")
For sigma1
, sgima0
and p
, the
index argument have to be 1 dimensional. For example
post_p = get_posterior_samples(result$raw, par = "p", K = 3, ind = 1, warmup = 3000)
hist(post_p, breaks = 30, main = "Histogram of posterior samples of p1")
For sigma
, the index argument should be NULL:
post_sigma = get_posterior_samples(result$raw, par = "sigma", K = 3, ind = NULL, warmup = 3000)
hist(post_sigma, breaks = 30, main = "Histogram of posterior samples of sigma")
Step 5 (Optional) MCMC diagnoistics.
If non-convergence issues arise (e.g., an Rhat value greater than
1.1), you can generate a trace plot of the posterior samples using the
function traceplot
:
The traceplot is a useful tool for diagnosing MCMC convergence (See this in rstan for example link. In the traceplot, each line corresponds to posterior samples from one MCMC chain. Non-convergence is indicated when the posterior samples from the four chains do not appear similar.
If convergence issues arise, consider increasing the number of iterations or adjusting the prior parameters
Step 6 Estimating path-wise / indirect effects
To estimate path-wise effects, use the following command. For
example, the function indirect_effect
evaluates the
indirect effect of childhood BMI on breast cancer through adult BMI:
path_effect = indirect_effect(result$raw, K = 3, path = c(3,2,1), warmup = 3000)
print(path_effect)
#> Mean 2.5% 50% 97.5%
#> 1 0.1197818 -0.06143416 0.1103945 0.3452347
The path
parameter specifies the path you are interested
it. In this case it is 3 -> 2 -> 1.
This can also be done by first extracting the posterior samples of `B[1, 2]’ and ‘B[2, 3]’, then multiply them together
post_B1 = get_posterior_samples(result$raw, par = "B", K = 3, ind = c(1,2), warmup = 3000)
post_B2 = get_posterior_samples(result$raw, par = "B", K = 3, ind = c(2,3), warmup = 3000)
print(quantile(post_B1 * post_B2, c(0.025, .5, .975)))
#> 2.5% 50% 97.5%
#> -0.06143416 0.11039454 0.34523474
Alternatively, instead of focusing on a specific path, you can
estimate the sum of effects from all non-direct paths. The function
indirect_effect
summarizes the posterior samples of all
non-direct path-wise effects. It outputs array df
of
dimension K by K by 4, where the first two dimensions correspond to the
dimensions of B
, the third dimension contains the mean,
2.5/% quantile, median, 97.5% quantile respectively.
ind_effect = indirect_effect(result$raw, K = 3, path = "all", warmup = 3000)
print(ind_effect)
#> , , Mean
#>
#> [,1] [,2] [,3]
#> [1,] 0 0 0.1197818
#> [2,] 0 0 0.0000000
#> [3,] 0 0 0.0000000
#>
#> , , Quantile_0.025
#>
#> [,1] [,2] [,3]
#> [1,] 0 0 -0.06143416
#> [2,] 0 0 0.00000000
#> [3,] 0 0 0.00000000
#>
#> , , Quantile_0.5
#>
#> [,1] [,2] [,3]
#> [1,] 0 0 0.1103945
#> [2,] 0 0 0.0000000
#> [3,] 0 0 0.0000000
#>
#> , , Quantile_0.975
#>
#> [,1] [,2] [,3]
#> [1,] 0 0 0.3452347
#> [2,] 0 0 0.0000000
#> [3,] 0 0 0.0000000
We can also examine the posterior inference for the total effects, which represent the sum of the direct and indirect effects.
tot_effect = total_effect(result$raw, K = 3, warmup = 3000)
print(tot_effect)
#> , , Mean
#>
#> [,1] [,2] [,3]
#> [1,] 0 0.1585711 -0.2970388
#> [2,] 0 0.0000000 0.7299033
#> [3,] 0 0.0000000 0.0000000
#>
#> , , Quantile_0.025
#>
#> [,1] [,2] [,3]
#> [1,] 0 -0.08859471 -0.4794140
#> [2,] 0 0.00000000 0.5644166
#> [3,] 0 0.00000000 0.0000000
#>
#> , , Quantile_0.5
#>
#> [,1] [,2] [,3]
#> [1,] 0 0.1557088 -0.2931256
#> [2,] 0 0.0000000 0.7261120
#> [3,] 0 0.0000000 0.0000000
#>
#> , , Quantile_0.975
#>
#> [,1] [,2] [,3]
#> [1,] 0 0.4214826 -0.1284121
#> [2,] 0 0.0000000 0.9287262
#> [3,] 0 0.0000000 0.0000000
Finally, we demonstrate how to extract posterior samples of total, indirect, and path-wise effects, similar to how we did for the direct effects.
For each total effect, we use the function
get_total_samples
.
tot_samples = get_total_samples(result$raw, 3, ind = c(1,3))
hist(tot_samples, breaks = 30, main = "Histogram of posterior samples of total effects of X3 on X1")
For path-wise effect, we use the function
get_indirect_samples
.
path_samples = get_indirect_samples(result$raw, 3, path = c(3,2,1))
hist(path_samples, breaks = 30, main = "Histogram of posterior samples of the effect of X3 on X1 through X2")
For the joint effect of non-direct paths, we still use the function
get_indirect_samples
. In this case we need to set
path
to be “all” and specify the index of interests. In
this case, since there is only 1 indirect path, the indirect effect will
be identical to the path-wise effect we just examined
ind_samples = get_indirect_samples(result$raw, 3, path = "all", ind = c(1,3))
hist(ind_samples, breaks = 30, main = "Histogram of posterior samples of indirect effects of X3 on X1")