In this vignette, we walk through an example to illustrate
how the fcfdr R package can be used to leverage various
relevant genetic and genomic data with GWAS \(p\)-values for type 1 diabetes (T1D) to
find new genetic associations. This vignette will take approximately 30
minutes to complete.
The data required for this example is available to download within
the fcfdr R package and includes:
GWAS \(p\)-values for T1D (Onengut-Gumuscu et al. 2015) downloaded from the NHGRI-EBI GWAS Catalog (study GCST005536 accessed on 08/10/21).
GWAS \(p\)-values for rheumatoid arthritis (RA) (Eyre et al. 2012) downloaded from the NHGRI-EBI GWAS Catalog (study GCST90013445 accessed on 08/10/21).
Binary measure of SNP overlap with regulatory factor binding
sites, derived from merging all DNaseI digital genomic footprinting
(DGF) regions from the narrow-peak classifications across 57 cell types
(see https://www.nature.com/articles/nature11247). SNP
annotations were downloaded for all 1000 Genomes phase 3 SNPs from the
LDSC
data repository and the binary DGF_ENCODE annotation
was extracted for all SNPs in our analysis.
Average fold-enrichment ratios of H3K27ac ChIP-seq counts
relative to expected background counts in T1D-relevant cell types.
Fold-enrichment ratios were downloaded from NIH Roadmap for CD3, CD4+
CD25int CD127+ Tmem, CD4+ CD25+ CD127- Treg, CD4+ CD25- Th, CD4+ CD25-
CD45RA+, CD4 memory, CD4 naive, CD8 memory and CD8 naive primary cells
(from https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/
see epigenome ID to cell type conversion table here).
The fold-enrichment ratios were averaged over cell types to derive the
values in column H3K27ac.
Firstly, we download the data:
set.seed(1)
library(fcfdr)
library(cowplot)
library(ggplot2)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(T1D_application_data, package = "fcfdr")
head(T1D_application_data)
#> rsID CHR19 BP19 REF ALT T1D_pval MAF LDAK_weight
#> <char> <int> <int> <char> <char> <num> <num> <num>
#> 1: rs61733845 1 1118275 C T 0.10475578 0.0353535 0.896405
#> 2: rs9729550 1 1135242 A C 0.06329083 0.2575760 0.424584
#> 3: rs6603788 1 1218086 C T 0.12616240 0.0656566 0.497992
#> 4: rs1240708 1 1335790 A G 0.24334501 0.1464650 0.712747
#> 5: rs880051 1 1493727 C T 0.29581294 0.2121210 NA
#> 6: rs2296716 1 1497824 G A 0.49740945 0.1161620 NA
#> RA_pval DGF H3K27ac
#> <num> <int> <num>
#> 1: 0.3552 0 0.1235078
#> 2: 0.1089 0 2.0741089
#> 3: 0.3700 1 0.1636556
#> 4: 0.6963 0 8.5169011
#> 5: 0.5926 0 0.7284344
#> 6: 0.3080 0 0.5411567In this application we leverage GWAS \(p\)-values for RA, binary SNP overlap with regulatory factor binding sites and H3K27ac counts in T1D-relevant cell types with GWAS \(p\)-values for T1D to generate adjusted \(p\)-values (called \(v\)-values).
orig_p <- T1D_application_data$T1D_pval
chr <- T1D_application_data$CHR19
MAF <- T1D_application_data$MAF
q1 <- T1D_application_data$RA_pval
q2 <- T1D_application_data$DGF
q3 <- log(T1D_application_data$H3K27ac+1) # deal with long tailThe data frame also contains a column of LDAK weights for each SNP (https://dougspeed.com/calculate-weightings/). An LDAK weight of zero means that the signal is (almost) perfectly captured by neighbouring SNPs, and so we use the subset of SNPs with non-zero LDAK weights as our independent subset of SNPs.
We are now ready to use the fcfdr R package to generate
\(v\)-values. Firstly, we generate
\(v\)-values by leveraging GWAS \(p\)-values for RA. We supply MAF values to
prevent a bias of the KDE fit towards the behaviour of rarer SNPs (the
function intrinsically down-samples the independent subset of SNPs to
match the MAF distribution in this subset to that in the whole set of
SNPs).
Since the outputted \(v\)-values are analogous to \(p\)-values, they can be used directly in any error-rate controlling procedure. Here, we use the Benjamini-Hochberg (BH) procedure to derive FDR-adjusted \(v\)-values and plot the resultant FDR values.
res1 <- data.frame(p = orig_p, q1, v1)
mid1 <- median(res1$q1)
ggplot(res1, aes(x = p.adjust(p, method = "BH"), y = p.adjust(v1, method = "BH"), col = q1)) + geom_point(cex = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1, linetype="dashed") + xlab("Original FDR") + ylab("V1 (FDR)") + ggtitle(paste0("Iteration 1")) + scale_color_gradient2(midpoint = mid1, low = "blue", mid = "white", high = "red", space = "Lab")The resultant \(v\)-values for this
first iteration (v1) are then used in the next iteration to
leverage binary data on SNP overlap with regulatory factor binding
sites. Note that the binary cFDR function implements a leave-one-out
procedure and therefore requires a group index for each SNP. This will
generally be the chromosome on which that SNP resides but can also be
indices relating to LD blocks, for example.
res2 <- data.frame(p = v1, v2, q2)
res2$q2 <- as.factor(res2$q2)
ggplot(res2, aes(x = p.adjust(p, method = "BH"), y = p.adjust(v2, method = "BH"), col = q2)) + geom_point(cex = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1, linetype="dashed") + xlab("V1 (FDR)") + ylab("V2 (FDR)") + ggtitle(paste0("Iteration 2")) + scale_colour_manual(values = c("grey", "black"))The resultant \(v\)-values for this
second iteration (v2) are then used in the next iteration
to leverage H3K27ac counts.
res3 <- data.frame(p = v2, q3, v3)
ggplot(res3, aes(x = p.adjust(p, method = "BH"), y = p.adjust(v3, method = "BH"), col = q3)) + geom_point(cex = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1, linetype="dashed") + xlab("V2 (FDR)") + ylab("V3 (FDR)") + ggtitle(paste0("Iteration 3")) + scale_color_gradient2(midpoint = 1, low = "blue", mid = "white", high = "red", space = "Lab")We then create a final data frame containing the results from our
analysis. Note that the sign is flipped for \(q2\) and \(q3\). This is because these are negatively
correlated with p and the flexible cFDR software
automatically flips the sign of q to ensure that low
p are enriched for low q.
res <- data.frame(orig_p, q1 = iter1_res[[1]]$q, q2 = as.factor(iter2_res$q), q3 = iter3_res[[1]]$q, v1, v2, v3)
head(res)
#> orig_p q1 q2 q3 v1 v2 v3
#> 1 0.10475578 0.3552 0 -0.1164557 0.11041697 0.11826617 0.14932460
#> 2 0.06329083 0.1089 0 -1.1230151 0.04303016 0.04697244 0.02506604
#> 3 0.12616240 0.3700 1 -0.1515664 0.13365968 0.10911264 0.13941567
#> 4 0.24334501 0.6963 0 -2.2530693 0.29820649 0.31427711 0.21649184
#> 5 0.29581294 0.5926 0 -0.5472161 0.34568571 0.36509249 0.34641533
#> 6 0.49740945 0.3080 0 -0.4325332 0.48752645 0.51798495 0.55052819We can plot the original \(p\)-values for T1D against the final adjusted \(v\)-values.
mid1 <- median(res$q1)
ggplot(res, aes(x = p.adjust(orig_p, method = "BH"), y = p.adjust(v3, method = "BH"))) + geom_point(cex = 0.5, alpha = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1, linetype="dashed", col = "red") + xlab("Original P (FDR)") + ylab("V3 (FDR)") + ggtitle(paste0("FDR adjusted v-values\nagainst original FDR values"))ggplot(res, aes(x = -log10(orig_p), y = -log10(v3))) + geom_point(cex = 0.5, alpha = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1, linetype="dashed", col = "red") + xlab("Original P (FDR) (-log10)") + ylab("V3 (FDR) (-log10)") + ggtitle(paste0("FDR adjusted v-values against\noriginal FDR values (FDR)")) + coord_cartesian(ylim = c(0,10), xlim = c(0,10))We find that our implementation of cFDR identifies newly FDR significant SNPs that have relatively small GWAS \(p\)-values for rheumatoid arthritis, are more likely to be found in genomic regions where regulatory factors bind and have relatively high H3K27ac counts in T1D relevant cell types.
p_fdr <- p.adjust(orig_p, method = "BH")
v3_fdr <- p.adjust(v3, method = "BH")
# choose fdr threshold corresponding to genome-wide significance threshold
fdr_thr <- max(p_fdr[which(orig_p <= 5*10^{-8})])
median(T1D_application_data$RA_p[which(v3_fdr < fdr_thr & p_fdr > fdr_thr)])
#> [1] 0.0012585
median(T1D_application_data$RA_p)
#> [1] 0.4222
mean(T1D_application_data$DGF[which(v3_fdr < fdr_thr & p_fdr > fdr_thr)])
#> [1] 0.323741
mean(T1D_application_data$DGF)
#> [1] 0.2339995
median(T1D_application_data$H3K27ac[which(v3_fdr < fdr_thr & p_fdr > fdr_thr)])
#> [1] 1.024923
median(T1D_application_data$H3K27ac)
#> [1] 0.5759056Side comment: code to create the Manhattan plot in the manuscript:
T1D_application_data$v3_fdr <- v3_fdr
nCHR <- length(unique(T1D_application_data$CHR19))
T1D_application_data$BPcum <- NA
s <- 0
nbp <- c()
T1D_application_data <- data.frame(T1D_application_data)
for (i in unique(T1D_application_data$CHR19)){
nbp[i] <- max(T1D_application_data[T1D_application_data$CHR19 == i,]$BP19)
T1D_application_data[T1D_application_data$CHR19 == i,"BPcum"] <- T1D_application_data[T1D_application_data$CHR19 == i,"BP19"] + s
s <- s + nbp[i]
}
axis.set <- T1D_application_data %>%
group_by(CHR19) %>%
summarize(center = (max(BPcum) + min(BPcum)) / 2)
ggplot(T1D_application_data, aes(x = BPcum, y = -log10(v3_fdr), col = as.factor(CHR19))) + geom_point(cex = 0.75) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_hline(yintercept = -log10(fdr_thr), linetype = "dashed") + xlab("Position") + scale_color_manual(values = rep(c("#276FBF", "#183059"), nCHR)) +
scale_x_continuous(label = axis.set$CHR19, breaks = axis.set$center) + theme(legend.position = "none")+ theme(axis.text.x = element_text(size = 6, angle = 0)) + coord_cartesian(ylim=c(0,10)) + ylab(expression(paste("-log"[10],"(FDR)")))