---
title: "T1D Application"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{T1D Application}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = 'png'
)

Sys.setenv(`_R_S3_METHOD_REGISTRATION_NOTE_OVERWRITES_` = "false")
```

In this vignette, we walk through an example to illustrate how the `fcfdr` R package can be used to leverage various relevant genetic and genomic data with GWAS $p$-values for type 1 diabetes (T1D) to find new genetic associations. This vignette will take approximately 30 minutes to complete.

---

The data required for this example is available to download within the `fcfdr` R package and includes:

1. GWAS $p$-values for T1D ([Onengut-Gumuscu et al. 2015](https://www.nature.com/articles/ng.3245)) downloaded from the NHGRI-EBI GWAS Catalog (study GCST005536 accessed on 08/10/21).

2. GWAS $p$-values for rheumatoid arthritis (RA) ([Eyre et al. 2012](https://www.nature.com/articles/ng.2462)) downloaded from the NHGRI-EBI GWAS Catalog (study GCST90013445 accessed on 08/10/21).

3. Binary measure of SNP overlap with regulatory factor binding sites, derived from merging all DNaseI digital genomic footprinting (DGF) regions from the narrow-peak classifications across 57 cell types (see https://www.nature.com/articles/nature11247). SNP annotations were downloaded for all 1000 Genomes phase 3 SNPs from the [LDSC data repository](https://storage.googleapis.com/broad-alkesgroup-public/LDSCORE/1000G_Phase3_baselineLD_v2.1_ldscores.tgz) and the binary `DGF_ENCODE` annotation was extracted for all SNPs in our analysis.

4. Average fold-enrichment ratios of H3K27ac ChIP-seq counts relative to expected background counts in T1D-relevant cell types. Fold-enrichment ratios were downloaded from NIH Roadmap for CD3, CD4+ CD25int CD127+ Tmem, CD4+ CD25+ CD127- Treg, CD4+ CD25- Th, CD4+ CD25- CD45RA+, CD4 memory, CD4 naive, CD8 memory and CD8 naive primary cells (from https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/ see epigenome ID to cell type conversion table [here](https://docs.google.com/spreadsheets/u/1/d/1yikGx4MsO9Ei36b64yOy9Vb6oPC5IBGlFbYEt-N6gOM/edit#gid=15)). The fold-enrichment ratios were averaged over cell types to derive the values in column `H3K27ac`.

---

Firstly, we download the data:

```{r}
set.seed(1)
library(fcfdr)
library(cowplot)
library(ggplot2)
library(dplyr)

data(T1D_application_data, package = "fcfdr")
head(T1D_application_data)
```

In this application we leverage GWAS $p$-values for RA, binary SNP overlap with regulatory factor binding sites and H3K27ac counts in T1D-relevant cell types with GWAS $p$-values for T1D to generate adjusted $p$-values (called $v$-values). 

```{r}
orig_p <- T1D_application_data$T1D_pval
chr <- T1D_application_data$CHR19
MAF <- T1D_application_data$MAF
q1 <- T1D_application_data$RA_pval
q2 <- T1D_application_data$DGF
q3 <- log(T1D_application_data$H3K27ac+1) # deal with long tail
```

---

The data frame also contains a column of LDAK weights for each SNP (https://dougspeed.com/calculate-weightings/). An LDAK weight of zero means that the signal is (almost) perfectly captured by neighbouring SNPs, and so we use the subset of SNPs with non-zero LDAK weights as our independent subset of SNPs.

```{r}
ind_snps <- which(T1D_application_data$LDAK_weight != 0)
```

---

We are now ready to use the `fcfdr` R package to generate $v$-values. Firstly, we generate $v$-values by leveraging GWAS $p$-values for RA. We supply MAF values to prevent a bias of the KDE fit towards the behaviour of rarer SNPs (the function intrinsically down-samples the independent subset of SNPs to match the MAF distribution in this subset to that in the whole set of SNPs).

```{r}
iter1_res <- flexible_cfdr(p = orig_p, 
                           q = q1, 
                           indep_index = ind_snps,
                           maf = MAF)

v1 <- iter1_res[[1]]$v
```

---

Since the outputted $v$-values are analogous to $p$-values, they can be used directly in any error-rate controlling procedure. Here, we use the Benjamini-Hochberg (BH) procedure to derive FDR-adjusted $v$-values and plot the resultant FDR values.

```{r}
res1 <- data.frame(p = orig_p, q1, v1)
mid1 <- median(res1$q1)

ggplot(res1, aes(x = p.adjust(p, method = "BH"), y = p.adjust(v1, method = "BH"), col = q1)) + geom_point(cex = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1,  linetype="dashed") + xlab("Original FDR") + ylab("V1 (FDR)") + ggtitle(paste0("Iteration 1")) + scale_color_gradient2(midpoint = mid1, low = "blue", mid = "white", high = "red", space = "Lab")
```

---

The resultant $v$-values for this first iteration (`v1`) are then used in the next iteration to leverage binary data on SNP overlap with regulatory factor binding sites. Note that the binary cFDR function implements a leave-one-out procedure and therefore requires a group index for each SNP. This will generally be the chromosome on which that SNP resides but can also be indices relating to LD blocks, for example.

```{r}
iter2_res <- binary_cfdr(p = v1,
                         q = q2,
                         group = chr)

v2 <- iter2_res$v
```

```{r}
res2 <- data.frame(p = v1, v2, q2)
res2$q2 <- as.factor(res2$q2)

ggplot(res2, aes(x = p.adjust(p, method = "BH"), y = p.adjust(v2, method = "BH"), col = q2)) + geom_point(cex = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1,  linetype="dashed") + xlab("V1 (FDR)") + ylab("V2 (FDR)") + ggtitle(paste0("Iteration 2")) + scale_colour_manual(values = c("grey", "black"))
```

---

The resultant $v$-values for this second iteration (`v2`) are then used in the next iteration to leverage H3K27ac counts.

```{r}
iter3_res <- flexible_cfdr(p = v2, 
                           q = q3, 
                           indep_index = ind_snps,
                           maf = MAF)
v3 <- iter3_res[[1]]$v
```

```{r}
res3 <- data.frame(p = v2, q3, v3)

ggplot(res3, aes(x = p.adjust(p, method = "BH"), y = p.adjust(v3, method = "BH"), col = q3)) + geom_point(cex = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1,  linetype="dashed") + xlab("V2 (FDR)") + ylab("V3 (FDR)") + ggtitle(paste0("Iteration 3")) + scale_color_gradient2(midpoint = 1, low = "blue", mid = "white", high = "red", space = "Lab")
```

---

We then create a final data frame containing the results from our analysis. Note that the sign is flipped for $q2$ and $q3$. This is because these are negatively correlated with `p` and the flexible cFDR software automatically flips the sign of `q` to ensure that low `p` are enriched for low `q`.

```{r}
res <- data.frame(orig_p, q1 = iter1_res[[1]]$q, q2 = as.factor(iter2_res$q), q3 = iter3_res[[1]]$q, v1, v2, v3)

head(res)
```

---

We can plot the original $p$-values for T1D against the final adjusted $v$-values.

```{r}
mid1 <- median(res$q1)

ggplot(res, aes(x = p.adjust(orig_p, method = "BH"), y = p.adjust(v3, method = "BH"))) + geom_point(cex = 0.5, alpha = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1,  linetype="dashed", col = "red") + xlab("Original P (FDR)") + ylab("V3 (FDR)") + ggtitle(paste0("FDR adjusted v-values\nagainst original FDR values"))
```

```{r}
ggplot(res, aes(x = -log10(orig_p), y = -log10(v3))) + geom_point(cex = 0.5, alpha = 0.5) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_abline(intercept = 0, slope = 1,  linetype="dashed", col = "red") + xlab("Original P (FDR) (-log10)") + ylab("V3 (FDR) (-log10)") + ggtitle(paste0("FDR adjusted v-values against\noriginal FDR values (FDR)")) + coord_cartesian(ylim = c(0,10), xlim = c(0,10))
```

---

We find that our implementation of cFDR identifies newly FDR significant SNPs that have relatively small GWAS $p$-values for rheumatoid arthritis, are more likely to be found in genomic regions where regulatory factors bind and have relatively high H3K27ac counts in T1D relevant cell types.

```{r}
p_fdr <- p.adjust(orig_p, method = "BH")
v3_fdr <- p.adjust(v3, method = "BH")

# choose fdr threshold corresponding to genome-wide significance threshold
fdr_thr <- max(p_fdr[which(orig_p <= 5*10^{-8})])

median(T1D_application_data$RA_p[which(v3_fdr < fdr_thr & p_fdr > fdr_thr)])
median(T1D_application_data$RA_p)

mean(T1D_application_data$DGF[which(v3_fdr < fdr_thr & p_fdr > fdr_thr)])
mean(T1D_application_data$DGF)

median(T1D_application_data$H3K27ac[which(v3_fdr < fdr_thr & p_fdr > fdr_thr)])
median(T1D_application_data$H3K27ac)
```

---

Side comment: code to create the Manhattan plot in the manuscript:

```{r eval = FALSE}
T1D_application_data$v3_fdr <- v3_fdr

nCHR <- length(unique(T1D_application_data$CHR19))
T1D_application_data$BPcum <- NA
s <- 0
nbp <- c()
T1D_application_data <- data.frame(T1D_application_data)
for (i in unique(T1D_application_data$CHR19)){
  nbp[i] <- max(T1D_application_data[T1D_application_data$CHR19 == i,]$BP19)
  T1D_application_data[T1D_application_data$CHR19 == i,"BPcum"] <- T1D_application_data[T1D_application_data$CHR19 == i,"BP19"] + s
  s <- s + nbp[i]
}

axis.set <- T1D_application_data %>% 
  group_by(CHR19) %>% 
  summarize(center = (max(BPcum) + min(BPcum)) / 2)

ggplot(T1D_application_data, aes(x = BPcum, y = -log10(v3_fdr), col = as.factor(CHR19))) + geom_point(cex = 0.75) + theme_cowplot(12) + background_grid(major = "xy", minor = "none") + geom_hline(yintercept = -log10(fdr_thr), linetype = "dashed") + xlab("Position") +  scale_color_manual(values = rep(c("#276FBF", "#183059"), nCHR)) +
  scale_x_continuous(label = axis.set$CHR19, breaks = axis.set$center) + theme(legend.position = "none")+ theme(axis.text.x = element_text(size = 6, angle = 0)) + coord_cartesian(ylim=c(0,10)) + ylab(expression(paste("-log"[10],"(FDR)")))
```