The fcfdr R package implements the cFDR
framework and is applicable for a wide variety of auxiliary covariates.
This is in contrast to earlier empirical cFDR methods (Liley and Wallace
2021; https://github.com/jamesliley/cfdr) that only support
auxiliary \(p\)-values from related
traits. A direct utility of fcfdr is to leverage relevant
functional genomic data with GWAS \(p\)-values to increase power for GWAS
discovery. The method generates “\(v\)-values” which can be interpreted as
GWAS \(p\)-values that have been
re-weighted according to the auxiliary data values. Since the \(v\)-values are analogous to \(p\)-values, they can be used directly in
any error-rate controlling procedure.
The fcfdr R package contains two key functions:
flexible_cfdr:
Implements cFDR leveraging \(\textbf{continuous}\) auxiliary
covariates.
binary_cfdr:
Implements cFDR leveraging \(\textbf{binary}\) auxiliary
covariates.
Both functions require two parameters to be specified:
p: GWAS \(p\)-values for the trait of interest
(vector of per-SNP \(p\)-values)
q: Auxiliary data values (vector of per-SNP
auxiliary data values).
The flexible_cfdr function also requires the indices of
an independent subset of SNPs (indep_index parameter; see
LDAK
vignette) whilst the binary_cfdr function requires
group indices for each SNP (group parameter; typically the
chromosome or LD block index for each SNP).
For further details, including instructions to generate
indep_index, examples of auxiliary data to leverage and
instructions to apply cFDR iteratively, please see the Extra
Information vignette.
Firstly, load the Flexible cFDR R package:
Next, simulate \(p\)-values for 50,000 genetic variants, including 500 associated variants.
set.seed(1)
n = 50000
n1p = 500 # associated variants
zp = c(rnorm(n1p, sd=5), rnorm(n-n1p, sd=1)) # z-scores
p = 2*pnorm(-abs(zp)) # convert to p-values
hist(p)We simulate relevant auxiliary data from a mixture normal distribution (\(q\)). The associated SNPs (with indices 1-500) are sampled from \(N(-0.5, 0.5^2)\) and the non-associated SNPs (with indices 500-50000) are sampled from \(N(2,1)\).
mixture_comp1 <- function(x) rnorm(x, mean = -0.5, sd = 0.5)
mixture_comp2 <- function(x) rnorm(x, mean = 2, sd = 1)
n = length(p)
z = runif(n)
q <- c(mixture_comp1(n1p), mixture_comp2(n-n1p))
hist(q)We can use the corr_plot function to visualise the
relationship between \(p\) and \(q\). We observe that low \(p\)-values (i.e. high \(-log10(p)\)) are enriched for low \(q\) values.
This is also clear from the stratified QQ plot.
stratified_qqplot(data_frame = data.frame(p, q), prin_value_label = "p", cond_value_label = "q", thresholds = quantile(q)[-1])We are now ready to use the flexible_cfdr function to
derive the \(v\)-values. Note that for
the purpose of the vignette, we do not specify an independent subset of
SNPs, however for real analyses this parameter should be specified
appropriately to avoid biased bandwidth estimations when fitting the
KDE. Subsets of independent SNPs can be readily found using PLINK or
LDAK - see the vignette for deriving
LDAK weights here.
By default, flexible_cfdr prints some useful plots so
that users can evaluate the accuracy of the KDE that is estimated in the
method.
(This line of code will take approximately 4 minutes to run.)
The output from the function is a list of length two. The first
element is a data.frame containing the \(p\)-values (input parameter
p), the auxiliary data values (input parameter
q) and the generated \(v\)-values. The second element contains
auxiliary data, such as how many data-points were left-censored in the
method and/or spline corrected. In this example, we can see that a value
of \(q=-0.169\) was used for left
censoring (any auxiliary data values smaller than this were set to this
value) which results in 1108/50000 (2% of) data points being left
censored. We also see that 109 (0.2% of) data points were spline
corrected.
str(res)
#> List of 2
#> $ :'data.frame': 50000 obs. of 3 variables:
#> ..$ p: num [1:50000] 1.73e-03 3.59e-01 2.94e-05 1.51e-15 9.94e-02 ...
#> ..$ q: num [1:50000] -0.325 -0.297 -0.458 -0.383 -0.578 ...
#> ..$ v: num [1:50000] 1.63e-05 3.93e-03 1.27e-07 1.02e-17 2.03e-04 ...
#> $ :'data.frame': 1 obs. of 3 variables:
#> ..$ q_low : num -0.169
#> ..$ left_cens : int 1108
#> ..$ splinecorr: int 1945
p = res[[1]]$p
q = res[[1]]$q
v = res[[1]]$vNote that the cFDR framework requires that low \(p\)-values are enriched for low \(q\) values, so that if the correlation between \(p\) and \(q\) is negative then the function intrinsically flips the sign of \(q\), meaning that the \(q\) values reported in the data.frame output may be \(q:=-q\).
We can then visualise the results using the pv_plot and
log10pv_plot functions.
We observe that \(v\)-values for SNPs with high \(q\) values are increased, whilst those for SNPs with low \(q\) values are decreased.
Finally, we run the Benjamini-Hochberg procedure on the \(v\)-values and control the FDR at 0.05. This means that we are willing to accept up to 5% of the associations to be false positives.
For comparison, we do the same to the raw p-values:
‘True’ associations are those with indices 1-500, so the proportions of false discoveries are
# cFDR
1 - (length(intersect(hit,c(1:500)))/length(hit))
#> [1] 0.4237037
# p-value
1 - (length(intersect(hit_p,c(1:500)))/length(hit_p))
#> [1] 0.02262443Altogether, the cFDR method has found 47 new associations that are true whilst controlling the FDR.