This vignette contains additional details for the
fcfdr
R package.
A key assumption in the cFDR methodology is that the p-values are uniformly distributed under the null hypothesis of no association. Therefore, prior to applying cFDR, users should check that this assumption is satisfied by calculating the genomic inflation factor, λ, and applying genomic control if necessary.
The distinguishing feature of fcfdr
compared to earlier
cFDR methods is that it can leverage auxiliary data from any arbitrary
distribution. Some examples of auxiliary covariates to leverage
include:
Functional genomic assays. It is known that GWAS
SNPs are not randomly distributed across functional categories, for
example GWAS SNPs are typically enriched in enhancer and open chromatin
regions in relevant cell types. This suggests that leveraging (e.g.)
ATAC-seq data or ChIP-seq data for histone modifications (e.g. H3K27ac
marking enhancer regions) in relevant cell types may be useful. If the
relevant cell types are known then we found the consolidated
fold change values from NIH Roadmap to be useful (epigenome ID to
cell type conversion available here).
GWAS SNPs can easily be matched to the fold change value using the
bedtools intersect function with the -wb
tag.
Per-SNP scores of pathogenicity/ functionality/ deleteriousness. Many tools have been developed that integrate various genomic and epigenomic annotation data to quantify the pathogenicity, functionality and/or deleteriousness of both coding and non-coding GWAS variants. For example tissue-specific GenoSkyline scores or PINES scoring.
LD score annotations The latest and recommended baseline-LD model v2.2 contains 97 annotations ranging from binary synonymous/non-synonymous annotations to continuous functional genomic annotations. The annotations for all 1000 genomes phase 3 SNPs (note: now extended to 19,476,620 UK Biobank SNPs with MAF ≥ 0.1) can be downloaded readily online.
The flexible_cfdr()
function requires the indices of an
independent subset of the GWAS SNPs to ensure non-biased bandwidth
estimation for the KDE. In practice, we use SNPs allocated a non-zero
LDAK weight as our subset of independent SNPs. Instructions to generate
LDAK weights can be found in
this vignette. Alternatively, PLINK can also be used (see here).
The binary cFDR method does not fit a KDE and instead uses empirical
CDFs. For this reason, a leave-one-chromosome-out procedure is
implemented intrinsically in the binary_cfdr()
function.
Therefore, the binary_cfdr()
function requires per-SNP
chromosomes as input.
Our cFDR method can be applied iteratively, whereby the v-values from the previous iteration are used as the p-values in the next iteration. This allows additional layers of information to be incorporated into the analysis.
An example of this is shown in the type 1 diabetes vignette.