The flexible_cfdr
function requires as input the indices of an independent subset of SNPs.
These indices indicate the (p,q)
pairs considered
independent observations for the purpose of the KDE fitting
procedure.
In practice, we identify independent SNPs as those assigned a non-zero weighting by the LDAK package’s weighting calculation procedure. These weightings were originally developed as a means of adjusting for the unequal tagging of (causal) SNPs across the genome when estimating heritability. The calculation procedure requires haplotype information, which we obtain in the form of the 1000 Genomes (1000G) Phase 3 data.
The purpose of this vignette is to sketch out our own approach to
generating LDAK weightings for use in fcfdr
. Whilst
RMarkdown was used to generate this HTML document, the code snippets
contained in this vignette are intended to serve as static illustrations
and are not intended to be run without further modification.
This workflow was written with the use of LDAK v5.1
(download here) and PLINK
v1.90b6.21 64-bit (19 Oct 2020)
(download here). We also use
code from the R packages bigsnpr and data.table.
NB: In this section we discuss how to obtain and process haplotype data from the 1000G project. We have made available the end result, the quality-controlled, European-only data, as a ~280MB download at https://doi.org/10.5281/zenodo.4709547. If your principal p-values were taken from a GWAS conducted in a European population, you can skip the steps here and use the files in the download instead. You will still have to run LDAK. Note that the data use coordinates from the hg19 genome assembly.
We first download the 1000G Phase 3 data in the vcf
format. These data use the hg19
genome assembly for SNP
coordinates.
for i in {1..22}; do
wget -O "chr"$i".vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
done
wget -O "chrX.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz"
wget -O "chrY.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz"
Subsequent processing depends on the data being in the
PLINK-compatible bed
, bim
, and
fam
file formats. We use PLINK to convert the
vcf
files to these formats.
for i in {1..22}; do
plink --vcf "chr"$i".vcf.gz" --make-bed --out "chr"$i
done
plink --vcf chrX.vcf.gz --make-bed --out chrX
plink --vcf chrY.vcf.gz --make-bed --out chrY
The 1000G haplotype data were obtained by sequencing individuals from a variety of populations. In practice, we take a subset of the 1000G samples to ensure that the ancestry of the individuals from which we obtain haplotype data matches the ancestry of the individuals in the GWAS of interest (from which our principal p-values come). Sample ancestry information for the 1000G project can be found at the data portal of the International Genome Sample Resource.
We specify the desired sample IDs in a fam
file. The
fam
files generated from the vcf
files by
plink --make-bed
should be identical for all chromosomes,
so we use a single custom fam
file, euro.fam
,
to downsample all files. euro.fam
was obtained by
subsetting chr1.fam
to retain only those entries with
European sample IDs. We write the new files out to the directory
euro_only
.
for i in {1..22}; do
plink --bfile "chr"$i --keep euro.fam --make-bed --silent --out "euro_only/chr"$i
done
plink --bfile chrX --keep euro.fam --make-bed --silent --out euro_only/chrX
plink --bfile chrY --keep euro.fam --make-bed --silent --out euro_only/chrY
The fam
files omit sex information. This is problematic
only when we wish to carry out QC on the Y chromosome SNPs: PLINK drops
heterozygous Y genotypes if we do not affirm the sex of the samples in
chrY.fam
. We do this by changing the values in the sex code
column from 0 (‘unknown’) to 1 (‘male’).
We then use a function from the R package bigsnpr
to
carry out some basic QC on the 1000G data and write out the duly
filtered data to another directory, euro_only_qc
.
for(i in 1:22) {
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = paste0('euro_only/chr', i),
prefix.out = paste0('euro_only_qc/chr', i),
geno = 0, maf = 0.01, hwe = 1e-10)
}
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrX',
prefix.out = 'euro_only_qc/chrX',
geno = 0, maf = 0.01, hwe = 1e-10)
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrY',
prefix.out = 'euro_only_qc/chrY',
geno = 0, maf = 0.01, hwe = 1e-10)
NB: The following processing was carried out with a GWAS containing only autosomal SNPs, so we omit reference to the sex chromosomes henceforth.
If you skipped the previous section on processing the 1000G data then you can download our pre-processed files:
For the purpose of fitting the KDE in flexible_cfdr
we
care only about the dependence structure of the SNPs at which our
p
and q
values were obtained, so we filter the
1000G SNPs to retain those for which we have GWAS p-values. Ideally, the
1000G SNPs would form a superset of those in the GWAS, but this is
typically not the case; with ~5.5 million SNPs in our GWAS we would
expect to lose several tens of thousands of SNPs. The absence of this
relatively paltry number of SNPs should have a neglible influence upon
the KDE.
To filter the SNPs we use plink --extract
, which allows
one to extract SNPs based on:
rsIDs are not always consistent between data sets and we have found
that extracting SNPs using genomic coordinates typically yields a larger
intersection of 1000G and GWAS SNPs. Thus we use
plink --extract
with the --range
argument (documentation
here), which requires that we pass a white space-separated text file
specifying a list of genomic ‘ranges’ (hereafter the ‘range file’) in
rows in the format chr bp bp ID
. In our use case, each SNP
corresponds to a range, albeit one of length one: the first and second
bp
columns which give the start and end coordinates of the
range are duplicated. We use rsIDs as range identifiers, but these are
essentially superfluous when using the --range
argument and
will not be used by plink
to filter the SNPs.
The coordinates of the GWAS SNPs are usually readily available and a
simple approach to generating the requisite range files for
plink --extract
would copy the genomic coordinates of each
SNP to the appropriate chromosome-specific range file (albeit in the
plink
-compliant chr bp bp rsID
format).
However, this approach overlooks the possibility that there exist
multiple SNPs with different reference/alternative allele pairings at
the same locus, that is, duplicates.
We can account for this when preparing the range files by first
joining the 1000G SNPs in each chromosome’s bim
file data
to our GWAS SNPs and checking that the allele pairings match. This
approach also allows us to remove duplicated rows in the range files; as
noted above, it is not essential for a good fit that we retain
every SNP.
The following exemplifies the sort of R code one can use to write out the range files whilst carrying out the aforementioned checks.
library(data.table)
system("mkdir plinkRanges")
system("mkdir filtered")
system("mkdir ldak")
# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
# note that gwas_dat needs to be a data.table
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)
# Iterating over chromosomal bim files
for(i in 1:22) {
# bim files have no header
bim_dat <- fread(sprintf('euro_only_qc/chr%d.bim', i), sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2') )
bim_join <- merge(bim_dat, gwas_dat[CHR19 == i], by.x = 'BP19', by.y = 'BP19')
# Make sure alleles match, although for two-sided association p-values we don't care whether ref/alt is reversed
bim_join <- bim_join[(REF == A1 & ALT == A2) | (REF == A2 & ALT == A1)]
bim_join <- bim_join[, .(Chr, BP19, BP19, SNPID)]
if(any(duplicated(bim_join, by='BP19'))) {
warning(sprintf('%d duplicates removed from output', sum(duplicated(bim_join, by = 'BP19'))))
}
# Remove duplicates
bim_join <- unique(bim_join, by='BP19')
fwrite(bim_join, file = sprintf('plinkRanges/chr%d.tsv', i), row.names = F, sep = '\t', col.names = F, quote = F)
}
Note that we took care to use the hg19
assembly
coordinates to match the assembly used in the 1000G data we downloaded
above.
Using these range files we can now filter the 1000G data.
for i in {1..22}; do
plink --silent --bfile "euro_only_qc/chr"$i --extract "plinkRanges/chr"$i".tsv" --range --make-bed --out "filtered/chr"$i
done
The weighting calculation procedure entails a preprocessing step,
ldak --cut-weights
, and a calculation step,
ldak --calc-weights-all
. We refer the reader to the LDAK
documentation for further guidance. An idiosyncracy of our approach
is that we process the data in a set of chromosome-specific files. This
is an artifact of the format of the 1000G data.
We create the directory ldak
and within it
subdirectories labelled chrx
, where x ranges from 1 to 22,
to hold the results of the procedure. (note that you may have to replace
the executable ldak
with ldak5.1.linux
depending on how you downloaded the LDAK software).
for i in {1..22}; do
ldak --cut-weights "ldak/chr"$i --bfile "filtered/chr"$i
ldak --calc-weights-all "ldak/chr"$i --bfile "filtered/chr"$i
done
LDAK will write out the procedure’s results to files named
ldak/chrx/weights.all
. We join these chromosome-specific
files into one.
for i in {1..22}; do
# We omit the header in each file
sed "s/$/ $i/" <(tail -n +2 "ldak/chr$i/weights.all") >> "ldak/combined_weights.all"
done
# We add back in a single header for the combined file
sed -i '1 i\Predictor Weight Neighbours Tagging Info Check Chr' "ldak/combined_weights.all"
flexible_fcfdr
In practice, it is necessary to merge the weightings contained in
combined_weights.all
back into the file containing the
p
and q
values so that all three vectors can
be made available to flexible_cfdr
. This poses a problem,
however, as the rsIDs in the Predictor
column of
combined_weights.all
are derived from the 1000G
bim
files, not the GWAS file rsIDs specified in the range
files (which are ignored by plink
as noted above). Merging
1000G and GWAS SNPs using genomic coordinates is to be preferred over
the use of rsIDs as the latter are not always consistent between data
sets.
combined_weights.all
as written out by LDAK does not
contain SNP basepair coordinates. This means we must use the
bim
files contained in the filtered/chrx
directories to recover the basepair coordinates and
reference/alternative allele pairings. This can be accomplished by
matching the SNPs in combined_weights.all
to those in the
bim
files. As the rsIDs in
combined_weights.all
are derived from these
bim
files, it is appropriate in this case to merge these
data with the use of rsIDs. We provide the following snippets as an
example of how this can be accomplished.
To simplify matters, we first concatenate all chromosomal
bim
files.
We then add the genomic metadata we need to the
combined_weights.all
LDAK output file.
library(data.table)
bim_dat <- fread('filtered/chr_all.bim', sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2'))
weights_dat <- fread('ldak/combined_weights.all', sep = ' ', header = T)
# Drop rows with a missing ID or weight value
weights_dat <- na.omit(weights_dat, cols = c('Predictor', 'Weight'))
join_dat <- merge(weights_dat, bim_dat[, .(ID, Chr, BP19, A1, A2)], all.x = T, by.x = c('Predictor', 'Chr'), by.y = c('ID', 'Chr'), sort = F)
fwrite(join_dat, file = 'ldak/combined_weights_meta.all', sep = ' ', col.names = T, row.names = F, quote = F)
The format of the file in which your p
and
q
are stored will of course vary, but the snippet below
illustrates how combined_weights_meta.all
can be merged
with it.
weights_dat <- fread('ldak/combined_weights_meta.all', sep = ' ', header = T, select = c('Predictor', 'Weight', 'Chr', 'BP19', 'A1', 'A2'))
# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)
gwas_dat <- merge(gwas_dat, weights_dat, by.x = c('CHR19', 'BP19'), by.y = c('Chr', 'BP19'), sort = F)
# Drop rows where the ref/alt allele pairing differs from that already present
gwas_dat <- gwas_dat[((REF == A1 & ALT == A2) | (REF == A2 & ALT == A1))]
# Drop now-redundant allele columns
gwas_dat[, c('A1', 'A2') := NULL ]