---
title: "LDAK vignette"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{LDAK vignette}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = 'png',
  eval = F
)
```

The `flexible_cfdr` function requires as input the indices of an independent subset of SNPs. These indices indicate the `(p,q)` pairs considered independent observations for the purpose of the KDE fitting procedure.

In practice, we identify independent SNPs as those assigned a non-zero weighting by the [LDAK package's](https://dougspeed.com/ldak/) weighting calculation [procedure](https://dougspeed.com/calculate-weightings/). These weightings were originally developed as a means of adjusting for the unequal tagging of (causal) SNPs across the genome when estimating heritability. The calculation procedure requires haplotype information, which we obtain in the form of the 1000 Genomes (1000G) Phase 3 data. 

The purpose of this vignette is to sketch out our own approach to generating LDAK weightings for use in `fcfdr`. Whilst RMarkdown was used to generate this HTML document, the code snippets contained in this vignette are intended to serve as static illustrations and are not intended to be run without further modification.

This workflow was written with the use of LDAK `v5.1` (download [here](https://dougspeed.com/downloads/)) and PLINK `v1.90b6.21 64-bit (19 Oct 2020)` (download [here](https://www.cog-genomics.org/plink/2.0/)). We also use code from the R packages [bigsnpr](https://privefl.github.io/bigsnpr/) and [data.table](https://rdatatable.gitlab.io/data.table/).

---

## Optional: Obtaining and processing the 1000G data

*NB: In this section we discuss how to obtain and process haplotype data from the 1000G project. We have made available the end result, the quality-controlled, European-only data, as a ~280MB download at [https://doi.org/10.5281/zenodo.4709547](https://doi.org/10.5281/zenodo.4709547). If your principal p-values were taken from a GWAS conducted in a European population, you can skip the steps here and use the files in the download instead. You will still have to run LDAK. Note that the data use coordinates from the hg19 genome assembly.*

We first download the 1000G Phase 3 data in the `vcf` format. These data use the `hg19` genome assembly for SNP coordinates. 

```{bash download_1000G}
for i in {1..22}; do
    wget -O "chr"$i".vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz"
done

wget -O "chrX.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrX.phase3_shapeit2_mvncall_integrated_v1b.20130502.genotypes.vcf.gz"
wget -O "chrY.vcf.gz" "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz"
```

Subsequent processing depends on the data being in the PLINK-compatible `bed`, `bim`, and `fam` file formats. We use PLINK to convert the `vcf` files to these formats.

```{bash vcf_to_bed}
for i in {1..22}; do
    plink --vcf "chr"$i".vcf.gz" --make-bed --out "chr"$i
done

plink --vcf chrX.vcf.gz --make-bed --out chrX
plink --vcf chrY.vcf.gz --make-bed --out chrY
```

The 1000G haplotype data were obtained by sequencing individuals from a variety of populations. In practice, we take a subset of the 1000G samples to ensure that the ancestry of the individuals from which we obtain haplotype data matches the ancestry of the individuals in the GWAS of interest (from which our principal p-values come). Sample ancestry information for the 1000G project can be found at the [data portal](https://www.internationalgenome.org/data-portal/sample) of the International Genome Sample Resource.

We specify the desired sample IDs in a `fam` file. The `fam` files generated from the `vcf` files by `plink --make-bed` should be identical for all chromosomes, so we use a single custom `fam` file, `euro.fam`, to downsample all files. `euro.fam` was obtained by subsetting `chr1.fam` to retain only those entries with European sample IDs. We write the new files out to the directory `euro_only`. 

```{bash subset_to_euro}
for i in {1..22}; do
  plink --bfile "chr"$i --keep euro.fam --make-bed --silent --out "euro_only/chr"$i
done 

plink --bfile chrX --keep euro.fam --make-bed --silent --out euro_only/chrX
plink --bfile chrY --keep euro.fam --make-bed --silent --out euro_only/chrY
```

The `fam` files omit sex information. This is problematic only when we wish to carry out QC on the Y chromosome SNPs: PLINK drops heterozygous Y genotypes if we do not affirm the sex of the samples in `chrY.fam`. We do this by changing the values in the sex code column from 0 ('unknown') to 1 ('male'). 

```{bash}
sed -i 's/0 0 0 -9/0 0 1 -9/' euro_only/chrY.fam
```

We then use a function from the R package `bigsnpr` to carry out some basic QC on the 1000G data and write out the duly filtered data to another directory, `euro_only_qc`.

```{r plink_qc}
for(i in 1:22) {
       bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = paste0('euro_only/chr', i),
                          prefix.out = paste0('euro_only_qc/chr', i),
                          geno = 0, maf = 0.01, hwe = 1e-10)
}
bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrX',
                        prefix.out = 'euro_only_qc/chrX',
                        geno = 0, maf = 0.01, hwe = 1e-10)

bigsnpr::snp_plinkQC(plink.path = '/bin/plink', prefix.in = 'euro_only/chrY',
                        prefix.out = 'euro_only_qc/chrY',
                        geno = 0, maf = 0.01, hwe = 1e-10)
```

--- 

## Joining the 1000G and GWAS SNPs

*NB: The following processing was carried out with a GWAS containing only autosomal SNPs, so we omit reference to the sex chromosomes henceforth.*

If you skipped the previous section on processing the 1000G data then you can download our pre-processed files:

```{bash}
wget https://zenodo.org/record/4709547/files/euro_only_qc.tar.gz
tar -xvf euro_only_qc.tar
```

For the purpose of fitting the KDE in `flexible_cfdr` we care only about the dependence structure of the SNPs at which our `p` and `q` values were obtained, so we filter the 1000G SNPs to retain those for which we have GWAS p-values. Ideally, the 1000G SNPs would form a superset of those in the GWAS, but this is typically not the case; with ~5.5 million SNPs in our GWAS we would expect to lose several tens of thousands of SNPs. The absence of this relatively paltry number of SNPs should have a neglible influence upon the KDE. 

To filter the SNPs we use `plink --extract`, which allows one to extract SNPs based on:

1. SNP ID (i.e. rsID); or
2. SNP genomic coordinates

rsIDs are not always consistent between data sets and we have found that extracting SNPs using genomic coordinates typically yields a larger intersection of 1000G and GWAS SNPs. Thus we use `plink --extract` with the `--range` argument ([documentation here](https://zzz.bwh.harvard.edu/plink/dataman.shtml#extract)), which requires that we pass a white space-separated text file specifying a list of genomic 'ranges' (hereafter the 'range file') in rows in the format `chr bp bp ID`. In our use case, each SNP corresponds to a range, albeit one of length one: the first and second `bp` columns which give the start and end coordinates of the range are duplicated. We use rsIDs as range identifiers, but these are essentially superfluous when using the `--range` argument and will not be used by `plink` to filter the SNPs. 

The coordinates of the GWAS SNPs are usually readily available and a simple approach to generating the requisite range files for `plink --extract` would copy the genomic coordinates of each SNP to the appropriate chromosome-specific range file (albeit in the `plink`-compliant `chr bp bp rsID` format). However, this approach overlooks the possibility that there exist multiple SNPs with different reference/alternative allele pairings at the same locus, that is, duplicates.

We can account for this when preparing the range files by first joining the 1000G SNPs in each chromosome's `bim` file data to our GWAS SNPs and checking that the allele pairings match. This approach also allows us to remove duplicated rows in the range files; as noted above, it is not essential for a good fit that we retain *every* SNP. 

The following exemplifies the sort of R code one can use to write out the range files whilst carrying out the aforementioned checks. 

```{r gwas_bim_join}
library(data.table)
system("mkdir plinkRanges")
system("mkdir filtered")
system("mkdir ldak")

# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
# note that gwas_dat needs to be a data.table
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)

# Iterating over chromosomal bim files
for(i in 1:22) {
  # bim files have no header
  bim_dat <- fread(sprintf('euro_only_qc/chr%d.bim', i), sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2') )
  
  bim_join <- merge(bim_dat, gwas_dat[CHR19 == i], by.x = 'BP19', by.y = 'BP19')
  
  # Make sure alleles match, although for two-sided association p-values we don't care whether ref/alt is reversed
  bim_join <- bim_join[(REF == A1 & ALT == A2) | (REF == A2 & ALT == A1)]
  
  bim_join <- bim_join[, .(Chr, BP19, BP19, SNPID)]

  if(any(duplicated(bim_join, by='BP19'))) {
    warning(sprintf('%d duplicates removed from output', sum(duplicated(bim_join, by = 'BP19'))))
  }

  # Remove duplicates
  bim_join <- unique(bim_join, by='BP19')

  fwrite(bim_join, file = sprintf('plinkRanges/chr%d.tsv', i), row.names = F, sep = '\t', col.names = F, quote = F)
}
```

Note that we took care to use the `hg19` assembly coordinates to match the assembly used in the 1000G data we downloaded above.

Using these range files we can now filter the 1000G data.

```{bash plink_extract}
for i in {1..22}; do
  plink --silent --bfile "euro_only_qc/chr"$i --extract "plinkRanges/chr"$i".tsv" --range --make-bed --out "filtered/chr"$i
done
```

---

## Running the LDAK weighting calculation procedure

The weighting calculation procedure entails a preprocessing step, `ldak --cut-weights`, and a calculation step, `ldak --calc-weights-all`. We refer the reader to the [LDAK documentation](https://dougspeed.com/calculate-weightings/) for further guidance. An idiosyncracy of our approach is that we process the data in a set of chromosome-specific files. This is an artifact of the format of the 1000G data.

We create the directory `ldak` and within it subdirectories labelled `chrx`, where x ranges from 1 to 22, to hold the results of the procedure. (note that you may have to replace the executable `ldak` with `ldak5.1.linux` depending on how you downloaded the LDAK software).

```{bash ldak_weighting}
for i in {1..22}; do 
  ldak --cut-weights "ldak/chr"$i --bfile "filtered/chr"$i
  ldak --calc-weights-all "ldak/chr"$i --bfile "filtered/chr"$i
done
```

LDAK will write out the procedure's results to files named `ldak/chrx/weights.all`. We join these chromosome-specific files into one.

```{bash combine_weights_files}
for i in {1..22}; do
 # We omit the header in each file
  sed "s/$/ $i/" <(tail -n +2 "ldak/chr$i/weights.all") >> "ldak/combined_weights.all"
done

# We add back in a single header for the combined file
sed -i '1 i\Predictor Weight Neighbours Tagging Info Check Chr' "ldak/combined_weights.all"
```

---

## Using the LDAK weightings in `flexible_fcfdr`

In practice, it is necessary to merge the weightings contained in `combined_weights.all` back into the file containing the `p` and `q` values so that all three vectors can be made available to `flexible_cfdr`. This poses a problem, however, as the rsIDs in the `Predictor` column of `combined_weights.all` are derived from the 1000G `bim` files, not the GWAS file rsIDs specified in the range files (which are ignored by `plink` as noted above). Merging 1000G and GWAS SNPs using genomic coordinates is to be preferred over the use of rsIDs as the latter are not always consistent between data sets. 

`combined_weights.all` as written out by LDAK does not contain SNP basepair coordinates. This means we must use the `bim` files contained in the `filtered/chrx` directories to recover the basepair coordinates and reference/alternative allele pairings. This can be accomplished by matching the SNPs in `combined_weights.all` to those in the `bim` files. As the rsIDs in `combined_weights.all` are derived from these `bim` files, it is appropriate in this case to merge these data with the use of rsIDs. We provide the following snippets as an example of how this can be accomplished.

To simplify matters, we first concatenate all chromosomal `bim` files.

```{bash}
for i in {1..22}; do
  cat "filtered/chr"$i".bim" >> "filtered/chr_all.bim"
done
```

We then add the genomic metadata we need to the `combined_weights.all` LDAK output file.

```{r merge_bim_combined_weights}
library(data.table)

bim_dat <- fread('filtered/chr_all.bim', sep = '\t', header = F, col.names = c('Chr', 'ID', 'Cm', 'BP19', 'A1', 'A2'))

weights_dat <- fread('ldak/combined_weights.all', sep = ' ', header = T)

# Drop rows with a missing ID or weight value
weights_dat <- na.omit(weights_dat, cols = c('Predictor', 'Weight'))

join_dat <- merge(weights_dat, bim_dat[, .(ID, Chr, BP19, A1, A2)], all.x = T, by.x = c('Predictor', 'Chr'), by.y = c('ID', 'Chr'), sort = F)

fwrite(join_dat, file = 'ldak/combined_weights_meta.all', sep = ' ', col.names = T, row.names = F, quote = F)
```

The format of the file in which your `p` and `q` are stored will of course vary, but the snippet below illustrates how `combined_weights_meta.all` can be merged with it.

```{r}
weights_dat <- fread('ldak/combined_weights_meta.all', sep = ' ', header = T, select = c('Predictor', 'Weight', 'Chr', 'BP19', 'A1', 'A2'))

# We assume this contains columns SNPID, CHR19 and BP19 (so-called because of hg19), REF, and ALT
gwas_dat <- fread('gwas_sum_stats.tsv.gz', sep = '\t', header = T)

gwas_dat <- merge(gwas_dat, weights_dat, by.x = c('CHR19', 'BP19'), by.y = c('Chr', 'BP19'), sort = F)

# Drop rows where the ref/alt allele pairing differs from that already present
gwas_dat <- gwas_dat[((REF == A1 & ALT == A2) | (REF == A2 & ALT == A1))]

# Drop now-redundant allele columns
gwas_dat[, c('A1', 'A2') := NULL ]
```