Ancient DNA exercise
Overview
Adapted from Martin Sikora.
First:
- Navigate to your home directory:
- Create a directory called "adna"
- Navigate to the directory you just created.
We will try to
- Authenticate ancient DNA
- do some basic population genetics
Data authentication
Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big difference between modern DNA and ancient DNA is the presence of chemical damage due to the passage of time.
Direct measurements of the rate of chemical damage
First, create a directory:
mkdir 01_authentication cd 01_authentication
We will characterize DNA damage patterns using mapDamage, a software to estimate the rate of nucleotide substitution. In this section, we will examine some example BAM files for the presence of DNA damage patterns typical of ancient DNA.
We have a set of 10 modern and 26 ancient individuals (subsampled to 100k reads)
find /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ -name "*bam"
First, run mapDamage on one of the modern individuals:
mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/modern/NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.bam -r /home/databases/references/human/hs37d5.fa --no-stats
Examine the output:
cd NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.mapDamage/ okular Length_plot.pdf & okular Fragmisincorporation_plot.pdf & cd ..
Q1: which fragment length occurs most frequently?
Q2: what is the frequency of 5' C>T and 3' G>A substitutions ()
Run mapDamage on one of the ancient individuals
mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ancient/allentoft_2015/RISE559.sort.rmdup.realign.md.100k.bam -r /home/databases/references/human/hs37d5.fa --no-stats
Examine the output
cd RISE559.sort.rmdup.realign.md.100k.mapDamage/ okular Length_plot.pdf & okular Fragmisincorporation_plot.pdf &
Q3: At what fragment length does the distribution show its peak?
Q4: what are the frequencies of 5' C>T (red line) and 3' G>A substitutions (blue line)?
Q5: which bases are enriched at 5' flanking position?
Q6: does your sample look ancient? if not, what might be the reason?
Population genetics
Create a new subdirectory and navigate to it:
cd .. mkdir 02_popgen cd 02_popgen
Explore the reference panel dataset
Pur reference panel dataset is in binary PLINK format, a widely used format in genetic studies (see documentation here). We need to access the following files:
ls /home/projects/22126_NGS/exercises/adna/02_popgen/plink/
However, instead of copying them, we will create symbolic links using the ln command, these acts as placeholders and tell the operating system to pretend that there is an actual file there. This saves considerable disk space compared to copying over the files.
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bed . ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bim . ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.cluster . ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.fam . ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.sampleInfo.txt . ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/eur.poplist . ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/modern.poplist . ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/noneur.poplist .
The PLINK binary format consists of 3 files:
file | description |
world.bed | genotype data in binary format (not to be confused with genomic intervals bed file but it is confusing) |
world.bim | metadata for the variants, 1 line per variant |
world.fam | metadata for the samples, 1 line per sample |
file | description |
world.cluster | pre-defined population groupings for samples (for plink) |
world.sampleInfo.txt | additional sample metadata (for plotting etc) |
Let us explore the metadata files:
head world.fam head world.bim head world.cluster head world.sampleInfo.txt
Q7: How many samples / SNPs are in our dataset?
Q8: what populations are in our reference panel and what sample size do they have (trick: forgo the header using "tail -n+2", you need "sort" and uniq (prints 1 instance per repeated line), to tell "uniq" to count and print how many lines were repeated "-c"?
Calculate basic summary statistics (a simple description of the data) for the dataset:
/home/ctools/plink-1.9/plink --bfile world --missing --out world
Q9: are you getting the same number of variants and individuals as you did via UNIX command lines?
The world.imiss file lists the number and fraction of missing genotypes for each sample
Q10: what fraction of SNPs have a missing genotype for the Tyrolean Iceman?
Genotype and merge an ancient individual
In this section, we will merge our ancient data with the reference panel to prepare our dataset for downstream analysis genotypes for our ancient data will be obtained by randomly sampling a read from the alignments (BAM files) at the reference dataset SNP positions.
We are going to use a low-coverage individual from Allentoft et al (RISE507), this data was obtained from an ~5100-year-old individual from the Early Bronze Age Afanasievo culture in the Altai Mountains region
ls /home/projects/22126_NGS/exercises/adna/02_popgen/bam/
First, we need to extract a genomic interval bed file for the SNP positions of the reference panel:
awk '{print $1"\t"($4-1)"\t"$4}' world.bim | gzip > world.snps.bed.gz
awk is a command to create small programs. In this example, we tell it, print the first columns, the fourth column minus 1 and the fourth column again.
Inspect the results:
zcat world.snps.bed.gz | head
Create a read pileup file for the reference panel SNP positions (might take a few minutes)
samtools mpileup -f /home/databases/references/human/hs37d5.fa -B -l world.snps.bed.gz /home/projects/22126_NGS/exercises/adna/02_popgen/bam/RISE507.sort.rmdup.realign.md.bam |gzip > RISE507.pileup.gz
Examine the output:
zcat RISE507.pileup.gz |head
Q11: how many SNPs of the reference panel are covered in RISE507?
Now we will randomly sample a DNA fragment at each position and output the results in VCF format (custom python script):
zcat RISE507.pileup.gz | python2 /home/projects/22126_NGS/exercises/adna/02_popgen/get_haploid_vcf_from_pileup.py -r -s RISE507 |bgzip -c > RISE507.vcf.gz
This is done because the coverage is insufficient to ensure proper genotyping.
Let us inspect the result:
zcat RISE507.vcf.gz |grep -v "^#" |head
We convert to plink binary format:
/home/ctools/plink-1.9/plink --vcf RISE507.vcf.gz --make-bed --double-id --out RISE507
Try to merge the sample with the reference panel
/home/ctools/plink-1.9/plink --bfile world --bmerge RISE507 --out RISE507.merge
You should get an error.
Q12: how many SNPs failed the merge? What is the likely reason?
We will remove the failing SNPs and try again
/home/ctools/plink-1.9/plink --bfile RISE507 --exclude RISE507.merge.missnp --make-bed --out RISE507.merge2 /home/ctools/plink-1.9/plink --bfile world --bmerge RISE507.merge2 --out RISE507.world
Make a cluster file for subsetting
awk '{print $1,$2,$1}' RISE507.world.fam > RISE507.world.cluster
Investigate the genetic affinities of the ancient sample using PCA
In this section, we will try to place our sample within a PCA of a set of modern and ancient individuals.
First, we will have a look at the modern populations in the reference panel:
/home/ctools/plink-1.9/plink --bfile RISE507.world --keep-clusters modern.poplist --within RISE507.world.cluster --pca header tabs --out modern
We can plot the first two principal components using the custom R script plotPca.R
The three positional arguments are the eigenvector file, sample info file and prefix for the output
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R modern.eigenvec world.sampleInfo.txt modern evince modern.pca.plot.pdf &
Q13: which populations are most differentiated along PC1? Q14: which populations are most differentiated along PC2?
We repeat the exercise on a subset of European populations:
/home/ctools/plink-1.9/plink --bfile RISE507.world --keep-clusters eur.poplist --within RISE507.world.cluster --pca header tabs --out eur Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R eur.eigenvec world.sampleInfo.txt eur evince eur.pca.plot.pdf &
Q15: which populations are most differentiated along PC1? Q16: which populations are most differentiated along PC2?
Now, let us examine how the cluster of ancient individuals compared to the modern ones:
/home/ctools/plink-1.9/plink --bfile RISE507.world --pca header tabs --out ancient.world Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.world.eigenvec world.sampleInfo.txt ancient.world evince ancient.world.pca.plot.pdf &
Here are some references if you want to read more about the different ancient samples:
sample | link |
UstIshim | [1] |
Loschbour | [2] [3] |
Brana | [4] |
NE1 | [5] |
Stuttgart | [6] |
Iceman | [7] |
Karelia | [8] |
Samara | [9] |
MA1 | [10] |
RISE507 | [11] |
Q17: which ancient individuals don't cluster close to any modern individuals? what could be a plausible reason?
Repeat the exercise but remove the non-European modern individuals:
/home/ctools/plink-1.9/plink --bfile RISE507.world --within RISE507.world.cluster --remove-clusters noneur.poplist --pca header tabs --out ancient.eur Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.eur.eigenvec world.sampleInfo.txt ancient.eur evince ancient.eur.pca.plot.pdf &
Q18: which populations are most differentiated along PC1? what could be a plausible reason?
As a final exercise, we now project the ancient individual on PCs inferred from modern Europeans:
/home/ctools/plink-1.9/plink --bfile RISE507.world --within RISE507.world.cluster --pca-clusters eur.poplist --remove-clusters noneur.poplist --pca header tabs --out ancient_proj.eur --maf 0.01 Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient_proj.eur.eigenvec world.sampleInfo.txt ancient_proj.eur evince ancient_proj.eur.pca.plot.pdf &
Q19: where does our study individual cluster now?
Q20: How do you explain that an individual that is found closer to the modern-day Chinese border is closer to modern Europeans than he is to the Han Chinese?
Please find answers here