Ancient DNA exercise

From 22126
Revision as of 16:35, 19 March 2024 by WikiSysop (talk | contribs) (Created page with "<H2>Overview</H2> Adapted from Martin Sikora. First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "adna" <LI>Navigate to the directory you just created. </OL> We will try to # Authenticate ancient DNA # do some basic population genetics <h2> Data authentication</h2> Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big differe...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

Adapted from Martin Sikora.

First:

  1. Navigate to your home directory:
  2. Create a directory called "adna"
  3. Navigate to the directory you just created.

We will try to

  1. Authenticate ancient DNA
  2. do some basic population genetics

Data authentication

Authentication involves making sure that the DNA that you have extracted from my fossil and sequenced is indeed from the fossil and not some modern contaminant. A big difference between modern DNA and ancient DNA is the presence of chemical damage due to the passage of time.

Direct measurements of the rate of chemical damage

First, create a directory:

mkdir  01_authentication
cd 01_authentication

We will characterize DNA damage patterns using mapDamage, a software to estimate the rate of nucleotide substitution. In this section, we will examine some example BAM files for the presence of DNA damage patterns typical of ancient DNA.


We have a set of 10 modern and 26 ancient individuals (subsampled to 100k reads)

find /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ -name "*bam"

First, run mapDamage on one of the modern individuals:

mapDamage -i /home/projects/22126_NGS/exercises/adna/01_authentication/bam/modern/NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.bam -r /home/databases/references/human/hs37d5.fa --no-stats

Examine the output:

cd NA20786.mapped.ILLUMINA.bwa.TSI.low_coverage.20130415.100k_ss.mapDamage/
okular Length_plot.pdf &
okular Fragmisincorporation_plot.pdf &
cd ..


Q1: which fragment length occurs most frequently?

Q2: what is the frequency of 5' C>T and 3' G>A substitutions ()

Run mapDamage on one of the ancient individuals

mapDamage -i  /home/projects/22126_NGS/exercises/adna/01_authentication/bam/ancient/allentoft_2015/RISE559.sort.rmdup.realign.md.100k.bam -r /home/databases/references/human/hs37d5.fa --no-stats

Examine the output

cd RISE559.sort.rmdup.realign.md.100k.mapDamage/
okular Length_plot.pdf &
okular Fragmisincorporation_plot.pdf &

Q3: At what fragment length does the distribution show its peak?

Q4: what are the frequencies of 5' C>T (red line) and 3' G>A substitutions (blue line)?

Q5: which bases are enriched at 5' flanking position?

Q6: does your sample look ancient? if not, what might be the reason?



Population genetics

Create a new subdirectory and navigate to it:

cd ..
mkdir 02_popgen
cd 02_popgen

Explore the reference panel dataset

Pur reference panel dataset is in binary PLINK format, a widely used format in genetic studies (see documentation here). We need to access the following files:

ls /home/projects/22126_NGS/exercises/adna/02_popgen/plink/

However, instead of copying them, we will create symbolic links using the ln command, these acts as placeholders and tell the operating system to pretend that there is an actual file there. This saves considerable disk space compared to copying over the files.

ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bed .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.bim .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.cluster .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.fam .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/plink/world.sampleInfo.txt .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/eur.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/modern.poplist .
ln -s /home/projects/22126_NGS/exercises/adna/02_popgen/noneur.poplist .

The PLINK binary format consists of 3 files:


We also have the following files than contain extra information:
file description
world.bed genotype data in binary format (not to be confused with genomic intervals bed file but it is confusing)
world.bim metadata for the variants, 1 line per variant
world.fam metadata for the samples, 1 line per sample
file description
world.cluster pre-defined population groupings for samples (for plink)
world.sampleInfo.txt additional sample metadata (for plotting etc)


Let us explore the metadata files:

head world.fam
head world.bim
head world.cluster
head world.sampleInfo.txt

Q7: How many samples / SNPs are in our dataset?


Q8: what populations are in our reference panel and what sample size do they have (trick: forgo the header using "tail -n+2", you need "sort" and uniq (prints 1 instance per repeated line), to tell "uniq" to count and print how many lines were repeated "-c"?


Calculate basic summary statistics (a simple description of the data) for the dataset:

/home/ctools/plink-1.9/plink --bfile world --missing --out world


Q9: are you getting the same number of variants and individuals as you did via UNIX command lines?

The world.imiss file lists the number and fraction of missing genotypes for each sample

Q10: what fraction of SNPs have a missing genotype for the Tyrolean Iceman?


Genotype and merge an ancient individual

In this section, we will merge our ancient data with the reference panel to prepare our dataset for downstream analysis genotypes for our ancient data will be obtained by randomly sampling a read from the alignments (BAM files) at the reference dataset SNP positions.

We are going to use a low-coverage individual from Allentoft et al (RISE507), this data was obtained from an ~5100-year-old individual from the Early Bronze Age Afanasievo culture in the Altai Mountains region

ls /home/projects/22126_NGS/exercises/adna/02_popgen/bam/

First, we need to extract a genomic interval bed file for the SNP positions of the reference panel:

 
awk '{print $1"\t"($4-1)"\t"$4}' world.bim | gzip > world.snps.bed.gz

awk is a command to create small programs. In this example, we tell it, print the first columns, the fourth column minus 1 and the fourth column again.

Inspect the results:

zcat world.snps.bed.gz | head

Create a read pileup file for the reference panel SNP positions (might take a few minutes)

samtools mpileup -f /home/databases/references/human/hs37d5.fa -B -l world.snps.bed.gz /home/projects/22126_NGS/exercises/adna/02_popgen/bam/RISE507.sort.rmdup.realign.md.bam |gzip > RISE507.pileup.gz

Examine the output:

zcat RISE507.pileup.gz |head

Q11: how many SNPs of the reference panel are covered in RISE507?


Now we will randomly sample a DNA fragment at each position and output the results in VCF format (custom python script):

zcat RISE507.pileup.gz | python2 /home/projects/22126_NGS/exercises/adna/02_popgen/get_haploid_vcf_from_pileup.py -r -s RISE507 |bgzip -c > RISE507.vcf.gz

This is done because the coverage is insufficient to ensure proper genotyping.

Let us inspect the result:

zcat RISE507.vcf.gz |grep -v "^#" |head

We convert to plink binary format:

 
/home/ctools/plink-1.9/plink --vcf RISE507.vcf.gz --make-bed --double-id --out RISE507

Try to merge the sample with the reference panel

 
/home/ctools/plink-1.9/plink --bfile world --bmerge RISE507 --out RISE507.merge

You should get an error.

Q12: how many SNPs failed the merge? What is the likely reason?

We will remove the failing SNPs and try again

/home/ctools/plink-1.9/plink --bfile RISE507 --exclude RISE507.merge.missnp --make-bed --out RISE507.merge2
/home/ctools/plink-1.9/plink --bfile world --bmerge RISE507.merge2 --out RISE507.world

Make a cluster file for subsetting

awk '{print $1,$2,$1}' RISE507.world.fam > RISE507.world.cluster

Investigate the genetic affinities of the ancient sample using PCA

In this section, we will try to place our sample within a PCA of a set of modern and ancient individuals.

First, we will have a look at the modern populations in the reference panel:

/home/ctools/plink-1.9/plink --bfile RISE507.world --keep-clusters modern.poplist --within RISE507.world.cluster --pca header tabs --out modern

We can plot the first two principal components using the custom R script plotPca.R

The three positional arguments are the eigenvector file, sample info file and prefix for the output

Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R modern.eigenvec world.sampleInfo.txt modern
evince modern.pca.plot.pdf &

Q13: which populations are most differentiated along PC1? Q14: which populations are most differentiated along PC2?

We repeat the exercise on a subset of European populations:

/home/ctools/plink-1.9/plink --bfile RISE507.world --keep-clusters eur.poplist --within RISE507.world.cluster --pca header tabs --out eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R eur.eigenvec world.sampleInfo.txt eur
evince eur.pca.plot.pdf &

Q15: which populations are most differentiated along PC1? Q16: which populations are most differentiated along PC2?


Now, let us examine how the cluster of ancient individuals compared to the modern ones:

/home/ctools/plink-1.9/plink --bfile RISE507.world --pca header tabs --out ancient.world
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.world.eigenvec world.sampleInfo.txt ancient.world
evince ancient.world.pca.plot.pdf &

Here are some references if you want to read more about the different ancient samples:


sample link
UstIshim [1]
Loschbour [2] [3]
Brana [4]
NE1 [5]
Stuttgart [6]
Iceman [7]
Karelia [8]
Samara [9]
MA1 [10]
RISE507 [11]

Q17: which ancient individuals don't cluster close to any modern individuals? what could be a plausible reason?

Repeat the exercise but remove the non-European modern individuals:

/home/ctools/plink-1.9/plink --bfile RISE507.world --within RISE507.world.cluster --remove-clusters noneur.poplist --pca header tabs --out ancient.eur
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient.eur.eigenvec world.sampleInfo.txt ancient.eur
evince ancient.eur.pca.plot.pdf &

Q18: which populations are most differentiated along PC1? what could be a plausible reason?

As a final exercise, we now project the ancient individual on PCs inferred from modern Europeans:

/home/ctools/plink-1.9/plink --bfile RISE507.world --within RISE507.world.cluster --pca-clusters eur.poplist --remove-clusters noneur.poplist --pca header tabs --out ancient_proj.eur --maf 0.01
Rscript /home/projects/22126_NGS/exercises/adna/02_popgen/plotPca.R ancient_proj.eur.eigenvec world.sampleInfo.txt ancient_proj.eur
evince ancient_proj.eur.pca.plot.pdf &


Q19: where does our study individual cluster now?


Q20: How do you explain that an individual that is found closer to the modern-day Chinese border is closer to modern Europeans than he is to the Han Chinese?


Please find answers here