Alignment exercise
Overview
First:
- Navigate to your home directory:
- Create a directory called "align"
- Navigate to the directory you just created.
We will try to align different types of NGS data.
- Pseudomonas alcaligenes single-end Illumina reads
- Human single-end paired-end Illumina reads
P. aeruginosa single-end Illumina reads
Alignment using bwa mem
We will align some of the single-end reads that we trimmed from P. aeruginosa. The raw data can be found here:
/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz
The trimmed data can be found here:
/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz
The reference genome can be found here:
/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta
the basic command line to align the data using bwa mem is
bwa mem [reference genome in fasta] [input.fastq.gz] > [output.sam]
Remember that the > the operator sends the STDOUT to a file.
We have learned about multiplexing. Get used to adding reads groups to denote the read group ID and sample they belong to. For instance, let us say that our read group was "RG38" and the sample was "SMPL96". We would add the read groups as such during the alignment:
bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference genome in fasta] [input.fastq.gz] > [output.sam]
That way, when merging multiple bam files, we know which reads came from which files.
Align the correct fastq file using the command above.
Q1: Had we not told you which fastq file contains the trimmed reads, how could have figured out which file contains the trimmed reads? I can think of 3 different ways.
Looking at the alignment
First, let us look at the alignment, let us assume you called the output SRR8002634_1.sam. The file can be viewed as such:
less -S SRR8002634_1.sam
-S wraps the lines around, press 'q' to quit. Please refer to the slides of the official documentation check what each field means. Answer the following questions:
Q2: how many lines does the header have?
Q3: what is the genomic coordinate of the first read (SRR8002634.1)?
Q4: what is the mapping quality of the third read "SRR8002634.3"? What does this tell you about this read?
Q5: Use the flags (see flags explained here) and determine how many reads mapped to the + strand and to the - strand among the first 8 reads.
Q6: Is the 10th read "SRR8002634.11" (n.b. SRR8002634.9 was removed by the trimming algorithm so we skip one) unmapped? How did you determine this?
To get some basic statistics regarding the alignment use:
samtools flagstat [input.sam]
Here is a brief table of what the different fields mean:
Category | Meaning |
mapQ | mapping quality |
QC-passed reads | most processing pipelines will QC fail reads for a variety of reasons, those are the ones that were not failed i.e. can be used for analysis |
QC-failed reads | as mentioned above, those are read with any kind of problem as determined by the processing pipeline and should not be used by downstream analyses. Most properly coded software will overlook those reads. |
total | total number of alignments |
secondary | if a mapper aligns a read equally well to 2 or more locations, I can report the best location as primary alignment and any other ones as secondary |
supplementary | If, for instance, the beginning of a read aligns well to one location and the end of the reader aligns well to a different location in a chimeric way, one read will the "normal" read and the other one will be the "supplementary". |
duplicates | these are reads are marked as "duplicates" will cover this at the next class |
mapped | total number of reads for which there is at least one location i.e. they are not left unmapped. |
paired in sequencing | I have reads that are paired (versus single-end) |
read1 | forward reads |
read2 | reverse reads |
properly paired | we have covered this and class, meaning facing each other and within a reasonable distance. |
with itself and mate mapped | the read maps and also its mate (forward or reverse) but may or may not satisfy the properly paired requirement. |
singletons | the read was mapped but not its mate |
with mate mapped to a different chr | the read was mapped but its mate mapped to different chromosome |
Q7: what is the fraction of reads that did not align to the reference?
Working with alignments
Format conversion
This should be the first and hopefully last time you work with the SAM format. First, evaluate the size of the file:
ls -lh SRR8002634_1.sam
Then we will converted to binary SAM format or BAM:
samtools view -bS [input.sam] > [output.bam]
And evaluate the space used by the BAM file:
ls -lh SRR8002634_1.bam
-l says: give me the long listing ex: creation data, file size etc. -h make the file size human-readable (ex: 2.4M instead of 2469134).
This BAM file contains exactly the same alignments as the previous file, to view it as SAM:
samtools view [input.bam] | less -S
You can filter for certain reads for a specific flag for example, to only include unmapped reads:
samtools view -f 0x4 [input.bam]
Or to exclude unmapped reads:
samtools view -F 0x4 [input.bam]
Why 0x4? refer to Decoding SAM flags.
Q8: what is the size ratio of SAM to BAM?
Then we can convert it to CRAM which compresseses it even further using the reference:
samtools view -C -T [reference genome in fasta] [input.bam] > [output.cram]
The [reference genome in fasta] is exactly the same file you used for mapping.
use "ls -lh" to find the file size.
To view the alignment as SAM:
samtools view -T [reference genome in fasta] [input.cram] | less -S
Q9: what is the size ratio of BAM to CRAM?
Please remove the SAM file and CRAM file as we will concentrate on BAM for now, remember to remove a file:
rm [file]
Sorting
First, sort the BAM file using samtools sort:
samtools sort [input.bam] > [output.bam]
Be careful not to overwrite the original file, the [input.bam] can be SRR8002634_1.bam whereas [output.bam] can be SRR8002634_1.sorted.bam. Inspect the alignments using the command "samtools view" used above and check if the reads are sorted.
You can now index the sorted BAM file using samtools index:
samtools index [input.sorted.bam]
Remember you cannot index an unsorted BAM file.
Retrieving a particular region
Once sorted and indexed, we can retrieve specific portions of the alignment, for example:
samtools view [input.sorted.bam] [region ID]:[start]-[end]
For instance, the "[region ID]:[start]-[end]" can be "NC_002516.2:1000000-1000100" to retrieve all reads overlapping positions between the 1,000,000th base to the 1,000,100th base.
Q10: How many reads are aligned between positions 2000000 and 3000000 on the reference NC_002516.2? (tip: to count lines in a file use "wc -l" do not write the output to another file, pipe directly as such "samtools view [commands] |wc -l ").
Q11: How many reads with mapping quality greater or equal to 30 are aligned between positions 2000000 and 3000000 on the reference NC_002516.2? (tip: use "samtools view" without any arguments to view options).
Average coverage
We will use mosdepth to measure the average coverage (i.e. the average number of times any base in the genome is covered) for our BAM file:
mosdepth [output prefix] [input.sorted.bam]
Feel free to use whatever [output prefix], we used SRR8002634_1. Use mosdepth on your sorted bam file and see the results in [output prefix].mosdepth.summary.txt
Does the estimate make sense?
Q12: On average, how many reads cover a base in the genome? What is the maximum number of reads covering a single position?
Use the following command to view the reads on a per position basis:
samtools mpileup [input.sorted.bam] |less -S
The wrong reference genome?
Q13: Imagine you would inadvertently align the data to a different reference genome, say the Yersinia pestis, the bacteria responsible for the plague, a distant relative of Pseudomonas alcaligenes. Would the number of aligned reads go up or down? Why? What would happen if the bacterial species was highly related, would you align more or less reads?
bonus: try it: The Yersinia pestis genome:
/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta
The Pseudomonas alcaligenes genome:
/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta
Human paired-end Illumina reads
Aligning
We will align exome-seq sequences from a Yoruba female. The raw data can be found here:
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz /home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz
The first one is the forward, the second, the reverse. These are pre-trimmed so no need to worry about the adapters.
Your first exercise will be to write a single command line that will:
- use bwa mem to align sequences and produce SAM format output
- transform SAM to BAM
- sort the bam file
For 1), the command line has the following format:
bwa mem [reference genome in fasta] [forward fastq] [reverse fastq]
The human reference is here:
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa
Note: there are multiple versions of in reference genome for a given species. The version number for this particular reference is hg38 and is the latest as of writing. However, there are various different versions of the human genome ex: hg19, hg18. The coordinates will differ and you cannot assume that the sequence at a specific chromosome/coordinate will be the same with the next genome version. Be very careful to stick with the same version throughout your analyses.
If you can, add the read groups+sample, let us say that the read group was RG26 and sample was YRB42 (tip: use -R "@RG\tID:RG26\tSM:YRB42" in bwa mem).
This command will produce SAM on the STDOUT. Next, we want to convert SAM to BAM:
samtools view -bS [input SAM]
In your case, the input is in SAM format and will be the STDIN: /dev/stdin. The command above will produce BAM on the STDOUT. Next, we want to sort:
samtools sort [input BAM]
Now, the input is in BAM format and will be the STDIN: /dev/stdin. The command above will produce BAM (but sorted) on the STDOUT. The output should be redirected to a file.
Type down the command as a single line, run it and send the output to NA19201.bam. The alignment takes about 10 minutes, so free to take a bathroom break.
Q14: How to run all of the steps above on a single line using UNIX pipes?
Q15: There two major advantages of using UNIX pipes over simply running each command at the time. What would those be?
n.b. For full disclosure, the reads were only the ones that only mapped to chr20. We did this to speed things up a bit.
Statistics
flagstat
Q16: Using samtools flagstat, what is the proportion of reads that aligned to the reference?
Q17 Using the same command, how many pairs are properly paired?
Q18 Index the file and use:
samtools view [input BAM] [chromosome name]
And determine how many total reads aligned to chr20 (reminder: use wc -l to count lines).
stat
Use
samtools stat [input BAM]
To generate additional statistics. The command above generates a bunch of statistics and produces it on the standard out (STDOUT). Redirect this to a file (ex: NA19201.stat). We will then use plot-bamstats on the stat file that you have produced to create a congenial interface, for example:
plot-bamstats -p NA19201 NA19201.stat
firefox NA19201.html
Q19 look at the graph of the insert size, what would you say is the most common insert size?
igv
We will now use igv (integrated genome viewer) to look at the alignment per se. First, start:
igv &
We will do the following to load the file:
- You should see a drop-down field with probably the text "Human hg19" written in it. This is your genome and genome version. Select " More...", find and select "Human hg38" and press ok.
- Go to "File", "load from file" and select your BAM file.
- next to the field when you've selected your genome build, there is a possibility to select a chromosome, please select "chr20".
- next to the field for the chromosome, you can zoom to a specific location, type: chr20:35,581,362
There seems to be a variant that position, use igv to answer the following questions
Q20: what is the reference base? What is the base that seems to be in the sample?
Q21: how many reads support this particular base?
Q22: this mutation seems to fall in a gene, what is the name of the gene?
Feel free to use igv to look at different loci on chromosome 20.
Please find the answers here
Congratulations you finished the exercise!