Alignment exercise
Overview
In this exercise you will practice aligning NGS data and working with alignment files.
- Navigate to your home directory.
- Create a directory called
align. - Navigate to the
aligndirectory.
We will align two types of NGS data:
- Pseudomonas single-end Illumina reads
- Human paired-end Illumina reads
P. aeruginosa single-end Illumina reads
Alignment using bwa mem
We will align single-end reads that have been trimmed from P. aeruginosa.
Raw data:
/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz
Trimmed data:
/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz
Reference genome:
/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta
The basic bwa mem command to align single-end reads is:
bwa mem [reference.fasta] [reads.fastq.gz] > [output.sam]
Remember: the > operator redirects standard output (STDOUT) to a file.
We have discussed multiplexing and read groups. It is good practice to add a read group ID and sample name during alignment. For example, if the read group is RG38 and the sample is SMPL96:
bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference.fasta] [reads.fastq.gz] > [output.sam]
This information is crucial when you later merge multiple BAM files, so you can trace which reads came from which library or sample.
Task: Align the trimmed FASTQ file using the command above.
Q1: If you were not told which FASTQ file contains the trimmed reads, how could you determine it from the files themselves? (Hint: think of at least three different ways.)
Inspecting the alignment
Assume you named your output file SRR8002634_1.sam. You can view it as:
less -S SRR8002634_1.sam
The -S option prevents line wrapping; press q to quit. Use the slides and the
<a href="https://samtools.github.io/hts-specs/SAMv1.pdf">official SAM documentation</a> to interpret each field.
Answer the following:
Q2: How many lines does the header have (lines starting with @)?
Q3: What is the genomic coordinate (reference name and position) of the first read SRR8002634.1?
Q4: What is the mapping quality of the third read SRR8002634.3? What does that mapping quality tell you about this read?
Q5: Using the SAM flag definitions (see <a href="https://broadinstitute.github.io/picard/explain-flags.html">explain flags</a>), determine among the first 8 reads how many map to the forward (+) strand and how many to the reverse (–) strand.
Q6: Is the 10th read SRR8002634.11 unmapped? (Note: SRR8002634.9 was removed by trimming, so numbering skips.) How did you determine this from the SAM fields?
To get basic alignment statistics, use:
samtools flagstat [input.sam]
Below is a brief explanation of the fields reported by flagstat:
| Category | Meaning |
|---|---|
| mapQ | Mapping quality |
| QC-passed reads | Reads not marked as QC-failed; these are typically used for analysis. |
| QC-failed reads | Reads flagged as having problems by the processing pipeline; downstream tools usually ignore them. |
| total | Total number of alignments reported. |
| secondary | Additional alignments for reads that map equally well to multiple locations. |
| supplementary | Alignments for chimeric or split reads where different parts map to different locations. |
| duplicates | Reads marked as duplicates (e.g. PCR duplicates); will be discussed in the next class. |
| mapped | Number of reads with at least one reported alignment (not unmapped). |
| paired in sequencing | Reads that were sequenced as part of a pair (not single-end). |
| read1 | First read in the pair (forward). |
| read2 | Second read in the pair (reverse). |
| properly paired | Pairs that face each other and are within the expected insert size range. |
| with itself and mate mapped | Both the read and its mate are mapped (whether or not properly paired). |
| singletons | Reads that are mapped but whose mate is unmapped. |
| with mate mapped to a different chr | Reads whose mate is mapped to a different chromosome. |
Q7: According to samtools flagstat, what fraction of reads did not align to the reference?
Working with alignments
Format conversion
This should be the first and hopefully last time you work directly with SAM for large files.
First, check the SAM file size:
ls -lh SRR8002634_1.sam
Convert SAM to BAM:
samtools view -bS [input.sam] > [output.bam]
Check the BAM file size:
ls -lh SRR8002634_1.bam
-l gives a detailed listing (permissions, size, date). -h shows file sizes in human-readable form (e.g. 2.4M instead of 2469134 bytes).
The BAM file contains exactly the same alignments as the SAM file, but in binary form. To view it as SAM:
samtools view [input.bam] | less -S
You can filter reads based on SAM flags. For example, to include only unmapped reads:
samtools view -f 0x4 [input.bam]
To exclude unmapped reads:
samtools view -F 0x4 [input.bam]
The flag 0x4 corresponds to "read unmapped" (see
<a href="https://broadinstitute.github.io/picard/explain-flags.html">flag documentation</a>).
Q8: What is the size ratio of SAM to BAM (SAM size divided by BAM size)?
Now convert BAM to CRAM, which compresses further using the reference:
samtools view -C -T [reference.fasta] [input.bam] > [output.cram]
Use the same reference FASTA you used for mapping. Check the CRAM file size with ls -lh.
To view CRAM as SAM:
samtools view -T [reference.fasta] [input.cram] | less -S
Q9: What is the size ratio of BAM to CRAM?
To save space, please remove the SAM and CRAM files (we will work with BAM only):
rm [file]
Sorting
Sort the BAM file by genomic coordinate:
samtools sort [input.bam] > [output.sorted.bam]
Be careful not to overwrite the original BAM file; for example:
input.bam=SRR8002634_1.bamoutput.sorted.bam=SRR8002634_1.sorted.bam
Use samtools view and less -S to confirm that reads are ordered by reference and coordinate.
Index the sorted BAM file:
samtools index [input.sorted.bam]
Note: you cannot index an unsorted BAM file.
Retrieving a particular region
Once sorted and indexed, you can retrieve reads from a specific region:
samtools view [input.sorted.bam] [regionID]:[start]-[end]
For example, to get reads overlapping positions 1,000,000–1,000,100 on chromosome NC_002516.2:
samtools view SRR8002634_1.sorted.bam NC_002516.2:1000000-1000100
Q10: How many reads are aligned between positions 2,000,000 and 3,000,000 on the reference NC_002516.2?
Hint: do not save to a file; instead use:
samtools view [options] | wc -l
Q11: How many reads with mapping quality ≥ 30 are aligned between positions 2,000,000 and 3,000,000 on NC_002516.2?
Hint: run samtools view without arguments to see its options.
Average coverage
We will use mosdepth to measure the average coverage (mean number of reads covering each base in the genome):
mosdepth [output_prefix] [input.sorted.bam]
Use any prefix you like (e.g. SRR8002634_1). mosdepth will write a summary file named [output_prefix].mosdepth.summary.txt.
Check the summary and see if the reported coverage makes sense given the data.
Q12: On average, how many reads cover a base in the genome? What is the maximum coverage (maximum number of reads covering a single position)?
You can also inspect per-position coverage using:
samtools mpileup [input.sorted.bam] | less -S
The wrong reference genome?
Q13: Suppose you accidentally aligned the reads to a different bacterial reference genome, e.g. Yersinia pestis (the plague bacterium), a distant relative of Pseudomonas alcaligenes. Would the number of aligned reads go up or down compared to the correct reference? Why? What if the other species was very closely related—would you expect more or fewer reads to align?
Optional bonus: Try it.
Yersinia pestis reference:
/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta
Pseudomonas alcaligenes reference:
/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta
Human paired-end Illumina reads
Aligning
We will align exome-seq reads from a <a href="https://en.wikipedia.org/wiki/Yoruba_people">Yoruba</a> female.
Raw data:
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz /home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz
NA19201_1.fastq.gz contains the forward reads; NA19201_2.fastq.gz contains the reverse reads. These reads are already trimmed.
Your goal is to write a single command line that:
- Uses
bwa memto align the paired-end reads and produce SAM output. - Converts SAM to BAM.
- Sorts the BAM file.
bwa mem syntax for paired-end reads:
bwa mem [reference.fasta] [forward.fastq.gz] [reverse.fastq.gz]
Human reference (GRCh38):
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa
Note: There are multiple versions of the human reference genome (e.g. hg18, hg19, hg38). Coordinates and even sequences can differ between versions. Always make sure to use the same reference version consistently in all steps of your analysis.
If possible, add a read group and sample name. For example, if read group is RG26 and sample is YRB42:
-R "@RG\tID:RG26\tSM:YRB42"
Converting and sorting:
samtools view -bS [input.sam]converts SAM to BAM.- When reading from STDIN, use
/dev/stdinas the input. samtools sort [input.bam]sorts the BAM file.
Your combined command should:
- Run
bwa mem, - Pipe SAM output to
samtools view, - Pipe BAM output to
samtools sort, - Redirect the final sorted BAM to
NA19201.bam.
The alignment may take around 10 minutes.
Q14: Write the full one-line command that performs alignment, SAM->BAM conversion, and sorting using pipes, and saves output as NA19201.bam.
Q15: What are two major advantages of using UNIX pipes instead of running each command separately and writing intermediate files?
Note: For speed, the provided reads only contain sequences mapping to chr20.
Alignment statistics
flagstat
Q16: Using samtools flagstat, what proportion of reads aligned to the reference?
Q17: Using the same output, how many read pairs are marked as properly paired?
Q18: Index the BAM file, then run:
samtools view [input.bam] [chromosome]
Count how many reads align to chr20 (hint: pipe to wc -l). How many total reads are aligned to chr20?
stat
Generate additional statistics using:
samtools stat [input.bam]
Redirect the output to a file (e.g. NA19201.stat).
Then run:
plot-bamstats -p NA19201 NA19201.stat firefox NA19201.html
Q19: Look at the insert size distribution plot. What is the most common insert size (approximately)?
Inspecting the alignment in IGV
We will use IGV (Integrative Genomics Viewer) to inspect the human alignment.
Start IGV:
igv &
Then:
- In the genome dropdown, choose “More…”, then select Human hg38 and click OK.
- Go to “File” → “Load from File…” and select your
NA19201.bamfile. - In the chromosome dropdown, select
chr20. - In the coordinate box, enter:
chr20:35,581,362and press Enter.
There appears to be a variant at this position. Use IGV to answer:
Q20: What is the reference base at this position? What base is present in the sample reads?
Q21: How many reads support the non-reference base?
Q22: This variant lies within a gene. What is the name of the gene?
Feel free to explore other regions on chromosome 20 using IGV.
Please find the answers <a href="Alignment_exercise_answers">here</a>.
Congratulations, you finished the exercise!