Alignment exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with " <H2>Overview</H2> First: <OL> <LI>Navigate to your home directory: <LI>Create a directory called "align" <LI>Navigate to the directory you just created. </OL> We will try to align different types of NGS data. # <i>Pseudomonas alcaligenes</i> single-end Illumina reads # Human single-end paired-end Illumina reads <H2><i>P. aeruginosa</i> single-end Illumina reads</H2> <H3>Alignment using bwa mem</H3> <p> We will align some of the single-end reads that we trimmed f...")
 
No edit summary
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<h2>Overview</h2>


<p>In this exercise you will practice aligning NGS data and working with alignment files.</p>


<H2>Overview</H2>
<ol>
  <li>Navigate to your home directory.</li>
  <li>Create a directory called <code>align</code>.</li>
  <li>Navigate to the <code>align</code> directory.</li>
</ol>


First:
<p>We will align two types of NGS data:</p>
<OL>
<ol>
<LI>Navigate to your home directory:
  <li><i>Pseudomonas</i> single-end Illumina reads</li>
<LI>Create a directory called "align"
  <li>Human paired-end Illumina reads</li>
<LI>Navigate to the directory you just created.
</ol>
</OL>


We will try to align different types of NGS data.
<hr>
# <i>Pseudomonas alcaligenes</i> single-end Illumina reads
# Human single-end paired-end Illumina reads


<H2><i>P. aeruginosa</i> single-end Illumina reads</H2>
<h2><i>P. aeruginosa</i> single-end Illumina reads</h2>


<H3>Alignment using bwa mem</H3>
<h3>Alignment using bwa mem</h3>


<p> We will align some of the single-end reads that we trimmed from <i>P. aeruginosa</i>. The raw data can be found here:</p>
<p>We will align single-end reads that have been trimmed from <i>P. aeruginosa</i>.</p>
<pre>
 
/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz
<p><b>Raw data:</b></p>
</pre>
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz</pre>


<p>The trimmed data can be found here:</p>
<p><b>Trimmed data:</b></p>
<pre>
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz</pre>
/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz
</pre>


<p>The reference genome can be found here:</p>
<p><b>Reference genome:</b></p>
<pre>/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta</pre>


<pre>
<p>The basic <code>bwa mem</code> command to align single-end reads is:</p>
/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta
<pre>bwa mem [reference.fasta] [reads.fastq.gz] &gt; [output.sam]</pre>
</pre>


<p> the basic command line to align the data using '''bwa mem''' is</p>
<p>Remember: the <code>&gt;</code> operator redirects standard output (STDOUT) to a file.</p>


<pre>
<p>We have discussed multiplexing and read groups. It is good practice to add a read group ID and sample name during alignment. For example, if the read group is <code>RG38</code> and the sample is <code>SMPL96</code>:</p>
bwa mem  [reference genome in fasta]  [input.fastq.gz] > [output.sam]
</pre>


Remember that the '''>''' the operator sends the STDOUT to a file.  
<pre>bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference.fasta] [reads.fastq.gz] &gt; [output.sam]</pre>


We have learned about multiplexing. Get used to adding reads groups to denote the read group ID and sample they belong to. For instance, let us say that our read group was "RG38" and the sample was "SMPL96". We would add the read groups as such during the alignment:
<p>This information is crucial when you later merge multiple BAM files, so you can trace which reads came from which library or sample.</p>


<pre>
<p><b>Task:</b> Align the <b>trimmed</b> FASTQ file using the command above.</p>
bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference genome in fasta]  [input.fastq.gz] > [output.sam]
</pre>


That way, when merging multiple bam files, we know which reads came from which files.
<p><b>Q1:</b> If you were not told which FASTQ file contains the trimmed reads, how could you determine it from the files themselves? (Hint: think of at least three different ways.)</p>
 


'''Align''' the correct fastq file using the command above.
<hr>


'''Q1:''' Had we not told you which fastq file contains the trimmed reads, how could have figured out which file contains the trimmed reads?  I can think of 3 different ways.
<h3>Inspecting the alignment</h3>


<H3>Looking at the alignment</H3>
<p>Assume you named your output file <code>SRR8002634_1.sam</code>. You can view it as:</p>
<pre>less -S SRR8002634_1.sam</pre>


First, let us look at the alignment, let us assume you called the output '''SRR8002634_1.sam'''. The file can be viewed as such:
<p>The <code>-S</code> option prevents line wrapping; press <code>q</code> to quit. Use the slides and the
[https://samtools.github.io/hts-specs/SAMv1.pdf official SAM specification] to interpret each field.</p>


<pre>
<p>Answer the following:</p>
less -S SRR8002634_1.sam
</pre>


-S wraps the lines around, press 'q' to quit. Please refer to the slides of the [https://samtools.github.io/hts-specs/SAMv1.pdf official documentation] check what each field means. Answer the following questions:
<p><b>Q2:</b> How many lines does the header have (lines starting with <code>@</code>)?</p>


'''Q2:''' how many lines does the header have?
<p><b>Q3:</b> What is the genomic coordinate (reference name and position) of the first read <code>SRR8002634.1</code>?</p>


'''Q3:''' what is the genomic coordinate of the first read (SRR8002634.1)?
<p><b>Q4:</b> What is the mapping quality of the third read <code>SRR8002634.3</code>? What does that mapping quality tell you about this read?</p>


'''Q4:''' what is the mapping quality of the third read "SRR8002634.3"? What does this tell you about this read?
<p><b>Q5:</b> Using the SAM flag definitions (see
[https://broadinstitute.github.io/picard/explain-flags.html Picard flag explanation]), determine among the first 8 reads how many map to the <b>forward (+)</b> strand and how many to the <b>reverse (–)</b> strand.</p>


'''Q5:''' Use the flags ([https://broadinstitute.github.io/picard/explain-flags.html see flags explained here]) and determine how many reads mapped to the + strand and to the - strand among the first 8 reads.
<p><b>Q6:</b> Is the 10th read <code>SRR8002634.11</code> unmapped? (Note: <code>SRR8002634.9</code> was removed by trimming, so numbering skips.) How did you determine this from the SAM fields?</p>


'''Q6:''' Is the 10th read "SRR8002634.11" (n.b. SRR8002634.9 was removed by the trimming algorithm so we skip one) unmapped? How did you determine this?
<p>To get basic alignment statistics, use:</p>
<pre>samtools flagstat [input.sam]</pre>


To get some basic statistics regarding the alignment use:
<p>Below is a brief explanation of the fields reported by <code>flagstat</code>:</p>
<pre>
samtools flagstat [input.sam]
</pre>


Here is a brief table of what the different fields mean:
<table class="wikitable">
  <tr>
    <th>Category</th>
    <th>Meaning</th>
  </tr>
  <tr>
    <td>mapQ</td>
    <td>Mapping quality</td>
  </tr>
  <tr>
    <td>QC-passed reads</td>
    <td>Reads not marked as QC-failed; these are typically used for analysis.</td>
  </tr>
  <tr>
    <td>QC-failed reads</td>
    <td>Reads flagged as having problems by the processing pipeline; downstream tools usually ignore them.</td>
  </tr>
  <tr>
    <td>total</td>
    <td>Total number of alignments reported.</td>
  </tr>
  <tr>
    <td>secondary</td>
    <td>Additional alignments for reads that map equally well to multiple locations.</td>
  </tr>
  <tr>
    <td>supplementary</td>
    <td>Alignments for chimeric or split reads where different parts map to different locations.</td>
  </tr>
  <tr>
    <td>duplicates</td>
    <td>Reads marked as duplicates (e.g. PCR duplicates); will be discussed in the next class.</td>
  </tr>
  <tr>
    <td>mapped</td>
    <td>Number of reads with at least one reported alignment (not unmapped).</td>
  </tr>
  <tr>
    <td>paired in sequencing</td>
    <td>Reads that were sequenced as part of a pair (not single-end).</td>
  </tr>
  <tr>
    <td>read1</td>
    <td>First read in the pair (forward).</td>
  </tr>
  <tr>
    <td>read2</td>
    <td>Second read in the pair (reverse).</td>
  </tr>
  <tr>
    <td>properly paired</td>
    <td>Pairs that face each other and are within the expected insert size range.</td>
  </tr>
  <tr>
    <td>with itself and mate mapped</td>
    <td>Both the read and its mate are mapped (whether or not properly paired).</td>
  </tr>
  <tr>
    <td>singletons</td>
    <td>Reads that are mapped but whose mate is unmapped.</td>
  </tr>
  <tr>
    <td>with mate mapped to a different chr</td>
    <td>Reads whose mate is mapped to a different chromosome.</td>
  </tr>
</table>


<p><b>Q7:</b> According to <code>samtools flagstat</code>, what fraction of reads did <b>not</b> align to the reference?</p>


<hr>


{| class="wikitable"
<h3>Working with alignments</h3>
| '''Category'''
| '''Meaning'''   
|-
| mapQ
| mapping quality
|-
| QC-passed reads
| most processing pipelines will QC fail reads for a variety of reasons, those are the ones that were not failed i.e. can be used for analysis
|-
| QC-failed reads
| as mentioned above, those are read with any kind of problem as determined by the processing pipeline and should not be used by downstream analyses. Most properly coded software will overlook those reads.
|-
| total
| total number of alignments
|-
| secondary
| if a mapper aligns a read equally well to 2 or more locations, I can report the best location as primary alignment and any other ones as secondary
|-
| supplementary
| If, for instance, the beginning of a read aligns well to one location and the end of the reader aligns well to a different location in a chimeric way, one read will the "normal" read and the other one will be the "supplementary".
|-
| duplicates
| these are reads are marked as "duplicates" will cover this at the next class
|-
| mapped
| total number of reads for which there is at least one location i.e. they are not left unmapped.
|-
| paired in sequencing
| I have reads that are paired (versus single-end)
|-
| read1
| forward reads
|-
|read2
|reverse reads
|-
| properly paired
| we have covered this and class, meaning facing each other and within a reasonable distance.
|-
| with itself and mate mapped
| the read maps and also its mate (forward or reverse) but may or may not satisfy the properly paired requirement.
|-
| singletons
| the read was mapped but not its mate
|-
|with mate mapped to a different chr
| the read was mapped but its mate mapped to different chromosome
|}


'''Q7:''' what is the fraction of reads that did not align to the reference?
<h4>Format conversion</h4>


<H3>Working with alignments</H3>
<p>This should be the first and hopefully last time you work directly with SAM for large files.</p>


<H4>Format conversion</H4>
<p>First, check the SAM file size:</p>
<pre>ls -lh SRR8002634_1.sam</pre>


This should be the first and hopefully last time you work with the SAM format. First, evaluate the size of the file:
<p>Convert SAM to BAM:</p>
<pre>samtools view -bS [input.sam] &gt; [output.bam]</pre>


<pre>
<p>Check the BAM file size:</p>
ls -lh SRR8002634_1.sam
<pre>ls -lh SRR8002634_1.bam</pre>
</pre>


Then we will converted to binary SAM format or BAM:
<p><code>-l</code> gives a detailed listing (permissions, size, date). <code>-h</code> shows file sizes in human-readable form (e.g. 2.4M instead of 2469134 bytes).</p>
<pre>
samtools view -bS [input.sam] > [output.bam]
</pre>


And evaluate the space used by the BAM file:
<p>The BAM file contains exactly the same alignments as the SAM file, but in binary form. To view it as SAM:</p>
<pre>
<pre>samtools view [input.bam] | less -S</pre>
ls -lh SRR8002634_1.sam
</pre>
-l says: give me the long listing ex: creation data, file size etc. -h make the file size human-readable (ex: 2.4M instead of 2469134).


This BAM file contains exactly the same alignments as the previous file, to view it as SAM:
<p>You can filter reads based on SAM flags. For example, to include only unmapped reads:</p>
<pre>samtools view -f 0x4 [input.bam]</pre>


<pre>
<p>To exclude unmapped reads:</p>
samtools view [input.bam] | less -S
<pre>samtools view -F 0x4 [input.bam]</pre>
</pre>


You can filter for certain reads for a specific flag for example, to only include unmapped reads:
<p>The flag <code>0x4</code> corresponds to “read unmapped” (see the
[https://broadinstitute.github.io/picard/explain-flags.html flag documentation]).</p>


<pre>
<p><b>Q8:</b> What is the size ratio of SAM to BAM (SAM size divided by BAM size)?</p>
samtools view -f 0x4 [input.bam]
</pre>


Or to exclude unmapped reads:
<p>Now convert BAM to CRAM, which compresses further using the reference:</p>
<pre>
<pre>samtools view -C -T [reference.fasta] [input.bam] &gt; [output.cram]</pre>
samtools view -F 0x4 [input.bam]  
</pre>


Why 0x4? refer to [https://broadinstitute.github.io/picard/explain-flags.html Decoding SAM flags].
<p>Use the same reference FASTA you used for mapping. Check the CRAM file size with <code>ls -lh</code>.</p>


'''Q8:''' what is the size ratio of SAM to BAM?
<p>To view CRAM as SAM:</p>
<pre>samtools view -T [reference.fasta] [input.cram] | less -S</pre>


Then we can convert it to CRAM which compresseses it even further using the reference:
<p><b>Q9:</b> What is the size ratio of BAM to CRAM?</p>
<pre>
samtools view -C -T [reference genome in fasta]  [input.bam] > [output.cram]
</pre>


The [reference genome in fasta] is exactly the same file you used for mapping.
<p>To save space, please remove the SAM and CRAM files (we will work with BAM only):</p>
<pre>rm [file]</pre>


use "ls -lh" to find the file size.
<h4>Sorting</h4>


To view the alignment as SAM:
<p>Sort the BAM file by genomic coordinate:</p>
<pre>
<pre>samtools sort [input.bam] &gt; [output.sorted.bam]</pre>
samtools view -T [reference genome in fasta] [input.cram] | less -S
</pre>


<p>Be careful not to overwrite the original BAM file; for example:</p>
<ul>
  <li><code>input.bam</code> = <code>SRR8002634_1.bam</code></li>
  <li><code>output.sorted.bam</code> = <code>SRR8002634_1.sorted.bam</code></li>
</ul>


'''Q9:''' what is the size ratio of BAM to CRAM?
<p>Use <code>samtools view</code> and <code>less -S</code> to confirm that reads are ordered by reference and coordinate.</p>


Please remove the SAM file and CRAM file as we will concentrate on BAM for now, remember to remove a file:  
<p>Index the sorted BAM file:</p>
<pre>
<pre>samtools index [input.sorted.bam]</pre>
rm [file]
</pre>  


<H4>Sorting</H4>
<p>Note: you <b>cannot</b> index an unsorted BAM file.</p>


First, sort the BAM file using samtools sort:
<h4>Retrieving a particular region</h4>
<pre>
samtools sort  [input.bam] > [output.bam]
</pre>


Be careful not to overwrite the original file, the [input.bam] can be SRR8002634_1.bam whereas [output.bam] can be SRR8002634_1.sorted.bam. Inspect the alignments using the command "samtools view" used above and check if the reads are sorted.
<p>Once sorted and indexed, you can retrieve reads from a specific region:</p>
<pre>samtools view [input.sorted.bam] [regionID]:[start]-[end]</pre>


You can now index the sorted BAM file using samtools index:
<p>For example, to get reads overlapping positions 1,000,000–1,000,100 on chromosome <code>NC_002516.2</code>:</p>
<pre>
<pre>samtools view SRR8002634_1.sorted.bam NC_002516.2:1000000-1000100</pre>
samtools index [input.sorted.bam]
</pre>


Remember you cannot index an unsorted BAM file.
<p><b>Q10:</b> How many reads are aligned between positions 2,000,000 and 3,000,000 on the reference <code>NC_002516.2</code>?<br>
Hint: do not save to a file; instead use:</p>
<pre>samtools view [options] | wc -l</pre>


<H4>Retrieving a particular region</H4>
<p><b>Q11:</b> How many reads with mapping quality <b>&ge; 30</b> are aligned between positions 2,000,000 and 3,000,000 on <code>NC_002516.2</code>?<br>
Hint: run <code>samtools view</code> without arguments to see its options.</p>


Once sorted and indexed, we can retrieve specific portions of the alignment, for example:
<h4>Average coverage</h4>


<pre>
<p>We will use <code>mosdepth</code> to measure the average coverage (mean number of reads covering each base in the genome):</p>
samtools view  [input.sorted.bam] [region ID]:[start]-[end]
<pre>mosdepth [output_prefix] [input.sorted.bam]</pre>
</pre>
For instance, the "[region ID]:[start]-[end]" can be "NC_002516.2:1000000-1000100" to retrieve all reads overlapping positions between the 1,000,000th base to the 1,000,100th base.


<p>Use any prefix you like (e.g. <code>SRR8002634_1</code>). <code>mosdepth</code> will write a summary file named <code>[output_prefix].mosdepth.summary.txt</code>.</p>


'''Q10:''' How many reads are aligned between positions 2000000 and 3000000 on the reference NC_002516.2? (tip: to count lines in a file use "wc -l" do not write the output to another file, pipe directly as such "samtools view [commands] |wc -l ").
<p>Check the summary and see if the reported coverage makes sense given the data.</p>


'''Q11:''' How many reads with mapping quality greater or equal to 30 are aligned between positions 2000000 and 3000000 on the reference NC_002516.2? (tip: use "samtools view" without any arguments to view options).
<p><b>Q12:</b> On average, how many reads cover a base in the genome? What is the maximum coverage (maximum number of reads covering a single position)?</p>


<p>You can also inspect per-position coverage using:</p>
<pre>samtools mpileup [input.sorted.bam] | less -S</pre>


<H4>Average coverage</H4>
<hr>


We will use mosdepth to measure the average coverage (i.e. the average number of times any base in the genome is covered) for our BAM file:
<h3>The wrong reference genome?</h3>


<pre>
<p><b>Q13:</b> Suppose you accidentally aligned the reads to a different bacterial reference genome, e.g. <i>Yersinia pestis</i> (the plague bacterium), a distant relative of <i>Pseudomonas alcaligenes</i>. Would the number of aligned reads go up or down compared to the correct reference? Why? What if the other species was very closely related — would you expect more or fewer reads to align?</p>
mosdepth [output prefix] [input.sorted.bam] 
</pre>


Feel free to use whatever [output prefix], we used SRR8002634_1. Use mosdepth on your sorted bam file and see the results in [output prefix].mosdepth.summary.txt
<p><b>Optional bonus:</b> Try it.</p>


Does the estimate make sense?
<p><b><i>Yersinia pestis</i> reference:</b></p>
<pre>/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta</pre>


'''Q12:''' On average, how many reads cover a base in the genome? What is the maximum number of reads covering a single position?
<p><b><i>Pseudomonas alcaligenes</i> reference:</b></p>
<pre>/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta</pre>


<hr>


Use the following command to view the reads on a per position basis:
<h2>Human paired-end Illumina reads</h2>
<pre>
samtools mpileup [input.sorted.bam] |less -S
</pre>


<H3>The wrong reference genome?</H3>
<h3>Aligning</h3>


'''Q13:''' Imagine you would inadvertently align the data to a different reference genome, say the ''Yersinia pestis'', the bacteria responsible for the plague, a distant relative of ''Pseudomonas alcaligenes''. Would the number of aligned reads go up or down? Why? What would happen if the bacterial species was highly related, would you align more or less reads?
<p>We will align exome-seq reads from a [https://en.wikipedia.org/wiki/Yoruba_people Yoruba] female.</p>


'''bonus''': try it:
<p><b>Raw data:</b></p>
The ''Yersinia pestis'' genome:
<pre>/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
<pre>
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz
/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta
</pre>
</pre>


The ''Pseudomonas alcaligenes'' genome:
<p><code>NA19201_1.fastq.gz</code> contains the forward reads; <code>NA19201_2.fastq.gz</code> contains the reverse reads. These reads are already trimmed.</p>
<pre>
/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta
</pre>


<H2>Human paired-end Illumina reads</H2>
<p>Your goal is to write a <b>single command line</b> that:</p>
<H3>Aligning</H3>
<ol>
  <li>Uses <code>bwa mem</code> to align the paired-end reads and produce SAM output.</li>
  <li>Converts SAM to BAM.</li>
  <li>Sorts the BAM file.</li>
</ol>


We will align exome-seq sequences from a [https://en.wikipedia.org/wiki/Yoruba_people Yoruba] female. The raw data can be found here:
<p><b><code>bwa mem</code> syntax for paired-end reads:</b></p>
<pre>bwa mem [reference.fasta] [forward.fastq.gz] [reverse.fastq.gz]</pre>


<pre>
<p><b>Human reference (GRCh38):</b></p>
/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
<pre>/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz
</pre>


The first one is the forward, the second, the reverse. These are pre-trimmed so no need to worry about the adapters.  
<p><b>Note:</b> There are multiple versions of the human reference genome (e.g. hg18, hg19, hg38). Coordinates and even sequences can differ between versions. Always make sure to use the <b>same reference version</b> consistently in all steps of your analysis.</p>


Your first exercise will be to write a single command line that will:
<p>If possible, add a read group and sample name. For example, if read group is <code>RG26</code> and sample is <code>YRB42</code>:</p>
# use '''bwa mem''' to align sequences and produce SAM format output
<pre>-R "@RG\tID:RG26\tSM:YRB42"</pre>
# transform SAM to BAM
# sort the bam file


For 1), the command line has the following format:
<p><b>Converting and sorting:</b></p>
<pre>
<ul>
bwa mem [reference genome in fasta] [forward fastq]  [reverse fastq]
  <li><code>samtools view -bS [input.sam]</code> converts SAM to BAM.</li>
</pre>
  <li>When reading from STDIN, use <code>/dev/stdin</code> as the input.</li>
  <li><code>samtools sort [input.bam]</code> sorts the BAM file.</li>
</ul>


The human reference is here:
<p>Your combined command should:</p>
<pre>
<ul>
/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa
  <li>Run <code>bwa mem</code>,</li>
</pre>
  <li>Pipe SAM output to <code>samtools view</code>,</li>
  <li>Pipe BAM output to <code>samtools sort</code>,</li>
  <li>Redirect the final sorted BAM to <code>NA19201.bam</code>.</li>
</ul>


'''Note:''' there are multiple versions of in reference genome for a given species. The version number for this particular reference is hg38 and is the latest as of writing. However, there are various different versions of the human genome ex: hg19, hg18. The coordinates will differ and you cannot assume that the sequence at a specific chromosome/coordinate will be the same with the next genome version. Be very careful to stick with the '''same version''' throughout your analyses.
<p>The alignment may take around 10 minutes.</p>


If you can, add the read groups+sample, let us say that the read group was RG26 and sample was YRB42 (tip: use -R "@RG\tID:RG26\tSM:YRB42" in bwa mem).
<p><b>Q14:</b> Write the full one-line command that performs alignment, SAM-&gt;BAM conversion, and sorting using pipes, and saves output as <code>NA19201.bam</code>.</p>


This command will produce SAM on the STDOUT. Next, we want to convert SAM to BAM:
<p><b>Q15:</b> What are two major advantages of using UNIX pipes instead of running each command separately and writing intermediate files?</p>
<pre>
samtools view -bS [input SAM]
</pre>


In your case, the input is in SAM format and will be the STDIN: /dev/stdin. The command above will produce BAM on the STDOUT. Next, we want to sort:
<p><b>Note:</b> For speed, the provided reads only contain sequences mapping to <code>chr20</code>.</p>


<pre>
<hr>
samtools sort [input BAM]
</pre>


Now, the input is in BAM format and will be the STDIN: /dev/stdin. The command above will produce BAM (but sorted) on the STDOUT.  The output should be redirected to a file.
<h3>Alignment statistics</h3>


Type down the command as a single line, run it and send the output to NA19201.bam. The alignment takes about 10 minutes, so free to take a bathroom break.
<h4>flagstat</h4>


'''Q14:''' How to run all of the steps above on a single line using UNIX pipes?
<p><b>Q16:</b> Using <code>samtools flagstat</code>, what proportion of reads aligned to the reference?</p>


'''Q15:''' There two major advantages of using UNIX pipes over simply running each command at the time. What would those be?
<p><b>Q17:</b> Using the same output, how many read pairs are marked as <b>properly paired</b>?</p>


'''n.b.''' For full disclosure, the reads were only the ones that only mapped to '''chr20'''. We did this to speed things up a bit.  
<p><b>Q18:</b> Index the BAM file, then run:</p>
<pre>samtools view [input.bam] [chromosome]</pre>


<H3>Statistics</H3>
<p>Count how many reads align to <code>chr20</code> (hint: pipe to <code>wc -l</code>). How many total reads are aligned to <code>chr20</code>?</p>
<H4>flagstat</H4>


'''Q16:''' Using '''samtools flagstat''', what is the proportion of reads that aligned to the reference?
<h4>stat</h4>


'''Q17''' Using the same command, how many pairs are properly paired?
<p>Generate additional alignment statistics using:</p>


'''Q18''' Index the file and use:
<pre>
<pre>
samtools view  [input BAM] [chromosome name]
samtools stat [input.bam] > NA19201.stat
</pre>
</pre>
And determine how many total reads aligned to '''chr20''' (reminder: use '''wc -l''' to count lines). 


<H4>stat</H4>
<p>Generate plots from the statistics file:</p>
Use
<pre>
samtools stat [input BAM]
</pre>
To generate additional statistics. The command above generates a bunch of statistics and produces it on the standard out (STDOUT). Redirect this to a file (ex:  NA19201.stat). We will then use '''plot-bamstats''' on the stat file that you have produced to create a congenial interface, for example:


<pre>
<pre>
plot-bamstats -p NA19201 NA19201.stat
plot-bamstats -p NA19201 NA19201.stat
</pre>
</pre>
<p><b>Viewing the BAM statistics report:</b></p>
<p>The command above generates a set of <b>PNG image files</b> containing
various BAM statistics (e.g. insert size, base composition, quality by cycle).
The plots are created in the current directory.</p>
<p>If you are using <b>MobaXterm</b>, you can open the PNG files directly from the
left-hand file panel.</p>
<p>If you are using <b>macOS</b> (or a standard terminal), copy the PNG files to your
local computer and open them using any image viewer. For example:</p>


<pre>
<pre>
firefox NA19201.html
scp stud0XX@pupilX:path/to/NA19201*.png .
</pre>
</pre>


'''Q19''' look at the graph of the insert size, what would you say is the most common insert size?  
<p>Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.</p>
 
<p><b>Q19:</b> Look at the <b>insert size distribution</b> plot. What is the most
common insert size (approximately)?</p>


<hr>


<h3>Inspecting the alignment with <code>samtools tview</code></h3>


<p>We will use the text-based viewer <code>samtools tview</code> to inspect the human alignment around a potential variant.</p>


<H3>igv</H3>
<p>First, make sure your BAM file is indexed:</p>
<pre>samtools index NA19201.bam</pre>


We will now use igv (integrated genome viewer) to look at the alignment per se. First, start:
<p>Then start <code>samtools tview</code>:</p>
<pre>samtools tview NA19201.bam /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>


<pre>
<p>This opens the alignment in your terminal.</p>
igv &
</pre>


We will do the following to load the file:
<p>To jump to the region of interest (<code>chr20:35,581,362</code>):</p>
# You should see a drop-down field with probably the text "Human hg19" written in it. This is your genome and genome version. Select " More...", find and select "Human hg38" and press ok.
<ol>
# Go to "File", "load from file" and select your BAM file.
  <li>Press <code>g</code> (for “goto”).</li>
# next to the field when you've selected your genome build, there is a possibility to select a chromosome, please select "chr20".
  <li>Type <code>chr20:35581362</code> and press Enter.</li>
# next to the field for the chromosome, you can zoom to a specific location, type: chr20:35,581,362
</ol>


There seems to be a variant that position, use igv to answer the following questions
<p>You should now see:</p>
<ul>
  <li>The reference sequence on the top line.</li>
  <li>Aligned reads below. Matching bases are often shown as <code>.</code> or <code>,</code>, while mismatches are shown as the actual base (A/C/G/T).</li>
</ul>


'''Q20''': what is the reference base? What is the base that seems to be in the sample?
<p>Useful keys:</p>
<ul>
  <li><code>?</code> – show help.</li>
  <li>Arrow keys – move left/right/up/down.</li>
  <li><code>q</code> – quit <code>tview</code>.</li>
</ul>


'''Q21''': how many reads support this particular base?
<p>Use this view to answer the following:</p>


'''Q22''': this mutation seems to fall in a gene, what is the name of the gene?
<p><b>Q20:</b> At position <code>chr20:35,581,362</code>, what bases are present in the sample reads?</p>


Feel free to use igv to look at different loci on chromosome 20.
<p><b>Q21:</b> How many reads support the non-consensus base at this position? (Count the reads showing the alternative base in <code>tview</code>.)</p>


<p><b>Q22:</b> Based on the fraction of reads supporting the non-reference base, does this variant look more like a heterozygous or a homozygous variant? Explain briefly.</p>


Please find the answers [[Alignment_exercise_answers|here]]
<hr>


'''Congratulations you finished the exercise!'''
<p>Please find the answers [[Alignment_exercise_answers|here]].</p>


<p></p>
<p><b>Congratulations, you finished the exercise!</b></p>
<p></p>
<p></p>

Latest revision as of 11:55, 19 December 2025

Overview

In this exercise you will practice aligning NGS data and working with alignment files.

  1. Navigate to your home directory.
  2. Create a directory called align.
  3. Navigate to the align directory.

We will align two types of NGS data:

  1. Pseudomonas single-end Illumina reads
  2. Human paired-end Illumina reads

P. aeruginosa single-end Illumina reads

Alignment using bwa mem

We will align single-end reads that have been trimmed from P. aeruginosa.

Raw data:

/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz

Trimmed data:

/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz

Reference genome:

/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta

The basic bwa mem command to align single-end reads is:

bwa mem [reference.fasta] [reads.fastq.gz] > [output.sam]

Remember: the > operator redirects standard output (STDOUT) to a file.

We have discussed multiplexing and read groups. It is good practice to add a read group ID and sample name during alignment. For example, if the read group is RG38 and the sample is SMPL96:

bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference.fasta] [reads.fastq.gz] > [output.sam]

This information is crucial when you later merge multiple BAM files, so you can trace which reads came from which library or sample.

Task: Align the trimmed FASTQ file using the command above.

Q1: If you were not told which FASTQ file contains the trimmed reads, how could you determine it from the files themselves? (Hint: think of at least three different ways.)


Inspecting the alignment

Assume you named your output file SRR8002634_1.sam. You can view it as:

less -S SRR8002634_1.sam

The -S option prevents line wrapping; press q to quit. Use the slides and the official SAM specification to interpret each field.

Answer the following:

Q2: How many lines does the header have (lines starting with @)?

Q3: What is the genomic coordinate (reference name and position) of the first read SRR8002634.1?

Q4: What is the mapping quality of the third read SRR8002634.3? What does that mapping quality tell you about this read?

Q5: Using the SAM flag definitions (see Picard flag explanation), determine among the first 8 reads how many map to the forward (+) strand and how many to the reverse (–) strand.

Q6: Is the 10th read SRR8002634.11 unmapped? (Note: SRR8002634.9 was removed by trimming, so numbering skips.) How did you determine this from the SAM fields?

To get basic alignment statistics, use:

samtools flagstat [input.sam]

Below is a brief explanation of the fields reported by flagstat:

Category Meaning
mapQ Mapping quality
QC-passed reads Reads not marked as QC-failed; these are typically used for analysis.
QC-failed reads Reads flagged as having problems by the processing pipeline; downstream tools usually ignore them.
total Total number of alignments reported.
secondary Additional alignments for reads that map equally well to multiple locations.
supplementary Alignments for chimeric or split reads where different parts map to different locations.
duplicates Reads marked as duplicates (e.g. PCR duplicates); will be discussed in the next class.
mapped Number of reads with at least one reported alignment (not unmapped).
paired in sequencing Reads that were sequenced as part of a pair (not single-end).
read1 First read in the pair (forward).
read2 Second read in the pair (reverse).
properly paired Pairs that face each other and are within the expected insert size range.
with itself and mate mapped Both the read and its mate are mapped (whether or not properly paired).
singletons Reads that are mapped but whose mate is unmapped.
with mate mapped to a different chr Reads whose mate is mapped to a different chromosome.

Q7: According to samtools flagstat, what fraction of reads did not align to the reference?


Working with alignments

Format conversion

This should be the first and hopefully last time you work directly with SAM for large files.

First, check the SAM file size:

ls -lh SRR8002634_1.sam

Convert SAM to BAM:

samtools view -bS [input.sam] > [output.bam]

Check the BAM file size:

ls -lh SRR8002634_1.bam

-l gives a detailed listing (permissions, size, date). -h shows file sizes in human-readable form (e.g. 2.4M instead of 2469134 bytes).

The BAM file contains exactly the same alignments as the SAM file, but in binary form. To view it as SAM:

samtools view [input.bam] | less -S

You can filter reads based on SAM flags. For example, to include only unmapped reads:

samtools view -f 0x4 [input.bam]

To exclude unmapped reads:

samtools view -F 0x4 [input.bam]

The flag 0x4 corresponds to “read unmapped” (see the flag documentation).

Q8: What is the size ratio of SAM to BAM (SAM size divided by BAM size)?

Now convert BAM to CRAM, which compresses further using the reference:

samtools view -C -T [reference.fasta] [input.bam] > [output.cram]

Use the same reference FASTA you used for mapping. Check the CRAM file size with ls -lh.

To view CRAM as SAM:

samtools view -T [reference.fasta] [input.cram] | less -S

Q9: What is the size ratio of BAM to CRAM?

To save space, please remove the SAM and CRAM files (we will work with BAM only):

rm [file]

Sorting

Sort the BAM file by genomic coordinate:

samtools sort [input.bam] > [output.sorted.bam]

Be careful not to overwrite the original BAM file; for example:

  • input.bam = SRR8002634_1.bam
  • output.sorted.bam = SRR8002634_1.sorted.bam

Use samtools view and less -S to confirm that reads are ordered by reference and coordinate.

Index the sorted BAM file:

samtools index [input.sorted.bam]

Note: you cannot index an unsorted BAM file.

Retrieving a particular region

Once sorted and indexed, you can retrieve reads from a specific region:

samtools view [input.sorted.bam] [regionID]:[start]-[end]

For example, to get reads overlapping positions 1,000,000–1,000,100 on chromosome NC_002516.2:

samtools view SRR8002634_1.sorted.bam NC_002516.2:1000000-1000100

Q10: How many reads are aligned between positions 2,000,000 and 3,000,000 on the reference NC_002516.2?
Hint: do not save to a file; instead use:

samtools view [options] | wc -l

Q11: How many reads with mapping quality ≥ 30 are aligned between positions 2,000,000 and 3,000,000 on NC_002516.2?
Hint: run samtools view without arguments to see its options.

Average coverage

We will use mosdepth to measure the average coverage (mean number of reads covering each base in the genome):

mosdepth [output_prefix] [input.sorted.bam]

Use any prefix you like (e.g. SRR8002634_1). mosdepth will write a summary file named [output_prefix].mosdepth.summary.txt.

Check the summary and see if the reported coverage makes sense given the data.

Q12: On average, how many reads cover a base in the genome? What is the maximum coverage (maximum number of reads covering a single position)?

You can also inspect per-position coverage using:

samtools mpileup [input.sorted.bam] | less -S

The wrong reference genome?

Q13: Suppose you accidentally aligned the reads to a different bacterial reference genome, e.g. Yersinia pestis (the plague bacterium), a distant relative of Pseudomonas alcaligenes. Would the number of aligned reads go up or down compared to the correct reference? Why? What if the other species was very closely related — would you expect more or fewer reads to align?

Optional bonus: Try it.

Yersinia pestis reference:

/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta

Pseudomonas alcaligenes reference:

/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta

Human paired-end Illumina reads

Aligning

We will align exome-seq reads from a Yoruba female.

Raw data:

/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz

NA19201_1.fastq.gz contains the forward reads; NA19201_2.fastq.gz contains the reverse reads. These reads are already trimmed.

Your goal is to write a single command line that:

  1. Uses bwa mem to align the paired-end reads and produce SAM output.
  2. Converts SAM to BAM.
  3. Sorts the BAM file.

bwa mem syntax for paired-end reads:

bwa mem [reference.fasta] [forward.fastq.gz] [reverse.fastq.gz]

Human reference (GRCh38):

/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa

Note: There are multiple versions of the human reference genome (e.g. hg18, hg19, hg38). Coordinates and even sequences can differ between versions. Always make sure to use the same reference version consistently in all steps of your analysis.

If possible, add a read group and sample name. For example, if read group is RG26 and sample is YRB42:

-R "@RG\tID:RG26\tSM:YRB42"

Converting and sorting:

  • samtools view -bS [input.sam] converts SAM to BAM.
  • When reading from STDIN, use /dev/stdin as the input.
  • samtools sort [input.bam] sorts the BAM file.

Your combined command should:

  • Run bwa mem,
  • Pipe SAM output to samtools view,
  • Pipe BAM output to samtools sort,
  • Redirect the final sorted BAM to NA19201.bam.

The alignment may take around 10 minutes.

Q14: Write the full one-line command that performs alignment, SAM->BAM conversion, and sorting using pipes, and saves output as NA19201.bam.

Q15: What are two major advantages of using UNIX pipes instead of running each command separately and writing intermediate files?

Note: For speed, the provided reads only contain sequences mapping to chr20.


Alignment statistics

flagstat

Q16: Using samtools flagstat, what proportion of reads aligned to the reference?

Q17: Using the same output, how many read pairs are marked as properly paired?

Q18: Index the BAM file, then run:

samtools view [input.bam] [chromosome]

Count how many reads align to chr20 (hint: pipe to wc -l). How many total reads are aligned to chr20?

stat

Generate additional alignment statistics using:

samtools stat [input.bam] > NA19201.stat

Generate plots from the statistics file:

plot-bamstats -p NA19201 NA19201.stat

Viewing the BAM statistics report:

The command above generates a set of PNG image files containing various BAM statistics (e.g. insert size, base composition, quality by cycle). The plots are created in the current directory.

If you are using MobaXterm, you can open the PNG files directly from the left-hand file panel.

If you are using macOS (or a standard terminal), copy the PNG files to your local computer and open them using any image viewer. For example:

scp stud0XX@pupilX:path/to/NA19201*.png .

Replace stud0XX with your student ID and pupilX with the compute node you are working on.

Q19: Look at the insert size distribution plot. What is the most common insert size (approximately)?


Inspecting the alignment with samtools tview

We will use the text-based viewer samtools tview to inspect the human alignment around a potential variant.

First, make sure your BAM file is indexed:

samtools index NA19201.bam

Then start samtools tview:

samtools tview NA19201.bam /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa

This opens the alignment in your terminal.

To jump to the region of interest (chr20:35,581,362):

  1. Press g (for “goto”).
  2. Type chr20:35581362 and press Enter.

You should now see:

  • The reference sequence on the top line.
  • Aligned reads below. Matching bases are often shown as . or ,, while mismatches are shown as the actual base (A/C/G/T).

Useful keys:

  • ? – show help.
  • Arrow keys – move left/right/up/down.
  • q – quit tview.

Use this view to answer the following:

Q20: At position chr20:35,581,362, what bases are present in the sample reads?

Q21: How many reads support the non-consensus base at this position? (Count the reads showing the alternative base in tview.)

Q22: Based on the fraction of reads supporting the non-reference base, does this variant look more like a heterozygous or a homozygous variant? Explain briefly.


Please find the answers here.

Congratulations, you finished the exercise!