Alignment exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
Tag: Manual revert
 
Line 1: Line 1:
<h2>Overview</h2>
<h2>Overview</h2>


<p>
<p>In this exercise you will practice aligning NGS data and working with alignment files.</p>
In this mini-workshop you will familiarize yourself with <b>TADbit</b> (Serra et al., 2017):
from FASTQ files to contact matrix and beyond.
</p>


<p>
<ol>
<b>A Primer into 3D Genomics: A Mini-Workshop</b><br>
  <li>Navigate to your home directory.</li>
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen<br>
  <li>Create a directory called <code>align</code>.</li>
9 January 2026, DTU
  <li>Navigate to the <code>align</code> directory.</li>
</p>
</ol>
 
<hr>
 
<h2>Outline of the exercises</h2>


<p>We will align two types of NGS data:</p>
<ol>
<ol>
   <li>Preprocess Hi-C FASTQ data</li>
   <li><i>Pseudomonas</i> single-end Illumina reads</li>
  <li>Index reference genome</li>
  <li>Human paired-end Illumina reads</li>
  <li>Use TADbit to:
    <ol>
      <li>Map reads to reference genome (<code>map</code>)</li>
      <li>Get intersection (<code>parse</code>)</li>
      <li>Filter reads (<code>filter</code>)</li>
      <li>Normalize (<code>normalize</code>)</li>
      <li>Generate matrices (<code>bin</code>)</li>
      <li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
    </ol>
  </li>
</ol>
</ol>


<hr>
<hr>


<h2>Setup conda environment to run TADbit later</h2>
<h2><i>P. aeruginosa</i> single-end Illumina reads</h2>


<pre>
<h3>Alignment using bwa mem</h3>
cd;  # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
</pre>


<p>
<p>We will align single-end reads that have been trimmed from <i>P. aeruginosa</i>.</p>
You should get (as the only output) the help from the program — this means the environment is up and running.
</p>


<p>
<p><b>Raw data:</b></p>
Make yourself familiar with the directory structure. Inside
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz</pre>
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:
</p>


<ul>
<p><b>Trimmed data:</b></p>
  <li><code>fastq</code> – raw data</li>
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz</pre>
  <li><code>SCRIPTS</code> – scripts to run TADbit</li>
  <li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>


<hr>
<p><b>Reference genome:</b></p>
<pre>/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta</pre>


<h2>Index reference genome</h2>
<p>The basic <code>bwa mem</code> command to align single-end reads is:</p>
<pre>bwa mem [reference.fasta] [reads.fastq.gz] &gt; [output.sam]</pre>


<p>
<p>Remember: the <code>&gt;</code> operator redirects standard output (STDOUT) to a file.</p>
Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.
</p>


<p>
<p>We have discussed multiplexing and read groups. It is good practice to add a read group ID and sample name during alignment. For example, if the read group is <code>RG38</code> and the sample is <code>SMPL96</code>:</p>
Remember to activate the tadbit conda environment.
</p>


<pre>
<pre>bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference.fasta] [reads.fastq.gz] &gt; [output.sam]</pre>
# Move to your home
cd;


# Activate TADbit environment
<p>This information is crucial when you later merge multiple BAM files, so you can trace which reads came from which library or sample.</p>
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"


# Make a WORKING folder for the course
<p><b>Task:</b> Align the <b>trimmed</b> FASTQ file using the command above.</p>
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;


# Make SCRIPT folders (to store your own scripts)
<p><b>Q1:</b> If you were not told which FASTQ file contains the trimmed reads, how could you determine it from the files themselves? (Hint: think of at least three different ways.)</p>
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log


# Make RESULTS folder
<hr>
mkdir -p tadbit_dirs;


# Make REFERENCE GENOME folder
<h3>Inspecting the alignment</h3>
mkdir -p refGenome;


# To store logs from fastp
<p>Assume you named your output file <code>SRR8002634_1.sam</code>. You can view it as:</p>
mkdir -p fastp_reports
<pre>less -S SRR8002634_1.sam</pre>


# For the fastq
<p>The <code>-S</code> option prevents line wrapping; press <code>q</code> to quit. Use the slides and the
mkdir -p fastq
[https://samtools.github.io/hts-specs/SAMv1.pdf official SAM specification] to interpret each field.</p>
# Filtered fastq
mkdir -p fastq/clean
</pre>


<p><b>Putting things into an SBATCH script</b></p>
<p>Answer the following:</p>


<p>
<p><b>Q2:</b> How many lines does the header have (lines starting with <code>@</code>)?</p>
A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:
</p>


<pre>
<p><b>Q3:</b> What is the genomic coordinate (reference name and position) of the first read <code>SRR8002634.1</code>?</p>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>


<p>
<p><b>Q4:</b> What is the mapping quality of the third read <code>SRR8002634.3</code>? What does that mapping quality tell you about this read?</p>
Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:
</p>


<pre>
<p><b>Q5:</b> Using the SAM flag definitions (see
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
[https://broadinstitute.github.io/picard/explain-flags.html Picard flag explanation]), determine among the first 8 reads how many map to the <b>forward (+)</b> strand and how many to the <b>reverse (–)</b> strand.</p>


cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
<p><b>Q6:</b> Is the 10th read <code>SRR8002634.11</code> unmapped? (Note: <code>SRR8002634.9</code> was removed by trimming, so numbering skips.) How did you determine this from the SAM fields?</p>
</pre>


<p>
<p>To get basic alignment statistics, use:</p>
Open the template with your favorite editor, paste the following into the file, and save it.
<pre>samtools flagstat [input.sam]</pre>
For example: <code>emacs 00_index.sbatch</code>
</p>


<pre>
<p>Below is a brief explanation of the fields reported by <code>flagstat</code>:</p>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};


# Running the indexer
<table class="wikitable">
# Note: the output is just a *prefix*; no file extension needed.
  <tr>
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
    <th>Category</th>
</pre>
    <th>Meaning</th>
  </tr>
  <tr>
    <td>mapQ</td>
    <td>Mapping quality</td>
  </tr>
  <tr>
    <td>QC-passed reads</td>
    <td>Reads not marked as QC-failed; these are typically used for analysis.</td>
  </tr>
  <tr>
    <td>QC-failed reads</td>
    <td>Reads flagged as having problems by the processing pipeline; downstream tools usually ignore them.</td>
  </tr>
  <tr>
    <td>total</td>
    <td>Total number of alignments reported.</td>
  </tr>
  <tr>
    <td>secondary</td>
    <td>Additional alignments for reads that map equally well to multiple locations.</td>
  </tr>
  <tr>
    <td>supplementary</td>
    <td>Alignments for chimeric or split reads where different parts map to different locations.</td>
  </tr>
  <tr>
    <td>duplicates</td>
    <td>Reads marked as duplicates (e.g. PCR duplicates); will be discussed in the next class.</td>
  </tr>
  <tr>
    <td>mapped</td>
    <td>Number of reads with at least one reported alignment (not unmapped).</td>
  </tr>
  <tr>
    <td>paired in sequencing</td>
    <td>Reads that were sequenced as part of a pair (not single-end).</td>
  </tr>
  <tr>
    <td>read1</td>
    <td>First read in the pair (forward).</td>
  </tr>
  <tr>
    <td>read2</td>
    <td>Second read in the pair (reverse).</td>
  </tr>
  <tr>
    <td>properly paired</td>
    <td>Pairs that face each other and are within the expected insert size range.</td>
  </tr>
  <tr>
    <td>with itself and mate mapped</td>
    <td>Both the read and its mate are mapped (whether or not properly paired).</td>
  </tr>
  <tr>
    <td>singletons</td>
    <td>Reads that are mapped but whose mate is unmapped.</td>
  </tr>
  <tr>
    <td>with mate mapped to a different chr</td>
    <td>Reads whose mate is mapped to a different chromosome.</td>
  </tr>
</table>


<p>
<p><b>Q7:</b> According to <code>samtools flagstat</code>, what fraction of reads did <b>not</b> align to the reference?</p>
Submit the job:
</p>


<pre>
<hr>
sbatch 00_index.sbatch;
</pre>


<p><b>⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.</b></p>
<h3>Working with alignments</h3>


<p>
<h4>Format conversion</h4>
We can make a symlink to the reference genome in our folder so that we do not have to copy it:
</p>


<pre>
<p>This should be the first and hopefully last time you work directly with SAM for large files.</p>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem


ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
<p>First, check the SAM file size:</p>
</pre>
<pre>ls -lh SRR8002634_1.sam</pre>


<p>
<p>Convert SAM to BAM:</p>
⏰ It should take ~5–10 min to complete.
<pre>samtools view -bS [input.sam] &gt; [output.bam]</pre>
</p>


<p>
<p>Check the BAM file size:</p>
A prepared script is also available:
<pre>ls -lh SRR8002634_1.bam</pre>
</p>


<pre>
<p><code>-l</code> gives a detailed listing (permissions, size, date). <code>-h</code> shows file sizes in human-readable form (e.g. 2.4M instead of 2469134 bytes).</p>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>


<hr>
<p>The BAM file contains exactly the same alignments as the SAM file, but in binary form. To view it as SAM:</p>
<pre>samtools view [input.bam] | less -S</pre>


<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>
<p>You can filter reads based on SAM flags. For example, to include only unmapped reads:</p>
<pre>samtools view -f 0x4 [input.bam]</pre>


<p>
<p>To exclude unmapped reads:</p>
While genome indexing runs, start looking at the data and pre-process it.
<pre>samtools view -F 0x4 [input.bam]</pre>
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.
</p>


<p>
<p>The flag <code>0x4</code> corresponds to “read unmapped” (see the
Copy the template and create <code>01_fastp.sbatch</code>:
[https://broadinstitute.github.io/picard/explain-flags.html flag documentation]).</p>
</p>


<pre>
<p><b>Q8:</b> What is the size ratio of SAM to BAM (SAM size divided by BAM size)?</p>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>


<p>
<p>Now convert BAM to CRAM, which compresses further using the reference:</p>
Put the following into the SBATCH script:
<pre>samtools view -C -T [reference.fasta] [input.bam] &gt; [output.cram]</pre>
</p>


<pre>
<p>Use the same reference FASTA you used for mapping. Check the CRAM file size with <code>ls -lh</code>.</p>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"


fastp \
<p>To view CRAM as SAM:</p>
    # Read raw fastq from course folders
<pre>samtools view -T [reference.fasta] [input.cram] | less -S</pre>
    -i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
    # Store the clean fastq version in your folder
    -o clean/${sample}_R1.clean.fastq.gz \
    -I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
    -O clean/${sample}_R2.clean.fastq.gz \
    --detect_adapter_for_pe \
    # Trim first 5 bases (often lower quality)
    --trim_front1 5 \
    # Threads
    -w 10 \
    # Minimal read length (remove reads shorter than this after trimming)
    -l 30 \
    -h ${sample}.html
</pre>


<p>
<p><b>Q9:</b> What is the size ratio of BAM to CRAM?</p>
Copy the HTML report to your local computer and open it in a browser:
</p>


<pre>
<p>To save space, please remove the SAM and CRAM files (we will work with BAM only):</p>
USER="juanrod"
<pre>rm [file]</pre>
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


<p>
<h4>Sorting</h4>
⏰ It should take ~1 min to complete with 6 CPUs.
</p>


<p><b>Question:</b> Check the HTML report. What percentage of reads are kept?</p>
<p>Sort the BAM file by genomic coordinate:</p>
<pre>samtools sort [input.bam] &gt; [output.sorted.bam]</pre>


<hr>
<p>Be careful not to overwrite the original BAM file; for example:</p>
<ul>
  <li><code>input.bam</code> = <code>SRR8002634_1.bam</code></li>
  <li><code>output.sorted.bam</code> = <code>SRR8002634_1.sorted.bam</code></li>
</ul>


<h2>Mapping to the reference genome</h2>
<p>Use <code>samtools view</code> and <code>less -S</code> to confirm that reads are ordered by reference and coordinate.</p>


<p>
<p>Index the sorted BAM file:</p>
TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
<pre>samtools index [input.sorted.bam]</pre>
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.
</p>


<p>
<p>Note: you <b>cannot</b> index an unsorted BAM file.</p>
Put the following into your mapping script:
</p>


<pre>
<h4>Retrieving a particular region</h4>
cd /home/people/$USER/3D_GENOMICS_COURSE/


# Variables used for mapping
<p>Once sorted and indexed, you can retrieve reads from a specific region:</p>
sample="liver"
<pre>samtools view [input.sorted.bam] [regionID]:[start]-[end]</pre>
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}


# Two enzymes used in this experiment
<p>For example, to get reads overlapping positions 1,000,000–1,000,100 on chromosome <code>NC_002516.2</code>:</p>
enz="MboI HinfI"  # Double digestion (relevant for Arima/Phase Genomics)
<pre>samtools view SRR8002634_1.sorted.bam NC_002516.2:1000000-1000100</pre>


# Map read 1
<p><b>Q10:</b> How many reads are aligned between positions 2,000,000 and 3,000,000 on the reference <code>NC_002516.2</code>?<br>
rd=1;
Hint: do not save to a file; instead use:</p>
<pre>samtools view [options] | wc -l</pre>


tadbit map \
<p><b>Q11:</b> How many reads with mapping quality <b>&ge; 30</b> are aligned between positions 2,000,000 and 3,000,000 on <code>NC_002516.2</code>?<br>
  --fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
Hint: run <code>samtools view</code> without arguments to see its options.</p>
  --workdir ${wd} \
  --index ${ref} \
  --read ${rd} \
  --tmpdb ${TMPDIR} \
  --renz ${enz} \
  -C 6


# Map read 2
<h4>Average coverage</h4>
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


<p>
<p>We will use <code>mosdepth</code> to measure the average coverage (mean number of reads covering each base in the genome):</p>
⏰ It should take ~5 min to complete with 6 CPUs.
<pre>/home/ctools/mosdepth/mosdepth [output_prefix] [input.sorted.bam]</pre>
</p>


<p>
<p>Use any prefix you like (e.g. <code>SRR8002634_1</code>). <code>mosdepth</code> will write a summary file named <code>[output_prefix].mosdepth.summary.txt</code>.</p>
<b>Note:</b> We are not using iterative mapping. Fragment-based mapping is the default in TADbit.
</p>


<p>
<p>Check the summary and see if the reported coverage makes sense given the data.</p>
After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.
</p>


<p><b>Question:</b> How may restriction enzyme choice influence the experiment? ✂️</p>
<p><b>Q12:</b> On average, how many reads cover a base in the genome? What is the maximum coverage (maximum number of reads covering a single position)?</p>


<ul>
<p>You can also inspect per-position coverage using:</p>
  <li>Fragment size histogram</li>
<pre>samtools mpileup [input.sorted.bam] | less -S</pre>
  <li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>


<hr>
<hr>


<h2>Finding the intersection of mapped reads (parse)</h2>
<h3>The wrong reference genome?</h3>
 
<p><b>Q13:</b> Suppose you accidentally aligned the reads to a different bacterial reference genome, e.g. <i>Yersinia pestis</i> (the plague bacterium), a distant relative of <i>Pseudomonas alcaligenes</i>. Would the number of aligned reads go up or down compared to the correct reference? Why? What if the other species was very closely related — would you expect more or fewer reads to align?</p>
 
<p><b>Optional bonus:</b> Try it.</p>
 
<p><b><i>Yersinia pestis</i> reference:</b></p>
<pre>/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta</pre>
 
<p><b><i>Pseudomonas alcaligenes</i> reference:</b></p>
<pre>/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta</pre>
 
<hr>


<p>
<h2>Human paired-end Illumina reads</h2>
Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.
</p>


<p><b>⚠️ Note:</b> The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
<h3>Aligning</h3>
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.
</p>


<pre>
<p>We will align exome-seq reads from a [https://en.wikipedia.org/wiki/Yoruba_people Yoruba] female.</p>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver"  # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample}  # workdir (auto-created by TADbit)


# Keep only canonical chromosomes and compress map files after parsing
<p><b>Raw data:</b></p>
tadbit parse \
<pre>/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
      --workdir ${wd} \
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz
      --genome ${ref} \
      --filter_chrom "chr.*" \
      --compress_input;
</pre>
</pre>


<p>
<p><code>NA19201_1.fastq.gz</code> contains the forward reads; <code>NA19201_2.fastq.gz</code> contains the reverse reads. These reads are already trimmed.</p>
⏰ It should take ~35 min to complete with 10 CPUs.
</p>


<p><b>Question:</b> Is it possible to retrieve multiple contacting regions?</p>
<p>Your goal is to write a <b>single command line</b> that:</p>
<ol>
  <li>Uses <code>bwa mem</code> to align the paired-end reads and produce SAM output.</li>
  <li>Converts SAM to BAM.</li>
  <li>Sorts the BAM file.</li>
</ol>


<hr>
<p><b><code>bwa mem</code> syntax for paired-end reads:</b></p>
<pre>bwa mem [reference.fasta] [forward.fastq.gz] [reverse.fastq.gz]</pre>


<h2>Filtering interactions</h2>
<p><b>Human reference (GRCh38):</b></p>
<pre>/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>


<p>
<p><b>Note:</b> There are multiple versions of the human reference genome (e.g. hg18, hg19, hg38). Coordinates and even sequences can differ between versions. Always make sure to use the <b>same reference version</b> consistently in all steps of your analysis.</p>
TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.
</p>


<p><b>Run filtering:</b></p>
<p>If possible, add a read group and sample name. For example, if read group is <code>RG26</code> and sample is <code>YRB42</code>:</p>
<pre>-R "@RG\tID:RG26\tSM:YRB42"</pre>


<pre>
<p><b>Converting and sorting:</b></p>
tadbit filter \
<ul>
   --workdir ${wd} \
   <li><code>samtools view -bS [input.sam]</code> converts SAM to BAM.</li>
   --apply 1 2 3 4 6 7 8 9 10 \
  <li>When reading from STDIN, use <code>/dev/stdin</code> as the input.</li>
   --cpus 6 \
  <li><code>samtools sort [input.bam]</code> sorts the BAM file.</li>
   --tmpdb ${TMPDIR}
</ul>
</pre>
 
<p>Your combined command should:</p>
<ul>
  <li>Run <code>bwa mem</code>,</li>
   <li>Pipe SAM output to <code>samtools view</code>,</li>
   <li>Pipe BAM output to <code>samtools sort</code>,</li>
   <li>Redirect the final sorted BAM to <code>NA19201.bam</code>.</li>
</ul>
 
<p>The alignment may take around 10 minutes.</p>
 
<p><b>Q14:</b> Write the full one-line command that performs alignment, SAM-&gt;BAM conversion, and sorting using pipes, and saves output as <code>NA19201.bam</code>.</p>


<hr>
<p><b>Q15:</b> What are two major advantages of using UNIX pipes instead of running each command separately and writing intermediate files?</p>


<h2>Check the amount of filtered data and past commands</h2>
<p><b>Note:</b> For speed, the provided reads only contain sequences mapping to <code>chr20</code>.</p>


<p>
<hr>
<code>tadbit describe</code> summarizes what has been done so far in the workdir,
and reports counts, numbers, and parameters after each step.
</p>


<pre>
<h3>Alignment statistics</h3>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample


# Summarize the run
<h4>flagstat</h4>
tadbit describe . | less
</pre>


<p><b>Question:</b> How many valid pairs do we keep?</p>
<p><b>Q16:</b> Using <code>samtools flagstat</code>, what proportion of reads aligned to the reference?</p>
<p><b>Question:</b> The total number of filtered reads is not equal to the initial number of reads… Why?</p>


<hr>
<p><b>Q17:</b> Using the same output, how many read pairs are marked as <b>properly paired</b>?</p>


<h2>To normalize or to not normalize</h2>
<p><b>Q18:</b> Index the BAM file, then run:</p>
<pre>samtools view [input.bam] [chromosome]</pre>


<p>
<p>Count how many reads align to <code>chr20</code> (hint: pipe to <code>wc -l</code>). How many total reads are aligned to <code>chr20</code>?</p>
In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
It is during <b>normalization</b> that we specify which categories to include/exclude so the normalization is performed accordingly.
</p>


<p>
<h4>stat</h4>
Normalization in TADbit extracts a <b>bias vector</b> (one value per bin) which adjusts interaction intensities
depending on coverage and technical biases.
</p>


<p>
<p>Generate additional alignment statistics using:</p>
<b>Important:</b> During normalization is where <b>bad columns</b> (low counts, low mappability, etc.) are removed from the matrix.
</p>


<p>
<pre>
Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
samtools stat [input.bam] > NA19201.stat
A simple and commonly used option is to filter based on a <b>minimum number of counts per bin</b>.
</pre>
</p>


<p>
<p>Generate plots from the statistics file:</p>
If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.
</p>


<pre>
<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/;
plot-bamstats -p NA19201 NA19201.stat
</pre>


# Variables used for normalization
<p><b>Viewing the BAM statistics report:</b></p>
sample="liver"  # sample name
wd="tadbit_dirs/"${sample}  # workdir (auto-created by TADbit)


# First time we define the resolution
<p>The command above generates a set of <b>PNG image files</b> containing
res="100000"  # 100 kb
various BAM statistics (e.g. insert size, base composition, quality by cycle).
The plots are created in the current directory.</p>


# Choice of normalization (raw, ICE, Vanilla, decay)
<p>If you are using <b>MobaXterm</b>, you can open the PNG files directly from the
norm="Vanilla"
left-hand file panel.</p>


# Minimum number of counts required per bin
<p>If you are using <b>macOS</b> (or a standard terminal), copy the PNG files to your
min_count=100
local computer and open them using any image viewer. For example:</p>
</pre>


<pre>
<pre>
tadbit normalize -w ${wd} \
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/NA19201*.png .
      -r ${res} \
      --tmpdb ${TMPDIR} \
      --cpus 6 \
      --filter 1 2 3 4 6 7 9 10 \
      --normalization ${norm} \
      --badcols chrW:1-7000000 chrZ:1-83000000 \
      --min_count ${min_count}
</pre>
</pre>


<p>
<p>Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
⏰ It should take ~2 min to complete with 6 CPUs.
compute node you are working on.</p>
</p>


<p>
<p><b>Q19:</b> Look at the <b>insert size distribution</b> plot. What is the most
⚠️ Run another version with <code>norm="raw"</code> to compare later.
common insert size (approximately)?</p>
</p>


<p>
<hr>
Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.
</p>


<p>
<h3>Inspecting the alignment with <code>samtools tview</code></h3>
Each job is assigned a <code>&lt;job_id&gt;</code>. This helps retrieve results from specific runs (especially when testing parameters).
</p>


<p>
<p>We will use the text-based viewer <code>samtools tview</code> to inspect the human alignment around a potential variant.</p>
If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:
</p>


<p>
<p>First, make sure your BAM file is indexed:</p>
https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105
<pre>samtools index NA19201.bam</pre>
</p>


<hr>
<p>Then start <code>samtools tview</code>:</p>
<pre>samtools tview NA19201.bam /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>
 
<p>This opens the alignment in your terminal.</p>
 
<p>To jump to the region of interest (<code>chr20:35,581,362</code>):</p>
<ol>
  <li>Press <code>g</code> (for “goto”).</li>
  <li>Type <code>chr20:35581362</code> and press Enter.</li>
</ol>
 
<p>You should now see:</p>
<ul>
  <li>The reference sequence on the top line.</li>
  <li>Aligned reads below. Matching bases are often shown as <code>.</code> or <code>,</code>, while mismatches are shown as the actual base (A/C/G/T).</li>
</ul>


<h2>Binning and viewing matrices</h2>
<p>Useful keys:</p>
<ul>
  <li><code>?</code> – show help.</li>
  <li>Arrow keys – move left/right/up/down.</li>
  <li><code>q</code> – quit <code>tview</code>.</li>
</ul>


<p>
<p>Use this view to answer the following:</p>
Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.
</p>


<pre>
<p><b>Q20:</b> At position <code>chr20:35,581,362</code>, what bases are present in the sample reads?</p>
# Variables used for binning


cd /home/people/$USER/3D_GENOMICS_COURSE/
<p><b>Q21:</b> How many reads support the non-consensus base at this position? (Count the reads showing the alternative base in <code>tview</code>.)</p>
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>


<pre>
<p><b>Q22:</b> Based on the fraction of reads supporting the non-reference base, does this variant look more like a heterozygous or a homozygous variant? Explain briefly.</p>
tadbit bin \
      -w ${wd} \
      -r ${res} \
      -c ${chrom} \
      --plot \
      --norm "norm" \
      --format "png" \
      --cpus 6;
</pre>


<hr>
<hr>
<p>Please find the answers [[Alignment_exercise_answers|here]].</p>


<p><b>Congratulations, you finished the exercise!</b></p>
<p><b>Congratulations, you finished the exercise!</b></p>

Latest revision as of 10:47, 7 January 2026

Overview

In this exercise you will practice aligning NGS data and working with alignment files.

  1. Navigate to your home directory.
  2. Create a directory called align.
  3. Navigate to the align directory.

We will align two types of NGS data:

  1. Pseudomonas single-end Illumina reads
  2. Human paired-end Illumina reads

P. aeruginosa single-end Illumina reads

Alignment using bwa mem

We will align single-end reads that have been trimmed from P. aeruginosa.

Raw data:

/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz

Trimmed data:

/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz

Reference genome:

/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta

The basic bwa mem command to align single-end reads is:

bwa mem [reference.fasta] [reads.fastq.gz] > [output.sam]

Remember: the > operator redirects standard output (STDOUT) to a file.

We have discussed multiplexing and read groups. It is good practice to add a read group ID and sample name during alignment. For example, if the read group is RG38 and the sample is SMPL96:

bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference.fasta] [reads.fastq.gz] > [output.sam]

This information is crucial when you later merge multiple BAM files, so you can trace which reads came from which library or sample.

Task: Align the trimmed FASTQ file using the command above.

Q1: If you were not told which FASTQ file contains the trimmed reads, how could you determine it from the files themselves? (Hint: think of at least three different ways.)


Inspecting the alignment

Assume you named your output file SRR8002634_1.sam. You can view it as:

less -S SRR8002634_1.sam

The -S option prevents line wrapping; press q to quit. Use the slides and the official SAM specification to interpret each field.

Answer the following:

Q2: How many lines does the header have (lines starting with @)?

Q3: What is the genomic coordinate (reference name and position) of the first read SRR8002634.1?

Q4: What is the mapping quality of the third read SRR8002634.3? What does that mapping quality tell you about this read?

Q5: Using the SAM flag definitions (see Picard flag explanation), determine among the first 8 reads how many map to the forward (+) strand and how many to the reverse (–) strand.

Q6: Is the 10th read SRR8002634.11 unmapped? (Note: SRR8002634.9 was removed by trimming, so numbering skips.) How did you determine this from the SAM fields?

To get basic alignment statistics, use:

samtools flagstat [input.sam]

Below is a brief explanation of the fields reported by flagstat:

Category Meaning
mapQ Mapping quality
QC-passed reads Reads not marked as QC-failed; these are typically used for analysis.
QC-failed reads Reads flagged as having problems by the processing pipeline; downstream tools usually ignore them.
total Total number of alignments reported.
secondary Additional alignments for reads that map equally well to multiple locations.
supplementary Alignments for chimeric or split reads where different parts map to different locations.
duplicates Reads marked as duplicates (e.g. PCR duplicates); will be discussed in the next class.
mapped Number of reads with at least one reported alignment (not unmapped).
paired in sequencing Reads that were sequenced as part of a pair (not single-end).
read1 First read in the pair (forward).
read2 Second read in the pair (reverse).
properly paired Pairs that face each other and are within the expected insert size range.
with itself and mate mapped Both the read and its mate are mapped (whether or not properly paired).
singletons Reads that are mapped but whose mate is unmapped.
with mate mapped to a different chr Reads whose mate is mapped to a different chromosome.

Q7: According to samtools flagstat, what fraction of reads did not align to the reference?


Working with alignments

Format conversion

This should be the first and hopefully last time you work directly with SAM for large files.

First, check the SAM file size:

ls -lh SRR8002634_1.sam

Convert SAM to BAM:

samtools view -bS [input.sam] > [output.bam]

Check the BAM file size:

ls -lh SRR8002634_1.bam

-l gives a detailed listing (permissions, size, date). -h shows file sizes in human-readable form (e.g. 2.4M instead of 2469134 bytes).

The BAM file contains exactly the same alignments as the SAM file, but in binary form. To view it as SAM:

samtools view [input.bam] | less -S

You can filter reads based on SAM flags. For example, to include only unmapped reads:

samtools view -f 0x4 [input.bam]

To exclude unmapped reads:

samtools view -F 0x4 [input.bam]

The flag 0x4 corresponds to “read unmapped” (see the flag documentation).

Q8: What is the size ratio of SAM to BAM (SAM size divided by BAM size)?

Now convert BAM to CRAM, which compresses further using the reference:

samtools view -C -T [reference.fasta] [input.bam] > [output.cram]

Use the same reference FASTA you used for mapping. Check the CRAM file size with ls -lh.

To view CRAM as SAM:

samtools view -T [reference.fasta] [input.cram] | less -S

Q9: What is the size ratio of BAM to CRAM?

To save space, please remove the SAM and CRAM files (we will work with BAM only):

rm [file]

Sorting

Sort the BAM file by genomic coordinate:

samtools sort [input.bam] > [output.sorted.bam]

Be careful not to overwrite the original BAM file; for example:

  • input.bam = SRR8002634_1.bam
  • output.sorted.bam = SRR8002634_1.sorted.bam

Use samtools view and less -S to confirm that reads are ordered by reference and coordinate.

Index the sorted BAM file:

samtools index [input.sorted.bam]

Note: you cannot index an unsorted BAM file.

Retrieving a particular region

Once sorted and indexed, you can retrieve reads from a specific region:

samtools view [input.sorted.bam] [regionID]:[start]-[end]

For example, to get reads overlapping positions 1,000,000–1,000,100 on chromosome NC_002516.2:

samtools view SRR8002634_1.sorted.bam NC_002516.2:1000000-1000100

Q10: How many reads are aligned between positions 2,000,000 and 3,000,000 on the reference NC_002516.2?
Hint: do not save to a file; instead use:

samtools view [options] | wc -l

Q11: How many reads with mapping quality ≥ 30 are aligned between positions 2,000,000 and 3,000,000 on NC_002516.2?
Hint: run samtools view without arguments to see its options.

Average coverage

We will use mosdepth to measure the average coverage (mean number of reads covering each base in the genome):

/home/ctools/mosdepth/mosdepth [output_prefix] [input.sorted.bam]

Use any prefix you like (e.g. SRR8002634_1). mosdepth will write a summary file named [output_prefix].mosdepth.summary.txt.

Check the summary and see if the reported coverage makes sense given the data.

Q12: On average, how many reads cover a base in the genome? What is the maximum coverage (maximum number of reads covering a single position)?

You can also inspect per-position coverage using:

samtools mpileup [input.sorted.bam] | less -S

The wrong reference genome?

Q13: Suppose you accidentally aligned the reads to a different bacterial reference genome, e.g. Yersinia pestis (the plague bacterium), a distant relative of Pseudomonas alcaligenes. Would the number of aligned reads go up or down compared to the correct reference? Why? What if the other species was very closely related — would you expect more or fewer reads to align?

Optional bonus: Try it.

Yersinia pestis reference:

/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta

Pseudomonas alcaligenes reference:

/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta

Human paired-end Illumina reads

Aligning

We will align exome-seq reads from a Yoruba female.

Raw data:

/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz

NA19201_1.fastq.gz contains the forward reads; NA19201_2.fastq.gz contains the reverse reads. These reads are already trimmed.

Your goal is to write a single command line that:

  1. Uses bwa mem to align the paired-end reads and produce SAM output.
  2. Converts SAM to BAM.
  3. Sorts the BAM file.

bwa mem syntax for paired-end reads:

bwa mem [reference.fasta] [forward.fastq.gz] [reverse.fastq.gz]

Human reference (GRCh38):

/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa

Note: There are multiple versions of the human reference genome (e.g. hg18, hg19, hg38). Coordinates and even sequences can differ between versions. Always make sure to use the same reference version consistently in all steps of your analysis.

If possible, add a read group and sample name. For example, if read group is RG26 and sample is YRB42:

-R "@RG\tID:RG26\tSM:YRB42"

Converting and sorting:

  • samtools view -bS [input.sam] converts SAM to BAM.
  • When reading from STDIN, use /dev/stdin as the input.
  • samtools sort [input.bam] sorts the BAM file.

Your combined command should:

  • Run bwa mem,
  • Pipe SAM output to samtools view,
  • Pipe BAM output to samtools sort,
  • Redirect the final sorted BAM to NA19201.bam.

The alignment may take around 10 minutes.

Q14: Write the full one-line command that performs alignment, SAM->BAM conversion, and sorting using pipes, and saves output as NA19201.bam.

Q15: What are two major advantages of using UNIX pipes instead of running each command separately and writing intermediate files?

Note: For speed, the provided reads only contain sequences mapping to chr20.


Alignment statistics

flagstat

Q16: Using samtools flagstat, what proportion of reads aligned to the reference?

Q17: Using the same output, how many read pairs are marked as properly paired?

Q18: Index the BAM file, then run:

samtools view [input.bam] [chromosome]

Count how many reads align to chr20 (hint: pipe to wc -l). How many total reads are aligned to chr20?

stat

Generate additional alignment statistics using:

samtools stat [input.bam] > NA19201.stat

Generate plots from the statistics file:

plot-bamstats -p NA19201 NA19201.stat

Viewing the BAM statistics report:

The command above generates a set of PNG image files containing various BAM statistics (e.g. insert size, base composition, quality by cycle). The plots are created in the current directory.

If you are using MobaXterm, you can open the PNG files directly from the left-hand file panel.

If you are using macOS (or a standard terminal), copy the PNG files to your local computer and open them using any image viewer. For example:

scp stud0XX@pupilX.healthtech.dtu.dk:path/to/NA19201*.png .

Replace stud0XX with your student ID and pupilX with the compute node you are working on.

Q19: Look at the insert size distribution plot. What is the most common insert size (approximately)?


Inspecting the alignment with samtools tview

We will use the text-based viewer samtools tview to inspect the human alignment around a potential variant.

First, make sure your BAM file is indexed:

samtools index NA19201.bam

Then start samtools tview:

samtools tview NA19201.bam /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa

This opens the alignment in your terminal.

To jump to the region of interest (chr20:35,581,362):

  1. Press g (for “goto”).
  2. Type chr20:35581362 and press Enter.

You should now see:

  • The reference sequence on the top line.
  • Aligned reads below. Matching bases are often shown as . or ,, while mismatches are shown as the actual base (A/C/G/T).

Useful keys:

  • ? – show help.
  • Arrow keys – move left/right/up/down.
  • q – quit tview.

Use this view to answer the following:

Q20: At position chr20:35,581,362, what bases are present in the sample reads?

Q21: How many reads support the non-consensus base at this position? (Count the reads showing the alternative base in tview.)

Q22: Based on the fraction of reads supporting the non-reference base, does this variant look more like a heterozygous or a homozygous variant? Explain briefly.


Please find the answers here.

Congratulations, you finished the exercise!