Alignment exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
<h2>Overview</h2>
<h2>Overview</h2>


<p>In this exercise you will practice aligning NGS data and working with alignment files.</p>
<p>
In this exercise you will explore Hi-C data analysis using <b>TADbit</b>,
from raw FASTQ files to normalized contact matrices and domain-level
interpretation.
</p>


<ol>
<p>
  <li>Navigate to your home directory.</li>
The goal is to understand what each step of the pipeline does, which
  <li>Create a directory called <code>align</code>.</li>
parameters matter, and how choices affect downstream interpretation.
  <li>Navigate to the <code>align</code> directory.</li>
</p>
</ol>
 
<p>We will align two types of NGS data:</p>
<ol>
  <li><i>Pseudomonas</i> single-end Illumina reads</li>
  <li>Human paired-end Illumina reads</li>
</ol>


<hr>
<hr>


<h2><i>P. aeruginosa</i> single-end Illumina reads</h2>
<h2>Outline of the exercises</h2>


<h3>Alignment using bwa mem</h3>
<ol>
 
  <li>Preprocess Hi-C FASTQ data</li>
<p>We will align single-end reads that have been trimmed from <i>P. aeruginosa</i>.</p>
  <li>Index a reference genome</li>
 
  <li>Map reads to the reference genome</li>
<p><b>Raw data:</b></p>
  <li>Parse and filter read pairs</li>
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1.fastq.gz</pre>
  <li>Normalize Hi-C contact matrices</li>
 
  <li>Generate and inspect contact matrices</li>
<p><b>Trimmed data:</b></p>
</ol>
<pre>/home/projects/22126_NGS/exercises/alignment/SRR8002634_1_trimmed.fq.gz</pre>
 
<p><b>Reference genome:</b></p>
<pre>/home/databases/references/P_aeruginosa/GCF_000006765.1_ASM676v1_genomic.fasta</pre>
 
<p>The basic <code>bwa mem</code> command to align single-end reads is:</p>
<pre>bwa mem [reference.fasta] [reads.fastq.gz] &gt; [output.sam]</pre>
 
<p>Remember: the <code>&gt;</code> operator redirects standard output (STDOUT) to a file.</p>
 
<p>We have discussed multiplexing and read groups. It is good practice to add a read group ID and sample name during alignment. For example, if the read group is <code>RG38</code> and the sample is <code>SMPL96</code>:</p>
 
<pre>bwa mem -R "@RG\tID:RG38\tSM:SMPL96" [reference.fasta] [reads.fastq.gz] &gt; [output.sam]</pre>
 
<p>This information is crucial when you later merge multiple BAM files, so you can trace which reads came from which library or sample.</p>
 
<p><b>Task:</b> Align the <b>trimmed</b> FASTQ file using the command above.</p>
 
<p><b>Q1:</b> If you were not told which FASTQ file contains the trimmed reads, how could you determine it from the files themselves? (Hint: think of at least three different ways.)</p>


<hr>
<hr>


<h3>Inspecting the alignment</h3>
<h2>Setup conda environment to run TADbit</h2>


<p>Assume you named your output file <code>SRR8002634_1.sam</code>. You can view it as:</p>
<p>
<pre>less -S SRR8002634_1.sam</pre>
Before starting, set up a conda environment with all required dependencies.
</p>


<p>The <code>-S</code> option prevents line wrapping; press <code>q</code> to quit. Use the slides and the
<pre>
[https://samtools.github.io/hts-specs/SAMv1.pdf official SAM specification] to interpret each field.</p>
cd;   # Home directory
 
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .
<p>Answer the following:</p>
./setup_TADbit.sh
 
</pre>
<p><b>Q2:</b> How many lines does the header have (lines starting with <code>@</code>)?</p>
 
<p><b>Q3:</b> What is the genomic coordinate (reference name and position) of the first read <code>SRR8002634.1</code>?</p>
 
<p><b>Q4:</b> What is the mapping quality of the third read <code>SRR8002634.3</code>? What does that mapping quality tell you about this read?</p>
 
<p><b>Q5:</b> Using the SAM flag definitions (see
[https://broadinstitute.github.io/picard/explain-flags.html Picard flag explanation]), determine among the first 8 reads how many map to the <b>forward (+)</b> strand and how many to the <b>reverse (–)</b> strand.</p>
 
<p><b>Q6:</b> Is the 10th read <code>SRR8002634.11</code> unmapped? (Note: <code>SRR8002634.9</code> was removed by trimming, so numbering skips.) How did you determine this from the SAM fields?</p>
 
<p>To get basic alignment statistics, use:</p>
<pre>samtools flagstat [input.sam]</pre>
 
<p>Below is a brief explanation of the fields reported by <code>flagstat</code>:</p>


<table class="wikitable">
<p>
  <tr>
If successful, the command should print the <code>tadbit</code> help message.
    <th>Category</th>
This confirms that the environment is correctly installed.
    <th>Meaning</th>
</p>
  </tr>
  <tr>
    <td>mapQ</td>
    <td>Mapping quality</td>
  </tr>
  <tr>
    <td>QC-passed reads</td>
    <td>Reads not marked as QC-failed; these are typically used for analysis.</td>
  </tr>
  <tr>
    <td>QC-failed reads</td>
    <td>Reads flagged as having problems by the processing pipeline; downstream tools usually ignore them.</td>
  </tr>
  <tr>
    <td>total</td>
    <td>Total number of alignments reported.</td>
  </tr>
  <tr>
    <td>secondary</td>
    <td>Additional alignments for reads that map equally well to multiple locations.</td>
  </tr>
  <tr>
    <td>supplementary</td>
    <td>Alignments for chimeric or split reads where different parts map to different locations.</td>
  </tr>
  <tr>
    <td>duplicates</td>
    <td>Reads marked as duplicates (e.g. PCR duplicates); will be discussed in the next class.</td>
  </tr>
  <tr>
    <td>mapped</td>
    <td>Number of reads with at least one reported alignment (not unmapped).</td>
  </tr>
  <tr>
    <td>paired in sequencing</td>
    <td>Reads that were sequenced as part of a pair (not single-end).</td>
  </tr>
  <tr>
    <td>read1</td>
    <td>First read in the pair (forward).</td>
  </tr>
  <tr>
    <td>read2</td>
    <td>Second read in the pair (reverse).</td>
  </tr>
  <tr>
    <td>properly paired</td>
    <td>Pairs that face each other and are within the expected insert size range.</td>
  </tr>
  <tr>
    <td>with itself and mate mapped</td>
    <td>Both the read and its mate are mapped (whether or not properly paired).</td>
  </tr>
  <tr>
    <td>singletons</td>
    <td>Reads that are mapped but whose mate is unmapped.</td>
  </tr>
  <tr>
    <td>with mate mapped to a different chr</td>
    <td>Reads whose mate is mapped to a different chromosome.</td>
  </tr>
</table>


<p><b>Q7:</b> According to <code>samtools flagstat</code>, what fraction of reads did <b>not</b> align to the reference?</p>
<p>
Inside <code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> you will find:
</p>


<hr>
<h3>Working with alignments</h3>
<h4>Format conversion</h4>
<p>This should be the first and hopefully last time you work directly with SAM for large files.</p>
<p>First, check the SAM file size:</p>
<pre>ls -lh SRR8002634_1.sam</pre>
<p>Convert SAM to BAM:</p>
<pre>samtools view -bS [input.sam] &gt; [output.bam]</pre>
<p>Check the BAM file size:</p>
<pre>ls -lh SRR8002634_1.bam</pre>
<p><code>-l</code> gives a detailed listing (permissions, size, date). <code>-h</code> shows file sizes in human-readable form (e.g. 2.4M instead of 2469134 bytes).</p>
<p>The BAM file contains exactly the same alignments as the SAM file, but in binary form. To view it as SAM:</p>
<pre>samtools view [input.bam] | less -S</pre>
<p>You can filter reads based on SAM flags. For example, to include only unmapped reads:</p>
<pre>samtools view -f 0x4 [input.bam]</pre>
<p>To exclude unmapped reads:</p>
<pre>samtools view -F 0x4 [input.bam]</pre>
<p>The flag <code>0x4</code> corresponds to “read unmapped” (see the
[https://broadinstitute.github.io/picard/explain-flags.html flag documentation]).</p>
<p><b>Q8:</b> What is the size ratio of SAM to BAM (SAM size divided by BAM size)?</p>
<p>Now convert BAM to CRAM, which compresses further using the reference:</p>
<pre>samtools view -C -T [reference.fasta] [input.bam] &gt; [output.cram]</pre>
<p>Use the same reference FASTA you used for mapping. Check the CRAM file size with <code>ls -lh</code>.</p>
<p>To view CRAM as SAM:</p>
<pre>samtools view -T [reference.fasta] [input.cram] | less -S</pre>
<p><b>Q9:</b> What is the size ratio of BAM to CRAM?</p>
<p>To save space, please remove the SAM and CRAM files (we will work with BAM only):</p>
<pre>rm [file]</pre>
<h4>Sorting</h4>
<p>Sort the BAM file by genomic coordinate:</p>
<pre>samtools sort [input.bam] &gt; [output.sorted.bam]</pre>
<p>Be careful not to overwrite the original BAM file; for example:</p>
<ul>
<ul>
   <li><code>input.bam</code> = <code>SRR8002634_1.bam</code></li>
   <li><code>fastq</code> – raw Hi-C FASTQ files</li>
   <li><code>output.sorted.bam</code> = <code>SRR8002634_1.sorted.bam</code></li>
  <li><code>SCRIPTS</code> – scripts to run TADbit</li>
   <li><code>refGenome</code> – reference genome FASTA and index files</li>
</ul>
</ul>


<p>Use <code>samtools view</code> and <code>less -S</code> to confirm that reads are ordered by reference and coordinate.</p>
<hr>


<p>Index the sorted BAM file:</p>
<h2>Index reference genome</h2>
<pre>samtools index [input.sorted.bam]</pre>


<p>Note: you <b>cannot</b> index an unsorted BAM file.</p>
<p>
Before mapping Hi-C reads, the reference genome must be indexed for the
GEM mapper used by TADbit.
</p>


<h4>Retrieving a particular region</h4>
<p>
Use the provided reference genome in the <code>refGenome</code> directory.
This step only needs to be done once per reference.
</p>


<p>Once sorted and indexed, you can retrieve reads from a specific region:</p>
<hr>
<pre>samtools view [input.sorted.bam] [regionID]:[start]-[end]</pre>


<p>For example, to get reads overlapping positions 1,000,000–1,000,100 on chromosome <code>NC_002516.2</code>:</p>
<h2>Mapping Hi-C reads</h2>
<pre>samtools view SRR8002634_1.sorted.bam NC_002516.2:1000000-1000100</pre>


<p><b>Q10:</b> How many reads are aligned between positions 2,000,000 and 3,000,000 on the reference <code>NC_002516.2</code>?<br>
<p>
Hint: do not save to a file; instead use:</p>
Hi-C reads are paired-end and must be mapped with special care to preserve
<pre>samtools view [options] | wc -l</pre>
pairing information.
</p>


<p><b>Q11:</b> How many reads with mapping quality <b>&ge; 30</b> are aligned between positions 2,000,000 and 3,000,000 on <code>NC_002516.2</code>?<br>
<p>
Hint: run <code>samtools view</code> without arguments to see its options.</p>
Mapping assigns each read to a genomic coordinate in the reference genome.
Unmapped and ambiguously mapped reads will be handled in later steps.
</p>


<h4>Average coverage</h4>
<hr>
 
<p>We will use <code>mosdepth</code> to measure the average coverage (mean number of reads covering each base in the genome):</p>
<pre>/home/ctools/mosdepth/mosdepth [output_prefix] [input.sorted.bam]</pre>


<p>Use any prefix you like (e.g. <code>SRR8002634_1</code>). <code>mosdepth</code> will write a summary file named <code>[output_prefix].mosdepth.summary.txt</code>.</p>
<h2>Parsing mapped reads</h2>


<p>Check the summary and see if the reported coverage makes sense given the data.</p>
<p>
After mapping, TADbit parses the BAM file to identify valid Hi-C read pairs.
</p>


<p><b>Q12:</b> On average, how many reads cover a base in the genome? What is the maximum coverage (maximum number of reads covering a single position)?</p>
<p>
 
This step assigns read pairs to different categories (e.g. valid pairs,
<p>You can also inspect per-position coverage using:</p>
dangling ends, self-circles, duplicates).
<pre>samtools mpileup [input.sorted.bam] | less -S</pre>
</p>


<hr>
<hr>


<h3>The wrong reference genome?</h3>
<h2>Filtering reads</h2>


<p><b>Q13:</b> Suppose you accidentally aligned the reads to a different bacterial reference genome, e.g. <i>Yersinia pestis</i> (the plague bacterium), a distant relative of <i>Pseudomonas alcaligenes</i>. Would the number of aligned reads go up or down compared to the correct reference? Why? What if the other species was very closely related — would you expect more or fewer reads to align?</p>
<p>
Filtering does <b>not</b> remove reads immediately. Instead, reads are
classified into categories.
</p>


<p><b>Optional bonus:</b> Try it.</p>
<p>
These categories are later used during normalization to decide which
reads contribute to the contact matrix.
</p>


<p><b><i>Yersinia pestis</i> reference:</b></p>
<p>
<pre>/home/databases/references/Y_pestis/GCF_000222975.1_ASM22297v1_genomic.fasta</pre>
To summarize the results of mapping, parsing, and filtering:
</p>


<p><b><i>Pseudomonas alcaligenes</i> reference:</b></p>
<pre>
<pre>/home/databases/references/P_alcaligenes/GCF_001597285.1_ASM159728v1_genomic.fasta</pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
 
tadbit describe . | less
<hr>
 
<h2>Human paired-end Illumina reads</h2>
 
<h3>Aligning</h3>
 
<p>We will align exome-seq reads from a [https://en.wikipedia.org/wiki/Yoruba_people Yoruba] female.</p>
 
<p><b>Raw data:</b></p>
<pre>/home/projects/22126_NGS/exercises/alignment/NA19201_1.fastq.gz
/home/projects/22126_NGS/exercises/alignment/NA19201_2.fastq.gz
</pre>
</pre>


<p><code>NA19201_1.fastq.gz</code> contains the forward reads; <code>NA19201_2.fastq.gz</code> contains the reverse reads. These reads are already trimmed.</p>
<p><b>Q1:</b> How many valid pairs are retained after filtering?</p>
 
<p>Your goal is to write a <b>single command line</b> that:</p>
<ol>
  <li>Uses <code>bwa mem</code> to align the paired-end reads and produce SAM output.</li>
  <li>Converts SAM to BAM.</li>
  <li>Sorts the BAM file.</li>
</ol>
 
<p><b><code>bwa mem</code> syntax for paired-end reads:</b></p>
<pre>bwa mem [reference.fasta] [forward.fastq.gz] [reverse.fastq.gz]</pre>
 
<p><b>Human reference (GRCh38):</b></p>
<pre>/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>
 
<p><b>Note:</b> There are multiple versions of the human reference genome (e.g. hg18, hg19, hg38). Coordinates and even sequences can differ between versions. Always make sure to use the <b>same reference version</b> consistently in all steps of your analysis.</p>
 
<p>If possible, add a read group and sample name. For example, if read group is <code>RG26</code> and sample is <code>YRB42</code>:</p>
<pre>-R "@RG\tID:RG26\tSM:YRB42"</pre>
 
<p><b>Converting and sorting:</b></p>
<ul>
  <li><code>samtools view -bS [input.sam]</code> converts SAM to BAM.</li>
  <li>When reading from STDIN, use <code>/dev/stdin</code> as the input.</li>
  <li><code>samtools sort [input.bam]</code> sorts the BAM file.</li>
</ul>
 
<p>Your combined command should:</p>
<ul>
  <li>Run <code>bwa mem</code>,</li>
  <li>Pipe SAM output to <code>samtools view</code>,</li>
  <li>Pipe BAM output to <code>samtools sort</code>,</li>
  <li>Redirect the final sorted BAM to <code>NA19201.bam</code>.</li>
</ul>
 
<p>The alignment may take around 10 minutes.</p>
 
<p><b>Q14:</b> Write the full one-line command that performs alignment, SAM-&gt;BAM conversion, and sorting using pipes, and saves output as <code>NA19201.bam</code>.</p>


<p><b>Q15:</b> What are two major advantages of using UNIX pipes instead of running each command separately and writing intermediate files?</p>
<p><b>Q2:</b> Why does the total number of filtered reads not equal the
initial number of read pairs?</p>


<p><b>Note:</b> For speed, the provided reads only contain sequences mapping to <code>chr20</code>.</p>
<p>
Hint: read categories are not mutually exclusive.
</p>


<hr>
<hr>


<h3>Alignment statistics</h3>
<h2>To normalize or to not normalize</h2>


<h4>flagstat</h4>
<p>
Up to this point, reads have only been classified.
<b>No reads have been excluded yet.</b>
</p>


<p><b>Q16:</b> Using <code>samtools flagstat</code>, what proportion of reads aligned to the reference?</p>
<p>
Normalization is the step where you decide which categories to include
and how to correct for technical biases.
</p>


<p><b>Q17:</b> Using the same output, how many read pairs are marked as <b>properly paired</b>?</p>
<p>
Normalization in TADbit computes a <b>bias vector</b> (one value per bin),
which corrects interaction counts for sequencing depth, mappability, and
other systematic effects.
</p>


<p><b>Q18:</b> Index the BAM file, then run:</p>
<blockquote>
<pre>samtools view [input.bam] [chromosome]</pre>
<b>Important:</b> During normalization, <b>bad columns</b> (bins with low
counts or poor mappability) are removed from the matrix.
</blockquote>


<p>Count how many reads align to <code>chr20</code> (hint: pipe to <code>wc -l</code>). How many total reads are aligned to <code>chr20</code>?</p>
<p>
Several normalization strategies are available:
</p>


<h4>stat</h4>
<p>
See <code>tadbit normalize --help</code> for details.
</p>


<p>Generate additional alignment statistics using:</p>
<p>
A common approach is to require a <b>minimum number of counts per bin</b>
and to explicitly exclude problematic genomic regions.
</p>


<pre>
<pre>
samtools stat [input.bam] > NA19201.stat
cd /home/people/$USER/3D_GENOMICS_COURSE/
</pre>
 
<p>Generate plots from the statistics file:</p>


<pre>
# Variables used for normalization
plot-bamstats -p NA19201 NA19201.stat
sample="liver"                  # sample name
wd="tadbit_dirs/${sample}"      # working directory
res="100000"                    # resolution (100 kb)
norm="ICE"
min_count=5
</pre>
</pre>


<p><b>Viewing the BAM statistics report:</b></p>
<p>
To exclude specific regions (e.g. sex chromosomes or poorly assembled
regions), use the <code>--badcols</code> option.
</p>


<p>The command above generates a set of <b>PNG image files</b> containing
<p>
various BAM statistics (e.g. insert size, base composition, quality by cycle).
The normalization step should take approximately 2 minutes using 6 CPUs.
The plots are created in the current directory.</p>
</p>


<p>If you are using <b>MobaXterm</b>, you can open the PNG files directly from the
<p><b>Task:</b> Run normalization twice: once with <code>norm="ICE"</code>
left-hand file panel.</p>
and once with <code>norm="raw"</code>. Compare the results later.</p>
 
<p>If you are using <b>macOS</b> (or a standard terminal), copy the PNG files to your
local computer and open them using any image viewer. For example:</p>
 
<pre>
scp stud0XX@pupilX.healthtech.dtu.dk:path/to/NA19201*.png .
</pre>
 
<p>Replace <code>stud0XX</code> with your student ID and <code>pupilX</code> with the
compute node you are working on.</p>
 
<p><b>Q19:</b> Look at the <b>insert size distribution</b> plot. What is the most
common insert size (approximately)?</p>


<hr>
<hr>


<h3>Inspecting the alignment with <code>samtools tview</code></h3>
<h2>Contact matrices</h2>
 
<p>We will use the text-based viewer <code>samtools tview</code> to inspect the human alignment around a potential variant.</p>


<p>First, make sure your BAM file is indexed:</p>
<p>
<pre>samtools index NA19201.bam</pre>
After normalization, TADbit generates Hi-C contact matrices at the chosen
resolution.
</p>


<p>Then start <code>samtools tview</code>:</p>
<p>
<pre>samtools tview NA19201.bam /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa</pre>
These matrices represent interaction frequencies between genomic bins
and are the basis for downstream analyses such as TAD detection.
</p>


<p>This opens the alignment in your terminal.</p>
<p><b>Q3:</b> How does changing the resolution affect the appearance of the
 
contact matrix?</p>
<p>To jump to the region of interest (<code>chr20:35,581,362</code>):</p>
<ol>
  <li>Press <code>g</code> (for “goto”).</li>
  <li>Type <code>chr20:35581362</code> and press Enter.</li>
</ol>
 
<p>You should now see:</p>
<ul>
  <li>The reference sequence on the top line.</li>
  <li>Aligned reads below. Matching bases are often shown as <code>.</code> or <code>,</code>, while mismatches are shown as the actual base (A/C/G/T).</li>
</ul>
 
<p>Useful keys:</p>
<ul>
  <li><code>?</code> – show help.</li>
  <li>Arrow keys – move left/right/up/down.</li>
  <li><code>q</code> – quit <code>tview</code>.</li>
</ul>
 
<p>Use this view to answer the following:</p>
 
<p><b>Q20:</b> At position <code>chr20:35,581,362</code>, what bases are present in the sample reads?</p>
 
<p><b>Q21:</b> How many reads support the non-consensus base at this position? (Count the reads showing the alternative base in <code>tview</code>.)</p>
 
<p><b>Q22:</b> Based on the fraction of reads supporting the non-reference base, does this variant look more like a heterozygous or a homozygous variant? Explain briefly.</p>


<hr>
<hr>


<p>Please find the answers [[Alignment_exercise_answers|here]].</p>
<p><b>Congratulations, you finished the TADbit exercise!</b></p>
 
<p><b>Congratulations, you finished the exercise!</b></p>

Revision as of 15:53, 6 January 2026

Overview

In this exercise you will explore Hi-C data analysis using TADbit, from raw FASTQ files to normalized contact matrices and domain-level interpretation.

The goal is to understand what each step of the pipeline does, which parameters matter, and how choices affect downstream interpretation.


Outline of the exercises

  1. Preprocess Hi-C FASTQ data
  2. Index a reference genome
  3. Map reads to the reference genome
  4. Parse and filter read pairs
  5. Normalize Hi-C contact matrices
  6. Generate and inspect contact matrices

Setup conda environment to run TADbit

Before starting, set up a conda environment with all required dependencies.

cd;   # Home directory
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .
./setup_TADbit.sh

If successful, the command should print the tadbit help message. This confirms that the environment is correctly installed.

Inside /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE you will find:

  • fastq – raw Hi-C FASTQ files
  • SCRIPTS – scripts to run TADbit
  • refGenome – reference genome FASTA and index files

Index reference genome

Before mapping Hi-C reads, the reference genome must be indexed for the GEM mapper used by TADbit.

Use the provided reference genome in the refGenome directory. This step only needs to be done once per reference.


Mapping Hi-C reads

Hi-C reads are paired-end and must be mapped with special care to preserve pairing information.

Mapping assigns each read to a genomic coordinate in the reference genome. Unmapped and ambiguously mapped reads will be handled in later steps.


Parsing mapped reads

After mapping, TADbit parses the BAM file to identify valid Hi-C read pairs.

This step assigns read pairs to different categories (e.g. valid pairs, dangling ends, self-circles, duplicates).


Filtering reads

Filtering does not remove reads immediately. Instead, reads are classified into categories.

These categories are later used during normalization to decide which reads contribute to the contact matrix.

To summarize the results of mapping, parsing, and filtering:

cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
tadbit describe . | less

Q1: How many valid pairs are retained after filtering?

Q2: Why does the total number of filtered reads not equal the initial number of read pairs?

Hint: read categories are not mutually exclusive.


To normalize or to not normalize

Up to this point, reads have only been classified. No reads have been excluded yet.

Normalization is the step where you decide which categories to include and how to correct for technical biases.

Normalization in TADbit computes a bias vector (one value per bin), which corrects interaction counts for sequencing depth, mappability, and other systematic effects.

Important: During normalization, bad columns (bins with low counts or poor mappability) are removed from the matrix.

Several normalization strategies are available:

See tadbit normalize --help for details.

A common approach is to require a minimum number of counts per bin and to explicitly exclude problematic genomic regions.

cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for normalization
sample="liver"                  # sample name
wd="tadbit_dirs/${sample}"      # working directory
res="100000"                    # resolution (100 kb)
norm="ICE"
min_count=5

To exclude specific regions (e.g. sex chromosomes or poorly assembled regions), use the --badcols option.

⏰ The normalization step should take approximately 2 minutes using 6 CPUs.

Task: Run normalization twice: once with norm="ICE" and once with norm="raw". Compare the results later.


Contact matrices

After normalization, TADbit generates Hi-C contact matrices at the chosen resolution.

These matrices represent interaction frequencies between genomic bins and are the basis for downstream analyses such as TAD detection.

Q3: How does changing the resolution affect the appearance of the contact matrix?


Congratulations, you finished the TADbit exercise!