Alignment exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
Line 2: Line 2:


<p>
<p>
In this exercise you will explore Hi-C data analysis using <b>TADbit</b>,
In this mini-workshop you will familiarize yourself with <b>TADbit</b> (Serra et al., 2017):
from raw FASTQ files to normalized contact matrices and domain-level
from FASTQ files to contact matrix and beyond.
interpretation.
</p>
</p>


<p>
<p>
The goal is to understand what each step of the pipeline does, which
<b>A Primer into 3D Genomics: A Mini-Workshop</b><br>
parameters matter, and how choices affect downstream interpretation.
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen<br>
9 January 2026, DTU
</p>
</p>


Line 18: Line 18:
<ol>
<ol>
   <li>Preprocess Hi-C FASTQ data</li>
   <li>Preprocess Hi-C FASTQ data</li>
   <li>Index a reference genome</li>
   <li>Index reference genome</li>
   <li>Map reads to the reference genome</li>
   <li>Use TADbit to:
  <li>Parse and filter read pairs</li>
    <ol>
  <li>Normalize Hi-C contact matrices</li>
      <li>Map reads to reference genome (<code>map</code>)</li>
  <li>Generate and inspect contact matrices</li>
      <li>Get intersection (<code>parse</code>)</li>
      <li>Filter reads (<code>filter</code>)</li>
      <li>Normalize (<code>normalize</code>)</li>
      <li>Generate matrices (<code>bin</code>)</li>
      <li>Export formats (<code>bin</code> + <code>cooler</code>)</li>
    </ol>
  </li>
</ol>
</ol>


<hr>
<hr>


<h2>Setup conda environment to run TADbit</h2>
<h2>Setup conda environment to run TADbit later</h2>
 
<p>
Before starting, set up a conda environment with all required dependencies.
</p>


<pre>
<pre>
cd;   # Home directory
cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh
./setup_TADbit.sh
</pre>
</pre>


<p>
<p>
If successful, the command should print the <code>tadbit</code> help message.
You should get (as the only output) the help from the program — this means the environment is up and running.
This confirms that the environment is correctly installed.
</p>
</p>


<p>
<p>
Inside <code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> you will find:
Make yourself familiar with the directory structure. Inside
<code>/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE</code> we have three folders:
</p>
</p>


<ul>
<ul>
   <li><code>fastq</code> – raw Hi-C FASTQ files</li>
   <li><code>fastq</code> – raw data</li>
   <li><code>SCRIPTS</code> – scripts to run TADbit</li>
   <li><code>SCRIPTS</code> – scripts to run TADbit</li>
   <li><code>refGenome</code> – reference genome FASTA and index files</li>
   <li><code>refGenome</code> – reference genome raw FASTA and indexed files</li>
</ul>
</ul>


Line 59: Line 61:


<p>
<p>
Before mapping Hi-C reads, the reference genome must be indexed for the
Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use.
GEM mapper used by TADbit.
This is standard for most mappers (e.g., bwa, bowtie2). We can call the <code>gem-indexer</code>
from within the TADbit environment.
</p>
</p>


<p>
<p>
Use the provided reference genome in the <code>refGenome</code> directory.
Remember to activate the tadbit conda environment.
This step only needs to be done once per reference.
</p>
</p>
<pre>
# Move to your home
cd;
# Activate TADbit environment
conda activate /home/people/${USER}/envs/tadbit_course
# $USER is your user; it's an environment variable so no need to change it.
# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;
# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log
# Make RESULTS folder
mkdir -p tadbit_dirs;
# Make REFERENCE GENOME folder
mkdir -p refGenome;
# To store logs from fastp
mkdir -p fastp_reports
# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
</pre>
<p><b>Putting things into an SBATCH script</b></p>
<p>
A template for <code>sbatch</code> job submission is provided. Copy it to your <code>SCRIPTS</code> folder:
</p>
<pre>
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
</pre>
<p>
Move to your <code>SCRIPTS</code> folder and make a copy called <code>00_index.sbatch</code>:
</p>
<pre>
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
</pre>
<p>
Open the template with your favorite editor, paste the following into the file, and save it.
For example: <code>emacs 00_index.sbatch</code>
</p>
<pre>
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};
# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
</pre>
<p>
Submit the job:
</p>
<pre>
sbatch 00_index.sbatch;
</pre>
<p><b>⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.</b></p>
<p>
We can make a symlink to the reference genome in our folder so that we do not have to copy it:
</p>
<pre>
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
</pre>
<p>
⏰ It should take ~5–10 min to complete.
</p>
<p>
A prepared script is also available:
</p>
<pre>
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch
</pre>


<hr>
<hr>


<h2>Mapping Hi-C reads</h2>
<h2>Pre-process Hi-C FASTQ data: minimum QC</h2>
 
<p>
While genome indexing runs, start looking at the data and pre-process it.
Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters,
low-quality bases, and short reads using <code>fastp</code>.
</p>
 
<p>
Copy the template and create <code>01_fastp.sbatch</code>:
</p>
 
<pre>
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
</pre>
 
<p>
Put the following into the SBATCH script:
</p>
 
<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"
 
fastp \
    # Read raw fastq from course folders
    -i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
    # Store the clean fastq version in your folder
    -o clean/${sample}_R1.clean.fastq.gz \
    -I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
    -O clean/${sample}_R2.clean.fastq.gz \
    --detect_adapter_for_pe \
    # Trim first 5 bases (often lower quality)
    --trim_front1 5 \
    # Threads
    -w 10 \
    # Minimal read length (remove reads shorter than this after trimming)
    -l 30 \
    -h ${sample}.html
</pre>


<p>
<p>
Hi-C reads are paired-end and must be mapped with special care to preserve
Copy the HTML report to your local computer and open it in a browser:
pairing information.
</p>
</p>
<pre>
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .
</pre>


<p>
<p>
Mapping assigns each read to a genomic coordinate in the reference genome.
⏰ It should take ~1 min to complete with 6 CPUs.
Unmapped and ambiguously mapped reads will be handled in later steps.
</p>
</p>
<p><b>Question:</b> Check the HTML report. What percentage of reads are kept?</p>


<hr>
<hr>


<h2>Parsing mapped reads</h2>
<h2>Mapping to the reference genome</h2>


<p>
<p>
After mapping, TADbit parses the BAM file to identify valid Hi-C read pairs.
TADbit maps each read separately, so we run <code>tadbit map</code> twice (once per read).
It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.
</p>
</p>


<p>
<p>
This step assigns read pairs to different categories (e.g. valid pairs,
Put the following into your mapping script:
dangling ends, self-circles, duplicates).
</p>
</p>


<hr>
<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
 
# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}
 
# Two enzymes used in this experiment
enz="MboI HinfI"  # Double digestion (relevant for Arima/Phase Genomics)
 
# Map read 1
rd=1;
 
tadbit map \
  --fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
  --workdir ${wd} \
  --index ${ref} \
  --read ${rd} \
  --tmpdb ${TMPDIR} \
  --renz ${enz} \
  -C 6


<h2>Filtering reads</h2>
# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.
</pre>


<p>
<p>
Filtering does <b>not</b> remove reads immediately. Instead, reads are
⏰ It should take ~5 min to complete with 6 CPUs.
classified into categories.
</p>
</p>


<p>
<p>
These categories are later used during normalization to decide which
<b>Note:</b> We are not using iterative mapping. Fragment-based mapping is the default in TADbit.
reads contribute to the contact matrix.
</p>
</p>


<p>
<p>
To summarize the results of mapping, parsing, and filtering:
After mapping, inspect the plots TADbit generates. Discuss the number of digested sites,
dangling ends, and ligation efficiency.
</p>
 
<p><b>Question:</b> How may restriction enzyme choice influence the experiment? ✂️</p>
 
<ul>
  <li>Fragment size histogram</li>
  <li>HiC sequencing quality and digestion/ligation deconvolution</li>
</ul>
 
<hr>
 
<h2>Finding the intersection of mapped reads (parse)</h2>
 
<p>
Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end).
We identify pairs and build fragment associations with <code>tadbit parse</code>.
</p>
 
<p><b>⚠️ Note:</b> The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand.
It will only match chromosomes that start with the string in <code>--filter_chrom</code>.
</p>
</p>


<pre>
<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
cd /home/people/$USER/3D_GENOMICS_COURSE/;
tadbit describe . | less
sample="liver"  # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample}  # workdir (auto-created by TADbit)
 
# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
      --workdir ${wd} \
      --genome ${ref} \
      --filter_chrom "chr.*" \
      --compress_input;
</pre>
</pre>


<p><b>Q1:</b> How many valid pairs are retained after filtering?</p>
<p>
⏰ It should take ~35 min to complete with 10 CPUs.
</p>
 
<p><b>Question:</b> Is it possible to retrieve multiple contacting regions?</p>


<p><b>Q2:</b> Why does the total number of filtered reads not equal the
<hr>
initial number of read pairs?</p>
 
<h2>Filtering interactions</h2>


<p>
<p>
Hint: read categories are not mutually exclusive.
TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.
</p>
</p>
<p><b>Run filtering:</b></p>
<pre>
tadbit filter \
  --workdir ${wd} \
  --apply 1 2 3 4 6 7 8 9 10 \
  --cpus 6 \
  --tmpdb ${TMPDIR}
</pre>


<hr>
<hr>


<h2>To normalize or to not normalize</h2>
<h2>Check the amount of filtered data and past commands</h2>


<p>
<p>
Up to this point, reads have only been classified.
<code>tadbit describe</code> summarizes what has been done so far in the workdir,
<b>No reads have been excluded yet.</b>
and reports counts, numbers, and parameters after each step.
</p>
</p>
<pre>
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
# Summarize the run
tadbit describe . | less
</pre>
<p><b>Question:</b> How many valid pairs do we keep?</p>
<p><b>Question:</b> The total number of filtered reads is not equal to the initial number of reads… Why?</p>
<hr>
<h2>To normalize or to not normalize</h2>


<p>
<p>
Normalization is the step where you decide which categories to include
In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet.
and how to correct for technical biases.
It is during <b>normalization</b> that we specify which categories to include/exclude so the normalization is performed accordingly.
</p>
</p>


<p>
<p>
Normalization in TADbit computes a <b>bias vector</b> (one value per bin),
Normalization in TADbit extracts a <b>bias vector</b> (one value per bin) which adjusts interaction intensities
which corrects interaction counts for sequencing depth, mappability, and
depending on coverage and technical biases.
other systematic effects.
</p>
</p>
<blockquote>
<b>Important:</b> During normalization, <b>bad columns</b> (bins with low
counts or poor mappability) are removed from the matrix.
</blockquote>


<p>
<p>
Several normalization strategies are available:
<b>Important:</b> During normalization is where <b>bad columns</b> (low counts, low mappability, etc.) are removed from the matrix.
</p>
</p>


<p>
<p>
See <code>tadbit normalize --help</code> for details.
Several normalization strategies exist (see: <code>tadbit normalize --help</code>).
A simple and commonly used option is to filter based on a <b>minimum number of counts per bin</b>.
</p>
</p>


<p>
<p>
A common approach is to require a <b>minimum number of counts per bin</b>
If you want to exclude specific genomic regions, use the <code>--badcols</code> parameter.
and to explicitly exclude problematic genomic regions.
</p>
</p>


<pre>
<pre>
cd /home/people/$USER/3D_GENOMICS_COURSE/
cd /home/people/$USER/3D_GENOMICS_COURSE/;


# Variables used for normalization
# Variables used for normalization
sample="liver"                 # sample name
sample="liver" # sample name
wd="tadbit_dirs/${sample}"      # working directory
wd="tadbit_dirs/"${sample} # workdir (auto-created by TADbit)
res="100000"                   # resolution (100 kb)
 
norm="ICE"
# First time we define the resolution
min_count=5
res="100000" # 100 kb
 
# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"
 
# Minimum number of counts required per bin
min_count=100
</pre>
 
<pre>
tadbit normalize -w ${wd} \
      -r ${res} \
      --tmpdb ${TMPDIR} \
      --cpus 6 \
      --filter 1 2 3 4 6 7 9 10 \
      --normalization ${norm} \
      --badcols chrW:1-7000000 chrZ:1-83000000 \
      --min_count ${min_count}
</pre>
</pre>


<p>
<p>
To exclude specific regions (e.g. sex chromosomes or poorly assembled
⏰ It should take ~2 min to complete with 6 CPUs.
regions), use the <code>--badcols</code> option.
</p>
</p>


<p>
<p>
⏰ The normalization step should take approximately 2 minutes using 6 CPUs.
⚠️ Run another version with <code>norm="raw"</code> to compare later.
</p>
</p>


<p><b>Task:</b> Run normalization twice: once with <code>norm="ICE"</code>
<p>
and once with <code>norm="raw"</code>. Compare the results later.</p>
Use <code>tadbit describe</code> to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.
</p>


<hr>
<p>
Each job is assigned a <code>&lt;job_id&gt;</code>. This helps retrieve results from specific runs (especially when testing parameters).
</p>


<h2>Contact matrices</h2>
<p>
If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:
</p>


<p>
<p>
After normalization, TADbit generates Hi-C contact matrices at the chosen
https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105
resolution.
</p>
</p>
<hr>
<h2>Binning and viewing matrices</h2>


<p>
<p>
These matrices represent interaction frequencies between genomic bins
Once normalization is done, we can visualize Hi-C matrices. Using <code>-c</code> restricts the plot to a specific chromosome or region.
and are the basis for downstream analyses such as TAD detection.
</p>
</p>


<p><b>Q3:</b> How does changing the resolution affect the appearance of the
<pre>
contact matrix?</p>
# Variables used for binning
 
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
</pre>
 
<pre>
tadbit bin \
      -w ${wd} \
      -r ${res} \
      -c ${chrom} \
      --plot \
      --norm "norm" \
      --format "png" \
      --cpus 6;
</pre>


<hr>
<hr>


<p><b>Congratulations, you finished the TADbit exercise!</b></p>
<p><b>Congratulations, you finished the exercise!</b></p>

Revision as of 10:22, 7 January 2026

Overview

In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017): from FASTQ files to contact matrix and beyond.

A Primer into 3D Genomics: A Mini-Workshop
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen
9 January 2026, DTU


Outline of the exercises

  1. Preprocess Hi-C FASTQ data
  2. Index reference genome
  3. Use TADbit to:
    1. Map reads to reference genome (map)
    2. Get intersection (parse)
    3. Filter reads (filter)
    4. Normalize (normalize)
    5. Generate matrices (bin)
    6. Export formats (bin + cooler)

Setup conda environment to run TADbit later

cd;  # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
./setup_TADbit.sh

You should get (as the only output) the help from the program — this means the environment is up and running.

Make yourself familiar with the directory structure. Inside /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE we have three folders:

  • fastq – raw data
  • SCRIPTS – scripts to run TADbit
  • refGenome – reference genome raw FASTA and indexed files

Index reference genome

Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use. This is standard for most mappers (e.g., bwa, bowtie2). We can call the gem-indexer from within the TADbit environment.

Remember to activate the tadbit conda environment.

# Move to your home
cd;

# Activate TADbit environment
conda activate /home/people/${USER}/envs/tadbit_course
# $USER is your user; it's an environment variable so no need to change it.

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean

Putting things into an SBATCH script

A template for sbatch job submission is provided. Copy it to your SCRIPTS folder:

cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/

Move to your SCRIPTS folder and make a copy called 00_index.sbatch:

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch

Open the template with your favorite editor, paste the following into the file, and save it. For example: emacs 00_index.sbatch

data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic

Submit the job:

sbatch 00_index.sbatch;

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.

We can make a symlink to the reference genome in our folder so that we do not have to copy it:

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna

⏰ It should take ~5–10 min to complete.

A prepared script is also available:

cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch

Pre-process Hi-C FASTQ data: minimum QC

While genome indexing runs, start looking at the data and pre-process it. Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters, low-quality bases, and short reads using fastp.

Copy the template and create 01_fastp.sbatch:

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;

Put the following into the SBATCH script:

cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
    # Read raw fastq from course folders
    -i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
    # Store the clean fastq version in your folder
    -o clean/${sample}_R1.clean.fastq.gz \
    -I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
    -O clean/${sample}_R2.clean.fastq.gz \
    --detect_adapter_for_pe \
    # Trim first 5 bases (often lower quality)
    --trim_front1 5 \
    # Threads
    -w 10 \
    # Minimal read length (remove reads shorter than this after trimming)
    -l 30 \
    -h ${sample}.html

Copy the HTML report to your local computer and open it in a browser:

USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .

⏰ It should take ~1 min to complete with 6 CPUs.

Question: Check the HTML report. What percentage of reads are kept?


Mapping to the reference genome

TADbit maps each read separately, so we run tadbit map twice (once per read). It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.

Put the following into your mapping script:

cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI"  # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
  --fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
  --workdir ${wd} \
  --index ${ref} \
  --read ${rd} \
  --tmpdb ${TMPDIR} \
  --renz ${enz} \
  -C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.

⏰ It should take ~5 min to complete with 6 CPUs.

Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.

After mapping, inspect the plots TADbit generates. Discuss the number of digested sites, dangling ends, and ligation efficiency.

Question: How may restriction enzyme choice influence the experiment? ✂️

  • Fragment size histogram
  • HiC sequencing quality and digestion/ligation deconvolution

Finding the intersection of mapped reads (parse)

Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end). We identify pairs and build fragment associations with tadbit parse.

⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand. It will only match chromosomes that start with the string in --filter_chrom.

cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver"  # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample}  # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
       --workdir ${wd} \
       --genome ${ref} \
       --filter_chrom "chr.*" \
       --compress_input;

⏰ It should take ~35 min to complete with 10 CPUs.

Question: Is it possible to retrieve multiple contacting regions?


Filtering interactions

TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.

Run filtering:

tadbit filter \
  --workdir ${wd} \
  --apply 1 2 3 4 6 7 8 9 10 \
  --cpus 6 \
  --tmpdb ${TMPDIR}

Check the amount of filtered data and past commands

tadbit describe summarizes what has been done so far in the workdir, and reports counts, numbers, and parameters after each step.

# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less

Question: How many valid pairs do we keep?

Question: The total number of filtered reads is not equal to the initial number of reads… Why?


To normalize or to not normalize

In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet. It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.

Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities depending on coverage and technical biases.

Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.

Several normalization strategies exist (see: tadbit normalize --help). A simple and commonly used option is to filter based on a minimum number of counts per bin.

If you want to exclude specific genomic regions, use the --badcols parameter.

cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver"  # sample name
wd="tadbit_dirs/"${sample}  # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000"  # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100
tadbit normalize -w ${wd} \
       -r ${res} \
       --tmpdb ${TMPDIR} \
       --cpus 6 \
       --filter 1 2 3 4 6 7 9 10 \
       --normalization ${norm} \
       --badcols chrW:1-7000000 chrZ:1-83000000 \
       --min_count ${min_count}

⏰ It should take ~2 min to complete with 6 CPUs.

⚠️ Run another version with norm="raw" to compare later.

Use tadbit describe to check how many bins were removed. A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.

Each job is assigned a <job_id>. This helps retrieve results from specific runs (especially when testing parameters).

If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:

https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105


Binning and viewing matrices

Once normalization is done, we can visualize Hi-C matrices. Using -c restricts the plot to a specific chromosome or region.

# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
tadbit bin \
       -w ${wd} \
       -r ${res} \
       -c ${chrom} \
       --plot \
       --norm "norm" \
       --format "png" \
       --cpus 6;

Congratulations, you finished the exercise!