Alignment exercise

Overview

In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017): from FASTQ files to contact matrix and beyond.

A Primer into 3D Genomics: A Mini-Workshop
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen
9 January 2026, DTU

Outline of the exercises

Preprocess Hi-C FASTQ data
Index reference genome
Use TADbit to:
1. Map reads to reference genome (map)
2. Get intersection (parse)
3. Filter reads (filter)
4. Normalize (normalize)
5. Generate matrices (bin)
6. Export formats (bin + cooler)

Setup conda environment to run TADbit later

cd;  # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
./setup_TADbit.sh

You should get (as the only output) the help from the program — this means the environment is up and running.

Make yourself familiar with the directory structure. Inside /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE we have three folders:

fastq – raw data
SCRIPTS – scripts to run TADbit
refGenome – reference genome raw FASTA and indexed files

Index reference genome

Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use. This is standard for most mappers (e.g., bwa, bowtie2). We can call the gem-indexer from within the TADbit environment.

Remember to activate the tadbit conda environment.

# Move to your home
cd;

# Activate TADbit environment
conda activate /home/people/${USER}/envs/tadbit_course
# $USER is your user; it's an environment variable so no need to change it.

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean

Putting things into an SBATCH script

A template for sbatch job submission is provided. Copy it to your SCRIPTS folder:

cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/

Move to your SCRIPTS folder and make a copy called 00_index.sbatch:

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch

Open the template with your favorite editor, paste the following into the file, and save it. For example: emacs 00_index.sbatch

data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic

Submit the job:

sbatch 00_index.sbatch;

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.

We can make a symlink to the reference genome in our folder so that we do not have to copy it:

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna

⏰ It should take ~5–10 min to complete.

A prepared script is also available:

cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch

Pre-process Hi-C FASTQ data: minimum QC

While genome indexing runs, start looking at the data and pre-process it. Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters, low-quality bases, and short reads using fastp.

Copy the template and create 01_fastp.sbatch:

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;

Put the following into the SBATCH script:

cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
    # Read raw fastq from course folders
    -i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
    # Store the clean fastq version in your folder
    -o clean/${sample}_R1.clean.fastq.gz \
    -I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
    -O clean/${sample}_R2.clean.fastq.gz \
    --detect_adapter_for_pe \
    # Trim first 5 bases (often lower quality)
    --trim_front1 5 \
    # Threads
    -w 10 \
    # Minimal read length (remove reads shorter than this after trimming)
    -l 30 \
    -h ${sample}.html

Copy the HTML report to your local computer and open it in a browser:

USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .

⏰ It should take ~1 min to complete with 6 CPUs.

Question: Check the HTML report. What percentage of reads are kept?

Mapping to the reference genome

TADbit maps each read separately, so we run tadbit map twice (once per read). It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.

Put the following into your mapping script:

cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}

# Two enzymes used in this experiment
enz="MboI HinfI"  # Double digestion (relevant for Arima/Phase Genomics)

# Map read 1
rd=1;

tadbit map \
  --fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
  --workdir ${wd} \
  --index ${ref} \
  --read ${rd} \
  --tmpdb ${TMPDIR} \
  --renz ${enz} \
  -C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.

⏰ It should take ~5 min to complete with 6 CPUs.

Note: We are not using iterative mapping. Fragment-based mapping is the default in TADbit.

After mapping, inspect the plots TADbit generates. Discuss the number of digested sites, dangling ends, and ligation efficiency.

Question: How may restriction enzyme choice influence the experiment? ✂️

Fragment size histogram
HiC sequencing quality and digestion/ligation deconvolution

Finding the intersection of mapped reads (parse)

Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end). We identify pairs and build fragment associations with tadbit parse.

⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand. It will only match chromosomes that start with the string in --filter_chrom.

cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver"  # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample}  # workdir (auto-created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
       --workdir ${wd} \
       --genome ${ref} \
       --filter_chrom "chr.*" \
       --compress_input;

⏰ It should take ~35 min to complete with 10 CPUs.

Question: Is it possible to retrieve multiple contacting regions?

Filtering interactions

TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets.

Run filtering:

tadbit filter \
  --workdir ${wd} \
  --apply 1 2 3 4 6 7 8 9 10 \
  --cpus 6 \
  --tmpdb ${TMPDIR}

Check the amount of filtered data and past commands

tadbit describe summarizes what has been done so far in the workdir, and reports counts, numbers, and parameters after each step.

# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less

Question: How many valid pairs do we keep?

Question: The total number of filtered reads is not equal to the initial number of reads… Why?

To normalize or to not normalize

In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet. It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.

Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities depending on coverage and technical biases.

Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.

Several normalization strategies exist (see: tadbit normalize --help). A simple and commonly used option is to filter based on a minimum number of counts per bin.

If you want to exclude specific genomic regions, use the --badcols parameter.

cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver"  # sample name
wd="tadbit_dirs/"${sample}  # workdir (auto-created by TADbit)

# First time we define the resolution
res="100000"  # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100

tadbit normalize -w ${wd} \
       -r ${res} \
       --tmpdb ${TMPDIR} \
       --cpus 6 \
       --filter 1 2 3 4 6 7 9 10 \
       --normalization ${norm} \
       --badcols chrW:1-7000000 chrZ:1-83000000 \
       --min_count ${min_count}

⏰ It should take ~2 min to complete with 6 CPUs.

⚠️ Run another version with norm="raw" to compare later.

Use tadbit describe to check how many bins were removed. A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.

Each job is assigned a <job_id>. This helps retrieve results from specific runs (especially when testing parameters).

If you want, you can take a quick look at the different normalization strategies and extract your own conclusions:

https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105

Binning and viewing matrices

Once normalization is done, we can visualize Hi-C matrices. Using -c restricts the plot to a specific chromosome or region.

# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"

tadbit bin \
       -w ${wd} \
       -r ${res} \
       -c ${chrom} \
       --plot \
       --norm "norm" \
       --format "png" \
       --cpus 6;

Congratulations, you finished the exercise!

Alignment exercise

Contents

Overview

Outline of the exercises

Setup conda environment to run TADbit later

Index reference genome

Pre-process Hi-C FASTQ data: minimum QC

Mapping to the reference genome

Finding the intersection of mapped reads (parse)

Filtering interactions

Check the amount of filtered data and past commands

To normalize or to not normalize

Binning and viewing matrices

Navigation menu

Alignment exercise

Overview

Outline of the exercises

Setup conda environment to run TADbit later

Index reference genome

Pre-process Hi-C FASTQ data: minimum QC

Mapping to the reference genome

Finding the intersection of mapped reads (parse)

Filtering interactions

Check the amount of filtered data and past commands

To normalize or to not normalize

Binning and viewing matrices

Navigation menu

Search