Exercise and answers

From 22126
Jump to navigation Jump to search

Overview

In this mini-workshop you will familiarize yourself with TADbit (Serra et al., 2017): from FASTQ files to contact matrix and beyond.

A Primer into 3D Genomics: A Mini-Workshop
Juan Antonio Rodríguez, Globe Institute, University of Copenhagen
9 January 2026, DTU


Outline of the exercises

  1. Preprocess Hi-C FASTQ data
  2. Index reference genome
  3. Use TADbit to:
    1. Map reads to reference genome (map)
    2. Get intersection (parse)
    3. Filter reads (filter)
    4. Normalize (normalize)
    5. Generate matrices (bin)
    6. Export formats (bin + cooler)

Slurm / sbatch on the cluster

The cluster uses Slurm to schedule jobs. Instead of running long commands directly in the terminal, you write a small script (an .sbatch file) that requests resources (time, CPUs, memory) and runs your commands. Slurm decides when and where your job runs so users don’t interfere with each other.

After you submit a job you can safely disconnect without your job crashing (similar idea to screen). Slurm writes standard output and errors to log files, so you can always see what happened.

  • sbatch myjob.sbatch — submit a job (returns a <job_id>)
  • squeue -u $USER — check your jobs
  • scancel <job_id> — cancel a job

Setup conda environment to run TADbit later

You may get notified something is missing — just accept.

cd;  # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
bash ./setup_TADbit.sh

You should get (as the only output) the help from the program — this means the environment is up and running.

Make yourself familiar with the directory structure. Inside /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE we have three folders:

  • fastq – raw data
  • SCRIPTS – scripts to run TADbit
  • refGenome – reference genome raw FASTA and indexed files

Index reference genome

Before analyzing Hi-C data through TADbit, index the reference genome that GEM mapper will use. This is standard for most mappers (e.g., bwa, bowtie2). We can call the gem-indexer from within the TADbit environment.

Remember to activate the tadbit conda environment.

# Move to your home
cd;

# Activate TADbit environment
eval "$(/home/ctools/miniconda3/bin/conda shell.bash hook)"
conda activate "$HOME/envs/tadbit_course"

# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;

# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
mkdir -p SCRIPTS/log

# Make RESULTS folder
mkdir -p tadbit_dirs;

# Make REFERENCE GENOME folder
mkdir -p refGenome;

# To store logs from fastp
mkdir -p fastp_reports

# For the fastq
mkdir -p fastq
mkdir -p fastq/clean

Putting things into an SBATCH script

A template for sbatch job submission is provided. Copy it to your SCRIPTS folder:

cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch \
  /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/

Move to your SCRIPTS folder and make a copy called 00_index.sbatch:

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 00_index.sbatch

Open the new copy with your favourite editor (for example: emacs 00_index.sbatch) and paste the following at the bottom of the file after the existing #SBATCH ... lines.

data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE
cd ${data_dir}

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 \
  -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna \
  -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic

If you were to run this, you would submit the job with:

sbatch 00_index.sbatch

⚠️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK INSTEAD.

We make a symlink to the reference genome in our folder so that we do not have to copy it:

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem \
  /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna \
  /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna

Pre-process Hi-C FASTQ data: minimum QC

While the indexing step would run, start looking at the data and pre-process it. Hi-C FASTQs are paired-end reads. We will “clean” the reads from adapters, low-quality bases, and short reads using fastp.

Create 01_fastp.sbatch from the template:

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 01_fastp.sbatch

Open 01_fastp.sbatch and paste the following at the bottom (after the #SBATCH lines):

cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

# Launch fastp: raw reads are read from course folder, clean fastqs are stored in your folder.
# Enable adapter detection; trim the first 5 bases (often lower quality).
# Use 4 threads and minimum read length 30 (remove reads shorter than this after trimming).
fastp \
  -i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
  -I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
  -o clean/${sample}_R1.clean.fastq.gz \
  -O clean/${sample}_R2.clean.fastq.gz \
  --detect_adapter_for_pe \
  --trim_front1 5 \
  -w 4 \
  -l 30 \
  -h ${sample}.html

Submit the script to the queue:

sbatch 01_fastp.sbatch

When it finishes, copy the HTML report to your local computer and open it in a browser (replace SERVER and USER as needed):

scp USER@SERVER:/home/people/USER/3D_GENOMICS_COURSE/fastq/liver.html .

⏰ It should take ~1 min to complete.

Question: Check the HTML report. What percentage of reads are kept?

Answer: It should be about ~96.4%. No massive adapter content or low quality sequences. After mapping we will inspect ligation/digestion patterns in more detail.


Mapping to the reference genome

TADbit maps each read separately, so we run tadbit map twice (once per read). It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.

Create 02_map.sbatch from the template and add the commands below (paste at the bottom after #SBATCH lines).

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 02_map.sbatch
cd /home/people/$USER/3D_GENOMICS_COURSE/

sample="liver"
ref="refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/${sample}"
mkdir -p ${wd}

# Two enzymes used in this experiment (double digestion)
enz="MboI HinfI"

# Map read 1
rd=1
tadbit map \
  --fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
  --workdir ${wd} \
  --index ${ref} \
  --read ${rd} \
  --renz ${enz} \
  -C 6

# Map read 2
rd=2
tadbit map \
  --fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
  --workdir ${wd} \
  --index ${ref} \
  --read ${rd} \
  --renz ${enz} \
  -C 6

Submit the job:

sbatch 02_map.sbatch

If you want to inspect the job output, check the log files in SCRIPTS/log (or the directory where you submitted the job, depending on the template). TADbit also produces PNG plots inside tadbit_dirs/liver/.

After mapping, inspect the plots TADbit generates. Discuss the number of digested sites, dangling ends, and ligation efficiency.

Question: How may restriction enzyme choice influence the experiment? ✂️

Answer: Cutting frequency differs between 4-cutters and 6-cutters and influences fragment size distribution, ligation probabilities, and contact resolution. Using two enzymes increases the diversity of ligation junctions. Compare with Micro-C, which uses MNase digestion, so it cuts evenly through the genome.

Useful background figures (open locally):


Finding the intersection of mapped reads (parse)

Each mate of a Hi-C pair originates from the same digested/ligated fragment (unless it is a dangling end). We identify pairs and build fragment associations with tadbit parse.

Note on runtime: this step can be quick or can take longer depending on load and filesystem performance. Try to get it launched, then continue reading / looking at the provided plots while it runs (this is a good time for a break).

⚠️ Note: The chromosome prefixes to filter have to be defined in the reference genome FASTA file beforehand. It will only match chromosomes that start with the string in --filter_chrom.

Create 03_parse.sbatch from the template and (important!) update the template to request 10 CPUs (e.g. change #SBATCH -c 4 to #SBATCH -c 10), then paste the commands below at the bottom.

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 03_parse.sbatch
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
ref="/home/people/$USER/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/${sample}"

tadbit parse \
  --workdir ${wd} \
  --genome ${ref} \
  --filter_chrom "chr.*" \
  --compress_input

Submit the job:

sbatch 03_parse.sbatch

Question: Is it possible to retrieve multiple contacting regions?

Answer: Consider complex ligation products (read pairs mapping to different fragments in the same molecule, i.e., multiple contacts) and multi-mapping artifacts; TADbit focuses on valid pairs as operationally defined by the filters. Multi-contact methods (e.g., Pore-C, SPRITE) address this explicitly, but standard Hi-C largely models binary contacts per ligation event. We can view it on the bam file in the next step.


Filtering interactions

TADbit allows flexible filtering of non-wanted interactions. In many cases, the defaults work well across datasets. To see what the filter numbers mean, check:

tadbit filter --help

Run filtering (you can run in an sbatch script or interactively):

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/${sample}"

tadbit filter \
  --workdir ${wd} \
  --apply 1 2 3 4 6 7 8 9 10 \
  --cpus 6

Check the amount of filtered data and past commands

tadbit describe summarizes what has been done so far in the workdir, and reports counts, numbers, and parameters after each step.

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
cd tadbit_dirs/$sample

tadbit describe . | less

Question: How many valid pairs do we keep?

Answer: Check the “valid pairs” section of tadbit describe after filtering to get the exact count and percentage.

Question: The total number of filtered reads is not equal to the initial number of reads… Why?

Answer: Because a read pair can be assigned to more than one category (e.g., a dangling end that is also a duplicate). Categories are not mutually exclusive, so percentages can overlap.


To normalize or to not normalize

In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet. It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.

Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities depending on coverage and technical biases.

Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.

Several normalization strategies exist (see: tadbit normalize --help). A simple and commonly used option is to filter based on a minimum number of counts per bin. If you want to exclude specific genomic regions, use the --badcols parameter.

Create an sbatch script (e.g. 04_norm_vanilla.sbatch) if you like, then paste the commands below at the bottom.

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 04_norm_vanilla.sbatch
cd /home/people/$USER/3D_GENOMICS_COURSE/

sample="liver"
wd="tadbit_dirs/${sample}"

# Define the resolution
res="100000"  # 100 kb

# Normalization method
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100

tadbit normalize \
  -w ${wd} \
  -r ${res} \
  --cpus 6 \
  --filter 1 2 3 4 6 7 9 10 \
  --normalization ${norm} \
  --badcols chrW:1-7000000 chrZ:1-83000000 \
  --min_count ${min_count}

Submit it:

sbatch 04_norm_vanilla.sbatch

Use tadbit describe to check how many bins were removed. A good rule of thumb: remove ~3–4% of bins. If much more is removed, something may be wrong.

Optional reading on normalization strategies:

Overview article on Hi-C normalization strategies


Binning and viewing matrices

Once normalization is done, we can visualize Hi-C matrices. Using -c restricts the plot to a specific chromosome or region. This step can take a while, so using an sbatch script is recommended.

cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp template.sbatch 05_bin_chr1.sbatch
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/${sample}"
res="100000"
chrom="chr1"
norm="Vanilla"

tadbit bin \
  -w ${wd} \
  -r ${res} \
  -c ${chrom} \
  --plot \
  --norm ${norm} \
  --format "png" \
  --cpus 6

Submit it:

sbatch 05_bin_chr1.sbatch

Example matrices (open locally):