We will familiarize with TADbit (Serra et al., 2017): From FASTQ files to contact matrix and beyond.

1 Outline of the exercises

Preprocess Hi‑C FASTQ data
Index reference genome
Use TADbit to:
- Map reads to reference genome (map)
- Get intersection (parse)
- Filter reads (filter)
- Normalize (normalize)
- Generate matrices (bin)
- Export formats (bin + cooler)

2 Setup conda environment to run tadbit later

cd;  # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
./setup_TADbit.sh 
# You should get as an only result the help from the program- this means the environment is up and running.

Make yourself familiar with the directory structure. Inside /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE we have three folders:

fastq # …
SCRIPTS # SCRIPTS to run tadbit
refGenome # Reference genome raw fasta and indexed to map our fastq files

3 Index reference genome

Before analyzing Hi‑C data through TADbit, index the reference genome that GEM mapper will use. This is standard for most mappers (e.g., bwa, bowtie2). We can call the gem-indexer from within the TADbit environment. Remember to activate the tadbit conda environment.

# Move to your home.
cd; 
# Activate TADbit environment
conda activate /home/people/${USER}/envs/tadbit_course # $USER is your user; it's an environment variable so no need to change it.
# Make a WORKING folder for the course 
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;
# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS; 
# also a log folder for the scripts
mkdir -p SCRIPTS/log 
# Make RESULTS folder
mkdir -p tadbit_dirs;
# Make REFERENCE GENOME folder.
mkdir -p refGenome;
# To store logs from fastp
mkdir -p fastp_reports
# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean 


# 🧵 PUTTING THINGS INTO A SBATCH SCRIPT:
# I have prepared a template for sbatch job submission. Copy it to your SCRIPTS folder.
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/

# Move to your home folder with the SCRIPTS.
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;

cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch 

# Open the template with your favorite editor, paste the following in the file and save it with a different name:
# For example: `emacs 00_index.sbatch`


# Say; we save it as 00_index.sbatch, then we paste in there:
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};

# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic

sbatch 00_index.sbatch; 

⚠️️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.:

# We can make a sym link to the reference genome in our folder so that we do not have to copy all of us to ur folder.
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem

ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna


⏰ It should take ~5 - 10 min to complete 

## I have prepared a script ready to run in case this is too much. (THIS WILL BE AVAILABLE FOR EACH STEP)
# Just copy to your SCRIPTS folder
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatch

4 Pre‑process Hi‑C FASTQ data: minimum QC

While genome indexing runs, start looking at the data and pre‑process it. Hi‑C FASTQs are paired‑end reads. We will “clean” the reads from adapters, low‑quality bases, and short reads using fastp.

# This should be done for each of the steps if you want:
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch; 

# And put the following there:

cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"

fastp \
# Read raw fastq from course folders.
    -i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Stores the clean fastq version in your folder
    -o clean/${sample}_R1.clean.fastq.gz \
    -I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
    -O clean/${sample}_R2.clean.fastq.gz \
    --detect_adapter_for_pe \
# trim first 5 bases, usually lower quality
    --trim_front1 5 \
# threads
    -w 10 \
# minimal read length. If after the trim the read is shorter than this, remove it.
    -l 30 \
    -h ${sample}.html


# Copy the HTML report to your local computer and open it with a HTML browser, to view it. 
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .

⏰ It should take ~1 min to complete with 6 cpus.

Question: Check the HTML report. What percentage of reads are kept?

It should be about ~96.4%. No massive adapter content or low quality sequences. After mapping we will inspect ligation/digestion patterns in more detail.

5 Mapping to the reference genome

TADbit maps each read separately, so we run tadbit map twice, once for each read. It requires the restriction enzyme(s) used in the experiment. These samples were treated with two enzymes.

# And put the following there:

cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for mapping
sample="liver"                                        
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}
# Two enzymes used in this experiment
enz="MboI HinfI" # This double-digestion is particularly relevant for Arima/Phase genomics experiments
# Map read 1 or 2
rd=1;

tadbit map
  --fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
  --workdir ${wd} \
  --index ${ref} \
  --read ${rd} \
  --tmpdb ${TMPDIR} \
  --renz ${enz} \
  -C 6

# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.

⏰ It should take ~5 min to complete with 6 cpus.

Note: We are not using iterative mapping. Fragment‑based mapping is the default on TADbit.

After mapping, inspect the plots TADbit generates. Let’s discuss the number of digested sites, dangling ends, and ligation efficiency.

Question: How may restriction enzyme choice influence the experiment? ✂️

Cutting frequency differs between 4‑cutters and 6‑cutters and influences fragment size distribution, ligation probabilities, and contact resolution. Using two enzymes increases the diversity of ligation junctions. Compare with Micro‑C, which uses MNase digestion, so it cuts evenly through the genome.

Fragment size histogram

Fragment size histogram

HiC Sequencing Quality and digestion - ligation deconvolution

HiC Sequencing Quality and digestion - ligation deconvolution

6 Finding the intersection of mapped reads (`parse`)

Each mate of a Hi‑C pair originates from the same digested/ligated fragment (unless it is a dangling end). We identify pairs and will build the fragment associations with tadbit parse.

⚠️ ⚠️ ⚠️ Note that the chromosome prefixes to filter have to be defined in the reference genome fasta files before hand. It will only match the chromosomes that start with the string in –filter_chrom

cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver"                                        # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample}                            # workdir (auto‑created by TADbit)

# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
       --workdir ${wd} \
       --genome ${ref} \
       --filter_chrom "chr.*" \
       --compress_input;

⏰ It should take ~35 min to complete with 10 cpus.

Question: Is it possible to retrieve multiple contacting regions?

Consider complex ligation products (read pairs mapping to different fragments in the same molecule, i.e., multiple contacts) and multi‑mapping artifacts; TADbit focuses on valid pairs as operationally defined by the filters. Multi‑contact methods (e.g., Pore‑C, SPRITE) address this explicitly, but standard Hi‑C largely models binary contacts per ligation event. We can view it on the bam file in the next step.

7 Filtering interactions

TADbit allows flexible filtering of non‑wanted interactions. In my own experience, the defaults usually work well across datasets.

Filter name	Description	Notes
1. Self-circle	Both ends map to the same RE fragment in opposite orientation	Likely same-fragment ligation
2. Dangling-end	Both ends map to the same RE fragment in facing orientation	Failed ligation
3. Error	Same fragment, same orientation	Mapping or digestion error
4. Extra dangling-end	Adjacent RE fragments, facing orientation, close to cutting site	< `max_molecule_length`
5. Too close to RE site (semi-dangling)	Read start < 5 bp (default) from RE cut-site	Remove ambiguous sites
6. Too short	Maps to fragments < 75 bp	Could belong to neighboring fragments
7. Too large	Maps to fragments > 100 kb (default)	Often repetitive or misassembled
8. Over-represented fragments	Top 0.5% most frequent RE fragments	PCR bias or genome issues (structural variants)
9. PCR artefact / duplicate	Same start, length, and strand for both ends	Remove duplicates
10. Random breaks	Too far from any RE site (> `minimum_distance_to_RE`)	Random shearing or non-canonical cutting


tadbit filter \
  --workdir ${wd} \
  --apply 1 2 3 4 6 7 8 9 10 \
  --cpus 6 \
  --tmpdb ${TMPDIR}

8 Check the amount of filtered data and past commands

tadbit describe summarizes what has been done so far in the workdir, and reports counts, numbers and parameters after each step.

# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample

# Summarize the run
tadbit describe . | less

Question: How many valid pairs do we keep?

Check the “valid pairs” section of tadbit describe after filtering to get the exact count and percentage regarding the initial read pairs.

Question: The total number of filtered reads is not equal to the initial number of reads… Why?

Because a read pair can be assigned to more than one category (e.g., a dangling end that is also a duplicate). Categories are not mutually exclusive, so percentages can overlap.

9 To normalize or to not normalize

In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet. It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.

Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities depending on coverage and technical biases.

Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.

Several normalization strategies exist (see: tadbit normalize --help). A simple and commonly used option is to filter based on a minimum number of counts per bin.

If you want to exclude specific genomic regions, use the --badcols parameter.


cd /home/people/$USER/3D_GENOMICS_COURSE/;

# Variables used for normalization
sample="liver"                                        # sample name
wd="tadbit_dirs/"${sample}                            # workdir (auto‑created by TADbit)

# First time we define the resolution
res="100000"          # 100 kb

# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"

# Minimum number of counts required per bin
min_count=100

tadbit normalize -w ${wd} \
       -r ${res} \
       --tmpdb ${TMPDIR} \
       --cpus 6 \
       --filter 1 2 3 4 6 7 9 10 \
       --normalization ${norm} \
       --badcols chrW:1-7000000 chrZ:1-83000000 \
       --min_count ${min_count}

⏰It should take ~2 min to complete with 6 cpus.

⚠️ Run another version with norm="raw" to compare later.

Use tadbit describe to check how many bins were removed. A good rule of thumb: remove ~3–4% of bins. If much more is removed → something may be wrong.

Each job is assigned a <job_id>. This helps retrieve results from specific runs (especially when testing parameters).

If you want you can give a quick look to the different normalization strategies… Extract your own conclusions. :)

https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105

10 Binning and viewing matrices

Once normalization is done, we can visualize Hi-C matrices. Using -c restricts the plot to a specific chromosome or region.

# Variables used for binning

cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"

tadbit bin \
       -w ${wd} \
       -r ${res} \
       -c ${chrom} \
       --plot \
       --norm "norm" \
       --format "png" \
       --cpus 6;

Raw HiC matrix

Raw chromosome 1 HiC matrix

Normalized HiC matrix

Normalized HiC matrix

A Primer into 3D Genomics: A Mini‑Workshop

Juan Antonio Rodríguez, Globe Institute, University of Copenhagen

9 January 2026, DTU

1 Outline of the exercises

2 Setup conda environment to run tadbit later

3 Index reference genome

4 Pre‑process Hi‑C FASTQ data: minimum QC

5 Mapping to the reference genome

6 Finding the intersection of mapped reads (`parse`)

7 Filtering interactions

8 Check the amount of filtered data and past commands

9 To normalize or to not normalize

10 Binning and viewing matrices

A Primer into 3D Genomics: A Mini‑Workshop

Juan Antonio Rodríguez, Globe Institute, University of Copenhagen

9 January 2026, DTU

1 Outline of the exercises

2 Setup conda environment to run tadbit later

3 Index reference genome

4 Pre‑process Hi‑C FASTQ data: minimum QC

5 Mapping to the reference genome

6 Finding the intersection of mapped reads (parse)

7 Filtering interactions

8 Check the amount of filtered data and past commands

9 To normalize or to not normalize

10 Binning and viewing matrices

6 Finding the intersection of mapped reads (`parse`)