We will familiarize with TADbit (Serra et al., 2017): From FASTQ files to contact matrix and beyond.
map)parse)filter)normalize)bin)bin + cooler)cd; # Home folder
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .;
./setup_TADbit.sh
# You should get as an only result the help from the program- this means the environment is up and running.Make yourself familiar with the directory structure. Inside
/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE we
have three folders:
fastq # …SCRIPTS # SCRIPTS to run tadbitrefGenome # Reference genome raw fasta and indexed to
map our fastq filesBefore analyzing Hi‑C data through TADbit, index the reference genome
that GEM mapper will use. This is standard for most mappers (e.g.,
bwa, bowtie2). We can call the
gem-indexer from within the TADbit environment. Remember to
activate the tadbit conda environment.
# Move to your home.
cd;
# Activate TADbit environment
conda activate /home/people/${USER}/envs/tadbit_course # $USER is your user; it's an environment variable so no need to change it.
# Make a WORKING folder for the course
mkdir -p 3D_GENOMICS_COURSE;
cd 3D_GENOMICS_COURSE;
# Make SCRIPT folders (to store your own scripts)
mkdir -p SCRIPTS;
# also a log folder for the scripts
mkdir -p SCRIPTS/log
# Make RESULTS folder
mkdir -p tadbit_dirs;
# Make REFERENCE GENOME folder.
mkdir -p refGenome;
# To store logs from fastp
mkdir -p fastp_reports
# For the fastq
mkdir -p fastq
# Filtered fastq
mkdir -p fastq/clean
# 🧵 PUTTING THINGS INTO A SBATCH SCRIPT:
# I have prepared a template for sbatch job submission. Copy it to your SCRIPTS folder.
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/
# Move to your home folder with the SCRIPTS.
cd /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/;
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch
# Open the template with your favorite editor, paste the following in the file and save it with a different name:
# For example: `emacs 00_index.sbatch`
# Say; we save it as 00_index.sbatch, then we paste in there:
data_dir=/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE;
cd ${data_dir};
# Running the indexer
# Note: the output is just a *prefix*; no file extension needed.
gem-indexer -t 11 -i refGenome/GCF_000002315.6_GRCg6a_genomic.fna -o /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic
sbatch 00_index.sbatch;
⚠️️ NO NEED TO RUN THIS. WE WILL GENERATE A SYMBOLIC LINK.:
# We can make a sym link to the reference genome in our folder so that we do not have to copy all of us to ur folder.
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.gem
ln -s /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna /home/people/${USER}/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna
⏰ It should take ~5 - 10 min to complete
## I have prepared a script ready to run in case this is too much. (THIS WILL BE AVAILABLE FOR EACH STEP)
# Just copy to your SCRIPTS folder
cd ~/3D_GENOMICS_COURSE/SCRIPTS;
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/00_index.sbatch .;
sbatch 00_index.sbatchWhile genome indexing runs, start looking at the data and pre‑process it. Hi‑C FASTQs are paired‑end reads. We will “clean” the reads from adapters, low‑quality bases, and short reads using fastp.
# This should be done for each of the steps if you want:
cp /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/template.sbatch /home/people/${USER}/3D_GENOMICS_COURSE/SCRIPTS/01_fastp.sbatch;
# And put the following there:
cd /home/people/$USER/3D_GENOMICS_COURSE/fastq
sample="liver"
FASTQ_DIR="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/fastq"
fastp \
# Read raw fastq from course folders.
-i ${FASTQ_DIR}/${sample}_R1.fastq.gz \
# Stores the clean fastq version in your folder
-o clean/${sample}_R1.clean.fastq.gz \
-I ${FASTQ_DIR}/${sample}_R2.fastq.gz \
-O clean/${sample}_R2.clean.fastq.gz \
--detect_adapter_for_pe \
# trim first 5 bases, usually lower quality
--trim_front1 5 \
# threads
-w 10 \
# minimal read length. If after the trim the read is shorter than this, remove it.
-l 30 \
-h ${sample}.html
# Copy the HTML report to your local computer and open it with a HTML browser, to view it.
USER="juanrod"
scp ${USER}@pupil1.healthtech.dtu.dk:/home/people/${USER}/3D_GENOMICS_COURSE/fastq/liver.html .⏰ It should take ~1 min to complete with 6 cpus.
TADbit maps each read separately, so we run tadbit map
twice, once for each read. It requires the restriction enzyme(s) used in
the experiment. These samples were treated with two enzymes.
# And put the following there:
cd /home/people/$USER/3D_GENOMICS_COURSE/
# Variables used for mapping
sample="liver"
ref="/refGenome/GCF_000002315.6_GRCg6a_genomic.gem"
wd="tadbit_dirs/"${sample}
mkdir -p ${wd}
# Two enzymes used in this experiment
enz="MboI HinfI" # This double-digestion is particularly relevant for Arima/Phase genomics experiments
# Map read 1 or 2
rd=1;
tadbit map
--fastq fastq/clean/${sample}_R${rd}.clean.fastq.gz \
--workdir ${wd} \
--index ${ref} \
--read ${rd} \
--tmpdb ${TMPDIR} \
--renz ${enz} \
-C 6
# Map read 2
rd=2
# >>> Just change the script to take that as a parameter.⏰ It should take ~5 min to complete with 6 cpus.
Note: We are not using iterative mapping. Fragment‑based mapping is the default on TADbit.
After mapping, inspect the plots TADbit generates. Let’s discuss the number of digested sites, dangling ends, and ligation efficiency.
parse)Each mate of a Hi‑C pair originates from the same digested/ligated
fragment (unless it is a dangling end). We identify pairs and will build
the fragment associations with tadbit parse.
⚠️ ⚠️ ⚠️ Note that the chromosome prefixes to filter have to be defined in the reference genome fasta files before hand. It will only match the chromosomes that start with the string in –filter_chrom
cd /home/people/$USER/3D_GENOMICS_COURSE/;
sample="liver" # sample name
ref="/home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/refGenome/GCF_000002315.6_GRCg6a_genomic.fna"
wd="tadbit_dirs/"${sample} # workdir (auto‑created by TADbit)
# Keep only canonical chromosomes and compress map files after parsing
tadbit parse \
--workdir ${wd} \
--genome ${ref} \
--filter_chrom "chr.*" \
--compress_input;
⏰ It should take ~35 min to complete with 10 cpus.TADbit allows flexible filtering of non‑wanted interactions. In my own experience, the defaults usually work well across datasets.
| Filter name | Description | Notes |
|---|---|---|
| 1. Self-circle | Both ends map to the same RE fragment in opposite orientation | Likely same-fragment ligation |
| 2. Dangling-end | Both ends map to the same RE fragment in facing orientation | Failed ligation |
| 3. Error | Same fragment, same orientation | Mapping or digestion error |
| 4. Extra dangling-end | Adjacent RE fragments, facing orientation, close to cutting site | < max_molecule_length |
| 5. Too close to RE site (semi-dangling) | Read start < 5 bp (default) from RE cut-site | Remove ambiguous sites |
| 6. Too short | Maps to fragments < 75 bp | Could belong to neighboring fragments |
| 7. Too large | Maps to fragments > 100 kb (default) | Often repetitive or misassembled |
| 8. Over-represented fragments | Top 0.5% most frequent RE fragments | PCR bias or genome issues (structural variants) |
| 9. PCR artefact / duplicate | Same start, length, and strand for both ends | Remove duplicates |
| 10. Random breaks | Too far from any RE site (>
minimum_distance_to_RE) |
Random shearing or non-canonical cutting |
tadbit describe summarizes what has been done so far in
the workdir, and reports counts, numbers and parameters after each
step.
# Change to workdir
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
# Summarize the run
tadbit describe . | lesstadbit describe after
filtering to get the exact count and percentage regarding the initial
read pairs.
In the filter step we have catalogued all the reads into categories — so it actually didn’t filter anything yet. It is during normalization that we specify which categories to include/exclude so the normalization is performed accordingly.
Normalization in TADbit extracts a bias vector (one value per bin) which adjusts interaction intensities depending on coverage and technical biases.
Important: During normalization is where bad columns (low counts, low mappability, etc.) are removed from the matrix.
Several normalization strategies exist (see:
tadbit normalize --help). A simple and commonly used option
is to filter based on a minimum number of counts per
bin.
If you want to exclude specific genomic regions, use
the --badcols parameter.
cd /home/people/$USER/3D_GENOMICS_COURSE/;
# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/"${sample} # workdir (auto‑created by TADbit)
# First time we define the resolution
res="100000" # 100 kb
# Choice of normalization (raw, ICE, Vanilla, decay)
norm="Vanilla"
# Minimum number of counts required per bin
min_count=100
tadbit normalize -w ${wd} \
-r ${res} \
--tmpdb ${TMPDIR} \
--cpus 6 \
--filter 1 2 3 4 6 7 9 10 \
--normalization ${norm} \
--badcols chrW:1-7000000 chrZ:1-83000000 \
--min_count ${min_count}⏰It should take ~2 min to complete with 6 cpus.
⚠️ Run another version with norm="raw" to compare
later.
Use tadbit describe to check how many bins were removed.
A good rule of thumb: remove ~3–4% of bins. If much more is removed →
something may be wrong.
Each job is assigned a <job_id>. This helps
retrieve results from specific runs (especially when testing
parameters).
If you want you can give a quick look to the different normalization strategies… Extract your own conclusions. :)
https://www.tandfonline.com/doi/full/10.2144/btn-2019-0105
Once normalization is done, we can visualize Hi-C matrices. Using -c restricts the plot to a specific chromosome or region.
# Variables used for binning
cd /home/people/$USER/3D_GENOMICS_COURSE/
sample="liver"
wd="tadbit_dirs/"${sample}
res="100000";
chrom="chr1"
tadbit bin \
-w ${wd} \
-r ${res} \
-c ${chrom} \
--plot \
--norm "norm" \
--format "png" \
--cpus 6;