Alignment exercise: Difference between revisions
No edit summary |
No edit summary |
||
| Line 36: | Line 36: | ||
cd; # Home directory | cd; # Home directory | ||
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh . | cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh . | ||
./setup_TADbit.sh | bash ./setup_TADbit.sh | ||
</pre> | </pre> | ||
Revision as of 10:18, 7 January 2026
Overview
In this exercise you will explore Hi-C data analysis using TADbit, from raw FASTQ files to normalized contact matrices and domain-level interpretation.
The goal is to understand what each step of the pipeline does, which parameters matter, and how choices affect downstream interpretation.
Outline of the exercises
- Preprocess Hi-C FASTQ data
- Index a reference genome
- Map reads to the reference genome
- Parse and filter read pairs
- Normalize Hi-C contact matrices
- Generate and inspect contact matrices
Setup conda environment to run TADbit
Before starting, set up a conda environment with all required dependencies.
cd; # Home directory cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh . bash ./setup_TADbit.sh
If successful, the command should print the tadbit help message.
This confirms that the environment is correctly installed.
Inside /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE you will find:
fastq– raw Hi-C FASTQ filesSCRIPTS– scripts to run TADbitrefGenome– reference genome FASTA and index files
Index reference genome
Before mapping Hi-C reads, the reference genome must be indexed for the GEM mapper used by TADbit.
Use the provided reference genome in the refGenome directory.
This step only needs to be done once per reference.
Mapping Hi-C reads
Hi-C reads are paired-end and must be mapped with special care to preserve pairing information.
Mapping assigns each read to a genomic coordinate in the reference genome. Unmapped and ambiguously mapped reads will be handled in later steps.
Parsing mapped reads
After mapping, TADbit parses the BAM file to identify valid Hi-C read pairs.
This step assigns read pairs to different categories (e.g. valid pairs, dangling ends, self-circles, duplicates).
Filtering reads
Filtering does not remove reads immediately. Instead, reads are classified into categories.
These categories are later used during normalization to decide which reads contribute to the contact matrix.
To summarize the results of mapping, parsing, and filtering:
cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample tadbit describe . | less
Q1: How many valid pairs are retained after filtering?
Q2: Why does the total number of filtered reads not equal the initial number of read pairs?
Hint: read categories are not mutually exclusive.
To normalize or to not normalize
Up to this point, reads have only been classified. No reads have been excluded yet.
Normalization is the step where you decide which categories to include and how to correct for technical biases.
Normalization in TADbit computes a bias vector (one value per bin), which corrects interaction counts for sequencing depth, mappability, and other systematic effects.
Important: During normalization, bad columns (bins with low counts or poor mappability) are removed from the matrix.
Several normalization strategies are available:
See tadbit normalize --help for details.
A common approach is to require a minimum number of counts per bin and to explicitly exclude problematic genomic regions.
cd /home/people/$USER/3D_GENOMICS_COURSE/
# Variables used for normalization
sample="liver" # sample name
wd="tadbit_dirs/${sample}" # working directory
res="100000" # resolution (100 kb)
norm="ICE"
min_count=5
To exclude specific regions (e.g. sex chromosomes or poorly assembled
regions), use the --badcols option.
⏰ The normalization step should take approximately 2 minutes using 6 CPUs.
Task: Run normalization twice: once with norm="ICE"
and once with norm="raw". Compare the results later.
Contact matrices
After normalization, TADbit generates Hi-C contact matrices at the chosen resolution.
These matrices represent interaction frequencies between genomic bins and are the basis for downstream analyses such as TAD detection.
Q3: How does changing the resolution affect the appearance of the contact matrix?
Congratulations, you finished the TADbit exercise!