Alignment exercise

From 22126
Revision as of 15:53, 6 January 2026 by Mick (talk | contribs)
Jump to navigation Jump to search

Overview

In this exercise you will explore Hi-C data analysis using TADbit, from raw FASTQ files to normalized contact matrices and domain-level interpretation.

The goal is to understand what each step of the pipeline does, which parameters matter, and how choices affect downstream interpretation.


Outline of the exercises

  1. Preprocess Hi-C FASTQ data
  2. Index a reference genome
  3. Map reads to the reference genome
  4. Parse and filter read pairs
  5. Normalize Hi-C contact matrices
  6. Generate and inspect contact matrices

Setup conda environment to run TADbit

Before starting, set up a conda environment with all required dependencies.

cd;   # Home directory
cp /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE/SCRIPTS/setup_TADbit.sh .
./setup_TADbit.sh

If successful, the command should print the tadbit help message. This confirms that the environment is correctly installed.

Inside /home/projects/22126_NGS/exercises/3D_GENOMICS_COURSE you will find:

  • fastq – raw Hi-C FASTQ files
  • SCRIPTS – scripts to run TADbit
  • refGenome – reference genome FASTA and index files

Index reference genome

Before mapping Hi-C reads, the reference genome must be indexed for the GEM mapper used by TADbit.

Use the provided reference genome in the refGenome directory. This step only needs to be done once per reference.


Mapping Hi-C reads

Hi-C reads are paired-end and must be mapped with special care to preserve pairing information.

Mapping assigns each read to a genomic coordinate in the reference genome. Unmapped and ambiguously mapped reads will be handled in later steps.


Parsing mapped reads

After mapping, TADbit parses the BAM file to identify valid Hi-C read pairs.

This step assigns read pairs to different categories (e.g. valid pairs, dangling ends, self-circles, duplicates).


Filtering reads

Filtering does not remove reads immediately. Instead, reads are classified into categories.

These categories are later used during normalization to decide which reads contribute to the contact matrix.

To summarize the results of mapping, parsing, and filtering:

cd /home/people/$USER/3D_GENOMICS_COURSE/tadbit_dirs/$sample
tadbit describe . | less

Q1: How many valid pairs are retained after filtering?

Q2: Why does the total number of filtered reads not equal the initial number of read pairs?

Hint: read categories are not mutually exclusive.


To normalize or to not normalize

Up to this point, reads have only been classified. No reads have been excluded yet.

Normalization is the step where you decide which categories to include and how to correct for technical biases.

Normalization in TADbit computes a bias vector (one value per bin), which corrects interaction counts for sequencing depth, mappability, and other systematic effects.

Important: During normalization, bad columns (bins with low counts or poor mappability) are removed from the matrix.

Several normalization strategies are available:

See tadbit normalize --help for details.

A common approach is to require a minimum number of counts per bin and to explicitly exclude problematic genomic regions.

cd /home/people/$USER/3D_GENOMICS_COURSE/

# Variables used for normalization
sample="liver"                  # sample name
wd="tadbit_dirs/${sample}"      # working directory
res="100000"                    # resolution (100 kb)
norm="ICE"
min_count=5

To exclude specific regions (e.g. sex chromosomes or poorly assembled regions), use the --badcols option.

⏰ The normalization step should take approximately 2 minutes using 6 CPUs.

Task: Run normalization twice: once with norm="ICE" and once with norm="raw". Compare the results later.


Contact matrices

After normalization, TADbit generates Hi-C contact matrices at the chosen resolution.

These matrices represent interaction frequencies between genomic bins and are the basis for downstream analyses such as TAD detection.

Q3: How does changing the resolution affect the appearance of the contact matrix?


Congratulations, you finished the TADbit exercise!