Cancerseq exercise

From 22126
Jump to navigation Jump to search

Overview

Adapted from an original exercise by Marcin Krzystanek and Aron Eklund.


First:

  1. Navigate to your home directory:
  2. Create a directory called "cancerseq"
  3. Navigate to the directory you just created.

In this exercise, we learn the differences between standard genotyping and genotyping for cancer-seq:

  1. Somatic mutation calling
  2. Interpretation of the resulting somatic mutations

Somatic mutation calling

MuTect2

We use MuTect2, a somatic mutation caller that identifies both SNV and indels. It produces a VCF-file, although the output of Mutect2 has some information specific for somatic variants. See here for specs.

A big difference in cancer-seq variant calling using Mutect2 is that there are no ploidy assumptions. This accommodates tumor data that can have many copy number variants (CNVs).

Mutect2 is computationally intensive so we recommend parallelizing if possible. One way to achieve this is to split processes by chromosomes (calling variants for each chromosome and then merging vcf-files.)

Since we neither have the time nor the capacity to process the entire genome during our exercises, we will call somatic mutations on a small part of chromosome 10, from the 3,100,000th to the 5,100,000th base pair, which is set with the -L option. We have 2 files, one for the normal tissue:

/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam

and for the tumor one:

/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam

The reference can be found:

/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa

You can now run Mutech2

gatk Mutect2 -R [reference] -I [BAM NORMAL SAMPLE]  -I [BAM TUMOR SAMPLE]  -normal [ID of the NORMAL SAMPLE]   -L [REGION]  --germline-resource /home/databases/databases/GRCh38/somatic-hg38_af-only-gnomad.hg38.vcf.gz   -O [OUTPUT VCF]

please remember that the region is in the following format:

[CHR NAME]:[START COORD]-[END COORD]

The option germline resource specifies the frequency of germline mutations in a population. You can call your output TCRBOA2.vcf.gz.

Take a look at the resulting VCF file, try to count the number of raw variants either using bcftools or standard UNIX tools (zgrep or zcat+grep):

Q1

How many variants did you find?


Compare with calling all variants

Just for comparison, we try to call all variants in the interval for the germline and the tumor sample.

gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-T.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-N.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz

Count the variant lines in each with the same command as above:

bcftools view -H TCRBOA2-T.vcf.gz | wc -l
bcftools view -H TCRBOA2-N.vcf.gz | wc -l 

Q2 Where do you find the highest number of raw variants? Does that make biological sense? What is the difference between the two numbers and does it match above?


Filtering and annotating variants


Before continuing, we need to filter the raw vcf-output to only get confident variants:

gatk FilterMutectCalls -V TCRBOA2.vcf.gz -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -O TCRBOA2_filtered.vcf.gz

Try to look at the variants.

Q3 What does it look like in the filter column (7th column)? What kind of filters were applied?

To add some extra information to the VCF file, we will also annotate with the dbSNP IDs of known SNP. HaplotypeCaller can do this as it calls variants, but using Mutect2 we need to do it ourselves:

java -jar /home/ctools/snpEff/SnpSift.jar  annotate /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz  [INPUT VCF] 

The command above produces output on the STDOUT. The input is the VCF you produced with FilterMutectCalls. Try to produce a file called "TCRBOA2_filtered_anno.vcf.gz". Remember to use "bgzip -c". Note that this command may take up to 3 minutes. Now try to filter mutational calls by selecting those with Mutect "PASS" annotation.

bcftools view -H -f PASS TCRBOA2_filtered_anno.vcf.gz

You should at least see this line (without the header). Don't forget you can scroll to the sides!

    #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TCRBOA2-N-WEX   TCRBOA2-T-WEX
chr10	3165513	rs9423502	G	C	.	PASS	AS_FilterStatus=SITE;AS_SB_TABLE=39,119|0,3;DP=168;ECNT=1;GERMQ=93;MBQ=38,39;MFRL=225,184;MMQ=60,60;MPOS=4;NALOD=1.86;NLOD=21.37;POPAF=1.32;TLOD=6.08;CAF=[0.9454,0.05464];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;NSM;OTHERKG;PH3;REF;RS=9423502;RSPOS=3207705;S3D;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050300000a01150517000101;WGT=1;dbSNPBuildID=119	GT:AD:AF:DP:F1R2:F2R1:SB	0/0:79,0:0.014:79:41,0:38,0:19,60,0,0	0/1:79,3:0.054:82:41,2:38,1:20,59,0,3


There is a lot of information in the line. A brief explanation of each part of the line above is in the header of the VCF file (pipe into "less -S" to look at it, but it is also difficult to interpret).

Let's focus on the FORMAT lines. The per sample information kept here is organized by GT:AD:AF:DP:F1R2:F2R1:SB and the values are kept in the two proceeding columns, also separated by colons. The first starting with 0/0 refers to the normal sample, whereas the column beginning with 0/1 refers to the tumor. This means that the tumor is heterozygote (REF/ALT) for the mutation which is not seen in the germline sample at all (it is REF/REF).

After genotype (GT) we have allelic depth (AD) which is "79,3" (i.e. 79 reads and 3 reads for the reference and mutant allele respectively) in the tumor and "79,0" in the normal sample. Then comes allelic frequency, which is a fraction of the mutant allele out of all aligned bases in this position and the depth. We will skip the remaining values for F1R2, F2R1 and SB for now.

Q4

There should be 2 variants, which one was already found in previous studies (i.e. is documented in dbSNP)?


Interpretation of the resulting somatic mutations

A list of chromosome coordinates is kind of hard to interpret. Here are some ways to approach the results.

Variant annotation with dbSNP, Variant effect predictor and COSMIC

Find the RS identifier from the cancer mutation with a dbSNP ID and look it up at dbSNP.

Q5

Find the frequency table tab. Is your mutation common in some populations? What does a high frequency tell you about its role in cancer? So far you have processed and analyzed only a small section of chromosome 10.

Now, let us analyze a bigger portion of the genome. Pick your favorite chromosome and find the corresponding VCF file on the server. For example, if you choose chromosome 7, you would use this file:

/home/projects/22126_NGS/exercises/cancer_seq/chr7.vcf

Hint: your results will be more interesting if you pick chromosomes: 18!

Filter the VCF to retain only the lines marked as "PASS".

grep "PASS" /home/projects/22126_NGS/exercises/cancer_seq/chr7.vcf > chr7_filtered.vcf

Download the filtered VCF to your own computer using:

scp [userID]@pupil1:/path/to/file .

and submit it to the VEP website using default settings. When the results become available, look in the "Somatic status" column. Are there any known cancer mutations?

If you find a known cancer mutation, find its COSMIC identifier (COSM######, e.g. COSM4597270) in the "existing variant" column. Search for your COSMIC identifier in the COSMIC database.

Q6

In which tissues is this mutation found?


cBioPortal

Go to cBioPortal, a website that provides tools to analyze several large cancer sequencing datasets. In Quick Select, choose "TCGA PanCancer Atlas studies". Then press "Query by Gene" and type in the name of the gene that was hit by this mutation. Choose "mutations" as we have not looked at Copy Number Alterations. Press "Submit Query". Look at the barcharts and play around with the options.

Q7

How often is this gene mutated in various cancer types?


Inference of tissue of origin

Next, we will do some analysis on a VCF file containing somatic mutations found throughout the entire genome:

/home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2_filtered.vcf.gz

Unlike VEP, TumorTracer requires VCF files to have the header information. Thus, we will filter this VCF file to retain: 1) header lines (which begin with "#"), and 2) data lines with a PASS call.

zgrep -E "^#|PASS" /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2_filtered.vcf.gz > TCRBOA2_filtered_pass.vcf

Download the VCF file and submit it to the https://services.healthtech.dtu.dk/services/TumorTracer-1.1/ TumorTracer server]. Make sure to specify that this VCF was generated using GRCh38 coordinates.

Q8

What tissue does TumorTracer predict? Is it a confident prediction?


Congratulations you finished the exercise!



Please find answers Cancerseq_exercise_answers.