Cancerseq exercise answers
Q1
We run:
gatk Mutect2 -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -normal TCRBOA2-N-WEX -L chr10:3100000-5100000 --germline-resource /home/databases/databases/GRCh38/somatic-hg38_af-only-gnomad.hg38.vcf.gz -O TCRBOA2.vcf.gz
Then either:
bcftools view -H TCRBOA2.vcf.gz|wc -l bcftools stats TCRBOA2.vcf.gz zgrep -v "^#" TCRBOA2.vcf.gz|wc -l zcat TCRBOA2.vcf.gz |grep -v "#"|wc -l
Will give you 9 variants.
Q2
First we run:
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-T.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-N.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
Then counting the number of variants in the tumor:
bcftools view -H TCRBOA2-T.vcf.gz | wc -l
424
and normal sample:
bcftools view -H TCRBOA2-N.vcf.gz | wc -l
413
The tumor has more variants which is expected due to a higher amount of somatic variants.
Q3
gatk FilterMutectCalls -V TCRBOA2.vcf.gz -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -O TCRBOA2_filtered.vcf.gz
We can insect visually:
bcftools view -H TCRBOA2_filtered.vcf.gz |less -S
Or to classify filters in a straightforward manner:
bcftools view -H TCRBOA2_filtered.vcf.gz |cut -f 7 |tr ";" "\n" |sort |uniq -c |sort
You get:
1 slippage 2 haplotype 2 PASS 2 weak_evidence 3 clustered_events 3 map_qual 4 strand_bias 6 normal_artifact
see some notes on the meaning of these filters here
Q4
Running
java -jar /usr/local/bin/SnpSift.jar annotate /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz TCRBOA2_filtered.vcf.gz | bgzip -c > TCRBOA2_filtered_anno.vcf.gz bcftools view -H -f PASS TCRBOA2_filtered_anno.vcf.gz | less -S
should give you:
chr10 3165513 rs9423502 G C . PASS AS_FilterStatus=SITE;AS_SB_TABLE=39,119|0,3;DP=168;ECNT=1;GERMQ=93;MBQ=38,39;MFRL=225,184;MMQ=60,60;MPOS=4;NALOD=1.86;NLOD=21.37;POPAF=1.32;TLOD=6.08;CAF=[0.9454,0.05464];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;NSM;OTHERKG;PH3;REF;RS=9423502;RSPOS=3207705;S3D;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050300000a01150517000101;WGT=1;dbSNPBuildID=119 GT:AD:AF:DP:F1R2:F2R1:SB 0/0:79,0:0.014:79:41,0:38,0:19,60,0,0 0/1:79,3:0.054:82:41,2:38,1:20,59,0,3 chr10 4972935 . A T . PASS AS_FilterStatus=SITE;AS_SB_TABLE=8,84|0,4;DP=102;ECNT=1;GERMQ=93;MBQ=36,34;MFRL=302,495;MMQ=50,40;MPOS=20;NALOD=1.58;NLOD=11.08;POPAF=1.62;TLOD=8.82GT:AD:AF:DP:F1R2:F2R1:SB 0/0:37,0:0.026:37:16,0:20,0:1,36,0,0 0/1:55,4:0.083:59:26,2:28,2:7,48,0,4
The ID "rs9423502" is a dbSNP ID so the SNP at 3165513 was previously found whereas 4972935 was not.
Q5
The SNP can be found here: https://www.ncbi.nlm.nih.gov/snp/?term=rs9423502
Generally, the prevalence of the SNPs is relatively low (2-5%) which indicates that there is a potential role for diving cancer.
Q6
The variant on chromosome 18 are missense, potentially deleterious and have the COSMIC ID: COSV99493765. In the COSMIC database, it hits the DYM gene and is mostly found mutated in liver and prostate.
Q7
Often enough, around 6% in certain cases
Q8
kidney but the confidence is low as the prediction score is a virtual tie with liver.