Cancerseq exercise answers

From 22126
Jump to navigation Jump to search

Q1

We run:

gatk Mutect2  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa  -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam     -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam        -normal TCRBOA2-N-WEX         -L chr10:3100000-5100000   --germline-resource /home/databases/databases/GRCh38/somatic-hg38_af-only-gnomad.hg38.vcf.gz   -O TCRBOA2.vcf.gz

Then either:

bcftools view -H TCRBOA2.vcf.gz|wc -l
bcftools stats TCRBOA2.vcf.gz
zgrep -v "^#" TCRBOA2.vcf.gz|wc -l 
zcat TCRBOA2.vcf.gz |grep -v "#"|wc -l 

Will give you 9 variants.


Q2

First we run:

gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-T-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-T.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
gatk HaplotypeCaller -I /home/projects/22126_NGS/exercises/cancer_seq/TCRBOA2-N-WEX_recaled.bam -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -L chr10:3100000-5100000 -O TCRBOA2-N.vcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz

Then counting the number of variants in the tumor:

bcftools view -H TCRBOA2-T.vcf.gz | wc -l

424

and normal sample:

bcftools view -H TCRBOA2-N.vcf.gz | wc -l 

413

The tumor has more variants which is expected due to a higher amount of somatic variants.

Q3

gatk FilterMutectCalls -V TCRBOA2.vcf.gz -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -O TCRBOA2_filtered.vcf.gz

We can insect visually:

bcftools view -H TCRBOA2_filtered.vcf.gz |less -S

Or to classify filters in a straightforward manner:

 bcftools view -H TCRBOA2_filtered.vcf.gz |cut -f 7   |tr ";" "\n" |sort |uniq -c  |sort

You get:

      1 slippage
      2 haplotype
      2 PASS
      2 weak_evidence
      3 clustered_events
      3 map_qual
      4 strand_bias
      6 normal_artifact

see some notes on the meaning of these filters here


Q4 Running

java -jar /usr/local/bin/SnpSift.jar  annotate /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz TCRBOA2_filtered.vcf.gz  | bgzip -c > TCRBOA2_filtered_anno.vcf.gz
bcftools view -H -f PASS TCRBOA2_filtered_anno.vcf.gz | less -S


should give you:

chr10	3165513	rs9423502	G	C	.	PASS	AS_FilterStatus=SITE;AS_SB_TABLE=39,119|0,3;DP=168;ECNT=1;GERMQ=93;MBQ=38,39;MFRL=225,184;MMQ=60,60;MPOS=4;NALOD=1.86;NLOD=21.37;POPAF=1.32;TLOD=6.08;CAF=[0.9454,0.05464];COMMON=1;G5;GNO;HD;KGPROD;KGPhase1;NSM;OTHERKG;PH3;REF;RS=9423502;RSPOS=3207705;S3D;SAO=0;SLO;SSR=0;VC=SNV;VLD;VP=0x050300000a01150517000101;WGT=1;dbSNPBuildID=119	GT:AD:AF:DP:F1R2:F2R1:SB	0/0:79,0:0.014:79:41,0:38,0:19,60,0,0	0/1:79,3:0.054:82:41,2:38,1:20,59,0,3
chr10	4972935	.	A	T	.	PASS	AS_FilterStatus=SITE;AS_SB_TABLE=8,84|0,4;DP=102;ECNT=1;GERMQ=93;MBQ=36,34;MFRL=302,495;MMQ=50,40;MPOS=20;NALOD=1.58;NLOD=11.08;POPAF=1.62;TLOD=8.82GT:AD:AF:DP:F1R2:F2R1:SB	0/0:37,0:0.026:37:16,0:20,0:1,36,0,0	0/1:55,4:0.083:59:26,2:28,2:7,48,0,4

The ID "rs9423502" is a dbSNP ID so the SNP at 3165513 was previously found whereas 4972935 was not.

Q5

The SNP can be found here: https://www.ncbi.nlm.nih.gov/snp/?term=rs9423502

Generally, the prevalence of the SNPs is relatively low (2-5%) which indicates that there is a potential role for diving cancer.

Q6

The variant on chromosome 18 are missense, potentially deleterious and have the COSMIC ID: COSV99493765. In the COSMIC database, it hits the DYM gene and is mostly found mutated in liver and prostate.

Q7

Often enough, around 6% in certain cases

Q8


kidney but the confidence is low as the prediction score is a virtual tie with liver.