SNP calling exercise part 2 answers: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "'''Q1''' First, running: <pre> tabix -f -p vcf NA24694.gvcf.gz </pre> then <pre> gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam -L chr20 -O NA24694.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF </pre> <pre> gatk GenotypeGVCFs -R /home/databases/references/human/GRCh3...")
 
No edit summary
 
Line 1: Line 1:
'''Q1'''
'''Q1'''
First, running:
<pre>
tabix -f -p vcf NA24694.gvcf.gz
</pre>
then
<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller    -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa    -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam  -L chr20  -O NA24694.gvcf.gz  --dbsnp  /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz  -ERC GVCF
</pre>
<pre>
gatk  GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V NA24694.gvcf.gz  -L chr20 --dbsnp  /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -O NA24694.vcf.gz
</pre>
<pre>
bcftools stats  NA24694.vcf.gz
</pre>
Should give you:
<pre>
SN 0 number of SNPs: 75684
</pre>
so 75684 SNPs.
'''Q2'''
First run:
<pre>
tabix -p vcf NA24694.vcf.gz
</pre>
Then:
<pre>
tabix  NA24694.vcf.gz chr20:32000000-33000000 |wc -l
</pre>
So 1290 variant sites.
'''Q3'''
There are 2 ways to do this:
<pre>
bcftools view -H --type=snps NA24694.vcf.gz  chr20:32000000-33000000 |wc -l
</pre>
or
<pre>
tabix -h NA24694.vcf.gz chr20:32000000-33000000 |bcftools view -H --type=snps -  |wc -l
</pre>
Both will give you: 956.
'''Q4'''
<pre>
tabix NA24694.vcf.gz  chr20:32011209-32011209
</pre>
You get:
<pre>
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.010e-01;DB;DP=24;ExcessHet=3.0103;FS=3.949;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=11.03;ReadPosRankSum=-3.580e-01;SOR=0.552 GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
</pre>
The line above says that you probably have G and A and the site is heterozygous. G is the reference and A the alternative allele. The allele depth is 15Gs, 9As, the depth is 24, the genotype quality is 99, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 272,0,533
<pre>
tabix NA24694.vcf.gz  chr20:32044279-32044279
</pre>
You get:
<pre>
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.99;SOR=0.990 GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
</pre>
The line above says that you are probably homozygous for T. C is the reference and T the alternative allele. The allele depth is 0Cs, 21Ts, the depth is 21, the genotype quality is 63, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 813,63,0.
'''Q5'''
Both are heterozygous sites however this is the better one:
<pre>
chr20 32974911 rs6088051 A G 403.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.693;DB;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.78;MQRankSum=-6.860e-01;QD=19.22;ReadPosRankSum=-1.703e+00;SOR=0.871 GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
</pre>
and the worse one:
<pre>
chr20 64291638 rs369221086 C T 30.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.530e-01;DB;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.18;MQRankSum=-9.670e-01;QD=5.11;ReadPosRankSum=-8.420e-01;SOR=0.693 GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
</pre>
The first site has a depth of 21, 8As and 13Gs. In the second case, the genotype quality is 38 and the first was 99. The second one only has 4Cs and 2 Ts for a total depth of 6 which is not sufficient to confidently call a heterozygous site.
 
'''Q6'''
Either use:
<pre>
bcftools view -H --types=snps  NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep -v rs |wc -l
</pre>
or
<pre>
bcftools view -H --types=snps  NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep "\." |wc -l
</pre>
you will get 17.
'''Q7'''
17 is very little compared to the number of SNPs 956. However, this is very expected. Given that the individual is Han Chinese, this ethnic group is very well represented in dbSNP.
<pre>
mkdir var_recal/
gatk --java-options "-Xmx10G"  VariantRecalibrator -V NA24694.vcf.gz    --rscript-file var_recal/NA24694_plots.R -O var_recal/NA24694_recal  -mode SNP --tranches-file var_recal/NA24694_tranches  -tranche 99.0  -tranche 95.0 -tranche 90.0  -tranche 85.0 -tranche 80.0 -tranche 75.0 -tranche 70.0 -tranche 65.0 -tranche 60.0 -tranche 58.0  -an QD -an DP -an FS  -an MQRankSum -an ReadPosRankSum -an SOR -an MQ  -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /home/databases/databases/GRCh38/hapmap_3.3.hg38.vcf.gz  -resource:omni,known=false,training=true,truth=false,prior=12.0 /home/databases/databases/GRCh38/1000G_omni2.5.hg38.vcf.gz  -resource:1000G,known=false,training=true,truth=false,prior=10.0 /home/databases/databases/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz    -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>
'''Q8'''
Either 60 or 65 should be good.
<pre>
gatk --java-options "-Xmx10G" ApplyVQSR -V NA24694.vcf.gz  -O NA24694_sf.vcf.gz --recal-file var_recal/NA24694_recal  --tranches-file var_recal/NA24694_tranches  -truth-sensitivity-filter-level 65 --create-output-variant-index true  -mode SNP
</pre>
'''Q9'''
Let us not forget to index:
<pre>
tabix -p  vcf NA24694_sf_pass.vcf.gz
</pre>
Then run:
<pre>
bcftools view -H --types=snps  NA24694_sf_pass.vcf.gz  chr20:32000000-33000000 |wc -l
</pre>
For a total of filtered 416 SNPs which is much less of them before.
'''Q10'''
If you are more sensitive then you will let more sites through and the number would increase.
'''Q11'''


First running:
First running:
Line 165: Line 23:




'''Q12'''
'''Q2'''


One possibility is:
One possibility is:
Line 182: Line 40:
This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.
This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.


'''Q13'''
'''Q3'''


Initially, we isolate the ones that pass the filter:
Initially, we isolate the ones that pass the filter:
Line 203: Line 61:




'''Q14'''
'''Q4'''
Using:
Using:
<pre>
<pre>
Line 212: Line 70:
Intron 64.368%  
Intron 64.368%  


'''Q15'''
'''Q5'''
In the HTML file you see:
In the HTML file you see:



Latest revision as of 16:28, 15 December 2024

Q1

First running:

gatk VariantFiltration     -V NA24694.vcf.gz  -O NA24694_hf.vcf.gz   -filter "DP < 10.0" --filter-name "DP"    -filter "QUAL < 30.0" --filter-name "QUAL30"     -filter "SOR > 3.0" --filter-name "SOR3"     -filter "FS > 60.0" --filter-name "FS60"     -filter "MQ < 40.0" --filter-name "MQ40"      
  

and

 bcftools view  -H NA24694_hf.vcf.gz |grep -v PASS |wc -l 

Gives us 4005 sites.

 bcftools view  -H  --type=snps NA24694_hf.vcf.gz |grep -v PASS |wc -l 

2630 SNPs


Q2

One possibility is:

 bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |cut -f 7  |sort |uniq -c  |sort -n 
      5 FS60;SOR3
     30 DP;MQ40;SOR3
     74 MQ40;SOR3
    158 DP;SOR3
    197 DP;MQ40
    390 MQ40
   1340 SOR3
   1811 DP

This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.

Q3

Initially, we isolate the ones that pass the filter:

bcftools view  -f PASS NA24694_hf.vcf.gz |bgzip -c  > NA24694_hf_pass.vcf.gz
bcftools view -H NA24694_hf_pass.vcf.gz |wc -l 

88594 total sites (SNPS+indels+multi-allelic).

Then we retain the sites using bedtools:

bedtools intersect -header -a  NA24694_hf_pass.vcf.gz  -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > NA24694_hf_map99.vcf.gz
bcftools view -H NA24694_hf_map99.vcf.gz |wc -l 

51624 total sites remain


Q4 Using:

java -jar /home/ctools/snpEff/snpEff.jar  eff  -dataDir /home/databases/databases/snpEff/  -htmlStats NA24694_hf.html GRCh38.99 NA24694_hf.vcf.gz  |bgzip -c > NA24694_hf_ann.vcf.gz

In the HTML file you see: Intron 64.368%

Q5 In the HTML file you see:

MISSENSE 584 44.242%

So a total of 584 detecting mutations can have an impact on the protein sequence.