SNP calling exercise answers part 1: Difference between revisions

From 22126
Jump to navigation Jump to search
No edit summary
No edit summary
 
Line 6: Line 6:


<pre>
<pre>
tabix -f -p vcf NA24694.gvcf.gz
/home/ctools/htslib-1.20/tabix -f -p vcf NA24694.gvcf.gz
</pre>
</pre>


Line 12: Line 12:


<pre>
<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller \
/home/ctools/gatk-4.6.2.0/gatk --java-options "-Xmx10g" HaplotypeCaller \
   -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
   -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
   -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
   -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
Line 24: Line 24:


<pre>
<pre>
gatk GenotypeGVCFs \
/home/ctools/gatk-4.6.2.0/gatk GenotypeGVCFs \
   -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
   -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
   -V NA24694.gvcf.gz \
   -V NA24694.gvcf.gz \
Line 35: Line 35:


<pre>
<pre>
bcftools stats NA24694.vcf.gz
/home/ctools/bcftools-1.23/bcftools stats NA24694.vcf.gz
</pre>
</pre>


Line 53: Line 53:


<pre>
<pre>
tabix -p vcf NA24694.vcf.gz
/home/ctools/htslib-1.20/tabix -p vcf NA24694.vcf.gz
</pre>
</pre>


Line 59: Line 59:


<pre>
<pre>
tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
</pre>
</pre>


Line 71: Line 71:


<pre>
<pre>
bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
/home/ctools/bcftools-1.23/bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
</pre>
</pre>


Line 77: Line 77:


<pre>
<pre>
tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
/home/ctools/htslib-1.20/tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
   | bcftools view -H --type snps - \
   | /home/ctools/bcftools-1.23/bcftools view -H --type snps - \
   | wc -l
   | wc -l
</pre>
</pre>
Line 91: Line 91:


<pre>
<pre>
tabix NA24694.vcf.gz chr20:32011209-32011209
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32011209-32011209
</pre>
</pre>


Line 112: Line 112:


<pre>
<pre>
tabix NA24694.vcf.gz chr20:32044279-32044279
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32044279-32044279
</pre>
</pre>


Line 163: Line 163:


<pre>
<pre>
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
   | cut -f 3 \
   | cut -f 3 \
   | grep -v rs \
   | grep -v rs \
Line 172: Line 172:


<pre>
<pre>
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
   | cut -f 3 \
   | cut -f 3 \
   | grep "\." \
   | grep "\." \

Latest revision as of 15:55, 7 January 2026

Answers

Q1

First index the gVCF:

/home/ctools/htslib-1.20/tabix -f -p vcf NA24694.gvcf.gz

Then run HaplotypeCaller:

/home/ctools/gatk-4.6.2.0/gatk --java-options "-Xmx10g" HaplotypeCaller \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \
  -L chr20 \
  -O NA24694.gvcf.gz \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -ERC GVCF

Then genotype the gVCF:

/home/ctools/gatk-4.6.2.0/gatk GenotypeGVCFs \
  -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \
  -V NA24694.gvcf.gz \
  -L chr20 \
  --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \
  -O NA24694.vcf.gz

Finally, count SNPs using bcftools stats:

/home/ctools/bcftools-1.23/bcftools stats NA24694.vcf.gz

You should see something like:

SN    0    number of SNPs:    75684

Answer: There are 75,684 SNPs on chromosome 20 in this sample.


Q2

Index the VCF if not already indexed:

/home/ctools/htslib-1.20/tabix -p vcf NA24694.vcf.gz

Then query the 1 Mb region:

/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l

Answer: 1290 variant sites in this region (all variant types).


Q3

Count only SNPs (exclude indels and multi-allelic sites):

/home/ctools/bcftools-1.23/bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l

or equivalently:

/home/ctools/htslib-1.20/tabix -h NA24694.vcf.gz chr20:32000000-33000000 \
  | /home/ctools/bcftools-1.23/bcftools view -H --type snps - \
  | wc -l

Answer: 956 SNPs in the region.


Q4

Variant at chr20:32011209

/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32011209-32011209
chr20  32011209  rs147652161  G  A  264.64  .  
AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552  
GT:AD:DP:GQ:PL  0/1:15,9:24:99:272,0,533

Interpretation:

  • Genotype: 0/1 (heterozygous G/A)
  • Allele depth (AD): 15 G and 9 A
  • Depth (DP): 24
  • Genotype quality (GQ): 99
  • Genotype likelihoods (PL): 272,0,533 (het most likely)

Variant at chr20:32044279

/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32044279-32044279
chr20  32044279  rs4525768  C  T  799.06  .  
AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99  
GT:AD:DP:GQ:PL  1/1:0,21:21:63:813,63,0

Interpretation:

  • Genotype: 1/1 (homozygous T/T)
  • Allele depth (AD): 0 C, 21 T
  • Depth (DP): 21
  • Genotype quality (GQ): 63
  • Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)

Q5

Higher-quality heterozygous SNP:

chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL  0/1:8,13:21:99:411,0,247

Lower-quality heterozygous SNP:

chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL  0/1:4,2:6:38:38,0,114

Why?

  • The first site has much higher depth (21× vs 6×)
  • Allele balance is more reasonable: 8/13 vs 4/2
  • The genotype quality (GQ) is much higher: 99 vs 38
  • Overall likelihoods strongly support the correct genotype

Conclusion: More data ⇒ higher confidence.


Q6

Count SNPs with no dbSNP ID (column 3 = .):

/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep -v rs \
  | wc -l

or equivalently:

/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \
  | cut -f 3 \
  | grep "\." \
  | wc -l

Answer: 17 novel SNPs


Q7

You found:

  • 956 total SNPs in the region (Q3)
  • 17 novel SNPs (Q6)

Only ~1.8% of SNPs are novel.

This is expected because:

  • Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.
  • dbSNP contains >100 million known variants, so most common variation is already catalogued.
  • Novel variants tend to be rare or extremely rare.

Conclusion: The number of novel variants is small and biologically reasonable.