SNP calling exercise answers part 1: Difference between revisions
No edit summary |
No edit summary |
||
| Line 6: | Line 6: | ||
<pre> | <pre> | ||
tabix -f -p vcf NA24694.gvcf.gz | /home/ctools/htslib-1.20/tabix -f -p vcf NA24694.gvcf.gz | ||
</pre> | </pre> | ||
| Line 12: | Line 12: | ||
<pre> | <pre> | ||
gatk --java-options "-Xmx10g" HaplotypeCaller \ | /home/ctools/gatk-4.6.2.0/gatk --java-options "-Xmx10g" HaplotypeCaller \ | ||
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ | -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ | ||
-I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \ | -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \ | ||
| Line 24: | Line 24: | ||
<pre> | <pre> | ||
gatk GenotypeGVCFs \ | /home/ctools/gatk-4.6.2.0/gatk GenotypeGVCFs \ | ||
-R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ | -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ | ||
-V NA24694.gvcf.gz \ | -V NA24694.gvcf.gz \ | ||
| Line 35: | Line 35: | ||
<pre> | <pre> | ||
bcftools stats NA24694.vcf.gz | /home/ctools/bcftools-1.23/bcftools stats NA24694.vcf.gz | ||
</pre> | </pre> | ||
| Line 53: | Line 53: | ||
<pre> | <pre> | ||
tabix -p vcf NA24694.vcf.gz | /home/ctools/htslib-1.20/tabix -p vcf NA24694.vcf.gz | ||
</pre> | </pre> | ||
| Line 59: | Line 59: | ||
<pre> | <pre> | ||
tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l | /home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l | ||
</pre> | </pre> | ||
| Line 71: | Line 71: | ||
<pre> | <pre> | ||
bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l | /home/ctools/bcftools-1.23/bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l | ||
</pre> | </pre> | ||
| Line 77: | Line 77: | ||
<pre> | <pre> | ||
tabix -h NA24694.vcf.gz chr20:32000000-33000000 \ | /home/ctools/htslib-1.20/tabix -h NA24694.vcf.gz chr20:32000000-33000000 \ | ||
| bcftools view -H --type snps - \ | | /home/ctools/bcftools-1.23/bcftools view -H --type snps - \ | ||
| wc -l | | wc -l | ||
</pre> | </pre> | ||
| Line 91: | Line 91: | ||
<pre> | <pre> | ||
tabix NA24694.vcf.gz chr20:32011209-32011209 | /home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32011209-32011209 | ||
</pre> | </pre> | ||
| Line 112: | Line 112: | ||
<pre> | <pre> | ||
tabix NA24694.vcf.gz chr20:32044279-32044279 | /home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32044279-32044279 | ||
</pre> | </pre> | ||
| Line 163: | Line 163: | ||
<pre> | <pre> | ||
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | /home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | ||
| cut -f 3 \ | | cut -f 3 \ | ||
| grep -v rs \ | | grep -v rs \ | ||
| Line 172: | Line 172: | ||
<pre> | <pre> | ||
bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | /home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | ||
| cut -f 3 \ | | cut -f 3 \ | ||
| grep "\." \ | | grep "\." \ | ||
Latest revision as of 15:55, 7 January 2026
Answers
Q1
First index the gVCF:
/home/ctools/htslib-1.20/tabix -f -p vcf NA24694.gvcf.gz
Then run HaplotypeCaller:
/home/ctools/gatk-4.6.2.0/gatk --java-options "-Xmx10g" HaplotypeCaller \ -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam \ -L chr20 \ -O NA24694.gvcf.gz \ --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -ERC GVCF
Then genotype the gVCF:
/home/ctools/gatk-4.6.2.0/gatk GenotypeGVCFs \ -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa \ -V NA24694.gvcf.gz \ -L chr20 \ --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz \ -O NA24694.vcf.gz
Finally, count SNPs using bcftools stats:
/home/ctools/bcftools-1.23/bcftools stats NA24694.vcf.gz
You should see something like:
SN 0 number of SNPs: 75684
Answer: There are 75,684 SNPs on chromosome 20 in this sample.
Q2
Index the VCF if not already indexed:
/home/ctools/htslib-1.20/tabix -p vcf NA24694.vcf.gz
Then query the 1 Mb region:
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32000000-33000000 | wc -l
Answer: 1290 variant sites in this region (all variant types).
Q3
Count only SNPs (exclude indels and multi-allelic sites):
/home/ctools/bcftools-1.23/bcftools view -H --type snps NA24694.vcf.gz chr20:32000000-33000000 | wc -l
or equivalently:
/home/ctools/htslib-1.20/tabix -h NA24694.vcf.gz chr20:32000000-33000000 \ | /home/ctools/bcftools-1.23/bcftools view -H --type snps - \ | wc -l
Answer: 956 SNPs in the region.
Q4
Variant at chr20:32011209
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32011209-32011209
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-0.301;DB;DP=24;FS=3.949;MQ=60.00;QD=11.03;SOR=0.552 GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
Interpretation:
- Genotype: 0/1 (heterozygous G/A)
- Allele depth (AD): 15 G and 9 A
- Depth (DP): 24
- Genotype quality (GQ): 99
- Genotype likelihoods (PL): 272,0,533 (het most likely)
Variant at chr20:32044279
/home/ctools/htslib-1.20/tabix NA24694.vcf.gz chr20:32044279-32044279
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21;MQ=60.00;QD=28.99 GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
Interpretation:
- Genotype: 1/1 (homozygous T/T)
- Allele depth (AD): 0 C, 21 T
- Depth (DP): 21
- Genotype quality (GQ): 63
- Genotype likelihoods (PL): 813,63,0 (homozygous alt most likely)
Q5
Higher-quality heterozygous SNP:
chr20 32974911 rs6088051 A G ... GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
Lower-quality heterozygous SNP:
chr20 64291638 rs369221086 C T ... GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
Why?
- The first site has much higher depth (21× vs 6×)
- Allele balance is more reasonable: 8/13 vs 4/2
- The genotype quality (GQ) is much higher: 99 vs 38
- Overall likelihoods strongly support the correct genotype
Conclusion: More data ⇒ higher confidence.
Q6
Count SNPs with no dbSNP ID (column 3 = .):
/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | cut -f 3 \ | grep -v rs \ | wc -l
or equivalently:
/home/ctools/bcftools-1.23/bcftools view -H --types snps NA24694.vcf.gz chr20:32000000-33000000 \ | cut -f 3 \ | grep "\." \ | wc -l
Answer: 17 novel SNPs
Q7
You found:
- 956 total SNPs in the region (Q3)
- 17 novel SNPs (Q6)
Only ~1.8% of SNPs are novel.
This is expected because:
- Han Chinese individuals are extremely well represented in dbSNP and 1000 Genomes.
- dbSNP contains >100 million known variants, so most common variation is already catalogued.
- Novel variants tend to be rare or extremely rare.
Conclusion: The number of novel variants is small and biologically reasonable.