Longread exercise answers

From 22126
Jump to navigation Jump to search

Q1

Counting all the lines minus the header gives us:

zcat BGI_hg38_chr20.vcf.gz |grep -v "^#"|wc -l 

1878 variants


Q2

We can try the following:

zcat BGI_hg38_chr20.vcf.gz |grep -v "^#" |cut -f 10  |sed "s/:.*//g"|sort | uniq -c |sort -n
  1. grep -v "^#" grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. "^#": The pattern to match lines starting with a hash (#). These lines are usually headers or comments in VCF files.
  2. cut -f 10
  3. cut: Extracts specific fields from each line. -f 10: takes the 10th field, i.e. the genotype information.
  4. sed "s/:.*//g" "s/:.*//g": Removes everything after the first colon in each line. This leaves us with the GT field (e.g., 0/1, 1/1).
  5. sort Sorts lines of text in alphabetical order.
  6. uniq -c Counts the number of occurrences of each unique line. Here, it counts each unique genotype.
  7. sort -n : Sorts the output numerically. This will arrange the genotypes in ascending order based on their frequency.
     33 1/2
     85 1|1
    182 0|1
    587 1/1
    991 0/1

So 85+182=267 variants are already phased.

Q3

chr20	2855611	rs4364082	T	C	938.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.690;DB;DP=47;ExcessHet=3.0103;FS=10.989;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.97;ReadPosRankSum=0.976;SOR=0.117	GT:AD:DP:GQ:PGT:PID:PL:PS	0|1:21,26:47:99:0|1:2855611_T_C:946,0,745:2855611
chr20	2855618	rs6051444	C	T	886.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=1.09;DB;DP=46;ExcessHet=3.0103;FS=14.200;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.27;ReadPosRankSum=0.855;SOR=0.387	GT:AD:DP:GQ:PGT:PID:PL:PS	0|1:22,24:46:99:0|1:2855611_T_C:894,0,797:2855611

They are 6 bases away from each other.

Q4

Sequences are quite long compared to Illumina/BGI

samtools view  HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'


Q5:

Aligning+indexing:

/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a   /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini /home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz |samtools view -uS - |samtools sort /dev/stdin > HG002_pacbio.bam 
samtools index HG002_pacbio.bam 

Then:

samtools view  HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'

9474.43 so almost 10kb.

Q6: yes


Q7:

First phase:

whatshap phase  --ignore-read-groups  --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o BGI_hg38_chr20_phased.vcf.gz BGI_hg38_chr20.vcf.gz HG002_pacbio.bam 

Then run:

zcat BGI_hg38_chr20_phased.vcf.gz |grep -v "^#" |cut -f 10  |sed "s/:.*//g"|sort | uniq -c |sort -n

To get:

      7 0/1
     33 1/2
    550 1|0
    616 0|1
    672 1/1

so 550+616=1166 phased variants compared to 267 previously.