Longread exercise answers

Q1

Counting all the lines minus the header gives us:

zcat BGI_hg38_chr20.vcf.gz |grep -v "^#"|wc -l

1878 variants

Q2

We can try the following:

zcat BGI_hg38_chr20.vcf.gz |grep -v "^#" |cut -f 10  |sed "s/:.*//g"|sort | uniq -c |sort -n

grep -v "^#" grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. "^#": The pattern to match lines starting with a hash (#). These lines are usually headers or comments in VCF files.
cut -f 10
cut: Extracts specific fields from each line. -f 10: takes the 10th field, i.e. the genotype information.
sed "s/:.*//g" "s/:.*//g": Removes everything after the first colon in each line. This leaves us with the GT field (e.g., 0/1, 1/1).
sort Sorts lines of text in alphabetical order.
uniq -c Counts the number of occurrences of each unique line. Here, it counts each unique genotype.
sort -n : Sorts the output numerically. This will arrange the genotypes in ascending order based on their frequency.

So 85+182=267 variants are already phased.

Q3

chr20	2855611	rs4364082	T	C	938.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.690;DB;DP=47;ExcessHet=3.0103;FS=10.989;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.97;ReadPosRankSum=0.976;SOR=0.117	GT:AD:DP:GQ:PGT:PID:PL:PS	0|1:21,26:47:99:0|1:2855611_T_C:946,0,745:2855611
chr20	2855618	rs6051444	C	T	886.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=1.09;DB;DP=46;ExcessHet=3.0103;FS=14.200;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.27;ReadPosRankSum=0.855;SOR=0.387	GT:AD:DP:GQ:PGT:PID:PL:PS	0|1:22,24:46:99:0|1:2855611_T_C:894,0,797:2855611

They are 6 bases away from each other.

Q4

Sequences are quite long compared to Illumina/BGI

samtools view  HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'

Q5:

Aligning+indexing:

/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a   /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini /home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz |samtools view -uS - |samtools sort /dev/stdin > HG002_pacbio.bam 
samtools index HG002_pacbio.bam

Then:

samtools view  HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'

9474.43 so almost 10kb.

Q6: yes

Q7:

First phase:

whatshap phase  --ignore-read-groups  --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o BGI_hg38_chr20_phased.vcf.gz BGI_hg38_chr20.vcf.gz HG002_pacbio.bam

Then run:

zcat BGI_hg38_chr20_phased.vcf.gz |grep -v "^#" |cut -f 10  |sed "s/:.*//g"|sort | uniq -c |sort -n

To get:

so 550+616=1166 phased variants compared to 267 previously.

Longread exercise answers

Navigation menu

Search