WikiSysop: Created page with "'''Q1''' Counting all the lines minus the header gives us:

 zcat BGI_hg38_chr20.vcf.gz |grep -v "^#"|wc -l

1878 variants '''Q2''' We can try the following:

 zcat BGI_hg38_chr20.vcf.gz |grep -v "^#" |cut -f 10  |sed "s/:.*//g"|sort | uniq -c |sort -n

grep -v "^#" grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. "^#": The pattern to match lines starting with a hash (#). These lines are usually he..."

2024-03-19T15:41:44Z

Created page with "'''Q1''' Counting all the lines minus the header gives us: <pre> zcat BGI_hg38_chr20.vcf.gz |grep -v "^#"|wc -l </pre> 1878 variants '''Q2''' We can try the following: <pre> zcat BGI_hg38_chr20.vcf.gz |grep -v "^#" |cut -f 10 |sed "s/:.*//g"|sort | uniq -c |sort -n </pre> <OL> <LI>grep -v "^#" grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. "^#": The pattern to match lines starting with a hash (#). These lines are usually he..."

New page

'''Q1'''

Counting all the lines minus the header gives us:
<pre>
zcat BGI_hg38_chr20.vcf.gz |grep -v "^#"|wc -l
</pre>

1878 variants

'''Q2'''

We can try the following:
<pre>
zcat BGI_hg38_chr20.vcf.gz |grep -v "^#" |cut -f 10 |sed "s/:.*//g"|sort | uniq -c |sort -n
</pre>

<OL>
<LI>grep -v "^#" grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. "^#": The pattern to match lines starting with a hash (#). These lines are usually headers or comments in VCF files.
<LI>cut -f 10
<LI> cut: Extracts specific fields from each line. -f 10: takes the 10th field, i.e. the genotype information.
<LI> sed "s/:.*//g" "s/:.*//g": Removes everything after the first colon in each line. This leaves us with the GT field (e.g., 0/1, 1/1).
<LI> sort Sorts lines of text in alphabetical order.
<LI> uniq -c Counts the number of occurrences of each unique line. Here, it counts each unique genotype.
<LI> sort -n : Sorts the output numerically. This will arrange the genotypes in ascending order based on their frequency.
</OL>

<pre>
33 1/2
85 1|1
182 0|1
587 1/1
991 0/1
</pre>

So 85+182=267 variants are already phased.

'''Q3'''

<pre>
chr20 2855611 rs4364082 T C 938.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.690;DB;DP=47;ExcessHet=3.0103;FS=10.989;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.97;ReadPosRankSum=0.976;SOR=0.117 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:21,26:47:99:0|1:2855611_T_C:946,0,745:2855611
chr20 2855618 rs6051444 C T 886.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=1.09;DB;DP=46;ExcessHet=3.0103;FS=14.200;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.27;ReadPosRankSum=0.855;SOR=0.387 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:22,24:46:99:0|1:2855611_T_C:894,0,797:2855611
</pre>

They are 6 bases away from each other.

'''Q4'''

Sequences are quite long compared to Illumina/BGI

samtools view HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'

'''Q5''':

Aligning+indexing:
<pre>
/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini /home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz |samtools view -uS - |samtools sort /dev/stdin > HG002_pacbio.bam
samtools index HG002_pacbio.bam
</pre>
Then:
<pre>
samtools view HG002_pacbio.bam |awk '{print length($10)}' | awk '{sum+=$1; n++} END {if(n>0) print sum/n}'
</pre>

9474.43 so almost 10kb.

'''Q6''': yes

'''Q7''':

First phase:
<pre>
whatshap phase --ignore-read-groups --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o BGI_hg38_chr20_phased.vcf.gz BGI_hg38_chr20.vcf.gz HG002_pacbio.bam
</pre>

Then run:

<pre>
zcat BGI_hg38_chr20_phased.vcf.gz |grep -v "^#" |cut -f 10 |sed "s/:.*//g"|sort | uniq -c |sort -n
</pre>

To get:
<pre>
7 0/1
33 1/2
550 1|0
616 0|1
672 1/1
</pre>

so 550+616=1166 phased variants compared to 267 previously.

Longread exercise answers - Revision history