<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://teaching.healthtech.dtu.dk/22126/index.php?action=history&amp;feed=atom&amp;title=Longread_exercise_answers</id>
	<title>Longread exercise answers - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://teaching.healthtech.dtu.dk/22126/index.php?action=history&amp;feed=atom&amp;title=Longread_exercise_answers"/>
	<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22126/index.php?title=Longread_exercise_answers&amp;action=history"/>
	<updated>2026-05-04T04:35:48Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.41.0</generator>
	<entry>
		<id>https://teaching.healthtech.dtu.dk/22126/index.php?title=Longread_exercise_answers&amp;diff=40&amp;oldid=prev</id>
		<title>WikiSysop: Created page with &quot;&#039;&#039;&#039;Q1&#039;&#039;&#039;  Counting all the lines minus the header gives us: &lt;pre&gt; zcat BGI_hg38_chr20.vcf.gz |grep -v &quot;^#&quot;|wc -l  &lt;/pre&gt;  1878 variants   &#039;&#039;&#039;Q2&#039;&#039;&#039;  We can try the following: &lt;pre&gt; zcat BGI_hg38_chr20.vcf.gz |grep -v &quot;^#&quot; |cut -f 10  |sed &quot;s/:.*//g&quot;|sort | uniq -c |sort -n &lt;/pre&gt;   &lt;OL&gt; &lt;LI&gt;grep -v &quot;^#&quot;  grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. &quot;^#&quot;: The pattern to match lines starting with a hash (#). These lines are usually he...&quot;</title>
		<link rel="alternate" type="text/html" href="https://teaching.healthtech.dtu.dk/22126/index.php?title=Longread_exercise_answers&amp;diff=40&amp;oldid=prev"/>
		<updated>2024-03-19T15:41:44Z</updated>

		<summary type="html">&lt;p&gt;Created page with &amp;quot;&amp;#039;&amp;#039;&amp;#039;Q1&amp;#039;&amp;#039;&amp;#039;  Counting all the lines minus the header gives us: &amp;lt;pre&amp;gt; zcat BGI_hg38_chr20.vcf.gz |grep -v &amp;quot;^#&amp;quot;|wc -l  &amp;lt;/pre&amp;gt;  1878 variants   &amp;#039;&amp;#039;&amp;#039;Q2&amp;#039;&amp;#039;&amp;#039;  We can try the following: &amp;lt;pre&amp;gt; zcat BGI_hg38_chr20.vcf.gz |grep -v &amp;quot;^#&amp;quot; |cut -f 10  |sed &amp;quot;s/:.*//g&amp;quot;|sort | uniq -c |sort -n &amp;lt;/pre&amp;gt;   &amp;lt;OL&amp;gt; &amp;lt;LI&amp;gt;grep -v &amp;quot;^#&amp;quot;  grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. &amp;quot;^#&amp;quot;: The pattern to match lines starting with a hash (#). These lines are usually he...&amp;quot;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;&amp;#039;&amp;#039;&amp;#039;Q1&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Counting all the lines minus the header gives us:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
zcat BGI_hg38_chr20.vcf.gz |grep -v &amp;quot;^#&amp;quot;|wc -l &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
1878 variants&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q2&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
We can try the following:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
zcat BGI_hg38_chr20.vcf.gz |grep -v &amp;quot;^#&amp;quot; |cut -f 10  |sed &amp;quot;s/:.*//g&amp;quot;|sort | uniq -c |sort -n&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;OL&amp;gt;&lt;br /&gt;
&amp;lt;LI&amp;gt;grep -v &amp;quot;^#&amp;quot;  grep -v: Inverts the match, i.e., selects lines that do not match the given pattern. &amp;quot;^#&amp;quot;: The pattern to match lines starting with a hash (#). These lines are usually headers or comments in VCF files.&lt;br /&gt;
&amp;lt;LI&amp;gt;cut -f 10&lt;br /&gt;
&amp;lt;LI&amp;gt; cut: Extracts specific fields from each line. -f 10: takes the 10th field, i.e. the genotype information.&lt;br /&gt;
&amp;lt;LI&amp;gt; sed &amp;quot;s/:.*//g&amp;quot;  &amp;quot;s/:.*//g&amp;quot;: Removes everything after the first colon in each line. This leaves us with the GT field (e.g., 0/1, 1/1).&lt;br /&gt;
&amp;lt;LI&amp;gt; sort Sorts lines of text in alphabetical order.&lt;br /&gt;
&amp;lt;LI&amp;gt; uniq -c  Counts the number of occurrences of each unique line. Here, it counts each unique genotype.&lt;br /&gt;
&amp;lt;LI&amp;gt; sort -n : Sorts the output numerically. This will arrange the genotypes in ascending order based on their frequency.&lt;br /&gt;
&amp;lt;/OL&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
     33 1/2&lt;br /&gt;
     85 1|1&lt;br /&gt;
    182 0|1&lt;br /&gt;
    587 1/1&lt;br /&gt;
    991 0/1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
So 85+182=267 variants are already phased. &lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q3&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
chr20	2855611	rs4364082	T	C	938.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=0.690;DB;DP=47;ExcessHet=3.0103;FS=10.989;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.97;ReadPosRankSum=0.976;SOR=0.117	GT:AD:DP:GQ:PGT:PID:PL:PS	0|1:21,26:47:99:0|1:2855611_T_C:946,0,745:2855611&lt;br /&gt;
chr20	2855618	rs6051444	C	T	886.64	.	AC=1;AF=0.500;AN=2;BaseQRankSum=1.09;DB;DP=46;ExcessHet=3.0103;FS=14.200;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=19.27;ReadPosRankSum=0.855;SOR=0.387	GT:AD:DP:GQ:PGT:PID:PL:PS	0|1:22,24:46:99:0|1:2855611_T_C:894,0,797:2855611&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
They are 6 bases away from each other.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q4&amp;#039;&amp;#039;&amp;#039;&lt;br /&gt;
&lt;br /&gt;
Sequences are quite long compared to Illumina/BGI&lt;br /&gt;
&lt;br /&gt;
 samtools view  HG002_pacbio.bam |awk &amp;#039;{print length($10)}&amp;#039; | awk &amp;#039;{sum+=$1; n++} END {if(n&amp;gt;0) print sum/n}&amp;#039;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q5&amp;#039;&amp;#039;&amp;#039;:&lt;br /&gt;
&lt;br /&gt;
Aligning+indexing:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
/home/ctools/minimap2/minimap2-2.26_x64-linux/minimap2 -a   /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.mini /home/projects/22126_NGS/exercises/long_reads/HG002_pacbio.fq.gz |samtools view -uS - |samtools sort /dev/stdin &amp;gt; HG002_pacbio.bam &lt;br /&gt;
samtools index HG002_pacbio.bam &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
Then:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
samtools view  HG002_pacbio.bam |awk &amp;#039;{print length($10)}&amp;#039; | awk &amp;#039;{sum+=$1; n++} END {if(n&amp;gt;0) print sum/n}&amp;#039;&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
9474.43 so almost 10kb.&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q6&amp;#039;&amp;#039;&amp;#039;: yes&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
&amp;#039;&amp;#039;&amp;#039;Q7&amp;#039;&amp;#039;&amp;#039;:&lt;br /&gt;
&lt;br /&gt;
First phase:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
whatshap phase  --ignore-read-groups  --reference=/home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -o BGI_hg38_chr20_phased.vcf.gz BGI_hg38_chr20.vcf.gz HG002_pacbio.bam &lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Then run:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
zcat BGI_hg38_chr20_phased.vcf.gz |grep -v &amp;quot;^#&amp;quot; |cut -f 10  |sed &amp;quot;s/:.*//g&amp;quot;|sort | uniq -c |sort -n&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
To get:&lt;br /&gt;
&amp;lt;pre&amp;gt;&lt;br /&gt;
      7 0/1&lt;br /&gt;
     33 1/2&lt;br /&gt;
    550 1|0&lt;br /&gt;
    616 0|1&lt;br /&gt;
    672 1/1&lt;br /&gt;
&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
so 550+616=1166 phased variants compared to 267 previously.&lt;/div&gt;</summary>
		<author><name>WikiSysop</name></author>
	</entry>
</feed>