SNP calling exercise part 2 answers

2024-12-15T15:28:16Z

Gabre: Created page with "'''Q1''' First, running: <pre> tabix -f -p vcf NA24694.gvcf.gz </pre> then <pre> gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam -L chr20 -O NA24694.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF </pre> <pre> gatk GenotypeGVCFs -R /home/databases/references/human/GRCh3..."

'''Q1'''

First, running:
<pre>
tabix -f -p vcf NA24694.gvcf.gz
</pre>
then
<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I /home/projects/22126_NGS/exercises/snp_calling/NA24694.bam -L chr20 -O NA24694.gvcf.gz --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF
</pre>

<pre>
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V NA24694.gvcf.gz -L chr20 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -O NA24694.vcf.gz
</pre>

<pre>
bcftools stats NA24694.vcf.gz
</pre>

Should give you:
<pre>
SN 0 number of SNPs: 75684
</pre>
so 75684 SNPs.

'''Q2'''

First run:
<pre>
tabix -p vcf NA24694.vcf.gz
</pre>

Then:
<pre>
tabix NA24694.vcf.gz chr20:32000000-33000000 |wc -l
</pre>
So 1290 variant sites.

'''Q3'''

There are 2 ways to do this:
<pre>
bcftools view -H --type=snps NA24694.vcf.gz chr20:32000000-33000000 |wc -l
</pre>
or
<pre>
tabix -h NA24694.vcf.gz chr20:32000000-33000000 |bcftools view -H --type=snps - |wc -l
</pre>
Both will give you: 956.

'''Q4'''

<pre>
tabix NA24694.vcf.gz chr20:32011209-32011209
</pre>
You get:
<pre>
chr20 32011209 rs147652161 G A 264.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-3.010e-01;DB;DP=24;ExcessHet=3.0103;FS=3.949;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=11.03;ReadPosRankSum=-3.580e-01;SOR=0.552 GT:AD:DP:GQ:PL 0/1:15,9:24:99:272,0,533
</pre>

The line above says that you probably have G and A and the site is heterozygous. G is the reference and A the alternative allele. The allele depth is 15Gs, 9As, the depth is 24, the genotype quality is 99, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 272,0,533

<pre>
tabix NA24694.vcf.gz chr20:32044279-32044279
</pre>
You get:
<pre>
chr20 32044279 rs4525768 C T 799.06 . AC=2;AF=1.00;AN=2;DB;DP=21;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;QD=28.99;SOR=0.990 GT:AD:DP:GQ:PL 1/1:0,21:21:63:813,63,0
</pre>

The line above says that you are probably homozygous for T. C is the reference and T the alternative allele. The allele depth is 0Cs, 21Ts, the depth is 21, the genotype quality is 63, the PHRED genotype likelihoods are homo ref, hetero, homo alt: 813,63,0.

'''Q5'''

Both are heterozygous sites however this is the better one:
<pre>
chr20 32974911 rs6088051 A G 403.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.693;DB;DP=22;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.78;MQRankSum=-6.860e-01;QD=19.22;ReadPosRankSum=-1.703e+00;SOR=0.871 GT:AD:DP:GQ:PL 0/1:8,13:21:99:411,0,247
</pre>

and the worse one:
<pre>
chr20 64291638 rs369221086 C T 30.64 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.530e-01;DB;DP=6;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=28.18;MQRankSum=-9.670e-01;QD=5.11;ReadPosRankSum=-8.420e-01;SOR=0.693 GT:AD:DP:GQ:PL 0/1:4,2:6:38:38,0,114
</pre>

The first site has a depth of 21, 8As and 13Gs. In the second case, the genotype quality is 38 and the first was 99. The second one only has 4Cs and 2 Ts for a total depth of 6 which is not sufficient to confidently call a heterozygous site.

'''Q6'''

Either use:
<pre>
bcftools view -H --types=snps NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep -v rs |wc -l
</pre>

or

<pre>
bcftools view -H --types=snps NA24694.vcf.gz chr20:32000000-33000000 |cut -f 3 |grep "\." |wc -l
</pre>

you will get 17.

'''Q7'''

17 is very little compared to the number of SNPs 956. However, this is very expected. Given that the individual is Han Chinese, this ethnic group is very well represented in dbSNP.

<pre>
mkdir var_recal/
gatk --java-options "-Xmx10G" VariantRecalibrator -V NA24694.vcf.gz --rscript-file var_recal/NA24694_plots.R -O var_recal/NA24694_recal -mode SNP --tranches-file var_recal/NA24694_tranches -tranche 99.0 -tranche 95.0 -tranche 90.0 -tranche 85.0 -tranche 80.0 -tranche 75.0 -tranche 70.0 -tranche 65.0 -tranche 60.0 -tranche 58.0 -an QD -an DP -an FS -an MQRankSum -an ReadPosRankSum -an SOR -an MQ -resource:hapmap,known=false,training=true,truth=true,prior=15.0 /home/databases/databases/GRCh38/hapmap_3.3.hg38.vcf.gz -resource:omni,known=false,training=true,truth=false,prior=12.0 /home/databases/databases/GRCh38/1000G_omni2.5.hg38.vcf.gz -resource:1000G,known=false,training=true,truth=false,prior=10.0 /home/databases/databases/GRCh38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>

'''Q8'''

Either 60 or 65 should be good.

<pre>
gatk --java-options "-Xmx10G" ApplyVQSR -V NA24694.vcf.gz -O NA24694_sf.vcf.gz --recal-file var_recal/NA24694_recal --tranches-file var_recal/NA24694_tranches -truth-sensitivity-filter-level 65 --create-output-variant-index true -mode SNP
</pre>

'''Q9'''

Let us not forget to index:
<pre>
tabix -p vcf NA24694_sf_pass.vcf.gz
</pre>

Then run:
<pre>
bcftools view -H --types=snps NA24694_sf_pass.vcf.gz chr20:32000000-33000000 |wc -l
</pre>
For a total of filtered 416 SNPs which is much less of them before.

'''Q10'''

If you are more sensitive then you will let more sites through and the number would increase.

'''Q11'''

First running:

<pre>
gatk VariantFiltration -V NA24694.vcf.gz -O NA24694_hf.vcf.gz -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40"

</pre>

and

<pre>
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |wc -l
</pre>

Gives us 4005 sites.

<pre>
bcftools view -H --type=snps NA24694_hf.vcf.gz |grep -v PASS |wc -l
</pre>
2630 SNPs

'''Q12'''

One possibility is:
<pre>
bcftools view -H NA24694_hf.vcf.gz |grep -v PASS |cut -f 7 |sort |uniq -c |sort -n
5 FS60;SOR3
30 DP;MQ40;SOR3
74 MQ40;SOR3
158 DP;SOR3
197 DP;MQ40
390 MQ40
1340 SOR3
1811 DP
</pre>

This says remove all lines with the string "PASS", extract the seventh column, sort them, unique and count them, sort again but according to numerical order. At the bottom, you have the most used filter which is depth of coverage.

'''Q13'''

Initially, we isolate the ones that pass the filter:
<pre>
bcftools view -f PASS NA24694_hf.vcf.gz |bgzip -c > NA24694_hf_pass.vcf.gz
bcftools view -H NA24694_hf_pass.vcf.gz |wc -l
</pre>
88594 total sites (SNPS+indels+multi-allelic).

Then we retain the sites using bedtools:
<pre>
bedtools intersect -header -a NA24694_hf_pass.vcf.gz -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > NA24694_hf_map99.vcf.gz
</pre>

<pre>
bcftools view -H NA24694_hf_map99.vcf.gz |wc -l
</pre>
51624 total sites remain

'''Q14'''
Using:
<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats NA24694_hf.html GRCh38.99 NA24694_hf.vcf.gz |bgzip -c > NA24694_hf_ann.vcf.gz
</pre>

In the HTML file you see:
Intron 64.368%

'''Q15'''
In the HTML file you see:

MISSENSE 584 44.242%

So a total of 584 detecting mutations can have an impact on the protein sequence.

SNP calling exercise part 1 answers

2024-12-15T15:27:53Z

SNP calling exercise part 1

2024-12-15T15:27:20Z

Gabre:

<H2>Overview</H2>

First:
<OL>
<LI>Navigate to your home directory:
<LI>Create a directory called "variant_call"
<LI>Navigate to the directory you just created.
</OL>

We will:
<OL>
<LI>Genotype some whole-genome sequencing data.
<LI>Get acquainted with VCF files
<LI>Soft filtering
<LI>Hard filtering
<LI> Annotation of variants
</OL>

----

<H2>Genotyping</H2>

We will genotype a chromosome from a BAM file that has been processed using the steps we detailed before. It is from a [https://en.wikipedia.org/wiki/Han_Chinese Han Chinese] Male with a depth of coverage of 24.6X.
<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.bam
</pre>

It has been indexed. We will first use GATK's HaplotypeCaller, the command-line look something like this:

<pre>
gatk --java-options "-Xmx10g" HaplotypeCaller -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -I [INPUT BAM] -L chr20 -O [OUTPUT ] --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz -ERC GVCF
</pre>

" -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa " specifies a reference, "-L chr20" specifies only to consider chromosome 20, "-dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz ", add annotation from SNPs which are known to exist in the human population (see [https://www.ncbi.nlm.nih.gov/snp/ dbSNP]). We should point out that the variation coming from Eurasians is more extensively represented in this database. However, the most genetically diverse populations are found in [https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1740-1 Africa]. Finally, " -ERC GVCF" will generate a GVCF (Genomic Variant Call Format) which essentially contains all sites instead of just variant sites. The input is specified above, you can call the output: NA24694.gvcf.gz

The command takes between '''20 to 30 minutes to run''', so feel free to take a break. Or feel free to copy the one I generated:

<pre>
/home/projects/22126_NGS/exercises/snp_calling/NA24694.gvcf.gz
</pre>

We will have a brief look at the file:

<pre>
zcat NA24694.gvcf.gz|less -S
</pre>

You will see several lines starting with '#', this is the header portion, scroll down to see the calls. The first 5 columns represent, the name of the chromosome, the coordinate, snp ID (from dbSNP), the reference base and the alternative base. You should see a mix of sites, some with the mention <NON_REF> in the 5th column which corresponds to the alternative base, those sites are probably invariant, and others where there is a variant in the fifth column.

Prior to running GATK, we need to index it:
<pre>
tabix -f -p vcf [VCF to index]
</pre>

Next, we will use the preliminary variant to actually generate only variant sites:
<pre>
gatk GenotypeGVCFs -R /home/databases/references/human/GRCh38_full_analysis_set_plus_decoy_hla.fa -V [INPUT GVCF] -O [OUTPUT VCF] -L chr20 --dbsnp /home/databases/databases/GRCh38/Homo_sapiens_assembly38.dbsnp138.vcf.gz
</pre>

In your case, the input is the file you generated above and the output can be called NA24694.vcf.gz.
This command takes less than 5 minutes to run.

Then let us index the file with the same tabix command:
<pre>
tabix -f -p vcf [VCF to index]
</pre>

Just like "samtools index" can allow us to create an index to retrieve portions of a BAM file, tabix is another utility that allows us to retrieve portions of a GVCF file.

<H3> Get acquainted with VCF files </H3>

'''Q1'''

Using:
<pre>
bcftools stats [input vcf]
</pre>
How many SNPs do you have? (hint: find the line with: "SN 0 number of SNPs")

<pre>
tabix -f -p vcf [VCF to index]
</pre>

The index will be stored as a "tbi" file. Then, the VCF file can be queried:

<pre>
tabix [VCF to index] [CHROMOSOME]:[START]-[END]
</pre>

where [CHROMOSOME] is the chromosome name and [START] [END] are the start and end coordinates respectively.
Or for a single coordinate:
<pre>
tabix [VCF to index] [CHROMOSOME]:[COORDINATE]-[COORDINATE]
</pre>

'''Q2'''

Use the command above, determine how many total variants are in the 1 million base pair region "chr20:32000000-33000000"? (hint, remember "wc -l" to count lines).

'''Q3'''

bcftools is a nifty utility that allows us to do various operations on VCF files. Type:

<pre>
bcftools view
</pre>

It contains several options. Try to determine how many SNPs (excluding indels and multi-allelic sites) are there in the '''same region''' as above? (hint: you need to filter for a certain '''type''' of variant, hint2: be careful not to include the header in the number of lines (option: -H)).

'''Q4'''

Use either tabix or bcftools to retrieve the SNP located on chromosome 20 at coordinate 32011209. Tell me in your own words what is the genotype? What about chromosome 20, coordinate 32044279. Become familiar with the different fields in the [https://samtools.github.io/hts-specs/VCFv4.2.pdf VCF specifications]. Pay attention to "1.4 Data lines". For the genotype fields (column 9 and 10) there are more info on the GATK fields: [https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format GATK's VCF-Variant-Call-Format].

Also, answer for each site:

<OL>
<LI> what is the allele depth i.e. how many bases are there of each type?
<LI> what is the depth of coverage i.e. how many bases cover this site?
<LI> What is your genotype quality?
<LI> What are your genotype likelihoods?
</OL>

'''Q5'''

Inspect the SNPS at positions:
<pre>
chr20 32974911
chr20 64291638
</pre>

One SNP has poor quality the other has good quality. Which is which? Why do you think this is? (hint: remember the class, the more data you have, the more you are sure).

'''Q6'''

Using the same region, 32000000-33000000, how many are novel (i.e. does not exist in databases of a large number of sampled genomed) SNPs? Note that the third column is the ID of a known SNP in dbSNP. You can use the '''cut''' command to get a specific column. Also, every ID in dbSNP starts with "rs" and you could use '''grep''' to either retain such lines or filter them out (see option -v).

'''Q7'''

Contrast the number you obtained for questions 6 to the one you obtained for question 3. Do you think it's expected?

'''Congratulations you finished the exercise!'''

Please find answers [[SNP_calling_exercise_part_1_answers_|here]]

SNP calling exercise part 2

2024-12-15T15:27:18Z

Gabre:

<H2> Filtering</H2>

We saw that the data contains some calls of poor quality. Ideally, we do not want to carry over these calls into downstream analyses. We will explore how to filter out genotype data.

Please use the VCF file generated in part 1.

<H3> Hard filtering </H3>

We saw that soft filtering learns which variants are "true." However, a major downside of soft filtering is that it does not apply to samples for which we do not have a good representation of the genetic diversity.

An alternative is to do hard filtering i.e. filtering the variants according to predetermined cutoffs. This has the downside of potentially introducing bias if the filter is correlated with the type of variant (ex: if heterozygous sites have higher genotype quality).

First, we will consider the following mask file:

<pre>
/home/databases/databases/GRCh38/mask99.bed.gz
</pre>

It is in BED format which is for genomic intervals. These files are extensively used in next-generation sequencing analyses, have a look at the first lines, the format is:

<pre>
[chromosome] [start coordinate (0-based)] [end coordinate (1-based)]
</pre>

0-based means that the first base is at coordinates 0 (i.e 0 1 2 3 ...), 1-based means that the first base is at coordinate 1 (i.e 1 2 3 4...).

This mask file contains a set of genomic regions that you want to '''remove''' from downstream analyses. It is important to note that most genotypers do not take into account the fact that a genomic region can be duplicated. This is why it's that a mappability filter is also a good idea to use.

<pre>
gatk VariantFiltration -V [INPUT VCF] -O [OUTPUT VCF] -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40"
</pre>

This is what the different filter mean:

{| class="wikitable"
| '''Filter'''
| '''Meaning'''
|-
| -filter "DP < 10.0"
| sites with less than 10X depth of coverage are removed
|-
| -filter "QUAL < 30.0"
| sites with a variant quality less than 30 are removed (see the difference between variant quality and genotype quality (GQ) [https://gatk.broadinstitute.org/hc/en-us/articles/360035531392?id=7258 here] )
|-
| -filter "SOR > 3.0"
| sites with a strand odds ratio less than 3.0 are removed (strand bias is when a strand is favored over the other, read more [https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio here] )
|-
| -filter "FS > 60.0"
| sites with a Fisher's exact test (FS) for strand bias is greater than 60 are removed (read more about FS [https://gatk.broadinstitute.org/hc/en-us/articles/360036361992-FisherStrand here] ).
|-
| -filter "MQ < 40.0"
|sites where the median mapping quality of reads supporting is less than 40 are removed.
|-
|}

In real life, there are no perfect filters, I suggest you progressively add them, measure their effectiveness and make sure that you do not introduce unwanted bias to your analyses.

'''Q1'''

How many sites have been filtered out? Remember that sites that passed the filter will have the string '''PASS''' as the seventh column, you can use '''grep''' with the trick we mentioned to remove lines with a specific string. How many SNPs were filtered out?

'''Q2'''

The filtering command leaves in the seventh column a string (specified via --filter-name above) to determine which filter failed sites that were removed. Can you use the command above and modify it using the commands like '''cut''', '''sort''' and '''uniq''' to determine was the filter that filtered out the most sites? If you are starting to use UNIX, this question might be challenging.

Let's further remove sites that are in genomic regions of poor mappability. We can use bedtools which is a set of utilities to deal with BED files (merge, intersect, etc).

<pre>
bedtools intersect -header -a [INPUT VCF] -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > [OUTPUT VCF]
</pre>

The "-header" just says print the header of [INPUT VCF]. Retain the sites that have '''passed the hard filtering''' that are contained in filter99.bed.gz and call the output NA24694_hf_map99.vcf.gz. The 99 is from the percentage of DNA fragments required to map uniquely at that particular position (read more [http://lh3lh3.users.sourceforge.net/snpable.shtml here].

'''Q3'''

How many sites did you retain?

----

<H2>Annotation of variants</H2>

An interesting question is always would was the genomic context of the variants, are they in introns, exons, intergenic. We will use a program called snpEff to characterize the different genomic variants that we found. You can run the program as such:

<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats [OUTPUT HTML] GRCh38.99 [INPUT VCF] |bgzip -c > [OUTPUT VCF]
</pre>

The -dataDir option specifies where to find the data. A human gene database was downloaded here: /home/databases/databases/snpEff/. "GRCh38.99" represents a specific version (hg38) of the human genome. As mentioned in previous exercises, be very careful to use the exact same version of the genome in your analyses.

Run it on file resulting from hard-filtering prior to the filtering by mappability (NA24694_hf.vcf.gz), create an output HTML named: NA24694_hf.html and a VCF output named: NA24694_hf_ann.vcf.gz.

Use firefox to open the HTML report:

<pre>
firefox NA24694_hf.html
</pre>

and answer the following questions:

'''Q4'''

What is the most common genomic region (exon, downstream, intron, UTR) of the variants we detected?

'''Q5'''

How many variants can lead to a codon change? See explanation about point mutations on [https://en.wikipedia.org/wiki/Point_mutation Wikipedia]

----

Please find answers [[SNP_calling_exercise_part_2_answers|here]]

'''Congratulations you finished the exercise!'''

Note: in these exercises and answers, sometimes piped the output of "bcftools view" into other programs, ideally you should use the flag:
<pre>
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
</pre>
and think about which one is appropriate for your situation whether you're storing data or piping to another program.

SNP calling exercise part 2

2024-12-15T15:25:08Z

Gabre:

<H2> Filtering</H2>

We saw that the data contains some calls of poor quality. Ideally, we do not want to carry over these calls into downstream analyses. We will explore how to filter out genotype data.

Please use the vcf file generated in part 1.

<H3> Hard filtering </H3>

We saw that soft filtering learns which variants are "true." However, a major downside of soft filtering is that it does not apply to samples for which we do not have a good representation of the genetic diversity.

An alternative is to do hard filtering i.e. filtering the variants according to predetermined cutoffs. This has the downside of potentially introducing bias if the filter is correlated with the type of variant (ex: if heterozygous sites have higher genotype quality).

First, we will consider the following mask file:

<pre>
/home/databases/databases/GRCh38/mask99.bed.gz
</pre>

It is in BED format which is for genomic intervals. These files are extensively used in next-generation sequencing analyses, have a look at the first lines, the format is:

<pre>
[chromosome] [start coordinate (0-based)] [end coordinate (1-based)]
</pre>

0-based means that the first base is at coordinates 0 (i.e 0 1 2 3 ...), 1-based means that the first base is at coordinate 1 (i.e 1 2 3 4...).

This mask file contains a set of genomic regions that you want to '''remove''' from downstream analyses. It is important to note that most genotypers do not take into account the fact that a genomic region can be duplicated. This is why it's that a mappability filter is also a good idea to use.

<pre>
gatk VariantFiltration -V [INPUT VCF] -O [OUTPUT VCF] -filter "DP < 10.0" --filter-name "DP" -filter "QUAL < 30.0" --filter-name "QUAL30" -filter "SOR > 3.0" --filter-name "SOR3" -filter "FS > 60.0" --filter-name "FS60" -filter "MQ < 40.0" --filter-name "MQ40"
</pre>

This is what the different filter mean:

{| class="wikitable"
| '''Filter'''
| '''Meaning'''
|-
| -filter "DP < 10.0"
| sites with less than 10X depth of coverage are removed
|-
| -filter "QUAL < 30.0"
| sites with a variant quality less than 30 are removed (see the difference between variant quality and genotype quality (GQ) [https://gatk.broadinstitute.org/hc/en-us/articles/360035531392?id=7258 here] )
|-
| -filter "SOR > 3.0"
| sites with a strand odds ratio less than 3.0 are removed (strand bias is when a strand is favored over the other, read more [https://gatk.broadinstitute.org/hc/en-us/articles/360036361772-StrandOddsRatio here] )
|-
| -filter "FS > 60.0"
| sites with a Fisher's exact test (FS) for strand bias is greater than 60 are removed (read more about FS [https://gatk.broadinstitute.org/hc/en-us/articles/360036361992-FisherStrand here] ).
|-
| -filter "MQ < 40.0"
|sites where the median mapping quality of reads supporting is less than 40 are removed.
|-
|}

In real life, there are no perfect filters, I suggest you progressively add them, measure their effectiveness and make sure that you do not introduce unwanted bias to your analyses.

'''Q11'''

How many sites have been filtered out? Remember that sites that passed the filter will have the string '''PASS''' as the seventh column, you can use '''grep''' with the trick we mentioned to remove lines with a specific string. How many SNPs were filtered out?

'''Q12'''

The filtering command leaves in the seventh column a string (specified via --filter-name above) to determine which filter failed sites that were removed. Can you use the command above and modify it using the commands like '''cut''', '''sort''' and '''uniq''' to determine was the filter that filtered out the most sites? If you are starting to use UNIX, this question might be challenging.

Let's further remove sites that are in genomic regions of poor mappability. We can use bedtools which is a set of utilities to deal with BED files (merge, intersect, etc).

<pre>
bedtools intersect -header -a [INPUT VCF] -b /home/databases/databases/GRCh38/filter99.bed.gz |bgzip -c > [OUTPUT VCF]
</pre>

The "-header" just says print the header of [INPUT VCF]. Retain the sites that have '''passed the hard filtering''' that are contained in filter99.bed.gz and call the output NA24694_hf_map99.vcf.gz. The 99 is from the percentage of DNA fragments required to map uniquely at that particular position (read more [http://lh3lh3.users.sourceforge.net/snpable.shtml here].

'''Q13'''

How many sites did you retain?

----

<H2>Annotation of variants</H2>

An interesting question is always would was the genomic context of the variants, are they in introns, exons, intergenic. We will use a program called snpEff to characterize the different genomic variants that we found. You can run the program as such:

<pre>
java -jar /home/ctools/snpEff/snpEff.jar eff -dataDir /home/databases/databases/snpEff/ -htmlStats [OUTPUT HTML] GRCh38.99 [INPUT VCF] |bgzip -c > [OUTPUT VCF]
</pre>

The -dataDir option specifies where to find the data. A human gene database was downloaded here: /home/databases/databases/snpEff/. "GRCh38.99" represents a specific version (hg38) of the human genome. As mentioned in previous exercises, be very careful to use the exact same version of the genome in your analyses.

Run it on file resulting from hard-filtering prior to the filtering by mappability (NA24694_hf.vcf.gz), create an output HTML named: NA24694_hf.html and a VCF output named: NA24694_hf_ann.vcf.gz.

Use firefox to open the HTML report:

<pre>
firefox NA24694_hf.html
</pre>

and answer the following questions:

'''Q14'''

What is the most common genomic region (exon, downstream, intron, UTR) of the variants we detected?

'''Q15'''

How many variants can lead to a codon change? See explanation about point mutations on [https://en.wikipedia.org/wiki/Point_mutation Wikipedia]

----

Please find answers [[SNP_calling_exercise_answers|here]]

'''Congratulations you finished the exercise!'''

Note: in these exercises and answers, sometimes piped the output of "bcftools view" into other programs, ideally you should use the flag:
<pre>
-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
</pre>
and think about which one is appropriate for your situation whether you're storing data or piping to another program.

Unix pipes

2024-12-10T16:20:57Z

Gabre: Created page with "Welcome! This small guide contains a lot of information and examples that will help you as you start learning Linux. In addition to simple Linux commands, this tutorial also provides three Python scripts, that you can run and try out the concepts yourself. # Basic Unix Commands and Concepts This is where your journey begins! If you are new to Unix-based systems and also got used to Windows or MacOS, you really need a brief explanation of a few key commands and concept..."

Welcome! This small guide contains a lot of information and examples that will help you as you start learning Linux.
In addition to simple Linux commands, this tutorial also provides three Python scripts, that you can run and try out the concepts yourself.

# Basic Unix Commands and Concepts
This is where your journey begins!
If you are new to Unix-based systems and also got used to Windows or MacOS, you really need a brief explanation of a few key commands and concepts. Vice versa, you will have a heart attack every time while writing some code, thinking hopefully you did not delete some important thing (based on experiences).

## cd (Change Directory):

The `cd` command is used to navigate between directories (folders) in a Unix-based system.
For example, if you are in a directory called home, and you want to move to a directory inside it called documents, you would type:

```bash
cd documents
```

If you want to move to the parent directory, you can use:

```bash
cd ..
```

If you ever want to return to your home directory, simply type:

```bash
cd
```

Also, you can combine some of them! If you want to move to the parent folder, and then go to another directory from there, you can simply write:

```bash
cd ../directory_path
```

## ls (List Directory Contents)

The `ls` command lists the contents of the current directory you are in. It shows all files and subdirectories within that directory.
For example, to see what files and directories are inside the current folder, type:

```bash
ls
```

You can also add options to ls to view more details. For instance:

- `ls -l` lists files with detailed information, such as file permissions, size, and modification dates.
- `ls -a` lists all files, including hidden ones (files that start with a dot .).

## mkdir (Creating Directories)

The `mkdir (make directory)` command is used for creating new directories (folders) within the Unix file system. Organizing files into directories helps maintain a structured and manageable file system., which is a good thing.
You can simply create directories from `your current directory` using `mkdir` like this:

```bash
mkdir 'directory_path'
```
For example, if you are in a directory named `my_directory` and want to create a directory named `my_new_directory`, you will write:

```bash
mkdir my_new_directory
```
It will be created without notifying you. But you can check if the directory was created by using `ls`. The output of this command should be seen like this:

```
*other folders or files
my_new_directory
```

Checking it yourself is not bad, but it would be better if it would notify you when the directory is created. For that, you can use the flag `-v`! The 'v' here means `verbose` and notifies you when the directory is created successfully, or vice versa. How does it notify? Outputting the success message to your terminal, since the terminal is where the standard output goes. What is standard output? We will talk about it later!

```bash
mkdir -v my_new_directory
```
This code now prints out the message that you created successfully the directory.

Now imagine you need to create a folder, in a folder, which is in a folder. Creating all of them would not be that hard, but what if you need to create 20 folders like that? Instead of exhaustively doing that, you can use another flag, `-p`! `-p` flag will create parent directories as well, `__if they are not existing__`. You can achieve this like this:

```bash
mkdir -p my_new_directory/my_another_new_directory/unix_tutorial
```

This code will create all directories if they do not exist. Also, you can combine the flags `-v` and `-p` to get notified at every creating step.

You can ask yourself, why are we splitting all directories with `/` but not using it before the first directory? Normally you can use it, but having `/` at the very first position tells your system that you are trying to do something from the `root` directory. So if you add `/` before the `my_new_directory`, your system will create all folders not from your current location, but from the root directory. Yet you can use this if you want to create a directory rooting from different locations.

## htop

`htop` is an interactive and user-friendly process viewer for Unix systems. It provides a real-time, color-coded display of system processes, CPU usage, memory consumption, and more. If you are used to using Windows systems, `htop` is kinda similar to `Task Manager`.
You can open up `htop` by simply writing:

```bash
htop
```
By writing that, you should get a tab like the following:

![htop](https://github.com/user-attachments/assets/0ca69cd7-05e0-40d7-ba0f-8f539fda5b91)

The best thing (for me) `htop` providing is the `with mouse navigating`. You can click the buttons on green line and access CPU-Usage, Memory-Usage and so on.

## time

`time` is a tiny command that helps measure the execution time of a command or script. It gives out three different measurements, which are:

```
real: Total elapsed time starting with input and end of the task.
user: CPU time spent in user mode. This is the runtime of your code.
sys: CPU time spent in kernel mode. This is the writing to file, reading from file, and such things (file descriptors or pipes).
```

# stdout, stdin and stderr

## stdout (Standard Output)

`stdout` stands for "standard output", where a program sends its regular output.
In most cases, this is your terminal screen. For example, when a command or program runs successfully, the result is displayed on `stdout`, i.e. your terminal.
You can redirect this output to a file if you don’t want it displayed on the screen.

Lets say, you have a program, `hello_world.py`, that simply writes out "Hello World!" to the terminal, looking like this:

```python
print("Hello World!")
```

When you run this command in Linux by writing `python3 hello_world.py` you will see the output `Hello World!` in your terminal.

Let's break down this code together. First, we need to write `python3` in unix-based systems to call python files successfully. Then, we need to say which file would be called. In this case, the name of our little program is `hello_world.py`. When you give only these two as a command, it will normally write out `Hello World!` to the terminal.

Cool, right? But what if you want to print out this output to a text file named `greeting.txt`? The first way to achieve this, you could change the program itself like this:

```python
import sys

# Redirect stdout to a file
with open("greeting.txt", "w") as file:
sys.stdout = file
print("Hello World!")
```

And then, `python3 hello_world.py` would create `greeting.txt`, and append `Hello World!` in it. When you achieve this, it still writes it out to the `stdout` but the directory of `stdout` would be changed. Yet it works for us, it kinda seems a bit exhaustive.

The second way, and a bit easier way is using directly `file descriptors` of Linux. Using file descriptors is a way to manipulate the outputs, errors, and inputs of programs. Using the very first version of `hello_world.py` and file descriptors, you can achieve it like:

```bash
python3 hello_world.py > greeting.txt
```

The `>` operator here, is one of the basic `file descriptors` in Linux. Using it like that, you will redirect the `output` of the program into a file. In detail, we will talk about it in next chapters.

There may also be a situation where you want to delete the output of the program. You can do this again using file descriptors. The directory named ‘/dev/null’ is a special directory and acts like a black hole, so to speak. Everything you send there will be lost. Suppose we don't want to see the output of `hello_world.py`. We can achieve this as follows:

```python
python3 hello_world.py > /dev/null
```

## stdin (Standard Input)

`stdin` stands for "standard input" and is where a program receives its input. By default, this is the keyboard, but it can also come from a file or the output of another command.
For example, if you run a command and are prompted to type something, that input is coming from `stdin`.

Imagine our `hello_world.py` also says our name! As the program can not know your name, "legally", you need to specify this. You can give your name like this:

```bash
python3 hello_world.py ozgur
```

Aaaaand it won't work. It is because normally, your python code can not understand if an `argument` exists in your command line. The library named `argparse` in python helps you to take inputs better from the command line! When you set up argparse and modify your code correctly, it will take input from the command line and process it.

We can modify our little code like this:

```python
import argparse

def main():
parser = argparse.ArgumentParser(description="Greeting Message")
parser.add_argument('name', nargs='?', help='Your name to greet correctly')
args = parser.parse_args()

print(f"Hello World! {args.name}")

if __name__ == "__main__":
main()
```

Well, that's a huge modification at all.

Here what we call 'parser' is our python class. We add an argument to this class and name it 'name'. Then we use parser.parse_args() to get the arguments correctly. This will allow us to keep each argument by flags. So when you type your name in the argument point flagged 'name', you can call it as name.`yourname`. Now, if you call the code like:

```bash
python3 hello_world.py ozgur
```

You will get:

```
Hello World! ozgur
```

Even if it doesn't make sense, we were able to get our output right, that's something.

Now imagine you have two python codes. One of them picks a random name and the second one prints Hello World [name] with the chosen name (our little programme). You can run your first code, see what it outputs, and use the second code by writing the output of the first code. It won't bother you since you are taking only one name at a time, but imagine inputting 50 random names. To hinder this hard work, you can use `pipes!` Pipe is a kind of operator in unix-based systems, that helps you connect `stdout` and `stdin` of different codes. Also when you want to use the `pipe` operator, you do not need `argparse`. By using file descriptors, or pipes, you change the type of the input into a file, so you need to process it like a file.

Let's name our first code `random_name_generator.py`:

```python
import random

names = [
"Anders", "Niels", "Jens", "Poul", "Lars", "Morten", "Søren", "Thomas", "Peter", "Martin",
"Henrik", "Jesper", "Frederik", "Kasper", "Rasmus", "Svend", "Jacob", "Simon", "Mikkel", "Christian",
"Brian", "Steffen", "Jonas", "Mark", "Daniel", "Carsten", "Torben", "Bent", "Erik", "Michael",
"Viggo", "Oskar", "Emil", "Victor", "Alexander", "Sebastian", "Oliver", "William", "Noah", "Lasse",
"Mads", "Bjørn", "Leif", "Gunnar", "Elias", "August", "Aksel", "Finn", "Ebbe", "Vladimir",
"Anne", "Karen", "Pia", "Mette", "Lise", "Hanne", "Rikke", "Sofie", "Camilla", "Maria",
"Julie", "Christine", "Birthe", "Tine", "Kirsten", "Ingrid", "Line", "Trine", "Kristine", "Mia",
"Cecilie", "Charlotte", "Emma", "Ida", "Nadia", "Sanne", "Sara", "Eva", "Helene", "Nanna",
"Maja", "Lærke", "Molly", "Stine", "Emilie", "Amalie", "Signe", "Freja", "Isabella", "Tuva",
"Viktoria", "Ane", "Dorte", "Laura", "Asta", "Marie", "Clara", "Sofia", "Filippa", "Ella",
"Alex", "Robin", "Kim", "Sam", "Alexis", "Charlie", "Taylor", "Jamie", "Morgan", "Riley"
]

# Select 10 random names without replacement
random_names = random.sample(danish_names, 10)

# Print each name on a separate line
for name in random_names:
print(name)
```

And after a little adjustments, our `hello_world.py`:
```python
import sys

def main():
# Reading names
for line in sys.stdin:
name = line.strip() # Stripping lines
if name: # For every name
print(f"Hello World! {name}")

if __name__ == "__main__":
main()
```

You can achieve the given task using pipes like this:

```bash
python3 random_name_generator.py | python3 hello_world.py
```

or with file descriptors:

```bash
python3 random_name_generator.py > names.txt
python3 hello_world.py < names.txt
```

Both work perfectly, but notice how easier to use `pipes` for this type of task, compared to file descriptors.

## stderr (Standard Error)

`stderr` stands for "standard error" and is used by programs to send error messages or diagnostics.
This is also shown on your terminal screen by default, but it is separate from `stdout`. Reading both of them on your terminal would be hard to distinguish them, so redirecting one of them would be better in general.

Let's say we want to print a status message for the `hello_world.py`. After every line is written out as stdout, it should provide the status message, `Name greeted: name`. We can directly print it out with print function like this:

```python
import sys

def main():
# Reading names
for line in sys.stdin:
name = line.strip() # Stripping
if name:
print(f"Hello World! {name}")
print(f"Name greeted: {name}")

if __name__ == "__main__":
main()
```

When you run this code, it will output something like that:

```
Hello World! Maria
Name greeted: Maria
Hello World: Anders
Name greeted: Anders
...
```

It works, but it is not something we want to achieve. First, the `status message` is still going to `stdout`.

If you change the stdout location using the file descriptor, all messages will still go to the same place. So first we need to define the status message as `stderr` and then change the `output location of stderr`.

We can achieve the defining `stderr` like this:

```python
import sys

def main():
for line in sys.stdin:
name = line.strip()
if name:
print(f"Hello World! {name}")
print(f"Name greeted: {name}", file=sys.stderr)

if __name__ == "__main__":
main()
```
`file` is an argument of `print` function in python, which specifies where the output goes. If you give a specific text file to that argument, it prints out there. The default value of it is `sys.stdout`, so basically `stdout`. You can change it by specifying that argument as `file=sys.stderr`.

Now we want to redirect this status message into a file named `status.txt`. As we do it before, we can use `file descriptors`! Let's try it like this:

```bash
python3 hello_world.py > status.txt
```

Did not work right? That's because the `>` operator redirects only `stdout`. If we want to redirect `stderr`, we specify this with `2>`! But why, we did not use some number for redirecting `stdout`? All `stdout, stderr, and stdin` have values for specifying.

- Standard Input (stdin): File descriptor 0
- Standard Output (stdout): File descriptor 1
- Standard Error (stderr): File descriptor 2

But the default one is `stdout`, so you do not need to define it explicitly.

Based on this information, we can redirect our status message into `status.txt` with following command:

```bash
python3 hello_world.py 2> status.txt
```

That's the end of this chapter. Next on, we will talk about a real-world implementation of all the concepts above.

# Real World Example

Welcome! This part of the tutorial provides a real-world example where you can use what you have learned above. All of the code examples below can be found in this GitHub repository. So let's get started!

# Random Integer Generator

Let's see the script first:

```python
import sys
import random as r
import argparse
import time

parser = argparse.ArgumentParser(description="This program generates random integers within a given interval.")
parser.add_argument(
"num_of_nums",
metavar="n",
type=int,
nargs="?",
default=100,
help="number of generated numbers (default: 100)")
parser.add_argument(
"--min",
metavar="min",
type=int,
default=10,
help="minimum value of the interval (default: 10)")
parser.add_argument(
"--max",
metavar="max",
type=int,
default=100,
help="maximum value of the interval (default: 100)")
parser.add_argument(
"--output", "-o",
metavar="FILE",
type=str,
default="random_numbers.txt",
help="output file to write the numbers (default: random_numbers.txt)")
args = parser.parse_args()

def random_int_generator(number_of_numbers = 100, min_interval = 10, max_interval = 100):
"""Generates random integers within a specified interval and writes them to outputs.txt."""
with open("outputs.txt", "w") as file: # Open the file for writing
for _ in range(number_of_numbers):
num = r.randint(min_interval, max_interval)
file.write(f"{num}\n") # Write each number to the file, one per line

def main():
"""Main function to run the random integer generator and measure its runtime."""
# Record the start time
start_time = time.perf_counter()

# Run the random integer generator
random_int_generator(args.num_of_nums, args.min, args.max)

# Record the end time
end_time = time.perf_counter()

# Calculate the runtime
runtime = end_time - start_time

# Print the runtime to stderr to keep it separate from the generated numbers
print(f"The runtime of the random integer generator is {runtime:.6f} seconds", file=sys.stderr)

if __name__ == "__main__":
main()
```
where,
```
Arguments (or flags) are:
- `n` (positional, optional): The number of random integers to generate. Defaults to 100 if not specified.
- `--min` (optional): The minimum value of the interval. Defaults to 10.
- `--max` (optional): The maximum value of the interval. Defaults to 100.
- `--output` (optional): The output file of random numbers. Defaults to 'random_numbers.txt'
Ensure that the `min` value is less than or equal to the `max` value to avoid errors.
```

The script is designed to generate random integers within a specified interval. Also, this script can be executed from the command line with optional arguments to specify the number of integers to generate and the range of values. The script itself returns none since the generated numbers are written directly to a file named `outputs.txt`. At this point, you should be saying: `But wait! We did learn, that connecting scripts with pipes does not require creating files!` You are correct. We really do not need that output file, since we will be connecting them directly. To achieve this, let's change the part:

```python
def random_int_generator(number_of_numbers = 100, min_interval = 10, max_interval = 100):
"""Generates random integers within a specified interval and writes them to outputs.txt."""
with open("outputs.txt", "w") as file: # Open the file for writing
for _ in range(number_of_numbers):
num = r.randint(min_interval, max_interval)
file.write(f"{num}\n") # Write each number to the file, one per line
```
into this:
```python
def random_int_generator(number_of_numbers, min_interval, max_interval):
"""Generates random integers within a specified interval and writes them to stdout."""
for _ in range(number_of_numbers):
num = r.randint(min_interval, max_interval)
print(num, file=sys.stdout)
```
Voilá! Now it prints out everything into stdout, like we discussed in the previous section.

To generate 50 random integers between 1 and 50, you would run this code as:

```bash
python3 random_int_generator.py 50 --min 1 --max 50
```

This code also provides the `runtime`, which is printed out directly to the `stderr`. By measuring the runtime, you can evaluate how quickly the program generates the desired number of random integers within the specified interval. This information is crucial for optimizing the code, especially when scaling up to generate larger datasets or integrating the generator into larger applications where performance may impact overall system efficiency. And while this process is ongoing, you can check memory or CPU usage by using `htop`, as we talked about in the previous section.

We will check the runtimes after the introduction of all three scripts :).

# Prime Checker (Naive)
The code for the naive approach seems like this:
```python
import math
import argparse
import time
import sys

parser = argparse.ArgumentParser(description="Prime Number Checker. This program checks if the input numbers are prime and writes the primes to an output file.")
parser.add_argument(
'input_file',
nargs='?',
type=str,
default='-',
help='Path to the input file containing numbers to check. Use "-" or omit to read from stdin.'
)
args = parser.parse_args()

def is_prime(num):
"""Check if a number is prime."""
if num <= 1:
return False
if num <= 3:
return True
if num % 2 == 0 or num % 3 == 0:
return False
sqrt_num = int(math.sqrt(num)) + 1
for i in range(5, sqrt_num, 6):
if num % i == 0 or num % (i + 2) == 0:
return False
return True

def prime_checker(numbers):
"""Check which numbers are prime and return them as a list."""
primes = list(filter(is_prime, numbers))
return primes

def main():
# Determine the input source: file or stdin
if args.input_file == '-' or args.input_file == '':
input_source = sys.stdin
else:
input_source = open(args.input_file, 'r')

# Read numbers from the input source
with input_source:
input_data = input_source.read().strip().split()
numbers = list(map(int, input_data))

# Measure runtime
start_time = time.time()
primes = prime_checker(numbers)
end_time = time.time()
runtime = end_time - start_time

# Write primes to the stdout
if primes:
print("\n".join(map(str, primes)), file=sys.stdout)

# Print runtime to stderr
print(f"The runtime of the prime checker is {runtime:.6f} seconds", file=sys.stderr)

if __name__ == "__main__":
main()
```
where,
```
Arguments (or flags) are:
- `input file` (positional, optional): The file consisting random integers. If not given, it will try to read from stdin.
```

This prime checker script is designed to determine if numbers provided via standard input (`stdin`) or through a file, are prime.
It outputs the prime numbers to standard output and into a file, and logs the runtime of the operation directly to the stderr.
It returns prime numbers line by line.

Let's start talking about what does the `naive approach.` The `naive approach`, is a function that efficiently determines whether a given number `num` is prime. It first excludes numbers less than or equal to 1 and directly identifies 2 and 3 as prime. It then eliminates any even numbers and multiples of 3 to reduce unnecessary checks. For numbers greater than 3, the function iterates from 5 up to the square root of num, checking divisibility in steps of 6. This approach leverages the fact that all primes greater than 3 are of the form `6k ± 1`, thereby minimizing the number of iterations and enhancing performance compared to the naive method of checking all numbers up to `num - 1`. If no divisors are found, the function concludes that num is prime.

Since this script needs a list of integers, which are line by line (what a coincidence), you can take these integers from `random_int_generator.py!` Instead of exhaustively having these numbers and feeding them into `prime_checker.py` separately, we can use the brand new thing we learned, `pipes`!

You can pipe both scripts like this:
```bash
python3 random_integer_generator.py | python prime_checker.py
```

As we specified earlier, the random integer generator generates 100 numbers between 10 and 100, so our prime checker would be fed with them. It will then print out only `prime ones`. That means, the original output of `random_int_generator.py` would be omitted since it has been redirected to the `prime_checker.py`. Also this prime checker code provides the runtime to the user, for assessing the performance of this code.

# RSA Checker
Here comes the code first:

```python
import sys

def rsa_key_checker(p, q, e):
"""Compute the RSA key pair given primes p and q and a public exponent e."""
modulus = p * q
phi_n = (p - 1) * (q - 1)

try:
private_exponent = pow(e, -1, phi_n) # Calculate private exponent d
except ValueError:
return False, "No modular inverse exists for e and phi(n)"

# Test the RSA encryption/decryption cycle
test_message = 42
encrypted_message = pow(test_message, e, modulus)
decrypted_message = pow(encrypted_message, private_exponent, modulus)

if test_message != decrypted_message:
return False, "Encryption/Decryption failed"

return True, f"Valid RSA key pair. Modulus = {modulus}, Public Exponent = {e}, Private Exponent = {private_exponent}"

def read_next_prime(file):
"""Read the next prime number from a file."""
line = file.readline()
if line:
return int(line.strip())
return None

def main():
if len(sys.argv) != 3:
print("Usage: python RSAChecker.py [file_with_primes_1] [file_with_primes_2]")
sys.exit(1)

primes_file_1 = sys.argv[1]
primes_file_2 = sys.argv[2]

# Common public exponent
e = 65537

# Open both files
with open(primes_file_1, "r") as file1, open(primes_file_2, "r") as file2:
while True:
p = read_next_prime(file1)
q = read_next_prime(file2)

if p is None or q is None:
if p is None and q is None:
break # Both files are fully processed
# Handle cases where one file has fewer lines
if p is None:
print("Warning: File 1 has fewer lines than File 2. Stopping.")
if q is None:
print("Warning: File 2 has fewer lines than File 1. Stopping.")
break

# Compute the RSA key pair without checking if p and q are prime
valid, message = rsa_key_checker(p, q, e)
if valid:
print(message)

if __name__ == "__main__":
main()
```

The RSAchecker.py script is designed to generate RSA key pairs from two lists of prime numbers.
It reads prime numbers from two files, computes the RSA key pair for each pair of primes, and checks if the encryption and decryption process is successful. If you want to learn about it further, you can find information about RSA encryption further on the internet.

To use this script, you need to provide two files containing prime numbers, one is so-called `public keys`, and the other one is `private keys`. Each file should have one prime number per line. The script will read these files, compute RSA key pairs, and verify their validity.

The correct way to use this script follows:
python RSAchecker.py [file_with_primes_1] [file_with_primes_2]

But since we do not have the prime numbers in files, we need to utilize `file descriptors`! A way to use two file descriptors at the same time is by bundling commands together with parenthesis. It will bundle the codes together and redirects the output of all code inside the parentheses. A little confusing, right? Let's break it down, using an example:

```bash
python3 RSAchecker.py <(python random_integer_generator.py | python prime_checker.py) <(python random_integer_generator.py | python prime_checker.py)
```
We knew the part inside the parentheses, it outputs a list of prime numbers. Now, we do it two times since we need a pair of prime numbers. We bundled the parts that output prime numbers and redirected them to the `RSAchecker.py`. `<` indicates that the input goes into the file, so that is the reverse of what we did in the previous section.

And voila! It worked perfectly, and we have valid prime number pairs for encryption.

Congrats! Your encryption works!

# Benchmarking

Love to see all codes in action, but checking if they are working optimized is another concern since we need everything (ideally) low-cost at the means of time, calculations, and such. So we need to benchmark our pipeline to see if some code bottlenecks or raises errors during the pipeline. For this benchmarking, we are going to use the `time` function of Linux (see Linux Concepts Section, if you already forgot :D.) Let's start building our pipeline!

# Time Efficiency Benchmarking

## Random Integer Generator and Prime Checker

Based on our knowledge from the previous section, we know that we can achieve this pipeline with various methods, like using intermediate files, file descriptors, or pipes. So when we need to pick any of them, the concern is cost efficiency, and in this case, it is time efficiency. Let's try every method and check if it really changes that much. We are going to generate 50.000.000 numbers in every test, which are between 100 and 1.000.000. All tests are undergone with 6GB RAM and 2GB Swap Memory.

### Using Intermediate Files

We will, for testing intermediate files, generate a file consists all random integers and feed `prime checker` with them. In order to achieve that, we will check them separately and add up later. We will use the code:

```bash
time python3 random_int_generator.py 50000000 --min 100 --max 1000000 > random_integers.txt
time python3 prime_checker.py random_integers.txt > prime_list_first.txt
```

The runtime of both are, respectively:

```
real 0m44.270s
user 0m42.046s
sys 0m2.200s

and

real 2m39.985s
user 2m6.337s
sys 0m25.831s
```

Which makes total of nearly 3 minutes and 30 seconds, `without coding time.` Please note that the prime checker works way much slower than the random integer generator.

### Using Pipes

Let's pipe them together! We will use the code below:

```bash
time python3 random_int_generator.py 50000000 --min 100 --max 1000000 | python3 prime_checker.py
```

The total runtime of this code is:
```
real 2m41.284s
user 2m21.816s
sys 0m15.455s
```
It made a difference, yes? A minute down seems not that big but imagine much bigger tasks. We always prefer lower time consumptions with also `lower coding times.`

### Using File Descriptors

The file descriptors method is the last method to benchmark between the random integer generator and prime checker. After this, we will be going to connect all three scripts and find the best-est method of all time! Connecting with file descriptors these two scripts would be achieved like this:

```bash
time python3 prime_checker_naive_approach.py > primes.txt <(python3 random_int_generator.py 50000000 --min 100 --max 1000000)
```
and the runtime:

```
real 2m50.221s
user 2m31.008s
sys 0m12.328s
```
It made no observable difference between pipes and file descriptors, but surely they are much faster than using intermediate files. So we are going to use one of the faster ones in RSA benchmarking.

## RSA Checker and Others

We know by now, which methods are faster, so we will stick to it. Yet, let's try and see one more time the time difference using a more automated method and exhaustively transporting files here and there, between scripts.
Let's take firstly the long road.

And the other thing is, that we will generate only 5.000.000 of integers here, we will talk about it in short.

### Using Intermediate Files

We can achieve it with the following codes:

```bash
time python3 random_int_generator.py 5000000 --min 100 --max 1000000 | python3 prime_checker_naive_approach.py > primes.txt
time python3 random_int_generator.py 5000000 --min 100 --max 1000000 | python3 prime_checker_naive_approach.py > primes2.txt
time python3 RSAchecker.py primes.txt primes2.txt > valid_pairs.txt
```
After running them all, the runtimes would look like:

```
real 0m12.585s
user 0m10.159s
sys 0m2.642s

and

real 0m10.612s
user 0m10.117s
sys 0m0.670s

and

real 0m3.912s
user 0m3.486s
sys 0m0.411s
```
It took nearly 27 seconds to resolve all three codes, with 3 files taking nearly 60MB of space. Now let's try it with the much faster method.

### Using Pipes and File Descriptors Together

We will modify the code we used in the previous section while introducing RSAchecker. The code will look like this:

```bash
time python3 RSAcheckerNEW.py > valid_pairs.txt
<(python random_int_generator.py 50000000 --min 100 --max 1000000 | python prime_checker_naive_approach.py)
<(python random_int_generator.py 50000000 --min 100 --max 1000000 | python prime_checker_naive_approach.py)
> valid_pairs.txt
```
aaaaand here comes the runtime!!!:

```
real 0m15.262s
user 0m3.468s
sys 0m0.480s
```
Really a huge improvement. In the means of time, using much more automated architectures and omitting files make a huge difference.

But let's talk about why we did not use fifty million integers as we did earlier? Yeah, to be honest, 6GB RAM can not handle processing that much of integers. That leads us to the second important thing, the `computational load.` The main concern at this point is, which script causes that overload? Let's find out that together!

## Computational Load Benchmarking with `htop`

As we talked about it in the Linux Concepts section, `htop` helps us to find out the computational load. We need to open a `htop` screen.
I will provide here a couple of `htop` screenshots, let's compare them.

The first one is taken during the random integer generation.
![during random int](https://github.com/user-attachments/assets/949ae9e7-6c94-42a9-94a5-1f9239a1aaf5)
The maximum percentage usage of CPU and Memory is nearly 2.7% here, not much, not to be concerned about. We can see that the random integer generator works rather optimized. At least, it does not bring the computational load we are talking about here.

The second screenshot have been taken right after the random integer generation.
![after random int](https://github.com/user-attachments/assets/4b979224-93f0-484f-97f5-b68ac13e09c4)
Note that the both memory and CPU usage went sharply up, and caused some absurd numbers, like 101% usage of CPU. Initialization may caused that, but the program works still.

But after that, the third one was caught during the prime checker:
![during prime checker](https://github.com/user-attachments/assets/8e00ebd4-89ec-4a13-a540-e359ec9cfec3)
It seems like they not using all % of both CPU and Memory, the current usage says different things. As we can see, all of the memory and swap memory filled up. That's exactly when the process has been killed also. The program can not move to the RSA checking part, because everything has been killed in this part and stopped already. At this point, we should ask ourselves how to optimize or maybe bypass this step to get a more efficient pipeline. Also as we said earlier, we will provide another prime checking algorithms besides the naive one, you can check yourself and find out which one is the better :).

# Thanks for the attention! See you in another tutorial!
Written by Özgür Yolcu

Instructed by Gabriel Renaud