Unix pipes

From 22126
Jump to navigation Jump to search

This is a small tutorial about UNIX pipes, a powerful way to combine multiple commands together.

Basic Unix Commands and Concepts

Reminder about basic commands:

cd (Change Directory):

The `cd` command is used to navigate between directories (folders) in a Unix-based system. For example, if you are in a directory called home, and you want to move to a directory inside it called documents, you would type:


cd documents


If you want to move to the parent directory, you can use:

cd ..


If you ever want to return to your home directory, simply type:

cd

Also, you can combine some of them! If you want to move to the parent folder, and then go to another directory from there, you can simply write:

cd ../directory_path

ls (List Directory Contents)

The `ls` command lists the contents of the current directory you are in. It shows all files and subdirectories within that directory. For example, to see what files and directories are inside the current folder, type:


ls

You can also add options to ls to view more details. For instance:

  • `ls -l` lists files with detailed information, such as file permissions, size, and modification dates.
  • `ls -a` lists all files, including hidden ones (files that start with a dot .).

mkdir (Creating Directories)

The `mkdir (make directory)` command is used for creating new directories (folders) within the Unix file system. Organizing files into directories helps maintain a structured and manageable file system., which is a good thing. You can simply create directories from `your current directory` using `mkdir` like this:


mkdir [directory_name_here]

For example, if you are in a directory named `my_directory` and want to create a directory named `my_new_directory`, you will write:

mkdir my_new_directory

It will be created without notifying you. But you can check if the directory was created by using `ls`. The output of this command should be seen like this:

my_new_directory

Checking it yourself is not bad, but it would be better if it would notify you when the directory is created. For that, you can use the flag `-v` The 'v' here means `verbose` and notifies you when the directory is created successfully, or vice versa. How does it notify? Outputting the success message to your terminal, since the terminal is where the standard output goes. What is standard output? We will talk about it later.


mkdir -v my_new_directory

This code now prints out the message that you created successfully the directory.

Now imagine you need to create a folder, in a folder, which is in a folder. Creating all of them would not be that hard, but what if you need to create 20 folders like that? Instead of exhaustively doing that, you can use another flag, `-p`! `-p` flag will create parent directories as well, `__if they are not existing__`. You can achieve this like this:

mkdir -p my_new_directory/my_another_new_directory/unix_tutorial

This code will create all directories if they do not exist. Also, you can combine the flags `-v` and `-p` to get notified at every creating step.

You can ask yourself, why are we splitting all directories with `/` but not using it before the first directory? Normally you can use it, but having `/` at the very first position tells your system that you are trying to do something from the `root` directory. So if you add `/` before the `my_new_directory`, your system will create all folders not from your current location, but from the root directory. Yet you can use this if you want to create a directory rooting from different locations.

htop

`htop` is an interactive and user-friendly process viewer for Unix systems. It provides a real-time, color-coded display of system processes, CPU usage, memory consumption, and more. If you are used to using Windows systems, `htop` is kinda similar to `Task Manager`. You can open up `htop` by simply writing:

htop

By writing that, you should get a tab like the following:

![htop](https://github.com/user-attachments/assets/0ca69cd7-05e0-40d7-ba0f-8f539fda5b91)

`htop` also accepts the mouse. You can click the buttons on green line and access CPU-Usage, Memory-Usage and so on.

time

`time` is a tiny command that helps measure the execution time of a command or script. It gives out three different measurements, which are:

real: Total elapsed time starting with input and end of the task.
user: CPU time spent in user mode. This is the runtime of your code.
sys: CPU time spent in kernel mode. This is the writing to file, reading from file, and such things (file descriptors or pipes).


stdout, stdin and stderr

stdout (Standard Output)

`stdout` stands for "standard output", where a program sends its regular output. In most cases, this is your terminal screen. For example, when a command or program runs successfully, the result is displayed on `stdout`, i.e. your terminal. You can redirect this output to a file if you don’t want it displayed on the screen.

Let's say, you have a program, `hello_world.py`, that simply writes out "Hello World!" to the terminal, looking like this:

#!/usr/bin/python3

print("Hello World!")

You can copy-paste this code block into a file using emacs, a powerful text editor (nw stands for no window to avoid a pop up window):

emacs -nw hello_world.py

then press and hold CTLR followed by 'x' and 's' on your keyboard. Release CTRL. This will save it. Then press and hold CTLR followed by 'x' and 'c' on your keyboard to quit.

When you run this command in Linux by writing `python3 hello_world.py` you will see the output `Hello World!` in your terminal.

Let's break down this code together. First, we need to write `python3` in unix-based systems to call python files successfully. Then, we need to say which file would be called. In this case, the name of our little program is `hello_world.py`. When you give only these two as a command, it will normally write out `Hello World!` to the terminal.

But what if you want to print out this output to a text file named `greeting.txt`? The first way to achieve this, you could change the program itself like this:

#!/usr/bin/python3
import sys

# Redirect stdout to a file
with open("greeting.txt", "w") as file:
    sys.stdout = file
    print("Hello World!")
close(file)

And then, `python3 hello_world.py` would create `greeting.txt`, and append `Hello World!` in it. When you achieve this, it still writes it out to the `stdout` but the directory of `stdout` would be changed. Yet it works but seems a bit exhaustive.

The second way, and a bit easier way is using directly the redirection of stdout. Redirection is a way to manipulate the outputs, errors, and inputs of programs. Using the very first version of `hello_world.py` and redirection, you can achieve it like:


python3 hello_world.py > greeting.txt

The `>` operator here, is one of the basic redirection in Linux. Using it like that, you will redirect the `output` of the program into a file "greetings.txt" which will get created automatically. In detail, we will talk about it in the next chapters.


/dev/null

There may also be a situation where you want to discard the output of the program. You can do this again using redirection. The directory named ‘/dev/null’ is a special directory and acts like a black hole, so to speak. Everything you send there will be lost. Suppose we don't want to see the output of `hello_world.py`. We can achieve this as follows:


python3 hello_world.py > /dev/null

stdin (Standard Input)

`stdin` stands for "standard input" and is where a program receives its input. By default, this is the keyboard, but it can also come from a file or the output of another command. For example, if you run a command and are prompted to type something, that input is coming from `stdin`.

Imagine our `hello_world.py` also says our name! As the program can not know your name, you need to specify this. You can give your name like this:


python3 hello_world.py rasmus

But it won't work. It is because normally, your python code can not understand if an `argument` exists in your command line. The library named `argparse` in python helps you to take inputs better from the command line! When you set up argparse and modify your code correctly, it will take input from the command line and process it.

We can modify our little code like this:

#!/usr/bin/python3

import argparse

def main():
    parser = argparse.ArgumentParser(description="Greeting Message")
    parser.add_argument('name', nargs='?', help='Your name to greet correctly')
    args = parser.parse_args()

    print(f"Hello World! {args.name}")

if __name__ == "__main__":
    main()

That is a substantial modification.

Here what we call 'parser' is our python class. We add an argument to this class and name it 'name'. Then we use parser.parse_args() to get the arguments correctly. This will allow us to keep each argument by flags. So when you type your name in the argument point flagged 'name', you can call it as name.`yourname`. Now, if you call the code like:


python3 hello_world.py rasmus

You will get:


Hello World! rasmus

Even if it doesn't make sense, we were able to get our output right, that's something.

Now imagine you have two python codes. One of them picks a random name and the second one prints Hello World [name] with the chosen name (our little program). You can run your first code, see what it outputs, and use the second code by writing the output of the first code. It won't bother you since you are taking only one name at a time, but imagine inputting 50 random names. To hinder this hard work, you can use `pipes!` Pipe is a kind of operator in unix-based systems, that helps you connect `stdout` and `stdin` of different codes. Also when you want to use the `pipe` operator, you do not need `argparse`. By using rediction, or pipes, you change the type of the input into a file, so you need to process it like a file.

Let's name our first code `random_name_generator.py`:

#!/usr/bin/python3

import random

names = [
    "Anders", "Niels", "Jens", "Poul", "Lars", "Morten", "Søren", "Thomas", "Peter", "Martin",
    "Henrik", "Jesper", "Frederik", "Kasper", "Rasmus", "Svend", "Jacob", "Simon", "Mikkel", "Christian",
    "Brian", "Steffen", "Jonas", "Mark", "Daniel", "Carsten", "Torben", "Bent", "Erik", "Michael",
    "Viggo", "Oskar", "Emil", "Victor", "Alexander", "Sebastian", "Oliver", "William", "Noah", "Lasse",
    "Mads", "Bjørn", "Leif", "Gunnar", "Elias", "August", "Aksel", "Finn", "Ebbe", "Vladimir",
    "Anne", "Karen", "Pia", "Mette", "Lise", "Hanne", "Rikke", "Sofie", "Camilla", "Maria",
    "Julie", "Christine", "Birthe", "Tine", "Kirsten", "Ingrid", "Line", "Trine", "Kristine", "Mia",
    "Cecilie", "Charlotte", "Emma", "Ida", "Nadia", "Sanne", "Sara", "Eva", "Helene", "Nanna",
    "Maja", "Lærke", "Molly", "Stine", "Emilie", "Amalie", "Signe", "Freja", "Isabella", "Tuva",
    "Viktoria", "Ane", "Dorte", "Laura", "Asta", "Marie", "Clara", "Sofia", "Filippa", "Ella",
    "Alex", "Robin", "Kim", "Sam", "Alexis", "Charlie", "Taylor", "Jamie", "Morgan", "Riley"
]

# Select 10 random names without replacement
random_names = random.sample(danish_names, 10)

# Print each name on a separate line
for name in random_names:
    print(name)

And after a little adjustments, our `hello_world.py`:

#!/usr/bin/python3

import sys

def main():
    # Reading names
    for line in sys.stdin:
        name = line.strip()  # Stripping lines
        if name:  # For every name
            print(f"Hello World! {name}")

if __name__ == "__main__":
    main()

You can achieve the given task using redirection like this:

 python3 random_name_generator.py > names.txt
 python3 hello_world.py < names.txt

This will create a file called names.txt. But you can achieve the given task using pipes "|" like this:

python3 random_name_generator.py | python3 hello_world.py


Both work perfectly, but notice how easier to use `pipes` for this type of task, compared to redirecting to an intermediate file.

What happens behind the scenes? The command "python3 random_name_generator.py" writes to a special file called a file descriptor. Then "python3 random_name_generator.py" reads from that file descriptor. It is almost like the first case except:

  1. the intermediate file is deleted for you
  2. 'the second program does not need for the first one to finish to start executing'

The second aspect is particularly appealing for next-generation sequencing analysis.

stderr (Standard Error)

`stderr` stands for "standard error" and is used by programs to send error messages or diagnostics. This is also shown on your terminal screen by default, but it is separate from `stdout`. Reading both of them on your terminal would be hard to distinguish them, so redirecting one of them would be better in general.

Let's say we want to print a status message for the `hello_world.py`. After every line is written out as stdout, it should provide the status message, `Name greeted: name`. We can directly print it out with the print function like this:

#!/usr/bin/python3
import sys

def main():
    # Reading names
    for line in sys.stdin:
        name = line.strip()  # Stripping
        if name:
            print(f"Hello World! {name}")
            print(f"Name greeted: {name}")

if __name__ == "__main__":
    main()


When you run this code, it will output something like that:

Hello World! Maria
Name greeted: Maria
Hello World: Anders
Name greeted: Anders
...
<pre>

It works, but it is not something we want to achieve. First, the "status message" is still going to stdout along with the output. 

If you change the stdout location using the redirection, all messages will still go to the same place. So first we need to define the status message as `stderr` and then change the `output location of stderr`.

We can achieve the defining `stderr` like this:
<pre>
#!/usr/bin/python3

import sys

def main():
    for line in sys.stdin:
        name = line.strip()
        if name:
            print(f"Hello World! {name}")
            print(f"Name greeted: {name}", file=sys.stderr)

if __name__ == "__main__":
    main()

`file` is an argument of `print` function in python, which specifies where the output goes. If you give a specific text file to that argument, it prints out there. The default value of it is `sys.stdout`, so basically `stdout`. You can change it by specifying that argument as `file=sys.stderr`.

Now, we want to redirect this status message into a file named `status.txt`. As we do it before, we can use redirection Let's try it like this:

python3 hello_world.py > status.txt

Did not work right? That's because the `>` operator redirects only `stdout`. If we want to redirect `stderr`, we specify this with `2>` But why, we did not use some number for redirecting `stdout`? All `stdout, stderr, and stdin` have values for specifying.

- Standard Input (stdin): File descriptor 0 - Standard Output (stdout): File descriptor 1 - Standard Error (stderr): File descriptor 2

But the default one is `stdout`, so you do not need to define it explicitly to redirect, simply use ">".

Based on this information, we can redirect our status message into `status.txt` with the following command:


python3 hello_world.py 2> status.txt 

That's the end of this chapter. Next on, we will talk about a real-world implementation of all the concepts above.

Real World Example

This part of the tutorial provides a real-world example where you can use what you have learned above. All of the code examples below can be found in this GitHub repository. So let's get started!

Random Integer Generator

Let us generate random integers. Let's see the script first:

#!/usr/bin/python3

import sys
import random as r
import argparse

parser = argparse.ArgumentParser(description="Random Integer Generator. This program generates random integers within a given interval.")
parser.add_argument("num_of_nums", metavar="n", type=int, nargs="?", default=100, help="number of generated numbers (default: 100)")
parser.add_argument("--min", metavar="min", type=int, default=10, help="minimum value of the interval (default: 10)")
parser.add_argument("--max", metavar="max", type=int, default=100, help="maximum value of the interval (default: 100)")
args = parser.parse_args()

def random_int_generator(number_of_numbers, min_interval, max_interval):
    """Generates random integers within a specified interval and writes them to stdout."""
    for _ in range(number_of_numbers):
        num = r.randint(min_interval, max_interval)
        print(num)

# Run the random integer generator
random_int_generator(args.num_of_nums, args.min, args.max)

usage:

usage: random_int_generator.py [-h] [--min min] [--max max] [n]

Random Integer Generator. This program generates random integers within a given interval.

positional arguments:
  n           number of generated numbers (default: 100)

optional arguments:
  -h, --help  show this help message and exit
  --min min   minimum value of the interval (default: 10)
  --max max   maximum value of the interval (default: 100)

The script is designed to generate random integers within a specified interval. Also, this script can be executed from the command line with optional arguments to specify the number of integers to generate and the range of values. This script is located here:

/home/projects/22126_NGS/exercises/pipes/random_int_generator.py

To generate 50 random integers between 1 and 50, you would run this code as:


python3 /home/projects/22126_NGS/exercises/pipes/random_int_generator.py 50 --min 1 --max 50

Prime Checker (Naive)

A number is prime if it can be divided by 1 and itself. 19 and 17 are prime but not 14 (divide by 2,7) nor 16 (divided by 2,4,8). The most naive way to check if a number is prime is to check every number between 1 and the number to see if it can be divided. A slightly smarter way is to check up to the square root of the number.

The code for the naive approach seems like this:

import sys
import math
import argparse
import itertools

parser = argparse.ArgumentParser(description="Prime Number checker . This program checks if the input numbers are prime and writes the primes to stdout.")
args = parser.parse_args()

def is_prime(num):
    """Check if a number is prime."""
    if num <= 1:
        return False
    for i in range(2, int(math.sqrt(num))+1):
        if num % i == 0:
            return False
    return True

def prime_checker():
    """Check which numbers from stdin are prime and write them to stdout."""
    input_data = sys.stdin.read().strip().split()
    numbers = map(int, input_data)
    
    primes = filter(is_prime, numbers)
    print("\n".join(map(str, primes)))

# Run the prime checker
if __name__ == "__main__":
    prime_checker()


This prime checker script is designed to determine if numbers provided via standard input (`stdin`) or through a file, are prime. It outputs the prime numbers to the standard output. It returns prime numbers line by line.

Let's start talking about what does the `naive approach.` The `naive approach`, is a function that efficiently determines whether a given number `num` is prime. It first excludes numbers less than or equal to 1 and directly identifies 2 and 3 as prime. It then eliminates any even numbers and multiples of 3 to reduce unnecessary checks. For numbers greater than 3, the function iterates from 5 up to the square root of num, checking divisibility in steps of 6. This approach leverages the fact that all primes greater than 3 are of the form `6k ± 1`, thereby minimizing the number of iterations and enhancing performance compared to the naive method of checking all numbers up to `num - 1`. If no divisors are found, the function concludes that num is prime.

Since this script needs a list of integers, which are line by line (what a coincidence), you can take these integers from `random_int_generator.py!` Instead of exhaustively having these numbers and feeding them into `prime_checker.py` separately, we can use the brand new thing we learned, `pipes`!

You can pipe both scripts like this: ```bash python3 random_integer_generator.py | python prime_checker.py ```

As we specified earlier, the random integer generator generates 100 numbers between 10 and 100, so our prime checker would be fed with them. It will then print out only `prime ones`. That means, the original output of `random_int_generator.py` would be omitted since it has been redirected to the `prime_checker.py`. Also this prime checker code provides the runtime to the user, for assessing the performance of this code.

RSA Checker

Here comes the code first:

import sys
from sympy import mod_inverse, isprime

def read_primes_from_file(file_path):
    try:
        with open(file_path, 'r') as file:
            values = [int(line.strip()) for line in file if line.strip().isdigit()]
            for value in values:
                if not isprime(value):
                    raise ValueError(f"The number {value} in {file_path} is not a prime number.")
            return values
    except Exception as e:
        print(f"Error reading primes from file {file_path}: {e}")
        sys.exit(1)

def generate_rsa_keys(p, q):
    if p == q:
        raise ValueError("The two primes must be distinct.")

    # Calculate n and phi(n)
    n = p * q
    phi_n = (p - 1) * (q - 1)

    # Choose an encryption exponent e
    e = 65537  # Commonly used value for e
    if e >= phi_n or gcd(e, phi_n) != 1:
        raise ValueError("Invalid value for e. It must be coprime with phi(n) and less than phi(n).")

    # Calculate the decryption exponent d
    d = mod_inverse(e, phi_n)

    return (e, n), (d, n)

def gcd(a, b):
    while b:
        a, b = b, a % b
    return a

def main():
    if len(sys.argv) != 3:
        print("Usage: python rsa_key_gen.py <prime_file_1> <prime_file_2>")
        sys.exit(1)

    prime_file_1 = sys.argv[1]
    prime_file_2 = sys.argv[2]

    # Read primes from files
    primes1 = read_primes_from_file(prime_file_1)
    primes2 = read_primes_from_file(prime_file_2)

    # Take the minimum length of the two prime lists
    min_length = min(len(primes1), len(primes2))
    primes1 = primes1[:min_length]
    primes2 = primes2[:min_length]

    # Generate RSA keys for each pair of primes
    for p, q in zip(primes1, primes2):
        try:
            public_key, private_key = generate_rsa_keys(p, q)
            # Print keys to stdout
            print(f"Public Key: {public_key[0]}\t{public_key[1]}\tPrivate Key: {private_key[0]}\t{private_key[1]}")
        except ValueError as e:
            print(f"Error generating keys for primes {p} and {q}: {e}")

if __name__ == "__main__":
    main()


This Python program demonstrates RSA key generation using prime numbers read from two input files. It pairs each prime number from the first file with one from the second file, generates public and private keys for each pair, and prints them to the console. RSA is a widely used encryption scheme using prime numbers.

The correct way to use this script follows:

 python /home/projects/22126_NGS/exercises/pipes/RSAcompute.py [file_with_primes_1] [file_with_primes_2]

But since we do not have the prime numbers in files, we need to utilize `file descriptors`! A way to use two file descriptors at the same time is by bundling commands together with parenthesis. It will bundle the codes together and redirects the output of all code inside the parentheses. Let's break it down, using an example:

 python3 /home/projects/22126_NGS/exercises/pipes/RSAcompute.py <(python3 /home/projects/22126_NGS/exercises/pipes/random_int_generator.py --min 1000000 --max 10000000 10000 | python3 /home/projects/22126_NGS/exercises/pipes/prime_checker.py) <(python3 /home/projects/22126_NGS/exercises/pipes/random_int_generator.py --min 1000000 --max 10000000 10000| python3 /home/projects/22126_NGS/exercises/pipes/prime_checker.py)

We knew the part inside the parentheses, it outputs a list of prime numbers. Now, we do it two times since we need 2 files of prime numbers. We bundled the parts that output prime numbers and redirected them to the `RSAcompute.py`. "<( PROCESS HERE )" will take the stdout of PROCESS HERE and turn it into a file descriptor. Essentially, we have saved writing 4 files here using file descriptors and, RSAcompute.py goes as fast as random_int_generator and prime_checker can go.

It worked perfectly, and we have RSA pairs for encryption.


Benchmarking

Love to see all codes in action, but checking if they are working optimized is another concern since we need everything (ideally) low-cost at the means of time, calculations, and such. So we need to benchmark our pipeline to see if some code bottlenecks or raises errors during the pipeline. For this benchmarking, we are going to use the `time` function of Linux (see Linux Concepts Section, if you already forgot :D.) Let's start building our pipeline!

Time Efficiency Benchmarking

'Random Integer Generator and Prime Checker'

Based on our knowledge from the previous section, we know that we can achieve this pipeline with various methods, like using intermediate files, file descriptors, or pipes. So when we need to pick any of them, the concern is cost efficiency, and in this case, it is time efficiency. Let's try every method and check if it really changes that much. We are going to generate 50.000.000 numbers in every test, which are between 100 and 1.000.000. All tests are undergone with 6GB RAM and 2GB Swap Memory.

'Using Intermediate Files'

We will, for testing intermediate files, generate a file consists all random integers and feed `prime checker` with them. In order to achieve that, we will check them separately and add up later. We will use the code:

time python3 random_int_generator.py 50000000 --min 100 --max 1000000 > random_integers.txt
time python3 prime_checker.py random_integers.txt > prime_list_first.txt


The runtime of both are, respectively:

real    0m44.270s
user    0m42.046s
sys     0m2.200s

and

real    2m39.985s
user    2m6.337s
sys     0m25.831s

Which makes a total of nearly 3 minutes and 30 seconds, `without coding time.` Please note that the prime checker works way much slower than the random integer generator.

'Using Pipes'

Let's pipe them together! We will use the code below:


time python3 random_int_generator.py 50000000 --min 100 --max 1000000 | python3 prime_checker.py


The total runtime of this code is:

real    2m41.284s
user    2m21.816s
sys     0m15.455s

It made a difference, yes? A minute down seems not that big but imagine much bigger tasks. We always prefer lower time consumptions with also `lower coding times.`