Advanced UNIX and Pipes: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with "== Advanced UNIX and Pipes == This page covers standard input/output streams, redirection, pipes, file descriptors, and examples using Python scripts. These concepts are extremely useful in NGS analysis (e.g., chaining commands together, avoiding intermediate files, streaming FASTQ/BAM data, etc.). === stdout, stdin, stderr === Every UNIX command uses three data streams: * '''stdin''' (file descriptor 0) – input * '''stdout''' (file descriptor 1) – normal output...")
 
No edit summary
 
Line 1: Line 1:
== Advanced UNIX and Pipes ==
== Advanced UNIX and Pipes ==
This page covers standard input/output streams, redirection, pipes, file descriptors, and examples using Python scripts. These concepts are extremely useful in NGS analysis (e.g., chaining commands together, avoiding intermediate files, streaming FASTQ/BAM data, etc.).
This page covers standard input/output streams, redirection, pipes, file descriptors, and Python examples demonstrating how UNIX handles data flow. These concepts are extremely useful in NGS analysis, where tools are often chained together and large data files should be streamed instead of written to disk.


=== stdout, stdin, stderr ===
== stdout, stdin, stderr ==
Every UNIX command uses three data streams:
Every UNIX program interacts with three data streams:


* '''stdin''' (file descriptor 0) – input   
* '''stdin''' (0) – standard input   
* '''stdout''' (file descriptor 1) – normal output   
* '''stdout''' (1) – standard output   
* '''stderr''' (file descriptor 2) – error messages  
* '''stderr''' (2) – standard error   


Redirecting these streams:
Basic redirection:


* `>`  – redirect stdout
<pre>
* `2>`  – redirect stderr
command > out.txt          # redirect stdout to file
* `<`  – redirect stdin
command 2> errors.txt      # redirect stderr to file
* `>>`  – append
command < input.txt        # feed file into stdin
* `|`  – pipe stdout stdin
command >> out.txt        # append output
</pre>
 
Piping connects stdout of one command to stdin of another:
 
<pre>
command1 | command2
</pre>


Examples:
Examples:
<pre>
<pre>
ls > listing.txt
ls | wc -l                      # count files
grep HUMAN ex1.acc 2> errors.log
grep HUMAN ex1.acc | sort | uniq -c
wc -l < ex1.acc
cut -f5 ex1.tot | sort -nr | head
ls | wc -l
</pre>
</pre>


=== /dev/null (“black hole”) ===
== /dev/null (the “black hole”) ==
Redirect output you want to ignore:
If you want to discard output:


<pre>
<pre>
Line 32: Line 38:
</pre>
</pre>


=== Pipes ===
stderr goes separately:
Pipes connect the output of one command to the input of another.


Simple examples:
<pre>
<pre>
grep HUMAN ex1.acc | sort | uniq -c
command 2> /dev/null
cut -f1 ex1.tot | sort | head
</pre>
</pre>


Pipes allow commands to run in parallel and avoid writing temporary files.
== Simple Pipe Examples ==
These illustrate the concepts used constantly in real NGS workflows.
 
<pre>
zcat reads.fastq.gz | head
grep -v "^#" variants.vcf | wc -l
samtools view file.bam | awk '{print $3}' | sort | uniq -c
</pre>


=== Using stdin/stdout in Python ===
== Using stdin/stdout in Python ==
Example program reading from stdin:
Python scripts can read from stdin and write to stdout, making them pipe-friendly.


=== Example 1: Minimal stdin → stdout Python script ===
<pre>
<pre>
#!/usr/bin/python3
#!/usr/bin/python3
Line 54: Line 65:
</pre>
</pre>


Run with:
Run it with a pipe:
 
<pre>
<pre>
echo "world" | python3 hello.py
echo "world" | python3 hello.py
</pre>
</pre>


=== File descriptors and process substitution ===
=== Example 2: Greeting names from stdin ===
Process substitution creates a temporary file-like object from a command:
 
Create `hello_world.py`:
 
<pre>
#!/usr/bin/python3
import sys
 
def main():
    for line in sys.stdin:
        name = line.strip()
        if name:
            print(f"Hello World! {name}")
 
if __name__ == "__main__":
    main()
</pre>
 
Run:
 
<pre>
echo -e "Alice\nBob" | python3 hello_world.py
</pre>
 
== Example: Random Name Generator + Hello Script ==
This demonstrates connecting two Python scripts using pipes rather than temporary files.
 
=== random_name_generator.py ===
 
<pre>
#!/usr/bin/python3
import random
 
names = [
    "Anders", "Niels", "Jens", "Poul", "Lars", "Morten", "Søren", "Thomas",
    "Peter", "Martin", "Henrik", "Jesper", "Frederik", "Kasper", "Rasmus",
    "Anne", "Maria", "Sofie", "Camilla", "Julie", "Eva", "Sara", "Ida"
]
 
# Print 10 random names
for name in random.sample(names, 10):
    print(name)
</pre>
 
=== hello_world_stdin.py ===
<pre>
#!/usr/bin/python3
import sys
 
for line in sys.stdin:
    name = line.strip()
    print(f"Hello World! {name}")
</pre>
 
=== Run both using a pipe ===
 
<pre>
python3 random_name_generator.py | python3 hello_world_stdin.py
</pre>
 
No temporary files needed.
 
== stderr (Standard Error) ==
stderr is meant for status or diagnostics.
 
Here is a corrected example where the greeting goes to stdout and the status message goes to stderr.
 
<pre>
#!/usr/bin/python3
import sys
 
def main():
    for line in sys.stdin:
        name = line.strip()
        print(f"Hello World! {name}")              # stdout
        print(f"Name greeted: {name}", file=sys.stderr)  # stderr
 
if __name__ == "__main__":
    main()
</pre>
 
Redirect stderr:


<pre>
<pre>
diff <(sort file1) <(sort file2)
python3 greet.py 2> status.txt
</pre>
</pre>


Useful for tools that expect filenames.
Now:
* stdout → terminal 
* stderr → status.txt 


=== Example: Random name generator + greeting script ===
== Process Substitution (<(...)) ==
(Your cleaned-up versions go here.)
This provides a “fake temporary file” whose contents come from a command.


=== Example: Integer generator + prime checker ===
Example:
(Place the simplified versions here.)


=== Example: RSA key generator ===
<pre>
(Optional section, for students who want the deeper CS example.)
diff <(sort file1.txt) <(sort file2.txt)
</pre>


=== Benchmarking ===
No temp files created, and diff sees two “files”.
Using the `time` command:
 
This works in bash, not all shells.
 
== Real Example: Integer Generator + Prime Checker ==
 
=== random_int_generator.py ===


<pre>
<pre>
time python3 script.py
#!/usr/bin/python3
 
import sys, random, argparse
 
parser = argparse.ArgumentParser()
parser.add_argument("n", nargs="?", type=int, default=10)
parser.add_argument("--min", type=int, default=10)
parser.add_argument("--max", type=int, default=100)
args = parser.parse_args()
 
for _ in range(args.n):
    print(random.randint(args.min, args.max))
</pre>
</pre>


Use this when comparing pipelines vs intermediate files.
=== prime_checker.py ===
 
<pre>
#!/usr/bin/python3
 
import sys
import math
 
def is_prime(num):
    if num <= 1:
        return False
    for i in range(2, int(math.sqrt(num))+1):
        if num % i == 0:
            return False
    return True
 
numbers = map(int, sys.stdin.read().strip().split())
for n in numbers:
    if is_prime(n):
        print(n)
</pre>
 
=== Run both together ===
 
<pre>
python3 random_int_generator.py 20 --min 1 --max 200 | python3 prime_checker.py
</pre>
 
This streams numbers directly to the checker.
 
== RSA Example Using Process Substitution ==
Demonstrates combining two pipelines, each producing primes, into an RSA key generator.
 
The concept:
 
<pre>
python3 RSAcompute.py \
  <(python3 random_int_generator.py --min 1000000 --max 10000000 10000 | python3 prime_checker.py) \
  <(python3 random_int_generator.py --min 1000000 --max 10000000 10000 | python3 prime_checker.py)
</pre>
 
Each `<(...)>` block becomes a temporary file-like input.
 
== Benchmarking with time ==
 
<pre>
time python3 random_int_generator.py 5000000 | python3 prime_checker.py
</pre>
 
Compare:
 
1. Using intermediate files
2. Using pipes 
3. Using process substitution 
 
Pipes are usually fastest because:
* no disk IO 
* both programs run concurrently 


== Summary ==
== Summary ==
These advanced UNIX tools allow:
Advanced UNIX concepts such as redirection, pipes, stderr handling, and process substitution are essential for:
* streaming large data instead of creating intermediate files   
* chaining tools in NGS pipelines 
* chaining tools together efficiently   
* avoiding large temporary files   
* using Python and shell commands seamlessly  
* streaming FASTQ/BAM/VCF data efficiently   
* improving performance for large NGS pipelines
* mixing shell tools and Python scripts  
* optimizing performance
 
For core UNIX navigation and file management, see [[Basic UNIX Notes]].

Latest revision as of 10:07, 20 November 2025

Advanced UNIX and Pipes

This page covers standard input/output streams, redirection, pipes, file descriptors, and Python examples demonstrating how UNIX handles data flow. These concepts are extremely useful in NGS analysis, where tools are often chained together and large data files should be streamed instead of written to disk.

stdout, stdin, stderr

Every UNIX program interacts with three data streams:

  • stdin (0) – standard input
  • stdout (1) – standard output
  • stderr (2) – standard error

Basic redirection:

command > out.txt          # redirect stdout to file
command 2> errors.txt      # redirect stderr to file
command < input.txt        # feed file into stdin
command >> out.txt         # append output

Piping connects stdout of one command to stdin of another:

command1 | command2

Examples:

ls | wc -l                       # count files
grep HUMAN ex1.acc | sort | uniq -c
cut -f5 ex1.tot | sort -nr | head

/dev/null (the “black hole”)

If you want to discard output:

command > /dev/null

stderr goes separately:

command 2> /dev/null

Simple Pipe Examples

These illustrate the concepts used constantly in real NGS workflows.

zcat reads.fastq.gz | head
grep -v "^#" variants.vcf | wc -l
samtools view file.bam | awk '{print $3}' | sort | uniq -c

Using stdin/stdout in Python

Python scripts can read from stdin and write to stdout, making them pipe-friendly.

Example 1: Minimal stdin → stdout Python script

#!/usr/bin/python3
import sys

for line in sys.stdin:
    print("Hello", line.strip())

Run it with a pipe:

echo "world" | python3 hello.py

Example 2: Greeting names from stdin

Create `hello_world.py`:

#!/usr/bin/python3
import sys

def main():
    for line in sys.stdin:
        name = line.strip()
        if name:
            print(f"Hello World! {name}")

if __name__ == "__main__":
    main()

Run:

echo -e "Alice\nBob" | python3 hello_world.py

Example: Random Name Generator + Hello Script

This demonstrates connecting two Python scripts using pipes rather than temporary files.

random_name_generator.py

#!/usr/bin/python3
import random

names = [
    "Anders", "Niels", "Jens", "Poul", "Lars", "Morten", "Søren", "Thomas",
    "Peter", "Martin", "Henrik", "Jesper", "Frederik", "Kasper", "Rasmus",
    "Anne", "Maria", "Sofie", "Camilla", "Julie", "Eva", "Sara", "Ida"
]

# Print 10 random names
for name in random.sample(names, 10):
    print(name)

hello_world_stdin.py

#!/usr/bin/python3
import sys

for line in sys.stdin:
    name = line.strip()
    print(f"Hello World! {name}")

Run both using a pipe

python3 random_name_generator.py | python3 hello_world_stdin.py

No temporary files needed.

stderr (Standard Error)

stderr is meant for status or diagnostics.

Here is a corrected example where the greeting goes to stdout and the status message goes to stderr.

#!/usr/bin/python3
import sys

def main():
    for line in sys.stdin:
        name = line.strip()
        print(f"Hello World! {name}")               # stdout
        print(f"Name greeted: {name}", file=sys.stderr)  # stderr

if __name__ == "__main__":
    main()

Redirect stderr:

python3 greet.py 2> status.txt

Now:

  • stdout → terminal
  • stderr → status.txt

Process Substitution (<(...))

This provides a “fake temporary file” whose contents come from a command.

Example:

diff <(sort file1.txt) <(sort file2.txt)

No temp files created, and diff sees two “files”.

This works in bash, not all shells.

Real Example: Integer Generator + Prime Checker

random_int_generator.py

#!/usr/bin/python3

import sys, random, argparse

parser = argparse.ArgumentParser()
parser.add_argument("n", nargs="?", type=int, default=10)
parser.add_argument("--min", type=int, default=10)
parser.add_argument("--max", type=int, default=100)
args = parser.parse_args()

for _ in range(args.n):
    print(random.randint(args.min, args.max))

prime_checker.py

#!/usr/bin/python3

import sys
import math

def is_prime(num):
    if num <= 1:
        return False
    for i in range(2, int(math.sqrt(num))+1):
        if num % i == 0:
            return False
    return True

numbers = map(int, sys.stdin.read().strip().split())
for n in numbers:
    if is_prime(n):
        print(n)

Run both together

python3 random_int_generator.py 20 --min 1 --max 200 | python3 prime_checker.py

This streams numbers directly to the checker.

RSA Example Using Process Substitution

Demonstrates combining two pipelines, each producing primes, into an RSA key generator.

The concept:

python3 RSAcompute.py \
  <(python3 random_int_generator.py --min 1000000 --max 10000000 10000 | python3 prime_checker.py) \
  <(python3 random_int_generator.py --min 1000000 --max 10000000 10000 | python3 prime_checker.py)

Each `<(...)>` block becomes a temporary file-like input.

Benchmarking with time

time python3 random_int_generator.py 5000000 | python3 prime_checker.py

Compare:

1. Using intermediate files 2. Using pipes 3. Using process substitution

Pipes are usually fastest because:

  • no disk IO
  • both programs run concurrently

Summary

Advanced UNIX concepts such as redirection, pipes, stderr handling, and process substitution are essential for:

  • chaining tools in NGS pipelines
  • avoiding large temporary files
  • streaming FASTQ/BAM/VCF data efficiently
  • mixing shell tools and Python scripts
  • optimizing performance

For core UNIX navigation and file management, see Basic UNIX Notes.