First look exercise

From 22126
Revision as of 12:45, 20 November 2025 by Mick (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Overview

In this exercise you will explore real NGS data and learn how to use the screen command to keep long-running jobs alive after logout.

  1. Use standard UNIX commands to work with NGS data
  2. Use screen in the shell

First look at data

  1. Navigate to your home directory:
    cd
    

    cd without arguments returns you to your home directory:

    /home/people/[STUDENT ID]
    

  2. Create a directory called first_look and cd into it.
  3. Copy the file reads.fastq.gz from:
    /home/projects/22126_NGS/exercises/first_look/
    
  4. Use zless to inspect the compressed FASTQ file:
    zless -S reads.fastq.gz
    

    A FASTQ read consists of exactly four lines:

    1. Header line starting with @
    2. Sequence line (A/C/G/T/N)
    3. “+” line (may repeat the header)
    4. Quality line (ASCII PHRED scores)

    The -S option disables line wrapping so the sequence appears on one line.

  5. Count the number of reads in the file.

    Each read has 4 lines. Use wc to count lines:

    zcat reads.fastq.gz | wc -l
    

    Divide by 4 to get the number of reads.


Illumina data

  1. Copy pairedReads.tar.gz into your first_look directory from:
    /home/projects/22126_NGS/exercises/first_look/
    

    Unpack it:

    tar xvfz pairedReads.tar.gz
    

    Flags:

    • x – extract
    • v – verbose
    • f – archive file follows
    • z – gzip-compressed

    If a file ends with .tar.bz2, use j instead of z.

  2. You should now have two FASTQ files:
    • ERR243038_1.fastq
    • ERR243038_2.fastq

    Inspect the first read in each file. The two headers should be identical except for the trailing 1 or 2 — meaning they are paired-end reads from opposite ends of the same DNA fragment.

  3. We now check whether the two files are “in sync.” We will:
    1. extract all header lines from each file
    2. remove the final /1 or /2-type suffix
    3. write each set of normalized headers to a new file
    4. compare the two files
  4. Extract all header lines with grep. FASTQ headers begin with:
    @ERR243038
    

    Example:

    grep '^@ERR243038' ERR243038_1.fastq | head
    

    Try this for both FASTQ files and inspect the first 10 headers.

  5. Remove the trailing 1 or 2 using sed. Examples of sed patterns:
    sed 's/PATTERN/REPLACEMENT/' file
    sed 's/PATTERN//' file              # remove PATTERN
    sed 's/^PATTERN//' file             # remove PATTERN at line start
    sed 's/PATTERN$//' file             # remove at line end
    

    Apply sed to strip the last character from each header line. For example:

    grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' | head
    
  6. Redirect the output into files:
    grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers
    grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers
    
  7. Inspect the first 10 lines side-by-side with paste:
    paste human_1.headers human_2.headers | head
    
  8. Finally compare both files using diff:
    diff human_1.headers human_2.headers
    

    If diff prints nothing, the pair files are perfectly in sync.


Use screen in the shell

NGS jobs often run for hours. If you log out or lose network connection, all running commands normally die. The screen program creates a persistent “virtual terminal” that continues running even after logout.

Benefits:

  1. Safe against connection drops
  2. Allows long-running jobs to continue after logout
  3. Can detach at work, reattach from home

Start screen:

screen

Press Enter to dismiss the welcome message.

Inside a screen session:

  • All commands run normally
  • Special commands begin with Ctrl-a

Try:

Ctrl-a ?

This opens the help screen. Press Enter to exit.

Run something simple, e.g.:

ls

Detach from the session:

Ctrl-a d

You’ll see:

[detached]

Reattach later:

screen -r

If your SSH session dies, simply reconnect and run screen -r to resume.

More documentation: screen tutorial


Congratulations — you have completed the exercise!


First_look_exercise_answers