First look exercise
Overview
In this exercise you will explore real NGS data and learn how to use the screen command to keep long-running jobs alive after logout.
- Use standard UNIX commands to work with NGS data
- Use screen in the shell
First look at data
- Navigate to your home directory:
cd
cdwithout arguments returns you to your home directory:/home/people/[STUDENT ID]
- Create a directory called first_look and
cdinto it. - Copy the file reads.fastq.gz from:
/home/projects/22126_NGS/exercises/first_look/
- Use zless to inspect the compressed FASTQ file:
zless -S reads.fastq.gz
A FASTQ read consists of exactly four lines:
- Header line starting with
@ - Sequence line (A/C/G/T/N)
- “+” line (may repeat the header)
- Quality line (ASCII PHRED scores)
The -S option disables line wrapping so the sequence appears on one line.
- Header line starting with
- Count the number of reads in the file.
Each read has 4 lines. Use
wcto count lines:zcat reads.fastq.gz | wc -l
Divide by 4 to get the number of reads.
Illumina data
- Copy pairedReads.tar.gz into your first_look directory from:
/home/projects/22126_NGS/exercises/first_look/
Unpack it:
tar xvfz pairedReads.tar.gz
Flags:
- x – extract
- v – verbose
- f – archive file follows
- z – gzip-compressed
If a file ends with .tar.bz2, use
jinstead ofz. - You should now have two FASTQ files:
- ERR243038_1.fastq
- ERR243038_2.fastq
Inspect the first read in each file. The two headers should be identical except for the trailing
1or2— meaning they are paired-end reads from opposite ends of the same DNA fragment. - We now check whether the two files are “in sync.”
We will:
- extract all header lines from each file
- remove the final
/1or/2-type suffix - write each set of normalized headers to a new file
- compare the two files
- Extract all header lines with
grep. FASTQ headers begin with:@ERR243038
Example:
grep '^@ERR243038' ERR243038_1.fastq | head
Try this for both FASTQ files and inspect the first 10 headers.
- Remove the trailing
1or2usingsed. Examples ofsedpatterns:sed 's/PATTERN/REPLACEMENT/' file sed 's/PATTERN//' file # remove PATTERN sed 's/^PATTERN//' file # remove PATTERN at line start sed 's/PATTERN$//' file # remove at line end
Apply
sedto strip the last character from each header line. For example:grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' | head
- Redirect the output into files:
grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers
- Inspect the first 10 lines side-by-side with
paste:paste human_1.headers human_2.headers | head
- Finally compare both files using
diff:diff human_1.headers human_2.headers
If
diffprints nothing, the pair files are perfectly in sync.
Use screen in the shell
NGS jobs often run for hours. If you log out or lose network connection, all running commands normally die. The screen program creates a persistent “virtual terminal” that continues running even after logout.
Benefits:
- Safe against connection drops
- Allows long-running jobs to continue after logout
- Can detach at work, reattach from home
Start screen:
screen
Press Enter to dismiss the welcome message.
Inside a screen session:
- All commands run normally
- Special commands begin with Ctrl-a
Try:
Ctrl-a ?
This opens the help screen. Press Enter to exit.
Run something simple, e.g.:
ls
Detach from the session:
Ctrl-a d
You’ll see:
[detached]
Reattach later:
screen -r
If your SSH session dies, simply reconnect and run screen -r to resume.
More documentation: screen tutorial
Congratulations — you have completed the exercise!