First look exercise: Difference between revisions
(Created page with " <H2>Overview</H2> <p>In this exercise you will try to look at empirical NGS data. Additionally, you will try to use the '''screen''' command when using the shell. </p> <OL> <LI>Use standard UNIX commands to work with NGS data <LI>Use '''screen''' in shell </OL> <HR> <H2>First look at data</H2> <OL> <LI>Navigate to your home directory: <pre> cd </pre> '''cd''' without arguments will bring you back to your home directory. In our case, your home is: <pre> /home/people...") |
No edit summary |
||
| Line 1: | Line 1: | ||
<H2>Overview</H2> | <H2>Overview</H2> | ||
<p>In this exercise you will | <p>In this exercise you will explore real NGS data and learn how to use the <b>screen</b> command to keep long-running jobs alive after logout.</p> | ||
<OL> | <OL> | ||
<LI>Use standard UNIX commands to work with NGS data | <LI>Use standard UNIX commands to work with NGS data</LI> | ||
<LI>Use | <LI>Use <b>screen</b> in the shell</LI> | ||
</OL> | </OL> | ||
| Line 14: | Line 13: | ||
<OL> | <OL> | ||
<LI>Navigate to your home directory: | <LI>Navigate to your home directory: | ||
<pre> | <pre> | ||
cd | cd | ||
</pre> | </pre> | ||
<p><code>cd</code> without arguments returns you to your home directory: | |||
<pre> | <pre> | ||
/home/people/[STUDENT ID] | /home/people/[STUDENT ID] | ||
</pre> | </pre> | ||
</p> | |||
<LI>Create a | <LI>Create a directory called <tt>first_look</tt> and <code>cd</code> into it.</LI> | ||
< | |||
</ | |||
< | <LI>Copy the file <tt>reads.fastq.gz</tt> from: | ||
< | <pre> | ||
/home/projects/22126_NGS/exercises/first_look/ | |||
</pre> | |||
<LI> | <LI>Use <b>zless</b> to inspect the compressed FASTQ file: | ||
<pre> | <pre> | ||
zless -S reads.fastq.gz | |||
</pre> | </pre> | ||
A FASTQ read consists of exactly four lines: | |||
<OL> | <OL> | ||
<LI>Header line starting with <code>@</code></LI> | |||
<LI>Sequence line (A/C/G/T/N)</LI> | |||
<LI>“+” line (may repeat the header)</LI> | |||
<LI>Quality line (ASCII PHRED scores)</LI> | |||
</OL> | |||
The <tt>-S</tt> option disables line wrapping so the sequence appears on one line. | |||
<LI>Count the number of reads in the file. | |||
<p>Each read has 4 lines. Use <code>wc</code> to count lines:</p> | |||
<pre> | <pre> | ||
zcat reads.fastq.gz | wc -l | |||
</pre> | </pre> | ||
Divide by 4 to get the number of reads. | |||
< | |||
</OL> | |||
</ | |||
<HR> | |||
<H2>Illumina data</H2> | |||
<OL> | |||
<LI>Copy <tt>pairedReads.tar.gz</tt> into your <tt>first_look</tt> directory from: | |||
<pre> | <pre> | ||
/home/projects/22126_NGS/exercises/first_look/ | |||
</pre> | </pre> | ||
Unpack it: | |||
<pre> | <pre> | ||
tar xvfz pairedReads.tar.gz | |||
</pre> | </pre> | ||
Flags: | |||
<ul> | |||
<li><b>x</b> – extract</li> | |||
<li><b>v</b> – verbose</li> | |||
<li><b>f</b> – archive file follows</li> | |||
<li><b>z</b> – gzip-compressed</li> | |||
</ul> | |||
< | If a file ends with <tt>.tar.bz2</tt>, use <code>j</code> instead of <code>z</code>. | ||
<LI>You should now have two FASTQ files: | |||
<ul> | |||
<li><tt>ERR243038_1.fastq</tt></li> | |||
<li><tt>ERR243038_2.fastq</tt></li> | |||
</ul> | |||
< | Inspect the first read in each file. The two headers should be identical except for the trailing <code>1</code> or <code>2</code> — meaning they are paired-end reads from opposite ends of the same DNA fragment. | ||
</ | |||
<LI>We now check whether the two files are “in sync.” | |||
We will: | |||
<OL> | |||
<LI>extract all header lines from each file</LI> | |||
<LI>remove the final <code>/1</code> or <code>/2</code>-type suffix</LI> | |||
<LI>write each set of normalized headers to a new file</LI> | |||
<LI>compare the two files</LI> | |||
</OL> | |||
<LI>Extract all header lines with <code>grep</code>. FASTQ headers begin with: | |||
<pre> | <pre> | ||
@ERR243038 | |||
</pre> | </pre> | ||
Example: | |||
<pre> | <pre> | ||
grep '^@ERR243038' ERR243038_1.fastq | head | |||
</pre> | </pre> | ||
Try this for both FASTQ files and inspect the first 10 headers. | |||
<LI>Remove the trailing <code>1</code> or <code>2</code> using <code>sed</code>. | |||
Examples of <code>sed</code> patterns: | |||
<pre> | <pre> | ||
sed 's/PATTERN/REPLACEMENT/' file | |||
sed 's/PATTERN//' file # remove PATTERN | |||
sed 's/^PATTERN//' file # remove PATTERN at line start | |||
sed 's/PATTERN$//' file # remove at line end | |||
</pre> | </pre> | ||
< | Apply <code>sed</code> to strip the last character from each header line. | ||
For example: | |||
<pre> | <pre> | ||
sed 's/ | grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' | head | ||
</pre> | </pre> | ||
<LI>Redirect the output into files: | |||
<pre> | <pre> | ||
grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers | |||
grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers | |||
</pre> | </pre> | ||
<LI>Inspect the first 10 lines side-by-side with <code>paste</code>: | |||
<pre> | <pre> | ||
paste | paste human_1.headers human_2.headers | head | ||
</pre> | </pre> | ||
<LI>Finally compare both files using <code>diff</code>: | |||
<pre> | <pre> | ||
diff | diff human_1.headers human_2.headers | ||
</pre> | </pre> | ||
If <code>diff</code> prints nothing, the pair files are perfectly in sync. | |||
</OL> | </OL> | ||
<HR> | |||
< | <H2>Use <tt>screen</tt> in the shell</H2> | ||
< | <p>NGS jobs often run for hours. If you log out or lose network connection, all running commands normally die. The <b>screen</b> program creates a persistent “virtual terminal” that continues running even after logout.</p> | ||
< | <b>Benefits:</b> | ||
<OL> | <OL> | ||
<LI> | <LI>Safe against connection drops</LI> | ||
<LI> | <LI>Allows long-running jobs to continue after logout</LI> | ||
<LI> | <LI>Can detach at work, reattach from home</LI> | ||
</OL> | </OL> | ||
<b>Start screen:</b> | |||
<pre> | <pre> | ||
screen | screen | ||
</pre> | </pre> | ||
Press <kbd>Enter</kbd> to dismiss the welcome message. | |||
Inside a screen session: | |||
* All commands run normally | |||
* Special commands begin with <kbd>Ctrl-a</kbd> | |||
Try: | |||
<pre> | |||
Ctrl-a ? | |||
</pre> | |||
This opens the help screen. Press <kbd>Enter</kbd> to exit. | |||
Run something simple, e.g.: | |||
<pre> | |||
ls | |||
</pre> | |||
<b>Detach</b> from the session: | |||
<pre> | |||
Ctrl-a d | |||
</pre> | |||
You’ll see: | |||
<pre> | |||
[detached] | |||
</pre> | |||
<b>Reattach</b> later: | |||
<pre> | <pre> | ||
screen -r | screen -r | ||
</pre> | </pre> | ||
If your SSH session dies, simply reconnect and run <code>screen -r</code> to resume. | |||
More documentation: | |||
[http://kb.iu.edu/data/acuy.html screen tutorial] | |||
<HR> | <HR> | ||
<p>Congratulations you | <p><b>Congratulations — you have completed the exercise!</b></p> | ||
<HR> | <HR> | ||
[[First_look_exercise_answers]] | [[First_look_exercise_answers]] | ||
Latest revision as of 12:45, 20 November 2025
Overview
In this exercise you will explore real NGS data and learn how to use the screen command to keep long-running jobs alive after logout.
- Use standard UNIX commands to work with NGS data
- Use screen in the shell
First look at data
- Navigate to your home directory:
cd
cdwithout arguments returns you to your home directory:/home/people/[STUDENT ID]
- Create a directory called first_look and
cdinto it. - Copy the file reads.fastq.gz from:
/home/projects/22126_NGS/exercises/first_look/
- Use zless to inspect the compressed FASTQ file:
zless -S reads.fastq.gz
A FASTQ read consists of exactly four lines:
- Header line starting with
@ - Sequence line (A/C/G/T/N)
- “+” line (may repeat the header)
- Quality line (ASCII PHRED scores)
The -S option disables line wrapping so the sequence appears on one line.
- Header line starting with
- Count the number of reads in the file.
Each read has 4 lines. Use
wcto count lines:zcat reads.fastq.gz | wc -l
Divide by 4 to get the number of reads.
Illumina data
- Copy pairedReads.tar.gz into your first_look directory from:
/home/projects/22126_NGS/exercises/first_look/
Unpack it:
tar xvfz pairedReads.tar.gz
Flags:
- x – extract
- v – verbose
- f – archive file follows
- z – gzip-compressed
If a file ends with .tar.bz2, use
jinstead ofz. - You should now have two FASTQ files:
- ERR243038_1.fastq
- ERR243038_2.fastq
Inspect the first read in each file. The two headers should be identical except for the trailing
1or2— meaning they are paired-end reads from opposite ends of the same DNA fragment. - We now check whether the two files are “in sync.”
We will:
- extract all header lines from each file
- remove the final
/1or/2-type suffix - write each set of normalized headers to a new file
- compare the two files
- Extract all header lines with
grep. FASTQ headers begin with:@ERR243038
Example:
grep '^@ERR243038' ERR243038_1.fastq | head
Try this for both FASTQ files and inspect the first 10 headers.
- Remove the trailing
1or2usingsed. Examples ofsedpatterns:sed 's/PATTERN/REPLACEMENT/' file sed 's/PATTERN//' file # remove PATTERN sed 's/^PATTERN//' file # remove PATTERN at line start sed 's/PATTERN$//' file # remove at line end
Apply
sedto strip the last character from each header line. For example:grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' | head
- Redirect the output into files:
grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers
- Inspect the first 10 lines side-by-side with
paste:paste human_1.headers human_2.headers | head
- Finally compare both files using
diff:diff human_1.headers human_2.headers
If
diffprints nothing, the pair files are perfectly in sync.
Use screen in the shell
NGS jobs often run for hours. If you log out or lose network connection, all running commands normally die. The screen program creates a persistent “virtual terminal” that continues running even after logout.
Benefits:
- Safe against connection drops
- Allows long-running jobs to continue after logout
- Can detach at work, reattach from home
Start screen:
screen
Press Enter to dismiss the welcome message.
Inside a screen session:
- All commands run normally
- Special commands begin with Ctrl-a
Try:
Ctrl-a ?
This opens the help screen. Press Enter to exit.
Run something simple, e.g.:
ls
Detach from the session:
Ctrl-a d
You’ll see:
[detached]
Reattach later:
screen -r
If your SSH session dies, simply reconnect and run screen -r to resume.
More documentation: screen tutorial
Congratulations — you have completed the exercise!