First look exercise: Difference between revisions

From 22126
Jump to navigation Jump to search
(Created page with " <H2>Overview</H2> <p>In this exercise you will try to look at empirical NGS data. Additionally, you will try to use the '''screen''' command when using the shell. </p> <OL> <LI>Use standard UNIX commands to work with NGS data <LI>Use '''screen''' in shell </OL> <HR> <H2>First look at data</H2> <OL> <LI>Navigate to your home directory: <pre> cd </pre> '''cd''' without arguments will bring you back to your home directory. In our case, your home is: <pre> /home/people...")
 
No edit summary
 
Line 1: Line 1:
<H2>Overview</H2>
<H2>Overview</H2>


<p>In this exercise you will try to look at empirical NGS data. Additionally, you will try to use the '''screen''' command when using the shell. </p>
<p>In this exercise you will explore real NGS data and learn how to use the <b>screen</b> command to keep long-running jobs alive after logout.</p>


<OL>
<OL>
<LI>Use standard UNIX commands to work with NGS data
<LI>Use standard UNIX commands to work with NGS data</LI>
<LI>Use '''screen''' in shell
<LI>Use <b>screen</b> in the shell</LI>
</OL>
</OL>


Line 14: Line 13:


<OL>
<OL>
<LI>Navigate to your home directory:
<LI>Navigate to your home directory:
<pre>
<pre>
cd
cd
</pre>
</pre>
'''cd''' without arguments will bring you back to your home directory. In our case, your home is:
<p><code>cd</code> without arguments returns you to your home directory:
<pre>
<pre>
/home/people/[STUDENT ID]
/home/people/[STUDENT ID]
</pre>
</pre>
</p>


<LI>Create a folder called "first_look". Change dir to it.
<LI>Create a directory called <tt>first_look</tt> and <code>cd</code> into it.</LI>
<LI>Copy the "reads.fastq.gz" to your folder from "/home/projects/22126_NGS/exercises/first_look/"
<LI>The program '''zless''' allows you to view a zipped text file, press the key q to quit. Look at the data using '''less file'''. You will see a couple of reads in here. We will try to count the number of reads that there are. A read is always four lines, it may look like there are more lines but it is because the lines are wrapped. Try opening the file with '''zless -S file''' and you should see it:
  <OL>
  <LI>Header line starts with "@"
  <LI>Sequence line with the DNA sequence
  <LI>Middle header line with a "+" and sometimes also the header
  <LI>Base-quality line PHRED scaled probability that the base is wrong
  </OL>
<LI>Try to count the number of reads in the file. You can of course do this by looking into the file, but this gets hard when there are millions of reads in a file. Remember that the header always starts with "@". Try to count the number of lines using '''wc''' and divide by 4. What count do you get?  How many reads do you get?
<!-- <LI>The reads.fastq are actually not an actual datafile, it is sequencing reads from five different NGS technologies. Can you determine which are which? Hint: copy the other fastq files from "/home/projects/22126_NGS/exercises/first_look/" to your dir and look at those too. -->
</OL>


<H2>Illumina data</H2>
<LI>Copy the file <tt>reads.fastq.gz</tt> from:
<OL>
<pre>
/home/projects/22126_NGS/exercises/first_look/
</pre>


<LI> Now we are going to look at some Illumina paired-end reads. Copy the file "pairedReads.tar.gz" from "/home/projects/22126_NGS/exercises/first_look/" to your dir /home/people/[STUDENT ID]/first_look/. You will see that these files are in a tar/zipped file as ".tar.gz". A '''tar''' file is an archive (collection) of files that makes it easier to send and copy. Zipping saves disk space. To unzip and untar (unpack the files) them, use:
<LI>Use <b>zless</b> to inspect the compressed FASTQ file:
<pre>
<pre>
tar xvfz file.tar.gz  
zless -S reads.fastq.gz
</pre>
</pre>
The flag options "xvfz" do the following: '''x''' = extract the files from the archive, '''v''' = list files which are processed (v for verbose), '''f'''= what follows is the archive file, '''z''' the file is zipped. Note: If the file ends with ".tar.bz2", this means this was sorted using the Burrows-Wheeler algorithm, use instead "xvfj" as flags.


<LI> Now you should have two files, "ERR243038_1.fastq" and "ERR243038_2.fastq". Have a look at them both. Compare the headers of the first reads. They should be identical except for the last character - this means that these two reads are paired together, ie. they are the DNA sequences from the two ends of the same DNA fragment. It is important that they are in sync i.e. that the fifth read in file 1 is paired together with the fifth read in file 2.
A FASTQ read consists of exactly four lines:
<LI> Let's try to see if they in sync. We will do this by 1) extracting the header lines in each file 2) trimming away the last character to make them identical 3) send the outputs to different files and 4) compare the output we get from both files.


<OL>
<OL>
<LI>Header line starting with <code>@</code></LI>
<LI>Sequence line (A/C/G/T/N)</LI>
<LI>“+” line (may repeat the header)</LI>
<LI>Quality line (ASCII PHRED scores)</LI>
</OL>
The <tt>-S</tt> option disables line wrapping so the sequence appears on one line.
<LI>Count the number of reads in the file.
<p>Each read has 4 lines. Use <code>wc</code> to count lines:</p>


<LI>  We will extract the header lines from each file. One way to do this is to leverage the fact that every header line begins with "@ERR243038". We have covered '''grep'''. It uses 2 special characters. "^" indicates the beginning of the line. "$" indicates the end of the line. One way to specify "return all lines beginning with a particular pattern", is to use:
<pre>
<pre>
grep ^PATTERN file
zcat reads.fastq.gz | wc -l
</pre>
</pre>


One way to specify "return all lines ending with a particular pattern", is to use:
Divide by 4 to get the number of reads.
<pre>
 
grep PATTERN$ file
</OL>
</pre>
 
<HR>
 
<H2>Illumina data</H2>


Try to determine all header lines (hint: beginning with "@ERR243038") in ERR243038_1.fastq using '''grep'''. Since there are many hits, to visualize the first 10 lines, one can send the output of '''grep''' into the '''head''' command using the UNIX pipe '|' :
<OL>


<LI>Copy <tt>pairedReads.tar.gz</tt> into your <tt>first_look</tt> directory from:
<pre>
<pre>
grep ^PATTERN file | head
/home/projects/22126_NGS/exercises/first_look/
</pre>
</pre>


Fun fact, you can visualize the last 10 hits using '''tail''':
Unpack it:
 
<pre>
<pre>
grep ^PATTERN file | tail
tar xvfz pairedReads.tar.gz
</pre>
</pre>


Try to view the first 10 header lines of ERR243038_1.fastq and ERR243038_2.fastq
Flags:
<ul>
<li><b>x</b> – extract</li>
<li><b>v</b> – verbose</li>
<li><b>f</b> – archive file follows</li>
<li><b>z</b> – gzip-compressed</li>
</ul>


<LI> The second step is to remove the trailing "1" and "2" from the header lines. One potential command to do this is '''sed''' which stands for "stream editor". It can take a file or the output of another command as input and edit the strings in various ways and return the modified strings as output.  
If a file ends with <tt>.tar.bz2</tt>, use <code>j</code> instead of <code>z</code>.


For instance, to change the word PATTERN with REPLACEMENT, one can do:
<LI>You should now have two FASTQ files:
<ul>
<li><tt>ERR243038_1.fastq</tt></li>
<li><tt>ERR243038_2.fastq</tt></li>
</ul>


<pre>
Inspect the first read in each file. The two headers should be identical except for the trailing <code>1</code> or <code>2</code> — meaning they are paired-end reads from opposite ends of the same DNA fragment.
sed 's/PATTERN/REPLACEMENT/'  file
</pre>


To simply remove PATTERN, simply put an empty replace string:
<LI>We now check whether the two files are “in sync.” 
We will:
<OL>
<LI>extract all header lines from each file</LI>
<LI>remove the final <code>/1</code> or <code>/2</code>-type suffix</LI>
<LI>write each set of normalized headers to a new file</LI>
<LI>compare the two files</LI>
</OL>


<LI>Extract all header lines with <code>grep</code>. FASTQ headers begin with:
<pre>
<pre>
sed 's/PATTERN//'  file
@ERR243038
</pre>
</pre>


Note that you can force PATTERN can be at the beginning of the line or end using the same "^" and "$" characters respectively:
Example:
 
<pre>
<pre>
sed 's/^PATTERN//'   file
grep '^@ERR243038' ERR243038_1.fastq | head
sed 's/PATTERN$//'  file
</pre>
</pre>


Take the '''grep''' command you had to identify the header lines and remove the trailing "1" and "2" at the end of the line using '''sed'''. Hint use the pipe. Look at the first 10 lines.
Try this for both FASTQ files and inspect the first 10 headers.


If you are having trouble, read on, your command should look like this:
<LI>Remove the trailing <code>1</code> or <code>2</code> using <code>sed</code>.


Examples of <code>sed</code> patterns:
<pre>
<pre>
grep ^PATTERN file | sed 's/^PATTERN//' | head
sed 's/PATTERN/REPLACEMENT/' file
sed 's/PATTERN//' file              # remove PATTERN
sed 's/^PATTERN//' file             # remove PATTERN at line start
sed 's/PATTERN$//' file            # remove at line end
</pre>
</pre>


<LI> As you saw in the UNIX lectures, you can send the output of a process to a file using ">" as such:
Apply <code>sed</code> to strip the last character from each header line. 
 
For example:
<pre>
<pre>
sed 's/^PATTERN//' file  > newfile
grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' | head
</pre>
</pre>
The previous '''sed''' command replaces any line starting with "PATTERN" and send the output to its STDOUT. The ">" redirects the STDOUT to a file called "newfile".
Now, take the combined '''grep''' and '''sed''' commands you wrote earlier and redirect their output to a file called "human_1.headers" for at "ERR243038_1.fastq" and "human_2.headers" for "ERR243038_2.fastq" (these are human sequences).
<LI>Look at the first 10 lines of the 2 files (using '''head'''), they should contain the headers from each pair. If you want to view the output side-by-side, '''paste''' is a command that puts 2 files side-by-side as such:


<LI>Redirect the output into files:
<pre>
<pre>
paste file1 file2
grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers
grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers
</pre>
</pre>


Again, you can view the first 10 lines side-by-side using '''head'''
<LI>Inspect the first 10 lines side-by-side with <code>paste</code>:
 
<pre>
<pre>
paste file1 file2 | head  
paste human_1.headers human_2.headers | head
</pre>
</pre>


Use '''paste''' and '''head''' to view the first 10 lines of "human_1.headers" and "human_2.headers" side-by-side. Are they in sync so far?
<LI>Finally compare both files using <code>diff</code>:
This can get very tedious if you have millions of lines, one way to compare two files very rapidly is using '''diff''':
 
<pre>
<pre>
diff file1 file2 
diff human_1.headers human_2.headers
</pre>
</pre>


This command will return the differences between "file1" and "file2" and return nothing otherwise.  
If <code>diff</code> prints nothing, the pair files are perfectly in sync.


Using '''diff''' make sure that "human_1.headers" and "human_2.headers" are identical and therefore that " ERR243038_1.fastq" and "ERR243038_2.fastq" were in sync.
</OL>
</OL>
</OL>


<HR>


<HR>
<H2>Use <tt>screen</tt> in the shell</H2>


<H2>Use '''screen''' in the shell</H2>
<p>NGS jobs often run for hours. If you log out or lose network connection, all running commands normally die. The <b>screen</b> program creates a persistent “virtual terminal” that continues running even after logout.</p>


<p>In NGS, due to the volume of data, it is normal for programs to run for several hours even overnight. A problem is once you log off a terminal, any program that you have running will be killed (i.e. cease to run). The command '''screen''' creates a virtual shell inside your shell. This means that you can run commands within this virtual shell and it will be as if you never logged off. There are a couple of advantages why you might want to use it:</p>
<b>Benefits:</b>
<OL>
<OL>
<LI>If you log out or lose your connection you can log in again and get back to your shell as it was. This is extremely useful when you are working on a wireless network where the connection may sometimes drop. If the connection drops any programs you have running in a normal shell will be killed - but not if you are using '''screen'''.
<LI>Safe against connection drops</LI>
<LI>If you need to run a program for a long time, then it will be killed if you log out or close your computer. But by using '''screen''' you can have the program running for as long as you like.
<LI>Allows long-running jobs to continue after logout</LI>
<LI>You can "detach" a screen at work and "attach" the screen from home or from another computer
<LI>Can detach at work, reattach from home</LI>
</OL>
</OL>


Ok lets try it. Log in to the server and start '''screen''' by typing:
<b>Start screen:</b>
<pre>
<pre>
screen
screen
</pre>  
</pre>
 
Press <kbd>Enter</kbd> to dismiss the welcome message.


You will get a welcome message, press enter. Now you have a shell inside '''screen''' - it works just like a normal shell except for a few things. You enter the control features by pressing "ctrl-a". Try pressing "ctrl-a" and afterward "?" - this will take you to the help screen, press enter to leave it again. Try typing a command like '''ls''' in the shell. Then we will detach the shell by pressing "ctrl-a-d" - now you will see a message at the bottom of the screen saying "[detached]" and you are out of the '''screen''' program. But our shell is still there, type:
Inside a screen session:
* All commands run normally 
* Special commands begin with <kbd>Ctrl-a</kbd>


Try:
<pre>
Ctrl-a ?
</pre>
This opens the help screen. Press <kbd>Enter</kbd> to exit.
Run something simple, e.g.:
<pre>
ls
</pre>
<b>Detach</b> from the session:
<pre>
Ctrl-a d
</pre>
You’ll see:
<pre>
[detached]
</pre>
<b>Reattach</b> later:
<pre>
<pre>
screen -r
screen -r
</pre>
</pre>
to reconnect to the screen - now you are back in! That is the same you would do if you lost your connection, the screen would detach itself and you would be able to reconnect to it using "screen -r". This is basically it, but there are a lot more functionalities with '''screen''' such as having multiple screen sessions, you can read more [http://kb.iu.edu/data/acuy.html here].
 
If your SSH session dies, simply reconnect and run <code>screen -r</code> to resume.
 
More documentation: 
[http://kb.iu.edu/data/acuy.html screen tutorial]


<HR>
<HR>


<p>Congratulations you finished the exercise!</p>
<p><b>Congratulations you have completed the exercise!</b></p>
 


<HR>
<HR>


[[First_look_exercise_answers]]
[[First_look_exercise_answers]]

Latest revision as of 12:45, 20 November 2025

Overview

In this exercise you will explore real NGS data and learn how to use the screen command to keep long-running jobs alive after logout.

  1. Use standard UNIX commands to work with NGS data
  2. Use screen in the shell

First look at data

  1. Navigate to your home directory:
    cd
    

    cd without arguments returns you to your home directory:

    /home/people/[STUDENT ID]
    

  2. Create a directory called first_look and cd into it.
  3. Copy the file reads.fastq.gz from:
    /home/projects/22126_NGS/exercises/first_look/
    
  4. Use zless to inspect the compressed FASTQ file:
    zless -S reads.fastq.gz
    

    A FASTQ read consists of exactly four lines:

    1. Header line starting with @
    2. Sequence line (A/C/G/T/N)
    3. “+” line (may repeat the header)
    4. Quality line (ASCII PHRED scores)

    The -S option disables line wrapping so the sequence appears on one line.

  5. Count the number of reads in the file.

    Each read has 4 lines. Use wc to count lines:

    zcat reads.fastq.gz | wc -l
    

    Divide by 4 to get the number of reads.


Illumina data

  1. Copy pairedReads.tar.gz into your first_look directory from:
    /home/projects/22126_NGS/exercises/first_look/
    

    Unpack it:

    tar xvfz pairedReads.tar.gz
    

    Flags:

    • x – extract
    • v – verbose
    • f – archive file follows
    • z – gzip-compressed

    If a file ends with .tar.bz2, use j instead of z.

  2. You should now have two FASTQ files:
    • ERR243038_1.fastq
    • ERR243038_2.fastq

    Inspect the first read in each file. The two headers should be identical except for the trailing 1 or 2 — meaning they are paired-end reads from opposite ends of the same DNA fragment.

  3. We now check whether the two files are “in sync.” We will:
    1. extract all header lines from each file
    2. remove the final /1 or /2-type suffix
    3. write each set of normalized headers to a new file
    4. compare the two files
  4. Extract all header lines with grep. FASTQ headers begin with:
    @ERR243038
    

    Example:

    grep '^@ERR243038' ERR243038_1.fastq | head
    

    Try this for both FASTQ files and inspect the first 10 headers.

  5. Remove the trailing 1 or 2 using sed. Examples of sed patterns:
    sed 's/PATTERN/REPLACEMENT/' file
    sed 's/PATTERN//' file              # remove PATTERN
    sed 's/^PATTERN//' file             # remove PATTERN at line start
    sed 's/PATTERN$//' file             # remove at line end
    

    Apply sed to strip the last character from each header line. For example:

    grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' | head
    
  6. Redirect the output into files:
    grep '^@ERR243038' ERR243038_1.fastq | sed 's/.$//' > human_1.headers
    grep '^@ERR243038' ERR243038_2.fastq | sed 's/.$//' > human_2.headers
    
  7. Inspect the first 10 lines side-by-side with paste:
    paste human_1.headers human_2.headers | head
    
  8. Finally compare both files using diff:
    diff human_1.headers human_2.headers
    

    If diff prints nothing, the pair files are perfectly in sync.


Use screen in the shell

NGS jobs often run for hours. If you log out or lose network connection, all running commands normally die. The screen program creates a persistent “virtual terminal” that continues running even after logout.

Benefits:

  1. Safe against connection drops
  2. Allows long-running jobs to continue after logout
  3. Can detach at work, reattach from home

Start screen:

screen

Press Enter to dismiss the welcome message.

Inside a screen session:

  • All commands run normally
  • Special commands begin with Ctrl-a

Try:

Ctrl-a ?

This opens the help screen. Press Enter to exit.

Run something simple, e.g.:

ls

Detach from the session:

Ctrl-a d

You’ll see:

[detached]

Reattach later:

screen -r

If your SSH session dies, simply reconnect and run screen -r to resume.

More documentation: screen tutorial


Congratulations — you have completed the exercise!


First_look_exercise_answers