First look exercise

From 22126
Jump to navigation Jump to search

Overview

In this exercise you will try to look at empirical NGS data. Additionally, you will try to use the screen command when using the shell.

  1. Use standard UNIX commands to work with NGS data
  2. Use screen in shell

First look at data

  1. Navigate to your home directory:
    cd
    

    cd without arguments will bring you back to your home directory. In our case, your home is:

    /home/people/[STUDENT ID]
    
  2. Create a folder called "first_look". Change dir to it.
  3. Copy the "reads.fastq.gz" to your folder from "/home/projects/22126_NGS/exercises/first_look/"
  4. The program zless allows you to view a zipped text file, press the key q to quit. Look at the data using less file. You will see a couple of reads in here. We will try to count the number of reads that there are. A read is always four lines, it may look like there are more lines but it is because the lines are wrapped. Try opening the file with zless -S file and you should see it:
    1. Header line starts with "@"
    2. Sequence line with the DNA sequence
    3. Middle header line with a "+" and sometimes also the header
    4. Base-quality line PHRED scaled probability that the base is wrong
  5. Try to count the number of reads in the file. You can of course do this by looking into the file, but this gets hard when there are millions of reads in a file. Remember that the header always starts with "@". Try to count the number of lines using wc and divide by 4. What count do you get? How many reads do you get?

Illumina data

  1. Now we are going to look at some Illumina paired-end reads. Copy the file "pairedReads.tar.gz" from "/home/projects/22126_NGS/exercises/first_look/" to your dir /home/people/[STUDENT ID]/first_look/. You will see that these files are in a tar/zipped file as ".tar.gz". A tar file is an archive (collection) of files that makes it easier to send and copy. Zipping saves disk space. To unzip and untar (unpack the files) them, use:
    tar xvfz file.tar.gz 
    

    The flag options "xvfz" do the following: x = extract the files from the archive, v = list files which are processed (v for verbose), f= what follows is the archive file, z the file is zipped. Note: If the file ends with ".tar.bz2", this means this was sorted using the Burrows-Wheeler algorithm, use instead "xvfj" as flags.

  2. Now you should have two files, "ERR243038_1.fastq" and "ERR243038_2.fastq". Have a look at them both. Compare the headers of the first reads. They should be identical except for the last character - this means that these two reads are paired together, ie. they are the DNA sequences from the two ends of the same DNA fragment. It is important that they are in sync i.e. that the fifth read in file 1 is paired together with the fifth read in file 2.
  3. Let's try to see if they in sync. We will do this by 1) extracting the header lines in each file 2) trimming away the last character to make them identical 3) send the outputs to different files and 4) compare the output we get from both files.
    1. We will extract the header lines from each file. One way to do this is to leverage the fact that every header line begins with "@ERR243038". We have covered grep. It uses 2 special characters. "^" indicates the beginning of the line. "$" indicates the end of the line. One way to specify "return all lines beginning with a particular pattern", is to use:
      grep ^PATTERN file
      

      One way to specify "return all lines ending with a particular pattern", is to use:

      grep PATTERN$ file
      

      Try to determine all header lines (hint: beginning with "@ERR243038") in ERR243038_1.fastq using grep. Since there are many hits, to visualize the first 10 lines, one can send the output of grep into the head command using the UNIX pipe '|' :

      grep ^PATTERN file | head 
      

      Fun fact, you can visualize the last 10 hits using tail:

      grep ^PATTERN file | tail 
      

      Try to view the first 10 header lines of ERR243038_1.fastq and ERR243038_2.fastq

    2. The second step is to remove the trailing "1" and "2" from the header lines. One potential command to do this is sed which stands for "stream editor". It can take a file or the output of another command as input and edit the strings in various ways and return the modified strings as output. For instance, to change the word PATTERN with REPLACEMENT, one can do:
      sed 's/PATTERN/REPLACEMENT/'   file 
      

      To simply remove PATTERN, simply put an empty replace string:

      sed 's/PATTERN//'   file 
      

      Note that you can force PATTERN can be at the beginning of the line or end using the same "^" and "$" characters respectively:

      sed 's/^PATTERN//'   file 
      sed 's/PATTERN$//'   file 
      

      Take the grep command you had to identify the header lines and remove the trailing "1" and "2" at the end of the line using sed. Hint use the pipe. Look at the first 10 lines.

      If you are having trouble, read on, your command should look like this:

      grep ^PATTERN file | sed 's/^PATTERN//'  | head 
      
    3. As you saw in the UNIX lectures, you can send the output of a process to a file using ">" as such:
      sed 's/^PATTERN//'  file  > newfile
      

      The previous sed command replaces any line starting with "PATTERN" and send the output to its STDOUT. The ">" redirects the STDOUT to a file called "newfile".

      Now, take the combined grep and sed commands you wrote earlier and redirect their output to a file called "human_1.headers" for at "ERR243038_1.fastq" and "human_2.headers" for "ERR243038_2.fastq" (these are human sequences).


    4. Look at the first 10 lines of the 2 files (using head), they should contain the headers from each pair. If you want to view the output side-by-side, paste is a command that puts 2 files side-by-side as such:
      paste file1 file2
      

      Again, you can view the first 10 lines side-by-side using head

      paste file1 file2 | head 
      

      Use paste and head to view the first 10 lines of "human_1.headers" and "human_2.headers" side-by-side. Are they in sync so far?

      This can get very tedious if you have millions of lines, one way to compare two files very rapidly is using diff:

      diff file1 file2  
      

      This command will return the differences between "file1" and "file2" and return nothing otherwise.

      Using diff make sure that "human_1.headers" and "human_2.headers" are identical and therefore that " ERR243038_1.fastq" and "ERR243038_2.fastq" were in sync.



Use screen in the shell

In NGS, due to the volume of data, it is normal for programs to run for several hours even overnight. A problem is once you log off a terminal, any program that you have running will be killed (i.e. cease to run). The command screen creates a virtual shell inside your shell. This means that you can run commands within this virtual shell and it will be as if you never logged off. There are a couple of advantages why you might want to use it:

  1. If you log out or lose your connection you can log in again and get back to your shell as it was. This is extremely useful when you are working on a wireless network where the connection may sometimes drop. If the connection drops any programs you have running in a normal shell will be killed - but not if you are using screen.
  2. If you need to run a program for a long time, then it will be killed if you log out or close your computer. But by using screen you can have the program running for as long as you like.
  3. You can "detach" a screen at work and "attach" the screen from home or from another computer

Ok lets try it. Log in to the server and start screen by typing:

screen

You will get a welcome message, press enter. Now you have a shell inside screen - it works just like a normal shell except for a few things. You enter the control features by pressing "ctrl-a". Try pressing "ctrl-a" and afterward "?" - this will take you to the help screen, press enter to leave it again. Try typing a command like ls in the shell. Then we will detach the shell by pressing "ctrl-a-d" - now you will see a message at the bottom of the screen saying "[detached]" and you are out of the screen program. But our shell is still there, type:

screen -r

to reconnect to the screen - now you are back in! That is the same you would do if you lost your connection, the screen would detach itself and you would be able to reconnect to it using "screen -r". This is basically it, but there are a lot more functionalities with screen such as having multiple screen sessions, you can read more here.


Congratulations you finished the exercise!



First_look_exercise_answers