Standard streams and working with files

From Unix
Jump to navigation Jump to search

In the last section we learned how to make directories and move around in the file system, but we didn't actually learn how to work with files. So in this section we'll be doing just that.

Many of the commands you'll be learning in this course can receive data from the standard input and write to something called the standard output, so later in this section we'll introduce the concept of standard streams. Lastly we'll look at how we can change the direction of the standard output and standard input with re-directional operators and pipelines.

Introduction to commands

Here we list Unix commands and their main function but it's important to keep in mind that Unix commands are versatile and it is possible to complete the same tasks using different commands.

Unix Command Acronym translation Description
touch [OPTION] <FILE> - Touches a file. If the file doesn't exist already it will create a file with the specified name. If it already exists it will update the date of the file
mv [OPTION] <FILE> <destination directory or another filename> Move Moves a file to a specified directory. It can also be used to rename files.
rm [OPTION] <FILE> Remove Removes specified file in current directory. This command can also be used to remove non-empty directories.
cp [OPTION] <FILE> <destination directory or another file> Copy Works a lot like mv, but moves a copy of the file instead. Can also be used copy the content of one file to another file.
cat [OPTION] <FILE> Concatenate Concatenates files and displays it in standard output. If used on one file, the content of that file is displayed in the command line interface.
head [OPTION] <FILE> - Outputs the first part of a file
tail [OPTION] <FILE> - Outputs the last part of a file
less <FILE> - Shows a screenfull of the file. This is a useful command for viewing big files as it loads at smalls segments at a time. q --> quit , space --> scroll forward one page , b --> scroll backward one page. Arrow keys can be used to scroll up and down one line at a time.
wc [OPTION] <FILE> Word count Counts the lines and words in the file/files, but can also count other things based on the options you give it.
paste [OPTION] <FILE> - Merges lines from different files.
cut [OPTION] <FILE> - Removes different parts of a file depending on on what is specified in the option.
echo [OPTION] <STRING> - Outputs the string to your command line interface. In computer language, a string is just a sequence of characters.
wget [OPTION] <URL> web get A non-interactive network downloader used to download files located at the URL.
curl [OPTION] <URL> client URL Similarly to wget, it is used to download files at the specified URL. This is an alternative MAC OS users, where wget doesn't work.
tee [OPTION] <FILE> It's named after the 'T-splitter' used in plumbing. Splits output so that it can be outputted to both the terminal and a file.

As an introduction, you can watch this youtube video on the use of some of the commands, Unix Commands for working with files. The video introduces some of the Unix commands for navigation that you learned in the last section but it also introduces the commands: touch, mv and cp.

Datafiles

Below are downloadable links for this sections datafiles. You can download them by right-clicking and then choosing the option 'Download link'.
Datafile 1: Pseudomonas Aeruginosa 16S rRNA Genebank file
Datafile 2: ex1.acc
Datafile 3: ex1.dat
You can also right click, copy the link address and type

Prompt$ wget <the link address you copied>

in your UNIX terminal. You might have trouble pasting the link into your terminal because the keyboard shortcut is not necessarily Ctrl-V. On Ubuntu WSL, the shortcut for pasting is simply right-clicking and for copying it's Ctrl-Shift-C. On Mobaxterm it should be Shift-Ins or the middle mouse button (the one you'd normally use for scrolling) if you have one of those. It might be set differently on your MobaXterm, however, and you can check this under Settings --> Keyboard shortcuts --> Paste in terminal. You can copy text in MobaXterm by left-clicking and marking the text you want copied.


A little background on the files: Datafile 1 is a genebank file that contains information about the 16S rRNA DNA sequence of the pathogenic bacteria Pseudomonas aeruginosa. The DNA sequence of 16S rRNA is a highly conserved region in bacteria and is often used to identify bacteria. Datafile 2 contains 3 tab-separated columns of numerical data and datafile 3 contains 2 tab separated columns of accession numbers. One tab is equivalent to 5 regular space and accession numbers are unique identifier tags for DNA.

Examples of how this sections Unix commands can be used

All of the commands below are executed in the command line interface. For example, when using the wc command your terminal should look like figure 2.1. In figure 2.1, the command is executed in Windows linux subsystem (WSL) Ubuntu, so the colouring might be different on your computer. The syntax, however, is the same.

Figure 2.1 Using wc command in a Unix environment: The wc command is executed in Windows linux subsystem (WSL) Ubuntu, so it might look a little different if you're using a different Unix environment.


cat, short for concatenate, is often used to display file contents.

Prompt$ cat <FILE> 

outputs the content of the file. You can also combine cat with options for different functionalities.

Prompt$ cat -n <FILE>

outputs the line numbers along with the file content of the file. There are also other useful functionalities of cat but these require an understanding of redirectional operators, so we'll save them for later in this section.

If you're interested in the file content at the top or bottom of a file, you can use the commands head and tail.

Prompt$ head -3 <FILE> 

outputs the first 3 lines of a file.

Prompt$ tail -3 <FILE> 

outputs the last 3 lines of a file.

If you want to know the number of words, lines or characters in a file you can use the wc command.

Prompt$ wc <FILE> 

outputs the number of characters, lines, and words. This output is always followed by the filename. You can also use options for a more specific functionality.

Prompt$ wc -l <FILE> 

outputs the number of lines in a file.

Prompt$ wc -m <FILE> 

outputs the characters in a file.

Keep in mind that commands cat, head, tail and wc can all take multiple <FILE> arguments as shown in figure 2.1.

You can use echo to write stuff in the command line interface.

Prompt$ echo <Whavever you want outputted to the command line interface> 

outputs just about anything to the command line interface.

The introductory video should have given you a basic idea of how the commands mv, cp and rm, but there are some extra tricks that are good to know.

Prompt$ mv file1 /filepath/file2 

will move file1 to file2 location, and rename file1 as file2

Prompt$ mv -t <DIRECTORY> file1..file99 

will move any number of files to a new directory. The cp -t command works in the same way.

Exercise 1: Working with datafiles

  1. Download the 3 datafiles if you haven't already.
  2. Create 3 new files. You can call them whatever you like.
  3. Create two directories, called test and data.
  4. Delete two of the files you created and move the remaining file along with the data files to the test directory.
  5. Move all the files from the test directory to the data directory. Delete the test directory.
  6. Rename the file you created to mydatafile.gb.
  7. Copy the content of datafile 1 to mydatafile.gb and check that they're identical.
  8. Display the content of datafile 1, datafile 2 and datafile 3.
  9. Count the total number of bytes in the datafiles. (Hint: Check out the different command line options for wc).

Standard Streams

Figure 2.2: Standard Streams: This figure illustrates the concept of standard streams. You can think of the green box as the interface that you can interact with. This is what we called the command line interface in the previous section. Recall that there are 2 types of user interfaces; command line interface (CLI) and GUI (graphical user interface). The yellow box represents the process where cat is translated back and forth from the hardware of your computer. The shell (the command line interpreter) and kernel oversee this process. If the process was successful, the resulting output is what's called the standard output (stdout). Oppositely, if the command wasn't successful the output will be the standard error (stderror). The standard input often originates from the keyboard (as it does when you type cat in your command line) which is why it's shown in the figure.

Now that we have a practical idea of this sections Unix commands, let's discuss the concept of standard streams. This will give you an idea to what exactly is going on when these commands are executed from the command line.
Standard streams are streams of data that travel from where a program was executed, to the places where the program is processed and then back again. It's important emphasize, that there are many streams of data in your computer, but the standard streams are the ones that the user has the most control over. There are 3 type of standard streams; standard input (stdin), standard output (stdout) and standard error (stderror). We'll go through what each term means by using the command cat as an example.

Use the Unix command cat by typing in

Prompt$ cat

in the terminal. This will prompt you, the user, to give cat some input in the form stdin directly from your keyboard. Simply type something and press 'ENTER'. To exit the process, press Ctrl and d simultaneously. The command cat will then process the stdin that you've given it, and output it as what's called the standard output (stdout). In this case stdout is just whatever you typed, and it is by default connected to the terminal, which is why it appears there. If the process wasn't successful, a standard error (stderror) message will be outputted to the terminal instead. Depending on the error you made, different error messages can appear. If you, for example, type in 'eccho Hello' the stderror might return the error message 'bash: echho: command not found'. The stderror is also connected to the terminal by default. Sometimes, nothing is outputted by the stdout and this is because some commands don't have a stdout. You've already experienced this in the last section with commands like mkdir, rmdir, rm, cd and so on.

When supplying cat with a file by typing,

Prompt$ cat <FILE> 

in your command line, it will output the file contents as stdout. It is, however, important to understand that <FILE> is not being fed as stdin to cat. When you type a command on your command line and the command file is present on your system (you can find most of these files by going in the directory, /bin), all separated words, spaces and tabs that are present on the command line, will be passed to this file. It's definitely a stream of data, but it's not the stdin.

The stdin is connected to your keyboard, and stdout or stderror outputs are directed to the terminal by default. We can, however, take control of these outputs by using redirectional operators, pipelines and the command tee.

Supplementary material on standard output and standard input
Standard output
Standard input

Re-directional operators and pipelines

Operators are symbols which behave like functions within the Unix OS. The easiest to understand might be arithmetic operators, which use symbols like + for addition, - for subtraction, = for assigning values to variables, and so on. In this section we'll be learning redirectional operators.

Stdout redirectional operators, > and >>

> operator is used to redirect stdout and stderror. Here's, one way of using it:

Prompt$ cat file1 > file2

This will redirect stdout of cat file1 to file2, which is the same as redirecting the file contents of file1 to file2. An important feature of > is that it overwrites the content whereto its directed with the output that it receives. So in this case, the file content of <file2> will be overwritten with the file content of file1. This means you have to be careful as to not overwrite your work when using it.

>> operator is also used to redirect stdout and stderror in the same way as >, but will append output to a file instead of overwriting it. For example,

Prompt$ cat file1 >> file2

would append the file content of file1 to file2.

Here's how the use of these operators would look in a Unix terminal.

Figure 2.3 Redirectional Operators: Here we use the Unix command echo which simply outputs whatever text you input in the command line. In the example, this output either appends or replaces the file content of Operator_Example.txt

cat can be used in a similar fashion as echo <STRING> > <FILE>, to add text to files.

Prompt$ cat > <FILE> 

will ask the user for stdin which can be outputted to <FILE>. This is because the stdin is connected to the keyboard by default. To exit, simply hold the Ctrl key while pressing d. After entering your text and before exiting, it's a good idea to type enter or else the command line will look a bit weird. Basically, the command line and the text you just entered will be on the same line, which you might find confusing. This application works for the >> operator as well.

Figure 2.4 Using cat or echo to add text to files

< operator

< is the redirectional operator for the standard input and it is used to redirect stdin to commands. This is useful for commands that require additional input from the user and we'll take a look at such commands in later sections. To give an example, when you're downloading and installing packages, you'll be prompted for stdin, to confirm if it's okay that the package uses that said amount of space on your device. In such a case, you need to type 'y' for yes, and 'n' for no. If you're doing many time-consuming package installments, it can be quite annoying to have to be around just to press 'y' once in a while. Therefore, it is super-handy that you can use the < operator to direct the stdin that you need. This can easily be done with echo,

Prompt$ echo 'y' < apt install <package>             

The 'apt command is short for 'Advanced Package tool' and is the standard packaging tool for Unix. We'll learn more about this in the section 'File compression and advanced packaging tools', so don't worry about it now.

Pipelines

Making pipelines or 'pipelining' as it is sometimes called, is similar to the concept of redirectional operators. Pipelines are used to redirect the stdout of one Unix command as the stdin to another Unix command. A good example of this is:

Prompt$ cat <some big file> | less

will feed the stdout of cat <some big file> as stdin to less. If you just write cat, all of the contents of the file will rapidly be displayed on your screen and it can be a real pain to scroll all the way to the top in order to read the text. But by piping cat with less, you can scroll through the file small segments at a time (see the Unix command table or google man less for instructions on how to scroll through the file).

As mentioned earlier, not all commands have a stdin. An example of this is, echo, which can only output its command line argument <STRING>. If you tried to pipe stdin echo with stdout from another command, it wouldn't work.

Prompt$ cat <FILE> |echo 

output a blank line.

Some Examples: Simple Piping

These examples aren't necessarily useful, but just to give you a better idea of what pipelines are and how they can be constructed. Try them out yourself on this sections datafiles.

head -5 <datafile> | tail -2 

The first 5 lines are extracted from datafile and fed to tail -2, which extracts the last 2 lines and outputs to the command line interface.

tail -10 <datafile> | wc -c > 10_tail_chars.txt 

The last 10 lines are extracted from datafile and fed to wc -c, which counts the characters. These are redirected and saved to the 10_tail_chars.txt.

head -5 <datafile> | cut -f 2-4 >> columns2to4.txt

The first 5 lines are extracted from datafile and fed to cut -f 2-4, which extracts the columns 2 to 4. The columns 2 to 4 are then appended to columuns2to4.txt. If you wanted to cut out only columns 2 and 4, you could instead write cut -f 2,4.

tee command

The function of the tee command is to split stdout into a file. The command is named after the T-splitter used in plumbing and the T-shape that is illustrated in figure 2.5.

Figure 2.5 Tee command:

The tee command is useful if you're making a long pipeline, and you want to save intermediary results into files. But it's actually just a general good practice to use if you're making long pipelines. This way, if something is wrong in the final output, you can check where it wrong by looking at your intermediary files.

Figure 2.6 Tee example: A file called header_file.gb is made using the command 'touch and then pipeline is constructed. When the pipeline is executed from the command line, the header of the genebank file is saved to a file, and the number of bytes is outputted to the terminal. The contents of header_file.gb is then displayed with cat.

This command has a couple of command options which you can check out with the man command. One of the more useful command options is the a option, short for append,

Prompt$ tee -a <FILE> 

will append to the file instead. In a scenario, where you want to save multiple intermediary outputs in the same file this command is useful.

Exercise 2: Re-directional operators and Pipelines

  1. Merge the lines of datafiles 2 & 3 and save them to mergefile.dat. Try displaying its content and make sure that accession numbers are on the left and the data on the right. It doesn't matter if files are of equal length, if there are no more lines in one file, blank lines will simply be added instead.
  2. Take the first 5 lines of mergefile.dat, cut out the first and third column and save it as columns1and3.dat.
  3. Count the number of characters in ALL of the files and append the results to a file called charsinfiles.dat.
  4. Make a pipeline that saves the bottom part of datafile1 in extracted_data.gb and the number of bytes in another bytefile.dat.
  5. Make a pipeline that saves the header of datafile1, appends it to extracted_data.gb and then appends the number of bytes to bytefile.dat.

Exercise 3: Moving and removing files across the file system

This exercise is a repetition of what you learned in the last section about navigating the file system combined with the commands you learned in this section.

Figure 2.6 Ex3 Branch of directories
  1. Make a branch of directories like the one shown in figure 2.7.
  2. Move datafiles 1,2 and 3 to directory AB.
  3. Make a copy of datafile 1 in A5 called datafile_copy1, a copy of datafile 2 in A7 called datafile_copy2 and a copy of datafile 3 in B7 called datafile_copy3.
  4. Move datafile_copy1 and datafile_copy2 back to AB. You should do this without making A5 and A7 your current working directory (Hint 1 )
  5. Make two files in B7 called extra_copy1 and extracopy2.
  6. Move datafile_copy3, extra_copy1 and extra_copy2 to AB.
  7. Rename extra_copy2 as extra_copy1.
  8. Remove datafile_copy1, datafile_copy2, datafile_copy3 and extra_copy1.

Hint 1: You can move files in other directories than the one you're in by specifying an absolute path.