Unix - User contributions [en]

MediaWiki:Sidebar

2024-03-20T12:24:54Z

WikiSysop: Created page with " * navigation ** https://teaching.healthtech.dtu.dk/|Course List ** https://teaching.healthtech.dtu.dk/unix/|Unix * TOOLBOX"

* navigation
** https://teaching.healthtech.dtu.dk/|Course List
** https://teaching.healthtech.dtu.dk/unix/|Unix
* TOOLBOX

Commands summary

2024-03-20T12:23:28Z

WikiSysop: Created page with "__NOTOC__ = Unix architecture and file system = {| class="wikitable" |- !style="width: 25%"| Unix Command !Acronym translation !Description |- |'''who''' or '''whoami''' |<nowiki>-</nowiki> |Tells you who the current user is. The 'who' command is not present in MobaXterm but you can use 'whoami' instead |- |'''man''' <COMMAND> |Manual |A very useful command. By using this command on other Unix commands, it gives you a manual of how to use them. This command is not pre..."

__NOTOC__
= Unix architecture and file system =
{| class="wikitable"
|-
!style="width: 25%"| Unix Command
!Acronym translation
!Description
|-
|'''who''' or '''whoami'''
|<nowiki>-</nowiki>
|Tells you who the current user is. The 'who' command is not present in MobaXterm but you can use 'whoami' instead
|-
|'''man''' <COMMAND>
|Manual
|A very useful command. By using this command on other Unix commands, it gives you a manual of how to use them. This command is not present in MobaXterm, but you can instead use google to find command manuals.
|-
|'''cal''' <nowiki>[OPTION]</nowiki>
|Calendar
|Gives you the current date in a calendar form
|-
|'''date''' <nowiki>[OPTION]</nowiki>
|<nowiki>-</nowiki>
|Gives you the current date
|-
|'''pwd''' <nowiki>[OPTION]</nowiki>
|Print working directory
|Where are you? Shows the current directory
|-
|'''ls''' <nowiki>[OPTION]</nowiki> [DIRECTORY]
|List segments
|Shows the files in the current directory if a filepath is not given.
|-
|'''cd''' <nowiki>[OPTION]</nowiki> <DIRECTORY>
|Change Directory
|Moves you to a specified directory. You can type 'cd ..' to move one directory back. The '..' simply means to go up one level towards your root
|-
|'''mkdir''' <nowiki>[OPTION]</nowiki> <DIRECTORY> 
|Make directory
|Makes the specified directory
|-
|'''rmdir''' <nowiki>[OPTION]</nowiki> <DIRECTORY>
|Remove directory
|Removes the specified directory if it's empty. To remove a non-empty directory, one can use the recursive option '''r''.
|-
|'''ln''' <nowiki>[OPTION]</nowiki> <DIRECTORY> <LINK_NAME>
|Link
|Can be used to create links (shortcuts) between files and directories.
|}
= Standard streams and working with files =
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''touch''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Touches a file. If the file doesn't exist already it will create a file with the specified name. If it already exists it will update the date of the file
|-
|'''mv''' [OPTION] <FILE> <destination directory or another filename>
|Move
|Moves a file to a specified directory. It can also be used to rename files.
|-
|'''rm''' [OPTION] <FILE>
|Remove
|Removes specified file in current directory. This command can also be used to remove non-empty directories.
|-
|'''cp''' [OPTION] <FILE> <destination directory or another file>
|Copy
|Works a lot like mv, but moves a copy of the file instead. Can also be used copy the content of one file to another file.
|-
|'''cat''' [OPTION] <FILE>
|Concatenate
|Concatenates files and displays it in standard output. If used on one file, the content of that file is displayed in the command line interface.
|-
|'''head''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Outputs the first part of a file
|-
|'''tail''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Outputs the last part of a file
|-
|'''less''' <FILE>
|<nowiki>-</nowiki>
|Shows a screenfull of the file. This is a useful command for viewing big files as it loads at smalls segments at a time. q --> quit , space --> scroll forward one page , b --> scroll backward one page. Arrow keys can be used to scroll up and down one line at a time.
|-
|'''wc''' [OPTION] <FILE>
|Word count
|Counts the lines and words in the file/files, but can also count other things based on the options you give it.
|-
|'''paste''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Merges lines from different files.
|-
|'''cut''' <nowiki>[OPTION]</nowiki> <FILE>
|<nowiki>-</nowiki>
|Removes different parts of a file depending on on what is specified in the option.
|-
|'''echo''' [OPTION] <STRING>
|<nowiki>-</nowiki>
|Outputs the string to your command line interface. In computer language, a string is just a sequence of characters.
|-
|'''wget''' [OPTION] <URL>
|web get
|A non-interactive network downloader used to download files located at the URL.
|-
|'''curl''' [OPTION] <URL>
|client URL
|Similarly to '''wget''', it is used to download files at the specified URL. This is an alternative MAC OS users, where '''wget''' doesn't work.
|-
|'''tee''' [OPTION] <FILE>
|It's named after the 'T-splitter' used in plumbing.
|Splits output so that it can be outputted to both the terminal and a file.
|}
= Text editors and some shell scripting =
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''alias''' <nowiki><alias_name>=<The stuff you want to make an alias for></nowiki>
|<nowiki>-</nowiki>
|Creates an alias called alias_name for what you've inserted on the right side of '='
|-
|'''source''' <FILE>
|<nowiki>-</nowiki>
|Executes the contents of a file in current shell. Changes made when the file is run will be permanent until changed. It is synonymous with Prompt$ '''.''' <FILE>.
|-
|'''bash''' <FILE>
|Bourne again shell
|'''Bash''' will execute <FILE> as a different process. This way, changes that occur while the file is being executed cannot affect your shell.
|}
= Setting up your shell script =
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''which''' <COMMAND>
|<nowiki>-</nowiki>
|Shows the full path to the command.
|-
|'''read''' [OPTION] [input1] [input2] [input3]
|<nowiki>-</nowiki>
|Can be used to prompt user for input and save them as variables [input1]..[input3].
|}
= Filtering and regular expressions =
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''grep''' [PATTERN] <FILE>
|Global regular expression print.
|Uses regular expressions select lines in a file that matches the pattern.
|-
|'''sed''' [OPTION] <SCRIPT> <FILE>
|stream editor
|Allows user to edit files without actually opening the files using regular expressions.
|-
|'''tr''' [OPTION] <SET1> <SET2>
|Translate
|Translates characters from the standard input and writes to the standard output.
|-
|'''sort''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Sorts the content of a file.
|}
= File permissions =
{|class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''chmod''' [OPTION] [MODE] <FILE>
|Change mode.
|Changes file permissions on a file according to the mode given. No need to be confused about [MODE], as it just another term used for file permission. It can be specified with letters or numbers. It's easier to understand what you're doing with letters but using numbers can be faster. In the example, we'll show how to use both.
|-
|'''chown''' [OPTION] [OWNER][:[GROUP]] <FILE>
|Change owner.
|Change file owner and/or group. This can be done separately or simultaneously by typing [OWNER]:[GROUP].
|-
|'''chgrp''' [OPTION] <FILE>
|Change group.
|Change group of a file.
|}
= File compression and advanced packaging tools =
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''tar''' [OPTION] <ARCHIVE> <FILES>
|Tape archive
|Archive utility tool, used to create and extract archives. Archives are simply multiple files that have been combined into one file.
|-
|'''gzip''' [OPTION] <FILE>
|GNU Zone informational Protocol.
|Compresses <FILE>
|-
|'''gunzip''' [OPTION] <FILE>
|GNU un Zone informational Protocol.
|Decompresses <FILE>
|-
|'''zip''' [OPTION] <FILE>
|Zone informational Protocol
|Archive and compression utility. Used to make an archive or compress <FILE>.
|-
|'''unzip'''
|un Zone informational Protocol
|Decompresses <FILE>
|-
|'''apt''' [OPTION] <PACKAGE>
|Advanced Packaging tool
|Package manager for Ubuntu with many utilities.
|}
= Processes: foreground and background, ps, top, kill, screen, nohup and daemons =
{| class="wikitable"
|-
!style="width: 18%"| Unix Command
!Acronym translation
!Description
|-
|<COMMAND> '''&'''
|<nowiki>-</nowiki>
|Runs command as a background process
|-
|'''bg''' %<PID>
|bg and PID are short for background and process identification respectively
|Continues a stopped job in the background
|-
|'''fg''' %<PID>
|fg and PID are short for foreground and process identification
|Continues a stopped job in the foreground
|-
|'''sleep''' <NUMBER>[s/m/h/d]
|<nowiki>-</nowiki>
|Delays for a specified amount of time. This can be specified <NUMBER> and suffixes; s, m, h and d which are short seconds, minuted, hours and days respectively. By default, the suffix is s
|-
|'''top''' [OPTION]
|<nowiki>-</nowiki>
|Displays all the processes running on your computer
|-
|'''ps''' [OPTION]
|Process status
|Reports a snapshot of current processes
|-
|'''kill''' [OPTION] <PID>
|<nowiki>-</nowiki>
|Sends a signal to a process and by default this signal is to terminate the process
|-
|'''screen''' [OPTION]
|<nowiki>-</nowiki>
|Used to create new terminal windows that are detached from each other. Child processes created within these new terminal windows are not affected if their parent process is terminated
|-
|'''disown'''
|<nowiki>-</nowiki>
|Dissociates process from current terminal session
|-
|'''nohup''' [OPTION]
|No hangup
|Used to run commands immune to hangups, ignoring stdin. By default output is redirected to nohup.out.
|-
|'''pstree'''
|Process tree
|Display a tree of parent and child processes
|}
= Understanding network and remote servers: IP/URL, ssh, scp, wget, curl =
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''wget''' [OPTION] <URL>
|web get
|A non-interactive network downloader used to download files located at the URL.
|-
|'''curl''' [OPTION] <URL>
|Client url
|Similarly to '''wget''', it is used to download files at the specified URL. This is an alternative MAC OS users, where '''wget''' doesn't work.
|-
|'''ssh''' <PORT> <user@IP/Domain_name>
|Secure shell
|Used to establish a secure connection to a remote server/system. It's also known as secure shell protocol.
|-
|'''scp''' <PORT> <user@IP/Domain_name> <user@IP/Domain_name>
|Secure copy protocol
|Starts a secure copy protocol, which copies files securely across from remote networks to clients or from clients to remote networks.
|-
|'''telnet''' <URL> <PORT>
|Teletype network
|Establishes a connection with the specified URL and port.
|-
|'''ifconfig''' [OPTION]
|Interface configuration
|Displays currently active networks but when used with '''a''', it displays the status of all networks.
|-
|'''nslookup'''
|Name server lookup
|Used to obtain information about a server through a DNS (Domain name system) server.
|-
|'''ping''' [OPTION] <IP/URL>
|Packet Internet Groper
|Checks network connectivity between host (your computer) and host/server.
|}

File:Protocol overview.png

2024-03-20T12:22:44Z

WikiSysop:

Understanding network and remote servers: IP/URL, ssh, scp, wget, curl

2024-03-20T12:21:51Z

WikiSysop: /* Servers and URLS */

__NOTOC__
In computing, a network is simply a network consisting of computers and other devices; routers, gateways, modems etc.. These computers are either desktop pc's, which is the type that you're using, or they're server computers. The internet in its entirety is in fact an example of humongous network of global proportions, so as a short introduction, watch this short and simple video explaining the internet [https://www.youtube.com/watch?v=7_LPdttKXPc What's the internet].

Computers and other devices within the internet have addresses called IP addresses that have the syntax x.x.x.x, where x is a number between 0 and 255. There are 2 types of IP addresses; private and public. The difference is in what numbers are used. Private IP address are 'private', which just means that they're not directly connected to the internet. You can check your computers IP by typing,
Prompt$ '''ifconfig'''
which will display information on the IP addresses currently in use by your computer. For windows users, your IP address should be under 'wifi0' or 'eth0'. The terms 'wifi0' and 'eth0' mean that you're connected to the internet via wifi and LAN cable respectively. For Mac users it should be one of the IP addresses next to ''inet''. The 'lo: inet 127.0.0.1' is your machines loopback address and is not your private IP address. For more information on finding your IP address on a MAC, you can follow this link [http://osxdaily.com/2010/11/21/find-ip-address-mac/ Finding your IP address on Mac OS].

Public IP addresses connect you with the rest of the internet. Typically, public IP addresses belong to a modem/router, which you can simply think of as a frontline device that connects all of your private devices to the rest of internet. Devices that are connected to the same modem, will all share the same public IP address and because public IP addresses are 'public' they must be unique. This is not the case for private IP addresses, because they aren't directly connected to the internet. The term for a network consisting of public IP addresses is called 'World Area Network' (WAN), which is represented with a '''red''' line in '''figure 7.1'''. Just so you know, '''figure 7.1''' has been simplified so that the router for the server computers aren't included.
You can look up your public IP address by following this link [https://whatismyipaddress.com/ Public IP address]. But you can also find your public IP address by using '''wget''',
Prompt$ '''wget''' -qO- ifconfig.me
The command option '''q''' is short for quiet mode, and -O- ensures that the output is written to stdout. The Mac OS is not equipped with '''wget''' by default, but you can instead use the command '''curl''',
Prompt$ '''curl''' ifconfig.me

You shouldn't get too attached to your IP addresses, as they can easily change. IP addresses are determined by your internet service provider (ISP), hence, it'll change depending on the WiFi you're logged onto, when you unplug and replug your router or perhaps if your ISP decides to change it.

We briefly mentioned something about using a LAN cable, but we haven't defined what LAN is. LAN is short for 'Local Area Network (LAN)', and you can think of it as the network connecting your private devices. In '''figure 7.1''', this is depicted as the devices connected by the '''green''' line; printer, phone and computer. The device that allocates private IP addresses to your 'private devices' while also establishing connectivity between these devices, is your router. Often, it is the case that the modem and router are combined into one device, which is how it's illustrated in '''figure 7.1'''. For more information on the difference between router and modem, follow this link [https://www.howtogeek.com/234233/whats-the-difference-between-a-modem-and-a-router/ Modem vs router]

The IP addresses we've been discussing so far are IPv4 addresses. The amount of possible IPv4 addresses is approximately 4.3 billion, which was a sufficient amount for some time. However, it very quickly became clear that it wasn't, therefore, the next version of IP addresses, IPv6, has 7.9x10^28 as many IP addresses as IPv4.

[[File:Network_LAN&WAN.png|center|frame|'''Figure 7.1 Network Overview:''' This figure gives an overview of a network and distinguishes between LAN ('''green''') and WAN ('''red'''). The modem connects private devices within the same network to the internet, while the router connects devices within the LAN. Servers don't necessarily need to be run by a server computer and can also be run by a desktop computer.]]

== Servers and URLS ==
A 'server' is simply a computer connected to a network that provides 'services' to other computers. The computers that the server provides services for are called ''clients''. This kind of relationship between servers and clients is what's defined as the 'client-server model'.

A simple example of a server that you're likely familiar with are print servers. Print servers are servers that provide the service of printing for computers connected to the same network. In a 'client-server model', the connected computers are the clients and the print server is the server. A print server would typically be what's called a 'local server'. Servers can be distinguished into local and remote servers, the only difference being that local servers are setup to be within a LAN (so within the green lines in '''figure 7.1''') and remote servers are setup on another 'remote computer' (the servers connected by the red line in '''figure 7.1'''). It goes without saying, that remote servers are found everywhere on the internet, because that's largely what the internet is made of. There are different type of servers, but you're probably most familiar with web servers like (wikipedia, google, youtube, facebook etc..). Web servers are available to clients through the use of a web browser (Google Chrome, Internet Explorer etc.) and a URL, which is likely the only way you've been accessing them up till now. You can, however, also interact with them in a little more 'old school way' through the command line interface. For example, in the section 'Standard streams and working with files', we used '''wget''' to retrieve files from URLS.

URL is short for 'Uniform resource locator', and they're are used to identify websites. A URL consists of several parts; 'protocol', 'sub domain', 'second level domain', 'top level domain', 'directory/folder', 'filename/webpage', and 'file extension'. Let's go through what each part represents by using a wikipedia URL as an example.
'''https''''''://''''''en''''''.''''''wikipedia''''''.''''''org''''''/''''''wiki''''''/''''''Elon_Musk'''

*'''https''' --> protocol.
*'''en''' --> subdomain.
*'''wikipedia''' --> second level domain.
*'''org''' --> top level domain.
*'''wiki''' --> folder/directory.
*'''Elon_Musk''' --> webpage.

The protocol used is '''https''' short for 'Hypertext Transfer Protocol Secure'. This is the protocol that your computer uses to retrieve data securely from your browser (google chrome, internet explorer etc.) corresponding to the URL. Protocols are a different topic, and we'll talk more about that later. Before 'https', the protocol used was 'http' but due to problems with data insecurity, there's a gradual increase in the use of the 'https' protocol. In short, 'https' ensures that the data received from your browser is encrypted and you can read more about this by following this [https://www.entrepreneur.com/article/281633 link]. But not all websites use it, for instance, this very website is using the 'http' protocol for interaction with browsers. 
Following 'https', there can be a subdomain and in this case it's '''en'''. The subdomain can be called just about anything but the most commonly used is 'www'. You don't actually have to add a subdomain but what's important, is that one of the 2 URLS, one with a subdomain and the one without, redirects to the other in order to avoid duplicate versions of the URL. For example, if you write 'https://wikipedia.org/wiki/Elon_Musk' in your browser, you will be redirected to 'https://en.wikipedia.org/wiki/Elon_Musk'. Next, '''wikipedia''' is the second level domain. Along with the top level domain, the second level domain make up the domain name of URL which is what makes the URL unique. The top level domain in this URL is '''org''', which is short for organization. A more commonly used top level domain you're familiar with is 'com', short for commercial. The folder/directory is '''wiki'''. This is followed by the webpage '''Elon Musk'''. In this case, there's no file extension but in previous sections we've been using URL's, https://teaching.healthtech.dtu.dk/material/unix/ex1.acc, where the filename is 'ex1' and the file extension is 'acc'.

You can find the IP address of a domain, by using the command '''nslookup''',
Prompt$ '''nslookup''' <Domain_name>
would output the IP address along with some additional information to your terminal. For example, if you typed
Prompt$ '''nslookup''' wikipedia.org
you would get the IP address 91.198.174.192. The command, '''nslookup''', actually receives this information from what's called DNS (Domain name system) servers. DNS servers are simply systems that store URL's with their corresponding IP addresses, ensuring that you're brought to the right IP address when you use a URL.

Another useful command is '''ping''', which is used to check network connectivity between host (your computer) and another host or server. In simple terms, it sends a data packet to the specified IP or URL with the message "PING" and waits for a response. The response time from the host/server is called latency. High latency and slow ping are what causes 'lagging' in online computer games or maybe just some really slow websites. Conversely, low latency and fast pings ensure enjoyable gaming and web browsing. You can check the ping of a website by typing,
Prompt$ '''ping''' <IP/HOSTNAME>
in your terminal.

We've already used '''wget''' in an earlier section to download datafiles, but here we'll go more into detail with some of its options. Essentially, '''wget''' allows you to download files from servers without being logged into that server. If you're using a MAC OS, '''wget''' won't work, but you can use '''curl''' instead.
Prompt$ '''wget''' <URL>
will download files from the server specified by the URL, as long as it doesn't require any sort of login. You can try this out yourself for any URL, but the content of the files you download might seem a little strange, if all you're downloading is a website. Websites are written in 'html', a programming language you might not be familiar with.

If you need to download a big file, you can run the download as a background process.
Prompt$ -b '''wget''' <URL>
will download the URL as a background process, allowing you to do other work within the shell as you wait.

If you're download was interrupted for some reason, you can resume the download of the partially downloaded file using the '''c''' option
Prompt$ -c '''wget''' <URL>
will resume the download of the file from the URL.

The command '''curl''' is quite similar to '''wget''', there are some differences however. The difference are summed up nicely in this link [https://daniel.haxx.se/docs/curl-vs-wget.html Curl vs wget].

Similar to '''wget''' you can download a file from a URL,
Prompt$ '''curl''' <URL>
will download files from the server specified by the URL.

Multiple files can be downloaded with the syntax,
Prompt$ '''curl''' http://website.{URL_1, URL_2, URL_3}.com
will download from the URLS; URL_1, URL_2 and URL_3.

If you need to download a series of files,
Prompt$ '''curl''' ftp://ftp.something.com/file[1-20].jpeg
will download the file[1-20]. Here we using the ftp protocol.

You can save the content of the URL to a specific file on your computer,
Prompt$ '''curl -o''' <FILE> <URL>
will download the content from the URL and save it as <FILE>

In networks, we distinguish between ''webpages'' and ''websites''. A website is a URL that can contain a multitude webpages all under the same domain. So examples of a websites and webpage could be https://wikipedia.org and https://wikipedia.org/wiki/Elon_Musk respectively. Websites like Wikipedia, Netflix, Google etc., all have some server computer connected to the internet that provides informational data to clients. There are many other types of servers as well; Mail servers, data servers, FTP servers, proxy servers, chat servers etc.. If you're interested in these other types of servers you can read more about them by following this link, [https://www.webopedia.com/quick_ref/servers.asp Different server types].

We defined a 'server' as a computer connected to a network that provides 'services' to other computers, which essentially means that any computer could be made into a server. That being said, however, there are actually computers that are designed especially to be servers. These type of computers are called 'server computers'. They have different different specs then normal computers and are designed to operate many clients to be operated simultaneously. They also typically have a lot of hardware redundancy; 'RAID disk systems', 'ECC memory' and 'dual power supply, which ensure that if one part server breaks the server can continue working without crashing. However, a server doesn't have to be run by a server computer and you can just as well a run a server on a desktop computer. It really depends on the scale for which the server is going to be used. Finally, most servers do not use GUI (Graphical User interfaces) and can only be operated through CLI (Command line interface), which is one of the main reasons why you've been learning to become efficient with CLI's. You might be wondering why don't servers use graphical user interfaces, as it can't be that hard to implement and then you wouldn't have to take this course. There's actually a very good reason for this. Applications, hereunder graphical user interfaces, make servers more susceptible to security breaches, which could allow uninvited guests (hackers) inside the server. Hence, servers are normally designed to be as simple as possible, while also giving utility to the intended user.

== Protocols and ports ==
IP addresses play a central role in connecting your computer to the internet and they ensure that your requests go to the right place while also ensuring that information is returned correctly. In order to make sure that you can access the internet efficiently, something called ''protocols'' is used. You don't need a deep understanding of protocols unless you're planning to become a web developer. Simply put, protocols are a standard set of rules that dictate how computers are to communicate efficiently across a network. A protocol that you're likely familiar with is 'http', short for 'hyper text transfer protocol', which is the protocol that your browser uses for extracting data from a website. Protocols use something called ports, which you can think of as a door from which data can go out and in. There are 65535 ports in total, and port numbers ranging from 0-1023 are considered system ports which are the ones that the most common protocols use. The port numbers typically used for HTTP, SMTP (Simple Mail Transfer Protocol) and FTP (File Transfer Protocol) are ports; 80, 25 and 21. Keep in mind, however, that there are alternative ports for most protocols, and you can in fact use any port number as long as it isn't assigned to another protocol. The port numbers ranging from 1024-65535 are called ''dynamic ports'' and they're usually assigned as needed. Exactly what 'assigned as needed' means can be illustrated with an example. Imagine you've connected to a web server through port 80 using the http protocol and you're waiting for the web server to respond and send you your data. If there are few people using the server, you might get this data sent back through port 80. However, if there are many using the server, port 80 might not be available. If this is the case, the web server will assign whatever port is available to send back the requested data. This type of assignation of ports is the most common use for ''dynamic ports'', however, assignation of dynamic ports also happen when you install a new application that uses a dynamic port. Let's get into how we can use commands like '''telnet''' and '''ssh''' to connect with servers. Till now, you've likely only been connecting with remote servers through the use of browser applications like Google chrome, Mozilla Firefox, Safari, etc.

Prompt$ '''telnet''' <URL> <PORT>
will connect you to the specified <URL> using the specified <PORT>. For example, you can connect to a gmail using the 'smtp' protocol and port 465.
Prompt$ '''telnet''' smtp.gmail.com 465
Writing mails with '''telnet''' is technically possible but difficult to do and we won't be bothering with trying. Practically, '''telnet''' is mostly used to troubleshoot whether the connection of your computer to a server is working properly.

The commands '''ssh''' and '''scp''', short for secure shell and secure copy, are commands that establish secure connections to remote servers. The command '''ssh''' sets you up with a shell environment at remote server, allowing you to do work there. The command '''scp''', allows you to copy files to and from a remote server. As a quick introduction, this tutorial tells you almost all you need to know about '''ssh''' and '''scp''', [https://www.youtube.com/watch?v=rm6pewTcSro Tutorial video on using ssh and scp].

To establish a connection to a remote server in a secure shell environment with port 443,
Prompt$ '''ssh -p443''' username@x.x.x.x
where 'username' is the username you're using on the remote server with the IP address 'x.x.x.x'. Port 443 is the port normally for 'secure web browser communication', and data transferred across this port highly resistant to interception. There's no specific reason to why we're using it here and it's just to show that you can specify with port you would like access remote servers with.

To copy a file from a remote server to your device with port 443,
Prompt$ '''scp -p443''' username@x.x.x.x:Directory/to/the/file/file.txt /mnt/c/Users/Username/Desktop/My_working_directory
where 'filepath/to/the/file/file.txt' is filepath leading to the location of the file on the remote server and /mnt/c/Users/Username/Desktop/My_working_directory is filepath to where the file is copied to.

Oppositely, you can copy a file from your device to the remote server with port 443,
Prompt$ '''scp -p443''' /mnt/c/Users/Username/Desktop/My_working_directory/file.txt username@x.x.x.x:Directory/to/the/file

== From web server to your computer display (optional) ==
Here we give a explanation of exactly how information is transferred from a web server and displayed on your computer. Understanding this in detail is optional, and it's really just placed here for your curiosity.

This link [https://www.youtube.com/watch?v=PpsEaqJV_A0 Introduction to protocols] will guide you to an introductory video on protocols. In the video, protocols are explained in layers and so to be consistent we'll do the same. There's also a small mistake in the video, as the 'SMTP' (simple mail transfer protocol) is not used for checking mail but only for sending mail. For checking mail other protocols like POP (Post office protocol) and IMAP (Internet message access protocol) are used.

[[File:Protocol_overview.png|right|frame|'''Figure 7.2 Using Protocols Overview:''' The figure shows how data is transferred to your screen display through the use of a web browser (in this case, Mozilla Firefox) and URL, while also showing how protocols; http, TCP, IP influence this transaction.]]

*'''Application layer (HTTP, SMTP, FTP) and ports'''
In the application layer, protocols like HTTP (Hyper text transfer protocol) receive data from the program that you're using. In the case of HTTP, the data would originate from your web browser, but in the case of SMTP the data would originate from a mail application. After having received the data from the program you're running, the application layer will send the data through a port to the TCP (Transmission control protocol). There are 65535 ports in total, and port numbers ranging from 0-1023 are considered system ports which are the ones that the most common protocols use. Port number typically used for HTTP, SMTP and FTP are ports; 80, 25 and 21. Keep in mind, however, that there are alternative ports for almost every protocol, and you can in fact use any port number as long as it isn't assigned to another protocol.

*'''Transport layer (TCP)'''
In the Transmission control protocol (TCP), the data received from the application layer is allocated into what's called 'packets', which you can think of as small bundles of data. By allocating the data into packets, this allows for the data to be transported as fast as possible to ultimately needs to go. For the data to be put back together properly after having arrived at its destination, TCP equips each packet with headers which contain instructions for how to put the packets together. First, however, these packets go through the internet protocol (IP).

*'''Internet layer (IP)'''
The ''Internet Protocol'' (this is what the 'IP' in 'IP address is short for) ensures addressing, delivering and routing your requests correctly. The packets that it receives from the transport layer are equipped with both origin and destination IP address. This ensure that the packets know where they need to go, and that the receiving device knows where the packets came from. Next, the packets go through the Network layer.

*'''Network layer'''
Among other things, the network layer handles 'Mac addressing', which ensures that the data from the packets are converted to electrical impulses and ensuring that they're delivered to the right device in the right places.

== Command list ==
Here we present all the commands used in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''wget''' [OPTION] <URL>
|web get
|A non-interactive network downloader used to download files located at the URL.
|-
|'''curl''' [OPTION] <URL>
|Client url
|Similarly to '''wget''', it is used to download files at the specified URL. This is an alternative MAC OS users, where '''wget''' doesn't work.
|-
|'''ssh''' <PORT> <user@IP/Domain_name>
|Secure shell
|Used to establish a secure connection to a remote server/system. It's also known as secure shell protocol.
|-
|'''scp''' <PORT> <user@IP/Domain_name> <user@IP/Domain_name>
|Secure copy protocol
|Starts a secure copy protocol, which copies files securely across from remote networks to clients or from clients to remote networks.
|-
|'''telnet''' <URL> <PORT>
|Teletype network
|Establishes a connection with the specified URL and port.
|-
|'''ifconfig''' [OPTION]
|Interface configuration
|Displays currently active networks but when used with '''a''', it displays the status of all networks.
|-
|'''nslookup''' [OPTION]
|Name server lookup
|Used to obtain information about a server through a DNS (Domain name system) server.
|-
|'''ping''' [OPTION] <IP/URL>
|Packet Internet Groper
|Checks network connectivity between host (your computer) and host/server.
|}

== Exercises 1: Using ssh and scp ==
In order to use the commands '''ssh and scp''' you need to actually have a remote server you can try it on. You can create a SSH server on your local server (so on your own computer), and although this is hardly a remote server this will allow you to try the commands '''ssh and scp'''. For windows users, this require a couple more steps.

* For windows users
First make sure ubuntu is updated,
Prompt$ '''sudo apt-get update'''
Prompt$ '''sudo apt-get upgrade'''

Then install ssh client and server,
Prompt$ '''sudo apt-get openssh-client'''
Prompt$ '''sudo apt-get install openssh-server'''

You should now be able to start a ssh server,
Prompt$ '''sudo service ssh start'''
Prompt$ '''ps -A'''
You should be able the daemon process, 'sshd', up and running. You can stop it again by typing,
Prompt$ '''sudo service ssh stop'''
* For Mac users
Prompt$ '''systemsetup -setremotelogin on'''
You should now be able to start a ssh server,
Prompt$ '''sudo service ssh start'''
You can view check if the ssh server is up on and running with, '''ps -A''',
Prompt$ '''ps -A'''
where you should be able the daemon process, 'sshd', up and running. You can stop it again by typing,
Prompt$ '''sudo service ssh stop'''

By default the port number that your ssh server uses, is port number 22. You can, however, change this by going to the file <ssh_config>
Prompt$ '''sudo vim''' /etc/ssh/ssh_config

# Now that you're setup with an ssh server, start it and connect with '''ssh''' (whenever you want to exit the remote server, simply type '''exit''' in the command prompt).
# Copy any file from your computer to somewhere on the server.
# Copy any file from the remote server to your home directory.

File:Network LAN&WAN.png

2024-03-20T12:19:30Z

WikiSysop:

Understanding network and remote servers: IP/URL, ssh, scp, wget, curl

2024-03-20T12:18:42Z

WikiSysop: Created page with "__NOTOC__ In computing, a network is simply a network consisting of computers and other devices; routers, gateways, modems etc.. These computers are either desktop pc's, which is the type that you're using, or they're server computers. The internet in its entirety is in fact an example of humongous network of global proportions, so as a short introduction, watch this short and simple video explaining the internet [https://www.youtube.com/watch?v=7_LPdttKXPc What's the in..."

__NOTOC__
In computing, a network is simply a network consisting of computers and other devices; routers, gateways, modems etc.. These computers are either desktop pc's, which is the type that you're using, or they're server computers. The internet in its entirety is in fact an example of humongous network of global proportions, so as a short introduction, watch this short and simple video explaining the internet [https://www.youtube.com/watch?v=7_LPdttKXPc What's the internet].

Computers and other devices within the internet have addresses called IP addresses that have the syntax x.x.x.x, where x is a number between 0 and 255. There are 2 types of IP addresses; private and public. The difference is in what numbers are used. Private IP address are 'private', which just means that they're not directly connected to the internet. You can check your computers IP by typing,
Prompt$ '''ifconfig'''
which will display information on the IP addresses currently in use by your computer. For windows users, your IP address should be under 'wifi0' or 'eth0'. The terms 'wifi0' and 'eth0' mean that you're connected to the internet via wifi and LAN cable respectively. For Mac users it should be one of the IP addresses next to ''inet''. The 'lo: inet 127.0.0.1' is your machines loopback address and is not your private IP address. For more information on finding your IP address on a MAC, you can follow this link [http://osxdaily.com/2010/11/21/find-ip-address-mac/ Finding your IP address on Mac OS].

Public IP addresses connect you with the rest of the internet. Typically, public IP addresses belong to a modem/router, which you can simply think of as a frontline device that connects all of your private devices to the rest of internet. Devices that are connected to the same modem, will all share the same public IP address and because public IP addresses are 'public' they must be unique. This is not the case for private IP addresses, because they aren't directly connected to the internet. The term for a network consisting of public IP addresses is called 'World Area Network' (WAN), which is represented with a '''red''' line in '''figure 7.1'''. Just so you know, '''figure 7.1''' has been simplified so that the router for the server computers aren't included.
You can look up your public IP address by following this link [https://whatismyipaddress.com/ Public IP address]. But you can also find your public IP address by using '''wget''',
Prompt$ '''wget''' -qO- ifconfig.me
The command option '''q''' is short for quiet mode, and -O- ensures that the output is written to stdout. The Mac OS is not equipped with '''wget''' by default, but you can instead use the command '''curl''',
Prompt$ '''curl''' ifconfig.me

You shouldn't get too attached to your IP addresses, as they can easily change. IP addresses are determined by your internet service provider (ISP), hence, it'll change depending on the WiFi you're logged onto, when you unplug and replug your router or perhaps if your ISP decides to change it.

We briefly mentioned something about using a LAN cable, but we haven't defined what LAN is. LAN is short for 'Local Area Network (LAN)', and you can think of it as the network connecting your private devices. In '''figure 7.1''', this is depicted as the devices connected by the '''green''' line; printer, phone and computer. The device that allocates private IP addresses to your 'private devices' while also establishing connectivity between these devices, is your router. Often, it is the case that the modem and router are combined into one device, which is how it's illustrated in '''figure 7.1'''. For more information on the difference between router and modem, follow this link [https://www.howtogeek.com/234233/whats-the-difference-between-a-modem-and-a-router/ Modem vs router]

The IP addresses we've been discussing so far are IPv4 addresses. The amount of possible IPv4 addresses is approximately 4.3 billion, which was a sufficient amount for some time. However, it very quickly became clear that it wasn't, therefore, the next version of IP addresses, IPv6, has 7.9x10^28 as many IP addresses as IPv4.

[[File:Network_LAN&WAN.png|center|frame|'''Figure 7.1 Network Overview:''' This figure gives an overview of a network and distinguishes between LAN ('''green''') and WAN ('''red'''). The modem connects private devices within the same network to the internet, while the router connects devices within the LAN. Servers don't necessarily need to be run by a server computer and can also be run by a desktop computer.]]

== Servers and URLS ==
A 'server' is simply a computer connected to a network that provides 'services' to other computers. The computers that the server provides services for are called ''clients''. This kind of relationship between servers and clients is what's defined as the 'client-server model'.

A simple example of a server that you're likely familiar with are print servers. Print servers are servers that provide the service of printing for computers connected to the same network. In a 'client-server model', the connected computers are the clients and the print server is the server. A print server would typically be what's called a 'local server'. Servers can be distinguished into local and remote servers, the only difference being that local servers are setup to be within a LAN (so within the green lines in '''figure 7.1''') and remote servers are setup on another 'remote computer' (the servers connected by the red line in '''figure 7.1'''). It goes without saying, that remote servers are found everywhere on the internet, because that's largely what the internet is made of. There are different type of servers, but you're probably most familiar with web servers like (wikipedia, google, youtube, facebook etc..). Web servers are available to clients through the use of a web browser (Google Chrome, Internet Explorer etc.) and a URL, which is likely the only way you've been accessing them up till now. You can, however, also interact with them in a little more 'old school way' through the command line interface. For example, in the section 'Standard streams and working with files', we used '''wget''' to retrieve files from URLS.

URL is short for 'Uniform resource locator', and they're are used to identify websites. A URL consists of several parts; 'protocol', 'sub domain', 'second level domain', 'top level domain', 'directory/folder', 'filename/webpage', and 'file extension'. Let's go through what each part represents by using a wikipedia URL as an example.
'''https''''''://''''''en''''''.''''''wikipedia''''''.''''''org''''''/''''''wiki''''''/''''''Elon_Musk'''

*'''https''' --> protocol.
*'''en''' --> subdomain.
*'''wikipedia''' --> second level domain.
*'''org''' --> top level domain.
*'''wiki''' --> folder/directory.
*'''Elon_Musk''' --> webpage.

The protocol used is '''https''' short for 'Hypertext Transfer Protocol Secure'. This is the protocol that your computer uses to retrieve data securely from your browser (google chrome, internet explorer etc.) corresponding to the URL. Protocols are a different topic, and we'll talk more about that later. Before 'https', the protocol used was 'http' but due to problems with data insecurity, there's a gradual increase in the use of the 'https' protocol. In short, 'https' ensures that the data received from your browser is encrypted and you can read more about this by following this link [https://www.entrepreneur.com/article/281633]. But not all websites use it, for instance, this very website is using the 'http' protocol for interaction with browsers. 
Following 'https', there can be a subdomain and in this case it's '''en'''. The subdomain can be called just about anything but the most commonly used is 'www'. You don't actually have to add a subdomain but what's important, is that one of the 2 URLS, one with a subdomain and the one without, redirects to the other in order to avoid duplicate versions of the URL. For example, if you write 'https://wikipedia.org/wiki/Elon_Musk' in your browser, you will be redirected to 'https://en.wikipedia.org/wiki/Elon_Musk'. Next, '''wikipedia''' is the second level domain. Along with the top level domain, the second level domain make up the domain name of URL which is what makes the URL unique. The top level domain in this URL is '''org''', which is short for organization. A more commonly used top level domain you're familiar with is 'com', short for commercial. The folder/directory is '''wiki'''. This is followed by the webpage '''Elon Musk'''. In this case, there's no file extension but in previous sections we've been using URL's, http://teaching.bioinformatics.dtu.dk/material/36610/ex1.acc, where the filename is 'ex1' and the file extension is 'acc'.

You can find the IP address of a domain, by using the command '''nslookup''',
Prompt$ '''nslookup''' <Domain_name>
would output the IP address along with some additional information to your terminal. For example, if you typed
Prompt$ '''nslookup''' wikipedia.org
you would get the IP address 91.198.174.192. The command, '''nslookup''', actually receives this information from what's called DNS (Domain name system) servers. DNS servers are simply systems that store URL's with their corresponding IP addresses, ensuring that you're brought to the right IP address when you use a URL.

Another useful command is '''ping''', which is used to check network connectivity between host (your computer) and another host or server. In simple terms, it sends a data packet to the specified IP or URL with the message "PING" and waits for a response. The response time from the host/server is called latency. High latency and slow ping are what causes 'lagging' in online computer games or maybe just some really slow websites. Conversely, low latency and fast pings ensure enjoyable gaming and web browsing. You can check the ping of a website by typing,
Prompt$ '''ping''' <IP/HOSTNAME>
in your terminal.

We've already used '''wget''' in an earlier section to download datafiles, but here we'll go more into detail with some of its options. Essentially, '''wget''' allows you to download files from servers without being logged into that server. If you're using a MAC OS, '''wget''' won't work, but you can use '''curl''' instead.
Prompt$ '''wget''' <URL>
will download files from the server specified by the URL, as long as it doesn't require any sort of login. You can try this out yourself for any URL, but the content of the files you download might seem a little strange, if all you're downloading is a website. Websites are written in 'html', a programming language you might not be familiar with.

If you need to download a big file, you can run the download as a background process.
Prompt$ -b '''wget''' <URL>
will download the URL as a background process, allowing you to do other work within the shell as you wait.

If you're download was interrupted for some reason, you can resume the download of the partially downloaded file using the '''c''' option
Prompt$ -c '''wget''' <URL>
will resume the download of the file from the URL.

The command '''curl''' is quite similar to '''wget''', there are some differences however. The difference are summed up nicely in this link [https://daniel.haxx.se/docs/curl-vs-wget.html Curl vs wget].

Similar to '''wget''' you can download a file from a URL,
Prompt$ '''curl''' <URL>
will download files from the server specified by the URL.

Multiple files can be downloaded with the syntax,
Prompt$ '''curl''' http://website.{URL_1, URL_2, URL_3}.com
will download from the URLS; URL_1, URL_2 and URL_3.

If you need to download a series of files,
Prompt$ '''curl''' ftp://ftp.something.com/file[1-20].jpeg
will download the file[1-20]. Here we using the ftp protocol.

You can save the content of the URL to a specific file on your computer,
Prompt$ '''curl -o''' <FILE> <URL>
will download the content from the URL and save it as <FILE>

In networks, we distinguish between ''webpages'' and ''websites''. A website is a URL that can contain a multitude webpages all under the same domain. So examples of a websites and webpage could be https://wikipedia.org and https://wikipedia.org/wiki/Elon_Musk respectively. Websites like Wikipedia, Netflix, Google etc., all have some server computer connected to the internet that provides informational data to clients. There are many other types of servers as well; Mail servers, data servers, FTP servers, proxy servers, chat servers etc.. If you're interested in these other types of servers you can read more about them by following this link, [https://www.webopedia.com/quick_ref/servers.asp Different server types].

We defined a 'server' as a computer connected to a network that provides 'services' to other computers, which essentially means that any computer could be made into a server. That being said, however, there are actually computers that are designed especially to be servers. These type of computers are called 'server computers'. They have different different specs then normal computers and are designed to operate many clients to be operated simultaneously. They also typically have a lot of hardware redundancy; 'RAID disk systems', 'ECC memory' and 'dual power supply, which ensure that if one part server breaks the server can continue working without crashing. However, a server doesn't have to be run by a server computer and you can just as well a run a server on a desktop computer. It really depends on the scale for which the server is going to be used. Finally, most servers do not use GUI (Graphical User interfaces) and can only be operated through CLI (Command line interface), which is one of the main reasons why you've been learning to become efficient with CLI's. You might be wondering why don't servers use graphical user interfaces, as it can't be that hard to implement and then you wouldn't have to take this course. There's actually a very good reason for this. Applications, hereunder graphical user interfaces, make servers more susceptible to security breaches, which could allow uninvited guests (hackers) inside the server. Hence, servers are normally designed to be as simple as possible, while also giving utility to the intended user.

== Protocols and ports ==
IP addresses play a central role in connecting your computer to the internet and they ensure that your requests go to the right place while also ensuring that information is returned correctly. In order to make sure that you can access the internet efficiently, something called ''protocols'' is used. You don't need a deep understanding of protocols unless you're planning to become a web developer. Simply put, protocols are a standard set of rules that dictate how computers are to communicate efficiently across a network. A protocol that you're likely familiar with is 'http', short for 'hyper text transfer protocol', which is the protocol that your browser uses for extracting data from a website. Protocols use something called ports, which you can think of as a door from which data can go out and in. There are 65535 ports in total, and port numbers ranging from 0-1023 are considered system ports which are the ones that the most common protocols use. The port numbers typically used for HTTP, SMTP (Simple Mail Transfer Protocol) and FTP (File Transfer Protocol) are ports; 80, 25 and 21. Keep in mind, however, that there are alternative ports for most protocols, and you can in fact use any port number as long as it isn't assigned to another protocol. The port numbers ranging from 1024-65535 are called ''dynamic ports'' and they're usually assigned as needed. Exactly what 'assigned as needed' means can be illustrated with an example. Imagine you've connected to a web server through port 80 using the http protocol and you're waiting for the web server to respond and send you your data. If there are few people using the server, you might get this data sent back through port 80. However, if there are many using the server, port 80 might not be available. If this is the case, the web server will assign whatever port is available to send back the requested data. This type of assignation of ports is the most common use for ''dynamic ports'', however, assignation of dynamic ports also happen when you install a new application that uses a dynamic port. Let's get into how we can use commands like '''telnet''' and '''ssh''' to connect with servers. Till now, you've likely only been connecting with remote servers through the use of browser applications like Google chrome, Mozilla Firefox, Safari, etc.

Prompt$ '''telnet''' <URL> <PORT>
will connect you to the specified <URL> using the specified <PORT>. For example, you can connect to a gmail using the 'smtp' protocol and port 465.
Prompt$ '''telnet''' smtp.gmail.com 465
Writing mails with '''telnet''' is technically possible but difficult to do and we won't be bothering with trying. Practically, '''telnet''' is mostly used to troubleshoot whether the connection of your computer to a server is working properly.

The commands '''ssh''' and '''scp''', short for secure shell and secure copy, are commands that establish secure connections to remote servers. The command '''ssh''' sets you up with a shell environment at remote server, allowing you to do work there. The command '''scp''', allows you to copy files to and from a remote server. As a quick introduction, this tutorial tells you almost all you need to know about '''ssh''' and '''scp''', [https://www.youtube.com/watch?v=rm6pewTcSro Tutorial video on using ssh and scp].

To establish a connection to a remote server in a secure shell environment with port 443,
Prompt$ '''ssh -p443''' username@x.x.x.x
where 'username' is the username you're using on the remote server with the IP address 'x.x.x.x'. Port 443 is the port normally for 'secure web browser communication', and data transferred across this port highly resistant to interception. There's no specific reason to why we're using it here and it's just to show that you can specify with port you would like access remote servers with.

To copy a file from a remote server to your device with port 443,
Prompt$ '''scp -p443''' username@x.x.x.x:Directory/to/the/file/file.txt /mnt/c/Users/Username/Desktop/My_working_directory
where 'filepath/to/the/file/file.txt' is filepath leading to the location of the file on the remote server and /mnt/c/Users/Username/Desktop/My_working_directory is filepath to where the file is copied to.

Oppositely, you can copy a file from your device to the remote server with port 443,
Prompt$ '''scp -p443''' /mnt/c/Users/Username/Desktop/My_working_directory/file.txt username@x.x.x.x:Directory/to/the/file

== From web server to your computer display (optional) ==
Here we give a explanation of exactly how information is transferred from a web server and displayed on your computer. Understanding this in detail is optional, and it's really just placed here for your curiosity.

This link [https://www.youtube.com/watch?v=PpsEaqJV_A0 Introduction to protocols] will guide you to an introductory video on protocols. In the video, protocols are explained in layers and so to be consistent we'll do the same. There's also a small mistake in the video, as the 'SMTP' (simple mail transfer protocol) is not used for checking mail but only for sending mail. For checking mail other protocols like POP (Post office protocol) and IMAP (Internet message access protocol) are used.

[[File:Protocol_overview.png|right|frame|'''Figure 7.2 Using Protocols Overview:''' The figure shows how data is transferred to your screen display through the use of a web browser (in this case, Mozilla Firefox) and URL, while also showing how protocols; http, TCP, IP influence this transaction.]]

*'''Application layer (HTTP, SMTP, FTP) and ports'''
In the application layer, protocols like HTTP (Hyper text transfer protocol) receive data from the program that you're using. In the case of HTTP, the data would originate from your web browser, but in the case of SMTP the data would originate from a mail application. After having received the data from the program you're running, the application layer will send the data through a port to the TCP (Transmission control protocol). There are 65535 ports in total, and port numbers ranging from 0-1023 are considered system ports which are the ones that the most common protocols use. Port number typically used for HTTP, SMTP and FTP are ports; 80, 25 and 21. Keep in mind, however, that there are alternative ports for almost every protocol, and you can in fact use any port number as long as it isn't assigned to another protocol.

*'''Transport layer (TCP)'''
In the Transmission control protocol (TCP), the data received from the application layer is allocated into what's called 'packets', which you can think of as small bundles of data. By allocating the data into packets, this allows for the data to be transported as fast as possible to ultimately needs to go. For the data to be put back together properly after having arrived at its destination, TCP equips each packet with headers which contain instructions for how to put the packets together. First, however, these packets go through the internet protocol (IP).

*'''Internet layer (IP)'''
The ''Internet Protocol'' (this is what the 'IP' in 'IP address is short for) ensures addressing, delivering and routing your requests correctly. The packets that it receives from the transport layer are equipped with both origin and destination IP address. This ensure that the packets know where they need to go, and that the receiving device knows where the packets came from. Next, the packets go through the Network layer.

*'''Network layer'''
Among other things, the network layer handles 'Mac addressing', which ensures that the data from the packets are converted to electrical impulses and ensuring that they're delivered to the right device in the right places.

== Command list ==
Here we present all the commands used in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''wget''' [OPTION] <URL>
|web get
|A non-interactive network downloader used to download files located at the URL.
|-
|'''curl''' [OPTION] <URL>
|Client url
|Similarly to '''wget''', it is used to download files at the specified URL. This is an alternative MAC OS users, where '''wget''' doesn't work.
|-
|'''ssh''' <PORT> <user@IP/Domain_name>
|Secure shell
|Used to establish a secure connection to a remote server/system. It's also known as secure shell protocol.
|-
|'''scp''' <PORT> <user@IP/Domain_name> <user@IP/Domain_name>
|Secure copy protocol
|Starts a secure copy protocol, which copies files securely across from remote networks to clients or from clients to remote networks.
|-
|'''telnet''' <URL> <PORT>
|Teletype network
|Establishes a connection with the specified URL and port.
|-
|'''ifconfig''' [OPTION]
|Interface configuration
|Displays currently active networks but when used with '''a''', it displays the status of all networks.
|-
|'''nslookup''' [OPTION]
|Name server lookup
|Used to obtain information about a server through a DNS (Domain name system) server.
|-
|'''ping''' [OPTION] <IP/URL>
|Packet Internet Groper
|Checks network connectivity between host (your computer) and host/server.
|}

== Exercises 1: Using ssh and scp ==
In order to use the commands '''ssh and scp''' you need to actually have a remote server you can try it on. You can create a SSH server on your local server (so on your own computer), and although this is hardly a remote server this will allow you to try the commands '''ssh and scp'''. For windows users, this require a couple more steps.

* For windows users
First make sure ubuntu is updated,
Prompt$ '''sudo apt-get update'''
Prompt$ '''sudo apt-get upgrade'''

Then install ssh client and server,
Prompt$ '''sudo apt-get openssh-client'''
Prompt$ '''sudo apt-get install openssh-server'''

You should now be able to start a ssh server,
Prompt$ '''sudo service ssh start'''
Prompt$ '''ps -A'''
You should be able the daemon process, 'sshd', up and running. You can stop it again by typing,
Prompt$ '''sudo service ssh stop'''
* For Mac users
Prompt$ '''systemsetup -setremotelogin on'''
You should now be able to start a ssh server,
Prompt$ '''sudo service ssh start'''
You can view check if the ssh server is up on and running with, '''ps -A''',
Prompt$ '''ps -A'''
where you should be able the daemon process, 'sshd', up and running. You can stop it again by typing,
Prompt$ '''sudo service ssh stop'''

By default the port number that your ssh server uses, is port number 22. You can, however, change this by going to the file <ssh_config>
Prompt$ '''sudo vim''' /etc/ssh/ssh_config

# Now that you're setup with an ssh server, start it and connect with '''ssh''' (whenever you want to exit the remote server, simply type '''exit''' in the command prompt).
# Copy any file from your computer to somewhere on the server.
# Copy any file from the remote server to your home directory.

File:Ongoing processes3.png

2024-03-20T12:18:05Z

WikiSysop:

File:Top kill.png

2024-03-20T12:17:34Z

WikiSysop:

File:Ps ef command.png

2024-03-20T12:17:06Z

WikiSysop:

Processes; foreground and background, ps, top, kill, screen, nohup and daemons

2024-03-20T12:16:30Z

WikiSysop: Created page with "__NOTOC__ A process is simply the instance of a running program. Processes are a fundamental concept of Linux systems and we'll start by discussing what can be termed as the life cycle of processes. This includes the concepts of parent and child processes and while going through this subject, we'll also introduce some essential commands that allow us to view ongoing processes in our terminal. These commands are '''ps''', '''jobs''' and '''top'''. Next we discuss foregr..."

__NOTOC__
A process is simply the instance of a running program. Processes are a fundamental concept of Linux systems and we'll start by discussing what can be termed as the life cycle of processes. This includes the concepts of parent and child processes and while going through this subject, we'll also introduce some essential commands that allow us to view ongoing processes in our terminal. These commands are '''ps''', '''jobs''' and '''top'''.

Next we discuss foreground and background processes and how processes can be managed practically using UNIX commands and operators. These commands include '''sleep''', '''fg''', '''bg''', '''jobs''' and '''kill'''. The only new operator we'll be looking at is '''&''', which is used to start a command as a background process. We'll be revising redirectional operators, <, > and taking a look at file descriptors.

Lastly we go a bit more into detail with the 'SIGHUP' signal. In short, this is a signal that is sent out when a terminal session is closed, which results in child processes being terminated. We'll discuss how one can make processes immune to the 'SIGHUP' signal, so that processes can continue even after we've closed the terminal. The commands introduced for this are '''screen''', '''disown''' and '''nohup'''. Lastly we introduce the concept of daemon processes.

== Parent and child processes ==
Processes are distinguished into parent, child and daemon processes, but for now we'll focus on parent and child processes. All processes have parent processes but not all parent processes have child processes. The relationship between the two is how you'd imagine; a child process is derived/spawned from its parent process. When a process starts the execution of a new program, it first makes a copy process of itself. In Linux we call this 'forking' and it is carried out by the system call fork(). You can think of system calls as an interface between applications and the kernel, where the shell is the intermediary translator between the two. The fork() is usually followed by a exec() system calls, which is what converts the child process into a new process.

You can see the Linux processes currently running in your terminal by using the command '''ps''', which is short for 'process status'.
Prompt$ '''ps -ef'''
[[File:Ps ef command.png|none|frame|'''Figure 8.1:''' Parent and child processes. In columns 1-8 there are UID (User ID), PID (Process ID) , PPID (Parent process ID's), C (Processor Utilization), STIME (Start time), TTY ('TeleTYpewriter'), TIME (CPU Time) and CMD (the actual command). It's not important to remember what all these columns mean, as you can always figure this out by going to the man page for '''ps'''. But just know that you can display different columns depending on the options that you use with '''ps'''. ]]

A process is a child process of another process, if its PPID (Parent process ID) is the same as that of the other processes PID (Process ID). By using different command options you can make '''ps''' display a lot of different column statistics on the processes running on your computer. You can often guess what the meaning of these columns are, but if you're in doubt you can always go to the man page for '''ps'''. UID is short for 'user ID' and in this case there are only two instances of users; root and goodboy. PID is short 'process IDS' which are unique identification numbers, assigned to processes making it easier to target them. PPID is short for 'parent process IDS' which unique identification number of a process's parent process. Recall, that UNIX commands are in fact small programs, which is why '''ps -ef''' appears as a process. Take note that the PPID for '''ps -ef''' is 4, which is the PID for '''-bash'''. This means that '''-bash''' is the parent process of '''ps -ef'''. The parent process for '''-bash''' is '''/init ro''', which in turn has no parent processes. You can think of the ''init'' process as 'super parent process', which is the very first process to be run when you start/boot a unix computer. ''init'' should always have the PID 1. If a parent process terminates before it's child process, ''init'' will become the new parent of the child. For a more graphical view of this you can use the command '''pstree''',
Prompt$ '''pstree'''
which displays a tree of parent and child processes. The ''init'' process should always be the starting node of this tree.

== Foreground and background processes ==
Processes can run in the foreground and background. Foreground processes is any command that you enter in the prompt, whereafter you have to wait for its completion before being able to enter a new command. Up till now you've only been executing commands as foreground processes. Unlike foreground processes, when a background process has been executed you don't have to wait for its completion before being able to issue a new command. Any command can be run as a background process by typing a space and '&' after the command,
Prompt$ '''<COMMAND> &'''
As an example, we use the command, '''sleep''', which is essentially a pause command that does nothing for a specified amount of time,
Prompt$ '''sleep 30 &'''
will create a pause of 30 seconds as a background process, which you can view with the '''ps''' command. Now try to run '''sleep''' as foreground process,
Prompt$ '''sleep 60'''
which will create of pause of 60 seconds. To exit this pause you have two choices; Press '''Ctrl-c''' which will send a kill signal to any process running in foreground, terminating it immediately. This is a very effective and often used way to stop programs. You can instead press '''Ctrl-z''' to send a suspend signal to any process running in the foreground, pausing it immediately. If you chose the pause option you'll be able to see the process with '''ps ax''', but its status will be T (Stopped). If you've paused the process you can restart it as a background process or foreground process.
Prompt$ '''bg''' %<job ID>
Prompt$ '''fg''' %<job ID>
which will restart the process as a background or foreground process respectively. Do NOT forget your suspended processes, as they use resources, even if they do not do anything. You can see job ID's of currently running background processes by executing,
Prompt$ '''jobs'''
which displays a list of active jobs. Jobs can be defined as processes that are initiated in the shell interactively by you, the user. Each job is assigned a job ID which the commands, '''bg''' and '''fg''', use for targeting.

You can terminate a background process by sending it to the foreground and then terminating it with '''Ctrl-c'''. But a more direct approach to terminating a background process is to use the '''kill''' command. It doesn't use job ID's and instead uses PID's,
Prompt$ '''kill''' <PID>
will terminate the process with corresponding <PID>. The '''kill''' command can actually be used to send a multitude of signals to processes. You can see a list of all these different signals by using '''-l''' command line option,
Prompt$ '''kill -l'''
By default the signal used by '''kill''' is the kill signal, 'SIGKILL', but you can also use '''kill''' to send a stop signal, 'SIGSTOP'.
Prompt$ '''kill -19''' <PID>
will send a stop signal to the process with corresponding <PID>.

Lastly, an alternative command to '''ps''' is '''top''', the difference being that '''top''' provides a continuous representation of ongoing processes with an interactive command mode and '''ps''' only provides a snapshot of current processes. By typing,
Prompt$ '''top'''
an interactive command mode and columns containing statistics of ongoing processes is displayed. The meaning of the columns is listed herunder.
* PID: Unique process id.
* PR: Priority of the task.
* VIRT: Total virtual memory.
* USER: User name of owner.
* %CPU: CPU usage.
* TIME+: CPU Time, similar to ‘TIME’.
* SHR: Shared Memory size (kb).
* NI: Nice Value. A negative nice value implies high priority and a positive nice value implies a low priority.
* %MEM: Memory usage.

You can exit the interactive command mode of '''top''' by pressing q. You can use the kill utility by pressing k while in the interactive command mode. You will then be prompted to enter which PID you want to send a signal to, whereafter you'll be prompted to enter what kind of signal you'd like to send. By default this is the the kill signal, so if you don't enter any signal type and simply press enter, a kill signal will be sent. In '''figure 8.2''' a kill signal is being sent to the process with the PID, 24, from the '''top''' interactive command mode.

[[File:top kill.png|frame|none|'''Figure 8.2 The top command:''' The '''top''' command displays a continuous representation of ongoing processes. In this representation, we enter the kill utility by pressing k, whereafter we're prompted for a the PID we would like to send a signal to. In this case, we enter the PID, 24. Afterwards we'll be prompted for the signal type. By default this is the kill signal, so by entering nothing and a kill signal will be sent.]]

== File descriptors ==
When you a open a file on your computer, your operating system will create an entry wherein information for that file can be stored. These entries have an entry number, which can be any positive integer (6,7,12,301 etc.) and this number is what's called a file descriptor. Standard input, standard output and standard error are also thought of as files in Unix, and they therefore also have file descriptors. These are by default always set to 0, 1 and 2 for stdin, stout and stderror respectively. You can actually find these files on your system by going to the device directory.
Prompt$ cd /dev

== Redirecting outputs and providing necessary stdin for background processes ==
If a command requires additional stdin from the user or has some output, it will cause problems if it's run as a background process. If a program run in the background requires additional stdin and doesn't receive it, the program will suspend and wait for infinity for the input. Furthermore, stdout and stderror from a background process needs to be redirected to somewhere else than the terminal. For example, when installing packages with '''apt''' there's a continuous stream of output regardless of whether it's run in the background or foreground. While this is being outputted we cannot continue working with other stuff in the terminal, which ruins the point of background processes.

So in order to run a program as a background process, you often need to redirect stdout and stderror to somewhere else, which can be done by using file descriptors.
Prompt$ <COMMAND> > <output_file> '''2>''' <error_file> '''&'''
In this command, stdout and stderror from '''Unix command''' is redirected to two separate files. By default, the '''>''' operator redirects stdout, which is why we only have to write '''>'''.
If we don't care about the stdout and stderror we can redirect to what's called the ''null device''.
Prompt$ <COMMAND> > /dev/null '''2>&1 &'''
In this command, the stdout is used to redirect to '/dev/null' with '''>''', and the stderror is redirected, '''2>''', to the same place as the stdout, '''&1'''. Alternatively, you could have also typed,
Prompt$ <COMMAND> > /dev/null '''2>''' /dev/null '''&'''
which will also redirect both stdout and stderror to the null device. If it helps, think of the null device as a black hole where we send data we don't need.

We've dealt with the problem of redirecting stdout and stderror, but we still have to deal with the stdin. For instance, in the last section were we used '''apt''' to install packages, we needed to type in 'y' for installation of a package to continue. A solution to this, is simply to use the '''y''' option as shown in the last section.
Prompt$ '''sudo apt -y install''' <PACKAGE> '''>''' /dev/null '''2>&1 &'''

This command, will install <PACKAGE> without asking for the additional 'y' confirmation and will direct all output to '/dev/null'. There might, however, not always be an option like the '''y''' option. Therefore, we show another method where we use '''echo''' to pipe the stdin you need,
Prompt$ '''echo''' 'Your_stdin' '''|''' '''sudo apt install''' <PACKAGE> '''>''' /dev/null '''2>&1 &'''
Also, you could create an <inputfile> and use redirectional operator for stdin, '''<''', to feed its contents.
Prompt$ '''sudo apt install''' <PACKAGE> '''<''' <inputfile> '''>''' /dev/null '''2>&1 &'''

When you use '''sudo''' you'll be prompted to enter your password, which will grant you root privileges for 15 minutes. In these 15 minutes you won't be prompted again for your password when executing commands that require root privileges. If you run a command that requires root privileges as a background process while output is redirected to /dev/null and you don't have root privileges, the process will stop. The simplest solution to this problem is to run another command with root privileges prior to running your background process. It doesn't matter what command you use to acquire these root privileges, it could for example be '''ls''',
Prompt$ sudo ls
Prompt$ sudo <COMMAND> > dev/null '''2>&1'''

You could also utilize the fact that '''sudo''' by default doesn't read passwords from stdin, but directly from your keyboard. This is useful, if you need to run a command that requires both password and some stdin like 'y', because then you don't have to worry about 'y' being fed as your password.
Prompt$ '''echo''' 'y' '''| sudo apt remove ''' <PACKAGE> > /dev/null 2>&1
will ask you for your password which it receives from your keyboard. Subsequently, it's fed 'y' as stdin and all output is redirected to /dev/null. After having initiated the process, you can make it a background process, by first pausing it with '''Ctrl-z''' and then making it a background process with '''bg''' %<job ID>.

It might be useful to know that you can actually feed your password with stdin by using '''-S''' option.
Prompt$ '''echo''' "Password" | '''sudo -S''' <COMMAND>

== Continuing processes after exiting terminal: screen, disown and nohup ==
Under normal circumstances, any child processes are sent what's called 'SIGHUP' (short for signal hangup) when the terminal session ends, effectively terminating them. In other words, if the computer or server you're working from crashes, or you simply have to go home and bring your laptop with you, ongoing processes will be terminated. This is obviously problematic and are 3 ways to avoid this; '''screen''', '''disown''' and '''nohup'''.

When people talk about multiple screens in Linux, they're talking about running multiple terminal windows separate from each other. The processes run within these different terminal windows are not affected by the user logging of, and they're all equipped with a shell.
Prompt$ ''''sudo screen -S''' <Screen_name>
will start a new terminal window called <Screen_name>. Within this terminal window you can start a lengthy process in the foreground, then quit the terminal window while keeping its shell running by pressing '''Ctrl a-d'''. You don't have to redirect output as it is directed to the terminal where you started the process, which is the screen. Screens are quite useful if you're doing a lengthy proces and you're not sure what stdin it might need, so you want to be able to be able interact with it.

You can read more about screens here: 
[https://kb.iu.edu/d/acuy Additional info on screens 1] 
[https://www.computerhope.com/unix/screen.htm Additional info on screens 2]

As mentioned, the signal that terminates processes when terminal session ends, is called 'SIGHUP'. You can make processes inmmune to this signal by using the command '''disown''', which will detach the process from the shells job list, so that the process is not sent a 'SIGHUP' signal when the terminal session ends. For example,
Prompt$ '''sudo bash -c''' ''''apt-get -y install''' <PACKAGE> '''>''' /dev/null '''2>&1 & disown''''
will install <PACKAGE> as a child process separate from the terminal, and the installation of the package will therefore proceed even if the terminal is shut down.

Lastly, a very useful command that has similar function to '''disown''' is '''nohup'''. Unlike '''disown''', the command '''nohup''' is defined by POSIX, which means that it works within most shells. We haven't really discussed what POSIX is. Essentially, POSIX represents a standard for the syntax of Unix operating systems and command line shells, and ensures compatibility between them. It is, however, rarely the case that an OS is 'POSIX-certified', and most OS's are defined as 'Mostly POSIX-compliant'. For instance, there are some shells where '''disown''' doesn't work (tcsh, csh, dash and sh). Any command run with '''nohup''' will be immune to SIGHUP signaling, furthermore, '''nohup''' ignores all standard input and directs any output bound for the terminal to the file, nohup.out. This means that this command is not ideal for processes that require additional stdin.
Prompt$ '''nohup''' <COMMAND>
will execute <COMMAND>, direct all output to nohup.out and ignore all stdin. Also, if the terminal ends, the 'SIGHUP' signal will be ignored and the process will not end.

== Daemon processes ==
Daemon processes are difficult to characterize as they share many of the characteristics that normal background processes have. They run in the background and are detached from the terminal. The parent process of daemon processes is most often the init process, which means that daemons will most often have PPID value of 1. They're usually created by forking a child process, followed by an immediate exit of the process, however, they can also be created directly by the init process. Users will typically have no control over daemons. You can see the daemon processes running on your device by typing,
Prompt$ ps -A
which lists ongoing Unix processes. Except for the init process and a few others, daemon processes typically have the ending, 'd'. The output from executing '''ps -A''' is shown in '''figure 8.3'''. Here, there are 3 daemon processes which are underlined with red. Two of these are the init process and the third is what's called 'Secure shell server' process. The screen process has been underlined with orange to show that processes with a PPID value of 1, aren't necessarily a daemon. There are other daemons which won't show up unless you're using a Linux computer.

The purpose of daemons is to provide a service. Examples could be printer service, network service, sound, web service, mail service, etc. It is system level services that we often do not have to care about.

[[File:Ongoing processes3.png|frame|none|'''Figure 8.3 Ongoing processes, some of which are daemons:''' The figure shows the execution and output of the command, '''ps -A''', which shows ongoing Unix processes. 3 daemon processes are underlined with red, two of which are init process process and the third is a 'Secure shell server' process. Processes with a PPID value of 1, such as the screen process underlined with orange, aren't necessarily daemons.]]

== Command list ==
In the table below, we list this sections Unix commands.
{| class="wikitable"
|-
!style="width: 18%"| Unix Command
!Acronym translation
!Description
|-
|<COMMAND> '''&'''
|<nowiki>-</nowiki>
|Runs command as a background process
|-
|'''bg''' %<PID>
|bg and PID are short for background and process identification respectively
|Continues a stopped job in the background
|-
|'''fg''' %<PID>
|fg and PID are short for foreground and process identification
|Continues a stopped job in the foreground
|-
|'''sleep''' <NUMBER>[s/m/h/d]
|<nowiki>-</nowiki>
|Delays for a specified amount of time. This can be specified <NUMBER> and suffixes; s, m, h and d which are short seconds, minuted, hours and days respectively. By default, the suffix is s
|-
|'''top''' [OPTION]
|<nowiki>-</nowiki>
|Displays all the processes running on your computer
|-
|'''ps''' [OPTION]
|Process status
|Reports a snapshot of current processes
|-
|'''kill''' [OPTION] <PID>
|<nowiki>-</nowiki>
|Sends a signal to a process and by default this signal is to terminate the process
|-
|'''screen''' [OPTION]
|<nowiki>-</nowiki>
|Used to create new terminal windows that are detached from each other. Child processes created within these new terminal windows are not affected if their parent process is terminated
|-
|'''disown'''
|<nowiki>-</nowiki>
|Dissociates process from current terminal session
|-
|'''nohup''' [OPTION]
|No hangup
|Used to run commands immune to hangups, ignoring stdin. By default output is redirected to nohup.out.
|-
|'''pstree'''
|Process tree
|Display a tree of parent and child processes
|}

== Exercise 1: Working with PIDs ==
1. Take a snapshot of current processes and save the process information to a file. The process information should contain PID's and PPID's as a minimum. 
2. In the file, replace the occurrence of bash with 'Hacked'. 
3. Extract the process ID of the 'Hacked' process and have the output directed to your terminal ('''Hint 1'''). 
4. Make a screen and exit it without closing it. 
5. Extract its PID and have it redirected to your terminal. 

'''Hint 1:''' A good way to do this is to use what you learned in 'Filtering and regular expressions' combined with pipelines.

File:Apt remove2.png

2024-03-20T12:15:53Z

WikiSysop:

File:Apt install.png

2024-03-20T12:15:26Z

WikiSysop:

File:Apt update&&upgrade.png

2024-03-20T12:15:05Z

WikiSysop:

File:Tar pipe gzip.png

2024-03-20T12:14:36Z

WikiSysop:

File compression and advanced packaging tools

2024-03-20T12:13:44Z

WikiSysop: Created page with "__NOTOC__ You might've encountered files with file extensions like '.tar', '.gzip' and '.zip' when downloading files from the internet. These file extensions mean that the file is compressed and file byte size reduced so that it requires less disk space. Bytes are simply a sequence of 8 bits, and we've previously seen how this is used to store ASCII characters. Files that have been compressed can also be decompressed, making them user readable but requiring more bytes on..."

__NOTOC__
You might've encountered files with file extensions like '.tar', '.gzip' and '.zip' when downloading files from the internet. These file extensions mean that the file is compressed and file byte size reduced so that it requires less disk space. Bytes are simply a sequence of 8 bits, and we've previously seen how this is used to store ASCII characters. Files that have been compressed can also be decompressed, making them user readable but requiring more bytes on your disk. In this section, we first discuss how it's possible to reduce the byte size of a file and afterwards restore it back to its original byte size. Next we discuss the file extensions '.tar', '.gzip' and '.zip'; what they mean and how they differ. In short, tar doesn't really compress files and there are different ways for how files can be compressed, hence, there are also different file extensions like .gzip and .zip. We will also introduce the commands; '''tar''', '''gzip''' and '''zip''' and how they can be used to compress/decompress files.

Lastly, we discuss packaging managers and how the command '''apt''', short for 'Advanced Packaging tool', is used. If you're running a Unix terminal from your Mac OS, you won't be able to run the command '''apt''', because it's not supported. In order to use it, you'd need to either use a virtualbox or connect to a server with a Linux OS. We haven't talked about connecting to remote servers and you'd have to find such a server as well, so if you want to try these commands out, the easiest option would be to use a virtualbox (there's a link to a guide in the section 'Course Introduction'). You could also try out the Mac OS equivalent of the 'Advanced Packaging tool', which is called 'Homebrew'. Here's an introductory video for this package manager [https://www.youtube.com/watch?v=SELYgZvAZbU Homebrew Guide].

At first it might seem rather mysterious that files can be reduced in byte size and decompressed back to their original byte size. The concept, however, is actually quite simple. Files, especially text files, have many patterns that are redundant and appear multiple times. The idea is therefore to make a dictionary that assigns these patterns with a bit value. During file compression, every time these patterns appear, they are then assigned with the same bit value. This type of compression, is called lossless compression, and for an intuitive understanding of this, you can check out this video by crash course [https://www.youtube.com/watch?v=OtDxDvCpPL4&feature=youtu.be File compression] (0:00-6:26).

For obvious reasons, files that have many redundancies can be compressed significantly more than files that have many unique characters like; music, video and picture files. This, however, is only true if you want the compressed file to have exactly the same shade of blue or sound frequency as the original file. The type of compression we've been discussing up till now is what's called ''lossless compression''. As the name implies, none of the file content is lost during this type of compression and it compressed file can be recreated perfectly. But there's actually another type of compression called ''lossy compression''. The rest of the crash course video linked above, is about this sort of compression (6:26:12:47). Lossy compression operates by removing unnecessary bits of information. After all, the human ear and eye are not acutely evolved enough to perceive small differences in color shades and sound frequencies. Software that use lossy compression, alter the color value and frequency of pixels and sounds respectively to identical values, reducing byte size and ensuring that it's indistinguishable to us.

= What is tar, gzip and zip =
Here, we list the commands we'll be using in the section, which includes both those used for compression/decompression of files as well as those needed for package management.

{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''tar''' [OPTION] <ARCHIVE> <FILES>
|Tape archive
|Archive utility tool, used to create and extract archives. Archives are simply multiple files that have been combined into one file.
|-
|'''gzip''' [OPTION] <FILE>
|GNU Zone informational Protocol.
|Compresses <FILE>
|-
|'''gunzip''' [OPTION] <FILE>
|GNU un Zone informational Protocol.
|Decompresses <FILE>
|-
|'''zip''' [OPTION] <FILE>
|Zone informational Protocol
|Archive and compression utility. Used to make an archive or compress <FILE>.
|-
|'''unzip'''
|un Zone informational Protocol
|Decompresses <FILE>
|-
|'''apt''' [OPTION] <PACKAGE>
|Advanced Packaging tool
|Package manager for Ubuntu with many utilities.
|}

Tar is the oldest of the 3 utilities; '''tar''', '''gzip''', and '''zip'''. Unlike '''gzip''' and ''''zip''', '''tar''' is actually not compressing files, but rather bundling them into archives and giving them the file extension '.tar'. So using '''tar''' on 100 files of 10 kB, might not make the resulting '.tar' file byte size smaller than 1 mB. This actually depends on how the files are arranged in directories, as '''tar''' eliminates the space wasted by the file system. The main reason for using '''tar''' is to create a single file of multiple files, a so-called 'archive', that makes for easier portability and storage.

Prompt$ '''tar cvf''' <ARCHIVE> <DIRECTORY>
will create an archive called <ARCHIVE> from the files in <DIRECTORY>. The option '''c''', '''v''' an '''f''' are short for 'create archive', 'verbose' and 'redirect stdout to given filename' respectively. The '''tar''' command has a lot options which are listed hereunder,

* '''c''' : Create archive
* '''x''' : Extract archive
* '''f''' : Redirect stdout to given filename
* '''t''' : Display files in archived file
* '''u''' : Archives and add to existing archive file
* '''v''' : Verbose
* '''A''' : Concatenate archive files
* '''z''' : zip, will use gzip for compression on resulting tar file.
* '''r''' : Add file/directory in already existing .tar file

You can extract files from the archive by typing,
Prompt$ '''tar xvf''' <ARCHIVE>
which will extract <ARCHIVE> and return it to its original state.

After having created an archive using '''tar''', the '''gzip''' utility can be used compress the archive. The file extension for gzipped files is '.gz'. However, if you've used the tar utility on the files to make an archive, the file extension is tar.gz or the abbreviated form, .tgz.
From the command line, you can compress files by typing,
Prompt$ '''gzip''' <FILE>
which will compress <FILE> to <FILE.gz>. The '''gzip''' command automatically adds the '''.gz''' file extension. If you want your file to not have the '''.gz''' file extension, you can use the '''k''' command option,
Prompt$ '''gzip -k''' <FILE>
which will compress <FILE> to <FILE>.

You can also just as easily compress multiple files,
Prompt$ '''gzip''' file1 file2 file3
which will compress the files; file1, file2 and file3. You can also use '''gzip''' to compress every file in a directory, but you need to use the recursive command option, '''r''', otherwise you will get the error 'gzip: DIRECTORY/ is a directory -- ignored',
Prompt$ '''gzip -r''' <DIRECTORY>
which will go through <DIRECTORY> and compress every file, the end result being a directory filled with compressed files. If you want all of the files to be compressed into one file, you'd need to first create an archive with '''tar''' and then compress it with '''gzip'''. Lastly, you can set the level for which you want to compress your files, ranging from '''[1-9]'''. For instance,
Prompt$ '''gzip -1''' <FILE>
will compress <FILE> at the highest level. The process, however, is a lot slower than if you had compressed your file at the lowest compression level,
Prompt$ '''gzip -9''' <FILE>
will compress <FILE> at the lowest level, but the process significantly faster. By default, '''gzip''' will compress at the highest compression level, '''1'''.
A compressed file, can be extracted with '''gunzip''',
Prompt$ '''gunzip''' <FILE>
which will extract <FILE> and return it to its original state. Just like with '''gzip''' you can extract multiple files simultaneously,
Prompt$ '''gunzip -k''' file1 file2 file3
which will extract files; file1, file2 and file3. You can also extract every in a directory by using the '''r''' command option,
Prompt$ '''gunzip -r''' <DIRECTORY>
One of the command options for the '''tar''' command allows for simultaneously creating an archive and compressing the files using '''gzip''',
Prompt$ '''tar cvzf''' <ARCHIVE> <FILES>
will create compressed archive, <ARCHIVE>, from <FILES>. A compressed archive can be extracted by typing,
Prompt$ '''tar xvzf''' <Compressed_ARCHIVE>
which returns the compressed archive to its original state.

You might be wondering what's the point of using the '''gzip''' command if the '''tar''' command can be used for compressing files as well. One good reason for this, is that when you use the '''gzip''' utility of '''tar''' you can only use the default option settings for '''gzip'''. '''gzip''' offers a range of compression levels from 1 to 9; 1 offers the fastest compression speed but at a lower ratio, and 9 offers the highest compression ratio but at a lower speed. The gzip application uses level 6 by default. Depending on the filetype, there's often no significant difference in data storage at different compression levels. Conversely, there can be big differences in the time it takes to compress data, hence , it's often not worth the wait time to compress files at the highest level.

You can pipe the output of '''tar''' to '''gzip''',

[[File:Tar pipe gzip.png|frame|none|Figure 7.1 Piping '''tar''' with '''gzip''': The directory, Example, is archived using '''tar c''', which stdout is piped to '''gzip -9''', whereafter the stdout is redirected to the Example.tgz.]]

The '''zip''' utility actually came out before '''gzip''', but because it was developed commercially by the firm PKWARE, it suffered from patent encumbering for many years which sparked the development of '''gzip''', which was a free compression/decompression software. Both '''zip''' and '''gzip''', use the 'LZ77 algorithm' for compression/decompression but unlike gzip, it also creates archives of files and gives it the file extension '.zip'. For more information on '''tar''', '''zip''' and '''gzip''' you can follow this link, [https://itsfoss.com/tar-vs-zip-vs-gz/ Additional information on tar, zip, gzip and some other file compressions]

To use '''zip''' and '''unzip''' you first need to download and package it, and for this you can use the package managers '''apt''' and '''Homebrew''' (for Mac OS users).

== Exercise 1: tar and gzip ==
1. Download files or find files on your computer which you think can be compressed significantly. Put them together in a folder. 
2. Archive and compress the folder at default compression level. 
3. Uncompress the folder. 
4. Archive and compress it at the lowest compression level in one line. 

== Package managing ==
Package managers are software tools that automate installation, updating, configuring and removing packages. An important theme on Linux based systems, is that every program is able to do one simple function, but it does it well. Larger programs on a Linux OS are sort of tailored together by smaller programs and are therefore dependent on smaller programs. This is essentially what's meant by the 'dependencies' of a program, and for the Linux OS many of these small programs been developed in parallel by different organizations. Package managers keep all of these dependencies updated and in check, ensuring functionality and compatibility of all programs. Microsoft is the sole owner of the Windows OS and there are therefore defined procedure for installing programs. Also larger programs are designed to be independent and for the most part they don't need to be tailored with other small programs for full functionality. But there are some exceptions on a Windows OS which need to be installed externally and you've probably heard of; Java, Adobe Flash player, Wizard etc.. You might think that having programs without dependencies is a good idea, but a major weakness is that there'll be a lot redundancy and wasted disk space.

The package manager used for Ubuntu is '''apt''' short for 'Advanced package manager'. In order to use it, you need superuser privileges, which you can get by using the command '''sudo'''. The command '''sudo''', will execute following commands with elevated privileges corresponding to the permissions that the user has. If you're executing this from your computer, you'll be granted root privileges which are the highest privileges you can get. 
Prompt$ '''sudo apt install''' <PROGRAM>
The command '''apt''' will download and install <PROGRAM>.
You can also remove packages,
Prompt$ '''sudo apt remove''' <PROGRAM>
which will remove <PROGRAM>. If you want to remove package configuration files as well,
Prompt$ '''sudo apt purge''' <PROGRAM>
will remove <PROGRAM> along with its configuration files. Configuration files contain information about the initial parameters and setting for a program. If you want to update your currently installed packages,
Prompt$ '''sudo apt upgrade'''
will update all of your currently installed packages. It is, however, sometimes the case that the sources from which you download your updates are outdated. You can update these sources,
Prompt$ '''sudo apt update'''
which will update all configured sources for your packages. These sources can be found in the directory, '/etc/apt', in files with the ending '.list'.
You can remove packages that are no longer required on your computer by using the '''autoremove''' option,
Prompt$ '''sudo apt autoremove'''
You can use the '''search''' option to search for packages with a specific feature you might need,
Prompt$ '''sudo apt search <REGEX>'''
will search for packages correlating to the regular expression pattern.
Lastly, the command option '''show''' displays information about packages in the terminal.
Prompt$ '''sudo apt show <PACKAGE>

=== Installing Emacs package example ===
Here we demonstrate how the package for the text editor, emacs, can be installed with '''apt'''.

[[File:Apt update&&upgrade.png|none|frame|'''Figure 7.2 Updating configured sources and upgrade packages:''' The command '''sudo apt update && sudo apt upgrade''' will execute '''sudo apt update''' and '''sudo apt upgrade''' consecutively, updating all configured sources and installing available upgrades for installed packages. This is equivalent to waiting for '''sudo apt update''' to complete and then executing '''sudo apt upgrade'''. The '''&&''' is essentially an operator that means 'and'.]]

[[File:Apt install.png|none|frame|'''Figure 7.3 Installing a package:''' The command '''apt install emacs''' is used to install the package, emacs. In this case, we show what it would look like if emacs was already installed, as it's not possible to show the whole installation process anyway. Also the installation process might take a while, and you'll have to wait for its completion before you can do anything else. There's away to avoid this which we'll learn in the next section when we introduce the concept of background processes.]]

[[File:Apt remove2.png|none|frame|'''Figure 7.4 Removing a package:''' The command '''sudo apt remove -y emacs''' will remove the package emacs. When a command option '''-y''' is used, you won't be prompted to type the additional 'Y' for the removal to proceed.]]

== Exercise 2: Installing and using Zip ==
1. Use the '''apt''' command to install '''zip'''. 
2. Use '''zip''' to unzip the file folder you compressed in exercise 1.

File:Rwx permissions2.png

2024-03-20T12:13:00Z

WikiSysop:

File:File permissions1.png

2024-03-20T12:12:44Z

WikiSysop:

File permissions

2024-03-20T12:12:23Z

WikiSysop: Created page with "__NOTOC__ File permissions (also known as file mode) are an important concept in the Unix (OS), and you might have already seen file permissions without knowing it. You can see them by listing files in long format, using the command '''ls -l''' in your working directory. It should look something like '''figure 5.1''', shown in MobaXterm. File:File permissions1.png|middle|frame|'''Figure 5.1:''' A screenshot of a command line interface after having used the command ls..."

__NOTOC__
File permissions (also known as file mode) are an important concept in the Unix (OS), and you might have already seen file permissions without knowing it. You can see them by listing files in long format, using the command '''ls -l''' in your working directory. It should look something like '''figure 5.1''', shown in MobaXterm.
[[File:File permissions1.png|middle|frame|'''Figure 5.1:''' A screenshot of a command line interface after having used the command ls -l.]]

You can make out what most of these columns mean. Starting from the right we have the name of the file/directory, the number of bytes it contains and so on. But what does the column with those weird '-', 'r','x' symbols on the far left mean? Well, this is what's called file permissions, and is what we'll be learning in this section.

File permissions are what they sound like. It's what permits or prohibits users to interact with files. In the early days of computing, before computers became cheap and personal, they were expensive and universities would typically have one mainframe and a bunch of terminals from where users could access the mainframe. As users connected to the same mainframe, they could also access each others files, so in order to protect user files, a file permission system was implemented. It is still very relevant today as super computers use file permission systems to determine who should be allowed to access and work with the data stored within.

[[File:Rwx permissions2.png|right|frame|'''Figure 5.2''' Overview over what is used ]]

Let's go through the file permissions on my_program.sh --> and nothing_here --> in '''Figure 5.1'''. 
The first symbol, '-' for my_program.sh and 'd' for nothing_here, is the file type and indicate that this is a regular file and a directory respectively.The next 3 characters are '''rwx''' for both my_program.sh and nothing_here. 
In a Unix (OS) there are 3 things that a user can do with a file, read : '''r''' , write : '''w''' , and execute : '''x'''. So if a file has the file permission, '''rwx''', then the user can read, write and execute the file. Now you might be wondering why there are multiple 'r's, 'w's and 'x's in a file permission. This is simply because we distinguish between 3 types of users; the file owner, group members and other users. When you've created a file, you'd typically also be the file owner of the file, but if you've made the document collectively with a group of people, you might only be a group member. The last type of user, other users, could be any stranger having a look at your files. Each of these type of users have 3 file permissions, read:r, write:w and execute:x. So counting the character for file type as well, there will in total always be 10 characters in any file permission. The hyphen (This is what '''-''' is called in computer language) means that a permission is lacking.

For example, the file permission for my_program.sh, '''-rwxr--r--''' translates to:
*The file is a regular file
*The file owner can read, write and execute the file
*Group members and other users can only read the file.
== Working with file permissions ==
Now that we have a basic idea of what file permissions are let's do some work with them. The Unix commands we'll be learning in this section are shown here. If you're using Ubuntu WSL, file permission commands won't work for files located in /mnt/c/*. You have to either go to your home with '''cd ~''' or use MobaXterm.
{|class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''chmod''' [OPTION] [MODE] <FILE>
|Change mode.
|Changes file permissions on a file according to the mode given. No need to be confused about [MODE], as it just another term used for file permission. It can be specified with letters or numbers. It's easier to understand what you're doing with letters but using numbers can be faster. In the example, we'll show how to use both.
|-
|'''chown''' [OPTION] [OWNER][:[GROUP]] <FILE>
|Change owner.
|Change file owner and/or group. This can be done separately or simultaneously by typing [OWNER]:[GROUP].
|-
|'''chgrp''' [OPTION] <FILE>
|Change group.
|Change group of a file.
|}

== Changing File permissions with chmod ==

=== Using letters ===
In the [MODE] option of '''chmod''', the letters '''u''', '''g''' and '''o''' are used for user (file owner), group owner and other users respectively. To assign file permissions to users, the symbols '''+''' and '''-''' are used. So let's say I wanted to make changes to the file permissions of pictures_of_spiderman.jpg in '''figure 5.1'''.
Prompt$ '''chmod g+w''' pictures_of_spiderman.jpg
Assigns file permission, '''w''', to group owners. This makes the file writable for group owners.
Prompt$ '''chmod o+rwx''' pictures_of_spiderman.jpg
Assigns the file permissions, '''rwx''', to other users. This makes the file readable, writable and executable for other users.
Prompt$ '''chmod u-rx''' pictures_of_spiderman.jpg
Removes file permission, '''rx''', to the file owner, making it only writable for the file owner.
=== Using numbers ===
If you want to assign multiple file permissions to multiple users with one command you can use numbers instead. To understand how, you need to think of file permissions as a set binary numbers (one set for each user), which is how your computer interprets them. 
'''rwx rwx rwx 111 111 111'''
'''--- --- --- 000 000 000'''
'''rw- -wx r-x 110 011 101'''
'''r-- -w- --x 100 010 001'''

In the last section we learned that bit values (reading from right to left) are 1,2 and 4. We can add the numbers together in to get numbers corresponding to a file permission for each user.
'''rwx rwx rwx 111 111 111''' --> '''4+2+1 4+2+1 4+2+1''' --> '''7 7 7'''
'''--- --- --- 000 000 000''' --> '''0+0+0 0+0+0 0+0+0''' --> '''0 0 0'''
'''rw- -wx r-x 110 011 101''' --> '''4+2+0 0+2+1 4+0+1''' --> '''6 3 5'''
'''r-- -w- --x 100 010 001''' --> '''4+0+0 0+2+0 0+0+1''' --> '''4 2 1'''

Let's say I wanted to make pictures_of_spiderman.jpg available for reading, writing and execution to everyone: 

Prompt$ '''chmod 777''' pictures_of_spiderman.jpg

There's no need to use '''u''', '''g''' and '''o''', as the first spot corresponds to '''u''', the second spot corresponds to '''g''' and the third spot corresponds to '''o'''. If this was to be done with letters to do this with letters, you would have to have used three commands: 
Prompt$ '''chmod u+rwx''' pictures_of_spiderman.jpg 
Prompt$ '''chmod g+rwx''' pictures_of_spiderman.jpg 
Prompt$ '''chmod o+rwx''' pictures_of_spiderman.jpg 

When you're using numbers, it's important to know that you're always changing all user file permissions. So if you omit a number at one spot, Unix will assume that its value is 0 and remove all file permissions for this type of user. For example: 

Prompt$ '''chmod 75''' pictures_of_spiderman.jpg 

This will remove '''rwx''' permission for the file owner, give rwx file permission to group users and r-x file permission to other users.

A useful command line option for '''chmod''' is, '''R''', which allows you to operate file permissions on all files within a directory.
Prompt$ '''chmod''' -R 777 <DIRECTORY>
gives all file permissions to all users.

== Changing file ownership with chgrp and chown ==
The command '''chgrp''' is used to change the group ownership of a file and '''chown''' can be used change both the file owner and group. As mentioned, every file or directory can be accessed by 3 types of users; the file owner, group users and other users. We've just learned how to change file permissions for the 3 different user types using '''chmod''', but we haven't learned how to change owners and groups.
You can change the file owner of a file,
Prompt$ '''chown''' [NewFileOwner] <file>
which will replace the original owner with [NewFileOwner] if it's a valid owner. In order to actually test '''chown''' and '''chgrp''' on your computer you need to actually have different users and groups. As this is probably not something that you have there will be no exercises in using the commands '''chown''' and '''chgrp'''. You could try to create more file owners and group owners on your computer, but that's entirely up to you.
You can change the group owner of a file,
Prompt$ '''chown''' :[NewGroup] <FILE>
which will change the group owner to [NewGroup] if it's a valid group. You can also change group and file owner at the same time,
Prompt$ '''chown''' [NewFileOwner]:[NewGroup] <FILE>
will change the file owner and group of the <FILE> if [NewFileOwner] and [NewGroup] area valid.

The command '''chgrp''' can also be used to change group owners, even though '''chown''' can be used for that as well. At first, '''chown''' could only be used to change file owner or both file owner and group, hence, '''chgrp''' was needed if you only wanted to change the group owner. This, however, is not the case anymore. The best reason for still using '''chgrp''' instead of '''chown''' is that it lowers the risk of accidentally changing the file owner of a file, which could be problematic. You can change the group of a file by typing,
Prompt$ '''chgrp''' [NewGroup] <FILE>
will change the group of the <FILE> if [NewGroup] is valid.

In similar manner to '''chmod'', the command line option, '''R''', can be used to change ownerships on all files within a directory.
Prompt$ '''chown''' -R [NewFileOwner] <DIRECTORY>
will change all ownership of all files in <DIRECTORY> to [NewFileOwner]

== Exercise 1: Changing file permission with chmod ==
# Create a file called workfile.txt.
# Change the file permission of the file to ---xrw-rwx.
# Change the file permission of the file to -rwx---r-x.
# Remove all file permission from the file, so that the file permission is ----------.
# Make a shell script file that upon execution will give your workfile the file permission -rwx--x--x, then -r-x-w---- and lastly -rwxrwxrw-.

Filtering and regular expressions

2024-03-20T12:11:52Z

WikiSysop: /* Exercise 2: Translating ASCII characters to binary and decimal values */

__NOTOC__
You can think of regular expressions (regex for short) as a pattern language that can be used to match patterns and filter data. In this section, we'll be learning so-called ''filter commands'', some of which utilize regular expressions in order to find and replace patterns. It is therefore important to have a basic understanding of regular expressions. To get you started, follow this link to an introductory video, [https://www.youtube.com/watch?v=KJG1dETacLI Basic Regular Expression Introduction Video], which gives a basic understanding of the concepts of regular expressions.

Next, you can follow this link [https://regexone.com/lesson/introduction_abcs Regex Introduction Exercises], where there's some exercises on basic regex. The 'Practice problems' are a bit more complex, but should still be doable. They are, however, optional.

[https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/pdf/ Regex cheat sheet]

== Introduction to commands ==
Now that you've been initiated in regular expressions, we'll take a look at some Unix commands that can use regular expressions. Underneath we list commands and syntax for the commands that we'll be using in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''grep''' [PATTERN] <FILE>
|Global regular expression print.
|Uses regular expressions select lines in a file that matches the pattern.
|-
|'''sed''' [OPTION] <SCRIPT> <FILE>
|stream editor
|Allows user to edit files without actually opening the files using regular expressions.
|-
|'''tr''' [OPTION] <SET1> <SET2>
|Translate
|Translates characters from the standard input and writes to the standard output.
|-
|'''sort''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Sorts the content of a file.
|}

Datafile 1: [https://teaching.healthtech.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [https://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files] 
Datafile 3: [https://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] 
Datafile 4: [https://teaching.healthtech.dtu.dk/material/unix/Ex1_7bit_binarydata.dat Binary data file] 

=== grep ===
The '''grep''' command uses regular expressions as search patterns to capture patterns in files and outputs it to stdout. It has the syntax,
Prompt$ '''grep''' [OPTION] <PATTERN> <FILE>
In '''figure 5.1''', the '''grep''' command is used to capture the line containing the authors of a text, which is then redirected to a text file.
[[File:Grep example.png|none|frame|'''Figure 5.1 Using the grep command:''' Here, '''grep''' is used capture the line containing the authors of the file and saving it to <AUTHORFILE.txt>.]]

But before you start using this sections commands you should know that Bash (recall that this is the shell that you're working in) uses basic regex and not extended regex by default. For example, if you wanted to search for occurrences of 'AUTHORS' or 'authors'
Prompt$ '''grep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
there would be no results, as Bash doesn't interpret '|' as a special character. There are 3 solutions to this problem.
Prompt$ '''grep''' 'AUTHORS\|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''egrep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''grep -E''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
In the first solution, we use '\' to designate that '|' is to be interpreted as a special character. The second and third solution are similar, as they both use an extended version of '''grep''', so that Bash interprets extended regex. However, only '''grep''' has the '''-E''' command line argument and an extended version. So for '''sed''', '''tr''' and '''sort''' you have to use '''\'''. Alternatively, in this instance where we're interested in the occurrence of uppercase and lowercase versions of a string, '''grep''' actually has a command line option for this.
'''grep -i''' 'authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
would capture both 'authors' and 'AUTHORS'. It also has the improved effect of capturing stuff like 'Authors', 'aUTHORS' etc..

=== sed ===
The command '''sed''' stands for 'stream editor', and is typically used to substitute or delete patterns in files. 
Prompt$ '''sed''' 's/good/better/' <FILE>
substitutes the first occurrence of 'good' in each line with 'better' in <file>. The 's' is 'substitute'. You can also instead type,
Prompt$ '''sed''' 's/nice/epic/g' <FILE>
which substitutes all occurrences of 'nice with 'epic' in each line. The 'g' stands for global replacement.

In the above cases, the changes that are made to <FILE> aren't saved and the stdout is directed to the terminal. This can be done by using the command option '''i''' or by using some redirectional operators, as shown in '''figure 5.2'''.

[[File:Sed example.png|frame|left|'''Figure 5.2 Using the sed command:''' The '''sed''' command is used to substitute occurrences of 'good' with 'better' and 'better' with 'the best. The command option, -i, allows you edit the file in place so that changes are saved. Otherwise, the changes are simply written to the terminal. You might be thinking that you could instead write '''sed''' 's/word1/word2/' sed_example.txt > sed_example.txt but it won't work. The shell interprets redirectional operations prior to commands, so that '> sed_example.txt' will be interpreted first and a new empty sed_example.txt is created. This effectively overwrites the original file and '''sed''' ends up processing an empty file. This sort of thing, where empty files are created, actually poses a problem for supercomputer with giant disk systems as it slows the server down. This can happen, when running automated processes with many intermediate files, where one failed subprocess results in an empty file, affecting other sub processes to produce a multitude of empty files. Therefore, it's good practice to designate intermediate files with file extensions, that make them easy to locate and delete if something goes wrong. The last example is just another way of doing '''sed -i''', which is shown because not all versions of '''sed''' have the '''-i''' command option (Mac OS doesn't for example).]]

You don't have to use '''/''' as the separator that separates pattern from substitution. The '''sed''' command just uses whatever is followed by '''s''' as a separator, and '''/''' just happens to be the most commonly used. You could instead write,

Prompt$ '''sed''' 's|good|better|g' <FILE>
which would work perfectly fine. 

It is also possible to specify which lines you would like to have replaced in a file.
Prompt$ '''sed''' '''666''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' in line 666.
Prompt$ '''sed''' '''55,$''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' from line 55 to the last line of <FILE>. The last line of <FILE> is indicated by the symbol, '''$'''. 

The command '''sed''' can be used to delete whole lines.
Prompt$ '''sed''' '2d' <FILE>
will delete the second line of <FILE>. The 'd' stands for delete.
Prompt$ '''sed''' '2,4d' <FILE>
will delete the second to fourth lines of <FILE>.
Lastly, you can also search for a pattern and delete lines wherein the pattern occur.
Prompt$ '''sed''' '/nope/d' <FILE>
will delete any line with the pattern 'nope' in it.

Lastly, it is also possible to do multiple substitutions using the syntax,
Prompt$ '''sed''' 's/good/better/g ; s/nice/epic/g' <FILE>
which will all instances of good and nice with better and epic.

=== tr ===
The command '''tr''' stands for translate and does exactly this, however, it can only be used to translate one character at a time. It isn't supported by regex but some of the syntax is similar.
A common way of using it, is to translate lowercase characters to uppercase characters.

Prompt$ '''tr''' '[a-z]' '[A-Z]' '''<''' <FILE>
will translate occurrences of lowercase characters to uppercase characters in <FILE>.
Prompt$ echo "Tabs for spaces please" | '''tr''' '[:space:]' '\t'
will translate occurrences of spaces to tabs.

[[File:Tr example.png|frame|none|'''Figure 5.3 Using the tr command:''' In the first line, the contents of tr_example.txt is displayed using '''cat''', and in the second line, lowercase characters in tr_example.txt are translated to uppercase characters. In the third line spaces are translated to '_', and in fourth line digits are translated to '*'.]]

=== Sort ===
The '''sort''' command is used to sort lines in files, arrange them in a particular order and output to stdout. By default, without any options given, it will sort according to what's called the ASCII (American Standard Code for Information Interchange) table. In the ASCII table, characters like 'a', 'y', 'n','4','6' have certain values which can be given in binary, octal, decimal and hexadecimal. It is based upon these values that '''sort''', sorts lines in files. Because '''sort''' sorts according to values in the ASCII table, it has the following features:

* Lines starting with numbers appear before lines starting with letters
* Lines starting with letters will appear in alphabetical order
* Lines starting with uppercase letters appear before lines starting with lowercase letters

This sorting specifications are illustrated in '''figure 5.4''', where the '''sort''' command is used on the file, sort1_testfile.txt.

[[File:Sort1 example.png|none|frame|'''Figure 5.4 Using the sort command:''' The lines in the sort1_testfile.txt are sorted according to the values in the ASCII table. Characters with lowest value in ASCII table will appear first, for example, as '''!''' has the lowest value it appears first]]

You can sort files in reverse ordering by using the '''r''' command option.
Prompt$ '''sort -r''' <FILE>
will sort the file in reverse order and output to stdout.

When dealing with numerical data, you can use the '''n''' command option.
Prompt$ '''sort -n''' <FILE>
will sort the file numerically and output to stdout. This can be combined with the '''r''' command option.
Prompt$ '''sort -nr''' <FILE>
will sort the file numerically in the reverse order and output to stdout.

You can check whether a file has already been sorted by using the '''c''' command option
[[File:sort2 example.png|frame|none|'''Figure 5.5 Using the sort command:''' If a file isn't sorted, a message will appear that notifies the user of a disorder in the file. If nothing appears then the file is already sorted]]

If you want to sort a file while also removing duplicates you can use the '''u''' command option.
Prompt$ '''sort -u''' <FILE>
will sort the file and remove any duplicates.

Lastly, you can sort lines in a file according to the values of one column with the '''k''' command option. For instance, if you wanted to sort according to column 4, '''sort -k4''' <FILE> in the command line. In figure 5.6, we show how one can sort numerically and according to a column.

[[File:sort3 example.png|frame|none|'''Figure 5.6 Using the sort command:''' Here, sort2_testfile.txt is sorted numerically and according to column 2 by combining command options '''r''' and '''k2'''.]]

=== ASCII table and numeral systems ===
To understand how '''sort''' works, we need to clarify what is meant by binary, octal, decimal, hexadecimal systems and finally how this relates to the ASCII table.

You're already familiar with decimal systems, as its the system most commonly used for math and anything to do with numbers. As you know, it consists of 10 unique character; 0,1,2,3,4,5,6,7,8, 9. The amount of unique characters in a numeral system is called its base or radix. Here's a video that gives a quick explanation of base systems, and how a binary system is different from a decimal system. 
[https://www.youtube.com/watch?v=LpuPe81bc2w Base systems and binary] 
When a number exceeds what you can write with these 10 characters, you simply add another slot. For instance with the number, 16, you've added the '1' to the second slot and the second slot represent 10's. The reason why the decimal system is so widely used today is most likely because we humans tend to use our fingers to count, and since we have 10 fingers, the decimal system was the most logical choice.

Binary (2), octal (8) and hexadecimal (16) base systems, are simply systems that have different bases. The binary value system consists of 2 characters, 0 and 1, and this is the system that all computers use. In computing, the 0 often corresponds to a unit being turned off, and 1 corresponds to a unit being turned on. Because the binary system only consists of 2 characters, you have to change slots more often than you would in the decimal system. In the binary system, these slots are in fact what's called ''bits'', something you might have heard about but not actually known what meant. The number of bits can vary, for instance, you might've heard about operating systems being 32 bit or 64 bit.

Let's learn by example by translating decimal values to binary values. The ASCII table uses 7 bits, which we'll use as well. The values of each bit in a binary system is; 64(7), 32(6), 16(5), 8(4), 4(3), 2(2), 1(1) . To clarify, these bit values are what would correspond to the slot values; 100.0000(7), 100.000(6), 10.000(5), 1000(4), 100(3), 10(2), 1(1) in the decimal value system. 

In the table underneath there are 4 examples ASCII characters with corresponding decimal and binary combination. When summed, every binary combination results in a unique decimal value.
{| class="wikitable"
|-
!Binary value (7 bits): 64 32 16 8 4 2 1
!Decimal value
!Character in ASCII table
|-
|0 0 1 0 1 0 1
|0+0+16+0+4+0+1=21
|!
|-
|1 0 0 0 0 0 1
|64+0+0+0+0+0+1=65
|A
|-
|0 1 1 0 1 1 0
|0+32+16+0+4+2+0=54
|6
|-
|1 1 1 1 1 1 1
|64+32+16+8+4+2+1=127
|Del
|}

The ASCII table has 127 characters, a limit that is set by it having 7 bits which amounts to 127 combinations. It includes the characters A-Z, a-z, 0-9 and other characters which you can see by using the '''man''' command,
Prompt$ '''man''' ascii
This will show you a table of all 127 characters, the binary values however, are not shown. Instead only the octal, decimal and hexadecimal values are shown.

The octal system consists of 8 unique characters; 0,1,2,3,4,5,6,7. Therefore, a value exceeding this, for instance 8 in decimal value, would be translated to 010 in the octal system. The octal system is used in computer software to simplify binary input, but interestingly, it has also been used by the indigenous american yuki people who used the space between their fingers to count.

The hexadecimal system consists of 16 unique characters; 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F. As the hexadecimal system uses 6 more unique characters than the decimal system, then changing slots happens less frequently. For instance, a decimal value of 12 would translate to 0B in the hexadecimal system.

The ASCII table only covers 127 characters but there exists a lot more characters than this, therefore, another system with more bits called ''Unicode'' is often used. If you're interested, here's a five minute introductory video explaining ASCII and Unicode [https://www.youtube.com/watch?v=5aJKKgSEUnY Introductory video on ASCII and Unicode].

== Exercise 1: Extracting and sorting data from a Gene Bank files ==
You've been given the task to extract and sort data from some genebank files, [https://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files]. More specifically, you need to extract the authors, accession number and the name of the organism and save in 3 different files.

1. Extract the lines with authors from Genebankfiles.gb, sort it and save the output to one file. 
2. Extract the lines with accession numbers from Genebankfiles.gb, sort it and save the output to a second file. 
3. Extract the lines with organisms from Genebankfiles.gb, sort it, and save the output to a third file. 
4. As you don't know when you're going to need to do this again, you want to write a shell script that does the functions of questions 1-3. Make a simple shell script that appends authors, accession numbers and organisms to the files you made in questions 1-3.

== Exercise 2: Translating ASCII characters to binary and decimal values ==
In this exercise, you'll be working with [https://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] , which contains ASCII characters, and [https://teaching.healthtech.dtu.dk/material/unix/Binary.dat Binary data file] which contains corresponding binary data. The ASCII characters --> corresponding decimal values are listed hereunder: 
'''{''' --> '''125''' '''a''' --> '''97''' '''p''' --> '''112''' '''X'''--> '''88''' '''+''' --> '''43''' '''/''' --> '''47''' '''$''' --> '''36''' 

1. Translate all the ASCII characters to decimal values and save the output to Decimals.dat (See '''Hint 1'''). 
2. Merge ASCII_chars.dat, Binary.dat and Decimals.dat so that column 1; ASCII chars, column 2; Binary data and column 3; Decimal values. Save the output to Merge.dat and then delete Decimals.dat. 
3. Sort Merge.dat based on the decimal values. 
'''Hint 1:''' This is tedious problem, as there are a lot of ASCII characters that need to be translated. The best way to do this with your current skill level is to use '''sed''' 's/blah1/blah2/g ; s/blah3/blah4/g ; ... s/blah98/blah99/g' '''>''' Decimals.dat). Also, remember to use '''\''' for special characters like '''$'''.

Filtering and regular expressions

2024-03-20T12:11:29Z

WikiSysop: /* Exercise 1: Extracting and sorting data from a Gene Bank files */

__NOTOC__
You can think of regular expressions (regex for short) as a pattern language that can be used to match patterns and filter data. In this section, we'll be learning so-called ''filter commands'', some of which utilize regular expressions in order to find and replace patterns. It is therefore important to have a basic understanding of regular expressions. To get you started, follow this link to an introductory video, [https://www.youtube.com/watch?v=KJG1dETacLI Basic Regular Expression Introduction Video], which gives a basic understanding of the concepts of regular expressions.

Next, you can follow this link [https://regexone.com/lesson/introduction_abcs Regex Introduction Exercises], where there's some exercises on basic regex. The 'Practice problems' are a bit more complex, but should still be doable. They are, however, optional.

[https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/pdf/ Regex cheat sheet]

== Introduction to commands ==
Now that you've been initiated in regular expressions, we'll take a look at some Unix commands that can use regular expressions. Underneath we list commands and syntax for the commands that we'll be using in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''grep''' [PATTERN] <FILE>
|Global regular expression print.
|Uses regular expressions select lines in a file that matches the pattern.
|-
|'''sed''' [OPTION] <SCRIPT> <FILE>
|stream editor
|Allows user to edit files without actually opening the files using regular expressions.
|-
|'''tr''' [OPTION] <SET1> <SET2>
|Translate
|Translates characters from the standard input and writes to the standard output.
|-
|'''sort''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Sorts the content of a file.
|}

Datafile 1: [https://teaching.healthtech.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [https://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files] 
Datafile 3: [https://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] 
Datafile 4: [https://teaching.healthtech.dtu.dk/material/unix/Ex1_7bit_binarydata.dat Binary data file] 

=== grep ===
The '''grep''' command uses regular expressions as search patterns to capture patterns in files and outputs it to stdout. It has the syntax,
Prompt$ '''grep''' [OPTION] <PATTERN> <FILE>
In '''figure 5.1''', the '''grep''' command is used to capture the line containing the authors of a text, which is then redirected to a text file.
[[File:Grep example.png|none|frame|'''Figure 5.1 Using the grep command:''' Here, '''grep''' is used capture the line containing the authors of the file and saving it to <AUTHORFILE.txt>.]]

But before you start using this sections commands you should know that Bash (recall that this is the shell that you're working in) uses basic regex and not extended regex by default. For example, if you wanted to search for occurrences of 'AUTHORS' or 'authors'
Prompt$ '''grep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
there would be no results, as Bash doesn't interpret '|' as a special character. There are 3 solutions to this problem.
Prompt$ '''grep''' 'AUTHORS\|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''egrep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''grep -E''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
In the first solution, we use '\' to designate that '|' is to be interpreted as a special character. The second and third solution are similar, as they both use an extended version of '''grep''', so that Bash interprets extended regex. However, only '''grep''' has the '''-E''' command line argument and an extended version. So for '''sed''', '''tr''' and '''sort''' you have to use '''\'''. Alternatively, in this instance where we're interested in the occurrence of uppercase and lowercase versions of a string, '''grep''' actually has a command line option for this.
'''grep -i''' 'authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
would capture both 'authors' and 'AUTHORS'. It also has the improved effect of capturing stuff like 'Authors', 'aUTHORS' etc..

=== sed ===
The command '''sed''' stands for 'stream editor', and is typically used to substitute or delete patterns in files. 
Prompt$ '''sed''' 's/good/better/' <FILE>
substitutes the first occurrence of 'good' in each line with 'better' in <file>. The 's' is 'substitute'. You can also instead type,
Prompt$ '''sed''' 's/nice/epic/g' <FILE>
which substitutes all occurrences of 'nice with 'epic' in each line. The 'g' stands for global replacement.

In the above cases, the changes that are made to <FILE> aren't saved and the stdout is directed to the terminal. This can be done by using the command option '''i''' or by using some redirectional operators, as shown in '''figure 5.2'''.

[[File:Sed example.png|frame|left|'''Figure 5.2 Using the sed command:''' The '''sed''' command is used to substitute occurrences of 'good' with 'better' and 'better' with 'the best. The command option, -i, allows you edit the file in place so that changes are saved. Otherwise, the changes are simply written to the terminal. You might be thinking that you could instead write '''sed''' 's/word1/word2/' sed_example.txt > sed_example.txt but it won't work. The shell interprets redirectional operations prior to commands, so that '> sed_example.txt' will be interpreted first and a new empty sed_example.txt is created. This effectively overwrites the original file and '''sed''' ends up processing an empty file. This sort of thing, where empty files are created, actually poses a problem for supercomputer with giant disk systems as it slows the server down. This can happen, when running automated processes with many intermediate files, where one failed subprocess results in an empty file, affecting other sub processes to produce a multitude of empty files. Therefore, it's good practice to designate intermediate files with file extensions, that make them easy to locate and delete if something goes wrong. The last example is just another way of doing '''sed -i''', which is shown because not all versions of '''sed''' have the '''-i''' command option (Mac OS doesn't for example).]]

You don't have to use '''/''' as the separator that separates pattern from substitution. The '''sed''' command just uses whatever is followed by '''s''' as a separator, and '''/''' just happens to be the most commonly used. You could instead write,

Prompt$ '''sed''' 's|good|better|g' <FILE>
which would work perfectly fine. 

It is also possible to specify which lines you would like to have replaced in a file.
Prompt$ '''sed''' '''666''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' in line 666.
Prompt$ '''sed''' '''55,$''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' from line 55 to the last line of <FILE>. The last line of <FILE> is indicated by the symbol, '''$'''. 

The command '''sed''' can be used to delete whole lines.
Prompt$ '''sed''' '2d' <FILE>
will delete the second line of <FILE>. The 'd' stands for delete.
Prompt$ '''sed''' '2,4d' <FILE>
will delete the second to fourth lines of <FILE>.
Lastly, you can also search for a pattern and delete lines wherein the pattern occur.
Prompt$ '''sed''' '/nope/d' <FILE>
will delete any line with the pattern 'nope' in it.

Lastly, it is also possible to do multiple substitutions using the syntax,
Prompt$ '''sed''' 's/good/better/g ; s/nice/epic/g' <FILE>
which will all instances of good and nice with better and epic.

=== tr ===
The command '''tr''' stands for translate and does exactly this, however, it can only be used to translate one character at a time. It isn't supported by regex but some of the syntax is similar.
A common way of using it, is to translate lowercase characters to uppercase characters.

Prompt$ '''tr''' '[a-z]' '[A-Z]' '''<''' <FILE>
will translate occurrences of lowercase characters to uppercase characters in <FILE>.
Prompt$ echo "Tabs for spaces please" | '''tr''' '[:space:]' '\t'
will translate occurrences of spaces to tabs.

[[File:Tr example.png|frame|none|'''Figure 5.3 Using the tr command:''' In the first line, the contents of tr_example.txt is displayed using '''cat''', and in the second line, lowercase characters in tr_example.txt are translated to uppercase characters. In the third line spaces are translated to '_', and in fourth line digits are translated to '*'.]]

=== Sort ===
The '''sort''' command is used to sort lines in files, arrange them in a particular order and output to stdout. By default, without any options given, it will sort according to what's called the ASCII (American Standard Code for Information Interchange) table. In the ASCII table, characters like 'a', 'y', 'n','4','6' have certain values which can be given in binary, octal, decimal and hexadecimal. It is based upon these values that '''sort''', sorts lines in files. Because '''sort''' sorts according to values in the ASCII table, it has the following features:

* Lines starting with numbers appear before lines starting with letters
* Lines starting with letters will appear in alphabetical order
* Lines starting with uppercase letters appear before lines starting with lowercase letters

This sorting specifications are illustrated in '''figure 5.4''', where the '''sort''' command is used on the file, sort1_testfile.txt.

[[File:Sort1 example.png|none|frame|'''Figure 5.4 Using the sort command:''' The lines in the sort1_testfile.txt are sorted according to the values in the ASCII table. Characters with lowest value in ASCII table will appear first, for example, as '''!''' has the lowest value it appears first]]

You can sort files in reverse ordering by using the '''r''' command option.
Prompt$ '''sort -r''' <FILE>
will sort the file in reverse order and output to stdout.

When dealing with numerical data, you can use the '''n''' command option.
Prompt$ '''sort -n''' <FILE>
will sort the file numerically and output to stdout. This can be combined with the '''r''' command option.
Prompt$ '''sort -nr''' <FILE>
will sort the file numerically in the reverse order and output to stdout.

You can check whether a file has already been sorted by using the '''c''' command option
[[File:sort2 example.png|frame|none|'''Figure 5.5 Using the sort command:''' If a file isn't sorted, a message will appear that notifies the user of a disorder in the file. If nothing appears then the file is already sorted]]

If you want to sort a file while also removing duplicates you can use the '''u''' command option.
Prompt$ '''sort -u''' <FILE>
will sort the file and remove any duplicates.

Lastly, you can sort lines in a file according to the values of one column with the '''k''' command option. For instance, if you wanted to sort according to column 4, '''sort -k4''' <FILE> in the command line. In figure 5.6, we show how one can sort numerically and according to a column.

[[File:sort3 example.png|frame|none|'''Figure 5.6 Using the sort command:''' Here, sort2_testfile.txt is sorted numerically and according to column 2 by combining command options '''r''' and '''k2'''.]]

=== ASCII table and numeral systems ===
To understand how '''sort''' works, we need to clarify what is meant by binary, octal, decimal, hexadecimal systems and finally how this relates to the ASCII table.

You're already familiar with decimal systems, as its the system most commonly used for math and anything to do with numbers. As you know, it consists of 10 unique character; 0,1,2,3,4,5,6,7,8, 9. The amount of unique characters in a numeral system is called its base or radix. Here's a video that gives a quick explanation of base systems, and how a binary system is different from a decimal system. 
[https://www.youtube.com/watch?v=LpuPe81bc2w Base systems and binary] 
When a number exceeds what you can write with these 10 characters, you simply add another slot. For instance with the number, 16, you've added the '1' to the second slot and the second slot represent 10's. The reason why the decimal system is so widely used today is most likely because we humans tend to use our fingers to count, and since we have 10 fingers, the decimal system was the most logical choice.

Binary (2), octal (8) and hexadecimal (16) base systems, are simply systems that have different bases. The binary value system consists of 2 characters, 0 and 1, and this is the system that all computers use. In computing, the 0 often corresponds to a unit being turned off, and 1 corresponds to a unit being turned on. Because the binary system only consists of 2 characters, you have to change slots more often than you would in the decimal system. In the binary system, these slots are in fact what's called ''bits'', something you might have heard about but not actually known what meant. The number of bits can vary, for instance, you might've heard about operating systems being 32 bit or 64 bit.

Let's learn by example by translating decimal values to binary values. The ASCII table uses 7 bits, which we'll use as well. The values of each bit in a binary system is; 64(7), 32(6), 16(5), 8(4), 4(3), 2(2), 1(1) . To clarify, these bit values are what would correspond to the slot values; 100.0000(7), 100.000(6), 10.000(5), 1000(4), 100(3), 10(2), 1(1) in the decimal value system. 

In the table underneath there are 4 examples ASCII characters with corresponding decimal and binary combination. When summed, every binary combination results in a unique decimal value.
{| class="wikitable"
|-
!Binary value (7 bits): 64 32 16 8 4 2 1
!Decimal value
!Character in ASCII table
|-
|0 0 1 0 1 0 1
|0+0+16+0+4+0+1=21
|!
|-
|1 0 0 0 0 0 1
|64+0+0+0+0+0+1=65
|A
|-
|0 1 1 0 1 1 0
|0+32+16+0+4+2+0=54
|6
|-
|1 1 1 1 1 1 1
|64+32+16+8+4+2+1=127
|Del
|}

The ASCII table has 127 characters, a limit that is set by it having 7 bits which amounts to 127 combinations. It includes the characters A-Z, a-z, 0-9 and other characters which you can see by using the '''man''' command,
Prompt$ '''man''' ascii
This will show you a table of all 127 characters, the binary values however, are not shown. Instead only the octal, decimal and hexadecimal values are shown.

The octal system consists of 8 unique characters; 0,1,2,3,4,5,6,7. Therefore, a value exceeding this, for instance 8 in decimal value, would be translated to 010 in the octal system. The octal system is used in computer software to simplify binary input, but interestingly, it has also been used by the indigenous american yuki people who used the space between their fingers to count.

The hexadecimal system consists of 16 unique characters; 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F. As the hexadecimal system uses 6 more unique characters than the decimal system, then changing slots happens less frequently. For instance, a decimal value of 12 would translate to 0B in the hexadecimal system.

The ASCII table only covers 127 characters but there exists a lot more characters than this, therefore, another system with more bits called ''Unicode'' is often used. If you're interested, here's a five minute introductory video explaining ASCII and Unicode [https://www.youtube.com/watch?v=5aJKKgSEUnY Introductory video on ASCII and Unicode].

== Exercise 1: Extracting and sorting data from a Gene Bank files ==
You've been given the task to extract and sort data from some genebank files, [https://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files]. More specifically, you need to extract the authors, accession number and the name of the organism and save in 3 different files.

1. Extract the lines with authors from Genebankfiles.gb, sort it and save the output to one file. 
2. Extract the lines with accession numbers from Genebankfiles.gb, sort it and save the output to a second file. 
3. Extract the lines with organisms from Genebankfiles.gb, sort it, and save the output to a third file. 
4. As you don't know when you're going to need to do this again, you want to write a shell script that does the functions of questions 1-3. Make a simple shell script that appends authors, accession numbers and organisms to the files you made in questions 1-3.

== Exercise 2: Translating ASCII characters to binary and decimal values ==
In this exercise, you'll be working with [http://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] , which contains ASCII characters, and [http://teaching.healthtech.dtu.dk/material/unix/Binary.dat Binary data file] which contains corresponding binary data. The ASCII characters --> corresponding decimal values are listed hereunder: 
'''{''' --> '''125''' '''a''' --> '''97''' '''p''' --> '''112''' '''X'''--> '''88''' '''+''' --> '''43''' '''/''' --> '''47''' '''$''' --> '''36''' 

1. Translate all the ASCII characters to decimal values and save the output to Decimals.dat (See '''Hint 1'''). 
2. Merge ASCII_chars.dat, Binary.dat and Decimals.dat so that column 1; ASCII chars, column 2; Binary data and column 3; Decimal values. Save the output to Merge.dat and then delete Decimals.dat. 
3. Sort Merge.dat based on the decimal values. 
'''Hint 1:''' This is tedious problem, as there are a lot of ASCII characters that need to be translated. The best way to do this with your current skill level is to use '''sed''' 's/blah1/blah2/g ; s/blah3/blah4/g ; ... s/blah98/blah99/g' '''>''' Decimals.dat). Also, remember to use '''\''' for special characters like '''$'''.

File:Sort3 example.png

2024-03-20T12:10:58Z

WikiSysop:

File:Sort2 example.png

2024-03-20T12:10:38Z

WikiSysop:

File:Sort1 example.png

2024-03-20T12:10:18Z

WikiSysop:

File:Tr example.png

2024-03-20T12:09:54Z

WikiSysop:

File:Sed example.png

2024-03-20T12:09:31Z

WikiSysop:

File:Grep example.png

2024-03-20T12:09:08Z

WikiSysop:

Filtering and regular expressions

2024-03-20T12:08:33Z

WikiSysop: /* Introduction to commands */

__NOTOC__
You can think of regular expressions (regex for short) as a pattern language that can be used to match patterns and filter data. In this section, we'll be learning so-called ''filter commands'', some of which utilize regular expressions in order to find and replace patterns. It is therefore important to have a basic understanding of regular expressions. To get you started, follow this link to an introductory video, [https://www.youtube.com/watch?v=KJG1dETacLI Basic Regular Expression Introduction Video], which gives a basic understanding of the concepts of regular expressions.

Next, you can follow this link [https://regexone.com/lesson/introduction_abcs Regex Introduction Exercises], where there's some exercises on basic regex. The 'Practice problems' are a bit more complex, but should still be doable. They are, however, optional.

[https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/pdf/ Regex cheat sheet]

== Introduction to commands ==
Now that you've been initiated in regular expressions, we'll take a look at some Unix commands that can use regular expressions. Underneath we list commands and syntax for the commands that we'll be using in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''grep''' [PATTERN] <FILE>
|Global regular expression print.
|Uses regular expressions select lines in a file that matches the pattern.
|-
|'''sed''' [OPTION] <SCRIPT> <FILE>
|stream editor
|Allows user to edit files without actually opening the files using regular expressions.
|-
|'''tr''' [OPTION] <SET1> <SET2>
|Translate
|Translates characters from the standard input and writes to the standard output.
|-
|'''sort''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Sorts the content of a file.
|}

Datafile 1: [https://teaching.healthtech.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [https://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files] 
Datafile 3: [https://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] 
Datafile 4: [https://teaching.healthtech.dtu.dk/material/unix/Ex1_7bit_binarydata.dat Binary data file] 

=== grep ===
The '''grep''' command uses regular expressions as search patterns to capture patterns in files and outputs it to stdout. It has the syntax,
Prompt$ '''grep''' [OPTION] <PATTERN> <FILE>
In '''figure 5.1''', the '''grep''' command is used to capture the line containing the authors of a text, which is then redirected to a text file.
[[File:Grep example.png|none|frame|'''Figure 5.1 Using the grep command:''' Here, '''grep''' is used capture the line containing the authors of the file and saving it to <AUTHORFILE.txt>.]]

But before you start using this sections commands you should know that Bash (recall that this is the shell that you're working in) uses basic regex and not extended regex by default. For example, if you wanted to search for occurrences of 'AUTHORS' or 'authors'
Prompt$ '''grep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
there would be no results, as Bash doesn't interpret '|' as a special character. There are 3 solutions to this problem.
Prompt$ '''grep''' 'AUTHORS\|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''egrep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''grep -E''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
In the first solution, we use '\' to designate that '|' is to be interpreted as a special character. The second and third solution are similar, as they both use an extended version of '''grep''', so that Bash interprets extended regex. However, only '''grep''' has the '''-E''' command line argument and an extended version. So for '''sed''', '''tr''' and '''sort''' you have to use '''\'''. Alternatively, in this instance where we're interested in the occurrence of uppercase and lowercase versions of a string, '''grep''' actually has a command line option for this.
'''grep -i''' 'authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
would capture both 'authors' and 'AUTHORS'. It also has the improved effect of capturing stuff like 'Authors', 'aUTHORS' etc..

=== sed ===
The command '''sed''' stands for 'stream editor', and is typically used to substitute or delete patterns in files. 
Prompt$ '''sed''' 's/good/better/' <FILE>
substitutes the first occurrence of 'good' in each line with 'better' in <file>. The 's' is 'substitute'. You can also instead type,
Prompt$ '''sed''' 's/nice/epic/g' <FILE>
which substitutes all occurrences of 'nice with 'epic' in each line. The 'g' stands for global replacement.

In the above cases, the changes that are made to <FILE> aren't saved and the stdout is directed to the terminal. This can be done by using the command option '''i''' or by using some redirectional operators, as shown in '''figure 5.2'''.

[[File:Sed example.png|frame|left|'''Figure 5.2 Using the sed command:''' The '''sed''' command is used to substitute occurrences of 'good' with 'better' and 'better' with 'the best. The command option, -i, allows you edit the file in place so that changes are saved. Otherwise, the changes are simply written to the terminal. You might be thinking that you could instead write '''sed''' 's/word1/word2/' sed_example.txt > sed_example.txt but it won't work. The shell interprets redirectional operations prior to commands, so that '> sed_example.txt' will be interpreted first and a new empty sed_example.txt is created. This effectively overwrites the original file and '''sed''' ends up processing an empty file. This sort of thing, where empty files are created, actually poses a problem for supercomputer with giant disk systems as it slows the server down. This can happen, when running automated processes with many intermediate files, where one failed subprocess results in an empty file, affecting other sub processes to produce a multitude of empty files. Therefore, it's good practice to designate intermediate files with file extensions, that make them easy to locate and delete if something goes wrong. The last example is just another way of doing '''sed -i''', which is shown because not all versions of '''sed''' have the '''-i''' command option (Mac OS doesn't for example).]]

You don't have to use '''/''' as the separator that separates pattern from substitution. The '''sed''' command just uses whatever is followed by '''s''' as a separator, and '''/''' just happens to be the most commonly used. You could instead write,

Prompt$ '''sed''' 's|good|better|g' <FILE>
which would work perfectly fine. 

It is also possible to specify which lines you would like to have replaced in a file.
Prompt$ '''sed''' '''666''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' in line 666.
Prompt$ '''sed''' '''55,$''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' from line 55 to the last line of <FILE>. The last line of <FILE> is indicated by the symbol, '''$'''. 

The command '''sed''' can be used to delete whole lines.
Prompt$ '''sed''' '2d' <FILE>
will delete the second line of <FILE>. The 'd' stands for delete.
Prompt$ '''sed''' '2,4d' <FILE>
will delete the second to fourth lines of <FILE>.
Lastly, you can also search for a pattern and delete lines wherein the pattern occur.
Prompt$ '''sed''' '/nope/d' <FILE>
will delete any line with the pattern 'nope' in it.

Lastly, it is also possible to do multiple substitutions using the syntax,
Prompt$ '''sed''' 's/good/better/g ; s/nice/epic/g' <FILE>
which will all instances of good and nice with better and epic.

=== tr ===
The command '''tr''' stands for translate and does exactly this, however, it can only be used to translate one character at a time. It isn't supported by regex but some of the syntax is similar.
A common way of using it, is to translate lowercase characters to uppercase characters.

Prompt$ '''tr''' '[a-z]' '[A-Z]' '''<''' <FILE>
will translate occurrences of lowercase characters to uppercase characters in <FILE>.
Prompt$ echo "Tabs for spaces please" | '''tr''' '[:space:]' '\t'
will translate occurrences of spaces to tabs.

[[File:Tr example.png|frame|none|'''Figure 5.3 Using the tr command:''' In the first line, the contents of tr_example.txt is displayed using '''cat''', and in the second line, lowercase characters in tr_example.txt are translated to uppercase characters. In the third line spaces are translated to '_', and in fourth line digits are translated to '*'.]]

=== Sort ===
The '''sort''' command is used to sort lines in files, arrange them in a particular order and output to stdout. By default, without any options given, it will sort according to what's called the ASCII (American Standard Code for Information Interchange) table. In the ASCII table, characters like 'a', 'y', 'n','4','6' have certain values which can be given in binary, octal, decimal and hexadecimal. It is based upon these values that '''sort''', sorts lines in files. Because '''sort''' sorts according to values in the ASCII table, it has the following features:

* Lines starting with numbers appear before lines starting with letters
* Lines starting with letters will appear in alphabetical order
* Lines starting with uppercase letters appear before lines starting with lowercase letters

This sorting specifications are illustrated in '''figure 5.4''', where the '''sort''' command is used on the file, sort1_testfile.txt.

[[File:Sort1 example.png|none|frame|'''Figure 5.4 Using the sort command:''' The lines in the sort1_testfile.txt are sorted according to the values in the ASCII table. Characters with lowest value in ASCII table will appear first, for example, as '''!''' has the lowest value it appears first]]

You can sort files in reverse ordering by using the '''r''' command option.
Prompt$ '''sort -r''' <FILE>
will sort the file in reverse order and output to stdout.

When dealing with numerical data, you can use the '''n''' command option.
Prompt$ '''sort -n''' <FILE>
will sort the file numerically and output to stdout. This can be combined with the '''r''' command option.
Prompt$ '''sort -nr''' <FILE>
will sort the file numerically in the reverse order and output to stdout.

You can check whether a file has already been sorted by using the '''c''' command option
[[File:sort2 example.png|frame|none|'''Figure 5.5 Using the sort command:''' If a file isn't sorted, a message will appear that notifies the user of a disorder in the file. If nothing appears then the file is already sorted]]

If you want to sort a file while also removing duplicates you can use the '''u''' command option.
Prompt$ '''sort -u''' <FILE>
will sort the file and remove any duplicates.

Lastly, you can sort lines in a file according to the values of one column with the '''k''' command option. For instance, if you wanted to sort according to column 4, '''sort -k4''' <FILE> in the command line. In figure 5.6, we show how one can sort numerically and according to a column.

[[File:sort3 example.png|frame|none|'''Figure 5.6 Using the sort command:''' Here, sort2_testfile.txt is sorted numerically and according to column 2 by combining command options '''r''' and '''k2'''.]]

=== ASCII table and numeral systems ===
To understand how '''sort''' works, we need to clarify what is meant by binary, octal, decimal, hexadecimal systems and finally how this relates to the ASCII table.

You're already familiar with decimal systems, as its the system most commonly used for math and anything to do with numbers. As you know, it consists of 10 unique character; 0,1,2,3,4,5,6,7,8, 9. The amount of unique characters in a numeral system is called its base or radix. Here's a video that gives a quick explanation of base systems, and how a binary system is different from a decimal system. 
[https://www.youtube.com/watch?v=LpuPe81bc2w Base systems and binary] 
When a number exceeds what you can write with these 10 characters, you simply add another slot. For instance with the number, 16, you've added the '1' to the second slot and the second slot represent 10's. The reason why the decimal system is so widely used today is most likely because we humans tend to use our fingers to count, and since we have 10 fingers, the decimal system was the most logical choice.

Binary (2), octal (8) and hexadecimal (16) base systems, are simply systems that have different bases. The binary value system consists of 2 characters, 0 and 1, and this is the system that all computers use. In computing, the 0 often corresponds to a unit being turned off, and 1 corresponds to a unit being turned on. Because the binary system only consists of 2 characters, you have to change slots more often than you would in the decimal system. In the binary system, these slots are in fact what's called ''bits'', something you might have heard about but not actually known what meant. The number of bits can vary, for instance, you might've heard about operating systems being 32 bit or 64 bit.

Let's learn by example by translating decimal values to binary values. The ASCII table uses 7 bits, which we'll use as well. The values of each bit in a binary system is; 64(7), 32(6), 16(5), 8(4), 4(3), 2(2), 1(1) . To clarify, these bit values are what would correspond to the slot values; 100.0000(7), 100.000(6), 10.000(5), 1000(4), 100(3), 10(2), 1(1) in the decimal value system. 

In the table underneath there are 4 examples ASCII characters with corresponding decimal and binary combination. When summed, every binary combination results in a unique decimal value.
{| class="wikitable"
|-
!Binary value (7 bits): 64 32 16 8 4 2 1
!Decimal value
!Character in ASCII table
|-
|0 0 1 0 1 0 1
|0+0+16+0+4+0+1=21
|!
|-
|1 0 0 0 0 0 1
|64+0+0+0+0+0+1=65
|A
|-
|0 1 1 0 1 1 0
|0+32+16+0+4+2+0=54
|6
|-
|1 1 1 1 1 1 1
|64+32+16+8+4+2+1=127
|Del
|}

The ASCII table has 127 characters, a limit that is set by it having 7 bits which amounts to 127 combinations. It includes the characters A-Z, a-z, 0-9 and other characters which you can see by using the '''man''' command,
Prompt$ '''man''' ascii
This will show you a table of all 127 characters, the binary values however, are not shown. Instead only the octal, decimal and hexadecimal values are shown.

The octal system consists of 8 unique characters; 0,1,2,3,4,5,6,7. Therefore, a value exceeding this, for instance 8 in decimal value, would be translated to 010 in the octal system. The octal system is used in computer software to simplify binary input, but interestingly, it has also been used by the indigenous american yuki people who used the space between their fingers to count.

The hexadecimal system consists of 16 unique characters; 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F. As the hexadecimal system uses 6 more unique characters than the decimal system, then changing slots happens less frequently. For instance, a decimal value of 12 would translate to 0B in the hexadecimal system.

The ASCII table only covers 127 characters but there exists a lot more characters than this, therefore, another system with more bits called ''Unicode'' is often used. If you're interested, here's a five minute introductory video explaining ASCII and Unicode [https://www.youtube.com/watch?v=5aJKKgSEUnY Introductory video on ASCII and Unicode].

== Exercise 1: Extracting and sorting data from a Gene Bank files ==
You've been given the task to extract and sort data from some genebank files, [http://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files]. More specifically, you need to extract the authors, accession number and the name of the organism and save in 3 different files.

1. Extract the lines with authors from Genebankfiles.gb, sort it and save the output to one file. 
2. Extract the lines with accession numbers from Genebankfiles.gb, sort it and save the output to a second file. 
3. Extract the lines with organisms from Genebankfiles.gb, sort it, and save the output to a third file. 
4. As you don't know when you're going to need to do this again, you want to write a shell script that does the functions of questions 1-3. Make a simple shell script that appends authors, accession numbers and organisms to the files you made in questions 1-3.

== Exercise 2: Translating ASCII characters to binary and decimal values ==
In this exercise, you'll be working with [http://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] , which contains ASCII characters, and [http://teaching.healthtech.dtu.dk/material/unix/Binary.dat Binary data file] which contains corresponding binary data. The ASCII characters --> corresponding decimal values are listed hereunder: 
'''{''' --> '''125''' '''a''' --> '''97''' '''p''' --> '''112''' '''X'''--> '''88''' '''+''' --> '''43''' '''/''' --> '''47''' '''$''' --> '''36''' 

1. Translate all the ASCII characters to decimal values and save the output to Decimals.dat (See '''Hint 1'''). 
2. Merge ASCII_chars.dat, Binary.dat and Decimals.dat so that column 1; ASCII chars, column 2; Binary data and column 3; Decimal values. Save the output to Merge.dat and then delete Decimals.dat. 
3. Sort Merge.dat based on the decimal values. 
'''Hint 1:''' This is tedious problem, as there are a lot of ASCII characters that need to be translated. The best way to do this with your current skill level is to use '''sed''' 's/blah1/blah2/g ; s/blah3/blah4/g ; ... s/blah98/blah99/g' '''>''' Decimals.dat). Also, remember to use '''\''' for special characters like '''$'''.

Filtering and regular expressions

2024-03-20T12:07:51Z

WikiSysop: Created page with "__NOTOC__ You can think of regular expressions (regex for short) as a pattern language that can be used to match patterns and filter data. In this section, we'll be learning so-called ''filter commands'', some of which utilize regular expressions in order to find and replace patterns. It is therefore important to have a basic understanding of regular expressions. To get you started, follow this link to an introductory video, [https://www.youtube.com/watch?v=KJG1dETacLI B..."

__NOTOC__
You can think of regular expressions (regex for short) as a pattern language that can be used to match patterns and filter data. In this section, we'll be learning so-called ''filter commands'', some of which utilize regular expressions in order to find and replace patterns. It is therefore important to have a basic understanding of regular expressions. To get you started, follow this link to an introductory video, [https://www.youtube.com/watch?v=KJG1dETacLI Basic Regular Expression Introduction Video], which gives a basic understanding of the concepts of regular expressions.

Next, you can follow this link [https://regexone.com/lesson/introduction_abcs Regex Introduction Exercises], where there's some exercises on basic regex. The 'Practice problems' are a bit more complex, but should still be doable. They are, however, optional.

[https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/pdf/ Regex cheat sheet]

== Introduction to commands ==
Now that you've been initiated in regular expressions, we'll take a look at some Unix commands that can use regular expressions. Underneath we list commands and syntax for the commands that we'll be using in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''grep''' [PATTERN] <FILE>
|Global regular expression print.
|Uses regular expressions select lines in a file that matches the pattern.
|-
|'''sed''' [OPTION] <SCRIPT> <FILE>
|stream editor
|Allows user to edit files without actually opening the files using regular expressions.
|-
|'''tr''' [OPTION] <SET1> <SET2>
|Translate
|Translates characters from the standard input and writes to the standard output.
|-
|'''sort''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Sorts the content of a file.
|}

Datafile 1: [http://teaching.healthtech.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [http://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files] 
Datafile 3: [http://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] 
Datafile 4: [http://teaching.healthtech.dtu.dk/material/unix/Ex1_7bit_binarydata.dat Binary data file] 

=== grep ===
The '''grep''' command uses regular expressions as search patterns to capture patterns in files and outputs it to stdout. It has the syntax,
Prompt$ '''grep''' [OPTION] <PATTERN> <FILE>
In '''figure 5.1''', the '''grep''' command is used to capture the line containing the authors of a text, which is then redirected to a text file.
[[File:Grep example.png|none|frame|'''Figure 5.1 Using the grep command:''' Here, '''grep''' is used capture the line containing the authors of the file and saving it to <AUTHORFILE.txt>.]]

But before you start using this sections commands you should know that Bash (recall that this is the shell that you're working in) uses basic regex and not extended regex by default. For example, if you wanted to search for occurrences of 'AUTHORS' or 'authors'
Prompt$ '''grep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
there would be no results, as Bash doesn't interpret '|' as a special character. There are 3 solutions to this problem.
Prompt$ '''grep''' 'AUTHORS\|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''egrep''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
Prompt$ '''grep -E''' 'AUTHORS|authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
In the first solution, we use '\' to designate that '|' is to be interpreted as a special character. The second and third solution are similar, as they both use an extended version of '''grep''', so that Bash interprets extended regex. However, only '''grep''' has the '''-E''' command line argument and an extended version. So for '''sed''', '''tr''' and '''sort''' you have to use '''\'''. Alternatively, in this instance where we're interested in the occurrence of uppercase and lowercase versions of a string, '''grep''' actually has a command line option for this.
'''grep -i''' 'authors' Pseudomonas_Aeruginosa_16S_rRNA.gb
would capture both 'authors' and 'AUTHORS'. It also has the improved effect of capturing stuff like 'Authors', 'aUTHORS' etc..

=== sed ===
The command '''sed''' stands for 'stream editor', and is typically used to substitute or delete patterns in files. 
Prompt$ '''sed''' 's/good/better/' <FILE>
substitutes the first occurrence of 'good' in each line with 'better' in <file>. The 's' is 'substitute'. You can also instead type,
Prompt$ '''sed''' 's/nice/epic/g' <FILE>
which substitutes all occurrences of 'nice with 'epic' in each line. The 'g' stands for global replacement.

In the above cases, the changes that are made to <FILE> aren't saved and the stdout is directed to the terminal. This can be done by using the command option '''i''' or by using some redirectional operators, as shown in '''figure 5.2'''.

[[File:Sed example.png|frame|left|'''Figure 5.2 Using the sed command:''' The '''sed''' command is used to substitute occurrences of 'good' with 'better' and 'better' with 'the best. The command option, -i, allows you edit the file in place so that changes are saved. Otherwise, the changes are simply written to the terminal. You might be thinking that you could instead write '''sed''' 's/word1/word2/' sed_example.txt > sed_example.txt but it won't work. The shell interprets redirectional operations prior to commands, so that '> sed_example.txt' will be interpreted first and a new empty sed_example.txt is created. This effectively overwrites the original file and '''sed''' ends up processing an empty file. This sort of thing, where empty files are created, actually poses a problem for supercomputer with giant disk systems as it slows the server down. This can happen, when running automated processes with many intermediate files, where one failed subprocess results in an empty file, affecting other sub processes to produce a multitude of empty files. Therefore, it's good practice to designate intermediate files with file extensions, that make them easy to locate and delete if something goes wrong. The last example is just another way of doing '''sed -i''', which is shown because not all versions of '''sed''' have the '''-i''' command option (Mac OS doesn't for example).]]

You don't have to use '''/''' as the separator that separates pattern from substitution. The '''sed''' command just uses whatever is followed by '''s''' as a separator, and '''/''' just happens to be the most commonly used. You could instead write,

Prompt$ '''sed''' 's|good|better|g' <FILE>
which would work perfectly fine. 

It is also possible to specify which lines you would like to have replaced in a file.
Prompt$ '''sed''' '''666''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' in line 666.
Prompt$ '''sed''' '''55,$''' 's/nice/epic/g' <FILE>
substitutes all occurrences of 'nice' with 'epic' from line 55 to the last line of <FILE>. The last line of <FILE> is indicated by the symbol, '''$'''. 

The command '''sed''' can be used to delete whole lines.
Prompt$ '''sed''' '2d' <FILE>
will delete the second line of <FILE>. The 'd' stands for delete.
Prompt$ '''sed''' '2,4d' <FILE>
will delete the second to fourth lines of <FILE>.
Lastly, you can also search for a pattern and delete lines wherein the pattern occur.
Prompt$ '''sed''' '/nope/d' <FILE>
will delete any line with the pattern 'nope' in it.

Lastly, it is also possible to do multiple substitutions using the syntax,
Prompt$ '''sed''' 's/good/better/g ; s/nice/epic/g' <FILE>
which will all instances of good and nice with better and epic.

=== tr ===
The command '''tr''' stands for translate and does exactly this, however, it can only be used to translate one character at a time. It isn't supported by regex but some of the syntax is similar.
A common way of using it, is to translate lowercase characters to uppercase characters.

Prompt$ '''tr''' '[a-z]' '[A-Z]' '''<''' <FILE>
will translate occurrences of lowercase characters to uppercase characters in <FILE>.
Prompt$ echo "Tabs for spaces please" | '''tr''' '[:space:]' '\t'
will translate occurrences of spaces to tabs.

[[File:Tr example.png|frame|none|'''Figure 5.3 Using the tr command:''' In the first line, the contents of tr_example.txt is displayed using '''cat''', and in the second line, lowercase characters in tr_example.txt are translated to uppercase characters. In the third line spaces are translated to '_', and in fourth line digits are translated to '*'.]]

=== Sort ===
The '''sort''' command is used to sort lines in files, arrange them in a particular order and output to stdout. By default, without any options given, it will sort according to what's called the ASCII (American Standard Code for Information Interchange) table. In the ASCII table, characters like 'a', 'y', 'n','4','6' have certain values which can be given in binary, octal, decimal and hexadecimal. It is based upon these values that '''sort''', sorts lines in files. Because '''sort''' sorts according to values in the ASCII table, it has the following features:

* Lines starting with numbers appear before lines starting with letters
* Lines starting with letters will appear in alphabetical order
* Lines starting with uppercase letters appear before lines starting with lowercase letters

This sorting specifications are illustrated in '''figure 5.4''', where the '''sort''' command is used on the file, sort1_testfile.txt.

[[File:Sort1 example.png|none|frame|'''Figure 5.4 Using the sort command:''' The lines in the sort1_testfile.txt are sorted according to the values in the ASCII table. Characters with lowest value in ASCII table will appear first, for example, as '''!''' has the lowest value it appears first]]

You can sort files in reverse ordering by using the '''r''' command option.
Prompt$ '''sort -r''' <FILE>
will sort the file in reverse order and output to stdout.

When dealing with numerical data, you can use the '''n''' command option.
Prompt$ '''sort -n''' <FILE>
will sort the file numerically and output to stdout. This can be combined with the '''r''' command option.
Prompt$ '''sort -nr''' <FILE>
will sort the file numerically in the reverse order and output to stdout.

You can check whether a file has already been sorted by using the '''c''' command option
[[File:sort2 example.png|frame|none|'''Figure 5.5 Using the sort command:''' If a file isn't sorted, a message will appear that notifies the user of a disorder in the file. If nothing appears then the file is already sorted]]

If you want to sort a file while also removing duplicates you can use the '''u''' command option.
Prompt$ '''sort -u''' <FILE>
will sort the file and remove any duplicates.

Lastly, you can sort lines in a file according to the values of one column with the '''k''' command option. For instance, if you wanted to sort according to column 4, '''sort -k4''' <FILE> in the command line. In figure 5.6, we show how one can sort numerically and according to a column.

[[File:sort3 example.png|frame|none|'''Figure 5.6 Using the sort command:''' Here, sort2_testfile.txt is sorted numerically and according to column 2 by combining command options '''r''' and '''k2'''.]]

=== ASCII table and numeral systems ===
To understand how '''sort''' works, we need to clarify what is meant by binary, octal, decimal, hexadecimal systems and finally how this relates to the ASCII table.

You're already familiar with decimal systems, as its the system most commonly used for math and anything to do with numbers. As you know, it consists of 10 unique character; 0,1,2,3,4,5,6,7,8, 9. The amount of unique characters in a numeral system is called its base or radix. Here's a video that gives a quick explanation of base systems, and how a binary system is different from a decimal system. 
[https://www.youtube.com/watch?v=LpuPe81bc2w Base systems and binary] 
When a number exceeds what you can write with these 10 characters, you simply add another slot. For instance with the number, 16, you've added the '1' to the second slot and the second slot represent 10's. The reason why the decimal system is so widely used today is most likely because we humans tend to use our fingers to count, and since we have 10 fingers, the decimal system was the most logical choice.

Binary (2), octal (8) and hexadecimal (16) base systems, are simply systems that have different bases. The binary value system consists of 2 characters, 0 and 1, and this is the system that all computers use. In computing, the 0 often corresponds to a unit being turned off, and 1 corresponds to a unit being turned on. Because the binary system only consists of 2 characters, you have to change slots more often than you would in the decimal system. In the binary system, these slots are in fact what's called ''bits'', something you might have heard about but not actually known what meant. The number of bits can vary, for instance, you might've heard about operating systems being 32 bit or 64 bit.

Let's learn by example by translating decimal values to binary values. The ASCII table uses 7 bits, which we'll use as well. The values of each bit in a binary system is; 64(7), 32(6), 16(5), 8(4), 4(3), 2(2), 1(1) . To clarify, these bit values are what would correspond to the slot values; 100.0000(7), 100.000(6), 10.000(5), 1000(4), 100(3), 10(2), 1(1) in the decimal value system. 

In the table underneath there are 4 examples ASCII characters with corresponding decimal and binary combination. When summed, every binary combination results in a unique decimal value.
{| class="wikitable"
|-
!Binary value (7 bits): 64 32 16 8 4 2 1
!Decimal value
!Character in ASCII table
|-
|0 0 1 0 1 0 1
|0+0+16+0+4+0+1=21
|!
|-
|1 0 0 0 0 0 1
|64+0+0+0+0+0+1=65
|A
|-
|0 1 1 0 1 1 0
|0+32+16+0+4+2+0=54
|6
|-
|1 1 1 1 1 1 1
|64+32+16+8+4+2+1=127
|Del
|}

The ASCII table has 127 characters, a limit that is set by it having 7 bits which amounts to 127 combinations. It includes the characters A-Z, a-z, 0-9 and other characters which you can see by using the '''man''' command,
Prompt$ '''man''' ascii
This will show you a table of all 127 characters, the binary values however, are not shown. Instead only the octal, decimal and hexadecimal values are shown.

The octal system consists of 8 unique characters; 0,1,2,3,4,5,6,7. Therefore, a value exceeding this, for instance 8 in decimal value, would be translated to 010 in the octal system. The octal system is used in computer software to simplify binary input, but interestingly, it has also been used by the indigenous american yuki people who used the space between their fingers to count.

The hexadecimal system consists of 16 unique characters; 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F. As the hexadecimal system uses 6 more unique characters than the decimal system, then changing slots happens less frequently. For instance, a decimal value of 12 would translate to 0B in the hexadecimal system.

The ASCII table only covers 127 characters but there exists a lot more characters than this, therefore, another system with more bits called ''Unicode'' is often used. If you're interested, here's a five minute introductory video explaining ASCII and Unicode [https://www.youtube.com/watch?v=5aJKKgSEUnY Introductory video on ASCII and Unicode].

== Exercise 1: Extracting and sorting data from a Gene Bank files ==
You've been given the task to extract and sort data from some genebank files, [http://teaching.healthtech.dtu.dk/material/unix/Genebankfiles.gb Genebank files]. More specifically, you need to extract the authors, accession number and the name of the organism and save in 3 different files.

1. Extract the lines with authors from Genebankfiles.gb, sort it and save the output to one file. 
2. Extract the lines with accession numbers from Genebankfiles.gb, sort it and save the output to a second file. 
3. Extract the lines with organisms from Genebankfiles.gb, sort it, and save the output to a third file. 
4. As you don't know when you're going to need to do this again, you want to write a shell script that does the functions of questions 1-3. Make a simple shell script that appends authors, accession numbers and organisms to the files you made in questions 1-3.

== Exercise 2: Translating ASCII characters to binary and decimal values ==
In this exercise, you'll be working with [http://teaching.healthtech.dtu.dk/material/unix/ASCII_chars.dat ASCII character file] , which contains ASCII characters, and [http://teaching.healthtech.dtu.dk/material/unix/Binary.dat Binary data file] which contains corresponding binary data. The ASCII characters --> corresponding decimal values are listed hereunder: 
'''{''' --> '''125''' '''a''' --> '''97''' '''p''' --> '''112''' '''X'''--> '''88''' '''+''' --> '''43''' '''/''' --> '''47''' '''$''' --> '''36''' 

1. Translate all the ASCII characters to decimal values and save the output to Decimals.dat (See '''Hint 1'''). 
2. Merge ASCII_chars.dat, Binary.dat and Decimals.dat so that column 1; ASCII chars, column 2; Binary data and column 3; Decimal values. Save the output to Merge.dat and then delete Decimals.dat. 
3. Sort Merge.dat based on the decimal values. 
'''Hint 1:''' This is tedious problem, as there are a lot of ASCII characters that need to be translated. The best way to do this with your current skill level is to use '''sed''' 's/blah1/blah2/g ; s/blah3/blah4/g ; ... s/blah98/blah99/g' '''>''' Decimals.dat). Also, remember to use '''\''' for special characters like '''$'''.

File:Read ShellScript.png

2024-03-20T12:07:08Z

WikiSysop:

File:UserArguments.png

2024-03-20T12:06:47Z

WikiSysop:

File:Variables ShellScripts.png

2024-03-20T12:06:27Z

WikiSysop:

File:Shebangs1.png

2024-03-20T12:06:02Z

WikiSysop:

Setting up your shell script

2024-03-20T12:05:18Z

WikiSysop: Created page with "__NOTOC__ Last section we did some simple shell scripting. These shell scripts contained multiple commands and the script could be executed with '''bash'''. In this section, we go through some of the basics of shell scripts. Before getting started, however, you should know that when writing shell scripts in text editors, it's important that it supports bash. For example, running Sublime on the Windows OS will cause errors. If you're using Vim editor, you should have no p..."

__NOTOC__
Last section we did some simple shell scripting. These shell scripts contained multiple commands and the script could be executed with '''bash'''. In this section, we go through some of the basics of shell scripts. Before getting started, however, you should know that when writing shell scripts in text editors, it's important that it supports bash. For example, running Sublime on the Windows OS will cause errors. If you're using Vim editor, you should have no problems.

{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''which''' <COMMAND>
|<nowiki>-</nowiki>
|Shows the full path to the command.
|-
|'''read''' [OPTION] [input1] [input2] [input3]
|<nowiki>-</nowiki>
|Can be used to prompt user for input and save them as variables [input1]..[input3].
|}

=== Shebang '#!' ===
When making a shell script, the file extension '.sh' lets other users know that the file is a shell script. However, this sort of file extension doesn't actually affect how your system interprets the file. You could write whatever you want as your file extension and it wouldn't matter. What's important is to give your file something called a 'shebang' (which are the symbols '#!') as the first line. This is followed by a space and file path to the interpreter you would like to use in your script. This interpreter could be a shell but it could also be a programming language interpreter. In one case ('''1''' in '''figure 1'''), we want to use bash as our interpreter. But we can change the interpreter to python3 interpreter ('''2''' in '''figure 1'''). This, however, results in a syntax error because the command '''echo''' 'Hello' is not python language.

[[File:Shebangs1.png|frame|none|'''Figure 1 Using #! to set up your interpreter''' ]]

Also, bash is your default interpreter so if you were to delete the 'shebang' and just execute the file,
Prompt$ ./testfile.blabla
it would also work. However, if you were working on a server where bash wasn't the default interpreter, executing the file without shebangs might cause problems. Essentially, 'shebangs' ensure that your script is always executed with the intended interpreter, which in turn ensures that the file is executed properly.

Well almost always, as it's possible to forego 'shebangs' and use whatever interpreter you like from the command line. We've already been doing this earlier with '''bash'''.
Prompt$ '''bash''' <shell_script>
This would result in a file being interpreted by bash regardless of what has been specified in the file. This means that if you know which language a script is written in, you don't have to add a 'shebang'. But it's still a good practice. Just imagine a scenario, where you need to use multiple scripts written in different languages.

When making a 'shebang', you need to know where your interpreters are located. For shells, they'll typically be located in the directory /bin. You can check which shells are available on your system by looking in the file <shells>,
Prompt$ cat /etc/shells

You can always check where commands/programs, such as interpreters, are located with the '''which''' command,
Prompt$ '''which''' [COMMAND]
which outputs the path to the specified Unix command. On a side note, it's worth knowing that when '''which''' and commands in general are executed, the shell will locate the command with the closest absolute path. This means that if you were to have two duplicate programs, it is the one with the closest absolute path that is used.

=== Comments ===
In shell scripting, you should make comments that explain your shell script. This can be done by entering the symbol '#' at the start of line, which results in the following text on that line, to be disregarded by your interpreter. So if you were to type,
# Comment for my shell script.
in some line in your shell script, your interpreter won't regard it as bash script.

=== Variables ===
When writing a shell script, you might want to define variables in your code. There are user and system variables. In '''figure 2''', both types of variables are shown. A user variable like var1 can be assigned the value, 1, with the syntax,
var1=1
Values can be extracted by with a '''$''' in front of the variable. So typing,
'''echo''' $var1
in a shell script will output '1'.

It's also possible to save the output of Unix commands as variables by using backtick symbols '''``'''. This is shown in '''figure 2''', with the command '''cal''' that displays a nice calendar.
System variables, also called 'environment variables', are simply variables defined by your system. Examples of these are $SHELL, $BASH, $BASH_VERSION, $USER, $IFS, $HOME and (Internal field separator), etc. The environment variable, $SHELL, point two the shell that is used by default on your system. $BASH points to the execution path for '''bash''' (/bin/bash). Environment variables can change depending on which shell you're using, $BASH and $BASH_VERSION will for example not be valid when you're not using a BASH shell. The $HOME variable indicate where the home directory is located in your file system.
You can list all of these commands by typing ''env''' in your terminal.
Prompt$ '''env'''
[[File:Variables ShellScripts.png|frame|center|'''Figure 2 Variables:''' Firstly, the bash version is outputted by using the system variable, BASH_VERSION. Hereafter, user variables are set to various values and then extracted with $ in an '''echo''' command. The command line option '''-e''', is an option that enables interpretation of backslash characters (in this the newline character '\n').]]

Variables are of course not limited to shell scripts, and you can also define variables in your terminal. The difference, however, is that the variables you set in your terminal, won't reset until your terminal session ends. When running a shell script, all variables that were set during the process, will reset after completion of the process.

== Reading user input ==

Shell scripts can receive user input from the terminal, in the form of user arguments. It's quite simple to do as shown in '''figure 3'''.
[[File:UserArguments.png|frame|none|'''Figure 3 User arguments: ''' Four user arguments are saved as variables in the shell script. The user arguments are saved are by default as 1, 2, 3 and 4. These values are extracted with '''$''' and outputted with '''echo'''. All of the user arguments are also saved to the variable 'Fourelements', which is then outputted with '''echo'''. Lastly, a for loop is used to output all elements in 'Fourelements' sequentially. If you are unfamiliar with for loops, you can check this link out [https://www.cyberciti.biz/faq/bash-for-loop/ For loops], which has a good explanation for how for loops work and their syntax in bash.]]

There's no limit to how many user arguments that can be passed to a file and they're saved as '''$1..$99''' in the shell script. Furthermore, all user arguments are stored in the variable, '''@'''. So by typing,
Prompt$ '''echo''' $@
in your shell script, all user arguments will be outputted.

You can save all these user arguments to one variable,
var=$@
and extract user arguments separately using the syntax,
${var[i]}
which will extract the user argument at position '''i''' in var.

Shell scripts can also be made to prompt the user for input or multiple inputs with the command '''read''', as shown in '''figure 4'''.
[[File:Read ShellScript.png|none|frame|'''Figure 4 Read command:''' The command '''read''' is used to obtain user input and save them as variables within the shell script. This is done with different command options. The '''p''' option allows us to create prompt (in this case 'What is your last name and age?') without a trailing newline character. The '''s''' option, ensures that what is typed, is not displayed on the screen.]]

It's worth knowing that if you don't specify a variable with '''read''', bash saves user input as a system variable called ''$REPLY''. For example,
Prompt$ '''read && echo''' $REPLY
will output whatever you entered from your keyboard.

== Exercise 1: User info file ==
Make a shell script that prompts the user for a username and password (make it stealthy). The information should be stored inside a file called <user_info.txt>. The username and password need to be stored neatly on separte lines likeso:
username:blabla
password:blabla

File:Setting up Sublime3.png

2024-03-20T12:04:35Z

WikiSysop:

Text editors and some shell scripting

2024-03-20T12:03:54Z

WikiSysop: /* Exercise 1: Making a simple shell script with vim. */

__NOTOC__
In the last sections we learned how to use commands that move and work with files, but we haven't learned how to actually edit files, which is what we'll be
learning in this section. Text editors are used for editing files and there are many text editors to choose from. You've probably already used some text editors, like ''Notepad'' or ''TextEdit'', as these are the default text editors on a Windows or Mac OS. In this section, we'll be taking a look at a lot of different text editors. It's good to idea to be familiar with a lot of text editors because if you're working on a server that doesn't have the text editor you're used to, it's good to have another text editor that you're somewhat familiar with.

The first text editor we'll be looking at is the 'vim' editor, as it's used ubiquitously across different OS's and servers.
Next, we'll briefly introduce some other text editors; Nedit, Gedit, Pico and Emacs. Becoming proficient in all of these text editors is a bit excessive, and it's not a criteria for this course. It is, however, a good idea just to know that many text editors exist. If you're ever in need of using one of these text editors, it's pretty straightforward to 'google' your way to a guide, or use some of them links presented in this section.

In this section, we'll take a sneak peak at shell scripting. At the simplest level, shell scripts are multiple commands saved within a file. When the file is executed, the commands within are run. This is useful when you need to do the same workflow on multiple tasks, which can be quite labor intensive.
On a more complex level, shell scripts can be equipped with syntax like ''for'' and ''while'' loops, and you can actually do programming in shell scripts. However, doing large programs in shell scripts isn't recommended as BASH and other shell syntaxes can be quite difficult to learn and read. Instead, you'd be better off using a programming language such as 'Python', 'C++', 'Java' etc..

Also, when making shell scripts it's a good idea to keep reproducibility in mind. This can be done by giving your script a proper title, that adequately explains the function of your script. It's also paramount to make comments in your script, which makes it easier for others (and yourself in half a year) to understand and use your script. It also makes it easier to change the script. In the next section, we'll go into detail with comments and the other basics of shell scripts.

In the last bit of the section, we present a guide on installing 'Sublime 3', a user-friendly and a nice text editor for programming. We'll set it up so it can be run from the command line on your Windows or Mac computer.

Underneath are some of the commands we'll be using in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''alias''' <nowiki><alias_name>=<The stuff you want to make an alias for></nowiki>
|<nowiki>-</nowiki>
|Creates an alias called alias_name for what you've inserted on the right side of '='
|-
|'''source''' <FILE>
|<nowiki>-</nowiki>
|Executes the contents of a file in current shell. Changes made when the file is run will be permanent until changed. It is synonymous with Prompt$ '''.''' <FILE>.
|-
|'''bash''' <FILE>
|Bourne again shell
|'''Bash''' will execute <FILE> as a different process. This way, changes that occur while the file is being executed cannot affect your shell.
|}

The command, '''alias''', is useful when you want to simplify long commands or if you often need to go to faraway directory with a long file path. You can for instance, make a directory that allows you to go to your desktop quickly.
Prompt$ alias desktop='cd filepath_to_Desktop/Desktop'
This will create an alias called 'desktop', which will change your directory to your Desktop. These changes, however, are only temporary and the next time you open your terminal it won't work. In order to make it a permanent alias we have to edit in what's called the ''.bashrc'' file, which is located in your home directory,
Prompt$ '''cd''' ~
Prompt$ '''ls -a''' <home directory>
It's a hidden file, indicated by the dot symbol '.', which is why you have to use the command option '''-a''' to see it. On the MAC OS, this file is called '''.bash_profile''' instead. The ''.bashrc'' file contains the commands that are run when you start a bash shell, and the 'rc' in ''.bashrc'' is actually short for 'run commands'. Therefore, to make the alias permanent you need simply append the alias to ''.bashrc''.
Prompt$ echo 'alias desktop='cd filepath_to_Desktop/Desktop/' >> .bashrc
This will ensure that the next time you open a terminal, the alias command will be run. 

Instead of closing and opening your terminal to restart it you can instead type,
Prompt$ '''source''' .bashrc
which will ensure that the commands within ''.bashrc'' are run. Alternatively one could write ''''.''' .bashrc', which is synonymous to ''''source''' .bashrc'

The '''bash''' command, which is the command we'll be using to execute shell scripts, is somewhat similar to '''source''' because it also executes files. But the difference is that '''bash''' executes files in a separate process than the terminal. Remember, the terminal is a program itself and an ongoing process. This way commands that might change the settings of the terminal are only set during the execution of the file. Therefore you would never write,
Prompt$ bash .bashrc
when restarting your bash. In figure 3.1 we show the difference between '''source''' and '''bash'''. The file, test.sh, contains a variable called Variable1. Variables can be assigned with the syntax, variable_name=variable_value, and '''$''' is a special character that let's the shell know, that subsequent string is a variable. If the '''$''' wasn't used, '''echo''' would output 'Variable1'. In the figure we see that when '''bash''' is used to execute ''test.sh'', the variable is not saved. Conversely, when '''source''' is used, the variable is saved.
[[File:Bash vs source.png|none|frame|'''Figure 3.1 Bash command vs source command:''']]

== Vim editor ==
The vim editor is one of the most widely available text editors. It's by default installed on most systems and therefore worth knowing.
Start by following this link to a couple of introductory videos on the Vim/vi editor [https://www.youtube.com/watch?v=SI8TeVMX8pk&list=PLPyiwIbA1EVmT3QIzltwGszhdSVEr1DkD&index=4 Vim/vi editor tutorial videos].

To open a file with the vim editor simply type,
Prompt$ '''vim''' <FILE>
An important vim editor feature, is that there are multiple mode from which different commands can be issued. The 3 most important modes; are command mode (also called normal mode), insert mode and visual mode. After having opened a file with vim, you'll start out in the command mode. You can move your cursor around with the arrow keys. There are other more elaborate ways for navigating, such as moving an entire line forth/back, but unless you're constantly using vim this sort of navigation can be difficult to remember. From the command mode you can issue commands, like entering insert and visual mode. Before we do that, however, it's important to know how to exit the vim editor. While in the command mode, type
''':wq'''
which will appear at the bottom of vim editor display. This will save your file and exit. If you simply want to exit, type
''':q'''
which will exit your file without saving. If this doesn't work, it's probably because you're not in command mode. You can always return to command mode by typing the 'Esc' key.

While in command mode you can type 'i' and 'v' to enter insert and visual mode respectively. Within the insert mode, you can write and delete text as you would in any text editor. The visual mode is used for highlighting text, that you want deleted, copied and replaced. From the command line you can actually enter 3 types of visual modes; visual ('v'), visual line ('shift-v') and visual block ('ctrl-v'). These difference visual modes are simply different ways for which text can be highlighted. First set your cursor to where you want to start highlighting, then enter one of the 3 visual modes. You can then choose the text that you want highlighted by moving the cursor. It is within the visual mode, that you can copy/cut/paste content of the file. To do this, first highlight the text you want cut/copied by entering one of the 3 visual modes. From here, you can then press 'd' to cut and 'y' to copy. Subsequently, you can move to where you want the content pasted with the arrow keys, and press 'P' to paste before the cursor, and 'p' to paste after the cursor.

If you've done a something that you shouldn't have done within vim, you can undo your previous action by typing 'u' while in command mode. Also, you can move your cursor to a specific line by typing <Line number>-G while in command mode. This will move your cursor to the line specified by line number. The number of utilities for the vim editor is long and it's likely that you might never have use for any of them. For now, what's important is that your able to open a file, edit the file in insert mode, copy/paste content with visual mode and save/exit the file.

=== Exercise 1: Making a simple shell script with vim. ===
Datafile 1: [https://teaching.healthtech.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [https://teaching.healthtech.dtu.dk/material/unix/ex1.acc ex1.acc] 
Datafile 3: [https://teaching.healthtech/material/unix/ex1.dat ex1.dat] 
1. Download the 3 datafiles. 
2. Create a file with the ending '.sh', which makes it a shell executable file, in your current directory. 
3. Open it with vim and go into insert mode. 
4. While in insert mode type in a command that will create a sub-directory to your current directory. 
5. On the second line, type in a command that copies the 3 data files to this subdirectory. 
6. On the third line, type in a command that merges the 3 data files and saves it as merged.dat. 
7. On the fifth line, type in a command that will delete the subdirectory that you created. 
8. Exit the file and execute the file from the command line by typing,
Prompt$ bash <file.sh>

=== .vimrc file ===
The .vimrc, where the 'rc' is short for "run commands", is a file that contains commands that are run whenever you open a file with vim. It's located as a hidden file within your home directory or if you on a MAC OS, it should be located at /usr/share/vim/vimrc. You can think of it as a settings file for the vim editor. By default your .vimrc file should be empty and customizing your vim editor is a task completely up to you. Customizing .vimrc is simple, just open it with vim as you would any other file. Underneath is link to a video, that explains some basic settings you could do in your .vimrc file.
[https://www.youtube.com/watch?v=-jB6i--_XrU Tutorial on some basic .vimrc settings]

You can actually also download a premade .vimrc file that gives your vim editor a lot of extra utility. However, as you might someday be working on computers where these utilities aren't available, it's a good idea that you first become efficient with the default vim editor.

== Other text editors; Nedit, Gedit, Pico and Emacs ==
Here, we present some 'very short' introduction of some other text editors that you might run into.
*Nedit
Nedit, short for 'Nirvana editor', has an interface similar to that of text editors found natively on Windows and Mac computers. It has some functionalities that it excels at, which are listed have been listed and explained in this link [https://blog.ostermiller.org/nedit/ Nedit].

[https://sourceforge.net/projects/nedit/ nedit download]
*Gedit
Gedit is the default text editor in GNOME desktop environments. We haven't talked about 'GNU project' and 'GNOME' yet so this might be a good time. 'GNU' is not an acronym, but another name for a wildebeest (remember those things that killed the lion king, yep), which was like the mascot for the project. Essentially, the 'GNU project' was a mass-collaboration project that started in September 27 1983 with the goal of giving computer users more control of their computers by providing free software to users. GNOME, short for 'GNU Network Object Model Environment' was one of the results of this project, and it's actually the default desktop environment used by many Linux distributions, for example, Ubuntu uses it. GNOME has provided a help guide, that can help you get started if you want to learn gedit- [https://help.gnome.org/users/gedit/stable/ Help guide to gedit]

[https://sourceforge.net/projects/gedit/ gedit download]
*Pico
Pico, short for 'Pine composer', is text editor for UNIX based computer based systems. The good thing about learning Pico, is that it's simple and easy to learn. The downside to pico is that it doesn't have as many features as for example the 'vim editor' which we'll be learning later. Also, pico has a clone text editor called 'nano' which was created as part of the GNU project due to unclear redistribution terms. So if you've learned 'pico' then you've also learned nano. If you're interested in learning pico, a good start would be to watch this video [https://www.youtube.com/watch?v=o5IY1dMUpc0 Pico text editor introduction].

[http://picocms.org/download/ pico download]

*Emacs
Emacs was initially released in 1976, and has since undergone many tweeks and updates. Among programmers, there exists a something called the 'Editor war', which is like a rivalry between 'emacs' and 'vim' [https://en.wikipedia.org/wiki/Editor_war Editor war]. It's, however, considerably more difficult to learn 'emacs' compared to 'vim'. To get you started with emacs, you should have no trouble finding various tutorials on youtube.

To install emacs, it's best to use UNIX commands '''sudo''' and '''apt''' to ensure that it's installed properly. Here's what you would type in your terminal,
Prompt$ '''sudo apt -y install''' emacs

== Sublime 3 ==

=== Windows ===
Sublime is a pretty cool text editor with some nice features, especially for programming purposes. There are two good ways of setting up Sublime 3 in a Unix terminal on a Windows computer. One way is to use a program called ''Xming'' coupled with the use of the Unix commands '''apt''' and '''sudo'''. But as we haven't covered these commands yet (they will be presented in the section '''Advanced Packaging tools'''), we won't be using this method.

For the other method, start off by downloading the windows version of Sublime text by following this link [https://www.sublimetext.com/3 Sublime download link]. When installing Sublime 3, make sure you know where you're saving it (by default, windows will save it in ''\Program Files''). Once you've successfully installed Sublime 3 start your Ubuntu WSL terminal.
From your terminal, you can directly access sublime text by specifying the filepath to 'subl.exe',

Prompt$ /mnt/c/Program\ Files/Sublime\ Text\ 3/subl.exe <FILE>

will open <FILE> with the sublime text editor. The '\ ' are how spaces are written in a command line. This can be confusing so spaces are often avoided in filenames by replacing spaces with '_'. We can avoid having to write such a long filepath everytime we use sublime by using the command '''alias''',

Prompt$ '''alias''' subl='/mnt/c/Program\ Files/Sublime\ Text\ 3/subl.exe'
which creates an alias for what's specified after '='. This way, typing '''subl'''<file>, will open the file in sublime text editor. This alias, however, is not permanent and the next time you start your terminal you'll need to write it again. In order to make it a permanent alias we have to edit in what's called the .bashrc file, which is located in your home directory,
Prompt$ '''cd''' ~
Prompt$ '''echo''' 'alias subl='/mnt/c/Program\ Files/Sublime\ Text\ 3/subl.exe'' >> .bashrc
Now everytime you start bash the alias command will be run. In figure 3.1, we show the commands that need to be executed in the terminal.
[[File:Setting up Sublime3.png|frame|none|'''Figure 3.1 Setting up Sublime 3:''' The directory is changed to the home directory by typing '''cd''' ~, and all files within the home directory are listed using the command '''ls -a'''. The alias is appended to .bashrc using '''echo''', and afterwards .bashrc is executed using the command '''source''']]
=== Mac ===
For Mac computers, the installation method for Sublime 3 is similar to the Windows installation. You can download the mac version of Sublime 3 by following the link, [https://www.sublimetext.com/3 Sublime download link].

File:Bash vs source.png

2024-03-20T12:02:40Z

WikiSysop:

Text editors and some shell scripting

2024-03-20T12:02:08Z

WikiSysop: Created page with "__NOTOC__ In the last sections we learned how to use commands that move and work with files, but we haven't learned how to actually edit files, which is what we'll be learning in this section. Text editors are used for editing files and there are many text editors to choose from. You've probably already used some text editors, like ''Notepad'' or ''TextEdit'', as these are the default text editors on a Windows or Mac OS. In this section, we'll be taking a look at a lot..."

__NOTOC__
In the last sections we learned how to use commands that move and work with files, but we haven't learned how to actually edit files, which is what we'll be
learning in this section. Text editors are used for editing files and there are many text editors to choose from. You've probably already used some text editors, like ''Notepad'' or ''TextEdit'', as these are the default text editors on a Windows or Mac OS. In this section, we'll be taking a look at a lot of different text editors. It's good to idea to be familiar with a lot of text editors because if you're working on a server that doesn't have the text editor you're used to, it's good to have another text editor that you're somewhat familiar with.

The first text editor we'll be looking at is the 'vim' editor, as it's used ubiquitously across different OS's and servers.
Next, we'll briefly introduce some other text editors; Nedit, Gedit, Pico and Emacs. Becoming proficient in all of these text editors is a bit excessive, and it's not a criteria for this course. It is, however, a good idea just to know that many text editors exist. If you're ever in need of using one of these text editors, it's pretty straightforward to 'google' your way to a guide, or use some of them links presented in this section.

In this section, we'll take a sneak peak at shell scripting. At the simplest level, shell scripts are multiple commands saved within a file. When the file is executed, the commands within are run. This is useful when you need to do the same workflow on multiple tasks, which can be quite labor intensive.
On a more complex level, shell scripts can be equipped with syntax like ''for'' and ''while'' loops, and you can actually do programming in shell scripts. However, doing large programs in shell scripts isn't recommended as BASH and other shell syntaxes can be quite difficult to learn and read. Instead, you'd be better off using a programming language such as 'Python', 'C++', 'Java' etc..

Also, when making shell scripts it's a good idea to keep reproducibility in mind. This can be done by giving your script a proper title, that adequately explains the function of your script. It's also paramount to make comments in your script, which makes it easier for others (and yourself in half a year) to understand and use your script. It also makes it easier to change the script. In the next section, we'll go into detail with comments and the other basics of shell scripts.

In the last bit of the section, we present a guide on installing 'Sublime 3', a user-friendly and a nice text editor for programming. We'll set it up so it can be run from the command line on your Windows or Mac computer.

Underneath are some of the commands we'll be using in this section.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''alias''' <nowiki><alias_name>=<The stuff you want to make an alias for></nowiki>
|<nowiki>-</nowiki>
|Creates an alias called alias_name for what you've inserted on the right side of '='
|-
|'''source''' <FILE>
|<nowiki>-</nowiki>
|Executes the contents of a file in current shell. Changes made when the file is run will be permanent until changed. It is synonymous with Prompt$ '''.''' <FILE>.
|-
|'''bash''' <FILE>
|Bourne again shell
|'''Bash''' will execute <FILE> as a different process. This way, changes that occur while the file is being executed cannot affect your shell.
|}

The command, '''alias''', is useful when you want to simplify long commands or if you often need to go to faraway directory with a long file path. You can for instance, make a directory that allows you to go to your desktop quickly.
Prompt$ alias desktop='cd filepath_to_Desktop/Desktop'
This will create an alias called 'desktop', which will change your directory to your Desktop. These changes, however, are only temporary and the next time you open your terminal it won't work. In order to make it a permanent alias we have to edit in what's called the ''.bashrc'' file, which is located in your home directory,
Prompt$ '''cd''' ~
Prompt$ '''ls -a''' <home directory>
It's a hidden file, indicated by the dot symbol '.', which is why you have to use the command option '''-a''' to see it. On the MAC OS, this file is called '''.bash_profile''' instead. The ''.bashrc'' file contains the commands that are run when you start a bash shell, and the 'rc' in ''.bashrc'' is actually short for 'run commands'. Therefore, to make the alias permanent you need simply append the alias to ''.bashrc''.
Prompt$ echo 'alias desktop='cd filepath_to_Desktop/Desktop/' >> .bashrc
This will ensure that the next time you open a terminal, the alias command will be run. 

Instead of closing and opening your terminal to restart it you can instead type,
Prompt$ '''source''' .bashrc
which will ensure that the commands within ''.bashrc'' are run. Alternatively one could write ''''.''' .bashrc', which is synonymous to ''''source''' .bashrc'

The '''bash''' command, which is the command we'll be using to execute shell scripts, is somewhat similar to '''source''' because it also executes files. But the difference is that '''bash''' executes files in a separate process than the terminal. Remember, the terminal is a program itself and an ongoing process. This way commands that might change the settings of the terminal are only set during the execution of the file. Therefore you would never write,
Prompt$ bash .bashrc
when restarting your bash. In figure 3.1 we show the difference between '''source''' and '''bash'''. The file, test.sh, contains a variable called Variable1. Variables can be assigned with the syntax, variable_name=variable_value, and '''$''' is a special character that let's the shell know, that subsequent string is a variable. If the '''$''' wasn't used, '''echo''' would output 'Variable1'. In the figure we see that when '''bash''' is used to execute ''test.sh'', the variable is not saved. Conversely, when '''source''' is used, the variable is saved.
[[File:Bash vs source.png|none|frame|'''Figure 3.1 Bash command vs source command:''']]

== Vim editor ==
The vim editor is one of the most widely available text editors. It's by default installed on most systems and therefore worth knowing.
Start by following this link to a couple of introductory videos on the Vim/vi editor [https://www.youtube.com/watch?v=SI8TeVMX8pk&list=PLPyiwIbA1EVmT3QIzltwGszhdSVEr1DkD&index=4 Vim/vi editor tutorial videos].

To open a file with the vim editor simply type,
Prompt$ '''vim''' <FILE>
An important vim editor feature, is that there are multiple mode from which different commands can be issued. The 3 most important modes; are command mode (also called normal mode), insert mode and visual mode. After having opened a file with vim, you'll start out in the command mode. You can move your cursor around with the arrow keys. There are other more elaborate ways for navigating, such as moving an entire line forth/back, but unless you're constantly using vim this sort of navigation can be difficult to remember. From the command mode you can issue commands, like entering insert and visual mode. Before we do that, however, it's important to know how to exit the vim editor. While in the command mode, type
''':wq'''
which will appear at the bottom of vim editor display. This will save your file and exit. If you simply want to exit, type
''':q'''
which will exit your file without saving. If this doesn't work, it's probably because you're not in command mode. You can always return to command mode by typing the 'Esc' key.

While in command mode you can type 'i' and 'v' to enter insert and visual mode respectively. Within the insert mode, you can write and delete text as you would in any text editor. The visual mode is used for highlighting text, that you want deleted, copied and replaced. From the command line you can actually enter 3 types of visual modes; visual ('v'), visual line ('shift-v') and visual block ('ctrl-v'). These difference visual modes are simply different ways for which text can be highlighted. First set your cursor to where you want to start highlighting, then enter one of the 3 visual modes. You can then choose the text that you want highlighted by moving the cursor. It is within the visual mode, that you can copy/cut/paste content of the file. To do this, first highlight the text you want cut/copied by entering one of the 3 visual modes. From here, you can then press 'd' to cut and 'y' to copy. Subsequently, you can move to where you want the content pasted with the arrow keys, and press 'P' to paste before the cursor, and 'p' to paste after the cursor.

If you've done a something that you shouldn't have done within vim, you can undo your previous action by typing 'u' while in command mode. Also, you can move your cursor to a specific line by typing <Line number>-G while in command mode. This will move your cursor to the line specified by line number. The number of utilities for the vim editor is long and it's likely that you might never have use for any of them. For now, what's important is that your able to open a file, edit the file in insert mode, copy/paste content with visual mode and save/exit the file.

=== Exercise 1: Making a simple shell script with vim. ===
Datafile 1: [http://teaching.bioinformatics.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [http://teaching.bioinformatics.dtu.dk/material/36610/ex1.acc ex1.acc] 
Datafile 3: [http://teaching.bioinformatics.dtu.dk/material/36610/ex1.dat ex1.dat] 
1. Download the 3 datafiles. 
2. Create a file with the ending '.sh', which makes it a shell executable file, in your current directory. 
3. Open it with vim and go into insert mode. 
4. While in insert mode type in a command that will create a sub-directory to your current directory. 
5. On the second line, type in a command that copies the 3 data files to this subdirectory. 
6. On the third line, type in a command that merges the 3 data files and saves it as merged.dat. 
7. On the fifth line, type in a command that will delete the subdirectory that you created. 
8. Exit the file and execute the file from the command line by typing,
Prompt$ bash <file.sh>

=== .vimrc file ===
The .vimrc, where the 'rc' is short for "run commands", is a file that contains commands that are run whenever you open a file with vim. It's located as a hidden file within your home directory or if you on a MAC OS, it should be located at /usr/share/vim/vimrc. You can think of it as a settings file for the vim editor. By default your .vimrc file should be empty and customizing your vim editor is a task completely up to you. Customizing .vimrc is simple, just open it with vim as you would any other file. Underneath is link to a video, that explains some basic settings you could do in your .vimrc file.
[https://www.youtube.com/watch?v=-jB6i--_XrU Tutorial on some basic .vimrc settings]

You can actually also download a premade .vimrc file that gives your vim editor a lot of extra utility. However, as you might someday be working on computers where these utilities aren't available, it's a good idea that you first become efficient with the default vim editor.

== Other text editors; Nedit, Gedit, Pico and Emacs ==
Here, we present some 'very short' introduction of some other text editors that you might run into.
*Nedit
Nedit, short for 'Nirvana editor', has an interface similar to that of text editors found natively on Windows and Mac computers. It has some functionalities that it excels at, which are listed have been listed and explained in this link [https://blog.ostermiller.org/nedit/ Nedit].

[https://sourceforge.net/projects/nedit/ nedit download]
*Gedit
Gedit is the default text editor in GNOME desktop environments. We haven't talked about 'GNU project' and 'GNOME' yet so this might be a good time. 'GNU' is not an acronym, but another name for a wildebeest (remember those things that killed the lion king, yep), which was like the mascot for the project. Essentially, the 'GNU project' was a mass-collaboration project that started in September 27 1983 with the goal of giving computer users more control of their computers by providing free software to users. GNOME, short for 'GNU Network Object Model Environment' was one of the results of this project, and it's actually the default desktop environment used by many Linux distributions, for example, Ubuntu uses it. GNOME has provided a help guide, that can help you get started if you want to learn gedit- [https://help.gnome.org/users/gedit/stable/ Help guide to gedit]

[https://sourceforge.net/projects/gedit/ gedit download]
*Pico
Pico, short for 'Pine composer', is text editor for UNIX based computer based systems. The good thing about learning Pico, is that it's simple and easy to learn. The downside to pico is that it doesn't have as many features as for example the 'vim editor' which we'll be learning later. Also, pico has a clone text editor called 'nano' which was created as part of the GNU project due to unclear redistribution terms. So if you've learned 'pico' then you've also learned nano. If you're interested in learning pico, a good start would be to watch this video [https://www.youtube.com/watch?v=o5IY1dMUpc0 Pico text editor introduction].

[http://picocms.org/download/ pico download]

*Emacs
Emacs was initially released in 1976, and has since undergone many tweeks and updates. Among programmers, there exists a something called the 'Editor war', which is like a rivalry between 'emacs' and 'vim' [https://en.wikipedia.org/wiki/Editor_war Editor war]. It's, however, considerably more difficult to learn 'emacs' compared to 'vim'. To get you started with emacs, you should have no trouble finding various tutorials on youtube.

To install emacs, it's best to use UNIX commands '''sudo''' and '''apt''' to ensure that it's installed properly. Here's what you would type in your terminal,
Prompt$ '''sudo apt -y install''' emacs

== Sublime 3 ==

=== Windows ===
Sublime is a pretty cool text editor with some nice features, especially for programming purposes. There are two good ways of setting up Sublime 3 in a Unix terminal on a Windows computer. One way is to use a program called ''Xming'' coupled with the use of the Unix commands '''apt''' and '''sudo'''. But as we haven't covered these commands yet (they will be presented in the section '''Advanced Packaging tools'''), we won't be using this method.

For the other method, start off by downloading the windows version of Sublime text by following this link [https://www.sublimetext.com/3 Sublime download link]. When installing Sublime 3, make sure you know where you're saving it (by default, windows will save it in ''\Program Files''). Once you've successfully installed Sublime 3 start your Ubuntu WSL terminal.
From your terminal, you can directly access sublime text by specifying the filepath to 'subl.exe',

Prompt$ /mnt/c/Program\ Files/Sublime\ Text\ 3/subl.exe <FILE>

will open <FILE> with the sublime text editor. The '\ ' are how spaces are written in a command line. This can be confusing so spaces are often avoided in filenames by replacing spaces with '_'. We can avoid having to write such a long filepath everytime we use sublime by using the command '''alias''',

Prompt$ '''alias''' subl='/mnt/c/Program\ Files/Sublime\ Text\ 3/subl.exe'
which creates an alias for what's specified after '='. This way, typing '''subl'''<file>, will open the file in sublime text editor. This alias, however, is not permanent and the next time you start your terminal you'll need to write it again. In order to make it a permanent alias we have to edit in what's called the .bashrc file, which is located in your home directory,
Prompt$ '''cd''' ~
Prompt$ '''echo''' 'alias subl='/mnt/c/Program\ Files/Sublime\ Text\ 3/subl.exe'' >> .bashrc
Now everytime you start bash the alias command will be run. In figure 3.1, we show the commands that need to be executed in the terminal.
[[File:Setting up Sublime3.png|frame|none|'''Figure 3.1 Setting up Sublime 3:''' The directory is changed to the home directory by typing '''cd''' ~, and all files within the home directory are listed using the command '''ls -a'''. The alias is appended to .bashrc using '''echo''', and afterwards .bashrc is executed using the command '''source''']]
=== Mac ===
For Mac computers, the installation method for Sublime 3 is similar to the Windows installation. You can download the mac version of Sublime 3 by following the link, [https://www.sublimetext.com/3 Sublime download link].

File:Tee example1.png

2024-03-20T12:01:23Z

WikiSysop:

File:Tee command.png

2024-03-20T12:01:02Z

WikiSysop:

File:Echo&cat.png

2024-03-20T12:00:37Z

WikiSysop:

File:Operator Example2.png

2024-03-20T12:00:16Z

WikiSysop:

File:Standard streams.png

2024-03-20T11:59:23Z

WikiSysop:

File:Standard streams wcexample2.png

2024-03-20T11:58:54Z

WikiSysop:

Standard streams and working with files

2024-03-20T11:57:44Z

WikiSysop: /* Datafiles */

__NOTOC__
In the last section we learned how to make directories and move around in the file system, but we didn't actually learn how to work with files. So in this section we'll be doing just that. 

Many of the commands you'll be learning in this course can receive data from the standard input and write to something called the standard output, so later in this section we'll introduce the concept of standard streams. Lastly we'll look at how we can change the direction of the standard output and standard input with re-directional operators and pipelines.

== Introduction to commands ==
Here we list Unix commands and their main function but it's important to keep in mind that Unix commands are versatile and it is possible to complete the same tasks using different commands.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''touch''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Touches a file. If the file doesn't exist already it will create a file with the specified name. If it already exists it will update the date of the file
|-
|'''mv''' [OPTION] <FILE> <destination directory or another filename>
|Move
|Moves a file to a specified directory. It can also be used to rename files.
|-
|'''rm''' [OPTION] <FILE>
|Remove
|Removes specified file in current directory. This command can also be used to remove non-empty directories.
|-
|'''cp''' [OPTION] <FILE> <destination directory or another file>
|Copy
|Works a lot like mv, but moves a copy of the file instead. Can also be used copy the content of one file to another file.
|-
|'''cat''' [OPTION] <FILE>
|Concatenate
|Concatenates files and displays it in standard output. If used on one file, the content of that file is displayed in the command line interface.
|-
|'''head''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Outputs the first part of a file
|-
|'''tail''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Outputs the last part of a file
|-
|'''less''' <FILE>
|<nowiki>-</nowiki>
|Shows a screenfull of the file. This is a useful command for viewing big files as it loads at smalls segments at a time. q --> quit , space --> scroll forward one page , b --> scroll backward one page. Arrow keys can be used to scroll up and down one line at a time.
|-
|'''wc''' [OPTION] <FILE>
|Word count
|Counts the lines and words in the file/files, but can also count other things based on the options you give it.
|-
|'''paste''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Merges lines from different files.
|-
|'''cut''' <nowiki>[OPTION]</nowiki> <FILE>
|<nowiki>-</nowiki>
|Removes different parts of a file depending on on what is specified in the option.
|-
|'''echo''' [OPTION] <STRING>
|<nowiki>-</nowiki>
|Outputs the string to your command line interface. In computer language, a string is just a sequence of characters.
|-
|'''wget''' [OPTION] <URL>
|web get
|A non-interactive network downloader used to download files located at the URL.
|-
|'''curl''' [OPTION] <URL>
|client URL
|Similarly to '''wget''', it is used to download files at the specified URL. This is an alternative MAC OS users, where '''wget''' doesn't work.
|-
|'''tee''' [OPTION] <FILE>
|It's named after the 'T-splitter' used in plumbing.
|Splits output so that it can be outputted to both the terminal and a file.
|}

As an introduction, you can watch this youtube video on the use of some of the commands, [https://www.youtube.com/watch?v=VgI4UKyL0Lc Unix Commands for working with files]. The video introduces some of the Unix commands for navigation that you learned in the last section but it also introduces the commands: '''touch''', '''mv''' and '''cp'''.

=== Datafiles ===
Below are downloadable links for this sections datafiles. You can download them by right-clicking and then choosing the option 'Download link'. 
Datafile 1: [https://teaching.healthtech.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [https://teaching.healthtech.dtu.dk/material/unix/ex1.acc ex1.acc] 
Datafile 3: [https://teaching.healthtech.dtu.dk/material/unix/ex1.dat ex1.dat] 
You can also right click, copy the link address and type
Prompt$ '''wget''' <the link address you copied>
in your UNIX terminal. You might have trouble pasting the link into your terminal because the keyboard shortcut is not necessarily '''Ctrl-V'''. On Ubuntu WSL, the shortcut for pasting is simply right-clicking and for copying it's '''Ctrl-Shift-C'''. On Mobaxterm it should be Shift-Ins or the middle mouse button (the one you'd normally use for scrolling) if you have one of those. It might be set differently on your MobaXterm, however, and you can check this under Settings --> Keyboard shortcuts --> Paste in terminal. You can copy text in MobaXterm by left-clicking and marking the text you want copied.

'''A little background on the files:''' Datafile 1 is a genebank file that contains information about the 16S rRNA DNA sequence of the pathogenic bacteria Pseudomonas aeruginosa. The DNA sequence of 16S rRNA is a highly conserved region in bacteria and is often used to identify bacteria. Datafile 2 contains 3 tab-separated columns of numerical data and datafile 3 contains 2 tab separated columns of accession numbers. One tab is equivalent to 5 regular space and accession numbers are unique identifier tags for DNA.

=== Examples of how this sections Unix commands can be used ===

All of the commands below are executed in the command line interface. For example, when using the '''wc''' command your terminal should look like '''figure 2.1'''. In '''figure 2.1''', the command is executed in Windows linux subsystem (WSL) Ubuntu, so the colouring might be different on your computer. The syntax, however, is the same.
[[File:Standard streams wcexample2.png|none|frame|'''Figure 2.1 Using wc command in a Unix environment:''' The '''wc''' command is executed in Windows linux subsystem (WSL) Ubuntu, so it might look a little different if you're using a different Unix environment.]] 

'''cat''', short for concatenate, is often used to display file contents.
Prompt$ '''cat''' <FILE>
outputs the content of the file. You can also combine '''cat''' with options for different functionalities.
Prompt$ '''cat -n''' <FILE>
outputs the line numbers along with the file content of the file. There are also other useful functionalities of '''cat''' but these require an understanding of ''redirectional operators'', so we'll save them for later in this section.

If you're interested in the file content at the top or bottom of a file, you can use the commands '''head''' and '''tail'''. 
Prompt$ '''head -3''' <FILE>
outputs the first 3 lines of a file.
Prompt$ '''tail -3''' <FILE>
outputs the last 3 lines of a file.

If you want to know the number of words, lines or characters in a file you can use the '''wc''' command.
Prompt$ '''wc''' <FILE>
outputs the number of characters, lines, and words. This output is always followed by the filename. You can also use options for a more specific functionality.
Prompt$ '''wc -l''' <FILE>
outputs the number of lines in a file.
Prompt$ '''wc -m''' <FILE>
outputs the characters in a file.

Keep in mind that commands '''cat''', '''head''', '''tail''' and '''wc''' can all take multiple <FILE> arguments as shown in '''figure 2.1'''.

You can use '''echo''' to write stuff in the command line interface.
Prompt$ '''echo''' <Whavever you want outputted to the command line interface>
outputs just about anything to the command line interface.

The introductory video should have given you a basic idea of how the commands '''mv''', '''cp''' and '''rm''', but there are some extra tricks that are good to know.
Prompt$ '''mv''' file1 /filepath/file2
will move file1 to file2 location, and rename file1 as file2
Prompt$ '''mv -t''' <DIRECTORY> file1..file99
will move any number of files to a new directory. The '''cp -t''' command works in the same way.

== Exercise 1: Working with datafiles ==
# Download the 3 datafiles if you haven't already.
# Create 3 new files. You can call them whatever you like.
# Create two directories, called test and data.
# Delete two of the files you created and move the remaining file along with the data files to the test directory.
# Move all the files from the test directory to the data directory. Delete the test directory.
# Rename the file you created to mydatafile.gb.
# Copy the content of datafile 1 to mydatafile.gb and check that they're identical.
# Display the content of datafile 1, datafile 2 and datafile 3.
# Count the total number of bytes in the datafiles. ('''Hint:''' Check out the different command line options for '''wc''').

== Standard Streams ==
[[File:Standard streams.png|right|frame|'''Figure 2.2: Standard Streams:''' This figure illustrates the concept of standard streams. You can think of the green box as the interface that you can interact with. This is what we called the command line interface in the previous section. Recall that there are 2 types of user interfaces; command line interface (CLI) and GUI (graphical user interface). The yellow box represents the process where '''cat''' is translated back and forth from the hardware of your computer. The shell (the command line interpreter) and kernel oversee this process. If the process was successful, the resulting output is what's called the standard output (stdout). Oppositely, if the command wasn't successful the output will be the standard error (stderror). The standard input often originates from the keyboard (as it does when you type '''cat''' in your command line) which is why it's shown in the figure.]]

Now that we have a practical idea of this sections Unix commands, let's discuss the concept of ''standard streams''. This will give you an idea to what exactly is going on when these commands are executed from the command line. 
Standard streams are streams of data that travel from where a program was executed, to the places where the program is processed and then back again. It's important emphasize, that there are many streams of data in your computer, but the standard streams are the ones that the user has the most control over. There are 3 type of ''standard streams''; ''standard input (stdin)'', ''standard output (stdout)'' and ''standard error (stderror)''. We'll go through what each term means by using the command '''cat''' as an example. 

Use the Unix command '''cat''' by typing in
Prompt$ '''cat'''
in the terminal. This will prompt you, the user, to give '''cat''' some input in the form stdin directly from your keyboard. Simply type something and press 'ENTER'. To exit the process, press '''Ctrl''' and '''d''' simultaneously. The command '''cat''' will then process the stdin that you've given it, and output it as what's called the ''standard output (stdout)''. In this case stdout is just whatever you typed, and it is by default connected to the terminal, which is why it appears there. If the process wasn't successful, a ''standard error (stderror)'' message will be outputted to the terminal instead. Depending on the error you made, different error messages can appear. If you, for example, type in ''''eccho''' Hello' the stderror might return the error message 'bash: echho: command not found'. The stderror is also connected to the terminal by default. Sometimes, nothing is outputted by the stdout and this is because some commands don't have a stdout. You've already experienced this in the last section with commands like mkdir, rmdir, rm, cd and so on.

When supplying '''cat''' with a file by typing,
Prompt$ '''cat''' <FILE>
in your command line, it will output the file contents as stdout. It is, however, important to understand that <FILE> is not being fed as stdin to '''cat'''. When you type a command on your command line and the command file is present on your system (you can find most of these files by going in the directory, '''/bin'''), all separated words, spaces and tabs that are present on the command line, will be passed to this file. It's definitely a stream of data, but it's not the stdin.

The stdin is connected to your keyboard, and stdout or stderror outputs are directed to the terminal by default. We can, however, take control of these outputs by using redirectional operators, pipelines and the command '''tee'''.

'''Supplementary material on standard output and standard input''' 
[http://www.linfo.org/standard_output.html Standard output] 
[http://www.linfo.org/standard_input.html Standard input]

== Re-directional operators and pipelines ==
Operators are symbols which behave like functions within the Unix OS. The easiest to understand might be ''arithmetic operators'', which use symbols like '''+''' for addition, '''-''' for subtraction, '''=''' for assigning values to variables, and so on. In this section we'll be learning redirectional operators.
=== Stdout redirectional operators, > and >> ===
'''>''' operator is used to redirect stdout and stderror. Here's, one way of using it: 
Prompt$ '''cat''' file1 '''>''' file2 
This will redirect stdout of '''cat''' file1 to file2, which is the same as redirecting the file contents of file1 to file2.
An important feature of '''>''' is that it overwrites the content whereto its directed with the output that it receives. So in this case, the file content of <file2> will be overwritten with the file content of file1. This means you have to be careful as to not overwrite your work when using it.

'''>>''' operator is also used to redirect stdout and stderror in the same way as '''>''', but will append output to a file instead of overwriting it. For example, 
Prompt$ '''cat''' file1 '''>>''' file2 
would append the file content of file1 to file2.

Here's how the use of these operators would look in a Unix terminal.[[File:Operator Example2.png|none|frame|'''Figure 2.3 Redirectional Operators:''' Here we use the Unix command '''echo''' which simply outputs whatever text you input in the command line. In the example, this output either appends or replaces the file content of Operator_Example.txt]]

'''cat''' can be used in a similar fashion as '''echo''' <STRING> '''>''' <FILE>, to add text to files.
Prompt$ '''cat''' '''>''' <FILE>
will ask the user for stdin which can be outputted to <FILE>. This is because the stdin is connected to the keyboard by default. To exit, simply hold the '''Ctrl''' key while pressing '''d'''. After entering your text and before exiting, it's a good idea to type enter or else the command line will look a bit weird. Basically, the command line and the text you just entered will be on the same line, which you might find confusing. This application works for the '''>>''' operator as well.
[[File:Echo&cat.png|none|frame|'''Figure 2.4 Using cat or echo to add text to files''']]
=== < operator ===
'''<''' is the redirectional operator for the standard input and it is used to redirect stdin to commands. This is useful for commands that require additional input from the user and we'll take a look at such commands in later sections. To give an example, when you're downloading and installing packages, you'll be prompted for stdin, to confirm if it's okay that the package uses that said amount of space on your device. In such a case, you need to type 'y' for yes, and 'n' for no. If you're doing many time-consuming package installments, it can be quite annoying to have to be around just to press 'y' once in a while. Therefore, it is super-handy that you can use the '''<''' operator to direct the stdin that you need. This can easily be done with '''echo''',

Prompt$ '''echo''' 'y' < '''apt install''' <package>

The '''apt'' command is short for 'Advanced Package tool' and is the standard packaging tool for Unix. We'll learn more about this in the section 'File compression and advanced packaging tools', so don't worry about it now.

=== Pipelines ===
Making pipelines or 'pipelining' as it is sometimes called, is similar to the concept of redirectional operators. Pipelines are used to redirect the stdout of one Unix command as the stdin to another Unix command. A good example of this is:

Prompt$ '''cat''' <some big file> '''| less'''

will feed the stdout of '''cat''' <some big file> as stdin to '''less'''. If you just write '''cat''', all of the contents of the file will rapidly be displayed on your screen and it can be a real pain to scroll all the way to the top in order to read the text. But by piping '''cat''' with '''less''', you can scroll through the file small segments at a time (see the Unix command table or google '''man less''' for instructions on how to scroll through the file).

As mentioned earlier, not all commands have a stdin. An example of this is, '''echo''', which can only output its command line argument <STRING>. If you tried to pipe stdin '''echo''' with stdout from another command, it wouldn't work.
Prompt$ cat <FILE> |'''echo'''
output a blank line.

=== Some Examples: Simple Piping ===
These examples aren't necessarily useful, but just to give you a better idea of what pipelines are and how they can be constructed. Try them out yourself on this sections datafiles.

'''head -5''' <datafile> '''| tail -2 '''
The first 5 lines are extracted from datafile and fed to '''tail -2''', which extracts the last 2 lines and outputs to the command line interface. 
'''tail -10''' <datafile> '''| wc -c > 10_tail_chars.txt
The last 10 lines are extracted from datafile and fed to '''wc -c''', which counts the characters. These are redirected and saved to the 10_tail_chars.txt. 
'''head -5''' <datafile> '''| cut -f 2-4 >> columns2to4.txt
The first 5 lines are extracted from datafile and fed to '''cut -f 2-4''', which extracts the columns 2 to 4. The columns 2 to 4 are then appended to columuns2to4.txt. If you wanted to cut out only columns 2 and 4, you could instead write '''cut -f 2,4'''.

=== tee command ===
The function of the '''tee''' command is to split stdout into a file. The command is named after the T-splitter used in plumbing and the T-shape that is illustrated in '''figure 2.5'''.

[[File:Tee command.png|frame|none|'''Figure 2.5 Tee command:]]

The '''tee''' command is useful if you're making a long pipeline, and you want to save intermediary results into files. But it's actually just a general good practice to use if you're making long pipelines. This way, if something is wrong in the final output, you can check where it wrong by looking at your intermediary files.

[[File:Tee example1.png|frame|none|'''Figure 2.6 Tee example:''' A file called header_file.gb is made using the command ''''touch''' and then pipeline is constructed. When the pipeline is executed from the command line, the header of the genebank file is saved to a file, and the number of bytes is outputted to the terminal. The contents of header_file.gb is then displayed with '''cat'''.]]

This command has a couple of command options which you can check out with the '''man''' command. One of the more useful command options is the '''a'''
option, short for append,
Prompt$ '''tee -a''' <FILE>
will append to the file instead. In a scenario, where you want to save multiple intermediary outputs in the same file this command is useful.
== Exercise 2: Re-directional operators and Pipelines ==
# Merge the lines of datafiles 2 & 3 and save them to mergefile.dat. Try displaying its content and make sure that accession numbers are on the left and the data on the right. It doesn't matter if files are of equal length, if there are no more lines in one file, blank lines will simply be added instead.
# Take the first 5 lines of mergefile.dat, cut out the first and third column and save it as columns1and3.dat.
# Count the number of characters in ALL of the files and append the results to a file called charsinfiles.dat.
# Make a pipeline that saves the bottom part of datafile1 in extracted_data.gb and the number of bytes in another bytefile.dat.
# Make a pipeline that saves the header of datafile1, appends it to extracted_data.gb and then appends the number of bytes to bytefile.dat.
== Exercise 3: Moving and removing files across the file system ==
This exercise is a repetition of what you learned in the last section about navigating the file system combined with the commands you learned in this section.
[[File:Directory Branch.png|frame|'''Figure 2.6 Ex3 Branch of directories''']]
# Make a branch of directories like the one shown in '''figure 2.7'''.
# Move datafiles 1,2 and 3 to directory AB.
# Make a copy of datafile 1 in A5 called datafile_copy1, a copy of datafile 2 in A7 called datafile_copy2 and a copy of datafile 3 in B7 called datafile_copy3.
# Move datafile_copy1 and datafile_copy2 back to AB. You should do this without making A5 and A7 your current working directory ('''Hint 1 ''')
# Make two files in B7 called extra_copy1 and extracopy2.
# Move datafile_copy3, extra_copy1 and extra_copy2 to AB.
# Rename extra_copy2 as extra_copy1.
# Remove datafile_copy1, datafile_copy2, datafile_copy3 and extra_copy1.

'''Hint 1:''' You can move files in other directories than the one you're in by specifying an absolute path.

Standard streams and working with files

2024-03-20T11:56:18Z

WikiSysop: Created page with "__NOTOC__ In the last section we learned how to make directories and move around in the file system, but we didn't actually learn how to work with files. So in this section we'll be doing just that. Many of the commands you'll be learning in this course can receive data from the standard input and write to something called the standard output, so later in this section we'll introduce the concept of standard streams. Lastly we'll look at how we can change the direct..."

__NOTOC__
In the last section we learned how to make directories and move around in the file system, but we didn't actually learn how to work with files. So in this section we'll be doing just that. 

Many of the commands you'll be learning in this course can receive data from the standard input and write to something called the standard output, so later in this section we'll introduce the concept of standard streams. Lastly we'll look at how we can change the direction of the standard output and standard input with re-directional operators and pipelines.

== Introduction to commands ==
Here we list Unix commands and their main function but it's important to keep in mind that Unix commands are versatile and it is possible to complete the same tasks using different commands.
{| class="wikitable"
|-
!Unix Command
!Acronym translation
!Description
|-
|'''touch''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Touches a file. If the file doesn't exist already it will create a file with the specified name. If it already exists it will update the date of the file
|-
|'''mv''' [OPTION] <FILE> <destination directory or another filename>
|Move
|Moves a file to a specified directory. It can also be used to rename files.
|-
|'''rm''' [OPTION] <FILE>
|Remove
|Removes specified file in current directory. This command can also be used to remove non-empty directories.
|-
|'''cp''' [OPTION] <FILE> <destination directory or another file>
|Copy
|Works a lot like mv, but moves a copy of the file instead. Can also be used copy the content of one file to another file.
|-
|'''cat''' [OPTION] <FILE>
|Concatenate
|Concatenates files and displays it in standard output. If used on one file, the content of that file is displayed in the command line interface.
|-
|'''head''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Outputs the first part of a file
|-
|'''tail''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Outputs the last part of a file
|-
|'''less''' <FILE>
|<nowiki>-</nowiki>
|Shows a screenfull of the file. This is a useful command for viewing big files as it loads at smalls segments at a time. q --> quit , space --> scroll forward one page , b --> scroll backward one page. Arrow keys can be used to scroll up and down one line at a time.
|-
|'''wc''' [OPTION] <FILE>
|Word count
|Counts the lines and words in the file/files, but can also count other things based on the options you give it.
|-
|'''paste''' [OPTION] <FILE>
|<nowiki>-</nowiki>
|Merges lines from different files.
|-
|'''cut''' <nowiki>[OPTION]</nowiki> <FILE>
|<nowiki>-</nowiki>
|Removes different parts of a file depending on on what is specified in the option.
|-
|'''echo''' [OPTION] <STRING>
|<nowiki>-</nowiki>
|Outputs the string to your command line interface. In computer language, a string is just a sequence of characters.
|-
|'''wget''' [OPTION] <URL>
|web get
|A non-interactive network downloader used to download files located at the URL.
|-
|'''curl''' [OPTION] <URL>
|client URL
|Similarly to '''wget''', it is used to download files at the specified URL. This is an alternative MAC OS users, where '''wget''' doesn't work.
|-
|'''tee''' [OPTION] <FILE>
|It's named after the 'T-splitter' used in plumbing.
|Splits output so that it can be outputted to both the terminal and a file.
|}

As an introduction, you can watch this youtube video on the use of some of the commands, [https://www.youtube.com/watch?v=VgI4UKyL0Lc Unix Commands for working with files]. The video introduces some of the Unix commands for navigation that you learned in the last section but it also introduces the commands: '''touch''', '''mv''' and '''cp'''.

=== Datafiles ===
Below are downloadable links for this sections datafiles. You can download them by right-clicking and then choosing the option 'Download link'. 
Datafile 1: [http://teaching.bioinformatics.dtu.dk/material/unix/Pseudomonas_Aeruginosa_16SrRNA.gb Pseudomonas Aeruginosa 16S rRNA Genebank file] 
Datafile 2: [http://teaching.bioinformatics.dtu.dk/material/36610/ex1.acc ex1.acc] 
Datafile 3: [http://teaching.bioinformatics.dtu.dk/material/36610/ex1.dat ex1.dat] 
You can also right click, copy the link address and type
Prompt$ '''wget''' <the link address you copied>
in your UNIX terminal. You might have trouble pasting the link into your terminal because the keyboard shortcut is not necessarily '''Ctrl-V'''. On Ubuntu WSL, the shortcut for pasting is simply right-clicking and for copying it's '''Ctrl-Shift-C'''. On Mobaxterm it should be Shift-Ins or the middle mouse button (the one you'd normally use for scrolling) if you have one of those. It might be set differently on your MobaXterm, however, and you can check this under Settings --> Keyboard shortcuts --> Paste in terminal. You can copy text in MobaXterm by left-clicking and marking the text you want copied.

'''A little background on the files:''' Datafile 1 is a genebank file that contains information about the 16S rRNA DNA sequence of the pathogenic bacteria Pseudomonas aeruginosa. The DNA sequence of 16S rRNA is a highly conserved region in bacteria and is often used to identify bacteria. Datafile 2 contains 3 tab-separated columns of numerical data and datafile 3 contains 2 tab separated columns of accession numbers. One tab is equivalent to 5 regular space and accession numbers are unique identifier tags for DNA.

=== Examples of how this sections Unix commands can be used ===

All of the commands below are executed in the command line interface. For example, when using the '''wc''' command your terminal should look like '''figure 2.1'''. In '''figure 2.1''', the command is executed in Windows linux subsystem (WSL) Ubuntu, so the colouring might be different on your computer. The syntax, however, is the same.
[[File:Standard streams wcexample2.png|none|frame|'''Figure 2.1 Using wc command in a Unix environment:''' The '''wc''' command is executed in Windows linux subsystem (WSL) Ubuntu, so it might look a little different if you're using a different Unix environment.]] 

'''cat''', short for concatenate, is often used to display file contents.
Prompt$ '''cat''' <FILE>
outputs the content of the file. You can also combine '''cat''' with options for different functionalities.
Prompt$ '''cat -n''' <FILE>
outputs the line numbers along with the file content of the file. There are also other useful functionalities of '''cat''' but these require an understanding of ''redirectional operators'', so we'll save them for later in this section.

If you're interested in the file content at the top or bottom of a file, you can use the commands '''head''' and '''tail'''. 
Prompt$ '''head -3''' <FILE>
outputs the first 3 lines of a file.
Prompt$ '''tail -3''' <FILE>
outputs the last 3 lines of a file.

If you want to know the number of words, lines or characters in a file you can use the '''wc''' command.
Prompt$ '''wc''' <FILE>
outputs the number of characters, lines, and words. This output is always followed by the filename. You can also use options for a more specific functionality.
Prompt$ '''wc -l''' <FILE>
outputs the number of lines in a file.
Prompt$ '''wc -m''' <FILE>
outputs the characters in a file.

Keep in mind that commands '''cat''', '''head''', '''tail''' and '''wc''' can all take multiple <FILE> arguments as shown in '''figure 2.1'''.

You can use '''echo''' to write stuff in the command line interface.
Prompt$ '''echo''' <Whavever you want outputted to the command line interface>
outputs just about anything to the command line interface.

The introductory video should have given you a basic idea of how the commands '''mv''', '''cp''' and '''rm''', but there are some extra tricks that are good to know.
Prompt$ '''mv''' file1 /filepath/file2
will move file1 to file2 location, and rename file1 as file2
Prompt$ '''mv -t''' <DIRECTORY> file1..file99
will move any number of files to a new directory. The '''cp -t''' command works in the same way.

== Exercise 1: Working with datafiles ==
# Download the 3 datafiles if you haven't already.
# Create 3 new files. You can call them whatever you like.
# Create two directories, called test and data.
# Delete two of the files you created and move the remaining file along with the data files to the test directory.
# Move all the files from the test directory to the data directory. Delete the test directory.
# Rename the file you created to mydatafile.gb.
# Copy the content of datafile 1 to mydatafile.gb and check that they're identical.
# Display the content of datafile 1, datafile 2 and datafile 3.
# Count the total number of bytes in the datafiles. ('''Hint:''' Check out the different command line options for '''wc''').

== Standard Streams ==
[[File:Standard streams.png|right|frame|'''Figure 2.2: Standard Streams:''' This figure illustrates the concept of standard streams. You can think of the green box as the interface that you can interact with. This is what we called the command line interface in the previous section. Recall that there are 2 types of user interfaces; command line interface (CLI) and GUI (graphical user interface). The yellow box represents the process where '''cat''' is translated back and forth from the hardware of your computer. The shell (the command line interpreter) and kernel oversee this process. If the process was successful, the resulting output is what's called the standard output (stdout). Oppositely, if the command wasn't successful the output will be the standard error (stderror). The standard input often originates from the keyboard (as it does when you type '''cat''' in your command line) which is why it's shown in the figure.]]

Now that we have a practical idea of this sections Unix commands, let's discuss the concept of ''standard streams''. This will give you an idea to what exactly is going on when these commands are executed from the command line. 
Standard streams are streams of data that travel from where a program was executed, to the places where the program is processed and then back again. It's important emphasize, that there are many streams of data in your computer, but the standard streams are the ones that the user has the most control over. There are 3 type of ''standard streams''; ''standard input (stdin)'', ''standard output (stdout)'' and ''standard error (stderror)''. We'll go through what each term means by using the command '''cat''' as an example. 

Use the Unix command '''cat''' by typing in
Prompt$ '''cat'''
in the terminal. This will prompt you, the user, to give '''cat''' some input in the form stdin directly from your keyboard. Simply type something and press 'ENTER'. To exit the process, press '''Ctrl''' and '''d''' simultaneously. The command '''cat''' will then process the stdin that you've given it, and output it as what's called the ''standard output (stdout)''. In this case stdout is just whatever you typed, and it is by default connected to the terminal, which is why it appears there. If the process wasn't successful, a ''standard error (stderror)'' message will be outputted to the terminal instead. Depending on the error you made, different error messages can appear. If you, for example, type in ''''eccho''' Hello' the stderror might return the error message 'bash: echho: command not found'. The stderror is also connected to the terminal by default. Sometimes, nothing is outputted by the stdout and this is because some commands don't have a stdout. You've already experienced this in the last section with commands like mkdir, rmdir, rm, cd and so on.

When supplying '''cat''' with a file by typing,
Prompt$ '''cat''' <FILE>
in your command line, it will output the file contents as stdout. It is, however, important to understand that <FILE> is not being fed as stdin to '''cat'''. When you type a command on your command line and the command file is present on your system (you can find most of these files by going in the directory, '''/bin'''), all separated words, spaces and tabs that are present on the command line, will be passed to this file. It's definitely a stream of data, but it's not the stdin.

The stdin is connected to your keyboard, and stdout or stderror outputs are directed to the terminal by default. We can, however, take control of these outputs by using redirectional operators, pipelines and the command '''tee'''.

'''Supplementary material on standard output and standard input''' 
[http://www.linfo.org/standard_output.html Standard output] 
[http://www.linfo.org/standard_input.html Standard input]

== Re-directional operators and pipelines ==
Operators are symbols which behave like functions within the Unix OS. The easiest to understand might be ''arithmetic operators'', which use symbols like '''+''' for addition, '''-''' for subtraction, '''=''' for assigning values to variables, and so on. In this section we'll be learning redirectional operators.
=== Stdout redirectional operators, > and >> ===
'''>''' operator is used to redirect stdout and stderror. Here's, one way of using it: 
Prompt$ '''cat''' file1 '''>''' file2 
This will redirect stdout of '''cat''' file1 to file2, which is the same as redirecting the file contents of file1 to file2.
An important feature of '''>''' is that it overwrites the content whereto its directed with the output that it receives. So in this case, the file content of <file2> will be overwritten with the file content of file1. This means you have to be careful as to not overwrite your work when using it.

'''>>''' operator is also used to redirect stdout and stderror in the same way as '''>''', but will append output to a file instead of overwriting it. For example, 
Prompt$ '''cat''' file1 '''>>''' file2 
would append the file content of file1 to file2.

Here's how the use of these operators would look in a Unix terminal.[[File:Operator Example2.png|none|frame|'''Figure 2.3 Redirectional Operators:''' Here we use the Unix command '''echo''' which simply outputs whatever text you input in the command line. In the example, this output either appends or replaces the file content of Operator_Example.txt]]

'''cat''' can be used in a similar fashion as '''echo''' <STRING> '''>''' <FILE>, to add text to files.
Prompt$ '''cat''' '''>''' <FILE>
will ask the user for stdin which can be outputted to <FILE>. This is because the stdin is connected to the keyboard by default. To exit, simply hold the '''Ctrl''' key while pressing '''d'''. After entering your text and before exiting, it's a good idea to type enter or else the command line will look a bit weird. Basically, the command line and the text you just entered will be on the same line, which you might find confusing. This application works for the '''>>''' operator as well.
[[File:Echo&cat.png|none|frame|'''Figure 2.4 Using cat or echo to add text to files''']]
=== < operator ===
'''<''' is the redirectional operator for the standard input and it is used to redirect stdin to commands. This is useful for commands that require additional input from the user and we'll take a look at such commands in later sections. To give an example, when you're downloading and installing packages, you'll be prompted for stdin, to confirm if it's okay that the package uses that said amount of space on your device. In such a case, you need to type 'y' for yes, and 'n' for no. If you're doing many time-consuming package installments, it can be quite annoying to have to be around just to press 'y' once in a while. Therefore, it is super-handy that you can use the '''<''' operator to direct the stdin that you need. This can easily be done with '''echo''',

Prompt$ '''echo''' 'y' < '''apt install''' <package>

The '''apt'' command is short for 'Advanced Package tool' and is the standard packaging tool for Unix. We'll learn more about this in the section 'File compression and advanced packaging tools', so don't worry about it now.

=== Pipelines ===
Making pipelines or 'pipelining' as it is sometimes called, is similar to the concept of redirectional operators. Pipelines are used to redirect the stdout of one Unix command as the stdin to another Unix command. A good example of this is:

Prompt$ '''cat''' <some big file> '''| less'''

will feed the stdout of '''cat''' <some big file> as stdin to '''less'''. If you just write '''cat''', all of the contents of the file will rapidly be displayed on your screen and it can be a real pain to scroll all the way to the top in order to read the text. But by piping '''cat''' with '''less''', you can scroll through the file small segments at a time (see the Unix command table or google '''man less''' for instructions on how to scroll through the file).

As mentioned earlier, not all commands have a stdin. An example of this is, '''echo''', which can only output its command line argument <STRING>. If you tried to pipe stdin '''echo''' with stdout from another command, it wouldn't work.
Prompt$ cat <FILE> |'''echo'''
output a blank line.

=== Some Examples: Simple Piping ===
These examples aren't necessarily useful, but just to give you a better idea of what pipelines are and how they can be constructed. Try them out yourself on this sections datafiles.

'''head -5''' <datafile> '''| tail -2 '''
The first 5 lines are extracted from datafile and fed to '''tail -2''', which extracts the last 2 lines and outputs to the command line interface. 
'''tail -10''' <datafile> '''| wc -c > 10_tail_chars.txt
The last 10 lines are extracted from datafile and fed to '''wc -c''', which counts the characters. These are redirected and saved to the 10_tail_chars.txt. 
'''head -5''' <datafile> '''| cut -f 2-4 >> columns2to4.txt
The first 5 lines are extracted from datafile and fed to '''cut -f 2-4''', which extracts the columns 2 to 4. The columns 2 to 4 are then appended to columuns2to4.txt. If you wanted to cut out only columns 2 and 4, you could instead write '''cut -f 2,4'''.

=== tee command ===
The function of the '''tee''' command is to split stdout into a file. The command is named after the T-splitter used in plumbing and the T-shape that is illustrated in '''figure 2.5'''.

[[File:Tee command.png|frame|none|'''Figure 2.5 Tee command:]]

The '''tee''' command is useful if you're making a long pipeline, and you want to save intermediary results into files. But it's actually just a general good practice to use if you're making long pipelines. This way, if something is wrong in the final output, you can check where it wrong by looking at your intermediary files.

[[File:Tee example1.png|frame|none|'''Figure 2.6 Tee example:''' A file called header_file.gb is made using the command ''''touch''' and then pipeline is constructed. When the pipeline is executed from the command line, the header of the genebank file is saved to a file, and the number of bytes is outputted to the terminal. The contents of header_file.gb is then displayed with '''cat'''.]]

This command has a couple of command options which you can check out with the '''man''' command. One of the more useful command options is the '''a'''
option, short for append,
Prompt$ '''tee -a''' <FILE>
will append to the file instead. In a scenario, where you want to save multiple intermediary outputs in the same file this command is useful.
== Exercise 2: Re-directional operators and Pipelines ==
# Merge the lines of datafiles 2 & 3 and save them to mergefile.dat. Try displaying its content and make sure that accession numbers are on the left and the data on the right. It doesn't matter if files are of equal length, if there are no more lines in one file, blank lines will simply be added instead.
# Take the first 5 lines of mergefile.dat, cut out the first and third column and save it as columns1and3.dat.
# Count the number of characters in ALL of the files and append the results to a file called charsinfiles.dat.
# Make a pipeline that saves the bottom part of datafile1 in extracted_data.gb and the number of bytes in another bytefile.dat.
# Make a pipeline that saves the header of datafile1, appends it to extracted_data.gb and then appends the number of bytes to bytefile.dat.
== Exercise 3: Moving and removing files across the file system ==
This exercise is a repetition of what you learned in the last section about navigating the file system combined with the commands you learned in this section.
[[File:Directory Branch.png|frame|'''Figure 2.6 Ex3 Branch of directories''']]
# Make a branch of directories like the one shown in '''figure 2.7'''.
# Move datafiles 1,2 and 3 to directory AB.
# Make a copy of datafile 1 in A5 called datafile_copy1, a copy of datafile 2 in A7 called datafile_copy2 and a copy of datafile 3 in B7 called datafile_copy3.
# Move datafile_copy1 and datafile_copy2 back to AB. You should do this without making A5 and A7 your current working directory ('''Hint 1 ''')
# Make two files in B7 called extra_copy1 and extracopy2.
# Move datafile_copy3, extra_copy1 and extra_copy2 to AB.
# Rename extra_copy2 as extra_copy1.
# Remove datafile_copy1, datafile_copy2, datafile_copy3 and extra_copy1.

'''Hint 1:''' You can move files in other directories than the one you're in by specifying an absolute path.

File:Directory Branch.png

2024-03-20T11:55:38Z

WikiSysop:

File:Navigating from home&root.png

2024-03-20T11:55:08Z

WikiSysop:

File:Ubuntu CLI.png

2024-03-20T11:54:36Z

WikiSysop:

File:MobaXterm CLI Structure.png

2024-03-20T11:54:15Z

WikiSysop:

File:File structure2.png

2024-03-20T11:53:42Z

WikiSysop: