File compression and advanced packaging tools

From Unix
Jump to navigation Jump to search

You might've encountered files with file extensions like '.tar', '.gzip' and '.zip' when downloading files from the internet. These file extensions mean that the file is compressed and file byte size reduced so that it requires less disk space. Bytes are simply a sequence of 8 bits, and we've previously seen how this is used to store ASCII characters. Files that have been compressed can also be decompressed, making them user readable but requiring more bytes on your disk. In this section, we first discuss how it's possible to reduce the byte size of a file and afterwards restore it back to its original byte size. Next we discuss the file extensions '.tar', '.gzip' and '.zip'; what they mean and how they differ. In short, tar doesn't really compress files and there are different ways for how files can be compressed, hence, there are also different file extensions like .gzip and .zip. We will also introduce the commands; tar, gzip and zip and how they can be used to compress/decompress files.

Lastly, we discuss packaging managers and how the command apt, short for 'Advanced Packaging tool', is used. If you're running a Unix terminal from your Mac OS, you won't be able to run the command apt, because it's not supported. In order to use it, you'd need to either use a virtualbox or connect to a server with a Linux OS. We haven't talked about connecting to remote servers and you'd have to find such a server as well, so if you want to try these commands out, the easiest option would be to use a virtualbox (there's a link to a guide in the section 'Course Introduction'). You could also try out the Mac OS equivalent of the 'Advanced Packaging tool', which is called 'Homebrew'. Here's an introductory video for this package manager Homebrew Guide.

At first it might seem rather mysterious that files can be reduced in byte size and decompressed back to their original byte size. The concept, however, is actually quite simple. Files, especially text files, have many patterns that are redundant and appear multiple times. The idea is therefore to make a dictionary that assigns these patterns with a bit value. During file compression, every time these patterns appear, they are then assigned with the same bit value. This type of compression, is called lossless compression, and for an intuitive understanding of this, you can check out this video by crash course File compression (0:00-6:26).

For obvious reasons, files that have many redundancies can be compressed significantly more than files that have many unique characters like; music, video and picture files. This, however, is only true if you want the compressed file to have exactly the same shade of blue or sound frequency as the original file. The type of compression we've been discussing up till now is what's called lossless compression. As the name implies, none of the file content is lost during this type of compression and it compressed file can be recreated perfectly. But there's actually another type of compression called lossy compression. The rest of the crash course video linked above, is about this sort of compression (6:26:12:47). Lossy compression operates by removing unnecessary bits of information. After all, the human ear and eye are not acutely evolved enough to perceive small differences in color shades and sound frequencies. Software that use lossy compression, alter the color value and frequency of pixels and sounds respectively to identical values, reducing byte size and ensuring that it's indistinguishable to us.

What is tar, gzip and zip

Here, we list the commands we'll be using in the section, which includes both those used for compression/decompression of files as well as those needed for package management.

Unix Command Acronym translation Description
tar [OPTION] <ARCHIVE> <FILES> Tape archive Archive utility tool, used to create and extract archives. Archives are simply multiple files that have been combined into one file.
gzip [OPTION] <FILE> GNU Zone informational Protocol. Compresses <FILE>
gunzip [OPTION] <FILE> GNU un Zone informational Protocol. Decompresses <FILE>
zip [OPTION] <FILE> Zone informational Protocol Archive and compression utility. Used to make an archive or compress <FILE>.
unzip un Zone informational Protocol Decompresses <FILE>
apt [OPTION] <PACKAGE> Advanced Packaging tool Package manager for Ubuntu with many utilities.

Tar is the oldest of the 3 utilities; tar, gzip, and zip. Unlike gzip and 'zip, tar is actually not compressing files, but rather bundling them into archives and giving them the file extension '.tar'. So using tar on 100 files of 10 kB, might not make the resulting '.tar' file byte size smaller than 1 mB. This actually depends on how the files are arranged in directories, as tar eliminates the space wasted by the file system. The main reason for using tar is to create a single file of multiple files, a so-called 'archive', that makes for easier portability and storage.

Prompt$ tar cvf <ARCHIVE> <DIRECTORY>

will create an archive called <ARCHIVE> from the files in <DIRECTORY>. The option c, v an f are short for 'create archive', 'verbose' and 'redirect stdout to given filename' respectively. The tar command has a lot options which are listed hereunder,

  • c : Create archive
  • x : Extract archive
  • f : Redirect stdout to given filename
  • t : Display files in archived file
  • u : Archives and add to existing archive file
  • v : Verbose
  • A : Concatenate archive files
  • z : zip, will use gzip for compression on resulting tar file.
  • r : Add file/directory in already existing .tar file

You can extract files from the archive by typing,

Prompt$ tar xvf <ARCHIVE> 

which will extract <ARCHIVE> and return it to its original state.

After having created an archive using tar, the gzip utility can be used compress the archive. The file extension for gzipped files is '.gz'. However, if you've used the tar utility on the files to make an archive, the file extension is tar.gz or the abbreviated form, .tgz. From the command line, you can compress files by typing,

Prompt$ gzip <FILE>

which will compress <FILE> to <FILE.gz>. The gzip command automatically adds the .gz file extension. If you want your file to not have the .gz file extension, you can use the k command option,

Prompt$ gzip -k <FILE> 

which will compress <FILE> to <FILE>.

You can also just as easily compress multiple files,

Prompt$ gzip file1 file2 file3

which will compress the files; file1, file2 and file3. You can also use gzip to compress every file in a directory, but you need to use the recursive command option, r, otherwise you will get the error 'gzip: DIRECTORY/ is a directory -- ignored',

Prompt$ gzip -r <DIRECTORY> 

which will go through <DIRECTORY> and compress every file, the end result being a directory filled with compressed files. If you want all of the files to be compressed into one file, you'd need to first create an archive with tar and then compress it with gzip. Lastly, you can set the level for which you want to compress your files, ranging from [1-9]. For instance,

Prompt$ gzip -1 <FILE>

will compress <FILE> at the highest level. The process, however, is a lot slower than if you had compressed your file at the lowest compression level,

Prompt$ gzip -9 <FILE> 

will compress <FILE> at the lowest level, but the process significantly faster. By default, gzip will compress at the highest compression level, 1. A compressed file, can be extracted with gunzip,

Prompt$ gunzip <FILE>

which will extract <FILE> and return it to its original state. Just like with gzip you can extract multiple files simultaneously,

Prompt$ gunzip -k file1 file2 file3

which will extract files; file1, file2 and file3. You can also extract every in a directory by using the r command option,

Prompt$ gunzip -r <DIRECTORY> 

One of the command options for the tar command allows for simultaneously creating an archive and compressing the files using gzip,

Prompt$ tar cvzf <ARCHIVE> <FILES>  

will create compressed archive, <ARCHIVE>, from <FILES>. A compressed archive can be extracted by typing,

Prompt$ tar xvzf <Compressed_ARCHIVE> 

which returns the compressed archive to its original state.

You might be wondering what's the point of using the gzip command if the tar command can be used for compressing files as well. One good reason for this, is that when you use the gzip utility of tar you can only use the default option settings for gzip. gzip offers a range of compression levels from 1 to 9; 1 offers the fastest compression speed but at a lower ratio, and 9 offers the highest compression ratio but at a lower speed. The gzip application uses level 6 by default. Depending on the filetype, there's often no significant difference in data storage at different compression levels. Conversely, there can be big differences in the time it takes to compress data, hence , it's often not worth the wait time to compress files at the highest level.

You can pipe the output of tar to gzip,

Figure 7.1 Piping tar with gzip: The directory, Example, is archived using tar c, which stdout is piped to gzip -9, whereafter the stdout is redirected to the Example.tgz.

The zip utility actually came out before gzip, but because it was developed commercially by the firm PKWARE, it suffered from patent encumbering for many years which sparked the development of gzip, which was a free compression/decompression software. Both zip and gzip, use the 'LZ77 algorithm' for compression/decompression but unlike gzip, it also creates archives of files and gives it the file extension '.zip'. For more information on tar, zip and gzip you can follow this link, Additional information on tar, zip, gzip and some other file compressions

To use zip and unzip you first need to download and package it, and for this you can use the package managers apt and Homebrew (for Mac OS users).

Exercise 1: tar and gzip

1. Download files or find files on your computer which you think can be compressed significantly. Put them together in a folder.
2. Archive and compress the folder at default compression level.
3. Uncompress the folder.
4. Archive and compress it at the lowest compression level in one line.

Package managing

Package managers are software tools that automate installation, updating, configuring and removing packages. An important theme on Linux based systems, is that every program is able to do one simple function, but it does it well. Larger programs on a Linux OS are sort of tailored together by smaller programs and are therefore dependent on smaller programs. This is essentially what's meant by the 'dependencies' of a program, and for the Linux OS many of these small programs been developed in parallel by different organizations. Package managers keep all of these dependencies updated and in check, ensuring functionality and compatibility of all programs. Microsoft is the sole owner of the Windows OS and there are therefore defined procedure for installing programs. Also larger programs are designed to be independent and for the most part they don't need to be tailored with other small programs for full functionality. But there are some exceptions on a Windows OS which need to be installed externally and you've probably heard of; Java, Adobe Flash player, Wizard etc.. You might think that having programs without dependencies is a good idea, but a major weakness is that there'll be a lot redundancy and wasted disk space.

The package manager used for Ubuntu is apt short for 'Advanced package manager'. In order to use it, you need superuser privileges, which you can get by using the command sudo. The command sudo, will execute following commands with elevated privileges corresponding to the permissions that the user has. If you're executing this from your computer, you'll be granted root privileges which are the highest privileges you can get.

Prompt$ sudo apt install  <PROGRAM> 

The command apt will download and install <PROGRAM>. You can also remove packages,

Prompt$ sudo apt remove <PROGRAM>

which will remove <PROGRAM>. If you want to remove package configuration files as well,

Prompt$ sudo apt purge <PROGRAM> 

will remove <PROGRAM> along with its configuration files. Configuration files contain information about the initial parameters and setting for a program. If you want to update your currently installed packages,

Prompt$ sudo apt upgrade 

will update all of your currently installed packages. It is, however, sometimes the case that the sources from which you download your updates are outdated. You can update these sources,

Prompt$ sudo apt update 

which will update all configured sources for your packages. These sources can be found in the directory, '/etc/apt', in files with the ending '.list'. You can remove packages that are no longer required on your computer by using the autoremove option,

Prompt$ sudo apt autoremove 

You can use the search option to search for packages with a specific feature you might need,

Prompt$ sudo apt search <REGEX> 

will search for packages correlating to the regular expression pattern. Lastly, the command option show displays information about packages in the terminal.

Prompt$ sudo apt show <PACKAGE>

Installing Emacs package example

Here we demonstrate how the package for the text editor, emacs, can be installed with apt.

Figure 7.2 Updating configured sources and upgrade packages: The command sudo apt update && sudo apt upgrade will execute sudo apt update and sudo apt upgrade consecutively, updating all configured sources and installing available upgrades for installed packages. This is equivalent to waiting for sudo apt update to complete and then executing sudo apt upgrade. The && is essentially an operator that means 'and'.
Figure 7.3 Installing a package: The command apt install emacs is used to install the package, emacs. In this case, we show what it would look like if emacs was already installed, as it's not possible to show the whole installation process anyway. Also the installation process might take a while, and you'll have to wait for its completion before you can do anything else. There's away to avoid this which we'll learn in the next section when we introduce the concept of background processes.
Figure 7.4 Removing a package: The command sudo apt remove -y emacs will remove the package emacs. When a command option -y is used, you won't be prompted to type the additional 'Y' for the removal to proceed.

Exercise 2: Installing and using Zip

1. Use the apt command to install zip.
2. Use zip to unzip the file folder you compressed in exercise 1.