Archiving and compressing files

From CC Doc
Jump to: navigation, search


This article is a draft

This is not a complete article: This is a Draft, a work in progress that is intended to be published into an article, which may or may not be ready for inclusion in the main wiki. It should not necessarily be considered factual or authoritative.



Parent page: Storage and file management

Archiving means creating one file that contains a number of smaller files within it. Archiving data can improve the efficiency of file storage, and of file transfers. It is faster for the secure copy protocol (scp), for example, to transfer one archive file of a reasonable size than thousands of small files of equal total size.

Compressing means encoding a file such that the same information is contained in fewer bytes of storage. The advantage for long-term data storage should be obvious. For data transfers, the time spent for compressing data can be balanced against the time saved moving fewer bytes as described in this discussion of data compression and transfer from the US National Center for Supercomputing Applications.

Use tar to archive files and directories

The primary archiving utility on all Linux and Unix-like systems is the tar command. It will bundle a bunch of files or directories together and generate a single file, called an archive file or tar-file. By convention an archive file has .tar as the file name extension. When you archive a directory with tar, it will, by default, include all the files and sub-directories contained within it, and sub-sub-directories contained in those, and so on. So the command tar --create --file project1.tar project1 will pack all the content of directory project1 into the file project1.tar. The original directory will remain unchanged, so this may double the amount of disk space occupied!

You can extract files from an archive using the same command with a different option:tar --extract --file project1.tar. If there is no directory with the original name, it will be created. If a directory of that name exists and contains files of the same names as in the archive file, they will be overwritten. Another option can be added to specify the destination directory where to extract the archive's content.

Compress and uncompress tar files

The tar archiving utility can compress an archive file at the same time it creates it. There are a number of compression methods to choose from. We recommend either xz or gzip, which can be used as follows:

[user_name@localhost]$ tar --create --xz --file project1.tar.xz project1
[user_name@localhost]$ tar --extract --xz --file project1.tar.xz
[user_name@localhost]$ tar --create --gzip --file project1.tar.gz project1
[user_name@localhost]$ tar --extract --gzip --file project1.tar.gz

Typically, --xz will produce a smaller compressed file (a "better compression ratio") but takes longer and uses more RAM while working [1]. --gzip does not typically compress as small, but may be used if you encounter difficulties due to insufficient memory or excessive run time during tar --create. A third option, --bzip2, is also available, that typically does not compress as small as xz but takes longer than gzip.

You can also run tar --create first without compression and then use the commands xz or gzip in a separate step, although there is rarely a reason to do so. Similarly, you can run xz -d or gzip -d to decompress an archive file before running tar --extract, but again there is rarely a reason to do so.

The commands gzip or xz can be used to compress any file, not just archive files:

[user_name@localhost]$ gzip bigfile
[user_name@localhost]$ xz bigfile

These commands will produce the files bigfile.gz and bigfile.xz respectively.

Common options for tar command

Here are the most common options for this command. There are two synonymous forms for each, a single-letter form prefixed with a single dash, and a whole-word form prefixed with a double dash:

  • -c or --create: Create a new archive.
  • -f or --file=: Following is the archive file name.
  • -x or --extract: Extract files from archive.
  • -t or --list: List the content of an archive file.
  • -J or --xz: Compress or uncompress with xz.
  • -z or --gzip: Compress or uncompress with gzip.

Single-letter options can be combined with a single dash, so for example: tar -cJf project1.tar.zx project1 is equivalent totar --create --xz --file=project1.tar.xz project1.

There are many more options for tar, and these may depend on the version you are using. You can get a complete list of the options available on your system with man tar or tar --help. Note in particular that some older systems might not support --xz compression.

Archiving and Compressing Examples

To illustrate the use of archiving and compressing utilities, let us imagine that your working directory contains the files and sub-directories shown here:

[user_name@localhost]$ ls -F
bin/  documents/  jobs/  new.log.dat  programs/  report/  results/  tests/  work/

Archive files and directories

Use tar to archive a given directory

Perhaps the most common use of tar is to create an archive of a single directory. Let us take, for example, a directory named results and create from it an archive file named results.tar. On your terminal type:

[user_name@localhost]$ tar -cvf results.tar results
results
results/log1.dat
results/Res-01/
results/Res-01/log.15Feb16.1
results/Res-01/log.15Feb16.4
results/Res-02/
results/Res-02/log.15Feb16.balance.b.4

Using ls, command we can see the new tar file results.tar is there:

[user_name@localhost]$ ls -F
bin/  documents/  jobs/  new.log.dat  programs/  report/  results/  results.tar  tests/  work/

In this example, we invoked the tar command with the options c (for create), v (for verbosity) and f (for file). We chose the name results.tar for the tar file. This name could be something else but it is best to have a similar name to the directory being archived. It makes it easier to recognize your data later.

More than one directory or file can be placed in a tar file. For example, to place the directories results, report and documents into an archive file called full_results.tar, we can proceed as follows:

[user_name@localhost]$ tar -cvf full_results.tar results report documents/
results/
results/log1.dat
results/Res-01/
results/Res-01/log.15Feb16.1
results/Res-01/log.15Feb16.4
results/Res-02/
results/Res-02/log.15Feb16.balance.b.4
report/
report/report-2016.pdf
report/report-a.pdf
documents/
documents/1504.pdf
documents/ff.doc

Since the v option was used, the files added to the archive are shown. Those details may be hidden by omitting the v.

To check out the created archive, use ls:

[user_name@localhost]$ ls
bin/  documents/  full_results.tar  jobs/  new.log.dat  programs/  report/  results/  results.tar  tests/  work/

Archive files or directories that start with a given a letter

In our working directory, we have two directories that start with "r" (report, results). In this example, we put together the content of these directories into one single archive, archive.tar.

[user_name@localhost]$ tar -cvf archive.tar r*
report/
report/report-2016.pdf
report/report-a.pdf
results/
results/log1.dat
results/Res-01/
results/Res-01/log.15Feb16.1
results/Res-01/log.15Feb16.4
results/Res-02/
results/Res-02/log.15Feb16.balance.b.4

This trick uses a mechanism called shell globbing. Similar commands can be used to pick out files or directories that have any common chain of characters ("substring") in their name, like *Feb* or *.log.

Append or add files to the end of an archive

The -r option is used to add a file or files to an existing archive without having to unpack the archive and run tar again to create a new one. In the following example we add the file new.log.dat to the archive results.tar, then demonstrate the result with -t.

[user_name@localhost]$ tar -rf results.tar new.log.dat
[user_name@localhost]$ tar -tvf results.tar
drwxrwxr-x name name        0 2016-11-20 11:02 results/
-rw-r--r-- name name    10905 2016-11-16 16:31 results/log1.dat
drwxrwxr-x name name        0 2016-11-16 19:36 results/Res-01/
-rw-r--r-- name name    11672 2016-11-16 15:10 results/Res-01/log.15Feb16.1
-rw-r--r-- name name    11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4
drwxrwxr-x name name        0 2016-11-16 19:37 results/Res-02/
-rw-r--r-- name name    34117 2016-11-16 15:10 results/Res-02/log.15Feb16.balance.b.4
-rw-r--r-- name name    10905 2016-11-20 11:16 new.log.dat

Note: You cannot use -r to add files to compressed archives (*.gz, *.xz, *.bz2). Files can only be added to plain tar archives.

The -r option to tar can also be used to add a directory or directories to an existing tar file. Let us add the directory report to the archive results.tar from our previous example:

[user_name@localhost]$ tar -rf results.tar report/
[user_name@localhost]$ tar -tvf results.tar
drwxrwxr-x name name        0 2016-11-20 11:02 results/
-rw-r--r-- name name    10905 2016-11-16 16:31 results/log1.dat
drwxrwxr-x name name        0 2016-11-16 19:36 results/Res-01/
-rw-r--r-- name name    11672 2016-11-16 15:10 results/Res-01/log.15Feb16.1
-rw-r--r-- name name    11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4
drwxrwxr-x name name        0 2016-11-16 19:37 results/Res-02/
-rw-r--r-- name name    34117 2016-11-16 15:10 results/Res-02/log.15Feb16.balance.b.4
-rw-r--r-- name name    10905 2016-11-20 11:16 new.log.dat
drwxrwxr-x name name        0 2016-11-20 11:02 report/
-rw-r--r-- name name   924729 2015-11-20 04:14 report/report-2016.pdf
-rw-r--r-- name name   924729 2015-11-20 04:14 report/report-a.pdf

Again, the -v option is not necessary if you do not want to show the details of the files.

Add two archive files with concatenate

Just as one can add a file to an archive with -r, one can add an archive to another archive with -A. Let us add the archive report.tar to the existing archive results.tar.

Here are the contents of existing archive:

[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  programs/  report/  report.tar  results/  results.tar  tests/  work/
[user_name@localhost]$ tar -tvf results.tar
drwxr-xr-x name name        0 2016-11-20 16:16 results/
-rw-r--r-- name name    10905 2016-11-20 16:16 results/log1.dat
drwxr-xr-x name name        0 2016-11-20 16:16 results/Res-01/
-rw-r--r-- name name    11682 2016-11-20 16:16 results/Res-01/log.15Feb16.4
drwxr-xr-x name name        0 2016-11-20 16:16 results/Res-02/
-rw-r--r-- name name    34117 2016-11-20 16:16 results/Res-02/log.15Feb16.balance.b.4

Now, we add report.tar to it and list the contents of the new archive:

[user_name@localhost]$ tar -A -f results.tar report.tar
[user_name@localhost]$ tar -tvf results.tar
drwxr-xr-x name name        0 2016-11-20 16:16 results/
-rw-r--r-- name name    10905 2016-11-20 16:16 results/log1.dat
drwxr-xr-x name name        0 2016-11-20 16:16 results/Res-01/
-rw-r--r-- name name    11682 2016-11-20 16:16 results/Res-01/log.15Feb16.4
drwxr-xr-x name name        0 2016-11-20 16:16 results/Res-02/
-rw-r--r-- name name    34117 2016-11-20 16:16 results/Res-02/log.15Feb16.balance.b.4
drwxrwxr-x name name        0 2016-11-20 11:02 report/
-rw-r--r-- name name   924729 2015-11-20 04:14 report/report-2016.pdf
-rw-r--r-- name name   924729 2015-11-20 04:14 report/report-a.pdf

The options -A, --catenate, and --concatenate are equivalent, although some older systems may not support all three synonyms. Use man tar to see what options are available on your system.

Exclude particular files when creating an archive

Suppose you have certain files that do not need to be archived, such intermediate or temporary files (like *.o or core.*). Let us create an archive results.tar for the directory results but without the files that have .dat as extension. We do this with the option --exclude=*.dat. Notice the pattern *.dat is just like the pattern in globbing above.

[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  programs/  report/  results/  tests/  work/
[user_name@localhost]$ ls results/
log1.dat  log5.dat  Res-01/  Res-02/
[user_name@localhost]$ tar -cvf results.tar results/ --exclude=*.dat
results/
results/Res-01/
results/Res-01/log.15Feb16.4|results/Res-02/
results/Res-02/log.15Feb16.balance.b.4
[user_name@localhost]$ tar -tvf results.tar
drwxr-xr-x name name        0 2016-11-20 16:16 results/
drwxr-xr-x name name        0 2016-11-20 16:16 results/Res-01/
-rw-r--r-- name name    11682 2016-11-20 16:16 results/Res-01/log.15Feb16.4
drwxr-xr-x name name        0 2016-11-20 16:16 results/Res-02/
-rw-r--r-- name name    34117 2016-11-20 16:16 results/Res-02/log.15Feb16.balance.b.4

Preserve symbolic links using tar command

If your existing directory structure includes symbolic links and you are copying an archive file in order to re-create that structure on a new system, you should use the h option to tar to create the tar file with symbolic links preserved, like this:

[user_name@localhost]$ tar -cvhf results.tar results/

Compress files and archives

Compress a file

Compressing and are archiving are two different processes. Archiving or creating a tar file puts together several files or directories into a single file. Compressing is a process applied to a single file (possibly a tar file) in order to reduce its size. This is done with compression utilities like gzip or xz. In the following example, we compress the files new.log.dat and results.tar using gzip:

[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  new_results/  programs/  report/  results/  results.tar  tests/  work/
[user_name@localhost]$ gzip new.log.dat results.tar
[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat.gz  new_results/  programs/  report/  results/  results.tar.gz  tests/  work/

Notice how the uncompressed files are replaced by compressed files with a new extension, .gz. Similarly, using xz:

[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  new_results/  programs/  report/  results/  results.tar  tests/  work/
[user_name@localhost]$ xz new.log.dat results.tar
[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat.xz  new_results/  programs/  report/  results/  results.tar.xz  tests/  work/

One can compress an archive file while the archive is being created, without creating an intermediate .tar file or overwriting the original data. To do so, use the z or --gzip option for gzip compression, and J or --xz for xz compression. The extension of the file name does not really matter, but .tar.gz and .tgz are common extensions for files compressed with gzip.

[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  programs/  report/  results/  tests/  work/
[user_name@localhost]$ tar -cvzf results.tar.gz results/
results/
results/log1.dat
results/Res-01/
results/Res-01/log.15Feb16.4|results/Res-02/
results/Res-02/log.15Feb16.balance.b.4
[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  programs/  report/  results/  results.tar.gz  tests/  work/

.tar.xz and .txz are common extensions for files compressed with xz.

[user_name@localhost]$ tar -cvJf results.tar.xz results/
results/
results/log1.dat
results/Res-01/
results/Res-01/log.15Feb16.4
results/Res-02/
results/Res-02/log.15Feb16.balance.b.4
[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  programs/  report/  results/  results.tar.xz  results.tar.gz  tests/  work/

Add files to compressed archives (tar.gz/tar.xz)

You cannot add a file to a compressed archive in a single step. To do so, first uncompress the archive using gunzip or unxz. Then add the files by invoking tar -r as described above. Then compress again using gzip or xz.

Unpack compressed files and archives

Extract the whole archive

To extract all files from an archive, or "unpack" it, use -x (for extract) with -f (for file). If you want to watch the progress of the operation, you can add -v (for verbose).

Let us extract everything from the archive file results.tar. By default all files will be placed in a directory of the same name from which they were taken --- in this example, a directory named results. If there is already a directory with this name and it contains files with the same names as the ones in the archive, the extracted files will overwrite (replace) them. To avoid this, we can redirect the extracted data to another directory by adding the option -C and making sure that the destination directory exists already before unpacking the archive. For example, we create a directory recover and extract the data from the archive results.tar to this directory.

[user_name@localhost]$ mkdir recover
[user_name@localhost]$ tar -xvf results.tar -C recover
results/
results/log1.dat
results/Res-01/
results/Res-01/log.15Feb16.1
results/Res-01/log.15Feb16.4
results/Res-02/
results/Res-02/log.15Feb16.balance.b.4
new.log.dat 
report/
report/report-2016.pdf
report/report-a.pdf
[user_name@localhost]$ ls recover
new.log.dat  report/  results/

Uncompress gz, bz2, or xz files

For files with ".gz" extension, we use gunzip as follows:

[user_name@localhost]$ ls results*
results.tar.gz 
[user_name@localhost]$ gunzip results.tar.gz
[user_name@localhost]$ ls results*
results.tar

For files with ".bz2" extension, we use bunzip2 in place of gunzip. For files with ".xz" extension, we use unxz.

Extract a compressed archive file

A compressed tar file can be uncompressed and extracted in a single step by adding the z, j or J option for .gz, .bz2 or .xz files, respectively. If the example is the same as the previous one but with a compressed archive file named results.tar.gz, do:

[user_name@localhost]$ tar -xvzf results.tar.gz -C recover
...

Notes:

  • Remember that filename extensions like .gz, .bz2 and .xz are human conventions and not dictated by the software. It is not uncommon for a compressed tar file to carry the extension .tgz instead of .tar.gz, for example.
  • When using the option -C to give a destination directory, first make sure that the destination directory exists since tar will not create it for you and will fail if it does not exist.
  • The option v for verbosity causes the file and directory names to be displayed during extraction. If you want to display more details, like the file dates and permissions, add a second v option like so: tar -C new_results/ -xvvzf results.tar.gz.

Extract one file from an archive or a compressed archive

Let us consider again the same example as previously. First we create the archive results.tar for the directory archive and list all the files in it. Then we will extract one file into the directory new_results:

[user_name@localhost]$ ls
bin/  documents/  jobs/  new.log.dat  new_results/  programs/  report/  results/  results.tar  tests/  work/
[user_name@localhost]$ tar -tvf results.tar
drwxrwxr-x name name        0 2016-11-20 11:02 results/
-rw-r--r-- name name    10905 2016-11-16 16:31 results/log1.dat
drwxrwxr-x name name        0 2016-11-20 15:16 results/Res-01/
-rw-r--r-- name name    11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4
drwxrwxr-x name name        0 2016-11-16 19:37 results/Res-02/
-rw-r--r-- name name    34117 2016-11-16 15:10 results/Res-02/log.15Feb16.balance.b.4  [user_name@localhost]$ ls new_results/
[user_name@localhost]$ tar -C ./new_results/ --extract --file=results.tar results/Res-01/log.15Feb16.4
[user_name@localhost]$ ls new_results/results/Res-01/log.15Feb16.4
new_results/results/Res-01/log.15Feb16.4

In this example, we have extracted the file results/Res-01/log.15Feb16.4 from the archive without uncompressing the whole archive by using the option '"--extract'". The command creates the same directories as in the archive but in the destination directory.

Notes:

  • It is mandatory to use the -C {destination directory} for this command otherwise the command will extract the file to the same directory as the archive created for if it exists. If not, the command will create the same directory.
  • It works to extract a file or a directory but we need to give the right path for the file or directory.
  • The same command can be used to extract multiple files y adding the full path as in the previous example.
[user_name@localhost]$ tar -C ./new_results/ --extract --file=results.tar "results/Res-01/log.15Feb16.4" "file2" "file3"

The same command can also be used to extract a file from a compressed tar file.From *.gz files:

[user_name@localhost]$ tar -C ./new_results/ --extract -z --file=results.tar.gz results/Res-01/log.15Feb16.4
[user_name@localhost]$ ls new_results/results/Res-01/log.15Feb16.4new_results/results/Res-01/log.15Feb16.4

From *.bz2 file:

[user_name@localhost]$ tar -C ./new_results/ --extract -j --file=results.tar.bz2 results/Res-01/log.15Feb16.4
[user_name@localhost]$ ls new_results/results/Res-01/log.15Feb16.4
new_results/results/Res-01/log.15Feb16.4

Extract multiple files using wildcards

[user_name@localhost]$ tar -C ./new_results/ -xvf results.tar --wildcards "results/*.dat"
[user_name@localhost]$ ls new_results/results/
log1.dat

With the above command, we have extracted the files that are in the directory results and with the extension .dat.

'Note: The command is also valid with invoking j or z options for compressed archives as we have seen previously.From our previous example, we can extract all the files that start by log for example.

[user_name@localhost]$ tar -C ./new_results/ -xvf results.tar --wildcards "results/log*"
[user_name@localhost]$ ls new_results/results/
log1.dat


Content of archive files

List the content of an archive file

What if you have a tar file and don't remember what is in it? In this case, you can just list its content without unpacking the file. This can be achieved using tar -t:

[user_name@localhost]$ tar -tf results.tar
results/ 
results/log1.dat 
results/Res-01/
results/Res-01/log.15Feb16.1
results/Res-01/log.15Feb16.4
results/Res-02/ 
results/Res-02/log.15Feb16.balance.b.4

In addition, the use of -v option will also give "metadata" about the files, like permissions, date of last change, owner, just like you would see with ls -l on unarchived files:

[user_name@localhost]$ tar -tvf results.tar
drwxrwxr-x name name        0 2016-11-20 11:02 results/
-rw-r--r-- name name   10905 2016-11-16 16:31 results/log1.dat
drwxrwxr-x name name       0 2016-11-16 19:36 results/Res-01/
-rw-r--r-- name name   11672 2016-11-16 15:10 results/Res-01/log.15Feb16.1
-rw-r--r-- name name   11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4
drwxrwxr-x name name       0 2016-11-16 19:37 results/Res-02/
-rw-r--r-- name name   34117 2016-11-16 15:10 results/Res-02/log.15Feb16.balance.b.4

If for any reason you are interested in the number of files within a given tar file, it is possible to combine one of the previous commands with a pipe { | } and wc -l { word count with -l option to count only the number of lines}. This command counts the number of lines in the output from the command before the pipe symbol.

[user_name@localhost]$ tar -tvf results.tar | wc -l
7

Or:

[user_name@localhost]$ tar -tf results.tar | wc -l
7

From this example, we have a total of 9 entries in the tar file. This number includes all the files and sub-directories that are in the directory results including this directory itself. Let us mention that the details of the files are not shown even if the –v option was used. This is due to the fact that the results of the first command are filtered through the command wc –l that displays only the number of lines but not their details.

The options in the previous commands can be invoked separately. For example:

  • The option -tvf is equivalent to -t -v -f
  • The option -v is equivalent to --verbose
  • The option -t is equivalent to --list
  • The option --file=results.tar is equivalent to -f results.tar

Note: The option -f or --file= comes always before the name of tar file.

Search for a given file in archive file without unpacking it

We have seen previously how to list the files in the archive. It also possible to list the files and look at look for a particular file by using the list commands combined with pipe and grep commands. For example, let us see if we can find the file: log.15Feb16.4 (the path to this file is: results/Res-01/log.15Feb16.4).

[user_name@localhost]$ tar -tf results.tar | grep -a log.15Feb16.4
results/Res-01/log.15Feb16.4
[user_name@localhost]$ tar -tvf results.tar | grep -a log.15Feb16.4
-rw-r--r-- name name   11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4

Now, we can try see if we can find another file called for example pbs_file (for information, this file does not exist in our archive):

[user_name@localhost]$ tar -tf results.tar | grep -a pbs_file
[user_name@localhost]$ tar -tvf results.tar | grep -a pbs_file

As you can see, the output of the commands is empty meaning that the file does not exist in the archive.

If you want to list all the files that start for example by log(or any other chain of characters) in the archive, type on your terminal:

[user_name@localhost]$ tar -tf results.tar | grep -a log*
results/log1.dat 
results/Res-01/log.15Feb16.1
results/Res-01/log.15Feb16.4
results/Res-02/log.15Feb16.balance.b.4

Or add the -v option for more details:

[user_name@localhost]$ tar -tvf results.tar | grep -a log*
-rw-r--r-- name name    10905 2016-11-16 16:31 results/log1.dat
-rw-r--r-- name name    11672 2016-11-16 15:10 results/Res-01/log.15Feb16.1
-rw-r--r-- name name    11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4
-rw-r--r-- name name    34117 2016-11-16 15:10 results/Res-02/log.15Feb16.balance.b.4

Note: The command more can also be invoked after the pipe symbol to list the files in the archive or the compressed file.

List the content of a compressed file (*.gz or .bz2)

As in the case of a tar file we have seen previously, it is possible to combine '"tar'" command with '"z'" option to list the content of an archive compressed with '"gzip'" without uncompressing the file; or '"j'" option to list the content of an archive compressed with '"bzip2'" without uncompressing the file.For '"*.gz'" files:

[user_name@localhost]$ tar -tvzf results.tar.gz
drwxrwxr-x name name        0 2016-11-20 11:02 results/
-rw-r--r-- name name    10905 2016-11-16 16:31 results/log1.dat
drwxrwxr-x name name        0 2016-11-16 19:36 results/Res-01/
-rw-r--r-- name name    11672 2016-11-16 15:10 results/Res-01/log.15Feb16.1
-rw-r--r-- name name    11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4
drwxrwxr-x name name        0 2016-11-16 19:37 results/Res-02/
-rw-r--r-- name name    34117 2016-11-16 15:10 results/Res-02/log.15Feb16.balance.b.4
-rw-r--r-- name name    10905 2016-11-20 11:16 new.log.dat
drwxrwxr-x name name        0 2016-11-20 11:02 report/
-rw-r--r-- name name   924729 2015-11-20 04:14 report/report-2016.pdf
-rw-r--r-- name name   924729 2015-11-20 04:14 report/report-a.pdf

For *.bz2 files:

[user_name@localhost]$ tar -tvjf results.tar.bz2
drwxrwxr-x name name        0 2016-11-20 11:02 results/
-rw-r--r-- name name    10905 2016-11-16 16:31 results/log1.dat
drwxrwxr-x name name        0 2016-11-16 19:36 results/Res-01/
-rw-r--r-- name name    11672 2016-11-16 15:10 results/Res-01/log.15Feb16.1
-rw-r--r-- name name    11682 2016-11-16 15:10 results/Res-01/log.15Feb16.4
drwxrwxr-x name name        0 2016-11-16 19:37 results/Res-02/
-rw-r--r-- name name    34117 2016-11-16 15:10 results/Res-02/log.15Feb16.balance.b.4

Notes:

  • Again, in this example the option '"v'" is used to display all the details, but not required.
  • The two previous commands can be also combined with the pipe { | } and '"wc'"; or pipe { | } and grep; as we have seen previously.

Other Useful Utilities

Check the size of a file, directory or archive

From your terminal, you can use the command du -sh [your_file ...] to see the size:

[user_name@localhost]$ du -sh results.tar work tests
112K results.tar
58K  work
48K  tests

By knowing the size of your files or directories, you can decide how to split them on different archives if necessary to do not have to handle huge files. The splitting works also for archive files. A big file or tar file can be divided into small parts using the following syntax:

split -b <Size-in-MB><file or tar-file-name><prefix-name> split -b 100MB results.tar small-res

The option b is invoked to fix the size of the small parts; and prefix-name is the name for the small files. The above command will split the file results.tar into smaller files and the size of each one of them is 100 MB in current working directory and small file names will starts from: small-resaa small-resab small-resac small-resad .... etc.

To recover the original file, we use the cat command as follow:

[user_name@localhost]$ cat small_res* > your_archive_name.tar

Using split command, it is possible to divide large files into smaller parts by invoking split with the size you want {-b size in MB} then transfer all the small parts. Once all the small parts are transferred, use the cat command to recover your original file or your archive. In case if you want to append numbers in place of alphabets, use -d option in above split command.

Reminder of common commands

  • The pwd {present work directory} command to see the current working path.
  • The ls {list} command to see the files and the sub-directories.
  • The command </code>du -sh</code> {disk usage} to see the size of the files, directories or sub-directories.
  • Important note: Applying the commands gzip or bzip2 to a given file [your_file or your_archive.tar] requires the use of some free space, as in the case of tar command, to create the final compressed file: [your_file.gz or your_file.bz2] or [your_archive.tar.gz or your_archive.tar.bz2]. These commands will fail if there is no space left in the device or if you are out of quota. On the CC clusters, use the command quotaor quota –sfrom your terminal to see in more human readable information if you have enough space to write additional data.
  • Apply tar to one directory [results]:
[user_name@localhost]$ tar -cvf results.tar results
  • Apply tar to multiple files or directories in order to put them all together into a final single archive file.
[user_name@localhost]$ tar -cvf your_archive.tar dir1 dir2 dir3 dir4 dir5 file1 file2 file3 file4 file5
  • Apply tar to all files or directories that start with a given a letter r [or have a given chain of characters]:
[user_name@localhost]$ tar -cvf your_archive.tar r*
  • List the content of a tar file [results.tar] including the details:
[user_name@localhost]$ tar -tvf results.tar
  • List the content of a tar file [results.tar] without details:
[user_name@localhost]$ tar -tf results.tar
  • Count the number of entries in the tar file:
[user_name@localhost]$ tar -tvf results.tar | wc -l 
[user_name@localhost]$ tar -tf results.tar | wc -l
  • Search for a given file [file_name_you_search] in a tar archive file [your_archive.tar] without un-tarring the archive:
[user_name@localhost]$ tar -tf your_archive.tar | grep -a file_name_you_search
[user_name@localhost]$ tar -tvf your_archive.tar | grep -a file_name_you_search
  • List only file ending, or starting, or containing a certain pattern in their names: for examples files starting with log:
[user_name@localhost]$ tar -tf your_archive.tar | grep -a log*
[user_name@localhost]$ tar -tvf your_archive.tar | grep -a log*
  • Append a file or files or add a new file [new_file] to the end of a tar file [your_archive.tar]:
[user_name@localhost]$ tar -rf your_archive.tar new_file

Note: Files cannot be added to compressed archives [*.gz or *.bzip2]. Files can only be added to plain tar archives [*.tar].* Add a directory [new_dir] to an existing tar file [your_archive.tar]:

[user_name@localhost]$ tar -rf your_archive.tar new_dir
  • Add one archive [archive_02.tar] to another [archive_01.tar] with concatenate (-A option):
[user_name@localhost]$ tar -A -f archive_01.tar archive_02.tar
  • Extract the whole archive file [your_archive.tar]:
[user_name@localhost]$ tar -xvf your_archive.tar
  • Extract the whole archive file [your_archive.tar] into a specified directory [destination_dir]:
[user_name@localhost]$ tar -xvf your_archive.tar -C destination_dir
  • Compress a file [file0], or files [file1 file2 file3 file4 file5] or an archive file [your_archive.tar] using gzip command:
[user_name@localhost]$ gzip file0
[user_name@localhost]$ gzip file1 file2 file3 file4 file5
[user_name@localhost]$ gzip your_archive.tar
  • Compress a file [file0], or files [file1 file2 file3 file4 file5] or an archive file [your_archive.tar] using bzip2 command:
[user_name@localhost]$ bzip2 file0
[user_name@localhost]$ bzip2 file1 file2 file3 file4 file5
[user_name@localhost]$ bzip2 your_archive.tar
  • Compress with z or j option for gzip or </code>bzip</code> respectively:
[user_name@localhost]$ tar -cvzf results.tar.gz results|
[user_name@localhost]$ tar -cvjf results.tar.bz2 results/
[user_name@localhost]$ tar -cvzf results.tgz results
[user_name@localhost]$ tar -cvjf results.tbz results/
  • Exclude particular files [for example files with *.o] while creating a tar file:
[user_name@localhost]$ tar -cvf your_archive.tar your_directory --exclude=*.o
  • Uncompress *.gz or *.bz2 compressed files:

For files with .gz extension, use gunzip as follow:

[user_name@localhost]$ gunzip your_file.gz
[user_name@localhost]$ gunzip your_archive.gz

For files with .bz2 extension, use bunzip2 as follow:

[user_name@localhost]$ bunzip2 your_file.bz2
[user_name@localhost]$ bunzip2 your_archive.tar.bz2
  • List the content of a compressed file [*.gz or *.bz2]:
[user_name@localhost]$ tar -tvzf your_archive.tar.gz
[user_name@localhost]$ tar -tvjf your_archive.tar.bz2

Notes: Again, in this example the option v is used to display all detail but not required. The two previous commands can be also combined with the [pipe { | } and wc] or [pipe { | } and grep] as we have seen previously.

  • Extract a compressed archive file in another directory:
[user_name@localhost]$ tar -xvzf your_archive.tar.gz -C destination_dir
[user_name@localhost]$ tar -xvjf your_archive.tar.bz2 -C destination_dir
[user_name@localhost]$ tar -C destination_dir -xvzf your_archive.tar.gz
[user_name@localhost]$ tar -C destination_dir -xvjf your_archive.tar.bz2
  • Extract and retrieve data from a compressed archive file on two steps:

For the *.bz2 files:

[user_name@localhost]$ bunzip2 your_archive.tar.bz2
[user_name@localhost]$ tar -C destination_dir -xvf your_archive.tar

For the *.gz files:

[user_name@localhost]$ gunzip your_archive.tar.gz
[user_name@localhost]$ tar -C destination_dir -xvf your_archive.tar
  • Extract one file from an archive or a compressed archive file in another directory:
[user_name@localhost]$ tar -C ./destination_dir/ --extract --file=your_archive.tar path-to-your-file
[user_name@localhost]$ tar -C ./destination_dir/ --extract --file=results.tar "file1" "file2" "file3"

Note: The path to the file to extract should be indicated explicitly.

  • The previous command can also be used to extract a file from a compressed tar file:

From a gz file:

[user_name@localhost]$ tar -C ./destination_dir/ --extract -z --file=your_archive.tar.gz path-to-your-file

From a bz2 file:

[user_name@localhost]$ tar -C ./destination_dir/ --extract -j --file=your_archive.tar.bz2 path-to-your-file
  • Extract multiple files using wildcards [for example files with *.dat]:
[user_name@localhost]$ tar -C ./destination_dir/ -xvf your_archive.tar --wildcards "path-to-files/*.dat"
  • To preserve symbolic links using tar command, use h option:
[user_name@localhost]$ tar -cvhf your_archive.tar your_directory
  • Add files to compressed archives [*.tar.gz pr *.tar.bz2]:

Files can be added directly to compressed archives. To do so, uncompress the archive file, add files as shown previously then compress again.

  • Determine the size of the files or directories:
[user_name@localhost]$ du -sh your_file your_archive.tar dir1 dir2 dir3
  • Split a file or a tar file:
[user_name@localhost]$ split -b <Size-in-MB><tar-file-name>.<extension> prefix-name

For example, use a 1000MB for each small file:

[user_name@localhost]$ split -b 1000MB your_archive.tar small-res

To retrieve the original file:

[user_name@localhost]$ cat small_res* > your_archive_name.tar