Tar (file format)

From Wikipedia, the free encyclopedia

The correct title of this article is tar (file format). The initial letter is shown capitalized due to technical restrictions.
Tar
File extension: .tar
MIME type: application/x-tar
Uniform Type Identifier: public.tar-archive
Magic: ustar at byte 257
Type of format: file archive
Container for: Misc. files (often source code)
Contained by: compress, gzip, bzip2

In computing, the tar (file) format (derived from tape archive) is a type of archive bitstream or file format. The format is traditionally produced by the Unix command, tar, and was standardized by POSIX.1-1998 and later POSIX.1-2001. Initially used for tape backup, it is now commonly used to collate collections of files into one larger file, for distribution or archiving, while preserving file system information such as user and group permissions, dates, and directory structures.

The format was originally developed as a raw format for use with sequential access devices such as tape drives, specifically for backup purposes. However, it is now almost always used as a general file archiving utility. tar's linear roots can still be seen in its ability to work on any data stream and its slow partial extraction performance, as it has to read through the whole archive to extract only the final file. A tar file (somefile.tar), when subsequently compressed using a zip utility such as gzip or bzip, produces a zipped tar file with a filename extension (e.g.: somefile.tar.gz, somefile.tar.bz2). A .tar file containing GNU or other program source code is commonly referred to as a tarball, which may be compressed or not.

As is common with Unix utilities, tar is a single specialist program. It follows the Unix philosophy in that it can "do only one thing" (archive), "but do it well". tar is most commonly used in tandem with an external compression utility such as gzip, bzip2 or, formerly, compress, since it has no built-in data compression facilities. These compression utilities generally only compress a single file, hence the pairing with tar, which can produce a single file from many files. To speed this up, the BSD and GNU versions of tar support the command line options -z (gzip), -j (bzip2), and -Z (compress), which will compress or decompress the archive file it is currently working with, although even in this case the (de)compression is still actually performed by an external program. Compression is sometimes avoided because of the greatly amplified potential for damage to data in long term storage.

Contents

[edit] Usage

[edit] On the command line

  • To pack tar files, use the following commands:
    • for an uncompressed tar file:
      tar -cf packed_files.tar file_to_pack1 file_to_pack2 ...
    • to pack and compress (one step at a time):
      tar -cf packed_files.tar file_to_pack1 file_to_pack2 ...
      gzip packed_files.tar
    • to pack and compress all at once:
      tar -cf - file_to_pack1 file_to_pack2 ... | gzip -c > packed_files.tar.gz
    • to create a tar from a directory and its subdirectories:
      tar -cvf packed_files.tar dir_to_pack
  • To unpack tar files, use the following commands:
    • for an uncompressed tar file:
    tar -xvf file_to_unpack.tar
    • to decompress and unpack one step at a time:
      gunzip packed_files.tar.gz
      tar -xf packed_files.tar
    • to decompress and unpack all at once:
      gunzip -c packed_files.tar.gz | tar -xf -
  • To list the contents of a tar file, use the following command:
    tar -tvf file_to_list.tar

To use bzip2 instead of gzip, simply replace the commands above with bzip2 where gzip is used and bunzip2 where gunzip is used.

[edit] Compression options

BSD and GNU tar have a compression flag feature making it easier to archive and compress gzipped, bzipped or compressed tarballs in one go. The following commands can be used to take advantage of this:

  • To pack and compress:
    tar -czf packed_files.tgz file_to_pack1 file_to_pack2 ...
    tar -cjf packed_files.tbz2 file_to_pack1 file_to_pack2 ...
    tar -cZf packed_files.tar.Z file_to_pack1 file_to_pack2 ...
    • using some other arbitrary compression utility that works as a filter:
    tar --use-compress-program=name_of_program -cf packed_files.tar.XXX file_to_pack1 file_to_pack2 ...
  • To uncompress and unpack:
    • a gzip compressed tar file:
    tar -xzf file_to_unpack.tar.gz
    • a gzip compressed tar file (extracting to a specific directory):
    tar -xzf file_to_unpack.tar.gz -C /directory_to_extract_to
    • a bzip2 compressed tar file:
    tar -xjf file_to_unpack.tar.bz2
    • a compress compressed tar file:
    tar -xZf file_to_unpack.tar.Z
    • an arbitrary-compression-utility-compressed tar file:
    tar --use-compress-program=name_of_program -xf file_to_unpack.tar.XXX

Some versions of tar use the -y switch to invoke bzip2 rather than -j.

[edit] Historical tricks

The following syntax (not related to archiving) was used almost universally before the -d, -R, -p and -a options were added to the cp command.

  • To copy directories precisely:
    tar -cf - one_directory | (cd another_directory && tar -xpf - )

[edit] In graphical user interfaces

Within graphical user interfaces (GUIs), one can often create and extract tar files (without using the complex command line methods of tar) in graphical file archivers—or the archiving capabilities increasingly being built into file managers.

With desktop environments such as KDE or GNOME, a downloaded tar file can be clicked on with the secondary button on a pointing device causing a context menu to appear allowing the user to simply choose to extract the file (or, alternatively, clicking on a file with the primary button allows the user to view the contents). Similar context-menu options exist for archiving a group of selected files.

In such a manner, a computer user more used to GUIs can extract or create tar files, without knowing the sometimes apparently cryptic tar command line options.

An example for KDE would be Ark. On Windows systems, most common packing tools can handle tar-files.

[edit] Filename extensions

The following is a list of common filename extensions for uncompressed and compressed tar archives:

  • tar file:
    • .tar
  • gzipped tar file:
    • .tar.gz
    • .tgz
    • .tar.gzip
    • .war Konqueror Web ARchive file
  • bzipped tar file:
    • .tar.bz2
    • .tar.bzip2
    • .tbz2
    • .tbz
  • tar file compressed with compress
    • .tar.Z
    • .taz

[edit] Internet Content Type

The Internet content type, media type or MIME type for tar archives is application/x-tar.

[edit] Format details

A limitation of early tape drives was that data could only be written to them in 512 byte blocks. As a result data in tar files is arranged in 512 byte blocks.

A tar file is the concatenation of one or more files. Each file is preceded by a header block. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes and the extra space is zero filled. The end of an archive is marked by at least two consecutive zero-filled blocks.

[edit] File header

The file header block contains metadata about a file. To ensure portability across different architectures with different byte orderings, the information in the header block is encoded in ASCII. Thus if all the files in an archive are text files, then the archive is essentially an ASCII file.

The fields defined by the original Unix tar format are listed in the table below. When a field is unused it is zero filled. The header is padded with zero bytes to make it up to a 512 byte block.

Field Offset Field Size Field
0 100 File name
100 8 File mode
108 8 Owner user ID
116 8 Group user ID
124 12 File size in bytes
136 12 Last modification time
148 8 Check sum for header block
156 1 Link indicator
157 100 Name of linked file

The Link indicator field can have the following values:

Value Meaning
0 Normal file
(ASCII NUL)[1] Normal file
1 Hard link
2 Symbolic link[2]
3 Character special
4 Block special
5 Directory
6 FIFO
7 Contiguous file[3]

A directory is also indicated by having a trailing slash(/) in the name.

For historical reasons numerical values are encoded in octal with leading zeroes. The final character is either a null or a space. Thus although there are 12 bytes reserved for storing the file size, only 11 octal digits can be stored. This gives a maximum file size of 8 gigabytes on archived files. To overcome this limitation some versions of tar, including the GNU implementation, support an extension in which the file size is encoded in binary. Additionally, versions of GNU tar from 1999 and before pad the values with space characters instead of zero characters.

The checksum is calculated by taking the sum of the byte values of the header block with the eight checksum bytes taken to be ascii spaces (value 32). It is stored as a six digit octal number with leading zeroes followed by a nul and then a space.

[edit] USTAR format

Most modern tar programs read and write archives in the new USTAR (Uniform Standard Tape Archive) format, which has an extended header definition as defined by the POSIX (IEEE P1003.1) standards group. Older tar programs will ignore the extra information, while newer programs will test for the presence of the "ustar" string to determine if the new format is in use. The USTAR format allows for longer file names and stores extra information about each file.

Field Offset Field Size Field
0 156 (as in old format)
156 1 Type flag
157 100 (as in old format)
257 6 USTAR indicator
263 2 USTAR version
265 32 Owner user name
297 32 Owner group name
329 8 Device major number
337 8 Device minor number
345 155 Filename prefix

[edit] Example

The example below shows the hex dump of a header block from a tar file created using the GNU tar program. It was dumped with the od program. The "ustar" magic string can be seen, meaning that the tar file is in USTAR format.

0000000   e   t   c   /   p   a   s   s   w   d nul nul nul nul nul nul
0000020 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0000140 nul nul nul nul   0   1   0   0   6   4   4 nul   0   0   0   0
0000160   0   0   0 nul   0   0   0   0   0   0   0 nul   0   0   0   0
0000200   0   0   4   1   3   5   5 nul   1   0   1   5   5   0   6   1
0000220   1   0   5 nul   0   1   1   5   5   6 nul  sp   0 nul nul nul
0000240 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0000400 nul   u   s   t   a   r  sp  sp nul   r   o   o   t nul nul nul
0000420 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
0000440 nul nul nul nul nul nul nul nul nul   r   o   o   t nul nul nul
0000460 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul
*
0001000

Note, the OpenBSD 3.7 tar does not have the 2 space characters after ustar. They are nul characters.

[edit] Tarbombs

Tarbomb is derogatory hacker slang used to refer to a tarball containing files that untar to the current directory instead of untarring into a directory of their own. Such behaviour is often considered bad etiquette on the part of the archive's creator.

For example:

$ tar zxvf urchin5703_redhat_ent3.tar.gz
gunzip
inspector
install.sh
install.txt
license.txt
README
urchin.tar.gz
$ pwd
/home/suso

In the above example, all the files are untarred into the directory that the user is currently in. It should have untarred into a directory called 'urchin' or something similar. This can be a potential problem if it overwrites files using the same name in the current directory. It can also be a pain for the user who then needs to delete all the files that are scattered over the directory amongst other files. Oftentimes this ends up happening in the user's home directory.

To protect yourself against tarbombs, you should always use the -t action/option first to list the files contained in the archive before using the -x option to extract. Here is an example:

tar ztvf tarfile.tar.gz

To prevent making tarbombs when you create a tar file, create a working directory for all the files and directories that you want in the tar file and then move into the parent directory and run the following:

tar zcvf tarfile.tar.gz directory_name_you_want_to_tar

For programs, a common convention is to call the directory by the name of the program followed by a hyphen and then the version of the program in the directory. For example.

mkdir randomsig-1.9
mv workingdir/* randomsig-1.9/
tar zcvf randomsig-1.9.tar.gz randomsig-1.9

This way when a user downloads your program and untars it, it will not overwrite older versions of the same program or other important files in her current or other directories.


[edit] Tarpit

Tarpit is a term to describe a method of revision control where a tar is used to capture the state of development of a software module at a particular point in time. The use of a tarpit typically loosely mirrors the use of a Revision control software tag and branching through the use of descriptive names.

[edit] Notes

  1. ^ This is probably a workaround for buggy tar implementations (the byte 0x00 is ASCII NUL).
  2. ^ GNU tar's headers mark this field as "Reserved"[1]
  3. ^ Apparently relevant on an OS called RTU, this would be a normal file written in one contiguous section on-disc. GNU tar's headers mark this field as 'Reserved', and such items will probably be extracted as normal files on other operating systems.

[edit] See also

[edit] External links