ZFS

From Wikipedia, the free encyclopedia

For other uses, see ZFS (disambiguation).
ZFS
Developer Sun Microsystems
Full name Zettabyte File System
Introduced November 2005 (OpenSolaris)
Partition identifier
Structures
Directory contents Extensible Hash table
File allocation
Bad blocks
Limits
Max file size 16 exabytes
Max number of files 248
Max filename size
Max volume size 16 exabytes
Allowed characters in filenames
Features
Dates recorded
Date range
Forks Yes (called extended attributes)
Attributes POSIX
File system permissions POSIX
Transparent compression Yes
Transparent encryption No
Supported operating systems Solaris

ZFS, is a free, open-source file system produced by Sun Microsystems for its Solaris Operating System. It is notable for its high capacity, integration of the concepts of filesystem and volume management, novel on-disk structure, lightweight filesystems, and easy storage pool management.

Contents

[edit] History

ZFS was designed and implemented by a team at Sun led by Jeff Bonwick. It was announced on September 14, 2004.[1] Source code for the final product was integrated into the main trunk of Solaris development on October 31, 2005[2] and released as part of build 27 of OpenSolaris on November 16, 2005. Sun announced that ZFS was integrated into the 6/06 update to Solaris 10 in June 2006, one year after the opening of the OpenSolaris community.[3]

The name originally stood for "Zettabyte File System", but is now a pseudo-initialism.[4]

[edit] Capacity

ZFS is a 128-bit file system, which means it can store 18 billion billion (18 quintillion) times more data than current 64-bit systems. The limitations of ZFS are designed to be so large that they will never be encountered in practice. Project leader Bonwick said, "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans."[1]

Some theoretical limits in ZFS are:

  • 248 — Number of snapshots in any file system (2 × 1014)
  • 248 — Number of files in any individual file system (2 × 1014)
  • 16 exabytes — Maximum size of a file system
  • 16 exabytes — Maximum size of a single file
  • 16 exabytes — Maximum size of any attribute
  • 3 × 1023 petabytes — Maximum size of any zpool
  • 256 — Number of attributes of a file (actually constrained to 248 as that's the number of files in a ZFS file system)
  • 256 — Number of files in a directory (actually constrained to 248 as that's the number of files in a ZFS file system)
  • 264 — Number of devices in any zpool
  • 264 — Number of zpools in a system
  • 264 — Number of file systems in a zpool

As an example of how large these numbers are, if 1,000 files were created every second, it would take about 9,000 years to reach the limit of the number of files.

In reply to a question about filling up ZFS without boiling the oceans, Bonwick said:[5]

"Although we'd all like Moore's Law to continue forever, quantum mechanics imposes some fundamental limits on the computation rate and information capacity of any physical device. In particular, it has been shown that 1 kilogram of matter confined to 1 liter of space can perform at most 1051 operations per second on at most 1031 bits of information [see Seth Lloyd, "Ultimate physical limits to computation." Nature 406, 1047-1054 (2000)]. A fully populated 128-bit storage pool would contain 2128 blocks = 2137 bytes = 2140 bits; therefore the minimum mass required to hold the bits would be (2140 bits) / (1031 bits/kg) = 136 billion kg.
To operate at the 1031 bits/kg limit, however, the entire mass of the computer must be in the form of pure energy. By E=mc2, the rest energy of 136 billion kg is 1.2x1028 J. The mass of the oceans is about 1.4x1021 kg. It takes about 4,000 J to raise the temperature of 1 kg of water by 1 degree Celsius, and thus about 400,000 J to heat 1 kg of water from freezing to boiling. The latent heat of vaporization adds another 2 million J/kg. Thus the energy required to boil the oceans is about 2.4x106 J/kg * 1.4x1021 kg = 3.4x1027 J. Thus, fully populating a 128-bit storage pool would, literally, require more energy than boiling the oceans."

[edit] Storage pools

Unlike a traditional file system, which resides on a single device and thus requires a volume manager to use more than one device, ZFS is built on top of virtual storage pools called zpools. A pool is constructed from virtual devices (vdevs), each of which is either a raw device, a mirror (RAID 1) of one or more devices, or a RAID-Z group of two or more devices. The storage capacity of all vdevs are then available to all of the file systems in the zpool.

To limit the amount of space a file system can occupy, a quota can be applied, and to guarantee that space will be available to a specific file system, a reservation can be set.

[edit] Copy-on-write transactional model

ZFS uses a copy-on-write, transactional object model. All block pointers within the filesystem contain a 256-bit checksum of the target block which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, and then any metadata blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and an intent log is used when synchronous write semantics are required.

[edit] Snapshots

The ZFS copy-on-write model has another powerful advantage: when ZFS writes new data, instead of releasing the blocks containing the old data, it can instead retain them, creating a snapshot version of the file system. ZFS snapshots are created very quickly, since all the data comprising the snapshot is already stored; they are also space efficient, since any unchanged data is shared among the file system and its snapshots.

Writable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist.

[edit] Dynamic striping

Dynamic striping across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them, thus all disks in a pool are used, which balances the write load across them.

[edit] Variable block sizes

ZFS uses variable-sized blocks of up to 128 kilobytes. The currently available code allows the administrator to tune the maximum block size used as certain workloads do not perform well with large blocks. Automatic tuning to match workload characteristics is contemplated.

If compression is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations)

[edit] Lightweight filesystem creation

Creating a new filesystem within a ZFS storage pool is extremely quick and easy; the time and effort required are closer to those for making a new directory than those for formatting a volume with a traditional filesystem.

[edit] Additional capabilities

  • Explicit I/O priority with deadline scheduling
  • Globally optimal I/O sorting and aggregation
  • Multiple independent prefetch streams with automatic length and stride detection
  • Parallel, constant-time directory operations

[edit] Limitations

ZFS is currently not available as a root filesystem since there is no ZFS boot support. The ZFS Boot project is currently working on adding root filesystem support [1].

ZFS lacks transparent encryption, a la NTFS, and presently only n+1 redundancy is possible. n+2 redundancy (RAID level 6) is only in the development branch—via the OpenSolaris distribution[2]. These omissions in the production branch of Solaris (as of Solaris 06/06 current release) diminishes ZFS's attractiveness in several situations at which it's targeted.

ZFS doesn't support per user or per group quotas

[edit] Platforms

ZFS is part of Sun's own Solaris operating system and is thus available on both SPARC and x86 -based systems. Since the code for ZFS is open source, a port to other operating systems and platforms can be produced without Sun's involvement.

Nexenta OS, a complete GNU-based open source operating system built on top of the OpenSolaris kernel and runtime, includes a ZFS implementation, added in version alpha1.

It also appears that Apple is interested in porting ZFS to their Mac OS X operating system, according to a post by a Sun employee on the opensolaris.org zfs-discuss mailing list.[6]

Porting ZFS to Linux is complicated by incompatibilities between CDDL, the license its source is released under, and GPL, the license which governs the Linux kernel. To work around this problem the Google Summer of Code program is sponsoring a port of ZFS to Linux's FUSE system[7] so the filesystem will run in userspace instead. However, running a file system outside the kernel on Linux has significant perfomance impact.

There are no plans to port ZFS to HP-UX or AIX.[8]

Matt Dillon started porting ZFS to DragonFly BSD as a plan for their 1.5 release[9], and work is currently underway for a FreeBSD port as well, headed by developer Pawel Jakub Dawidek.[10] ZFS for FreeBSD will most likely first be seen in a 7.x release.

[edit] Adaptive Endianness

Pools and their associated ZFS file systems can be moved between different platform architectures, even between systems implementing different byte orders. The ZFS block pointer format allows for filesystem metadata to be stored in an endian-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness doesn't match the endianness of the system, the metadata is byte-swapped in memory.

This does not affect the stored data itself: as is usual in POSIX systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness.

[edit] See also

[edit] References

  1. ^ a b ZFS: the last word in file systems. Sun Microsystems (September 14, 2004). Retrieved on 2006-04-30.
  2. ^ Jeff Bonwick (October 31, 2005). ZFS: The Last Word in Filesystems. Jeff Bonwick's Blog. Retrieved on 2006-04-30.
  3. ^ Sun Celebrates Successful One-Year Anniversary of OpenSolaris. Sun Microsystems (June 20, 2006).
  4. ^ Jeff Bonwick (2006-05-04). You say zeta, I say zetta. Jeff Bonwick's Blog. Retrieved on 2006-09-08.
  5. ^ Jeff Bonwick (September 25, 2004). 128-bit storage: are you high?. Sun Microsystems. Retrieved on 2006-07-12.
  6. ^ Porting ZFS to OSX. zfs-discuss (April 27, 2006). Retrieved on 2006-04-30.
  7. ^ Ricardo Correia (May 26, 2006). Announcing ZFS on FUSE/Linux. Retrieved on 2006-07-15.
  8. ^ Fast Track to Solaris 10 Adoption: ZFS Technology. Solaris 10 Technical Knowledge Base. Sun Microsystems. Retrieved on 2006-04-24.
  9. ^ Dillon, Matt (December 17, 2005). Plans for 1.5. Retrieved on 2006-04-24.
  10. ^ Dawidek, Pawel Jakub (August 22, 2006). Porting ZFS file system to FreeBSD. Retrieved on 2006-08-22.

[edit] External links