Journaling file system

From Wikipedia, the free encyclopedia

A journaling (or journalling) file system is a file system that logs changes to a journal (usually a circular log in a specially-allocated area) before actually writing them to the main file system.

Rationale

File systems tend to be very large data structures; updating them to reflect changes to files and directories usually requires many separate write operations. This introduces a race condition, in which an interruption (like a power failure or system crash) can leave data structures in an invalid intermediate state.

For example, deleting a file on a Unix file system involves two steps:

  1. removing its directory entry
  2. marking the file's inode as free space in the free space map

If step 1 occurs just before a crash, there will be an orphaned inode and hence a storage leak. On the other hand, if only step 2 is performed first before the crash, the not-yet-deleted inode will be marked free and possibly be overwritten by something else.

One way to recover is to do a complete walk of the file system's data structures when it is next mounted to detect and correct any inconsistencies. This is traditionally performed by the fsck program on Unix-like systems. It can be very slow for large file systems, and is likely to become slower yet, given that the ratio of storage capacity to I/O bandwidth on modern mass storage devices is rising.

Another way to recover is for the file system to keep a journal of the changes it intends to make, ahead of time. Recovery then simply involves re-reading the journal and replaying the changes logged in it until the file system is consistent again. In this sense, the changes are said to be atomic (or indivisible) in that they will either:

  • succeed (have succeeded originally or be replayed completely during recovery), or
  • not be replayed at all.

Some file systems allow the journal to grow, shrink and be re-allocated just as would a regular file; most, however, put the journal in a contiguous area or a special hidden file that is guaranteed not to change in size while the file system is mounted.

A physical journal is one which simply logs verbatim copies of blocks that will be written later. ext3, for example, does this. A logical journal is one which logs metadata changes in a special, more compact format. This can improve performance by drastically reducing the amount of data that needs to be read from and written to the journal in large, metadata-heavy operations (for example, deleting a large directory tree). XFS keeps a logical journal.

Log-structured file systems are those for which the journal is itself the entire filesystem. As of 2005, none of the most popular general-purpose filesystems are log-structured, although WAFL and Reiser4 borrow some techniques from log-structured file systems.

Databases use more rigorous versions of the same journaling techniques to ensure data integrity.

Metadata-only journaling

Journaling can have a severe impact on performance because it requires that all data be written twice. Metadata-only journaling is a compromise between reliability and performance that stores only changes to file metadata (which is usually relatively small and hence less of a drain on performance) in the journal. This still ensures that the file system can recover quickly when next mounted, but leaves an opportunity for data corruption because unjournaled file data and journaled metadata can fall out of sync with each other.

For example, appending to a file on a Unix file system typically involves three steps:

  1. Increasing the size of the file in its inode.
  2. Allocating space for the extension in the free space map.
  3. Actually writing the appended data to the newly-allocated space.

In a metadata-only journal, it would not be clear after a crash whether step 3 was done or not, because it would not be logged. If step 3 was not done but steps 1 and 2 are replayed anyway after a crash, the file will gain a tail of garbage.

The write cache in most operating systems will traditionally order its writes with an elevator sort (or some similar scheme) to maximize throughput. To avoid an out-of-order write hazard, writes for file data must be ordered in the sort so that they are committed to storage before their associated metadata. This can be tricky to implement because it requires coordination within the operating system kernel between the file system driver and write cache.

FFS implementations typically do without a journal by ordering all meta-data writes to disk to ensure that the FS is recoverable. This is reliable as long as the disk does not "lie" about its internal write cache status and that it can write blocks atomically. Soft updates take a variation of this approach by allowing only asynchronous meta-data writes that do not render the on-disk file system inconsistent, or that the only inconsistency that ever happens is a storage leak.

See also