Lustre (file system)
From Wikipedia, the free encyclopedia
This article or section needs sources or references that appear in reliable, third-party publications. Primary sources and sources affiliated with the subject of the article are generally not sufficient for a Wikipedia article. Please include more appropriate citations from reliable sources, or discuss the issue on the talk page. This article has been tagged since October 2007. |
Lustre is a shared disk file system, generally used for large scale cluster computing. The name Lustre is a portmanteau of Linux and cluster. The project aims to provide a file system for clusters of tens of thousands of nodes with petabytes of storage capacity, without compromising speed or security. Lustre is available under the GNU GPL.
Lustre is designed, developed and maintained by Sun Microsystems, Inc. with input from many other individuals and companies. Sun completed its acquisition of Cluster File Systems, Inc., including the Lustre file system, on October 2, 2007, with the intention of bringing the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system.
Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-scale, multi-site clusters. Fifteen of the top 30 supercomputers in the world use Lustre file systems, including the world's fastest supercomputer, the Blue Gene/L at Lawrence Livermore National Laboratory (LLNL). Other supercomputers that use the Lustre file system include systems at Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and Los Alamos National Laboratory in North America, the largest system in Asia at Tokyo Institute of Technology, and the largest system in Europe at CEA.
Lustre file systems can support up to tens of thousands of client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Businesses ranging from Internet service providers to large financial institutions deploy Lustre file systems in their datacenters. Due to the high scalability of Lustre file systems, Lustre deployments are popular in the oil and gas, manufacturing, rich media and finance sectors.
Contents |
[edit] History
The Lustre file system architecture was developed as a research project in 1999 by Peter Braam, who was a Senior Systems Scientist at Carnegie Mellon University at the time. Braam went on to found his own company Cluster File Systems, which released Lustre 1.0 in 2003. Cluster File Systems was acquired by Sun Microsystems in October 2007.
The Lustre file system was first installed for production use in March 2003 on the MCR Linux Cluster at LLNL, one of the largest supercomputers at the time.
Lustre 1.2.0, released in March 2004, provided Linux kernel 2.6 support, a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side write-back cache accounting.
Lustre 1.4.0, released in November 2004, provided protocol compatibility between versions, Infiniband network support, and support for extents/mballoc in ldiskfs.
Lustre 1.6.0, released in April 2007, supported mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", supported dynamic addition of object storage targets (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and supported free space management for object allocation.
The Lustre file system and associated open source software has been adopted by many partners. Both Red Hat and SUSE (Novell) offer Linux kernels with Lustre patches for easy deployment. Currently over 5,000 downloads of the Lustre file system software occur per month.
[edit] Architecture
A Lustre file system has three major functional units:
- A single metadata target (MDT) per filesystem that stores metadata, such as filenames, directories, permissions, and file layout, on the metadata server (MDS)
- One or more OSTs that store file data on one or more object storage servers (OSSs). Depending on the server’s hardware, an OSS typically serves between 2 and 8 targets, each target up to 8 terabytes (TBs) in size. The capacity of a Lustre file system is the sum of the capacities provided by the targets.
- Client(s) that access and use the data. Lustre presents all clients with standard POSIX semantics and concurrent read and write access to the files in the filesystem.
The MDT, OST, and client can be on the same node or on different nodes, but in typical installations, these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including TCP/IP on Ethernet, Infiniband, Myrinet, Quadrics, and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage attached to the servers is partitioned, optionally organized with logical volume management (LVM), and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.
An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future, Sun's ZFS/DMU will also be used to store data.
When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or writer operations, the client then passes the layout to a logical object volume (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSs. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem.
[edit] Implementation
In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.
On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach was used in the LLNL Blue Gene installation[1]. Although this is the easiest approach to implement, it provides poor overall performance.
A slightly more complicated implementation that provides excellent overall performance uses the liblustre library. Liblustre is a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors can access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allows data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly. Exceptional performance characteristics and scalability make this approach the most suitable for using Lustre with MPP systems. Liblustre is the most significant design difference between Lustre implementations on MPPs such as Cray XT3 and Lustre implementations on conventional clustered computational systems.
Currently, Lustre OSS and MDS servers run the Linux operating system. Lustre 1.8, which is expected to be released in Q2 2008, will run on both Linux and Solaris.[2]
Lustre technology has been integrated in the HP StorageWorks Scalable File Share product.
[edit] Data objects and file striping
In a traditional UNIX disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDSs point to one or more objects associated with the file rather than to data blocks. These objects are implemented as files on the OSSs. When a client opens a file, the file open operation transfers an object pointer from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored, allowing the client to perform I/O on the file.
If only one object is associated with an MDS inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is “striped” across the objects. Striping provides significant benefits. When striping is used, the maximum file size is not limited by the size of a single target. Also, capacity and aggregate I/O bandwidth scale with the number of OSSs.
[edit] Networking
In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre file system MDSs and OSSs using traditional storage area network (SAN) technologies.
LNET supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when supported by underlying networks by vendors such as Quadrics, Myrinet and InfiniBand. High availability and recovery features enable transparent recovery in conjunction with failover servers.
LNET provides end-to-end throughput over Gigabit Ethernet (GigE) networks in excess of 100 MB/s[3], throughput up to 1.5 GB/s over InfiniBand double data rate (DDR) links, and throughput over 1 GB/s across 10GigE interfaces.
[edit] High availability
Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay.
Lustre MDSs are configured as an active/passive pair, while OSSs are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.
[edit] ZFS integration
Lustre 1.8 will allow users to choose between ZFS and ldiskfs as back-end storage.[4]
[edit] References
- ^ “DataDirect SELECTED AS STORAGE TECH POWERING BLUEGENE/L”, HPC Wire, October 15, 2004: Vol. 13, No. 41.
- ^ Video: Lustre File System presented at Sun HPC Consortium Reno. Sun Microsystems. Retrieved on 2008-02-18.
- ^ Lafoucrière, Jacques-Charles. “Lustre Experience at CEA/DIF”, HEPiX Forum, April 2007.
- ^ Architecture ZFS for Lustre. Sun Microsystems. Retrieved on 2008-02-18.
[edit] See also
[edit] External links
|