GridFTP is an extension of the standard File Transfer Protocol (FTP) for use with Grid computing.[1] It is defined as part of the Globus toolkit, under the organisation of the Global Grid Forum (specifically, by the GridFTP working group).
The aim of GridFTP is to provide a more reliable and high performance file transfer for Grid computing applications. This is necessary because of the increased demands of transmitting data in Grid computing - it is frequently necessary to transmit very large files, and this needs to be done fast and reliably.
GridFTP is the answer to the problem of incompatibility between storage and access systems. Previously, each data provider would make their data available in their own specific way, providing a library of access functions. This made it difficult to obtain data from multiple sources, requiring a different access method for each, and thus dividing the total available data into partitions. GridFTP provides a uniform way of accessing the data, encompassing functions from all the different modes of access, building on and extending the universally accepted FTP standard. FTP was chosen as a basis for it because of its widespread use, and because it has a well defined architecture for extensions to the protocol (which may be dynamically discovered).
Contents |
GridFTP is useful for a number of reasons - including faster transfer and in-built security. It achieves this through the following alterations to normal FTP.[2]
GSI - Grid Security Infrastructure - is another part of the Globus toolkit which provides authentication and encryption to file transfers, with user specified levels of confidentiality and data integrity. FTP itself is inherently insecure, and thus open to packet sniffing and eavesdropping, and has traditionally relied on things such as SSH and SSL for security.
A useful feature of FTP is that it allows remote transfer between servers to be initiated by a local client. GridFTP builds on this, and adds security and authentication for the local initiator. This feature is similar to File eXchange Protocol (FXP) in FTP terminology.
GridFTP achieves much greater use of bandwidth by allowing multiple simultaneous TCP streams. Files can be downloaded in pieces simultaneously from multiple sources; or even in separate parallel streams from the same source, which is still able to make better use of the bandwidth. Striped and interleaved transfers, again either from multiple or single sources, allow further speed increases.
Although FTP has the ability to resume an interrupted file transfer from a specific point in a file, it does not support the transmission of only a certain portion of a file. GridFTP allows a subset of a file to be sent. Such a feature is useful in applications where only small sections of a very large data file are required for processing (a motivating example being the processing of data from a high energy physics experiment, a traditional use of Grid technology).
GridFTP provides a fault tolerant implementation of FTP, to handle network unavailability and server problems. Transfers can also be automatically restarted if a problem occurs.
The underlying TCP connection in FTP has numerous settings such as window size and buffer size. GridFTP allows automatic (or manual) negotiation of these settings to provide optimal transfer speeds and reliability (settings are likely to need to be different for best performance with large files and for large groups of files).