RDMA over Converged Ethernet

RDMA over Converged Ethernet (RoCE) is a network protocol that allows remote direct memory access (RDMA) over an Ethernet network. There are two RoCE versions, RoCE v1 and RoCE v2. RoCE v1 is an Ethernet link layer protocol and hence allows communication between any two hosts in the same Ethernet broadcast domain. RoCE v2 is an internet layer protocol which means that RoCE v2 packets can be routed. Although the RoCE protocol benefits from the characteristics of a converged Ethernet network, the protocol can also be used on a traditional or non-converged Ethernet network.[1] [2][3][4]

Background

Network-intensive applications like networked storage or cluster computing need a network infrastructure with a high bandwidth and low latency. The advantages of RDMA over other network application programming interfaces such as Berkeley sockets are lower latency, lower CPU load and higher bandwidth.[5] The RoCE protocol allows lower latencies than its predecessor, the iWARP protocol.[6] There exist RoCE HCAs (Host Channel Adapter) with a latency as low as 1.3 microseconds[7][8] while the lowest known iWARP HCA latency in 2011 was 3 microseconds.[9]

RoCE Header format

RoCE v1

The RoCE v1 protocol is an Ethernet link layer protocol with ethertype 0x8915.[1] This means that the frame length limits of the Ethernet protocol apply - 1500 bytes for a regular Ethernet frame and 9000 bytes for a jumbo frame.

RoCE v2

The RoCEv2 protocol exists on top of either the UDP/IPv4 or the UDP/IPv6 protocol.[2] The UDP destination port number 4791 has been reserved for RoCE v2.[10] Since RoCEv2 packets are routable the RoCE v2 protocol is sometimes called Routable RoCE[11] or RRoCE.[3] Although in general the delivery order of UDP packets is not guaranteed, the RoCEv2 specification requires that packets with the same UDP source port and the same destination address must not be reordered.[3] In addition, RoCEv2 defines a congestion control mechanism that uses the IP ECN bits for marking and CNP[12] frames for the acknowledgment notification.[13] Software support for RoCE v2 is still emerging. Mellanox OFED 2.3 or later has RoCE v2 support and also Linux Kernel v4.5.[14]

RoCE versus InfiniBand

RoCE defines how to perform RDMA over Ethernet while the InfiniBand architecture specification defines how to perform RDMA over an InfiniBand network. RoCE was expected to bring InfiniBand applications, which are predominantly based on clusters, onto a common Ethernet converged fabric.[15] Others expected that InfiniBand will keep offering a higher bandwidth and lower latency than what is possible over Ethernet.[16]

The technical differences between the RoCE and InfiniBand protocols are:

RoCE versus iWARP

While the RoCE protocols define how to perform RDMA using Ethernet and UDP/IP frames, the iWARP protocol defines how to perform RDMA over a connection-oriented transport like the Transmission Control Protocol (TCP). RoCE v1 is limited to a single Ethernet broadcast domain. RoCE v2 and iWARP packets are routable. The memory requirements of a large number of connections along with TCP's flow and reliability controls lead to scalability and performance issues when using iWARP in large-scale datacenters and for large-scale applications (i.e. large-scale enterprises, cloud computing, web 2.0 applications etc.[20]). Also, multicast is defined in the RoCE specification while the current iWARP specification does not define how to perform multicast RDMA.[21][22][23]

Criticism

Some aspects that could have been defined in the RoCE specification have been left out. These are:

References

  1. 1 2 "InfiniBand™ Architecture Specification Release 1.2.1 Annex A16: RoCE". InfiniBand Trade Association. 13 April 2010.
  2. 1 2 "InfiniBand™ Architecture Specification Release 1.2.1 Annex A17: RoCEv2". InfiniBand Trade Association. 2 September 2014.
  3. 1 2 3 Ophir Maor (December 2015). "RoCEv2 Considerations". Mellanox.
  4. Ophir Maor (December 2015). "RoCE and Storage Solutions". Mellanox.
  5. Cameron, Don; Regnier, Greg (2002). Virtual Interface Architecture. Intel Press. ISBN 978-0-9712887-0-6.
  6. Feldman, Michael (22 April 2010). "RoCE: An Ethernet-InfiniBand Love Story". HPC wire.
  7. "End-to-End Lowest Latency Ethernet Solution for Financial Services" (PDF). Mellanox. March 2011.
  8. "RoCE vs. iWARP Competitive Analysis Brief" (PDF). Mellanox. 9 November 2010.
  9. "Low Latency Server Connectivity With New Terminator 4 (T4) Adapter". Chelsio. 25 May 2011.
  10. Diego Crupnicoff (17 October 2014). "Service Name and Transport Protocol Port Number Registry". IANA.
  11. InfiniBand Trade Association (November 2013). "RoCE Status and Plans" (PDF). IETF.
  12. Ophir Maor (December 2015). "RoCEv2 CNP Packet Format". Mellanox.
  13. Ophir Maor (December 2015). "RoCEv2 Congestion Management". Mellanox.
  14. "Kernel GIT". January 2016.
  15. Merritt, Rick (19 April 2010). "New converged network blends Ethernet, InfiniBand". EE Times.
  16. Kerner, Sean Michael (2 April 2010). "InfiniBand Moving to Ethernet ?". Enterprise Networking Planet.
  17. Mellanox (2 June 2014). "Mellanox Releases New Automation Software to Reduce Ethernet Fabric Installation Time from Hours to Minutes". Mellanox.
  18. "SX1036 - 36-Port 40/56GbE Switch System". Mellanox. Retrieved April 21, 2014.
  19. "IS5024 - 36-Port Non-blocking Unmanaged 40Gb/s InfiniBand Switch System". Mellanox. Retrieved April 21, 2014.
  20. Rashti, Mohammad (2010). "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet" (PDF). International Conference on High Performance Computing (HiPC).
  21. H. Shah; et al. (October 2007). "Direct Data Placement over Reliable Transports". RFC 5041. Retrieved May 4, 2011.
  22. C. Bestler; et al. (October 2007). "Stream Control Transmission Protocol (SCTP) Direct Data Placement (DDP) Adaptation". RFC 5043. Retrieved May 4, 2011.
  23. P. Culley; et al. (October 2007). "Marker PDU Aligned Framing for TCP Specification". RFC 5044. Retrieved May 4, 2011.
  24. Dreier, Roland (6 December 2010). "Two notes on IBoE". Roland Dreier's blog.
  25. Cohen, Eli (26 August 2010). "IB/core: Add VLAN support for IBoE". kernel.org.
  26. Cohen, Eli (13 October 2010). "RDMA/cm: Add RDMA CM support for IBoE devices". kernel.org.
  27. Crawford, M. (1998). "RFC 2464 - Transmission of IPv6 Packets over Ethernet Networks". IETF.
  28. Malhi, Upinder (4 September 2013). "PATCH Cisco VIC RDMA Node and Transport". linux-rdma mailing list.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.