Storage virtualization

From Wikipedia, the free encyclopedia

Storage Virtualization refers to the process of abstracting logical storage from physical storage. The term is today used to describe this abstraction at any layer in the storage software and hardware stack.

Contents

[edit] Key Concepts

[edit] Address Space Remapping

Virtualization of storage helps achieve location independence by abstracting the physical location of the data. The Virtualization system presents to the user a logical space for data storage and itself handles the process of mapping it to the actual physical location.

The actual form of the mapping will depend on the chosen implementation. Some implementations may limit the granularity of the mapping which itself may limit the capabilities of the device. Typical granularities range from a single physical disk down to some small subset (multiples of Mega/Gigabytes) of the physical disk.

In a block based storage environment a single block of data is addressed using a logical unit identifier and an offset within that LUN - known as a Logical Block Address (LBA) The address space mapping is between a logical disk, usually referred to as a virtual disk (vdisk) and a logical unit presented by one or more storage controller (Note The LUN itself is likely to be a logical disk and may even be a virtual disk)

[edit] Meta-data

The virtualization software/device is responsible for maintaining a consistent view of all the mapping information for the virtualized storage. This mapping information is usually called meta-data and is stored as a mapping table.

The address space may be limited by the capacity needed to maintain the mapping table. This is directly influenced by the granularity of the mapping information.

[edit] I/O Redirection

The virtualization software/device uses the meta-data to re-direct I/O requests. It will receive an incoming I/O request containing information about the location of the data in terms of the logical disk (vdisk) and translates this into a new I/O request to the physical disk location.

For example the virtualization device may :

  • Receive a read request for vdisk LUN ID=1, LBA=32
  • Perform a meta-data look up for LUN ID=1, LBA=32, and finds this maps to physical LUN ID=7, LBA0
  • Sends a read request to physical LUN ID=7, LBA0
  • Receives the data back from the physical LUN
  • Sends the data back to the originator as if it had come from vdisk LUN ID=1, LBA32

[edit] Capabilities

Most implementations allow for Heterogeneous management of multi-vendor storage devices, within the scope of a given implementations support matrix. This means that the following capabilities are not limited to a single vendors device (as with similar capabilities provided by specfic storage controllers) and are in fact possible across different vendors devices.

[edit] Replication

Data replication techniques are not limited to virtualization appliances and as such are not described here in detail. However most implementations will provide some or all of these replication services.

Note When storage is virtualized these services must be implemented above the software / device that is performing the virtualization. This is true because it is only above the virtualizaton layer that a true and consistent image of the logical disk (vdisk) can be copied. This limits the services that some implementations can implement - or makes them seriously difficuly to implement. If the virtualization is implemented in the network or higher, then this renders any replication services provided by the underlying storage controllers useless.

  • Remote Data Replication for Disaster Recovery
    • Synchronous Mirroring - where I/O completion is only returned when the remote site acknowledges the completion. Applicable for shorter distances (<200km)
    • Asynchronous Mirroing - where I/O completion is returned before the remote site has acknowldged the completion. Applicable for much greater distances (>200km)
  • Point-In-Time Snapshots to copy of clone data for diverse uses

[edit] Pooling

The physical storage resources are aggregated into storage pools, from which the logical storage is created. More storage systems, which may be heterogeneous in nature, can be added as and when needed, and the virtual storage space will scale up by a same amount. This process is fully transparent to the applications using the storage infrastructure.

[edit] Disk management

The software/device providing storage virtualization becomes a common disk manager in the virtualized environment. Logical disks (vdisks) are created by the virtualization software/device and are mapped (made visible) to the required host or server. Thus providing a common place and/or way for managing all volumes in the environment.

Enhanced features are easy to provide in this environment :

  • Thin Provisioning to maximize storage utilization
    • This is relatively easy to implement as physical storage is only allocated in the mapping table when it is used.
  • Disk expansion / shrinking
    • More physical storage can be allocated by adding to the mapping table (assuming the using system can cope with online expansion)
    • Similarily disks can bereduced in size by removing some physical storage from the mapping (uses for this are limited as there is no guarantee of what resides on the areas removed)

[edit] Benefits

[edit] Non-disruptive Data Migration

One of the major benefits of abstracting the host or server from the actual storage is the ability to migrate data while maintaining concurrent I/O access.

The host only knows about the logical disk (vdisk) and so any changes to the meta-data mapping is transparent to the host. This means you can concurrently make a copy or move the actual data from one physical location to another. When the data has been copied or move, the meta-data can simply updated to point to the new location, therefore freeing up the physical storage at the old location.

The process of moving the physical location is known as data migration Most implementations allow for this to be done in a non-distuptive manner, that is concurrenly while the host continues to perform I/O to the logical disk (vdisk).

The mapping granularity dictates how quickly the meta-data can be updated, how much extra capacity is required during the migration, and how quickly the previous location is marked as free. The smaller the granularity the faster the update, less space required and quicker the old storage can be freed up.

There are many day to day tasks a storage administrator has to perform that can be simply and concurrently performed using data migration techniques.

  • Moving data off of an over-utilised storage device.
  • Moving data onto a faster storage device as needs require
  • Implementing a Information Lifecycle Management policy
  • Migrating data off of older storage devices (either being scraped or off-lease)

[edit] Improved Utilization

Utilization can be increased by virtue of the pooling, migration and Thin Provisioning services.

When all available storage capacity is pooled, system administrators no longer have to search for disks that have free space to allocate to a particular host or server. A new logical disk can be simply allocated from from the available pool, or an existing disk can be expanded.

Pooling also means that all the available storage capacity can potentially be used. In a traditional environment, an entire disk would be mapped to a host. This may be larger than is required, thus wasting space. In a virtual environment, the logical disk (vdisk) is assigned the capacity required by the using host.

Storage can be assigned where it is needed at that point in time, reducing the need to guess how much a given host will need in the future. Using Thin Provisioning, the administrator can create a very large thin provisioned logical disk, thus the using system thinks it has a very large disk from day 1.

[edit] Fewer Points of Management

With storage virtualization, multiple independent storage devices, that may be scattered over a network, appear to be a single monolithic storage device, which can be managed centrally.

However, traditional storage controller management is still required. That is, the creation and maintenance of RAID arrays, including error and fault management.

[edit] Risks

[edit] Backing out a failed implementation

Once the abstraction layer is in place, only the virtualizer knows where the data actually resides on the physical medium. Backing out of a virtual storage environment therefore requires the reconstruction of the logical disks as contiguous disks that can be used in a traditional manner.

Most implementations will provide some form of back-out procedure and with the data migration services it is at least possible, but time consuming.

[edit] Interoperability/Vendor Support

Interoperability is a key enabler to any virtualization software/device. It applies to the actual physical storage controllers and the hosts, their operating systems, mutli-pathing software and connectivity hardware.

Interoperability requirements differ based on the implementation chosen. For example virtualization implemented within a storage controller adds no extra overhead to host based interoperability, but will require additional support of other storage controllers if they are to be virtualized by the same software.

Switch based virtualization may not require specific host interoperability - if it uses packet cracking techniques to redirect the I/O.

Network based appliances have the highest level of interoperability requirements as they have to interoperate with all devices, storage and hosts.

[edit] Complexity

Complexity affects several areas :

  • Management of Environment : Although a virtual storage infrastructure benefits from a single point of logical disk and replication service management, the physical storage must still be managed. Problem determination and fault isolation can also become complex, due to the abstraction layer.
  • Infrastructure Design : Traditional design ethics may no longer apply, virtualization brings a whole range of new ideas and concepts to think about (as detailed here)
  • The software/device itself : Some implementations are more complex to design and code - network based, especially in-band (symmetric) designs in particular - these implementations actually handle the I/O requests and so latency becomes an issue.

[edit] Meta-Data Management

Data is one of the most valuable assets in todays business environments. Once virtualized, the meta-data is glue in the middle. If the meta-data is lost, so is all of the actual data as it would be virtually impossible to reconstruct the logical drives without the mapping information.

Any implementation must ensure it protects, provides back-ups and can reconstruct the meta-data in the event of a catastrophic failure.

The meta-data management also has implications on performance. Any virtualization software/device must be able to keep all the copies of the meta-data atomic and quickly updateable. Some implementations restrict the ability to provide certain fast update functions, such as point-in-time copies and caching where super fast updates are required to ensure minimal latency to the actual I/O being performed.

[edit] Performance/Scaleability

In some implementations the performance of the physical storage can actually be improved, mainly due to caching. Caching however requires the visibilty of the data contained within the I/O request and so is limited to in-band / symmetric virtualization software/devices. However these implementations also directly influence the latency of an I/O request (cache miss), due to the I/O having to flow through the software/device. Assuming the software/device is efficiently designed this impact should be minimal when compared with the latency associated with physical disk accesses.

Due to the nature of virtualization, the mapping of logical to physical requires some processing power and lookup tables. Therefore every implementation will add some small amount of latency.

In addition to response time concerns, throughput has to be considered. The bandwidth into and out of the meta-data lookup software direclty impacts the available system bandwidth. In asymmetric implementations, where the meta-data lookup occurs before the data is read or written, bandwidth is less of a concern as the meta-data is a tiny fraction of the actual I/O size. In-band, symmetric flow through designs are directly limited by their processing power and connectivity bandwidths.

Most implementations provide some form of scale-out model, where the inclusion of additional software/device instances provides increased scaleability and potentially increased bandwidth. The performance and scaleability characteristics are directly influenced by the chosen implementation.

[edit] Implementation approaches

There are three main implementation approaches :

  • Host-based
  • Storage Device-based
  • Network-based

[edit] Host-based

Host-based virtualization requires specific software / device drivers. Here, physical disks are presented to the host system and a software layer above the physical device driver intercepts the I/O requests performing the meta-data lookup and I/O redirection.

It is only recently that forms of logical volume management have been referred to as virtualization, but essentially every operating system has its own form of logical volume manager (Disk Management in Windows, LVM in AIX, etc)

[edit] Pros

  • No additional hardware or infrastructure requirements
  • Simple to design / code

[edit] Cons

  • Storage utilisation optimiszed only on a per host basis
  • Replication and data migration only possible locally to that host
  • Software is unique to each operating system
  • No easy way of keeping host instances in sync with other instances

[edit] Specific Examples

[edit] Storage Device-Based

Like Host-based virtualization, several categories have existed for years and have only recently been classes as virtualization. While RAID controllers do provide a logical to physical abstraction, they generally do not provide the benefits of data migration, replication across heterogeneous storage. The exception being a new breed of RAID controllers which do allow the downstream attachment of other storage devices.

RAID systems are described elsewhere and for the purposes of this article we will only discuss the later style which do actually virtualize other storage devices.

[edit] Concept

A primary storage controller provides the virtualization services and allows the direct attachment of other storage controllers. Depending on the implementation these may be from the same or different vendors.

The primary controller will provide the pooling and meta-data management services. It may also provide replication and migration services across those controllers which it is virtualizing.

[edit] Pros

  • No additional hardware or infrastructure requirements
  • Provides most of the benefits of storage virtualization

[edit] Cons

  • Storage utilisation optimised only across the connected controllers
  • Replication and data migration only possible across the connected controllers and same vendors device for long distance support
  • Downstream controller attachment limited to vendors support matrix

[edit] Network-based

When talking about Storage virtualization it is most commonly thought of as a network based device using Fibre channel networks connected as a SAN. These types of device are the most commonly available and implemented form of virtualization.

The virtualization device sits in the SAN and provides the layer of abstraction between the hosts performing the I/O and the storage controllers providing the storage capacity.

[edit] Pros

  • True heterogeneous storage virtualization
  • Caching of data (performance benefit) is possible when in-band
  • Single management interface for all virtualized storage
  • Replication services across heterogeneous devices

[edit] Cons

  • Complex interoperability matrices - limited by vendors support
  • Difficult to implement fast meta-data updates in switched-based devices
  • Out-of-band requires specific host based software
  • In-band may add latency to I/O
  • In-band the most complication to design and code

[edit] Applicance-based vs. Switch-based

There are two commonly available implementations of network based storage virtualization, appliance based and switch based. Both models can provide the same services, disk management, meta-data lookup, data migration and replication. Both models also require some processing hardware to provide these services.

Appliance based devices are dedicated hardware devices that provide SAN connectivity of one form or another. These sit between the hosts and storage and in the case of in-band (symmetric) appliances can provide all of the benefits and services discussed in this article. I/O requests are targeted at the appliance itself, which performs the meta-data mapping before redirecting the I/O by sending its own I/O request to the underlying storage. The in-band appliance can also provide caching of data, and most implementations provide some form of clustering of individual appliances to maintain an atomic view of the meta-data as well as cache data.

Switch based devices, as the name suggests, reside in the physical switch hardware used to connect the SAN devices. These also sit between the hosts and storage but may use different techniques to provide the meta-data mapping, such as packet cracking to snoop on incoming I/O requests and perform the I/O redirection. It is much more difficult to ensure atomic updates of meta-data in a switched environment and services requiring fast updates of data and meta-data may be limited in switched implementations.

[edit] In-band vs. Out of band

In-band, also known as symmetric, virtualization devices actually sit in the data path between the host and storage. All I/O requests and their data pass through the device. Hosts perform I/O to the device directly and never interact with the storage itself. Only the device itself sends I/O to the storage. Caching of data, statistics about data usage, replications services, data migration and thin provisioning are all easily implmenented in an in-band device.

Out-of-band, also known as asymmetric, virtualization devices are sometimes called meta-data servers These devices only perform the meta-data mapping functions. This requires additional software in the host which knows to first request the locaton of the actual data. Therefore an I/O request from the host is intercepted before it leaves the host, a meta-data lookup is requested from the meta-data server (this may be through an interface other than the SAN) which returns the physical location of the data to the host. The data is then retrieved through an actual I/O request to the storage. Caching is not possible as the data never passes through the device.

[edit] SAN (block) vs. NAS (file)

[edit] See also