Best practices for implementing disk-to-disk backup, Part 3

Virtual tape libraries vs. commonality factoring devices

The previous article discussed the pros and cons of hardware-based disk libraries (HBDL). This final article in the series will focus on virtual tape libraries (VTL) and the newer commonality factoring devices.

In the case of VTLs, the application sees the disk and tape as a single entity. VTLs consist of an integrated solution that includes virtual tape software, typically running on an appliance, a disk from any manufacturer and tape libraries from most manufacturers. These products are typically bought as components and then integrated at the customer site. Some manufacturers are now providing complete solutions from a single source. Often these are tape libraries with a highly tuned ATA array installed on a shelf inside the tape library itself.

As with HBDLs, these task-specific units are designed for backup. The units are tuned for the write-intensive nature of the environment, they support a single, very large file system and they look like a tape library to the backup application. Also, the shareability of the library in a storage-area network is very similar to HBDLs.

The Pros of VTLs

The major advantage is that since the tape library and disk component are managed as one, the additional backup jobs that are required by the backup application, when using standard ATA arrays or HBDLs, are not required by VTLs. They automatically move data from disk to tape as the disk runs out of space or through a single global policy that is created. The media created when data is moved is "application compatible," meaning that the backup software application sees this media as if it created it. This all leads to nondisruptive implementations with little impact in overall administration time on a day-to-day basis.

In addition to the write performance increase, VTLs have another significant advantage over standard ATA-based arrays. They have an embedded path to tape. This is a private data transfer segment that feeds directly from the disk component to the tape drives. This has two immediate benefits. The first is performance-related. Since this is a private local high-speed path to tape, the integrated array is able to stream the tape devices consistently at maximum sustainable speed. Second, the backup and device servers are not involved in the movement to tape. This improves the reliability of the disk-to-tape transfer and provides additional available performance to the backup and device servers.

VTLs come as close as possible to the perfect vision of disk-to-disk backup, by having a write-tuned array with a private path to tape. In addition, they cause the least disruption to the installation and day-to-day management of the environment.

The disadvantages depend on what deployment method is chosen. Choosing the separate components and then integrating them in the data center provides greater flexibility but also produces increased installation and possible operational impact.

A concern often raised by providers of HBDLs is that a VTL masks the exact location of the data from the backup application. The backup application basically assumes that the backup data is on tape, although it could also be on disk. Another area of concern is if during the transfer from disk to tape there is a media error on the tape device. While the backup data is not lost and is still on the disk, and the data can be written to a new piece of media, there is no way to alert the backup application to the change in bar code or tape label. This creates a mismatch. The error must be caught manually (or from an alert from the VTL appliance), and the second cartridge must be imported back into the backup application. This is not an impossible task but is still challenging and not very automated.

Commonality Factoring Devices

Working on changing the rules on disk-to-disk backup are the commonality factoring devices (CFD). These devices claim to use advanced techniques to squeeze 20TB of data into 4TB of physical space. Accomplishing this task is fairly technical. At a high level, they use a content-addressable storage file system similar to EMC Centera, but this use is focused on backup.

As data is sent to the CFD appliances, it is analyzed at a byte level. The byte streams are then compared to other byte streams looking for duplicates. When a duplicate byte stream is found, instead of storing that data, a pointer is established back to the initial byte stream. For example, let's assume you had received two versions of a presentation with two different file names but only one slide had been changed. When these two files are backed up to the CFD appliance, the first file would be backed up in full. For the second file, only the bytes of data that represented the changes to that one slide would be backed up. Especially in this type of situation, there could be a significant reduction of the size of data that is actually stored on the unit. This process is typically called commonality factoring. Commonality factoring will also help customers who want to store several full backups of the same server on disk. Obviously a full backup of the same set of servers is going to show a lot of commonality.

There are certain environments where this concept will deliver more bang for the buck than others. Environments such as user home directories (file servers), imaging applications and even databases should do quite well. Also, the higher the server count, the better the reduction should be. This is partly because there is a lot of commonality in the operating system files on servers, and partly that the odds of duplicate or near-duplicate general files increase as you add more servers. There will be other environments where the customer will see little or no benefit from this system, such as anybody that is creating large single data files where that file is not being incrementally added to. Geophysical and file/video markets are good examples.

There is a debate in the CFD market. The key difference is where the commonality factoring is done. One strategy is to do commonality factoring on the client, while another is to do it on the appliance. The advantage of the client-side technique is that less physical data is sent across the network, making it more suitable for wide-area backup. This is because only the unique byte streams are sent by the client. But there are two big negatives to this approach. First, this requires a total replacement of the backup application. Second, in our testing of this method, the client placed a significant CPU impact on the client while it was doing the commonality factoring and it had a tendency to get unstable if it could not find the appliance.

The appliance-side approach, on the other hand, leverages the current backup application and appears to it as an NFS or CIFS mount point, which is backed up via the software's disk backup option. The downside to this approach is that the whole data set is transferred to the device before the commonality factoring is done, making a disk backup appliance such as Data Domain, for example, unsuitable outside of the data center. The plus to this is that the product leverages the existing backup software and infrastructure as opposed to replacing it and no additional load is placed on the client (other than the backup agent already there).

Which approach is the better method? From a pure technologist point of view, the client-side products are superior. Also, if the customer has the intent of doing wide-area backups it is a better choice. From a practical perspective, the appliance-side approach is a better choice. It's difficult to justify the replacement of your current backup application, and if something fails in this second scenario, backups can roll to tape as they did prior to purchasing this product. In contrast, if the client side fails, the entire backup process goes right along with it.

The core difference between the CFD solution and the various VTL solutions is most evident in the amount of data stored. CFDs can reduce the amount of storage by as much as 80%. While CFDs lack many of the VTL features, it is a compelling product category to examine.

When trying to decide the best product for your environment, we recommend weighing the desired goals against the available budget. Each of the above options increases speed to recovery and can lower day-to-day management time but also get progressively more expensive.

If improving the time to recovery for a single file or group of files is a primary concern, then any of the solutions will work. If the goal is to increase reliability and improve performance without increasing administration time, then VTLs are probably the most viable solution. If the goal is long-term storage of data on disk without the need for tape, then a CFD may be the best route.

George Crump is vice president of technology solutions at SANZ Inc., an Englewood, Colo.-based data storage consulting and system integration company focused on the design, deployment and support of intelligent data management.

Copyright © 2005 IDG Communications, Inc.

  
Shop Tech Products at Amazon