Deduplication

Deduplication

Awesome article

Types of deduplication

Source Deduplication

Source deduplication uses client software to compare new data blocks on the primary storage device with previously backed up data blocks. Previously stored data blocks are not transmitted. Source-based deduplication uses less bandwidth for data transmission, but it increases server workload and could increase the amount of time it takes to complete backups.

Target Deduplication

Target deduplication removes redundant data in the backup appliance — typically a NAS device or virtual tape library (VTL). Target dedupe reduces the storage capacity required for backup data, but does not reduce the amount of data sent across a LAN or WAN during backup. “A target deduplication solution is a purpose built appliance, so the hardware and software stack are tuned to deliver optimal performance,” Whitehouse said. “So when you have large backup sets or a small backup window, you don’t want to degrade the performance of your backup operation. For certain workloads, a target-based solution might be better suited.”

Target deduplication may also fit your environment better if you use multiple backup applications and some do not have built-in dedupe capabilities.

Inline deduplication

Another option to consider is when the data is deduplicated. Inline deduplication removes redundancies in real time as the data is written to the storage target. Software-only products tend to use inline processing because the backup data doesn’t land on a disk before it’s deduped. Like source deduplication, inline increases CPU overhead in the production environment but limits the total amount of data ultimately transferred to backup storage. Asigra Inc.’s Cloud Backup and CommVault Systems Inc.’s Simpana are software products that use inline deduplication.

Post-processing deduplication

Post-process deduplication writes the backup data into a disk cache before it starts the dedupe process. It doesn’t necessarily write the full backup to disk before starting the process; once the data starts to hit the disk the dedupe process begins. The deduping process is separate from the backup process so you can dedupe the data outside the backup window without degrading your backup performance. Post-process deduplication also allows you quicker access to your last backup. “So on a recovery that might make a difference,” Whitehouse said.

Global Deduplication

Global deduplication removes backup data redundancies across multiple devices if you are using target-based appliances and multiple clients with source-based products. It allows you to add nodes that talk to each other across multiple locations to scale performance and capacity. Without global deduplication capabilities, each device dedupes just the data it receives. Some global systems can be configured in two-node clusters, such as FalconStor Software’s FDS High Availability Cluster. Other systems use grid architectures to scale to dozens of nodes, such as Exarid Systems’DeltaZone and NEC’s Hydrastor.

The more backup data you have, the more global deduplication can increase your dedupe ratios and reduce your storage capacity needs. Global deduplication also introduces load balancing and high availability to your backup strategy, and allows you to efficiently manage your entire backup data storage environment. Users with large amounts of backup data or multiple locations will gain the most benefits from the technology.

Block Level Deduplication

Block-level deduplication uses a hash table to track every storage block. Even though the hash table is smaller than the data itself, there eventually comes a point where the hash table becomes unwieldy. ExaGrid estimates that a hash table consumes approximately a billion storage blocks for each 10 TB of data referenced.

Zone-level deduplication

Zone-level deduplication, which is proprietary to ExaGrid, has two main advantages over block-level deduplication:

  • It examines chunks of data that are significantly larger than blocks. Performing deduplication at a less granular level effectively reduces the size of the hash table. ExaGrid estimates the hash table size to be 1,000 times smaller than it would be if block-level deduplication were used.
  • It supports scale-out architectures. If an organization begins to outgrow its front-end controller, it can simply add another controller rather than trying to upgrade to a larger controller. Subsequent controllers can be added on an as-needed basis.

Zone-level deduplication effectively solves the scalability problems encountered with block-level deduplication. Although this is a proprietary technology, it is designed to be vendor-agnostic to the point that it will work with any backup application.