Steven J. Schwartz
In the past several months I’ve written a couple times about Deduplication, mainly in regards to my feeling that it is a feature and not a product, and more recently looking at NetApp’s implementation of A-SIS. I also mentioned the announcement of a newer blog DEDUPEMATTERS.com, which is run by Data Domain, here.

It isn’t that I’ve been ahead of the curve with Deduplication, it just continues to come up as a checkbox mark in every storage discussion. Why? Mainly because of the exponential growth of storage, and storage retention requirements for both public and private companies (please ignore the current economy turn down in the United States, and while this has an impact short term, I believe long term projections will still be accurate).
So where am I going with this train of thought? I honestly believe, as I’ve stated before, that Deduplication is a feature, and primarily a feature of backup storage/suites and server applications. This is a personal belief for the following reasons:
- The act of Deduplication on a data set, in both online and post-processing activities, is a compute and storage intensive process.
- The act of data re-hydration, or re-duplication, is also an intensive process, but mostly can have a storage capacity ballooning effect.
- A data set which has been through Deduplication forces data into a consolidated format, which, in certain instances can cause disk hot spots and data access performance to be lowered.
- Current methodologies for Deduplication are based only on capacity savings, and not the important of data access, nor application performance.
What does this all mean regarding features vs. products? How does this apply to your implementation of Deduplication? What does this mean for Deduplication of primary storage volumes? Let’s explore this:
Product vs. Feature
I would like to make a parallel here to Storage Virtualization. Some time ago, there were a plethora of “Heterogeneous” Storage Virtualization products/appliances. The biggest issues with these products/appliances was a very common IT dilemma; what I call the IT Triangle! There is no way to get ALL three without on of the corners suffering. If you want the highest performing, and highest resiliency, you end up with the HIHGEST COST. So in order for Virtualization products/appliances to stay cost effective and provide “heterogenous” storage support, they sacrificed performance and/or reliability. So the market dictated, that this level of functionality should be based within storage devices, and that "the flexibility” of true heterogeneous support would become less of a priority.
Deduplication products/appliances typically have the same problem, however, what the target is that they are deduplicating will have much different requirements. I will touch upon this shortly. So, the real question is, what are the sacrifices you are willing to make with a deduplication product/appliance in your environment? Are you willing to par extra for an additional product in the IT infrastructure for deduplication of the backup stream? Do you want to run your NAS environment on a 3rd party solution in order to take advantage of block based deduplication, when file level deduplication might be built into your current file serving solution? Would you be willing to place an in-data-path appliance between your application servers and your primary storage in order to leverage block based deduplication, knowing that it may have significant storage savings, however, at a cost to application performance?
How Deduplication is used in Environments Today!
YOU ARE ALREADY USING DEDUPLICATION TECHNOLOGIES!!!!! You might not even know it! There are several technologies that ARE Deduplication technologies present is MOST datacenters today.
- Are you running Exchange 2000, 2003 or 2007?
- Are you utilizing Windows Storage Servers?
- Well starting with Windows Storage Server 2003 RC2, there is file level Deduplication within volumes and set per volume.
- Are you utilizing any pointer based snapshot technology within your storage system, or VSS within Windows?
- Once again, this is a form of data Deduplication, specifically around data protection. Storage arrays that utilize a pointer based snapshot technology allow virtual backup copies of a volume set, this is the case when utilizing VSS within windows as well, just handled at the OS level rather then the disk storage level. (some storage providers can utilize VSS functionality to use disk based snapshot technology to take OS and Application consistent snapshots at the hardware layer, rather then the default software layer.
- Do you utilize COTS applications running on a Database?
- Many database applications utilize record linking in order to minimize multiple copies of the same data rows/columns/table spaces.
So what do the above examples show? Application/OS based deduplication which is a feature of a larger application set, not a product unto itself. Primary storage features that over several years have become relatively mainstream features. (note: NOT ALL Snapshots are created equal!).
There are also deduplication features available for most backup packages for helping reduce the footprint of the backup environment.
Deduplication of Primary Storage Volumes
So Primary Storage Volumes seem to be the next logical discussion point. Catching up on my questions earlier, virtualization appliances gave heterogeneous storage support, and cross platform data services, however, at a performance degradation, as well as with additional cost. Most customers I’ve come across in recent years are so concerned with performance, that detailed application assessments, and deep technical dives into storage performance was required in order to drive purchase decisions. The number of saved perfmon exports, and IOSTAT redirects that I’ve looked at an analyzed through tools sets continues to grow. So, as Stephen Foskett recently put, ”deduplication is not yet ready for prime time in primary storage applications”, it is however, readily present and ready for production use in other areas.
So, high IO, and low latency requirements for storage need to be seriously looked at as applications that aren’t “storage hardware feature” ready for deduplication. Applications can be more intelligent typically about deduplication, minimizing performance impact for a very specific data set, which just hasn’t been seen yet in the storage industry’s feature set.
Final Thoughts
So I am going to contradict myself. Several years ago I was actually a very big fan of Virtualization appliances, they were a non-perfect stop-gap for the storage industry. My customers wanted strong storage services, like Snapshots, site-to-site replication/mirroring/archiving, and heterogeneous storage pooling. They were willing to make an investment in products like SANSynphony, IPStor, and SVC, in order to gain an agnostic storage approach, re-deploy older storage, and leverage cheaper featureless storage arrays. The storage vendors caught up however, and began offering better performance with the same feature sets, and, in general, the virtualization appliance went away. I believe the same is occurring with deduplication appliances. This is a good stop-gap until the application providers and storage vendors come up with better native deduplication technologies and support. So yes, while I STRONGLY feel that Deduplication is a feature of either applications or storage hardware, for the time being deduplication appliances will continue to be prevalent, just a stop-gap though!