Dumb and Dumber – Why don’t we have intelligent deduplication?
Steven J. Schwartz
Image by ~C4Chaos via Flickr
So I’ve been following the deduplication market and products for a good amount of time, and what amazes me, is the lack of thought behind today’s solutions, and the recent blind adoption of deduplication on primary data sets. Now, I’m not referring to application level deduplication that I’ve discussed before, but appliance or storage based deduplication products.
The fundamental problem I see with primary volume deduplication on “live” data sets, is the complete lack of intelligence of the deduplication service. It doesn’t matter how often a data set is read, or by how many applications, current products treat the data the same. So, what is my primary concern? “Hot Spots” on the disk sub-systems. This is something that has been around as a concern, and a reality for databases, and will continue to grow as an application problem.
Where else have I seen disk hot-spots recently? NFSroot in the HPC environment, when clusters are configured with diskless nodes, and a single boot image. The idea, is that most of a Linux OS is the same binaries, and that you should be able to boot an entire cluster off of a single image, plus some extra space each node needs for swap and configuration information. Did anyone honestly think that some storage vendor(slide 6) or x86 based virtualization engine came up with the idea of a single boot image?
The problem with this method, and areas that need to be addressed, how do we handle moments of massive access against the same data set. In the instance of NFSroot, when a cluster is booting, or when applications change, the NFSroot loaded operating system puts a serious strain on the storage sub-system.
So the primary problem with traditional storage systems, is they only have 2 levels of service for a given volume at a given time. They have cache and they have the the HDDs that the volume is configured on. Cache is typically limited, due to cost, and the ability to maintain data state in the event of a power failure, however, locking a volume into a single type of HDD is a thing of the past. Several years ago SAM-FS was able to utilize it’s file system for HSM (Hierarchical Storage Management), also known as ILM (Information Lifecycle Management). however, these technologies where really tied to applications for open systems, and typically were one-way data movement, from expensive production quality disk, to archive disk, or active tape. FalconStor, a few years ago, within the IPStor product had
something called HotZone®, which created a virtual extended disk cache via RAM or SSD, more of a caching head then a full tiering solution. Products like, Dell | Equallogic PS-series storage with automated load balancing and storage tiering, and Compellent’s Data Progression™, give a new option for data access. In both cases, when areas of data require higher levels of performance, these solutions have the ability to migrate that data to a higher class of storage, thus eliminating, or minimizing hot spots within the data set.
So, it would sound like I’m plugging products based on the previous paragraph, however, what I’m trying to point out, is that deduplication is prone to quickly cause hot spots within applications, and could be especially risky for Virtualized OSs running on Deduplicated storage. Enhanced virtualization within the storage layer can help reduce these areas of contention. So, what is my solution?
- Products that are offering deduplication services for primary production volumes will need to at some point need to address
- As virtualization software continues to move towards single image storage (and deduplication at the application layer), storage vendors will have to keep up with the ever changing storage performance requirements.
- Deduplication intelligence needs to mature to the point where it can be aware that not all data regardless of sameness, night still need to be excluded.
I would love to hear peoples thoughts on this!
Related articles:
- Information Lifecycle Management: Cost Reduction for Your Bottom Line
- Storage stats paint disastrous picture
- Opinion: The downsides to server virtualization
Posted in Deduplication, SAN and NAS |
1 Comment »

October 7th, 2008 at 10:49 am
For deduplicated VMware and virtualised storage, hot spots are all goodness.
With NetApp’s PAM module (which makes the cache very large indeed), deduplicated blocks that get hot will be served from cache. There is, after all, only one block that is shared multiple times; why re-read it and make the disks hot? For instance this can be used to alleviate problems caused by VMware boot storms in VDI implemnetations.
Systems that don’t virtualise storage and don’t dedupe can’t do this at all. Forget caching or hot storage disks without it; deduplication is the prerequisite requirement to fixing this problem.