NetApp Data De-duplication, in response to a comment I received!
Steven J. Schwartz
I love when the A-SIS functionality is brought up for NetApp. The following is directly from NetApp’s Tech Report on Data De-duplication: (http://media.netapp.com/documents/tr-3505.pdf)
1. “In its initial release, deduplication primarily had a focus on data retention/archiving of file system data
on secondary storage NetApp systems. Substantial storage savings can be achieved with
deduplication in some tier 2 primary storage environments as well.”
What does that mean? Not for PRODUCTION STORAGE VOLUMES.
2. “Is a background process that can be configured to run automatically, scheduled, or run
manually through the command-line interface.”
Post-processing data de-duplication is a thing of the past. Most vendors who specialize in dedup are doing online dedup, why is this a concern? It means that the overhead required for NetApp to do de-duplication was so high they couldn’t keep it as a running process.
3. “Is enabled on a per flexible volume basis.”
Ah, here is a serious limitation. It is PER VOLUME ONLY. Which means if you already have several volumes created in an existing FAS environment it will be much less effective. Also, since NetApp’s architecture is clustered filer pairs, and most larger companies have several if not dozens of “pairs” each pair owns it’s own volumes, therefore you are stuck with much less de-duplication effectiveness.
So now we get into even better details. If you are running the average size filers that NetApp continues to push into the Enterprise space you are very limited to Flex Volume size. Why is this a problem? Data de-duplication can only occur within a single volume. So your expectation of running your entire VMWare environment on a single NFS share goes out the window, as well as all that supposed savings of disk space. Oh by the way, these numbers are valid as of OnTap 7.4
5. NetApp licensing:
Deduplication is included in Data ONTAP and just needs to be licensed. Add the deduplication license using the following command: license add <a_sis>
If you want to run deduplication on any of the FAS platforms you will also need to add the nearstore_option license: license add <nearstore_option>
Deduplication is a licensed option behind the NearStore option license. Hence, in a clustered environment, both nodes must have the NearStore option and deduplication licensed.
Well so that is interesting, A-SIS isn’t just a license on its own, it is a sub-license of nearstore. So watch out for those up-front licensing costs as well as on-going support costs.
6. Snapshot usage changes:
- Previous Snapshot copies will expire, and as they do some small savings will be realized, but they too will probably be pretty low.
- During this period of old Snapshot copies expiring, it is fair to assume new data is being created on the flexible volume and Snapshot copies being created.
- Thus the storage savings may stay rather flat (that is, very low).
I think this is one of the most powerful statements in the entire tech report. YOU WILL LOSE ALL SNAPSHOTS when running the de-duplication process. What does this mean for you? It means forget about data recovery in a production environment if you want A-SIS running as well.
7. Contradiction of CPU usages within the same white paper:
On Page 16:
Deduplication is tightly integrated with Data ONTAP and the WAFL file structure. Because of this, deduplication is performed with extreme efficiency. Complex hashing algorithms and look-up tables are not required. Instead, deduplication is able to leverage the internal characteristics of Data ONTAP to create and compare digital fingerprints, redirect data pointers, and free up redundant data areas, all with a minimal amount of performance impact.
On Page 18:
- If there is very little new data, run deduplication infrequently, because it doesn’t make sense to unnecessarily consume CPU resources. How often you run it will depend on the change rate of the data in the flexible volume.
- Stagger deduplication schedules for the flexible volumes so it runs on alternative
days.
So which is it? It is “extremely efficient”? Or it is such an overhead hog that it needs to be treated with mittens, and carefully scheduled and not over used?
So, this is just information dissected from a public NetApp document. There are plenty of other personal concerns I have over the A-SIS solution, but that wasn’t the point of this blog entry.
Posted in Enterprise, SAN and NAS, virtualization |
5 Comments »

September 20th, 2008 at 9:39 am
Your analysis of why we did post-processing is flawed.
You start off by saying everyone else does it differently so we must be wrong:
“Most vendors who specialize in dedup are doing online dedup, why is this a concern?”
and then assert that we couldn’t figure something out
“It means that the overhead required for NetApp to do de-duplication was so high they couldn’t keep it as a running process.”
You’re wrong on both counts.
Our approach to dedup was architected to enable deduplication on primary workloads that are latency sensistive. Such an architecture must move the compute intensive dedup operation outside of the IO path and ensure that data once deduped is efficient for normal workloads.
And since our solution uniquely works with primary workloads, our architecture will be different.
For a more detailed discussion check out my blog posting:
http://blogs.netapp.com/extensible_netapp/2008/09/a-little-digres.html
September 20th, 2008 at 9:53 am
I wasn’t implying that the NetApp architecture is wrong. I was implying that “post-processing” is not the same as target based processing. Most vendors that are utilizing target based processing are still processing in real-time, not as an off-line operation.
I want to be clear that my post wasn’t intended to beat up on A-SIS, just point out the limitations.
I agree with the NetApp position that de-duplication is a feature as A-SIS is implemented as a feature, and that it doesn’t not justify a “product”.
I guess my point is, A-SIS is not a silver bullet for everything, which is how it has been positioned by the NetApp VAR community.
September 21st, 2008 at 8:59 am
Hi!
Yes, FAS Dedup (formerly known as ASIS) is not the solution to all problems, but it does reduce storage costs in a way that no other technology does.
Not to be a nit-pick, but your other points are confusing as well.
What do you mean by Flexible Volume limitation?
And your comments about NFS and VMware are inscrutable. Why would anyone want to run all of their VMware environment on one NFS Share?
There are a bunch of things that are problematic with this post.
September 30th, 2008 at 4:01 pm
[...] in regards to my feeling that it is a feature and not a product, and more recently looking at NetApp’s implementation of A-SIS. I also mentioned the announcement of a newer blog DEDUPEMATTERS.com, which is run by Data [...]
October 1st, 2008 at 5:34 pm
For what it’s worth, ASIS is a free license now (just have to request it).
Some good info here from Scott Lowe as well.
http://blog.scottlowe.org/2008/03/31/quick-guide-to-setting-up-netapp-deduplication/
http://blog.scottlowe.org/2008/04/24/using-netapp-deduplication-with-block-storage/