Data Compliance: Guilty Until Proven Tamperproof

Feb 22, 2008 (07:02 PM EST)

Read the Original Article at

How certain are you that the electronic data your team retrieves in response to discovery requests is complete and unaltered? Recent rulings have framed electronic records as on par with audio recordings and digital photos in terms of reliability, as judges recognize that a clever cheat could modify an e-mail to remove a critical "not" before submitting it into evidence. IT groups that have yet to implement systems that store data in nonmodifiable form are behind the curve.

InformationWeek Reports

Long-term data-retention mandates are a minefield as well. Organizations covered by OSHA regs must keep physical exam records for 30 years after an employee's termination, while HIPAA requires that medical facilities retain records for 20 years or more. Just keeping copies of end-of-month or end-of-year backup tapes doesn't cut it. Even if the tape hasn't degraded, it's unlikely you'll have a drive that can read it.

Hitachi's Content Archive Platform has a unique approach
Storage vendors such as Caringo, EMC, Hitachi Data Systems, Permabit Technology, and Nexsan Technologies offer a variety of technologies to store fixed content data. These systems aren't cheap, but neither is litigation. And, as the space expands, IT will have more to choose from. We asked vendors about the latest in tamperproof content-addressable storage (CAS) and locked NAS gear, as well as services for those who don't want to maintain their own archives.

As for a business driver, if you can empower counsel to say, "This message was intercepted before the user had access to it by our e-mail archiving system, which saved it to a nonmodifiable archive at 4:02:03 p.m. on 13 February," you're a rock star.

"This e-mail sat for nine months in the user's in-box, where he could have changed it at any time," not so much.


Highly regulated industries like securities brokers have long maintained records in a nonrewriteable and nonerasable format, called write once, read many, or WORM.

Optical WORM disks should provide dependable data storage for 30 years or more.We advise most organizations to go the WORM route for fixed-content archives.

Besides WORM storage, you'll also need e-mail and file archiving apps to identify which data should be saved. Of course, that's easier said than done, especially in e-mail. Vendors like EMC, Symantec, and Zantaz can help separate ham from spam, but expect to store some grocery lists. Other applications, like medical and check imaging, write data directly to the fixed content store.

Plasmon's ultradensity optical WORM disks, each with a capacity of up to 60 GB, are state of the art for organizations seeking long media life. Like all WORM disks, they need WORM-aware archiving software to write to them. Plasmon's current archiving system, the Enterprise Active Archive, uses a server running Nexsan's Assureon CAS software as a front end. Data is typically written to a RAID array when initially stored, then migrated once long-term storage becomes more important than access time.

All popular tape formats, from LTO in the midrange to Sun Microsystems' T10000 at the high end, have firmware in the drive that identifies special WORM cartridges, and once data is written to them, prevents overwriting or erasure. With capacities of 800 GB per cartridge, WORM tape, especially if used behind a RAID cache, is the lowest-cost, and greenest, solution for very large archives where IT can deal with file access times measured in minutes. RAID, or even MAID, uses power when not being accessed. Optical disks take lots of floor space. High density and no need for power when not being accessed make tape the new green.


Rather than use a file's name and location in a hierarchy of directories as its primary identifier, CAS systems generate a globally unique identifier, or GUID, for each file as it's saved using a hash function like MD-5 or SHA-1. The file is stored based on that GUID. If the CAS device provides a CIFS or NFS interface--and most do--it does a database lookup to find the GUID for the full file path, then retrieves the file. One advantage here is that CAS systems automatically provide single-instance storage. When a saved file has exactly the same contents as a file already in the system, the new file will generate the same hash value. Because the hash value GUID is the primary key for storage, the system won't save two files with the same GUID; rather, it notes that one file has been referenced in the system multiple times. Single-instance storage slashes space requirements.

Just as with hash-based data deduplication, some CIOs have expressed concerns about hash collisions resulting in two different files being sent to their CAS systems, but only one being saved. The odds against this are astronomical--1 in 10 to the 25th for even the most basic hash functions--but steps vendors are taking to ease our minds range from using hash functions that are much more resistant to collisions, like SHA-512, to employing byte-by-byte comparisons of files that generate the same hash values before declaring them identical.

Real-world CAS implementations add the ability to store user metadata along with each object and provide a mechanism for enforcing data retention, preventing anyone, including the system administrator, from deleting files until their retention periods expire.

EMC's Centera was the first commercially available CAS system and remains the market-share leader. The Centera redundant array of independent nodes (RAIN) architecture uses access nodes, through which applications store and retrieve files, and storage nodes that include disks and additional processing power. Centera protects data by storing a copy of each object either on two storage nodes or in an object-based parity scheme, rather than relying on conventional RAID controllers. Centera clusters can also replicate data-over-IP networks.

Hitachi Data Systems' Content Archive Platform, a product of Hitachi's acquisition of Archivas last year, takes a different approach to CAS, using a file's location as its primary identifier and generating hash tokens after data is stored. CAP uses three or more diskless front-end nodes to store files on attached Fibre Channel arrays, which also can be used for other data. Organizations may add back-end storage or front-end compute nodes to boost capacity and/or speed. Rather than rely on custom APIs, data can be written to or retrieved from CAP using HTTP, NFS, CIFS, and WebDav. Archive applications can specify retention times, the number of copies of data to store, and other metadata by writing simple text and/or XML files for each folder.

Because Hitachi runs single-instance storage, indexing and data integrity checking as background tasks, data ingestion rates aren't controlled by how fast the system can hash and index. Data is encrypted at rest on archive disks, in flight across the SAN, and when being replicated to another CAP cluster at a remote site. CAP directly supports Network Data Management Protocol (NDMP) to back up archives to tape in addition to having multiple replicas.

Permabit's CAS system, built from a RAIN of 1U servers in access and storage node configurations, adds data deduplication, full text indexing from Fast Search & Transfer on a dedicated node, and a flexible NAS interface that can automatically retain and track multiple versions of files as they're saved. Problem is, with just 1 TB of usable storage on each node, a large archive could devour a lot of rack space and power. Microsoft's purchase of Fast shouldn't affect Fast's many OEM deals--at least not right away.

Nexsan's Assureon lets IT easily add RAID arrays for simple storage or nodes with compute and storage capabilities. Assureon also includes data dedupe and MAID technologies to reduce the amount of storage needed and power consumption. Assureon can act as a RAID cache in front of optical disk or WORM tape libraries and includes a Windows file system watcher that will automatically copy files from any Windows file store when they're closed or reach an age that implies they're complete; if the systems guesses wrong, you'll archive several drafts.

Finally, Caringo's CAStor software turns standard Intel-based PC servers into a CAS cluster. Unlike EMC Centera, CAStor uses HTTP rather than a proprietary API as its primary interface, with CIFS/NFS access available as an add-on. CAStor has the basic set of CAS features most organizations are looking for, including local and wide area replication, data retention, and replication depth definable at the object level. Still, while the idea of building a CAS cluster from standard servers and disk has a certain appeal, we don't think most enterprises will be comfortable rolling their own CAS systems.


For all its sexiness, CAS is a complicated solution to the problem of preventing users and admins from deleting or modifying files. Several vendors, including Network Appliance through its optional SnapLock for filers running OnTap and Sun's StorageTek division through its StorEdge Compliance Archiving software, have added software-managed WORM to their NAS appliances. Organizations can use the same NAS architectures, even the same appliances, as their primary file stores and still have a WORM archive. One system for backup, replication, and management saves money and complexity.

Locked NAS is also easy on your developers. Rather than having to integrate a new XML-based API, they can simply write to the locked NAS via CIFS or NFS. Data-retention periods can be defined on a folder-by-folder, or even a file-by-file, basis by setting the "file last accessed" time attribute to the end of the retention period and then flagging the file as read only.

Now that Network Appliance has rolled out its proprietary Advanced Single Instance Storage (A-SIS) subfile data deduplication technology, a NetApp filer running SnapLock can one-up the CAS vendor's single-instance storage, eliminating not just duplicate files but also duplicate data within files, ensuring that those five corporate positioning slides that appear in almost every PowerPoint presentation will be stored only once.

Compared with CAS, locked NAS does lack a mechanism for storing metadata about objects. How big a problem that is depends on how good your archiving software is. CAS systems provide an XML interface for storing file metadata, but organizations selecting locked NAS as their compliance stores will need to look to their archiving software or enterprise content management systems as a metadata store.

THE REPORT: Managing Enterprise Storage
Thin provisioning, data dedupe, and compression are vital, but you also need storage management
See all our reports at

Continue to the sidebar:
Send It Out: The Resurgence Of Storage As A Service