Monthly Archive for September, 2009

Silent Errors

What is a silent error?

A silent error occurs when data is lost or corrupted but the failure goes unnoticed by the storage system. In many cases this is worse than recognized data loss, as the application reading the bad data can react in undefined ways, and the invalid data may propagate to a point where further errors are caused throughout higher layers of the system.

Another particularly insidious property of silent errors is that backup copies are irrelevant in dealing with or preventing them. If the error were noticeable, the user or application could automatically fail over to a secondary valid copy. However, because the error was by definition not noticed (silent) the user or application would not realize it received invalid data and therefore the backup copy is never accessed.

Causes of silent errors

Silent errors manifest for two main reasons. The first is due to the nature of RAID systems, while the second is related to the behavior of underlying disks. RAID works by spreading reads and writes across a set of independent disks, however due to the fact that each disk is a separately spinning platter, it is impossible to perfectly synchronize the writing of data to the physical media. As the eminent database researcher Jim Gray said, “Update in Place is a Poison Apple”, referring to the danger in assuming atomicity of disk writes. The reason this is especially problematic for RAID systems is that in the event of an OS or disk controller crash, the different disks may end up in a mixed state, where some contain the newly written data while others contain the old data, and neither state is readable.

The other way silent errors can manifest is from a combination of Uncorrectable Read Errors (UREs) with a feature known as Time-Limited Error Recovery (TLER). UREs are latent errors which stem either from data that was improperly written, or data which corrupted to a point where it is no longer readable. Normally, when a URE manifests the disk can take up to a minute or longer attempting to re-read and or re-map the sector, however most RAID controllers will kick a drive from an array if it takes longer than 8 seconds to respond to a request. For this reason, Western Digital and other manufacturers have created a feature called Time-Limited Error Recovery, which forces disks to respond within 7 seconds as not to be considered failed and ejected from the RAID array. When the disk is forced to respond in this manner, since it was unable to read the valid data and it must return something, it will return junk data, perhaps all zeros. The error may be logged to SMARTd, however, this will be invisible to the application or user in most cases. Disk manufacturers report 1 URE for every 10 – 100 TB of data read, and so the risk is significant for any system that processes large amounts of data.

Conventional coping methods

The most popular technique for mitigating silent errors is to use hardware RAID which is backed by Non-Volatile RAM (NVRAM). Since the memory is non-volatile, it retains its state across power cycles, and therefore can reissue queued writes once power is restored. There are two downsides to this approach, NVRAM can be quite expensive, it requires specialized hardware which can be hundreds of dollars per server. The other downside is that it only protects against one specific route to inconsistent writes: power loss. It provides no protection in the event the RAID card itself fails, or in the case of an uncorrectable read error.

Sun created a variant of RAID called RAID-Z to address the problems that arise due to the non-atomic nature of other RAID systems. Instead of overwriting data, RAID-Z creates a copy in an unused area on all disks. Once the write is complete across all disks, a pointer to this new data is updated. In this way writes become atomic (either all succeed or none succeed) across the array. RAID-Z also incorporates integrity checking to detect when disks return bad data. While these features are a step in the right direction, these features break down in the event of a disk failure. When one disk is down, new data cannot be written, and if a URE were to occur there will be data loss.

Cleversafe’s solution

Cleversafe, through a combination of several unique strategies, is able to completely eliminate the issue of silent errors. These strategies include:

  • Transactional write semantics (atomic commit/rollback)
  • Write thresholds
  • Revisions
  • Up to a 512-bit data source integrity check value
  • Use of CRC-32 checksums on storage servers

Cleversafe’s storage system is unique in that it not only must coordinate writes across different disks, but across disks which are in different machines, stored at different sites perhaps across the globe. This required special consideration for ensuring data is written reliably yet without the strict limitation RAID-Z has that data must be written to every disk. Had we had this limitation, then the outage of a single slice server would make the system unusable.

To address the dual requirements of writing data reliably, and keeping a high availability, we instituted the concept of a write threshold. For a 16/10 dsNet, a write threshold of 14 would mean that the data must be written to 14 of the 16 slice servers to be considered successful. Therefore in this case, even if two slice servers are down, data can continue to be written, and the reliability is still maintained because up to 4 simultaneous disk failures could occur and the data would remain recoverable. Whether or not the client commits or rolls back the transaction depends on whether or not it is able to successfully write to at least a write-threshold number of slice servers.

One might wonder: What about the old data that is left over on slice servers which were off-line during the write? To ensure consistency, all data written together is assigned a unique revision number which is stored along side the data. Therefore when reading data back, the client can differentiate old data from new, and not include old data when reassembling the data source.

Lastly, to address the issue of UREs, Cleversafe slice servers compute and store integrity check values for each slice they keep. The integrity values are periodically checked for correctness by a background process, and additionally, the slice server will check the integrity of any requested slice prior to returning it to the client. If found to be invalid, the server will respond as if it does not have the slice, therefore preventing the corruption from propagating to a higher level. As a last line of defense, a data-souce-level integrity check value is computed and compared by the client after it has reassembled a data source. This integrity check value can be up to 512 bits in length, providing an immensely high probability that should any corruption have occurred, it will not be silent, and bad data will never reach the application or end user.

Storage efficiency

Storage efficiency drivers
In times of economic growth, companies are not focused on saving money as much as making money. So, IT organizations’ focus is on supporting the business with getting applications and storage online to help bring in revenue. When revenues are tighter based on the economy, enterprises naturally have to look internally to tighten the belt, which explains why we are seeing a focus on making IT more efficient.

Related to storage, two trends in the last five years have increased storage for enterprises. These trends are disaster recovery, and unstructured content.

Legislation such as Sarbanes Oxley, HIPPA, and events such as 911 and Hurricane Katrina forced enterprises to become more diligent about protecting their digital assets. Protection comes at a cost, and that cost is increase in storage since typical solutions rely on replication technology to backup data offsite, resulting in multiple copies of data.

Further exacerbating storage is the increase in unstructured content – videos, images, audio files, powerpoint presentations, etc. These types of files have changed the game of storage for enterpises, and are pressing the storage systems and technologies many enterprises have today.

Coupling the increase of unstructured content and with replication has resulted in expensive storage systems. In examining their storage systems, enterprises are realizing the cost to operate and manage the storage is more expensive than the original hardware purchase.

Considerations for Enterprises
At a business level, IT organizations need to work with their business units to understand and rate importance of data. By rating importance of data in daily operations, strategic importance, and other characteristics, enterprises can set better policies for data retention. The truth is that the documents and data are what is driving business, so enterprises are not going to go back to their end users and say “store less”. But they can ask, what is important, and then start treating data differently based on the requirements they gather.

Enterprises also benefit from a tiered storage approach, meaning, primary storage, nearline storage, and archive storage. Each tier is progressively less costly for storage and management. Primary storage is for highly transactional, frequently accessed data. Nearline storage is for less frequently accessed data that doesn’t need the lightning speed retrieval time, but is important to have accessible across the enterprise. And archive storage, which is typically a tape library and is for data that isn’t expected to be accessed, and much longer access times (think hours or days instead of seconds) are acceptable.

With a tiered storage approach, enterprises can examine what they are storing on each tier, and look for low hanging fruit of data that can be moved down a tier. Enterprises can also consider what policies they should put in place to automate moving of data between tiers.

Approaches to cut the amount of storage
Most enterprises are working several angles to get a better handle on their storage and storage management.

One approach is to implement virtualization. Historically, enterprise storage has been directly mapped to applications, resulting in inefficiencies because each of the independent applications isn’t necessarily fully utilizing the related storage. Virtualization basically operates above the storage appliances to abstract the location of the data from the physical hardware. This leads to better efficiencies and utilization of storage.

Another technique is to implement a Hierarchical Storage Management (HSM) approach. HSM typically uses policies to determine the frequency of use of files, and automatically moves data between high performing primary storage, and nearline disk storage and tape libraries. Enterprises also need to revisit any “store everything forever” decisions, and change policies to examine importance of data, and for data that will not be leveraged, include deleting it once they are legally not required to store it anymore.

Another area companies are looking at is the approach to data protection in using RAID and replication. Replication does work, but when coupled with RAID, can result in a 3-5 times increase of the raw capacity necessary. RAID 5 increases the usable storage by 25% (with a 4-1 configuration), and if copied offsite using replication once, results in 250% of raw storage required. Some organizations are making two offsite copies and are up at 375% of raw storage required.

Enterprises are looking for more efficient methods, such as dispersal, to lower the ratio between usable and raw storage. Dispersal results in 160% raw storage, with a 16/10 configuration, meaning, 16 slices are created and 10 are required for bit perfect recreation of the data. Dispersal provides superior data protection since it can tolerate multiple site failures, drive failures, and network outages that replication and RAID cannot.

Another technology geared at making storage more efficient is deduplication. Deduplication focuses on eliminating redundant data by finding duplicates, only storing one copy, and storing references to the one copy in all other instances.

Looking at storage from the operational cost perspective, enterprises look at data center consolidation, or outsourcing as another method to reduce cost. Also, as disk drives become more efficient, replacing existing storage across each tier can also reduce costs. Many storage products today utilize 1 terabyte drives and are much more power efficient than products available even 3 years ago. Replacing older equipment can reduce power consumption and floorspace.

Most enterprises realize that not one of these techniques is a silver bullet, and typically use them together for maximum efficiency. So, and example storage system may use virtualization for efficiency in application based storage, HSM to manage movement of data across storage tiers, deduplication to compress the data, and dispersal for data protection, and moving the data offsite.

Cleversafe announces dsNet™ Object Store

Today, Cleversafe announced the dsNet™ Object Store. The dsNet Object Store is targeted towards large content stores for digital content such as video, images and audio files.

What is object storage?
Object storage splits the metadata from the content and stores the two on different servers. Typically the metadata is stored by an application that is optimized for the digital content type. The application handles creating, accessing, and deleting objects from the dsNet Object Store, which utilizes Dispersal to store data with increased storage efficiency, security, and data protection.

Why is object storage important?
Object storage can scale larger than traditional block or file based storage systems. An object store is independent of a computer’s operating and file systems, so bypasses their limitations in terms of scale. Clearly the associated application needs to know how to handle the object, but it doesn’t need to use the file system for searching, sorting, or accessing objects.

When the majority of storage needs were structured data (typically textual based), the information found in the file system’s metadata (file name, creation date, folder location, etc.) was sufficient to efficiently locate information. Today however, file systems are proving less relevant for digital content storage since the metadata for objects is typically housed in specialized applications versus file systems.

A simple example is managing photos from a digital camera.  Users will not use the file system to find photos since they are stored by computer-generated names from the camera, and the photo application allows them to see, sort and tag photos much more effectively. Further, users aren’t sorting digital content into folders within the file system, instead they are using applications to make collections and relationships between content. So the actual storage of the objects can be a single flat container instead of requiring a tree-structure of a file system.

What are the interfaces and use cases?
The dsNet Object Store has several interfaces and access protocols. The underlying dsNet storage pool can be shared and jointly accessible by multiple access protocols, giving additional flexibility.

For use cases where an application is housing metadata, and only object storage is required, the Simple Object interface can be accessed with either a Java SDK, or an HTTP/REST API. In this use case, simple PUT, GET, DELETE commands are used to access Objects, and the resulting Object ID can be stored directly within the application.

For use cases where a file system is desired to store the metadata (versus within the application), or where integration is not desired, a file system interface on top of the Object Store can be accessed via dsFTP, a software client which can read or write files from and to the dsNet Slicestor appliances, without needing an Accesser appliance. dsFTP provides an interface similar to other command line FTP programs, with both an argument-based and interactive sessions.

Cleversafe’s still offers Block Storage using standard interfaces such as iSCSI and NFS for use cases where standard storage interfaces are required.

What storage appliances does the dsNet Object Store run on?
dsNet Object Store and it’s associated storage vaults can be built and managed using Cleversafe’s existing products. Slicestor® appliances, which act as storage nodes in a dsNet system, are configured as either block or object devices. (Existing block Slicestor appliances can be redeployed as object devices.)

Once the Slicestor appliances are configured, they are available to be provisioned within vaults (like a LUN). Vaults can either be configured block vaults or object vaults, and leverage the associated Slicestor appliances. Cleversafe’s Omnience™ Storage Management can manage both block and object vaults under the same unified management system.

When directly integrated into applications, the Object Store can be accessed from the software client versus through the Accesser appliance. This provides greater scalability in performance since there can be a plethora of simultaneous readers and writers, not dependent on using a gateway appliance.

Conclusion
Cleversafe’s unique combination of object storage and dispersal allows for both unlimited scale, which is crucial as unstructured content continues to expand, and superior data protection. It’s a viable method for building large-scale systems in the multi-terabyte to petabyte range.