What is a silent error?
A silent error occurs when data is lost or corrupted but the failure goes unnoticed by the storage system. In many cases this is worse than recognized data loss, as the application reading the bad data can react in undefined ways, and the invalid data may propagate to a point where further errors are caused throughout higher layers of the system.
Another particularly insidious property of silent errors is that backup copies are irrelevant in dealing with or preventing them. If the error were noticeable, the user or application could automatically fail over to a secondary valid copy. However, because the error was by definition not noticed (silent) the user or application would not realize it received invalid data and therefore the backup copy is never accessed.
Causes of silent errors
Silent errors manifest for two main reasons. The first is due to the nature of RAID systems, while the second is related to the behavior of underlying disks. RAID works by spreading reads and writes across a set of independent disks, however due to the fact that each disk is a separately spinning platter, it is impossible to perfectly synchronize the writing of data to the physical media. As the eminent database researcher Jim Gray said, “Update in Place is a Poison Apple”, referring to the danger in assuming atomicity of disk writes. The reason this is especially problematic for RAID systems is that in the event of an OS or disk controller crash, the different disks may end up in a mixed state, where some contain the newly written data while others contain the old data, and neither state is readable.
The other way silent errors can manifest is from a combination of Uncorrectable Read Errors (UREs) with a feature known as Time-Limited Error Recovery (TLER). UREs are latent errors which stem either from data that was improperly written, or data which corrupted to a point where it is no longer readable. Normally, when a URE manifests the disk can take up to a minute or longer attempting to re-read and or re-map the sector, however most RAID controllers will kick a drive from an array if it takes longer than 8 seconds to respond to a request. For this reason, Western Digital and other manufacturers have created a feature called Time-Limited Error Recovery, which forces disks to respond within 7 seconds as not to be considered failed and ejected from the RAID array. When the disk is forced to respond in this manner, since it was unable to read the valid data and it must return something, it will return junk data, perhaps all zeros. The error may be logged to SMARTd, however, this will be invisible to the application or user in most cases. Disk manufacturers report 1 URE for every 10 – 100 TB of data read, and so the risk is significant for any system that processes large amounts of data.
Conventional coping methods
The most popular technique for mitigating silent errors is to use hardware RAID which is backed by Non-Volatile RAM (NVRAM). Since the memory is non-volatile, it retains its state across power cycles, and therefore can reissue queued writes once power is restored. There are two downsides to this approach, NVRAM can be quite expensive, it requires specialized hardware which can be hundreds of dollars per server. The other downside is that it only protects against one specific route to inconsistent writes: power loss. It provides no protection in the event the RAID card itself fails, or in the case of an uncorrectable read error.
Sun created a variant of RAID called RAID-Z to address the problems that arise due to the non-atomic nature of other RAID systems. Instead of overwriting data, RAID-Z creates a copy in an unused area on all disks. Once the write is complete across all disks, a pointer to this new data is updated. In this way writes become atomic (either all succeed or none succeed) across the array. RAID-Z also incorporates integrity checking to detect when disks return bad data. While these features are a step in the right direction, these features break down in the event of a disk failure. When one disk is down, new data cannot be written, and if a URE were to occur there will be data loss.
Cleversafe’s solution
Cleversafe, through a combination of several unique strategies, is able to completely eliminate the issue of silent errors. These strategies include:
- Transactional write semantics (atomic commit/rollback)
- Write thresholds
- Revisions
- Up to a 512-bit data source integrity check value
- Use of CRC-32 checksums on storage servers
Cleversafe’s storage system is unique in that it not only must coordinate writes across different disks, but across disks which are in different machines, stored at different sites perhaps across the globe. This required special consideration for ensuring data is written reliably yet without the strict limitation RAID-Z has that data must be written to every disk. Had we had this limitation, then the outage of a single slice server would make the system unusable.
To address the dual requirements of writing data reliably, and keeping a high availability, we instituted the concept of a write threshold. For a 16/10 dsNet, a write threshold of 14 would mean that the data must be written to 14 of the 16 slice servers to be considered successful. Therefore in this case, even if two slice servers are down, data can continue to be written, and the reliability is still maintained because up to 4 simultaneous disk failures could occur and the data would remain recoverable. Whether or not the client commits or rolls back the transaction depends on whether or not it is able to successfully write to at least a write-threshold number of slice servers.
One might wonder: What about the old data that is left over on slice servers which were off-line during the write? To ensure consistency, all data written together is assigned a unique revision number which is stored along side the data. Therefore when reading data back, the client can differentiate old data from new, and not include old data when reassembling the data source.
Lastly, to address the issue of UREs, Cleversafe slice servers compute and store integrity check values for each slice they keep. The integrity values are periodically checked for correctness by a background process, and additionally, the slice server will check the integrity of any requested slice prior to returning it to the client. If found to be invalid, the server will respond as if it does not have the slice, therefore preventing the corruption from propagating to a higher level. As a last line of defense, a data-souce-level integrity check value is computed and compared by the client after it has reassembled a data source. This integrity check value can be up to 512 bits in length, providing an immensely high probability that should any corruption have occurred, it will not be silent, and bad data will never reach the application or end user.

A couple notes relating to the comments about ZFS: You can still write to a degraded RAID-Z zpool, just like you can write to a RAID-5 array that has lost a drive. Additionally, it bears mentioning that ZFS supports double- and triple-parity RAID – RAIDZ-2 and RAIDZ-3.