Monthly Archive for October, 2009

Sidekick’s Data Loss

There’s been much press lately regarding the loss of data for users of T-mobile’s Sidekick phone.  One of the phone’s differentiating features was the ability to automatically restore data from remote servers after a reset, therefore there was presumably no need to manually backup the phone’s data.  Unfortunately, that presumption turned out to be premature and perhaps undeserved given that Sidekick users had little visibility into how, where, and with what level of reliability and privacy protection were in place for their data.

The data loss event was traced to a server failure at one of Microsoft’s subsidiaries called Danger.  Microsoft released this statement reporting that most users should see their data restored, and an in an update as of a few days ago, are still in the process of recovering data.  We wish them luck and hope that all user data is recovered.

While we don’t know the exact details of the failure and what caused it, we can’t help but think that it could have been avoided had the storage system not possessed a single point of failure.  This is one of the primary benefits that dispersed storage provides; no single server failure can cause data loss.  If there is any positive outcome from this incident, it might be that users will be more vigilant when it comes to investigating how their data is protected and who their cloud storage providers are, because as this event shows, not all clouds are the same.

Dispersal: Getting better all the time

When the Reed-Solomon algorithm used in Information Dispersal was first invented in 1960, computers of the day were not even powerful enough to implement it.  For 20 some years, the algorithm would exist only on paper, until in 1982 it saw its first application in the Compact Disc, where it was used to keep data recoverable despite the ease at which the disc surface could scratch.

According to Moore’s Law the amount of computer processing power and memory that one can buy with a fixed quantity of money roughly doubles every 24 months.  Therefore between 1960 and 1982 there were 11 such doublings, computers became 211 (or 2048) times faster in that period of time.  While much faster than computers of the 60s, computers in the early 1980s were still very slow by today’s standards.  Common personal computers of the time were the Commodore 64 and Apple IIe and while it was possible to install a modem into them, there was no publicly accessible dial-up Internet until 1989.  Theoretically, one could have created a dispersed storage network among a group of BBSes, but with the 300 bps modems of the day it would take about 6 hours to disperse a single 720 KB floppy disk.  Before information dispersal could take off, the bandwidth price to performance ratio would need to massively improve.

Luckily the bandwidth capacity of networks and the Internet in general has been growing at an even faster rate than Moore’s Law, doubling approximately every 12 months.  We’ve gone from 56 Kbps modems in 1998 to 30 Mbps Fiber Optic connections in 2007.  That represents a speed increase of 548 times, roughly equal to 9 doublings.  With the proliferation of broadband Internet connections, and with network speed greatly outstripping the growth of processing power and memory, using the network for storage operations makes increasing sense with time.

By 2004, the year Cleversafe was founded, there had been another 11 doublings in processing power since 1982, and 22 since 1960.  Computers were then millions of times more powerful than those that were around when Reed-Solomon was first invented, and multi-Mbps connections to the Internet were common.  However, there is a disturbing trend related to Moore’s law which is causing havoc for storage administrators around the world: the total amount of data stored in the world also doubles every 24 months.  With each bit stored comes the risk of losing that bit.  This means that as storage requirements double so does the chance of losing one of those stored bits.  In other words, without compensating by making more and more copies of the data, the chance of an organization suffering a data loss event will double every 24 months.

The only way one can cope with an exponential problem is to have a solution which likewise increases exponentially in effectiveness.  Dispersal takes advantage of exponentially increasing CPU resources to combat the decreased reliability that comes with increasing storage requirements.  In a dispersed storage network, the level of fault-tolerance can be dialed-up as needed.  For example, going from a 15/10 configuration to a 16/10 configuration.  With every additional tolerable failure, the reliability of the dispersed storage network increases by 1000-fold.  Therefore providing a sufficient level of reliability for the next 10 doublings (roughly 20 years at current growth rates).  At that time, the 16/10 could migrate to a 16/9 or a 17/10.  The increased CPU requirements in going from a 16/10 to a 17/10 represent a 16% increase, but by the time this becomes necessary, CPUs will be 1,000 times more powerful.  Thus dispersal provides a solution which only becomes better through Moore’s law.  RAID- or replication-based systems, on the other hand, must keep increasing the number of copies, exacerbating the difficulty of keeping up with exponentially growing storage requirements.

Dispersal shifts rebuilding expectations

Rebuilding times are a hot topic particularly in light of the industry adoption of 1TB drives, and pending 2TB drives. The problem is that as storage density increases, the likelihood of encountering a Unrecoverable Read Error (URE) has also increased since URE’s have not seen improvement from 1/10e14 – 1/10e15. (Read RAID’s days may be numbered for more details)

Rebuilding concerns revolve around two key issues:

  • Am I vulnerable to data loss while the disk is rebuilding?
  • Can I balance rebuilding time with typical I/O demands on the system?

Rebuilding with RAID
RAID typically stores data in arrays of drives, and the bottleneck for rebuilding is the read/write speed of the drives. If a drive fails, the data is rebuilt from reading the parity data stored on the remaining drives. All of the drives are within the same storage appliance. (See figure 1a)

Based on many published studies on RAID rebuilding times, it would typically take 1-3 days to rebuild a 1 terabyte drive with an idle system. The rebuilding time would be 3-5 times that with a heavier I/O activity load.

Rebuilding using Dispersal
The main difference between rebuilding with RAID versus Dispersal is that RAID stripes its data across multiple drives in a single hardware appliance, whereas Dispersal disperses the Slices across multiple drives in multiple hardware appliances. When rebuilding a drive, a Slice stored on the failed drive is rebuilt from reading a threshold number of Slices from the remaining Slicestor appliances. (See figure 1b)

With Dispersal, the network speed plays an important role in rebuilding as the network can introduce a potential bottleneck into the rebuild process. This means that typically rebuilding using Dispersal may actually take longer if the network has insufficient throughput to write to the disk at full speed.

RAID and Dispersal Rebuilding

Comparing RAID and Dispersal Data Protection
In a 16 wide, 10 threshold Dispersal configuration (16/10), data is sliced and dispersed across 16 Slicestor appliances, with only any 10 required to perfectly recreate the data. So, up to six simultaneous appliance failures can occur without data loss. With the 16 storage appliances stored across 4 geographically dispersed locations, it would tolerate both an entire site failure, as well as 2 additional appliance failures, while still providing seamless access to data.

In a typical RAID 6 configuration of eight drives with two dedicated for parity, only two simultaneous errors can occur. Any further error (drive failure, or URE) will result in data loss.

Comparing RAID 6 and Dispersal 16/10, Dispersal 16/10 could encounter four simultaneous errors to effectively be at the same starting point in terms of data protection of a healthy RAID 6 system – both would be able to tolerate two simultaneous errors.

As this example shows, Dispersal can tolerate three times the number of simultaneous errors which points to why rebuilding times are less relevant. After losing two drives, the white knuckled rebuild with RAID 6 isn’t a pressing concern with Dispersal, since four additional errors could occur – which is statistically unlikely.

Comparing the years without data loss for a 1 petabyte system, (see Figure 2), Dispersal can tolerate much longer rebuild times while still delivering higher levels of data protection than RAID 5 or RAID 6.
For example, if RAID 6 rebuilding took 10 hours, Dispersal can tolerate over 6000 rebuilding hours while providing an equal level of data protection. This is an illustrative example only, clearly rebuilding would be prioritized to occur in a shorter time period.

Years without data loss

Figure 2

RAID rebuilding performance
With RAID rebuilding, typically there is a choice of how much performance degradation is acceptable while rebuilding the drive. Other factors affecting the rebuild time include RAID stripe size, drive size, number of drives, and drive capacity.

When setting rebuild priority, the tradeoff is between using system resources for rebuilding and for other I/O activities. When rebuilding takes precedence, faster rebuild times will result, but no other activities can occur potentially resulting in lost business productivity. When other I/O activities are prioritized, longer rebuild times occur, rebuilding may only occur during off peak hours, and data may be vulnerable to loss if additional errors are encountered.

Dispersal rebuilding performance
Looking back to the two issues when rebuilding – data protection, and balancing rebuilding with work productivity – Dispersal effectively addresses both concerns. Dispersal provides extremely high levels of data protection since it is fault tolerant by design.

Regarding work productivity, IT staff can simply dedicate more system resources to other I/O activities, and have rebuilding prioritized during off peak hours. This may seem counter-intuitive since this is suggesting making the rebuild times even longer. It’s not though because the data protection levels are so much higher than with RAID.

Dispersal rebuilds only data
Dispersal also performs its rebuilding differently than most RAID systems. Typically when a drive replaced in a RAID array, the rebuild process rebuilds the entire drive regardless of whether there is actually data stored on it. This means the rebuild time is longer than necessary.

Dispersal, on the other hand, uses both CRC values on reads, as well as a background scrubbing process to determine data that needs to be rebuilt. Further, when a drive is replaced, a scan is performed to determine how much data was actually stored, and only rebuilds actual data. This shortens the rebuild time when compared to rebuilding an entire drive.  It should be noted that dispersal rebuilding requires reading more data per restored byte then a RAID system.

Conclusion
Dispersal is much more fault tolerant than RAID 5 or RAID 6, and isn’t as sensitive to needing the fast rebuild times that RAID requires. As such, rebuilding can occur without significant performance degradation or risk to data integrity.