Monthly Archive for December, 2009

SearchStorage – top 10 enterprise data storage news stories of 2009

SearchStorage’s Beth Pariseau recently posted the Top 10 enterprise data storage news stories of 2009 . They were:

10. Data deduplication branches out
9. Object-based data storage re-invented
8. Solid-state drives hit market, but adoption slow
7. The NAS renaissance
6. Big vendors stack up
5. Clouds everywhere
4. vSphere 4 adds long-awaited data storage features
3. The IBM/Sun/Oracle love triangle
2. EMC and NetApp in bidding war for Data Domain
1. The economy

Here are some thoughts on how numero 9, and why object storage is being reinvented…

Provisioning for virtualized storage

Virtualized storage promises shared storage resources that can be parceled out as needed. One of the downsides to block based storage is that when a LUN is provisioned, the disk drives are actually formatted to accommodate the provisioned space. This leads to inefficiencies because often LUNs are underutilized, and having to earmark the space up front leads storage administrators to over provision since LUNs cannot grow is size. Thin provisioning solutions address this problem, but, are basically a layer of abstraction on top of block.

With object storage, the storage containers can be provisioned without having to actually allocate physical disk. This means thin provisioning is inherent to the storage mechanism, not an add on. As private and public cloud storage (yes a trendy word, but it does help communicate the point) emerges, being able to virtually provision is key to making these systems efficient and scalable.

Data mobility

With object storage, a unique object ID is assigned to each object. This means that the objects can actually move across physical hardware without causing file references to get out of sync. The benefit is data becomes  more mobile which will is an important requirement with automated tiering, and hardware refreshes in large scale systems. Data migration will become much simplier to accomplish with objects.

Scale without file system constraints

Imagine listing out the file system for google? Not doable. Furthermore, not valuable for end users. Google has proven that searches that show instantaneous results of relevant information are more important than traversing a file system folder structure to search for data.

Here’s another post related to Cleversafe’s object storage release from September that has further discussion.

Three Trends pointing towards Network Storage in 2010

As 2009 comes to a close we thought it was a good time to step back and evaluate larger trends in the industry and what they might mean for storage technology in the coming year. We identified three clear trends which point strongly towards 2010 being a turning point year for networked storage, and in particular dispersed storage. The three primary reasons for this approaching turning point are:

1. With 10 Gbps Ethernet, network speeds have unquestionably eclipsed primary storage speeds
2. Internet access is becoming increasingly ubiquitous, through technologies such as WiMAX, 3G, and LTE
3. Bandwidth is cheap enough that its cost is insignificant compared to the cost of the storage

Network Speed

Most hard drives sold today spin at 7200 RPM and support read and write rates of 80 – 100 MB/s. Special high-performance drives, such as Western Digital’s Raptor drives, have performance up to 150 MB/s. Enterprise grade SAS drives have been able to reach 200 MB/s, and SSD flash drives have achieved speeds between 200 and 300 MB/s. These speeds far exceed the maximum transmission rate of the 1 Gbps Ethernet ports, which can achieve up to 125 MB/s (1 billion bits is 125 million bytes).

Today 1 Gbps Ethernet is common, while 10 Gbps connections are emerging and relatively rare. However, within a few years 10 Gbps will likely be quite common. In what ways might storage evolve as network speed catches up with and overtakes primary storage speeds? 10 Gbps Ethernet is equivalent to 1250 MB/s or 10 times faster than today’s consumer hard drives and 4 times faster than today’s solid state drives.

Today, the hard drive’s role in the computer is that of a fast cache of programs, documents, content, etc. Accessing a movie stored on your hard drive today is much faster than accessing it over the network, but what role will the hard drive serve when that is no longer the case? If streaming a movie from the network could support higher resolutions and frame rates than why stream from a disk? If hard drives and flash are not completely obsolesced by a new much faster storage technology they will have to exist as highly parallelized banks of storage, in which write and read requests are striped across many individual hard drives or SSDs.

If networked storage like this is what the future holds, then information dispersal offers two clear advantages over replication and RAID based systems. The first is that replication works against parallelization in terms of gaining performance. For example, imagine replicating a 1 GB file to five locations. Even though the load is shared across five sites, each site has to handle a full GB of I/O. With dispersal, each location receives only (1 / threshold) of the load, so in a 10-of-15 configuration, each site receives 1/10th of a GB, or 100 MB. Therefore if each Slicestor processed data at 80 MB/s, the set of 15 stores could process data at 800 MB/s (close to the speed of 10 Gbps Ethernet). Notice that the higher the threshold of the configuration, the higher the achievable throughput.

This same advantage applies to RAID systems, in that in a 4+1 RAID 5 array, each disk receives 1/4th the data. However, due to RAID’s limited fault-tolerance, it is dangerous to put too many drives in the same array. Imagine a 20+1 RAID array, it would have the amazing performance of 20 disks reading or writing in parallel, however, if just 2 disks fail out of the 21 in the array there will be data loss. Dispersal can use arbitrarily wide configurations and dial-up the fault-tolerance to match.  For example, a 20-of-25 dispersal configuration could tolerate 5 simultaneous failures, as opposed to the 1 of RAID 5, or 2 of RAID 6.

The second advantage of dispersal offers for networked storage is inherent security and privacy. In a world where computers no longer have built-in hard drives, that means people need to trust their data in the hands of someone else. Not only does this mean the data has to be stored reliability, but it must be stored securely and privately. Dispersal through its SecureSlice(TM) technology ensures that data remains private, and if you don’t trust any single party to hold your data, you could split storage between multiple providers, and neither one could gain enough information to compromise the privacy of your data.

Network Omnipresence

Another trend which is pointing toward network storage is the omnipresence of network connections and Internet access. Airports, hotels, coffee shops, and even grocery stores frequently provide wireless Internet. Many cell phone companies offer Internet access through 3G, and soon a 100 Mbps standard known as LTE will be available. At this speed it becomes possible to stream high definition movies.

If the network is ubiquitous, why should the storage capacity of one’s iPod or cell phone be constrained by the amount of memory that physically fits within that device? Imagine instead of being limited to a few GB in your hand-held device, you could access your entire music or movie collection over the network from anywhere. This is what some companies such as Cablevision are beginning with Network DVR. Access to one’s data would no longer be restricted by what physical things one could take with them.  The potential benefits of network storage combined with network omnipresence are too great to overlook.

Network Cost

The amount it costs to transfer data over the Internet is always falling. Today, depending on location, a network storage provider can purchase a 1 Gbps Internet connection for roughly $7,000 a month. With this amount of throughput, they could transfer 328,717 GB both in and out of the site. This works out to just over 2 cents for each GB written and read.

Assuming the average drive uses about 13.5 watts of power to operate and assuming an electricity cost of $0.12 per Kilowatt hour, it would cost $1.17 a month to power a hard drive. Assuming a 1 TB drive costs $100 and has a useful life of 5 years, the total cost of the drive including capital and operational expenses works out to $2.83 a month.

If instead of using a hard drive, one used network storage and paid 2 cents per GB of transfer, they could write and read up to ($2.83 / $0.02) 141 GB of data per month for less than the price of buying and powering a hard drive. This amount of data is equal to 100 days of listening to MP3s at 1 MB/minute, or 17.1 straight days of streaming movies (assuming 350 MB/hour). Few people watch 400 hours of TV or movies each month, and therefore the cost of the network in network storage is negligible compared to the cost of the storage in the majority of use cases.

Trends in the advancement of storage virtualization: Management systems

Part 4: Management systems focused on data protection
In the first post of this series, advanced data virtualization, we discussed how once storage virtualization is a reality, the question of “Where is my data?” will shift to “Is my data protected?” This post will discuss how management systems will need to change in order to answer that question.

Current management systems focus on managing physical hardware and giving visibility into health and heart beat. With data stored as full copies on storage nodes (even when using RAID for data virtualization) monitoring physical hardware makes sense because if a storage node is available determines if the data residing on it is available.

Once advanced data virtualization techniques come into play, data will no longer reside on individual storage nodes in entirety. This will require a shift in the focus of management systems to provide correlated information across multiple locations and hardware.

For example, if a series of storage nodes are unavailable, the management system will need to display the impact on access to data across different storage containers / applications. There actually may be no impact to access.

Today, data protection is set by RAID configuration, and replication rules. In the future, data reliability and availability characteristics would be configured instead. (This was also mentioned in the post on self organizing systems in discussing how Storage Administrators will define the requirements for tiers such as QoS, data reliability, and performance.)

For example, a Storage Administrator would configure rules for failure tolerances such as how many drives can fail and how many sites can be unavailable while still providing seamless access to data. No replication settings would be required as the system would be managing the fault tolerance versus an administrator managing the rules of physically moving the data. The management system would then provide alerts when those rules are at risk of compromise and provide tools to trouble shoot and recommendations for automated data migration as necessary.

So, management systems will need to be the intermediary between the physical world of disks, devices, and sites, and the virtual world of storage containers that can dynamically grow and shrink. With the shift to virtualization, today’s controls for managing data protection will change from RAID and replication to failure tolerance.