The potential of SSD for millions of IOPS on increasingly larger databases

NOTE: There is an update to this article at http://blogs.msdn.com/b/microsoftbob/archive/2010/11/08/pcie-ssd-update-optimizing-sql-backup-performance.aspx regarding the throughput I was able to achieve on my own configuration with the Fusion-io backup process.  Through some parameter changes on the backup command, was able to double the throughput of the backup to Raid-10 and more than triple speed to the Null device.

It seems that the computer industry is finally coming to the recognition that someday electro-mechanical disk storage will be obsolete. One of the obstacles to this is that solid-state devices up until recently were very limited in terms of capacity. So even though the devices are known to be many times faster than traditional storage, particularly in random access, many have thought that the limitation of storage space would prevent widespread adoption for large databases without prohibitive costs.

However we are now seeing SSD devices beginning to rival traditional platter-based storage devices in capacity while maintaining ever-improving performance metrics. A key enabler of this has been the refinement of multilevel cell technology to increase reliability, decreases in the NAND flash chip footprint, and addition of software based approaches to optimize both longevity and performance. The initial wave of enterprise-class SSDs were based on SLC, with MLC deemed too unreliable for enterprise applications. However, MLC technology is improving and many believe SLC technology will be obsoleted due to reduced costs of MLC per GB of storage and comparable performance.

A major player in the MLC market as well as SLC is Fusion-io. Fusion-io is now producing a 1.2 TB PCIE SSD IoDrive Duo based on MLC with 1.1 GB/s sequential write throughput (1.5 GB/sec on the 320 GB SLC version) and 1.5 GB/sec read throughput, a read latency of 2 microseconds, a write latency of 30 microseconds (26 microseconds on 320 GB SLC version), 185,000 Read IOPS (261,000 on 320GB SLC version), and 278,000 Write IOPS.  (http://www.fusionio.com/products/iodriveduo/).

Before getting too far into this and just in case somebody doesn’t make it to the bottom, I need to issue the following disclaimer:

DISCLAIMER: I am a Microsoft employee and not a representative of Fusion-io.  The numbers relating to performance, price, and all other metrics related to the product are estimates only.  I do not guarantee the accuracy of any of the numbers or charts or statements regarding the Fusion-io products.  My analysis should not be considered an endorsement of Fusion-io or it’s products and is provided for informational purposes only.

I am not a hardware guru, but hopefully the below estimates are hopefully close enough to give a rough idea of trade-offs between using PCIE SSD and traditional drives.  Manufacturer sites and testing sites show significant discrepancies on enterprise-class 15K SAS drives.  To be fair to the SAS drives, I’ve taken what appear to be the best numbers for the best drives but only for the 73GB size – that’s because it takes a lot of spindles to reach parity with a IoDrive so no sense in using large drives.  For Fusion-io, I am using their data sheet, which has been verified by some other sites to be realistic.  Here’s an example of a test that included HP using 5 IO Duos to achieve 1 Million IOPS.  http://www.eweek.com/c/a/Data-Storage/Fusionio-HP-Team-Up-for-1M-IOPs-8GBs-Throughput-Ben-362772/

Here’s another example of a test of a 320GB IoDrive: http://www.techeye.net/hardware/fusion-io-solid-state-drive-has-a-whole-lot-to-give

The big disadvantage of platter-based storage is that even the fastest enterprise-class SAS drives don’t provide more than 250 IOPS under the best conditions.  This means it would require about 900 SAS drives to match the averaged read/write IOPS of a single Fusion-io Duo drive. On the other hand, it only takes about 15 SAS 73GB SAS drives (sequential throughput increases with drive density, 73GB drives typically around 100 MB/sec) to match the sequential read/write throughput of a duo.

The other significant advantage of the Fusion SSD Device is that it is a PCIE 2.0 card which means that for the Intel platform, processors can leverage the storage device at the maximum Quick Path Interconnect (QPI) speed which is 6.4 for higher-class Intel XEON processors.  I’m sure AMD also has comparable numbers, but I am not proficient with their hardware.  The PCIE technology also gets around the SATA/SAS bus bottleneck which is 600MB/sec even for SATA II.  In order to get around that limitation, you have the added expense of a SAN with HBA, Fibrechannel, ISCSI, and even those technologies max out around 4GB/sec.

Below are charts of 1 Fusion-IO MLC 1.2 TB IoDrive Duo to various numbers of SAS 73K drives depending on the goal.  To simplify the chart, we average the throughput and IOPS between 50% read/write as read vs. write for these items are not dramatically different.  I could not find the price of the Fusion-io but guessing at around 20k based on the fact that a 640GB version costs nearly 12K.  Hopefully I get some feedback if I’m way off and will update this post.  I’m using a $300.00 price for a Dell 73GB 2.5 SAS drive.  Using 2.5 form factor versus 3.5 because 1U servers typically don’t support 3.5 form factor and the trend is toward the smaller form factor, especially considering the infrastructure savings for energy/space.  As we see the enormous numbers of drives required to match a single Fusion-io device realistically, this becomes more apparent.

Let’s start with the least impressive application of the IoDrive for maximizing sequential access throughput and storage.  If all you are after is sequential throughput and/or storage, going the Fusion route will cost you about 4 times as much, so you probably don’t want to get one of these devices just for that.  As you’ll see in the next chart, the big advantage of the Fusion-IO concerns random access.

clip_image002

The second chart shows the IOPS comparison if you wanted to get the same number of average IOPS from the SAS drives as that of the Fusion.  This is sort of a ridiculous comparison, since 900 drives is going to mean a significant infrastructure investment in terms of SAN and HBAs.  I ball-parked $1,000 for every 16 drives to cover controllers and storage arrays.  If you really want IOPS, traditional storage is going to be significantly more expensive from both an initial cost and an operational point of view.  The estimated initial price to achieve the IOPS of the Fusion-IO  IoDrive duo using traditional storage is over 16 times higher!

clip_image002[4]

The final chart shows the comparison of throughput, storage, read latency, write latency and IOPS spending the same amount on both storage technologies.

Here you can see the significant advantages at the same price point of the the IoDrive duo drive for IOPs and read latency, near equivalence for write latency, and advantages to the SAS drives for size and throughput.

clip_image002[6]

So, it really comes down to what type of access you are doing (sequential versus random), whether you need fault tolerance, and whether you need lots of storage or lots of speed.  My approach been to use the Fusion devices for my database files along with TempDB and use Raid-10 array of traditional drives for the log.  However, it appears that the storage capacities of flash is increasing at a faster rate than that of platter-based storage.  Therefore, it does appear that platter-based storage will eventually go the way of the dinosaur within a few years.

I am currently running a pair of Fusion-io SLC Cards in a Dell R610 along with 6 SAS drives on one of my servers at home as part of my non-Microsoft outside research work. Two of the SAS drives are mirrored for the OS and the other Four are in a Raid-10 configuration. The fusion drives are software-partitioned into a couple of drives (playing with software-based Raid-1 as well as Raid-0). The machine is a Dell R610 with dual Xeon Quad-core 2.66 processors 48GB of RAM with 24 GB reserved for SQL Server and the remaining memory utilized for several virtual machines. It contains among other things a database used to store stock histories including intraday data and simulations. Normally I backup the database across a 1GB/sec network to another server. I thought it would be interesting to see the statistics for backing up the database from the SSD devices to the 4 x SAS Raid-10 volume.

Below is a quick test of a database backup. Although the database is only around 150GB with compression, it is still much larger than the 24GB allocated to SQL available for caching. I managed to get 361MB per second on a R610 dual-Xeon quad-core with 48GB RAM. The limiting factor being the write speed of the RAID-10 receiving device which only had 4 spindles.

BACKUP DATABASE [tp_v5] TO DISK = N’H:\sqlbackup\folibackuptest.bak’ WITH NOFORMAT, NOINIT,  NAME = N’tp_v5-Full Database Backup’, SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10

GO

10 percent processed.
20 percent processed.
30 percent processed.
40 percent processed.
50 percent processed.
60 percent processed.
70 percent processed.
80 percent processed.
90 percent processed.
Processed 40944 pages for database ‘tp_v5’, file ‘TradingOptimizer_v2’ on file 1.
Processed 159088 pages for database ‘tp_v5’, file ‘History09’ on file 1.
Processed 66336 pages for database ‘tp_v5’, file ‘History01’ on file 1.
Processed 62792 pages for database ‘tp_v5’, file ‘History02’ on file 1.
Processed 67280 pages for database ‘tp_v5’, file ‘History03’ on file 1.
Processed 60064 pages for database ‘tp_v5’, file ‘History04’ on file 1.
Processed 69616 pages for database ‘tp_v5’, file ‘History05’ on file 1.
Processed 84704 pages for database ‘tp_v5’, file ‘History06’ on file 1.
Processed 102512 pages for database ‘tp_v5’, file ‘History07’ on file 1.
Processed 131456 pages for database ‘tp_v5’, file ‘History08’ on file 1.
Processed 1144248 pages for database ‘tp_v5’, file ‘HistoryData2a’ on file 1.
Processed 567240 pages for database ‘tp_v5’, file ‘LoadData2’ on file 1.
Processed 395288 pages for database ‘tp_v5’, file ‘LoadData01’ on file 1.
Processed 287648 pages for database ‘tp_v5’, file ‘MiscData2’ on file 1.
Processed 232264 pages for database ‘tp_v5’, file ‘MiscData3’ on file 1.
Processed 148624 pages for database ‘tp_v5’, file ‘History10’ on file 1.
Processed 64 pages for database ‘tp_v5’, file ‘Load00’ on file 1.
Processed 66824 pages for database ‘tp_v5’, file ‘Load01’ on file 1.
Processed 167816 pages for database ‘tp_v5’, file ‘History09b’ on file 1.
Processed 182648 pages for database ‘tp_v5’, file ‘History09c’ on file 1.
Processed 199664 pages for database ‘tp_v5’, file ‘History09d’ on file 1.
Processed 190000 pages for database ‘tp_v5’, file ‘History10a’ on file 1.
Processed 203648 pages for database ‘tp_v5’, file ‘History10b’ on file 1.
Processed 199712 pages for database ‘tp_v5’, file ‘History10c’ on file 1.
Processed 84376 pages for database ‘tp_v5’, file ‘History10d’ on file 1.
Processed 160 pages for database ‘tp_v5’, file ‘Archive01’ on file 1.
Processed 65920 pages for database ‘tp_v5’, file ‘Load02’ on file 1.
Processed 77208 pages for database ‘tp_v5’, file ‘Load03’ on file 1.
Processed 73488 pages for database ‘tp_v5’, file ‘Load04’ on file 1.
Processed 72728 pages for database ‘tp_v5’, file ‘Load05’ on file 1.
Processed 80248 pages for database ‘tp_v5’, file ‘Load06’ on file 1.
Processed 78896 pages for database ‘tp_v5’, file ‘Load07’ on file 1.
Processed 80944 pages for database ‘tp_v5’, file ‘Load08’ on file 1.
Processed 90816 pages for database ‘tp_v5’, file ‘Load09’ on file 1.
Processed 95600 pages for database ‘tp_v5’, file ‘Load10’ on file 1.
Processed 85848 pages for database ‘tp_v5’, file ‘Load11’ on file 1.
Processed 96328 pages for database ‘tp_v5’, file ‘Load12’ on file 1.
Processed 82880 pages for database ‘tp_v5’, file ‘Load13’ on file 1.
Processed 72600 pages for database ‘tp_v5’, file ‘Load14’ on file 1.
Processed 89216 pages for database ‘tp_v5’, file ‘Load15’ on file 1.
Processed 83320 pages for database ‘tp_v5’, file ‘Load16’ on file 1.
Processed 83472 pages for database ‘tp_v5’, file ‘Load17’ on file 1.
Processed 85880 pages for database ‘tp_v5’, file ‘Load18’ on file 1.
Processed 238520 pages for database ‘tp_v5’, file ‘Load19’ on file 1.
Processed 224080 pages for database ‘tp_v5’, file ‘Load20’ on file 1.
Processed 169024 pages for database ‘tp_v5’, file ‘Load21’ on file 1.
Processed 176960 pages for database ‘tp_v5’, file ‘Load22’ on file 1.
Processed 29040 pages for database ‘tp_v5’, file ‘Load23’ on file 1.
Processed 16 pages for database ‘tp_v5’, file ‘Load24’ on file 1.
Processed 16 pages for database ‘tp_v5’, file ‘Sim42’ on file 1.
Processed 334904 pages for database ‘tp_v5’, file ‘Intraday01’ on file 1.
Processed 321640 pages for database ‘tp_v5’, file ‘Intraday’ on file 1.
100 percent processed.
Processed 1 pages for database ‘tp_v5’, file ‘TradingOptimizer_v2_log’ on file 1.
BACKUP DATABASE successfully processed 7804609 pages in 169.215 seconds (360.331 MB/sec).

To confirm that the Raid-10 was the bottleneck, I ran a backup to null device. 
BACKUP DATABASE [tp_v5] TO  DISK = ‘NUL:’
—  N’g:\sqlbackup\folibackuptest.bak’
WITH NOFORMAT, NOINIT,  NAME = N’tp_v5-Full Database Backup’, SKIP, NOREWIND, NOUNLOAD, COMPRESSION, STATS = 10
GO

10 percent processed.
… (No sense boring you again with SQL output)
100 percent processed.
Processed 1 pages for database ‘tp_v5’, file ‘TradingOptimizer_v2_log’ on file 1.
BACKUP DATABASE successfully processed 8042633 pages in 148.289 seconds (423.720 MB/sec).

This got the speed to 423 MB/sec.  I was surprised it wasn’t higher, don’t know if this was a quark of the backup to NULL device or not, maybe slowed down because the transaction log is also on the Raid-10 device.  I’m still not sure why it wasn’t faster.  Hopefully I can figure that out.  However, even at that speed, if we project the performance out using 4 Fusion-io IoDrive Duo cards (http://www.fusionio.com/products/iodriveduo/) on a 4U Dell R900 in Raid-0, we should get 6 times the throughput as long as we have an equivalently equipped receiving server also with 4 duos and a fast network (i.e. Infiniband) between.  This would give us over 3 GB/sec backup speed which is in the upper range of highest publicized database backup throughputs publically recorded for a production SQL Server database.

NOTE: See my update at http://blogs.msdn.com/b/microsoftbob/archive/2010/11/08/pcie-ssd-update-optimizing-sql-backup-performance.aspx for latest figures.  Was able to get the backup to disk speed over 618 MB/Sec and backup to Null to 1.42 GB/SEC

One argument against using PCIE-SSD is that it cannot be used in a cluster to store the database because the PCIE-SSD storage cannot be failed over.  However, consider the following excerpt from http://www.storagesearch.com/infiniband.html

Fusion-io SSDs deployed in nuclear safety program

Editor:- June 15, 2010 – today it was announced that Lawrence Livermore National Laboratory has deployed over 100TB of Fusion-io’s dual 320GB enterprise MLC ioMemory modules deployed in ioSAN carrier cards that connect over Infiniband in a testbed project to develop high-performance simulation computing capabilities needed to ensure the safety, security and reliability of the nation’s aging nuclear deterrent without underground testing.
With the addition of Fusion-Powered I/O, the Hyperion testbed will deliver over 40 million IOPS and 320GB/s of bandwidth from just 80 1U appliances (2 racks) compared to 43 racks and more power if implemented in an HDD based system.

So, it does not appear we are that far away from Fusion-IO modules being usable within the context of a SAN.  Also, there is the option of allowing TempDB to be stored on local storage rather than requiring it to be on SAN storage for cluster support.  Storing TempDB on shared-storage does not add any more fault tolerance than on local storage since it is recreated whenever a SQL instance starts up, which is what happens on Failover.  There are a few folks pushing for support of this (including yours truly) as it seems more of an installation setup problem, than an actual operational problem and can have a significant impact for applications that make heavy use of complex queries which can only be resolved in TempDB.  There is a way to get around this and there are some people uding TempDB on local storage including RamDisks or SSDs on SQL Clusters, but since this is not supported, you’ll have to Bing for them and I can’t provide links.

SSD is going to have an impact, especially in PCIE or Infiniband flavors that allow the speed of flash memory to be leveraged.  I have posted earlier about how some queries in SQL Server are dramatically improved by using more random-type plans (seeks versus scans) when using SSD.  Relational database products including SQL Server will need to get “smarter” when it comes to understanding what the underlying disk operating system is capable of in order to correctly optimize query plans.

There are may good articles beyond the ones already referenced here.  Fusion-io has made these devices available to a few select individuals in the SQL community who have blogged about their experiences. Here are a few:

http://www.brentozar.com/archive/2010/03/fusion-io-iodrive-review-fusionio/

http://www.sqlskills.com/BLOGS/PAUL/post/New-hardware-to-play-with-Fusion-io-SSDs.aspx

http://www.mysqlperformanceblog.com/2010/01/11/fusionio-320gb-mlc-benchmarks/

DISCLAIMER: I am a Microsoft employee and not a representative of Fusion-io.  The numbers relating to performance, price, and all other metrics related to the product are estimates only.  I do not guarantee the accuracy of any of the numbers or charts or statements regarding the Fusion-io products.  My analysis should not be considered an endorsement of Fusion-io or it’s products and is provided for informational purposes only.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s