cx380 Raid Group and file system can't keep up 1

rondebbs · Jul 14, 2008

Hello,

We are running a statistical application called SAS on AIX 5.3 connected to a cx380 with 8GB cache on each SP. There is a temporary/work file system called /saswork. Users kick off huge queries and much of the sorting etc is done in /saswork. Currently this file system sits on a 3+1 RAID 5. At times the file system does over 1200 IOPS and the drives can not keep up and cache is getting flooded and I end up with way too many forced flushes. This seems to be impacting other applications that share the array.

If each of my 146GB 10K drives can do an average of 120 IOPS then I should have at least 10 drives to handle 1200 IOPS without flooding cache.

The file system is made up of 4 LUNs but LUN16 gets most of the IOs. This is the first 100 GB LUN in the file system and it seems like most jobs can get all their work done on this first LUN without needing the others. This is by far the busiest LUN on the array and it tends to bog down SPA. When I redo this I will use AIX Logical Volume striping so that IOs will be spread equally accross all four LUNs - 2 on SPA and 2 on SPB.

I have 12 available 146GB drives (also have thirty 300GB drives but don't know if I want to use those big guys as I don't need much space). Should I do a 10+1 RAID 5? That seems like a lot of drives. Should I do two seperate 5+1 RAID groups? If I get a few more drives maybe I could do two 6+1 or 7+1.

I have never used multiple RGs for one file system. Are there performance impacts? I could have LUN1 and LUN3 in the first RG and LUN2 and LUN4 could be in the second. As heavy writes are occuring they would be going to all four LUNs in both RGs. Does this make sense or should I stick to a single RG.

Thanks

Brad

mrdenny · Jul 14, 2008

Is your IO all read or write (or what percentages of which)?

If you are doing heavy reads then RAID 5 will be better as you'll get more IOPs for the number of disks. If you are heavy writes then RAID 10 will give you better performance than RAID 5 as RAID 10 doesn't have to calculate parity.

You might be better off using the 300 Gig drives as you can put more spindles behind them.

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

My Blog

chgwhat · Jul 15, 2008

I have used 3 RGs for one file system and with AIX LVs setup for maximum range, with no stripping for good load balancing.

Tony ... aka chgwhat

When in doubt,,, Power out...

rondebbs · Jul 15, 2008

A little more info - Because this is a "work" file system it is a little strange. It looks like 55% writes and 45% reads. I'm guessing that an inital step writes stuff out to /saswork where it is sorted and manipulated for use later. Each job usually cleans up/deletes it's files at the end.

The RAID 5 must be encountering a significant penalty. Maybe I'm better off with the RAID 10 6+6. I just seems like it will only use the power of 6 spindles and then mirror that to another six. Am I really getting the performance of 12 spindles or just 6?

I may be able to get 4 more drives later. Can I simply add/expand the 4 drives to my 6+6 without any problems? This would give me an 8+8.

Also, once I have my 8+8 - does it make sense to add a second 4+4 or 8+8 RG using some of the 300GB drives if analyzer showed that I still need more drives? Maybe I would bind new luns in the 300GB drive RG and then do striped meta expansion of the original 8+8 luns.

It seems like I may need all these drives because of the amount of writes.

Thanks - Brad

mrdenny · Jul 15, 2008

It definetly sounds like RAID 10 will be your best bet. You will only get the performance of the 6 drives, not the full 12.

I believe that to add disks to the RAID Group you'll need to break the raid group and recreate it. That being the case, you could do a LUN migration to the 300 Gig disks, make the change and LUN migrate back to the new 8+8 RAID 10.

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

My Blog

xmsre · Jul 16, 2008

At the most basic level, the IO capacity of the underlying disk needs to match that of the application workload you place upon it.

Your application doing sorts to temporary woring files appears at face value to be write intensive. You could verify this by collecting the appropriate disk IO counters for your platform. While you're at it, quantify the overal load in terms of IOPS, read/write ratio, and average IO size.

Application workloads that write infrequently and mostly read have a high read/write ratio, >4:1. An example of such an apllication workload would be that of a file server. At the other end of the spectrum, application workloads that write as much as they read have a low read/write ratio, and example of which would be the database backend for an OLTP system.

RAID 5 has a write penalty of 4. This means that every write operation actually consists of 4 operations; read the data, read the parity, write the data, calculate and write the new parity. Each read operation on RAID 5 takes one IO, read the data.

RAID 1 or 10 or 0+1 has a write penalty of 2. For each write you must write the data then write the mirror. RAID 1 reads require 1 operation, read the data.

A single spindle or RAID 0 stripe has a write penalty of 1 (none). Each write takes 1 IO operation. Each read takes 1 IO operation as well.

RAID 1/10/+1 provides redundancy and can survive multiple disk failures as long as either the data or mirror survives for any element. RAID 5 provides redundancy against single drive failures by reconstructing the data from the remaining elements and parity. RAID 6, we'll go ahead and mention here, is similar to RAID 5, but can survive double disk failures in a RAID set because there is an additional parity element.

The data protection against disk failure isn't free. RAID 1/10/0+1 costs you half your raw capacity. Where N is the number of spindles in a RAID 5 array, RAID 5 provids space equal to n-1 spindles. It costs you a spindle's worth of raw capacity.

Where P is ther performance of a spindle in IOPS at the target response time, and N is the number of spindles in the array,

RAID 1/10/0+1 write performance = P*N/2
RAID 1/10/0+1 read performance = P*N
RAID 5 write performance = P*(N-1)/4
RAID 5 read performance = P(N-1)

If my spindle can sustain 100 IOPS at the target response time, and the read/write ratio is 1:1, then a 4 spindle array will provide::

RAID 10/0+1 = 264 IOPS
RAID 5 = 160 IOPS
RAID 0 = 400 IOPS

That should give you an idea of what the RAID type is costing you in terms of PERFORMANCE as well as simply focusing on space.

Now, with all of that said, explained, and out of the way, let's revisit your application: "There is a temporary/work file system called /saswork. Users kick off huge queries and much of the sorting etc is done in /saswork"

If this is truely a temporary scratch working area, why is redundancy required at all? RAID 0 will give you the best performance for a write intensive workload for a given number of spindles. RAID 10 will give you about 66% of the performance of RAID 0, and will give you protection against multiple drive failures (as long as either the data or mirror for a given element survives). RAID 5 comes in dead last at a paltry 40% of the performance of RAID 0 and gives you less fault tolerance than RAID 10.

If you need the fault tolerance, go with RAID 10, if you don't need it go with RAID 0 for the best performance.

All of the above discussion can be distilled into a simple rule of thumb for selecting RAID types. I recommend you follow it in the future:

IF THE READ/WRITE RATIO OF YOUR APPLICATION WORKLOAD IS GREATER THAN THE WRITE PENALTY OF A GIVEN RAID TYPE, THEN THAT RAID TYPE IS A POOR FIT FOR YOUR APPLICATION WORKLOAD.

Unless your storage subsystem is drastically reshaping the workload at the virtualization layer (not on a clariion, and only minimally on a SYM. The virtualization layer is just not robust enough on any EMC platform. You're limited in what you can do by the degrees of freedom at the virtualization layer. All EMC has is IO delay stategies to optimize seeks during overwrites and cache. That might, in the most optimal conditions, knock a write penalty of 4 down to 3.8. Other vendors do a far better job, at least in one case completely eliminating the write penalty), the rule holds for SAN as well.

XMSRE

mrdenny · Jul 17, 2008

In addition don't forget to correctly align your disk partitions. That will help you with reducing some additional penalties that you have when both reading and writing.

XMSRE brings up an excellent point about RAID 0 in your case might be an excellent option as it's just scratch space.

Denny
MCSA (2003) / MCDBA (SQL 2000)
MCTS (SQL 2005 / Microsoft Windows SharePoint Services 3.0: Configuration / Microsoft Office SharePoint Server 2007: Configuration)
MCITP Database Administrator (SQL 2005) / Database Developer (SQL 2005)

My Blog

xmsre · Jul 17, 2008

A long time ago, back when hard drives had a fixed number of sectors per track (remember 9GB hard drives?), disk alignment made a big difference. Today, hard drives have a variable number of sectors per track, from 128 to 256ish as you go across the surface, and disk alignment has a lot more to do with aligning cache slots than anything else. If you vendor doesn't already do it for your (no, EMC doesn't) it's worth a try, it's just that you don't get the same bang you used to. It does help make you controller cache more efficient though, and you'll probably see a small percentage performance increase. YMMV.

rondebbs · Jul 17, 2008

Thanks all for great info. Xmsre, excellent lesson.

rondebbs · Jul 17, 2008

One more question. In my current 3+1 Raid5 - for each write - parity is calculated and is also written to disk. Do parity bytes go on a different disk for each write of the data? I never really understood how that works.

xmsre · Jul 17, 2008

In RAID 4, parity is on a dedicated parity disk. In RAID 5, parity is dispersed across all the spindles in the set. For a given slice, data and parity never reside on the same spindle; for consecutive slices, the parity is on a different spindle.

rondebbs · Jul 18, 2008

Well, one more thing just came up as I started to create my new RG with the 12 available 146GB drives. It looks like 4 of them might be vault drives - 000, 001 etc.

Hmmm, I'd rather not use those drives for this busy RG. I do have 30 300GB drives available. Earlier I had prefered the 146GB drives as I did not need all the space on the 300GB drives. If I use 12 300GB drives I will be burning over 3TB of disk to get my 400GB file system. Now becuase of the vault drives issue I may need to use the 300GB.

However this gives me some flexibility because now I'm not limited to only 12 drives. I could create two 6+6 RGs and stripe expand the luns in the first RG to the luns in the second RG using the 300GB drives.

I beleive the 146 and 300 GB drives are both 10K. Will the response time be the same for either drive size? Or will the 300GB drives be slower for some reason? If I made one RG with 146GB drives and striped expanded to luns in a 300GB RG - would that matter?

Thanks again.

xmsre · Jul 18, 2008

vault disks:

Essentially the root vol for the storage processor. Code, and I believe it persists cache there when you power down the SP. Not a lot of IO really.

IOPS/Spindle:

For the same speed and interface, the IOPS will be roughly the same (there's a bit of variance between disk manufacturers). For every disk ,there is an IOPS/Response time curve. As the IOPS/spindle increases, so does the response time. The maximum IOPS for a 10K spindle is about 180 random IOPS/sec, but this number assumes unlimited response time (read >1second). Sequential IO numbers are higher because the track to track time is lower than the average seek time. Some vendors float this number, but they're not being completely honest with the customer. Would your application tolerate 1 second response times? IOPS numbers are essentially meaningles without details concerning IO size and response time and IO pattern.

A 10K spindle generally gets about 100 random 8K IOs at 20ms response time. From that you have to subtract the overhead of your filesystem For most journaling filesystemes that's 15% or so off the top. Figure 85.

Striped/expanded 300s:

I'd have to go back and look at the manual, but I think if you do that the 300s get a disproportionate percentage of the overall IO. That could lead to slower response times because you're packing more IOPS on those spindles - see above.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

cx380 Raid Group and file system can't keep up 1

rondebbs

MIS

mrdenny

Programmer

chgwhat

MIS

rondebbs

MIS

mrdenny

Programmer

xmsre

ISP

mrdenny

Programmer

xmsre

ISP

rondebbs

MIS

rondebbs

MIS

xmsre

ISP

rondebbs

MIS

xmsre

ISP

Similar threads

Part and Inventory Search

Sponsor