Newbie Question About FastT

mjldba · Oct 29, 2003

Hi everyone,

We've been using RISC6000 servers for many years & this is our first experience with a p-series 630 & FastT200 SAN connected w/1GB sec fiber channel.

The throughput is terrible!!

I'm using AIX 5.1 & Informix IDS 9.30.UC3 on a model 620 and model 630 and it takes 12 minutes to do a level 0 backup of 12GB/data on the 620 but it takes 77 minutes to do a level 0 backup of 24GB/data on the 630.

Using SMClient performance monitor, I see steady throughput of 4.8 - 5.5MB/sec which are pathetic numbers for 1 GB/sec Fiber Channel connection. I've had one isolated burst of 7MB/sec .... WOW!!!

I get the same throughput measurements when I copy a 750MB directory of files from datavg (FastT200) to rootvg (630) so I'm pretty sure it's not a database tuning problem.

Can anyone suggest where I can begin trying to resolve this?

PS I'm more of a "goo" than a guru. Zero AIX/IDS training but it was my turn to "accept more responsibility"

Thanks

comtec17 · Oct 29, 2003

How is your FastT LUN RAID setup? Is the LUN in a RAID1 or RAID5 configuration?

What are the queue depths set to on the volumes?

What type of HBA's are you using?

What AIX ML are you at?

What version are your FC disk drivers?

baanman · Oct 29, 2003

Hi mjldba,

How do you take level 0 backups and what is your backup device?

While we were using informix 7.31 it took about 4 hours to do a level 0 backup with ontape (90 gb Data) to a DLT 8000 on HP-UX.

Do you backup on the same devives from 620 and 630?

Has your tables more extents on which are on 630 than 620 . The extent number of tables can increase the bacup time..

'I get the same throughput measurements when I copy a 750MB directory of files from datavg (FastT200) to rootvg (630) so I'm pretty sure it's not a database tuning problem.'

I suppose that your rootvg is on your local disks not on fastT , therefore the slowness can be related to your internal disks. Try to create a large file on FastT with 'dd' command.

Baanman

mjldba · Oct 30, 2003

Hi Comtec17,

The FastT LUN was setup as RAID5 but this was recently changed to RAID10 at the suggestion of an IBM VAR.

Queue_depth for hdisk2 is 16 and num_cmd_elems is 1024 for FC adapter, I found that information on this BB.

Don't know what an HBA is...please explain.

AIX is 5.1 ML4

The FC drive disks are 73.6BG 10K drives, firmware is B353, am I answering your question?

Hi Baanman,

Level 0 backups are done using ontape command native to IDS, backup device on both machines is an Ultrium drive.

Each AIX box has its own dedicated 4mm drive for logical backups and an Ultrium drive for level 0 backups & nightly file system backups via cron.

I defrag DB's as needed, presently the DB's on the 620 (PeopleSoft HRMS) are not heavily fragmented, no table over 10 extents. The 630 (PeopleSoft Financials) has some tables at 20+ extents (most are < 5) because the PeopleSoft provided Upgrade Assistant product is constantly doing table rebuilds & I can't specify an intial extent that would be suitable for all tables without wasting a ton of space.

I tested the speed of copying a 750MB directory structure of files from datavg to rootvg thereby creating the directory on rootvg and throughput measured using SMClient performance monitor showed me < 7MB/sec. I then performed the same task in reverse (rootvg to datavg) and performance was still the same < 7MB/sec.

Can you give me an idea of what type of throughput I should be seeing between the 630 & the FastT?

Comparing the 620 to the 630 is like comparing a Pentium4 to a 386, it's that bad.

Thanks

slamcorp · Oct 30, 2003

something to check, use iostat -a 1 50
the -a option will show you the total throughput on the adapter...

also, if this is a new system change the block size on your tape drive, it is usually a problem, check the 620 too.

Adam

also HBA is your fiber card, are you running 2gig/1gig dual sdd?

mjldba · Oct 31, 2003

Hi Adam,

Thanks for responding.

Block size is 512 of all our tape devices on all the AIX boxes, I've never had a problem before so I haven't modified any tape device parameters. What block size would you suggest?

I have 2 HBA's, FastT200 uses 1GB. The IBM VAR set-up the FastT with 1 LUN using 1 controller and the second controller provides failover protection. I'm looking into breaking-up the single LUN into 2 LUN's and dedicating one controller to each LUN, thereby using both controllers and splitting the load. I was also told (by IBM) that they can be set-up to provide failover protection for each other so if one fails then the other will carry the full load until repairs are done.

The consultants are gone for the weekend so the 630 is dormant. I can generate activity by updating statistics on one of the DB's and I get overall I/Owait states of 45 to 55%, Kbps of 4800 to 5300, tps of 900 to 1300, Kb_Read of 4800 to 5200, Kb_Wrtn of 0 to 220 on hdisk2 (FastT RAID10 array) and fcs0.

We had a RAID5 array stripped across four 73GB disks, now we have a RAID10 array stripped across 2 73GB disks and mirrored on the other 2 73GB disks, performance is the same with either set-up. Drives are 10K drive.

In hindsight, I would have ordered 8 to 10 36GB drives, 15K giving me more spindles and lower seek times but I have to live with the reality of the situation until either IBM or the IBM VAR admits they sold us an unworkable solution.

We have the same "RAID5 on four 73GB drives" set-up on the 620 (different architecture) and it performs great so I don't know if this is contributing factor to the slug-like performance on the 630/FastT set-up.

mjurado · Oct 31, 2003

Maybe more information can help ...

Is the fastt 200 connected only to the p630 or it goes thru a switch and then the fastt 200 is shared with other servers?

if you check
lsdev -C
How many darX and dacX devices you have (dar0, dar1, dac0, dac1.)

What are the parameters of Fastt 200 (list profile in SMclient)
Start cache flushing
Stop cache flushing
Cache block size
Media scan duration (in days)
Default host type: (it shoud be AIX)

Logical drives (with the values I have)
Segment size: 64
Read cache: Enabled
Write cache: Enabled
Write cache with mirroring: Enabled
Write cache without batteries: Disabled
Flush write cache after (in seconds): 10.00
Cache read ahead multiplier: 0
Enable background media scan: Enabled
Media scan with redundancy check: Disabled

I asume you dont have any alerts (state=optimal) shown in the SMclient.

mjldba · Oct 31, 2003

The FastT is connected directly to the 630, there are no switches between these two devices and the FastT is not shared with any other devices. WE're treating it like an expansion cabinet for the 630.

We have one router, dar0, and one controller, dac0. We have two controllers but cfgmgr is only pickup up controllerA. IBM is working on this but I doubt it's the root cause of my problems 'cause controllerB is set-up as a failover device for controllerA. I realize this is not an optimal set-up & this will be corrected when there's time.
Right now I have three expensive consultants being less than productive.

Start caching - 80
Stop cache flushing - 80
Cache block size is 4KB
Media Scan Duration -
Default host type is AIX

Logical drive values are the same as yours with the exception of segment size. I chose 8K because the documentation stated it was best for DB's and Informix IDS uses a page size of 4KB; IBM agreed with my settings.

Also, I have a read-ahead multipler of 2. It was 0 but IBM suggested using 2.

Everything is optimal, have not had a critical event since 10/10/03 but I have over 1000 green "I's" & I interpret these as (I)nformational.

arvibm · Oct 31, 2003

Hi mjldba,

Try changing the parameters mentioned below,this will help u in improving the performance.

Performance improvement for FASTt
-------------------------------------

This info is generally applicable to AIX systems
using FAStT storage.

Config/tuning parameters for AIX using FAStT:

FC adapter attribute changes:

lg_term_dma from 0x200000 to 0x1000000
(from 2 MB to 16 MB)

#lattr -El fcs0
#chdev -l fcs0 -a lg_term_dma=0x1000000 -P
#cfgmgr
#lsattr -El fcs0

num_cmd_elems from 200 to 1024

#lattr -El fcs0
#chdev -l fcs0 -a num_cmd_elems=1024 -P
#cfgmgr
#lsattr -El fcs0

DAR attribute changes:

load_balancing was set to yes

#lsattr -El dar0
#chdev -l dar0 -a load_balancing=yes
#lsattr -El dar0

autorecovery was set to yes

#lsattr -El dar0
#chdev -l dar0 -a autorecovery=yes
#lsattr -El dar0

hdisk attribute changes: [VG must be varied-off]

queue_depth from the default to 16

#lsattr -El hdisk
#chdev -l hdiskX queue_depth=16
#lsattr -El hdisk

prefetch_mult from 0 to 8

#lsattr -El hdisk
#chdev -l hdiskXX prefetch_mult=8
#lsattr -El hdisk

---------------------------------------------------------
Try Changing vmtune parameters. set the default value of some parameter listed below.This will help in performance improvement

This can be set by the command /usr/kernel/sample/vmtune
-f minfree 120xN(default) if 4 cpu it is 480
-F maxfree 128xN (default) 512
-R = maxfree -minfree =32
-P min perm 5
-p macperm 10
-s 1
-b 200
-B 800

Where N is number of cpu's

Regards

arvibm

baanman · Oct 31, 2003

Are TAPEBLK paremeter values equal in both onconfig files.

increasing TAPEBLK number can descrease the backup time , But if you change this value you can not restore from your backups made with the old TAPEBLK size. Therefore do a level 0 bakcup.

mjldba · Nov 3, 2003

I've got a case open with IBM & they cannot get AIX to recognize dac1. This is not the root case of the miserable performance problem but I think I could have a defective FastT ... I'm pushing them for new hardware.

arvibm - Thanks for your suggestions but I've already changed these settings & saw no improvement in performance.

I found that these values get reset to default values when I bounce the server so would I put these command line statements that modify device parameters in /etc/rc when I get things sorted out?

baanman - TAPEBLK parameters are 512 for both rmt0 (internal 4mm device used for logical logging) and rmt1 (external Ultrium drive used for level 0 bkups & nightly file systems bkups). Thanks for your input.

I have other, more significant, problems than length of time to get a level 0 bkup, it's just one of the baseline tests that I use to measure performance.

All things being equal I would expect a level 0 backup on the 630 to take about 25 minutes.

mjurado · Nov 3, 2003

mjldba wrote:

"We have one router, dar0, and one controller, dac0. We have two controllers but cfgmgr is only pickup up controllerA. IBM is working on this but I doubt it's the root cause of my problems 'cause controllerB is set-up as a failover device for controllerA. I realize this is not an optimal set-up & this will be corrected when there's time."

Maybe you can get rid of this non-optimal configuration if you
erase all devices (definition included: rmdev -dl dac0) related to the Fastt (dar0, dac0, hdiskxx), and then run cfgmgr to recreate them.
You will need to varyoff and export the volume groups first.

Let me know if you can see dac1 then.

I know it is not much but in some fastt installations this have helped me.

mjldba · Nov 6, 2003

This may not provide an answer for my performance problem but IBM was able to reproduce my "only one controller" problem in the lab & they've found both controllers will not be detected by cfgmgr if there is only one LUN.

I blew away the RAID10 array & created one RAID5 array with two LUNS and cfgmgr has detected dac0 & dac1.

I'm loading data & will test later this morning.

Thanks everyone for your help & suggestions, I'll post performance results when available.

mjldba · Nov 6, 2003

IBM has been able to duplicate our problem in the lab & they tell that they see 7MB/sec throughput between the AIX box & the FastT .... this volume can't be as good as it gets going across a 1 GB/sec Fiber Channel?

I know 1 GB/sec is a lab produced stat, but is 100MB/sec expecting too much?

I've tried some of the tweaks everyone has suggested and there has been no improvement.

Please post what kind of throughput your FastT is giving you & a brief description of your FastT setup (# of spindles, drive speed, RAID? level, segment size if for RDBMS application, etc...)

Thanks

Breslau · Nov 10, 2003

We have a fastT 200 talking to a 6h1, dual controllers (HA setup they call it), through a brocade switch.

our hba is a 6228. i'm seeing 14-20 MB/sec for filesystem writes, jfs2 i think.

we have 10 luns over 10 drives, raid5. one controller we're talking to sees the first 5 luns.

one thing i did that boosted the throughput a bit was turn off the 'enable write caching with mirroring' on each lun. i would only do that if it fits your situation. we're just in test mode here right now, non production.

the 14-20 meg is still poor in my book, maybe it's just because the 200 is the low end fastT, or maybe i haven't found the right tuning combination yet. we see much better rates out of our 3rd party fibre channel drives.

mjldba · Nov 12, 2003

breslau, thanks for your response.

IBM has been able to duplicate our FastT performance issue in their lab & they're working the problem.

I'll post IBM's solution when it's available.

mjldba · Nov 24, 2003

An update:

At IBM's suggestion I tore everything down (for the 6th or 7th time) and created one RAID5 array with 64KB segment size and two equal size LUNs, thereby creating hdisk2 and hdisk3.

Then I removed half of the logical volumes I created with hdisk2 and recreated them using hdisk3.

Now I see activity on both controllers (yeah, baby!!) but performance is unaffected, I'm still seeing throughput in the low-to-mid 5MB/sec range with occasional bursts up to low 6MB/sec .... pathetic performance.

sbix · Nov 24, 2003

Normally to achieve the fastest response time by a group of disks, the key is to make them "speak" in parallel.
Your FiberChannel provides you of a big band width and you should not have a saturation problem.
The RAID 5 configuration spread along four drives should give you the highest data transfer in reading because of the striping.
Probably you can use a bigger blocking factor

Breslau · Nov 25, 2003

Some other thoughts:

You mentioned informix in the original post, are your data files written in jfs, jfs2, or raw logical volumes?

jfs2 relies more on physical memory for i/o performance. if your host is low on RAM it could have some affect.

In any case, i would look at your logical partition sizing in conjunction with your i/o patterns. for instance, if your file sizes are small, on the order of 10's of megs or less, you'll want to push the lp size lower and perhaps use jfs2. this will force you to create more luns on your raid since AIX allows a max of 1016 partitons per drive. if you go this way, make sure you tell smit to make the lv use the maximum number of drives at creation time, this will hopefully spread your i/o over all the luns.

if your file sizes are large (100's of megs or more), then you can have fewer luns and you won't be hurt so much by having a large disk (lun). Use jfs here with 'large file enabled' checked.

Overall i've found that creating luns of just under 64GB is the best compromise between the two ends, but as you gravitate more to one side you can certainly adjust to your need. Having a transactional based system with small files on big lp's will suck wind. If you have to, create seperate file systems for large and small files and tune them accordingly. I have gotten good results out of that on 3rd party SAN. 'postmark' is a decent tool for testing small file performance.

change your queue depth as mentioned above too, especially if you have small files.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Newbie Question About FastT

Technical User

Vendor

IS-IT--Management

Technical User

MIS

Technical User

Vendor

Technical User

Vendor

IS-IT--Management

Technical User

Vendor

Technical User

Technical User

Technical User

Technical User

Technical User

IS-IT--Management

Technical User

Similar threads

Log in

Part and Inventory Search

Sponsor