Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Westi on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

System hangs when write on JFS2

Status
Not open for further replies.

nwardez

Technical User
Oct 2, 2002
174
FR
I have an Aix 5.1 server (32 bits kernel) that run Oracle 8 databases on JFS2 filesystems. I've recently observed that system hangs while creating large (2 Go and more) datafiles.

Performing tests, dd command do the same: the shell prompt is coming back very fast (just like create datafile statement do), but system is freezing until all blocks are really created on disk (I suppose). CPU activity grows up to "95% wait i/o", disk activity is 100% (~ 9500K written per sec) and new processes are queued, even if they do not need this disk.

On the other hand, cp command on this big file works fine (does not give back the shell prompt very fast, but do not disturb other processes).

This JFS2 filesystem is on hdisk2, which is a 128 Go LUN. We use a FC link to a StorageTek array. On this array, the LUN is defined as 16 x (disk of 8 Go, model 3390) (virtual, as in fact this RAID 5 array is physically made of 18 Go hard disks ...) :p

Of course, IBM, StorageTek and Oracle supports can't help arguing and nothing comes.


My questions are:

1. I've performed tests and noticed that system hangs appear on whatever physical support I use when it's a JFS2 filesystem (even on a low charged system). Has anyone already noticed it, and what did he do ? I have just ordered ML 3 and I'd like to have advices from peoples who may have encountered this trouble.

2. Which values should I use to best set AIO servers ? Do I compute total aio=(number of aio servers per disk)x(number of disks ?) on:
-a- number of LUN (=1), quite small
-b- physical number of disks in use in RAID array (=7), what I did
-c- number of disks model 3390 defined in RAID array (=16), quite high
What about playing with priority of aio processes ?


Any idea is usefull. Thanks in advance.
 
Solution:

Since disk io goes to 100. What kind of FC card you have and check with storage tek. Run filemon command when IO is 100% to findout what file is the most active file.

The problem will be in either hardware connection.
WHY DID YOUR COMPANY BUY SOLUTION FROM STORAGETEK.

THE MAIN QUESTION WHAT IS THE DEFINITION OF HANG. Does system respond to ping. Can you do things at console. if only Oracle hangs then it is problem with Oracle.

YOU ARE NOT GOING TO LIKE IT BUT

move Oracle from JFS2 to JFS.

IBM does not have to Test Oracle and JFS2.

It is the oracle that has to test JFS2.

If oracle does not support Oracle on JFS2 then you can not use it - JUST like ORACLE does not support AIX 5.1 with 64 bit kernel.

 
HI,

I'm not sure the problem is with jfs2.
We've just been through a severall months "100 disk usage" phenomena investigation.
We had a DDN fiber RAID (2.2 Tera Byte) conncted to a AIX 4.3.3 machine over an Emulex HBA.

Over a heavy load severall LUNS could go 100 i/o for long time,when nothing was actually running on them.
Sometimes the system would release itself,sometimes not ...
We did not succeed to convince neither DDN nor IBM that the problem is theirs.

Eventually,we worked around it by changing the HBA from Emulex to a Cambex one,the problem was gone.

To it can be a specific fiber HBA drivers behaviour that causes it on AIX. "Long live king Moshiach !"
h
 
Many thanks for your replies.

The FC card is an Emulex LP9000 (according to lscfg) provided by IBM, when the server (a 6F0 pSeries) was bought.

Why did we choose solution from StorageTek ? No good reason (technical at least). Just because we already have 2 S/390 mainframes using it, and as free space was remaining, someone (the man who signs the invoice) said: why not ?

Sorry for my bad english, what I want to say with "system hang" is that system responds really very very very slowly. For example, when I want to telnet at this point of time, it can take up to 2 or 3 minutes between login and password prompt. Not only Oracle processes freeze.

Sure I don't like "move Oracle from JFS2 to JFS", as it's now a production server I can hardly stop. DBA said we'll have to deal with datafiles>2 GB and might have "very big" files (?!?). I had no Aix experience at this moment and IBM consultant choose to create JFS2 as it improves performances on cache with files>2GB (what I understood).

We don't have Enterprise licence for Oracle's products, so I have to keep Aix 32-bits kernel. Found nothing in pre-requisites from Oracle on JFS2.

I found a thread in this forum where aixat suggests to look for "oracle+avadh" and got some tips on tuning asyncio but bottleneck still remains

I'll see if I can convince IBM to perform a test with another FC card.
 
Hello,

I want to congratulate you on you doing search in this database. Buy the way avadh and aixat are the same and one.

Do this steps for me.

1) lsdev -Cc adapter and post the output.
2) why the adapter is either configure as IBM adapter or Emulex adapter. i need to know this.
3) lscfg -vp fC*
4) make sure the microcode is at the latest level. The filesets are at the latest level.
5) I have listed tuning for memory for oracle and also for Asynchio.
6) IN THE CONFIGURATION FILE FOR TCP/IP THERE IS A LINE FOR TELNET PUT -c flag.

What does it do.
It stops reverse name resolution taking place and it will give you fast telnet.
 
Hi Dr Jekyll/Mr Hide,

1)
fcs0 Available 27-08 FC Adapter
fcs1 Available 3A-08 FC Adapter
(One FC card in use, the second for spare)
2)
I "think" it's configure as IBM (filesets 5.1.0.10 & 15) ... How be sure ? May I didn't understand your question and/or what to check ...
3)
fcs0 27-08 FC Adapter

Part Number.................03N2452
EC Level....................D
Serial Number...............1D1450C190
Manufacturer................001D
FRU Number..................09P0102
Network Address.............10000000C92ADAA4
ROS Level and ID............02C03891
Device Specific.(Z0)........1002606D
Device Specific.(Z1)........00000000
Device Specific.(Z2)........00000000
Device Specific.(Z3)........02000909
Device Specific.(Z4)........FF401050
Device Specific.(Z5)........02C03891
Device Specific.(Z6)........06433891
Device Specific.(Z7)........07433891
Device Specific.(Z8)........20000000C92ADAA4
Device Specific.(Z9)........CS3.82A1
Device Specific.(ZA)........C1D3.82A1
Device Specific.(ZB)........C2D3.82A1
Device Specific.(YL)........P1-I8/Q1

fcs1 3A-08 FC Adapter

Part Number.................03N2452
EC Level....................D
Serial Number...............1D1450C26B
Manufacturer................001D
FRU Number..................09P0102
Network Address.............0000000800000001
ROS Level and ID............02C03891
Device Specific.(Z0)........1002606D
Device Specific.(Z1)........00000000
Device Specific.(Z2)........00000000
Device Specific.(Z3)........02000909
Device Specific.(Z4)........FF401050
Device Specific.(Z5)........02C03891
Device Specific.(Z6)........06433891
Device Specific.(Z7)........07433891
Device Specific.(Z8)........2000000800000001
Device Specific.(Z9)........CS3.82A1
Device Specific.(ZA)........C1D3.82A1
Device Specific.(ZB)........C2D3.82A1
Device Specific.(YL)........P1-I9/Q1

hdisk2 3A-08-01 Other FC SCSI Disk Drive

Manufacturer................STK
Machine Type and Model......V960
Part Number.................
ROS Level and ID............30323131
Serial Number...............60380072
EC Level....................
Device Specific.(Z0)........000003128B000002
Device Specific.(Z1)........00000000
Device Specific.(Z2)........
Device Specific.(Z3)........
Device Specific.(Z4)........
Device Specific.(Z5)........

Name: fibre-channel
Model: LP9000
Node: fibre-channel@1
Device Type: fcp
Physical Location: P1-I8/Q1

Name: fibre-channel
Model: LP9000
Node: fibre-channel@1
Device Type: fcp
Physical Location: P1-I9/Q1
4) I check

5) Good job ;-)

6) Delay between login and password in telnet was an example. If I "man ls" or "who" or "vi myfile" or "ping somebody_else" or whatever command, it is slow. CPU are shown 99% Wait / 0% Idle and ALL processes are being concerned

7) Also it contains issues for "Use of JFS2 on the 32-bit Kernel" & "Stack Overflow Limitation", ML 3 installation didn't improve anything

8) I found tips on decreasing write processes priority, that degrade global performances. Quite bad. As creating large Oracle datafiles isn't the main activity of DBA (I hope), it's probably not the answer !
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top