Segmentation Fault error in a Parallel Program. Suggestions?

angel84 · Jan 29, 2013

Hi everybody,

I get the following error message when I run a parallel program:

Bash:

--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 21451 on node node342 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

In the following I report some relevant lines of the code

Code:

c --------------------------------------------------------------------
c                  JOBNUMBER is an integer which counts the iterations
c                  RANK is a number which counts inside a single iteration
c                       (its range varies from 0 to (size of communicator-1)

        call mpi_comm_rank(mpi_comm_world,rank,ierr)
c -------------------------       convert RANK & JOBNUMBER to characters
        write(rankp,'(i4.4)') rank
        write(njob,'(i4.4)') jobnumber

c----------------------------------------------------------------------
c                                           wait for the output to be
c                                           written
       lexist=.false.
       do while (.NOT.lexist)
       inquire(file='MarmAIT'//njob//rankp//
     &          '-Observed-Time.ttr',exist=lexist)
      end do
c --------------------------------------------------------------------
c                                        keep reading the file 
c                                        NJOB//RANKP//'.out' until it finds
c                                        the string 
c                                        'no need to wait for final data
c                                        to be written'. It means that the
c                                        computation of the external program is terminated
                                        
   5    open(unit=123,file=njob//rankp//'.out')
        rewind(123, iostat=stat)
   10   read(123, '(A)', end = 20,iostat=stat) line
        ind=index(line,'no need to wait for final data')
        if (ind .ne. 0) then
         write(*,*)  
         print *, line
         goto 30
        endif
        goto 10
  20    close (123)
        goto 5
  30    continue
        close (123)

Could you suggest me some strategy to attack the problem?
For example, how could I find out which part of the code generates the segmentation fault error?

Thank you very much for your support, and please ask me more questions if this thread is not complete or not adequately clear.

Angelo

gummibaer · Jan 29, 2013

From what I understand of your code, I take it your prog has to wait till the prog running in parallel has completed execution.
Why not use system() for this job? This starts a program from within a running application with a command line and waits till this has executed.

However, I do not understand your code.

In this do loop you inquire if a certain file exists, I assume that this file is created by the program running in parallel as an indicator that it has completed.
Why then do you check this file njob//rankp//'.out' for this text? Is this file shared between the two progs running in parallel?

Norbert

The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.

angel84 · Jan 30, 2013

Hi Norbert,

Thank you for your reply. I try to answer to your questions. It is hard for me to answer to you satisfactorily, because the entire code is quite complex and the tasks which it performs are several and I do not have a clue in which part of the code the error could be.

Maily I have two fortran codes: the first one performs a global research inside a model space, I call it Researcher, the second one simulates the propagation, I call it Simulator. The Researcher contains some problem dependent subroutines, which I edited (the lines of the code reported in my prevoius post are parts of these subroutines), the remaining part of the code should not be edited.

The Researcher calls the Simulator. Since the Simulator is a parallel program, it is managed via a Batch Job System (PBS pro). It takes the job and enqueues it in a specific queue. Here comes the problem. In fact, the function system() considers the operations terminated when the Simulator is assigned to a queue, and it doesn't wait until the actual job is completed. That is the reason of the last lines of the code.
First I check that the output is written (MarmAIT'//njob//rankp//'-Observed-Time.ttr'). It is not enough because this file keep being edited once it is created. Then I check also a log file (njob//rankp//'.out') which guarantees that the output data is complete. These two check loops are both necessary since, the first one guarantees that the log file exists, and the second one guarantees that the job is terminated.

I posted these lines of the code because I though that they could be more subjected to fail when passing from a serial implementation from a parallel implementation, but nevertheless I don't have a clue about where the actual Segmentation Fault error arises.

Keep asking me questions, if my answer is not clear enough, or if you have any ideas about how to attack the issue.

Angelo

gummibaer · Jan 30, 2013

Most of the time problems seem complex but may be simple when you abstract from the special reqirements of your project. Or when you simplify your requirements on the comfort of handling your progs, at least for a start.

Why do you want to start your Simulator as a batch job? (Note, I am not familiar with this sort of thing, so my question may sound stupid ;-))
As far as I understand you can have Researcher wait until Simulator completes, right? So actually you may not need to have them run in parallel. Maybe if you can drop this feature, you may simplify your life some.

To my knowledge, segmentation faults can mean anything, so it is not so very easy to hunt them down. But my first guess would be your parallel access to your files from Simulator and Researcher at the same time. If you know, at least if you can guess how long your Simulator will take to execute, just put in a pause a little bit longer than that before you open the outputfile in Researcher. This may not be the final solution but just to get an idea, if the error may be found there.

BTW, I trust you have checked on your type declarations: line must be of character*30 type and this string must start at the first character of the line in the outputfile, otherwise your test would not work. By adding the iostat-clause to your read statement you suppress io-errormessages and I do not know how 'end' and 'iostat' work together.

Other idea:
If you have access both to Simulator and Researcher, then maybe you can change the structure like this:

Code:

Progranm Simulator
...
open outputfile
...
do evaluation and save data to file
...
close outputfile
...
open (unit = 10, file = 'imready.txt')    ! the last action simuator does before exit.
write (10,*) 'done'
close (unit = 10)
end program
!--------------------------------------------------
Program Researcher
...
start simulator
...
do while (lexist .eq. .false.)
    call sleep (1.0)    ! wait one second
    inquire ('imready.txt', exist = lexist)
enddo
open (unit = 10, file ='imready.txt')
close (unit = 10, dispose = 'delete')
...
end researcher

The idea is not to open and close this outputfile of yours all the time and introduse some waiting time.
Just generating some ideas, not having a solution ready.

Norbert

The optimist believes we live in the best of all possible worlds - the pessimist fears this might be true.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Segmentation Fault error in a Parallel Program. Suggestions?

angel84

Programmer

gummibaer

Programmer

angel84

Programmer

gummibaer

Programmer

Similar threads

Part and Inventory Search

Sponsor