View Issue Details

IDProjectCategoryView StatusLast Update
0000296ThirdPartyBugpublic2011-10-07 09:11
Reporterandras Assigned Touser4 
PriorityhighSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Platformx86_64OSCentOSOS Version5.4u3
Summary0000296: mpirun -np NUMPROCS not working
DescriptionTrying to run a simple tutorial in parallel does not work.

OpenFOAM-2.0.1 (gcc-4.5.1, gmp-5.0.1, mpc-0.8.1, mpfr-2.4.2)
openmpi-1.5.3 (configure-options += --with-sge)
Steps To ReproduceRun e.g. icoFoam in parallel.
Additional InformationAn error like this is produced:

--%<--
[n201:31552] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[n201:31552] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ahorvath@n201 cavity]$
[2]
FOAM parallel run exiting
[2]
[3]
[3]
[3] --> FOAM FATAL IO ERROR:
[3] incorrect first token, expected <int> or '(', found on line 0 the word 'z'
[3]
[3] file: IOstream at line 0.
[3]
[3] From function operator>>(Istream&, List<T>&)
[3] in file /hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1/src/OpenFOAM/lnInclude/ListIO.C at line 149.
[3]
FOAM parallel run exiting
[3]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 3 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 31555 on
node n201 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
--%<--
TagsNo tags attached.

Activities

andras

2011-09-19 16:36

reporter   ~0000659

Parallel runs also don't work with openmpi-1.4.3 (the latest stable release).
The error messages are the same.

user4

2011-09-23 17:14

  ~0000669

Did you try without the --with-sge?

Can you attach the whole output?

andras

2011-09-24 09:55

reporter   ~0000670

Yes, I compiled openmpi without "--with-sge" first. The results were the same for both the latest stable (1.4.3) and latest beta-version of openmpi that gets installed with OF-2.0.1. Anyhow: for a local run SGE is not invoked.


--%<--
[ahorvath@n201] . .bashrc

[ahorvath@n201 ~]$ foam

[ahorvath@n201 OpenFOAM-2.0.1]$ pwd

/hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1

[ahorvath@n201 OpenFOAM-2.0.1]$ which mpirun

~/OpenFOAM/ThirdParty-2.0.1/platforms/linux64Gcc/openmpi-1.4.3/bin/mpirun

[ahorvath@n201 OpenFOAM-2.0.1]$ ldd `which mpirun`

libopen-rte.so.0 => /hpc_home/ahorvath/OpenFOAM/ThirdParty-2.0.1/platforms/linux64Gcc/openmpi-1.4.3/lib/libopen-rte.so.0 (0x00002b016dc55000)
libopen-pal.so.0 => /hpc_home/ahorvath/OpenFOAM/ThirdParty-2.0.1/platforms/linux64Gcc/openmpi-1.4.3/lib/libopen-pal.so.0 (0x00002b016dea5000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000356d400000)
libnsl.so.1 => /lib64/libnsl.so.1 (0x0000003570800000)
libutil.so.1 => /lib64/libutil.so.1 (0x000000357d200000)
libm.so.6 => /lib64/libm.so.6 (0x000000356d800000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000356dc00000)
libc.so.6 => /lib64/libc.so.6 (0x000000356d000000)
/lib64/ld-linux-x86-64.so.2 (0x000000356cc00000)

[ahorvath@n201 OpenFOAM-2.0.1]$ run

[ahorvath@n201 run]$ ls

cavity chtMultiRegionFoam

[ahorvath@n201 run]$ cd cavity/

[ahorvath@n201 cavity]$ ls

0 constant processor0 processor1 processor2 processor3 system

[ahorvath@n201 cavity]$ mpirun -np 4 icoFoam -parallel 2>&1 > log.iF &

[1] 6969
[ahorvath@n201 cavity]$ [0] [1]
[1]
[1] --> FOAM FATAL IO ERROR:
[2]
[2]
[2] --> FOAM FATAL IO ERROR:
[2] incorrect first token, expected <int> or '(', found on line 0 the word 'z'
[2] [3]
[3]
[3] --> FOAM FATAL IO ERROR:
[3] incorrect first token, expected <int> or '(', found on line 0 the word 'z'
[3]
[3] file: [1] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token
[1]
[1] file: IOstream at line 0.
[1]
[1]
[2] file: IOstream at line 0.
[2]
[2] From function operator>>(Istream&, List<T>&)
[2] in file /hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1/src/OpenFOAM/lnInclude/ListIO.CIOstream at line 0.
[3]
[3] From function operator>>(Istream&, List<T>&)
[3] in file /hpc_home/ahorvath/OpenFOAM/OpenFOAM-2.0.1/src/OpenFOAM/lnInclude/ListIO.C at line 149.
[3]
FOAM parallel run exiting
[3]
at line 149.
[2]
FOAM parallel run exiting
[2]
From function IOstream::fatalCheck(const char*) const
[1] in file db/IOstreams/IOstreams/IOstream.C at line 114.
[1]
FOAM parallel run exiting
[1]

[0]
[0] --> FOAM FATAL IO ERROR:
[0] error in IOstream "IOstream" for operation operator>>(Istream&, List<T>&) : reading first token
[0]
[0] file: IOstream at line 0.
[0]
[0] From function IOstream::fatalCheck(const char*) const
[0] in file db/IOstreams/IOstreams/IOstream.C at line 114.
[0]
FOAM parallel run exiting
[0]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun has exited due to process rank 2 with PID 6972 on
node n201 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
[n201:06969] 3 more processes have sent help message help-mpi-api.txt / mpi-abort
[n201:06969] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
--%<--

nsf

2011-10-04 17:28

reporter   ~0000679

Has there been any progress on this?

I've come across the same issue when running in parallel on SLES10 sp2. The compilation of OpenFOAM-2.0.x is fine (I've seen no Errors in the log) but when running in parallel I get the same error as andras. I've tested both scotch and simple as decomposition methods.

For the compilation of OpenFOAM I used
gcc-4.3.3
gmp-4.2.4
mpfr-2.4.1
cmake-2.8.2, (currently recompiling OF with cmake 2.8.4).

OpenFOAM-1.7.x runs fine when compiled with the above ThirdParty apps.

I turned on the IOobject debug flag. Here's and excerpt of the output:
...
IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/neighbour"
 .... read
IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/neighbour"
 .... read
IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/boundary"
 .... read
IOobject::readHeader(Istream&) : reading header for file "/home/nico/OpenFOAM/nico-2.0.x/run/test/pitzDaily/processor0/constant/polyMesh/boundary"
 .... read
This is where the crash occurs.

From a working version (on ubuntu 11.04) I can tell that the next file to be read is .../processor0/../system/fvSchemes.

Do you have any advice as to what I can test to find where the error lies?


Best Regards

Nicolas

user4

2011-10-04 18:09

  ~0000680

- Check that boundary file on all processors.

- set FOAM_ABORT to 1 to cause a traceback at the location of error

- make sure your hostnames and userid are valid words so do not start with a number or contain invalid characters.

nsf

2011-10-05 17:22

reporter   ~0000687

I checked the boundary files and they looked fine. The crashed occurred even if I decomposed with OpenFOAM-1.7.x. I could run the same case just fine with pisoFoam from 1.7.x. Perhaps I should mention that I tested the incompressible pitzDaily tutorial case in parallel.

However after a thorough clean ("rm -rf .../OpenFOAM-2.0.x/platforms" and "find .../OpenFOAM-2.0.x -name '*.so' -or -name '*.dep' -or .name '*.o' | xargs rm") and then recompiling OpenFOAM with cmake-2.8.4 I can't reproduce the error.

So I'm not sure if I fixed it by cleaning and rebuilding or by switching from cmake-2.8.2 to 2.8.4. I'm think that the former is more likely.

Perhaps andras would also benefit from recompiling (and cleaning!) again?

Best Regards

Nicolas

user4

2011-10-05 17:58

  ~0000688

It must be the cleanout. Could it be that some files were compiled with a different mpi version?

nsf

2011-10-05 18:15

reporter   ~0000689

Yes, that is one probable cause, the first time around I tried with the mpi version (1.5.3) that's supplied in ThirdParty. Seeing the error I switched back to the version (1.4.1) that was supplied with ThirdParty-1.7. I recompiled but still had the error. When recompiling I didn't clean as thoroughly (meaning not at all. I thought the script would automagically do it for me).

I'm still not sure why it didn't work the first time around. I probably messed up some way or another. Unless you have good reason to upgrade to 1.5.3, I'm content that works with this version.

/Nicolas

andras

2011-10-06 21:36

reporter   ~0000695

Cleaning everything as described by nsf and running Allwmake again resolved the issue. I am using mpirun 1.4.3 (stable) now.

Thanks guys...


Cheers,
Andras

Issue History

Date Modified Username Field Change
2011-09-19 15:41 andras New Issue
2011-09-19 16:36 andras Note Added: 0000659
2011-09-23 17:14 user4 Note Added: 0000669
2011-09-24 09:55 andras Note Added: 0000670
2011-10-04 17:28 nsf Note Added: 0000679
2011-10-04 18:09 user4 Note Added: 0000680
2011-10-05 17:22 nsf Note Added: 0000687
2011-10-05 17:58 user4 Note Added: 0000688
2011-10-05 18:15 nsf Note Added: 0000689
2011-10-06 21:36 andras Note Added: 0000695
2011-10-07 09:11 user2 Status new => resolved
2011-10-07 09:11 user2 Resolution open => fixed
2011-10-07 09:11 user2 Assigned To => user4