|Anonymous | Login | Signup for a new account||2014-09-30 11:47 UTC|
|My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0001330||OpenFOAM||[All Projects] Bug||public||2014-06-22 15:50||2014-06-28 19:23|
|Priority||normal||Severity||minor||Reproducibility||have not tried|
|Target Version||Fixed in Version|
|Summary||0001330: Scalability issue: Exchange of message sizes|
|Description||In the framework the EU project PRACE (http://www.prace-ri.eu/ [^]) we have been working with OpenFOAM 2.2.x for carrying out FSI simulations for aircraft designs.|
In the course of our work we identified a scalability issue due to the exchange of buffer sizes, which is done in Pstream::exchange() (more precicely in combineReduce()). Our findings are described in detail in Section 4.4 of the following Whitepaper: http://www.prace-ri.eu/IMG/pdf/wp172.pdf [^]
Note that we implemented a workaround, using directly MPI_Alltoall() instead of combineReduce() for buffer size communication, resulting in a significant scalability improvement. Attached is the unified diff for this workaround, (based on OpenFoam 2.2.x, commit 6c399a448855e18d335202434b4304203de65).
|Additional Information||Note that the workaround applies to just one of the overloads of the method PstreamBuffers::finishedSends() (the one without the sizes output parameter).|
It has not been investigated in which cases the second overload, which replicates the complete matrix of message sizes on all MPI ranks, is really needed and how far it can be optimized by a similar approach, using appropriate collective MPI routines directly.
|Tags||No tags attached.|
|Attached Files||changes.diff [^] (7,320 bytes) 2014-06-22 15:50 [Show Content]|
Thanks for the detailed analysis. Have you tried any other mpi implementations? With the use of MPI_Alltoall what is the next bottleneck?
edited on: 2014-06-24 12:34
No, we haven't tried other MPI implementations. We only used Bullxmpi 1.1.16 (a derivate of OpenMPI).
Considering our specific testcase, the simple beam example with uniform motion diffusion (which corresponds in the above mentioned Whitepaper to Figure 5 on the right), when using MPI_Alltoall within one of the two overloads for PstreamBuffers::finishedSends(), 40% of the scaling overhead from 2048 to 4096 processes (i.e. the difference between actual runtime and theoretical runtime on optimal scaling) is caused by the second (unmodified) overload of PstreamBuffers::finishedSends() (and more precisely again by combineReduce()).
I could also send you the corresponding HPCToolkit profiles (~7MB) via Email in case you are interested.
that might be interesting. I'm m.janssens at the company email.
|2014-06-22 15:50||tponweis||New Issue|
|2014-06-22 15:50||tponweis||File Added: changes.diff|
|2014-06-24 11:12||mattijs||Note Added: 0003137|
|2014-06-24 12:01||tponweis||Note Added: 0003138|
|2014-06-24 12:34||tponweis||Note Edited: 0003138||View Revisions|
|2014-06-24 13:14||mattijs||Note Added: 0003139|