View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0001330OpenFOAM[All Projects] Bugpublic2014-06-22 15:502014-06-28 19:23
Reportertponweis 
Assigned To 
PrioritynormalSeverityminorReproducibilityhave not tried
StatusnewResolutionopen 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0001330: Scalability issue: Exchange of message sizes
DescriptionIn the framework the EU project PRACE (http://www.prace-ri.eu/ [^]) we have been working with OpenFOAM 2.2.x for carrying out FSI simulations for aircraft designs.

In the course of our work we identified a scalability issue due to the exchange of buffer sizes, which is done in Pstream::exchange() (more precicely in combineReduce()). Our findings are described in detail in Section 4.4 of the following Whitepaper: http://www.prace-ri.eu/IMG/pdf/wp172.pdf [^]

Note that we implemented a workaround, using directly MPI_Alltoall() instead of combineReduce() for buffer size communication, resulting in a significant scalability improvement. Attached is the unified diff for this workaround, (based on OpenFoam 2.2.x, commit 6c399a448855e18d335202434b4304203de65).

 
Additional InformationNote that the workaround applies to just one of the overloads of the method PstreamBuffers::finishedSends() (the one without the sizes output parameter).

It has not been investigated in which cases the second overload, which replicates the complete matrix of message sizes on all MPI ranks, is really needed and how far it can be optimized by a similar approach, using appropriate collective MPI routines directly.

TagsNo tags attached.
Attached Filesdiff file icon changes.diff [^] (7,320 bytes) 2014-06-22 15:50 [Show Content]

- Relationships

-  Notes
(0003137)
mattijs (manager)
2014-06-24 11:12

Thanks for the detailed analysis. Have you tried any other mpi implementations? With the use of MPI_Alltoall what is the next bottleneck?

Mattijs
(0003138)
tponweis (reporter)
2014-06-24 12:01
edited on: 2014-06-24 12:34

Dear Mattijs!

No, we haven't tried other MPI implementations. We only used Bullxmpi 1.1.16 (a derivate of OpenMPI).

Considering our specific testcase, the simple beam example with uniform motion diffusion (which corresponds in the above mentioned Whitepaper to Figure 5 on the right), when using MPI_Alltoall within one of the two overloads for PstreamBuffers::finishedSends(), 40% of the scaling overhead from 2048 to 4096 processes (i.e. the difference between actual runtime and theoretical runtime on optimal scaling) is caused by the second (unmodified) overload of PstreamBuffers::finishedSends() (and more precisely again by combineReduce()).

I could also send you the corresponding HPCToolkit profiles (~7MB) via Email in case you are interested.

Thomas

(0003139)
mattijs (manager)
2014-06-24 13:14

Hi Thomas,

that might be interesting. I'm m.janssens at the company email.

- Issue History
Date Modified Username Field Change
2014-06-22 15:50 tponweis New Issue
2014-06-22 15:50 tponweis File Added: changes.diff
2014-06-24 11:12 mattijs Note Added: 0003137
2014-06-24 12:01 tponweis Note Added: 0003138
2014-06-24 12:34 tponweis Note Edited: 0003138 View Revisions
2014-06-24 13:14 mattijs Note Added: 0003139