View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0000027OpenFOAM[All Projects] Bugpublic2010-09-06 12:422010-09-13 17:39
Reportercbeck 
Assigned Tomattijs 
PriorityhighSeveritycrashReproducibilitysometimes
StatusresolvedResolutionfixed 
PlatformIntel Nehalem, IB interlinkOScustom linuxOS Version?
Product VersionOther 
Target VersionFixed in Version1.7.x 
Summary0000027: Error during writing of fields in massive parallel simulations
DescriptionI am performing LES simulations on a big cluster using ~1000 cores. In some cases during writing of the fields half of the processor directories get written for time step n and the other half of the processor directories gets written for the time step n+1. This issue becomes apparent when trying to use the stored solutions for restart etc...



Steps To Reproduce-Wall clock time per time steps is ~0.5s.
-Version 1.6 from the git repository from ~Dec 2009
-A lustre file system is used
Additional InformationSome modified settings:

minBufferSize=300000000

OptimisationSwitches
{
fileModificationSkew 10;
commsType nonBlocking; //scheduled; //blocking;
floatTransfer 0;
nProcsSimpleSum 0;
}
TagsInput/output
Attached Filesgz file icon writeIssue.tar.gz [^] (3,009 bytes) 2010-09-07 13:20

- Relationships

-  Notes
(0000026)
mattijs (manager)
2010-09-07 10:49

Could you send system/controlDict and also an 'ls' of the processor directories that shows the problem (so a time directory present in some but not in others). It might be a time precision issue.

- are you sure that it is not a Lustre problem?
- how often (in real time) are the dumps apart. Is is less than e.g. the 10s of fileModificationSkew?
(0000027)
cbeck (reporter)
2010-09-07 13:33

The data you have requested is in the attached archive. You find there the output of:
find processor* -name '0.00741825' >00741825
find processor* -name '0.0074185' >0074185
and the controldict

It is in my opinion very unlikely that this is a Lustre problem.

The dumps are ~30s appart (see below):

ls processor0 -ltr --full-time

drwxr-xr-x 3 4096 2010-09-01 18:58:10.000000000 +0200 constant
drwxr-xr-x 2 4096 2010-09-01 19:04:51.000000000 +0200 0
drwx------ 3 4096 2010-09-03 01:10:34.000000000 +0200 0.0026285
drwx------ 3 4096 2010-09-03 04:00:10.000000000 +0200 0.004997
drwx------ 3 4096 2010-09-03 06:50:03.000000000 +0200 0.0074185

ls processor1 -ltr --full-time

drwxr-xr-x 3 4096 2010-09-01 18:58:11.000000000 +0200 constant
drwxr-xr-x 2 4096 2010-09-01 19:04:52.000000000 +0200 0
drwx------ 3 4096 2010-09-03 01:08:57.000000000 +0200 0.00262825
drwx------ 3 4096 2010-09-03 03:58:53.000000000 +0200 0.00499675
drwx------ 3 4096 2010-09-03 06:49:34.000000000 +0200 0.00741825
(0000028)
mattijs (manager)
2010-09-07 13:53

My guess is that the problem is the

 writeControl clockTime;

which checks the time-since-start-of-the-run and on different processors might occasionally decide different things. As a workaround you might want to use one of the other writeControl modes.
(0000029)
mattijs (manager)
2010-09-07 16:00

added a reduce of elapsedCpuTime, elapsedClockTime before using them.

commit 4160af412a8df26be1a7c284000f888f9dbe0c89

- Issue History
Date Modified Username Field Change
2010-09-06 12:42 cbeck New Issue
2010-09-07 09:24 andy Assigned To => mattijs
2010-09-07 09:24 andy Status new => assigned
2010-09-07 10:49 mattijs Note Added: 0000026
2010-09-07 13:20 cbeck File Added: writeIssue.tar.gz
2010-09-07 13:33 cbeck Note Added: 0000027
2010-09-07 13:53 mattijs Note Added: 0000028
2010-09-07 16:00 mattijs Note Added: 0000029
2010-09-07 16:00 mattijs Status assigned => resolved
2010-09-07 16:00 mattijs Fixed in Version => 1.7.x
2010-09-07 16:00 mattijs Resolution open => fixed
2010-09-13 17:39 andy Tag Attached: Input/output