0000027: Error during writing of fields in massive parallel simulations - OpenFOAM Issue Tracking

ID	Project	Category	View Status	Date Submitted	Last Update

0000027	OpenFOAM	Bug	public	2010-09-06 12:42	2010-09-13 17:39

Reporter	~~user28~~	Assigned To	~~user4~~
Priority	high	Severity	crash	Reproducibility	sometimes
Status	resolved	Resolution	fixed
Platform	Intel Nehalem, IB interlink	OS	custom linux	OS Version	?

Summary	0000027: Error during writing of fields in massive parallel simulations
Description	I am performing LES simulations on a big cluster using ~1000 cores. In some cases during writing of the fields half of the processor directories get written for time step n and the other half of the processor directories gets written for the time step n+1. This issue becomes apparent when trying to use the stored solutions for restart etc...
Steps To Reproduce	-Wall clock time per time steps is ~0.5s. -Version 1.6 from the git repository from ~Dec 2009 -A lustre file system is used
Additional Information	Some modified settings: minBufferSize=300000000 OptimisationSwitches { fileModificationSkew 10; commsType nonBlocking; //scheduled; //blocking; floatTransfer 0; nProcsSimpleSum 0; }
Tags	Input/output

~~user4~~ 2010-09-07 10:49 ~0000026	Could you send system/controlDict and also an 'ls' of the processor directories that shows the problem (so a time directory present in some but not in others). It might be a time precision issue. - are you sure that it is not a Lustre problem? - how often (in real time) are the dumps apart. Is is less than e.g. the 10s of fileModificationSkew?

~~user28~~ 2010-09-07 13:20	writeIssue.tar.gz (3,009 bytes)

~~user28~~ 2010-09-07 13:33 ~0000027	The data you have requested is in the attached archive. You find there the output of: find processor* -name '0.00741825' >00741825 find processor* -name '0.0074185' >0074185 and the controldict It is in my opinion very unlikely that this is a Lustre problem. The dumps are ~30s appart (see below): ls processor0 -ltr --full-time drwxr-xr-x 3 4096 2010-09-01 18:58:10.000000000 +0200 constant drwxr-xr-x 2 4096 2010-09-01 19:04:51.000000000 +0200 0 drwx------ 3 4096 2010-09-03 01:10:34.000000000 +0200 0.0026285 drwx------ 3 4096 2010-09-03 04:00:10.000000000 +0200 0.004997 drwx------ 3 4096 2010-09-03 06:50:03.000000000 +0200 0.0074185 ls processor1 -ltr --full-time drwxr-xr-x 3 4096 2010-09-01 18:58:11.000000000 +0200 constant drwxr-xr-x 2 4096 2010-09-01 19:04:52.000000000 +0200 0 drwx------ 3 4096 2010-09-03 01:08:57.000000000 +0200 0.00262825 drwx------ 3 4096 2010-09-03 03:58:53.000000000 +0200 0.00499675 drwx------ 3 4096 2010-09-03 06:49:34.000000000 +0200 0.00741825

~~user4~~ 2010-09-07 13:53 ~0000028	My guess is that the problem is the writeControl clockTime; which checks the time-since-start-of-the-run and on different processors might occasionally decide different things. As a workaround you might want to use one of the other writeControl modes.

~~user4~~ 2010-09-07 16:00 ~0000029	added a reduce of elapsedCpuTime, elapsedClockTime before using them. commit 4160af412a8df26be1a7c284000f888f9dbe0c89

Date Modified	Username	Field	Change
2010-09-06 12:42	~~user28~~	New Issue
2010-09-07 09:24	~~user2~~	Assigned To	=> user4
2010-09-07 09:24	~~user2~~	Status	new => assigned
2010-09-07 10:49	~~user4~~	Note Added: 0000026
2010-09-07 13:20	~~user28~~	File Added: writeIssue.tar.gz
2010-09-07 13:33	~~user28~~	Note Added: 0000027
2010-09-07 13:53	~~user4~~	Note Added: 0000028
2010-09-07 16:00	~~user4~~	Note Added: 0000029
2010-09-07 16:00	~~user4~~	Status	assigned => resolved
2010-09-07 16:00	~~user4~~	Fixed in Version	=> 1.7.x
2010-09-07 16:00	~~user4~~	Resolution	open => fixed
2010-09-13 17:39	~~user2~~	Tag Attached: Input/output