| Anonymous | Login | Signup for a new account | 2013-06-19 19:23 BST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | |||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | |||
| 0000027 | OpenFOAM | [All Projects] Bug | public | 2010-09-06 12:42 | 2010-09-13 17:39 | |||
| Reporter | cbeck | |||||||
| Assigned To | mattijs | |||||||
| Priority | high | Severity | crash | Reproducibility | sometimes | |||
| Status | resolved | Resolution | fixed | |||||
| Platform | Intel Nehalem, IB interlink | OS | custom linux | OS Version | ? | |||
| Product Version | Other | |||||||
| Target Version | Fixed in Version | 1.7.x | ||||||
| Summary | 0000027: Error during writing of fields in massive parallel simulations | |||||||
| Description | I am performing LES simulations on a big cluster using ~1000 cores. In some cases during writing of the fields half of the processor directories get written for time step n and the other half of the processor directories gets written for the time step n+1. This issue becomes apparent when trying to use the stored solutions for restart etc... | |||||||
| Steps To Reproduce | -Wall clock time per time steps is ~0.5s. -Version 1.6 from the git repository from ~Dec 2009 -A lustre file system is used | |||||||
| Additional Information | Some modified settings: minBufferSize=300000000 OptimisationSwitches { fileModificationSkew 10; commsType nonBlocking; //scheduled; //blocking; floatTransfer 0; nProcsSimpleSum 0; } | |||||||
| Tags | Input/output | |||||||
| Attached Files | ||||||||
Notes |
|
|
(0000026) mattijs (manager) 2010-09-07 10:49 |
Could you send system/controlDict and also an 'ls' of the processor directories that shows the problem (so a time directory present in some but not in others). It might be a time precision issue. - are you sure that it is not a Lustre problem? - how often (in real time) are the dumps apart. Is is less than e.g. the 10s of fileModificationSkew? |
|
(0000027) cbeck (reporter) 2010-09-07 13:33 |
The data you have requested is in the attached archive. You find there the output of: find processor* -name '0.00741825' >00741825 find processor* -name '0.0074185' >0074185 and the controldict It is in my opinion very unlikely that this is a Lustre problem. The dumps are ~30s appart (see below): ls processor0 -ltr --full-time drwxr-xr-x 3 4096 2010-09-01 18:58:10.000000000 +0200 constant drwxr-xr-x 2 4096 2010-09-01 19:04:51.000000000 +0200 0 drwx------ 3 4096 2010-09-03 01:10:34.000000000 +0200 0.0026285 drwx------ 3 4096 2010-09-03 04:00:10.000000000 +0200 0.004997 drwx------ 3 4096 2010-09-03 06:50:03.000000000 +0200 0.0074185 ls processor1 -ltr --full-time drwxr-xr-x 3 4096 2010-09-01 18:58:11.000000000 +0200 constant drwxr-xr-x 2 4096 2010-09-01 19:04:52.000000000 +0200 0 drwx------ 3 4096 2010-09-03 01:08:57.000000000 +0200 0.00262825 drwx------ 3 4096 2010-09-03 03:58:53.000000000 +0200 0.00499675 drwx------ 3 4096 2010-09-03 06:49:34.000000000 +0200 0.00741825 |
|
(0000028) mattijs (manager) 2010-09-07 13:53 |
My guess is that the problem is the writeControl clockTime; which checks the time-since-start-of-the-run and on different processors might occasionally decide different things. As a workaround you might want to use one of the other writeControl modes. |
|
(0000029) mattijs (manager) 2010-09-07 16:00 |
added a reduce of elapsedCpuTime, elapsedClockTime before using them. commit 4160af412a8df26be1a7c284000f888f9dbe0c89 |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2010-09-06 12:42 | cbeck | New Issue | |
| 2010-09-07 09:24 | andy | Assigned To | => mattijs |
| 2010-09-07 09:24 | andy | Status | new => assigned |
| 2010-09-07 10:49 | mattijs | Note Added: 0000026 | |
| 2010-09-07 13:20 | cbeck | File Added: writeIssue.tar.gz | |
| 2010-09-07 13:33 | cbeck | Note Added: 0000027 | |
| 2010-09-07 13:53 | mattijs | Note Added: 0000028 | |
| 2010-09-07 16:00 | mattijs | Note Added: 0000029 | |
| 2010-09-07 16:00 | mattijs | Status | assigned => resolved |
| 2010-09-07 16:00 | mattijs | Fixed in Version | => 1.7.x |
| 2010-09-07 16:00 | mattijs | Resolution | open => fixed |
| 2010-09-13 17:39 | andy | Tag Attached: Input/output | |