Restore nobigmat option?
When running with StarPU and StarPU's NUMA support enabled, allocating the whole matrix in one chunk will not fly if it doesn't fit NUMA node zero. The nobigmat option of the previous testing infrastructure was useful to let StarPU allocate on the fly, by passing CHAMELEON_MAT_ALLOC_TILE to CHAMELEON_Desc_Create. Could we restore this option?
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Owner
Yes we can :). It is just an omission.
- Mathieu Faverge mentioned in merge request !199 (merged)
mentioned in merge request !199 (merged)
- Mathieu Faverge closed via merge request !199 (merged)
closed via merge request !199 (merged)
- Developer
Thank you for the fix. It seems working better now.
Before:
% mpirun -DNMAD_DRIVER=tcp -DSTARPU_RESERVE_NCPU=2 -n 2 -nodelist henri0,henri1 ~/chameleon/build/testing/chameleon_stesting -o potrf -H -n 4800:200000:16000 # nm_strat_prio: init- max = 2 # nm_strat_prio: init- max = 2 Id Function threads gpus P Q mtxfmt nb uplo n lda seedA time gflops 0 spotrf 34 0 1 2 0 320 Upper 4800 4800 846930886 2.622355e-01 1.406199e+02 1 spotrf 34 0 1 2 0 320 Upper 20800 20800 1681692777 4.460911e+00 6.724755e+02 2 spotrf 34 0 1 2 0 320 Upper 36800 36800 1714636915 1.383951e+01 1.200381e+03 3 spotrf 34 0 1 2 0 320 Upper 52800 52800 1957747793 2.884257e+01 1.701214e+03 # henri0: nmad: WARNING- nm_core_flush()-nm_core_flush()- pack_submissions list not empty after flush. # henri0: nmad: WARNING- nm_core_flush()-nm_core_flush()- completed_pws list not empty after flush. 4 spotrf 34 0 1 2 0 320 Upper 68800 68800 424238335 4.718862e+01 2.300468e+03 5 spotrf 34 0 1 2 0 320 Upper 84800 84800 719885386 7.096592e+01 2.864337e+03 6 spotrf 34 0 1 2 0 320 Upper 100800 100800 1649760492 1.004855e+02 3.397531e+03 7 spotrf 34 0 1 2 0 320 Upper 116800 116800 596516649 1.349071e+02 3.937115e+03 8 spotrf 34 0 1 2 0 320 Upper 132800 132800 1189641421 1.738165e+02 4.491453e+03 9 spotrf 34 0 1 2 0 320 Upper 148800 148800 1025202362 2.184553e+02 5.027237e+03 10 spotrf 34 0 1 2 0 320 Upper 164800 164800 1350490027 2.693636e+02 5.538797e+03 CHAMELEON ERROR: chameleon_desc_mat_alloc(): malloc() failed CHAMELEON ERROR: chameleon_desc_check(): NULL matrix pointer CHAMELEON ERROR: CHAMELEON_Desc_Create_User(): invalid descriptor CHAMELEON ERROR: chameleon_desc_check(): invalid matrix type CHAMELEON ERROR: CHAMELEON_splgsy_Tile(): invalid descriptor CHAMELEON ERROR: chameleon_desc_check(): invalid matrix type CHAMELEON ERROR: CHAMELEON_spotrf_Tile_Async(): invalid descriptor ^C/home/pswartva/pm2/soft/x86_64/bin/padico-launch : ligne 793 : 6261 Complété ( rc=127; trap 'echo ${rank} ${BASHPID} ${rc} >> ${session_tmp_dir}/completions ; kill $(jobs -p) > /dev/null 2>&1 ' EXIT; ${console} ${padico_d} ${this_node_args} -- "$@"; rc=$? ) /home/pswartva/pm2/soft/x86_64/bin/padico-launch : ligne 793 : 6316 Complété ( rc=127; trap 'echo ${rank} ${BASHPID} ${rc} >> ${session_tmp_dir}/completions ; kill $(jobs -p) > /dev/null 2>&1 ' EXIT; ${console} ${PADICO_RSH} ${m} ${padico_d} ${this_node_args} -- "${quote_args}"; rc=$? )
Now (all the RAM was used for the last matrix size, so I interrupted the run):
% mpirun -DNMAD_DRIVER=tcp -DSTARPU_RESERVE_NCPU=2 -n 2 -nodelist henri0,henri1 ~/chameleon/build/testing/chameleon_stesting -o potrf -H -n 4800:200000:16000 --mtxfmt 1 # nm_strat_prio: init- max = 2 # nm_strat_prio: init- max = 2 Id Function threads gpus P Q mtxfmt nb uplo n lda seedA time gflops 0 spotrf 34 0 1 2 1 320 Upper 4800 4800 846930886 2.696713e-01 1.367425e+02 1 spotrf 34 0 1 2 1 320 Upper 20800 20800 1681692777 4.442065e+00 6.753286e+02 2 spotrf 34 0 1 2 1 320 Upper 36800 36800 1714636915 1.366303e+01 1.215886e+03 3 spotrf 34 0 1 2 1 320 Upper 52800 52800 1957747793 2.786577e+01 1.760848e+03 4 spotrf 34 0 1 2 1 320 Upper 68800 68800 424238335 4.712330e+01 2.303657e+03 5 spotrf 34 0 1 2 1 320 Upper 84800 84800 719885386 7.147869e+01 2.843789e+03 6 spotrf 34 0 1 2 1 320 Upper 100800 100800 1649760492 1.007229e+02 3.389521e+03 7 spotrf 34 0 1 2 1 320 Upper 116800 116800 596516649 1.350397e+02 3.933248e+03 8 spotrf 34 0 1 2 1 320 Upper 132800 132800 1189641421 1.742032e+02 4.481483e+03 9 spotrf 34 0 1 2 1 320 Upper 148800 148800 1025202362 2.183721e+02 5.029152e+03 10 spotrf 34 0 1 2 1 320 Upper 164800 164800 1350490027 2.838414e+02 5.256281e+03 11 spotrf 34 0 1 2 1 320 Upper 180800 180800 783368690 3.436960e+02 5.731961e+03 ^C/home/pswartva/pm2/soft/x86_64/bin/padico-launch : ligne 793 : 6743 Complété ( rc=127; trap 'echo ${rank} ${BASHPID} ${rc} >> ${session_tmp_dir}/completions ; kill $(jobs -p) > /dev/null 2>&1 ' EXIT; ${console} ${padico_d} ${this_node_args} -- "$@"; rc=$? ) /home/pswartva/pm2/soft/x86_64/bin/padico-launch : ligne 793 : 6800 Complété ( rc=127; trap 'echo ${rank} ${BASHPID} ${rc} >> ${session_tmp_dir}/completions ; kill $(jobs -p) > /dev/null 2>&1 ' EXIT; ${console} ${PADICO_RSH} ${m} ${padico_d} ${this_node_args} -- "${quote_args}"; rc=$? )
Maybe you could define the
mtxfmt
by default to 1 (at least for StarPU), since it's working better that way ? - Owner
That's probably an issue with int instead of size_t. I'll have a look.
- Author Developer
Maybe you could define the
mtxfmt
by default to 1 (at least for StarPU), since it's working better that way ?At least when StarPU has STARPU_USE_NUMA set to 1, we should probably do this by default ? Otherwise StarPU can't actually know what is where.
- Owner
It's not always the best case, and it would be specific to StarPU, so I'm not sure it's a good idea. I would prefer to avoid as much as possible references to the runtime system outside the runtime directory.