Regression in starpu_mpi_task_insert (_starpu_mpi_task_build_v) since b7c27d1e
Hi,
It appears that since commit b7c27d1e, calling starpu_mpi_task_insert
with a codelet using the can_execute
field results in a segfault with the following backtrace:
Thread 1 received signal SIGSEGV, Segmentation fault.
0x00007fffcca55a4b in _starpu_worker_exists_and_can_execute (task=0x555555ac3090, arch=STARPU_CPU_WORKER) at ../../src/core/workers.c:329
329 workers->init_iterator(workers, &it);
(gdb) backtrace
#0 0x00007fffcca55a4b in _starpu_worker_exists_and_can_execute (task=0x555555ac3090, arch=STARPU_CPU_WORKER) at ../../src/core/workers.c:329
#1 0x00007fffcca55c12 in _starpu_worker_exists (task=0x555555ac3090) at ../../src/core/workers.c:415
#2 0x00007ffff7e75897 in _starpu_mpi_task_build_v (comm=comm@entry=0x55555556baa0 <ompi_mpi_comm_world>, codelet=codelet@entry=0x55555556cec0 <fill_cl<float>>,
task=task@entry=0x7fffffffdfd0, xrank_p=xrank_p@entry=0x7fffffffdfc4, descrs_p=descrs_p@entry=0x7fffffffdfd8, nb_data_p=nb_data_p@entry=0x7fffffffdfc8,
prio_p=0x7fffffffdfcc, varg_list=0x7fffffffe060) at ../../../mpi/src/starpu_mpi_task_insert.c:629
[...]
I tried to understand what exactly causes the segfault and found that starpu_mpi_task_insert
calls _starpu_mpi_task_insert_v
which was changed in b7c27d1e.
The problem seems to be that this function now call _starpu_worker_exists
to check if at least one worker can execute the task while the task hasn't been affected a scheduling context ID yet (sched_ctx
). The function _starpu_worker_exists
does have a check in place to exit if the scheduling context ID is the default one (0
) but not if it's the default one (STARPU_NMAX_SCHED_CTXS
, affected in starpu_task_init
). The function will then call _starpu_worker_exists_and_can_execute
which calls _starpu_get_sched_ctx_struct
with the default sched_ctx
value which results in a pointer to a non-initialized object being returned (the last element of _starpu_config.sched_ctxs
). The program then segfaults when the function then tries to use the empty object.
I think there might be a first logic error in starpu_task_init
(called by starpu_task_create
) where the default sched_ctx
ID is set to STARPU_NMAX_SCHED_CTXS
yet the _starpu_config.sched_ctxs
array has a length of STARPU_NMAX_SCHED_CTXS + 1
. So instead of representing an invalid id (past the end of the array), this represents the last possible scheduler context. Which means that _starpu_get_sched_ctx_struct
can't detect the invalid ID and returns a pointer to the last element of the array instead of a null pointer. Though, this wouldn't have changed much because _starpu_worker_exists_and_can_execute
doesn't even check the pointer returned by _starpu_get_sched_ctx_struct
.
Note that this problem only occure if a can_execute
function pointer is set for the given codelet (because of the !task->cl->can_execute
check in _starpu_worker_exists_and_can_execute
).
Alexis.