data coherency and starpu_data_cpy
Steps to reproduce
Build and run this MPI example using starpu_data_cpy. It should simply increment a value in i % MPI_Comm_size and send it to every other processors and keep like that until INC_COUNT (aka 50). It starts correctly then value becomes random.
This is a reproducer to the issue that prevent using user workspace instead of starpu temporaries in Chameleon.
/* StarPU --- Runtime system for heterogeneous multicore architectures.
*
* Copyright (C) 2015-2022 Université de Bordeaux, CNRS (LaBRI UMR 5800), Inria
*
* StarPU is free software; you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as published by
* the Free Software Foundation; either version 2.1 of the License, or (at
* your option) any later version.
*
* StarPU is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
*
* See the GNU Lesser General Public License in COPYING.LGPL for more details.
*/
/*
* This example splits the whole set of communicators in subgroups,
* all communications take place within each subgroups
*/
#include <starpu_mpi.h>
#include "../helper.h"
#define DATA_TAG 666
#define INC_COUNT 50
void func_cpu(void *descr[], void *_args)
{
int *value = (int *)STARPU_VARIABLE_GET_PTR(descr[0]);
fprintf(stderr, "value in %d\n", *value);
(*value)++;
fprintf(stderr, "value out %d\n", *value);
}
struct starpu_codelet mycodelet =
{
.cpu_funcs = {func_cpu},
.nbuffers = 1,
.modes = {STARPU_RW},
.model = &starpu_perfmodel_nop,
};
int main(int argc, char **argv)
{
int size, x=789;
int color;
int rank;
int ret;
#define LEFT 0
#define SELF 1
#define RIGHT 2
int value = 0;
starpu_data_handle_t *data;
int thread_support;
if (MPI_Init_thread(&argc, &argv, MPI_THREAD_SERIALIZED, &thread_support) != MPI_SUCCESS)
{
fprintf(stderr,"MPI_Init_thread failed\n");
exit(1);
}
if (thread_support == MPI_THREAD_FUNNELED)
fprintf(stderr,"Warning: MPI only has funneled thread support, not serialized, hoping this will work\n");
if (thread_support < MPI_THREAD_FUNNELED)
fprintf(stderr,"Warning: MPI does not have thread support!\n");
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
ret = starpu_mpi_init_conf(NULL, NULL, 0, MPI_COMM_WORLD, NULL);
STARPU_CHECK_RETURN_VALUE(ret, "starpu_mpi_init_conf");
data = (starpu_data_handle_t*)malloc(size*sizeof(starpu_data_handle_t));
for (int i = 0; i < size; i ++) {
if ( i == rank)
starpu_variable_data_register(&data[i], STARPU_MAIN_RAM, (uintptr_t)&value, sizeof(int));
else
starpu_variable_data_register(&data[i], -1, (uintptr_t)NULL, sizeof(int));
starpu_mpi_data_register_comm(data[i], DATA_TAG + i, i, MPI_COMM_WORLD);
}
for ( int i = 0; i < INC_COUNT; i++ ) {
starpu_mpi_task_insert(MPI_COMM_WORLD, &mycodelet,
STARPU_RW, data[i%size],
0);
for (int j = 0; j < size; j ++) {
starpu_mpi_data_cpy( data[j], data[i%size], MPI_COMM_WORLD, 1, NULL, NULL );
}
}
starpu_task_wait_for_all();
if (size > 1)
starpu_data_unregister(data[LEFT]);
starpu_data_unregister(data[SELF]);
if (size > 2)
starpu_data_unregister(data[RIGHT]);
STARPU_ASSERT_MSG(value == INC_COUNT, "value %d is not the expected value %d\n", value, INC_COUNT);
starpu_mpi_shutdown_comm(MPI_COMM_WORLD);
MPI_Finalize();
return 0;
}
Obtained behavior
Computation starts correctly then value becomes random (the step the value is corrupted can change) :
run -l -n 2 ~/chameleon/starpu-build-intel-18_openmpi-4.1.6_ucx-1.15.0_mkl_rocm-6.0.2/mpi/examples/comm/mpi_data_cpy
0: [starpu][o164][initialize_lws_policy] Warning: you are running the default lws scheduler, which is not a very smart scheduler, while the system has GPUs or several memory nodes. Make sure to read the StarPU documentation about adding performance models in order to be able to use the dmda or dmdas scheduler instead.
1: [starpu][o164][initialize_lws_policy] Warning: you are running the default lws scheduler, which is not a very smart scheduler, while the system has GPUs or several memory nodes. Make sure to read the StarPU documentation about adding performance models in order to be able to use the dmda or dmdas scheduler instead.
0: [starpu][o164][_starpu_mpi_print_thread_level_support] MPI has been initialized with MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
0: value in 0
0: value out 1
1: [starpu][o164][_starpu_mpi_print_thread_level_support] MPI has been initialized with MPI_THREAD_SERIALIZED; Multiple threads may make MPI calls, but only one at a time.
1: value in 1
1: value out 2
0: value in 2
0: value out 3
1: value in 3
1: value out 4
0: value in 4
0: value out 5
1: value in 5
1: value out 6
0: value in 6
0: value out 7
1: value in 7
1: value out 8
0: value in 8
0: value out 9
1: value in 9
1: value out 10
0: value in 10
0: value out 11
1: value in -1094795586
1: value out -1094795585
0: value in -1094795585
0: value out -1094795584
1: value in -1094795584
1: value out -1094795583
0: value in -1094795583
0: value out -1094795582
1: value in -1094795582
1: value out -1094795581
0: value in -1094795581
0: value out -1094795580
1: value in -1094795580
1: value out -1094795579
0: value in -1094795579
0: value out -1094795578
1: value in -1094795578
1: value out -1094795577
0: value in -1094795577
0: value out -1094795576
1: value in -1094795576
Expected behavior
value should be the same on all processor at the end ans be 50
Configuration
/home_nfs/blacostex/chameleon/starpu/configure --prefix=/home_nfs/blacostex/chameleon/starpu-install-intel-18_openmpi-4.1.6_ucx-1.15.0_mkl_rocm-6.0.2 CFLAGS=-fPIE --disable-build-doc --disable-build-tests --disable-starpufft --disable-mlr --disable-hdf5 --disable-fortran --disable-opencl --disable-cuda --enable-hip --enable-maxnumanodes=8 --enable-maxhipdev=8 --enable-max-sched-ctxs=32 --enable-fxt --disable-parallel-worker --disable-simgrid --disable-build-examples --enable-fxt
Configuration result
Distribution
RHEL 8.8
Version of StarPU
Git revision 75506046
Version of GPU drivers
Not used be built with