starpu merge requestshttps://gitlab.inria.fr/starpu/starpu/-/merge_requests2023-12-08T11:09:19+01:00https://gitlab.inria.fr/starpu/starpu/-/merge_requests/124Draft: Fix assertion about partitioning lazy-allocated handles2023-12-08T11:09:19+01:00Nathalie FurmentoDraft: Fix assertion about partitioning lazy-allocated handlesFix assertion about partitioning lazy-allocated handles
Whether the child is already allocated or not, async partitioning is not
supported yet for automatically-allocated handles.Fix assertion about partitioning lazy-allocated handles
Whether the child is already allocated or not, async partitioning is not
supported yet for automatically-allocated handles.https://gitlab.inria.fr/starpu/starpu/-/merge_requests/123Draft: Partition2023-12-08T09:05:51+01:00Nathalie FurmentoDraft: Partitionhttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/122Draft: Partition2023-12-08T09:00:00+01:00Nathalie FurmentoDraft: Partitionhttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/98Draft: starpu_abstract_comm2023-06-09T11:17:07+02:00Nathalie FurmentoDraft: starpu_abstract_commwe need to import modules etc. before getting stuck in starpu_initwe need to import modules etc. before getting stuck in starpu_inithttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/97add missing newline on _STARPU_MSG and _STARPU_DISP2023-06-08T14:57:53+02:00Lorisadd missing newline on _STARPU_MSG and _STARPU_DISPHopefully I didn't forget any or do more than necessary ?Hopefully I didn't forget any or do more than necessary ?https://gitlab.inria.fr/starpu/starpu/-/merge_requests/87Draft: fix python with master slave mode2023-03-14T10:25:39+01:00Nathalie FurmentoDraft: fix python with master slave modehttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/86Perfmodels generalization2023-09-19T08:52:51+02:00LorisPerfmodels generalizationThe goal is the make the bus calibration phase general for any driver.
This is a WIP merge request so we can comment early what's needs to be changed.
~~So far I've only tested the CUDA driver and made no change to the other drivers.~~
...The goal is the make the bus calibration phase general for any driver.
This is a WIP merge request so we can comment early what's needs to be changed.
~~So far I've only tested the CUDA driver and made no change to the other drivers.~~
Tested on CUDA and HIP drivers, basic testing on OpenCL driver. MPI and TCPIP drivers has been left untouchedhttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/80Ndim2023-01-25T07:22:53+01:00Nathalie FurmentoNdimNathalie FurmentoNathalie Furmentohttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/62Add new simulation allocation flag: UNIQUE2022-09-26T13:36:23+02:00LEANDRO NESI LucasAdd new simulation allocation flag: UNIQUEThis provides a new flag (STARPU_MALLOC_SIMULATION_UNIQUE) for function starpu_malloc_flags() to indicate that when StarPU is using simgrid, the allocation for a block of a particular size could be unique, i.e., instead of using SIMGRID_...This provides a new flag (STARPU_MALLOC_SIMULATION_UNIQUE) for function starpu_malloc_flags() to indicate that when StarPU is using simgrid, the allocation for a block of a particular size could be unique, i.e., instead of using SIMGRID_SHARED_MALLOC per block allocation, and generating a new ptr mapped to the same region, actually provide the same ptr. This removes some pressure from kernel memory management.
Different from only STARPU_MALLOC_SIMULATION_FOLDED, the same address will be given for all mallocs of that particular size.
Some differences using 9 homogeneous nodes with Chameleon LU:
![unique_mem](/uploads/805a9797f99f21ea8b42d61503ce320e/unique_mem.png)
With even a small improvement in real-life simulation time.
And generate the same simulated execution:
![unique_gantt](/uploads/2e424e268885bb898700cfaa38525344/unique_gantt.png)https://gitlab.inria.fr/starpu/starpu/-/merge_requests/61Draft: Fortran fix add sync task2022-07-26T15:48:14+02:00Nathalie FurmentoDraft: Fortran fix add sync taskNathalie FurmentoNathalie Furmentohttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/58Draft: allow toggling of mpi stats aggregation2022-07-04T14:35:12+02:00Antoine JegoDraft: allow toggling of mpi stats aggregationToggling stats aggregation can be useful when measuring communications statistics while removing certain operations (e.g. preliminary redistribution)Toggling stats aggregation can be useful when measuring communications statistics while removing certain operations (e.g. preliminary redistribution)Antoine JegoAntoine Jegohttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/57Draft: llvm_openmp_novariant2022-07-01T11:18:29+02:00Nathalie FurmentoDraft: llvm_openmp_novarianthttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/54Draft: WIP: Asynchronous HIP driver2022-06-08T08:10:00+02:00Mathis FuentesDraft: WIP: Asynchronous HIP driverThis is a WIP of the asynchronous version of the HIP driver for starPU.
Began testing on 2 different NVIDIA platforms, awaiting AMD GPUS to further test the driverThis is a WIP of the asynchronous version of the HIP driver for starPU.
Began testing on 2 different NVIDIA platforms, awaiting AMD GPUS to further test the driverhttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/49WIP: First working version of the HIP driver (synchronous)2022-04-29T18:55:41+02:00Mathis FuentesWIP: First working version of the HIP driver (synchronous)This is a WIP of the HIP driver for starPU. Will be working on a better Configuration/compilation for it, which has been made **specific for the environment it has been developed on** (temporarily and for ease of use).
Developed and test...This is a WIP of the HIP driver for starPU. Will be working on a better Configuration/compilation for it, which has been made **specific for the environment it has been developed on** (temporarily and for ease of use).
Developed and tested with ROCm 5.0.1 (rocm-5.0.1/lib64 and rocm-5.0.1/hip/lib have been added to LD_LIBRARY_PATH manually).
Only ported the synchronous version called hip0, mirroring cuda0.
Compiled using Clang 13.0.0 (from aocc)
Tests carried out on a single node with 8 MI100 GPUs:
-`/examples/basic_examples/vector_scal`: added hip kernel => **1 task ran successfully on GPU returning the correct result**
-`/examples/basic_examples/block`: added hip kernel => **1 task ran successfully on GPU returning the correct result**
-`/examples/basic_examples/mult`: Specific kernel written to perform a Matrix Multiplication with multiple tasks on multiple GPUs => **16 tasks ran on all 8 GPUs returning the correct result** (compared to the computation made fully on CPU)
On top of a potential update for the Configuration/compilation of HIP, the main focus is adding asynchronous support for HIP and is currently in the works (taking cuda1 as an example).
Only driver_hip_init.c driver_hip.h and driver_hip0.c are important in the driver folder.https://gitlab.inria.fr/starpu/starpu/-/merge_requests/44Remove useless commented code2022-03-30T10:53:17+02:00Philippe SWARTVAGHERRemove useless commented codeNot sure why this code was still here, so this MR to be sure I'm not making a mistake. :)Not sure why this code was still here, so this MR to be sure I'm not making a mistake. :)https://gitlab.inria.fr/starpu/starpu/-/merge_requests/35Draft:fortran interface fix2021-11-11T11:05:57+01:00Antoine JegoDraft:fortran interface fixThis merge request adds some missing fortran interfaces as well as correct the name of `fstarpu_arbiter_destroy`.
_To be discussed :_
It also adds a new subroutine to create "buffered" synchronization tasks `fstarpu_task_create_sync` : ...This merge request adds some missing fortran interfaces as well as correct the name of `fstarpu_arbiter_destroy`.
_To be discussed :_
It also adds a new subroutine to create "buffered" synchronization tasks `fstarpu_task_create_sync` : its implementation is a little hacky as it is only setting the 0-th buffer of a task to a handle given in argument as well as the codelet of the task. The codelet handed out by the user is expected to set `where` to `STARPU_NOWHERE` (otherwise the codelet executes).
`starpu_create_sync_task` is not fitted for our purpose because we require a synchronization task who accesses a handle to leverage sequential consistency. This "fortran-sync-task" is a synchronization "in the past" as it expects tasks to depend on it (it is actually synchronizing through `end_dep` routines).
The use case comes from `qr_mumps` and the need to detect collective communications (cf. `nmad-coop-mcast`)Antoine JegoAntoine Jegohttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/25WIP: hierarchical dags2021-07-20T17:16:59+02:00Nathalie FurmentoWIP: hierarchical dagshttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/16Add more priorities/options to data requests2021-03-04T02:45:24+01:00Lucas NesiAdd more priorities/options to data requestsThis:
- Do changes on data requests traces
- Add priorities on data requests from MPI
- By adding priority to starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids
- By adding new function starpu_mpi_irecv_detached_prio
-...This:
- Do changes on data requests traces
- Add priorities on data requests from MPI
- By adding priority to starpu_data_acquire_on_node_cb_sequential_consistency_sync_jobids
- By adding new function starpu_mpi_irecv_detached_prio
- Add option (STARPU_MPI_EARLYDATA_ALLOCATE) to MPI driver to do early data request allocations and do not block too much.
- Add option (STARPU_CUDA_ONLY_FAST_ALLOC_OTHER_MEMNODES) to CUDA workers do not do slow allocations on other memnodes (RAM pinned memory allocations)
- During the beginning of the execution the CUDA workers will not be slowed down.
- Removes datawizard_progress from fetch_data_on_node as it can fail.
- Add priorities on data requests from _starpu_fetch_task_inputhttps://gitlab.inria.fr/starpu/starpu/-/merge_requests/11mpi: perform reduction only over contributing nodes2021-02-19T11:34:07+01:00Antoine Jegompi: perform reduction only over contributing nodes- added a long to describe a redux_map of contributing nodes
- added a test program to assess the behaviour
fixes #3- added a long to describe a redux_map of contributing nodes
- added a test program to assess the behaviour
fixes #3https://gitlab.inria.fr/starpu/starpu/-/merge_requests/2Draft: WIP: Nmad coop coll dynamic interface2022-06-29T13:53:30+02:00Philippe SWARTVAGHERDraft: WIP: Nmad coop coll dynamic interfaceA big one !
- add dynamic broadcasts in nmad backend
- add better clock synchronization for traces (when used with nmad backend)
- add some benchmarks related to distributed StarPU
- add priority and type (collective or point-to-point) ...A big one !
- add dynamic broadcasts in nmad backend
- add better clock synchronization for traces (when used with nmad backend)
- add some benchmarks related to distributed StarPU
- add priority and type (collective or point-to-point) of communications in comms.rec trace file
- and some other little stuff I don't remember :smiley:
Remaining TODOs:
- [ ] documentation
- [ ] ChangeLog
- [ ] (comments of this merge request)
- [ ] test branch with backend MPI on buildbot
- [ ] test branch with backend MPI on ci.inria.frstarpu 1.4Nathalie FurmentoNathalie Furmento