V0 - Perfs improvement
This post keeps track of the changes to improve the performance.
The given results are obtained on the plafrim cluster.
Config:
int thedeg = 4;
int theraf = 5;
The given duration are obtain without including the compilation of the opencl kernels, see #26bc8834)
Node 4 × K40 GPUs:
- Before:
Temps total (no memory transfer) =20.000000
- Now:
Temps total (no memory transfer) =12.500000
Node 2 × P100 GPUs:
- Before :
Temps total (no memory transfer) =23.500000
- Now:
Temps total (no memory transfer) =14.000000
This provides a nice speed up, here is a liste of changes that have been applied.
8bec277e
Origin state #There are synchronizations and lots of red part on the GPUs
2513cc08
Deleting the synchronizations #It looks like in function RK2_SPU
there is a starpu_task_wait_for_all()
inside the while loop.
Therefore, I moved the wait juste after the loop by considering that StarPU will manage the dependencies correctly.
Need confirmation to know if this is correct and if there is no side effect
cfb8f159
COMMUTE and degree of parallelism #The commute is used in several codelete, and that is really great! However, StarPU is not clear about how the commutative dependencies are managed. And I know (because I partially implemented it) that we need to use an arbiter (a mutex) to have a real commute, because StarPU need to have some kind of global lock to make sure that it can select the right task.
Scheduling
I connected my scheduler Heteroprio by using the version that is already inside StarPU #6d82d899 Then, I connected my WIP scheduler laHeteroprio #07b93c6d And that seems to give very nice results.
Config problem
I cannot use
int thedeg = 5;
int theraf = 6;
Or I get
testlaura_spu: /projets/schnaps/schnaps/src/interpolation.c:277: ref_ipg: Assertion `ic[2] >=0 && ic[2]<nraf[2]' failed
What's next?
- Use larger thedeg and theraf
- Use multiple runner on GPUs (but maybe this only works with CUDA and not OpenCL)