Starpu/alloc on the fly
Modify the way the workspaces are allocated to be able to allocate them on the fly.
The objectives are:
- Allocate less memory in QR like algorithms, as only the useful T tiles are allocated.
- Be more asynchronous in some algorithm as QR again, or norms by avoiding the required sequence_wait at the end of the call before freeing the allocated workspaces. This is also used in the upcoming SUMMA algorithms.
The changes are:
- switch geadd to axpy in norm computations as they may be optimized with mkl
- switch workspaces from global allocation to tile allocation
- update QR kernels that generating the T tiles to set it to 0 first. This can not be done through global memset anymore, and to avoid an complete allocation of the matrix, this is moved in the codelets to initialized only the touched tiles.
Edited by Mathieu Faverge