Starpu/alloc on the fly
Modify the way the workspaces are allocated to be able to allocate them on the fly.
The objectives are:
- Allocate less memory in QR like algorithms, as only the useful T tiles are allocated.
- Be more asynchronous in some algorithm as QR again, or norms by avoiding the required sequence_wait at the end of the call before freeing the allocated workspaces. This is also used in the upcoming SUMMA algorithms.
The changes are:
- switch geadd to axpy in norm computations as they may be optimized with mkl
- switch workspaces from global allocation to tile allocation
- update QR kernels that generating the T tiles to set it to 0 first. This can not be done through global memset anymore, and to avoid an complete allocation of the matrix, this is moved in the codelets to initialized only the touched tiles.