Add a generic lacpy codelet on CPU/CUDA workers
Add a generic copy codelet to be used in the case m == n, displA = displB = 0 to perfrom copies on CPU and GPU through the interface dat cpy function.
Add a generic copy codelet to be used in the case m == n, displA = displB = 0 to perfrom copies on CPU and GPU through the interface dat cpy function.