ENH: improve the performance of the OpenCL transpose dot
The matrix multiplication of the OpenCL implementation of the transposed linear operator has a poor performance compared to the regular linear operator.
This commit improves the performance of the transposed dot by using OpenCL vector types in the kernel. This improves the performance by a factor of 2+. It also adds a restriction that the generator length must now be a factor of 4. The linear operator tests were modified to adapt to this new restriction.