This project aims to design a small DSL representing shuffling and permutation in vector register. Ideally, this DSL should be exposing a clean search space and be architecture independent while still admitting a reasonable implementation for any architecture.
The goal is to generate vector-level transposition. It is currently able to generate transposition in the designed space for some specific dimension (namely: any M * VSize or VSize * M matrix).
Code generation is a work in progress (at least for Intel). It should not take too much work to make it work on intel and it seems to fit perfectly on arm.