Added example CGBN GPU code
This will be followed by a new stage 1 implementation. The new implementation is both 2-3x faster and supports a large range of inputs (512-32K bits).
I rebased and pushed after !19 (merged) was merged.
This is 80% of the code. The last 20% is dynamically sized kernels, better comments, cudacommon.h, and a few other things.
Edited by Seth Troisi