Autodetect the device type, rather than hard-code NVidia.
Add GNU command line options to the makefile, and adjust the "restrict"
extension usage. For now, we assume the toolchain is only configured for one
accelerator.
This allows each model to initialise their arrays with a parallel
approach, which yields the first touch required for good performance
on NUMA architectures.
Using integers for maths gets unstable past 38 interations even
in double precision. Using the original values/10 is safe up to
the default 100 iterations.