"README.org" did not exist on "32de9c8d2443ffa5d778921a5f8f46a974a87e40"
Weird behavior when using StarPU
@fpruvost and @ltaief reported that the actual running time of chameleon timings were much slower that what was printed by the output when using StarPU. the problem does not appear when using Quark. (@agullo, @thibault)
With StarPU and 15 threads on a 32 cores architecture from KAUST:
$ time numactl --interleave=all ./time_dpotrf_tile --n_range=10000:10000:10000 --k=10000 --nb=320 --threads=15 --nowarmup
Profiling throught FxT has not been enabled in StarPU runtime (configure StarPU with --with-fxt)
#
# CHAMELEON 0.9.1, ./time_dpotrf_tile
# Nb threads: 15
# Nb GPUs: 0
# NB: 320
# IB: 32
# eps: 1.110223e-16
#
# M N K/NRHS seconds Gflop/s Deviation
10000 10000 10000 0.778 428.57 +- 0.00
real 0m0.937s
user 0m12.296s
sys 0m0.426s
And with 31 threads:
$ time numactl --interleave=all ./time_dpotrf_tile --n_range=10000:10000:10000 --k=10000 --nb=320 --threads=31 --nowarmup
Profiling throught FxT has not been enabled in StarPU runtime (configure StarPU with --with-fxt)
#
# CHAMELEON 0.9.1, ./time_dpotrf_tile
# Nb threads: 31
# Nb GPUs: 0
# NB: 320
# IB: 32
# eps: 1.110223e-16
#
# M N K/NRHS seconds Gflop/s Deviation
10000 10000 10000 0.462 721.70 +- 0.00
real 0m5.742s
user 2m53.004s
sys 0m3.008s
While when using Quark (and Plasma, not within chameleon but the behavior is similar)
time numactl --interleave=all ./time_dgemm_tile --n_range=10000:10000:10000 --k=10000 --nb=320 --threads=31 --nowarmup
#
# PLASMA 2.8.0, ./time_dgemm_tile
# Nb threads: 31
# NB: 320
# IB: 32
# eps: 1.110223e-16
#
# M N K/NRHS seconds Gflop/s Deviation
10000 10000 10000 2.282 876.24 0.00
real 0m2.516s
user 1m8.134s
sys 0m1.449s