lingen tuning problems
I suspect that the "auto-tuning" performed by lingen goes wrong. I have a test folder in /srv/storage/grvingt@storage5.nancy.grid5000.fr/bouillaguet/fukuoka74/bwc/
for those who want to reproduce. All the experiments described below are run on the grvingt cluster. I have a test matrix of size 5.5M with 5.5G entries (give or take). I used n=256
and m=1024
.
The script:
#!/bin/bash
#OAR --queue production
#OAR --property cluster='grvingt'
#OAR --resource walltime=4
CADOPATH=$HOME/cado-nfs/build/grvingt-1.nancy.grid5000.fr/
WDIR=/srv/storage/grvingt@storage5.nancy.grid5000.fr/bouillaguet/fukuoka74/bwc/
$CADOPATH/linalg/bwc/lingen_u64k1 split-output-file=1 afile=A0-256.0-39424 ffile=F wdir=$WDIR prime=2 n=256 m=1024 thr=4x8 tuning_thresholds=recursive:128,ternary:6400,cantor:6400
terminates in 11500s (3h15min). Almost no tuning is done since the thresholds are given (these are the "default" given by bwc.pl
).
Now, running the same thing without explicitly providing the thresholds:
$CADOPATH/linalg/bwc/lingen_u64k1 split-output-file=1 afile=A0-256.0-39424 ffile=F wdir=$WDIR prime=2 n=256 m=1024 thr=4x8 tuning_log_filename=lingen_tuning.log tuning_schedule_filename=lingen_tuning.schedule tuning_timing_cache_filename=lingen_tuning.cache
runs the full "auto-tuning procedure". This takes about 14h. It concludes that using the quadratic algorithm all the time is the best option --- see lingen_tuning.log, lingen_tuning.schedule and lingen_tuning.cache. Obviously, the subsequent execution takes forever (more than 12h). So the result of the auto-tuning is worse than the default parameters.
Now, if one goes MPI with 4 nodes, the situation is a bit different. This OAR script:
#!/bin/bash
#OAR --queue production
#OAR --property cluster='grvingt'
#OAR --resource nodes=4,walltime=4
CADOPATHMPI=$HOME/cado-nfs/build/grvingt-11.nancy.grid5000.fr.mpi/
WDIR=/srv/storage/grvingt@storage5.nancy.grid5000.fr/bouillaguet/fukuoka74/bwc/
mpiexec --map-by ppr:1:node --hostfile=$OAR_NODE_FILE $CADOPATHMPI/linalg/bwc/lingen_u64k1 split-output-file=1 afile=A0-256.0-39424 ffile=F wdir=$WDIR prime=2 n=256 m=1024 mpi=2x2 thr=4x8 tuning_thresholds=recursive:128,ternary:6400,cantor:6400
Does the job in 8400s, a meager x1.35 speedup (using 4 nodes !). If one removes the tuning_thresholds=...
part and runs:
mpiexec --map-by ppr:1:node --hostfile=$OAR_NODE_FILE $CADOPATHMPI/linalg/bwc/lingen_u64k1 split-output-file=1 afile=A0-256.0-39424 ffile=F wdir=$WDIR prime=2 n=256 m=1024 mpi=2x2 thr=4x8 tuning_log_filename=lingen_tuning.mpi.log tuning_schedule_filename=lingen_tuning.mpi.schedule tuning_timing_cache_filename=lingen_tuning.mpi.cache
then it runs the tuning procedure before doing the job. The first run took 14180s, the second one (reusing the tuning results) took 13380s. So the tuning itself takes negligible time, as opposed to the single-node case. But again, the result of the tuning is worse thanthe default parameters --- see again lingen_tuning.mpi.cache, lingen_tuning.mpi.log and lingen_tuning.mpi.schedule). This time, the threshold between quadratic and recursive is at 1200.