las uses both openmp and pthreads, which causes several issues.
On 33815b67, the clang build is more than twice slower than the gcc build (in some cases), while in other situations both are equally bad. For stupid reasons.
eval $(make show)
if ! [ -f /tmp/c120.roots.gz ] ; then
$build_tree/sieve/makefb -poly parameters/polynomials/c120.poly -lim 5500000 -maxbits 12 -out c120.roots.gz -t 4
fi
with respectively a gcc-9.3.0 and a clang-9.0.1 build, I have (on my home machine):
localhost $ ./build/localhost/sieve/las -poly parameters/polynomials/c120.poly -I 12 -q0 4000000 -q1 4001000 -lim0 3000000 -lim1 5500000 -lpb0 27 -lpb1 27 -mfb0 54 -mfb1 54 -ncurves0 14 -ncurves1 19 -fb1 /tmp/c120.roots.gz -t auto -production | tail -n 1
# Total 3533 reports [0.00393s/r, 49.8r/sq] in 3.63 elapsed s [382.1% CPU]
localhost $ ./build/localhost.clang/sieve/las -poly parameters/polynomials/c120.poly -I 12 -q0 4000000 -q1 4001000 -lim0 3000000 -lim1 5500000 -lpb0 27 -lpb1 27 -mfb0 54 -mfb1 54 -ncurves0 14 -ncurves1 19 -fb1 /tmp/c120.roots.gz -t auto -production | tail -n 1
# Total 3533 reports [0.00783s/r, 49.8r/sq] in 8.7 elapsed s [317.9% CPU]
Another test is on grvingt
. Here, things go really bad, because in both cases we have a long wait at the beginning of the computation, just between the lines # Reading side-1 factor base took 0.1s (0.1s real)
and # polynomial has no roots for xxx of the yyy primes that were tried
.
The culprit is the mix of pthread (or other kind of tailor-made threads) and openmp threading in las.
las
is programmed so that it uses the machine fully (with -t auto
at least), and at any rate this is the usage we have in mind.
There are (to my knowledge) at least two places that las
reaches as "utility" code, and that use openmp (while las
proper does not).
- the
mpz_poly
layer inutils/mpz_poly.cpp
- the product tree code in
sieve/ecm/batch.cpp
Unfortunately, the openmp runtime eagerly spawns as many threads as it sees fit, and those threads seem to be taxing the cpu continuously, leading to very inefficient code. YMMV, which is why on my test with my home machine, the gcc build appears not to be affected. However, in some cases we pay a very high price.
There are several possible runtime workarounds.
- run with
OMP_NUM_THREADS=1
; it is probably fine to do so, at least as far as theutils/mpz_poly.cpp
code is concerned. While it is useful to have it openmp'ed in certain cases, that is not the case with las. The situation with the batch code is a bit different, and I'm not sure about what we should do. - run with
OMP_DYNAMIC=true
; it may or may not be a good idea, but I don't like it, really. Results are not deterministic, and what the openmp runtime decides to do is bound to be based on heuristics that we cannot control.
However, I think that we should rather fix this in the code.