Client that deals with the first polyselect silently dies and server waits forever.
Context: current git version 353f9cf7, on a laptop with 4 virtual cores, running Ubuntu 20.04, default compil options.
A basic factorization of a c30 is sometimes stopped at the early stage. To reproduce, try a loop like:
while true; do ./cado-nfs.py 30434402051897144953953380371; done
On my setup, this is quickly stuck at the following stage:
Info:Complete Factorization / Discrete logarithm: Factoring 30434402051897144953953380371
Info:HTTP server: serving at https://rillettes:34609 (0.0.0.0)
Info:HTTP server: For debugging purposes, the URL above can be accessed if the server.only_registered=False parameter is added
Info:HTTP server: You can start additional cado-nfs-client.py scripts with parameters: --server=https://rillettes:34609 --certsha1=12de84a7951da6234b3b8a8788bd18bc9dbe6e1b
Info:HTTP server: If you want to start additional clients, remember to add their hosts to server.whitelist
Info:Client Launcher: Starting client id localhost on host localhost
Info:Client Launcher: Starting client id localhost+2 on host localhost
Info:Client Launcher: Starting client id localhost+3 on host localhost
Info:Client Launcher: Starting client id localhost+4 on host localhost
Info:Client Launcher: Running clients: localhost (Host localhost, PID 327089), localhost+2 (Host localhost, PID 327092), localhost+3 (Host localhost, PID 327095), localhost+4 (Host localhost, PID 327098)
Info:Polynomial Selection (size optimized): Starting
Info:Polynomial Selection (size optimized): 0 polynomials in queue from previous run
Info:Polynomial Selection (size optimized): Adding workunit c30_polyselect1_0-2500 to database
Info:Polynomial Selection (size optimized): Adding workunit c30_polyselect1_2500-5000 to database
Info:HTTP server: 127.0.0.1 Sending workunit c30_polyselect1_0-2500 to client localhost+2
Info:HTTP server: 127.0.0.1 Sending workunit c30_polyselect1_2500-5000 to client localhost
Info:Polynomial Selection (size optimized): Parsed 7 polynomials, added 5 to priority queue (has 2)
Info:Polynomial Selection (size optimized): Worst polynomial in queue now has exp_E 10.300000
Info:Polynomial Selection (size optimized): Marking workunit c30_polyselect1_2500-5000 as ok (50.0% => ETA Unknown)
The problem is that the client (localhost+2) which is in charge of the first WU is dead. The log file of this client ends with the following lines:
rillettes> tail localhost+2.log
Downloading https://localhost:34609/cgi-bin/getwu?clientid=localhost+2 to /tmp/cado.0q50rwxa/client/download/WU.localhost+2808927621 (cafile = /tmp/cado.0q50rwxa/client/download/server.12de84a7951da6234b3b8a8788bd18bc9dbe6e1b.pem)
Running env LC_ALL=C curl --silent --show-error --fail --output /tmp/cado.0q50rwxa/client/download/WU.localhost+2808927621 --cacert /tmp/cado.0q50rwxa/client/download/server.12de84a7951da6234b3b8a8788bd18bc9dbe6e1b.pem --connect-timeout 10 https://localhost:34609/cgi-bin/getwu?clientid=localhost+2
[Thu Jun 18 15:31:44 2020] Subprocess has PID 327099
spin=0 is_wu=False blog=0
Downloading https://localhost:34609/polyselect to /tmp/cado.0q50rwxa/client/download/polyselect623172891 (cafile = /tmp/cado.0q50rwxa/client/download/server.12de84a7951da6234b3b8a8788bd18bc9dbe6e1b.pem)
Running env LC_ALL=C curl --silent --show-error --fail --output /tmp/cado.0q50rwxa/client/download/polyselect623172891 --cacert /tmp/cado.0q50rwxa/client/download/server.12de84a7951da6234b3b8a8788bd18bc9dbe6e1b.pem --connect-timeout 10 https://localhost:34609/polyselect
[Thu Jun 18 15:31:44 2020] Subprocess has PID 327103
Setting executable flag for /tmp/cado.0q50rwxa/client/download/polyselect
Result file /tmp/cado.0q50rwxa/client/localhost+2.work/c30.polyselect1.0-2500 does not exist
Running /tmp/cado.0q50rwxa/client/download/polyselect -P 1000 -N 30434402051897144953953380371 -degree 3 -t 2 -admin 0 -admax 2500 -incr 20 -nq 3
Would it be possible that there is a race condition between the two clients, and that the freshly downloaded binary is replaced by the one downloaded by the other client? But then, why isn't there any error message anywhere? I'm confused...