Commit 80d86bbe authored by Emmanuel Thomé's avatar Emmanuel Thomé
Browse files

still work in progress

parent f1ce13c7
# DLP240
# DLP-240
This repository contains information to reproduce the DLP-240 discrete
logarithm record.
......@@ -42,7 +42,7 @@ A typical command line for an individual work unit was:
$CADO_BUILD/polyselect/dlpolyselect -N 124620366781718784065835044608106590434820374651678805754818788883289666801188210855036039570272508747509864768438458621054865537970253930571891217684318286362846948405301614416430468066875699415246993185704183030512549594371372159029285303 -df 4 -dg 3 -area 2.0890720927744e+20 -Bf 34359738368.0 -Bg 34359738368.0 -bound 150 -modm 1000003 -modr 42 -t 8
```
where `-modr 42` gives the index of the task, and all tasks between 0 and
1000002 have been run.
1000002 were run.
A ranking between all the computed pairs was based on MurphyE as computed
by `dlpolyselect` with the parameters
......@@ -63,18 +63,18 @@ makes little sense to report the total number of CPU-years really used.
The calendar time was 18 days.
When a node of `grvingt` is fully charged with 8 jobs of 8 threads, then
one task as above takes 1200 wall clock seconds on average to be
processed. This must be multiplied by the 4 physical cores it uses to get
one task as above is processed in 1200 wall clock seconds on average.
This must be multiplied by the 4 physical cores it uses to get
the number of core-seconds per modr value. And we have 10^6 of them to
process. This adds-up to 152 core.years for the whole polynomial selection.
process. This adds up to 152 core.years for the whole polynomial selection.
Some sample-sieving was done on the top-100 polynomials according to
Some sample-sieving was done on the top 100 polynomials according to
MurphyE. Although there is a clear correlation between the efficiency of
a polynomial pairs and its MurphyE value, the ranking is definitely not perfect.
In particular, the number-one polynomial pair according to MurphyE finds
10% less relations than the best ones.
a polynomial pair and its MurphyE value, the ranking is definitely not perfect.
In particular, the best ranked polynomial pair according to MurphyE finds
10% less relations than the (truly) best ones.
Additional sample-sieving was performed on the few best candidates. With
Additional sample sieving was performed on the few best candidates. With
a test on 128,000 special-q, 3 polynomials could not be separated.
We ended up using the following [`dlp240.poly`](dlp240.poly):
......@@ -105,8 +105,8 @@ Here is what it gives with final parameters used in the computation. Here,
gzip soon becomes the limiting factor).
```
$CADO_BUILD/sieve/makefb -poly dlp240.poly -side 0 -lim 536870912 -maxbits 16 -t 16 -out dlp240.fb0.gz
$CADO_BUILD/sieve/makefb -poly dlp240.poly -side 1 -lim 268435456 -maxbits 16 -t 16 -out dlp240.fb1.gz
$CADO_BUILD/sieve/makefb -poly dlp240.poly -side 0 -lim 536870912 -maxbits 16 -t 16 -out $DATA/dlp240.fb0.gz
$CADO_BUILD/sieve/makefb -poly dlp240.poly -side 1 -lim 268435456 -maxbits 16 -t 16 -out $DATA/dlp240.fb1.gz
```
These files have size 209219374 and 103814592 bytes, respectively. They
......@@ -118,7 +118,7 @@ number of unique relations per special-q matters. The timing does not
matter.
```
$CADO_BUILD/sieve/las -poly dlp240.poly -fb0 dlp240.fb0.gz -fb1 dlp240.fb1.gz -lim0 536870912 -lim1 268435456 -lpb0 35 -lpb1 35 -q0 150e9 -q1 300e9 -dup -dup-qmin 150000000000,0 -sqside 0 -A 31 -mfb0 70 -mfb1 70 -lambda0 2.2 -lambda1 2.2 -random-sample 1024 -allow-compsq -qfac-min 8192 -qfac-max 100000000 -allow-largesq -bkmult 1.10 -t auto -v -fbc /tmp/fbc
$CADO_BUILD/sieve/las -poly dlp240.poly -fb0 $DATA/dlp240.fb0.gz -fb1 $DATA/dlp240.fb1.gz -lim0 536870912 -lim1 268435456 -lpb0 35 -lpb1 35 -q0 150e9 -q1 300e9 -dup -dup-qmin 150000000000,0 -sqside 0 -A 31 -mfb0 70 -mfb1 70 -lambda0 2.2 -lambda1 2.2 -random-sample 1024 -allow-compsq -qfac-min 8192 -qfac-max 100000000 -allow-largesq -bkmult 1.10 -t auto -v -fbc /tmp/dlp240.fbc
```
In less than half an hour on our target machine `grvingt`, this gives
......@@ -139,33 +139,34 @@ number_of_sq = 3.67e9
tot_rels = ave_rel_per_sq * number_of_sq
print (tot_rels)
```
This estimate of 2.2e9 relations can be made more precise by increasing the number of
special-q that are sampled for sieving. It is also possible to have
different nodes sampling different sub-ranges of the global range to get
the result faster. We consider that sampling 1024 special-qs is enough
to get a reliable estimate.
This estimate of 2.2e9 relations can be made more precise by increasing
the number of special-q that are sampled for sieving. It is also possible
to have different nodes sampling different sub-ranges of the global range
to get the result faster. We consider that sampling 1024 special-qs is
enough to get a reliable estimate.
## Estimating the cost of sieving
In production there is no need to activate the on-the-fly duplicate
removal which is supposed to be cheap but maybe not negligible and it is
important to emulate the fact that the cached factor base (the /tmp/fbc
file) is precomputed and hot (i.e., cached in memory by the OS and/or the
hard-drive), because this is the situation in production; for this, it
suffices to start a first run and interrupt it as soon as the cache is
written. Of course, we use the batch smoothness detection on side 1 and
we have to precompute the product of all primes to be extracted. This
means that on the other hand, the file `dlp240.fb1.gz` is _not_ needed in
production.
important to emulate the fact that the cached factor base (the
`/tmp/dlp240.fbc` file) is precomputed and hot (i.e., cached in memory by
the OS and/or the hard-drive), because this is the situation in
production; for this, it suffices to start a first run and interrupt it
as soon as the cache is written. Of course, we use the batch smoothness
detection on side 1 and we have to precompute the product of all primes
to be extracted. This means that on the other hand, the file
`$DATA/dlp240.fb1.gz` is _not_ needed in production (we only used it for
the estimation of the number of unique relations).
```
$CADO_BUILD/sieve/ecm/precompbatch -poly dlp240.poly -lim1 0 -lim0 536870912 -batch0 dlp240.batch0 -batch1 dlp240.batch1 -batchlpb0 29 -batchlpb1 28
$CADO_BUILD/sieve/ecm/precompbatch -poly dlp240.poly -lim1 0 -lim0 536870912 -batch0 /dev/null -batch1 $DATA/dlp240.batch1 -batchlpb0 29 -batchlpb1 28
```
Then a typical benchmark is as follows:
```
time $CADO_BUILD/sieve/las -v -poly dlp240.poly -t auto -fb0 dlp240.fb0.gz -allow-compsq -qfac-min 8192 -qfac-max 100000000 -allow-largesq -A 31 -lim1 0 -lim0 536870912 -lpb0 35 -lpb1 35 -mfb1 250 -mfb0 70 -batchlpb0 29 -batchlpb1 28 -batchmfb0 70 -batchmfb1 70 -lambda1 5.2 -lambda0 2.2 -batch -batch0 dlp240.batch0 -batch1 dlp240.batch1 -sqside 0 -bkmult 1.10 -q0 150e9 -q1 300e9 -fbc /tmp/dlp240.fbc -random-sample 2048
time $CADO_BUILD/sieve/las -v -poly dlp240.poly -t auto -fb0 $DATA/dlp240.fb0.gz -allow-compsq -qfac-min 8192 -qfac-max 100000000 -allow-largesq -A 31 -lim1 0 -lim0 536870912 -lpb0 35 -lpb1 35 -mfb1 250 -mfb0 70 -batchlpb0 29 -batchlpb1 28 -batchmfb0 70 -batchmfb1 70 -lambda1 5.2 -lambda0 2.2 -batch -batch1 $DATA/dlp240.batch1 -sqside 0 -bkmult 1.10 -q0 150e9 -q1 300e9 -fbc /tmp/dlp240.fbc -random-sample 2048
```
On our sample machine, the result of the above line is:
......@@ -174,8 +175,8 @@ real 22m19.032s
user 1315m10.459s
sys 5m56.262s
```
Then the `22m19.032s` value must be taken and appropriately divided and
multiplied to be converted in physical core-seconds. For instance, in our
Then the `22m19.032s` value must appropriately scaled, in order to
convert it to physical core-seconds. For instance, in our
case, since there are 32 cores and we sieved 2048 special-qs, this gives
`(22*60+19.0)*32/2048=20.9` core.seconds per special-q.
......@@ -213,7 +214,8 @@ random distribution). This reports an anticipated time of about 2.65
seconds per iteration (running on 4 nodes of the `grvingt` cluster).
To obtain timings in a different way, the following procedure can also be
used, maybe as a complement to the above, to generate a complete fake matrix of the required size with the
used, maybe as a complement to the above, to generate a complete fake
matrix of the required size with the
[`generate_random_matrix.sh`](generate_random_matrix.sh) script (which
takes well over an hour), and measure the time for 128 iterations (which
takes only a few minutes). Within the script
......@@ -225,52 +227,51 @@ adjusted to the users' needs.
DATA=$DATA CADO_BUILD=$CADO_BUILD MPI=$MPI nrows=37000000 density=250 nthreads=32 ./dlp240-linalg-0a-estimate_linalg_time_coarse_method_b.sh
```
This second method reports about 3.1 seconds per iteration. Allowing for some
inaccuracy, these experiments are sufficient to build confidence that the
time per iteration in the krylov (a.k.a. "sequence") step of block
Wiedemann is close to seconds per iteration.
The time per iteration in the mksol (a.k.a. "evaluation") step
is in the same ballpark. The time for krylov+mksol can then be estimated
as the product of this timing with `(1+n/m+1/n)*N`, with `N` the
number of rows, and `m` and `n` the block Wiedemann parameters (we chose
`m=48` and `n=16`). Applied to our use case, this gives an anticipated
cost of `(1+n/m+1/n)*N*3*4*32/3600/24/365=628` core-years for
Krylov+Mksol (4 and 32 representing the fact that we used 4-node jobs
with 32-physical cores per node).
The "lingen" (linear generator) step of
block Wiedemann was perceived as he main potential stumbling block for
the computation. We had to ensure that it would be doable with the
resources we had. To this end, a "tuning" of the lingen program can be
done with the `--tune` flag, so as to get an advance look at the cpu and
memory requirements for that step. These tests were sufficient to
convince us that we had several possible parameter settings to choose
from, and that this computation was doable.
This second method reports about 3.1 seconds per iteration. Allowing for
some inaccuracy, these experiments are sufficient to build confidence
that the time per iteration in the krylov (a.k.a. "sequence") step of
block Wiedemann is close to seconds per iteration. The time per
iteration in the mksol (a.k.a. "evaluation") step is in the same
ballpark. The time for krylov+mksol can then be estimated as the product
of this timing with `(1+n/m+1/n)*N`, with `N` the number of rows, and `m`
and `n` the block Wiedemann parameters (we chose `m=48` and `n=16`).
Applied to our use case, this gives an anticipated cost of
`(1+n/m+1/n)*N*3*4*32/3600/24/365=628` core-years for Krylov+Mksol (4 and
32 representing the fact that we used 4-node jobs with 32-physical cores
per node).
The "lingen" (linear generator) step of block Wiedemann was perceived as
the main potential stumbling block for the computation. We had to ensure
that it would be doable with the resources we had. To this end, a
"tuning" of the lingen program can be done with the `--tune` flag, so as
to get an advance look at the cpu and memory requirements for that step.
These tests were sufficient to convince us that we had several possible
parameter settings to choose from, and that this computation was doable.
## Validating the claimed sieving results
The benchmark command-lines above can be used almost as-is for
The benchmark command lines above can be used almost as-is for
reproducing the full computation. It is just necessary to remove the
-random-sample option and to adjust the -q0 and -q1 to create many small
work units that in the end cover exactly the global q-range.
`-random-sample` option and to adjust the `-q0` and `-q1` parameters in
order to create many small work units that in the end cover exactly the
global q-range.
Since we do not expect anyone to spend again as much computing resources
to perform again exactly the same computation, we provide in the
[`rel_count`](rel_count) file the count of how many (non-unique) relations were
[`dlp240-rel_count`](dlp240-rel_count) file the count of how many (non-unique) relations were
produced for each 1G special-q sub-range.
We can then have a visual plot of this data, as shown in
[`rel_count.pdf`](rel_count.pdf), where the x-coordinate denotes the
special-q (in multiples of 1G).
The plot is very regular except for special-q's
around 150G and 225G. The irregularities in these areas correspond to
the beginning of the computation when we were still adjusting our
scripts. We had two independent servers in charge of distributing sieving
tasks, one dealing with [150G,225G], and the other one dealing with
[225G,300G].
In order to validate our computation, it is possible to re-compute only
[`dlp240-plot_rel_count.pdf`](dlp240-plot_rel_count.pdf), where the
x-coordinate denotes the special-q (in multiples of 1G). The plot is
very regular except for special-q's around 150G and 225G. The
irregularities in these areas correspond to the beginning of the
computation when we were still adjusting our scripts. We had two
independent servers in charge of distributing sieving tasks, one dealing
with [150G,225G], and the other one dealing with [225G,300G].
In order to validate our computation, it is possible to recompute only
one of the sub-ranges (not one in the irregular areas) and check that the
number of relations is the one we report. This still requires significant
resources. If only a single node is available for the validation, it is
......@@ -286,34 +287,68 @@ follows.
The filtering follows the same general workflow as in the [rsa-240
case](../rsa240/filtering.md), with some notable changes:
- important companion files must be generated beforehand with
- not one, but two programs must be used to generate important companion
files beforehand:
```
$CADO_BUILD/numbertheory/badideals -poly dlp240.poly -ell 62310183390859392032917522304053295217410187325839402877409394441644833400594105427518019785136254373754932384219229310527432768985126965285945608842159143181423474202650807208215234033437849707623496592852091515256274797185686079514642651 -badidealinfo $DATA/dlp240.badidealinfo -badideals $DATA/dlp240.badideals
$CADO_BUILD/sieve/freerel -poly dlp240.poly -renumber $DATA/dlp240.renumber.gz -lpb0 35 -lpb1 35 -out $DATA/dlp240.freerel.gz -badideals $DATA/dlp240.badideals -lcideals -t 32
```
- command-line flag `-dl -badidealinfo $DATA/dlp240.badidealinfo` must be added to the `dup2` program.
- the command-line flags `-dl -badidealinfo $DATA/dlp240.badidealinfo` must be added to the `dup2` program.
- the `merge` and `replay` programs must be replaced by `merge-dl` and
`replay-dl`
`replay-dl`, respectively
- the `replay-dl` command line lists an extra output file
`dlp240.ideals` that is extremely important for the rest of the
computation.
Several filtering experiments were done during the sieving phase.
The final one can be reproduced as follows, with revision `492b804fc`:
### Duplicate removal
Duplicate removal used in the default cado-nfs way. We did several
filtering runs as relations kept arriving. For each of this run, the
integer shell variable `$EXP` was increased by one (starting from 1).
In the command below, `$new_files` is expected to expand to a file
containing a list of file names of new relations (relative to `$DATA`) to
add to the stored set of relations.
```
mkdir -p $DATA/dedup/{0..3}
$CADO_BUILD/filter/dup1 -prefix dedup -out $DATA/dedup/ -basepath $DATA -filelist $new_files -n 2 > $DATA/dup1.$EXP.stdout 2> $DATA/dup1.$EXP.stderr
grep '^# slice.*received' $DATA/dup1.$EXP.stderr $DATA/dup1.$EXP.per_slice.txt
```
This first pass takes about 3 hours. Numbers of relations per slice are
printed by the program and must be saved for later use (hence the
`$DATA/dup1.$EXP.per_slice.txt` file).
The second pass of duplicate removal works independently on each of the
non-overlapping slices (the number of slices can thus be used as a sort
of time-memory tradeoff.
```
for i in {0..3} ; do
nrels=`awk '/slice '$i' received/ { x+=$5 } END { print x; }' $DATA/dup1.*.per_slice.txt`
$CADO_BUILD/filter/dup2 -nrels $nrels -renumber $DATA/rsa240.renumber $DATA/dedup/$i/dedup*gz -dl -badidealinfo $DATA/dlp240.badidealinfo > $DATA/dup2.$EXP.$i.stdout 2> $DATA/dup2.$EXP.$i.stderr
done
```
### "purge", a.k.a. singleton and "clique" removal.
Here is the command line of the last filtering run that we used (revision `492b804fc`), with `EXP=7`:
```
$CADO_BUILD/filter/purge -out $DATA/purged7.gz -nrels 2380725637 -outdel $DATA/relsdel7.gz -keep 3 -col-min-index 0 -col-max-index 2960421140 -t 56 -required_excess 0.0 files
nrels=$(awk '/remaining/ { x+=$4; } END { print x }' $DATA/dup2.$EXP.[0-3].stderr)
colmax=$(awk '/INFO: size = / { print $5 }' $DATA/dup2.$EXP.0.stderr)
$CADO_BUILD/filter/purge -out $DATA/purged$EXP.gz -nrels $nrels -outdel $DATA/relsdel$EXP.gz -keep 3 -col-min-index 0 -col-max-index $colmax -t 56 -required_excess 0.0 $DATA/dedup/*/dedup*gz
```
where `files` is the list of files with unique relations (output of `dup2`).
This took about 7.5 hours on the machine wurst, with 575GB of peak memory.
The merge step can be reproduced as follows:
### The "merge" step
The merge step can be reproduced as follows (still with `EXP=7` for the
final experiment).
```
$CADO_BUILD/filter/merge-dl -mat $DATA/purged7.gz -out $DATA/history250_7 -target_density 250 -skip 0 -t 28
$CADO_BUILD/filter/merge-dl -mat $DATA/purged$EXP.gz -out $DATA/history250_$EXP -target_density 250 -skip 0 -t 28
```
and took about 20 minutes on the machine wurst, with a peak memory of 118GB.
### The "replay" step
Finally the replay step can be reproduced as follows:
```
$CADO_BUILD/filter/replay-dl -purged $DATA/purged7.gz -his $DATA/history250_7.gz -out $DATA/dlp240.matrix7.250.bin -index $DATA/dlp240.index7.gz -ideals $DATA/dlp240.ideals7.gz
$CADO_BUILD/filter/replay-dl -purged $DATA/purged$EXP.gz -his $DATA/history250_$EXP.gz -out $DATA/dlp240.matrix$EXP.250.bin -index $DATA/dlp240.index$EXP.gz -ideals $DATA/dlp240.ideals$EXP.gz
```
## Estimating linear algebra time more precisely, and choosing parameters
......
......@@ -84,7 +84,7 @@ provide guidance for all possible setups.
## Searching for a polynomial pair
See the [`polyselect.txt`](polyselect.txt) file in this repository.
See the [`polyselect.md`](polyselect.md) file in this repository.
And the winner is the [`rsa240.poly`](rsa240.poly) file:
......@@ -225,11 +225,10 @@ real 75m54.351s
user 4768m41.853s
sys 43m15.877s
```
Then the `75m54.351s=4554.3s` value must be appropriately divided and
multiplied so as to convert it into physical core-seconds. For instance,
in our case, since there are 32 physical cores and we sieved 1024
special-qs, this gives `4554.3*32/1024=142.32` core.seconds per
special-q.
Then the `75m54.351s=4554.3s` value must be appropriately scaled in order
to convert it into physical core-seconds. For instance, in our case,
since there are 32 physical cores and we sieved 1024 special-qs, this
gives `4554.3*32/1024=142.32` core.seconds per special-q.
Finally, it remains to multiply by the number of special-q in this
subrange. We get (in Sagemath):
......
......@@ -11,15 +11,15 @@ versions of cado-nfs changed the format of this file.)
## duplicate removal
Duplicate removal was done with revision 50ad0f1fd of cado-nfs. cado-nfs
proceeds through two passes. We used the default cado-nfs setting which,
on the first pass, splits the input into `2^2=4` independent slices, with
no overlap. cado-nfs supports doing this step in an incremental way, so
that we assume below the the shell variable `EXP` expands to an integer
indicating the filtering experiment number. In the command below,
`$new_files` is expected to expand to a file containing a list of
file names of new relations (relative to `$DATA`) to add to the stored
set of relations.
Duplicate removal was done with revision `50ad0f1fd` of cado-nfs.
cado-nfs proceeds through two passes. We used the default cado-nfs
setting which, on the first pass, splits the input into `2^2=4`
independent slices, with no overlap. cado-nfs supports doing this step
in an incremental way, so that we assume below the the shell variable
`EXP` expands to an integer indicating the filtering experiment number.
In the command below, `$new_files` is expected to expand to a file
containing a list of file names of new relations (relative to `$DATA`) to
add to the stored set of relations.
```
mkdir -p $DATA/dedup/{0..3}
$CADO_BUILD/filter/dup1 -prefix dedup -basepath $DATA -filelist $new_files -out $DATA/dedup/ -n 2 > $DATA/dup1.$EXP.stdout 2> $DATA/dup1.$EXP.stderr
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment