Commit 5d296679 authored by Nadia Heninger's avatar Nadia Heninger
Browse files

editing pass on dlp reproducibility

parent 3075decd
......@@ -3,7 +3,7 @@
This repository contains information to reproduce the DLP-240 discrete
logarithm record.
Several chapters are covered.
There are several subsections.
* [Software prerequisites, and reference hardware configuration](#software-prerequisites-and-reference-hardware-configuration)
* [Searching for a polynomial pair](#searching-for-a-polynomial-pair)
......@@ -24,7 +24,7 @@ Several chapters are covered.
## Software prerequisites, and reference hardware configuration
This is exactly similar to
This is similar to
[RSA-240](../rsa240/README.md#software-prerequisites-and-reference-hardware-configuration). For full reproducibility of the
computation, 10TB of data is perhaps a bit small; 20TB would be a more
comfortable setup.
......@@ -34,7 +34,7 @@ cado-nfs](https://gitlab.inria.fr/cado-nfs/cado-nfs/commit/8a72ccdde) as
baseline. In some cases we provide exact commit numbers for specific
commands as well.
We also repeat this important paragraph from the RSA-240 documentation.
We also reiterate this important paragraph from the RSA-240 documentation.
Most (if not all) information boxes in this document rely on two shell
variables, `CADO_BUILD` and `DATA`, be set and `export`-ed to shell
subprocess (as with `export CADO_BUILD=/blah/... ; export
......@@ -43,7 +43,7 @@ a successful cado-nfs build directory. The `DATA` variable, which is
used by some scripts, should point to a directory with plenty of storage,
possibly on some shared filesystem. Storage is also needed to store the
temporary files with collected relations. Overall, a full reproduction of
the computation would need in the whereabouts of 20TB of storage. All
the computation would need in the vicinity of 20TB of storage. All
scripts provided in this script expect to be run from the directory where
they are placed, since they are also trying to access companion data
files.
......@@ -52,7 +52,7 @@ files.
We searched for a polynomial pair _à la_ Joux-Lercier, using the
program `dlpolyselect` of `cado-nfs`, with parameters `-bound 150`,
`-degree 4` and `-modm 1000003`. We tried also to search for a skewed
`-degree 4` and `-modm 1000003`. We also tried to search for a skewed
pair of polynomials with the `-skewed` option, but this did not seem to
give a better polynomial for a fixed amount of time compared to plain
flat polynomials.
......@@ -73,26 +73,26 @@ by `dlpolyselect` with the parameters
```
which are the default parameters automatically computed by the server
script `cado-nfs.py` that we used for distributing the computation. These
are perhaps imprecise, but the ranking is pretty stable when parameters
might be imprecise, but the ranking is fairly stable when parameters
are changed.
This polynomial selection took place on a variety of computer resources,
but more than half of it was done on the `grvingt` cluster that we take
as a reference. The program was improved during the computation, so it
makes little sense to report the total number of CPU-years really used.
makes little sense to report the total number of CPU-years used in reality.
The calendar time was 18 days.
When a node of `grvingt` is fully charged with 8 jobs of 8 threads, then
one task as above is processed in 1200 wall clock seconds on average.
When a node of `grvingt` is fully occupied with 8 jobs of 8 threads, then
one task as above is processed in 1200 wall-clock seconds on average.
This must be multiplied by the 4 physical cores it uses to get
the number of core-seconds per modr value. And we have 10^6 of them to
process. This adds up to 152 core.years for the whole polynomial selection.
the number of core-seconds per modr value. We have 10^6 of them to
process. This adds up to 152 core-years for the whole polynomial selection.
Some sample-sieving was done on the top 100 polynomials according to
MurphyE. Although there is a clear correlation between the efficiency of
a polynomial pair and its MurphyE value, the ranking is definitely not perfect.
In particular, the best ranked polynomial pair according to MurphyE finds
10% less relations than the (truly) best ones.
10% fewer relations than the actual best-performing ones.
Additional sample sieving was performed on the few best candidates. With
a test on 128,000 special-q, 3 polynomials could not be separated.
......@@ -116,12 +116,12 @@ EOF
To estimate the number of relations produced by a set of parameters:
- We compute the corresponding factor bases.
- We random-sample in the global q-range, using sieving and not batch:
- We randomly sample in the global q-range, using sieving instead of batch:
this produces the same relations. This is slower but `-batch` is
incompatible (with the version of cado-nfs used) with on-line duplicate removal.
incompatible (with the version of cado-nfs we used) with on-line duplicate removal.
Here is what it gives with final parameters used in the computation. Here,
`-t 16` specifies the number of threads (more is practically useless, since
Here is the result with the final parameters used in the computation. Here,
`-t 16` specifies the number of threads (more is essentially useless, since
gzip soon becomes the limiting factor).
```shell
......@@ -132,8 +132,8 @@ $CADO_BUILD/sieve/makefb -poly dlp240.poly -side 1 -lim 268435456 -maxbits 16 -t
These files have size 209219374 and 103814592 bytes, respectively. They
take less than a minute to compute.
We can now sieve for random-sampled special-q, removing duplicate
relations on-the-fly. In the output of the command line below, only the
We can now sieve for randomly sampled special-q, removing duplicate
relations on the fly. In the output of the command line below, only the
number of unique relations per special-q matters. The timing does not
matter.
......@@ -142,14 +142,14 @@ $CADO_BUILD/sieve/las -poly dlp240.poly -fb0 $DATA/dlp240.fb0.gz -fb1 $DATA/dlp2
```
In less than half an hour on our target machine `grvingt`, this gives
the estimate that 0.61 unique relations per special-q can be expected
the estimate that we can expect 0.61 unique relations per special-q
based on these parameters. (Note that this test can also be done in
parallel over several nodes, using the `-seed [[seed value]]` argument in
order to vary the random picks.)
order to vary the random choices.)
In order to deduce an estimate of the total number of (de-duplicated)
relations, it remains to multiply the average number of relations per
special-q as obtained during the sample sieving by the number of
relations, we need to multiply the average number of relations per
special-q obtained during the sample sieving by the number of
special-q in the global q-range. This number of composite special-q can
be computed exactly or estimated using the logarithmic integral function.
For the target interval, there are 3.67e9 special-q.
......@@ -164,13 +164,13 @@ print (tot_rels)
This estimate (2.2G relations) can be made more precise by increasing the
number of special-q that are sampled for sieving. It is also possible to
have different nodes sample different sub-ranges of the global range to
get the result faster. We consider that sampling 1024 special-qs is
get the result more quickly. We consider sampling 1024 special-qs to be
enough to get a reliable estimate.
## Estimating the cost of sieving
In production there is no need to activate the on-the-fly duplicate
removal which is supposed to be cheap but maybe not negligible and it is
removal, which is supposed to be cheap but maybe not negligible, and it is
important to emulate the fact that the cached factor base (the
`/tmp/dlp240.fbc` file) is precomputed and hot (i.e., cached in memory by
the OS and/or the hard-drive), because this is the situation in
......@@ -203,12 +203,12 @@ real 22m19.032s
user 1315m10.459s
sys 5m56.262s
```
Then the `22m19.032s` value must appropriately scaled, in order to
Then the `22m19.032s` value must be appropriately scaled in order to
convert it to physical core-seconds. For instance, in our
case, since there are 32 cores and we sieved 2048 special-qs, this gives
`(22*60+19.0)*32/2048=20.9` core.seconds per special-q.
`(22*60+19.0)*32/2048=20.9` core-seconds per special-q.
Finally, it remains to multiply by the number of special-q in this
Finally, we need to multiply by the number of special-q in this
subrange. We get (in Sagemath):
```python
......@@ -219,8 +219,8 @@ cost_in_core_years=cost_in_core_hours/24/365
print (cost_in_core_hours, cost_in_core_years)
```
With this experiment, we get 20.9 core.sec per special-q, and therefore
we obtain about 2430 core.years for the total sieving time.
With this experiment, we get 20.9 core-seconds per special-q, and therefore
we obtain about 2430 core-years for the total sieving time.
## Estimating the linear algebra time (coarsely)
......@@ -230,10 +230,10 @@ enabled (i.e., the `MPI` shell variable was set to the path of your MPI
installation), and that `CADO_BUILD` points to the directory where the
corresponding binaries were built.
To determine ahead of time the linear algebra time for a sparse binary
To determine the linear algebra time ahead of time for a sparse binary
matrix with (say) 37M rows/columns and 250 non-zero entries per row, it
is possible to _stage_ a real set-up, just for the purpose of
measurement. cado-nfs has a useful _staging_ mode precisely for that
measurement. cado-nfs has a useful _staging_ mode precisely for this
purpose. In the RSA-240 context, we advise against its use because of
bugs, but these bugs seem to be less of a hurdle in the DLP-240 case. The
only weirdness is that the random distribution of the generated matrices
......@@ -275,27 +275,27 @@ and `m` and `n` the block Wiedemann parameters (we chose `m=48` and
per node).
The "lingen" (linear generator) step of block Wiedemann was perceived as
The "lingen" (linear generator) step of block Wiedemann was seen as
the main potential stumbling block for the computation. We had to ensure
that it would be doable with the resources we had. To this end, a
that it would be possible with the resources we had. To this end, a
"tuning" of the lingen program can be done with the `--tune` flag, so as
to get an advance look at the cpu and memory requirements for that step.
These tests were sufficient to convince us that we had several possible
parameter settings to choose from, and that this computation was doable.
parameter settings to choose from, and that this computation was feasible.
## Validating the claimed sieving results
The benchmark command lines above can be used almost as is to reproduce
the full computation. It is just necessary to remove the `-random-sample`
the full computation. It is only necessary to remove the `-random-sample`
option and to adjust the `-q0` and `-q1` parameters in order to create
many small work units that in the end cover exactly the global q-range.
Since we do not expect anyone to spend again as much computing resources
to perform again exactly the same computation, we provide in the
[`dlp240-rel_count`](dlp240-rel_count) file the count of how many
(non-unique) relations were produced for each 1G special-q sub-range.
Since we do not expect anyone to spend as much computing resources
to perform again exactly the same computation again, we provide the count
of how many (non-unique) relations were produced for each 1G special-q
sub-range in the [`dlp240-rel_count`](dlp240-rel_count) file.
We can then have a visual plot of this data, as shown in
We can then visually plot this data, as shown in
[`dlp240-plot_rel_count.pdf`](dlp240-plot_rel_count.pdf), where the
x-coordinate denotes the special-q (in multiples of 1G). The plot is
very regular except for special-q's around 150G and 225G. The
......@@ -313,7 +313,7 @@ extrapolate.
## Reproducing the filtering results
All relation files collected during sieving were collated into only a
All relation files collected during sieving were collated into a more
managable number of large files (150 files of 3.2GB each). These had to
undergo filtering in order to produce a linear system. The process is as
follows.
......@@ -337,7 +337,7 @@ The filtering follows roughly the same general workflow as in the
### Duplicate removal
Duplicate removal used in the default cado-nfs way. We did several
Duplicate removal was carried out in the default cado-nfs way. We did several
filtering runs as relations kept arriving. For each of this run, the
integer shell variable `$EXP` was increased by one (starting from 1).
In the command below, `$new_files` is expected to expand to a file
......@@ -349,8 +349,8 @@ $CADO_BUILD/filter/dup1 -prefix dedup -out $DATA/dedup/ -basepath $DATA -filelis
grep '^# slice.*received' $DATA/dup1.$EXP.stderr > $DATA/dup1.$EXP.per_slice.txt
```
This first pass takes about 3 hours (if done on the full data set).
Numbers of relations per slice are printed by the program and must be
This first pass takes about 3 hours on the full data set.
The number of relations per slice is printed by the program and must be
saved for later use (hence the `$DATA/dup1.$EXP.per_slice.txt` file).
The second pass of duplicate removal works independently on each of the
......@@ -363,7 +363,7 @@ for i in {0..3} ; do
$CADO_BUILD/filter/dup2 -nrels $nrels -renumber $DATA/dlp240.renumber.gz -dl -badidealinfo $DATA/dlp240.badidealinfo $DATA/dedup/$i/dedup*gz > $DATA/dup2.$EXP.$i.stdout 2> $DATA/dup2.$EXP.$i.stderr
done
```
(Note: in newer versions of cado-nfs, after june 2020, the `-badidealinfo
(Note: in newer versions of cado-nfs, after June 2020, the `-badidealinfo
$DATA/dlp240.badidealinfo` arguments to the `dup2` program must be
replaced by `-poly dlp240.poly`.)
......@@ -406,8 +406,8 @@ However on the very coarse-grain level we focus on two of them:
* _how dense_ we want the final matrix to be.
Sieving more is expected to have a beneficial impact on the matrix size,
but this benefit can become marginal, reaching a point of diminishing
returns eventually. Allowing for a denser matrix also makes it possible
but this benefit can become marginal, eventually reaching a point of diminishing
returns. Allowing a denser matrix also makes it possible
to have fewer rows in the final matrix, which is good for various memory
related concerns.
......@@ -482,8 +482,11 @@ export MPI
where the last 16 lines (steps `3-krylov`) correspond to the 16 "sequences" (vector blocks
numbered `0-1`, `1-2`, until `15-16`). These sequences can
be run concurrently on different sets of nodes, with no synchronization
needed. Each of these 16 sequences needs about 90 days to complete (in practice, we used a different platform than the one we report timings for, but the timings and calendar time was in the same ballpark). Jobs can be interrupted, and must simply be restarted exactly
from where they left off. E.g., if the latest of the `V1-2.*` files in
needed. Each of these 16 sequences needs about 90 days to complete (in
practice, we used a different platform than the one we report timings
for, but the timings and calendar time was in the same ballpark). Jobs
can be interrupted, and may be restarted exactly
where they left off. E.g., if the latest of the `V1-2.*` files in
`$DATA` is `V1-2.86016`, then the job for sequence 1 can be restarted
with:
```shell
......@@ -491,7 +494,7 @@ with:
```
Cheap sanity checks can be done periodically with the following script,
which does all checks it can do (note that the command is happy if it
which does all of the checks it can do (note that the command is happy if it
finds _no_ check to do as well!)
```shell
export matrix=$DATA/dlp240.matrix7.250.bin
......@@ -578,7 +581,7 @@ uncompressed files `$DATA/purged7_withsm.txt` and `$DATA/relsdel7_withsm.txt` th
```shell
$MPI/bin/mpiexec [[your favorite mpiexec args]] $CADO_BUILD/filter/sm_append -ell 62310183390859392032917522304053295217410187325839402877409394441644833400594105427518019785136254373754932384219229310527432768985126965285945608842159143181423474202650807208215234033437849707623496592852091515256274797185686079514642651 -poly $HERE/p240.poly -b 4096 -in "/grvingt/zimmerma/dlp240/filter/purged7.gz" -out "${HERE}/purged7.withsm.txt"
```
We did that in 8 hours on 16 grvingt nodes. Note that the files
We did this in 8 hours on 16 grvingt nodes. Note that the files
`$DATA/purged7_withsm.txt` and `$DATA/relsdel7_withsm.txt` are quite
big: 158G and 2.3TB, respectively.
......@@ -598,9 +601,9 @@ is slow and required a machine with a huge amount of memory.
In the file [`howto-descent.md`](howto-descent.md), we explain what we did to make our lives
simpler with this step. We do not claim full reproducibility here, since
this is admittedly hackish (a small C-code is also given, that searches
this is admittedly hacky (we also give a small C program that searches
in the database file without having an in-memory image). In any case,
this step can not be done without the database file
this step cannot be done without the database file
`dlp240.reconstructlog.dlog` that is too large (465 GB) to put in this
repository.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment