Commit 3075decd authored by Nadia Heninger's avatar Nadia Heninger
Browse files

Pass over README.md

parent 523c9566
......@@ -2,7 +2,7 @@
This repository contains information to reproduce the RSA-240 factoring record.
Several chapters are covered.
There are several subsections.
* [Software prerequisites, and reference hardware configuration](#software-prerequisites-and-reference-hardware-configuration)
* [Searching for a polynomial pair](#searching-for-a-polynomial-pair)
......@@ -17,14 +17,14 @@ Several chapters are covered.
## Software prerequisites, and reference hardware configuration
This documentation relies on [commit 8a72ccdde of
cado-nfs](https://gitlab.inria.fr/cado-nfs/cado-nfs/commit/8a72ccdde) as
cado-nfs](https://gitlab.inria.fr/cado-nfs/cado-nfs/commit/8a72ccdde) as a
baseline. In some cases we provide exact commit numbers for specific
commands as well.
The cado-nfs documentation should be followed, in order to obtain a
complete build. Note in particular that some of the experiments below
require the use of the [hwloc](https://www.open-mpi.org/projects/hwloc/)
library, and also some MPI implementation. [Open
library, as well as an MPI implementation. [Open
MPI](https://www.open-mpi.org/) is routinely used for tests, but cado-nfs
also works with Intel MPI, for instance. The bottom line is that although
these external pieces software are marked as _optional_ for cado-nfs,
......@@ -52,8 +52,8 @@ Some memory intensive tasks were performed on a dedicated machine called
As above, we provide the output of the [`lstopo --of
xml`](lstopo.wurst.xml) and [`dmidecode`](dmidecode.wurst.out) commands.
As regards the compilation of cado-nfs, the user-level configuration is
done in a file called `local.sh`. An [example `local.sh` file](local.sh)
With respect to the compilation of cado-nfs, the user-level configuration is
in a file called `local.sh`. An [example `local.sh` file](local.sh)
is provided, and you should probably adjust it to your needs. In all our
experiments, cado-nfs was compiled with up-to-date software (updated
debian 9 or debian 10). Typical software used were the GNU C compilers
......@@ -67,26 +67,26 @@ a successful cado-nfs build directory. The `DATA` variable, which is
used by some scripts, should point to a directory with plenty of storage,
possibly on some shared filesystem. Storage is also needed to store the
temporary files with collected relations. Overall, a full reproduction of
the computation would need in the whereabouts of 10TB of storage. All
the computation would need in the vicinity of 10TB of storage. All
scripts provided in this script expect to be run from the directory where
they are placed, since they are also trying to access companion data
files.
There is a considerable amount of biodiversity in the possible computing
environments in HPC centers. You might encounter difficulties in
matching the idiosyncrasies of your particular job scheduler with the way
matching the idiosyncrasies of your particular job scheduler to the way
cado-nfs expects to use the hardware. As a general rule, pay attention to
the fact that all the important steps of NFS in cado-nfs use all
available cores, and jobs expect to have exclusive control of the nodes
they run on. Any situation where you notice that this does not happen
is a hint at the fact that something has gone wrong. We cannot possibly
provide guidance for all possible setups.
provide guidance for every possible setup.
## Searching for a polynomial pair
See the [`polyselect.md`](polyselect.md) file in this repository.
And the winner is the [`rsa240.poly`](rsa240.poly) file:
The winner is the [`rsa240.poly`](rsa240.poly) file:
```shell
cat > rsa240.poly <<EOF
......@@ -102,17 +102,17 @@ EOF
## Estimating the number of (unique) relations
To estimate the number of relations produced by a set of parameters:
- We compute the corresponding factor bases. In our case, the side 0 is
- We compute the corresponding factor bases. In our case, side 0 is
rational, so there is no need to precompute the factor base. (Note that
this precomputed factor base is different from what we call the "factor
base cache", and which also appears later.)
- We create a "hint" file where we tell which strategy to use for which
special-q size.
- We random-sample in the global q-range, using sieving and not batch:
- We random sample in the global q-range, using sieving and not batch:
this produces the same relations. This is slower but `-batch` is
currently incompatible with on-line (on-the-fly) duplicate removal.
Here is what it gives with the parameters that were used in the computation.
Here is the result with the parameters that were used in the computation.
The computation of the factor base is done with the following command.
Here, `-t 16` specifies the number of threads (more is practically
......@@ -140,7 +140,7 @@ EOF
```
We can now sieve for random-sampled special-q, and remove duplicate
relations on-the-fly. In the output of the command line below, only the
relations on the fly. In the output of the command line below, only the
number of unique relations per special-q matters. The timing does not
matter.
......@@ -154,8 +154,8 @@ based on these parameters. (Note that this test can also be done in
parallel over several nodes, using the `-seed [[seed value]]` argument in
order to vary the random picks.)
In order to deduce an estimate of the total number of (de-duplicated)
relations, it remains to multiply the average number of relations per
In order to derive an estimate of the total number of (de-duplicated)
relations, it is necessary to multiply the average number of relations per
special-q as obtained during the sample sieving by the number of
special-q in the global q-range. The latter can be precisely estimated
using the logarithmic integral function as an approximation of the number
......@@ -173,7 +173,7 @@ print (tot_rels)
This estimate (5.9G relations) can be made more precise by increasing the
number of special-q that are sampled for sieving. It is also possible to
have different nodes sample different sub-ranges of the global range to
get the result faster. We consider that sampling 1024 special-qs is
get the result faster. Sampling 1024 special-qs should be
enough to get a reliable estimate.
## Estimating the cost of sieving
......@@ -188,14 +188,14 @@ we do measurements "under lab conditions":
functioning;
- no slow disk access due to competing I/Os.
On the `grvingt` cluster that we use for the measure, these perfect
conditions were reached during production as well, most of the time.
On the `grvingt` cluster that we use for the measurement, these perfect
conditions were reached during production most of the time as well.
In production there is no need to activate the on-the-fly duplicate
removal which is supposed to be cheap but maybe not negligible. There is
also no need to pass the hint file, since we are going to run the siever
on different parts of the q-range, and on each of them the parameters are
constant. Finally, during a benchmark, it is important to emulate the
removal, which is supposed to be cheap but maybe not negligible. There is
also no need to pass in the hint file, since we are going to run the siever
on different parts of the q-range, and the parameters are
constant on each of them. Finally, during a benchmark, it is important to emulate the
fact that the cached factor base (the `/tmp/rsa240.fbc` file) is
precomputed and hot (i.e., cached in memory by the OS and/or the
hard-drive), because this is the situation in production; for this, it
......@@ -217,8 +217,8 @@ may run the command with `-random-sample 1024` replaced by
`-random-sample 0` first, which will _only_ create the cache file. Then
run the command above.)
While `las` tries to print some running times, some start-up or finish
tasks might be skipped; furthermore the CPU-time gets easily confused by
While `las` tries to print some running times, some startup or finish
tasks might be skipped; furthermore the CPU time becomes easily confused by
the hyperthreading. Therefore, it is better to rely on `time`, since this
gives the real wall-clock time exactly as it was taken by the
computation.
......@@ -234,7 +234,7 @@ to convert it into physical core-seconds. For instance, in our case,
since there are 32 physical cores and we sieved 1024 special-qs, this
gives `4554.3*32/1024=142.32` core.seconds per special-q.
Finally, it remains to multiply by the number of special-q in this
Finally, we need to to multiply by the number of special-q in this
subrange. We get (in Sagemath):
```python
......@@ -245,7 +245,7 @@ cost_in_core_years=cost_in_core_hours/24/365
print (cost_in_core_hours, cost_in_core_years)
```
With this experiment, we get therefore about 279 core.years for this sub-range.
With this experiment, we estimate about 279 core.years for this sub-range.
#### Cost of 1-sided sieving + batch in the q-range [2.1e9,7.4e9]
......@@ -284,19 +284,19 @@ is as follows:
```shell
./rsa240-sieve-batch.sh -q0 2100000000 -q1 2100100000
```
The script prints on stdout the start and end date, and in the output of
`las`, which can be found in `$DATA/log/las.${q0}-${q1}.out`, the number
of special-qs that have been processed can be found. From this
information one can again deduce the cost in core.seconds to process one
special-q and then the overall cost of sieving the q-range [2.1e9,7.4e9].
The script prints the start and end date on stdout. The number
of special-qs that have been processed can be found in the output of
`las`, which is written to `$DATA/log/las.${q0}-${q1}.out`. One can again
deduce the cost in core-seconds to process one special-q from this
information, and then the overall cost of sieving the q-range [2.1e9,7.4e9].
The design of this script imposes to have a rather long range of
The design of this script imposes a rather long range of
special-q to handle for each run of `rsa240-sieve-batch.sh`. Indeed,
during the last minutes, the `finishbatch` jobs need to take care of the
last survivor files while `las` is no longer running, so that the node is
during the final minutes, the `finishbatch` jobs need to take care of the
last survivor files while `las` is no longer running, so the node is
not fully occupied. If the `rsa240-sieve-batch.sh` job takes a few hours,
this fade-out phase takes negligible time. Both for the benchmark and in
production it is then necessary to have jobs taking at least a few hours.
production it is thus necessary for the jobs to take at least a few hours.
On our sample machine, here is an example of a benchmark:
```shell
......@@ -326,7 +326,7 @@ we obtain about 510 core.years for this sub-range.
## Estimating the linear algebra time (coarsely)
Linear algebra works with MPI. For this section, as well as all linear
algebra-related sections, we assume that you built cado-nfs with MPI
algebra-related sections, we assume that you have built cado-nfs with MPI
enabled (i.e., the `MPI` shell variable was set to the path of your MPI
installation), and that `CADO_BUILD` points to the directory where the
corresponding binaries were built.
......@@ -335,8 +335,8 @@ The matrix size for RSA-240 is about 282M, with density 200 per row.
However, it is possible, and actually useful, to have an idea of the
computational cost of linear algebra before the matrix is actually ready,
just based on a rough prediction of its size. Tools that we developed for
simulating the filtering, although quite fragile, can be used to this
end. See the script
simulating the filtering, although quite fragile, can be used for this purpose.
See the script
[`scripts/estimate_matsize.sh`](https://gitlab.inria.fr/cado-nfs/cado-nfs/-/blob/8a72ccdde/scripts/estimate_matsize.sh)
available in the cado-nfs repository, for example.
......@@ -362,7 +362,7 @@ takes only a few minutes). Within the script
[`rsa240-linalg-0a-estimate_linalg_time_coarse_method_b.sh`](rsa240-linalg-0a-estimate_linalg_time_coarse_method_b.sh),
several implementation-level parameters are set, and should probably be
adjusted to the users' needs. Along with the `DATA` and `CADO_BUILD`
variables, the script below also requires that the `MPI` shell variable
variables, the script below also requires the `MPI` shell variable to
be set and `export`-ed, so that `$MPI/bin/mpiexec` can actually run MPI
programs. In all likelihood, this script needs to be tweaked depending on
the specifics of how MPI programs should be run on the target platform.
......@@ -384,29 +384,29 @@ Krylov+Mksol (8 and 32 representing the fact that we used 8-node jobs
with 32 physical cores per node).
Because the parallel code for the "lingen" (linear generator) step of
block Wiedemann was not ready when the computation started, we did no
block Wiedemann was not ready when the computation started, we made no
attempt to anticipate the timing. We were confident that the cost was
going to be minor anyway.
going to be small in any case.
## Validating the claimed sieving results
The benchmark command lines above can be used almost as is to reproduce
the full computation. It is just necessary to remove the `-random-sample`
the full computation. It is only necessary to remove the `-random-sample`
option and to adjust the `-q0` and `-q1` to create many small work units
that in the end cover exactly the global q-range.
Since we do not expect anyone to spend again as much computing resources
to perform again exactly the same computation, we provide in the
[`rsa240-rel_count`](rsa240-rel_count) file the count of how many
(non-unique) relations were produced for each 100M special-q sub-range.
Since we do not expect anyone to spend as many computing resources
to perform again exactly the same computation again, we provide the count
of how many (non-unique) relations were produced for each 100M special-q
sub-range in the [`rsa240-rel_count`](rsa240-rel_count) file .
We can then have a visual plot of this data, as shown in
We can then plot this data visually, as shown in
[`rsa240-plot_rel_count.pdf`](rsa240-plot_rel_count.pdf) where we see the
drop in the number of relations produced per special-q when changing the
strategy (the x-coordinate is the special-q value, divided by 100M). The
plot is very regular on the two sub-ranges, except for outliers
corresponding to the q-range [9e8,1e9]. This is due to a crash of a
computing facilities, and it was easier to allow duplicate computations
computing facility, and it was easier to allow duplicate computations
(hence duplicate relations) than to sort out exactly which special-q's
did not complete.
......@@ -460,7 +460,7 @@ few iterations of each, in order to guide the final choice. For this,
a single command line is sufficient. For consistency with the other
scripts, it is placed as well in a script in this repository, namely
[`rsa240-linalg-0b-test-few-iterations.sh`](rsa240-linalg-0b-test-few-iterations.sh).
This script needs the `MPI`, `DATA`, `matrix`, and `CADO_BUILD` to be
This script needs the `MPI`, `DATA`, `matrix`, and `CADO_BUILD` environment variables to be
set. It can be used as follows, where `$matrix` points to one of the
matrices that have been produced by the filter code (after the `replay`
step).
......@@ -481,7 +481,7 @@ iteration on the `grvingt` platform, subject to some variations.
## Reproducing the linear algebra results
The scripts above are of course part of a more general picture that does
The scripts above are of course part of a more general picture that runs
the full block Wiedemann algorithm.
We decided to use the block Wiedemann parameters `m=512` and `n=256`,
......@@ -505,8 +505,8 @@ where the last 4 lines (steps `3-krylov`) correspond to the 4 "sequences"
(vector blocks numbered `0-64`, `64-128`, `128-192`, and `192-256`).
These sequences can be run concurrently on different sets of nodes, with
no synchronization needed. Each of these 4 sequences needs about 25 days
to complete. Jobs can be interrupted, and must simply be restarted
exactly from where they left off. E.g., if the latest of the `V64-128.*`
to complete. Jobs can be interrupted, and can simply be restarted
exactly from the point where they left off. E.g., if the latest of the `V64-128.*`
files in `$DATA` is `V64-128.86016`, then the job for sequence 1 can be
restarted with:
```shell
......@@ -514,7 +514,7 @@ restarted with:
```
Cheap sanity checks can be done periodically with the following script,
which does all checks it can do (note that the command is happy if it
which does all of the checks it can do (note that the command is happy if it
finds _no_ check to do as well!)
```shell
export matrix=$DATA/rsa240.matrix11.200.bin
......@@ -524,7 +524,7 @@ export MPI
./rsa240-linalg-4-check-krylov.sh
```
Once this is done, data must be collated before being processed by the
Once this is finished, data must be collated before being processed by the
later steps. After step `5-acollect` below, a file named
`A0-256.0-1654784` with size 27111981056 bytes will be in `$DATA`. Step
`6-lingen` below runs on 16 nodes, and completes in slightly less than 10
......@@ -560,7 +560,7 @@ follows
(the size above is the final size. For a quick test, a size of
`64*64/8*32768=16777216` bytes would be enough.)
After having successfully followed the steps above, a file named
After the steps above have been successfully followed, a file named
`W.sols0-64` will be in `$DATA` (with a symlink to it called `W`). This
file represents a kernel vector.
......@@ -572,7 +572,7 @@ We used the following command on the machine `wurst`:
```shell
$CADO_BUILD/linalg/characters -poly rsa240.poly -purged $DATA/purged11.gz -index $DATA/rsa240.index11.gz -heavyblock $DATA/rsa240.matrix11.200.dense.bin -out $DATA/rsa240.kernel -ker $DATA/W -lpb0 36 -lpb1 37 -nchar 50 -t 56
```
This gave after a little more than one hour 21 dependencies
This gave 21 dependencies after a little more than one hour
(`rsa240.dep.000.gz` to `rsa240.dep.020.gz`).
## Reproducing the square root step
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment