Commit ed257c04 authored by Emmanuel Thomé's avatar Emmanuel Thomé
Browse files

still some more rsa240

parent ba2c0178
......@@ -59,15 +59,18 @@ experiments, cado-nfs was compiled with up-to-date software (updated
debian 9 or debian 10). Typical software used were the GNU C compilers
versions 6 to 9, or Open MPI versions 4.0.1 to 4.0.3.
In several of the information boxes in this document, the `$CADO_BUILD`
variable is assumed to be the path to a successful cado-nfs build
directory. For some scripts (linear algebra, mostly), the `$DATA`
variable should point to a directory with plenty of storage, possibly on
some shared filesystem. Storage is also needed to store the temporary
files with collected relations. Overall, a full reproduction of the
computation would need in the whereabouts of 10TB of storage. All scripts
provided in this script expect to be run from the directory where they
are placed, since they are also trying to access companion data files.
Most (if not all) information boxes in this document rely on two shell
variables, `CADO_BUILD` and `DATA`, be set and `export`-ed to shell
subprocess (as with `export CADO_BUILD=/blah/... ; export
DATA=/foo/...`). The `CADO_BUILD` variable is assumed to be the path to
a successful cado-nfs build directory. The `DATA` variable, which is
used by some scripts, should point to a directory with plenty of storage,
possibly on some shared filesystem. Storage is also needed to store the
temporary files with collected relations. Overall, a full reproduction of
the computation would need in the whereabouts of 10TB of storage. All
scripts provided in this script expect to be run from the directory where
they are placed, since they are also trying to access companion data
files.
There is a considerable amount of biodiversity in the possible computing
environments in HPC centers. You might encounter difficulties in
......@@ -115,7 +118,7 @@ computation of the factor base is done with the following command. Here,
gzip soon becomes the limiting factor).
```
$CADO_BUILD/sieve/makefb -poly rsa240.poly -side 1 -lim 2100000000 -maxbits 16 -t 16 -out rsa240.fb1.gz
$CADO_BUILD/sieve/makefb -poly rsa240.poly -side 1 -lim 2100000000 -maxbits 16 -t 16 -out $DATA/rsa240.fb1.gz
```
The file has size 786781774 bytes, and takes less than 4 minutes
......@@ -141,7 +144,7 @@ number of unique relations per special-q matters. The timing does not
matter.
```
$CADO_BUILD/sieve/las -poly rsa240.poly -fb1 rsa240.fb1.gz -lim0 1800000000 -lim1 2100000000 -lpb0 36 -lpb1 37 -q0 8e8 -q1 7.4e9 -dup -dup-qmin 0,800000000 -sqside 1 -A 32 -mfb0 72 -mfb1 111 -lambda0 2.2 -lambda1 3.2 -random-sample 1024 -t auto -bkmult 1,1l:1.15,1s:1.4,2s:1.1 -v -bkthresh1 90000000 -adjust-strategy 2 -fbc /tmp/fbc -hint-table rsa240.hint
$CADO_BUILD/sieve/las -poly rsa240.poly -fb1 $DATA/rsa240.fb1.gz -lim0 1800000000 -lim1 2100000000 -lpb0 36 -lpb1 37 -q0 8e8 -q1 7.4e9 -dup -dup-qmin 0,800000000 -sqside 1 -A 32 -mfb0 72 -mfb1 111 -lambda0 2.2 -lambda1 3.2 -random-sample 1024 -t auto -bkmult 1,1l:1.15,1s:1.4,2s:1.1 -v -bkthresh1 90000000 -adjust-strategy 2 -fbc /tmp/rsa240.fbc -hint-table rsa240.hint
```
In slightly more than an hour on our target machine `grvingt`, this gives
......@@ -189,7 +192,7 @@ removal which is supposed to be cheap but maybe not negligible. There is
no need to pass the hint file, since we are going to run the siever on
different parts of the q-range, and on each of them the parameters are
constant. Finally, during a benchmark, it is important to emulate the
fact that the cached factor base (the `/tmp/fbc` file) is precomputed and
fact that the cached factor base (the `/tmp/rsa240.fbc` file) is precomputed and
hot (i.e., cached in memory by the OS and/or the hard-drive),
because this is the situation in production; for this, it suffices
to start a first run and interrupt it as soon as the cache is written (or
......@@ -201,16 +204,16 @@ In order to measure the cost of sieving in the special-q subrange where
sieving is used on both sides, the typical command-line is as follows:
```
time $CADO_BUILD/sieve/las -poly rsa240.poly -fb1 rsa240.fb1.gz -lim0 1800000000 -lim1 2100000000 -lpb0 36 -lpb1 37 -q0 8e8 -q1 2.1e9 -sqside 1 -A 32 -mfb0 72 -mfb1 111 -lambda0 2.2 -lambda1 3.2 -random-sample 1024 -t auto -bkmult 1,1l:1.15,1s:1.4,2s:1.1 -v -bkthresh1 90000000 -adjust-strategy 2 -fbc /tmp/fbc
time $CADO_BUILD/sieve/las -poly rsa240.poly -fb1 $DATA/rsa240.fb1.gz -lim0 1800000000 -lim1 2100000000 -lpb0 36 -lpb1 37 -q0 8e8 -q1 2.1e9 -sqside 1 -A 32 -mfb0 72 -mfb1 111 -lambda0 2.2 -lambda1 3.2 -random-sample 1024 -t auto -bkmult 1,1l:1.15,1s:1.4,2s:1.1 -v -bkthresh1 90000000 -adjust-strategy 2 -fbc /tmp/rsa240.fbc
```
(note: the first time this command line is run, it takes some time
to create the "cache" file `/tmp/fbc`. If you want to avoid this, you may
to create the "cache" file `/tmp/rsa240.fbc`. If you want to avoid this, you may
run the command with `-random-sample 1024` replaced by `-random-sample 0`
first, which will _only_ create the cache file. Then run the command
above.)
While las tries to print some running times, some start-up or finish
While `las` tries to print some running times, some start-up or finish
tasks might be skipped; furthermore the CPU-time gets easily confused by
the hyperthreading. Therefore, it is better to rely on `time`, since this
gives the real wall-clock time exactly as it was taken by the
......@@ -251,7 +254,7 @@ option is mandatory, even if for our parameters, no file is produced on
side 1.)
```
$CADO_BUILD/sieve/ecm/precompbatch -poly rsa240.poly -lim0 0 -lim1 2100000000 -batch0 rsa240.batch0 -batch1 rsa240.batch1 -batchlpb0 31 -batchlpb1 30
$CADO_BUILD/sieve/ecm/precompbatch -poly rsa240.poly -lim0 0 -lim1 2100000000 -batch0 $DATA/rsa240.batch0 -batch1 $DATA/rsa240.batch1 -batchlpb0 31 -batchlpb1 30
```
Then, we can use the [`sieve-batch.sh`](sieve-batch.sh) shell-script
......@@ -263,7 +266,8 @@ given in this repository. This launches:
and produce relations.
The script takes two command-line arguments `-q0 xxx` and `-q1 xxx`,
which describe the range of special-q to process.
which describe the range of special-q to process. Temporary files are put
in the `/tmp` directory by default.
In order to run it on your own machine, there are some variables to
adjust at the beginning of the script. Two examples are already given, so
......@@ -271,12 +275,14 @@ this should be easy to imitate. The number of instances of `finishbatch`
can also be adjusted depending on the number of cores available on the
machine.
When the paths are properly set, here is a typical invocation:
When the paths are properly set (either by having `CADO_BUILD` and
`DATA` set correctly, or by tweaking the script), a typical invocation
is as follows:
```
./sieve-batch.sh -q0 2100000000 -q1 2100100000
```
The script prints on stdout the start and end date, and in the output of
`las` that can be found in `$result_dir/log/las.${q0}-${q1}.out`, the number
`las` that can be found in `$DATA/log/las.${q0}-${q1}.out`, the number
of special-qs that have been processed can be found. From this one can
again deduce the cost in core.seconds to process one special-q and then
the overall cost of sieving the q-range [2.1e9,7.4e9].
......@@ -313,20 +319,24 @@ we obtain about 510 core.years for this sub-range.
## Estimating linear algebra time (coarsely)
After the fact, we know the matrix size for RSA-240 (about 282M, density
200 per row). The good thing is that it is not too far from the matrix
size predicted above (320M). It is important to know that the basic
characteristics of the matrix that can be expected from filtering are
sufficient to give a rough idea of the computational cost of linear
algebra. Thanks to the filtering simulations, these timings may be
obtained ahead of time.
To determine ahead of time the linear algebra time for a sparse binary
matrix with (say) 300M rows/columns and 200 non-zero entries per row, it
is possible to _stage_ a real set-up, just for the purpose of
measurement. cado-nfs has a useful _staging_ mode precisely for that
purpose, but it seems to misbehave (as of commit
The matrix size for RSA-240 is about 282M, with density 200 per row.
However, it is possible, and actually useful, to have an idea of the
computational cost of linear algebra before the matrix is actually ready,
just based on a rough prediction of its size. Tools that we developed for
simulating the filtering, although quite fragile, can be used to this
end. See the script
[`scripts/estimate_matsize.sh`](https://gitlab.inria.fr/cado-nfs/cado-nfs/-/blob/8a72ccdde/scripts/estimate_matsize.sh)
available in the cado-nfs repository, for example.
As an illustration of how it is possible to determine the linear algebra
cost ahead of time, let us assume that some advance prediction tells us
that the expected size of the sparse binary matrix is (say) 300M
rows/columns and 200 non-zero entries per row. (As we know after the
fact, this is not too far from reality, but the whole point is that the
_real_ matrix is not known at this point, of course.) Based on these
characteristics, it is possible to _stage_ a real set-up, just for the
purpose of measurement. cado-nfs has a useful _staging_ mode precisely
for that purpose, but it seems to misbehave (as of commit
[8a72ccdde](https://gitlab.inria.fr/cado-nfs/cado-nfs/commit/8a72ccdde)
at least), and cannot be used (see the script
[`rsa240-linalg-0a-estimate_linalg_time_coarse_method_a.sh`](rsa240-linalg-0a-estimate_linalg_time_coarse_method_a.sh)
......@@ -339,10 +349,13 @@ takes well over an hour), and measure the time for 128 iterations (which
takes only a few minutes). Within the script
[`rsa240-linalg-0a-estimate_linalg_time_coarse_method_b.sh`](rsa240-linalg-0a-estimate_linalg_time_coarse_method_b.sh),
several implementation-level parameters are set, and should probably be
adjusted to the users' needs.
adjusted to the users' needs. Along with the `DATA` and `CADO_BUILD`
variables, the script below also requires that the `MPI` shell variable
be set and `export`-ed, so that `$MPI/bin/mpiexec` can actually run MPI
programs. In all likelihood, this script needs to be tweaked depending on
the specifics of how MPI programs should be run on the target platform.
```
DATA=$DATA CADO_BUILD=$CADO_BUILD MPI=$MPI nrows=300000000 density=200 nthreads=32 ./rsa240-linalg-0a-estimate_linalg_time_coarse_method_b.sh
nrows=300000000 density=200 nthreads=32 ./rsa240-linalg-0a-estimate_linalg_time_coarse_method_b.sh
```
This reports about 1.3 seconds per iteration. Allowing for some
......@@ -365,7 +378,7 @@ going to be minor anyway.
## Validating the claimed sieving results
The benchmark command-lines above can be used almost as-is for
The benchmark command lines above can be used almost as-is for
reproducing the full computation. It is just necessary to remove the
`-random-sample` option and to adjust the `-q0` and `-q1` to create many
small work units that in the end cover exactly the global q-range.
......@@ -394,7 +407,15 @@ extrapolate.
## Reproducing the filtering results
See the file [`filtering.txt`](filtering.txt) in this repository.
The filtering step in cado-nfs proceeds through several steps.
- duplicate removal.
- "purge", a.k.a. singleton and "clique" removal; also sometimes
referred to as only "filtering".
- "merge", which computes a sequence of row combinations.
- "replay", which replays the above sequence to produce a small matrix.
The file [`filtering.md`](filtering.md) in this repository gives more
information on these steps.
## Estimating linear algebra time more precisely, and choosing parameters
......@@ -432,7 +453,7 @@ set. It can be used as follows, where `$matrix` points to one of the
matrices that have been produced by the filter code (after the `replay`
step).
```
export matrix=/data/experiment/matrix_number_11_heavy.bin
export matrix=$DATA/rsa240.matrix11.200.bin
export DATA
export CADO_BUILD
export MPI
......@@ -530,7 +551,7 @@ Let `W` be the kernel vector computed by the linear algebra step.
The characters step transforms this kernel vector into dependencies.
We used the following command on the machine `wurst`:
```
characters -poly rsa240.poly -purged purged11.gz -index rsa240.index11.gz -heavyblock rsa240.matrix11.200.dense.bin -out rsa240.kernel -ker $DATA/W -lpb0 36 -lpb1 37 -nchar 50 -t 56
$CADO_BUILD/linalg/characters -poly rsa240.poly -purged $DATA/purged11.gz -index $DATA/rsa240.index11.gz -heavyblock $DATA/rsa240.matrix11.200.dense.bin -out $DATA/rsa240.kernel -ker $DATA/W -lpb0 36 -lpb1 37 -nchar 50 -t 56
```
This gave after a little more than one hour 21 dependencies
(rsa240.dep.000.gz to rsa240.dep.020.gz).
......@@ -540,7 +561,7 @@ This gave after a little more than one hour 21 dependencies
The following command line can be used to process dependencies `start` to
`start+t-1`, using `t` threads (one thread for each dependency):
```
sqrt -poly rsa240.poly -prefix rsa240.dep.gz -side0 -side1 -gcd -dep $start -t $t
$CADO_BUILD/sqrt/sqrt -poly rsa240.poly -prefix $DATA/rsa240.dep.gz -side0 -side1 -gcd -dep $start -t $t
```
The `stdout` file contains one line per dependency, either FAIL or a
non-trivial factor of RSA-240.
The filtering used revision 50ad0f1fd of CADO-NFS:
# Additional info on filtering for rsa240
purge -out purged11.gz -nrels 6011911051 -keep 160 -col-min-index 0 -col-max-index 8460702956 -t 56 -required_excess 0.0 <files>
A first step of the filtering process in cado-nfs is to create the
so-called "renumber table", as follows.
```
$CADO_BUILD/sieve/freerel -poly rsa240.poly -renumber $DATA/rsa240.renumber -lpb0 36 -lpb1 37 -out $DATA/rsa240.freerel -t 32
```
where `-t 32` specifies the number of thread. This was done with revision
`30a5f3eae` of cado-nfs, and takes several hours. (Note that newer
versions of cado-nfs changed the format of this file.)
## duplicate removal
Duplicate removal was done with revision 50ad0f1fd of cado-nfs. cado-nfs
proceeds through two passes. We used the default cado-nfs setting which,
on the first pass, splits the input into `2^2=4` independent slices, with
no overlap. cado-nfs supports doing this step in an incremental way, so
that we assume below the the shell variable `EXP` expands to an integer
indicating the filtering experiment number. In the command below,
`$new_files` is expected to expand to a file containing a list of
file names of new relations (relative to `$DATA`) to add to the stored
set of relations.
```
mkdir -p $DATA/dedup/{0..3}
$CADO_BUILD/filter/dup1 -prefix dedup -basepath $DATA -filelist $new_files -out $DATA/dedup/ -n 2 > $DATA/dup1.$EXP.stdout 2> $DATA/dup1.$EXP.stderr
grep '^# slice.*received' $DATA/dup1.$EXP.stderr $DATA/dup1.$EXP.per_slice.txt
```
The second pass of duplicate removal works independently on each of the
non-overlapping slices (the number of slices can thus be used as a sort
of time-memory tradeoff).
```
for i in {0..3} ; do
nrels=`awk '/slice '$i' received/ { x+=$5 } END { print x; }' $DATA/dup1.*.per_slice.txt`
$CADO_BUILD/filter/dup2 -nrels $nrels -renumber $DATA/rsa240.renumber $DATA/dedup/$i/dedup*gz > $DATA/dup2.$EXP.$i.stdout 2> $DATA/dup2.$EXP.$i.stderr
done
```
## "purge", a.k.a. singleton and "clique" removal.
This step was done with revision `50ad0f1fd` of cado-nfs. We assume below
that `$EXP` is consistent with the latest pass of duplicate removal that
was done following the steps above.
```
nrels=$(awk '/remaining/ { x+=$4; } END { print x }' $DATA/dup2.$EXP.[0-3].stderr)
colmax=$(awk '/INFO: size = / { print $5 }' $DATA/dup2.$EXP.0.stderr)
$CADO_BUILD/filter/purge -out purged$EXP.gz -nrels $nrels -keep 160 -col-min-index 0 -col-max-index $colmax -t 56 -required_excess 0.0 $DATA/dedup/*/dedup*gz
```
An excerpt of the output is:
```
...
Step 0: only singleton removal
Sing. rem.: begin with: nrows=6011911051 ncols=6334109673 excess=-322198622 at 23480.64
......@@ -33,15 +81,21 @@ Final values:
nrows=1175353278 ncols=1175353118 excess=160
weight=27090768157 weight*nrows=3.18e+19
Total usage: time 379175s (cpu), 36492s (wct) ; memory 3675M, peak 1314500M
```
## The "merge" step
##############################################################################
The merge step used revision `8e651c0a6` of cado-nfs, which implements the
algorithm described in the article
The merge step used revision 8e651c0a6 of CADO-NFS, which implements the
algorithm described in "Parallel Structured Gaussian Elimination for the
Number Field Sieve", by Charles Bouillaguet and Paul Zimmermann, April 2019,
https://hal.inria.fr/hal-02098114:
Charles Bouillaguet and Paul Zimmermann, _Parallel Structured Gaussian Elimination for the Number Field Sieve_, April 2019, [`https://hal.inria.fr/hal-02098114`](https://hal.inria.fr/hal-02098114)
merge -out history11 -t 56 -target_density 200 -mat purged11.gz -skip 32
```
$CADO_BUILD/filter/merge -out history$EXP -t 56 -target_density 200 -mat purged$EXP.gz -skip 32
```
An excerpt of the output is:
```
...
# Done: Read 1175353278 relations in 2444.8s -- 69.6 MB/s -- 480749.2 rels/s
Time for filter_matrix_read: 4786.54s
......@@ -72,12 +126,18 @@ Total usage: time 126269s (cpu), 6386s (wct) ; memory 703913M, peak 795576M
After cleaning memory:
Total usage: time 126883s (cpu), 6480s (wct) ; memory 536456M, peak 795576M
Total usage: time 131669s (cpu), 8926s (wct) ; memory 536456M, peak 795576M
```
## The "replay" step
##############################################################################
The replay step used revision 1369194a5 of cado-nfs.
The replay step used revision 1369194a5 of CADO-NFS:
```
$CADO_BUILD/replay -purged purged$EXP.gz -his history$EXP -out rsa240.matrix$EXP.200.bin -index rsa240.index$EXP.gz
```
replay -purged purged11.gz -his history11.gz -out rsa240.matrix11.200.bin -index rsa240.index11.gz
An excerpt of the output is:
```
...
# Done: Read 1175353278 relations in 2635.0s -- 64.6 MB/s -- 446058.1 rels/s
The biggest index appearing in a relation is 8460702945
......@@ -91,9 +151,6 @@ Sparse submatrix: nrows=282336644 ncols=282336484
# Writing matrix took 10146.6s
# Weight of the sparse submatrix: 56524556188
Total usage: time 32650s (cpu), 32502s (wct) ; memory 802686M, peak 915127M
```
#
#!/bin/bash
hn=`hostname`
if (echo $hn | grep juwels > /dev/null); then
cluster="juwels"
elif (echo $hn | grep grvingt > /dev/null); then
cluster="grvingt"
else
echo "Unknown cluster. Good bye."
exit 1
fi
if [ $cluster == "grvingt" ]; then
path_rsa240="/grvingt/pgaudry/rsa240"
wdir="/tmp"
result_dir="$path_rsa240"
cadobuild="/grvingt/pgaudry/cado-nfs/build"
else
path_rsa240="$PROJECT/gaudry/rsa240"
wdir="/tmp"
result_dir="$path_rsa240"
cadobuild="$PROJECT/gaudry/cado-nfs/build"
fi
: ${DATA?missing}
: ${CADO_BUILD?missing}
: ${wdir="/tmp"}
: ${result_dir="$DATA"}
# batch that number of survivors per file, to be sent to finishbatch
# 16M survivors creates a product tree 0.75 times the product tree of the primes
......@@ -83,9 +68,9 @@ loop_finishbatch() {
# run finishbatch on it
echo -n "[ $id ]: Starting finishbatch on $file.$id at "; date
$cadobuild/sieve/ecm/finishbatch -poly \
$path_rsa240/rsa240.poly -lim0 0 -lim1 2100000000 \
-lpb0 36 -lpb1 37 -batch0 $path_rsa240/rsa240.batch0\
$CADO_BUILD/sieve/ecm/finishbatch -poly \
$DATA/rsa240.poly -lim0 0 -lim1 2100000000 \
-lpb0 36 -lpb1 37 -batch0 $DATA/rsa240.batch0\
-batchlpb0 31 -batchmfb0 72 -batchlpb1 30 -batchmfb1 74 -doecm\
-ncurves 80 -t 8 -in "$workdir/running/$file.$id" \
> "$resdir/$file"
......@@ -107,10 +92,10 @@ done
echo -n "Starting las at "; date
$cadobuild/sieve/las \
-poly $path_rsa240/rsa240.poly \
-fb1 $path_rsa240/rsa240.fb1.gz \
-fbc $path_rsa240/rsa240.fbc \
$CADO_BUILD/sieve/las \
-poly $DATA/rsa240.poly \
-fb1 $DATA/rsa240.fb1.gz \
-fbc $wdir/rsa240.fbc \
-lim0 0 -lim1 2100000000 -lpb0 36 -lpb1 37 -sqside 1 -A 32 \
-mfb0 250 -mfb1 74 -lambda0 5.0 -lambda1 2.0 \
-bkmult 1,1l:1.15,1s:1.5,2s:1.1 -bkthresh1 50000000 \
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment