[las] [at work] Split factor bases for threads in blocks of 2^17
Imported issue: Initially reported by Alexander Kruppa in https://gforge.inria.fr/tracker/?group_id=2065&aid=16855
When sieving with n threads, the factor base primes are partitioned into n sets, each of which is processed by one thread executing fill_in_buckets().
Currently, the primes are partitioned in a round-robin fashion: if p_i is minimal s.t. p_i >= bucket_thresh, and with 2 threads, then the two thread-FBs receive
thread 1: p_i, p_{i+2}, p_{i+4}, ...
thread 2: p_{i+1}, p_{i+3}, p_{i+5}, ...
This is done mostly to balance the number of sieve hits each thread has to process, to avoid some threads taking longer than the others and everyone having to wait with applying the bucket updates until the last thread has finished.
After applying the bucket updates, we purge the buckets to remove updates that don't correspond to sieve survivors, then we sort the entries in order of increasing x-coordinate. This is done to speed up dividing out bucket primes when we process the sieve survivors in order of increasing x.
During the purge (before sorting, so the entries are still in order of increasing FB primes), we try to reconstruct the FB prime from the prime hint, which is the low 16 bits of (p-1)/2. When the hint of one entry is smaller than the hint of the previous entry, we increment a counter by 2^17 (=BUCKET_P_WRAP), this counter is added to the hints to get the reconstructed primes.
If during bucket sieving it ever occurs that one bucket receives no hits between one prime p and another p' >= p+2^17, then p' and all following primes in that bucket will be reconstructed incorrectly. We try to detect such incorrectly reconstructed primes in divide_primes_from_bucket(), by testing that p divides the norm, and if not, incrementing p by BUCKET_P_WRAP until it both is a prime and divides the norm. This may become quite expensive if incorrectly reconstructed primes occur often.
The problem is exacerbated by how we split the factor base among threads. With two threads, each thread-FB gets only half of the factor base primes from each interval of length 2^17, and thus the expected value for the number of hits in a bucket is only half of what it would be with 1 thread, increasing the probability that there is a bucket with no hits between two primes more than 2^17 apart.
The probability of not having hits gets quite large with large factor base bound. With 1 thread it is about 1% for primes around 100M to 43% for primes around 500M. With 2 threads it's already ~10% at 100M and 65% at 500M. With 4 threads, it's already 41% at 100M (computed with pr_no_hits.gp, attached). This assumes a BUCKET_REGION 2^16 and that each prime p hits in a given bucket region with probability BUCKET_REGION/p.
We could remove the effect of multiple threads by distributing primes to threads in intervals of length 2^17. I.e., for factor base primes greater than some limit (say, 2^24), we could split them like
thread 1: 16777259, 16777289, 16777291, ..., 16908263 [1], 17039381, 17039401, ...
thread 2: 16908293, 16908323, 16908329, ..., 17039339 [2],
At [1], after the prime 16908263, the first interval [2^24, 2^24+2^17[ of length 2^17 ends; all primes from this first interval go to thread 1. At [2], after the prime 17039339, the second interval of length 2^17 ends; all primes from this second interval go to thread 2. The primes from the third interval of length 2^17 go to thread 1 again, etc.
One thread now processes either all the primes from an inteval of length 2^17, or none of them. We know that no wrap-around in the prime hint occurs among the primes of one interval. If purge finds a wrap-around, it now must increase the value of the counter by nr_threads * 2^17. Since a thread processes all primes of an interval of length 2^17, it's less likely that such an interval will contain no hits, so it's less likely to reconstruct primes incorrectly.
I still need to get an estimate how much time we lose due to incorrectly reconstructed primes before we put any effort into coding.