6.56 KB
Newer Older
#+TITLE: Fabulous notes and ideas scratch pad
2 3 4 5
#+AUTHOR: Thomas Mijieux

7 8 9 10 11 12 13 14 15
* MEETING 2017-May-14
* Resources Links
** spack
   - source repository:
   - tutorial for hiepacs solvers:
** chameleon
   User guide and documentation may be old.
   It is best to regenerate them from sources.

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
   - source repository:
   - chameleon tutorial:*
   - user guide:
   - chameleon directory:
** fabulous
   Documentation for latest versions is only available by generating it from sources

   - source repository:
   - source repository(old)
   - documentation (branch "ib")

  #+BEGIN_SRC c++
  int main()
      // ...
      auto dr = fabulous::deflated_restart(/*nb_eigen_value=*/k, /*target=*/0.0);
      auto eq = fabulous::equation(N, X, ldx, B, ldb);
MIJIEUX Thomas's avatar
MIJIEUX Thomas committed
      auto arn = fabulous::bgmres::ib();
36 37 38
      auto ortho = fabulous::orthogonalization(/*iteration_count=*/3)
          + fabulous::OrthoType::RUHE
          + fabulous::OrthoScheme::IMGS;
MIJIEUX Thomas's avatar
MIJIEUX Thomas committed
      auto solution = fabulous::bgmres::solve(eq, arn, ortho, dr);
40 41 42 43
      return 0;
44 45
* fabulous linking with lapacke/cblas kernels
  When parallel blas/lapack implementation are available,
46 47 48 49 50 51 52 53 54 55
  all examples in test_basic/ folder will be linked with the parallel blas
  implementation. Examples in test_cham/ are linked with chameleon are since
  chameleon is linked with sequential blas/lapack implementation (since the
  multi-thread parallelism is generated directly by the chameleon application
  itself), these example will also be linked with the sequential cblas/lapack
  implementation. So it does not make much sense to compare kernels performance
  other than gels or incremental qr factorization between examples in test_basic/
  and in test_cham/ because these are the only kernels that use chameleon.
  Comparing other kernels means that you compare multi-threaded kernels against
  sequential kernels
56 57 58

  A solution exists to link chameleon with a parallel blas/lapack implementation:
  the application using chameleon (fabulous in our case) must set the OpenMP
59 60 61 62 63 64 65 66 67 68 69 70 71
  number of threads to 1 before callink chameleon kernels and reset it back to
  the number of available cores after the chameleon kernels calls are completed.
  This way chameleon can theorically work correctly and the other lapack/blas kernels
  can take advantage of multithreaded parallelism. For this to work in pratice,
  OpenMP threads binding must be performed correctly.

  In particular, it must be ensured than all threads are not bound to the same
  core during chameleon kernel calls, otherwise the whole point of multithreaded
  parallelism is destroyed (because all threads will execute on the same core,
  one after each other) For this, at least one kernel with omp_num_threads set
  to the maximum number of available core must be called before calling the
  first chameleon kernels (in order to bind the threads correctly in the first
  parallel region)
72 73

  This solution was put in practice by Terry Cojean (ask for details)
74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129

* QR factorization and DeflatedRestarting
** description
  The QRDR algorithms is a little different from the others.
  The factorize_last_column(incremental qr) call was put in
  The reason for this is that if it was not put there, no other call would
  factorize the first block column of the hessenberg (H1new). This is not a
  problem in QRIBDR since solve is called on F1new in order to perform R
  criterion and detect inexact breakdown.

  the problem with this is that the actual QR factorization is
  not performed during the call measuring the least square time.
  I.E time is not measured properly

**  TODO ? IDEA1: solve this problem by adding a notify_restart_end call to Hessenbergs!??
   this does not solve the problems, factorization time is still not measured properly
**  TODO ? IDEA2: add the code to measure time directly in the Hessenberg classes
   inconvenient: code must be added for all Hessenbergs;
   If not done correctly this could be problematic in IBDR and QRIBDR because
   IB_update is done one iteration after the corresponding R_criterion

* IB+DR with inexact breakdown on R0 and RHS update
  The IB+DR algorithm theorically handle inexact breakdown on R0 before the
  first restart (after the first restart there is no R0 anymore)

  In the IB only algorithm, Inexact breakdown on R0 typically occurs
  after several restart (usually when IB occurs during previous restart)

  In order to test inexact breakdown on R0 for IB+DR, bound right hand sides
  must be passed on purpose as input to the algorithm.

  The fact that the algorithm can handle both inexact breakdown on R0 and
  IB+DR is the reason why there is two ways for update the right hand sides
  to the local GELS problem.
  When there is inexact breakdown on R0 (before 1st restart) init_phi is called
  but _restarted is set to false, so compute_Lambda perform the "inexact
  breakdown on R0" computation: Lambda <- Phi * Lambda_1.
  After a restart init_phi_restarted is called,
  _restarted is set to true and compute_Lambda perform the other computation:
  Lambda <- [[eye(p1);zeros(nj+p-p1,p1)], Phi] * Lambda_1

* IB+DR+QR double udpate on (local GELS) RHS
  The IB+DR algorithm implies that there is an update on the right hand sides of
  GELS problem (either Inexact breakdown on R0, or update after IB+DR restart)

  QR versions also imply an update on the right hand sides.
  The solution that was adopted to handle this problem is that we compute
  Lambda the same way it is done in IB+DR versions:
  (IB on R0) Lambda <- Phi * Lambda_1;
  (restart)  Lambda <- [[eye(p1);zeros(nj+p-p1,p1)], Phi] * Lambda_1;

  and then, we apply all the Q^{H} or Q^{T} transformation coming
  from the incremental QR, at each iteration!
130 131 132 133 134 135 136
* Flops counter                                                        :NOTE:
  - IDEA: User method (from user matrix callback object) could return
  the number of flops they performed
  CLOCK: [2017-05-10 Wed 18:22]--[2017-05-10 Wed 18:24] =>  0:02
[2017-05-10 Wed 18:22]