algo/core/kmerstore.h · b0e3045debbc2a08aeea17849939b0942a9dc754 · vidjil / vidjil

core/kmerstore.h: ignore all k-mers with extended nucleotides when updating index · b0e3045d

Mathieu Giraud authored Mar 11, 2015

There are some 'N' and other extended nucleotides in the germline sequences.
As we store in the indexes both the k-mers and their reverse complement, and as
we handle extended nucleotides almost randomly (see tools:nuc_to_int()),
we may have slight differences when analyzing some reads and their reverse complement.
Ignoring such k-mers allow thus to be more deterministic, getting the same
results on a (pure ACGT) read and its reverse complement.

Another option (harder to implement) could be to add several k-mers in the index,
but this would decrease the effective weight of the seed.

Note that we should also improve the analysis of reads that includes extended nucleotides.

b0e3045d

Admin message