Commit b0e3045d authored by Mathieu Giraud's avatar Mathieu Giraud

core/kmerstore.h: ignore all k-mers with extended nucleotides when updating index

There are some 'N' and other extended nucleotides in the germline sequences.
As we store in the indexes both the k-mers and their reverse complement, and as
we handle extended nucleotides almost randomly (see tools:nuc_to_int()),
we may have slight differences when analyzing some reads and their reverse complement.
Ignoring such k-mers allow thus to be more deterministic, getting the same
results on a (pure ACGT) read and its reverse complement.

Another option (harder to implement) could be to add several k-mers in the index,
but this would decrease the effective weight of the seed.

Note that we should also improve the analysis of reads that includes extended nucleotides.
parent 7a2239cd
......@@ -187,6 +187,10 @@ void IKmerStore<T>::insert(const seqtype &sequence,
const string &label){
for(size_t i = 0 ; i + s < sequence.length() + 1 ; i++) {
seqtype kmer = spaced(sequence.substr(i, s), seed);
if (has_extended_nucleotides(kmer))
continue;
int strand = 1;
if (revcomp_indexed && T::hasRevcompSymetry()) {
seqtype rc_kmer = revcomp(kmer);
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment