Mentions légales du service

Skip to content
Snippets Groups Projects

SequenceAligner C++ and python code to align sequences

SequenceAligner is C++ class that aligns two (string) sequences and calculates metrics such as word error rate (WER). Pretty-printing enables human-readable logging of alignments and metrics. The code is a translation from the Java version of Brian Romanowski

This class is intended to reproduce the main functionality of the NIST sclite tool.
The Sphinx 4 source for the class edu.cmu.sphinx.util.NISTAlign was referenced when writing the SequenceAligner code.

Christian Raymond [christian.raymond@irisa.fr]

Details

This code is licensed under one of the BSD variants, please see LICENSE.txt for full details.

Example

#include "SequenceAligner.hpp"
#include <iostream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <string>

int main(int argc,char*argv[0])
{
    if(argc<2) throw std::invalid_argument("size of batch to test missing");

    batch(std::atoi(argv[1]));
    //return 0;
    
    std::vector<std::string> sentence1={"the", "quick", "brown", "cow", "jumped", "over", "moon"};
    std::vector<std::string> sentence2={"the","quick", "brown", "cows", "jumped", "way", "over", "the", "moon", "dude"};
    std::vector<std::string> sentence3={"Pepin","de", "Landen"};
    std::vector<std::string> sentence4={"pepin","de", "landen","le","premier"};
    const bool case_sensitive=false;
    SequenceAligner    werEval(case_sensitive);
    std::cout<<"\n----------SINGLE ALIGNMENT--------\n";
    //align 2 sentences
    auto sentence_alignement = werEval.align(sentence3,sentence4);
    //print stats and alignment
    std::cout<<sentence_alignement<<"\n";
    
    //get the 2 sentences aligned
    std::cout<<"\nREF=[";
    std::copy(sentence_alignement.reference.cbegin(),sentence_alignement.reference.cend(),std::ostream_iterator<std::string>(std::cout,"\t"));
    std::cout<<"]\nHYP=[";
    std::copy(sentence_alignement.hypothesis.cbegin(),sentence_alignement.hypothesis.cend(),std::ostream_iterator<std::string>(std::cout,"\t"));
    std::cout<<"]";
    
    std::cout<<"\n----------BATCH ALIGNMENT---------\n";
    //align batch of sentences pair
    auto batch_alignement1 = werEval.align({sentence1,sentence2},{sentence3, sentence4});
    //get metrics about batch of alignements
    const SequenceAligner<std::string>::SummaryStatistics ss(batch_alignement1);
    std::cout<<"\n\n"<<ss<<std::endl;
    
    return EXIT_SUCCESS;
}

Produces the output:

Align sentences: [100%] |██████████████████████████████████████████████████| 1/1 [ 00:00<00:00 1000.00it/s ] 

# seq   # ref   # hyp   cor     sub     ins     del     WER     SER
1       8       8       0.88    0       0.12    0.12    0.25    1

----------SINGLE ALIGNMENT--------
        # seq   # ref   # hyp   # cor   # sub   # ins   # del   acc     WER     # seq cor
STATS:  1       3       5       3       0       2       0       1       0.67    0
-----   -----   -----   -----   -----   -----   -----   -----   -----   -----   -----
REF:    Pepin   de      Landen  **      *******
HYP:    pepin   de      landen  le      premier

REF=[Pepin      de      Landen                  ]
HYP=[pepin      de      landen  le      premier ]
----------BATCH ALIGNMENT---------
Align sentences: [100%] |██████████████████████████████████████████████████| 2/2 [ 00:00<00:00 2000.00it/s ] 

# seq   # ref   # hyp   cor     sub     ins     del     WER     SER
2       17      8       0       0.47    0       0.53    1       1
  • Where the top portion of the output are the statistics for the given pair of reference/hypothesis sentences
  • insertion or deletion are filled by empty Objects (constructor with no argument: for string is the empty string)
  • operator<< use the "*" to show insertion or deletion

Align Any Object with the C++ version

The C++ version is template and can align any Objects with the folwing properties

  1. a default contructor that construct an empty object (invalid)
  2. an empty() method that return if the object is valid or not
  3. a size() method that return the number of character used to print the object
  4. an operator==() to compare objects
  5. the operator<< in order to print the object in size() character
  6. the operator- to privilegiate substitution over insertion/deletion when different object share similarity: return the percentage of similarity

Exemple

class Integer
{
    int _i;
    public:
    Integer(): _i(std::numeric_limits<int>::max()) {} //Default constructor should mark the object as empty in some way
    Integer(const int i): _i(i){}
    int size() const {return std::to_string(_i).length();} //size() should provide the nbsymbols to write object on screen
    bool empty() const {return _i==std::numeric_limits<int>::max();}//empty() should say if the object is not empty (not constructed by the constructor without arg)
    bool operator==(const Integer& g) const {return _i==g._i;} //must provide a comparator
    int operator-(const Integer& g) const {return std::min(_i,g._i)*100/std::max(_i,g._i);} //to privilegiate substitution error instead of insertion/deletion when operator== say false but objects share similarity (return PERCENTAGE of similarity so return always 0; to ignore this function)
    friend std::ostream& operator<<(std::ostream& o,const Integer& i) {return o<<i._i;}
};

So we can align Integer objects

    std::vector<Integer> num1={1,2,3};
    std::vector<Integer> num2={2,3,400};

    SequenceAligner<Integer> numEval;
    std::cout<<"\n----------NUMERIC ALIGNMENT--------\n";
    //align 2 sentences                    
    auto numeric_alignement = numEval.align(num1,num2);
    //print stats and alignment
    std::cout<<numeric_alignement<<"\n";

prints

----------NUMERIC ALIGNMENT--------
        # seq   # ref   # hyp   # cor   # sub   # ins   # del   acc     WER     # seq cor
STATS:  1       3       3       2       0       1       1       0.67    0.67    0
-----   -----   -----   -----   -----   -----   -----   -----   -----   -----   -----
REF:    1       2       3       ***
HYP:    *       2       3       400