SequenceAligner C++ and python code to align sequences
SequenceAligner is C++ class that aligns two (string) sequences and calculates metrics such as word error rate (WER). Pretty-printing enables human-readable logging of alignments and metrics. The code is a translation from the Java version of Brian Romanowski
This class is intended to reproduce the main functionality of the NIST sclite tool.
The Sphinx 4 source for the class edu.cmu.sphinx.util.NISTAlign was referenced when writing the SequenceAligner code.
Christian Raymond [christian.raymond@irisa.fr]
Details
This code is licensed under one of the BSD variants, please see LICENSE.txt for full details.
Example
#include "SequenceAligner.hpp"
#include <iostream>
#include <iterator>
#include <algorithm>
#include <vector>
#include <string>
int main(int argc,char*argv[0])
{
if(argc<2) throw std::invalid_argument("size of batch to test missing");
batch(std::atoi(argv[1]));
//return 0;
std::vector<std::string> sentence1={"the", "quick", "brown", "cow", "jumped", "over", "moon"};
std::vector<std::string> sentence2={"the","quick", "brown", "cows", "jumped", "way", "over", "the", "moon", "dude"};
std::vector<std::string> sentence3={"Pepin","de", "Landen"};
std::vector<std::string> sentence4={"pepin","de", "landen","le","premier"};
const bool case_sensitive=false;
SequenceAligner werEval(case_sensitive);
std::cout<<"\n----------SINGLE ALIGNMENT--------\n";
//align 2 sentences
auto sentence_alignement = werEval.align(sentence3,sentence4);
//print stats and alignment
std::cout<<sentence_alignement<<"\n";
//get the 2 sentences aligned
std::cout<<"\nREF=[";
std::copy(sentence_alignement.reference.cbegin(),sentence_alignement.reference.cend(),std::ostream_iterator<std::string>(std::cout,"\t"));
std::cout<<"]\nHYP=[";
std::copy(sentence_alignement.hypothesis.cbegin(),sentence_alignement.hypothesis.cend(),std::ostream_iterator<std::string>(std::cout,"\t"));
std::cout<<"]";
std::cout<<"\n----------BATCH ALIGNMENT---------\n";
//align batch of sentences pair
auto batch_alignement1 = werEval.align({sentence1,sentence2},{sentence3, sentence4});
//get metrics about batch of alignements
const SequenceAligner<std::string>::SummaryStatistics ss(batch_alignement1);
std::cout<<"\n\n"<<ss<<std::endl;
return EXIT_SUCCESS;
}
Produces the output:
Align sentences: [100%] |██████████████████████████████████████████████████| 1/1 [ 00:00<00:00 1000.00it/s ]
# seq # ref # hyp cor sub ins del WER SER
1 8 8 0.88 0 0.12 0.12 0.25 1
----------SINGLE ALIGNMENT--------
# seq # ref # hyp # cor # sub # ins # del acc WER # seq cor
STATS: 1 3 5 3 0 2 0 1 0.67 0
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
REF: Pepin de Landen ** *******
HYP: pepin de landen le premier
REF=[Pepin de Landen ]
HYP=[pepin de landen le premier ]
----------BATCH ALIGNMENT---------
Align sentences: [100%] |██████████████████████████████████████████████████| 2/2 [ 00:00<00:00 2000.00it/s ]
# seq # ref # hyp cor sub ins del WER SER
2 17 8 0 0.47 0 0.53 1 1
- Where the top portion of the output are the statistics for the given pair of reference/hypothesis sentences
- insertion or deletion are filled by empty Objects (constructor with no argument: for string is the empty string)
-
operator<<
use the "*" to show insertion or deletion
Align Any Object with the C++ version
The C++ version is template and can align any Objects with the folwing properties
- a default contructor that construct an empty object (invalid)
- an
empty()
method that return if the object is valid or not - a
size()
method that return the number of character used to print the object - an
operator==()
to compare objects - the
operator<<
in order to print the object insize()
character - the
operator-
to privilegiate substitution over insertion/deletion when different object share similarity: return the percentage of similarity
Exemple
class Integer
{
int _i;
public:
Integer(): _i(std::numeric_limits<int>::max()) {} //Default constructor should mark the object as empty in some way
Integer(const int i): _i(i){}
int size() const {return std::to_string(_i).length();} //size() should provide the nbsymbols to write object on screen
bool empty() const {return _i==std::numeric_limits<int>::max();}//empty() should say if the object is not empty (not constructed by the constructor without arg)
bool operator==(const Integer& g) const {return _i==g._i;} //must provide a comparator
int operator-(const Integer& g) const {return std::min(_i,g._i)*100/std::max(_i,g._i);} //to privilegiate substitution error instead of insertion/deletion when operator== say false but objects share similarity (return PERCENTAGE of similarity so return always 0; to ignore this function)
friend std::ostream& operator<<(std::ostream& o,const Integer& i) {return o<<i._i;}
};
So we can align Integer
objects
std::vector<Integer> num1={1,2,3};
std::vector<Integer> num2={2,3,400};
SequenceAligner<Integer> numEval;
std::cout<<"\n----------NUMERIC ALIGNMENT--------\n";
//align 2 sentences
auto numeric_alignement = numEval.align(num1,num2);
//print stats and alignment
std::cout<<numeric_alignement<<"\n";
prints
----------NUMERIC ALIGNMENT--------
# seq # ref # hyp # cor # sub # ins # del acc WER # seq cor
STATS: 1 3 3 2 0 1 1 0.67 0.67 0
----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
REF: 1 2 3 ***
HYP: * 2 3 400