COMPRISE library for weakly supervised training of Speech-to-Text (STT) models
This library provides three main components which represent the approaches proposed in COMPRISE, namely
- STT Error Detection driven Training (Err2Unk)
- Weakly Supervised Training based on Dialogue States
- Confusion Network based Language Model Training (CN2LM)
The first two components focus on obtaining reliable transcriptions of un-transcribed speech data which can be used for training both STT Acoustic Model (AM) and Language Model (LM). It can be any type of AM, although we choose the state-of-the-art Chain models in our examples. The third component features training of statistical n-gram LM and Recurrent Neural Network (RNN) LM from alternate and uncertain STT hypotheses obtained on un-transcribed speech data.
Readers interested in the high level design and experimental evaluation of these two components are directed to the COMPRISE D4.2 and D4.4 deliverable reports. This README provides details on typical usage of these components.
Video installation guide
An installation & usage guide video can be found on YouTube here.
Table of Contents
Prerequisites
- This library will re-use binaries and scripts from the Kaldi toolkit. So you should have Kaldi pre-installed on your system. Optionally, you can use the setup script to install Kaldi and also setup this library.
- If you are not using the setup script then you will need the Kaldi LM toolkit
- Additionally you need to install Kaldi helpers
- Speech datasets for training STT models, including:
- (small amount of) transcribed speech data. As demonstrated in the COMPRISE D4.2 deliverable report, it could be an existing read speech corpus or a few hours of domain/application specific speech corpus.
- (more) un-transcribed speech data.
- a dev set containing application specific transcribed speech data
- Err2Unk based training requires:
- Python 3.X (Python 2.7 not supported.)
- Python sklearn library
- Keras v2.3.1 and Tensorflow v2.0.0 Python libraries to train neural network models for STT error detection. (An upcoming version will move this to Pytorch.)
- the kenlm Python module to extract language model related features for error detector.
- Dialogue state based training requires:
- the SRILM tool. If you have installed Kaldi, you can install the SRILM tool with the tools/extras/install_srilm.sh script in your Kaldi installation.
- speech dataset from a human-machine dialogue system where a dialog corresponding to each human utterance is already available, for example the Let's Go dataset. One could also use human-human conversations or any other speech dataset with any kind of weak (but relevant) utterance level labels.
- CN2LM training requires:
- Numpy 1.19
- Pytorch 1.5.1 to train CN2LM RNN version.
Setup
- You can use the setup script to install Kaldi and setup this library.
- If you are not using setup.sh then please take care of the following.
- Ensure that you have a working Kaldi installation.
- Modify the softlinks steps and utils in this directory to point to egs/wsj/s5/steps/ and egs/wsj/s5/utils/, respectively, in your Kaldi installation.
- Modify the path of KALDI_ROOT and modify (or remove) the path to kaldi_lm, SRILM and sox tools in path.sh
- Modify cmd.sh if you are using a different execution queue for Kaldi.
Typical Usage Steps
Err2Unk based training
Err2Unk based semi-supervised training of STT models will typically go through the following steps.
(Click on individual step for usage details.)
Step 1. Train seed STT models
- Supervised training data with reliable speech-transcript pairs are used to train the seed AM and LM. (Note that this step can be skipped if you already have pre-trained AM and LM).
- A sample Kaldi recipe to train the seed AM and LM on a subset of the Let's Go dataset is made available in the egs/ directory.
Step 2. Prepare for STT Error Detection
-
The seed AM and LM are used to decode the unsupervised speech and dev set into STT lattices. (Sample script in egs/letsgo-15d if you are relying on sample recipe from Step 1.)
-
Obtain STT confusion networks from the lattices decoded on the unsupervised speech and dev set. The COMPRISE library assumes confusion networks are in Kaldi sausage format. Assuming your lattices are generated by Kaldi (as lat.*.gz), you can use our script to generate STT confusion networks as follows:
bash local/err2unk/getSaus.sh lattice_dir graph_dir lm_wt
'graph_dir' is the one used by the Kaldi decoder,
'lm_wt' is the LM weight which gives the best dev set WERNote that STT confusion networks, aka sausages, are generated in <lattice_dir>/sau/ directory, referred as 'saus_dir' in the next steps.
Step 3. Train STT Error Detector
-
Align the dev set confusion networks to the corresponding reference transcriptions
bash local/err2unk/sausAlign.sh saus_dir graph_dir ref_text
'ref_text' is reference transcription in Kaldi format
Note that the output is saved to a text file <saus_dir>/saus_bin-best_with-heps.hyp.align , referred as 'saus_ref_align' in next set of commands.
-
Extract relevant features and labels from the dev set confusion networks
python local/err2unk/errdet/saus_feats_for_train.py saus_dir saus_ref_align lm_arpa graph_dir dev_saus_feats_n_labs
'lm_arpa' is the LM in arpa format, 'dev_saus_feats_n_labs' is the output file containing features and labels extracted from the confusion networks, which will be used in the next command.
Note that the error detector is trained on the application specific dev set.
-
Train a Bi-directional Long Short Term Memory (BLSTM) based error tagger
python local/err2unk/errdet/train_3c_error_tagger_on_dev.py dev_saus_feats_n_labs err_model_dir
'err_model_dir' must be created by user and will store the resulting error tagger model
Note that feed forward neural network based error detector can also be tried with the python script local/err2unk/errdet/train_3c_error_mlp_on_dev.py.
Step 4. Get unsupervised speech transcripts
-
Extract relevant features from the unsupervised speech confusion networks, obtained in Step 2.
python local/err2unk/errdet/saus_feats_for_predict.py saus_dir lm_arpa graph_dir unsup_saus_feats
'unsup_saus_feats' is the output file containing features extracted from the confusion networks, which will be used in the next command.
-
Tag STT errors on the unsupervised speech confusion networks
python local/err2unk/errdet/tag_with_3c_tagger.py err_model_dir unsup_saus_feats unsup_error_preds
‘unsup_error_preds’ is a text file containing the error predictions.
-
Get Err2Unk unsupervised speech transcripts
bash local/err2unk/getErr2UnkTranscripts.sh saus_dir graph_dir unsup_error_preds > unsup_text
'unsup_text' are output transcription in Kaldi format
Step 5. Retrain STT models
-
Prepare new data directory, combining supervised and unsupervised data, for training new models
bash local/err2unk/prepareNewDataDir.sh unsup_text old_sup_data_dir old_unsup_data_dir new_data_dir
'old_sup_data_dir' contains wav.scp and utt2spk used for training seed AM in Step 1,
'old_unsup_data_dir' contains wav.scp and utt2spk used for decoding unsupervised speech in Step 2,
'new_data_dir' will contain the new combined data directory for training new models.Note that this script can be extended to combine feat.scp and cmvn.scp to avoid a repeat of feature extraction.
-
Train new AM and LM on combined data directory using a Kaldi recipe similar to Step 1.
Dialogue State based training
Dialog state based weakly supervised training of STT models will typically go through the following steps.
(Click on individual step for usage details.)
Step 1. Train seed STT models
- Supervised training data with reliable speech-transcript pairs are used to train the seed AM and LM. (Note that this step can be skipped if you already have pre-trained AM and LM).
- A sample Kaldi recipe to train the seed AM and LM on a subset of the Let's Go dataset is made available in the egs/ directory.
Step 2. Decode unsupervised speech to lattices
- The seed AM and LM are used to decode the unsupervised speech into Kaldi STT lattices (lat.*.gz). (Sample script in egs/ if you are relying on sample recipe from Step 1.)
Step 3. Train dialogue state LMs
-
Train dialog state specific LMs
bash local/dsLMs/trainDialogStateLMs.sh old_lang_test_dir utt_dialog_state_csv ds_lm_dir
'old_lang_test_dir' was created during training of seed models (in Step 1) and should contain files words.txt and G.fst,
'utt_dialog_state_csv' is tain set 3 column csv file of form utterance_id,transcript,dialog_state,
'ds_lm_dir' will contain the dialog state specific LMsNote that the above script uses the unk symbol and a count threshold minDsCnt on min number of utterances in a dialog state. Dialogue states with utterances less this count are ignored and these utterances will resort to the seed LM (G.fst) in 'old_lang_test_dir'.
-
Train interpolated dialog state specific LMs
bash local/dsLMs/trainInterpolatedDialogStateLMs.sh old_lang_test_dir old_lm_arpa ds_lm_dir int_ds_lm_dir
'old_lang_test_dir' lang_test directory of seed models,
'old_lm_arpa' is the arpa LM corresponding to the seed LM,
'ds_lm_dir' from previous step,
'int_ds_lm_dir' contains the interpolated dialog state specific LMs.Note that interpolated dialog state specific LMs perform better than the dialog state specific LMs created by previous command. But the previous command is essential to obtain 'ds_lm_dir'.
Step 4. Rescore unsupervised lattices
-
Reorganise old lattice archives into dialog state specific lattice archives
bash local/dsLMs/reorgLattices.sh data_dir utt_dialog_state_csv old_lat_dir int_ds_lm_dir new_lat_dir
'data_dir' should contain the wav.scp file,
'utt_dialog_state_csv' is a 3 column csv file of form utterance_id,transcript,dialog_state (without any transcript contents for unsupervised speech),
'old_lat_dir' was created after decoding with seed models and should contain kaldi format lattice archives (lat.*.gz),
'int_ds_lm_dir' was created in Step 3.,
'new_lat_dir' will contain the reorganised lattice archives ready for rescoring with Kaldi. -
Rescore unsupervised lattices with interpolated dialog state specific LMs
bash local/dsLMs/rescoreDsLattices.sh old_lang_test_dir int_ds_lm_dir data_dir new_lat_dir rescored_lat_dir
'old_lang_test_dir' was created during training of seed models (in Step 1) and should contain files words.txt and G.fst,
'int_ds_lm_dir' was created in Step 3,
'data_dir' should contain reference transcription in Kaldi format if you want to computer WER,
'new_lat_dir' contains the reorganised lattice archives ready for rescoring,
'rescored_lat_dir' will contain the dialog state LM rescored lattice archives -
Get best path transcripts on unsupervised speech
bash local/dsLMs/getBestPathTranscripts.sh rescored_lat_dir words_file lm_wt word_ins_penalty unsup_text
'words_file' is words.txt used by the seed models (for example in 'old_lang_test_dir'),
'lm_wt' is the LM weight which gives the best dev set WER,
'word_ins_penalty' is 0.0, 0.5 or 1.0 whichever gives the best dev set WER,
'unsup_text' contains the best path transcripts on unsupervised speech
Step 5. Retrain STT models
-
Prepare unsupervised data for training new models
bash local/err2unk/prepareNewUnsupDataDir.sh unsup_text old_sup_data_dir old_unsup_data_dir new_data_dir
'old_sup_data_dir' contains wav.scp and utt2spk used for training seed AM in Step 1,
'old_unsup_data_dir' contains wav.scp and utt2spk used for decoding unsupervised speech in Step 2,
'new_data_dir' will contain the new combined data directory for training new models.Note that this script can be extended to combine feat.scp and cmvn.scp to avoid a repeat of feature extraction.
-
Train new AM and LM on combined data directory using a Kaldi recipe similar to Step 1.
CN2LM training
CN2LM training will typically go through the following steps.
(Click on individual step for usage details.)
Step 1. Train seed STT models
- Supervised training data with reliable speech-transcript pairs are used to train the seed AM and LM. (Note that this step can be skipped if you already have pre-trained AM and LM).
- A sample Kaldi recipe to train the seed AM and LM on a subset of the Let's Go dataset is made available in the egs/ directory.
Step 2. Prepare Confusion Networks
-
The seed AM and LM are used to decode the unsupervised speech and dev set into STT lattices. (Sample script in egs/local/ if you are relying on sample recipe from Step 1.)
-
Obtain STT confusion networks from the lattices decoded on the unsupervised speech and dev set. The COMPRISE library assumes confusion networks are in Kaldi sausage format. Assuming your lattices are generated by Kaldi (as lat.*.gz), you can use our script to generate STT confusion networks as follows:
bash local/err2unk/getSaus.sh lattice_dir graph_dir lm_wt
'graph_dir' is the one used by the Kaldi decoder,
'lm_wt' is the LM weight which gives the best dev set WERNote that STT confusion networks, aka sausages, are generated in <lattice_dir>/sau/ directory, referred as 'saus_dir' in the next steps.
Step 3. Train CN2LM 3-gram LM
-
A 3-gram LM can be trained on the combined supervised training speech transcripts and confusion networks obtained on the unsupervised speech as follows:
python local/cn2lm/ngramlm/build_cn2lm_arpa.py asr_vocab_file sup_text unsup_saus_dir out_3glm_dir
'asr_vocab_file' is the vocabulary following Kaldi's words.txt format 'sup_text' is the supervised reference transcription in Kaldi format 'unsup_saus_dir' is the unsupervised speech confusion networks directory generated in previous step 'out_3glm_dir' is the output directory to store the 3-gram arpa LM
Note that this CN2LM component has built-in features to train interpolated modified-KN smoothed 3-gram LMs only on reference transcriptions or only on confusion networks. Moreover, it can also make use of error predictions on confusion networks to prune out the confusion network in non error predicted regions. Check local/cn2lm/ngramlm/build_cn2lm_arpa.py for relevant modifications. Moreover, it has features to prune the maximum number of arcs seen in confusion bins. Check global
MAX_ARCS
in local/cn2lm/ngramlm/data.py.
Step 4. Train CN2LM RNN LM
-
An RNN LM can be trained on the combined supervised training speech transcripts and confusion networks obtained on the unsupervised speech as follows:
python local/cn2lm/rnnlm/train_cn2lm_rnn.py asr_vocab_file sup_text unsup_saus_dir dev_saus_dir dev_text out_rnnlm_dir
'asr_vocab_file' is the vocabulary following Kaldi's words.txt format 'sup_text' is the supervised reference transcription in Kaldi format 'unsup_saus_dir' is the unsupervised speech confusion networks directory generated in previous step 'dev_saus_dir' is the dev set confusion networks directory generated in previous step 'out_rnnlm_dir' is the output directory to store the RNN LM model in Pytorch's pth format
Note that this CN2LM component has built in features to train LSTM or GRU RNN LM, sharing of input-output word embedding layers and support for different pooling schemes over confusion bin arcs. Moreover, it has features to prune the maximum number of arcs seen in confusion bins. Check the globals defined in local/cn2lm/rnnlm/models.py and local/cn2lm/rnnlm/data.py.
-
Support is provided to convert CN2LM GRU RNN LM to Kaldi RNN LM format as follows:
bash local/cn2lm/rnnlm/kaldi_support/pytorch_rnnlm_to_kaldi.sh asr_vocab_file kaldi_gru_lm_template_file pytorch_model out_kaldi_model_dir
'asr_vocab_file' is the vocabulary following Kaldi's words.txt format 'kaldi_gru_lm_template_file' is Kaldi nnet3 format template file. A template for a single layer RNN LM with shared input-output embeddings and Pytorch GRU cell is provided in local/cn2lm/rnnlm/kaldi_support/torch_gru.raw.tmp.txt 'pytorch_model' is the Pytorch format RNN LM trained in the previous step 'out_kaldi_model_dir' will store Kaldi compatible RNN LM files
Note that this step currently supports only single layer RNN LM with shared input-output embeddings and Pytorch GRU cell. Support for more RNN layers, LSTM cells, etc can be easily added if a suitable
kaldi_gru_lm_template_file
is created.
License
Each of the components in COMPRISE library for weakly supervised training of Speech-to-Text (STT) models have been separately licensed. Refer to COPYING file in individual components:
The source code headers for each file specifies the individual authors and source material for that file as well the corresponding copyright notice.