Convert Kaldi* ASpIRE Chain Time Delay Neural Network (TDNN) Model to the Intermediate Representation
You can download a pre-trained model for the ASpIRE Chain Time Delay Neural Network (TDNN) from the Kaldi* project official web-site.
Convert ASpIRE Chain TDNN Model to IR
To generate the Intermediate Representation (IR) of the model, run the Model Optimizer with the following parameters:
python3 ./mo_kaldi.py --input_model exp/chain/tdnn_7b/final.mdl --output output
The IR will have two inputs: input for data and ivector for ivectors.
Example: Run ASpIRE Chain TDNN Model with the Speech Recognition Sample
These instructions show how to run the converted model with the Speech Recognition sample. In this example, the input data contains one utterance from one speaker.
To follow the steps described below, you must first do the following:
1. Download a Kaldi repository.
2. Build it using instructions in README.md in the repository.
3. Download the model archive from Kaldi website.
4. Extract the downloaded model archive to the egs/aspire/s5 folder of the Kaldi repository.
To run the ASpIRE Chain TDNN Model with Speech Recognition sample:
- Prepare the model for decoding. Refer to the
README.txtfile from the downloaded model archive for instructions. - Convert data and ivectors to
.arkformat. Refer to the corresponding sections below for instructions.
Prepare Data
If you have a .wav data file, you can convert it to .ark format using the following command:
<path_to_kaldi_repo>/src/featbin/compute-mfcc-feats --config=<path_to_kaldi_repo>/egs/aspire/s5/conf/mfcc_hires.conf scp:./wav.scp ark,scp:feats.ark,feats.scp
Add the feats.ark absolute path to feats.scp to avoid errors in later commands.
Prepare Ivectors
To prepare ivectors for the Speech Recognition sample, do the following:
- Copy the
feats.scpfile to theegs/aspire/s5/directory of the built Kaldi repository and navigate there:
cp feats.scp <path_to_kaldi_repo>/egs/aspire/s5/
cd <path_to_kaldi_repo>/egs/aspire/s5/
- Extract ivectors from the data:
./steps/online/nnet2/extract_ivectors_online.sh --nj 1 --ivector_period <max_frame_count_in_utterance> <data folder> exp/tdnn_7b_chain_online/ivector_extractor <ivector folder>
To simplify the preparation of ivectors for the Speech Recognition sample,
specify the maximum number of frames in utterances as a parameter for --ivector_period
to get only one ivector per utterance.
To get the maximum number of frames in utterances, you can use the following command line:
../../../src/featbin/feat-to-len scp:feats.scp ark,t: | cut -d' ' -f 2 - | sort -rn | head -1
As a result, in <ivector folder>, you will find the ivector_online.1.ark file.
- Go to the
<ivector folder>:
cd <ivector folder>
- Convert the
ivector_online.1.arkfile to text format using thecopy-featstool. Run the following command:
<path_to_kaldi_repo>/src/featbin/copy-feats --binary=False ark:ivector_online.1.ark ark,t:ivector_online.1.ark.txt
- For the Speech Recognition sample, the
.arkfile must contain an ivector for each frame. You must copy the ivectorframe_counttimes. To do this, you can run the following script in the Python* command prompt:
import subprocess
subprocess.run(["<path_to_kaldi_repo>/src/featbin/feat-to-len", "scp:<path_to_kaldi_repo>/egs/aspire/s5/feats.scp", "ark,t:feats_length.txt"])
f = open("ivector_online.1.ark.txt", "r")
g = open("ivector_online_ie.ark.txt", "w")
length_file = open("feats_length.txt", "r")
for line in f:
if "[" not in line:
for i in range(frame_count):
line = line.replace("]", " ")
g.write(line)
else:
g.write(line)
frame_count = int(length_file.read().split(" ")[1])
g.write("]")
f.close()
g.close()
length_file.close()
- Create an
.arkfile from.txt:
<path_to_kaldi_repo>/src/featbin/copy-feats --binary=True ark,t:ivector_online_ie.ark.txt ark:ivector_online_ie.ark
Run the Speech Recognition Sample
Run the Speech Recognition sample with the created ivector .ark file as follows:
speech_sample -i feats.ark,ivector_online_ie.ark -m final.xml -d CPU -o prediction.ark -cw_l 17 -cw_r 12
Results can be decoded as described in "Use of Sample in Kaldi* Speech Recognition Pipeline" chapter in the Speech Recognition Sample description.