Lib4U

‎"Behind every stack of books there is a flood of knowledge."

Training Hidden Markov Model/Artificial Neural Network (HMM/ANN) Hybrids for Automatic Speech Recognition (ASR)

overview

John-Paul Hosom, Jacques de Villiers, Ron Cole, Mark Fanty, Johan Schalkwyk, Yonghong Yan, Wei Wei
Center for Spoken Language Understanding (CSLU)
OGI School of Science & Engineering (OGI)
Oregon Health & Science University (OHSU)

Version 1.1: February, 1999
Version 2.0: February, 2006

Contents

1. Introduction
1.1 Setup
1.2 Additional Information
2. General Concepts and Notation
2.1 Quick Review of Frame-Based Speech Recognition
2.2 Specifying Categories
2.3 Example of Specifying Categories
2.4 Finding Examples to Train On
2.4.1 Overfitting and Datasets
2.4.2 Filtering
2.4.3 Finding Categories
2.4.4 Number of Examples per Category
2.5 Training the Network
2.5.1 Generating Data
2.5.2 Number of Hidden Nodes
2.5.3 Negative Penalty
2.5.4 Number of Training Iterations
2.5.5 Re-Training on Force-Aligned Data
2.6 Evaluation
2.6.1 Word-Level Evaluation
2.6.2 Choosing the Best Iteration
2.6.3 Testing
3. Overall Procedure
3.1 Create Descriptions
3.2 Find Data
3.3 Select Data for Training
3.4 Train and Evaluate
3.5 Re-Train
3.6 Evaluate Test Set
4. Complete Example
5. File Formats
wav files
txt files
label files
corpora file
info file
grammar file
lexicon file
parts file
spec file
files file
dur file
counts file
examples file
vec file
neural-network files
summary file
ali files
6. Script and Program Usage
asr.tcl
checkvec.exe
fa.tcl
find_dur.tcl
find_files.tcl
gen_catfiles.tcl
gen_spec.tcl
gen_examples.tcl
nntrain.exe
pick_examples.tcl
revise_spec.tcl
select_best.tcl

1. Introduction

This tutorial describes one method used at Oregon Health & Science University’s Center for Spoken Language Understanding (CSLU) for creating automatic speech recognition (ASR) systems called Hidden Markov Model/Artificial Neural Network (HMM/ANN) hybrids, using the CSLU Toolkit.  The CSLU Toolkit contains tools for speech recognition, speech synthesis, facial animation, audio I/O, and other interface tools.  This Toolkit performs all lower-level operations using “C” code, and higher-level operations using a scripting language called “Tcl.”  This allows a balance of speed and flexibility that would not be possible with any one programming language. Included in this tutorial are some general concepts behind training such a recognizer, step-by-step instructions on how to train a recognizer, and a description of Tcl scripts that can be used to automate parts of this process.

1.1 Setup

In order to use the scripts mentioned in this tutorial, you must have the CSLU Toolkit installed on your machine. Currently, the CSLU Toolkit is only supported in a Windows environment.

The “Path” environment variable should be modified as follows (assuming Windows XP): click Start » Settings » Control Panel » System. Then click on the “Advanced” tab, and click on the “Environment Variables” button. Under the “System variables” heading, select the “Path” variable. Click on the “Edit” button, which pops up a new window called “Edit System Variable.” In this new window, there is an area for the “Variable value”. In the corresponding entry field, there are a number of paths, each separated by a semicolon. Assuming that you have installed the CSLU Toolkit into the default directory (C:\Program Files\CSLU), add the following paths to this list (separated by semicolons):
C:\Program Files\CSLU\Tcl80\bin
C:\Program Files\CSLU\Toolkit\2.0\bin
C:\Program Files\CSLU\Toolkit\2.0\shlib
C:\Program Files\CSLU\Toolkit\2.0\script\sview_1.0
C:\Program Files\CSLU\Toolkit\2.0\script\training_1.0
Click on “OK” on these windows to finalize the settings.

The .tcl extension can be associated with either a command-line Tcl script, or a GUI Tk script. The default association is with a Tk script, but this should be modified to execute a command-line Tcl script as follows: click Start » Settings » Control Panel » Folder Options.  Then click on the “File Types” tab and scroll down until the “TCL” extension is visible.  Highlight this extension with a single mouse click, so that the bottom of this window shows “Details for TCL Extension.”  Click on the “Advanced” button, which pops up a window called “Edit File Type.”  Single-click on the action labeled “Tclsh”, and then click on the “Set Default” button to the right.  Then click on the “Edit” button just above the “Set Default” button.  This pops up yet another window called “Editing action for type: Tcl/Tk”.  In this window, there is an entry box labeled “Application used to perform action:”.  Make sure that this entry box contains the following (assuming that CSLU Toolkit installation was in the default directory):
“C:\Program Files\CSLU\tcl80\bin\tclsh80.exe” “%1” %*%
(note especially the %*% at the end).  Click on “OK” on all of these windows to finalize the settings.

The training process uses commands entered from a DOS prompt; a DOS command window can be found on Windows XP from Start » Programs » Accessories » Command Prompt.  Since it will be used often, it is recommended that the command window be resized for a width of 80 characters, screen buffer height of 2000 lines, window size height of 40 or 50 lines, and screen text font color of pure white for maximum visibility.  A command window can be added to the Start menu for easy access.

In order to follow the examples in this tutorial, you may want to use the same data files.  These files have been put into a ZIP file containing all waveform and transcription files.  This file is available by clickinghere.  The size of this compressed ZIP file is 7.6 MB, and the size of the data files is about 10 MB.  In addition, the configuration files used in the tutorial have been put in a ZIP file.  This ZIP file is availablehere. Some of these configuration files will need to be modified to reflect your system’s path information and other relevant information.

1.2 Additional Information

In this document, phonetic symbols are represented using  the American English subset of Worldbet, which is an ASCII encoding of the International Phonetic Alphabet (IPA) [J. Hieronymus, 1995]. This tutorial has been supported by an NSF “Graduate Research Traineeships” award (grant number 9354959) and the CSLU Member companies. The views expressed in this tutorial do not necessarily represent those of the sponsoring agency and companies.

2. General Concepts and Notation

The general steps to creating an HMM/ANN speech recognition system are:

    1. Specify the (sub-)phonetic categories that the neural network will classify.
    2. Find examples of each of these categories in the speech data.
    3. Perform a number of iterations of neural-network training.  The output of each neural network is an estimation of the probabilities of each of the specified categories given a single time-point within a speech waveform. Select the best neural network (and adjust other system parameters) by evaluating each network on a small partition of the speech data that is not used for training or testing.  Evaluation is performed by using the estimated probabilities obtained from a neural network within a Hidden Markov Model framework.

 

  1. Evaluate the selected best network on a test set of speech data.

2.1 Quick Overview of Frame-Based Speech Recognition

Frame-based speech recognition has the following steps, illustrated in Figure 1:

Overview of Speech Recognition
Figure 1. Overview of frame-based speech recognition using a Hidden Markov Model/Artificial Neural Network (HMM/ANN) architecture.

  1. Divide the speech waveform into  frames, where each frame is a small segment of speech that contains an equal number of waveform samples. In this tutorial, we will assume a frame size of 10 msec.
  2. Compute features for each frame. These features can be thought of as a representation of the spectral envelope of the speech at that frame, and at a small number of surrounding frames called the “context window”.
  3. Classify the features in each frame into phonetic-based categories using a neural network. The outputs of the neural network are estimates of the probability of each phonetic category, given the speech features at this frame.  When the neural network is used to classify all frames, this creates a matrix of probabilities, with F columns and C rows, where F is the number of frames and C is the number of categories.
  4. Use the matrix of probabilities, a set of pronunciation models, and a priori information about each category’s duration to determine the most likely word(s) using a Viterbi search.

For a much more detailed explanation about HMMs and HMM/ANN hybrids, lecture notes are available.

2.2 Specifying Categories

In order to determine the categories that the network will classify, the following three things need to be done:

  1. The designer of the recognizer needs to determine the pronunciations for each of the words that will be recognized. More accurate pronunciation models will generally yield better recognition rates.
  2. Quite often, context-dependent phoneme models are used, which means that the model for a phoneme varies depending on the phonemes that precede or follow this phoneme.  For example, the phoneme /aI/ preceded by a /w/ will have a different model name than an /aI/ preceded by an /n/.  The surrounding contexts (/w/ and /n/ in this example) may be specific phonemes, or groups (clusters) of phonemes.  The grouping of phonemes into similar clusters is specified by the person designing the recognizer.
  3. Finally, when constructing context-dependent phoneme models, each phoneme to be recognized is divided into one, two, or three sub-sections or segments. Each sub-phonetic segment corresponds to one category to be recognized. If a phoneme is specified as having only one segment, then it is used without the context of surrounding phonemes.  If a phoneme is specified as having two segments, then the left segment (sub-phonetic category) is dependent on the preceding phoneme, and the right segment is dependent on the following phoneme.  The a phoneme is specified as having three segments, then the first segment is dependent on the preceding phoneme, the middle segment is independent of surrounding phonemes, and the third segment is dependent on the following phoneme.  If a phoneme is specified as being “right-dependent,” it has only one segment, but this segment is dependent on the context of the following phoneme.  Right-dependent categories are typically used for oral stop phonemes.  This method of constructing context-dependent models is somewhat different from the standard triphone or biphone model, focusing the context dependencies on the regions of the phoneme that are most affected by that context.  The number of parts to split each phoneme into is specified by the person designing the recognizer.

Figure 2 shows an illustration of this kind of context-dependent modeling. In this figure, an example is given for the modeling of the word “lines”, written in Worldbet as /l aI n z/. Here, the /l/ is split into two parts, the /aI/ is split into three parts, /n/ into two parts, and the /z/ is modeled using one part. There are eight clusters (or groups) of phonemes used for contexts; each cluster represents a broad class of sounds. For this recognizer, the /l/ phoneme is assigned to the “$lat” cluster (for lateral phonemes), the /aI/ phoneme is also assigned to the $bck_r cluster of back vowels occuring in a right-hand context and to the $fnt_l cluster of front vowels occuring in a left-hand context, and both /n/ and /z/ are assigned to the $alv (alveolar) cluster. (The /aI/ phoneme is unusual in that it begins as a back vowel and ends as a front vowel; therefore, it can not be grouped into only one of the $bck (back-vowel) or $fnt (front-vowel) clusters.  The solution is to consider whether the /aI/ is occuring to the left of a phoneme (a left context), in which case it is always a front vowel, or if it is occuring to the right of a phoneme (a right context), in which case it is always a back vowel.)

Context-Dependent Modeling
Figure 2. Context-Dependent Modeling

The context-dependent phonetic categories that the network will be trained on can be determined from the phonetic-level pronunciation models, the groupings of phonemes into clusters of similar phones, and the number of parts to split each phoneme into.

2.3 Example of Specifying Categories

To give an example of how these pronunciation models, clustering, and parts can be determined, we’ll use the example of recognizing the isolated words “three”, “tea”, “zero”, and “five”.  (For this example, isolated words (words that are surrounded by pauses or silence) will be used; the CSLU Toolkit can be used for recognizing continuous speech.)

First, we come up with some initial pronunciations:

     word

pronunciation

     three      T 9r i:
     tea      tc th i:
     zero      z i: 9r oU
     five      f aI v

We may want to modify these pronunciations, because the /i:/ in “zero” is often pronounced differently from the /i:/ in “three” and “tea”. To account for this difference in pronunciation, we can use our own symbol, /i:_x/, to represent the front vowel in “zero”. Making this change gives us the following pronunciation models:

     word

pronunciation

     three      T 9r i:
     tea      tc th i:
     zero      z i:_x 9r oU
     five      f aI v

Next, we will determine the number of parts to use for each phoneme. In the table below, “1” means that the phoneme will be context-independent, “2” means that the phoneme will be split into two parts, “3” means that the phoneme will be split into three parts, and “r” means that the phoneme will be “right-dependent”:

     phone

parts

     T

1

     9r

2

     i:

3

     tc

1

     th

r

     z

1

     i:_x

2

     oU

3

     f

1

     aI

3

     v

1

     .pau 1

The /.pau/ symbol is used for the pause that is assumed to occur between words.  Now, let’s look at the spectrograms of the vowel /i:/ in “three” and “tea”. In this case, the vowel /i:/ is the same, but it looks very different when it is preceded by a /9r/ compared to when it is preceded by a /th/ (see Figure 3).


Figure 3. Example of vowel /i:/ in different phonetic contexts.

In this case, we make the initial third of the /i:/ (since it is split into three parts) dependent on a preceding retroflex (/9r/) in one case and dependent on a preceding alveolar sound (/th/ or /z/) in the other case. We usually group the phonemes in a left or right context according to their broad phonetic category; for example, the following groupings can be used (the dollar sign indicates a variable that represents the group of listed phones):

  group   phonemes in group    description
  $bck   oU   back vowels
  $fnt   i: i:_x   front vowels
  $ret   9r   retroflex sounds
  $alvden   T v th z   dentals, labiodentals, and alveolars
  $sil   .pau tc /BOU /EOU   silence or closure

Notice the two symbols /BOU and /EOU in the “phonemes in group” column.  These are two special symbols defined by the recognizer; /BOU stands for “beginning of utterance” and /EOU stands for “end of utterance.”  They are not symbols that need to be trained on, but they are put into context clusters so that the recognizer knows what context-dependent category to assign to the first and last phonemes in an utterance.

This general scheme is relatively straightforward.  However, notice that it then becomes difficult to classify diphthongs such as /aI/, because the phoneme starts as a back vowel and ends as a front vowel. The solution is to modify the categories in the following way:

  group   phonemes in group    description
  $bck_l   oU   back vowels to the left of a target phoneme
  $bck_r   oU aI   back vowels to the right of a target phoneme
  $fnt_l   i: i:_x aI   front vowels to the left of a target phoneme
  $fnt_r   i: i_x   front vowels to the right of a target phoneme
  $ret   9r   retroflex sounds
  $alvden   T v th z   dentals, labiodentals, and alveolars
  $sil   .pau tc /BOU /EOU   silence or closure

First, we have added “_l” and “_r” suffixes to the variable names in question, to indicate whether the phonemes in this grouping occur on the left or right side of the phoneme being classified. Then, because /aI/ has the characteristics of a back vowel when it appears in a right-hand context, it has been put in the grouping $bck_r; because /aI/ has the characteristics of a front vowel in a left-hand context, /aI/ has also been put in the grouping $fnt_l. This method of grouping into left or right contexts is illustrated in Figure 4:


Figure 4. Illustration of labeling a diphthong in the word “five”.

The format for specifying different categories is [left_context]<phone>[right_context], so for example the category for /.pau/ will be <.pau>, the category for the initial third of /i:/ in the context of dental sounds will be $den<i:, the middle third of /i:/ will be <i:>, and the right third of /i:/ in the context of silence will be i:>$sil.

Given all this information, it can easily (if tediously) be determined that the 28 categories we need to train on are:

<.pau>

 $alvden<9r

$fnt_l<9r

9r>$fnt_r

9r>$bck_r

<T>

f<aI

<aI>

aI>$alvden

<f>

$ret<i:

$alvden<i:

<i:>

i:>$sil

$alvden<i:_x

i:_x>$ret

$ret<oU

<oU>

oU>$sil

<tc>

th>$fnt_r

<v>

<z>

In the following sections, a Tcl script called “gen_spec.tcl” is described; this script can be used to automate the process of determining categories and creating a specification file of what categories a classifier must use.

Different settings for the “parts” and “context clusters” can yield significantly different word-level performance in the final recognizer.  The values that yield best results will depend on a number of factors, including the vocabulary size and grammar.  The general goal is to create categories that will have enough examples for training (maximize the number of examples per category) and also maximize the difference of models for different words.

2.4 Finding Examples to Train On

2.4.1 Overfitting and Datasets
As a neural network is trained, the weights of the network are adjusted to minimize the classification error on the training data. For each adjustment of the weights, we have a new iteration (or epoch) of the training process. We can keep generating new iterations until the error no longer decreases. At this point, we have learned the training data to the extent that it is possible.

However, when we train a neural network, we aren’t really interested in learning as much as possible about the training data. Instead, we are interested in learning as much as possible about the general properties of the training data, so that when we evaluate on test data, our model is still accurate. By learning the general properties of the data instead of the details that are specific to the training data, we are best able to classify a new utterance that is not in the training set.

In order to determine which iteration of network weights has best learned the general properties of the data, we use a separate (usually smaller) set of data to evaluate each iteration.  Evaluation is conducted at the word level, meaning that the network is used in combination with a Viterbi search to perform word-level recognition; the set of network weights that maximizes word-level accuracy is selected as the “best” and final network.  This second set of data is called the “development” set (or cross-validation set). Because this development set has not been used to adjust the network weights during training, it can be used to evaluate the network’s ability to recognize phonetic categories, as opposed to the classifier’s ability to recognize (possibly irrelevant) details of the training set. The larger this development set is, the more confidence we can have in the general classification properties of the network.

Once we have determined the best network, we need to evaluate its performance on a test set. In order to have an honest evaluation, the data in the test set must not occur in either the training set or the development set.  In addition, for a speaker-independent recognizer, none of the speakers in the test set must have utterances in the training or development sets.

This means that given a corpus containing our target words, we must divide it into at least three parts: one part for training, one for development, and one for testing. If we have a large enough corpus, we may further divide the development set into subsets, so that as we evaluate and make modifications to our recognizer, we are not tuning performance to one set of development data.

2.4.2 Filtering
When selecting data for training, development, and testing, we can apply various filters to selectively reduce the amount of data. In one case, we may have utterances in our corpus that don’t occur in our target vocabulary. In this case, we may want to filter so that words not in our vocabulary list are not included in our datasets. For example, if we are training a digits recognizer and we are using the CSLU Numbers corpus for training, we may want to remove out-of-vocabulary utterances that contain numbers such as “first”, “twelve”, and “fifty”. In another case, we may have so much data that training or evaluation would take too long. In this case, we can filter so that we take every Nth utterance for use in our datasets, where N is some integer greater than 1. For example, we may want to take every sixth waveform for training our digits recognizer, because there are over 6000 utterances available for training on digits. Filtering in this way will still leave over 1000 utterances (or approximately 500 examples of each spoken digit) available for training.

2.4.3 Finding Categories
Once we know which files we’ll use for training, we need to find examples of each context-dependent phonetic category that we’ll train on. This can be one in one of two ways: using data that has been hand-labeled at the phonetic level, or by using forced alignment.

Hand-Labeled Data
A number of speech corpora, such as the OGI Numbers corpus, OGI Stories corpus, or TIMIT corpus, have been labeled with phoneme identities, as well as the beginning time and ending time of each phoneme.  If training is to be done on this hand-labeled data, then the labels must be mapped from the phonetic level to the (context-dependent) category level. For example, a hand-labeled file for the isolated digit “three” might contain this information:

0    53    .pau
53   113   T
113  170   9r
170  229   i:
229  273   .pau

where the first column is the start time in milliseconds, the second column is the end time in milliseconds, and the third column is the phoneme label. In order to train a context-dependent recognizer on these data, the labels need to be mapped onto the following set of time-aligned categories:

0    53    <.pau>
53   113   <T>
113  142   $den<9r
142  170   9r>$fnt_r
170  190   $ret<i:
190  209   <i:>
209  229   i:>$sil
229  273   <.pau>

A set of Tcl scripts to automate this process will be described later. Also, some general modifications may be made to the hand-labeled data so that the data is more suited for training; for example, we may want to ignore very short pauses. Again, there are scripts described below that will automate this for us.

Force-Aligned Data
Often, the corpus we want to train on has text (word-level) transcriptions but no time-aligned phonetic labels. In this case, we can create either phonetic labels or category labels using a process called “forced alignment”.

Forced alignment is the process of using an existing recognizer to recognize a training utterance, where the grammar and lexicon are restricted to be the correct result. (The correct result is the word-level transcription, which must be known). The result of forced alignment is a set of time-aligned labels that give the existing recognizer’s best alignment of the correct phonemes or categories. If the existing recognizer has high accuracy, then the labels will have good time alignments. These labels can then be used for training a new recognizer.

2.4.4 Number of Examples per Category
Finally, the designer of a recognizer must decide how many examples (10-msec frames of speech that have been associated with a particular context-dependent phonetic category) of each category to train on. Networks with decent performance can be trained using up to 500 examples per category, but sometimes 2000 or more examples are used. In order to get best performance, generally all examples in the training set should be used. However, training with all examples may be very time-consuming.

If some categories have very few or no training examples, then there are two options. The first option is to use an additional corpus that contains examples of these infrequent classes. The second option is to “tie” these infrequent categories to phonetically similar categories that do have enough training examples. Categories tied in this way will not be trained on, and during recognition their probabilities will be set equal to the probabilities of the categories that they were tied to.

2.5 Training the Network

2.5.1 Generating Data
Once the examples to train on have been found, and the number of training examples per category has been determined, the actual data that will be trained on are collected and stored in a “vector file”. This vector file contains, for each training example, the acoustic features that will be input to the neural network and the target category that the network is supposed to learn. (One set of training features and the target category is called a “vector”; it can also be called an “example”.)

2.5.2 Number of Hidden Nodes
At CSLU, we use 3-layer feed-forward networks. The number of input nodes is the number of acoustic features, and the number of output nodes is the number of categories to be trained on. The designer of a recognizer must decide how many hidden nodes the network should have; in general, we have found 200 to 300 hidden nodes to be a reasonable number.

2.5.3 Negative Penalty
When using a large number of examples per category, it is nearly inevitable that some categories will have much fewer examples than others, making it difficult to learn these sparse categories. This difficulty in training is due to the fact that there are many more negative examples than positive examples for a sparse category, where negative examples are examples for which the category being trained on has a target value of 0, and positive examples are examples for which the category being trained on has a target value of 1. As a result, these sparse categories often have very small output values that don’t reflect the actual posterior probabilities that we want to obtain. To adjust for this, the amount that each negative example contributes to the total error is weighted by a value proportional to the number of examples in that negative category; this value is called a “negative penalty”. Training can be done either with or without this negative penalty. A more thorough discussion of the negative penalty can be found in a paper by Wei and van Vuuren from ICASSP-98, “Improved Neural Network Training of Inter-Word Context Units for Connected Digit Recognition.”

2.5.4 Number of Training Iterations
It is almost never necessary to continue training until the training error stops decreasing; the best performance on the development set will almost always occur at an earlier iteration. Often, best performance on the development set occurs after about 20 to 30 iterations, and so training is done for a fixed number of iterations, usually between 30 and 45.

2.5.5 Re-Training on Force-Aligned Data
As described above, forced alignment can be used to generate labels for training. In order to generate initial labels using forced alignment, we usually use a general-purpose recognizer. We can also use forced alignment to re-train a network; in this case, we use our current-best network to generate the forced-alignment labels and then train again using these new labels. This re-training often yields better results.

2.6 Evaluation

2.6.1 Word-Level Evaluation
Once we have trained for, say, 30 iterations, we need to determine which iteration has the best performance on the development set. To do this, we recognize each utterance in the development set using the network weights from each iteration and a Viterbi search. We evaluate the performance at each iteration in terms of substitution errors, insertion errors, and deletion errors.  The overall accuracy of a network iteration is defined to be 100% – (Sub + Ins + Del), where Sub is the percentage of substitution errors, Ins is the percentage of insertion errors, and Del is the percentage of deletion errors. We can also measure the “sentence-level accuracy”, which is the number of utterances (or entire waveforms) recognized correctly divided by the total number of utterances in the development set.

2.6.2 Choosing the Best Iteration

Usually, we choose the network iteration with the best word-level accuracy; in case of equal word-level accuracies, then we select the iteration with the greater sentence-level accuracy.

2.6.3 Testing
Once we have finished developing a recognizer, we evaluate the final performance on the test set, in terms of word-level and sentence-level accuracy. It is important, however, that once evaluation is done on the test set, the recognizer is not further modified based on these test-set results. In order to ensure that such modifications are not done, the test set is usually reserved until just before the recognizer is put into general-purpose use (or just before publishing results in a journal or at a conference).

3. Overall Procedure

Given the background described in the previous section, the process of training a recognizer becomes relatively simple. This section gives the “recipe” for this training process.

3.1 Create Descriptions

The first step is to create a description of the recognizer and describe how the data will be selected for training. The files that need to be created are:

corpora file
Create a “corpora.txt” file if one doesn’t yet exist. The corpora.txt file contains a master list of each corpus, and the location and format of the files in that corpus. The format of this file is given below; there is no automated way of generating this file, but it is easy to modify by hand. The same corpora file can be used for all training tasks.
info files
Create “info” files for training, development, and testing. These info files must be created by hand; the format is given below in Section 5. An info file contains all of the information that is necessary to find examples for training, development, or testing. This info file includes the partition (train, develop, test), how to select the data for the required partition (i.e. filtering parameters, as described above), the basename of the recognizer, the minimum number of examples requested for each category, and corpus-dependent information. One info file is required for each of the tasks of training, re-training using forced alignment, development, and testing.
grammar file
Create a “grammar” file that specifies the grammar that will be used to recognize words.  The format of a grammar file is a modification of the ABNF format published by the W3C.  The exact format used here is described in the Statenet documentation.
lexicon file
Create a “lexicon” file that specifies the pronunciation of each word in the grammar.  The format of a lexicon file is given below.
parts file
Create a “parts” file, which specifies how many parts to split each phoneme into, and what context clusters to use. Once again, this must be created by hand, and the format is given in Section 5.

3.2 Find Data

Given the files created above, the scripts to use in order to find data files for training are:

find_files.tcl
Use “find_files.tcl” to find files for training, development, and testing. This script must be called once for each set of files. At this stage, any filters are applied and the corpus is searched for files that are appropriate for the given partition (such as training or testing).
gen_spec.tcl
Use “gen_spec.tcl” to generate a specification file that contains a list of the categories to train on.  This script uses the info, grammar, lexicon, and parts files to create a “spec” file.  The specification file contains, in addition to the categories used by the recognizer for training and recognition, the specific frame size, sampling rate, the location of code used to compute acoustic features, the context clusters, and any phonetic mappings.
gen_catfiles.tcl
Use “gen_catfiles.tcl” to create time-aligned categories from text transcriptions or from phonetic time-aligned transcriptions. These categories are written to separate files with the extension “.cat”, which are put in sub-directories that mirror the directory structure of the corpus (or corpora) being used.
revise_spec.tcl
Use “revise_spec.tcl” to (a) tie categories that don’t have enough training examples to categories that do have sufficient examples, and (b) update the minimum and maximum duration parameters for each category.  “ gen_catfiles.tcl” creates output files that indicate the number of examples available for each category, as well as the duration information.  The output of this script is a modified “spec” file.

3.3 Select Data for Training

Once the files have been selected, the category files have been created, and the desc file is correct, then we can use the following scripts and programs to select frames for training:

pick_examples.tcl
Use “pick_examples.tcl” to select examples to train on.  The output of this script is an “examples” file, which is used directly by the next script, gen_examples.tcl
gen_examples.tcl
Use “gen_examples.tcl” to create acoustic feature vectors and their associated category information, for each frame to be trained on.  This script creates a binary file with the extension “.vec” (for vectors of features).
checkvec.exe
Use “checkvec” to make sure that the data in the .vec file are valid.  This program also prints out the number of categories and the number of examples of each category.  The number of categories is needed when running nntrain.exe.

3.4 Train and Evaluate

nntrain.exe
Use “nntrain” to train the neural network iterations using the vector file as training data.
select_best.tcl
Use “select_best.tcl” to find the best iteration of the network using the set of development files.

3.5 Re-Train

Create force-aligned data using the best iteration of the network that was just trained. To do this, create an info file for forced alignment that specifies a new directory in which to put the category files and a forced-alignment script to use to create the new .cat files. Then use “find_files.tcl“, “gen_spec.tcl“, “gen_catfiles.tcl“, and “revise_spec.tcl” to generate the force-aligned labels and create a new .spec file. Then repeat Sections 3.3 and 3.4 to create a network trained on this force-aligned data.

3.6 Evaluate Test Set

Use “select_best.tcl” to evaluate the final best network’s performance on the test set. These are the final results that are acceptable for publication.

4. Complete Example

To illusrate the procedure described above, the example of training a continuous-speech digits recognizer is given in this section. All commands should be entered using a DOS command window.  First make sure that the environment is properly set up as described in Section 1.1.   Text given in bold indicates commands that are typed from a command window; text in fixed-width font indicates the output from this command. In DOS, all commands must be entered on one line; if a backslash is used in the examples below to continue the command on another line, this must be typed as one line with no backslash when using DOS. The parameters for each script and program are explained in Section 6. The data files that are used in this example are located in a zip file available for downloading (make sure that you preserve the directory structure of the files in the zip file).  The configuration files (and two scripts, “fa.tcl” and “remap_tutorial.tcl“) used in this tutorial have been put in a ZIP file, available here.  You may need to change some information in these files to reflect your directory structure or other information.  The changes that are needed should be clear as the tutorial progresses.  Section 5 describes the format of these files so that you can change them or create them from scratch later on, in order to train on another task or train using different parameters.

If you are familiar with the previous version of this training process, note that there are several differences.  The .vocab file has been replaced by two files, a .lexicon file and a .grammar file.  The format of the .lexicon file is similar to, but slightly different from, the format of the .vocab file, in order to be more consistent with ABNF style.  The .olddesc and .desc files have been replaced with a new format, called a .spec (specification) file.  The use of hscript.exe is no longer necessary.  There are significant other differences, as well, but these other differences may not be as noticable.  If you successfully used the old version, but are having difficulties with the new version, please read the instructions carefully, as there may be subtle changes in the procedure.

[Step 1] In this initialization step, set up the directory structure that you will use. It is recommended that you create one directory for each “project”, where a project contains all of the files created during the training of a network. For this example, we will be using a project directory called \tutorial\digit. Note that some files (vector files in particular) may take up a large amount of disk space; you may want to delete these files after you are finished using them. Now is a good time to make sure that your path contains the locations of the training scripts as well as the stand-alone C programs used for training. To check this, if you type “gen_spec.tcl” in your project directory, you should get the following:

gen_spec.tcl
Usage: gen_spec.tcl <.info file> <.grammar file>
<.lexicon file> <.parts file> <.spec file>
[-start <startToken>]
where <startToken> is the token at which compilation
of the grammar starts; default is '$grammar'.

and if you type “checkvec” in your project directory, you should get the following:

checkvec
give vec file

If you don’t get these responses, contact the person who installed the CSLU Toolkit to find the location of the “script\training_1.0” directory and the “bin” directory within the Toolkit directory hierarchy.  The default locations are C:\Program Files\CSLU\Toolkit\2.0\script\training_1.0 and C:\Program Files\CSLU\Toolkit\2.0\bin.  Modify your “path” environment variable to include the correct paths, as described in Section 1.1.

[Step 2] Create a corpora file, called “corpora.txt“. For this tutorial, the corpora.txt file might look like this (assuming that the tutorial data are stored in \tutorial\data):

type corpora.txt
corpus: numbers
    wav_path    /tutorial/data/speechfiles
    txt_path    /tutorial/data/txtfiles
    phn_path    /tutorial/data/phnfiles
    format      {NU-([0-9]+)\.[A-Za-z0-9_]+}
    wav_ext     wav
    txt_ext     txt
    phn_ext     phn
    cat_ext     cat
    ID:         {regexp $format $filename filematch ID}

The “format” field specifies the format of files in this corpus, using a regular expression.  The parentheses are used in combination with the “ID:” field to determine the speaker ID associated with a file.  (And, in turn, the speaker ID is used to make sure that the three partitions of training, development, and test data are speaker-independent.)  It is probably also a good idea to make sure that your filenames have the same format as specified in the “corpora” file; the format is case sensitive, so NU-78.zipcode.wav is different from nu-78.zipcode.wav.  Also, note that the path names are specified using a forward slash (unix style) instead of a backslash (MS style).

[Step 3] Create info files for training, development, and testing. They will be called digit.train.infodigit.dev.info, and digit.test.info. We will only request up to 200 examples per category, so that this tutorial doesn’t take more time than necessary to run through. If one wanted to maiximize accuracy, it would be better to use all available examples. To specify all examples, use the keyword ALL instead of 200 in the “want:” field in digit.train.info.

For the digit.train.info file, we are specifying that we want training data from the Numbers corpus, and we will put time-aligned category labels in the numbers_train subdirectory (specified in the “partition:”, “name:”, and “cat_path:” fields). We require the presence of waveform, phonetically-labled, and text transcription files in order to do this (specified in the “require:” field with “w” to require the wavform, “p” to require the phonetically-labeled files, and “t” to require the word-level text transcription files), and we’ll use 3/5 of available files (specified in the “partition:” field, where the first {expr $ID % 5} maps the speaker ID to one of five values (0 through 4), and the second part {0 1 2} selects values 0, 1, and 2 for training). We won’t skip over any files (specified in the “filter:” field, where “1+1” takes all files), but we will require that all of the vocabulary words in the text file are all words that we want to recognize  (specified in the “lexicon:” field with the lexicon file that contains all of the target words). We will remap the hand-labled phonetic files (which can have a high degree of variability in the phoneme identities used to represent a word) to a consistent set of phonemes using the remap_tutorial.tcl script (specified in the “remap:” field, which specifies that “remap_tutorial.tcl” will be executed to do this remapping).  In addition, we specify that the sampling frequency of the waveforms is 8000 Hz, and the recognizer will use a 10-msec frame rate (in the “sampling_freq:” and “frame_size:” fields).  The “min_samp:” field has no effect when using only one corpus… this field, and all other fields, are explained in more detail in the description of the info file format.

type digit.train.info
basename:      digit;
partition:     train;
sampling_freq: 8000;
frame_size:    10;
min_samp:      100;

corpus: name:      numbers
cat_path:  numbers_train
require:   wpt
partition: "{expr $ID % 5} {0 1 2}"
filter:    1+1
lexicon:   digit.lexicon
remap:     remap_tutorial.tcl
want:      200;

type digit.dev.info
partition:  dev;
basename:   digit;

corpus: name:      numbers
require:   wt
partition: "{expr $ID % 5} {3}"
filter:    1+1
lexicon:   digit.lexicon ;

type digit.test.info
partition:  test;
basename:   digit;

corpus: name:      numbers
require:   wt
partition: "{expr $ID % 5} {4}"
filter:    1+1
lexicon:   digit.lexicon;

[Step 4] Create a grammar file, called digit.grammar.  This file contains the grammar that the recognizer will use.  In this case, the grammar specifies that a digit is any one of the words “zero”, “oh”, “one”, … “nine”.  It also specifies that the top-level grammar (using the default symbol $grammar) allows an optional “separator” word called “sep*” (which may be pause or “garbage”), followed by one or more repetitions of a digit followed by optional separator, and finally ending with an option separator.

type digit.grammar
$digit   = zero | oh | one | two | three | four | five | six |
seven | eight | nine;
$grammar = [sep*%%] ($digit [sep*%%])<+> [sep*%%];

 

[Step 5] Create a lexicon file, called digit.lexicon. This file contains the target words and their pronunciations.  Here you can see that the “sep*” word has been defined as pause, followed by optional garbage, followed by another pause.  Also, the remapping script will map all occurrences of the phoneme sequence /oU 9r/ (which occurs in the word “four”) to the symbol />r/, because these two phonemes are heavily coarticulated and may be better represented as one phoneme.  Because the “>” is a pre-defined symbol that can be used in the grammar (to specify a repeat operator, among other things), the .lexicon file and .parts file must precede this symbol with a backslash to indicate that it is a phoneme symbol and not a grammar symbol, leading to the symbol “\>r” for the representation of the vowel and final consonant in the word “four”.

type digit.lexicon
zero        = z I 9r oU              ;
oh          = oU                     ;
one         = w ^ n [&]              ;
two         = tc th u                ;
three       = T 9r i:                ;
four        = f \>r                  ;
five        = f aI v                 ;
six         = s I kc kh s            ;
seven       = s E v (I | ^) n [&]    ;
eight       = ei tc [th]             ;
nine        = n aI n [&]             ;

sep*        = .pau [.garbage] .pau   ;

[Step 6] Create a parts file, called digit.parts. This contains the number of parts that each phoneme will be split into, the groupings of phonemes into clusters of similar phonemes, and mappings from one phoneme to another symbol.  In this case, the unvoiced closures /tc/ and /kc/ are mapped to the single symbol /uc/, which we hereby define as a “generic” unvoiced closure.  We then train on the /uc/ symbol, although we specify word pronunciations using /tc/ and /kc/.

type digit.parts
i:        3 ;
I         3 ;
E         3 ;
u         3 ;
^         3 ;
&         2 ;
ei        3 ;
aI        3 ;
oU        3 ;
\>r       3 ;
9r        3 ;
w         2 ;
n         2 ;
T         2 ;
f         2 ;
s         3 ;
v         2 ;
z         3 ;
th        r ;
kh        r ;
uc        1 ;
.pau      1 ;
.garbage  1 ;

$sil   = .pau uc .garbage /BOU /EOU ;
$fnt_l = i: I E ei aI ;
$fnt_r = i: I E ei  ;
$bck_l = u ^ & oU w  ;
$bck_r = u ^ & oU w aI \>r ;
$ret_l = 9r \>r ;
$ret_r = 9r ;
$alv   = n s z th ;
$den   = f T v ;
$vel   = kh ;

map uc = tc kc ;

[Step 7] Run find_files.tcl in order to find files suitable for training. The output is written to digit.train.numbers.files; this filename is constructed from the basename, the partition, and the corpus. The reason that the user doesn’t specify the output filename on the command line is that it is possible, when using several corpora, to create several output files; it seems easier to have the filenames automatically determined than to have the user specify one filename for each corpus.

find_files.tcl digit.train.info corpora.txt
Basename: digit
Partition: train
Corpus: numbers
cat_ext:    cat
txt_ext:    txt
format:    NU-([0-9]+)\.[A-Za-z0-9_]+
lexicon:    digit.lexicon
partition:    {expr $ID % 5} {0 1 2}
cat_path:    numbers_train
txt_path:    W:/digit/tutorial/tutorial/data/txtfiles
id:    regexp $format $filename filematch ID
remap:    remap_tutorial.tcl
phn_ext:    phn
filter:    1+1
wav_ext:    wav
name:    numbers
phn_path:    W:/digit/tutorial/tutorial/data/phnfiles
want:    200
wav_path:    W:/digit/tutorial/tutorial/data/speechfiles
require:    wpt

W:/digit/tutorial/tutorial/data/speechfiles...
W:/digit/tutorial/tutorial/data/speechfiles/0...
W:/digit/tutorial/tutorial/data/speechfiles/1...
W:/digit/tutorial/tutorial/data/speechfiles/10...
W:/digit/tutorial/tutorial/data/speechfiles/100...
W:/digit/tutorial/tutorial/data/speechfiles/101...
(etc)
NU-596.streetaddr.wav
NU-596.zipcode.wav
NU-597.streetaddr.wav
Final count of 552 files for this corpus
Done.

Then, run find_files.tcl a second and third time to find files suitable for development and testing:

find_files.tcl digit.dev.info corpora.txt
find_files.tcl digit.test.info corpora.txt 

[Step 8] Run gen_spec.tcl to determine the context-dependent categories that will be classified by the recognizer. The input files are the infogrammarlexicon, and parts files. The output file is the spec file; this specification file contains not only the list of the context-dependent categories, but also some other information about the recognizer that we will be creating.

gen_spec.tcl digit.train.info digit.grammar digit.lexicon digit.parts digit.orig.spec
Basename: digit
Partition: train
Corpus: numbers
lexicon:    digit.lexicon
partition:    {expr $ID % 5} {0 1 2}
cat_path:    numbers_train
remap:    remap_tutorial.tcl
filter:    1+1
name:    numbers
want:    200
require:    wpt

There are 22 unique phonemes.
& (2)  <- n
-> /EOU f T uc w oU .pau s z n ei
$alv<& &>$sil &>$den &>$bck_r &>$alv
&>$fnt_r

.pau (1)  <- & .pau v n s {\>r} i: u oU uc th .garbage /BOU
-> /EOU f T uc w oU .pau .garbage s z n ei
<.pau>

(etc.)

{$alv<&} {&>$sil} {&>$den} {&>$bck_r} {&>$alv} {&>$fnt_r} <.pau> {$den<9r} {$fnt_l<9r} <9r> {9r>$fnt_r} {9r>$bck_r} {$alv<E} <E> {E>$den} {$alv<I} {$den<I} <I> {I>$sil} {I>$ret_r} {I>$alv} {$bck_l<T} {$fnt_l<T} {$sil<T} {$alv<T} {$den<T} {$ret_l<T} {T>$ret_r} {$den<\>r} {<\>r>} {\>r>$sil} {\>r>$den} {\>r>$bck_r} {\>r>$alv} {\>r>$fnt_r} {$bck_l<^} {$den<^} <^> {^>$alv} {$den<aI} {$alv<aI} <aI> {aI>$den} {aI>$alv} {$alv<ei} {$sil<ei} {$bck_l<ei} {$den<ei} {$fnt_l<ei} {$ret_l<ei} <ei> {ei>$sil} {$bck_l<f} {$ret_l<f} {$sil<f} {$alv<f} {$den<f} {$fnt_l<f} {f>$bck_r} {$ret_l<i:} <i:> {i:>$sil} {i:>$den} {i:>$bck_r} {i:>$alv} {i:>$fnt_r} {kh>$alv} {$bck_l<n} {$fnt_l<n} {$sil<n} {$den<n} {$alv<n} {$ret_l<n} {n>$sil} {n>$den} {n>$bck_r} {n>$alv} {n>$fnt_r} {$bck_l<oU} {$alv<oU} {$sil<oU} {$den<oU} {$fnt_l<oU} {$ret_l<oU} <oU> {oU>$sil} {oU>$den} {oU>$bck_r} {oU>$alv} {oU>$fnt_r} {$vel<s} {$bck_l<s} {$alv<s} {$sil<s} {$ret_l<s} {$fnt_l<s} {$den<s} <s> {s>$sil} {s>$den} {s>$bck_r} {s>$fnt_r} {s>$alv} {th>$sil} {th>$den} {th>$bck_r} {th>$alv} {th>$fnt_r} {$alv<u} <u> {u>$sil} {u>$den} {u>$bck_r} {u>$alv} {u>$fnt_r} <uc> {$fnt_l<v} {v>$sil} {v>$den} {v>$bck_r} {v>$alv} {v>$fnt_r} {$bck_l<w} {$alv<w} {$sil<w} {$den<w} {$fnt_l<w} {$ret_l<w} {w>$bck_r} {$bck_l<z} {$sil<z} {$alv<z} {$den<z} {$fnt_l<z} {$ret_l<z} <z> {z>$fnt_r} <.garbage>

[Step 9] Run gen_catfiles.tcl to take the list of files for training (digit.train.numbers.files) and create time-aligned labels of categories to train on. The input file (other than digit.train.infoand corpora.txt) is digit.train.numbers.files. If specified in digit.train.info, the script in the “remap:” field will be used, or the script in the “force_cat:” or “force_phn:” fields will be used (in this case, we haven’t specified the “force_cat:” or “force_phn:” fields because we are not yet doing forced alignment). The category label files that are created are stored in the directory that is specified in digit.train.info in the “cat_path:” field.  The gen_catfiles.tcl script also creates two other output files: the “dur” file and the “counts” file. The dur file contains minimum and maximum duration limits for each category, as determined from the category label files; the counts file lists the number of occurrence (and total time in msec) of each category.

gen_catfiles.tcl digit.train.info digit.parts digit.orig.spec corpora.txt digit.train.dur digit.train.counts
Basename: digit
Partition: train
Corpus: numbers
cat_ext:    cat
txt_ext:    txt
format:    NU-([0-9]+)\.[A-Za-z0-9_]+
lexicon:    digit.lexicon
partition:    {expr $ID % 5} {0 1 2}
cat_path:    numbers_train
txt_path:    W:/digit/tutorial/tutorial/data/txtfiles
id:    regexp $format $filename filematch ID
remap:    remap_tutorial.tcl
phn_ext:    phn
filter:    1+1
wav_ext:    wav
name:    numbers
phn_path:    W:/digit/tutorial/tutorial/data/phnfiles
want:    200
wav_path:    W:/digit/tutorial/tutorial/data/speechfiles
require:    wpt

READING digit.train.numbers.files
Created file NU-25.zipcode.cat
Created file NU-30.zipcode.cat
Created file NU-46.streetaddr.cat
Created file NU-47.zipcode.cat
Created file NU-51.other2.cat
Created file NU-51.other3.cat
Created file NU-51.zipcode.cat
(etc)
Created file NU-596.streetaddr.cat
Created file NU-596.zipcode.cat
Created file NU-597.streetaddr.cat
Sorting durations... taking lowest 2% and top 100% of durations
Done.

This script may generate messages such asMerging 996 1094 .glot with right (oU)or** Warning: phoneme '.tc' not known to recognizerThese are simply messages to the user that some labels are being merged, deleted, or ignored when converting from hand labels to categories. These messages come from the remapping script, in this caseremap_tutorial.tcl. No action needs to be taken by the user. At the end, for each category, the duration that is at the bottom 2nd percentile of all durations for that category is written to the dur file as the minimum duration, and the longest duration of the category is written to the dur file as the maximum duration. These limits help the Viterbi search refrain from inserting very short or very long words during recognition.

[Step 10] Run revise_spec.tcl to make sure that we have enough examples of each category to train on, and to add duration limits to the spec file. If there are not enough examples of a category, this script allows us to tie these categories to categories with more examples. This is the only interactive script in the entire training and recognition process. The input files are the input spec file, the counts file, and the dur file. The output of this script is a new spec file that contains category tie information and duration limits information.

revise_spec.tcl digit.orig.spec digit.train.dur digit.train.counts digit.train.spec
minimum number of occurrences per category: 5
Available Categories:
$alv<&      &>$sil      <.pau>      $den<9r     $fnt_l<9r   <9r>
9r>$fnt_r   9r>$bck_r   $alv<E      <E>         E>$den      $alv<I
$den<I      <I>         I>$sil      I>$ret_r    I>$alv      $bck_l<T
$fnt_l<T    $sil<T      $alv<T      $den<T      $ret_l<T    T>$ret_r
$den<\>r    <\>r>       \>r>$sil    \>r>$den    \>r>$bck_r  \>r>$alv
\>r>$fnt_r  $bck_l<^    $den<^      <^>         ^>$alv      $den<aI
$alv<aI     <aI>        aI>$den     aI>$alv     $alv<ei     $sil<ei
$bck_l<ei   $fnt_l<ei   $ret_l<ei   <ei>        ei>$sil     $bck_l<f
$ret_l<f    $sil<f      $alv<f      $den<f      $fnt_l<f    f>$bck_r
$ret_l<i:   <i:>        i:>$sil     i:>$den     i:>$bck_r   i:>$alv
kh>$alv     $bck_l<n    $fnt_l<n    $sil<n      $den<n      $alv<n
$ret_l<n    n>$sil      n>$den      n>$bck_r    n>$alv      n>$fnt_r
$bck_l<oU   $alv<oU     $sil<oU     $den<oU     $fnt_l<oU   $ret_l<oU
<oU>        oU>$sil     oU>$den     oU>$bck_r   oU>$alv     oU>$fnt_r
$vel<s      $bck_l<s    $alv<s      $sil<s      $ret_l<s    $fnt_l<s
$den<s      <s>         s>$sil      s>$den      s>$bck_r    s>$fnt_r
s>$alv      th>$sil     th>$den     th>$bck_r   th>$alv     $alv<u
<u>         u>$sil      u>$den      u>$bck_r    u>$alv      u>$fnt_r
<uc>        $fnt_l<v    v>$sil      v>$den      v>$bck_r    v>$alv
v>$fnt_r    $bck_l<w    $alv<w      $sil<w      $den<w      $fnt_l<w
w>$bck_r    $bck_l<z    $sil<z      $alv<z      $fnt_l<z    $ret_l<z
<z>         z>$fnt_r
Tie '&>$den'      (  0 examples) to: &>$sil
Tie '&>$bck_r'    (  0 examples) to: &>$sil
Tie '&>$alv'      (  0 examples) to: &>$sil
Tie '&>$fnt_r'    (  0 examples) to: &>$sil
Tie '$den<ei'     (  4 examples) to: $alv<ei
Tie 'i:>$fnt_r'   (  4 examples) to: i:>$bck_r
Tie 'th>$fnt_r'   (  3 examples) to: th>$bck_r
Tie '$ret_l<w'    (  4 examples) to: $bck_l<w
Tie '$den<z'      (  3 examples) to: $alv<z

In this case, we have tied the schwa vowel in the context of various subsequent phonemes to the schwa in the context of following silence; /ei/ in the context of a preceding dental to /ei/ in the context of a preceding alveolar, and other changes.  In general, if in doubt and a context-independent category exists (e.g. <ei>), then it is acceptable to tie to this context-independent category.  Because these are categories that are infrequent in the training data, it is also not very likely for these categories to be used in recognition, and so the recognizer should not be very sensitive to the tie categories that are selected.

[Step 11] Run pick_examples.tcl to select frames for training, from the files created by gen_catfiles.tcl. The input files are digit.train.infocorpora.txtdigit.train.spec, anddigit.train.numbers.files.   The output of this script is the file digit.train.examples, which contains an ASCII list of files, the frames to be used in each file, and the categories corresponding to these frames.

pick_examples.tcl digit.train.info corpora.txt digit.train.spec digit.train.examples
Basename: digit
Partition: train
Corpus: numbers
cat_ext:    cat
txt_ext:    txt
format:    NU-([0-9]+)\.[A-Za-z0-9_]+
lexicon:    digit.lexicon
partition:    {expr $ID % 5} {0 1 2}
cat_path:    numbers_train
txt_path:    W:/digit/tutorial/tutorial/data/txtfiles
id:    regexp $format $filename filematch ID
remap:    remap_tutorial.tcl
phn_ext:    phn
filter:    1+1
wav_ext:    wav
name:    numbers
phn_path:    W:/digit/tutorial/tutorial/data/phnfiles
want:    200
wav_path:    W:/digit/tutorial/tutorial/data/speechfiles
require:    wpt

digit.train.numbers.files, want=200, min_want=100

--------- numbers --------
$alv<&         122
&>$sil         116
<.pau>         200
$den<9r        200
$fnt_l<9r      200
<9r>           200
(etc.)
$alv<z         151
$fnt_l<z        26
$ret_l<z        34
<z>            200
z>$fnt_r       200

Selected from a total of 552 files
Randomizing order of frames...
Writing output file...
Done

[Step 12] Run gen_examples.tcl to compute acoustic features for all of the frames given in digit.train.examples. The input files are digit.train.infodigit.train.spec, anddigit.train.examples. The computed features and the associated target category values are stored in the binary output file digit.train.vec. Note that if you want to use features that are different from the standard features, you can write the Tcl code used to create the new features; the location of your code can be specified in the info file using the “featuresURI” and “contextURI” fields.  Also, the description of the format of the vector file given in Section 5 may be of interest.

gen_examples.tcl digit.train.info digit.train.spec digit.train.examples digit.train.vec
Basename: digit
Partition: train
Corpus: numbers
lexicon:    digit.lexicon
partition:    {expr $ID % 5} {0 1 2}
cat_path:    numbers_train
remap:    remap_tutorial.tcl
filter:    1+1
name:    numbers
want:    200
require:    wpt

Sampling freq: 8000
Frame size:    10
0: W:/digit/tutorial/tutorial/data/speechfiles/16/NU-1641.other1.wav
1: W:/digit/tutorial/tutorial/data/speechfiles/10/NU-1035.other1.wav
2: W:/digit/tutorial/tutorial/data/speechfiles/17/NU-1792.zipcode.wav
3: W:/digit/tutorial/tutorial/data/speechfiles/5/NU-591.streetaddr.wav
4: W:/digit/tutorial/tutorial/data/speechfiles/2/NU-237.streetaddr.wav
5: W:/digit/tutorial/tutorial/data/speechfiles/11/NU-1116.streetaddr.wav

(etc)

[Step 13] Run checkvec.exe to make sure that the vector file that has been created has the correct format, and that every category has at least one example to train on. The numbers in the left column are the values corresponding to each category (from 1 to the total number of categories), and the numbers in the right column are the number of examples for each category. The input file is digit.train.vec; the only output goes to the screen for the user to check, but may be piped to a file using “> checkvec_output.txt” at the end of the DOS command.

checkvec.exe digit.train.vec
  1:    122
2:    116
3:    199
4:    200
5:    200
6:    200

(etc)
123:    200
124:    151
125:     26
126:     34
127:    200
128:    200

20703 vectors with 130 features

[Step 14] Run nntrain.exe to train the neural network on the vector file digit.train.vec. This program creates a weights file at each iteration; we will select the best weights file after training for 30 iterations. The -l option indicates that the negative penalty will be adjusted to compensate for varying numbers of examples per category; -sn 88 and -sv 88 are random-number seeds;-f digitnet specifies the basename “digitnet” for the output files; -a 3 130 200 128 specifies the architecture of the net: 3 layers, with 130 nodes in the first layer, 200 nodes in the hidden layer, and 128 nodes in the output layer. The value 30 specifies training for 30 iterations, and the last parameter is the vector file to use for training.  Output files will, in this case, be called digitnet.0, digitnet.1, digitnet.2, … , digitnet.30, with one output file for each iteration.

nntrain -l -sn 88 -sv 88 -f digitnet -a 3 130 200 128 30 digit.train.vec
creating net with seed 88
negpen 0 is 0.752830
negpen 1 is 0.715597
negpen 2 is 1.000000
negpen 3 is 1.000000
negpen 4 is 1.000000
negpen 5 is 1.000000
 (etc)
negpen 125 is 0.208912
negpen 126 is 1.000000
negpen 127 is 1.000000
3 layers: 131 200 128
learning rate 0.050000
momentum 0.000000
negative weight 1.000000
training file digit.train.vec
numvec: 20703; tau: 103515.000000
vectors chosen in 1 blocks of 20703 with seed 88
time:8 learn_rate 0.041667; total error is 74753.695313
time:9 learn_rate 0.035714; total error is 56642.347656
time:8 learn_rate 0.031250; total error is 50545.117188
time:9 learn_rate 0.027778; total error is 46603.933594
time:9 learn_rate 0.025000; total error is 43362.011719
time:8 learn_rate 0.022727; total error is 40807.339844
time:9 learn_rate 0.020833; total error is 38606.566406
time:9 learn_rate 0.019231; total error is 36725.898438
 (etc)

Notes: For specifying the architecture, note that the number of nodes in the first layer will always be 130 for the standard feature set. The number of hidden nodes is decided by the user, but 200 is a reasonable number. The number of output nodes (128 in this case) must match the largest value in the left column of the output of checkvec.exe.  The number 128 used in this example may change, depending on the number of states that have been tied and the information in the .grammar.lexicon, and .parts files.

The only input file to nntrain.exe is the vector file; the output files are the neural-network weights files for each iteration (the default names are nnet.X, where X is an integer from 0 to the number of iterations).

[Step 15] Run select_best.tcl to evaluate the performance of each iteration (weight file) on the development-set data. This script may take a long time, especially if there are many files in the development set. This script calls two other scripts, “ asr_multi.tcl” and “eval_results.tcl“, which are located in the same directory as select_best.tcl, namely...\CSLU\Toolkit\2.0\script\training_1.0.  The input files are the neural-network files created by nntrain.exe, the digit.dev.numbers.files file and all waveform and text files specified indigit.dev.numbers.files, the grammar file, the lexicon file, and the spec file. The output files are ali files (with basename “wrdalign_digitnet” in this example) and a summary file.  The summary file shows the performance on each iteration (the WrdAcc% column shows word-level accuracy, and the SntCorr column shows the percentage of “sentence” (entire digit sequences in this case) correctly recognized), as well as the resulting best iteration.

select_best.tcl digitnet digit.dev.numbers.files digit.grammar digit.lexicon \
digit.train.spec digit.dev.summary -g 10 -b 15

Beginning at iteration 15, stopping after iteration 30
Evaluating every 1 iterations
Garbage value is 10
Basename for the .ali files is wrdalign_digitnet
Summary file is digit.dev.summary
Starting Iteration 30...
Starting Iteration 29...
Starting Iteration 28...
Starting Iteration 27...
Starting Iteration 26...
Starting Iteration 25...

(etc)
Starting Iteration 18...
Starting Iteration 17...
Starting Iteration 16...
Starting Iteration 15...
Itr #Snt  #Words  Sub%    Ins%    Del%   WrdAcc% SntCorr
30  100    444   3.83%   1.13%   0.90%  94.14%  78.00%
29  100    444   3.38%   0.90%   1.13%  94.59%  79.00%
28  100    444   3.83%   0.90%   0.68%  94.59%  79.00%
27  100    444   3.60%   0.90%   0.68%  94.82%  79.00%
26  100    444   3.83%   0.90%   1.13%  94.14%  78.00%
25  100    444   3.60%   0.90%   1.13%  94.37%  78.00%
24  100    444   3.60%   0.90%   0.90%  94.59%  79.00%
23  100    444   3.38%   0.90%   1.13%  94.59%  79.00%
22  100    444   4.05%   0.90%   1.35%  93.69%  77.00%
21  100    444   3.38%   0.90%   1.13%  94.59%  79.00%
20  100    444   3.83%   0.90%   0.90%  94.37%  78.00%
19  100    444   3.83%   1.13%   1.13%  93.92%  78.00%
18  100    444   4.28%   1.13%   0.90%  93.69%  78.00%
17  100    444   2.93%   0.68%   1.13%  95.27%  81.00%
16  100    444   3.60%   1.13%   1.13%  94.14%  80.00%
15  100    444   3.15%   0.90%   1.35%  94.59%  79.00%
Best results (95.27, 81.00) with network digitnet.17
Evaluated 16 networks

Note that training on only 200 examples per category has a negative influence on results; when trained using all available examples, instead of 200 per category (using the keyword “ALL” instead of 200 indigit.train.info), results on the same development data were 97.07% word accuracy and 88.00% sentence accuracy. The drawback to training on all examples is that pick_examples.tcl,gen_examples.tcl, and especially nntrain.exe take longer to execute.

[Step 16] Now we have finished the first cycle of training. If we are happy with the level of performance on the development set, we can stop the training process and evaluate on the test set (Step 19).  If we want to try to improve performance on the development set, we can do another cycle of training using force-aligned data. We can create another .info file for doing forced alignment, using the training file as a template. This new file will be called digit.trainfa.info:

copy digit.train.info digit.trainfa.info
edit digit.trainfa.info
type digit.trainfa.info
basename:      digit;
partition:     trainfa;
sampling_freq: 8000;
frame_size:    10;
min_samp:      100;

corpus: name:      numbers
cat_path:  numbers_trainfa
require:   wt
partition: "{expr $ID % 5} {0 1 2}"
filter:    1+1
lexicon:   digit.lexicon
force_cat: "fa.tcl digitnet.17 digit.train.spec digit.lexicon
WAV TXT c OUT"
want:      200;

(Note that in the “force_cat:” field, the script and associated parameters are specified on two lines. No special marker (such as a backslash) is required.)

We have changed the partition name (to “trainfa”) and the path for category files (to “numbers_trainfa”). Also, by specifying “require: wt“, we will now require the existence of .wav files and .txt files but not .phn files (because we will create time-aligned phonetic labels from the text transcriptions using the lexicon file and forced alignment). We also add a new field to the corpus description, indicating that we want to do forced alignment and create labels at the “category” level (as opposed to the phoneme level). Also note that this line specifies using iteration 17 from the training we just finished, since iteration 17 had the best word-level performance. Because we are doing forced alignment, it is no longer necessary to use the remapping script that re-maps labels created by hand to the set of labels used by our recognizer.

[Step 17] Now we once again find the files we want to use for training by running find_files.tcl, and then we generate cateogry-level time-aligned labels by running gen_catfiles.tcl. As part of the process of creating category-level label files, we also automatically create new dur and counts files. Finally, we create a new spec file with the new information in the dur and counts files using revise_spec.tcl.  In this case, we don’t tie categories with 3 or more occurrences, more for demonstrating the options available than for any theoretically sound reason.

find_files.tcl digit.trainfa.info corpora.txt
gen_catfiles.tcl digit.trainfa.info digit.parts digit.train.spec \
corpora.txt digit.trainfa.dur digit.trainfa.counts
revise_spec.tcl digit.orig.spec digit.trainfa.dur digit.trainfa.counts \
digit.trainfa.spec -min 3

[Step 18] Then, we repeat the training steps to train and select the best force-aligned network:

pick_examples.tcl digit.trainfa.info corpora.txt \
digit.trainfa.spec digit.trainfa.examples
gen_examples.tcl digit.trainfa.info digit.trainfa.spec \
digit.trainfa.examples digit.trainfa.vec
checkvec.exe digit.trainfa.vec
nntrain.exe -l -sn 88 -sv 88 -f digitfanet -a 3 130 200 134 30 digit.trainfa.vec
select_best.tcl digitfanet digit.dev.numbers.files digit.grammar digit.lexicon \
digit.trainfa.spec digit.devfa.summary -g 10 -b 15

Note that the neural network weights files have the basename “digitfanet”. The results from select_best are:

Itr #Snt  #Words  Sub%    Ins%    Del%   WrdAcc% SntCorr
30  100    444   2.70%   1.35%   0.45%  95.50%  85.00%
29  100    444   2.93%   1.35%   0.45%  95.27%  84.00%
28  100    444   3.38%   1.13%   0.23%  95.27%  84.00%
27  100    444   2.93%   1.35%   0.45%  95.27%  84.00%
26  100    444   3.15%   0.90%   0.23%  95.72%  85.00%
25  100    444   3.15%   0.90%   0.23%  95.72%  85.00%
24  100    444   3.38%   0.68%   0.45%  95.50%  85.00%
23  100    444   3.15%   1.13%   0.23%  95.50%  85.00%
22  100    444   3.38%   0.90%   0.23%  95.50%  85.00%
21  100    444   3.60%   1.35%   0.23%  94.82%  81.00%
20  100    444   3.38%   1.35%   0.23%  95.05%  82.00%
19  100    444   3.15%   1.13%   0.23%  95.50%  84.00%
18  100    444   3.15%   1.13%   0.23%  95.50%  83.00%
17  100    444   3.15%   0.68%   0.23%  95.95%  84.00%
16  100    444   3.83%   0.90%   0.23%  95.05%  81.00%
15  100    444   3.38%   0.90%   0.45%  95.27%  83.00%
Best results (95.95, 84.00) with network digitfanet.17

With the development set, it is acceptable to vary the system parameters to try to maximize performance.  For example, the insertion rate is slightly higher than the deletion rate in this example.  So, performance might improve by reducing the value of the garbage parameter from 10 to, say, 7, so that the insertion rate will decrease.  (This will often cause the deletion rate to increase… the objective here is to get the lowest combined error rate, which often occurs when the insertion and deletion error rates are nearly equal.)  If we try it:

select_best.tcl digitfanet digit.dev.numbers.files digit.grammar \
digit.lexicon digit.trainfa.spec digit.devfa.summary -g 7 -b 15

we see that the best word-level accuracy of 95.27% is slightly worse than the accuracy of 95.95% with garbage value of 10.  So, we leave the garbage value set at 10.

[Step 18] The resulting network, digitfanet.17, is the final network. The last step is to evaluate this network on the test set.  Here we run select_best.tcl again, but only evaluate on network iteration 17 using the “-o 17” option.

select_best.tcl digitfanet digit.test.numbers.files digit.grammar digit.lexicon \
digit.trainfa.spec digit.test.summary -g 10 -o 17

In this case, the output is

Itr #Snt  #Words  Sub%    Ins%    Del%   WrdAcc% SntCorr
17   19    111   0.00%   3.60%   0.00%  96.40%  78.95%
Best results (96.40, 78.95) with network digitfanet.17

Here, the final result of 96.40% word accuracy is slightly better than the word accuracy on the development set, and the sentence-level accuracy of 78.95% is slightly worse than the sentence-level accuracy on the development set.  Usually, the performance on the test set is slightly worse than the performance on the development set for both word-level and sentence-level accuracy, because the development-set performance is the maximum performance over a number of iterations and/or garbage values, while test-set performance reflects a single evaluation meant to indicate performance that can be expected of a final system on unseen data.

5. File Formats

In the following file formats, text in fixed-font bold is a keyword that must be used verbatim. Italicized items in brackets <> must be substituted with the proper values.
wav file
A wav file contains the speech waveform that is to be trained on or recognized. The format for wav files may be either Microsoft .wav format or NIST Sphere ulaw format.
txt file
A txt file contains a text transcription of the words in a speech waveform. This file is simply an ASCII file containing the words separated by spaces, and it can be created by any text editor that outputs ordinary .txt files.
label files (.phn, .cat, .wrd)
Label files, which usually have the extension .phn, .cat, or .wrd, contain time-aligned labels of a waveform utterance. If the file has the extension .phn, then the labels are phonetic labels; if the file has the .cat extension, then the labels are neural-network output categories (usually context-dependent sub-phonetic units); and if the file has the extension .wrd, then the labels are words. A label file has the following format:

MillisecondsPerFrame: <value>
END OF HEADER
<begin_time_1> <end_time_1> <label_1>
<begin_time_2> <end_time_2> <label_2>

<begin_time_n> <end_time_n> <label_n>

where:

<value>
is the number of milliseconds in one frame of speech (usually this value is 1.0).
<begin_time>
is the time at which <label> starts
<end_time>
is the time at which <label> ends
<label>
is the word, phone, or category label for the segment of speech

The values for <begin_time> and <end_time> are measured in frames (so if <value> is 1.0, then time is measured in milliseconds; if <value> is 10.0, then time is measured in centi-seconds). The <end_time> of one label is usually the same as the <begin_time> of the next label.
corpora file
The corpora file contains descriptions of all corpora:

<corpus1 description>
<corpus2 description>
<corpus3 description>

where a <corpus description> has the following format:

corpus: <corpus_name>
    wav_path   <path_to_wav_files>
    phn_path   <path_to_phn_files>
    txt_path   <path_to_txt_files>
    format     <regular_expression_for_parsing_filenames>
    wav_ext    <extension_for_wav_files>
    phn_ext    <extension_for_phn_files>
    txt_ext    <extension_for_txt_files>
    cat_ext    <extension_for_cat_files>
    ID:        <Tcl_code_for_determining_caller_ID>

where:

<corpus_name>
is a name used to describe the corpus. The format for <corpus_name> is the same as for any Tcl variable name.
<path_to_wav_files>
is the full path to the directory containing waveform files. It is assumed that in this directory will be sub-directories, and that the actual files will be in these sub-directories.
<path_to_phn_files>
is the full path to the directory containing time-aligned phonetic label files. It is assumed that in this directory will be sub-directories, and that the actual files will be in these sub-directories.
<path_to_txt_files>
is the full path to the directory containing text transcription files. It is assumed that in this directory will be sub-directories, and that the actual files will be in these sub-directories.
<regular_expression_for_parsing_filenames>
is a regular expression, enclosed in curley braces {}, that will succeed when used to parse the base name of a file that belongs in the corpus. It can also be used to extract the call number from the filename, for use in determining the caller ID.
<extension_for_wav_files>
is the filename extension for waveform files. Usually, the value is “wav”.
<extension_for_phn_files>
is the filename extension for time-aligned phonetic label files. Usually, the value is “phn”.
<extension_for_txt_files>
is the filename extension for text transcription files (without time alignment). Usually, the value is “txt”.
<extension_for_cat_files>
is the filename extension for time-aligned category files, where the categories correspond to outputs of the neural network (such as $nas<E). These files are usually generated automatically. <location_of_cull_file> is the full path and filename for the cull file, if one exists for this corpus.
<Tcl_code_for_determining_caller_ID>
is Tcl code, enclosed in curley braces {}, for determining the caller ID. In order to make this possible, this code can reference two variables: $format, which is the regular expression given above, and $filename, which is the base name of a waveform file. The result of this code must store the result in the variable “ID”.

info file
An info file has the following format:

basename:       <base_name> ;
partition:      <partition_name> ;
min_samp:       <minimum_number_of_examples> ;
frame_size:     <frame_size> ;
sampling_freq:  <sampling_frequency> ;
featuresURI:    <URI_for_feature_code> ;
contextURI:     <URI_for_context_code> ;

corpus:      <corpus1 information>;
corpus:      <corpus2 information>;
corpus:      <corpus3 information>;

The number of corpora specified in an info file is theoretically unlimited, but there must be at least one. Note that each field has a semicolon at the end.

<base_name>
is the basename (filename without path or extension information) of a number of files associated with the recognizer, such as the “.files” files. Typically, this name will indicate the task being trained on (“digits”, for example).
<parition_name>
is a description of how the data will be used. Typical partition names are “train”, “dev”, “trainfa” (for training with forced alignment), and “test”.
<minimum_number_of_examples>
is the minimum number of examples (or vectors) that are requested for each category. This is only meaningful when more than one corpus is being used, because if there is only one corpus, then the scripts will automatically try to find the desired number of examples. If there is more than one corpus, the scripts will try to obtain at least the minimum number that is specified. If only one corpus is being used, this field may be omitted.
<frame_size>
is the frame size that the recognizer will use.  The default is 10 msec.
<sampling_frequency>
is the sampling frequency that the recognizer will use.  The default is 8000 Hz. 
<URI_for_feature_code>
A “URI”, or Uniform Resource Locator, is a univeral way of specifying the location of something, in this case the location of a .tcl file and procedure that computes acoustic features.  The current implementation, however, only supports local files, and so the URI is simply the path to the .tcl file, a ‘#’, and the name of the procedure.  This procedure should contain code for computing acoustic features.  This procedure must take the following three arguments: the input waveform, the output features, and the desired sampling frequency.
<URI_for_context_code>
A “URI”, or Uniform Resource Locator, is a univeral way of specifying the location of something, in this case the location of a .tcl file and procedure that computes acoustic features.  The current implementation, however, only supports local files, and so the URI is simply the path to the .tcl file, a ‘#’, and the name of the procedure.  This procedure should contain code for selecting a context window.  This procedure must take the following four arguments: the input waveform, the input acoustic features at each frame, the desired sampling frequency, and the output context window of features.  If the code that computes features already takes a context window, or if the context window consists of only the center frame, then this procedure must create a copy of the input features and return this as the context-window feature output.
<corpus information>
contains information about how to use files in the particular corpus. This field has the following format:
name:      <corpus_name>
cat_path:  <path_to_cat_files>
require:   <type_of_files_that_are_required>
wav_list:  <list_of_wav_files_to_use>
phn_list:  <list_of_phn_files_to_use>
txt_list:  <list_of_txt_files_to_use>
partition: <Tcl_code_and_list>
filter:    <filter_for_skipping_files>
lexicon:   <lexicon_file>
force_phn: <script_for_forced_alignment_at_phonetic_level>
force_cat: <script_for_forced_alignment_at_category_level>
remap:     <script_for_remapping_hand_labels>
want:       <desired_number_of_examples_per_category>
where all fields are optional except for “name:” and “partition:“.
<corpus_name>
is the name of the corpus to be trained on. This name must match one of the corpus names in the corpora file.
<path_to_cat_files>
is the full path to the directory where time-aligned categories will be stored. This directory will be created during the training process.
<type_of_files_that_are_required>
specifies which kind of files must exist in order for that utterance to be used. The possible kinds of files are waveform, phonetic, text, and category files. To specify that the waveform file must exist, the <type_of_files_that_are_required> parameter must contain the letter “w”; to specify that phonetic files are required, this parameter must contain the letter “p”. To specify that text files are required, this parameter must contain the letter “t”, and to specify that category files are required, this parameter must contain the letter “c”. Typical values for training are “wp” (when using hand-labeled time-aligned phonetic data) and “wt” (when using forced alignment for training). The typical value for the development and training partitions is “wt”.
<list_of_wav_files_to_use>
is a filename that contains a list of waveform files to be used, instead of searching through the corpus directory and applying the filter. This field is usually used when searching through the corpus directory would be time-consuming or difficult.
<list_of_phn_files_to_use>
is a filename that contains a list of phonetic files to be used, instead of searching through the corpus directory and applying the filter.
<list_of_txt_files_to_use>
is a filename that contains a list of text transcription files to be used, instead of searching through the corpus directory and applying the filter.
<Tcl_code_and_list>
contains, in quotes, Tcl code in curley braces for generating a number or string based on a file’s ID, and a list in curley braces such that if the generated number of string is in this list, the file is in the specified (training, dev, test) partition. At CSLU, we use the following:
    for training:  “{expr $ID % 5} {0 1 2}”
    for fa:          “{expr $ID % 5} {0 1 2}”
    for fb:          “{expr $ID % 5} {0 1 2}”
    for dev:       “{expr $ID % 5} {3}”
    for test:       “{expr $ID % 5} {4}”
So that the determination is based on the file ID mod 5, and if this is 0, 1, or 2 then the file is used for training, forced alignment, and forward-backward training. If the file ID mod 5 is 3, then then file is used for development, and if it is 4, then the file is used for testing.
<filter_for_skipping_files>
is a specification of how many files to skip. The format for this field is “X+Y”, where X is the Nth file to use, and Y is the offset from the beginning. For example, “1+1” is the specification to use all files, “3+1” is the specification to use every third file, starting with the first file, and “10+4” is the specification to use every tenth file, starting with the fourth file.
<lexicon_file>
is the name of the lexicon file for this task. If this field is specified, then a filter will be applied to make sure that all files that are selected contain only the words in the lexicon file.  This will be accomplished by reading the .txt files associated with each .wav file.
<script_for_forced_alignment_at_phonetic_level>
is the name of the script and its arguments for doing forced alignment and generating phonetic-level results. In the argument list, the keyword “TXT” will be substituted with the text being aligned, “WAV” will be substituted with the name of the waveform file, and “OUT” will be substituted with the name of the phonetic-level output file. A general-purpose program can be used, in the following way:
    “fa.tcl <nnet> <spec> <lexicon> WAV TXT p OUT”
where <nnet> is the neural network to use, <spec> is the specifications file associated with the neural network, and <lexicon> is the lexicon file.
<script_for_forced_alignment_at_category_level>
is the name of the script and its arguments for doing forced alignment and generating category-level results. In the argument list, the keyword “TXT” will be substituted with the text being aligned, “WAV” will be substituted with the name of the waveform file, and “OUT” will be substituted with the name of the category-level output file. A general-purpose program can be used, in the following way:
     “fa.tcl <nnet> <spec> <lexicon> WAV TXT c OUT”
where <nnet> is the neural network to use, <spec> is the specifications file associated with the neural network, and <lexicon> is the lexicon file.
<script_for_remapping_hand_labels>
is the name of a script for automatically adjusting hand-labeled time-aligned phonetic labels and remapping some phone symbols. There are several remapping scripts currently available, or you can create your own. The scripts available are: for general-purpose use: remap_genpur.tcl for the digits corpus: remap_digits.tcl for the alphadigit corpus: remap_ad.tcl for the spell-spoken corpus: remap_spellspoken.tcl Only the name of the script is necessary; it is assumed that the two arguments are <input_phonetic_file> and <output_phonetic_file>.
<desired_number_of_examples_per_category>
is the number of examples per output category that we want from the specified corpus. Typical values range from 500 to 2000, but the keyword “ALL” will be substituted with a very large number. Also, if the corpus is being used only to find examples that are infrequent in our main corpus, then the value of 0 will result in examples being selected from this corpus only when the previous corpora have less than <minimum_number_of_examples> examples.

grammar file
The grammar file has the format specified in the Statenet documentation, specifically the grammar format documentation.

lexicon file
The lexicon file has the following format:

oneLevelExpansion 1;
<word>    = <pronunciation symbols> ;
<word>    = <pronunciation symbols> ;
<word>    = <pronunciation symbols> ;

<word>    = <pronunciation symbols> ;

where <word> is a word that will be recognized and <pronunciation symbols> are the symbols used to represent how the word is pronounced. In this tutorial, these symbols are Worldbet symbols, although any symbol set can be used. The keywords “oneLevelExpansion 1;” will ensure that if a word and phoneme have the same symbols (e.g. the word “I” and the phoneme /I/) that the expansion of a word into its string of phonetic symbols will happen only once.  If these keywords are not used, there is the risk that the expansion will occur multiple times, and the following lexicon

I       = aI ;
am      = @ m ;
sitting = s I tc th I N ;

will yield the following pronunciations:

      word “I” has pronunciation /aI/

 

      word “am” has pronunciation /@ m/

 

      word “sitting” has pronunciation /s aI tc th aI N/

parts file
The parts file indicates how many parts each phoneme should be split into (1, 2, 3, or “r”), as well as the grouping of phonemes for broad-category phonetic clusters and mappings of one or more phonemes onto different symbols. The format of this file is:

<phoneme>    <number_of_parts> ;
<phoneme>    <number_of_parts> ;

<phoneme>    <number_of_parts> ;
$<cluster> = <phoneme> <phoneme> <phoneme> … <phoneme> ;
$<cluster> = <phoneme> <phoneme> <phonemee> … <phoneme> ;

$<cluster> = <phoneme> <phoneme> <phoneme> … <phoneme> ;
map <new_phoneme> <old_phoneme> ;
map <new_phoneme> <old_phoneme> ;

map <new_phoneme> <old_phoneme> ;

where <phoneme> is a phoneme in the pronunciations given in the lexicon file. <number_of_parts> is the number of parts that the phoneme should be split into. Legal values are 1, 2, 3, and r. <cluster> is a variable name describing the cluster of a group of phonemes. <new_phoneme> is what <old_phoneme> will be replaced with, when encountered in the labeled data. <old_phoneme> is a phoneme in the labeled data that should be mapped to a new symbol. A common use of the “map” command is to map all unvoiced closures to the symbol “uc”, and all voiced closures to the symbol “vc”.

spec file
The spec file has the format specified in the Statenet documentation, specifically the recognizer spec format documentation.

files file
The “files” file contains an ASCII list of the files that will be used in a given partition. This file has the following format:

<wav_file_1> <phn_file_1> <cat_file_1> <txt_file_or_text_1>
<wav_file_2> <phn_file_2> <cat_file_2> <txt_file_or_text_2>
    ...
<wav_file_n> <phn_file_n> <cat_file_n> <txt_file_or_text_n>

where:

<wav_file>
is the name of a waveform file
<phn_file>
is the name of the phn file corresponding to the waveform file
<cat_file>
is the name of the cat file (containing category-level alignment information) corresponding to the waveform file.
<txt_file_or_text>
is either (a) the name of the text file corresponding to the waveform file, OR (b) the text transcription of the waveform file, surrounded by double-quotation marks (“) (for example, "one seven two three".)

If a file does not exist (and in the case of the last field if there is no text transcription), then the keyword NULL is used to indicate that the file (or transcription) doesn’t exist. Therefore, this file will always have four fields per line.
dur file
The dur file contains the minimum and maximum durations for each category. This information is used by the Viterbi search when doing recognition, in order to prevent the search from inserting very short words or very long words. This file is created by gen_catfiles.tcl, and it has the following format:

Category        MinDur         MaxDur
<category_1>    <min_dur_1>    <max_dur_1>
<category_2>    <min_dur_2>    <max_dur_2>
    ...
<category_n>    <min_dur_n>    <max_dur_n>

where:

<category>
is a context-dependent sub-phonetic category recognized by the neural network
<min_dur>
is the minimum duration, in msec, of the specified category
<max_dur>
is the maximum duration, in msec, of the specified category

counts file
The counts file contains the number of occurrences and total time (in msec) for each category. This information is used by revise_desc.tcl, in order to determine categories that have a small number of examples and may need to be tied to other categories. This file is created by gen_catfiles.tcl, and it has the following format:

Category        Occur        TotalTime(msec)
<category_1>    <occur_1>    <time_1>
<category_2>    <occur_2>    <time_2>
     ...
<category_n>    <occur_n>    <time_n>

where:

<category>
is a context-dependent sub-phonetic category recognized by the neural network
<occur>
is the number of occurrences, or segments, of the specified category (independent of how long each segment is).
<time>
is the total amount of time, in msec, for which there are examples of <category>

examples file
The examples file is an ASCII file that describes the location (filename and frame number) of each category that will be trained on. This file is created by pick_examples.tcl and read bygen_examples.tcl. The format of the examples file is:

      <

filename

      >

 

      <

category

      > <

frame

      >

 

      <

category

      > <

frame

      >

 

-1 -1

      <

filename

      >

 

      <

category

      > <

frame

      >

 

      <

category

      > <

frame

      >

 

-1 -1

where <filename> is the filename, including path, for a waveform.  < frame> is a particular frame within the waveform; this is an integer value.  < category> is the numeric value of the category that occurs at the associated frame; < category> is an integer value, and the mapping between category names and their numeric values can be determined from the spec file.

vec file
The vec file is a binary file that contains the feature vectors for training and the category numbers corresponding to the given features. The format is:

<header>
<category_number_1> <features_1>
<category_number_2> <features_2>
...
<category_number_n> <features_n>

where:

<header>
is a 4-byte integer value containing the number of vectors in the vec file.  A vector consists of one category number and its associated features.  
<category_number>
is a 4-byte integer value containing the numeric value of the category that represents the given features. This is the same as the number of the target output node of the neural network.
<features>
is an array of 4-byte float values that contains the features used during training. The default size of the feature vector is 130 features (13 MFCC coefficients plus their delta values, with a 5-frame context window), so the default size of <features> is 520 bytes.

neural-network files
The neural network files are generated by nntrain.exe.   Each neural network file contains the neural network architecture and the weight values of the neural network. These files are binary files, with each value containing 32 bits in TCP/IP network byte order. The extensions of these files are numbers, starting with number 0; each number represents one iteration of the training process. The format is:

<iteration_number>
<number_of_layers>
<size_of_layer_1> <size_of_layer_2> … <size_of_layer_n>
<weight_values>

where:

<iteration_number>
is the iteration of training at which this network file was created (represented as unsigned int).
<number_of_layers>
is the number of layers in the network (represented as unsigned int). The usual value is 3.
<size_of_layer>
is the size of a layer of nodes (represented as unsigned int).
<weight values>
is an array of all weight values, represented as floats (when generated by nntrain or hnntrain) or doubles (when generated by train_nnet.tcl).

summary file
The summary file contains a summary of the performance of different iterations of network training. This file is created by select_best.tcl and gives the recognition performance based on recognizing a set of waveform files in either the development or test set. The format is:

 Itr     #Snt     #Words   Sub%    Ins%    Del%    WrdAcc%  SntCorr
 <itr>   <snt>    <wrds>   <sub>%  <ins>%  <del>%  <wacc>%  <sacc>%
 <itr>   <snt>    <wrds>   <sub>%  <ins>%  <del>%  <wacc>%  <sacc>%
 ...
 <itr>   <snt>    <wrds>   <sub>%  <ins>%  <del>%  <wacc>%  <sacc>%
Best results (<wacc><sacc>) with network <network>

where:

<itr>
is an iteration for which results are given
<snt>
is the number of sentences (waveform files) evaluated
<wrds>
is the number of words evaluted
<sub>
is the percentage of substitution errors
<ins>
is the percentage of insertion errors
<del>
is the percentage of deletion errors
<wacc>
is the word accuracy of the given network iteration on the given set of waveform files. This is computed as (100% – (<sub>+<ins>+<del>)
<sacc>
is the sentence (waveform file) accuracy of the given network iteration on the given set of waveform files. This is a ratio of the number of waveform files recognized correctly divided by the total number of waveform files in the evaluation
<network>
is the network with the best word-level performance.

ali files
The ali files are created by select_best.tcl and contain sentence-by-sentence recognition results. The extension of an ali file is a number corresponding to the network iteration that was evaluated. These files can be evaluted by the eval_ali.tcl script. The format is simply an ASCII list of the correct words in an utterance, the recognized words in an utterance, and a blank line, as follows:

<correct_word_1a> <correct_word_1b> ... <correct_word_1x>
<recognized_word_1a> <recognized_word_1b> ... <recognized_word_1x>

<correct_word_2a> <correct_word_2b> ... <correct_word_2y>
<recognized_word_2a> <recognized_word_2b> ... <recognized_word_2y>
 ...
<correct_word_na> <correct_word_nb> ... <correct_word_nz>
<recognized_word_na> <recognized_word_nb> ... <recognized_word_nz>

If a word is mis-recognized, then both the correct word and the recognized word will be surrounded by asterixes (*). If an insertion or deletion occurred, then the alignment of the words is preserved by using pound signs (#) to represent the missing word.

 

6. Script and Program Usage

In the following descriptions, items in angle brackets <> must have the appropriate value substituted for the description, and items in square brackets [] are optional.

asr.tcl <grammar file> <lexicon file> <spec file> <neural network file> <wav file> <wrd file> <phn file> <cat file> [-garbage <N>]
<grammar file>
is the file containing the grammar to be used in recognition
<lexicon file>
is the file containing the lexicon of words and their pronunciations
<spec file>
is the recognizer specification file
<neural network file>
is the file containing the neural network weights to be used during phoneme classification
<wav file>
is a file containing the single waveform to be recognized
<wrd file>
is the output of recognition, at the time-aligned word level
<phn file>
is the output of recognition, at the time-aligned phoneme level
<cat file>
is the output of recognition, at the time-aligned phonetic category level
-garbage <N>
sets the garbage value to N (default is 5)

checkvec <.vec file><.vec file>

is the vector file created by 

gen_examples.tcl

.
fa.tcl <neural network file> <spec file> <lexicon file> <wav file> <txt file> {w,p,c} <output file> [-g <N>]
<neural network file>
is the file containing the neural network weights to be used during phoneme classification
<spec file>
is the recognizer specification file
<lexicon file>
is the file containing the lexicon of words and their pronunciations
<wav file>
is a file containing the single waveform to be recognized
<txt file>
is an ASCII file containing the words in the waveform
{wpc}
is w to specify word-level output, p to specify phoneme-level output, or c to specify phonetic category-level output
<output file>
is the output of forced alignment, which will be the time-aligned words, phonemes, or phonetic categories, depending on the previous option.
-g <N>
sets the garbage value to N (default is 5)

find_dur.tcl <.info_file> <corpora_file> <.dur_file> <.count_file>

<.info_file>
is the info file for this task
<corpora_file>
is the name of the corpora file.
<.dur_file>
is the duration file created by searching through the cat_path specified in <.info_file> for minimum and maximum durations.
<.count_file>
is the counts file created by searching through the cat_path specified in <.info_file> for occurrences of each category.

find_files.tcl <.info_file> <corpora_file>

<.info_file>
is the info file for this task
<corpora_file>
is the corpora file.
Output is written to 

<basename>.<partition>.<corpus>.files

, where 

<basename>

 is the basename given in the info file, 

<partition>

 is the partition give in the info file, and 

<corpus>

 is the corpus specified in the info file.

gen_catfiles.tcl <.info_file> <.parts_file> <.spec_file> <corpora_file> <dur_file> <counts_file>

<.info_file>
is the info file for this task.
<.parts_file>
is the parts file for this task.
<.spec_file>
is the spec file for this task.
<corpora_file>
is the corpora file.
<dur_file>
is an output file containing durations of each category.
<counts_file>
is an output file containing counts of each category.
The category files that are created are put in the location specified in the 

info

 file with the field “

cat_path

“.

gen_spec.tcl <.info_file> <.grammar_file> <.lexicon_file> <corpora_file> <.parts_file> <.spec_file> [-start <counts_file>]

<.info_file>
is the info file for this task.
<.grammar_file>
is the grammar file for this task.
<.lexicon_file>
is the lexicon file for this task.
<corpora_file>
is the corpora file.
<.parts_file>
is the parts file for this task.
<.spec_file>
is the output spec file that is created.

gen_examples.tcl <.info_file> <.spec_file> <.examples_file> <.vec_file>

<.info_file>
is the info file for this task.
<.spec_file>
is the spec file for this task.
<.examples_file>
is the examples file for this task.
<.vec file>
is the output vec file that is created.
If the .vec file already exists, 

gen_examples.tcl

 will not overwrite the existing file, but will print out a message telling the user to manually delete the existing file and then re-run the script. This is to avoid accidental overwriting of .vec files.

nntrain.exe [<options>] <num iter> <train file>[<options>] is one of:-i  count     frequency with which to dump iterations-l            balance category frequency with negative penalty-c  iter      continue training from iteration c-b  size      number of vectors in memory-n  weight    weight for negative training-t  tau       learn rate annealing factor-r  rate      learn rate [0.050000]-m  momentum  default is 0.0-a <num layers> <size 1> ... <size n>  architecture of network-sn seed      random seed for setting initial weight values-sv seed      random seed for order of vector evaluation-f  basename  basename for output weights files (default nnet)<num iter>is the number of iterations to train for (usually 30 to 45)<train file>is the vector file for trainingpick_examples.tcl <info file> <corpora file> <spec file> <examples file>

<.info_file>
is the info file for this task.
<corpora file>
is the corpora file for this task.
<.spec_file>
is the spec file for this task.
<.examples_file>
is the output examples file that is created.
This script will create an 

"examples"

 file that contains the locations of all frames to be trained on. This examples file is then input to 

gen_examples.tclrevise_spec.tcl <input .spec file> <dur file> <counts file> <output .spec file> [-min <min_occur>]

<input .spec_file>
is the spec file that is input, having no ties or duration information
<dur file>
is the durations file created by gen_catfiles.tcl.
<counts>
is the counts file created by gen_catfiles.tcl.
<output .spec_file>
is the spec file that is created based on the input spec file, durations, counts, and user tie information
<min_occur>
is the minimum number of occurrences of a category for it to not potentially be tied.
select_best.tcl <nnet_basename> < file of test files> <grammar file> <lexicon file> <spec file> <summary file> [-garbage <N>] [-begin <B>] [-end <E>] [-only <O>] [-ali <wrdalign_basename>]
<nnet_basename>
is the base name for the neural networks
<file of test files>
is the file containing filenames to test; this file is usually generated by find_files.tcl
<grammar file>
is the grammar file for this task
<lexicon file>
is the lexicon file for this task
<spec file>
is the spec file for this recognizer
<summary file>
is a file that is created containing a summary of all evaluations
-garbage <N>
sets the garbage value to N (default is 5)
-begin <B>
starts evaluation at network iteration B
-end <E>
stops evaluation after network iteration E
-only <O>
evaluates only iteration O (equivalent to -begin O -end O)
-ali <wrdalign_basename>
writes alignment files using a basename of <wrdalign_basename>. The default is wrdalign_<nnet_basename>
Source:
http://www.cslu.ogi.edu/tutordemos/nnet_training/tutorial.html

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Virtual Fashion Technology

Virtual Fashion Education

toitocuaanhem

"chúng tôi chỉ là tôi tớ của anh em, vì Đức Kitô" (2Cr 4,5b)

VentureBeat

News About Tech, Money and Innovation

digitalerr0r

Modern art using the GPU

Theme Showcase

Find the perfect theme for your blog.

lsuvietnam

Learn to Learn

Gocomay's Blog

Con tằm đến thác vẫn còn vương tơ

Toán cho Vật lý

Khoa Vật lý, Đại học Sư phạm Tp.HCM - ĐT :(08)-38352020 - 109

Maths 4 Physics & more...

Blog Toán Cao Cấp (M4Ps)

Bucket List Publications

Indulge- Travel, Adventure, & New Experiences

Lib4U

‎"Behind every stack of books there is a flood of knowledge."

The WordPress.com Blog

The latest news on WordPress.com and the WordPress community.

%d bloggers like this: