Home

STRUT User's Guide Jean-Marc Boite, Laurent Couvreur, Geoffrey

1. are 1 la 5 wool filc Leuword with value AS a PELI Fic uy don t process all the utters ucc or OT truc or Fala o 03 output lilename File which con b open in tip desired output heuk 0n The 1 feat FE CE il LELR campos wir PANNE inn NII Fe Geen rapi emen en sad snp defeult ordz print option with omy le bo 2 ecru pri LR eo eu Boolean del sult OF raram exponent for verbi score integer derault vral voice activity detection sa up p nuc Liaw or fakas w 3 ac verbe true o 1 o U alt 07 T Hur Mist Figure 6 1 General Help 6 6 COMMAND LINE HELP 43 Shell Konsole session Edit View Settings Help boiteBaragzorn 501 recognize verbose thelpoues Bad value for parameter verbose should be keyword in the list 1 f false n no t true y yes parameter output required but not set boiteBaragorn 502 recognize verbose yes the lpoyes true help provides help on the command keyword in the list 0 1 all data dir data cdir variable help htmal latex n no perl tcl y defau debug memoru allocation bool
2. on 1 2 25 25 25 25 25 25 25 25 25 3141 0 4151 0 5161 0 6171 0 718 1 0 8280 51 0 5 23 5 ae 1 2 23 23 23 3141 0 4 2 4 0 5 1 0 5 24 5 1 2 24 24 24 3141 0 4 2 4 0 5 1 0 5 25 11 ay 8 1 4 1 0 4 1 5 1 0 5 1 6 1 0 6 1 7 1 0 2 13 1 0 8230 51 0 5 13 4 v 1 2 13 13 8230 51 0 5 145 1 2 14 14 14 3141 0 4 2 4 0 5 1 0 5 15 5 1 1 2 15 15 15 3141 0 4240 510 5 16 5 1 2 16 16 16 8 1 4 1 0 4 2 4 0 5 1 0 5 17 7 w 3 3141 0 4 24 0 5 1 0 5 25 1 2 222 8 1 4 1 0 4 24 0 5 1 0 5 34k 1 2 3 3 8230 51 0 5 4 5 bel 1 2444 3141 0 4 24 0 5 1 0 5 5 5 dcl 1 2 555 814 1 0 4 24 0 5 1 0 5 6 4 tcl 1 2 66 e T T on 0 n r 2 n H a3 m 1 1 1 N 10 2 10 0 5 10 5 26 4 ah 1 2 26 26 012 1 0 2131 0 3 1 4 1 0 4 15 1 0 5161 0 6 2 6 0 5 1 0 5 3 2 3 0 5 1 0 5 7 4 kcl 1 277 oun AN 3 23 0 5 1 0 5 27 7 ao 1 2 27 27 27 27 27 3 1 4 1 0 4 1 5 1 0 5 1 6 1 0 6 2 6 0 5 1 0 5 28 9 ow 1 2 28 28 28 28 28 28 28 1 2 19 19 19 19 19 19 18 5 hh 1 2 18 18 18 3 1 4 1 0 4 240 5 1 0 5 19 8 iy 3 1 4 1 0 415 1 0 3 23 0 5 1 0 5 85s 1 2888 3141 0 4 24 0 5 1 0 5 9 5 sh 1 2999 ou 5161 0 6171 0 3 3141 0 2131 0 727 0 5 10 5 4 24 0 5 10 5 99 on AN 4240 5 1 0 5 3 1 4 1 0 4 1 5 1 0 3 1 4 1 0
3. SUBSITUTION 4 0 49 0 73 u gl up en RR ME RR 0 23 0 50 lt 1 00 um mum mun ms mim d ms ms dum uu m am T 0 44 0 77 1 33 ERROR 7 0 41 lt 0 64 lt u T imi imum mus Qum dux qs cus mus OR i T 1 25 lt 1 81 lt 2 60 Gs ums mn dum uis mum Gum und mm Gum um mum um um eumd dud T 1 29 1 85 2 62 1 42 lt 1 83 lt 2 35 elc Figure 3 3 ASR Results Chapter 4 Tools This chapter contains a description of the tools that will allow to train models and test recog nition 4 1 Database Handling 4 1 1 General Purpose strutify raw to STRUT conversion tool The main task is to add a STRUT header where the user can specify the fields and the values The program is able to skip a fixed length header edit header allows to add remove edit fields in the header of a STRUT file create archive reads STRUT files note that a wav file can be considered as a STRUT file in a list of directories and pack all the data in a single file creating a STRUT archive It understands the data that it reads so it is able to optionaly code t
4. 5 er 5 1 6 1 0 6 1 7 1 0 7 1 8 1 0 4 24 0 5 1 0 5 1 2 30 30 30 325 8280 51 0 5 1 2 32 32 32 3 1 4 1 0 29 5 uw 4240 51 0 5 1 2 29 29 29 3 3 1 4 1 0 31 5 ax 4 24 0 5 1 0 5 1 2 31 31 31 0 1 2 1 0 3 1 4 1 0 Bibliography 1 2 3 9 L Rabiner and Juang Fundamentals of Speech Recognition Prentice Hall 1993 R Boite Bourlard T Dutoit J Hancq Leich Traitement de la Parole 2nd Edition Presses Polytechniques Universitaires Romandes 2000 J W Picone Signal Modeling Techniques in Speech Recognition Proceedings of the IEEE vol 81 no 9 pp 1214 1247 Sep 1993 Hermansky Perceptual Linear Predictive PLP Analysis of Speech Journal of the Acoustical Society of America vol 87 no 4 pp 1738 1752 Apr 1990 Hermansky and Morgan RASTA Processing of Speech IEEE Transactions on Speech and Audio Processing vol 2 no 4 pp 578 589 Oct 1994 5 Furui Cepstral Analysis Technique for Automatic Speaker Verification IEEE Transac tions on Acoustics Speech and Signal Processing vol 29 no 2 pp 254 272 Apr 1981 5 F Boll Suppression of Acoustic Noise in Speech Using Spectral Subtraction IEEE Transactions on Acoustics Speech and Signal Processing vol 27 no 2 pp 113 120 Apr 19 79 J 5 Lim and V Oppenheim Enhancement and Band width Compression of Noisy Speech Proceed
5. Lexicon As we mentioned earlier the lexicon consists in a list of the transcriptions of words in terms of speech units For example figure 1 4 a gives a lexicon for recognition of English digits where the speech units are actually words i e digits Likewise figure 1 4 b gives the lexicon for the same words where the speech units are phonemes The format of lexicon files is simple each line contains a word followed by a sequence of symbols corresponding to speech units Clearly these speech units should exist in the HMM topology file During the decoding it is possible to consider several different transcriptions for a given word It helps to handle variations with respect to the canonical transcription like accents co articulation effects or mis pronunciations There exists syntax rules to concisely generate alter natives from a reference transcription by means of brackets Besides it is sometimes also possible to generate the transcriptions automatically See Reference Guide of Application Compiler for a complete description about lexicon syntax rules and the use of automatic phonetization Language Model In the framework of STRUT the language model takes the form of a finite state grammar FSG This grammar defines in a concise hierarchical way all the possible sequence of words Figure 1 5 shows a FSG file for recognition of English digit sequences The file begins with the tag FSG which identifies it as a FSG
6. sample rate 8000 database id enu database version aurora2 and the archive file is obtained once we have removed the 08 files gt create archive output samples test sam format samples dir samples test This file can be directly used as input for the decoding program recognize yet you can compute the features beforehand to save computational time during the decoding process Create a Language Model The next step of the testing procedure is to create the language models in the form of a FSG file Figure 1 5 gives the FSG file for recognizing English digit sequences including a garbage model This file has typically the file extension fsg and is located in the application subdirectory In our example we name it aurora2 fsg Compile an Application File As we mentioned in section 1 3 4 the FSG file and the lexicon are compiled to obtain an appli cation file to be used by the decoding program recognize This is done with the application compiler compile asr gt compile asr phonemes application aurora2 words hmm user dictionary application aurora2 words dic 21 22 CHAPTER 3 TESTING PROCEDURE mode fsg syntax application aurora2 fsg output application aurora2 words app Based on the HMM topology file aurora2 words hmm given in Appendix C 0 2 the lex icon aurora2 words dic of figure 1 4 a and the FSG file aurora2 fsg all located in the application directory compiler asr builds a state graph which defines all the all
7. use macros one just has to click on the Record a macro command execute normally the sequence of tasks to be recorded into the macro and finish recording the macro by clicking on the Stop recording macro command The user must pay attention however to several things e Sound buffer manipulations such as paste cut or add to 5 2 2 involve a buffer of samples and a selection cursor position that are ALSO recorded into the macro e Undo and Redo commands only undo redo the last operation Be careful that undo redo operations executed BEFORE the recording of macro are not taken into account e Previous Utterance and Next Utterance may not change utterance in case the first resp the last utterance is selected Once the user has finished recording macro he she has to register it and give it name The name then appears in the Macros menu and the macro can be called by clicking on its name When StrutSurfer is closed macros are written into a file so that they remain available through sessions Part Il STRUT Reference Guide Chapter 6 The Strut User Interface 6 1 Introduction The general syntax of command is command parameter value For all programs typing command help yes of for almost all of them those which have mandatory arguments typing the command without any parameters gives an explanation on the command parameters On line help is also available in html format There is also a TCL TK interfa
8. codebook_count i 3 codebook_format s4 0012 data format s9 BigEndian codebook0 s3 plp codebookO_size i 512 codebooki s9 delta plp codebook1_size i 128 codebook2 s31 delta_energy delta_delta_energy codebook2_size i 64 end_head The labels must be stored with the minimal storing requirement possible For codebook size up to 256 the labels can be stored on one byte For greater codebook sizes two bytes are necessary The field codebook format in the header shows how the labels are stored For example 7 6 SEGMENTATION 47 0012 means that codebook 0 is stored on the two first bytes don t forget that data format tells the order of those two bytes that codebook 1 is stored on the third byte and that codebook 2 is stored on the fourth 7 6 Segmentation In STRUTI there was no segmentation format per se One uses the more general label format which labels each frame with the mlp output index This unfortunately makes the file dependent on the frame shift segmentation format will be defined which will allow to store the segments boundaries in milliseconds block will read those segmentation files and transform them into label streams for use in the training programs 7 7 Probabilities SIRUT 1024 file type s13 probabilities database id s5 TIMIT database version s3 1 0 models id s6 dtimit models version 53 2 1 utterance id s11 1100 511277 frame count i 230 data format s9 BigEndian probability
9. if you did not open a language model The command allow to assign an application to the current recognizer Recognize Once a recognizer is properly set it has been assigned at least a language and an application the command is enabled and it is possible to perform recognition Recognizer Settings allow to change the parameters of the current recognizer 5 4 The Utils Menu Besides simply recognizing a buffer of samples StrutSurfer can be used to manipulate the sound i e adding different types of noise filtering or enhancing the speech signal All these features are accessible via the Utils menu The menu currently contains 3 submenus one for each operation adding noise filtering or enhancing speech each submenu being composed of two commands the operation itself and the setting of the operation parameters As an example we show below a screenshot of the Noise settings command which controls the parameters of additive noise 5 5 Advanced Features In this section we review the advanced features features in Research menu and provided by specific plugins 5 5 1 Computing Features and Probabilities StrutSurfer can easily compute features and probabilities as soon as a recognizer is provided with acoustic models and an application This allows in particular to compare a reference stream to an online computed stream to test changes in the code or simply the results returned by two models or to detect phonemes that are abn
10. process we compile the HMM topology file the lexicon and the FSG file into an application file This file is obtained by using the program compile asr Note that the lexicon can be embedded in the FSG file For example the transcriptions given either in figure 1 4 a or in figure 1 4 b can be simply cut and pasted at the end of the FSG file See Reference Guide of Application Compiler for a complete description about the generation of application files 1 4 What Next If you have never used STRUT you should now read the chapter 6 which explain the general syntax of the STRUT programs Otherwise you can go on the next chapter a tutorial of the training procedure 14 CHAPTER 1 INTRODUCTION FSG lt START gt seq rep alt lt DIGIT gt UNK alt rep seq lt DIGIT gt alt zero oh two three four five six seven eight nine 281 Figure 1 5 Finite State Grammar for English digit recognition file aurora2 fsg Chapter 2 lraining Procedure 2 1 STRUT components The STRUT programs allow to create and to manipulate STRUT files Along the process of training and testing an ASR system in the STRUT framework it may sometimes be necessary to check that files are not corrupted display strut can be used as a validation tool This program displays the contain of a STRUT file utterance by utterance and some information about every utterance Ifa STRUT file cannot be displayed it is likely to be corrupt
11. train train ptk is Perl Tk script that trains an mlp It calls the following 3 programs mlp init computes the feature normnalization parameters mlp train performs one iteration of the mlp traning mlp cross validate does a cross validation on a test set to evaluate the performance of the MLP Chapter 5 StrutSurfer StrutSurfer is a sound edition tool based on KTH s WaveSurfer dedicated to the STRUT toolkit The user is invited to read the WaveSurfer documentation for more information StrutSurfer can not only view edit and play a huge variety of sound files including STRUT files and archives but can also be used to compute and visualize features and or probabilities segmentations alignments and much more 5 1 The StrutSurfer Window The StrutSurfer window looks much like WaveSurfer window Based on WaveSurfer it inherits from all WaveSurfer facilities and mouse gestures 5 2 Basic Functions 5 2 1 The File Menu The File menu contains the usual functions Open to open a sample file New to create a new window Save to save the current utterance Save as to save the current utterance Quit If you want that operation now you can skip this chapter Additionally the File menu contains the 5 most recent files and a File Associations entry that allows the user to define custom file types based on file extension When StrutSurfer en counters an unknown file type it asks the user information about the sampling fre
12. weights are then updated in order to minimize the error between the actual outputs and the suited ones In STRUT the training procedure is performed by running a Perl 10 script which can be generated via a user friendly interface written in PerlTk 11 train ptk Note that an acoustic model is always developed for a given language and for a given type of acoustic features sampling rate frame shift length etc Remark The ASR Chicken and Egg Problem The performance of an ASR system will depend on the quality of the acoustic model In order to train accurately an acoustic model a valid state by state segmentation of the training sequence of acoustic vectors is required Such a segmentation can be generated manually but this work is labor intensive and tedious since the training speech database can be several hour long and sometimes the states do not mean anything e g for word based HMMs Alternatively one can generate the segmentation automatically but it requires an existing acoustic model We generally resort to an iterative procedure the MLP weights and the segmentation are recomputed alternatively until convergence see figure 1 3 The procedure is called embedded training The training of the weights which is itself an iterative process is embedded within several re alignment steps 1 5 WHAT IS STRUT 11 Figure 1 3 The ASR run There exists many ways to initialize the procedure See section 2 6 for more information
13. 34 sample max i 1735 sample n bytes i 2 sample byte format 2 01 sample sig bits i 16 sample coding s none alaw ulaw shorten end head Note file type can be omitted in this kind of files so that the sample files coming from LDC are compatible The sample coding field explains how the samples must be interpreted Usually the lossless shorten format will be used 7 4 Features The first step of a recognition chain right after the segmentation of the input stream by a voice activity detector is to compute acoustic features SIRUT 1024 file type s8 features feature type s8 Cepstrum database id sb TIMIT database version s3 1 0 utterance 14 s11 jjb0_s11277 features s9 rasta plp frame count i 230 frame rate i 10 window size i 20 46 CHAPTER 7 THE FILES feature dimension i 26 data format s9 BigEndian end head You will find also a lot of optional fields depending on the feature extraction type that will allow to re compute the features with exactly the same parameters Feature type can be one of Cepstrum CMSCepstrum LifteredCepstrum LifteredC MSCepstrum Plp RastaPlp 7 5 Labels Although discrete models are seldom used labels files are described here because they were the sole segmentation files in STRUT A typical header SIRUT 1A 1024 file type s6 labels database id s5 TIMIT database version s3 1 0 utterance 14 s11 jjb0_s11277 frame count i 230 clustering s3 LBG
14. 4 1 5 1 0 4 1 5 1 0 5 2 5 0 5 6 0 5 5 2 5 0 5 6 0 5 6 1 7 1 0 6 1 7 1 0 7270 5 80 5 7270 58 0 5 8191 0 8191 0 929 0 5 10 0 5 9 29 0 5 10 0 5 10 1 11 1 0 10 1 11 1 0 11 2 11 0 5 12 0 5 11 2 11 0 5 1 0 5 12 1 13 1 0 13 2 13 0 5 14 0 5 4 12 oh 14 1 15 1 0 1 2 27 27 28 28 29 29 30 3031 31 15 2 15 0 5 1 0 5 01 2 1 0 10 9 12 two 213 1 0 1 2 59 59 60 60 61 61 62 62 63 63 3 23 0 5 4 0 5 0 1 2 1 0 57 6171 0 7270 580 5 8191 0 on AN 3230 54 0 5 4 15 1 0 9 29 0 5 100 5 10 1 11 1 0 5 2 5 0 5 6 0 5 6 1 7 1 0 11 2 11 0 5 12 0 5 12 1 13 1 0 727 0 5 8 0 5 8 1 9 1 0 13 2 13 0 5 14 0 5 14 1 15 1 0 9 29 0 5 100 5 10 1 11 1 0 15 2 15 0 5 16 0 5 16 2 16 0 5 1 0 5 11 2 11 0 5 1 0 5 11 10 17 zero 1 2 72 72 72 1 2 64 64 65 65 66 66 67 67 68 68 69 69 70 70 71 3 1 4 1 0 3230 54 0 5 4 15 1 0 4 24 0 5 1 0 5 5 2 5 0 5 6 0 5 EXAMPLES OF HMM TOPOLOGY APPENDIX C 98 We give here typical HMM topology file for phoneme based HMMs of English phonemes aurora2 phonemes hmm C 0 3 Phoneme Based HMM set 20 5 ih 1 2 20 20 20 3 1 4 1 0 1 2 10 10 10 10 5 2 iSilenceState 32 PHONE 33 3 1 4 1 0 05b 4 24 0 5 1 0 5 4 240 5 1 0 5 1 2000 21 6 1 2 21 21 21 21 3141 0 4151 0 1 2 11 11 3230 51 0 5 114 f 3 1 4 1 0 4 24 0 5 1 0 5 15d 5250 51 0 5 22 9 ey 1 2 12 12 12 4 th 1 2111 1 2 22 22 22 22 22 22 22
15. 58 0 5 4 1 5 1 0 8 1 9 1 0 5 2 5 0 5 6 0 5 929 0 5 10 0 5 6 1 7 1 0 10 1 11 1 0 7270 58 0 5 11 2 11 0 5 12 0 5 8 1 9 1 0 12 2 12 0 5 1 0 5 9 2 9 0 5 10 0 5 10 1 11 1 0 1 19 five 11 2 11 0 5 1 0 5 1 266778 899 10 10 11 11 12 12 13 13 14 01 2 1 0 6 17 seven 10 1 2 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 213 1 0 012 1 0 3 23 0 5 4 0 5 1 0 4 1 5 1 0 213 1 0 5 2 5 0 5 6 0 5 3 23 0 5 4 0 5 6 1 7 1 0 4 1 5 1 0 7270 58 0 5 5 2 5 0 5 6 0 5 8 1 9 1 0 6 1 7 1 0 9 2 9 0 5 10 0 5 7270 58 0 5 10 1 11 1 0 8 1 9 1 0 11 2 11 0 5 12 0 5 9 2 9 0 5 10 0 5 12 1 13 1 0 10 1 11 1 0 13 2 13 0 5 14 0 5 11 2 11 0 5 12 0 5 14 1 15 1 0 12 1 13 1 0 15 2 15 0 5 16 0 5 13 2 13 0 5 14 0 5 16 1 17 1 0 14 1 15 1 0 17 2 17 0 5 18 0 5 15 2 15 0 5 16 0 5 18 2 18 0 5 1 0 5 16 2 16 0 5 1 0 5 2 16 four 7 15 six 1 2 15 15 16 16 17 17 18 18 19 19 20 20 21 21 1 2 45 45 46 46 47 47 48 48 49 49 50 50 51 0121 0 0121 0 10 10 213 1 0 213 1 0 3 23 0 5 4 0 5 3 23 0 5 4 0 5 4 1 5 1 0 4 1 5 1 0 5 2 5 0 5 6 0 5 5 2 5 0 5 6 0 5 6 1 7 1 0 6 1 7 1 0 7270 58 0 5 7270 58 0 5 8 1 9 1 0 8 1 9 1 0 9 2 9 0 5 10 0 5 9 2 9 0 5 10 0 5 10 1 11 1 0 10 1 11 1 0 11 2 11 0 5 12 0 5 11 2 11 0 5 12 0 5 12 1 13 1 0 12 1 13 1 0 13 2 13 0 5 14 0 5 13 2 13 0 5 14 0 5 14 1 15 1 0 14 2 14 0 5 1 0 5 15 2 15 0 5 1 0 5 8 16 three 3 12 nine 1 2 52 52 53 53 54 54 55 55 56 56 57 57 58 58 1 2 22 22 23 23 24 24 25 25 26 26 012 1 0 01 2 1 0 10 10 213 1 0 213 1 0 3230 54 0 5 3 23 0 5 4 0 5
16. Markov Models HMM 1 are the most popular acoustic models According to the HMM formalism any se quence of acoustic vectors is a piecewise stationary stochastic process for which each stationary segment is associated with a specific state and has statistical properties depending on that state The sequence of states is controlled by a Markov process The acoustic models have to be trained before the recognition takes place Practically many sound examples of the words should be presented to the training system Except when the number of words is limited e g digit recognition it is practically not possible to train accurately word based acoustic models It would require too large speech databases It is generally preferred to represent speech with speech units smaller than words Commonly used speech units are phonemes In most languages any word can be represented by a sequence of phonemes drawn from a set of about 30 40 phonemes Hence it is sufficient to train phoneme based acoustic models word based acoustic models are then obtained by concatenating the phoneme based acoustic models according to the phonetic transcriptions contained in a lexicon 1 2 WHAT IS STRUT NOT 7 Lexicon When words are used as speech units the lexicon is just equivalent to the list of words that can be recognized When speech units are not words the lexicon maps the words to sequences of speech units It indicates how the acoustic models of the spee
17. RUT archive sample file new window pops up and you are asked to select an utterance for example fac_13a Then you can create a new pan and display the spectrogram Likewise we can display a color map of the acoustic features We finally load the corresponding word based and phoneme based segmentations See chapter 5 for more information Once we have verified that the sample files and the segmentation files are valid and coherent we can start training the acoustic models As we said earlier the training process consists in estimating a MLP that classifies the acoustic vectors with respect to the states given in the segmentation It is performed by running a Perl script which can be easily generated with the interface strut train ptk You select the different pages and fill the fields Finally you create the training Perl script run Create script The interface will tell you if any parameter is missing If you have filled in all the mandatory parameters an new window will apppear where you can edit the script Once you are happy with it you save it script is saved can be executed to generate the Figure 2 2 shows the script we use to train a MLP for MEL acoustic features starting with a models file Scripts for other training conditions can be generated easily by editing this script See the YET TO BE WRIT TEN User Guide for train ptk for a complete description on the MLP training options 2 6 Realignment Process A
18. STRUT User s Guide Jean Marc Boite Laurent Couvreur Geoffrey Wilfart November 4 2005 Contents I 1 STRUT in Nutshell Introduction lb TES Le eoo up esq d VEVEN l2 os uen ied Me Be dE wd 6 uidemus muxo ed Ba eE A Pel rah Ec iudi SB de ID T ob o6 eX RP RUN E dope Wee ETT lo OVCE acte db utem eh Ee EAS ES s Io NE CR RE Eu aus Loo Back End te oh ee este IN ie sel ei as chy Bo a 1 3 4 Application Compiler NN INR acho rede ah A ic Ge A D Training Procedure 21 STRUT Components ak 2244 6668 4 er ER DRE 2 2 Cr al Sample Lis ere Sie laine Reese 23 Compute a Feature File 2 4 Obtain a Segmentation File 2 5 Train an acoustie model RENE messes ds 2 6 Realignment Process Testing Procedure The Tools Al Database Handling s usum ek de ON A 4 1 1 General Purpose er ee a anes tcd m d 4 2 Trai
19. ackslash at the end of the line setenv This can be used to store text in variables for later use la pmake T he value of a variable may be retrieved by enclosing the variable name in parenthesis and preceding the whole with a dollar sign For example you can put in your setup setenv PHONEMES home strut data phonemes english phonemes PHONEMES Variables used inside another variable are expanded whenever the outer variable is expanded in the example setenv STRUT DIR home strut setenv PHONEMES DIR data phonemes english first PHONEMES setenv 5 DIR home zorglub second PHONEMES first will receive the value home strut data phonemes english while second will receive home zorglub data phonemes english 6 5 MULTIPLE SPECIFICATION OF AN ARGUMENT A1 e The variable values can be put in the environment prior to execution of the program so you can type in your shell setenv MYDATABASE digits 1 0 Or export MYDATABASE digits 1 0 and put in your setup weights STRUT DIR MYDATABASE weights 130 200 66 Three variables are already defined if you specify the database id in the command line DATABASE ID DATABASE VERSION DATABASE which is DATABASE ID DATABASE VERSION If a parameter has a default value it can contain variables The default value are expanded the command line and the setup files are read so when variables are set 6 5 Multiple Specification
20. array param 0 2 1 6 96 float array array param 0 2 3 4 3 float array array param 0 2 3 7 Of course using and can cause problem with your shell so array values will preferably specified in setup files see section 6 4 Values are Booleans The boolean will be set to true if the param is specified as 1 tT or yY to false otherwise input values are valid Integers They can be restricted to be only positive or only negative or min and max values can be specified Floats They be restricted to be only positive or only negative or to be between 0 and 1 Strings Any ASCII string Don t forget to put the string between quotes if there is a space or any special character interpreted by your shell Filenames filenames required can be input or output files Keyword Only a keyword is allowed Menu The menus introduce a hierarchy in the command line They can take some keyword values and according to the keyword chosen introduce a new parameter list specified between parenthesis command db dbid foo bar pari yes par2 toto command db dbid foo quick par3 no Often those menus are introduced only for clarity and no keyword is associated with them command db dbid foo pari yes par2 toto The full syntax of the menu is not displayed To get help specify help yes as parameter command foo help yes 40 CHAPTER 6 THE STRUT USER INTERFACE 6 4 Common Parameters Be
21. ation Retum noise only Use silence in energy clustering Predelay reverb noise Density reverb noise Diffusion reverb noise Decay reverb noise Central frequency sine noise stochastic vanation of SHR Modulation Modulation rate Modulation level OK Load settings 33 Gaussian white 1 0 1 0 tn original signal x to noisy signal ur 16000 r save settings Cancel Figure 5 7 Setting the characteristics of the noise to be added 34 CHAPTER 5 STRUTSURFER File Edit Strut Research Macros Utils Stream STRUT_Probabilities index not valid Time 2 Figure 5 8 Displaying the features and the probabilities computed by the MLP contain a single utterance if the current sound has no utterance ID i e was not extracted from an archive In case the current sound has been extracted from an archive of samples and a corresponding archive of features resp probabilities exists and respects the asr database directory structure then this archive is automatically loaded asr database directory structure is described in chapter 2 1 It is also possible to open a segmentation file corresponding to an utterance If the utterance is part of a STRUT archive the segmentation file is automatically loaded as described above In other cases or if the user wants to specify an alternate segmentation file he she just has to open a
22. ave been defined Almost all of them can be arrays or arrays of arrays If you need to enter an array you can either 37 38 CHAPTER 6 THE STRUT USER INTERFACE e Enclose the values between and separate values with commas or spaces commas are only valid for numeric parameters float array param 0 621 1 4142136 2 7182818 3 1415927 When you specify int float or bool params the parenthesis are there for clarity only If you separate values by commas you don t need them float array param 0 621 1 4142136 2 7182818 3 1415927 When specifying string array include the strings between quotes string array param si s2 Specify the individual values alone or grouped float array param 0 621 float array param 1 21 4142136 float array param 2 23 22 7182818 float array param 24 23 1415927 float array param 25 27 23 1415927 Some hybrid notations are also available but without guaranty float array param 0 621 1 4142136 2 7182818 3 1415927 float array param 1 20 71 The rule is simple float array param size 28 and float array param 0 621 1 4142136 2 7182818 3 1415927 both specify the size of the array And construct like float array param 1 20 71 modify one element enter arrays of arrays the allowed syntaxes are similar 6 3 ARGUMENT TYPES 39 float array array param 2 3 7 2 3113 4 5 6 7 8 float array array param 2 0 3 float array array param 0 2 1 9 float array
23. can be activated rasta log gt extract features input samples train shn output features train plp feature type plp rasta log 2 4 Obtain a Segmentation File As we explained previously the main problem for training a MLP is to obtained a first segmen tation In our example we assume that a segmentation is available That is every utterance from the training set corresponds to a sequence of English digits hence a sequence of HMMs hence a sequence of states The segmentation file contains the boundaries of every state for every utterance The segmentation files are located in the segmentation subdirectory The segmentations are available for both word based HMMs segmentation train words seg and phoneme based HMMs segmentation train phonemes seg See section 2 6 for more infor mation about obtaining a segmentation 2 4 OBTAIN SEGMENTATION FILE 17 StrutSurfer Hle Edit Strut Research Macros Utils train sam fac 13a spectrogram 00 333 60Hz 106 6946 WaveBar 00 000 01 159 Figure 2 1 Screen shot of the StrutSurfer interface 18 CHAPTER 2 TRAINING PROCEDURE 2 5 Train an acoustic model Before training a MLP it is worth verifying that your segmentation really correspond to your samples To do so you can use the graphical interface StrutSurfer fig 2 1 First we open the training sample file Select samples train sam in the File Open menu The file is automatically recognized as ST
24. ce but it is in a very preliminary state There is also another form of a command which is obsolete and should not be used any more command database id database version parameter value STRUT used to be a collection of different programs that performed dedicated actions this is not the case any more STRUT is now a single program with all the functionality This is transparent for the user on Unix like systems where you have symbolic links program reads its first argument the program name to know which functionality it is supposed to implement and how it is supposed to parse the remaining arguments If you don t have symbolic links you can always directly call the main program with the name of the component as the very first argument strut recognize setup recognize stp output 6 2 Case Dependency and Shell Interaction There is no distinction between lower case and upper case letters in arguments names Also there is no distinction between dashes and underscores _ To allow filename completion and wildcard expansion any number of space character can be put between the equal sign and the value of the parameter extract features setup lpc cepstrum is valid but not extract features setup lpc cepstrum After all the parameters can follow a list of filenames That s why command line is not parsed beyond the parameter with no equal sign 6 3 Argument Types This section will describe all the argument types that h
25. ch units should be linked to form the acoustic models of the words Language Model As we mentioned earlier the most likely sequence of words is searched during the decoding process In order to limit the number of hypotheses language model is used It contains information about the allowable word sequences 1 2 What is STRUT not This is not a dictation system 1 9 What is STRUT In this section we present how STRUT performs the tasks involved in an ASR system Readers who are already familiar with the components of STRUT can proceed directly to chapter 2 STRUT is research tool designed for speech and speaker recognition While Gaussian Mixture Models GMM can be trained hybrid Multi layer Perceptron Hiden Markov Models HMM MLP have been particularily developed 1 3 1 Overview STRUT consists in a set of stand alone programs which can be either run as command lines or included in scripts in order to perform the different tasks required in an ASR system STRUT has been developed as a research tool It aims at being user friendly providing comprehensive help and allowing to access all intermediate results of the ASR process When used in a normal mode the components of STRUT exchange information via files with a particular format Here is the structure of such a file a sample file in this case SIRUT 2A 1152 file type s7 samples database id s6 enu database version s3 aurora2 strut release s8 strut2 0 st
26. d b phoneme phoneme the first phoneme of the word eight Appendix C 0 3 describes the entire set of English phoneme based HMMs Next we have to estimate the statistical distributions of every state Many approaches have been proposed to estimate the state statistical distributions 1 2 In STRUT we have adopted the hybrid Hidden Markov Model Artificial Neural Network HMM ANN paradigm 9 More especially we consider a special class of ANNs namely Multi Layer Perceptrons MLP Once they have been properly trained such statistical tools allow to estimate the a posteriori state probability for any acoustic vector When the recognition takes place a sequence of acoustic vectors is first computed along the speech utterance to be recognized Next the MLP estimates the a posteriori state probabilities all the states at the same time resulting in a sequence of probability vectors Finally the decoding process searches the resulting lattice of probabilities for the most likely path with the highest probability under the constraints defined by the topologies of the HMMs the lexicon and the language model Such a MLP can be efficiently trained in a supervised mode Given a speech database we compute the acoustic vectors and we present the MLP with the acoustic vectors and the suited states equivalently an ideal probability vector with entries equal to one for the suited state and zero for the other The
27. data there is asr database mul direc tory which will contain multilingual database like AURORA or home made test databases 50 A 2 FILE SYSTEM ORGANIZATION 51 A 2 1 Database Subdirectory Organization In database sub directory you find those files or directories README an ASCII file explaining the database contents README should be written in capital letters so it will appear before the directory names in a sorted list Please don t explain the directory structure if it follows the conventions You must only describe what differs from the standard application all the files that are needed to build an application for a recognizer and the application file themselves The following conventions have bee adopted for the filenames extensions app application files the output of compile asr fsg a Finate State Grammar file This file contains only the FSG and will be rarely used Ixd the same as an FSG file with aditional information to help building the application such as the language and some phonetic transcriptions wpg word pair grammar words that can follow each other in the YO format dic dictionaries hmm phoneme topologies features the database after feature extraction The training program is normally able to com pute the feature on line so this directory should usually be empty However it sometimes can be useful to pre process the database for instance when you are experimen
28. databases follow that format header is succession of fields that describes the data format or that gives any information on how the data has been obtained Each line consists of the field name followed by a dash letter specifying the type of the value and then the field value itself i integer 823 string of 23 characters f float e Micro oft wave file format e STRUT file format single file can contain complete database header is similar to the NIST header with extension to arrays and arrays of arrays like in mlp cross validation score f 85 2903 86 1208 87 2948 feature dimension i 13 feature multipliers 0 f 0011111111111 feature multipliers 1 f 1111111111111 feature multipliers 2 1000000000000 7 1 Strut File Headers In a STRUT header no field is mandatory and there is no restriction on file names However some fields have special meaning and can affect the way STRUT is reading the data file type s This field explain what s in the file It defaults to sample 44 7 2 STRUT FILE TYPES 45 7 2 Strut File Types You will find in this section some examples of file types This is not an exhaustive list as new file types can be added as will 7 3 Samples Example from TIMIT database NIST 1A 1024 file type s7 samples database id sb TIMIT database version s3 1 0 utterance 14 s11 jjb0_si1277 channel count i 1 sample count i 36864 sample rate i 16000 sample min i 16
29. e the 08 files have been removed gt create archive output segmentation train seg format samples dir segmentation train With the first segmentation we can train a first acoustic model as described in the previous sections The acoustic model can be refined by realigning the segmentation To do so we use recognize itself Indeed if you recognize speech utterance with a lexicon containing only the pronounced words and a grammar limited to the pronounced sentence the recognition is straightforward and a byproduct of the recognition will be the state segmentation This procedure is generally called forced alignment First we build the phonetic file T his file contains a special application file for every utterance to align which allows only the actual sentence Such a file can be build with the perl Tk interface Select the Compile page Then you can compute alternatively a new segmentation and a new acoustic model See Training Perl Scripts Chapter 3 lesting Procedure Create a Sample File Like the training procedure the testing procedure begins with creating archive file with the test utterances First we copy the 1001 test files speechdata testa cleani 08 from cdrom 1 4 of the Aurora2 collection into the samples test directory Then the files are strutified for example gt strutify input samples test fak 1b 08 output MDATADIR samples test fak 1b data format big endian data size 2 file type samples
30. eaders dci do nn are eei e ede deed dee IO EN Carb cab la fee Ur M qr e P Pp fab did eget s 2 4 120 mc qe ee EC EUR Jade e erdt poe ded dx Eg 4 Toe 3babelst e ee Sede E eS dub er dede oso OR dew Qu DITES Sark EC SURE AS Training Perl Scripts Install STRUT Awl Software Installation ALI Install From Sources MAD Tostal Fron OV saat cated m tod St de fa oto Re oer ees Aiko Install Binary Packages dee uro sero cem 2 File System Organization A 2 1 Database Subdirectory Organization Environment Variables C Examples of HMM topology 0 2 A Word Based HMM set 0 3 Phoneme Based HMM set 30 30 30 33 34 35 36 36 36 36 39 40 40 43 43 44 44 44 45 46 46 47 49 49 49 49 49 49 90 52 Part I STRUT in a Nutshell Chapter 1 Introduction This document has been designed to provide complete step by step guide to the 5peech Training and Recognition Unified Tool STRUT The guide assumes that the user has no previous training with the STRUT soft
31. ean default 0 debug boolean default 0 ztream header debug ztream header routines boolean default 0 boiteBaragorn 5035 sce on Edi View Settrge Hep bni triman rennes the poet mnm select probabilities Cinclusive this value intemr default 0 Pelee Lhe nd tho list 0 1 all dataochr dautacdir waridblo holp htmli botox n no perl tol y yc yes tocfaou perenni pe ifall 05 melad puesta babies Parum Lis value interar default 1 itehia Ak ll 1A pr shot PF hal e 437 gt Figure 6 3 Help on the garbage sub menu Detailed Menu If parameter has a menu you can get help about the menu with command like recognize verbose yes help yes As the menu depends on the parameter value you have to specify a value see fig 6 2 Nested Menus Sometimes menu will be nested like in fig 6 3 Chapter 7 The Files STRUT has defined its own database format inspired by the NIST SPhere wave file format an ASCII header describes what s in the file and the data follows in binary format Extensions and adaptations have bee made to the format STRUT is able to handle e SPHERE headered files These are samples files with ASCII header Each file contains one utterance Many speech
32. ed probably because of a bug in the program that created the file If you want to visualize your data you can also use StrutSurfer This graphical interface is based on TclTk 12 and Python 13 These scripting languages should be installed StrutSurfer allows to navigate seamlessly in a STRUT archive file to visualize and process samples features or segmentation related to any utterance See chapter 5 for more information A detailed list of the STRUT programs can be found in chapter 4 2 2 Create a Sample File The initial step of the training procedure consists in creating the speech sample archive file Let s assume that we first copy the 8440 speech data files speech data train clean 08 from cdrom 2 4 of the Aurora2 collection T hese files are English digit sequences sampled at 8000 Hz and pronounced by 110 male and female speakers Next we transform these RAW speech files into STRUT sample files by adding a header with the command strutify For example gt strutify input samples train fac_13a 08 output samples train fac_13a data format big endian data size 2 file type samples sample rate 8000 database id enu database version aurora2 The input and output define the RAW speech file and the STRUT sample file respectively In our example every speech sample is binary coded on two bytes data size 2 with the left most byte being most significant data format big endian This information allows strutify to interpret proper
33. file The main macro lt START gt defines the structure of any sentence allowed by the grammar The definition relies on a set of nested rules They indicate that the grammar allows sentences containing any number of times an element of type lt DIGIT gt The macro lt DIGIT gt is defined afterwards It could contain any of the ten digits An alternative unit can be recognized instead of a digit namely the garbage This particular unit can be seen as a dummy word which models anything except a digit It is very useful to handle out of lexicon words which will be displayed as UNK during recognition See Reference Guide of Application Compiler for a complete description about the generation of FSG files 12 CHAPTER 1 INTRODUCTION Transcriptions oh oh zero zero one one two two three three four four five five Six Six seven seven eight eight nine nine Transcriptions oh ow zero 2 ih ow w n two tcl t uw three th r iy four f aor five f ay v Six ih kcl k s seven eh v n eight ey tcl t ninenayn Figure 1 4 a Word based lexicon file aurora2 words dic and b phoneme based lexicon file aurora2 phonemes dic for English digit recognition 1 4 WHAT NEXT 13 1 3 4 Application Compiler The HMM topology file the lexicon and the FSG file are human readable files which define entirely all the sentence models with respect to the decoding process In order to prepare the decoding
34. he data For instance the default behaviour when it handles samples is to code them into the SHORTEN format unstrutify takes a STRUT archive and split the data into one file per utterance select utterance allows to create an archive that is a subset of another one Note that the STRUT programs are normally able to do it on the fly StrutSurfer derives from WaveSurfer Please see chapter 5 for details 4 1 22 Samples Some programs allow to manipulate sample database convert samples is able to modify the sampling rate modify the sample format and so on add noise takes sample file and adds random or database noise to get a given sample to noise ratio wiener is supposed to enhance the speech quality by spectral filtering 29 26 CHAPTER 4 THE TOOLS 4 2 Training and Testing of Models compile asr takes an input vocabulary and or grammar together with phonetizer and cre ates an application file for the recognizer A special format of the output allows to keep backtracking information for the phonemes thus allowing to turn the recognition into a forced alignment recognize reads data in given format samples features probabilities and writes them into another format features probabailitiues or words It is also able to provide a phoneme segmentation align is a script that repeatedly calls recognize to produce a segmentation of a database and post process that segmentation into a suitable format for mlp
35. hesis file what has been recognized with a reference file what should be recognized and count the number of errors Figure 3 1 shows some lines of the hypothesis test phonemes mel hyp and Figure 3 2 the corresponding lines of the reference file test ref The format of both hypothesis files ans reference files is simple word string utterance identifier The reference file has typically the ref extension and it is located in reference Three types of errors are classically considered substitution deletion and insertion errors The Perl script sclite score pl allows to compute such statistics First it aligns every recog nized word string from the hypothesis file to the corresponding word string from the reference file Then it counts the errors and derives error rates as well as confidence intervals gt sclite score pl r reference test ref h results test phonemes mel hyp M 01 23 Acoustic HMM type Features phoneme based MEL 0 83 0 31 0 18 0 34 1 8396 0 64 0 46 0 73 PLP 0 80 0 34 0 09 0 37 1 59 0 61 0 46 0 52 Table 3 1 Word error rate substitution deletion insertion five five four 1 854 nine six oh fgb_9600a two seven five oh 270a five three three mle_933a four one mjh_419a two oh six eight mhm_268a Figure 3 1 The hypothesis file test phoneme mel hyp Thanks to a mask option M 01 it is possible to filter the utterance ids and to p
36. high pass filtering and multiplied by Hamming window Then a Fourier analysis is per formed over every sample frame and the power spectrum is computed Next an auditory spec trum is obtained by applying non uniform filterbank Finally cepstral coefficients are derived from the auditory spectrum Additional processing is possible to obtain acoustic features which are less sensitive to ad ditive and convolutional noises during operation So far it includes logR ASTA and jahR ASTA processings 5 as well as Cepstral Mean Subtraction CMS 6 Spectral Subtraction SS 7 and Wiener filtering 8 1 3 3 Back End The back end processing in STRUT is implemented in the program recognize Actually it is able to realize the whole ASR process even computing the front end However you can also compute the frond end separately 4 by applying mel cepstrum or rasta plp to a sample and to feed recognize with the resulting feature file Then recognize applies the acoustic model to the acoustic vectors to obtain probability vectors and performs the decoding under the constraints of the lexicon and the language model decoding is based on the Viterbi algorithm with pruning techniques 2 We describe in the following what exactly the acoustic model the lexicon and the language model consist in Acoustic Model As mentioned earlier we assume that any sequence of acoustic vectors has been emitted by a certain sequence of HMMs In
37. ile file type the sampling frequency in Hz sample rate the data byte ordering data format the file history step_ lines or the number of utterances utterance count Indeed STRUT supports archive files which can contain data for several speech utterances In this case every utterance is uniquely identified by an utterance identifier The list of utterance identifiers is given after the header Then the file contains the utterance data offsets sequence of binary coded integers which locate the beginning of every utterance in the data block of the file Finally we find the sample data which are stored sequentially for all the utterances In the following we describe how to process that file to actually perform the ASR tasks More details can be found in section 2 1 5 WHAT IS STRUT 9 1 3 2 Front End There exists several approaches to extract acoustic features from the speech samples Most popular front ends compute cepstral like coefficients for every analysis frame More especially two sets of coefficients are often used namely Mel frequency cepstral coefficients MEL 3 and Perceptual Linear Predictive coefficients PLP 4 Those coefficients can be computed by means of the extract features program The processing is also integrated in programs like mlp train and recognize so you don t need to pre compute the features First every speech utterance from the sample file is sliced into frames which are pre emphasized by
38. ings of the IEEE vol 67 no 12 pp 1586 1604 Dec 1979 Bourlard and N Morgan Connectionist Speech Recognition Kluwer Academic Publish ers 1994 10 PERL LANGUAGE WEBSITE http www perl org 11 PERL TK WEBSITE http www perltk org 12 TCL DEVELOPER SITE http www tcl tk 13 PYTHON LANGUAGE WEBSITE http www python org 14 AURORA 2 0 WEBPAGE http www elda fr proj aurora2 html 60
39. ly the byte stream and prevent wrong byte ordering Next we specify the type of STRUT files to create namely sample file file type samples sampled at 8000 Hz 15 16 CHAPTER 2 TRAINING PROCEDURE gt sample rate 8000 Finally we provide two informative parameters which identify the language database id and the database database version After removing all the 08 files we create the archive sample file train sam by merging all the files of type samples located in the directory samples train gt create archive output samples train sam format samples dir samples train Sometimes sample files can be quite big We may want to use the lossless shorten compression algorithm A clean signal can be compressed by a factor 2 gt convert samples input samples train sam output samples train shn coding shorten 2 9 Compute a Feature File The second step of the training procedure consists in computing the acoustic vectors Usually this step will be embedded in the training program but we will do it seperately here for didactic reasons In our example we have chosen to compute the PLP features They can be obtained with the following command gt extract features input samples train shn output features train plp feature type plp which computes by default 19 PLP cepstral coefficients with frames of 30 ms shifted by 10 ms Many other setups are possibles for computing the acoustic features For example logRASTA processing
40. ning and Testing of Models StrutSurfer 5 1 The StrutSurfer Window 0225 BASIC HUC EIOTIS a ss o era amp dive wd ae doo ue Se ach DS v ve ee SOME 4 OU ed 0 2 2 The Edit MEN sc xcu REOR ER LR CRT IR Dig UTC d ace cok 2 est oues de D Thesu Ment 6 4 46 SU boe es 2 0c uw RA 14 14 14 15 15 17 17 20 24 24 24 24 25 CONTENTS eee DES Advanced unes Oo tk oe 5 5 1 Computing Features and Probabilities 5 5 2 Performing Alignment A a ses laste ch oak sh STRUT Reference Guide The Strut User Interface E TS TT as 6 2 Case Dependency and Shell Interaction 0 9 LU pes See ee ee er ie doe deese Common Paramelefs s dee Oh eos dX ex EM e RR e arde ber s 6 5 Multiple Specification of an Argument 6 6 Command Line The Files Pob Orun H
41. of an Argument You you specify an argument that you have already specified the value is simply overwritten When you specify menus for reason of clarity in setup files you can split in several lines weights file afac weights scratch yes The values you specified are not reset unless you specify a different keyword to introduce the menu If after saying menu read file null type ascii You say menu write type ascii The file argument has been reset as it may not be relevant for the write menu 6 6 Command Line Help The program that will be presented here is recognize which has been choosen because it has many options The operating system used here is Linux but it works mutatis mutandis under Windows or MacOS General Help If you type recognize help yes or simply recognize you will get something like in figure 6 1 CHAPTER 6 THE STRUT USER INTERFACE we i hake m Um ses or no or true lalz o 3 D el 03 nt usc fixed point bee is lieLz 41 true LE falac 11 trug T talza falar alsa fuka ttl truc truol trua 8117 tros pect1 1 ui Ur omma ul lcuwcrd in the lists 20 1 all to dir dokn dir worinble hole lose n lol ue yoo default gp Ir lp input filename a File hidi co Le
42. order to find the most likely word string the decoding process searches the most likely sequence of HMMs given the sequence of acoustic vectors and outputs the corresponding word string The HMMs have to be trained beforehand First we define the topologies of the HMMs the number of states and the transition probabilities between the states Though these parameters can be automatically inferred we commonly use the left to right topology Figure 1 2 a depicts the topology of the HMM for the word eight The left to right structure models the sequential mechanism of speech produc tion The figures on the arcs represent the transition probabilities for leaving a state to another They sum to one The state labels indicate which statistical distribution should be considered The total number of states represents the minimum duration of any sequence of acoustic vectors emitted by that HMM self transitions model the stationary segments build complete ASR application we need several word HMMs Their topologies are described in HMM topol ogy file Appendix C 0 2 gives the HMM topology file of word based HMMs for English digit recognition It is sometimes preferred to use phoneme based HMMs any word HMM is obtained by concatenating several phoneme HMMs Figure 1 2 b shows the topology of the HMM for the 10 CHAPTER 1 INTRODUCTION 0 5 1 0 0 5 Figure 1 2 Topology of a left to right HMM for a word eight an
43. ormally recognized leading to bad recognition results To compute features or probabilities the user must provide a recognizer with acoustic models and an application and then execute the appropriate command in the Research menu Once the features or probabilities have been computed the streams can be visualized by creating a new pane fig 5 8 It is also possible to load features or probabilities from a file The file must contain the utterance ID corresponding to the current sound if this sound comes from a STRUT archive or 32 H Best Models Vocabulary Beam width word hypotheses Keep samples Use speech detectar speech threshold speech threshold Sung Indicator threshold Sung Frames before Frames after Margin Probability threshold Minimum speech frames Maximum silence duration VAD Classifier Garbage from Garbage Word entrance penalty Divide by priors samples coding okip frames CHAPTER 5 STRUTSURFER 1 Choose Nocalhome hoite asrdata Choose 250 0 8 J a am 2 0 0 550000011321 n aee 1 0 Caml classifier PCM Load settings settings Cancel Figure 5 6 Setting the parameters of the recognizer 5 5 ADVANCED FEATURES samples file Hoise type sample rate Frame length SNR Hormaliz
44. outputs You will find such files as hyp the hypothesis the output of recognize sgml the results processed by sclite html an html document describing the experiments and the results samples these are the samples files As much as possible they should be stored in the shorten format Since this files seldom change and can be pretty big this directory is never backed up The following conventions can help to describe the format of the file sam a STRUT archive with an undefined sample coding shn the same but with samples coded with the shorten algorithm pcm raw data in PCM format wav samples in Micro oft wave file format They can be read by any STRUT program script any script related to the database training script database handling scripts segmentation the segmentation of the database They can come in two formats seg the STRUT2 segmentation format which is independent of the frame shift and the sampling frequency the STRUT format which labels the frames Unfortunately this kind of file depends on the frame shift setup any setup of a STRUT program text files coming from miscellaneous sources database cdroms or internet unprocessed Appendix Environment Variables Appendix Examples of HMM topology The list and the topology of HMMs is given in the HMM topology file hmm which is typ ically located in the DATADIR application directory Two examples of HMM topolog
45. owable state sequences application file has typically the app extension and it is located in the application directory Likewise the application file for phoneme based HMMs is generated as follows gt compile asr phonemes application aurora2 phonemes hmm user dictionary application aurora2 phonemes dic mode fsg syntax application aurora2 fsg output application aurora2 phonemes app with the HMM topology file aurora2 phonemes hmm and the lexicon aurora2 phonemes dic given in Appendix C 0 2 and figure 1 4 b respectively and the same FSG file aurora2 fsg Recognize a Sample File Finally we can recognize the test set with the program recognize This program decodes a set of test utterances with respect to an application file and For example we can recognize the test set test sam with the ASR system based on MEL acoustic features and phoneme based HMMs gt recognize input samples test sam output type words output results test phonemes mel hyp models yes file models enu08 ff 234 0600 033 mel mlp decode yes application application aurora2 phonemes app The results of the recognition process are contained in the hypothesis file test phonemes mel hyp An hypothesis file has typically the hyp extension and is located in the results subdirectory Assess System Performance In order to assess an ASR system it is common to compute statistics on its recognition perfor mance To do so we compare an hypot
46. pert top frame_selection 0 3 6 9 12 15 18 train expert top hidden_layer_size 1000 train expert top initial_ learning rate 8 train expert top learning rate_rate 50 train expert top output_function Sigmoid X Entropy train expert top window_offset 9 train expert top window_width 19 trainitrain from models Include options from scripts specified in the command line ARG while ARGV gt 0 foreach param debug verbose 4 if ARGV O eq param train param 1 shift next ARG do shift ARGV if exists train debug 44 ParseArgs truearg train debug train utterance_count 100 train utterance_count_xval 10 train learning_rate_schedule 0 015 0 01 0 008 new EmbeddedMlpTrain train gt process Figure 2 2 Training Perl script 19 20 CHAPTER 2 TRAINING PROCEDURE gt strutify input segmentation train fac_13a 08 output segmentation train fac_13a add field float name analysis_frame_length_ms value 30 float name analysis_frame_shift_ms value 10 string name label coding value segments int name frame size value 1 int name label_count value 1 string name label format value 0 int name label_size value 33 string name label type value Segment database id enu database version aurora2 and the resulting STRUT segmentation files are merged into an archive file onc
47. quency sample coding etc user may choose to associate the file extension to those settings When the the File Associations button is clicked a window displaying the currently associated extensions pops up and provides the possibility to remove any of these extensions 27 28 V 5 StrutSurfer Edit Strut Research Macros Utils Help ound WaveBar 00 000 00 000 Figure 5 1 A StrutSurfer Window sample rate sample encoding Channels Byte order skip Offset bytes Associate extension with these valu Load settings save Settings Cancel Create Pane Apply Configuration Save Configuration Properties CHAPTER 5 STRUTSURFER lel x Data Plot waveform Spectrogram Pitch Contour Power Plat Formant Plot VAD Print recognition wards Print recognition phonemes Print recognition states Display Features Display probabilities Display segmentation Time Axis Transcription 16000 linib Mono x woiBren v 4 Little Endian Intel x Big Endian Motorola Figure 5 2 The File Associations setting panel 5 2 BASIC FUNCTIONS 29 Remove Figure 5 3 The File Associations management panel Opening a STRUT Archive When StrutSurfer opens a STRUT archive a window pops up with the list of all available utterances The user can select an utterance by simply double clicking on the utterance ID o
48. r by typing it fig 5 4 5 2 2 The Edit Menu The Edit menu contains the following commands Undo and Redo to undo redo the last command Copy to copy the selected region Cut to cut the selection Paste to paste the contents of the clipboard Add to to ad samples in the clipboard to te current sound This can be used for instance to add noise to a signal fig 5 5 Previous Utterance to switch to the previous utterance in the archive Next Utterance to switch to the next utterance in the archive 30 CHAPTER 5 STRUTSURFER Figure 5 4 A small panel displaying the utterance ids of a STRUT archive StrutSurfer File Edit Strut Research Macros Utils V spkr_fr0301 pcm 13355 15838 127 gt x by Spkr_fr0301 pcm 2 13779 16176 Waveform 00 000 331 233 WaveBar 00 000 02 023 B1 Scroll B2 shift B1 Zoom ctri B2 Zoom full out Figure 5 5 Adding the samples in the clipboard to an utterance 5 3 ASR FUNCTIONS 31 5 3 ASR Functions 5 3 1 The Strut Menu The Strut menu provides basic ASR functionalities The user can open and set up a recognizer choose acoustic models applications and perform a recognition New Recognizer creates a new recognizer object with default parameters without models and application Open Language will let you choose acoustic models and assigne the to the current selected recognizer Open Application is not active
49. rainiend with trailing silence 1 trainifeature parameter source asr database frf bref features bref 16kHz mel train input asr database frf bref samples brefi20 16kHz parti sam train input_models asr database frf bref models frf16 ff 180 1000 036 plp lograsta mlp trainimax word hypothesis 8 trainimodels dir dirname 0 trainimodels name pattern frfi16 ff ANN SIZE plp lograsta mlp trainioutput function Sigmoid X Entropy trainioutput layer dir dirname 0 trainipercentage training 94 trainiremove output layer 1 train shuffle 1 trainiskip frames cross 2 trainiskip frames train 4 trainistrut dev null train expert nlda epochs 6 train expert nlda feature_multipliers 0 1 12 train expert nlda feature_multipliers 1 0 12 train expert nlda feature_multipliers 2 0 0 train expert nlda formula 0 1 train expert nlda formula 1 2 1 0 1 2 train expert nlda formula 2 2 1 2 2 2 1 2 train expert nlda frame_selection 0 3 6 9 12 train expert nlda hidden_layer_size 500 24 train expert nlda initial_learning_rate 32 train expert nlda learning_rate_rate 50 train expert nlda output_function Sigmoid X Entropy trainiexperthinldakiwindow offset 6 train expert nlda window_width 13 train expert top epochs 6 train ex
50. re Precision Floating point Mip Precision Floating point Feature Extraction plp lograsta Optional Start training loop from Utterance selection Humber of input sentences raining Cross Validation set size 7 Appendix Install STRUT A 1 Software Installation A 1 1 Install From Sources A 1 2 Install From CVS A 1 3 Install Binary Package A 2 File System Organization This chapter describe how the databases are organized at the STRUT development site Although keeping the same hierarchy is not mandatory it is always better to follow guidelines You can choose the name of the database and replace asr database by whatever you want The databases are installed in subdirectories of asr database following these rules e the main directory name reflects the language By convention languages are specified by three charachters the first two defines the language e g fr for french en for english the third one is the first letter of the english name of a country Examples frf french spoken in France enu english spoken in the US eng english spoken in Great Brittain sws swedish spoken in Sweden tut turkish spoken in Turquey e Then comes the database id So there will be directories like asr database frf bref asr database eng wsjcamO asr database enu wsj asr database enu rmi asr database enu rm2 e purposes of a good organization of related
51. rovide statistics by group of utterances In our example we use the first letter of the utterance ids the left parenthesis is ignored to group the utterances separating female speakers from male speakers Figure 3 3 gives the output of the above command It shows that the ASR system based on MEL acoustic features and phoneme based HMMs works fairly well with a word error rate equal to 1 83 For the sake of comparison we also give the word error rates for the ASR systems based on PLP acoustic features and word based HMMs in table 3 1 five five four 1 854 nine six oh fgb_9600a two seven five oh fba 270a five three three mle 933a four one mjh 4192 two oh six eight mhm 2683 Figure 3 2 The reference file test ref 24 CHAPTER 3 TESTING PROCEDURE 0 28 0 56 1 08 results test phonemes mel hyp m 4 reference test ref 0 14 lt 0 36 lt 0 80 0 27 lt 0 46 lt 0 76 M CORRECT 7 u nn me E 97 40 lt 98 19 lt 98 75 ses 97 38 lt 98 15 lt 98 71 INSERTION u t on a em 0 41 lt 0 75 lt 1 32 nn eu un nu ee un Fa 0 39 0 71 1 26
52. rut version s8 internal data format 12 LittleEndian data offset 1 1204 data size i 49867 machine s4 1686 nodename s17 mutlu multitel be release s11 2 4 8 26mdk os version s32 1 Sun Sep 23 17 06 39 CEST 2001 8 CHAPTER 1 INTRODUCTION sample_coding s shorten sample_n_bytes 1 sample rate i 8000 step count i 2 sysname s5 Linux utterance count i 3 utterance start offset i 1188 Step 1 command line s153 data format big endian data size 2 database id enu database version aurora2 file type samples sample rate 8000 input fkk 8492339 08 output fkk 849z339a Step 1 program date s44 internal executed Tue Dec 3 14 27 32 2002 step 1 program name s8 strutify step 2 command line s45 dir aurora2 format samples output aurora2 sam Step 2 program date s44 internal executed Tue Dec 3 14 30 05 2002 step 2 program name 14 create archive end head 5678901234567890123456789012345678901234567890123456789012345678901234 5678901234567890123456789012345678901234567890123456789012345678901 mlt 9242a mfc 7521a fkk 849z339a 4 utterance data offset it block data We present in section 2 how to obtain such a file It begins with an ASCII header This header starts with the STRUT 2A tag which identifies the as a STRUT file then it gives the header size in bytes and a set of fields which characterize the information contained in the file For instance the fields can define the type of f
53. s Naturally this transition has a probability equal to one The exit state has located at position 1 in the list of state identifiers no transition possible The first active state located at position 2 in the list of state identifiers has only one transition to the second active state The second active state 54 59 0 13 eight 00112233445 1 0 N 22 22 22 22 22 22 22 1 0 ON verrrreror O 1 1 0 2 1 32 4 1 52 6 1 7 2 8 1 9 2 OF O 4 4 01 01 Ww O HA Figure C 1 Topology of a left to right HMM for word eight and b phoneme ey has two transitions one to the third active state and as a self transition both are equally likely And so on Figure 1 2 a depicts a graphical view of this HMM Likewise we can describe the topology of the phoneme based HMM for the phoneme given in figure C 1 b and depicted in figure 1 2 b 06 APPENDIX EXAMPLES OF HMM TOPOLOGY C 0 2 Word Based HMM set We give here a typical HMM topology file for word based HMMs of English digits aurora2 words hmm 4151 0 PHONE 5 2 5 0 5 6 0 5 12 6 1 7 1 0 7270 58 0 5 5ilenceState 72 8 1 9 1 0 9 2 9 0 5 10 0 5 0 13 eight 10 1 11 1 0 1 200112233445 11 2 11 0 5 1 0 5 0 1 2 1 0 1 0 5 12 ons 2 13 1 0 i 2 32 32 33 33 34 34 35 35 36 36 3 23 0 5 4 0 5 0 1 2 1 0 4 1 5 1 0 1 0 5 2 5 0 5 6 0 5 2 13 1 0 6 1 7 1 0 3 23 0 5 4 0 5 7270
54. s are fed into the ASR system which outputs a word string Typically an ASR system is divided into a front end part and a back end part 1 1 2 Front End The front end computes relevant information with respect to the ASR process along the speech samples Practically a set of coefficients are computed from a frame of speech samples which is typically 20 30 ms long These coefficients also called acoustic features are gathered into the so called acoustic vector This vector is intended to represent the spectral contain of the current frame Next the frame is shifted by 10 20 ms and the computation is repeated Eventually we obtain a sequence of acoustic vectors which represent the temporal evolution of the spectral contain of a given sequence of speech samples 1 1 3 Back End The back end interprets the sequence of acoustic vectors to find out the pronounced words This processing is generally called the decoding process Clearly the problem is very complex since the number of words and the word limits are unknown back end relies on three sources of information to perform its decoding namely the acoustic model the lexicon and the language model Acoustic Model Given a sequence of acoustic vectors the back end searches to find the sequence of words which has the most likely produced it To do so stochastic model of every possible word in terms of acoustic vectors the so called acoustic model is required Currently Hidden
55. s we mentioned in section 1 3 3 the training of an acoustic model requires an state by state segmentation If we assume that such a segmentation is available it is straighforward to obtain an acoustic model as described above However the segmentation may be approximative and the resulting acoustic model is inaccurate In order to refine the acoustic model the segmentation should be recomputed the so called realignment process Assume we have a coarse first segmentation not necessarily in STRUT format First we need to strutify the segmentation To do so we prepare for every utterance a file for example located in the segmentation train directory in which we store a binary format the following information lt ist state label gt lt ist state right boundary int2 float4 lt 2nd state label gt lt 2nd state right boundary int2 float4 3rd state label gt lt 3rd state right boundary int2 float4 The segmentation for every utterance is then strutified 2 0 REALIGNMENT PROCESS usr bin perl push INC usr local asr strut lib perl modules use File Basename require Parse rgs require ann EmbeddedMlpTrain trainialignment iteration 2 trainialignment stop criterion iteration trainiapplication asr database frf bref application bref120 train app trainibeam width 100 trainfbunch size 32 trainicache size 25000 trainidivide by priors 0 t
56. s8 discrete probability count i 41 phoneme set id 55 TIMIT probability coding s4 none end head probability s Probability description discrete mlp probabilty coding s none or log or probability count i number of output probabilities Chapter 8 lraining Perl Scripts As MLPs are trained via an iterative use of programs like mlp train mlp cross validate recognize a tool is provided to automatically generate perl scripts that will call those programs with appropriate parameters fig 8 1 This tool will also check if the files specified are consistent For instance every utterance in the input sample file must appear in the segmentation file and or the application file You create an initial perl script the modify it if you need any customization for your expe rience and finally run it If you are using zsh you might run your script like this my training script pl amp The ampersand tells the shell to run the command in the background and the exclamation mark says the terminal must disown the job Thsi way you can close your window and shutdown your laptop if you run the script an remote machine the job will continue to run in the background 48 49 Run Options Action train mi Training Procedure Files Mlp Alignment 215 salient features Language French Country France sample rate 18000 size sid Featu
57. segmentation pane and choose a file by right clicking into this pane 5 5 2 Performing Alignment It is fairly easy to perform an alignment or any kind of alternate recognition with StrutSurfer The user only needs to open segmentation pane and load an application StrutSurfer then automatically generates a recognizer with the same settings as the current recognizer except for the application which is the application the user just loaded and performs the recognition The recognition results are then displayed into the segmentation pane Note that you need to have acoustic models associated to your recognizer If it is not the case StrutSurfer warns you By default if several segmentations are available words phonemes and or states controlled by the application the most detailed one is used You can also perform an alignment using an archive of applications As usual in this case StrutSurfer looks for the current utterance ID if any If the sound buffer was not extracted from an archive then the archive of application must contain only one utterance If the correct utterance cannot be found StrutSurfer silently returns 5 6 MACROS 35 5 6 Macros StrutSurfer provides a macro mechanism so that the user does not have to continuously repeat the same operations Combined with the configurations provided by the underlying WaveSurfer layer it makes it very fast and easy to perform the same task on several files utterances
58. sides the parameters individually defined for each command a few parameters are common help the help parameters allow to get help on a command or a general help on the command line parsing The value is a keyword in the list help print this text y es print help on the command or the sub menu all print all help available for this command current level plus sub menus latex print the help in latex format html print the help in html format tcl creates and runs TCL TK script to ease the specification of the parameters You need TCL TK installed on your system data dir variable print the name of the environment variable which tells where the par tially specified files are to be found data dir the contents of the data dir variable verbose This menu controls the setting of some debugging variables oobp the more this integer is the more verbose is the oobp mechanism memory print information about the memory in use by the program stream header print debugging information about what is found in the header of the files print params print parameter values after command line parsing setup apparition of such an argument causes the program to read the filename specified The file is supposed to contain arguments to the program with the same syntax as in the command line except that empty lines or lines beginning with are ignored setup files can be nested to any level For good readability long lines can be split simply put a b
59. ting new features Also in order to compute the features mlp train needs a template STRUT file from which it reads the header This is a good place to put such files The extension should reflect the main feature extraction parameters like plp lograsta or lpc jrasta Dif ferent options must be seperated by a dash plp Perceptual Linear Predictive speech analysis mel mel frequency scale cepstral coefficients Ipc linear predictive coefficients rasta rasta fileetring lograsta log rasta jrasta J rasta spsub spectral subtraction wiener wiener filtering probabilities temporary files containing state probabilities output of MLP models the filename should reflect as much as possible what this model is good for for instance frf08 180 0600 036 plp jasta mlp mlp multilayer perception weights file 52 APPENDIX A INSTALL STRUT gmm gaussian mixture models phonetic the file needed to perform an alignment STRUT1 old phonetic files should go there but it is expected that they will slowly disappear Here are the conventions for the STRUT2 files app the archive of app file as needed by the program recognize to perform a seg mentation state alignment of a database Ixd the files used to created the app s reference the references for the test databases ref orthographic description of the test sentences result the speech recognition experiments
60. ware but simply has an installed version of the package Having some experience about speech recognition is of course big advantage This user s guide does not provide any theoretical information There are good tutorials in the litterature The user is assumed to have a good knowledge of Unix since it is the prefered operating system to run the programs Users should complete the guide sequentially and within a short time the will have access to all the background knowledge in order to train and run an automatic speech recognizer 1 1 What Is ASR In this section we overview what an automatic speech recognition ASR system consists in Advanced readers who are already familiar with ASR can proceed directly to section 1 3 Beginner readers can find more detailed information in the reference books 1 2 Lexicon Acoustic Language Model i Model Acoustic Vectors BiA BLA BLA Speech N Samples Figure 1 1 A typical ASR system consisting of a microphone a front end and a back end Words 6 CHAPTER 1 INTRODUCTION 1 1 1 Overview Figure 1 1 outlines the generic structure of an ASR system For a given speech waveform the ASR system produces a word string that is the most likely associated with that waveform The speech waveform is recorded by means of microphone which converts acoustic pressure into an electrical signal The speech signal is then sampled The resulting speech sample
61. y files are presented in the following sections word based HMMs and phoneme based HMMs respec tively We describe here the format of these files Every HMM topology file begins with the tag PHONE directly followed by the number of HMMs in the file Then comes an optional line which indicates the silence state identifier T his line is required if the silence state identifier 18 different from 0 The remainder of the file contains the description of the topologies of the HMMs Figure C 1 a shows the topology of the word based HMM for word eight First we have the line 0 13 eight This HMM has 0 as identifier and eight as label It counts 13 states Then we found a list of 13 state identifiers 1 200112233 445 The first two state identifiers correspond to the entry 1 and exit 2 states They are dummy states not associated with any statistical distribution for emission of acoustic vectors which serve as connecting states between HMMs Next we find the state identifiers of the active states Every state identifier corresponds to a position in the probability vector given by the acoustic model the output of the MLP Then we find the description of the connections between the states w Ww oo In this example the entry state located at position O in the list of state identifiers allows a single transition to the state located at position 2 in the list of state identifier

STRUT User's Guide Jean-Marc Boite, Laurent Couvreur, Geoffrey

Contents

Download Pdf Manuals

Related Search

Related Contents