Home

as a PDF

image

Contents

1. for each cluster source phrase read its sub DPR model W and predict the DPR probabilities normalised for each instance overloaded for each cluster source phrase read its sub DPR model W and predict the DPR probabilities normalised for each instance given a test corpus extract all source phrases appeared store the source features in outputFileName file and return a sourcePositionMap dictionary Chapter 4 Code guide 29 Name probPredictionFunction h cpp Function contain functions to generate DPR probabilities for phrase options of each develop test sentence Public Functions Continued void smt_createSourceCluster char inputFileName phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDict fr phraseNgramDict tagsDict_en wordClassDict wordDict fr word ClassDict wordDict_en int maxPhraseLength int maxNgramSize int zoneConf relabelFeature relabelDict phraseTranslationTable trainingPhraseTable char outputFileName sourcePositionMap sourcePositionDict void smt_createSourceCluster string sourceSentence phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDict fr phraseNgramDict tagsDict_en word ClassDict wordDict_tfr word ClassDict wordDict_en int maxPhraseLength int maxNgramSize int zoneConf relabelFeature relabelDict phraseTranslationTable trainingPhraseTable
2. target word Pos map the target to source alignment target word Pos source word Pos Public Functions alignArray alignArray string alignmentString vector lt int gt getFRtoEN_alignment int sourcePos vector lt int gt getENtoFR_alignment int targetPos bool checkFRtoEN_alignment int sourcePos bool checkENtoFR_alignment int targetPos constructor create an empty alignment file get the word alignments from a string return the corresponding target POSs for a source POS return the corresponding source POSs for a target POS check if the source POS is null aligned check if the target POS is null aligned 20 Chapter 4 Code guide 4 3 Constructing and processing a sample phrase pair pool Name phraseConstructionFunction h cpp Function contain functions to construct the sample pool Public Functions bool smt_construct_phraseNgramDict construct the ngram dictionary for the char inputCorpusFile char ngramDictFile source target word word class tag corpus int maxNgram int minPrune phraseNgramDict smt_construct_phraseNgramDict overloaded construct the ngram dictionary char inputCorpusFile char ngramDictFile for the source target word word class tag int maxNgram int minPrune bool overloadFlag corpus bool smt_construct_wordDict construct the word class dictionary and char wordClassDictFile char inputCorpus create the tag corpus for the source
3. string getClusterName unsigned long long writeWeightCluster ofstream amp outputFile void get WeightCluster ifstream amp inputFile int numClass unsigned long long startPos void structureLearningW vector lt vector lt int gt gt phraseTable int maxRound float step float eTol vector lt float gt structureLearningConfidence vector lt int gt featureList vector lt float gt structureLearningConfidence vector lt int gt sourceFeature vector lt int gt targetFeature constructor create an empty sub DPR model i e a weight cluster constructor read a sub DPR model from an input file get the number of classes get the name source phrase of the sub DPR model output the sub DPR model to outputFile file read the sub DPR model from an input file train a sub DPR model using the structured learning algorithm return the confidence W x for each class overloaded return the confidence WT amp x for each class Chapter 4 Code guide 27 Name relabelFeature h cpp Function Store the relabel dictionary for ngram features Members featureRelabel map the relabel dictionary of ngram features countFeatureRelabel int the number of features in the relabel dictionary Public Functions relabelFeature relabelFeature char relabelFilename int insertFeature int featureIndex int getRelabeledFeature int featureIndex int getNumFeature void writeRelabelFeat
4. the maximum length of ngrams used in the ngram feature dictionary usually choose 3 or 4 default 4 Chapter 2 User manual 7 20 21 22 23 24 25 windowSize the window size around the source phrases usually choose 3 or 4 default 3 See Ni et al 2009 for details minPrune prune the ngram features that occur less than minPrune times de fault 1 See Ni et al 2009 for details minTrainingExample prune the source phrases that occur less than min Train ingExample times default 10 because the discriminative model does not work well when the training size is too small maxRound the maximum number of iterations default 500 See Ni et al 2009 for details step the step size learning rate of the DPR model default 0 05 See Ni et al 2009 for details eTol the error tolerance for training the DPR model default 0 001 See Ni et al 2009 for details 2 3 4 Generating training samples for the DPR model After completing the configuration file Generating training samples for the DPR model is rather easy just execute the command smt_mainProcess_construct_phraseDB myConfigurationFile It will generate the following files for training the DPR model SourceCorpusFile tags the word class tags for the source corpus each line is a sentence TargetCorpusFile tags the word class tags for the target corpus each line is a sentence SourceCorpusF
5. the sample extraction module The red block denotes the main process for this module i e main cpp the blue block denotes the function library containing all functions needed in this module and the black blocks are the classes An arrow from block A to block B indicates that Block B directly calls functions or uses classes in Block A smt_mainProcess_generatePhraseOption cpp phraseTranslationTable probPredictionFunction h cpp h epp sentencePhrase corpusPhraseDB Option h cpp h cpp phraseConstructionFunction h cpp sentenceArray weightMatrix h cpp h cpp wordClassDict phraseNgramDict phraseReorderingTable relabelFeature alignArray h cpp h epp h epp h cpp h cpp FIGURE 4 2 The relationships among classes function libraries and main processes in the DPR probability generation module The red block denotes the main process for this module i e main cpp the blue block denotes the function library containing all functions needed in this module and the black blocks are the classes An arrow from block A to block B indicates that Block B directly calls functions or uses classes in Block A Chapter 4 Code guide 15 phraseReorderingTable h cpp Store phrase pairs with their reordering distances orientation class phraseTranslationTable h cpp Store source phrases a
6. you can change the mode of the file by using chmod chmod u x your_file e bin sh check dependencies pl usr bin perlM bad interpreter No such file or directory This is due to different coding of CR carriage re turn between Windows and Linux Unix and cause a problem to function check dependencies pl in the directory MOSES_tools scripts You can try the Perl function delDots pl to solve the problem Just do the following perl delDots pl check dependencies pl check dependencies1 pl delete check dependencies pl mv check dependencies1 pl check dependencies pl e ERROR Cannot find mkels GIZA amp snt2cooc out in Did you in stall this script using make release at moses script scripts 20100427 2119 training train factored phrase model perl line 152 This might hap pen when you use train factored phrase model perl to train a MOSES system The solution is to search my BINDIR in train factored phrase model perl and mod ify the line as my BINDIR your_directory t0_GIZA The file is in the directory MOSES_tools Chapter 3 Preliminary results We now test the new MT system MOSES with DPR on an MT task French to English translation The EuroParl corpus French English was used from which we extracted sentence pairs where both sentences had between 1 and 100 words and where the ratio of the lengths was no more than 2 1 The training set had 50K se
7. P Koehn H Hoang A Birch C Callison Burch M Federico adn N Bertoldi B Cowan W Shen C Moran R Zens C Dyer O Bojar A Constantin and E Herbst Moses open source toolkit for statistical machine translation In Pro ceedings of Annual Meeting of the Association for Computational Linguistics ACL demonstration session Prague Czech Republic 2007 Y Ni C Saunders S Szedmak and M Niranjan Handling phrase reorderings for ma chine translation In Proceedings of the joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Con ference on Natural Language Processing of the Asian Federation of Natural Language Processing ACL IJCNLP 2009 pages 241 244 Singapore 2009 F J Och Minimum error rate training in statistical machine translation In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics ACL 2003 Japan September 2003 37
8. char outputFileName sourcePositionMap sourcePositionDict overloaded given a test corpus extract all source phrases appeared store the source features in outputFileName file and return a sourcePositionMap dictionary overloaded given a test corpus extract all source phrases appeared store the source features in outputFileName file and return a sourcePositionMap dictionary 30 Chapter 4 Code guide Name probPredictionFunction h cpp Function contain functions to generate DPR probabilities for phrase options of each develop test sentence Public Functions Continued sentenceP hraseOption create phrase options for each test sentence smt_collectPhraseOptions Format sentenceIndex left_boundary right_boundary char inputFileName target translations reordering probabilities phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDict_fr phraseNgramDict tagsDict_en wordClassDict wordDict_fr wordClassDict wordDict_en int maxPhraseLength int maxNgramSize int zoneConf relabelFeature relabelDict phraseReorderingTable trainingPhraseTable char weightFileName weight MatrixW weightMatrix sentenceP hraseOption overloaded create phrase options for each test sentence smt_collectPhraseOptions Format sentencelndex left_boundary right_boundary char inputFileName target translations reordering probabilities phraseNgramDict ng
9. number of translations for a source phrase default 100 tableFilterLabel 0 the MOSES phrase table has not been filtered 1 default the MOSES phrase table has been filtered output weightMatrixFile the filename of the DPR model weightMatrixTrainLabel 0 if you do not need to train a DPR model e g you have trained it before 1 default train a DPR model output phraseOptionFile a file stores the phrase options phrase pairs with their DPR probabilities for each sentence in TestFile Line contains the phrase options for sentence i This file will then be used by a MOSES decoder TestFile the file containing the source test sentences The phrase options with their DPR probabilities will be generated for these sentences only batchOutputLabel 0 collect phrase options for one sentence at a time use less memory but very slow 1 default and recommended collect phrase options for all sentences at a time use large memory but very fast For the DPR parameter settings 16 17 18 19 maxPhraseLength the phrase pairs up to length maxPhraseLength default 7 will be extracted classSetup the class setup of the DPR model currently the model only support 3 class setup and 5 class setup See Ni et al 2009 for details distCut prune the phrase pairs whose reordering distances are longer than dist Cut default 15 To avoid some alignment errors caused by GIZA maxNgramSize
10. word class ngram dictionaries the word class dictionary for source words the word class dictionary for target words the phrase pairs extracted for the DPR model the relabel dictionary for ngram features current only support two class setups 3 and 5 the max length of ngram features prune ngram features that occur less than minPrune times the window size of the environment for feature extraction extract phrases upto length maxPhraseLength cut examples whose reordering distances are longer than distCut maximum iteration for training weight matrix W see Ni et al 2009 the learning rate of the PSL algorithm see Ni et al 2009 the error tolerance for training weight matrix W see Ni et al 2009 the phrase translation table from MOSES recommend using Moses s filtered translation table filterLabel 1 the phrase translation table has been filtered 0 otherwise batchLabel 1 store all sentence options first then output them at once use large memory but fast 0 collect and output phrase options for one sentence at a time use less memory but slower maxTranslation the maximum number of translation for each source phrase if 0 use all translations minTrainingExample the minimum number of training examples required Outputs fout_weight Matrix weightMatrixFile fout_phraseOptionDB phraseOptionFile the output file for the DPR model the phrase option database for test sentences 18 Chapter 4 Cod
11. UNIVERSITY OF SOUTHAMPTON Distance phrase reordering for MOSES User Manual and Code Guide by Yizhao Ni Mahesan Niranjan Craig Saunders and Sandor Szedmak Technical Report Faculty of Engineering Science and Mathematics School of Electronics and Computer Science April 30 2010 UNIVERSITY OF SOUTHAMPTON ABSTRACT FACULTY OF ENGINEERING SCIENCE AND MATHEMATICS SCHOOL OF ELECTRONICS AND COMPUTER SCIENCE by Yizhao Ni Mahesan Niranjan Craig Saunders and Sandor Szedmak We describe the implementation of a novel distance phrase reordering DPR model for a public domain statistical machine translation SMT system MOSES The model mainly focuses on the application of machine learning ML techniques to a specific problem in machine translation learning the grammatical rules and content dependent changes which are simplified as phrase reorderings This document serves two purposes a user manual for the functions of the DPR model and a code guide for developers http www statmt org moses Contents Acknowledgements 1 Introduction 1 1 1 2 Distance phrase reordering Copyright announcement 2 User manual 2 2 2 2 3 2 4 Source code 2 2 24 24 8 484 Compilation 2 he a aa How to use 2 82 2 28 2 3 1 Training a MOSES system 0000000 2 3 2 Prerequisite 2 3 3 Generating a parameter configuration file 2 3 4 Generating training samples f
12. ame weightMatrix h cpp Function train and store the weight matrix matrices of the DPR model The file contains two classes Class weightMatrixW Members weightMatrix map store the start positions of all sub DPR models one for each source phrase in a weight matrix database numCluster int the number of clusters source phrases Public Functions weight MatrixW constructor create an empty dictionary Format source phrase its DPR model in the database constructor read the dictionary from an input file get the number of clusters source phrases output the position dictionary to outputPileName file Format source phrase start position insert the start position of a new sub DPR model get the start position of a sub DPR model for a source phrase 26 Chapter 4 Code guide Name weight Matrix h cpp continued Function train and store the weight matrix matrices of the DPR model The file contains two classes Class weight ClusterW Members weightCluster map store orientation ngram features feature values numOrientation int the number of orientation classes sourcePhrase string the source phrase for this sub DPR model distMatrix float matrix the distance matrix for structured learning Public Functions weight ClusterW string source int numClass weight ClusterW ifstream amp inputFile int numClass unsigned long long start Pos int getNumOrientation
13. ar inputFileName int maxTranslations phraseTranslationTable char inputFileName corpusPhraseDB testPhraseDB phraseTranslationTable char inputFileName corpusPhraseDB testPhraseDB int maxTranslations vector lt string gt getClusterNames int getNumCluster int getNumPhrasePair vector lt string gt getTargetTranslation string sourcePhrase int getNumberofTarget Translation string sourcePhrase constructor create an empty phrase table constructor read the phrase pairs from an input file constructor read the phrase pairs from an input file for each phrase extract top mazTranslations translations constructor read the phrase pairs appeared in testPhraseDB from an input file constructor read the phrase pairs appeared in testPhraseDB from an input file for each phrase extract top maxTranslations translations get all source phrases in the phrase table get the number of clusters unique source phrases get the number of phrase pairs in the phrase table get target translations for a source phrase get the number of target translations for a source phrase Chapter 4 Code guide 25 4 4 Constructing a DPR model weight MatrixW char inputFileName int getNumCluster void writeWeightMatrix char outputFileName void insertWeightCluster string sourcePhrase unsigned long long startPos unsigned long long get WeightClusterPOS string sourcePhrase N
14. d mert moses pl in the directory MOSES_tools scripts training To see these modifications simply search DPR in the files A class DPR_reordering h cpp in the directory MOSES_tools moses src is also created as an interface between the DPR model and the MOSES decoder Chapter 4 Code guide 35 Name DPR_reordering h cpp Function an interface between the DPR model and the MOSES decoder Members m_dprOptionStartPOS sentenceOptionFile sentenceID sentencePhraseOption classSetup unDetectProb WDR_cost vector store start positions of phrase options for each sentence i e start positions of each line in the sentence option file ifstream the ifstream handle of the sentence option file long int the test sentence ID map store phrase options for each sentence int the number of orientations float the constant DPR probability for the phrase pair which is not in the sentence options vector the word distance based reordering costs Public Functions DPR reordering ScoreIndexManager amp scorelndexManager const string filePath const string classString const vector lt float gt amp weights size_t GetNumScoreComponents const string GetScoreProducerDescription const string GetScoreProducer WeightShortName const FFState Evaluate const Hypothesis amp cur_hypo const FFState prev_state ScoreComponentCollection accumulator const const FFState EmptyHypoth
15. decoder branches DPR_MOSES MOSES_tools This will copy all source code MOSES with DPR to your local machine in the directory MOSES_tools 4 Chapter 2 User manual 2 2 Compilation To compile the MOSES system the readers are referred to the MOSES user guide Koehn and Hoang 2009 Note that the directory created in this report i e MOSES_tools is equivalent to the directory tools moses mentioned in Koehn and Hoang 2009 To compile the DPR model you need to go to the directory MOSES_tools DPR_model and execute the following command makeFile If the program is compiled successfully it will generate three executables e smt_mainProcess_configuration e smt_mainProcess_construct_phraseDB e smt_mainProcess_generatePhraseOption 2 3 How to use The DPR package consists of two modules a sample extraction module smt_mainProcess_ construct_phraseDB and a DPR probability generation module smt_mainProcess_generatePhrase Option The former is used to extract all samples phrase pairs for training a DPR model while the latter is then used to generate the DPR probabilities for different phrase pairs 2 3 1 Training a MOSES system Since the DPR model requires some outputs from MOSES you need to train a MOSES system before training a DPR model The MOSES user guide will help you to complete this step 2 3 2 Prerequisite The DPR model requires the following outputs from a MOSES system e Th
16. e guide 4 2 Processing a sentence sentenceArray string sentenceString sentenceArray string sentenceString wordClassDict wordDict string getPhraseFromSentence int startPos int endPos string get PhraseFromSentence int startPos int getSentenceLength Name sentenceArray h cpp Function store the words for a sentence Members sentence string array store the words of a sentence sentenceLengh int store the sentence length Public Functions sentenceArray constructor create an empty sentence constructor get words from a sentence string constructor get the words and transform them to tags return the phrase sentence startPos endPos return the word sentence start Pos return the length of the sentence Name wordClassDict h cpp Function store the word class label for each word Members wordClassDictionary map store the words string and the word class labels int readDictCheck 0 can not find the dictionary file 1 otherwise num Words the number of words in the dictionary Public Functions wordClassDict char dictFileName bool checkReadFileStatus int getNumWords int getWordClass string word void create WCFile char inputFile char outputFile constructor read a dictionary file check the read status of the dictionary output the dictionary to outputFile file get the size of the dictionary get the word class label of a word C
17. e source target word class dictionary After training a MOSES system two files named fr vcb classes and en vcb classes are located in a local directory root_directory corpus Alternatively you can use mkcls to train more accurate Read the paragraph under Section Get the Latest Moses Version in Koehn and Hoang 2009 The root_directory is the directory defined by the option root dir when training a MOSES system Chapter 2 User manual 5 word class dictionaries e g by increasing training rounds using different number of word classes etc e The word alignment file A file named aligned grow diag final and which is in the directory root_directory model e The phrase table generated by MOSES A file named phrase table gz is lo cated in the directory root_directory model and you need to unzip it before using it Alternatively to facilitate the processing time of DPR it is highly recommended to use a filtered phrase table That is use the MOSES script filter model given input pl to filter the phrase table and use the filtered table instead 2 3 3 Generating a parameter configuration file To construct a DPR model the first step is to generate a parameter configuration file by calling smt_mainProcess_configuration myConfigurationFile A file named myConfigurationFile will then be created which contains all the informa tion needed for the rest of the process You need to fill in all item
18. esisState const void clearSentencePhraseOption const void constructSentencePhraseOption const float generateReorderingProb size_t boundary_left size_t boundary_right size_t prev_boundary right string targetPhrase const int createOrientationClass int dist const constructor read the sentence option file return the number of score components i e 1 return the name of the DPR model return the short name of the weight for the DPR model i e wDPR compute DPR probabilities for the current extending phrase pair initialisation function clear the phrase options in the option database construct phrase options for the current translating sentence generate DPR probabilities for a phrase pair create the orientation class using the reordering distance Bibliography C Callison Burch C Fordyce P Koehn C Monz and J Schroeder meta evaluation of machine translation In Proceedings of the Second Workshop on Statistical Machine Translation pages 136 158 Prague Czech Republic June 2007 P Koehn A Axelrod A B Mayne C Callison Burch M Osborne and D Talbot Edinburgh system description for the 2005 iwslt speech translation evaluation In Proceedings of the International Workshop on Spoken Language Translation IWSLT 2005 Pittsburgh PA October 2005 P Koehn and H Hoang Moses installation and training run through In http www statmt org moses_ steps html December 2009
19. ex gt reordering probabilities Public Functions sentencePhraseOptionSTR sentencePhraseOptionSTR char inputFileName void outputPhraseOption char outputFileName void outputPhraseOption ofstream amp outputFile void createPhraseOption int sentencelndex unsigned short phrase_boundary mapTargetProbOptionSTR targetProbs void createPhraseOption unsigned short phrase_boundary mapTargetProbOptionSTR targetProbs vector lt float gt getPhraseProbs int sentencelndex unsigned short phrase_boundary string targetPhrase int numClass constructor create an empty phrase option list constructor read phrase options from inputFileName file output all phrase options to output FileName file overloaded output all phrase options to outputFileName file compute the DPR probabilities for a phrase pair and update the phrase option list overloaded compute the DPR probabilities for a phrase pair and update the phrase option list get the target translations and their DPR probabilities for a source phrase 4 6 The configuration process Name smt_configuration cpp Function generate a configuration file for the DPR model 4 7 Other modifications on MOSES To integrate the DPR model into the MOSES decoder modifications are made to MOSES files Parameter cpp StaticData h cpp and Makefile am in the directory 34 Chapter 4 Code guide MOSES_tools moses src an
20. hapter 2 the DPR package consists of two modules a sample ex traction module smt_mainProcess_construct_phraseDB and a DPR probability genera tion module smt_mainProcess_generatePhrase Option The relationships among classes function libraries and main processes are illustrated in Figure 4 1 and Figure 4 2 In the following we provide a summary of the package framework e The main processes smt_mainProcess_construct_phraseReorderingDB cpp and smt_mainProcess_generatePhrase Option cpp e Processing a sentence sentenceArray h cpp Store the words or word class tags for a sentence wordClassDict h cpp Store the word class label for each word phraseNgramDict h cpp Store the word word class ngram features alignArray h cpp Store the word alignments for each sentence pair e Constructing and processing a sample phrase pair pool phraseConstructionFunction h cpp Contain functions to construct the sample phrase pair pool corpusPhraseDB h cpp Store the source phrases that appear in the train test corpus 13 14 Chapter 4 Code guide smt_mainProcess_construct_phraseReorderingDB cpp phraseConstructionFunction h cpp corpusPhraseDB h epp sentenceArray h cpp wordClassDict phraseNgramDict relabelFeature alignArray h cpp h cpp h cpp h cpp FIGURE 4 1 The relationships among classes function libraries and main processes in
21. hapter 4 Code guide 19 Name phraseNgramDict h cpp Function store the word word class ngram features Members phraseDict map store each phrase ngram its feature label length and frequency readDictCheck 0 can not find the dictionary file 1 otherwise ngramIndex the ngram label used when constructing the dictionary Public Functions phraseNgramDict char dictFileName phraseNgramDict void insertNgram string key int ngramLength void deleteNgram string key int getNgramIndex string key int getNgramOccurance string key int getNgramLength string key vector lt int gt getNgramltems string key bool findNgram string key bool checkReadFileStatus void outputNgramDict char dictFileName int minOccurenceCut int getNumFeature constructor read a dictionary file constructor create an empty dictionary file insert a new ngram feature delete an ngram feature get the label of an ngram feature get the frequency of an ngram feature get the length of an ngram feature get the label length frequency of an ngram feature search an ngram feature in the dictionary check the read status of a dictionary output the dictionary to dictFileName file get the number of features in this dictionary Name align Array h cpp Function store the word alignments for each sentence Members align_FRtoEN align_ENtoFR map the source to target alignment source word Pos
22. ile ngramDict the ngram feature dictionary constructed using the source word corpus TargetCorpusFile ngram Dict the ngram feature dictionary constructed using the target word corpus SourceCorpusFile tagsDict the ngram word class dictionary constructed us ing the source word class corpus TargetCorpusFile tagsDict the ngram word class dictionary constructed using the target word class corpus 8 Chapter 2 User manual e phraseTableFile the file containing all extracted samples phrase pairs for training the DPR model e phraseTableFile featureRelabel the relabel dictionary for the ngram features 2 3 5 Training the DPR model and generating DPR probabilities The final step is to execute the command smt_mainProcess_generatePhraseOption myConfigurationFile and the following files will be generated e weightMatrixFile the DPR model e weight MatrixFile start Position the start position of each sub DPR model one for each unique source phrase e phraseOptionFile the phrase options each line is a sentence for the TestFile corpus e phraseOptionFile start Position the start position of each line in phraseOp tionFile The phrase option files i e phraseOptionFile and phraseOptionFile start Position will then be used by the MOSES decoder 2 3 6 Integrating the DPR model into MOSES To integrate the DPR model into MOSES you need to use the MOSES software package we pro
23. maxPhraseLength testCorpusFile TestFileName the source corpus text file the target corpus text file the word alignment file text file from GIZA the word class dictionary for source words the word class dictionary for target words the max length of ngram features prune ngram features that occur less than minPrune times the window size of the environment for feature extraction extract phrases upto length maxPhraseLength optional the source test corpus to filter the phrase DB Outputs fout_phraseDB phraseTableFile fout_relabelDB the output file of the phrase DB Format source phrase target phrase reordering dist features the relabel dictionary of ngram features Chapter 4 Code guide 17 Name smt_mainProcess_generatePhraseOption cpp Function A Learn the sub DPR model for each source cluster B Construct the phrase option database Inputs soucreCorpus TestFile sourceCorpus_tr SourceCorpusFile targetCorpus_tr TargetCorpusFile wordClassFile_fr SourceWordClassFile wordClassFile en Target WordClassFile extractPhraseTable phraseTableFile relabelDict classSetup maxNgramSize minPrune windowSize maxPhraseLength distCut maxRound step eTol phraseTranslationTable the source test corpus the name of the source training corpus for reading word word class ngram dictionaries the name of the target training corpus for reading word
24. ment int zoneConf int maxPhraseLength int maxNgramSize relabelFeature featureRelabelDB ofstream amp fout corpusPhraseDB testPhraseDB void smt_constructPhraseReorderingDB char sourceCorpusFile char targetCorpusFile char wordAlignmentFile char tagsSourceFile char tagsTargetFile char phraseDBFile phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDictFR phraseNgramDict tagsDictEN int zoneConf int maxPhraseLength int maxNgramSize char featureRelabelDBFile void smt_constructPhraseReorderingDB char sourceCorpusFile char targetCorpusFile char wordAlignmentFile char tagsSourceFile char tagsTargetFile char phraseDBFile phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDictFR phraseNgramDict tagsDictEN int zoneConf int maxPhraseLength int maxNgramSize char featureRelabelDBFile char testFileName overloaded extract all consistent phrase pairs upto length max PhraseLength and appeared in testPhraseDB for a sentence pair using the word alignments Time complexity O N extract all consistent phrase pairs with their reordering distances and ngram features for all sentences in sourceCorpusFile extract all consistent phrase pairs appeared in testFileName with their reordering distances and ngram features for all sentences in source CorpusFile 22 Chapter 4 Code guide Name corp
25. n and use in source and binary forms with or without modification are permitted provided that the following conditions are met e Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer e Redistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution e Redistributions of source code for commercial purposes should contact the copy right holder If you use this software in your scientific work please cite the work Ni et al 2009 Chapter 2 User manual The purpose of this chapter is to offer a step by step example of downloading com piling and constructing a DPR model and its related integrating framework i e the MOSES decoder Koehn et al 2007 and the minimal error rating training MERT Och 2003 2 1 Source code The DPR model is integrated into MOSES as a feature function Therefore you also need a MOSES software package to run the program A MOSES package including the DPR model is available at the following location the additional metadata named DPR_MOSES zip http eprints ecs soton ac uk 20939 Alternatively the source code is also available via Subversion from Sourceforge by executing the following commands mkdir MOSES tools svn co https mosesdecoder svn sourceforge net svnroot moses
26. ncy of machine translation Different from the lexicalized reordering model used in MOSES Koehn et al 2005 this model considers the sentence context as well as the relationships between phrase movements by means of a newly emerging structured learning paradigm As observed by the authors the DPR model works well on some language pairs that contain many differences in word ordering e g Chinese to English This document does not describe in depth the underlying framework and the readers are referred to Ni et al 2009 for more details about the model 1 2 Copyright announcement Copyright c 2010 Yizhao Ni All rights reserved THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CON TRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES INCLUD ING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABIL ITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE 2 Chapter 1 Introduction FOR ANY DIRECT INDIRECT INCIDENTAL SPECIAL EXEMPLARY OR CON SEQUENTIAL DAMAGES INCLUDING BUT NOT LIMITED TO PROCURE MENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROF ITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THE ORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Redistributio
27. nd their transla tions from a phrase table generated by Moses to ensure the consistency between the two phrase pair databases e Constructing a DPR model weightMatrix h cpp Train and store the weight matrix matrices of the DPR model relabelFeature h cpp Store the relabel dictionary for ngram features to reduce the size of the feature expression Generating DPR probabilities probPredictionFunction h cpp Contain functions to generate DPR prob abilities for phrase options of each develop test sentence sentencePhraseOption h cpp Store phrase options including target trans lations and DPR probabilities for each develop test sentence The configuration process smt_configuration cpp Other modifications on MOSES DPR reordering h cpp An interface between the DPR model and the MOSES decoder Parameter cpp StaticData h cpp Makefile am mert moses pl The following sections specify the members and public functions for each class function library and main process 16 Chapter 4 Code guide 4 1 The main processes Name smt_mainProcess_construct_phraseReorderingDB cpp Function Extract samples phrase pairs for training a DPR model Inputs soucreCorpus SourceCorpusFile targetCorpus TargetCorpusFile word AlignmentFile alignmentFile wordClassFile fr SourceWordClassFile wordClassFile_en TargetWordClassFile maxNgramSize minPrune windowSize
28. ntences whilst the develop and the test sizes were fixed at 1K sentences For parameter tuning minimum error rating training MERT Och 2003 was applied Experiments were repeated three times to assess variance and the performance was evaluated by four standard MT measurements namely word error rate WER BLEU NIST and METEOR see Callison Burch et al 2007 for details Table 3 1 demonstrates the translation results In most of the cases importing a DPR model improved the translation quality especially the METEOR score MT evaluations System BLEU WER NIST METEOR MOSES LR WDR 26 1 0 1 39 0 0 4 6 67 0 04 48 7 40 3 MOSES DPR LR WDR 26 5 0 3 39 0 0 1 6 68 0 04 50 9 0 2 MOSES DPR WDR 263 0 1 38 9 0 3 6 68 0 04 50 7 0 1 MOSES DPR 26 3401 39 140 2 6 66 0 04 50 8 0 1 TABLE 3 1 Evaluations for MT experiments Bold numbers refer to the best results The corpus can be downloaded at http www statmt org europar1 11 Chapter 4 Code guide This chapter gives an overview of the code The DPR model is implemented using object oriented principles and the developers can gain a general idea of its class organisation from this chapter All source code is in the directory MOSES_tools DPR_model and each class function library and main process contains a brief description on its members and functions at the beginning of its h cpp file As mentioned in C
29. or the DPR model 2 3 5 Training the DPR model and generating DPR probabilities 2 3 6 Integrating the DPR model into MOSES 2 3 7 Minimal error rating training MERT 2 3 8 Decoding Trouble shooting 3 Preliminary results 4 Code guide Constructing and processing a sample phrase pair pool 4 1 The main processes 4 2 Processing a sentence 4 3 4 4 Constructing a DPR model 4 5 Generating DPR probabilities 4 6 The configuration process 4 7 Other modifications on MOSES Bibliography vii 11 13 16 18 20 25 28 33 33 37 Acknowledgements The work was supported by the PASCAL Network School of Electronics and Com puter Science University of Southampton and the European Commission under the IST Project SMART FP6 033917 Moreover particularly thanks are owing to As sistant Prof Philipp Koehn and Dr Hieu Hoang from University of Edinburgh who provided valuable suggestions during this circuitous process vil Chapter 1 Introduction 1 1 Distance phrase reordering The distance phrase reordering DPR model mainly focuses on the application of ma chine learning ML techniques to a specific problem in machine translation learning the grammatical rules and content dependent changes which are simplified as phrase re orderings It models the problem with a classification framework and aims at improving the flue
30. ramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDict fr phraseNgramDict tagsDict_en wordClassDict wordDict fr word ClassDict wordDict_en int maxPhraseLength int maxNgramSize int zoneConf relabelFeature relabelDict phraseTranslationTable trainingPhraseTable char weightFileName weight MatrixW weightMatrix int classSetup Chapter 4 Code guide 31 Name probPredictionFunction h cpp Function contain functions to generate DPR probabilities for phrase options of each develop test sentence Public Functions Continued void smt_collect PhraseOptions char inputFileName phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDict fr phraseNgramDict tagsDict_en wordClassDict wordDict fr word ClassDict wordDict_en int maxPhraseLength int maxNgramSize int zoneConf relabelFeature relabelDict phraseTranslationTable trainingPhraseTable char weightFileName weightMatrixW weightMatrix int classSetup char outPhraseOptionFileName overloaded create phrase options for each test sentence Format sentenceIndex left_boundary right_boundary target translations reordering probabilities 32 Chapter 4 Code guide Name sentencePhraseOption h cpp Function store phrase options including target translations and DPR probabilities for each develop test sentence The file contains
31. s listed below General part 1 SourceCorpusFile the source corpus for the training each line is a sentence 2 TargetCorpusFile the target corpus for the training each line is a sentence 3 SourceWordClassFile the source word class dictionary from MOSES or mkcls i e fr ucb classes 4 Target WordClassFile the target word class dictionary from MOSES or mkels i e en vcb classes For extracting samples phrase pairs for the DPR model 5 alignmentFile the word alignment file generated by MOSES e g aligned grow diag final and 6 output phraseTableFile the file containing all samples phrase pairs for the DPR model 3See Part V Filtering Test Data in Koehn and Hoang 2009 Note that certain items have been assigned default values Chapter 2 User manual T TestFileName only source phrases appearing in this file will be extracted from the training corpus and form the sample pool In order to facilitate the training process it is highly recommended to define this file as the combination of the develop and the test sets i e a text that containing all source sentences from the develop and the test sets For generating the DPR probabilities 8 10 11 12 13 14 15 PhraseTranslation Table the phrase table generated by MOSES i e unzipped phrase table gz It is highly recommended to use the filtered phrase table maxTranslations the maximum
32. target char tagsCorpus corpus wordClassDict smt_construct_wordDict overloaded construct the word class char wordClassDictFile char inputCorpus dictionary and create the tag corpus for char tagsCorpus bool overloadF lag the source target corpus vector lt int gt smt_extract_ngramFeature extract ngram features around a source or sentenceArray sentence phraseNgramDict target phrase ngramDictionary int zoneL int zoneR int flag int maxNgramSize void smt_consistPhrasePair extract all consistent phrase pairs upto sentenceArray sentenceFR length maxPhraseLength for a sentence pair sentenceArray sentenceEN using the word alignments sentenceArray tagFR sentenceArray tagEN Time complexity O N phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDictFR phraseNgramDict tagsDictEN alignArray sentenceAlignment int zoneConf int maxPhraseLength int maxNgramSize relabelFeature featureRelabelDB ofstream amp fout Chapter 4 Code guide 21 Name phraseConstructionFunction h cpp Function contain functions to construct the sample pool Public Functions Continued void smt_consistPhrasePair sentenceArray sentenceFR sentenceArray sentenceEN sentenceArray tagFR sentenceArray tagEN phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDictFR phraseNgramDict tagsDictEN alignArray sentenceAlign
33. tory_to_moses moses cmd src moses your_directory_to_model model moses ini working dir your_working_directory rootdir your_directory_to_scripts decoder flags v 0 lambdas wDPR 0 5 0 1 1 5 activate d_1 Im tm w wDPR The command tells MERT that the initial weight for the DPR model is 0 5 you can also define weights for other parameters such as d lm tm and w and the range of the weight is between 0 1 and 1 5 Meanwhile there are 5 weights needed tuning d_1 i e the word distance based reordering model Im the language model tm the phrase translation model w the word penalty and wDPR the DPR model 2 3 8 Decoding When you obtain the tuned parameters for the MOSES decoder use the following com mand to decode the test sentences your_directory_to_moses moses cmd sre moses config your_directory_to_model model moses ini input file your_directory_to_source your_source_test 1 gt your_directory_to_target your_target_translation 2 gt your_directory_to_log log_file The translations will be written in the file your_target_translation and a log file log_file will also be created Now enjoy the distance phrase reordering model 10 Chapter 2 User manual 2 4 Trouble shooting When you compile the files or execute the commands you might meet the following errors e Permission denied Make sure the file is executable
34. two classes Class sentencePhraseOption Members phraseOption map store the phrase options Format sentencelD gt left _boundary right_boundary target translations index gt reordering probabilities numSen int store the number of sentences Public Functions sentencePhraseOption void createPhraseOption int sentencelndex unsigned short phrase_boundary mapTargetProbOption targetProbs void createPhraseOption unsigned short phrase_boundary mapTargetProbOption targetProbs void outputPhraseOption ofstream amp outputFile int sentencelndex sentenceArray sentence phraseTranslationTable trainingPhraseTable int getNumSentence constructor create an empty phrase option list compute the DPR probabilities for a phrase pair and update the phrase option list overloaded compute the DPR probabilities for a phrase pair and update the phrase option list output all phrase options to outputFile file get the number of sentences Chapter 4 Code guide 33 Name sentencePhraseOption h cpp continued Function store phrase options including target translations and DPR probabilities for each develop test sentence The file contains two classes Class sentencePhraseOptionSTR Members also members inherited from sentencePhraseOption phraseOption map store the phrase options Format sentencelD gt left boundary right_boundary target translations ind
35. up int distCut createOrientationClass int dist int classSetup int getClusterMember string sourcePhrase vector lt string gt getClusterNames int getNumCluster int getNumPhrasePair int getNumOrientatin vector lt vector lt int gt gt getExamples string sourcePhrase ifstream amp inputFile vector lt unsigned long long gt getPositionIndex constructor create an empty phrase table constructor read a phrase table from inputFileName file create the orientation class from the reordering distance of a phrase pair get the number of examples in this cluster get all source phrases in the phrase table get the number of clusters unique source phrases get the number of phrase pairs in the phrase table get the class setup get the examples with their ngram features store in a vector get the start positions in a position file of ngram features for all phrase pairs 24 Chapter 4 Code guide Name phraseTranslationTable h cpp Function store source phrases and their translations from a phrase table generated by Moses Members phraseTranslationTable numCluster numPhrasePair map a phrase table store source phrases target translations int the number of clusters unique source phrases in the phrase table int the number of phrase pairs in the phrase table Public Functions phraseTranslationTable phraseTranslationTable char inputFileName phraseTranslationTable ch
36. ures char dictFileName constructor create an empty relabel dictionary constructor read a relabel dictionary from an input file relabel and insert an ngram feature return an ngram feature s relabeled feature return the number of relabeled features output the relabel dictionary to dictFileName file 28 Chapter 4 Code guide 4 5 Generating DPR probabilities Name probPredictionFunction h cpp Function contain functions to generate DPR probabilities for phrase options of each develop test sentence Public Functions void smt_sourceClusterPrediction weightClusterW wt ifstream amp sourceFeatureFileName phraseFeaturePositionMap sourceFeaturePosition targetFeatureMap target Translation sentencePhraseOption phraseOption void smt_sourceClusterPrediction weightClusterW wt ifstream amp sourceFeatureFile phraseFeaturePositionMap sourceFeaturePosition source TargetFeatureMapSTR const_iterator sourceTarget Found sentencePhraseOptionSTR phraseOption void smt_createSourceCluster char inputFileName phraseNgramDict ngramDictFR phraseNgramDict ngramDictEN phraseNgramDict tagsDict fr phraseNgramDict tagsDict_en wordClassDict wordDict fr word ClassDict wordDict_en int maxPhraseLength int maxNgramSize int zoneConf relabelFeature relabelDict phraseReorderingTable trainingPhraseTable char outputFileName sourcePositionMap sourcePositionDict
37. usPhraseDB h cpp Function store the source phrases that appear in the train test corpus Members phraseDB map store the phrases appeared in the corpus numPhrase int the number of phrases maxPhraseLength int the max phrase length in this phrase DB Public Functions corpusPhraseDB constructor create an empty phrase DB corpusPhraseDB char inFileName int MAXPLENGTH corpusPhraseDB char inFileName int MAXPLENGTH bool readDict bool checkPhraseDB string phrase int getNumPhrase int getMaxPhraseLength void outAllPhrases char outFileName constructor create a phrase DB for an input corpus constructor read the phrase DB from a DB file check if a phrase appears in the phrase DB return the number of phrases return the maximum phrase length output all phrases to outFileName file Format phrase phraselndex Chapter 4 Code guide 23 Name phraseReorderingTable h cpp Function store the phrase pairs with their reordering distances orientation class Members phraseTable map store the source phrases with the orientation classes and the ngram features numCluster int the number of clusters source phrases numPhrasePair int the number of phrase pairs stored positionIndex vector store the start position of ngram features for each phrase pair in a position file Public Functions sourceReorderingTable sourceReorderingTable char inputFileName int classSet
38. vided as some MOSES source code has been modified see Section 4 7 Mean while the following lines should be added to the file root_directory model moses ini DPR file your_directory_to_phraseOptionFile phraseOptionFile wDPR the weight for the DPR model e g 0 5 class DPR the class for the DPR model choose 3 or 5 depending on the DPR model trained This tells the MOSES decoder where the DPR probability file is and what is the weight for the DPR model Chapter 2 User manual 9 2 3 7 Minimal error rating training MERT To use MERT you need to use the MOSES scripts package we provided as some source code of the scripts has been modified see Section 4 7 The scripts package is in the directory MOSES_tools scripts and the command is your_directory_to_scripts training mert moses pl your_directory_to_source your_ source file your_directory_to_target your_target_file your_directory_to_moses moses cmd sre moses your_directory_to_model model moses ini working dir your_working_directory rootdir your_directory_to_scripts decoder flags v 0 If you would like to switch on off the DPR model or other reordering models you can use the configurations lambdas and activate For example do the following your_directory_to_scripts training mert moses pl your_directory_to_source your_source file your_directory_to_target your_target_file your_direc

Download Pdf Manuals

image

Related Search

Related Contents

Bedienungsanleitung  jm rutd  K 2.31 M  Radio Shack 65-717 Computer Drive User Manual  télécharger la Mauvaise Herbe de Mars 2014 au format PDF    Com'X 200 - User Guide  C om unicação Profibus  Pour plus d`information Flashez-moi For more information  プリンタドライバ補足説明書  

Copyright © All rights reserved.
Failed to retrieve file