Home

SHOW User Manual - silico.biotoul.frsilico.biotoul.fr

1. 0 209391 0 790609 0 174707 0 825293 0 0927552 0 907245 0 081693 0 918307 0 0806847 0 919315 0 125776 0 874224 0 161663 0 838337 0 235672 0 764328 0 221932 0 778068 0 178636 0 821364 0 114087 0 885913 0 0881827 0 911817 0 0877433 0 912257 O 0 925877 0741231 4 The show_viterbi executable The show_viterbi executable performs the Viterbi algorithm on a sequence or a set of sequences 15 4 1 Viterbi algorithm The Viterbi algorithm enables the computation of the most likely hidden state path s given the observed sequence xj and the model parameters 0 a b s arg max P ST s XT x7 arg max Pa ST s1 XT 27 st st To overcome numerical problems we present a version of Viterbi by taking the logarithm of all the probabilities involved in the algorithm Starts with t 1 with log Po S1 v 11 log by x1 a v For t increasing from 2 to n 1 compute by recurrence the logarithm of the maximal probability of the most probable path enabling to generate the observations 71 2441 that ends in position t 1 in state S44 v t _ ot EN t 1 max log Pa Si 51 541 V T 51 max 1og bo a1 24 7 1 a u v max log Pista ul si Find s s s3 s by backtracing S arg max max log Polo b 5s 7 Sn v 21 Sq st arg max log by 2141 2f 11 alu st max log PS s Si ul 1 The same memory saving approximation has been implemented as for forward backward al
2. the value of ptrans will evolve during iterations of EM The keyword label can be set on the first line of the transition description It enables to choose an identifier which can be used with the keyword tied_to when setting up models with tied parameters The example below shows how to use this feature BEGIN_STATE state_id cdsi _3 BEGIN_TRANSITIONS label trans_cds _3 identifier is trans_cds _3 type 1 state cds1 _1 ptrans 0 99 type 1 state stop _1 ptrans 0 01 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna type 1 order 2 pobs random excepted TGA TAG TAA END_OBSERVATIONS END_STATE BEGIN_STATE state_id cds2 _3 BEGIN_TRANSITIONS tied_to trans_cds _3 state cds2 _1 P cds2 _3 gt cds2 _1 state stop _1 P cds2 _3 gt stop _1 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna type 1 order 2 pobs random excepted TGA TAG TAA END_OBSERVATIONS END_STATE P cds1 _3 gt cds1 _1 P cds1 _3 gt stop _1 In this example the states cds1 _3 and cds2 _3 correspond to the third codon position of a model allowing two types of compositions for coding sequences The outgoing transitions of this two states are tied they are identical and simultaneously estimated when running show_emfit This feature can be used to ensure that the state transition probabilities will be the same for two or more hidden states and can also be useful to decrease the number of parameters and then give an easier and bette
3. OF THE MODEL AND PARAMETERS ESTIMATION m lt modelid gt c 2c 3c 4c 1c_si 2c_si 3c_si 4c_si Number of coding types to use si stands for short intergenic default is 4c_si rna lt rnafile gt Optional fasta file containing DNA sequences of structural RNA genes used only in order to compute nucleotides frequencies rbs lt modelid gt mO mi double_m0 default is m0 em lt niter epsi niter_sel nb_sel eps_sel gt Parameters for the EM algorithm default is 20 0 01 50 10 10 2 USE OF AN ALREADY ESTIMATED MODEL fm lt showmodel gt A bacterial gene detection model already fitted by SHOW If used any option of the SECTION 1 is ignored TEMP FILES AND PARAMATERS OF THE GENE PREDICTION OUTPUT d lt tmpdir gt Location of the temporary directory where SHOW 1 0 used by the Perl script will be located default is tmp w I cdst lt float gt Probability threshold for CDS prediction default is 0 5 startt lt float gt Probability threshold for multiple starts prediction default is 0 1 rbst lt float gt Probability threshold for RBS prediction default is 0 1 19 Overlap 2nt Ne E y EN FR PRA e ej a J Toole FIG 4 Structure of a HMM dedicated to bacterial coding sequences detection used by bactgeneSHOW Colors are used in text in order to identify parts of the model 7 3 HMM for bacterial gene detection Figure 4 displays the structure
4. X EM consists in alternating two steps the so called E step for Expectation and M step for Maximi zation 3 1 1 E step forward backward algorithm The E step on a HMM is also named Baum Welch or forward backward algorithm It computes the probability values Pam y S u St 1 X x for each position of the sequence t 1 n 1 and each couple of hidden states u v 1 q Values of Po m 1 Sy u X x are further deduced from Pym St u St 1 v XP vi 10 The forward processing of the sequence starts with t 1 and use recursively t 2 n both following equations until the calculation of the value Poim 1 Sn vla Predictive equation 1 q if t gt 1 Bea Sous y Dal D u 0 Poem Ssi u at u 1 else Po m 1 S1 v a y Filtring equation 2 py x4 time Si v db Pra S4 v z 1 z Cee CN m1 For the first r 1 positions of the sequence the probabilities b t att are used instead of the transitions 0 eres The backward processing of the sequence starts from t n and use recursively t n 1 1 the equations Smoothing equation 3 Pym St u St v 7 Patm 1 St v ar Equation 4 q Po m1 Str u 7 S 7 Pym Sia u St v 11 v 1 3 1 2 M step M step consists in updating the parameter 0 by choosing 0 arg max Egon log Po XT St Xi Le the condition
5. been designed Circles represent the considered states of the HMM and arrows the allowed transitions between states This graph modelises the alternation of intergenic regions with CoDing Sequences CDS on the both strands of a DNA molecule The model contains 23 hidden states grouped in 7 groups in dotted line Here follows a short description of the biological meaning of this graph we only give details of the model structure When the actual hidden state corresponds to an intergenic region at position t the arrows indicate that we can stay in this region at position t 1 or leave it towards the first position of a start codon on the strand triplets atg gtg ttg or towards a last position of a stop codon on the strand inverse complementary of tga tag taa Intergenic state can only be reached from the last position of a stop codon on the strand or the first position of a start codon on the strand Leaving the third position of a start codon on the strand the hidden path goes through a CDS which is a succession of codons Codons are modelised by using a cycle of three hidden states one for each position inside the codon This modelisation ensures that the length of the CDS will be a multiple of three and enables to take into account distinct compositions of the DNA according to the codon position defined by the emission transitions b which are not described here At the third codon position on the
6. logl 67903 5 diff 24 9115 iter 6 logl 67886 6 diff 16 8449 iter 7 logl 67874 3 diff 12 3817 8 8 3 14 3 3 2 select likelihoods file This file summarizes the final likelihoods obtained from each starting point and reports the model which has been further fitted during the final stage of the estimation model O loglikelihood 67851 4 model 1 loglikelihood 67883 3 model 2 loglikelihood 67901 best model found 0 loglikelihood 67851 4 3 3 3 select models file This file contains the definition of the models obtained from each starting point Models are described using the same format as in the model lt file gt 3 3 4 trace file This file contains likelihoods of the models at each iteration of the EM algorithm running from the model of the model lt file gt if random initializations are not used or from the best model found after random initializations 3 3 5 model file This file contains the final model obtained after the estimation of the parameters model is des cribed using the same format as in the model lt file gt 3 3 6 e file These files are only generated if an output lt file gt is specified One file with the e suffix is created for each of the sequence files referenced in the seq lt file gt For each sequence the e file contains output of the forward backward algorithm as defined in the output lt file gt Each line corresponds to a position along the sequence tt hello hello2
7. strand it is necessary to forbid the appearance of an in frame stop codon Thus the emission model associated with the third position needs to verify bcos _3 a tg P X a St CDS 3 Xf tg 0 beps 3 9 ta P X g CDS 3 Xi ta 0 and beps _3 a ta P X a S CDS _3 Xi ta 0 The first position of the stop codon on the strand can only be reached from the third position of a translated codon and corresponds to the first nucleotide of the triplets tga tag taa From the third stop codon position on the strand the path goes through intergenic From the intergenic state the third position of a stop codon on the strand could be reached This transition enable the beginning of a CDS on the strand In the next section we will present the syntax allowing the definition of such HMM 2 3 HMM specification file model lt file gt Model specification consists in the specification of each hidden state state and emission observation transitions It corresponds to the description of each of the nodes of the graph This file does not only contain the description of the model structure but also indicates which parameters of the model band a are fixed or must be estimated when using the model as input of the show_emfit executable When running show_emfit values found in this file for the parameters described as to be estimated are used as the starting point of the iterative EM
8. 07 0 172784 0 214286 0 322822 255216 0 193496 0 177495 0 373793 326314 0 198495 0 157725 0 317467 260414 0 258744 0 223338 0 257503 22166 0 186155 0 233092 0 359093 184989 0 211447 0 218147 0 385417 334371 0 134578 0 155746 0 375305 345444 0 155348 0 195448 0 30376 328272 0 126919 0 235736 0 309073 ooooooooooooooo 0 204668 0 155128 0 192557 0 447647 tta ttg ttc ttt END_OBSERVATIONS END_STATE Hidden state description begins with the BEGIN STATE keyword and ends with the END STATE keyword It contains the identifier of the state that is used in the description of the outgoing transitions and that must be unique The state transition description is separated from the description of the emission observation transitions Outgoing transition description from the hidden state begins with the BEGIN TRANSITIONS keyword and ends with the END_TRANSITIONS keyword It contains the description of each allowed transition from the hidden state The type must be specified for each outgoing transition 0 means no estimation of the parameter and 1 means estimation by the EM algorithm The state refers to the identifier of the state to which the transition is allowed The ptrans keyword preceeds a numeric value of the state outgoing transition probability a u v This numerical value is fixed when the type is equal to 0 and corresponds to the initial value required to run EM when the type is set to 1 in this last case
9. 4 7 3 4 RBS modelling Taking into account that the ribosome binding site RBS position has been shown to improve precise start site prediction this can also help to predict short genes where CDS composition does not provide sufficient information The script enables the user to choose between distinct RBS models according to the rbs lt modelid gt options rbs mO RBS sequences are modelled by a positional compo sitional matrix of Markov order 0 and length 14 rbs m1 same as rbs m0 but with Markov order 1 rbs double_m0 the positional compositional matrix is duplicated enabling to model two distinct consensus for RBSs proposed in Besemer et al Nucleic Acids Res 2001 The RBS is followed by a spacer sequence of minimal length 1 Hidden states corresponding to RBS and spacer are colored in blue in the Figure 4 The HMM does not enforce each CDS to be preceeded by RBS the probability to find a RBS before a start codon is estimated The addition of two states corresponding to short intergenic regions could improve this estimation These two states are the same as the central intergenic state in terms of composition tied observations but their transitions differ They are dedicated to model short intergenic regions separating CDSs on the same strand one for CDSs on the forward strand and the other for CDSs on the backward strand RBS model is not reachable from the state which corresponds to short intergenic regions on the for
10. AGAATGGGCAAGTTCGTTAAACTTAAGATGGTACAAAGGTTTCTTTG GCACAAGTGTGCTTATTTTATGCTAGATGAATTCCTTACATACTATGATAATCTTTGTTT ATCAGATGACGGTAATAGTATTTGTCAAGTTTTCTCTTTAGGAAGAACTATCACTTGATT If the model includes a bound state the output sequence is a concatenation of sequences of total length corresponding to the length specified in simul lt file gt Appearances of bound states delimit individual sequences bound state positions along the concatenated sequence is given in the simulated hidden_states file in the same way than other hidden states and in the simulated_0 dna file by X in the simulated sequence 18 6 Some precisions concerning the design of the source code These considerations are dedicated to users interested in extending the source code The C programming language has been chosen to implement this program because it is widely used it allows object oriented programming and it is efficient for applications requiring intense numeric calculus The object oriented design is well adapted for HMM based programs in particular because of the modular nature of the HMM However an object oriented design implies numerous calls to function methods and then could slow down the execution Thus a balanced design must be chosen in order to keep modularity and efficiency properties The three programs show_emfit show_viterbi and show_simul share widely the same object components These objects are well designed for some further extensio
11. N_STATE state_id motif_1 1 BEGIN_TRANSITIONS type 1 state motif_2 1 ptrans 0 5 type 1 state motif_2 2 ptrans 0 5 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna type 1 order 0 pobs random END_OBSERVATIONS END_STATE BEGIN_STATE state_id motif_1 1 BEGIN_TRANSITIONS type 1 state motif_2 1 ptrans 0 5 type 1 state motif_2 2 ptrans 0 5 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna type 1 order 0 pobs random END_OBSERVATIONS END_STATE BEGIN_STATE state_id motif_10 1 BEGIN_TRANSITIONS type 0 state bound ptrans 1 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna type 1 order 0 pobs random END_OBSERVATIONS END_STATE BEGIN_STATE state_id motif_10 2 BEGIN_TRANSITIONS type 0 state bound ptrans 1 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna type 1 order 0 pobs random END_OBSERVATIONS END_STATE 2 4 Observed sequences file list seq lt file gt The SHOW executables can work on a single sequence or a set of sequences The sequences to process are referenced in the seq lt file gt An example of such a file is given below It must contain the three keywords seq_identifier seq_type and seq_files seq_identifier genomic_dna seq_type dna seq_files Contigl dna ContigII dna Keyword seq_identifier refers to any chosen identifier of the sequence that must be the same as in the model lt file gt keyword seq in observation emission descript
12. SHOW User Manual Pierre NICOLAS 1 2 Anne Sophie TOCQUET 2 and Florence MURI MAJOUBE 2 26 janvier 2004 1 Laboratoire de Math matique Informatique et G nome INRA F 78350 Jouy en Josas cedex 2 Laboratoire de Statistique et G nome CNRS Tour Evry2 523 place des terrasses de l Agora F 91034 Evry Table des mati res 1 Introduction 3 2 Hidden Markov Models HMMs 3 2 1 SHOW s HMMs for DNA sequences 2 000 ee a 3 2 2 Example of a simple model for gene detection o e o 4 2 3 HMM specification file model lt file gt e ee 5 2 3 1 Hidden state definition eee eee 5 2 3 2 Two distinct modelisations of the boundaries of the sequence 8 2 4 Observed sequences file list seq lt file gt eee 10 3 The show_emfit executable 10 3 1 EM algorithm Baum Welch algorithm 10 3 1 1 E step forward backward algorithm o o o o o ooo o 10 Sul 2 Mest pes Budo eee eS eo Bee Heen A E A a 11 3 1 3 Computation of the loglikelihood 0 o e e 12 3 1 4 Stopping criteria for EM e 12 3 1 5 Memory saving approximation 0 000 2 ee 12 3 1 6 Bypassing local maxima of the likelihood function 12 3 2 Input files d siret ai dE NE A pate a re Se eos 12 32 1 modeli EES te orn Sate wok gant og vinde Buk Beide regan ah dive aes Ae 12 ID Sem EEE stean an hore arke ed Beet NS eo eo ae B
13. Some precisions concerning the design of the source code 19 7 bactgeneSHOW a Perl script invoking SHOW for bacterial gene detection 19 ek MOUI 04 tee ouer Sapte be ado en AP A Rola A ven Meola ee a 19 7 2 The bactgeneSHOW command line e 19 7 3 HMM for bacterial gene detection 2 eee 20 7 31 Intergenic sequences 2 eee eee eee 20 7 3 2 Coding Sequences ee 21 7 3 3 Overlap between coding sequences e 21 TaM RBS modelling 2p ya vee Ske a SE o ae ae Gp Ae 21 7 3 5 Structural RNA modelling aaa 22 7 3 6 CDSs on the complementary strand 0 00002 eee eee 22 7 4 What does bactgeneSHOW een 22 7 5 How to retrieve a fitted model for use with the fm lt showmodel gt option 22 8 Acknowledgments 23 1 Introduction SHOW stands for Structured HOmogeneities Watcher It is a set of programs implementing dif ferent uses of Hidden Markov Models HMMs for DNA sequences SHOW enables self learning of HMM on a set of sequences sequence segmentation based on the Baum Welch or the Viterbi algo rithms and sequence simulation under a given HMM We have designed these programs to allow the user to specify any highly structured model and also to process large sets of sequences To date it has been successfully used in diverse tasks such as DNA segmentation in homogeneous segments bacterial gene prediction and human splice sites detection The three following programs are available sh
14. al expectation to the current candidate and the sequence X of the complete loglikelihood This maximization step uses the segmentation obtained during the E step and leads to the intuitive estimators Equation 5 iO Gea Niza Palm Sti u S v 7 i 2 Pomo Sia u 21 and De Pym St 0 a Lo KITA w Meo LE G Ditra Pomo St 0 ap ext w In the special case of a state u with an observation emission Markov process with pseudo order See Section the maximization of b could not be performed analytcally Thus this maximization is performed using a multidimensional maximization routine of the GSL GNU Scientific Library 11 3 1 3 Computation of the loglikelihood The computation of the incomplete loglikelihood is performed recursively for t 1 n during the E step using equation q log Poona Xt zi log DC bD ay a Poma vlet u 1 log Pym X af Precautions must be taken for the r 1 first positions 3 1 4 Stopping criteria for EM It can be shown that every limit point of the sequence OCD 0 generated by EM alternating the E and M steps previously described satisfies the incomplete log likelihood equations and that 0 n gt 0 converges towards the maximum likelihood estimate MLE if the starting point 9 is not too far from the true value From a practical point of view EM can thus converge to a local maximum The E and M steps are alternated until an it
15. algorithm 2 3 1 Hidden state definition The HMM definition file is organized as a succession of hidden state definitions that makes it highly modular easy to edit by copy paste operations and makes an existing model easy to extend The following shows an example of a hidden state which could be the intergenic hidden state of the figure 1 definition BEGIN_STATE state_id intergenic identifier of the state BEGIN_TRANSITIONS type 1 allows estimation of the transition state start _1 transition towards start _1 state ptrans 0 00432589 probability of this transition type 1 state stop _3 1 ptrans 0 00426073 type 1 state stop _3 2 ptrans 0 000892905 type 1 state intergenic ptrans 0 99052 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna a sequence id which must be the same as in the seq lt file gt type 1 allows estimation of the observation distribution order 2 markov order of the observation distribution pobs 0 314514 0 183587 0 185804 0 316094 ta gct 385992 0 177465 0 146873 0 28967 aa ag ac at 327029 0 217127 0 207901 0 247942 ga gg gc gt 310991 0 158126 0 221846 0 309037 23778 0 185229 0 190936 0 386055 oooo 44321 0 181248 0 140779 0 234764 aaa aag aac aat 33124 0 25435 0 217354 0 197056 gaa gag gac gat 382986 0 161845 0 2005 0 25467 303587 0 204104 0 178921 0 313388 38761 0 202831 0 138278 0 27128 342915 0 224382 0 201373 0 231331 2901
16. d value As likelihood rapidly increases during the first iterations of the algorithm it is interesting to stop the maximization after a few iterations and then to choose the model for which the likelihood maximization procedure is performed further The figure 3 shows the likelihood increase during such a maximization performed from ten random starting points 3 2 Input files 3 2 1 model lt file gt The syntax of this file is described in the section 2 3 12 133400 133600 Ke le Ko D 2 133800 134000 iteration number Fic 3 Example of likelihood increase during EM iterations The loglikelihood maximization begins with 10 starting points during 50 iterations only the likelihood of the best model is performed further than 50 iterations 3 2 2 em lt file gt This file contains the information required to initialize and run the EM algorithm An example of such a file is given below estep_segment 20000 estep_overlap 1000 niter 1000 epsi 0 00001 Keywords estep_segment and estep_overlap refer to the length of the segment used for the memory saving approximation during the forward backward algorithm E step For a given model and set of sequences the memory needed by the program grows linearly with estep segment niter and epsi define the stopping criteria of the EM algorithm EM iterations stop when the loglikelihood increase between two consecutive iterations is low
17. ectory in the location specified by the d lt tmpdir gt option This directory contains all the intervening files in particular input and output files of the show_emfit executable We will now describe the different steps performed by bactgeneSHOW The DNA sequence is copied in the temporary directory An initial model corresponding to the model specified by the user is copied in the directory for instance gene_4c_si model This file is a model lt file gt for show_emfit see section 2 3 containing flags between and or between two These flags are used by the script to determin respectively where to add some hidden states and which states must be used for gene prediction states modelling coding regions If rna lt rnafile gt option is used Fasta file containing structural RNA gene sequences is copied in the directory and a file having the suffix invcomp containing the reverse complementary sequences is created Composition of structural RNA is computed over these two files and the two corresponding states are added in the HMM the new model file have rna_ prefix for instance rna_gene_4c_si model Input files for show_emfit are created em desc see section 3 2 2 according to the em lt niter epsi niter_sel nb_sel eps_sel gt option start set see section 2 4 show_emfit estimates the parameters of the initial model the following files are created start select traces start select likelihoods start se
18. en no sufficient context is known to use the parameters b x x ED These values correspond to the starting point required to process EM when type is set to 1 and are fixed otherwise A random choice of the pobs values can be done by setting pobs to the keyword random In this case the model cannot be used directly for viterbi reconstruction of the hidden path or for simulation but must first be estimated using show_emfit Note that when using directly show_viterbi or show simul the pobs and ptrans values will be considered as fixed for these executables even if the type of these parameters is set to 1 Thus show_emfit must first be used to allow some parameter estimation Finally the keyword excepted can be used to forbid the emission of some words this keyword was used in the previous example to forbid an in frame stop codon tag tga and taa The length of the forbidden word must be gt r 1 If the length of a word forbidden by the excepted keyword is greater than r 1 then SHOW uses a special kind of Markov model for the observation emission process This Markov model corresponds to a model of order r conditionally on that some words of length l do not appear the resulting model is in fact a Markov model of order 1 but with the same number of parameters than a Markov model of order r Then we refer to this kind of Markov model as a Markov model of pseudo order lw 1 The following matrix gives an example of the
19. ences HMMs are basically implemented in the same way in SHOW and RHOM Nicolas et al Nucleic Acids Res 2002 We note X X1 X2 Xn the observed DNA sequence with X Y a g c t and S S1 S2 Sn the corresponding hidden state path each S taken from a finite set S 1 q defined by the user The S are generated according to a first order Markov chain of transitions a u v P S v S u uves The initial distribution of the chain is atu P S1 v wes Unlike the RHOM software designs of the algorithms implemented in SHOW are optimized for large sparse HMMs where most of the transitions between hidden states are null The X are generated according to a markov model of order r which depends on the actual hidden state S u Transitions from the letters x_ 1 to the letter x in state u are by x w P X x Se u XE w welt ver ues t Tu For the first r 1 positions of the sequence we will use for 0 lt t lt ru by a w P X z S1 u Xt w wert rE ues In the source code the state transitions a and the emission observation transitions b will be denoted respectively by ptrans and pobs 2 2 Example of a simple model for gene detection Stop Start CDS Fic 1 Example of a simple HMM dedicated to bacterial coding sequences detection Figure 1 displays the structure of a simple HMM represented by the hidden state transitions for which SHOW has
20. er than epsi or the maximal number of iterations niter has been reached If parameters of the emission observation process associated to some hidden states are randomly initialized i e model definition file contains pobs random supplementary keywords nb sel niter_sel and eps sel are expected in the em lt file gt These supplementary keywords are used in the following example estep_segment 20000 estep_overlap 1000 nb_sel 3 niter_sel 100 eps_sel 0 01 niter 1000 epsi 0 0001 13 nb_sel corresponds to the number of random starting points for the EM algorithm niter_sel refers to the maximal number of iterations performed from each starting point eps sel corresponds to the stopping criteria of the EM algorithm running from each starting point 3 2 3 seq lt file gt The syntax of this file is described in the section 2 4 3 2 4 Optional output lt file gt This file is optional and is required for the output e file generation The output lt file gt contains the output of the last iteration J of the forward backward algorithm i e the probabilities Pym St u Xp and Pym St u Siy1 Xj for each state u v and each position t of the sequences An output file containing these probabilities for all considered states can be very large and difficult to use and analyze Then a summary of selected states can be done The following example gives the syntax of such an output e file rhom1 rhom2 rhom3 r
21. eration M for which convergence can be stated The stopping rule is log Pym 1 X log Pym X lt e for a given e the value of e is fixed by the user The parameters 9 are then estimated by 921 aM and YM and for all positions t in the sequence the probability of the state S to be v 1 q is estimated by Pg S v X 3 1 5 Memory saving approximation As the forward backward algorithm requires the storage of the probabilities Pp m 1 S4 u X Fook and Po m 1 St u X during the forward step this procedure can consume a large amount of active memory when processing large sequences with a big model A solution could be to use hard memory files to store this probabilities Another solution that we think more effective in terms of execution speed have been implemented in SHOW It consists in approximating Pam 1 S u XT by Pym St u X where s and s are boundaries of overlapping segments of the sequence This approximation is mathematically justified because the dependence between two positions in a Markov chain decreases geometrically with the distance separating the two positions 3 1 6 Bypassing local maxima of the likelihood function One of the major drawbacks of the EM algorithm is that it can be stuck in local maxima of the likelihood function The implemented solution consists in starting EM with distinct initial value 00 and to choose the best path which leads to the highest likelihoo
22. et aad MR Se 13 Id seg le Aka taal Mok bata BS AE Eel en owt hee 14 3 2 4 Optional output lt file gt eeen 14 3 39 Output AGS he a ten rar ao eran ada Te be ee eed ee eS ha 14 Sid Select traces Menor eG Ien aten ar ndi bree den ln REA gi e de ieee e 14 3 3 2 select likelihoods file gt ren ae a eee a ie be ede ad rn Bee eee a 2 15 3 93 81 Select models file sos ene ate ey rls Be a ee del ey a EE aan 3 15 SoA trace Tle eb n yada Ue ee en Noe aA ee Aa oe Bea en A A ede 5 15 dis model rc ne eto eter bia ee Dil Rar ee ede Behe ee B ie 15 3 30 eme dia A A paa a enea 15 4 The show_viterbi executable 15 4d Viterbialgorithm lv tias trad AN Rak dy denn Seba de a 16 42 oput mles mys pma ss at ds a ie hand a 16 ADL modela a sen ia A al a 16 AD SMe 2 3 vo ca Sa AS ole BA A in ett ws 16 AD 3 sed Shless edel sek Soe Bde ea eek Bae oe ASA A AR 16 43 Outputailes osn foo er Ae end a a ee een es Ok ee os EU a 17 5 The show_simul executable 17 bil Simulatingan HMM sm aar aoe a we aS ee ea ke ae eee A 17 52 Input les gc ets Araceae Sate ee Oe ee A en ele eee a OA Ee ERA 17 Dieke model SAE a ee er a A EE ee EE he te 2 17 B22 simuladas a oe Ba Ete te ye ge bat tee i de ha eS ae it 18 Dida Seq tle Shek ee ee eet Ae ee ee ak 18 5 3 OQu tput fles oc anr ra Bean Se Ge eg gee ad Se gee ee a eden 18 5 3 1 simulated hidden_states fille aaa 18 5 8 2 simulated 0 dna file sn ee been we edn Beale Hogt ee eee i 18 6
23. gorithm See section 3 1 5 4 2 Input files 4 2 1 model lt file gt The syntax of this file is described in the section 2 3 Keep in mind that all the parameters of the model must be fully specified i e the file must not contain any pobs random 4 2 2 vit lt file gt This file contains the information concerning the parameters of the memory saving approximation An example of such a file is given below vit_segment 20000 vit_overlap 4000 4 2 3 seq lt file gt The syntax of this file is described in the section 2 4 16 4 3 Output files Only one kind of output file is generated by the show_viterbi executable One output file with the vit suffix is created for each sequence file referenced after the seq_files keyword in the seq lt file gt For each sequence file the vit output file contains the most probable hidden path An example of such a file is given below the mapping between numbers and hidden state identifiers is given in the header of the file each line of the file corresponds to a position along the sequence excepted in the header viterbi reconstruction O hello 1 hello2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PRRRROOO 5 The show_simul executable The show_simul executable enables to simulate an HMM given a fully specified HMM 5 1 Simulating an HMM The simulation of an HMM is easy because it only consists in simulating the hidden path given its markov model and simulat
24. hom2 gt rhom3 If rhom1 rhom2 rhom3 are three hidden states the output file will contain three columns of length n corresponding to Pam S rhoml Xi Pam S rhom2 XT Pym St rhom3 X7 and Pam S rhom2 St rhom3 X7 Note that a same probability cannot be summed in two columns 3 3 Output files Some files are created in the current directory when running show_emfit The names of these files are generated by adding suffixes to the name of seq lt file gt excepted output files with the suffix e which are created using the names of the individual sequence files referenced in seq lt file gt Files with suffixes containing select are generated only if random initializations of the EM algorithm are used Files with the suffix e are generated only if an output lt file gt is specified 3 3 1 select traces file This file contains likelihoods of the models at each iteration of the EM algorithm running from each starting point ok ok ok ok ok ok ok ok ok ok ok ok ok ok ak dd ok ak k ak ok ak ok ak ak ok logl 67864 iter 9 logl 67857 iter 10 logl 67851 4 FE CSAS akak ak ak ak ae ak 3K 2k ak ak a akk 2k ak k KK GI I III IE I AK model 1 iter 0 logl 71490 3 iter 1 logl 68347 4 diff 3142 88 diff 9 50504 diff 7 44326 model 0 iter 0 logl 71378 6 iter 1 logl 68449 2 diff 2929 44 iter 2 logl 68067 4 diff 381 803 iter 3 logl 67971 1 diff 96 317 iter 4 logl 67928 4 diff 42 6526 iter 5
25. ing the observed sequence conditionally on the hidden path S Mj a 1 a 2 a q St Sti u Mi a u 1 a u 2 a u q Ke Set Kien Ti Mi Oe ep Os eee belde 5 2 Input files 5 2 1 model lt file gt The syntax of this file is described in the section 2 3 Keep in mind that all the parameters of the model must be fully specified i e the file must not contain any pobs random 17 5 2 2 simul lt file gt This file contains only the length of the sequence to be simulated lg 10000 5 2 3 seq lt file gt The syntax of this file is described in the section 2 4 It may surprise that such a file should be used by a sequence simulation program However the sequence referenced in this file is absolutely not used for the simulation of the new sequence but the seq_type information of the seq lt file gt is necessary 5 3 Output files 5 3 1 simulated hidden_states file This file conforms to the same format as output files of the show_viterbi executable An example of such a file is given below hidden states simulation 0 statel 1 state2 2 state3 3 states 1 1 1 1 1 1 1 1 WWW 5 3 2 simulated_0 dna file A DNA sequence in Fasta format gt dna observations simulation TTAGATCAAAAACAGCAGGTAAATAAAAGTTCTATTACTGAAGCGAAGGTGTTTCAACAC CAAAAATGTTCTAAGGATATTCCTGACCTTAGTGGACAGCATCAAATGATATCTTTTCAT GCAAATATACAAGTCTAGAAAAATTATTCATTTAATTGCAAAACTAGAAGGCTTTTTCTG CCAGCGCCTACTA
26. ion The seq_type corresponds to the nature of the observed sequence currently only dna is properly supported The seq files are the name of the files containing sequences to be analyzed Sequences in GenBank and Fasta format are accepted In order to process a sequence set it is possible to store all the sequences in the same Fasta file or to use multiple files each of them referenced in the seq lt file gt When processing a set of sequences the sequences are considered as independent realisations of a same HMM 3 The show emfit executable The show_emfit executable simultaneously enables to estimate the parameters of the model by likelihood maximization and to segment the sequence using the EM algorithm The show_emfit execu table can also be used with fixed parameters to segment the sequences according to the Baum Welch algorithm or to compute the likelihood of a model for a given sequence 3 1 EM algorithm Baum Welch algorithm This section describes the EM algorithm implemented in SHOW We denote by the whole pa rameters of the HMM state and observation transitions a and b that we want to estimate some parameters can of course be fixed Given a starting value 6 of the parameters the EM algorithm is an iterative procedure that produces an approximation of the maximum likelihood estimation MLE 9 that is updated at each iteration m This procedure ensures the increase of the likelihood at each iteration m Pam y X gt Pom
27. lect model start trace start model see section 3 3 The initial model is enriched after estimation start model hidden states modelling overlaps between CDSs and RBS are added the resulting model file is named startbis model show_emfit estimates the parameters of the enriched model startbis model it produces the output files final select traces final select likelihoods final select model final trace final model see section 3 3 The file final model contains the final fitted model which will be used for gene prediction An output description file named outputfromparse desc see section 3 2 4 is created by parsing final model according to the flags between two show_emfit performs forward backward algorithm using the files final model outputfromparse desc and em final desc and produces an output file with suffix e see section 3 3 6 The e output file is parsed to produce the annotation file either in GFF format or in GenBank feature format 7 5 How to retrieve a fitted model for use with the fm lt showmodel gt option Copy the file named final model from the temporary directory created during a preceding run of bactgeneSHOW 22 8 Acknowledgments We thank Antoine MARIN Gregory NUEL and Vincent MIELE for their help in accessing powerful computers which enables us to try numerous HMM for varied biological problems We are grateful to Mark HOEBEKE for advices in C object oriented development
28. nce the sequence begins when going out from the bound state and ends when reaching back the bound state No observations are described in the bound state description The description of the outgoing transitions from the bound state follows the same rules as any other state An application example of such modelisation is how to distinguish false and true sites corresponding to some signals When presenting a potential site the likelihood of a true and false model is computed and the decision of predicting a true or a false site is taken according to the likelihood ratio Figure 2 displays the graph corresponding to a HMM which can be used to predict a ten nucleotide length signal This model enables to take into account some kinds of correlations along the motif corresponding to the signal It could be learned using show_emfit on a learning set The prediction will be done by computing the likelihood of the models corresponding to the true and false sites given the sequence when running show_emfit with the two models 10 A ait E ALIAS Fic 2 Example of a HMM dedicated to a ten nucleotides length motif detection The description of the beginning and the end of the model corresponding to the figure 2 is given below BEGIN_STATE state_id bound the bound state contains no observation description BEGIN_TRANSITIONS type 1 state motif_1 1 ptrans 0 5 type 1 state motif_1 2 ptrans 0 5 END_TRANSITIONS END_STATE BEGI
29. ns but not for all Here we present what we think possible or not Concerning observation modelisation extends of the program enabling to deal with other kinds of sequences such as amino acids sequences codon sequences conditioning the model with the trans lation are possible The use of multiple sources of observation that could be taken into account for hidden path reconstruction has been considered and could be easily implemented This is the reason why the seq identifier is used in the observation distribution description Concerning the estima tion procedure we think that HMM estimation algorithm based on bayesian methodology could be done In contrast we do not think that the design of the source code could allow extension to hidden semi Markovian models generalized hidden Markov model 7 bactgeneSHOW a Perl script invoking SHOW for bacterial gene detection 7 1 Motivation SHOW is a generic program that uses various HMMs to analyze biological sequences especially DNA The design of the HMM and the processing of the output files is a sizeable task Thus we propose a Perl script that enables to use SHOW for bacterial gene detection using a single command line 7 2 The bactgeneSHOW command line Usage bactgeneSHOW i lt dnafile gt o lt outputfile gt options lt dnafile gt Fasta file containing the DNA sequences to be analyze lt outputfile gt Annotation file containing the results of the gene detection OPTIONS 1 CHOICE
30. of the models used by bactgeneSHOW 7 3 1 Intergenic sequences Intergenic sequences are modelled using a single back looped hidden state emitting the observed DNA sequence according to a second order Markov chain The structure of the model displayed in Figure 4 contains a single intergenic state at the center of the graph However if m lt modelid gt refers to a model identifier containing si short intergenic two supplementary states are added for short intergenic regions modelling The purpose of the use of these states is explained in the paragraph concerning RBS modelling 20 7 3 2 Coding sequences Coding sequences have a three periodic composition core represented by a cycle of three hid den states corresponding to the three positions within a codon Each of these hidden states emits nucleotides according to a second order Markov chain In frame stop codons are prevented by a zero probability of emitting a stop codon in the third state of the cycle All states concerned with this constraint are filled in Figure 4 In order to ensure that start and stop codons delimit CDSs appro priate sub models are added upstream and downstream from the core of the CDS Heterogeneities of coding sequences composition have been shown to be important features of the bacterial chromosome composition In particular taking into account the atypical a t rich composition of some horizontally transferred genes has been shown to greatly imp
31. ow_emfit enables to fit an HMM on sequences using EM algorithm learning and to reconstruct the hidden state path using forward backward algorithm segmentation When used with fixed parameters show_emfit only produces the sequence segmentation with the forward backward algorithm show_viterbi implements the viterbi algorithm to find the most probable hidden path given the observed sequence segmentation The HMM parameters can first be learned with show_emfit show_simul enables to simulate a hidden state sequence and a DNA sequence under a specified HMM All these programs share a same format for the HMM specification file which is presented in the first section The three following sections present detailed explanations of the different executables and how to use them Section 6 intends to deal with the source code design in order to facilitate further developments of the software Finally section 7 presents a Perl script making easy the use of SHOW for bacterial gene finding The source code of SHOW is freely available this software is protected by the GNU Public Licence Installation instruction can be found in the INSTALL file of the distribution Keywords DNA segmentation Hidden Markov Models maximum likelihood estimation EM al gorithm Viterbi algorithm Baum Welch algorithm forward backward algorithm HMM simulation gene detection DNA sequence heterogeneity 2 Hidden Markov Models HMMs 2 1 SHOW s HMMs for DNA sequ
32. parameters of a Markov model of pseudo order 1 corresponding to a Markov model of order 0 conditionally on that ag dinucleotide does not appear bu a bulo but bu a bu c bu t bu a bu c bu t bula bu c bu t veg WO Bulg Bue ball by a bu g c bu t bu a bu g bu c bu t The example below shows how to use the keywords label and tied_to in the observation descrip tion BEGIN_STATE state_id start _1 BEGIN_TRANSITIONS type 0 state start _2 ptrans 1 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna label start _1_obs type 1 order 0 pobs 0 30 300 4 a g ort END_OBSERVATIONS END_STATE BEGIN_STATE state_id start _1 BEGIN_TRANSITIONS type 0 state intergenic ptrans 1 END_TRANSITIONS BEGIN_OBSERVATIONS seq genomic_dna tied_to start _1_obs type 3 END_OBSERVATIONS END_STATE 2 3 2 Two distinct modelisations of the boundaries of the sequence SHOW allows two distinct modelisations of the sequence s boundaries The first one enables to work conditionally on the length of the sequence In this case the sequence length is not modelised and the sequence begins and ends in any of the hidden states This is the default modelisation when no bound state is specified The alternative is to modelise the length of the sequence This is done by imposing a state of which the identifier is set to bound The state named bound corresponds to the beginning and the end of the seque
33. r estimation Observation emission transition probabilities description conditionally on the hidden state begins with the keyword BEGIN_ OBSERVATIONS and ends with the keyword END_OBSERVATIONS The first keyword found in the observation description must be seq It gives an identifier that must be the same as the identifier given in the file referenced by the seq argument of the command line see the section 6 for more explanations The keyword type indicates how the observation emission transition probabilities should be proces sed during estimation Type 0 stands for no estimation constant value and type 1 means estimation Types 2 and 3 must be used only for tied observations 2 means identical to the referenced observation emission distribution while 3 means complementary to the referenced one Type 3 can only be used when the order r of the referenced observation emission distribution is 0 The observation emission transition probabilities will be estimated when type is set to 2 or 3 only if the referenced one are estimated The keyword order is used to indicate the order r of the markov chain of the observation emission distribution The keyword des preceedes numerical values given by the user for the observation transition probabilities by 1 Z a After pobs Tuta 4t numerical values must be set to the values of by x x ze for t increasing En 0 to ru The T bulz Bae are used only at the beginning of the sequence i e wh
34. rove gene detection sensitivity In order to take these features into account different sub models corresponding to distinct coding types are reachable downstream from a start codon Moreover composition type can change within genes This is more general than using disjoint gene types and also more realistic in particular with regard to the existence of hydrophobic regions in proteins The model displayed in Figure 4 contains two types of coding regions The user can choose the number of coding sequence compositions of the HMM according to the m lt modelid gt option Model identifiers containing Ic 2c 3c and 4c refer respectively to HMM with a single two three and four types of coding compositions 7 3 3 Overlap between coding sequences Overlaps between CDSs are an important feature of the bacterial genome organization Our model distinguishes the very frequent short overlaps of 1 2 or 4 nucleotides from the rare longer overlaps Composition of long overlaps is modelled in the same way as non overlapping CDSs As not so much data is available in a single genome for their composition estimation we use for these overlaps an emission process of order 1 In order to prevent from the appearence of in frame stop codons we used a pseudo second order model corresponding to a Markov model of order 1 conditionally on the absence of stop codons Hidden states and transitions used to model the overlaps are colored in red in Figure
35. ward strand reciprocally short intergenic regions state is not reachable from RBS model on the backward strand One of the main interest of the use of these states is to distinguish intergenic regions too short to contain RBS from the others and thus to enable a better estimation of the probability of finding a RBS before a CDS when intergenic is not so short 21 7 3 5 Structural RNA modelling Structural RNA composition differ from intergenic composition In particular it has been shown to be related to environmental conditions for the organism such as temperature In order to prevent prediction of CDSs within rRNA genes and to improve the estimates for intergenic parameters the user can choose to add two states corresponding to rRNA texture one on the forward the other on the complementary strand by the use of the rna lt rnafile gt option These states are not displayed in Figure 4 Parameters of the composition associated to this state are computed on the sequence in Fasta format contained in the lt rnafile gt 7 3 6 CDSs on the complementary strand Genes exist on both strands and therefore are read on both the direct and complementary strands With minor modifications the complementary strand model can be derived from the direct strand one hidden states modelling the complementary strand are displayed in the lower half part of Figure 4 7 4 What does bactgeneSHOW The bactgeneSHOW script creates a new temporary dir
36. which unfortunatly have not always been followed 23

SHOW User Manual - silico.biotoul.frsilico.biotoul.fr

Contents

Download Pdf Manuals

Related Search

Related Contents