Home

HMMER User's Guide

1. h Help print a brief reminder of command line usage and all available options c Emit a consensus sequence instead of sampling a sequence from the profile HMM s probability distribution The consensus sequence is formed by selecting the maximum probability residue at each match state O lt f gt Direct the output sequences to file lt f gt rather than to stdout p Sample sequences from the implicit profile not from the core model The core model consists only of the homologous states between the begin and end states of a HMMER Plan7 model The profile includes the nonhomologous N C and J states local glocal and uni multihit algorithm configuration and the target length model Therefore sequences sampled from a profile may include nonhomologous as well as homologous sequences and may contain more than one homologous sequence segment By default the profile is in multihit local mode and the target sequence length is configured for L 400 To change these defaults see Options Controlling Emission from Profiles below N lt n gt Sample lt n gt sequences rather than just one Options controlling emission from profiles All these options require that the p option is also set L lt n gt Configure the profile s target sequence length model to generate a mean length of approximately lt n gt rather than the default of 400 local Configure the profile for multihit local alignment unilocal Configure the profile fo
2. You d think that except for the E values which depend on database search space sizes you should get exactly the same scores domain number domain coordinates and alignment every time you do a search of the same HMM against the same sequence And this is actually the case but in fact it s actually not so obvious this should be so and HMMER is going out of its way to make it so HMMER uses stochastic sampling algorithms to infer some parameters and also to infer the exact domain number and domain boundaries in certain difficult cases If HMMER ran its stochastic samples properly it would see different samples every time you ran a program and all of you would complain to me that HMMER was weird and buggy because it gave different answers on the same problem To suppress run to run variation HAMMER seeds its random number generator s identically every time you do a sequence comparison If you re an expert and you really want to see the proper stochastic variation that results from any sampling algorithms that got run you can pass a command line argument of seed 0 to programs that have this property hmmbuild and the four search programs Creating multiple alignments with hmmalign The file tutorial globins45 fa is a FASTA file containing 45 unaligned globin sequences To align all of these to the globins4 model and make a multiple sequence alignment gt hmmalign globins4 hmm tutorial globins45 fa The output of this is a Stoc
3. algorithm is the new algorithm in HMMER3 It essentially calculates the HMM equivalent of BLAST s sum score an optimal sum of ungapped high scoring alignment segments Unlike BLAST it does this calculation directly without BLAST s word hit or hit extension step using a SIMD vector parallel algorithm By default HMMER3 is configured to allow sequences with a P value of lt 0 02 through the MSV score filter thus if the database contained no homologs and P values were accurately calculated the highest scoring 2 of the sequences will pass the filter Here about 4 of the database got through the MSV filter A quick check is then done to see if the target sequence is obviously so biased in its composition that its unlikely to be a true homolog This is called the bias filter If you don t like it it can occasionally be overaggressive you can shut it off with the nobias option Here 15923 sequences pass through the bias filter The Viterbi filter then calculates a gapped optimal alignment score This is a bit more sensitive than the MSV score but the Viterbi filter is about four fold slower than MSV By default HMMER3 lets sequences with a P value of lt 0 001 through this stage Here because there s a little over a thousand true globin homologs in this database much more than that gets through 2207 sequences Then the full Forward score is calculated which sums over all possible alignments of the profile to the
4. recognized GR annotations ss Secondary structure consensus See Gc SS_cons above sa Surface accessibility consensus See Gc SA_cons above pp Posterior probability for an aligned residue This represents the probability that this residue is assigned to the HMM state corresponding to this alignment column as opposed to some other state Note that a residue can be confidently unaligned a residue in an insert state or flanking N or C state may have high posterior probability The posterior probability is encoded as 11 possible characters 0 9x 0 0 lt p lt 0 05 is coded as 0 0 05 lt p lt 0 15 is coded as 1 and so on 0 85 lt p lt 0 95 is coded as 9 and 0 95 lt p lt 1 0 is coded as Gap characters appear in the PP line where no residue has been assigned A2M multiple alignment format HMMER s Easel library routines are capable of writing alignments in UC Santa Cruz A2M alignment to model format the native input format for the UCSC SAM profile HMM software package To select A2M format use the format code a2m for example to reformat a Stockholm alignment to A2M gt esl reformat a2m myali sto Easel currently does not read A2M format and it currently only writes in what UCSC calls dotless A2M format The most official documentation for A2M format appears to be at http compbio soe ucsc edu a2m desc html You may refer to that document if anything in the brief description below is un
5. eff_nseq relent info p relE compKL h The index of this profile numbering each on in the file starting from 1 The name of the profile The optional accession of the profile or if there is none The number of sequences that the profile was estimated from The effective number of sequences that the profile was estimated from after HM MER applied an effective sequence number calculation such as the default entropy weighting The length of the model in consensus residues match states Mean relative entropy per match state in bits This is the expected mean score per consensus position This is what the default entropy weighting method for effective sequence number estimation focuses on so for default HMMER3 models you expect this value to reflect the default target for entropy weighting Mean information content per match state in bits Probably not useful Information content is a slightly different calculation than relative entropy Mean positional relative entropy in bits This is a fancier version of the per match state relative entropy taking into account the transition insertion deletion proba bilities it may be a more accurate estimation of the average score contributed per model consensus position Kullback Leibler distance between the model s overall average residue composition and the default background frequency distribution The higher this number the more biased the residue
6. Here s a tutorial walk through of some small projects with HMMERS This should suffice to get you started on work of your own and you can at least temporarily skip the rest of the Guide such as all the nitty gritty details of available command line options The programs in HMMER Single sequence queries new to HMMER3 phmmer Search a sequence against a sequence database BLASTP like jackhmmer Iteratively search a sequence against a sequence database PSIBLAST like Replacements for HMMER2 s functionality hmmbuild Build a profile HMM from an input multiple alignment hmmsearch Search a profile HMM against a sequence database hmmscan Search a sequence against a profile HMM database hmmalign Make a multiple alignment of many sequences to a common profile HMM Other utilities hmmconvert Convert profile formats to from HMMER3 format hmmemit Generate sample sequences from a profile HMM hmmfetch Get a profile HMM by name or accession from an HMM database hmmpress Format an HMM database into a binary format for hmmscan hmmstat Show summary statistics for each profile in an HMM database The quadrumvirate of hmmbuild hmmsearch hmmscan hmmalign is the core functionality for protein do main analysis and annotation pipelines for instance using profile databases like Pfam or SMART The phmmer and jackhmmer programs are new to HMMERS They searches a single sequence against a sequence database akin to BLASTP and PSIBLAST respec
7. Sets the tail mass fraction to fit in the simulation that estimates the location param eter tau for Forward evalues Default is 0 04 33 Other options mpi informat lt s gt seed lt n gt stall Run as a parallel MPI program Each alignment is assigned to a MPI worker node for construction Therefore the maximum parallelization cannot exceed the num ber of alignments in the input msafile This is useful when building large profile libraries This option is only available if optional MPI capability was enabled at compile time Declare that the input msafile is in format lt s gt Currently the accepted multiple alignment sequence file formats only include Stockholm and SELEX Default is to autodetect the format of the file Seed the random number generator with lt n gt an integer gt 0 If lt n gt is nonzero any stochastic simulations will be reproducible the same command will give the same results If lt n gt is 0 the random number generator is seeded arbitrarily and stochastic simulations will vary from run to run of the same command The default seed is 42 laplace Experimental only use a Laplace 1 prior in place of the default mixture Dirichlet prior For debugging MPI parallelization arrest program execution immediately after start and wait for a debugger to attach to the running process and release the arrest 34 hmmconvert convert profile file to a HMMER format Synopsis hmmconve
8. Some of the more important holes for us are DNA sequence comparison HMMER3 s search pipeline is somewhat specialized to protein protein com parison Specifically the pipeline works by filtering individual sequences winnowing down to a subset of the sequences in a database that need close attention from the full heavy artillery of Bayesian inference This strategy doesn t work for long DNA sequences it doesn t filter the human genome much to say there s a hit on chromosome 1 The algorithms need to be adapted to identify high scoring regions of a target sequence rather than filtering by whole sequence scores You can chop a DNA sequence into overlapping windows and HMMER3 would work fine on such a chopped up database but that s a disgusting kludge and don t want to know about it Translated comparisons We d of course love to have the HMM equivalents of BLASTX TBLASTN and TBLASTX They ll come Profile profile comparison A number of pioneering papers and software packages have demonstrated the power of profile profile comparison for even more sensitive remote homology detection We will aim to develop profile HMM methods in HMMER with improved detection power and at HAMER3 speed We also have some technical and usability issues we will be addressing Real Soon Now More sequence input formats HMMER3 will work fine with FASTA files for unaligned sequences and Stockholm and aligned FASTA files for multiple sequen
9. The next field is the cs annotation for this node If cs was yes then this is a single character representing the consensus structure at this match state otherwise this field is Insert emission line The K fields on this line are the insert emission scores one per symbol in alphabetic order State transition line The seven fields on this line are the transitions for node k in the order shown by the transition header line Mi gt Miri Ik Dk 1 Ik gt Meri Ik Dr gt Meri Deas For transitions from the final node M match state M 1 is interpreted as the END state E and there is no delete state M 1 therefore the final M Dx4i and Di Dr transitions are always zero probability and the final Dy gt Mz41 transition is always 0 0 probability 1 0 Finally the last line of the format is the record separator Stockholm the recommended multiple sequence alignment format The Pfam and Rfam Consortiums have developed a multiple sequence alignment format called Stockholm format that allows rich and extensible annotation Most popular multiple alignment file formats can be changed into a minimal Stockholm format file just by adding a Stockholm header line and a trailing terminator STOCKHOLM 1 0 seql ACDEF GHIKL seq2 ACDEF GHIKL seq3 EFMNRGHIKL seql MNPOTVWY seq2 MNPOTVWY seq3 MNPOT The first line in the file must be STOCKHOLM 1 x where x is a minor versio
10. The phmmer program is for searching a single sequence query against a sequence database much as BLASTP Or FASTA would do phmmer works essentially just like hmmsearch does except you provide a query sequence instead of a query profile HMM Internally HMMER builds a profile HMM from your single query sequence using a simple position independent scoring system BLOSUM62 scores converted to probabilities plus a gap open and gap extend probability The file tutorial HBB_HUMAN is a FASTA file containing the human globin sequence as an example query If you have a sequence database such as uniprot sprot fasta make that your target database otherwise USe tutorial globins45 fa as a small example gt phmmer tutorial HBB_HUMAN uniprot_sprot fa or gt phmmer tutorial HBB_HUMAN tutorial globins45 fa Everything about the output is essentially as previously described for hmmsearch Iterative searches using jackhmmer The jackhmmer program is for searching a single sequence query iteratively against a sequence database much as PSI BLAST would do The first round is identical to a phmmer search All the matches that pass the inclusion thresholds are put in a multiple alignment In the second and subsequent rounds a profile is made from these results and the database is searched again with the profile Iterations continue either until no new sequences are detected or the maximum number of iterations is reached By default the maximum number
11. beta That continues until the inclusion threshold is reached at which point you see a tagline inclusion thresh old indicating where the threshold was set 0 00076 24 1 0 0 0 00083 24 0 0 0 1 0 1 sp P02180 MYG_BALPH 0 00087 23 9 0 0 0 0009 2319 0 0 1 0 1 sp P02148 MYG_PONPY 325 inclusion threshold 0 0013 23 3 03 0 021 19 4 0 2 2 1 1 sp P81044 HBAZ_MACEU 0 0021 22 7 0 0 0 0022 22 6 0 0 1 0 1 sp P02182 MYG_ZIPCA OS Homo sapien OS Pan paniscu OS Pan troglod OS Gorilla gor OS Hylobates 1 OS Semnopithec Myoglobin OS Balaenoptera physalus GN Myoglobin OS Pongo pygmaeus GN MB PE 1 Hemoglobin subunit zeta Fragments OS Myoglobin OS Ziphius cavirostris GN MB The domain output and search statistics are then shown just as in phmmer At the end of this first iteration you ll see some output that starts with ee this is a simple tag that lets you search through the file to find the end of one iteration and the beginning of another New targets included 894 New alignment includes 895 subseqs was 1 including original query Continuing to next round ee Round 2 Included in MSA 895 subsequences query 894 subseqs from 894 targets Model size 146 positions ee This obviously is telling you that the new alignment contains 895 sequences your query plus 894 significant matches For round two it s built a new model from this alignment Now for round two it fires off what s essenti
12. lt f gt is a positive real number giving the relative weight for a sequence usually used to compensate for biased representation by downweighting similar sequences Usually the weights average 1 0 e g the weights sum to the number of sequences in the alignment but this is not required Either every sequence must have a weight annotated or none of them can Ac lt s gt Accession lt s gt is a database accession number for this sequence Compare the cF Ac markup which gives an accession for the whole alignment One word DE lt s gt Description lt s gt is one line giving a description for this sequence Compare the cF DE markup which gives a description for the whole alignment recognized GC annotations RF Reference line Any character is accepted as a markup for a column The intent is to allow labeling the columns with some sort of mark SS_cons Secondary structure consensus For protein alignments DSSP codes or gaps are accepted as markup HGIEBTSCX _ where H is alpha helix G is 3 10 helix is p helix E is extended strand B is a residue in an isolated b bridge T is a turn S is a bend C is a random coil or loop and X is unknown for instance a residue that was not resolved in a crystal structure SA_cons Surface accessibility consensus 0 9 gap symbols or X are accepted as markup 0 means lt 10 accessible residue surface area 1 means lt 20 9 means lt 100 etc X means unknown structure 71
13. 0 7 1 2e 16 1 2e 16 1 85 1799 1890 1799 1891 65 0 91 8 17 8 0 0 1 8e 07 1 8e 07 6 TA 1904 TOGE ras 1901 LITO 0150290 91 12 8 0 0 6 6e 06 6 6e 06 1 86 1993 2107 sxe 1993 2107 0 89 Domains are reported in the order they appear in the sequence not in order of their significance The or symbol indicates whether this domain does or does not satisfy both per sequence and per domain inclusion thresholds Inclusion thresholds are used to determine what matches should be considered to be true as opposed to reporting thresholds that determine what matches will be reported often including the top of the noise so you can see what interesting sequences might be getting tickled by your search By default inclusion thresholds usually require a per sequence E value of 0 01 or less and a per domain conditional E value of 0 01 or less except jackhmmer which requires a more stringent 0 001 for both and reporting E value thresholds are set to 10 0 The bit score and bias values are as described above for sequence scores but are the score of just one domain s envelope The first of the two E values is the conditional E value This is an odd number and it s not even clear we re going to keep it Pay attention to what it means It is an attempt to measure the statistical significance of each domain given that we ve already decided that the target sequence is a true homolog It is the expected number of additional domains we d find w
14. 1 the observed score distribution 2 the maximum like linood fitted distribution 3 a maximum likelihood fit to the location parameter mu tau while assuming lambda log_2 Output the bit scores as a binary array of double precision floats 8 bytes per score to file lt f gt Programs like Easel s esl histplot can read such binary files This is useful when generating extremely large sample sizes Options controlling model configuration mode H3 only uses multihit local alignment fs mode and this is where we believe the statistical fits Unihit local alignment scores Smith Waterman sw mode also obey our statistical conjectures Glocal alignment statistics either multihit or unihit are still not adequately understood nor adequately fitted ts SW Is Collect multihit local alignment scores This is the default alignment as fragment search mode Collect unihit local alignment scores The H3 J state is disabled alignment as Smith Waterman search mode Collect multihit glocal alignment scores In glocal global local alignment the en tire model must align to a subsequence of the target The H3 local entry exit tran sition probabilities are disabled Ils comes from HMMER2 s historical terminology for multihit local alignment as local search mode Collect unihit glocal alignment scores Both the H3 J state and local entry exit tran sition probabilities are disabled s comes fr
15. IBM Intel Sun Microsystems Silicon Graphics Hewlett Packard Paracel and nVidia have provided generous hardware support that makes this possible owe a large debt to the free software community for the development tools use an incomplete list includes GNU gcc gdb emacs and autoconf the amazing valgrind the indispensable Subversion the ineffable perl LaTeX and TeX PolyglotMan and the UNIX and Linux operating systems Finally will cryptically thank Dave Mr Frog Pare and Tom Chainsaw Ruschak for a totally unrelated free software product that was historically instrumental in HMMER s development for reasons that are best not discussed while sober 75 References Altschul S F Gish W Miller W Myers E W and Lipman D J 1990 Basic local alignment search tool J Mol Biol 215 403 410 Altschul S F Madden T L Schaffer A A Zhang J Zhang Z Miller W and Lipman D J 1997 Gapped BLAST and PSI BLAST A new generation of protein database search programs Nucl Acids Res 25 3389 3402 Barton G J 1990 Protein multiple sequence alignment and flexible pattern matching Meth Enzymol 183 403 427 Bashford D Chothia C and Lesk A M 1987 Determinants of a protein fold Unique features of the globin amino acid sequences J Mol Biol 196 199 216 Churchill G A 1989 Stochastic models for heterogeneous DNA sequences Bull Math Biol 51 79 94 Dur
16. Set the lower bound on the tail mass distribution The default is 0 02 for the default single tail mass tmax lt x gt _ Set the upper bound on the tail mass distribution The default is 0 02 for the default single tail mass tpoints lt n gt Set the number of tail masses to sample starting from tmin and ending at tmax The default is 1 for the default 0 02 single tail mass tlinear Sample a range of tail masses with uniform linear spacing The default is to use uniform logarithmic spacing Options controlling h3 parameter estimation methods H3 uses three short random sequence simulations to estimating the location parameters for the expected score distributions for MSV scores Viterbi scores and Forward scores These options allow these simula tions to be modified EmL lt n gt Sets the sequence length in simulation that estimates the location parameter mu for MSV E values Default is 200 EmN lt n gt Sets the number of sequences in simulation that estimates the location parameter mu for MSV E values Default is 200 EvL lt n gt Sets the sequence length in simulation that estimates the location parameter mu for Viterbi E values Default is 200 EvN lt n gt Sets the number of sequences in simulation that estimates the location parameter mu for Viterbi E values Default is 200 EfL lt n gt Sets the sequence length in simulation that estimates the location parameter tau for Forward E values Defa
17. above it would be unusual to use a single bit score threshold in hmmscan 41 Options for model specific score thresholding Curated profile databases may define specific bit score thresholds for each profile superseding any thresh olding based on statistical significance alone To use these options the profile must contain the appropriate GA TC and or NC optional score threshold annotation this is picked up by hmmbuild from Stockholm format alignment files Each thresholding option has two scores the per sequence threshold lt x1 gt and the per domain threshold lt x2 gt These act as if T lt x1 gt incT lt x1 gt domT lt x2 gt incdomT lt x2 gt has been applied specifically using each model s curated thresholds cut ga Use the GA gathering bit scores in the model to set per sequence GA1 and per domain GA2 reporting and inclusion thresholds GA thresholds are generally considered to be the reliable curated thresholds defining family membership for example in Pfam these thresholds define what gets included in Pfam Full align ments based on searches with Pfam Seed models cut_nc Use the NC noise cutoff bit score thresholds in the model to set per sequence NC1 and per domain NC2 reporting and inclusion thresholds NC thresholds are generally considered to be the score of the highest scoring known false posi tive cut_tc Use the NC trusted cutoff bit score thresholds in the model to set per sequence TC
18. an alignment format that has no reference annotation line only Stockholm and SELEX formats support reference annota tion In this case Easel internally generates a reasonable guess at consensus columns using essentially the same procedure that HMMER s hmmbuild program uses by default sequence fragments sequences lt 50 of the mean sequence length in the alignment overall are ignored and for the remaining sequences any column containing gt 50 residues is considered to be a consensus column 73 6 Acknowledgements and history HMMER 1 was developed on slow weekends in the lab at the MRC Laboratory of Molecular Biology Cam bridge UK while was a postdoc with Richard Durbin and John Sulston thank the Human Frontier Science Program and the National Institutes of Health for their remarkably enlightened support at a time when was really supposed to be working on the genetics of neural development in C elegans HMMER 1 8 the first public release of HUMMER came in April 1995 shortly after moved to Washington University in St Louis A few bugfix releases followed A number of more serious modifications and improvements went into HMMER 1 9 code but 1 9 was never released Some versions of HMMER 1 9 inadvertently escaped St Louis and make it to some genome centers but 1 9 was never documented or supported HMMER 1 9 collapsed under its own weight in 1996 HMMER 2 was a nearly complete rewrite based on the new Plan 7 model
19. are determined they are normalized to sum to a total effective sequence number eff_nseg This number may be the actual number of sequences in the alignment but it is almost always smaller than that The default entropy weighting method eent reduces the effective sequence num ber to reduce the information content relative entropy or average expected score on true homologs per consensus position The target relative entropy is controlled by a two parameter function where the two parameters are settable with ere and esigma 32 eent eclust enone eset lt x gt ere lt x gt esigma lt x gt eid lt x gt Adjust effective sequence number to achieve a specific relative entropy per position see ere This is the default Set effective sequence number to the number of single linkage clusters at a spe cific identity threshold see eid This option is not recommended it s for experi ments evaluating how much better eent is Turn off effective sequence number determination and just use the actual number of sequences One reason you might want to do this is to try to maximize the relative entropy position of your model which may be useful for short models Explicitly set the effective sequence number for all models to lt x gt Set the minimum relative entropy position target to lt x gt Requires eent Default depends on the sequence alphabet for protein sequences it is 0 59 bits positi
20. best scoring alignment However sequence alignments are uncertain and the more distantly related sequences are the more uncertain alignments become HMMER3 calculates complete pos terior alignment ensembles rather than single optimal alignments Posterior ensembles get used for a variety of useful inferences involving alignment uncertainty For example any HMMER3 sequence alignment is accompanied by posterior probability annotation representing the degree of confidence in each individual aligned residue Sequence scores not alignment scores Alignment uncertainty has an important impact on the power of sequence database searches It s precisely the most remote homologs the most difficult to identify and potentially most interesting sequences where alignment uncertainty is greatest and where the statistical approximation inherent in scoring just a single best alignment breaks down the most To maximize power to discriminate true homologs from nonhomologs in a database search statistical inference theory says you ought to be scoring sequences by integrating over alignment uncertainty not just scoring the single best alignment HMMER3 s log odds scores are sequence scores not just optimal alignment scores they are integrated over the posterior alignment ensemble Speed A major limitation of previous profile HMM implementations including HMMER2 was their slow performance HMMER3 implements a new heuristic acceleration algorithm For mo
21. composition of the profile This was used in exploring some of the effects of biased composition Turn the H3 target sequence length model off Set the self transitions for N C J and the null model to 350 351 instead this emulates HMMER2 Not a good idea in general This was used to demonstrate one of the main H2 vs H3 differences Set the nu parameter for the MSV algorithm the expected number of ungapped local alignments per target sequence The default is 2 0 corresponding to a E gt J transition probability of 0 5 This was used to test whether varying nu has significant effect on result it doesn t seem to within reason This option only works if msv is selected it only affects MSV and it will not work with fast because the optimized implementations are hardwired to assume nu 2 0 Set the filter P value threshold to use in generating filter power files with ffile The default is 0 02 which would be appropriate for testing MSV scores since this is the default MSV filter threshold in H3 s acceleration pipeline Other appropriate choices matching defaults in the acceleration pipeline would be 0 001 for Viterbi and 1e 5 for Forward 52 hmmstat display summary statistics for a profile file Synopsis hmmstat options lt hmmfile gt Description The hmmstat utility prints out a tabular file of summary statistics for each profile in lt hmmfile gt The columns are Options idx name accession nseq
22. composition of the profile is Highly biased profiles can slow the HMMERS acceleration pipeline by causing too many nonhomologous sequences to pass the filters Help print a brief reminder of command line usage and all available options 53 jackhmmer iteratively search sequence s against a protein database Synopsis jackhmmer options lt seqfile gt lt seqdb gt Description jackhmmer iteratively searches each query sequence in lt segfile gt against the target sequence s in lt seqdb gt The first iteration is identical to a phmmer search For the next iteration a multiple align ment of the query together with all target sequences satisfying inclusion thresholds is assembled a profile is constructed from this alignment identical to using hmmbuild on the alignment and profile search of the lt seqdb gt is done identical to an hmmsearch with the profile The output format is designed to be human readable but is often so voluminous that reading it is impractical and parsing it is a pain The tblout and domtblout options save output in simple tabular formats that are concise and easier to parse The o option allows redirecting the main output including throwing it away in dev null Options h Help print a brief reminder of command line usage and all available options N lt n gt Set the maximum number of iterations to lt n gt The default is 5 If N 1 the result is equivalent to a phmmer search Option
23. eg A A AS DS PA Aa AA eee AN ee e query HMM file globins4 hmm target sequence database uniprot_sprot fasta per seq hits tabular output globins4 tbl per dom hits tabular output globins4 domtbl number of worker threads 2 da Ea a Sa aa aa A AA as A ee Oe Query globins4 M 149 Scores for complete sequences score includes all domains The second section is the sequence top hits list It is a list of ranked top hits sorted by E value most significant hit first formatted in a BLAST like style full sequence best 1 domain dom E value score bias E value score bias exp N Sequence Description 6e 65 222 7 34 2 6 7e 65 222 6 2 2 1 0 1 sp P02185 MYG_PHYCA 3 1e 63 217 2 0 1 3 4e 63 217 0 0 0 1 0 1 sp P02024 HBB_GORGO 4 5e 63 216 6 0 0 5e 63 216 5 0 0 1 0 1 sp P68871 HBB_HUMAN 4 5e 63 216 6 0 0 5e 63 216 5 0 0 1 0 1 sp P68872 HBB_PANPA 4 5e 63 216 6 0 0 5e 63 216 5 0 0 1 0 1 sp P68873 HBB_PANTR 6 4e 63 216 1 340 7 1le 63 216 0 2 0 1 0 1 sp P02177 MYG_ESCGI Myoglobin OS Physeter catodon GN MB PE Hemoglobin subunit beta OS Gorilla gor Hemoglobin subunit beta OS Homo sapien Hemoglobin subunit beta OS Pan paniscu Hemoglobin subunit beta OS Pan troglod Myoglobin OS Eschrichtius gibbosus GN The last two columns obviously are the name of each target sequence and optional description The most important number here is the first one the sequence E value This is the statistical significance of the match
24. filter threshold set the P value threshold for the Viterbi filter step The default is 0 001 F3 lt x gt Third filter threshold set the P value threshold for the Forward filter step The default is 1e 5 nobias Turn off the bias filter This increases sensitivity somewhat but can come at a high cost in speed especially if the query has biased residue composition such as a repetitive sequence region or if it is a membrane protein with large regions of hydrophobicity Without the bias filter too many sequences may pass the filter with biased queries leading to slower than expected performance as the compu tationally intensive Forward Backward algorithms shoulder an abnormally heavy load Options controlling profile construction later iterations These options control how consensus columns are defined in multiple alignments when building profiles By default jackhmmer always includes your original query sequence in the alignment result at every iteration and consensus positions are defined by that query sequence that is a default jackhmmer profile is always the same length as your original query at every iteration fast Define consensus columns as those that have a fraction gt symfrac of residues as opposed to gaps See below for the symfrac option Although this is the default profile construction option elsewhere in hmmbuild in particular it may have undesirable effects in jackhmmer because a profile could itera
25. fraction lt x gt times the mean sequence length of all the sequences in the alignment then the sequence is handled as a fragment The default is 0 5 Options controlling relative weights HMMER uses an ad hoc sequence weighting algorithm to downweight closely related sequences and up weight distantly related ones This has the effect of making models less biased by uneven phylogenetic representation For example two identical sequences would typically each receive half the weight that one sequence would These options control which algorithm gets used Wpb Use the Henikoff position based sequence weighting scheme Henikoff and Henikoff J Mol Biol 243 574 1994 This is the default wgsc Use the Gerstein Sonnhammer Chothia weighting algorithm Gerstein et al J Mol Biol 235 1067 1994 wblosum Use the same clustering scheme that was used to weight data in calculating BLO SUM subsitution matrices Henikoff and Henikoff Proc Natl Acad Sci 89 10915 1992 Sequences are single linkage clustered at an identity threshold default 0 62 see wid and within each cluster of c sequences each sequence gets rela tive weight 1 c wnone No relative weights All sequences are assigned uniform weight wid lt x gt Sets the identity threshold used by single linkage clustering when using wblosum Invalid with any other weighting scheme Default is 0 62 Options controlling effective sequence number After relative weights
26. gt as the per domain inclusion threshold in targets that have already satisfied the overall per target inclusion threshold The default is 0 01 incdomT lt x gt Instead of using E values use a bit score of gt lt x gt as the per domain inclusion threshold 45 Options for model specific score thresholding Curated profile databases may define specific bit score thresholds for each profile superseding any thresh olding based on statistical significance alone To use these options the profile must contain the appropriate GA TC and or NC optional score threshold annotation this is picked up by hmmbuild from Stockholm format alignment files Each thresholding option has two scores the per sequence threshold lt x1 gt and the per domain threshold lt x2 gt These act as if T lt x1 gt incT lt x1 gt domT lt x2 gt incdomT lt x2 gt has been applied specifically using each model s curated thresholds cut ga Use the GA gathering bit scores in the model to set per sequence GA1 and per domain GA2 reporting and inclusion thresholds GA thresholds are generally considered to be the reliable curated thresholds defining family membership for example in Pfam these thresholds define what gets included in Pfam Full align ments based on searches with Pfam Seed models cut_nc Use the NC noise cutoff bit score thresholds in the model to set per sequence NC1 and per domain NC2 reporting and inclusion thresholds NC t
27. have been used for years in speech recognition HMMs had been used in biology before the Krogh Haussler work notably by Gary Churchill Churchill 1989 but the Krogh paper had a dramatic impact because HMM technology was Nothing should go wrong so well suited to the popular profile methods for searching databases using multiple sequence alignments instead of single query sequences Profiles had been introduced by Gribskov and colleagues Gribskov et al 1987 1990 and several other groups introduced similar approaches at about the same time such as flexible patterns Barton 1990 and templates Bashford et al 1987 Taylor 1986 The term profile has stuck All profile meth ods including PSI BLAST Altschul et al 1997 are more or less statistical descriptions of the consensus of a multiple sequence alignment They use position specific scores for amino acids or nucleotides and position specific penalties for opening and extending an insertion or deletion Traditional pairwise alignment for example BLAST Altschul et al 1990 FASTA Pearson and Lipman 1988 or the Smith Waterman algorithm Smith and Waterman 1981 uses position independent scoring parameters This property of profiles captures important information about the degree of conservation at various positions in the multiple alignment and the varying degree to which gaps and insertions are permitted The advantage of using HMMs is t
28. not 16 165 5 have these options hmmbuild builds a core profile which the search and alignment programs configure as they need to And at least for the moment they always configure for local alignment Step 2 search the sequence database with hmmsearch Presumably you have a sequence database to search Here lIl use the Uniprot 15 7 Swissprot FASTA format flatfile not provided in the tutorial because of its large size uniprot_sprot fasta If you don t have a sequence database handy run your example search against tutorial globins45 fa instead which is a FASTA format file containing 45 globin sequences hmmsearch accepts any FASTA file as input It also accepts EMBL Uniprot text format It will automat ically determine what format your file is in you don t have to say An example of searching a sequene database with our globins4 hmm model would look like gt hmmsearch globins4 hmm uniprot_sprot fasta gt globins4 out Depending on the database you search the output file globins4 out should look more or less like the example of a Uniprot search output provided in tutorial globins4 out The first section is the header that tells you what program you ran on what and with what options hmmsearch search profile s against a sequence database HMMER 3 0 March 2010 http hmmer org Copyright C 2010 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 A es
29. of a Options for specifying the alphabet a noaoo es Options controlling profile construction o o o Options controlling relative weights o o Options controlling effective sequence number ee ee ees Options controlling e value calibration 1 o Other Options ses ue teeter Beebe tye dee Ge ER ae A car ede e eee hmmconvert convert profile file to a HMMER format o o a SYNOPSIS Le A MAE Beck dat Meine debe Os Des Bo ee a DeSCripliOn sn oa a a ee e e ta ee a tot eo a as ee RR OPINAS 5s i818 o Seely A a othe Ge BAe ee 35 hmmemit sample sequences from a profile HMM 0000222 36 SYVMOPSIS gt iste RR Ge fas nls te aed fete Ge SHS 36 DESPON dat Be tcs Ae od od oc tht oat oe WOE Ase oe Se ae ae 36 COMMON OPTIONS vd eee EES ha a Pe ee eee ee EGE y a 36 Options controlling emission from profiles 0 0 0 00 eee ee ee eee 36 Other options durene i aaa ee 37 hmmfetch retrieve profile HMM s from a file 2 0000 cee eee ee eee 38 SYNOPSIS ii E A Pale an ee Wee Be Gh 38 DeSCHptiOM 4 tosenn att a eee wn EG be de A Ee Ee Te Ghia aie Gee de ees 38 OPUONS gt 2 2 hs aaya daa hale ees ERE SS 38 hmmpress prepare an HMM database forhmmscan 0 a 39 SINOPSIS Si ete Sa aan Do ye ae a OR RA eet ohh et Bee tee a AE 39 Pgserniption z ien ae A ee eee eee aia ee S 39 OPUONS saa Ba
30. of iterations is 5 you can change this with the n option Your original query sequence is always included in the multiple alignments whether or not it appears in the database The consensus columns assigned to each multiple alignment always correspond exactly to the residues of your query so the coordinate system of every profile is always the same as the numbering of residues in your query sequence 1 L for a sequence of length L Assuming you have Uniprot or something like it handy here s an example command line for a jackhmmer search SIf it is in the database it will almost certainly be included in the internal multiple alignment twice once because it s the query and once because it s a significant database match to itself This redundancy won t screw up the alignment because sequences are downweighted for redundancy anyway 26 gt jackhmmer tutorial HBB_HUMAN uniprot_sprot fa One difference from phmmer output you ll notice is that jackhmmer marks new sequences with a and lost sequences with a New sequences are sequences that pass the inclusion threshold s in this round but didn t in the round before Lost sequences are the opposite sequences that passed the inclusion threshold s in the previous round but have now fallen beneath yet are still in the reported hits it s possible though rare to lose sequences utterly if they no longer even pass the reporting threshold s In the first roun
31. recognized GF annotations ID lt s gt Identifier lt s gt is a name for the alignment e g rrm One word Unique in file AC lt s gt Accession lt s gt is a unique accession number for the alignment e g PF00001 Used by the Pfam database for instance Often a alphabetical prefix indicating the database e g PF followed by a unique numerical accession One word Unique in file DE lt s gt Description lt s gt is a free format line giving a description of the alignment e g RNA recogni tion motif proteins One line Unique in file AU lt s gt Author lt s gt is a free format line listing the authors responsible for an alignment e g Bateman A One line Unique in file GA lt f gt lt f gt Gathering thresholds Two real numbers giving HMMER bit score per sequence and per domain cutoffs used in gathering the members of Pfam full alignments NC lt f gt lt f gt Noise cutoffs Two real numbers giving HMMER bit score per sequence and per domain cutoffs set according to the highest scores seen for unrelated sequences when gathering members of Pfam full alignments TC lt f gt lt f gt Trusted cutoffs Two real numbers giving HMMER bit score per sequence and per domain cutoffs set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds when gathering members of Pfam full alignments recognized GS annotations wT lt f gt Weight
32. that it s even better than that That s entirely due to Goran Ceric HMMER3 testing now spins up thousands of processors at a time an unearthly amount of computing power Over the years the MRC LMB computational molecular biology discussion group contributed many ideas to HMMER In particular thank Richard Durbin Graeme Mitchison Erik Sonnhammer Alex Bate 74 man Ewan Birney Gos Micklem Tim Hubbard Roger Sewall David MacKay and Cyrus Chothia The UC Santa Cruz HMM group led by David Haussler and including Richard Hughey Kevin Karplus Anders Krogh now back in Copenhagen and Kimmen Sj lander has been a source of knowledge friendly competition and occasional collaboration All scientific competitors should be so gracious The Santa Cruz folks have never complained at least in my earshot that HMMER started as simply a re implementation of their original ideas just to teach myself what HMMs were In many places I ve reimplemented algorithms described in the literature These are too numerous to credit and thank here The original references are given in the comments of the code However I ve borrowed more than once from the following folks that l d like to be sure to thank Steve Altschul Pierre Baldi Phillip Bucher Warren Gish Steve and Jorja Henikoff Anders Krogh and Bill Pearson HMMER is primarily developed on GNU Linux and Apple Macintosh machines but is tested on a variety of hardware Over the years Compaq
33. the number of match states in the model Mandatory Symbol alphabet type For biosequence analysis models lt s gt iS amino DNA Or RNA case insensitive There are also other accepted alphabets for purposes be yond biosequence analysis including coins dice and custom This determines the symbol alphabet and the size of the symbol emission probability distributions If amino the alphabet size K is set to 20 and the symbol alphabet to ACDE FGHIKLMNPQRSTVWY alphabetic order if pna the alphabet size K is set to 4 and the symbol alphabet to ACGT if RNa the alphabet size K is set to 4 and the symbol alphabet to ACGU Mandatory Reference annotation flag lt s gt is either no or yes case insensitive If yes the reference annotation character field for each match state in the main model see below is valid if no these characters are ignored Reference column annotation is picked up from a Stockholm alignment files GC RF line It is propagated to alignment outputs and also may optionally be used to define consensus match columns in profile HMM construction Optional assumed to be no if not present Consensus structure annotation flag lt s gt is either no or yes case insensitive If yes the consensus structure character field for each match state in the main model see below is valid if no these characters are ignored Consensus structure an notation is picked up from a Stockholm file s GC SS_cons lin
34. the residue is aligned to one model position versus others but also confounded with whether a residue should be considered to be homologous aligned to the model somewhere versus not homologous at all 4 These domain table and per domain alignment reports for each sequence then continue for each se quence that was in the per sequence top hits list Finally at the bottom of the file you ll see some summary statistics For example at the bottom of the globins search output you ll find something like Internal pipeline statistics summary Query model s ki 149 nodes Target sequences 497293 175274722 residues Passed MSV filter 19416 0 0390434 expected 9945 9 0 02 Passed bias filter 15923 0 0320194 expected 9945 9 0 02 Passed Vit filter 2207 0 00443803 expected 497 3 0 001 Passed Fwd filter 1076 0 00216371 expected 5 0 le 05 Initial search space Z 497293 actual number of targets Domain search space domZ 1075 number of targets reported over threshold CPU time 5 66u 0 07s 00 00 05 73 Elapsed 00 00 02 29 Mc sec 11354 75 This gives you some idea of what s going on in HMMERS3 s acceleration pipeline You ve got one query HMM and the database has 497 293 target sequences Each sequence goes through a gauntlet of three scoring algorithms called MSV Viterbi and Forward in order of increasing sensitivity and increasing com putational requirement MSV the Multi ungapped Segment Viterbi
35. 0 Lal 7 9e 02 10 36 1209 1230 1203 1259 0 82 4 24 3 0 0 1 2e 06 0 00084 14 80 1313 1380 1304 1386 0 82 5 47 2 0 7 8 3e 14 5 8e 11 1 85 Es 1799 1890 1799 18 94 wa 0 091 6 17 8 0 0 0 00013 0 091 6 TA 1904 19 66 qe 1901 LOTO 0 5 02 90 71 1248 0 0 0 0047 33 1 86 1993 2107 0 1993 2107 gt 6 07 89 Notice that almost everything s the same it s the same target sequence after all except for what hap pens with E values The independent E value is calculated assuming a search space of all 497293 se quences For example look at the highest scoring domain domain 5 here domain 7 above When we only looked at a single sequence its score of 47 2 bits has an E value of 5 8e 11 When we search a database of 497293 sequences a hit scoring 47 2 bits would be expected to happen 497293 times as often 1 2e 16 x 497293 5 97e 11 it s showing 5 8e 11 because of roundoff issues the 1 2e 16 in fact isn t exactly 1 2e 16 inside HMMER In this Uniprot search 711 sequences were reported in the top hits list with E values lt 10 If we were to assume that all 711 are true homologs x out the domain s that made us think that and then went looking for additional domains in those 711 sequences we d be searching a smaller database of 711 sequences the expected number of times we d see a hit of 47 2 bits or better is now 1 2e 16 x 711 8 3e 14 That s where the conditional E value c Evalue is coming from Notice that a couple of domains dis
36. 00 15 If your input file had contained more than one alignment you d get one line of output for each model For instance a single hmmbuild command suffices to turn a Pfam seed alignment flatfile such as Pfam A seed into a profile HMM flatfile such as Pfam hmm The information on these lines is almost self explanatory The globins4 alignment consisted of 4 se quences with 171 aligned columns HMMER turned it into a model of 149 consensus positions which means it defined 22 gap containing alignment columns to be insertions relative to consensus The 4 se quences were only counted as an effective total sequence number eff nseq of 0 96 The model ended up with a relative entropy per position re pos information content of 0 589 bits This output format is rudimentary HMMER3 knows quite a bit more information about what it s done to build this HMM Some of this information is likely to be useful to you the user As H3 testing and development proceeds were likely to expand the amount of data that hmmbuild reports The new HMM was saved to globins4 hmm If you were to look at this file and you don t have to it s intended for HMMER s consumption not yours you d see something like HMMER3 b 3 0 March 2010 NAME globins4 LENG 149 ALPH amino RF no cS no MAP yes DATE Sun Mar 28 09 50 46 2010 NSEQ 4 EFFN 0 964844 CKSUM 2027839109 STATS LOCAL MSV 9 9014 0 70957 STATS LOCAL VITERBI 10 7224 0 70957 STAT
37. 1 and per domain TC2 reporting and inclusion thresholds TC thresholds are generally considered to be the score of the lowest scoring known true positive that is above all known false positives Control of the acceleration pipeline HMMERS searches are accelerated in a three step filter pipeline the MSV filter the Viterbi filter and the Forward filter The first filter is the fastest and most approximate the last is the full Forward scoring al gorithm There is also a bias filter step between MSV and Viterbi Targets that pass all the steps in the acceleration pipeline are then subjected to postprocessing domain identification and scoring using the Forward Backward algorithm Changing filter thresholds only removes or includes targets from considera tion changing filter thresholds does not alter bit scores E values or alignments all of which are determined solely in postprocessing max Turn off all filters including the bias filter and run full Forward Backward postpro cessing on every target This increases sensitivity somewhat at a large cost in speed F1 lt x gt Set the P value threshold for the MSV filter step The default is 0 02 meaning that roughly 2 of the highest scoring nonhomologous targets are expected to pass the filter F2 lt x gt Set the P value threshold for the Viterbi filter step The default is 0 001 F3 lt x gt Set the P value threshold for the Forward filter step The default is 1e 5 nob
38. 120 characters per line which helps in displaying the output cleanly on terminals and in editors but can truncate target profile description lines textw lt n gt Set the main output s line length limit to lt n gt characters per line The default is 120 40 Options for reporting thresholds Reporting thresholds control which hits are reported in output files the main output tblout and domtblout E lt x gt T lt x gt domE lt x gt domT lt x gt In the per target output report target profiles with an E value of lt lt x gt The default is 10 0 meaning that on average about 10 false positives will be reported per query so you can see the top of the noise and decide for yourself if it s really noise Instead of thresholding per profile output on E value instead report target profiles with a bit score of gt lt x gt In the per domain output for target profiles that have already satisfied the per profile reporting threshold report individual domains with a conditional E value of lt lt x gt The default is 10 0 A conditional E value means the expected number of additional false positive domains in the smaller search space of those comparisons that already satisfied the per profile reporting threshold and thus must have at least one homologous domain already Instead of thresholding per domain output on E value instead report domains with a bit score of gt lt x gt Options fo
39. 2 lt x gt Second filter threshold set the P value threshold for the Viterbi filter step The default is 0 001 F3 lt x gt Third filter threshold set the P value threshold for the Forward filter step The default is 1e 5 nobias Turn off the bias filter This increases sensitivity somewhat but can come at a high cost in speed especially if the query has biased residue composition Such as a repetitive sequence region or if it is a membrane protein with large regions of hydrophobicity Without the bias filter too many sequences may pass the filter with biased queries leading to slower than expected performance as the compu tationally intensive Forward Backward algorithms shoulder an abnormally heavy load 63 Options controlling e value calibration Estimating the location parameters for the expected score distributions for MSV filter scores Viterbi filter scores and Forward scores requires three short random sequence simulations EmL lt n gt EmN lt n gt EvL lt n gt EvN lt n gt EfL lt n gt EfN lt n gt Eft lt x gt Other options nonull2 Z lt x gt domZ lt x gt seed lt n gt qformat lt s gt tformat lt s gt cpu lt n gt Sets the sequence length in simulation that estimates the location parameter mu for MSV filter E values Default is 200 Sets the number of sequences in simulation that estimates the location parameter mu for MSV filter E va
40. 3 2098 that Uniprot currently doesn t annotate but these are pretty plausible domains given that the extracellular domain of Sevenless is pretty much just a big array of 100aa fibronectin repeats Under the domain table an optimal posterior accuracy alignment Holmes 1998 is computed within each domain s envelope and displayed For example skipping domain 1 because it s weak and uncon vincing fibronectin lll domain 2 in your 7LESS_DROME output is shown as domain 2 score 40 7 bits conditional E value 1 3e 14 CEEEEEEECTTEEEEEEE S SS SEEEEEEEETTTCCGCEEEEEETTTSEEEEES TT EEEEEEEEEETTEE E CS fn3 2 saPenlsvsevtstsltlsWsppkdgggpitgYeveyqekgegeewgevtvprtttsvtltgLepgteYefrVqavngagegp 84 saP 1 4 4W p tgpitgY e vpt s L gt Y n gegp 7LESS_DROME 439 SAPVIEHLMGLDDSHLAVHWHPGRFTNGPIEGYRLRLSSSEGNA TSEOLVPAGRGSYIFSOLOAGTNYTLALSMINKOGEGP 520 78999999999 KKK KKK KKK KK KK RK KKK KKK AK IIIS KKK KR RK RK KK KK KK KKK KIIIT PP The initial header line starts with a as a little handle for a parsing script to grab hold of The rest of that line we ll probably put more information on eventually If the model had any consensus structure or reference line annotation that it inherited from your multiple alignment GC SS_cons GC RF annotation in Stockholm files that information is simply regurgitated as CS Or RF annotation lines here The n3 model had a consensus structure annotation l
41. 3 61503 0 00338 6 08833 6 81068 0 61958 0 77255 0 48576 0 95510 86 3 03720 5 94099 3 75455 2 96917 5 26587 2 91682 3 66571 4 11840 4 99111 121 E 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 3 61503 0 00227 6 08723 0 61958 0 77255 0 00000 11 An HMM file consists of one or more HMMs Each HMM starts with the identifier HMMER3 b and ends with on a line by itself The identifier allows backward compatibility as the HMMER software evolves it tells the parser this file is from HMMER3 s save file format version b The 3 a format was used in alpha test versions of HMMERS The closing allows multiple HMMs to be concatenated The format is divided into two regions The first region contains textual information and miscalleneous parameters in a roughly tag value scheme This section ends with a line beginning with the keyword Hmm The second region is a tabular whitespace limited format for the main model parameters All probability parameters are all stored as negative natural log probabilities with five digits of precision to the right of the decimal point rounded For example a probability of 0 25 is stored as log 0 25 1 38629 The special case of a zero probability is stored as Spacing is arranged for human readability but the parser only cares that fields are separated by at least one space character A more detailed description of the format follows header section The header section is
42. 65 5 File formats HMMER profile HMM files The file tutorial f n3 hmm gives an example of a HMMER3 ASCII save file An abridged version is shown here where mark deletions made for clarity and space HMMER3 b 3 0b2 June 2009 NAME fn3 ACC PFO0041 12 DESC Fibronectin type III domain LENG 86 ALPH amino RF no cs yes MAP yes DATE Mon Jun 15 09 39 41 2009 NSEQ 106 EFFN 11 415833 CKSUM 72638555 GA 8 00 7 20 TC 8 00 7 20 NC 7 90 7 90 STATS LOCAL MSV 9 4043 0 71847 STATS LOCAL VITERBI 9 7737 0 41847 STATS LOCAL FORWARD 3 8341 0 71847 HMM A C D E F G H I Y m gt m m gt i m gt d i gt m E e Y d gt m d gt d COMPO 2 70271 4 89246 3 02314 2 64362 3 59817 2 82566 3 74147 3 08574 3 22607 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 3 61503 0 00338 6 08833 6 81068 0 61958 0 77255 0 00000 1 3 16986 5 21447 4 52134 3 29953 4 34285 4 18764 4 30886 3 35801 3 93889 Dre 2 68629 4 42236 2 77530 2 73088 3 46365 2 40512 3 72505 3 29365 3 61514 0 09796 2 38361 6 81068 0 10064 2 34607 0 48576 0 95510 2 2 70230 5 97353 2 24744 2 62947 5 31433 2 60356 4 43584 4 79731 4 25623 q gt 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 3 61503 0 00338 6 08833 6 81068 0 61958 0 77255 0 48576 0 95510 o 85 2 48488 5 72055 3 87501 1 97538 3 04853 3 48010 4 51877 3 51898 3 43366 120 B 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354
43. 7 257 286 Smith T F and Waterman M S 1981 Identification of common molecular subsequences J Mol Biol 147 195 197 Sonnhammer E L L Eddy S R and Durbin R 1997 Pfam A comprehensive database of protein families based on seed alignments Proteins 28 405 420 Taylor W R 1986 Identification of protein sequence homology by consensus template alignment J Mol Biol 188 233 258 77
44. A a A Le eed a a ee S Makefile targets n aa a o e a ae Sap handed aa eed Workarounds for some unusual configure compilation problems Tutorial The programs In AMMER i a O a A O a ja a a Files usediinsthe tutorial irala d a a e Gk o a aa a See ee an a Ta lar a ee ee Geer e Searching a sequence database with a single profile HMM anaua anaana Step 1 build a profile HMM with hmmbuild 0 0 00000 eee eee eee Step 2 search the sequence database with hmmsearch 50055 Searching a profile HMM database with a query Sequence 0 0 000000 ee Step 1 create an HMM database flatfile 0 o eee Step 2 compress and index the flatfile with hmmpress Step 3 search the HMM database withhmmscan o o Creating multiple alignments with hmmalign o o o o ooo Single sequence queries usingphmmer e Iterative searches usingjackhmmer 0 o Manual pages hmmalign align sequences to a profile HMM o o e SYNOPSIS ici ida he a aed pt aa ote be ar A de de a DeSCription a ai dd a taa ar a4 OPUS n pre a o ht e Bd Pak a en ta aai it hmmbui 1d construct profile HMM s from multiple sequence alignment s SYNOPSIS iio do a A AW Ra D SChiplON simi aa e te ek RRO OO Oe we ee ee ee eee Raa OPONS sii ahah anes E Y he Se A eed a ok aan Anke tee a
45. For domain there then follows a domain table and alignment output just as in hmmsearch The fn3 annotation for example looks like Domain and alignment annotation for each model gt gt fn3 Fibronectin type III domain score bias c Evalue i Evalue hmmfrom hmm to alifrom ali to envfrom env to acc t S13 0 0 0 33 0 5 61 74 396 409 395 411 0 85 at 40 7 0 0 2 6e 14 3 8e 14 2 84 is 439 B20 isla 437 O21 tear 0 95 301 14 4 0 0 4 1e 06 6 1e 06 13 85 ua 836 O13 wos 826 914 0 73 41 Sal 0 0 0 0032 0 0048 10 DO sa 1209 1235 a 1203 1239 6 0582 24 Dt 24 3 0 0 3 4e 09 5e 09 14 80 1313 1380 1304 1386 0 82 6 0 0 0 0 0 13 0 19 58 TE es 1754 LIGG ave 17 39 11 69 089 P 47 2 0 7 2 3e 16 3 5e 16 1 85 1799 1890 1799 TBO wp 05 91 Bot LIB 0 0 3 7e 07 5 5e 07 6 74 1904 1966 1901 LOVE A 0 90 al 12 8 0 0 1 3e 05 2e 05 ab 86 1993 2107 1993 2107 0 89 and an example alignment of that second domain again domain 2 score 40 7 bits conditional E value 2 6e 14 CEEEEEEECTTEEEEEEE S SS SEEEEEEEETTTCCGCEEEEEETTTSEEEEES TT EEEEEEEEEETTEE E CS fn3 2 saPenlsvsevtstsltlsWsppkdgggpitgYeveyqekgegeewgevtvprtttsvtltgLepgteYefrVqavngagegp 84 saP 1 4 4W p tgpitgY e vpt s L gt Y n gegp 7LESS_DROME 439 SAPVIEHLMGLDDSHLAVHWHPGRFTNGPIEGYRLRLSSSEGNA TSEOLVPAGRGSYIFSOLOAGTNYTLALSMINKOGEGP 520 RELE RELE EEE KKK KKK KK RK KKK KKK KKK KKKKKIOIR RK KKK KK KKK RK KKK KKK KKK KKK KKKKKKKIIIT PP
46. HMMER User s Guide Biological sequence analysis using profile hidden Markov models http hmmer org Version 3 0 March 2010 Sean R Eddy for the HMMER Development Team Janelia Farm Research Campus 19700 Helix Drive Ashburn VA 20147 USA http eddylab org Copyright C 2010 Howard Hughes Medical Institute Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies HMMER is licensed and freely distributed under the GNU General Public License version 3 GPLv3 For a copy of the License see http www gnu org licenses HMMER is a trademark of the Howard Hughes Medical Institute Contents 1 Introduction How to avoid reading this manual 2 2 ee How to avoid using this software links to similar software 0 0 o What profile HMMs are o e o Applications of profile HMMs 0 o e Design goals of HMMER3 ee What s still missing in HMMER3 o e How to learn more about profile HMMS o o o o Installation Quick installation instructions ooo System requirements a ary ata e mdd ra a a eee A Multithreaded parallelization for multicores is the default 0 MPI parallelization for clusters is optional o oo ee Using build difectories 0 2200 ee
47. HMMs in a hurry and HMMER s ASCII flatfiles are bulky To accelerate this hmmscan uses binary compression and indexing of the flatfiles To use hmmscan you 23 must first compress and index your HMM database with the hmmpress program gt hmmpress minifam This will quickly produce Working done Pressed and indexed 3 HMMs 3 names and 2 accessions Models pressed into binary file minifam h3m SSI index for binary model file minifam h3i Profiles MSV part pressed into minifam h3f Profiles remainder pressed into minifam h3p and you ll see these four new binary files in the directory The tutorial minifam example has already been pressed so there are example binary files tutorial minifam h3 m i f p included in the tutorial Their format is proprietary which is an open source term of art that means both I haven t found time to document them yet and I still might decide to change them arbitrarily without telling you Step 3 search the HMM database with hmmscan Now we can analyze sequences using our HMM database and hmmscan For example the receptor tyrosine kinase 7LESS_DRoME not only has all those fibronectin type III domains on its extracellular face it s got a protein kinase domain on its intracellular face Our minifam database has models of both n3 and Pkinase as well as the unrelated globins4 model So what happens when we scan the 7LESS_DROME Sequence gt hmmscan minifam tutorial 7
48. KHAKSF QVDPOYFKVLAAVIADTVAAG DAGFEKLMSMICILL HBB_HUMAN AHKYH HBA_HUMAN TSKYR MYG_PHYCA AAKYKELGYOG GLB5_PETMA RSAY 1I Most popular alignment formats are similar block based formats and can be turned into Stockholm format with a little editing or scripting Don t forget the stTocKHOLM 1 0 line at the start of the alignment nor the at the end Stockholm alignments can be concatenated to create an alignment database flatfile containing many alignments The hmmbuild command builds a profile HMM from an alignment or HMMs for each of many alignments in a Stockholm file and saves the HMM s in a file For example type gt hmmbuild globins4 hmm tutorial globins4 sto and you ll see some output that looks like Tit expects Stockholm by default To read aligned FASTA files which HMMER calls afa format specify informat afa on the command line of any program that reads an input alignment 15 hmmbuild profile HMM construction from multiple sequence alignments HMMER 3 0 March 2010 http hmmer org Copyright C 2010 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 input alignment file globins4 sto output HMM file globins4 hmm 3E ode de de e e ROE idx name nseq alen mlen eff_nseq re pos description Si i A A AS ees et AR pe A AS Sees eee ee 1 globins4 4 171 149 0 96 0 589 CPU time 0 13u 0 00s 00 00 00 13 Elapsed 00 00
49. LESS_DROME The header and the first section of the output will look like hmmscan search sequence s against a profile database HMMER 3 0 March 2010 http hmmer org Copyright C 2010 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 Sept ad ASA ee SER ge RP ee E Pe eat ls ay a ks A a A os eee AA A SS ots NA See Ca ee Ee query sequence file 7LESS_DROME target HMM database minifam per seq hits tabular output 7LESS tbl per dom hits tabular output 7LESS domtbl A A E A O RR E NR Query 7LESS_DROME L 2554 Accession P13368 Description RecName Full Protein sevenless EC 2 7 10 1 Scores for complete sequence score includes all domains full sequence best 1 domain dom E value score bias E value score bias exp N Model Description 5 6e 57 178 0 0 4 3 5e 16 47 2 0 7 9724 5 2 90 Ena Fibronectin type III domain Lele 43 137 2 0 0 1 7e 43 136 5 0 0 1 3 1 Pkinase Protein kinase domain The output fields are in the same order and have the same meaning as in hmmsearch s output The size of the search space for hmmscan is the number of models in the HMM database here 3 for a Pfam search on the order of 10000 In hmmsearch the size of the search space is the number of sequences in the sequence database This means that E values may differ even for the same individual profile vs sequence comparison depending on how you do the search
50. MER is used to make a model of the seed search the database for homologs and automatically produce the full alignment by aligning every sequence to the seed consensus You may find other applications as well Using hidden Markov models to make a linear consensus model of a bunch of related strings is a pretty generic problem and not just in biological sequence analysis HMMER3 has already been used to model mouse song Elena Rivas personal communication and in the past HMMER has reportedly been used to model music speech and even automobile engine telemetry If you use it for something particularly strange ld be curious to hear about it but never ever want to see my error messages showing up on the console of my Saab Design goals of HMMER3 In the past profile HMM methods were considered to be too computationally expensive and BLAST has remained the main workhorse of sequence similarity searching The main objective of the HMMERS project begun in 2004 is to combine the power of probabilistic inference with high computational speed We aim to upgrade some of molecular biology s most important sequence analysis applications to use more powerful statistical inference engines without sacrificing computational performance Specifically HMMER3 has three main design features that in combination distinguish it from previous tools Explicit representation of alignment uncertainty Most sequence alignment analysis tools report only a single
51. S LOCAL FORWARD 4 1637 0 70957 HMM A G D E F G H I aioe W Ye m gt m m gt i m gt d i gt m i gt i d gt m d gt d COMPO 2 36553 4 52577 2 96709 2 70473 3 20818 3 02239 3 41069 2 90041 4 55393 3 62921 2 68640 4 42247 2 77497 2 73145 3 46376 2 40504 3 72516 3 29302 4 58499 3 61525 0 57544 1 78073 1 31293 1 75577 0 18968 0 00000 1 1 70038 4 17733 3 76164 3 36686 3 72281 3 29583 4 27570 2 40482 5 32720 4 10031 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 4 58477 3 61503 0 03156 3 86736 4 58970 0 61958 0 77255 0 34406 1 23405 149 2 92198 5 11574 3 28049 2 65489 4 47826 3 59727 2 51142 3 88373 5 42147 4 18835 2 68634 4 42241 2 77536 2 73098 3 46370 2 40469 3 72511 3 29370 4 58493 3 61418 0 22163 1 61553 x 1 50361 0 25145 0 00000 The HMMER3 ASCII save file format is defined in Section 5 If you re used to HMMER2 you may now be expecting to calibrate the model with H2 s hmmcalibrate program HMMER3 models no longer need a separate calibration step We ve figured out how to calculate the necessary parameters rapidly bypassing the need for costly simulation Eddy 2008 The determination of the statistical parameters is part of hmmbuild These are the parameter values on the three lines marked STATS You also may be expecting to need to configure the model s alignment mode as in HMMER2 hmmbui ld f option for building local fragment search alignment models for example HMMER3 s hmmbuild does
52. ad instead The first whitespace delimited field on each non blank non comment line of the lt keyfile gt is used as a lt key gt and any remaining data on the line is ignored this allows a variety of whitespace delimited datafiles to be used as lt keyfile gt s The lt keyfile gt argument can be a dash character in which case the keyfile is read from standard input this allows unix ninjas to construct sophisticated invocations using pipes By default the HMM is printed to standard output in HMMERS format Options h Help print a brief reminder of command line usage and all available options f The second commandline argument is a lt keyfile gt instead of a single lt key gt The first field on each line of the lt keyfile gt is used as a retrieval lt key gt an HMM name or accession Blank lines and comment lines that start with a character are ignored 0 lt f gt Output HMM s to file lt f gt instead of to standard output O Output HMM s to individual file s named lt key gt instead of standard output With the f option this can result in many files being created index Instead of retrieving one or more profiles from lt hmmfile gt index the lt hmmfile gt for future retrievals This creates a lt hmmfile gt ssi binary index file 38 hmmpress prepare an HMM database for hmmscan Synopsis hmmpress options lt hmmfile gt Description Starting from a profile database lt hmmfile gt
53. algo rithm slowest but most accurate There is also a bias filter step between MSV and Viterbi Targets that pass all the steps in the acceleration pipeline are then subjected to postprocessing domain identification and scoring using the Forward Backward algorithm Essentially the only free parameters that control HMMER s heuristic filters are the P value thresholds controlling the expected fraction of nonhomologous sequences that pass the filters Setting the default thresholds higher will pass a higher proportion of nonhomologous sequence increasing sensitivity at the expense of speed conversely setting lower P value thresholds will pass a smaller proportion decreasing sensitivity and increasing speed Setting a filter s P value threshold to 1 0 means it will passing all sequences and effectively disables the filter Changing filter thresholds only removes or includes targets from consideration changing filter thresholds does not alter bit scores E values or alignments all of which are determined solely in postprocessing max Maximum sensitivity Turn off all filters including the bias filter and run full For ward Backward postprocessing on every target This increases sensitivity slightly at a large cost in speed F1 lt x gt First filter threshold set the P value threshold for the MSV filter step The default is 0 02 meaning that roughly 2 of the highest scoring nonhomologous targets are expected to pass the filter F
54. ally an hmmsearch of the target database with this new model Scores for complete sequences score includes all domains full sequence best 1 domain dom E value score bias E value score bias exp N Sequence 1 5e 67 231 0 0 2 1 7e 67 230 8 0 1 1 0 1 sp P02055 HBB_MELME Description Hemoglobin subunit beta OS Meles meles 2 3e 67 2 7e 67 3 3e 67 230 230 229 4 2 9 0 0 0 4 3 2 28 Za 35 6e 67 9e 67 7e 67 230 230 229 2 0 7 0 3 1 0 0 2 10 On 1 0 sp P81042 HBE_MACEU sp P15449 HBB_MELCA sp P68046 HBB_ODORO Hemoglobin subunit epsilon OS Macropus Hemoglobin subunit beta OS Mellivora c Hemoglobin subunit beta OS Odobenus ro If you skim down through this output you ll start seeing newly included sequences marked with s such as 9 8e 30 1 4e 29 1 5e 29 3e 29 4e 29 9 3 e 29 1 3e 28 2e 28 2 8e 28 7 9e 26 It s unusual to see sequences get lost and marked with but it can globin example 108 107 107 106 106 105 104 104 103 9535 YDRFPAYNWAIDMW o0oo0oooooooo or oonoouwo a Ww OWN le 29 De 29 2e 29 3e 29 4e 29 le 28 6e 28 4e 28 1e 28 8e 26 108 107 107 106 106 105 104 103 103 95 AunwoRONDA WIN oooooooooo Orbooroonno 00000o0ooo0oo sp sp sp sp sp sp sp sp sp sp After round 2 many more globin sequences have been found New targe
55. alphabet for protein sequences it is 0 59 bits position Sets the minimum relative entropy contributed by an entire model alignment over its whole length This has the effect of making short models have higher relative entropy per position than ere alone would give The default is 45 0 bits Sets the fractional pairwise identity cutoff used by single linkage clustering with the eclust option The default is 0 62 Options controlling e value calibration Estimating the location parameters for the expected score distributions for MSV filter scores Viterbi filter scores and Forward scores requires three short random sequence simulations EmL lt n gt EmN lt n gt EvL lt n gt EvN lt n gt EfL lt n gt EfN lt n gt Eft lt x gt Other options nonull2 Z lt x gt Sets the sequence length in simulation that estimates the location parameter mu for MSV filter E values Default is 200 Sets the number of sequences in simulation that estimates the location parameter mu for MSV filter E values Default is 200 Sets the sequence length in simulation that estimates the location parameter mu for Viterbi filter E values Default is 200 Sets the number of sequences in simulation that estimates the location parameter mu for Viterbi filter E values Default is 200 Sets the sequence length in simulation that estimates the location parameter tau for Forward E values Default is 100 Sets the num
56. annoying behavior that you may notice if you look systematically sometimes you ll see a model that runs at something more like HMMER2 speed 100x or so slower than an average query This needless to say Will Be Fixed More processor support One of the attractive features of the HMMER3 MSV acceleration algorithm is that it is a very tight and efficient piece of code which ought to be able to take advantage of recent advances in using massively parallel GPUs graphics processing units and other specialized processors such as the Cell processor or FPGAs We have prototype work going on in a variety of processors but none of this is far along as yet But this work combined with the parallelization is partly why we expect to wring significantly more speed out of HMMER in the future How to learn more about profile HMMs Cryptogenomicon http cryptogenomicon org is a blog where lIl be talking about issues as they arise in HMMER3 and where you can comment or follow the discussion Reviews of the profile HMM literature have been written by myself Eddy 1996 1998 and by Anders Krogh Krogh 1998 For details on how profile HMMs and probabilistic models are used in computational biology see the pioneering 1994 paper from Krogh et al Krogh et al 1994 or our book Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids Durbin et al 1998 To learn more about HMMs from the perspective of the speech recog
57. appeared in the Uniprot search because now in this larger search space size they re not significant Domain 1 the one with the score of 1 3 bits got a conditional E value of 0 17 x 711 121 and domain 6 with a bit score of 0 0 got a c Evalue of 0 063 x 711 45 These fail the default reporting threshold of 10 0 The domain with a score of 5 1 bits also shifted from being above to below the default inclusion thresholds so now it s marked with a instead of a Operationally e If the independent E value is significant lt lt 1 that means that even this single domain by itself is such a strong hit that it suffices to identify the sequence as a significant homolog with respect to the size of the entire original database search You can be confident that this is a homologous domain e Once there s one or more high scoring domains in the sequence already sufficient to decide that the sequence contains homologs of your query you can look with some caution at the conditional E value to decide the statistical significance of additional weak scoring domains In the Uniprot output for example l d be pretty sure of four of the domains 1 4 5 and maybe 6 each of which has a strong enough independent E value to declare 7LESS_DRoME to be an fnlll domain containing protein Domains 2 and 7 wouldn t be significant if they were all saw in the sequence but once decide that 7LESS_DROME contains fn3 domains on the basis of the othe
58. arch fn3 hmm 7LESS_DROME Pkinase sto The Pfam 22 0 Pkinase seed alignment of protein kinase domains minifam An example HMM flatfile database containing three models globins4 fn3 and Pkinase minifam h3 m i f p Binary compressed files corresponding to minifam produced by hmmpress HBB_HUMAN A FASTA file containing the sequence of human 3 hemoglobin used as an exam ple query for phmmer and jackhmmer Searching a sequence database with a single profile HMM Step 1 build a profile HMM with hmmbuild HMMER starts with a multiple sequence alignment file that you provide Currently HMMER3 can read Stockholm or aligned FASTA alignments The file tutorial globins4 sto is an example of a simple Stockholm file It looks like this STOCKHOLM 1 0 HBB_HUMAN VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWTORFFESFGDLSTPDAVMGNPKVKAHGKKVL HBA_HUMAN VLSPADKTNVKAAWGKVGA HAGEYGAEALERMFLSFPTTKTYFPHF DLS HGSAQVKGHGKKVA MYG_PHYCA oo o VLSEGEWOLVLHVWAKVEA DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVL GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYS TYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADOLKKSADVRWHAERII HBB_HUMAN GAFSDGLAHL D NLKGTFATLSELHCDKL HVDPENFRLLGNVLVCVLAHHFGKEFTPPVOAAYOKVVAGVANAL HBA_HUMAN DALTNAVAHV D DMPNALSALSDLHAHKL RVDPVNEKLLSHCLLVILAAHLPAEFTPAVHASLDKFLASVSTVL MYG_PHYCA TALGAILKK K GHHEAELKPLAQSHATKH KIPIKYLEFISEAI IHVLHSRHPGDFGADAQGAMNKALELFRKDI GLB5_PETMA NAVNDAVASM DDTEKMSMKLRDLSG
59. architecture Implementation was begun in November 1996 thank the Washington University Dept of Genetics the NIH National Human Genome Research Institute and Monsanto for their support during this time Also thank the Biochemistry Academic Contacts Committee at Eli Lilly amp Co for a gift that paid for the trusty Linux laptop on which much of HMMER 2 was written The laptop was indispensable Far too much of HMMER was written in coffee shops airport lounges transoceanic flights and Graeme Mitchison s kitchen The source code still contains a disjointed record of where and when various bits were written HMMER then settled into a comfortable middle age like its primary author still actively maintained though dramatic changes seemed increasingly unlikely HMMER 2 1 1 was the stable release for three years from 1998 2001 HMMER 2 2g was intended to be a beta release but became the de facto stable release for two more years 2001 2003 The final release of the HMMER2 series 2 3 was assembled in spring 2003 The last bugfix release 2 3 2 came out in October 2003 If the world worked as hoped and expected the combination of the 1998 Durbin Eddy Krogh Mitchison book Biological Sequence Analysis and the existence of HMMER2 as a widely used proof of principle should have motivated the widespread adoption of probabilistic modeling methods for sequence analysis particularly database search We would declare Victory and move on Richard D
60. at of a substitution matrix lt mxfile gt is the standard format accepted by BLAST FASTA and other sequence analysis software Options controlling reporting thresholds Reporting thresholds control which hits are reported in output files the main output tblout and domtblout In each iteration sequence hits and domain hits are ranked by statistical significance E value and output is generated in two sections called per target and per domain output In per target output by default all se quence hits with an E value lt 10 are reported In the per domain output for each target that has passed per target reporting thresholds all domains satisfying per domain reporting thresholds are reported By default these are domains with conditional E values of lt 10 The following options allow you to change the default E value reporting thresholds or to use bit score thresholds instead E lt x gt Report sequences with E values lt lt x gt in per sequence output The default is 10 0 T lt x gt Use a bit score threshold for per sequence output instead of an E value threshold any setting of E is ignored Report sequences with a bit score of gt lt x gt By default this option is unset 55 Z lt x gt Declare the total size of the database to be lt x gt sequences for purposes of E value calculation Normally E values are calculated relative to the size of the database you actually searched e g the number of sequence
61. bda Slope parameter as determined by H3 s estimation procedures pE 10 The E value calculated for the 10th ranked score using pmu plambda At the end of this table one more line is printed starting with and summarizing the overall CPU time used by the simulations Some of the optional output files are in xmgrace xy format xmgrace is powerful and freely available graph plotting software Miscellaneous options h Help print a brief reminder of command line usage and all available options a Collect expected Viterbi alignment length statistics from each simulated sequence This only works with Viterbi scores the default see vit Two additional fields are printed in the output table for each model the mean length of Viterbi alignments and the standard deviation v_ Verbose Print the scores too one score per line L lt n gt Set the length of the randomly sampled nonhomologous sequences to lt n gt The default is 100 N lt n gt Set the number of randomly sampled sequences to lt n gt The default is 1000 mpi Run in MPI parallel mode under mpirun Itis parallelized at the level of sending one profile at a time to an MPI worker process so parallelization only helps if you have more than one profile in the lt hmmfile gt and you want to have at least as many profiles as MPI worker processes Only available if optional MPI support was enabled at compile time Options controlling output O lt f gt Sav
62. ber of sequences in simulation that estimates the location parameter tau for Forward E values Default is 200 Sets the tail mass fraction to fit in the simulation that estimates the location param eter tau for Forward evalues Default is 0 04 Turn off the null2 score corrections for biased composition Assert that the total number of targets in your searches is lt x gt for the purposes of per sequence E value calculations rather than the actual number of targets seen 59 domZ lt x gt seed lt n gt qformat lt s gt tformat lt s gt cpu lt n gt mpi Assert that the total number of targets in your searches is lt x gt for the purposes of per domain conditional E value calculations rather than the number of targets that passed the reporting thresholds Seed the random number generator with lt n gt an integer gt 0 If lt n gt is gt 0 any stochastic simulations will be reproducible the same command will give the same results If lt n gt is 0 the random number generator is seeded arbitrarily and stochastic simulations will vary from run to run of the same command The default seed is 42 Declare that the input query_seqfile is in format lt s gt Accepted sequence file formats include FASTA EMBL Genbank DDBJ Uniprot Stockholm and SELEX Default is to autodetect the format of the file Declare that the input target_seqdb is in format lt s gt Accepted sequence file for mats incl
63. bin R Eddy S R Krogh A and Mitchison G J 1998 Biological Sequence Analysis Probabilistic Models of Proteins and Nucleic Acids Cambridge University Press Cambridge UK Eddy S R 1996 Hidden Markov models Curr Opin Struct Biol 6 361 365 Eddy S R 1998 Profile hidden Markov models Bioinformatics 14 755 763 Eddy S R 2008 A probabilistic model of local sequence alignment that simplifies statistical significance estimation PLoS Comput Biol 4 e1000069 Finn R D Mistry J Tate J Coggill P Heger A Pollington J E Gavin O L Ceric G Forslund K Holm L Sonnhammer E L L Eddy S R and Bateman A 2010 The Pfam protein families database Nucleic Acids Research in press Gribskov M Luthy R and Eisenberg D 1990 Profile analysis Meth Enzymol 183 146 159 Gribskov M McLachlan A D and Eisenberg D 1987 Profile analysis Detection of distantly related proteins Proc Natl Acad Sci USA 84 4355 4358 Holmes 1998 Studies in Probabilistic Sequence Alignment and Evolution PhD thesis University of Cambridge Karlin S and Altschul S F 1990 Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes Proc Natl Acad Sci USA 87 2264 2268 Karlin S and Altschul S F 1993 Applications and statistics for multiple high scoring segments in molec ular sequences Proc Natl Aca
64. cal bin and man pages in usr local man man1 You can change usr local to any directory you want using the configure prefix option as in configure prefix the directory you want That s it You can keep reading if you want to know more about customizing a HMMER3 installation or you can skip ahead to the next chapter the tutorial System requirements Operating system HMMER is designed to run on POSIX compatible platforms including UNIX Linux and MacOS X The POSIX standard essentially includes all operating systems except Microsoft Windows Our distribution tarball includes precompiled binaries for Linux These were compiled with the Intel C compiler icc on an x86_64 Intel platform running Red Hat Enterprise Linux AS4 We believe they should be widely portable to different Linux systems We have tested most extensively on Linux and on MacOS X because these are the machines we develop on We test on a variety of other POSIX compliant systems in our compile farm We aim to be portable to all POSIX platforms We currently do not develop or test on Windows Processor HMMER3 depends on vector parallelization methods that are supported on most modern processors H3 requires either an x86 compatible IA32 IA64 or Intel64 processor that supports the SSE2 vector instruction set or a PowerPC processor that supports the Altivec VMX instruction set SSE2 is supported on Intel processors from Pentium 4 on and AMD processors from K8 Athl
65. ce alignments It has parsers for a handful of other formats Genbank EMBL and Uniprot flatfiles SELEX format alignments that we ve tested somewhat It s particularly missing parsers for some widely used alignment formats such as Clustal format so using HMMERS on the MSAs produced by many popular multiple alignment programs MUSCLE or MAFFT for example is harder than it should be because it may require a reformat to Stockholm or aligned FASTA format More alignment modes HMMER3 only does local alignment HMMER2 also could do glocal alignment align a complete model to a subsequence of the target and global alignment align a complete model to a complete target sequence The E value statistics of glocal and global alignment remain poorly understood HMMERS relies on accurate significance statistics far more so than HMMER2 did because HMMER3 s acceleration pipeline works by filtering out sequences with poor P values More speed Even on x86 platforms HMAMER3 s acceleration algorithms are still on a nicely sloping bit of their asymptotic optimization curve still think we can accelerate the code by another two fold or so Additionally for a small number of HMMs lt 1 of Pfam models the acceleration core is performing relatively poorly for reasons we pretty much understand having to do with biased composition most of these pesky models are hydrophobic membrane proteins but which are nontrivial to work around This ll produce an
66. cible Any other positive integer will give different but also reproducible results A choice of 0 uses a randomly chosen seed Assert that the query sequence file is in format lt s gt Accepted formats include fasta embl genbank ddbj uniprot stockholm pfam a2m and afa The default is to autodetect the format of the file Set the number of parallel worker threads to lt n gt By default HMMER sets this to the number of CPU cores it detects in your machine that is it tries to maximize the use of your available processor cores Setting lt n gt higher than the number of avail able cores is of little if any value but you may want to set it to something less You can also control this number by setting an environment variable HMMER_NCPU This option is only available if HAMER was compiled with POSIX threads support This is the default but it may have been turned off at compile time for your site or machine for some reason For debugging the MPI master worker version pause after start to enable the developer to attach debuggers to the running master and worker s processes Send SIGCONT signal to release the pause Under gdb gdb signal SIGCONT Only available if optional MPI support was enabled at compile time Run in MPI master worker mode using mpirun Only available if optional MPI support was enabled at compile time 47 hmmsim collect score distributions on random sequences Synopsis hmmsim options hmmfil
67. clear An example A2M file This alignment seql ACDEF GHIKLMNPOTVWY seg2 ACDEF GHIKLMNPOTVWY seg3 EFmnrGHIKLMNPOQT is encoded in A2M format as gt seql Sequence 1 description ACDEFGHIKLMNPOTVWY gt seg2 Sequence 2 description ACDEFGHIKLMNPOTVWY gt seq3 Sequence 3 description EFmnrGHIKLMNPQT A2M format looks a lot like aligned FASTA format A crucial difference is that the aligned sequences in a dotless A2M file do not necessarily all have the same number of characters The format distinguishes between consensus columns where residues are in upper case and gaps are a dash and insert columns where residues are in lower case and gaps are dots that aren t explicitly shown in the for mat hence dotless A2M The position and number of gaps in insert columns dots is implicit in this representation An advantage of this format is its compactness This representation only works if all insertions relative to consensus are considered to be unaligned characters That is how insertions are handled by profile HMM implementations like SAM and HMMER and profile SCFG implementations like Infernal Thus every sequence must have the same number of consensus columns upper case letters plus characters and may have additional lower case letters for insertions 72 Legal characters A2M and SAM do not support some special characters such as the not a residu
68. d Sci USA 90 5873 5877 Krogh A 1998 An introduction to hidden Markov models for biological sequences In Salzberg S Searls D and Kasif S editors Computational Methods in Molecular Biology pages 45 63 Elsevier Krogh A Brown M Mian I S Sj lander K and Haussler D 1994 Hidden Markov models in compu tational biology Applications to protein modeling J Mol Biol 235 1501 1531 Letunic l Copley R R Pils B Pinkert S Schultz J and Bork P 2006 SMART 5 Domains in the context of genomes and networks Nucl Acids Res 34 D257 D260 Mulder N J Apweiler R Attwood T K Bairoch A Barrell D Bateman A Binns D Biswas M Bradley P Bork P Bucher P Copley R R Courcelle E Das U Durbin R Falquet L Fleischmann W Griffiths Jones S Haft D Harte N Hulo N Kahn D Kanapin A Krestyaninova M Lopez R Letunic l Lonsdale D Silventoinen V Orchard S E Pagni M Peyruc D Ponting C P Selengut J D Servant F Sigrist C J Vaughan R and Zdobnov E M 2003 The InterPro database 2003 brings increased coverage and new features Nucl Acids Res 31 315 318 76 Pearson W R and Lipman D J 1988 Improved tools for biological sequence comparison Proc Natl Acad Sci USA 85 2444 2448 Rabiner L R 1989 A tutorial on hidden Markov models and selected applications in speech recognition Proc IEEE 7
69. d everything above the inclusion thresholds is marked with a and nothing is marked with a For example the top of this output looks like jackhmmer iteratively search a protein sequence against a protein database HMMER 3 0 March 2010 http hmmer org Copyright C 2010 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 E y tee A ee et ee A ne A A AS ee oe es eS ee ea ee ee query sequence file HBB_HUMAN target sequence database uniprot_sprot fasta per segq hits tabular output hbb Jjack tbl per dom hits tabular output hbb jack domtbl A A A AS A Si A A A et he pe en ee RS ee pes ce Query HBB_HUMAN L 146 Description Human beta hemoglobin Scores for complete sequences score includes all domains full sequence best 1 domain dom E value score bias E value score bias exp N Sequence 2 3e 98 331 4 0 0 2 5e 98 331 2 0 0 1 0 1 sp P68871 HBB_HUMAN 2 3e 98 331 4 0 0 2 5e 98 331 2 0 0 1 0 1 sp P68872 HBB_PANPA 2 3e 98 331 4 0 0 2 5e 98 331 2 0 0 1 0 1 sp P68873 HBB_PANTR 9 1le 98 329 4 0 0 le 97 329 3 0 0 1 0 1 sp P02024 HBB_GORGO 2e 96 325 1 0 0 2 2e 96 324 9 0 0 1 0 1 sp P02025 HBB_HYLLA 2e 95 321 8 0 0 2 2e 95 321 7 0 0 1 0 1 sp P02032 HBB_SEMEN Description Hemoglobin Hemoglobin Hemoglobin Hemoglobin Hemoglobin Hemoglobin subunit subunit subunit subunit subunit subunit beta beta beta beta beta
70. ding the GNU C compiler gcc We test the code using both the gcc and icc compilers We find that icc produces somewhat faster code at present Libraries and other installation requirements HMMER includes a software library called Easel which it will automatically compile during its installation process By default HMMER3 does not require any additional libraries to be installed by you other than standard ANSI C99 libraries that should already be present on a system that can compile C code Bundling Easel instead of making it a separate installation requirement is a deliberate design decision to simplify the installation process Configuration and compilation use several UNIX utilities Although these utilities are available on all UNIX Linux MacOS systems old versions may not support all the features the configure script and Makefiles are hoping to find We aim to build on anything even old Ebay ed junk but if you have an old system you may want to hedge your bets and install up to date versions of GNU tools such as GNU make and GNU grep Multithreaded parallelization for multicores is the default The four search programs and hmmbuild support multicore parallelization using POSIX threads By default the configure script will identify whether your platform supports POSIX threads almost all platforms do and will automatically compile in multithreading support If you want to disable multithreading at compile time recompile from source a
71. e Description The hmmsim program generates random sequences scores them with the model s in hmmfile and outputs various sorts of histograms plots and fitted distributions for the resulting scores hmmsim is not a mainstream part of the HMMER package Most users would have no reason to use it It is used to develop and test the statistical methods used to determine P values and E values in HMMERS For example it was used to generate most of the results in a 2008 paper on H3 s local alignment statistics PLoS Comp Bio 4 e1000069 2008 http www ploscompbiol org doi pcbi 1000069 Because it is a research testbed you should not expect it to be as robust as other programs in the package For example options may interact in weird ways we haven t tested nor tried to anticipate all different possible combinations The main task is to fit a maximum likelihood Gumbel distribution to Viterbi scores or an maximum likelihood exponential tail to high scoring Forward scores and to test that these fitted distributions obey the conjecture that lambda _ log_2 for both the Viterbi Gumbel and the Forward exponential tail The output is a table of numbers one row for each model Four different parametric fits to the score data are tested 1 maximum likelihood fits to both location mu tau and slope lambda parameters 2 assuming lambda log_2 maximum likelihood fit to the location parameter only 8 same but assuming an edge corrected lambda
72. e and propagated to alignment displays Optional assumed to be no if not present Map annotation flag lt s gt is either no or yes case insensitive If set to yes the map annotation field in the main model see below is valid if no that field will be ignored The HMM alignment map annotates each match state with the index of the alignment column from which it came It can be used for quickly mapping any subsequent HMM alignment back to the original multiple alignment via the model Optional assumed to be no if not present Date the model was constructed lt s gt is a free text date string This field is only used for logging purposes Optional Command line log lt n gt counts command line numbers and lt s gt is a one line com mand There may be more than one com line per save file each numbered starting from n 1 These lines record every HMMER command that modified the save file This helps us reproducibly and automatically log how Pfam models have been constructed for example Optional HMMER does not use dates for any purpose other than human readable annotation so it is no more prone than you are to Y2K Y2038 or any other date related eschatology 67 NSEQ EFFN CKSUM GA lt f gt TC lt f gt NC lt f gt STATS lt sl gt lt s2 gt COMPO lt f gt xK lt d gt lt f gt lt d gt lt f gt lt f gt lt f gt Sequence number lt d gt is a nonzero positive i
73. e or missing data characters Easel outputs these characters as gaps either in a consensus column or nothing in an insert column The SAM software parses only a subset of legal ambiguity codes for amino acids and nucleotides For amino acids it only reads BXZ in addition to the 20 standard one letter codes For nucleotides it only reads NRY in addition to ACGTU With one crucial exception it treats all other letters as X or N The crucial exception is O SAM reads an O as the position of a free insertion module FIM a concept specific to SAM style profile HMMs This has no impact on nucleic acid sequences where O is not a legal character But in amino acid sequences O means pyrrolysine one of the unusual genetically encoded amino acids This means that A2M format alignments must not contain pyrrolysine residues lest they be read as FIMs For this reason Easel converts O residues to X when it writes an amino acid alignment in A2M format Determining consensus columns Writing A2M format requires knowing which alignment columns are supposed to be considered consensus and which are considered inserts If the alignment was produced by HMMER or Infernal then the alignment has so called reference annotation what appears as a GC RF line in Stockholm format marking the consensus columns Often an alignment has no reference annotation for example if it has been read from
74. e multiple instantiations of Easel lying around The Easel API is not yet stable enough to decouple it from the applications that use it like HMMER and Infernal 11 a program under mpirun on N nodes you ll be running N independent duplicate commands not a single MPl enabled command Don t do that The MPI implementation for hmmbuild scales well up to hundreds of processors and hmmsearch scales all right The other search programs hmmscan phmmer and jackhmmer scale poorly and probably shouldn t be used on more than tens of processors at most Improving MPI scaling is one of our goals Using build directories The configuration and compilation process from source supports using separate build directories using the GNU standard VPATH mechanism This allows you to maintain separate builds for different processors or with different configuration compilation options All you have to do is run the configure script from the directory you want to be the root of your build directory For example gt mkdir my hmmer build gt cd my hmmer build gt path to hmmer configure gt make This assumes you have a make that supports VPATH If your system s make does not you can always install GNU make Makefile targets al1 Builds everything Same as just saying make check Runs automated test suites in both HMMER and the Easel library clean Removes all files generated by compilation by make Configuration files generated by config
75. e the main output table to a file lt f gt rather than sending it to stdout afile lt f gt When collecting Viterbi alignment statistics the a option for each sampled se quence output two fields per line to a file lt f gt the length of the optimal alignment and the Viterbi bit score Requires that the a option is also used efile lt f gt Output a rank vs E value plot in XMGRACE xy format to file lt f gt The x axis is the rank of this sequence from highest score to lowest the y axis is the E value cal culated for this sequence E values are calculated using H3 s default procedures i e the pmu plambda parameters in the output table You expect a rough match between rank and E value if E values are accurately estimated 49 ffile lt f gt pfile lt f gt xfile lt f gt Output a filter power file to lt f gt for each model a line with three fields model name number of sequences passing the P value threshold and fraction of se quences passing the P value threshold See pthresh for setting the P value threshold which defaults to 0 02 the default MSV filter threshold in H3 The P values are as determined by H3 s default procedures the pmu plambda param eters in the output table If all is well you expect to see filter power equal to the predicted P value setting of the threshold Output cumulative survival plots P S gt x to file lt f gt in XMGRACE xy format There are three plots
76. ed in the hmmfile The multiple alignment in lt f gt will be exactly reproduced in its consensus columns as defined by the profile but the displayed alignment in insert columns may be altered because insertions relative to a profile are considered by conven tion to be unaligned data trim Trim nonhomologous residues assigned to N and C states in the optimal align ments from the resulting multiple alignment output amino Specify that all sequences in segfile are proteins By default alphabet type is autodetected from looking at the residue composition dna Specify that all sequences in segfile are DNAs rna Specify that all sequences in seqfile are RNAs 29 informat lt s gt outformat lt s gt Declare that the input segfile is in format lt s gt Accepted sequence file formats include FASTA EMBL Genbank DDBJ Uniprot Stockholm and SELEX Default is to autodetect the format of the file Specify that the msafile is in format lt s gt Currently the accepted multiple alignment sequence file formats only include Stockholm and SELEX Default is to autodetect the format of the file 30 hmmbuild construct profile HMM s from multiple sequence alignment s Synopsis hmmbuild options hmmfile msafile Description Build a profile HMM for each multiple sequence alignment in msafile and save it to a new file Ammfile Options h Help print a brief reminder of command line usage and all available o
77. ences with a bit score of gt lt x gt domE lt x gt In the per domain output for target sequences that have already satisfied the per profile reporting threshold report individual domains with a conditional E value of lt lt x gt The default is 10 0 A conditional E value means the expected number of additional false positive domains in the smaller search space of those comparisons that already satisfied the per target reporting threshold and thus must have at least one homologous domain already domT lt x gt Instead of thresholding per domain output on E value instead report domains with a bit score of gt lt x gt Options for inclusion thresholds Inclusion thresholds are stricter than reporting thresholds Inclusion thresholds control which hits are con sidered to be reliable enough to be included in an output alignment or a subsequent search round or marked as significant as opposed to questionable in domain output incE lt x gt Use an E value of lt lt x gt as the per target inclusion threshold The default is 0 01 meaning that on average about 1 false positive would be expected in every 100 searches with different query sequences incT lt x gt Instead of using E values for setting the inclusion threshold instead use a bit score of gt lt x gt as the per target inclusion threshold By default this option is unset incdomE lt x gt Use a conditional E value of lt lt x
78. ent file formats stockholm aligned fasta and others See the gformat option for a complete list The lt hmmdb gt needs to be press ed using hmmpress before it can be searched with hmmscan This creates four binary files suffixed h3 fimp The output format is designed to be human readable but is often so voluminous that reading it is impractical and parsing it is a pain The tblout and domtblout options save output in simple tabular formats that are concise and easier to parse The o option allows redirecting the main output including throwing it away in dev null Options h Help print a brief reminder of command line usage and all available options Options for controlling output o lt f gt Direct the main human readable output to a file lt f gt instead of the default stdout tblout lt f gt Save a simple tabular space delimited file summarizing the per target output with one data line per homologous target model found domtblout lt f gt Save a simple tabular space delimited file summarizing the per domain output with one data line per homologous domain detected in a query sequence for each homologous model acc Use accessions instead of names in the main output where available for profiles and or sequences noali Omit the alignment section from the main output This can greatly reduce the output volume notextw Unlimit the length of each line in the main output The default is a limit of
79. equence in a sequence alignment block lt s gt ls aligned to the residues in that sequence and has the same length as the rest of the block Typically GR lines are placed immedi ately following the aligned sequence they annotate semantics of Stockholm markup Any Stockholm parser will accept syntactically correct files but is not obligated to do anything with the markup lines It is up to the application whether it will attempt to interpret the meaning the semantics of the markup in a useful way At the two extremes are the Belvu alignment viewer and the HMMER profile hidden Markov model software package Belvu simply reads Stockholm markup and displays it without trying to interpret it at all The tag types GF etc are sufficient to tell Belvu how to display the markup whether it is attached to the whole file sequences columns or residues HMMER uses Stockholm markup to pick up a variety of information from the Pfam multiple alignment database The Pfam consortium therefore agrees on additional syntax for certain tag types so HMMER can parse some markups for useful information This additional syntax is imposed by Pfam HMMER and 70 other software of mine not by Stockholm format per se You can think of Stockholm as akin to XML and what my software reads as akin to an XML DTD if you re into that sort of structured data format lingo The Stockholm markup tags that are parsed semantically by my software are as follows
80. es of Stockholm markup annotation for per file per sequence per column and per residue annotation GF lt tag gt lt s gt Per file annotation lt s gt is a free format text line of annotation type lt tag gt For example GF DATE April 1 2000 Can occur anywhere in the file but usually all the cr markups occur in a header GS lt seqname gt lt tag gt lt s gt Per sequence annotation lt s gt is a free format text line of annotation type tag associated with the sequence named lt seqname gt For ex ample GS seql SPECIES_SOURCE Caenorhabditis elegans Can occur anywhere in the file but in single block formats e g the Pfam distribution will typically follow on the line after the sequence itself and in multi block formats e g HMMER output will typically occur in the header preceding the alignment but following the cr annotation GC lt tag gt lt s gt Per column annotation lt s gt is an aligned text line of annotation type lt tag gt GC lines are associated with a sequence alignment block lt s gt is aligned to the residues in the alignment block and has the same length as the rest of the block Typically Gc lines are placed at the end of each block GR lt seqname gt lt tag gt lt s gt Per residue annotation lt s gt is an aligned text line of annotation type lt tag gt associated with the sequence named lt seqname gt GR lines are associated with one s
81. f documentation you must be joking just want to know that the software compiles runs and gives apparently useful results before read some 75 exhausting pages of someone s documentation For cynics that have seen one too many software packages that don t work e Follow the quick installation instructions on page 10 An automated test suite is included so you will know immediately if something went wrong e Go to the tutorial section on page 14 which walks you through some examples of using HMMER on real data Everything else you can come back and read later How to avoid using this software links to similar software Several implementations of profile HMM methods and related position specific scoring matrix methods are available Software URL HMMER http hmmer org SAM http www cse ucsc edu research compbio sam html PSI BLAST http www ncbi nim nih gov Education BLASTinfo psil htm PFTOOLS http www isrec isb sib ch profile profile html The UC Santa Cruz SAM software is the closest relative of HMMER What profile HMMs are Profile HMMs are statistical models of multiple sequence alignments or even of single sequences They capture position specific information about how conserved each column of the alignment is and which residues are likely Anders Krogh David Haussler and co workers at UC Santa Cruz introduced profile HMMs to computational biology Krogh et al 1994 adopting HMM techniques which
82. formats include fasta embl genbank ddbj uniprot stockholm pfam a2m and afa Set the number of parallel worker threads to lt n gt By default HMMER sets this to the number of CPU cores it detects in your machine that is it tries to maximize the use of your available processor cores Setting lt n gt higher than the number of available cores is of little if any value but you may want to set it to something less You can also control this number by setting an environment variable HM MER_NCPU This option is only available if HMMER was compiled with POSIX threads support This is the default but it may have been turned off for your site or machine for some reason For debugging the MPI master worker version pause after start to enable the developer to attach debuggers to the running master and worker s processes Send SIGCONT signal to release the pause Under gdb gdb signal SIGCONT Only available if optional MPI support was enabled at compile time Run in MPI master worker mode using mpirun Only available if optional MPI support was enabled at compile time 43 hmmsearch search profile s against a sequence database Synopsis hmmsearch options lt hmmfile gt lt seqdb gt Description hmmsearch is used to search one or more profiles against a sequence database For each profile in lt hmmfile gt use that query profile to search the target database of profiles in lt seqdb gt and output ranked lists of
83. fter giving the disable threads flag to configure By default our multithreaded programs will use all available cores on your machine You can control the number of cores each HMMER process will use for computation with the cpu lt x gt command line option or the HMMER_NCPU environment variable Even with a single processing thread cpu 1 HMMER will devote a second execution thread to database input resulting in significant speedup over serial execution If you specify cpu 0 the program will run in serial only mode with no threads This might be useful if you suspect something is awry with the threaded parallel implementation MPI parallelization for clusters is optional The four search programs and hmmbuild now also support MPI Message Passing Interface parallelization on clusters To use MPI you first need to have an MPI library installed such as OpenMPI www open mpi org We use Intel MPI at Janelia MPI support is not enabled by default and it is not compiled into the precompiled binaries that we supply with HMMER To enable MPI support at compile time give the enable mpi option to the configure command To use MPI parallelization each program that has an MPl parallel mode has an mpi command line option This option activates a master worker parallelization mode Without the mpi option if you run 2If you install more than one package that uses the Easel library it may become an annoyance you ll hav
84. hat HMMs have a formal probabilistic basis We use probability theory to guide how all the scoring parameters should be set Though this might sound like a purely academic issue this probabilistic basis lets us do things that more heuristic methods cannot do easily One of the most important is that HMMs have a consistent theory for setting position specific gap and insertion scores The methods are consistent and therefore highly automatable allowing us to make libraries of hundreds of profile HMMs and apply them on a very large scale to whole genome analysis One such database of protein domain models is Pfam Sonnhammer et al 1997 Finn et al 2010 which is a significant part of the Interpro protein domain annotation system Mulder et al 2003 The construction and use of Pfam is tightly tied to the HMMER software package Profile HMMs do have important limitations One is that HMMs do not capture any higher order corre lations An HMM assumes that the identity of a particular position is independent of the identity of all other positions Profile HMMs are often not good models of structural RNAs for instance because an HMM cannot describe base pairs Applications of profile HMMs HMMER can be used to replace BLASTP and PSI BLAST for searching databases with single query se quences With HMMER3 we now include two programs for searching databases with single query se quences phmmer and jackhmmer where jackhmmer is an iterative search aki
85. hat the input lt segfile gt is in format lt s gt Accepted formats include fasta embl genbank ddbj uniprot stockholm pfam a2m and afa The default is to autodetect the format of the file Declare that the input lt segdb gt is in format lt s gt Accepted formats include fasta embl genbank ddbj uniprot stockholm pfam a2m and afa The default is to autodetect the format of the file Set the number of parallel worker threads to lt n gt By default HMMER sets this to the number of CPU cores it detects in your machine that is it tries to maximize the use of your available processor cores Setting lt n gt higher than the number of avail able cores is of little if any value but you may want to set it to something less You 64 mpi can also control this number by setting an environment variable HMMER_NCPU This option is only available if HMMER was compiled with POSIX threads support This is the default but it may have been turned off at compile time for your site or machine for some reason stall For debugging the MPI master worker version pause after start to enable the developer to attach debuggers to the running mas ter and worker s processes Send SIGCONT signal to release the pause Under gdb gdb signal SIGCONT Only available if optional MPI support was enabled at compile time Run in MPI master worker mode using mpirun Only available if optional MPI support was enabled at compile time
86. he same Rarely you might see that they re wildly different and this would usually be a sign that the target sequence is so highly repetitive that it s confused the H3 domain postprocessors Such sequences aren t likely to show up as significant homologs to any sensible query in the first place The sequence top hits output continues until it runs out of sequences to report By default the report includes all sequences with an E value of 10 0 or less Then comes the third output section which starts with Domain annotation for each sequence and alignments Now for each sequence in the top hits list there will be a section containing a table of where HUMER3 thinks all the domains are followed by the alignment inferred for each domain Let s use the n3 vs 7LESS_DROME example because it contains lots of domains and is more interesting in this respect than the globin4 output The domain table for 7LESS_DRomE looks like gt gt 7LESS_DROME RecName Full Protein sevenless EC 2 7 10 1 score bias c Evalue i Evalue hmmfrom hmm to alifrom ali to envfrom env to acc 1 TE 0 0 0519 017 61 74 396 409 395 All se 0 85 2A 40 7 0 0 1 3e 14 1 3e 14 2 84 439 520 437 lte a E 0 95 34 14 4 0 0 2e 06 2e 06 TS 85 836 913 826 914 0 73 4 54d 0 0 0 0016 0 0016 10 36 1209 T2335 a 1203 T259 gt 0 0 82 5 24 3 0 0 1 7e 09 1 7e 09 14 Ae 1313 1380 1304 1386 0 82 6 0 0 0 0 0 063 0 063 58 Tanis 1754 1768 1739 1769 0 89 7 47 2
87. hresholds are generally considered to be the score of the highest scoring known false posi tive cut_tc Use the NC trusted cutoff bit score thresholds in the model to set per sequence TC1 and per domain TC2 reporting and inclusion thresholds TC thresholds are generally considered to be the score of the lowest scoring known true positive that is above all known false positives Options controlling the acceleration pipeline HMMERS searches are accelerated in a three step filter pipeline the MSV filter the Viterbi filter and the Forward filter The first filter is the fastest and most approximate the last is the full Forward scoring al gorithm There is also a bias filter step between MSV and Viterbi Targets that pass all the steps in the acceleration pipeline are then subjected to postprocessing domain identification and scoring using the Forward Backward algorithm Changing filter thresholds only removes or includes targets from considera tion changing filter thresholds does not alter bit scores E values or alignments all of which are determined solely in postprocessing max Turn off all filters including the bias filter and run full Forward Backward postpro cessing on every target This increases sensitivity somewhat at a large cost in speed F1 lt x gt Set the P value threshold for the MSV filter step The default is 0 02 meaning that roughly 2 of the highest scoring nonhomologous targets are expected to pass
88. ias Turn off the bias filter This increases sensitivity somewhat but can come at a high cost in speed especially if the query has biased residue composition such as a repetitive sequence region or if it is a membrane protein with large regions of hydrophobicity Without the bias filter too many sequences may pass the filter with biased queries leading to slower than expected performance as the compu tationally intensive Forward Backward algorithms shoulder an abnormally heavy load 42 Other options nonull2 Z lt x gt domZ lt x gt seed lt n gt qformat lt s gt cpu lt n gt stall mpi Turn off the null2 score corrections for biased composition Assert that the total number of targets in your searches is lt x gt for the purposes of per sequence E value calculations rather than the actual number of targets seen Assert that the total number of targets in your searches is lt x gt for the purposes of per domain conditional E value calculations rather than the number of targets that passed the reporting thresholds Set the random number seed to lt n gt Some steps in postprocessing require Monte Carlo simulation The default is to use a fixed seed 42 so that results are exactly reproducible Any other positive integer will give different but also reproducible results A choice of 0 uses an arbitrarily chosen seed Assert that the query sequence file is in format lt s gt Accepted
89. ights Whenever a profile is built from a multiple alignment HMMER uses an ad hoc sequence weighting algorithm to downweight closely related sequences and upweight distantly related ones This has the effect of making models less biased by uneven phylogenetic representation For example two identical sequences would typically each receive half the weight that one sequence would and this is why jackhmmer isn t concerned about always including your original query sequence in each iteration s alignment even if it finds it again in the database you re searching These options control which algorithm gets used Wpb Use the Henikoff position based sequence weighting scheme Henikoff and Henikoff J Mol Biol 243 574 1994 This is the default wgsc Use the Gerstein Sonnhammer Chothia weighting algorithm Gerstein et al J Mol Biol 235 1067 1994 wblosum Use the same clustering scheme that was used to weight data in calculating BLO SUM subsitution matrices Henikoff and Henikoff Proc Natl Acad Sci 89 10915 1992 Sequences are single linkage clustered at an identity threshold default 0 62 see wid and within each cluster of c sequences each sequence gets rela tive weight 1 c wnone No relative weights All sequences are assigned uniform weight wid lt x gt Sets the identity threshold used by single linkage clustering when using wblosum Invalid with any other weighting scheme Default is 0 62 Options contr
90. in standard HMMER3 format construct binary compressed datafiles for hmmscan The hmmpress step is required for hmmscan to work Four files are created lt hmmfile gt h3m lt hmmfile gt h3i lt hmmfile gt h3f and lt hmmfile gt h3p The lt hmmfile gt h3m file contains the profile HMMs and their annotation in a binary format The lt hmmfile gt h3i file is an SSI index for the lt hmmfile gt h3m file The lt hmmfile gt h3f file contains precomputed data structures for the fast heuristic filter the MSV filter The lt hmmfile gt h3p file contains precomputed data structures for the rest of each profile Options h Help print a brief reminder of command line usage and all available options f Force overwrites any previous hmmpress ed datafiles The default is to bitch about any existing files and ask you to delete them first 39 hmmscan search sequence s against a profile database Synopsis hmmescan options lt hmmdb gt lt seqfile gt Description hmmscan is used to search sequences against collections of profiles For each sequence in lt segfile gt use that query sequence to search the target database of profiles in lt hmmdb gt and output ranked lists of the profiles with the most significant matches to the sequence The lt segfile gt may contain more than one query sequence It can be in FASTA format or several other common sequence file formats genbank embl and uniprot among others or in alignm
91. ine The line starting with n3 is the consensus of the query model Capital letters represent the most con served high information content positions Dots in this line indicate insertions in the target sequence with respect to the model The midline indicates matches between the query model and target sequence A indicates positive score which can be interpreted as conservative substitution with respect to what the model expects at that position The line starting with 7LESS_DRoME is the target sequence Dashes in this line indicate deletions in the target sequence with respect to the model The bottom line is new to HMMERS This represents the posterior probability essentially the expected accuracy of each aligned residue A O means 0 5 1 means 5 15 and so on 9 means 85 95 and a means 95 100 posterior probability You can use these posterior probabilities to decide which parts of the 3Not to mention one mercifully rare bug artifact that I m betting is so unusual that testers don t even see an example of it but we ll see 21 alignment are well determined or not You ll often observe for example that expected alignment accuracy degrades around locations of insertion and deletion which you d intuitively expect You ll also see expected alignment accuracy degrade at the ends of an alignment this is because alignment accuracy posterior probabilities currently not only includes whether
92. is theory doesn t apply to integrated log odds sequence scores HMM Forward scores The statistical significance expectation values or E values of HMMER3 sequence scores is determined by using recent theoretical conjectures about the statistical properties of integrated log odds scores which have been supported by numerical simulation experiments Eddy 2008 And as far as speed goes there s really nothing new about HMMER3 s speed either Besides Kar lin Altschul statistics the main reason BLAST has been so useful is that it s so fast Our design goal in HMMERS was to achieve rough speed parity between BLAST and more formal and powerful HMM based methods The acceleration algorithm in HMMER3 is a new heuristic It seems likely to be more sensitive than BLAST s heuristics on theoretical grounds It certainly benchmarks that way in practice at least in our hands Additionally it s very well suited to modern hardware architectures We expect to be able to take good advantage of GPUs and other parallel processing environments in the near future What s still missing in HMMER3 Even though this is a stable public release that we consider suitable for large scale production work genome annotation Pfam analysis whatever at the same time HMMER3 remains a work in progress We think the codebase is a suitable foundation for development of a number of significantly improved approaches to classically important problems in sequence analysis
93. ith a domain score this big in the set of sequences reported in the top hits list if those sequences consisted only of random nonhomologous sequence outside the region that sufficed to define them as homologs The second number is the independent E value the significance of the sequence in the whole database search if this were the only domain we had identified It s exactly the same as the best 1 domain E value in the sequence top hits list The different between the two E values is not apparent in the 7LESS_DRoME example because in both cases the size of the search space as 1 sequence There s a single sequence in the target sequence database that s the size of the search space that the independent best single domain E value depends on There s one sequence reported as a putative homolog in the sequence top hits list that s the size of the search space that the conditional E value depends on A better example is to see what happens when we search Uniprot 15 7 contains 497293 sequences with the fn3 model 19 gt hmmsearch f n3 hmm uniprot_sprot fasta If you don t have Uniprot and can t run a command like this don t worry about it I ll show the relevant bits here Now the domain report for 7LEss_DROME looks like score bias c Evalue i Evalue hmmfrom hmm to alifrom ali to envfrom env to acc E 40 7 0 0 9 1e 12 6 4e 09 2 84 439 520 437 321 0 95 21 14 4 0 0 0 0014 1 13 85 836 913 826 914 0 73 s 34 1 0
94. kholm format multiple alignment file The first few lines of it look like STOCKHOLM 1 0 MYG_ESCGI VLSDAEWOLVLNIWAKVEADVAGHGODILIRLFKGHPETLEKFDKFKH GR MYG_ESCGI 9 PR 69 RR RR IR RR I RR IOI RR IO I I Ok MYG_HORSE g LSDGEWOOVLNVWGKVEAD IAGHGOQEVLIRLFTGHPETLEKFDKFKH GR MYG_HORSE PP 8 8 9 RR RRR e e e RIOR OR IO MYG_PROGU g LSDGEWOLVLNVWGKVEGDLSGHGOEVLIRLFKGHPETLEKFDKFKH GR MYG_PROGU PP 8 B89xxxxxx xx k xk xk xx xxXxkXXKAK RR IO AR RR MYG_SAISC g LSDGEWOLVLNIWGKVEAD IPSHGOQEVLISLFKGHPETLEKFDKFKH GR MYG_SAISC PP 8 89 RR IR RR I IOI RA A MYG_LYCPI g LSDGEWOIVLNIWGKVETDLAGHGOQEVLIRLFKNHPETLDKFDKFKH H GR MYG_LYCPI PP 8 89 mR RR I IR IR RI I IORI RR IO I Ok MYG_MOUSE g LSDGEWOLVLNVWGKVEADLAGHGOEVLIGLFKTHPETLDKFDKFKN GR MYG_MOUSE PP 8 89 RR RI IR RR I IO IR RR IO I I Ok MYG_MUSAN Ves DWEKVNSVWSAVESDLTAIGONILLRLFEOYPESONHFPKEKN and so on Notice those pp annotation lines That s posterior probability annotation as in the single sequence alignments that hmmscan and hmmsearch showed This essentially represents the confidence that each residue is aligned where it should be hmmalign Currently has a feature that we re aware of Recall that HUMER3 only does local alignments Here we know that we ve provided full length globin sequences and globins4 is a full length globin model We d probably like hmmalign to produce a global alignment It can t currently do that If it doesn t quite manage to extend its loca
95. l alignment to the full length of a target globin sequence you ll get a weird looking effect as the nonmatching termini are pulled out to the left or right For example look at the N terminal g 25 in MYG_HORSE above H3 is about 80 confident that this residue is nonhomologous though any sensible person would align it into the first globin consensus column Look at the end of that first block of Stockholm alignment where you ll see HBBL_RANCA v HWTAEEKAVINSVWOKV DVEQODGHEALTRLFIVYPWTORYFSTFGD GR HBBL_RANCA PP 6 6799xx xx xx x xxx KXKXXKXKXKKAKA KAR ARA HBB2_TRICR VHLTAEDRKEIAAILGKV NVDSLGGOCLARLIVVNPWSRRYFHDFGD GR HBB2_TRICR PP 69xxx x x xkxxxxx X KAKI GC PP_cons OT DR I I OOOO IO IO IO IO IO IO TO IO I I ke GC RE XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The Gc PP_cons line is Stockholm format consensus posterior probability annotation for the entire column It s calculated simply as the arithmetic mean of the per residue posterior probabilities in that col umn This should prove useful in phylogenetic inference applications for example where it s common to mask away nonconfidently aligned columns of a multiple alignment The PP cons line provides an objective measure of the confidence assigned to each column The cc RF line is Stockholm format reference coordinate annotation with an x marking each column that the profile considered to be consensus Single sequence queries using phmmer
96. l tails for Forward scores A values must be positive All three lines or none of them must be present when all three are present the model is considered to be calibrated for E value statistics Optional Flags the start of the main model section Solely for human readability of the tabular model data the symbol alphabet is shown on the uw line aligned to the fields of the match and insert symbol emission distributions in the main model below The next line is also for human readability providing column headers for the state transition probability fields in the main model section that follows Though unparsed after the nmm tag the presence of two header lines is mandatory the parser always skips the line after the Hmm tag line The first line in the main model section may be an optional line starting with compo these are the model s overall average match state emission probabilities which are used as a background residue composition in the filter null model The K fields on this line are log probabilities for each residue in the appropriate biosequence alphabet s order Optional All the remaining fields are mandatory The first two lines in the main model section are atypical They contain information for the core model s BEGIN node This is stored as model node 0 and match state 0 is treated as the BEGIN state The begin state is mute so there are no match emission probabilities The first line is the insert 0 emissions The sec
97. lues Default is 200 Sets the sequence length in simulation that estimates the location parameter mu for Viterbi filter E values Default is 200 Sets the number of sequences in simulation that estimates the location parameter mu for Viterbi filter E values Default is 200 Sets the sequence length in simulation that estimates the location parameter tau for Forward E values Default is 100 Sets the number of sequences in simulation that estimates the location parameter tau for Forward E values Default is 200 Sets the tail mass fraction to fit in the simulation that estimates the location param eter tau for Forward evalues Default is 0 04 Turn off the null2 score corrections for biased composition Assert that the total number of targets in your searches is lt x gt for the purposes of per sequence E value calculations rather than the actual number of targets seen Assert that the total number of targets in your searches is lt x gt for the purposes of per domain conditional E value calculations rather than the number of targets that passed the reporting thresholds Seed the random number generator with lt n gt an integer gt 0 If lt n gt is gt 0 any stochastic simulations will be reproducible the same command will give the same results If lt n gt is 0 the random number generator is seeded arbitrarily and stochastic simulations will vary from run to run of the same command The default seed is 42 Declare t
98. lusion thresholds is that inclusion thresholds control which hits actually get used in the next iteration or the final output multiple alignment if the A option is used whereas reporting thresholds control what you see in output Reporting thresholds are generally more loose so you can see borderline hits in the top of the noise that might be of interest incE lt x gt Include sequences with E values lt lt x gt in subsequent iteration or final alignment output by A The default is 0 001 incT lt x gt Use a bit score threshold for per sequence inclusion instead of an E value thresh old any setting of incE is ignored Include sequences with a bit score of gt lt x gt By default this option is unset incdomE lt x gt Include domains with conditional E values lt lt x gt in subsequent iteration or fi nal alignment output by A in addition to the top scoring domain per significant sequence hit The default is 0 001 incdomT lt x gt Use a bit score threshold for per domain inclusion instead of an E value threshold any setting of incT is ignored Include domains with a bit score of gt lt x gt By default this option is unset Options controlling acceleration heuristics HMMER3 searches are accelerated in a three step filter pipeline the MSV filter the Viterbi filter and the Forward filter The first filter is the fastest and most approximate the last is the full Forward scoring algo rithm slowest bu
99. mains even if no single domain is solidly significant on its own On the other hand if the target sequence happened to be a piece of junk consisting of a set of identical internal repeats and one of those repeats accidentially gives a weak hit to the query model all the repeats will sum up and the sequence score might look significant which mathematically alas is the correct answer the null hypothesis we re testing against is that the sequence is a random sequence of some base composition and a repetitive sequence isn t random So operationally e if both E values are significant lt lt 1 the sequence is likely to be homologous to your query e if the full sequence E value is significant but the single best domain E value is not the target sequence is probably a multidomain remote homolog but be wary and watch out for the case where it s just a repetitive sequence OK the sharp eyed reader asks if that s so then why in the globin4 output all of which have only a single domain do the full sequence bit scores and best single domain bit scores not exactly agree For example the top ranked hit myc_pHyca sperm whale myoglobin if you re curious has a full sequence score of 222 7 and a single best domain score of 222 6 What s going on What s going on is that the position and alignment of that domain is uncertain in this case only very slightly so but nonetheless uncertain The full sequence score is summed
100. mbuild wouldn t know how to name it correctly To build a multi model database from a multi MSA flatfile the alignments have to be in Stockholm format no other MSA format that I m aware of supports having more than one alignment per file and each alignment must have a name on a GF 1D line But if you happen to have a Pfam seed alignment flatfile fam A seed around an example command would be gt hmmbuild Pfam A hmm Pfam A seed This would take about two or three hours to build all 10 000 models or so in Pfam To speed the database construction process up hmmbuild supports MPI parallelization As far as HMMER s concerned all you have to do is add mpi to the command line for hmmbuild assuming you ve compiled support for MPI into it see the installation instructions You ll also need to know how to invoke an MPI job in your particular environment with your job scheduler and MPI distribution We can t really help you with this different sites have different cluster environments With our scheduler SGE the Sun Grid Engine and our MPI distro Intel MPI an example incantation for building P am hmm from Pfam A seed is gt qsub N hmmbuild j y o errors out b y cwd V pe impi 128 mpirun np 128 hmmbuild mpi Pfam hmm Pfam A seed gt hmmbuild out This reduces the time to build all of Pfam to about 40 seconds Step 2 compress and index the flatfile with hmmpress The hmmscan program has to read a lot of profile
101. mfile gt lt seqgfile gt Description Perform a multiple sequence alignment of all the sequences in segfile by aligning them individually to the profile HMM in hmmfile The new alignment is output to stdout in Stockholm format The sequences in segfile are aligned in unihit local alignment mode Therefore they should already be known to contain a single domain they should not contain more than one domain They may be fragments The optimal alignment may assign some residues as nonhomologous N and C states in which case these residues are still included in the resulting alignment but shoved to the outer edges To trim these nonhomologous residues from the result see the trim option Options h Help print a brief reminder of command line usage and all available options o lt f gt Direct the output alignment to file lt f gt rather than to stdout allcol Include columns in the output alignment for every match consensus state in the hmmfile even if it means having all gap columns This is useful in analysis pipelines that need to be able to maintain a predetermined profile HMM architec ture with an unchanging number of consensus columns through an hmmalign step mapali lt f gt Merge the existing alignment in file lt f gt into the result where lt f gt is exactly the same alignment that was used to build the model in hmmfile This is done using a map of alignment columns to consensus profile positions that is stor
102. n number for the format specification and which currently has no effect on my parsers This line allows a parser to instantly identify the file format In the alignment each line contains a name followed by the aligned sequence A dash period un derscore or tilde but not whitespace denotes a gap If the alignment is too long to fit on one line the alignment may be split into multiple blocks with blocks separated by blank lines The number of sequences their order and their names must be the same in every block Within a given block each sub sequence 69 and any associated GR and Gc markup see below is of equal length called the block length Block lengths may differ from block to block The block length must be at least one residue and there is no maximum Other blank lines are ignored You can add comments anywhere to the file even within a block on lines starting with a All other annotation is added using a tag value comment style The tag value format is inherently exten sible and readily made backwards compatible unrecognized tags will simply be ignored Extra annotation includes consensus and individual RNA or protein secondary structure sequence weights a reference co ordinate system for the columns and database source information including name accession number and coordinates for subsequences extracted from a longer source sequence See below for details syntax of Stockholm markup There are four typ
103. n the main output The default is a limit of 120 characters per line which helps in displaying the output cleanly on terminals and in editors but can truncate target profile description lines textw lt n gt Set the main output s line length limit to lt n gt characters per line The default is 120 61 Options controlling scoring system The probability model in phmmer is constructed by inferring residue probabilities from a standard 20x20 substitution score matrix plus two additional parameters for position independent gap open and gap extend probabilities popen lt x gt Set the gap open probability for a single sequence query model to lt x gt The default is 0 02 lt x gt must be gt 0 and lt 0 5 pextend lt x gt Set the gap extend probability for a single sequence query model to lt x gt The default is 0 4 lt x gt must be gt 0 and lt 1 0 mxfile lt mxfile gt Obtain residue alignment probabilities from the substitution matrix in file lt mxfile gt The default score matrix is BLOSUME62 this matrix is internal to HMMER and does not have to be available as a file The format of a substitution matrix lt mxfile gt is the standard format accepted by BLAST FASTA and other sequence analysis software Options controlling reporting thresholds Reporting thresholds control which hits are reported in output files the main output tblout and domtblout Sequence hits and domain hits are ranked by sta
104. n to PSI BLAST Another application of HMMER is when you are working on a protein sequence family and you have carefully constructed a multiple sequence alignment Your family like most protein families has a number of strongly but not absolutely conserved key residues separated by characteristic spacing You wonder if there are more members of your family in the sequence databases but the family is so evolutionarily diverse a BLAST search with any individual sequence doesn t even find the rest of the sequences you already know about You re sure there are some distantly related sequences in the noise You spend many pleasant evenings scanning weak BLAST alignments by eye to find ones with the right key residues are in the right places but you wish there was a computer program that did this a little better Another application is the automated annotation of the domain structure of proteins Large databases of curated alignments and HMMER models of known domains are available including Pfam Finn et al 2010 and SMART Letunic et al 2006 in the Interpro database consortium Mulder et al 2003 Many top ten protein domains lists a standard table in genome analysis papers rely heavily on HMMER annotation Say you have a new sequence that according to a BLAST analysis shows a slew of hits to receptor tyrosine kinases Before you decide to call your sequence an RTK homologue you suspiciously recall that RTK s are like many protei
105. neous options a aoaaa aaa ee 49 Options controlling output 2 ee 49 Options controlling model configuration mode o oo o e 50 Options controlling scoring algorithm 2 2 2 00000 ee 50 Options controlling fitted tail masses for forward 51 Options controlling h3 parameter estimation methods 0 o 51 Popugding oplan s i a ok whe a a a a a ie es 52 Experimental Options aeea a h uda da 52 hmmstat display summary statistics for a profile file oaoa a 53 SYNOPSIS a felt AE A A phe A a Me E a a a a 53 DESEAPUIO oer kh a re Se ec Se 53 OPINAS 22 2 6 2 a A o do ats Se ht Seda ed eras ke OE sce iyi yikes 53 jackhmmer iteratively search sequence s against a protein database 54 SINOPSIS 2a So dte ed NA A a Ete e de a eS 54 DeSCMPION orar las addon ae 54 OPUONS scr tit a a dd ds aa la ae de do ls te aed tee E GS 54 Options controlling output 2 a 54 Options controlling single sequence scoring first iteration 55 Options controlling reporting thresholds 00 0 ee ee eee 55 Options controlling inclusion thresholds 0 0 00 eee eee ee ee eee 56 Options controlling acceleration heuristics osooso oo ee 56 Options controlling profile construction later iterations o ooo a o 57 Options controlling relative weights 2 ooa a 58 Options controlling effecti
106. ngth of each line in the main output The default is a limit of 120 characters per line which helps in displaying the output cleanly on terminals and in editors but can truncate target profile description lines textw lt n gt Set the main output s line length limit to lt n gt characters per line The default is 120 44 Options controlling reporting thresholds Reporting thresholds control which hits are reported in output files the main output tblout and domtblout Sequence hits and domain hits are ranked by statistical significance E value and output is generated in two sections called per target and per domain output In per target output by default all sequence hits with an E value lt 10 are reported In the per domain output for each target that has passed per target report ing thresholds all domains satisfying per domain reporting thresholds are reported By default these are domains with conditional E values of lt 10 The following options allow you to change the default E value reporting thresholds or to use bit score thresholds instead E lt x gt In the per target output report target sequences with an E value of lt lt x gt The default is 10 0 meaning that on average about 10 false positives will be reported per query so you can see the top of the noise and decide for yourself if it s really noise T lt x gt Instead of thresholding per profile output on E value instead report target se qu
107. nition community an excellent tutorial introduction has been written by Rabiner Rabiner 1989 gt How do I cite HMMER There is still no real paper on HMMER If you re writing for an enlightened url friendly journal the best reference is http hmmer org If you must use a paper reference the best one to use is my 1998 profile HMM review Eddy 1998 2 Installation Quick installation instructions Download hnmer 3 0 tar gz from http hmmer org or directly from ftp selab janelia org pub software hmmer3 hmmer 3 0 tar gz untar and change into the newly created directory hmmer 3 0 gt wget ftp selab janelia org pub software hmmer3 hmmer 3 0 tar gz gt tar xf hmmer 3 0 tar gz gt cd hmmer 3 0 The tarball includes precompiled binaries for x86 Linux platforms These are in the binaries directory You can just stop here if you like if you re on a x86 Linux machine and you want to try the programs out without installing them To compile new binaries from source do a standard GNUish build gt configure gt make To compile and run a test suite to make sure all is well you can optionally do gt make check All these tests should pass You don t have to install HMMER programs to run them The newly compiled binaries are now in the src directory you can run them from there To install the programs and man pages somewhere on your system do gt make install By default programs are installed in usr lo
108. ns composed of multiple functional domains and these domains are often found 2There has been agitation in some quarters to call all such models position specific scoring matrices PSSMs but certain small nocturnal North American marsupials have a prior claim on the name SThis is not strictly true There is a subtle difference between an HMMs state path a first order Markov chain and the sequence described by an HMM generated from the state path by independent emissions of symbols at each state promiscuously in proteins with a wide variety of functions Is your sequence really an RTK Or is it a novel sequence that just happens to have a protein kinase catalytic domain or fibronectin type III domain Another application is the automated construction and maintenance of large multiple alignment databases It is useful to organize sequences into evolutionarily related families But there are thousands of protein se quence families some of which comprise tens of thousands of sequences and the primary sequence databases continue to double every year or two This is a hopeless task for manual curation but on the other hand manual curation is really necessary for high quality biologically relevant multiple alignments Databases like Pfam Finn et al 2010 are constructed by distinguishing between a stable curated seed alignment of a small number of representative sequences and full alignments of all detectable homologs HM
109. nstead of thresholding per domain output on E value instead report domains with a bit score of gt lt x gt Options controlling inclusion thresholds Inclusion thresholds are stricter than reporting thresholds They control which hits are included in any output multiple alignment the A option and which domains are marked as significant as opposed to questionable in domain output 62 incE lt x gt Use an E value of lt lt x gt as the per target inclusion threshold The default is 0 01 meaning that on average about 1 false positive would be expected in every 100 searches with different query sequences incT lt x gt Instead of using E values for setting the inclusion threshold instead use a bit score of gt lt x gt as the per target inclusion threshold By default this option is unset incdomE lt x gt Use a conditional E value of lt lt x gt as the per domain inclusion threshold in targets that have already satisfied the overall per target inclusion threshold The default is 0 01 incdomT lt x gt Instead of using E values use a bit score of gt lt x gt as the per domain inclusion threshold By default this option is unset Options controlling the acceleration pipeline HMMERS searches are accelerated in a three step filter pipeline the MSV filter the Viterbi filter and the Forward filter The first filter is the fastest and most approximate the last is the full Forward scoring
110. nt known detectable domains in a given sequence It takes a single query sequence and an HMM database as input The HMM database might be Pfam SMART or TIGRFams for example or another collection of your choice Step 1 create an HMM database flatfile An HMM database flatfile is simply a concatenation of individual HMM files To create a database flat file you can either build individual HMM files and concatenate them or you can concatenate Stockholm alignments and use hmmbui 1d to build an HMM database of all of them in one command Let s create a tiny database called minifam containing models of globin fn3 and Pkinase protein kinase domains by concatenating model files gt hmmbuild globins4 hmm tutorial globins4 sto gt hmmbuild fn3 hmm tutorial fn3 sto gt hmmbuild Pkinase hmm tutorial Pkinase sto gt cat globins4 hmm fn3 hmm Pkinase hmm gt minifam We ll use minifam for our examples in just a bit but first a few words on other ways to build HMM databases especially big ones The file tutorials minifam is the same thing if you want to just use that Alternatively you can concatenate Stockholm alignment files together as Pfam does in its big P am A seed and Pfam A full flatfiles and use hmmbuild to build HMMs for all the alignments at once This won t work properly for our tutorial alignments because the globins4 sto alignment doesn t have an GF ID anno tation line giving a name to the globins4 alignment so hm
111. nteger the number of sequences that the HMM was trained on This field is only used for logging purposes Optional Effective sequence number lt f gt is a nonzero positive real the effective total num ber of sequences determined by hmmbuild during sequence weighting for combin ing observed counts with Dirichlet prior information in parameterizing the model This field is only used for logging purposes Optional Training alignment checksum lt d gt is a nonnegative unsigned 32 bit integer This number is calculated from the training sequence data and used in conjunction with the alignment map information to verify that a given alignment is indeed the alignment that the map is for Optional Pfam gathering thresholds GA1 and GA2 See Pfam documentation of GA lines Optional Pfam trusted cutoffs TC1 and TC2 See Pfam documentation of TC lines Op tional Pfam noise cutoffs NC1 and NC2 See Pfam documentation of NC lines Optional lt f1 gt lt 2 gt Statistical parameters needed for E value calculations lt s1 gt is the model s main model section alignment mode configuration currently only Local is recognized lt s2 gt is the name of the score distribution currently MSV VITERBI and FORWARD are recog nized lt 1 gt and lt 2 gt are two real valued parameters controlling location and slope of each distribution respectively y and A for Gumbel distributions for MSV and Viterbi scores and 7 and A for exponentia
112. olling effective sequence number After relative weights are determined they are normalized to sum to a total effective sequence number eff_nseg This number may be the actual number of sequences in the alignment but it is almost always smaller than that The default entropy weighting method eent reduces the effective sequence num ber to reduce the information content relative entropy or average expected score on true homologs per consensus position The target relative entropy is controlled by a two parameter function where the two parameters are settable with ere and esigma eent Adjust effective sequence number to achieve a specific relative entropy per position see ere This is the default 58 eclust enone eset lt x gt ere lt x gt esigma lt x gt eid lt x gt Set effective sequence number to the number of single linkage clusters at a spe cific identity threshold see eid This option is not recommended it s for experi ments evaluating how much better eent is Turn off effective sequence number determination and just use the actual number of sequences One reason you might want to do this is to try to maximize the relative entropy position of your model which may be useful for short models Explicitly set the effective sequence number for all models to lt x gt Set the minimum relative entropy position target to lt x gt Requires eent Default depends on the sequence
113. om HMMER2 s historical terminology for unihit glocal alignment Options controlling scoring algorithm vit fwd hyb MSV Collect Viterbi maximum likelihood alignment scores This is the default Collect Forward log odds likelihood scores summed over alignment ensemble Collect Hybrid scores as described in papers by Yu and Hwa for instance Bioin formatics 18 864 2002 These involve calculating a Forward matrix and taking the maximum cell value The number itself is statistically somewhat unmotivated but the distribution is expected be a well behaved extreme value distribution Gumbel Collect MSV multiple ungapped segment Viterbi scores using H3 s main accel eration heuristic 50 fast For any of the above options use H3 s optimized production implementation us ing SIMD vectorization The default is to use the implementations sacrifice a small amount of numerical precision This can introduce confounding noise into statis tical simulations and fits so when one gets super concerned about exact details it s better to be able to factor that source of noise out Options controlling fitted tail masses for forward In some experiments it was useful to fit Forward scores to a range of different tail masses rather than just one These options provide a mechanism for fitting an evenly spaced range of different tail masses For each different tail mass a line is generated in the output tmin lt x gt
114. on Sets the minimum relative entropy contributed by an entire model alignment over its whole length This has the effect of making short models have higher relative entropy per position than ere alone would give The default is 45 0 bits Sets the fractional pairwise identity cutoff used by single linkage clustering with the eclust option The default is 0 62 Options conirolling e value calibration The location parameters for the expected score distributions for MSV filter scores Viterbi filter scores and Forward scores require three short random sequence simulations EmL lt n gt EmN lt n gt EvL lt n gt EvN lt n gt EfL lt n gt EfN lt n gt Eft lt x gt Sets the sequence length in simulation that estimates the location parameter mu for MSV filter E values Default is 200 Sets the number of sequences in simulation that estimates the location parameter mu for MSV filter E values Default is 200 Sets the sequence length in simulation that estimates the location parameter mu for Viterbi filter E values Default is 200 Sets the number of sequences in simulation that estimates the location parameter mu for Viterbi filter E values Default is 200 Sets the sequence length in simulation that estimates the location parameter tau for Forward E values Default is 100 Sets the number of sequences in simulation that estimates the location parameter tau for Forward E values Default is 200
115. on 64 on we believe this includes almost all Intel processors since 2000 and AMD processors since 2003 Altivec VMX is supported on Motorola G4 IBM G5 and IBM PowerPC processors starting with the Power6 which we believe includes almost all PowerPC based desktop systems since 1999 and servers since 2007 There are add on products available for making Windows more POSIX compliant and more compatible with GNU ish configures and builds One such product is Cygwin http www cygwin com which is freely available Although we do not test on Windows platforms we understand HMMER builds fine in a Cygwin environment on Windows 10 If your platform does not support one of these vector instruction sets the configure script will revert to an unoptimized implementation called the dummy implementation This implementation is two orders of magnitude slower It will enable you to see H3 s scientific features on a much wider range of processors but is not suited for real production work We do aim to be portable to all modern processors The acceleration algorithms are designed to be portable despite their use of specialized SIMD vector instructions We hope to add support for the Sun SPARC VIS instruction set for example We believe that the code will be able to take advantage of GP GPUs and FPGAs in the future Compiler The source code is C conforming to POSIX and ANSI C99 standards It should compile with any ANSI C99 compliant compiler inclu
116. ond line contains the transitions from the begin state and insert state 0 These seven numbers are That is the first two lines after the optional COMPO line Don t be confused by the presence of an optional COMPO line here The COMPO line is placed in the model section below the residue column headers because it s an array of numbers much like residue scores but it s not really part of the model 68 BoM B gt Ip B gt Di lo gt Mi lo Ip then a 0 0 and a because by convention nonexistent transitions from the nonexistent delete state 0 are set to log 1 0 and log0 The remainder of the model has three lines per node for M nodes where M is the number of match states as given by the LENG line These three lines are K is the alphabet size in residues Match emission line The first field is the node number 1 M The parser verifies this number as a consistency check it expects the nodes to come in order The next K numbers for match emissions one per symbol in alphabetic order The next field is the map annotation for this node If map was yes in the header then this is an integer representing the alignment column index for this match state 1 alen otherwise this field is The next field is the RF annotation for this node If RF was yes in the header then this is a single character representing the reference annotation for this match state otherwise this field is
117. ormat 2 o 72 An example A2M file 0 o 72 Kegalicharacters 4 06 ok eo A ii RE Be de a Fee A Ss 73 Determining consensus columns 0000 0 ee 73 6 Acknowledgements and history 74 THANKS ate a tl a das e pest a git cama IE lea aio de dawn carta 74 1 Introduction HMMER is used to search sequence databases for homologs of protein sequences and to make protein sequence alignments HMMER can be used to search sequence databases with single query sequences but it becomes particularly powerful when the query is a multiple sequence alignment of a sequence family HMMER makes a profile of the query that assigns a position specific scoring system for substitutions insertions and deletions HMMER profiles are probabilistic models called profile hidden Markov models profile HMMs Krogh et al 1994 Eddy 1998 Durbin et al 1998 Compared to BLAST FASTA and other sequence alignment and database search tools based on older scoring methodology HMMER aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying probability models In the past this strength came at a significant computational cost with profile HMM implementations running about 100x slower than comparable BLAST searches With HMMER3 HMMER is now essentially as fast as BLAST How to avoid reading this manual hate reading documentation If you re like me you re sitting here thinking 75 pages o
118. over all possible alignments of the globin model to the myc_PpHyca sequence When HMMERS identifies domains it identifies what it calls an envelope bounding where the domain s alignment most probably lies More on this later when we discuss the reported coordinates of domains and alignments in the next section of the output The single best dom score is calculated after the domain envelope has been defined and the summation is restricted only to the ensemble of possible alignments that lie within the envelope The fact that the two scores are slightly different is therefore telling you that there s a small amount of probability uncertainty that the domain lies somewhat outside the envelope bounds that HMMER has selected The two columns headed doms are two different estimates of the number of distinct domains that the target sequence contains The first the column marked exp is the expected number of domains according to HMMER s statistical model It s an average calculated as a weighted marginal sum over all possible 18 EC 2 7 alignments Because it s an average it isn t necessarily a round integer The second the column marked N is the number of domains that HAMER93 s domain postprocessing and annotation pipeline finally decided to identify annotate and align in the target sequence This is the number of alignments that will show up in the domain report later in the output file These two numbers should be about t
119. parsed line by line in a tag value format Each line type is either mandatory or optional as indicated HMMER3 b Unique identifier for the save file format version the b means that this is HMMER3 HMM file format version b When HMMER3 changes its save file format the revi sion code advances This way parsers may easily remain backwards compatible The remainder of the line after the HMmER3 b tag is free text that is ignored by the parser HMMER currently writes its version number and release date in brackets here e g 3 0b2 June 2009 in this example Mandatory 66 NAME lt s gt ACC lt s gt DESC lt s gt LENG lt d gt ALPH lt s gt CS lt s gt DATE lt s gt COM lt n gt lt s gt Model name lt s gt is a single word containing no spaces or tabs The name is normally picked up from the GF ID line from a Stockholm alignment file If this is not present the name is created from the name of the alignment file by removing any file type suffix For example an otherwise nameless HMM built from the alignment file rrm s1x would be named rrm Mandatory Accession number lt s gt is a one word accession number This is picked up from the GF Ac line in a Stockholm format alignment Optional Description line lt s gt is a one line free text description This is picked up from the GF DE line in a Stockholm alignment file Optional Model length lt a gt a positive nonzero integer is
120. pecify that all sequences in msafile are DNAs rna Specify that all sequences in msafile are RNAs Options controlling profile construction These options control how consensus columns are defined in an alignment fast Define consensus columns as those that have a fraction gt symfrac of residues as opposed to gaps See below for the symfrac option This is the default 31 hand Define consensus columns in next profile using reference annotation to the multiple alignment This allows you to define any consensus columns you like symfrac lt x gt Define the residue fraction threshold necessary to define a consensus column when using the default fast construction option The default for symfrac is 0 5 The symbol fraction in each column is calculated after taking relative sequence weighting into account and ignoring gap characters corresponding to ends of se quence fragments as opposed to internal insertions deletions Setting this to 0 0 means that every alignment column will be assigned as consensus which may be useful in some cases Setting it to 1 0 means that only columns that have no gap characters at all will be assigned as consensus fragthresh lt x gt We only want to count terminal gaps as deletions if the aligned sequence is known to be full length not if it is a fragment for instance because only part of it was sequenced HMMER uses a simple rule to infer fragments if the sequence length Lis less than a
121. pparent in the globin example because all the globins in this example consist of only a single globin domain So let s set up a second example using a model of a single domain that s commonly found in multiple domains in a single sequence Build a fibronectin type IIl domain model using the tutorial fn3 sto alignment this happens to be a Pfam seed alignment it s a good example of an alignment with complex Stockholm anno tation Then use that model to analyze the sequence tutorial 7LESS_DROME the Drosophila Sevenless receptor tyrosine kinase gt hmmbuild fn3 hmm tutorial fn3 sto gt hmmsearch fn3 hmm tutorial 7LESS_DROME gt fn3 out An example of what that output file will look like is provided in tutorial fn3 out The sequence top hits list says full sequence best 1 domain dom E value score bias E value score bias exp N Sequence Description 1 9e 57 178 0 0 4 1 2e 16 47 2 0 7 9 4 9 7LESS_DROME RecName Full Protein sevenless OK now let s pick up the explanation where we left off The total sequence score of 178 0 sums up all the fibronectin IIl domains that were found in the 7LESS_DRomE sequence The single best dom score and E value are the bit score and E value as if the target sequence only contained the single best scoring domain without this summation The idea is that we might be able to detect that a sequence is a member of a multidomain family because it contains multiple weakly scoring do
122. profiles and or sequences noali Omit the alignment section from the main output This can greatly reduce the output volume notextw Unlimit the length of each line in the main output The default is a limit of 120 characters per line which helps in displaying the output cleanly on terminals and in editors but can truncate target profile description lines textw lt n gt Set the main output s line length limit to lt n gt characters per line The default is 120 Options controlling single sequence scoring first iteration By default the first iteration uses a search model constructed from a single query sequence This model is constructed using a standard 20x20 substitution matrix for residue probabilities and two additional pa rameters for position independent gap open and gap extend probabilities These options allow the default single sequence scoring parameters to be changed popen lt x gt Set the gap open probability for a single sequence query model to lt x gt The default is 0 02 lt x gt must be gt 0 and lt 0 5 pextend lt x gt Set the gap extend probability for a single sequence query model to lt x gt The default is 0 4 lt x gt must be gt 0 and lt 1 0 mxfile lt mxfile gt Obtain residue alignment probabilities from the substitution matrix in file lt mxfile gt The default score matrix is BLOSUME62 this matrix is internal to HMMER and does not have to be available as a file The form
123. ptions n lt s gt Name the new profile lt s gt The default is to use the name of the alignment if one is present in the msafile or failing that the name of the hmmfile If msafile contains more than one alignment n doesn t work and every alignment must have a name annotated in the msafile as in Stockholm GF ID annotation o lt f Direct the summary output to file lt f gt rather than to stdout O lt f gt After each model is constructed resave annotated possibly modified source align ments to a file lt f in Stockholm format The alignments are annotated with a reference annotation line indicating which columns were assigned as consensus and sequences are annotated with what relative sequence weights were assigned Some residues of the alignment may have been shifted to accommodate restric tions of the Plan7 profile architecture which disallows transitions between insert and delete states Options for specifying the alphabet The alphabet type amino DNA or RNA is autodetected by default by looking at the composition of the msafile Autodetection is normally quite reliable but occasionally alphabet type may be ambiguous and autodetection can fail for instance on tiny toy alignments of just a few residues To avoid this or to increase robustness in automated analysis pipelines you may specify the alphabet type of msafile with these options amino Specify that all sequences in msafile are proteins dna S
124. r hits their conditional E values indicate that they are probably also fn3 domains too Domain 3 is too weak to be sure of from this search alone but would be something to pay attention to The next four columns give the endpoints of the reported local alignment with respect to both the query model hmm from and hmm to and the target sequence ali from and ali to It s not immediately easy to tell from the to coordinate whether the alignment ended internally in the query or target versus ran all the way as in a full length global alignment to the end s To make this more readily apparent with each pair of query and target endpoint coordinates there s also a little symbology meaning both ends of the alignment ended internally and means both ends of the alignment were full length flush to the ends of the query or target and and mean only the left or right end was flush full length The next two columns env from and env to define the envelope of the domain s location on the target sequence The envelope is almost always a little wider than what HMMER chooses to show as 20 a reasonably confident alignment As mentioned earlier the envelope represents a subsequence that encompasses most of the posterior probability for a given homologous domain even if precise endpoints are only fuzzily inferrable You ll notice that for higher scoring domains the coordinates of the en
125. r inclusion thresholds Inclusion thresholds are stricter than reporting thresholds Inclusion thresholds control which hits are con sidered to be reliable enough to be included in an output alignment or a subsequent search round In hmmscan which does not have any alignment output like hmmsearch or phmmer nor any iterative search steps like jackhmmer inclusion thresholds have little effect They only affect what domains get marked as significant or questionable in domain output incE lt x gt incT lt x gt incdomE lt x gt incdomT lt x gt Use an E value of lt lt x gt as the per target inclusion threshold The default is 0 01 meaning that on average about 1 false positive would be expected in every 100 searches with different query sequences Instead of using E values for setting the inclusion threshold instead use a bit score of gt lt x gt as the per target inclusion threshold It would be unusual to use bit score thresholds with hmmscan because you don t expect a single score threshold to work for different profiles different profiles have slightly different expected score distributions Use a conditional E value of lt lt x gt as the per domain inclusion threshold in targets that have already satisfied the overall per target inclusion threshold The default is 0 01 Instead of using E values instead use a bit score of gt lt x gt as the per domain inclusion threshold As with incT
126. r unihit local alignment Smith Waterman glocal Configure the profile for multihit glocal alignment uniglocal Configure the profile for unihit glocal alignment 36 Other options seed lt n gt Seed the random number generator with lt n gt an integer gt 0 If lt n gt is nonzero any stochastic simulations will be reproducible the same command will give the same results If lt n gt is 0 the random number generator is seeded arbitrarily and stochastic simulations will vary from run to run of the same command The de fault is 0 use an arbitrary seed so different hmmemit runs will generate different samples 37 hmmfetch retrieve profile HMM s from a file Synopsis hmmfetch options lt hmmtfile gt lt key gt retrieves HMM named lt key gt hmmfetch f foptions lt hmmfile gt lt keyfile gt retrieves all HMMs listed in lt keyfile gt hmmfetch index options lt hmmfile gt indexes lt hmmfile gt for fetching Description Quickly retrieves one or more profile HMMs from an lt hmmfile gt a large Pfam database for example The lt hmmfile gt must be indexed first using hmmfetch index The index is a binary file named lt hmmfile gt ssi The default mode is to retrieve a single profile by name or accession called the lt key gt For example hmmfetch Pfam A hmm Caudal_act hmmfetch Pfam A hmm PF00045 With the f option a lt keyfile gt containing a list of one or more keys is re
127. r xpg4 bin grep 12 Configuration warns that it s using the dummy implementation and H3 is going to be very slow This is what you get if your system has a processor that we don t yet support with a fast vector parallel im plementation We currently support Intel AMD compatible processors and PowerPC compatible processors H3 will revert to a portable but slow version on other processors Many make check tests fail We have one report of a system that failed to link multithread capable system C libraries correctly and instead linked to one or more serial only libraries We ve been unable to reproduce the problem here and are not sure what could cause it we optimistically believe it s a messed up system instead of our fault If it does happen it screws all kinds of things up with the multithreaded implementation A workaround is to shut threading off gt configure disable threads This will compile code won t parallelize across multiple cores of course but it will still work fine on a single processor at a time and MPI if you build with MPI enabled 3The telltale phenotype of this failure is to configure with debugging flags on and recompile run one of the failed unit test drivers such as easel easel_utest yourself and let it dump core and use a debugger to examine the stack trace in the core If it s failed in _errno_location it s linked a non thread capable system C library 13 3 Tutorial
128. ri tes cir bie Belton ari A etal te tet Pia 39 hmmscan search sequence s against a profile database o ooo 40 SYNOPSIS oe fo it id E A da E ta ek Es 40 DESCAPUOM ah Paar e eB eae ee ee e A 40 OPUIONS 2 arre do Se Ailend dela ERA A ibs 40 Options for controlling output o o 40 Options for reporting thresholds o o o 41 Options for inclusion thresholds o o o o 41 Options for model specific score thresholding 0 0002 o eee ee eee 42 Control of the acceleration pipeline o o 42 Othe options setos Bee Rint A lr rta 43 hmmsearch search profile s against a sequence database o o o o 44 SYNOPSIS na rt a a A ta A i 44 DeSCrplOM mii amp odode A A einen 44 Options cs a a A A io ee eee a 44 Options for controlling output o o 44 Options controlling reporting thresholds 0 ee es 45 Options for inclusion thresholds o o e 45 Options for model specific score thresholding 0 000 eee eee ee eee 46 Options controlling the acceleration pipeline o o e eee 46 Other Options fee Aa are Ee ee a ad EE Shee oe ay atin Gr ds ae Se a a ei 47 hmms im collect score distributions on random sequences o o o o o 000 48 SINOPSIS aci didn e Bt PR Piva ee ti 48 DESCHPUON s a fen a a i and ah ae te A ee eee ee sch Sac ee E 48 Miscella
129. rt options lt hmmfile gt Description The hmmconvert utility converts an input profile file to different HAMMER formats By default the input profile can be in any HMMER format including old obsolete formats from HMMER2 ASCII or binary the output profile is a current HMMER3 ASCII format Options h a 2 outfmt lt s gt Help print a brief reminder of command line usage and all available options Output profiles in ASCII text format This is the default Output profiles in binary format Output in legacy HMMER2 ASCII text format in Is glocal mode This allows HMMER3 models to be converted back to a close approximation of HMMER2 for comparative studies Output in a HMAMER3 ASCII text format other then the most current one Valid choices for lt s gt are 3 b or 3 a The current format is 3 b and this is the default There is a slightly different format 3 a that was used in some alpha test code 35 hmmemit sample sequences from a profile HMM Synopsis hmmemit options hmmfile Description The hmmemit program samples emit sequences from the profile HMM in hmmfile and outputs them The hmmfile should contain only a single HMM not a library of them only the first HMM will be read The default is to sample a sequence sequence from the core probability model Sampling sequences may be useful for a variety of purposes including creating synthetic true positives for benchmarks or tests Common options
130. s controlling output By default output for each iteration appears on stdout in a somewhat human readable somewhat parseable format These options allow redirecting that output or saving additional kinds of output to files including checkpoint files for each iteration 0 lt f gt Direct the human readable output to a file lt f gt A lt f gt After the final iteration save an annotated multiple alignment of all hits satisfying inclusion thresholds also including the original query to lt f gt in Stockholm format tblout lt f gt After the final iteration save a tabular summary of top sequence hits to lt f gt ina readily parseable columnar whitespace delimited format domtblout lt f gt After the final iteration save a tabular summary of top domain hits to lt f gt in a readily parseable columnar whitespace delimited format chkhmm lt prefix gt At the start of each iteration checkpoint the query HMM saving it to a file named lt prefix gt lt n gt hmm where lt n gt is the iteration number from 1 N chkali lt prefix gt At the end of each iteration checkpoint an alignment of all domains satisfying inclusion thresholds e g what will become the query HMM for the next iteration saving it to a file named lt checkpoint file prefix gt lt n gt sto in Stockholm format where lt n gt is the iteration number from 1 N 54 acc Use accessions instead of names in the main output where available for
131. s in target segdb In some cases for instance if you ve split your target sequence database into mul tiple files for parallelization of your search you may know better what the actual size of your search space is domE lt x gt Report domains with conditional E values lt lt x gt in per domain output in addi tion to the top scoring domain per significant sequence hit The default is 10 0 domT lt x gt Use a bit score threshold for per domain output instead of an E value threshold any setting of domT is ignored Report domains with a bit score of gt lt x gt in per domain output in addition to the top scoring domain per significant sequence hit By default this option is unset domZ lt x gt Declare the number of significant sequences to be lt x gt sequences for purposes of conditional E value calculation for additional domain significance Normally con ditional E values are calculated relative to the number of sequences passing per sequence reporting threshold Options controlling inclusion thresholds Inclusion thresholds control which hits are included in the multiple alignment and profile constructed for the next search iteration By default a sequence must have a per sequence E value of lt 0 001 see E option to be included and any additional domains in it besides the top scoring one must have a conditional E value of lt 0 001 see domE option The difference between reporting thresholds and inc
132. search the target database of sequences in lt seqdb gt and output ranked lists of the sequences with the most significant matches to the query The output format is designed to be human readable but is often so voluminous that reading it is impractical and parsing it is a pain The tblout and domtblout options save output in simple tabular formats that are concise and easier to parse The o option allows redirecting the main output including throwing it away in dev null Options h Help print a brief reminder of command line usage and all available options Options for controlling output O lt f gt Direct the main human readable output to a file lt f gt instead of the default stdout A lt f gt Save a multiple alignment of all significant hits those satisfying inclusion thresh olds to the file lt f in Stockholm format tblout lt f gt Save a simple tabular space delimited file summarizing the per target output with one data line per homologous target sequence found domtblout lt f gt Save a simple tabular space delimited file summarizing the per domain output with one data line per homologous domain detected in a query sequence for each homologous model acc Use accessions instead of names in the main output where available for profiles and or sequences noali Omit the alignment section from the main output This can greatly reduce the output volume notextw Unlimit the length of each line i
133. st queries it s about as fast as BLAST Individually none of these points is new As far as alignment ensembles and sequence scores go pretty much the whole reason why hidden Markov models are so theoretically attractive for sequence analysis is that they are good probabilistic models for explicitly dealing with alignment uncertainty The SAM profile HMM software from UC Santa Cruz has always used full probabilistic inference the HMM Forward and Backward algorithms as opposed to optimal alignment scores the HMM Viterbi algorithm HMMER2 had the full HMM inference algorithms available as command line options but used Viterbi alignment by default in part for speed reasons Calculating alignment ensembles is even more computationally intensive than calculating single optimal alignments One reason why it s been hard to deploy sequence scores for practical large scale use is that we haven t known how to accurately calculate the statistical significance of a log odds score that s been integrated over alignment uncertainty Accurate statistical significance estimates are essential when one is trying to discriminate homologs from millions of unrelated sequences in a large sequence database search The statistical significance of optimal alignment scores can be calculated by Karlin Altschul statistics Karlin and Altschul 1990 1993 Karlin Altschul statistics are one of the most important and fundamental advances introduced by BLAST However th
134. t most accurate There is also a bias filter step between MSV and Viterbi Targets that pass 56 all the steps in the acceleration pipeline are then subjected to postprocessing domain identification and scoring using the Forward Backward algorithm Essentially the only free parameters that control HMMER s heuristic filters are the P value thresholds controlling the expected fraction of nonhomologous sequences that pass the filters Setting the default thresholds higher will pass a higher proportion of nonhomologous sequence increasing sensitivity at the expense of speed conversely setting lower P value thresholds will pass a smaller proportion decreasing sensitivity and increasing speed Setting a filter s P value threshold to 1 0 means it will passing all sequences and effectively disables the filter Changing filter thresholds only removes or includes targets from consideration changing filter thresholds does not alter bit scores E values or alignments all of which are determined solely in postprocessing max Maximum sensitivity Turn off all filters including the bias filter and run full For ward Backward postprocessing on every target This increases sensitivity slightly at a large cost in speed F1 lt x gt First filter threshold set the P value threshold for the MSV filter step The default is 0 02 meaning that roughly 2 of the highest scoring nonhomologous targets are expected to pass the filter F2 lt x gt Second
135. target sequence The default allows sequences with a P value of lt 1075 through 1076 sequences passed All sequences that make it through the three filters are then subjected to a full probabilistic analysis using the HMM Forward Backward algorithms first to identify domains and assign domain envelopes then within each individual domain envelope Forward Backward calculations are done to determine posterior probabilities for each aligned residue followed by optimal accuracy alignment The results of this step are what you finally see on the output Recall the difference between conditional and independent E values with their two different search space sizes These search space sizes are reported in the statistics summary 4It may make more sense to condition the posterior probabilities on the assumption that the residue is indeed homologous given that how likely is it that I ve got it correctly aligned 22 Finally it reports the speed of the search in units of Mc sec million dynamic programming cells per second the CPU time and the elapsed time This search took about 2 29 seconds of elapsed wall clock time running with cpu 2 on two cores That s in the same ballpark as BLAST On the same machine also running dual core NCBI BLAST with one of these globin sequences took 2 3 seconds and WU BLAST took 4 8 seconds Searching a profile HMM database with a query sequence The hmmscan program is for annotating all the differe
136. the filter F2 lt x gt Set the P value threshold for the Viterbi filter step The default is 0 001 F3 lt x gt Set the P value threshold for the Forward filter step The default is 1e 5 nobias Turn off the bias filter This increases sensitivity somewhat but can come at a high cost in speed especially if the query has biased residue composition such as a repetitive sequence region or if it is a membrane protein with large regions of hydrophobicity Without the bias filter too many sequences may pass the filter with biased queries leading to slower than expected performance as the compu tationally intensive Forward Backward algorithms shoulder an abnormally heavy load 46 Other options nonull2 Z lt x gt domZ lt x gt seed lt n gt qformat lt s gt cpu lt n gt stall mpi Turn off the null2 score corrections for biased composition Assert that the total number of targets in your searches is lt x gt for the purposes of per sequence E value calculations rather than the actual number of targets seen Assert that the total number of targets in your searches is lt x gt for the purposes of per domain conditional E value calculations rather than the number of targets that passed the reporting thresholds Set the random number seed to lt n gt Some steps in postprocessing require Monte Carlo simulation The default is to use a fixed seed 42 so that results are exactly reprodu
137. the sequences with the most significant matches to the profile The lt hmmfile gt may contain more than one profile To build profiles from multiple alignments see hmm build The output format is designed to be human readable but is often so voluminous that reading it is impractical and parsing it is a pain The tblout and domtblout options save output in simple tabular formats that are concise and easier to parse The o option allows redirecting the main output including throwing it away in dev null Options h Help print a brief reminder of command line usage and all available options Options for controlling output O lt f gt Direct the main human readable output to a file lt f gt instead of the default stdout A lt f gt Save a multiple alignment of all significant hits those satisfying inclusion thresh olds to the file lt f gt tblout lt f gt Save a simple tabular space delimited file summarizing the per target output with one data line per homologous target sequence found domtblout lt f gt Save a simple tabular space delimited file summarizing the per domain output with one data line per homologous domain detected in a query sequence for each homologous model acc Use accessions instead of names in the main output where available for profiles and or sequences noali Omit the alignment section from the main output This can greatly reduce the output volume notextw Unlimit the le
138. tistical significance E value and output is generated in two sections called per target and per domain output In per target output by default all sequence hits with an E value lt 10 are reported In the per domain output for each target that has passed per target report ing thresholds all domains satisfying per domain reporting thresholds are reported By default these are domains with conditional E values of lt 10 The following options allow you to change the default E value reporting thresholds or to use bit score thresholds instead E lt x gt In the per target output report target sequences with an E value of lt lt x gt The default is 10 0 meaning that on average about 10 false positives will be reported per query so you can see the top of the noise and decide for yourself if it s really noise T lt x gt Instead of thresholding per profile output on E value instead report target se quences with a bit score of gt lt x gt domE lt x gt Inthe per domain output for target sequences that have already satisfied the per profile reporting threshold report individual domains with a conditional E value of lt lt x gt The default is 10 0 A conditional E value means the expected number of additional false positive domains in the smaller search space of those comparisons that already satisfied the per target reporting threshold and thus must have at least one homologous domain already domT lt x gt I
139. tively Internally they just produce a profile HMM from the query sequence then run HMM searches In the Tutorial section I ll show examples of running each of these programs using examples in the tutorial subdirectory of the distribution Files used in the tutorial The subdirectory tutorial in the HMMER distribution contains the files used in the tutorial as well as a number of examples of various file formats that HMMER reads The important files for the tutorial are globins4 sto An example alignment of four globin sequences in Stockholm format This align ment is a subset of a famous old published structural alignment from Don Bashford Bashford et al 1987 globins4 hmm An example profile HMM file built from globins4 sto in HMMER3 ASCII text format globins4 out An example hmmsearch output file that results from searching the globins4 hmm against Uniprot 15 7 14 fn3 sto An example alignment of 106 fibronectin type III domains This is the Pfam 22 0 n3 seed alignment It provides an example of a Stockholm format with more complex annotation We ll also use it for an example of hmmsearch analyzing sequences containing multiple domains n3 hmm A profile HMM created from n3 sto by hmmbuild 7LESS_DROME A FASTA file containing the sequence of the Drosophila Sevenless protein a recep tor tyrosine kinase whose extracellular region is thought to contain seven fibronectin type III domains fn3 out Output of hmmse
140. tively walk in sequence space away from your original query leaving few or no consensus columns corresponding to its residues hand Define consensus columns in next profile using reference annotation to the multiple alignment jackhmmer propagates reference annotation from the previous profile to the multiple alignment and thence to the next profile This is the default symfrac lt x gt Define the residue fraction threshold necessary to define a consensus column when using the fast option The default is 0 5 The symbol fraction in each 57 column is calculated after taking relative sequence weighting into account and ig noring gap characters corresponding to ends of sequence fragments as opposed to internal insertions deletions Setting this to 1 0 means that every alignment col umn will be assigned as consensus which may be useful in some cases Setting it to 0 0 is a bad idea because no columns will be assigned as consensus and you ll get a model of zero length fragthresh lt x gt We only want to count terminal gaps as deletions if the aligned sequence is known to be full length not if it is a fragment for instance because only part of it was sequenced HMMER uses a simple rule to infer fragments if the sequence length Lis less than a fraction lt x gt times the mean sequence length of all the sequences in the alignment then the sequence is handled as a fragment The default is 0 5 Options controlling relative we
141. to this sequence the number of hits we d expect to score this highly in a database of this size if the database contained only nonhomologous random sequences The lower the E value the more significant the hit The E value is based on the sequence bit score which is the second number This is the log odds score for the complete sequence Some people like to see a bit score instead of an E value because the bit score doesn t depend on the size of the sequence database only on the profile HMM and the target sequence The next number the bias is a correction term for biased sequence composition that s been applied to the sequence bit score The only time you really need to pay attention to this value is when it s large and on the same order of magnitude as the sequence bit score This might be a sign that the target sequence isn t really a homolog but merely shares a similar strong biased composition with the query model The 2The method that HMMER3 uses to compensate for biased composition is unpublished and different from HMMER2 We will write it up when there s a chance 17 biased composition correction usually works well but occasionally will not knock down a falsely significant nonhomologous hit as far as it should The next three numbers are again an E value score and bias but only for the single best scoring do main in the sequence rather than the sum of all its identified domains The rationale for this isn t a
142. ts included New alignment includes Continuing to next round Round Included in MSA Model size 167 1064 subseqs 1064 subsequences 146 positions P87497 P14399 P80017 P02022 Q9DEPO P14397 POC227 P09106 P14398 P23216 was 895 including original MYG_CHIRA MYG_MUSAN GLBD_CAUAR HBAM_RANCA MYG_CRYAN MYG_GALGA GLB_NERAL HBAT_PAPAN MYG_GALJA Myoglobin OS Chionodraco rastrospinosu Myoglobin OS Mustelus antarcticus GN m Globin D coelomic OS Caudina arenicol Hemoglobin heart muscle subunit alpha Myoglobin OS Cryodraco antarcticus GN Myoglobin OS Galeorhinus galeus GN mb Globin OS Nerita albicilla PE 1 SV 1 Hemoglobin subunit theta 1 OS Papio an Myoglobin OS Galeorhinus japonicus GN GLBP1_GLYDI Globin major polymeric component P1 O happen it doesn t happen in this query query 1063 subseqs from 1061 targets Because new sequences were included it keeps going to round three and then again to round four then again to round five After round five where this example has found 1113 included hits in the database the search ends quietly because there s a default maximum of five iterations and you get New targets included New alignment includes 1114 subseqs That marks the end of the results for one query 28 was 1113 including original query 4 Manual pages hmmalign align sequences to a profile HMM Synopsis hmmalign options lt hm
143. ude FASTA EMBL Genbank DDBJ Uniprot Stockholm and SELEX Default is to autodetect the format of the file Set the number of parallel worker threads to lt n gt By default HMMER sets this to the number of CPU cores it detects in your machine that is it tries to maximize the use of your available processor cores Setting lt n gt higher than the number of avail able cores is of little if any value but you may want to set it to something less You can also control this number by setting an environment variable HMMER_NCPU This option is only available if HAMER was compiled with POSIX threads support This is the default but it may have been turned off at compile time for your site or machine for some reason stall For debugging the MPI master worker version pause after start to enable the developer to attach debuggers to the running mas ter and worker s processes Send SIGCONT signal to release the pause Under gdb gdb signal SIGCONT Only available if optional MPI support was enabled at compile time Run in MPI master worker mode using mpirun Only available if optional MPI support was enabled at compile time 60 phmmer search protein sequence s against a protein sequence database Synopsis phmmer options lt seqfile gt lt seqdb gt Description phmmer is used to search one or more query protein sequences against a protein sequence database For each query sequence in lt seqfile gt use that sequence to
144. ult is 100 EfN lt n gt Sets the number of sequences in simulation that estimates the location parameter tau for Forward E values Default is 200 Eft lt x gt Sets the tail mass fraction to fit in the simulation that estimates the location param eter tau for Forward evalues Default is 0 04 51 Debugging options stall Seed lt n gt Experimental options For debugging the MPI master worker version pause after start to enable the developer to attach debuggers to the running master and worker s processes Send SIGCONT signal to release the pause Under gdb gdb signal SIGCONT Only available if optional MPI support was enabled at compile time Set the random number seed to lt n gt The default is 0 which makes the random number generator use an arbitrary seed so that different runs of hmmsim will almost certainly generate a different statistical sample For debugging it is useful to force reproducible results by fixing a random number seed These options were used in a small variety of different exploratory experiments bgflat bgcomp x no lengthmodel NU lt x gt pthresh lt x gt Set the background residue distribution to a uniform distribution both for purposes of the null model used in calculating scores and for generating the random se quences The default is to use a standard amino acid background frequency dis tribution Set the background residue distribution to the mean
145. urbin moved on to human genomics Anders Krogh moved on to pioneer a number of other probabilistic approaches for other biological sequence analysis problems Graeme Mitchison moved on to quantum computing moved on to noncoding RNAs Yet BLAST continued to be the most widely used search program HMMs seemed to be widely con sidered to be a mysterious and orthogonal black box rather than a natural theoretical basis for important programs like BLAST The NCBI in particular seemed to be slow to adopt or even understand HMM meth ods This nagged at me the revolution was unfinished When we moved the lab to Janelia Farm in 2006 had to make a decision about what we should spend our time on It had to be something Janelian something that would work on with my own hands something that would be difficult to accomplish under the usual reward structures of academic science and something that would make the largest possible impact on science decided that we should aim to replace BLAST with an entirely new generation of software The result is the HMMER3 project Thanks HMMER is increasingly not just my own work but the work of great people in my lab including Steve Johnson Alex Coventry Dawn Brooks Sergi Castellano Michael Farrar Travis Wheeler and Elena Rivas The current HMMER development team at Janelia Farm includes Sergi Michael and Travis as well as myself would call the Janelia computing environment world class except
146. ure is preserved distclean Removes all files generated by configuration by configure and by compilation by make Note that if you want to make a new configuration for example to try an MPI version by configure enable mpi make you should do a make distclean rather than a make clean to be sure old configuration files aren t used accidentally Workarounds for some unusual configure compilation problems Configuration or compilation fails when trying to use a separate build directory If you try to build in a build tree other than the source tree and you have any trouble in configuration or compilation try just building in the source tree instead Some make versions don t support the VPATH mechanism needed to use separate build trees Another workaround is to install GNU make Configuration fails complaining that the CFLAGS don t work Our configure script uses an Autoconf macro AX_cc_MAxoPT that tries to guess good optimization flags for your compiler In very rare cases we ve seen it guess wrong You can always set crLacs yourself with something like gt configure CFLAGS O Configuration fails complaining no acceptable grep could be found We ve seen this happen on our Sun Sparc Solaris machine It s a known issue in GNU autoconf You can either install GNU grep or you can insist to configure that the Solaris grep or whatever grep you have is ok by explicitly setting GREP gt configure GREP us
147. using current procedures in H3 Eddy 2008 and 4 using both parameters determined by H3 s current procedures The standard simple quick and dirty statistic for goodness of fit is E 10 the calculated E value of the 10th ranked top hit which we expect to be about 10 In detail the columns of the output are name Name of the model tailp Fraction of the highest scores used to fit the distribution For Viterbi MSV and Hybrid scores this defaults to 1 0 a Gumbel distribution is fitted to all the data For Forward scores this defaults to 0 02 an exponential tail is fitted to the highest 2 scores mu tau Location parameter for the maximum likelihood fit to the data lambda Slope parameter for the maximum likelihood fit to the data E 10 The E value calculated for the 10th ranked high score E 10 using the ML mu tau and lambda By definition this expected to be about 10 if E value esti mation were accurate mufix Location parameter for a maximum likelihood fit with a known fixed slope param eter lambda of log_2 0 693 E 10fix The E value calculated for the 10th ranked score using mufix and the expected lambda log_2 0 693 mufix2 Location parameter for a maximum likelihood fit with an edge effect corrected lambda 48 E 10fix2 The E value calculated for the 10th ranked score using mufix2 and the edge effect corrected lambda pmu Location parameter as determined by H3 s estimation procedures plam
148. ve sequence number 00000 ee ee ee eee 58 Options controlling e value calibration 2 0 e 59 Other options ore i a e BOE a ave oe ean bate a A e i 59 phmmer search protein sequence s against a protein sequence database 61 SINOPSIS avan irer e in edict te age ah oi eae ated ee eee ae Se ae Bd tetera Se hate ee 61 Description ee 61 Options Oe eek PE ae E A da A ct de A eal is 61 Options for controlling output 2 ee 61 Options controlling scoring system o e 62 Options controlling reporting thresholds 0 ee ee es 62 Options controlling inclusion thresholds o cee eee ee ee e 62 Options controlling the acceleration pipeline 000000 ee eee 63 Options controlling e value calibration 2 o 64 Other OPTIONS ett a a dt A e toda laada ae eu tada 64 5 File formats 66 HMMER profile HMM files o o ee 66 Reader Section 10 oa bbe a AO eee ee aed DE 66 Main Model SECTION a arau die aa a Ad dra tad 68 Stockholm the recommended multiple sequence alignment format 69 syntax of Stockholm markup 2 2 2 o 70 semantics of Stockholm markup 2 2 00 00 e 70 recognized GF annotations 2 a 71 recognized GS annotations 2 ee 71 recognized GC annotations 2 ee 71 recognized GR annotations ooa a 72 A2M multiple alignment f
149. velope and the inferred alignment will tend to be in tighter agreement corresponding to sharper posterior probability defining the location of the homologous region Operationally would use the envelope coordinates to annotate domain locations on target sequences not the alignment coordinates However be aware that when two weaker scoring domains are close to each other envelope coordinates can and will overlap corresponding to the overlapping uncertainty of where one domain ends and another begins In contrast alignment coordinates generally do not overlap though there are cases where even they will overlap The last column is the average posterior probability of the aligned target sequence residues effectively the expected accuracy per residue of the alignment For comparison current Uniprot consensus annotation of Sevenless shows seven domains FT DOMAIN 311 431 Fibronectin type III 1 FT DOMAIN 436 528 Fibronectin type III 2 FT DOMAIN 822 921 Fibronectin type III 3 FT DOMAIN 1298 1392 Fibronectin type III 4 FT DOMAIN 1680 1794 Fibronectin type III 5 FT DOMAIN 1797 1897 Fibronectin type III 6 FT DOMAIN 1898 1988 Fibronectin type III 7 These domains are a pretty tough case to call actually HMMER fails to see anything significant over lapping two of these domains 311 431 and 1680 1794 in the Uniprot search though it sees a smidgen of them when 7LESS_DRoME alone is the target HMMER3 sees two new domains 1205 1235 and 199

HMMER User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents