Home

HMMER3 beta test: User's Guide

1. hmmsearch search profile s against a sequence database HMMER 3 0b3 November 2009 http hmmer org Copyright C 2009 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 Srey ER es ee Re SE Se ec Re Ps E Seay aay Has WSR Wt fe os SPE Ree et ST SS ots as oh ee OE See Cee 8S query HMM file globins4 hmm target sequence database uniprot_sprot fasta za ee ee a eee Oe ee ee Oe ee Query globins4 M 149 Scores for complete sequences score includes all domains The second section is the sequence top hits list It is a list of ranked top hits sorted by E value most significant hit first formatted in a BLAST like style full sequence best 1 domain dom E value score bias E value score bias exp N Sequence Description 6e 65 222 7 6 7e 65 222 6 E 1 sp P02185 MYG_PHYCA Myoglobin OS Physeter catodon GN MB PE 3ele 63 217 2 3 4e 63 217 0 0 0 1 sp P02024 HBB_GORGO Hemoglobin subunit beta OS Gorilla gor 4 5e 63 216 6 0 0 5e 63 216 5 0 0 1 0 1 sp P68871 HBB_HUMAN Hemoglobin subunit beta OS Homo sapien 4 5e 63 216 6 0 0 5e 63 216 5 0 0 1 0 1 sp P68872 HBB_PANPA Hemoglobin subunit beta OS Pan paniscu 4 5e 63 216 6 0 0 5e 63 216 5 0 0 1 0 1 sp P68873 HBB_PANTR Hemoglobin subunit beta OS Pan troglod 6 4e 63 216 1 3 0 7 1e 63 216 0 2 0 1 0 1 sp P02177 MYG_ESCGI Myoglobin OS Eschrichtius gibbosus GN The last two columns obviously are the name of each target
2. Continuing to next round ee Round 2 Included in MSA 895 subsequences query 894 subseqs from 894 targets Model size 146 positions ee This obviously is telling you that the new alignment contains 895 sequences your query plus 894 significant matches For round two it s built a new model from this alignment Now for round two it fires off what s essentially an hmmsearch of the target database with this new model Scores for complete sequences score includes all domains full sequence best 1 domain dom E value score bias E value score bias exp N Sequence Description 1 5e 67 231 0 0 2 1 7e 67 230 8 0 1 1 0 1 sp P02055 HBB_MELME Hemoglobin subunit beta OS Meles meles 2 3e 67 230 4 0 4 2 6e 67 230 2 0 33 1 0 1 sp P81042 HBE_MACEU Hemoglobin subunit epsilon OS Macropus 2 7e 67 230 2 0 3 2 9e 67 230 0 0 2 1 0 1 sp P15449 HBB_MELCA Hemoglobin subunit beta OS Mellivora c 3 3e 67 229 9 0 2 3 7e 67 229 7 0 2 1 0 1 sp P68046 HBB_ODORO Hemoglobin subunit beta OS Odobenus ro If you skim down through this output you ll start seeing newly included sequences marked with s such as 9 8e 30 108 3 0 0 1 1le 29 108 2 0 0 si sp P87497 MYG_CHIRA Myoglobin OS Chionodraco rastrospinosu 1 4e 29 107 8 0 3 1 5e 29 107 7 0 2 0 sp P14399 MYG_MUSAN Myoglobin OS Mustelus antarcticus GN m 1 5e 29 107 7 0 3 2e 29 107 3 0 2 0 sp P80017 GLBD_CAUAR Globin D coelomic OS Caudina arenicol 3e 29 106 8 0 0 3 3e 29 106 6 0
3. seql MNPOTVWY seq2 MNPOTVWY seq3 MNPOT The first line in the file must be STOCKHOLM 1 x where x is a minor version number for the format specification and which currently has no effect on my parsers This line allows a parser to instantly identify the file format In the alignment each line contains a name followed by the aligned sequence A dash period un derscore or tilde but not whitespace denotes a gap If the alignment is too long to fit on one line the alignment may be split into multiple blocks with blocks separated by blank lines The number of sequences their order and their names must be the same in every block Within a given block each sub sequence 28 and any associated GR and Gc markup see below is of equal length called the block length Block lengths may differ from block to block The block length must be at least one residue and there is no maximum Other blank lines are ignored You can add comments anywhere to the file even within a block on lines starting with a All other annotation is added using a tag value comment style The tag value format is inherently exten sible and readily made backwards compatible unrecognized tags will simply be ignored Extra annotation includes consensus and individual RNA or protein secondary structure sequence weights a reference co ordinate system for the columns and database source information including name accession number and coordinates
4. of each distribution respectively u and A for Gumbel distributions for MSV and Viterbi scores and 7 and for exponential tails for Forward scores A values must be positive All three lines or none of them must be present when all three are present the model is considered to be calibrated for E value statistics Optional Flags the start of the main model section Solely for human readability of the tabular model data the symbol alphabet is shown on the uw line aligned to the fields of the match and insert symbol emission distributions in the main model below The next line is also for human readability providing column headers for the state transition probability fields in the main model section that follows Though unparsed after the nmm tag the presence of two header lines is mandatory the parser always skips the line after the Hmm tag line The first line in the main model section may be an optional line starting with compo these are the model s overall average match state emission probabilities which are used as a background residue composition in the filter null model The K fields on this line are log probabilities for each residue in the appropriate biosequence alphabet s order Optional All the remaining fields are mandatory The first two lines in the main model section are atypical They contain information for the core model s BEGIN node This is stored as model node 0 and match state 0 is treated as the BEGIN
5. on the order of 10000 In hmmsearch the size of the search space is the number of sequences in the sequence database This means that E values may differ even for the same individual profile vs sequence comparison depending on how you do the search For domain there then follows a domain table and alignment output just as in hmmsearch The fn3 annotation for example looks like Domain and alignment annotation for each model gt gt fn3 Fibronectin type III domain score bias c Evalue i Evalue hmmfrom hmm to alifrom ali to envfrom env to acc lyr Sli3 0 0 0 33 0 25 61 TAN 396 409 395 41 4 4 0 85 2 Al 40 7 0 0 2 6e 14 3 8e 14 2 84 439 520 as 437 O21 es 0295 3 0 14 4 0 0 4 1e 06 6 1e 06 13 BS oss 836 s a Re cass 826 914 aa 0 T3 4 Sel 0 0 0 0032 0 0048 10 BOF ios 1209 T235 ses 1203 1259 24 02 82 a t 24 3 0 0 3 4e 09 5e 09 14 80 1313 T3808 2 3 1304 1386 0 82 6 0 0 0 0 0 13 0 19 58 Tera 1754 1768 1739 1769 0 89 as al 47 2 0 7 2 3e 16 3 5e 16 q 85 3 1799 1890 ss 1799 1891 4 0 91 8 17 8 0 0 3 7e 07 5 5e 07 6 TEA 1904 1966 1901 19 76 65 090 g t 12 8 0 0 1 3e 05 2e 05 1 86 1993 ZOD wa 1993 2107 0 89 and an example alignment of that second domain again domain 2 score 40 7 bits conditional E value 2 6e 14 CEEEEEEECTTEEEEEEE S SS SEEEEEEEETTTCCGCEEEEEETTTSEEEEES TT BEEEEEEEEETTEE E CS Eng 2 saPenlsvsevtstsltlsWsppkdgggpitgYeveyqekgegeewgevtvprtttsvtltgLepgteYefrVqavngagegp 84
6. s a tutorial walk through of some small projects with HMMER3 This should suffice to get you oriented to a safe path through HMMER3 alpha test code that should work as advertised before you start boldly exploring later to be documented command line options that might or might not be doing anything sensible yet Files used in the tutorial The subdirectory tutorial in the HMMER distribution contains the files used in the tutorial as well as a number of examples of various file formats that HMMER reads The important files for the tutorial are globins4 sto An example alignment of four globin sequences in Stockholm format This align ment is a subset of a famous old published structural alignment from Don Bashford Bashford et al 1987 globins4 hmm An example profile HMM file built from globins4 sto in HMMER3 ASCII text format globins4 out An example hmmsearch output file that results from searching the globins4 hmm against Uniprot 15 7 fn3 sto An example alignment of 106 fibronectin type III domains This is the Pfam 22 0 fn3 seed alignment It provides an example of a Stockholm format with more complex annotation We ll also use it for an example of hmmsearch analyzing sequences containing multiple domains fn3 hmm A profile HMM created from fn3 sto by hmmbuild 7LESS_DROME A FASTA file containing the sequence of the Drosophila Sevenless protein a recep tor tyrosine kinase whose extracellular region is thought to con
7. 0 sp P02022 HBAM_RANCA Hemoglobin heart muscle subunit alpha 4e 29 106 3 0 0 4 4e 29 106 2 0 0 0 Sp Q9DEPO MYG_CRYAN Myoglobin OS Cryodraco antarcticus GN 9 3e 29 105 2 0 2 le 28 105 0 O 1 0 sp P14397 MYG_GALGA Myoglobin OS Galeorhinus galeus GN mb 1 3e 28 104 7 0 0 1 6e 28 104 4 0 0 0 sp P0C227 GLB_NERAL Globin OS Nerita albicilla PE 1 SV 1 2e 28 104 1 0 0 2 4e 28 103 8 0 0 0 sp P09106 HBAT_PAPAN Hemoglobin subunit theta 1 OS Papio an 2 8e 28 103 6 Oel 3 1e 28 103 5 Osea 0 sp P14398 MYG_GALJA Myoglobin OS Galeorhinus japonicus GN 7 9e 26 95 7 0 0 8 8e 26 95 5 0 0 s sp P23216 GLBP1_GLYDI Globin major polymeric component P1 O It s unusual to see sequences get lost and marked with but it can happen it doesn t happen in this globin example After round 2 many more globin sequences have been found New targets included 167 New alignment includes 1064 subseqs was 895 including original query Continuing to next round ee Round 3 Included in MSA 1064 subsequences query 1063 subseqs from 1061 targets Model size 146 positions ee Because new sequences were included it keeps going to round three and then again to round four then again to round five After round five where this example has found 1113 included hits in the database the search ends quietly because there s a default maximum of five iterations and you get New targets included 1 New alignment includes 11
8. 321 8 0 0 2 2e 95 321 7 0 0 1 0 1 sp P02032 HBB_ SEMEN Hemoglobin subunit beta That continues until the inclusion threshold is reached at which point you see a tagline inclusion thresh old indicating where the threshold was set 0 00076 24 1 0 00087 23 9 0 0 0 00083 24 0 0 0 1 0 0 0 0 0009 23 9 0 0 1 0 Sif it is in the database it will almost certainly be included in the internal multiple alignment twice once because it s the query and once because it s a significant database match to itself This redundancy won t screw up the alignment because sequences are downweighted for redundancy anyway 23 OS Homo sapien OS Pan paniscu OS Pan troglod OS Gorilla gor OS Hylobates 1 OS Semnopithec 1 sp P02180 MYG_BALPH Myoglobin OS Balaenoptera physalus GN 1 sp P02148 MYG_PONPY Myoglobin OS Pongo pygmaeus GN MB PE 1 SSeS inclusion threshold 0 0013 23 43 0 3 0 021 19 4 O22 2 1 1 sp P81044 HBAZ_MACEU Hemoglobin subunit zeta Fragments OS 0 0021 2257 0 0 0 0022 22 6 0 0 1 0 1 sp P02182 MYG_ZIPCA Myoglobin OS Ziphius cavirostris GN MB The domain output and search statistics are then shown just as in phmmer At the end of this first iteration you ll see some output that starts with ee this is a simple tag that lets you search through the file to find the end of one iteration and the beginning of another New targets included 894 New alignment includes 895 subseqs was 1 including original query
9. V pe openmpi 128 mpirun mca btl self openib sm np 128 hmmbuild mpi Pfam hmm Pfam A seed gt hmmbuild out This reduces the time to build all of Pfam to about 40 seconds Step 2 compress and index the flatfile with hmmpress The hmmscan program has to read a lot of profile HMMs in a hurry and HMMER s ASCII flatfiles are bulky To accelerate this hmmscan uses binary compression and indexing of the flatfiles To use hmmscan you must first compress and index your HMM database with the hmmpress program gt hmmpress minifam This will quickly produce Working done Pressed and indexed 3 HMMs 3 names and 2 accessions Models pressed into binary file minifam h3m SSI index for binary model file minifam h3i Profiles MSV part pressed into minifam h3f Profiles remainder pressed into minifam h3p and you ll see these four new binary files in the directory The tutorial minifam example has already been pressed so there are example binary files tutorial minifam h3 m i f p included in the tutorial Their format is proprietary which is an open source term of art that means both I haven t found time to document them yet and I still might decide to change them arbitrarily without telling you Step 3 search the HMM database with hmmscan Now we can analyze sequences using our HMM database and hmmscan For example the receptor tyrosine kinase 7LESS_DRoME not only has all those fibronectin type I
10. as markup 0 means lt 10 accessible residue surface area 1 means lt 20 9 means lt 100 etc X means unknown structure 30 recognized GR annotations ss Secondary structure consensus See Gc SS_cons above sa Surface accessibility consensus See Gc SA_cons above PP Posterior probability for an aligned residue This represents the probability that this residue is assigned to the HMM state corresponding to this alignment column as opposed to some other state Note that a residue can be confidently unaligned a residue in an insert state or flanking N or C state may have high posterior probability The posterior probability is encoded as 11 possible characters 0 9 0 0 lt p lt 0 05 is coded as 0 0 05 lt p lt 0 15 is coded as 1 and so on 0 85 lt p lt 0 95 is coded as 9 and 0 95 lt p lt 1 0 is coded as Gap characters appear in the PP line where no residue has been assigned 31 References Altschul S F Madden T L Schaffer A A Zhang J Zhang Z Miller W and Lipman D J 1997 Gapped BLAST and PSI BLAST A new generation of protein database search programs Nucl Acids Res 25 3389 3402 Bashford D Chothia C and Lesk A M 1987 Determinants of a protein fold Unique features of the globin amino acid sequences J Mol Biol 196 199 216 Durbin R Eddy S R Krogh A and Mitchison G J 1998 Biological Sequence Analysis Probabilistic Models of Protein
11. going on is that the position and alignment of that domain is uncertain in this case only very slightly so but nonetheless uncertain The full sequence score is summed over all possible alignments of the globin model to the myc_PHyca sequence When HMMER3 identifies domains it identifies what it calls an envelope bounding where the domain s alignment most probably lies More on this later when we discuss the reported coordinates of domains and alignments in the next section of the output The single best dom score is calculated after the domain envelope has been defined and the summation is restricted only to the ensemble of possible alignments that lie within the envelope The fact that the two scores are slightly different is therefore telling you that there s a small amount of probability uncertainty that the domain lies somewhat outside the envelope bounds that HMMER has selected The two columns headed doms are two different estimates of the number of distinct domains that the target sequence contains The first the column marked exp is the expected number of domains according to HMMER s statistical model It s an average calculated as a weighted marginal sum over all possible alignments Because it s an average it isn t necessarily a round integer The second the column marked N is the number of domains that HAMER93 s domain postprocessing and annotation pipeline finally decided to identify annotate and align in
12. it produces inconsistent wrong or confusing results or when it doesn t compile or run at all Some bugs already know about but I d still like to hear about them just to Know you care Cryptogenomicon http cryptogenomicon org is a blog where I ll be talking about issues as they arise in HMMER38 and where you can comment or follow the discussion 2Actually this option already exists max 2 Installation Quick installation instructions Download hmmer 3 0b3 tar gz from http hmmer org or directly from ftp selab janelia org pub software hmmer3 hmmer 3 0b3 tar gz untar and change into the newly created direc tory hmmer 3 0b3 gt wget ftp selab janelia org pub software hmmer3 hmmer 3 0b3 tar gz gt tar xf hmmer 3 0b3 tar gz gt cd hmmer 3 0b3 The beta test code includes precompiled binaries for x86 Linux platforms These are in the binaries directory You can just stop here if you like if you re on a x86 Linux machine and you want to try the programs out without installing them To compile new binaries from source do a standard GNUish build gt configure gt make To compile and run a test suite to make sure all is well you can optionally do gt make check All these tests should pass You don t have to install HMMER programs to run them The newly compiled binaries are now in the src directory you can run them from there To install the programs and man pages somewhere on your system
13. sequence and optional description The most important number here is the first one the sequence E value This is the statistical significance of the match to this sequence the number of hits we d expect to score this highly in a database of this size if the database contained only nonhomologous random sequences The lower the E value the more significant the hit The E value is based on the sequence bit score which is the second number This is the log odds score for the complete sequence Some people like to see a bit score instead of an E value because the bit score doesn t depend on the size of the sequence database only on the profile HMM and the target sequence The next number the bias is a correction term for biased sequence composition that s been applied to the sequence bit score The only time you really need to pay attention to this value is when it s large and on the same order of magnitude as the sequence bit score This might be a sign that the target sequence isn t really a homolog but merely shares a similar strong biased composition with the query model The biased composition correction usually works well but occasionally will not knock down a falsely significant nonhomologous hit as far as it should The next three numbers are again an E value score and bias but only for the single best scoring do main in the sequence rather than the sum of all its identified domains The rationale for this isn t appare
14. space character A more detailed description of the format follows header section The header section is parsed line by line in a tag value format Each line type is either mandatory or optional as indicated HMMER3 b Unique identifier for the save file format version the b means that this is HMMER3 HMM file format version b When HMMER3 changes its save file format the revi sion code advances This way parsers may easily remain backwards compatible The remainder of the line after the HMmER3 b tag is free text that is ignored by the parser HMMER currently writes its version number and release date in brackets here e g 3 0b2 June 2009 in this example Mandatory 25 NAME lt s gt ACC lt s gt DESC lt s gt LENG lt d gt ALPH lt s gt CS lt s gt DATE lt s gt COM lt n gt lt s gt Model name lt s gt is a single word containing no spaces or tabs The name is normally picked up from the GF ID line from a Stockholm alignment file If this is not present the name is created from the name of the alignment file by removing any file type suffix For example an otherwise nameless HMM built from the alignment file rrm s1x would be named rrm Mandatory Accession number lt s gt is a one word accession number This is picked up from the GF Ac line in a Stockholm format alignment Optional Description line lt s gt is a one line free text description This is picked up from the GF DE li
15. the target sequence This is the number of alignments that will show up in the domain report later in the output file These two numbers should be about the same Rarely you might see that they re wildly different and this would usually be a sign that the target sequence is so highly repetitive that it s confused the H3 domain postprocessors Such sequences aren t likely to show up as significant homologs to any sensible query in the first place The sequence top hits output continues until it runs out of sequences to report By default the report includes all sequences with an E value of 10 0 or less Then comes the third output section which starts with Domain annotation for each sequence and alignments Now for each sequence in the top hits list there will be a section containing a table of where HUMER3 thinks all the domains are followed by the alignment inferred for each domain Let s use the n3 vs 7LESS_DROME example because it contains lots of domains and is more interesting in this respect than the globin4 output The domain table for 7LESS_DRomE looks like gt gt 7TLESS_DROME RecName Full Protein sevenless EC 2 7 10 1 score bias c Evalue i Evalue hmmfrom hmm to alifrom ali to envfrom env to acc i SHS 0 0 0 17 One 61 74 396 409 395 ALN 3085 201 40 7 0 0 1 3e 14 1 3e 14 2 84 439 520 437 S27 4704 95 3 4 14 4 0 0 2e 06 2e 06 13 85 836 913 826 SIn O88 4 5 1 0 0 0 0016 0 0016 10 36 1209 1235743 120
16. these characters are ignored Consensus structure an notation is picked up from a Stockholm file s GC SS_cons line and propagated to alignment displays Optional assumed to be no if not present Map annotation flag lt s gt is either no or yes case insensitive If set to yes the map annotation field in the main model see below is valid if no that field will be ignored The HMM alignment map annotates each match state with the index of the alignment column from which it came It can be used for quickly mapping any subsequent HMM alignment back to the original multiple alignment via the model Optional assumed to be no if not present Date the model was constructed lt s gt is a free text date string This field is only used for logging purposes Optional Command line log lt n gt counts command line numbers and lt s gt is a one line com mand There may be more than one com line per save file each numbered starting from n 1 These lines record every HMMER command that modified the save file This helps us reproducibly and automatically log how Pfam models have been constructed for example Optional 1HMMER does not use dates for any purpose other than human readable annotation so it is no more prone than you are to Y2K Y2038 or any other date related eschatology 26 NSEQ EFFN CKSUM GA lt f gt TC lt f gt NC lt f gt STATS lt s1 gt lt s2 gt COMPO lt f gt K lt
17. to On the same machine also running dual core NCBI BLAST with one of these globin sequences took 3 8 seconds and WU BLAST took 7 4 seconds Searching a profile HMM database with a query sequence The hmmscan program is for annotating all the different known detectable domains in a given sequence It takes a single query sequence and an HMM database as input The HMM database might be Pfam SMART or TIGRFams for example or another collection of your choice Step 1 create an HMM database flatfile An HMM database flatfile is simply a concatenation of individual HMM files To create a database flat file you can either build individual HMM files and concatenate them or you can concatenate Stockholm alignments and use hmmbuild to build an HMM database of all of them in one command Let s create a tiny database called minifam containing models of globin fn3 and Pkinase protein kinase domains by concatenating model files gt hmmbuild globins4 hmm tutorial globins4 sto gt hmmbuild fn3 hmm tutorial fn3 sto gt hmmbuild Pkinase hmm tutorial Pkinase sto gt cat globins4 hmm fn3 hmm Pkinase hmm gt minifam We ll use minifam for our examples in just a bit but first a few words on other ways to build HMM databases especially big ones The file tutorials minifam is the same thing if you want to just use that Alternatively you can concatenate Stockholm alignment files together as Pfam does in its big P fam A seed and Pfam A full
18. 0b3 November 2009 NAME globins4 LENG 149 ALPH amino RF no CS no MAP yes DATE Sat Nov 14 10 47 39 2009 NSEQ 4 EFFN 0 964844 12 CKSUM 2027839109 STATS LOCAL MSV 9 9014 0 70957 STATS LOCAL VITERBI 10 7224 0 70957 STATS LOCAL FORWARD 4 1637 0 70957 HMM A Cc D E E G H I OER wW m gt m m gt i m gt d i gt m issi d gt m d gt d COMPO 2 36614 4 52553 2 96662 2 70479 3 20868 3 02051 3 41129 2 90113 4 55399 3 62918 2 68640 4 42247 2 77497 2 73145 3 46376 2 40504 3 72516 3 29302 4 58499 3 61525 0 57544 1 78073 1 31293 1 75577 0 18968 0 00000 1 1 70038 4 17733 3 76164 3 36686 3 72281 3 29583 4 27570 2 40482 5 32720 4 10031 e So 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 4 58477 3 61503 0 03156 3 86736 4 58970 0 61958 0 77255 0 34406 1 23405 149 2 92198 5 11574 3 28049 2 65489 4 47826 3 59727 2 51142 3 88373 5 42147 4 18835 165 2 68634 4 42241 2 77536 2 73098 3 46370 2 40469 3 72511 3 29370 4 58493 3 61418 0 22163 1 61553 x 1 50361 0 25145 0 00000 The HMMER3 ASCII save file format is defined in Section 4 If you re used to HMMER2 you may now be expecting to calibrate the model with H2 s hmmcalibrate program HMMER3 models no longer need a separate calibration step We ve figured out how to calculate the necessary parameters rapidly bypassing the need for costly simulation Eddy 2008 The determination of the statistical parameters is part of hmmbuild These are the par
19. 10 4 51877 3 51898 3 43366 120 B 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 3 61503 0 00338 6 08833 6 81068 0 61958 0 77255 0 48576 0 95510 86 3 03720 5 94099 3 75455 2 96917 5 26587 2 91682 3 66571 4 11840 4 99111 121 E 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 3 61503 0 00227 6 08723 0 61958 0 77255 0 00000 An HMM file consists of one or more HMMs Each HMM starts with the identifier HMMER3 b and ends with on a line by itself The identifier allows backward compatibility as the HMMER software evolves it tells the parser this file is from HMMER 9 s save file format version b The 3 a format was used in alpha test versions of HMMER3 The closing allows multiple HMMs to be concatenated The format is divided into two regions The first region contains textual information and miscalleneous parameters in a roughly tag value scheme This section ends with a line beginning with the keyword Hmm The second region is a tabular whitespace limited format for the main model parameters All probability parameters are all stored as negative natural log probabilities with five digits of precision to the right of the decimal point rounded For example a probability of 0 25 is stored as log 0 25 1 38629 The special case of a zero probability is stored as Spacing is arranged for human readability but the parser only cares that fields are separated by at least one
20. 14 subseqs was 1113 including original query That marks the end of the results for one query 24 4 File formats HMMER profile HMM files The file tutorial f n3 hmm gives an example of a HMMER3 ASCII save file An abridged version is shown here where mark deletions made for clarity and space HMMER3 b 3 0b2 June 2009 NAME fn3 ACC PFO0041 12 DESC Fibronectin type III domain LENG 86 ALPH amino RF no cs yes MAP yes DATE Mon Jun 15 09 39 41 2009 NSEQ 106 EFFN 11 415833 CKSUM 72638555 GA 8 00 7 20 TC 8 00 7 20 NC 7 90 7 90 STATS LOCAL MSV 9 4043 0 71847 STATS LOCAL VITERBI 9 7737 0 41847 STATS LOCAL FORWARD 3 8341 0 71847 HMM A C D E F G H Y m gt m m gt i m gt d i gt m E e Y d gt m d gt d COMPO 2 70271 4 89246 3 02314 2 64362 3 59817 2 82566 3 74147 3 08574 3 22607 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 3 61503 0 00338 6 08833 6 81068 0 61958 0 77255 0 00000 1 3 16986 5 21447 4 52134 3 29953 4 34285 4 18764 4 30886 3 35801 3 93889 Prti 2 68629 4 42236 2 77530 2 73088 3 46365 2 40512 3 72505 3 29365 3 61514 0 09796 2 38361 6 81068 0 10064 2 34607 0 48576 0 95510 2 2 70230 5 97353 2 24744 2 62947 5 31433 2 60356 4 43584 4 79731 4 25623 e Ea ieai 2 68618 4 42225 2 77519 2 73123 3 46354 2 40513 3 72494 3 29354 3 61503 0 00338 6 08833 6 81068 0 61958 0 77255 0 48576 0 95510 ee 85 2 48488 5 72055 3 87501 1 97538 3 04853 3 480
21. 3 1259 gt 2 0 82 5 24 3 0 0 1 7e 09 1 7e 09 14 BOs sie 1313 1380 1304 1386 0 82 6 0 0 0 0 0 063 0 063 58 i aera 1754 1768 17 39 1769 39 089 cP ul 47 2 0 7 1 2e 16 1 2e 16 1 85 1799 18902 au 1799 T8910 25 091 8 ETB 0 0 1 8e 07 1 8e 07 6 74 0 1904 1966 1901 197 6 vo 0790 Fai 12 8 0 0 6 6e 06 6 6e 06 1 86 1993 2104 ase 1993 2107 0 89 Domains are reported in the order they appear in the sequence not in order of their significance The or symbol indicates whether this domain does or does not satisfy both per sequence and per domain inclusion thresholds Inclusion thresholds are used to determine what matches should be considered to be true as opposed to reporting thresholds that determine what matches will be reported often including the top of the noise so you can see what interesting sequences might be getting tickled by 15 your search By default inclusion thresholds usually require a per sequence E value of 0 01 or less anda per domain conditional E value of 0 01 or less except jackhmmer which requires a more stringent 0 001 for both and reporting E value thresholds are set to 10 0 The bit score and bias values are as described above for sequence scores but are the score of just one domain s envelope The first of the two E values is the conditional E value This is an odd number and it s not even clear were going to keep it Pay attention to what it means It is an attempt to measure the s
22. FLASVSTVL MYG_PHYCA TALGAILKK K GHHEAELKPLAQSHATKH KIPIKYLEFISEAI IHVLHSRHPGDFGADAQGAMNKALELFRKDI GLB5_PETMA NAVNDAVASM DDTEKMSMKLRDLSGKHAKSF QVDPOYFKVLAAVIADTVAAG DAGFEKLMSMICILL HBB_HUMAN AHKYH HBA_HUMAN TORY s 4 3 ka MYG_PHYCA AAKYKELGYQG GLB5_PETMA RSAY Most popular alignment formats are similar block based formats and can be turned into Stockholm format with a little editing or scripting Don t forget the stocKHoLM 1 0 line at the start of the alignment nor the at the end Stockholm alignments can be concatenated to create an alignment database flatfile containing many alignments The hmmbuild command builds a profile HMM from an alignment or HMMs for each of many alignments in a Stockholm file and saves the HMM s in a file For example type gt hmmbuild globins4 hmm tutorial globins4 sto and you ll see some output that looks like hmmbuild profile HMM construction from multiple sequence alignments HMMER 3 0b3 November 2009 http hmmer org Copyright C 2009 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 Set Plast cee Be Ge hace ieee Pe tae ee Ge i eat ee Seay a See Rey eset ete Ee ee eS et ee Sep ee ae ag tee input alignment file globins4 sto output HMM file globins4 hmm SPs KES Pome Prone eae wees GPE et Rae tov Peet ee ees Fk aaa eS en ered Re em Es oe Sais os et gE GS Pe tie oe EES idx name nseq
23. HMMERS beta test Users Guide Biological sequence analysis using profile hidden Markov models http hmmer org Version 3 063 November 2009 Sean R Eddy Janelia Farm Research Campus Howard Hughes Medical Institute 19700 Helix Drive Ashburn VA 20147 USA http eddylab org Copyright C 2009 Howard Hughes Medical Institute Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are retained on all copies HMMER is licensed and freely distributed under the GNU General Public License version 3 GPLv3 For a copy of the License see http www gnu org licenses HMMER is a trademark of the Howard Hughes Medical Institute Contents 1 Introduction Design goals ofp HMMERS a 2 ee What made it into the HMMER3 0 testcode 0002 eee ee ee What s stillmisSinge ass aoc ts chk ae Bae Baha PS eee ke We a ea eS What I hope to accomplish in testing 0000 ee ee Installation Quick installation instructions 2 2 ee System requirements noaa ee Multithreaded parallelization for multicores by default 0 00200 000 MPI parallelization for clusters is optional aoaaa ee Using build directories 1 era ai Pe h a ee Makefile targets 2 ee Tutorial Files used inthe tutorial aoaaa aaa ee Searching a sequence database with a single profile HMM 0 Step 1 build a profile HM
24. II domains on its extracellular face it s got a protein kinase domain on its intracellular face Our mini fam database has models of both fn3 and Pkinase as well as the unrelated globins4 model So what happens when we scan the 7LESS_DROME Sequence gt hmmscan minifam tutorial 7LESS_DROME The header and the first section of the output will look like hmmscan search sequence s against a profile database HMMER 3 0b3 November 2009 http hmmer org Copyright C 2009 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 x eh by ES eee Sa HE eee Se Sp be ey a See a es SS ee Sy Se ee query sequence file 7LESS_DROME target HMM database minifam per seq hits tabular output 7LESS tbl per dom hits tabular output 7LESS domtbl 20 Query 7LESS_DROME L 2554 Accession P13368 Description RecName Full Protein sevenless EC 2 7 10 1 Scores for complete sequence score includes all domains full sequence best 1 domain dom E value score bias E value score bias exp N Model Description 5 6e 57 178 0 0 4 3 5e 16 47 2 0 7 9 4 9 n3 Fibronectin type III domain 1 1le 43 137 2 0 0 1 7e 43 136 5 0 0 1 3 1 Pkinase Protein kinase domain The output fields are in the same order and have the same meaning as in hmmsearch s output The size of the search space for hmmscan is the number of models in the HMM database here 3 for a Pfam search
25. M with hmmbuild 0 00000 eee ee eee Step 2 search the sequence database with hmmsearch 055 Searching a profile HMM database with a query sequence 0 000002 eee Step 1 create an HMM database flatfile 0 0 0 0 0 cee eee ee eee Step 2 compress and index the flatfile with hmmpress 220004 Step 3 search the HMM database withhmmscan 000 000 ee eee Creating multiple alignments with hmmalign 2 2 20 00 Single sequence queries usingphmmer 2 0 0 a Iterative searches using jackhmmer aoaaa ee File formats HMMER profile HMM files 2 ee header sectoon is si air ara larane eters dt hd bee dees ae ae ha hale ew ee ee ees main model section aoaaa aa ee Stockholm the recommended multiple sequence alignment format syntax of Stockholm markup 2 2 2 2 ee semantics of Stockholm markup 2 2 0 0 00 eee ee recognized GF annotations ooa a recognized GS annotations 2 recognized GC annotations ooo a recognized GR annotations ooa a 1 Introduction This is a user s guide to the HMMERS test distribution It really isn t meant for a new user will assume you are already familiar with profile hidden Markov models profile HMMs Krogh et al 1994 Eddy 1998 Durbin et al 1998 with the previous version of HMMER HMMER2 http hmmer org with other popular biological sequence comparison tools such a
26. ability for a given homologous domain even if precise endpoints are only fuzzily inferrable You ll notice that for higher scoring domains the coordinates of the envelope and the inferred alignment will tend to be in tighter agreement corresponding to sharper posterior probability defining the location of the homologous region Operationally would use the envelope coordinates to annotate domain locations on target sequences not the alignment coordinates However be aware that when two weaker scoring domains are close to each other envelope coordinates can and will overlap corresponding to the overlapping uncertainty of where one domain ends and another begins In contrast alignment coordinates generally do not overlap though there are cases where even they will overlap The last column is the average posterior probability of the aligned target sequence residues effectively the expected accuracy per residue of the alignment For comparison current Uniprot consensus annotation of Sevenless shows seven domains FT DOMAIN 311 431 Fibronectin type III 1 FT DOMAIN 436 528 Fibronectin type III 2 FT DOMAIN 822 921 Fibronectin type III 3 FT DOMAIN 1298 1392 Fibronectin type III 4 FT DOMAIN 1680 1794 Fibronectin type III 5 FT DOMAIN 1797 1897 Fibronectin type III 6 FT DOMAIN 1898 1988 Fibronectin type III 7 These domains are a pretty tough case to call actually HMMER fails to see anything significant over lapping two of t
27. alen mlen eff_nseq re pos description oa ak Se eae Pere ener Eley Roepe rat en Dien Rapa RE eee eee om ere teenie tegen elon eye Re 1 globins4 4 171 149 0 96 0 589 CPU time 0 55u 0 00s 00 00 00 55 Elapsed 00 00 00 56 If your input file had contained more than one alignment you d get one line of output for each model For instance a single hmmbuild command suffices to turn a Pfam seed alignment flatfile such as P fam A seed into a profile HMM flatfile such as Pfam hmm The information on these lines is almost self explanatory The globins4 alignment consisted of 4 se quences with 171 aligned columns HMMER turned it into a model of 149 consensus positions which means it defined 22 gap containing alignment columns to be insertions relative to consensus The 4 se quences were only counted as an effective total sequence number eff_nseq of 0 96 The model ended up with a relative entropy per position re pos information content of 0 589 bits This output format is rudimentary HMMER3 knows quite a bit more information about what it s done to build this HMM Some of this information is likely to be useful to you the user As H3 testing and development proceeds we re likely to expand the amount of data that hmmbuild reports The new HMM was saved to globins4 hmm If you were to look at this file and you don t have to it s intended for HMMER s consumption not yours you d see something like HMMER3 b 3
28. alue i Evalue hmmfrom hmm to alifrom ali to envfrom env to acc 1 40 7 0 0 9 1e 12 6 4e 09 2 84 439 520 437 521 0 95 2 si 14 4 0 0 0 0014 1 13 85 836 913 826 914 0 73 iter EFEN 0 0 Le 7 9e 02 10 36 1209 1235 1203 1259 0 82 4 24 3 0 0 1 2e 06 0 00084 14 80 1313 1380 1304 1386 0 82 5 47 2 0 7 8 3e 14 5 8e 11 1 85 1799 1890 1799 TEST ca O39 i 6 17 8 0 0 0 00013 0 091 6 74 0 1904 196 6 as 1901 197E as 0 90 ist 12 8 0 0 0 0047 343 1 86 1993 ZO Aas 1993 2107 a 0 89 Notice that almost everything s the same it s the same target sequence after all except for what hap pens with E values The independent E value is calculated assuming a search space of all 497293 se quences For example look at the highest scoring domain domain 5 here domain 7 above When we only looked at a single sequence its score of 47 2 bits has an E value of 5 8e 11 When we search a database of 497293 sequences a hit scoring 47 2 bits would be expected to happen 497293 times as often 1 2e 16 x 497293 5 97e 11 it s showing 5 8e 11 because of roundoff issues the 1 2e 16 in fact isn t exactly 1 2e 16 inside HMMER In this Uniprot search 711 sequences were reported in the top hits list with E values lt 10 If we were to assume that all 711 are true homologs x out the domain s that made us think that and then went looking for additional domains in those 711 sequences we d be searching a smaller database of 711 sequences the expected number of t
29. ameter values on the three lines marked STATS You also may be expecting to need to configure the model s alignment mode as in HUMER2 s hmmbuild f option for building local fragment search alignment models for example HMMER3 s hmmbuild does not have these options hmmbuild builds a core profile which the search and alignment programs configure as they need to And at least for the moment they always configure for local alignment Step 2 search the sequence database with hmmsearch Presumably you have a sequence database to search Here lIl use the Uniprot 15 7 Swissprot FASTA format flatfile not provided in the tutorial because of its large size uniprot sprot fasta If you don t have a sequence database handy run your example search against tutorial globins45 fa instead which is a FASTA format file containing 45 globin sequences hmmsearch accepts any FASTA file as input It also accepts EMBL Uniprot text format It will automat ically determine what format your file is in you don t have to say An example of searching a sequene database with our globins4 hmm model would look like gt hmmsearch globins4 hmm uniprot_sprot fasta gt globins4 out Depending on the database you search the output file globins4 out should look more or less like the example of a Uniprot search output provided in tutorial globins4 out The first section is the header that tells you what program you ran on what and with what options
30. cts available for making Windows more POSIX compliant and more compatible with GNU ish configures and builds One such product is Cygwin http www cygwin com which is freely available Although we do not test on Windows platforms we understand HMMER builds fine in a Cygwin environment on Windows Processor HMMER3 depends on vector parallelization methods that are supported on most modern processors H3 requires either an x86 compatible IA32 IA64 processor that supports the SSE2 vector instruction set or a PowerPC processor that supports the Altivec VMX instruction set SSE2 is supported on Intel processors from Pentium 4 on and AMD processors from K8 Athlon 64 on we believe this includes almost all Intel processors since 2000 and AMD processors since 2003 Altivec VMX is supported on Motorola G4 IBM G5 and IBM Power6 processors which we believe includes almost all PowerPC based desktop systems since 1999 and servers since 2007 If your platform does not support one of these vector instruction sets the configure script will revert to an unoptimized implementation called the dummy implementation This implementation is two orders of magnitude slower It will enable you to see H3 s features on a much wider range of processors but is not suited for real production work We do aim to be portable to all modern processors The acceleration algorithms are designed to be portable despite their use of specialized SIMD vector instructions We h
31. d gt lt f gt lt d gt lt f gt lt f gt lt f gt Sequence number lt a gt is a nonzero positive integer the number of sequences that the HMM was trained on This field is only used for logging purposes Optional Effective sequence number lt f gt is a nonzero positive real the effective total num ber of sequences determined by hmmbuild during sequence weighting for combin ing observed counts with Dirichlet prior information in parameterizing the model This field is only used for logging purposes Optional Training alignment checksum lt d gt is a nonnegative unsigned 32 bit integer This number is calculated from the training sequence data and used in conjunction with the alignment map information to verify that a given alignment is indeed the alignment that the map is for Optional Pfam gathering thresholds GA1 and GA2 See Pfam documentation of GA lines Optional Pfam trusted cutoffs TC1 and TC2 See Pfam documentation of TC lines Op tional Pfam noise cutoffs NC1 and NC2 See Pfam documentation of NC lines Optional lt fl gt lt f2 gt Statistical parameters needed for E value calculations lt s1 gt is the model s main model section alignment mode configuration currently only LocaL is recognized lt s2 gt is the name of the score distribution currently MSV VITERBI and FORWARD are recog nized lt 1 gt and lt f 2 gt are two real valued parameters controlling location and slope
32. do gt make install By default programs are installed in usr local bin and man pages in usr local man man1 You can change usr local to any directory you want using the configure prefix option as in configure prefix the directory you want If you have the Intel C compiler icc we strongly recommend that you use it instead of gcc for example for performance reasons by specifying CC icc either in your environment or on the gt configure command line For example on our systems we would do gt configure CC icc LDFLAGS static prefix usr local hmmer 3 0 gt make gt make check gt make install That s it You can keep reading if you want to know more about customizing a HMMERS installation or you can skip ahead to the next chapter the tutorial System requirements Operating system HMMER is designed to run on POSIX compatible platforms including UNIX Linux and MacOS X The POSIX standard essentially includes all operating systems except Microsoft Windows The alpha test code includes precompiled binaries for Linux These were compiled with the Intel C compiler icc on an x86_64 Intel platform running Red Hat Enterprise Linux AS4 We believe they should be widely portable to different Linux systems We have tested most extensively on Linux and to a lesser extent on MacOS X We aim to be portable to all other POSIX platforms We currently do not develop or test on Windows There are add on produ
33. e where it s common to mask away nonconfidently aligned columns of a multiple alignment The PP cons line provides an objective measure of the confidence assigned to each column The cc RF line is Stockholm format reference coordinate annotation with an x marking each column that the profile considered to be consensus Single sequence queries using phmmer The phmmer program is for searching a single sequence query against a sequence database much as BLASTP Or FASTA would do phmmer works essentially just like hmmsearch does except you provide a query sequence instead of a query profile HMM Internally HMMER builds a profile HMM from your single query sequence using a simple position independent scoring system BLOSUME62 scores converted to probabilities plus a gap open and gap extend probability The file tutorial HBB_HUMAN is a FASTA file containing the human 6 globin sequence as an example query If you have a sequence database such as uniprot_sprot fasta make that your target database otherwise use tutorial globins45 faas a small example gt phmmer tutorial HBB_HUMAN uniprot_sprot fa or gt phmmer tutorial HBB_HUMAN tutorial globins45 fa 22 Everything about the output is essentially as previously described for hmmsearch Iterative searches using jackhmmer The jackhmmer program is for searching a single sequence query iteratively against a sequence database much as PSI BLAST would do The first round is identica
34. e things even though already know about them also want to find out what glaring problems you find that I m not already losing sleep over Really What I hope to accomplish in testing The core of HMMER 9 s functionality seems stable to me but all the stuff wrapped around it the stuff you see like the applications command line options i o formats is prototypical and still fluid The main objective of the test period is for a small number of savvy power users to have the opportunity to give feedback while the user oriented layers of HMMER3 are still under development in particular before its basic feature set command line options and input and output formats get frozen You might want it to spit out XML or you like tab delimited format or you want this number or that number on such and such a line to make it really fit in your analysis pipeline or you really really need a command line option for slowing the search programs back down to HMMER2 speed so you have more time for coffee This is the kind of stuff ld most like to hear now while the code is still fluid An obvious corollary of this responsiveness to your feedback is don t write any heavy duty output parsers around HMMER3 just yet You should expect all the output formats to change at least slightly before a public release is finalized Of course since it s test code I d like to also hear about bugs how you manage to break it or when
35. f that line we ll probably put more information on eventually If the model had any consensus structure or reference line annotation that it inherited from your multiple alignment GC SS_cons GC RF annotation in Stockholm files that information is simply regurgitated as cs or RF annotation lines here The n3 model had a consensus structure annotation line The line starting with n3 is the consensus of the query model Capital letters represent the most con served high information content positions Dots in this line indicate insertions in the target sequence with respect to the model The midline indicates matches between the query model and target sequence A indicates positive score which can be interpreted as conservative substitution with respect to what the model expects at that position The line starting with 7LESS_DRoME is the target sequence Dashes in this line indicate deletions in the target sequence with respect to the model The bottom line is new to HUMERS3 This represents the posterior probability essentially the expected accuracy of each aligned residue A 0 means 0 5 1 means 5 15 and so on 9 means 85 95 and a means 95 100 posterior probability You can use these posterior probabilities to decide which parts of the alignment are well determined or not You ll often observe for example that expected alignment accuracy degrades around locations of insertion and deletion which you d intuit
36. flatfiles and use hmmbuild to build HMMs for all the alignments at once This won t work properly for our tutorial alignments because the globins4 sto alignment doesn t have an GF ID anno tation line giving a name to the globins4 alignment So hmmbuild wouldn t know how to name it correctly To build a multi model database from a multi MSA flatfile the alignments have to be in Stockholm format no 19 other MSA format that I m aware of supports having more than one alignment per file and each alignment must have a name on a cGF Ip line But if you happen to have a Pfam seed alignment flatfile pfam A seed around an example command would be gt hmmbuild Pfam A hmm Pfam A seed This would take about two or three hours to build all 10 000 models or so in Pfam To speed the database construction process up hmmbuild supports MPI parallelization As far as HMMER s concerned all you have to do is add mpi to the command line for hmmbuild assuming you ve compiled support for MPI into it see the installation instructions You ll also need to know how to invoke an MPI job in your particular environment with your job scheduler and MPI distribution We can t really help you with this different sites have different cluster environments With our scheduler SGE the Sun Grid Engine and our MPI distro OpenMPI an example incantation for building Pfam hmm from Pfam A seed is gt qsub N hmmbuild j y o errors out b y cwd
37. for subsequences extracted from a longer source sequence See below for details syntax of Stockholm markup There are four types of Stockholm markup annotation for per file per sequence per column and per residue annotation GF lt tag gt lt s gt Per file annotation lt s gt is a free format text line of annotation type lt tag gt For example GF DATE April 1 2000 Can occur anywhere in the file but usually all the cr markups occur in a header GS lt seqname gt lt tag gt lt s gt Per sequence annotation lt s gt is a free format text line of annotation type tag associated with the sequence named lt seqname gt For ex ample GS seql SPECIES_SOURCE Caenorhabditis elegans Can occur anywhere in the file but in single block formats e g the Pfam distribution will typically follow on the line after the sequence itself and in multi block formats e g HMMER output will typically occur in the header preceding the alignment but following the cr annotation GC lt tag gt lt s gt Per column annotation lt s gt is an aligned text line of annotation type lt tag gt GC lines are associated with a sequence alignment block lt s gt is aligned to the residues in the alignment block and has the same length as the rest of the block Typically Gc lines are placed at the end of each block GR lt seqname gt lt tag gt lt s gt Per residue annotation lt s gt is an aligned
38. goes through a gauntlet of three scoring algorithms called MSV Viterbi and Forward in order of increasing sensitivity and increasing com putational requirement MSV the Multi ungapped Segment Viterbi algorithm is the new algorithm in HMMER3 It essentially calculates the HMM equivalent of BLAST s sum score an optimal sum of ungapped high scoring alignment segments Unlike BLAST it does this calculation directly without BLAST s word hit or hit extension step using a SIMD vector parallel algorithm By default HMMER3 is configured to allow sequences with a P value of lt 0 02 through the MSV score filter thus if the database contained no homologs and P values 4lt may make more sense to condition the posterior probabilities on the assumption that the residue is indeed homologous given that how likely is it that I ve got it correctly aligned 18 were accurately calculated the highest scoring 2 of the sequences will pass the filter Here about 4 of the database got through the MSV filter A quick check is then done to see if the target sequence is obviously so biased in its composition that it s unlikely to be a true homolog This is called the bias filter If you don t like it it can occasionally be overaggressive you can shut it off with the nobias option Here 15929 sequences pass through the bias filter The Viterbi filter then calculates a gapped optimal alignment score This is a bit more sen
39. hen this is a single character representing the reference annotation for this match state otherwise this field is The next field is the cs annotation for this node If cs was yes then this is a single character representing the consensus structure at this match state otherwise this field is Insert emission line The K fields on this line are the insert emission scores one per symbol in alphabetic order State transition line The seven fields on this line are the transitions for node k in the order shown by the transition header line Mp gt Mp 1 Ik Dk 1 Ik gt Meri Ik Dr gt Mk 1 Dest For transitions from the final node M match state M 1 is interpreted as the END state E and there is no delete state M 1 therefore the final M Dp and D Dx transitions are always zero probability and the final Dy gt Mz41 transition is always 0 0 probability 1 0 Finally the last line of the format is the record separator Stockholm the recommended multiple sequence alignment format The Pfam and Rfam Consortiums have developed a multiple sequence alignment format called Stockholm format that allows rich and extensible annotation Most popular multiple alignment file formats can be changed into a minimal Stockholm format file just by adding a Stockholm header line and a trailing terminator STOCKHOLM 1 0 seql ACDEF GHIKL seq2 ACDEF GHIKL seq3 EFMNRGHIKL
40. hese domains 311 431 and 1680 1794 in the Uniprot search though it sees a smidgen of them when 7LESS_DROME alone is the target HMMER3 sees two new domains 1205 1235 and 1993 2098 that Uniprot currently doesn t annotate but these are pretty plausible domains given that the extracellular domain of Sevenless is pretty much just a big array of 100aa fibronectin repeats Under the domain table an optimal posterior accuracy alignment Holmes 1998 is computed within each domain s envelope and displayed For example skipping domain 1 because it s weak and uncon vincing fibronectin Ill domain 2 in your 7LESS_DROME output is shown as domain 2 score 40 7 bits conditional E value 1 3e 14 CEEEEEEECTTEEEEEEE S SS SEEEEEEEETTTCCGCEEEEEETTTSEEEEES TT BEEEEEEEEEETTEE E CS 3Not to mention one mercifully rare bug artifact that I m betting is so unusual that testers don t even see an example of it but we ll see 17 f n3 2 saPenlsvsevtstsltlsWsppkdgggpitgYeveyqekgegeewgevtvprtttsvtltgLepgteYefrVqavngagegp 84 saP 1 4 4 W p tgpitgY e vpt s 4 L gttY n gegp 7LESS_DROME 439 SAPVIEHLMGLDDSHLAVHWHPGRFTNGP LEGYRLRLSSSEGNA TSEQLVPAGRGSYIFSOQLOAGTNYTLALSMINKOQGEGP 520 T89999999I IIK KKK KKK KKK RK KK RK KKK KKK AK IIIS RK KKK KKK RK KKK KK KKK KK KK KKKIIIT PP The initial header line starts with a as a little handle for a parsing script to grab hold of The rest o
41. hmmsearch hmmpfam hmmalign in people s domain analysis and annotation pipelines for instance using profile databases like Pfam or SMART These four programs have already been subjected to some serious independent testing by Rob Finn of the Pfam Consortium who visited Janelia for several months in order to adopt HMMER3 at Pfam apparently by beating the living tar out of it haven t yet fixed all the issues the evil Rob has identified but have fixed the showstopping bugs The phmmer and jackhmmer programs are new to HMMERS They searches a single sequence against a sequence database akin to BLASTP and PSIBLAST respectively Internally they just produce a profile HMM from the query sequence then run HMM searches In the Tutorial section I ll show examples of running each of these programs using examples in the tutorial subdirectory of the distribution 11 think What s still missing Oh lots The most egregious lacunae include More processor support One of the attractive features of the MSV algorithm is that it is a very tight and efficient piece of code which ought to be able to take advantage of recent advances in using massively parallel GPUs graphics processing units and other specialized processors such as the Cell processor or FPGAs We have prototype work going on in a variety of processors but none of this is far along as yet But this work combined with the parallelization is partly why we expect to wring sign
42. ificant more speed out of HMMER in the future More speed Even on x86 platforms HMMER s acceleration algorithms are still on a nicely sloping bit of their asymptotic optimization curve still think can accelerate the code by another two fold or so Additionally for a small number of HMMs lt 1 of Pfam models the acceleration core is performing relatively poorly for reasons pretty much understand having to do with biased composition most of these pesky models are hydrophobic membrane proteins but which are nontrivial to work around This ll produce an annoying behavior that some testers are sure to notice if you look systematically sometimes you ll see a model that runs at something more like HMMER2 speed 100x or so slower than an average query This needless to say Will Be Fixed DNA sequence comparison HMMER s search pipeline is somewhat specialized to protein protein comparison specifically the pipeline works by filtering individual sequences winnowing down to a subset of the sequences in a database that need close attention from the full heavy artillery of Bayesian inference This strategy doesn t work for long DNA sequences it doesn t filter the human genome much to say there s a hit on chromosome 1 The algorithms need to be adapted to identify high scoring regions of a target sequence rather than filtering by whole sequence scores You can chop a DNA sequence into overlapping windows and HMMER3 would wor
43. imes we d see a hit of 47 2 bits or better is now 1 2e 16 x 711 8 3e 14 That s where the conditional E value c Evalue is coming from Notice that a couple of domains disappeared in the Uniprot search because now in this larger search space size they re not significant Domain 1 the one with the score of 1 3 bits got a conditional E value of 0 17 x 711 121 and domain 6 with a bit score of 0 0 got a c Evalue of 0 063 x 711 45 These fail the default reporting threshold of 10 0 The domain with a score of 5 1 bits also shifted from being above to below the default inclusion thresholds so now it s marked with a instead of a Operationally e If the independent E value is significant lt lt 1 that means that even this single domain by itself is such a strong hit that it suffices to identify the sequence as a significant homolog with respect to the size of the entire original database search You can be confident that this is a homologous domain 16 e Once there s one or more high scoring domains in the sequence already sufficient to decide that the sequence contains homologs of your query you can look with some caution at the conditional E value to decide the statistical significance of additional weak scoring domains In the Uniprot output for example l d be pretty sure of four of the domains 1 4 5 and maybe 6 each of which has a strong enough independent E value to declare 7LESS_DRomE to be an fnlll doma
44. in containing protein Domains 2 and 7 wouldn t be significant if they were all saw in the sequence but once decide that 7LESS_DROME contains fn3 domains on the basis of the other hits their conditional E values indicate that they are probably also fn3 domains too Domain 3 is too weak to be sure of from this search alone but would be something to pay attention to The next four columns give the endpoints of the reported local alignment with respect to both the query model hmm from and hmm to and the target sequence ali from and ali to It s not immediately easy to tell from the to coordinate whether the alignment ended internally in the query or target versus ran all the way as in a full length global alignment to the end s To make this more readily apparent with each pair of query and target endpoint coordinates there s also a little symbology meaning both ends of the alignment ended internally and means both ends of the alignment were full length flush to the ends of the query or target and and mean only the left or right end was flush full length The next two columns env from and env to define the envelope of the domain s location on the target sequence The envelope is almost always a little wider than what HMMER chooses to show as a reasonably confident alignment As mentioned earlier the envelope represents a subsequence that encompasses most of the posterior prob
45. ively expect You ll also see expected alignment accuracy degrade at the ends of an alignment this is because alignment accuracy posterior probabilities currently not only includes whether the residue is aligned to one model position versus others but also confounded with whether a residue should be considered to be homologous aligned to the model somewhere versus not homologous at all 4 These domain table and per domain alignment reports for each sequence then continue for each se quence that was in the per sequence top hits list Finally at the bottom of the file you ll see some summary statistics For example at the bottom of the globins search output you ll find something like Internal pipeline statistics summary Query model s 1 149 nodes Target sequences 497293 175274722 residues Passed MSV filter 19416 0 0390434 expected 9945 9 0 02 Passed bias filter 15929 0 0320314 expected 9945 9 0 02 Passed Vit filter 2207 0 00443803 expected 497 3 0 001 Passed Fwd filter 1076 0 00216371 expected 5 0 le 05 Initial search space Z 497293 actual number of targets Domain search space domZ 1075 number of targets reported over threshold CPU time 7 08u 0 09s 00 00 07 17 Elapsed 00 00 02 95 Mc sec 8852 86 This gives you some idea of what s going on in HMMER3 s acceleration pipeline You ve got one query HMM and the database has 497 293 target sequences Each sequence
46. k fine on such a chopped up database but that s a disgusting kludge and don t want to know about it Translated comparisons We d of course love to have the HMM equivalents of BLASTX TBLASTN and TBLASTX They ll come More sequence input formats HMMER3 will work fine with FASTA files for unaligned sequences and Stockholm files for multiple sequence alignments It has parsers for a handful of other formats Genbank EMBL and Uniprot flatfiles SELEX format alignments that we ve tested somewhat It s particularly missing parsers for some widely used alignment formats such as Clustal format so using HMMER3 on the MSAs produced by many popular multiple alignment programs MUSCLE or MAFFT for example is harder than it should be because it requires a reformat to Stockholm format More alignment modes HMMER3 only does local alignment HMMER2 also could do glocal alignment align a complete model to a subsequence of the target and global alignment align a complete model to a complete target sequence The E value statistics of glocal and global alignment remain poorly understood HMMERS relies on accurate significance statistics far more so than HMMER2 did because HMMER93 s acceleration pipeline works by filtering out sequences with poor P values Part of the reason for the test phase is to confirm that these points are just as annoying to you as they are to me and therefore important to fix asap Feel free to tell me you want thes
47. l to a phmmer search All the matches that pass the inclusion thresholds are put in a multiple alignment In the second and subsequent rounds a profile is made from these results and the database is searched again with the profile Iterations continue either until no new sequences are detected or the maximum number of iterations is reached By default the maximum number of iterations is 5 you can change this with the n option Your original query sequence is always included in the multiple alignments whether or not it appears in the database The consensus columns assigned to each multiple alignment always correspond exactly to the residues of your query so the coordinate system of every profile is always the same as the numbering of residues in your query sequence 1 L for a sequence of length L Assuming you have Uniprot or something like it handy here s an example command line for a jackhmmer search gt jackhmmer tutorial HBB_HUMAN uniprot_sprot fa One difference from phmmer output you ll notice is that jackhmmer marks new sequences with a and lost sequences with a New sequences are sequences that pass the inclusion threshold s in this round but didn t in the round before Lost sequences are the opposite sequences that passed the inclusion threshold s in the previous round but have now fallen beneath yet are still in the reported hits it s possible though rare to lose sequences utterly if they no
48. longer even pass the reporting threshold s In the first round everything above the inclusion thresholds is marked with a and nothing is marked with a For example the top of this output looks like jackhmmer iteratively search a protein sequence against a protein database HMMER 3 0b3 November 2009 http hmmer org Copyright C 2009 Howard Hughes Medical Institute Freely distributed under the GNU General Public License GPLv3 Shee ie Goat lt i Lda ep ime Sh sib ead eae ees el ge tie eee nee yee ee er cute ae oe ee eas te ee ie aes ce ee query sequence file HBB_HUMAN target sequence database uniprot_sprot fasta per seq hits tabular output hbb jack tbl per dom hits tabular output hbb jack domtbl SS ee ee ee a ee ee Query HBB_HUMAN L 146 Description Human beta hemoglobin Scores for complete sequences score includes all domains full sequence best 1 domain dom E value score bias E value score bias exp N Sequence Description 2 3e 98 331 4 0 0 2 5e 98 331 2 0 0 1 0 1 sp P68871 HBB_ HUMAN Hemoglobin subunit beta 2 3e 98 331 4 0 0 2 5e 98 331 2 0 0 1 0 1 sp P68872 HBB_PANPA Hemoglobin subunit beta 2 3e 98 331 4 0 0 2 5e 98 331 2 0 0 1 0 1 sp P68873 HBB_PANTR Hemoglobin subunit beta 9 1le 98 329 4 0 0 le 97 329 3 0 0 1 0 1 sp P02024 HBB_GORGO Hemoglobin subunit beta 2e 96 325 1 0 0 2 2e 96 324 9 0 0 1 0 1 sp P02025 HBB_HYLLA Hemoglobin subunit beta 2e 95
49. n x86 or PowerPC processor that supports SSE2 or Altivec VMX using an ANSI C99 compliant compiler please report the problem If it fails at all with the portable dummy implementation on any POSIX ANSI C99 compatible platform regardless of processor type please report the problem Multithreaded parallelization for multicores by default The four search programs support multicore parallelization This is implemented using POSIX threads By default the configure script will identify whether your platform supports POSIX threads almost all platforms do and will automatically compile in multithreading support If you want to disable multithreading at compile time recompile from source after giving the disable threads flag to configure 2Bjarne Knudsen has identified what s probably the main reason and it s not gcc s fault We ll be able to address the speed difference in the near future 3 f you install more than one package that uses the Easel library it may become an annoyance you ll have multiple instantiations of Easel lying around The Easel API is not yet stable enough to decouple it from the applications that use it like HMMER and Infernal By default the search programs will use all available cores on your machine You can control the number of cores each HMMER process will use with the cpu lt x gt command line option or the HMMER NCPU environment variable Accordingly you should observe about a 2x or 4
50. ne in a Stockholm alignment file Optional Model length lt a gt a positive nonzero integer is the number of match states in the model Mandatory Symbol alphabet type For biosequence analysis models lt s gt iS amino DNA Or RNA case insensitive There are also other accepted alphabets for purposes be yond biosequence analysis including coins dice and custom This determines the symbol alphabet and the size of the symbol emission probability distributions If amino the alphabet size K is set to 20 and the symbol alphabet to ACDE FGHIKLMNPQRSTVWY alphabetic order if pna the alphabet size K is set to 4 and the symbol alphabet to ACGT if RNa the alphabet size K is set to 4 and the symbol alphabet to ACGU Mandatory Reference annotation flag lt s gt is either no or yes case insensitive If yes the reference annotation character field for each match state in the main model see below is valid if no these characters are ignored Reference column annotation is picked up from a Stockholm alignment files GC RF line It is propagated to alignment outputs and also may optionally be used to define consensus match columns in profile HMM construction Optional assumed to be no if not present Consensus structure annotation flag lt s gt is either no or yes case insensitive If yes the consensus structure character field for each match state in the main model see below is valid if no
51. ns even if no single domain is solidly significant on its own On the other hand if the target sequence happened to be a piece of junk consisting of a set of identical internal repeats and one of those repeats accidentially gives a weak hit to the query model all the repeats will sum up and the sequence score might look significant which mathematically alas is the correct answer the null hypothesis were testing against is that the sequence is a random sequence of some base composition and a repetitive sequence isn t random So operationally e if both E values are significant lt lt 1 the sequence is likely to be homologous to your query 2The method that HMMER3 uses to compensate for biased composition is unpublished and different from HMMER2 We will write it up when there s a chance 14 EC 2 7 e if the full sequence E value is significant but the single best domain E value is not the target sequence is probably a multidomain remote homolog but be wary and watch out for the case where it s just a repetitive sequence OK the sharp eyed reader asks if that s so then why in the globin4 output all of which have only a single domain do the full sequence bit scores and best single domain bit scores not exactly agree For example the top ranked hit myc_pHyca sperm whale myoglobin if you re curious has a full sequence score of 222 7 and a single best domain score of 222 6 What s going on What s
52. nt in the globin example because all the globins in this example consist of only a single globin domain So let s set up a Second example using a model of a single domain that s commonly found in multiple domains in a single sequence Build a fibronectin type Ill domain model using the tutorial fn3 sto alignment this happens to be a Pfam seed alignment it s a good example of an alignment with complex Stockholm anno tation Then use that model to analyze the sequence tutorial 7LESS_DROME the Drosophila Sevenless receptor tyrosine kinase gt hmmbuild fn3 hmm tutorial fn3 sto gt hmmsearch f n3 hmm tutorial 7LESS_DROME gt fn3 out An example of what that output file will look like is provided in tutorial fn3 out The sequence top hits list says full sequence best 1 domain dom E value score bias E value score bias exp N Sequence Description 1 9e 57 178 0 0 4 1 2e 16 ATs2 0 7 9 4 9 7LESS_DROME RecName Full Protein sevenless OK now let s pick up the explanation where we left off The total sequence score of 178 0 sums up all the fibronectin IIl domains that were found in the 7LESS_DRomE sequence The single best dom score and E value are the bit score and E value as if the target sequence only contained the single best scoring domain without this summation The idea is that we might be able to detect that a sequence is a member of a multidomain family because it contains multiple weakly scoring domai
53. o that sort of structured data format lingo The Stockholm markup tags that are parsed semantically by my software are as follows recognized GF annotations ID lt s gt Identifier lt s gt is a name for the alignment e g rrm One word Unique in file Ac lt s gt Accession lt s gt is a unique accession number for the alignment e g PF00001 Used by the Pfam database for instance Often a alphabetical prefix indicating the database e g PF followed by a unique numerical accession One word Unique in file DE lt s gt Description lt s gt is a free format line giving a description of the alignment e g RNA recogni tion motif proteins One line Unique in file au lt s gt Author lt s gt is a free format line listing the authors responsible for an alignment e g Bateman A One line Unique in file GA lt f gt lt f gt Gathering thresholds Two real numbers giving HMMER bit score per sequence and per domain cutoffs used in gathering the members of Pfam full alignments NC lt f gt lt gt Noise cutoffs Two real numbers giving HMMER bit score per sequence and per domain cutoffs set according to the highest scores seen for unrelated sequences when gathering members of Pfam full alignments TC lt f gt lt f gt Trusted cutoffs Two real numbers giving HMMER bit score per sequence and per domain cutoffs set according to the lowest scores seen for true homologous sequences that were ab
54. onment variables For example in our environment we currently build HMMERS for MPI using gt setenv OMPI_CC icc gt setenv OMPI_CFLAGS 03 gt configure enable mpi prefix usr local hmmer3 gt make Using build directories The installation process from source now supports using separate build directories using the GNU standard VPATH mechanism This allows you to maintain separate builds for different processors or with different configuration compilation options All you have to do is run the configure script from the directory you want to be the root of the build directory For example gt mkdir my hmmer build gt cd my hmmer build gt path to hmmer configure gt make You d probably only use this feature if you re a developer maintaining several different builds for testing purposes Makefile targets al1 Builds everything Same as just saying make check Runs automated test suites in both HMMER and the Easel library clean Removes all files generated by compilation by make Configuration files generated by configure is preserved distclean Removes all files generated by configuration by configure and by compilation by make Note that if you want to make a new configuration for example to try an MPI version by configure enable mpi make you should do a make distclean rather than a make clean to be sure old configuration files aren t used accidentally 10 3 Tutorial Here
55. ope to add support for the Sun SPARC VIS instruction set for example We believe that the code will be able to take advantage of GP GPUs and FPGAs in the future Compiler The source code is C conforming to POSIX and ANSI C99 standards It should compile with any ANSI C99 compliant compiler including the GNU C compiler gcc We test the code using both the gcc and icc compilers If you compile HMMER from source we strongly recommend using the Intel C compiler icc rather than gcc icc is free for noncommercial use and heavily discounted for academic use GNU gcc generates HMMERS3 code that is significantly slower than icc code If you find yourself saying hey those guys said this program s supposed to be as fast as BLAST but it only seems half as fast odds are you recompiled from source with gcc Libraries and other installation requirements HMMER includes a software library called Easel which it will automatically compile during its installation process By default HMMERS3 does not require any additional libraries to be installed by you other than standard ANSI C99 libraries that should already be present on a system that can compile C code Bundling Easel instead of making it a separate installation requirement is a deliberate design decision to simplify the installation process One of the objectives of the alpha test is to identify portability issues If HMMERS fails to compile and or run fast under a POSIX compliant OS on a
56. ove the GA gathering thresholds when gathering members of Pfam full alignments recognized GS annotations wr lt f gt Weight lt f gt is a positive real number giving the relative weight for a sequence usually used to compensate for biased representation by downweighting similar sequences Usually the weights average 1 0 e g the weights sum to the number of sequences in the alignment but this is not required Either every sequence must have a weight annotated or none of them can ac lt s gt Accession lt s gt is a database accession number for this sequence Compare the cF Ac markup which gives an accession for the whole alignment One word DE lt s gt Description lt s gt is one line giving a description for this sequence Compare the cr DE markup which gives a description for the whole alignment recognized GC annotations RF Reference line Any character is accepted as a markup for a column The intent is to allow labeling the columns with some sort of mark SS_cons Secondary structure consensus For protein alignments DSSP codes or gaps are accepted as markup HGIEBTSCX _ where H is alpha helix G is 3 10 helix is p helix E is extended strand B is a residue in an isolated b bridge T is a turn S is a bend C is a random coil or loop and X is unknown for instance a residue that was not resolved in a crystal structure SA_cons Surface accessibility consensus 0 9 gap symbols or X are accepted
57. parity between BLAST and more formal and powerful HMM based methods The acceleration algorithm in HMMER3 is a new heuristic It seems likely to be more sensitive than BLAST s heuristics on theoretical grounds It certainly benchmarks that way in practice Eddy 2009 manuscript in preparation Additionally its very well suited to modern hardware architectures We expect to be able to take good advantage of GPUs and other parallel processing environments in the near future What made it into the HMMER3 0 test code Single sequence queries new to HMMER3 phmmer Search a sequence against a sequence database BLASTP like jackhmmer Iteratively search a Sequence against a sequence database PSIBLAST like Replacements for HMMER2 s functionality hmmbuild Build a profile HMM from an input multiple alignment hmmsearch Search a profile HMM against a sequence database hmmscan Search a sequence against a profile HMM database hmmalign Make a multiple alignment of many sequences to a common profile HMM Other utilities hmmconvert Convert profile formats to from HMMER3 format hmmemit Generate sample sequences from a profile HMM hmmfetch Get a profile HMM by name or accession from an HMM database hmmpress Format an HMM database into a binary format for hmmscan hmmstat Show summary statistics for each profile in an HMM database The quadrumvirate of hmmbuild hmmsearch hmmscan hmmalign replaces HMMER2 s core functional ity of hmmbuild
58. quence database searches It s precisely the most remote homologs the most difficult to identify and potentially most interesting sequences where alignment uncertainty is greatest and where the statistical approximation inherent in scoring just a single best alignment breaks down the most To maximize power to discriminate true homologs from nonhomologs in a database search statistical inference theory says you ought to be scoring sequences by integrating over alignment uncertainty not just scoring the single best alignment HMMER3 s log odds scores are sequence scores not just optimal alignment scores they are integrated over the posterior alignment ensemble Speed A major limitation of previous profile HMM implementations including HMMER2 was their slow performance HMMER3 implements a new heuristic acceleration algorithm For most queries it s about as fast as BLAST Individually none of these points is new As far as alignment ensembles and sequence scores go pretty much the whole reason why hidden Markov models are so theoretically attractive for sequence analysis is that they are good probabilistic models for explicitly dealing with alignment uncertainty The SAM profile HMM software from UC Santa Cruz has always used full probabilistic inference the HMM Forward and Backward algorithms as opposed to optimal alignment scores the HMM Viterbi algorithm HMMER2 had the full HMM inference algorithms available as command line op
59. s BLAST Altschul et al 1997 and with running sequence analysis tools on a UNIX or UNIX like command line If this isn t true of you you should probably not be using the HMMER3 code yet Instead you should wait for a later and more stable version when the user documentation will take less for granted Design goals of HMMER3 HMMER 3 s objective is to combine the power of probabilistic inference with high computational speed We aim to upgrade some of molecular biology s most important sequence analysis applications to use more powerful statistical inference engines without sacrificing computational performance Specifically HMMER3 has three main design features that in combination distinguish it from previous tools Explicit representation of alignment uncertainty Most sequence alignment analysis tools report only a single best scoring alignment However sequence alignments are uncertain and the more distantly related sequences are the more uncertain alignments become HMMERS calculates complete pos terior alignment ensembles rather than single optimal alignments Posterior ensembles get used for a variety of useful inferences involving alignment uncertainty For example any HMMER3 sequence alignment is accompanied by posterior probability annotation representing the degree of confidence in each individual aligned residue Sequence scores not alignment scores Alignment uncertainty has an important impact on the power of se
60. s and Nucleic Acids Cambridge University Press Cambridge UK Eddy S R 1998 Profile hidden Markov models Bioinformatics 14 755 763 Eddy S R 2008 A probabilistic model of local sequence alignment that simplifies statistical significance estimation PLoS Comput Biol 4 e1000069 Holmes 1998 Studies in Probabilistic Sequence Alignment and Evolution PhD thesis University of Cambridge Karlin S and Altschul S F 1990 Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes Proc Natl Acad Sci USA 87 2264 2268 Karlin S and Altschul S F 1993 Applications and statistics for multiple high scoring segments in molec ular sequences Proc Natl Acad Sci USA 90 5873 5877 Krogh A Brown M Mian I S Sj lander K and Haussler D 1994 Hidden Markov models in compu tational biology Applications to protein modeling J Mol Biol 235 1501 1531 32
61. s with hmmalign The file tutorial globins45 fa is a FASTA file containing 45 unaligned globin sequences To align all of these to the globins4 model and make a multiple sequence alignment gt hmmalign globins4 hmm tutorial globins45 fa The output of this is a Stockholm format multiple alignment file The first few lines of it look like 21 STOCKHOLM 1 0 MYG_ESCGI VLSDAEWOLVLNIWAKVEADVAGHGQDILIRLFKGHPETLEKFDKFKH GR MYG_ESCGI PR 69k RR kkk k kkk kkk kk kk kk kk k k kk RK k kk k kk k k k k k k k MYG_HORSE g LSDGEWQQVLNVWGKVEADIAGHGQEVLIRLFTGHPETLEKFDKFKH GR MYG_HORSE PP 8 89 RR IR RR IR IO IRR I I Ok MYG_PROGU g LSDGEWOLVLNVWGKVEGDLSGHGOEVLIRLFKGHPETLEKFDKFKH GR MYG_PROGU PP 8 89 RR k k k k k k k k k k k k k k k k k k k k k k k k kk k k kkk Ik MYG_SAISC g LSDGEWQLVLNIWGKVEADIPSHGQEVLISLFKGHPETLEKFDKFKH GR MYG_SAISC PP 8 BQkk kk kkk kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk MYG_LYCPI g LSDGEWQIVLNIWGKVETDLAGHGQEVLIRLFKNHPETLDKFDKFKH GR MYG_LYCPI PP 8 BQx k k kkk kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk MYG_MOUSE g LSDGEWQLVLNVWGKVEADLAGHGQEVLIGLFKTHPETLDKFDKFKN GR MYG_MOUSE PP 8 8BQk xkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk MYG_MUSAN v DWEKVNSVWSAVESDLTAIGONILLRLFEQYPESONHFPKFKN and so on Notice those pp annotation lines That s posterior probability annotation as in the single sequence alignments that hmmscan and hmmsearch showed This essentially represents the confidence that each residue is aligned
62. saP I 4 4W p tgpitgY t e vpt s 4 L gttY n gegp 7LESS_DROME 439 SAPVIEHLMGLDDSHLAVHWHPGRFTINGP LEGYRLRLSSSEGNA TSEQLVPAGRGSYIFSQLOAGTNYTLALSMINKOGEGP 520 1899999999 KKK KKK KKK KK RK KKK KKK KKK KK IIIS KKK RK RK KK KK KK KKK KKIIIT PP You d think that except for the E values which depend on database search space sizes you should get exactly the same scores domain number domain coordinates and alignment every time you do a search of the same HMM against the same sequence And this is actually the case but in fact it s actually not so obvious this should be so and HMMER is going out of its way to make it so HMMER uses stochastic sampling algorithms to infer some parameters and also to infer the exact domain number and domain boundaries in certain difficult cases If HMMER ran its stochastic samples properly it would see different samples every time you ran a program and all of you would complain to me that HMMER was weird and buggy because it gave different answers on the same problem To suppress run to run variation HUMMER seeds its random number generator s identically every time you do a sequence comparison If you re an expert and you really want to see the proper stochastic variation that results from any sampling algorithms that got run you can pass a command line argument of seed 0 to programs that have this property hmmbuild and the four search programs Creating multiple alignment
63. sitive than the MSV score but the Viterbi filter is about four fold slower than MSV By default HMMER3 lets sequences with a P value of lt 0 001 through this stage Here because there s a little over a thousand true globin homologs in this database much more than that gets through 2207 sequences Then the full Forward score is calculated which sums over all possible alignments of the profile to the target sequence The default allows sequences with a P value of lt 1075 through 1076 sequences passed All sequences that make it through the three filters are then subjected to a full probabilistic analysis using the HMM Forward Backward algorithms first to identify domains and assign domain envelopes then within each individual domain envelope Forward Backward calculations are done to determine posterior probabilities for each aligned residue followed by optimal accuracy alignment The results of this step are what you finally see on the output Recall the difference between conditional and independent E values with their two different search space sizes These search space sizes are reported in the statistics summary Finally it reports the speed of the search in units of Mc sec million dynamic programming cells per second the CPU time and the elapsed time This search took about 2 95 seconds of elapsed wall clock time running with cpu 2 on two cores That s in the same ballpark as BLAST depending on which BLAST you compare
64. state The begin state is mute so there are no match emission probabilities The first line is the insert 0 emissions The second line contains the transitions from the begin state and insert state 0 These seven numbers are 2That is the first two lines after the optional COMPO line Don t be confused by the presence of an optional COMPO line here The COMPO line is placed in the model section below the residue column headers because it s an array of numbers much like residue scores but it s not really part of the model 27 BoM B gt Ip B gt Dy Io gt My Io Ip then a 0 0 and a because by convention nonexistent transitions from the nonexistent delete state 0 are set to log1 0 and log0 w The remainder of the model has three lines per node for M nodes where M is the number of match states as given by the LENG line These three lines are K is the alphabet size in residues Match emission line The first field is the node number 1 M The parser verifies this number as a consistency check it expects the nodes to come in order The next K numbers for match emissions one per symbol in alphabetic order The next field is the map annotation for this node If map was yes in the header then this is an integer representing the alignment column index for this match state 1 alen otherwise this field is The next field is the RF annotation for this node If RF was yes in the header t
65. tain seven fibronectin type III domains fn3 out Output of hmmsearch fn3 hmm 7LESS_DROME Pkinase sto The Pfam 22 0 Pkinase seed alignment of protein kinase domains minifam An example HMM flatfile database containing three models globins4 fn3 and Pkinase minifam h3 m i f p Binary compressed files corresponding to minifam produced by hmmpress HBB_HUMAN A FASTA file containing the sequence of human 3 hemoglobin used as an exam ple query for phmmer and jackhmmer Searching a sequence database with a single profile HMM Step 1 build a profile HMM with hmmbuild HMMER starts with a multiple sequence alignment file that you provide Currently HMMER3 only reads Stockholm alignments The file tutorial globins4 sto is an example of a simple Stockholm file It looks like this 1m lying It can read more just don t trust its other parsers yet 11 STOCKHOLM 1 0 HBB_HUMAN VHLTPEEKSAVTALWGKV NVDEVGGEALGRLLVVYPWTORFFESFGDLSTPDAVMGNPKVKAHGKKVL HBA HUMAN VLSPADKTNVKAAWGKVGA HAGEYGAEALERMFLSFPTTKTYFPHF DLS HGSAQVKGHGKKVA MYGHPHYCA Anae Saves VLSEGEWOQLVLHVWAKVEA DVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVL GLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYS TYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADOLKKSADVRWHAERII HBB_HUMAN GAFSDGLAHL D NLKGTFATLSELHCDKL HVDPENFRLLGNVLVCVLAHHFGKEF TPPVQAAYOKVVAGVANAL HBA_HUMAN DALTNAVAHV D DMPNALSALSDLHAHKL RVDPVNFKLLSHCLLVTLAAHLPAEF TPAVHASLDK
66. tatistical significance of each domain given that we ve already decided that the target sequence is a true homolog It is the expected number of additional domains we d find with a domain score this big in the set of sequences reported in the top hits list if those sequences consisted only of random nonhomologous sequence outside the region that sufficed to define them as homologs The second number is the independent E value the significance of the sequence in the whole database search if this were the only domain we had identified It s exactly the same as the best 1 domain E value in the sequence top hits list The different between the two E values is not apparent in the 7LESS_DROME example because in both cases the size of the search space as 1 sequence There s a single sequence in the target sequence database that s the size of the search space that the independent best single domain E value depends on There s one sequence reported as a putative homolog in the sequence top hits list that s the size of the search space that the conditional E value depends on A better example is to see what happens when we search Uniprot 15 7 contains 497293 sequences with the fn3 model gt hmmsearch f n3 hmm uniprot_sprot fasta If you don t have Uniprot and can t run a command like this don t worry about it I ll show the relevant bits here Now the domain report for 7LESS_DRomE looks like score bias c Ev
67. text line of annotation type lt tag gt associated with the sequence named lt seqname gt GR lines are associated with one sequence in a sequence alignment block lt s gt s aligned to the residues in that sequence and has the same length as the rest of the block Typically GR lines are placed immedi ately following the aligned sequence they annotate semantics of Stockholm markup Any Stockholm parser will accept syntactically correct files but is not obligated to do anything with the markup lines It is up to the application whether it will attempt to interpret the meaning the semantics of the markup in a useful way At the two extremes are the Belvu alignment viewer and the HMMER profile hidden Markov model software package Belvu simply reads Stockholm markup and displays it without trying to interpret it at all The tag types 4 GF etc are sufficient to tell Belvu how to display the markup whether it is attached to the whole file sequences columns or residues HMMER uses Stockholm markup to pick up a variety of information from the Pfam multiple alignment database The Pfam consortium therefore agrees on additional syntax for certain tag types so HMMER can parse some markups for useful information This additional syntax is imposed by Pfam HMMER and 29 other software of mine not by Stockholm format per se You can think of Stockholm as akin to XML and what my software reads as akin to an XML DTD if you re int
68. tions but used Viterbi alignment by default in part for speed reasons Calculating alignment ensembles is even more computationally intensive than calculating single optimal alignments One reason why it s been hard to deploy sequence scores for practical large scale use is that we haven t known how to accurately calculate the statistical significance of a log odds score that s been integrated over alignment uncertainty Accurate statistical significance estimates are essential when one is trying to discriminate homologs from millions of unrelated sequences in a large sequence database search The statistical significance of optimal alignment scores can be calculated by Karlin Altschul statistics Karlin and Altschul 1990 1993 Karlin Altschul statistics are one of the most important and fundamental advances introduced by BLAST However this theory doesn t apply to integrated log odds sequence scores HMM Forward scores The statistical significance expectation values or E values of HMMERS3 sequence scores is determined by using recent theoretical conjectures about the statistical properties of integrated log odds scores which have been supported by numerical simulation experiments Eddy 2008 And as far as speed goes there s really nothing new about HMMER3 s speed either Besides Kar lin Altschul statistics the main reason BLAST has been so useful is that it s so fast Our design goal in HMMER3 was to achieve rough speed
69. where it should be hmmalign Currently has a feature that we re aware of Recall that HMMER3 only does local alignments Here we know that we ve provided full length globin sequences and globins4 is a full length globin model We d probably like hmmalign to produce a global alignment It can t currently do that If it doesn t quite manage to extend its local alignment to the full length of a target globin sequence you ll get a weird looking effect as the nonmatching termini are pulled out to the left or right For example look at the N terminal g in MyG_HORSE above H3 is about 80 confident that this residue is nonhomologous though any sensible person would align it into the first globin consensus column Look at the end of that first block of Stockholm alignment where you ll see HBBL_RANCA v HWTAEEKAVINSVWOKV DVEQDGHEALTRLFIVYPWTOQRYFSTFGD GR HBBL_RANCA PP 6 6799 x amp RR RR RI I IRR IO IK I Ok HBB2_TRICR VHLTAEDRKE IAAI LGKV NVDSLGGOCLARLIVVNPWSRRYFHDFGD GR HBB2_TRICR PP 69 ko RRR I IRR I IR IO RK GC PP_cons OT I I FOR IO OR OR IO IO IO IO OR IO IO TOR IO I de GC RFE XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX The cc PP cons line is Stockholm format consensus posterior probability annotation for the entire column It s calculated simply as the arithmetic mean of the per residue posterior probabilities in that col umn This should prove useful in phylogenetic inference applications for exampl
70. x speedup on dual core or quad core machines relative to previous releases Even with a single CPU cpu 1 HMMER will devote a separate execution thread to database input resulting in significant speedup over serial execution If you specify cpu 0 the program will run in serial only mode with no threads This might be useful if you suspect something is awry with the threaded parallel implementation MPI parallelization for clusters is optional The four search programs and hmmbuild now also support MPI Message Passing Interface parallelization on clusters To use MPI you first need to have an MPI library installed such as OpenMPI www open mpi org MPI support is not enabled by default and it is not compiled into the precompiled binaries that we supply with HMMER To enable MPI support at compile time give the enable mpi option to the configure command To use MPI parallelization each program that has an MPl parallel mode has an mpi command line option This option activates a master worker parallelization mode Without the mpi option if you run a program under mpirun on N nodes you ll be running N independent duplicate commands not a single MPl enabled command Don t do that It s highly advantageous to get mpicc to use the Intel C compiler rather than its default which is of ten gcc Different MPI distributions may have different ways of selecting the C compiler and its options OpenMPI can be controlled by envir

HMMER3 beta test: User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents