Home

MUSCLE User Guide - Welcome to Uhura

1. SUEFF tr ities File name None Save tree produced in first or second iteration tree 11 to given file in Newick Phylip compatible format Value option Legal values Default Description weight1 none clustalw Sequence weighting scheme weights a weight is used in iterations 1 and 2 gsc weight2 is used for tree dependent refinement clustalw none all sequences have equal weight ENESEWAY henikoff Henikoff 8 Henikoff weighting scheme henikoffpb Modified Henikoff scheme as used in PSI BLAST clustalw CLUSTALW method threeway Gotoh three way method Flag option Set by default Description anchors yes Use anchor optimization in tree dependent refinement iterations clw no Write output in CLUSTALW format default is FASTA clwstrict no Write output in CLUSTALW format with the CLUSTAL W 1 81 header rather than the MUSCLE version This is useful when a post processing step is picky about the file header core yes in muscle Do not catch exceptions no in muscled fasta yes Write output in FASTA format Alternatives include clw clwstrict msf and html group yes Group similar sequences together in the output This is the default See also stable html no Write output in HTML format default is FASTA le maybe Use log expectation profile score VTML240 Alternatives are to use sp or sv This is the default for amino acid sequences msf no Write output in MSF format default is FASTA noanchors no Disab
2. of tables required in memory Each sequence starts with an annotation line which is recognized by having a greater than symbol gt as its first character There is no limit on the length of an annotation line this is new as of version 3 5 and there is no requirement that the annotation be unique The sequence itself follows on one or more subsequent lines and is terminated either by the next annotation line or by the end of the file 3 1 1 Amino acid sequences The standard single letter amino acid alphabet is used Upper and lower case is allowed the case is not significant The special characters X B Z and U are understood X means unknown amino acid B is D or N Z is E or Q U is understood to be the 21st amino acid Selenocysteine White space spaces tabs and the end of line characters CR and NL is allowed inside sequence data Dots and dashes in sequences are allowed and are discarded unless the input is expected to be aligned e g for the refine option wow 3 1 2 Nucleotide sequences The usual letters A G C T and U stand for nucleotides The letters T and U are equivalent as far as MUSCLE is concerned N is the wildcard meaning unknown nucleotide R means A or G Y means C or T U Other wildcards such as those used by RFAM are not understood in this version and will be replaced by Ns If you would like support for other DNA RNA alphabets please let me know 3 1 3 Determining sequence type By default
3. so aligning every pair can take a while MUSCLE uses a much faster but somewhat more approximate method to compute distances it counts the number of short sub sequences known as k mers k tuples or words that two sequences have in common without constructing an alignment This is typically around 3 000 times faster that CLUSTALW s method but the trees will generally be less accurate We call this step k mer clustering The second step is to use the tree to construct what is known as a progressive alignment At each node of the binary tree a pair wise alignment is constructed progressing from the leaves towards the root The first alignment will be made from two sequences Later alignments will be one of the three following types sequence sequence profile sequence or profile profile where profile means the multiple alignment of the sequences under a given internal node of the tree This is very similar to what CLUSTALW does once it has built a tree Now we have a multiple alignment which has been built very quickly compared with conventional methods mainly because of the distance calculation using k mers rather than alignments The quality of this alignment is typically pretty good it will often tie or beat a T Coffee alignment on our tests However on average we find that it can be improved by proceeding through the following steps From the multiple alignment we can now compute the pair wise identities of each pair of sequence
4. such as msf All options are a dash not two dashes followed by a long name there are no single letter equivalents Value options must be separated from their values by white space in the command line Thus muscle does not follow Unix Linux or Posix standards for which we apologize The order in which options are given is irrelevant unless two options contradict in which case the right most option silently wins 4 3 The maxiters option You can control the number of iterations that MUSCLE does by specifying the maxiters option If you specify 1 2 or 3 then this is exactly the number of iterations that will be performed If the value is greater than 3 then muscle will continue up to the maximum you specify or until convergence is reached which ever happens sooner The default is 16 If you have a large number of sequences refinement may be rather slow 4 4 The maxtrees option This option controls the maximum number of new trees to create in iteration 2 Our experience suggests that a point of diminishing returns is typically reached after the first tree so the default value is 1 Ifa larger value is given the process will repeat until convergence or until this number of trees has been created which ever comes first 4 5 The maxhours option If you have a large alignment muscle may take a long time to complete It is sometimes convenient to say I want the best alignment I can get in 24 hours rather than specifying a set of
5. technique called k mer extension to find diagonals It is disabled by default because of the slight reduction in average accuracy and can be turned on by specifying the diags and diags2 options 4 8 Anchor optimization Tree dependent refinement iterations 3 4 can be speeded up by dividing the alignment vertically into blocks Block boundaries are found by identifying high scoring columns e g a perfectly conserved column of Cs or Ws would be a candidate Each vertical block is then refined independently before reassembling the complete alignment which is faster because of the L factor in dynamic programming e g suppose the alignment is split into two vertical blocks then 2 x 0 5 0 5 so the dynamic programming time is roughly halved The noanchors option is used to disable this feature This option has no effect if maxiters 1 or maxiters 2 is specified On benchmark tests enabling anchors has little or no effect on accuracy but if you want to be very conservative and are striving for the best possible accuracy then noanchors is a reasonable choice 4 9 Log file You can specify a log file by using log lt filename gt or loga lt filename gt Using log causes any existing file to be deleted loga appends to any existing file A message will be written to the log file when muscle starts and stops Error and warning messages will also be written to the log If verbose is specified then more infor
6. 2 as in the following example muscl in seqs fa out seqs afa maxiters 2 2 4 Fastest speed If you want the fastest possible speed then the following example shows the applicable options for proteins muscl in seqs fa out seqs afa maxiters 1 diagsl sv distancel kbit20_3 For nucleotides use muscl in seqs fa out seqs afa maxiters 1 diagsl At the time of writing muscle with these options is faster than any other multiple sequence alignment program that I have tested The alignments are not bad especially when the sequences are closely related However as you might expect this blazing speed comes at the cost of the lowest average accuracy of the options that muscle provides 2 5 Huge alignments If you have a very large number of sequences several thousand or they are very long then the kbit20_3 option may cause problems because it needs a relatively large amount of memory Better is to use the default distance measure which is roughly 2x or 3x slower but needs less memory like this muscl in seqs fa out seqs afa maxiters 1 diagsl sv 2 6 Accuracy caveat emptor Why do I keep using the clumsy phrase average accuracy instead of just saying accuracy That s because the quality of alignments produced by MUSCLE varies as do those produced other programs such as CLUSTALW and T Coffee The state of the art leaves plenty of room for improvement Sometimes the fastest speed options to muscle give alignme
7. MUSCLE User Guide Multiple sequence comparison by log expectation by Robert C Edgar Version 3 5 August 2004 http www drive5 com muscle email muscle at drive5 com MUSCLE is updated regularly Send me an e mail if you would like to be notified of new releases Citation Edgar Robert C 2004 MUSCLE multiple sequence alignment with high accuracy and high throughput Nucleic Acids Research 32 5 1792 97 Table of Contents IT Intro dUCUON ect opta 3 2 QUICK at tata 3 A O 3 2 2 Making anales tii di 3 EA O NO 3 O ON 3 SO 4 2 G ACCULACY caveat A O 4 2 PIP dass 4 2 8 Refining an existing alignment siiesrs cvs sedacesccssseessesssouseescsvscee tasecvsbepesessnes vesestesvess sedivesdeeeseessdesoesbes 4 2 9 Profile profile ala pmiment csc isperite neoni cescgessessbevsssacbedesdses tots EKS ET Oh ESSE EEROR O TEs initial deseas 4 2 10 S quence clustering 2 46 i055 de 5 3 File Formats cee el ead tk Ste oe eats ee RG Sete oS See oe eh ds tae hy 5 3 1 Input tiles ot 5 3 131 AMINO acid SEQUENCES eiii ns od EEEE AEE T E 5 3 1 2 Nucleotide sequence che ei ih tii ails al Bel eae eed eas 5 3 1 3 Determining Sequence type osere eee ee oreen Eere EKE EEn cn neon nc VES near EEEE K naar nan E eoni 5 3 2 Output li E E the Sea a 6 IZ SEQUENCE BLOUP IN A tn ee in eel a and IE EE EAE eee 6 3 3 CLUS TAL W Lora io 6 34 MSF format inon eeen ees eth esl au alah Ss Seta Sa ee ett Sh ete ial ts 8 ea oe 6 3S5 HIME format i3 seag
8. MUSCLE looks at the first 100 letters in the input sequence data excluding gaps If 95 or more of those letters are valid nucleotides AGCTUN then the file is treated as nucleotides otherwise as amino acids This method almost always guesses correctly but you can make sure by specifying the sequence type on the command line This is done using the segtype option which can take the following values seqtype protein Amino acid seqtype nucleo Nucleotide seqtype auto Automatic detection default 3 2 Output files By default output is also written in FASTA format All letters are upper case and gaps are represented by dashes 3 2 1 Sequence grouping By default MUSCLE re arranges sequences so that similar sequences are adjacent in the output file This is done by ordering sequences according to a prefix traversal of the guide tree This makes the alignment easier to evaluate by eye If you want to the sequences to be output in the same order as the input file you can use the stable option 3 3 CLUSTALW format You can request CLUSTALW output by using the clw option This should be compatible with CLUSTALW with the exception of the program name in the file header You can ask MUSCLE to impersonate CLUSTALW by writing CLUSTAL W 1 81 as the program name by using clwstrict If you have problems parsing MUSCLE output with scripts designed for CLUSTALW please let me know 3 4 MSF format MSF format simila
9. ate iets Soi elon Gein lbs Ss Sete al a eae eee d 6 4 Using MUS CIEE conan aii 6 4 1 How the algorithm Works ooooonnnnoncnonncononnnonnnonnnonn nono nonononn cnn cone cone EoiN EE ona ran non EE nene one KSE Kak ERREEN 6 4 2 Command line OPtiONS oooccnncnonnconnconoconccnnonnnonononn nono nono none cnn crono ne on neon neon nn nano AEE EE CEES nr KE KEKSE OKEE 7 4 3 The maxler OPO oii leia delata 7 4A The maxtrees Op cinc 8 4 5 The maxhours Option cios ot aid 8 4 6 The profile scoring FUNCTION conocnoncnonoconccononononanonnnonn nono nono nono cone co recono nono nn nono nono nro ivr nan nn nn kesho v sreski nono 8 4 7 Diagonal optimization isses ssh voces sts co neiere dda dai detail eE a SEE Er oere EESE KEk K Es 8 4 8 Anchor optimization soene oien eo reneo E ee a E sete se eKO eE eO EEE EE CEES TEKSE KSE e naa Rss 8 49 Log O Sh tees Loa E E Sheds bei A UL Ae ove aS 8 4 10 Progress messages a oinnia oten co aE EES eE EEE bets de ETEEN EAO EEEE Sides taeeveub betes ia 9 4 11 Running out Of MEMOrTY eserine E aiioa e o renn Eo EE K SEK EEES EEE EK hE VEK Erais 9 4 12 TLOUDIESHOOUI ii 9 4 13 echnical S pport ns sonen a E A E R Sade bei E S 10 X Command Line Reference nmrclici a ante Ma E E a iii 10 1 Introduction MUSCLE is a program for creating multiple alignments of amino acid or nucleotide sequences A range of options is provided that give you the choice of optimizing accuracy speed or some compromise between the two Default pa
10. ault core is the default in muscled 4 13 Technical support I am happy to provide support But I am busy and am offering this program at no charge so I ask you to make a reasonable effort to figure things out for yourself before contacting me 5 Command Line Reference Value option Legal values Default Description anchorspacing Integer 32 Minimum spacing between anchor columns center Floating point 1 Center parameter Should be negative clusterl upgma upgmb Clustering method cluster is used in iteration Cree upm te ike 1 and 2 cluster2 in later iterations neighborjoining diaglength Integer 24 Minimum length of diagonal diagmargin Integer 5 Discard this many positions at ends of diagonal distancel kmer6_6 Kmer6_6 Distance measure for iteration 1 kmer20_3 amino or kmer20_4 Kmer4_6 kbit20_3 nucleo kmer4_6 distance2 kmer6_6 pctid_kimura Distance measure for iterations 2 3 kmer20_3 kmer20_4 kbit20_3 pctid_kimura pctid_log gapopen Floating point 1 The gap open score Must be negative hydro Integer 5 Window size for determining whether a region is hydrophobic hydrofactor Floating point 1 2 Multiplier for gap open close penalties in in Any file name standard input 10 hydrophobic regions Where to find the input sequences Value option Legal values Default Description log File name None Log file name delete existing file loga File name None Log file name append to existing f
11. average of the pair wise alignment score of every pair of sequences in the alignment Bipartitions are chosen by deleting an edge in the guide tree each of the two resulting subtrees defines a subset of sequences This procedure is called tree dependent refinement One iteration of tree dependent refinement tries bipartitions produced by deleting every edge of the tree in depth order moving from the leaves towards the center of the tree Iterations continue until convergence or up to a specified maximum For convenience the major steps in MUSCLE are described as iterations though the first three iterations all do quite different things and may take very different lengths of time to complete The tree dependent refinement iterations 3 4 are true iterations and will take similar lengths of time Iteration Actions 1 Distance matrix by k mer clustering estimate tree progressive alignment according to this tree 2 Distance matrix by pair wise identities from current multiple alignment estimate tree progressive alignment according to new tree repeat until convergence or specified maximum number of times 3 4 Tree dependent refinement One iteration visits every edge in the tree one time 4 2 Command line options There are two types of command line options value options and flag options Value options are followed by the value of the given parameter for example in lt filename gt flag options just stand for themselves
12. formation http www drive5 com muscle Check the input file to make sure it is in valid FASTA format Try giving it to another sequence analysis program that can accept large FASTA files e g the NCBI formatdb utility to see if you get an informative error message Try dividing the file into two halves and using each half individually as input If one half fails and the other does not repeat until the problem is localized as far as possible Use log or loga and verbose and check the log file to see if there are any messages that give you a hint about the problem Look at the peak memory requirements reported in progress messages to see if you may be exceeding the physical or virtual memory capacity of your computer If muscle crashes without giving an error message or hangs then you may need to refer to the source code or use a debugger A debug version muscled may be provided This is built from the same source code but with the DEBUG macro defined and without compiler optimizations This version runs much more slowly perhaps by a factor of three or more but does a lot more internal checking and may be able to catch something that is going wrong in the code The core option specifies that muscle should not catch exceptions When core is specified an exception may result in a debugger trap or a core dump depending on the execution environment The nocore option has the opposite effect In muscle nocore is the def
13. ile maxdiagbreak Integer 1 Maximum distance between two diagonals that allows them to merge into one diagonal maxhours Floating point None Maximum time to run in hours The actual time may exceed the requested limit by a few minutes Decimals are allowed so 1 5 means one hour and 30 minutes maxiters Integer 1 2 16 Maximum number of iterations maxtrees Integer 1 Maximum number of new trees to build in iteration 2 minbestcolscore Floating point 1 Minimum score a column must have to be an anchor minsmoothscore Floating point 1 Minimum smoothed score a column must have to be an anchor objscore sp spm Objective score used by tree dependent de refinement xp sp sum of pairs score spf spf sum of pairs score dimer approximation Spm spm sp for lt 100 seqs otherwise spf dp dynamic programming score ps average profile sequence score xp cross profile score out File name standard output Where to write the alignment root1 pseudo psuedo Method used to root tree rootl is used in root2 midlongestspan iteration 1 and 2 root2 in later iterations minavgleafdist seqtype protein auto Sequence type nucleo auto smoothscoreceil Floating point 1 Maximum value of column score for smoothing purposes smoothwindow Integer 7 Window used for anchor column smoothing SUEFF Floating point 0 1 Constant used in UPGMB clustering value between 0 Determines the relative fraction of average and 1 linkage SUEFF vs nearest neighbor linkage 1
14. le anchor optimization Default is anchors nocore no in muscle Catch exceptions and give an error message if possible yes in muscled quiet no Do not display progress messages refine no Input file is already aligned skip first two iterations and begin tree dependent refinement 12 Flag option Set by default Description sp no Use sum of pairs protein profile score PAM200 Default is le spn maybe Use sum of pairs nucleotide profile score BLASTZ parameters This is the only option for nucleotides and is therefore the default stable no Preserve input order of sequences in output file Default is to group sequences by similarity group sv no Use sum of pairs profile score VTML240 Default is le termgapsfull no Terminal gaps penalized with full penalty 1 Not fully supported in this version termgapshalf yes Terminal gaps penalized with half penalty 1 Not fully supported in this version termgapshalflonger no Terminal gaps penalized with half penalty if gap relative to longer sequence otherwise with full penalty 1 Not fully supported in this version verbose no Write parameter settings and progress messages to log file version no Write version string to stdout and exit Notes 1 Default depends on the profile scoring function To determine the default use verbose log and check the log file 13
15. mation will be written including the command line used to invoke muscle the resulting internal parameter settings and also progress messages The content and format of verbose log file output is subject to change in future versions The use of a log file may seem contrary to Unix conventions for using standard output and standard error I like these conventions but never found a fully satisfactory way to use them I like progress messages see below but they mess up a file if you re direct standard error and there are errors or warning messages too I could try to detect whether a standard file handle is a fty device or a disk file and change behavior accordingly but I regard this as too complicated and too hard for the user to understand On Windows it can be hard to re direct standard file handles especially when working in a GUI debugger Maybe one day I will figure out a better solution suggestions welcomed I highly recommend using verbose and log a especially when running muscle in a batch mode This enables you to verify whether a particular alignment was completed and to review any errors or warnings that occurred 4 10 Progress messages By default muscle writes progress messages to standard error periodically so that you know it s doing something and get some feedback about the time and memory requirements for the alignment Here is a typical progress message 00 00 23 25 Mb Iter 2 87 20 Build guide tree The fields a
16. nd inserting columns of gaps where needed Output is stored in both afa MUSCLE does not compute a similarity measure or measure of statistical significance such as an E value so this option is not useful for discriminating homologs from unrelated sequences For this task I recommend Sadreyev amp Grishin s COMPASS program 2 10 Sequence clustering The first stage in MUSCLE is a fast clustering algorithm This may be of use in other applications Typical usage is muscl cluster in seqs fa treel tree phy The sequences will be clustered and a tree written to tree phy Options weight1 distancel clusterl and rootl can be applied if desired Note that by default UPGMA clustering is used You can use neighborjoining if you prefer but note that this is substantially slower than UPGMA for large numbers of sequences 3 File Formats MUSCLE uses FASTA format for both input and output For output only it also offers CLUSTALW MSF and HTML formats using the clw msf and html command line options 3 1 Input files Input files must be in FASTA format These are plain text files word processing files such as Word documents are not understood Unix Windows and DOS text files are supported end of line may be NL or CR NL There is no explicit limit on the length of a sequence however if you are running a 32 bit version of muscle then the maximum will be very roughly 10 000 letters due to maximum addressable size
17. nts that are better than T Coffee though the reverse will more often be the case With challenging sets of sequences it is a good idea to make several different alignments using different muscle options and to try other programs too Regions where different alignments agree are more believable than regions where they disagree 2 7 Pipelining Input can be taken from standard input and output can be written to standard output This is the default so our first example would also work like this muscle lt segs fa gt seqs afa 2 8 Refining an existing alignment You can ask muscle to try to improve an existing alignment by using the refine option The input file must then be a FASTA file containing an alignment All sequences must be of equal length gaps can be specified using dots or dashes For example wow muscle in seqs afa out refined afa refin 2 9 Profile profile alignment A fundamental step in the MUSCLE algorithm is aligning two multiple sequence alignments each of which contain some of the input sequences This operation is sometimes called profile profile alignment If you have two existing alignments of related sequences you can use the profile option of MUSCLE to align those two sequences Typical usage is muscle profile inl one afa in2 two afa out both afa The alignments in one afa and two afa which must be in aligned FASTA format are aligned to each other keeping input columns intact a
18. options that will take an unknown length of time This is done by using maxhours which specifies a floating point number of hours If this time is exceeded muscle will write out current alignment and stop For example muscl in huge fa out huge afa maxiters 9999 maxhours 24 0 Note that the actual time may exceed the specified limit by a few minutes while muscle finishes up on a step It is also possible for no alignment to be produced if the time limit is too small 4 6 The profile scoring function Three different protein profile scoring functions are supported the log expectation score e option and a sum of pairs score using either the PAM200 matrix sp or the VTML240 matrix sv The log expectation score is the default as it gives better results on our tests but is typically somewhere between two or three times slower than the sum of pairs score For nucleotides spn is currently the only option which is of course the default for nucleotide data so you don t need to specify this option 4 7 Diagonal optimization Creating a pair wise alignment by dynamic programming requires computing an L x L matrix where L and L are the sequence lengths A trick used in algorithms such as BLAST is to reduce the size of this matrix by using fast methods to find diagonals i e short regions of high similarity between the two sequences This speeds up the algorithm at the expense of some reduction in accuracy MUSCLE uses a
19. r to CLUSTALW is requested by using the msf option As with CLUSTALW format this is easier for people to read than FASTA 3 5 HTML format I ve added an experimental feature starting in version 3 4 To get a Web page as output use the html option The alignment is colored using a color scheme from Eric Sonnhammer s Belvu editor which is my personal favorite A drawback of this option is that the Web page typically contains a very large number of HTML tags which can be slow to display in the Internet Explorer browser The Netscape browser works much better If you have any ideas about good ways to make Web pages please let me know 4 Using MUSCLE In this section we give more details of the MUSCLE algorithm and the more important options offered by the muscle implementation 4 1 How the algorithm works We won t give a complete description of the MUSCLE algorithm here for that you will have to read the paper But hopefully a summary will help explain what some of the command line options do and how they might be useful in your work The first step is to calculate a tree In CLUSTALW this is done as follows Each pair of input sequences is aligned and used to compute the pair wise identity of the pair Identities are converted to a measure of distance Finally the distance matrix is converted to a tree using a clustering method CLUSTALW uses neighbor joining If you have 1 000 sequences there are 1 000 x 999 2 499 500 pairs
20. rameters are those that give the best average accuracy in our tests Using versions current at the time of writing my tests show that MUSCLE can achieve both better average accuracy and better speed than CLUSTALW or T Coffee depending on the chosen options 2 Quick Start The MUSCLE algorithm is delivered as a command line program called muscle If you are running under Linux or Unix you will be working at a shell prompt If you are running under Windows you should be in a command window nostalgically known to us older people as a DOS prompt If you don t know how to use command line programs you should get help from a local guru 2 1 Installation Copy the muscle binary file to a directory that is accessible from your computer That s it there are no configuration files libraries environment variables or other settings to worry about If you are using Windows then the binary file is named muscle exe From now on muscle should be understood to mean muscle if you are using Linux or Unix muscle exe if you are using Windows 2 2 Making an alignment Make a FASTA file containing some sequences If you are not familiar with FASTA format it is described in detail later in this Guide For now just to make things fast limit the number of sequence in the file to no more than 50 and the sequence length to be no more than 500 Call the input file segs fa An example file named seqs fa is distributed with the standard MUSCLE package Make su
21. re as follows 00 00 23 Elapsed time since muscle started 25 Mb Peak memory use in megabytes i e not the current usage but the maximum amount of memory used since muscle started Iter 2 Iteration currently in progress 87 20 How much of the current step has been completed percentage Build A brief description of the current step The quiet command line option disables writing progress messages to standard error If the verbose command line option is specified a progress message will be written to the log file when each iteration completes So quiet and verbose are not contradictory 4 11 Running out of memory The muscle code tries to deal gracefully with low memory conditions by using the following technique A block of emergency reserve memory is allocated when muscle starts If a later request to allocate memory fails this reserve block is made available and muscle attempts to save the current alignment With luck the reserved memory will be enough to allow muscle to save the alignment and exit gracefully with an informative error message 4 12 Troubleshooting Here is some general advice on what to do if muscle fails and you don t understand what happened The code is designed to fail gracefully with an informative error message when something goes wrong but there will no doubt be situations I haven t anticipated not to mention bugs Check the MUSCLE web site for updates bug reports and other relevant in
22. re the directory containing the muscle binary is in your path If it isn t you can run it by typing the full path name and the following example command lines must be changed accordingly Now type muscl in seqs fa out seqs afa You should see some progress messages If muscle completes successfully it will create a file segs afa containing the alignment By default output is created in aligned FASTA format hence the afa extension This is just like regular FASTA except that gaps are added in order to align the sequences This is a nice format for computers but not very readable for people so to look at the alignment you will want an alignment viewer such as Belvu or a script that converts FASTA to a more readable format You can also use the msf command line option to request output in MSF format which is easier to understand for people If muscle gives an error message and you don t know how to fix it please read the Troubleshooting section The default settings are designed to give the best accuracy so this may be all you need to know 2 3 Large alignments If you have a large number of sequences a few thousand or they are very long then the default settings of may be too slow for practical use A good compromise between speed and accuracy is to run just the first two iterations of the algorithm On average this gives accuracy equal to T Coffee and speeds much faster than CLUSTALW This is done by the option maxiters
23. s This gives us a new distance matrix from which we estimate a new tree We compare the old and new trees and re align subgroups where needed to produce a progressive multiple alignment from the new tree If the two trees are identical there is nothing to do if there are no subtrees that agree very unusual then the whole progressive alignment procedure must be repeated from scratch Typically we find that the tree is pretty stable near the leaves but some re alignments are needed closer the root This procedure compute pair wise identities estimate new tree compare trees re align is iterated until the tree stabilizes or until a specified maximum number of iterations has been done We call this process tree refinement although it also tends to improve the alignment We now keep the tree fixed and move to a new procedure which is designed to improve the multiple alignment The set of sequences is divided into two subsets i e we make a bipartition on the set of sequences A profile is constructed for each of the two subsets based on the current multiple alignment These two profiles are then re aligned to each other using the same pair wise alignment algorithm as used in the progressive stage If this improves an objective score that measures the quality of the alignment then the new multiple alignment is kept otherwise it is discarded By default the objective score is the classic sum of pairs score that takes the sequence weighted

MUSCLE User Guide - Welcome to Uhura

Contents

Download Pdf Manuals

Related Search

Related Contents