Home

MrBayes manual - the CCMAR Computational Cluster Facility: GYRA

1. 34 the probability of each partition or clade in the tree and the phylogram lower tree gives the branch lengths measured in expected substitutions per site Tarsius_syrichta 1 Lemur_catta 2 Pongo 6 Hylobates 7 Macaca_fuscata 8 M_mulatta 9 M_fascicularis 10 M_sylvanus 11 Saimiri_sciureus 12 a a Tarsius_syrichta 1 a ata Lemur_catta 2 Homo_sapiens 3 Pan 4 Gorilla 5 Niesacetcosses Hylobates 7 Macaca_fuscata 8 o Nea Pongo 6 M_mulatta 9 l M_fascicularis 10 M_sylvanus 11 baa ca E EE Saimiri_sciureus 12 0 200 expected changes per site In the background the sumt command creates five additional files The first is a parts file which contains the key to taxon bipartitions The second and third are the tstat and vstat files which contain the summaries of partition statistics and branch length statistics respectively The next is the con file which includes the consensus trees By default the consensus tree with all the relevant clade support values and branch length information is printed in a format suitable for FigTree Once the con file is opened in FigTree you can display statistics such as the posterior probability and its as
2. NO oP WNEF l 31 8 Macaca_fuscata 9 M_mulatta 10 M_fascicularis 11 M_sylvanus 12 Saimiri_sciureus Key to taxon bipartitions saved to file primates nex parts ID Partition TLR k kkk Qe SD R yn he BSH aR lettin ais CM kee yes 5 Kodin eas OLTP res E ee ities td oe OPE ahs ees xe Dea ee areas 10 35 os egies Ta E EEE T2 SS eia enia psd T3 SFP ea aai 2K KK LAs KRG Bo tees 15 RRR x TG Se kine P a a xk 17 RRR kk 18 kkkk 19 E 205 ERK Si ce sae DAS ne Cet AK Then it gives a table over the informative bipartitions the ones with more than one taxon included specifying the number of times the partition was sampled obs the probability of the partition Probab the standard deviation of the partition frequency Sd s across runs the min and max of the standard deviation across runs Min s and Max s and finally the number of runs in which the partition was encountered In our analysis there is overwhelming support for a single tree so all partitions in this tree have a posterior probability of 1 0 Summary statistics for informative taxon bipartitions 32 saved to file primates nex tstat ID obs 13 302 14 302 15 302 16 302 17 302 18 302 19 302 20 302 21 302 Probab 000000 000000 000000 000000 000000 000000 000000 000000 000000 Sd s 0 000000 0 000000 0 000000 0 000000 0 000000 0 000000 0 000000 0
3. The root of the p and t file names can be altered using the Filename setting The Printfreq parameter controls the frequency with which the state of the chains is printed to screen You can leave Printfreq at the default value print to screen every 100th generation 23 When you set up your model and analysis the number of runs and heated chains MrBayes creates starting values for the model parameters A different random tree with predefined branch lengths is generated for each chain and most substitution model parameters are set to predefined values For instance station ary state frequencies start out being equal and unrooted trees have all branch lengths set to 0 1 The starting values can be changed by using the Startvals command For instance user defined trees can be read into MrBayes by execut ing a Nexus file with a trees block and then assigned to different chains using the Startvals command After a completed analysis MrBayes keeps the pa rameter values of the last generation and will use those as the starting values for the next analysis unless the values are reset using mcmc starttrees random startvals reset Since version 3 2 MrBayes prints all parameter values of all chains cold and heated to a checkpoint file every Checkfreq generations by default every 100 000 generations The checkpoint file has the suffix ckp If you run an analysis and it is stopped prematurely you can restart it from the last checkpoi
4. bution There are two variants implemented in MrBayes and they are invoked using the lset omegavar option In the Ny98 model Nielsen and Yang 1998 there are three classes with potentially different w values 0 lt w lt 1 w2 1 and w3 gt 1 The M3 model also has three classes of w values but these values are less constrained in that they only have to be ordered w lt wo lt w3 If for instance you would like to invoke the M3 model use the command lset omegavar M3 When you use a model with variation in selection pressure across sites you probably want to infer the positively selected sites If you select report possel yes before you start your analysis MrBayes will calculate the probability of each site being in a positively selected omega class For the M3 model for instance the likelihood of the site is calculated under each of the three categories taking the category frequencies into account and then the likelihoods are summed to yield the total likelihood of the site Finally the proportion of this sum originating from categories that are positively selected those that have an w value larger than 1 this is the posterior probability of the site being positively selected Once the probabilities of each site being positively selected are printed to file they can be summarized using the standard sump command When interpreting the output from the Ny98 model it is helpful to know that pi pi N and pi are the frequ
5. 011955 033104 029295 OO O O OO OR OOO 660632 448005 072349 082690 081064 189667 233264 026377 037226 079370 103081 477699 331021 049417 060417 056059 140537 172907 015679 022580 056479 070148 FOOrRrRPRRRFRFORRFRO O O NUNNNNNNNNNDN length 12 0 433951 0 005270 0 305526 0 572054 0 length 13 0 248133 0 002680 0 148162 0 338245 0 length 14 0 029261 0 000142 0 008043 0 052766 0 length 15 0 273555 0 003600 0 163707 0 403779 0 length 16 0 035972 0 000125 0 016521 0 059122 0 length 17 0 118515 0 001761 0 044026 0 199364 0 length 18 0 124953 0 001162 0 052839 0 178939 0 length 19 0 057618 0 000424 0 017331 0 091541 0 length 20 0 082425 0 000486 0 051259 0 137057 0 length 21 0 049766 0 000398 0 018269 0 094330 0 426316 0 998 2 243948 0 997 2 028329 0 997 2 268913 1 011 2 035263 0 998 2 119746 0 998 2 122538 0 997 2 057151 1 000 2 080014 0 997 2 047661 0 997 2 This table is followed by two trees The clade credibility tree upper tree gives Clade credibility values a Aaa SEa P a a ee Yn ry yn EE E E E EE ES NAET EE le ese eee E N EPEE E eee oe SoS 100 100 eSRessos 100 100 VSS aes 100 100 100 ereeoss 100 100 yarra enuennenams Phylogram based on average branch lengths
6. 58 4 4 Rate Variation Across Sites By default MrBayes assumes that all characters evolve at the same rate measured in expected changes per site over the tree There are four ways in which you can allow rate variation across sites The simplest method is to assume that rates vary over sites according to a gamma distribution Yang 1993 The gamma model can be combined with spatial autocorrelation between the rates at adjacent sites the autocorrelated gamma model A completely different approach to rate variation across sites is to allow a proportion of sites to be invariable This model can be combined with the gamma model but not with the autocorrelated gamma model Finally it is possible to divide characters into groups evolving at different rates the partitioned or site specific rate model 4 4 1 Gamma distributed Rate Model The commonly used model of gamma shaped rate variation across sites is invoked using lset rates gamma The shape of the gamma distribution is determined by the socalled a alpha parameter When this parameter is small below 1 the distribution takes on an L shaped form with a few sites evolving rapidly while most sites are conserved Conversely when a is above 1 the distribution becomes similar to a normal distribution with less and less variation in rates across sites as a becomes larger In practice the gamma distribution is approximated using a small number of discrete rate categories Yang 1994 By de
7. MrBayes we have had problems with infinite loops when the quit command is not included at the end of the file This problem should have been solved in version 3 1 How are gaps and missing characters treated MrBayes uses the same method as most maximum likelihood programs it treats gaps and missing characters as missing data Thus gaps and missing char acters will not contribute any phylogenetic information There is no way in which you can treat gaps as a fifth state in MrBayes but see below for information on how you can use gap information in your analysis How do I use gap information in my analysis Often insertion and deletion events contain phylogenetically useful informa tion Although MrBayes 3 is not able to do statistical multiple sequence alignment treating the insertion deletion process under a realistic stochastic model there is nevertheless a way of using some of the information in the indel events in your MrBayes analysis Code the indel events as binary characters presence absence of the gap and include them as a separate binary restriction data partition in your analysis See more information on this possibility in the section on the binary model in this manual What do I do when it is difficult to get convergence There are several things you can do to improve the efficiency of your analysis The simplest is to just increase the length of the run However the computational cost of doing so may be prohibitive A be
8. suites Some performance gain can be achieved by using different compiler optimiza tion flags Keep in mind that not all options are safe to use A number of opti mization methods might give a slowdown instead of a speedup and some options relax numerical assumptions leading to a program that might produce incorrect results When you want to file a bug report make sure that you compiled the program with the default options 7 1 1 Compiling with GNU Make and configure The configure script that is distributed with the MrBayes source code examines your build environment for appropriate compiler flags and program settings In most cases you can run this script configure without any arguments and it will produce a Makefile which can be used to compile the MrBayes executable with the command make There are a number of options you can give the configure script to alter it s behavior These are enable mpi Use this option if you want to compile the parallel MPI version of MrBayes This option is off by default meaning that a serial version of MrBayes will be built See below for more information on how to compile and run the MPI version of MrBayes 80 enable fastlog With this option a fast approximation algorithm is used instead of the normal log math functions Since this approximation algorithm can actually slow down the program on some computer architectures this option is turned off by default enable debug If yo
9. we cannot compare the rate of loss of antennal articles to the rate at which a yellow patch evolves into a green patch If state labels are truly arbitrary then the stationary state frequencies of the binary model must be fixed to be equal such that the estimation of model parameters becomes independent of the labeling of character states An alternative is to consider the standard model which provides more sophisticated ways of dealing with arbitrary state labels When is the correction for ascertainment bias important This is strongly dependent on the size of the tree the sum of the branch lengths on the tree The larger the tree the less important the correction for ascertainment bias becomes In our experience when there are more than 20 30 taxa even the most severe bias only informative characters observed is associated with an insignificant correction of the likelihood values 4 2 5 Standard Discrete Morphology Model The model used by MrBayes for standard discrete data is based on the ideas originally presented by Lewis 2001 Essentially the model is analogous to a JC model except that it has a variable number of states from 2 to 10 For instance a three state standard character would be evolving according to the instantaneous rate matrix a a g Loe ae OF Ny ap Se ap ee S Because all rates are the same we can maintain the essential property of stan dard characters namely that state labels are arbitrary Thus the
10. 000000 0 000000 Min s 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 Max s 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 1 000000 Nruns We then get a table summarizing branch and node parameters in our case the branch lengths The indices in this table refer to the key to partitions For instance length 14 is the length of the branch corresponding to partition ID 14 As we noted above this is the branch grouping humans and chimps The meaning of most of the values in this table is obvious The last two columns give a convergence diagnostic the Potential Scale Reduction Factor PSRF and the numnber of runs in which the partition was encountered The PSRF diagnostic is the same used for the regular parameter samples and it should approach 1 0 as runs converge Summary statistics for branch and node parameters saved to file primates nex vstat Parameter Variance 95 HPD Interval PSRF length 1 length 2 length 3 length 4 length 5 length 6 length 7 length 8 length 9 length 10 length 11 486115 335038 050689 060501 057754 143419 172066 016107 023164 056704 069330 007159 003829 000114 000144 000183 000537 001072 000031 000045 000147 000366 gt OOGO OR OOOO OR OO OOO OOOO 349117 222216 033718 039651 031723 100539 112808 005941
11. 3 4 and 5 for a six state character Because state labels are arbitrary in the standard model we cannot estimate unequal stationary state frequencies or substitution rates recall that the stationary state frequencies are an important factor in determining the latter However it is still possible to allow the state frequencies rates to vary over sites according to some appropriate distribution MrBayes uses a symmetric Dirichlet distribution for this purpose For binary data the analogy of the Dirichlet distribution is called the beta distribution MrBayes uses Dirichlet and beta interchangeably for the distribution depending on context The approach is similar to the one used to allow rate variation across sites according to a gamma distribution we calculate the likelihood of a site assuming different discrete categories of asymmetry and then we sum the values to obtain the site likelihood The symmetric Dirichlet distribution has one parameter that determines its 56 shape just like the alpha parameter determines the shape of the gamma distribu tion The larger the parameter of the symmetric Dirichlet the less transition rate stationary frequency asymmetry there is across sites By default the parameter is fixed to infinity prset symdirihyperpr fixed infinity this corresponds to the standard assumption of no transition rate asymmetry across sites the rate of going from 0 to 1 is equal to the rate of going from 1 to 0 for all c
12. Data into MrBayes First open up the Nexus data file in a text editor The DATA block of the Nexus file should look familiar but there are some differences compared to the primates nex file in the format statement Format datatype mixed Standard 1 166 DNA 167 3246 interleave yes gap missing First the datatype is specified as datatype mixed Standard 1 166 DNA 167 3246 This means that the matrix contains standard morphology characters in columns 1 166 and DNA characters in the remaining columns The mixed datatype is an extension to the Nexus standard This extension was originated by MrBayes 3 and may not be compatible with other phylogenetics programs Second the matrix is interleaved It is often convenient to specify mixed data in interleaved format with each block consisting of a natural subset of the matrix such as the morphological data or one of the gene regions 36 3 2 Dividing the Data into Partitions By default MrBayes partitions the data according to data type There are only two data types in the matrix so this model will include only a morphology stan dard and a DNA partition To divide the DNA partition into gene regions it is convenient to first specify character sets In principle this can be done from the command line but it is more convenient to do it in a MrBayes block in the data file With the MrBayes distribution we added a file cynmix run nex with a complete MrBayes block For this section
13. The Windows and Mac versions of MrBayes 3 1 are not multithreaded 76 so they will not take advantage of more than one processor on a single machine However you should be able to run two copies of MrBayes without noticeable decrease in performance on a dual processor machine provided you have enough RAM for both analyses How much memory is required You can calculate the amount of memory needed to store the conditional like lihoods for an analysis roughly as 2 taxa states in the Q matrix 4 gamma categories 4 bytes for the single precision float version of the code double the memory requirement for the double precision code The program will need slightly more memory for various book keeping purposes but the bulk of the memory required for an analysis is typically occupied by the conditional likeli hoods How do I fix the tree topology during an analysis In principle one can fix a tree topology by specifying constraints for all of the nodes in the tree However we do not recommend doing this because it is computationally very inefficient A better way is to set the proposal probability of all topology moves to 0 using the props command Then you need to switch on one proposal that changes branch lengths but not topology by increasing its proposal probability from 0 to some reasonable positive value like 5 The node slider is in our experience the best of these proposals 6 Differences Between Version 2 and V
14. and Hasegawa 1996 the Mtmam model Cao et al 1998 Yang Nielsen and Hasegawa 1998 the WAG model Wheland and Goldman 2001 the Rtrev model Dimmic et al 2002 the Cprev model Adachi et al 2000 the Vt model Muller and Vingron 2000 and the Blosum62 model Henikoff and Henikoff 1992 Each model is appropriate for a particular type of proteins For instance if you are analyzing mammal mitochon drial proteins you might want to use the Mtmam model Invoke that model by typing prset aamodelpr fixed mtmam 4 2 2 Estimating the Fixed Rate Model MrBayes allows a convenient way of estimating the fixed rate model for your amino acid data instead of specifying it prior to the analysis If you choose this option MrBayes will allow the MCMC sampler to explore all of the fixed rate models listed above including the Poisson model by regularly proposing new models When the MCMC procedure has converged each model will contribute to your results in proportion to its posterior probability For instance if you are analyzing mammal 51 mitochondrial proteins it is likely that the Mtmam model will contribute most to the posterior distribution but it is possible that some other models for instance the Mtrev model will contribute significantly too A nice feature of the MCMC model estimation is that the extra computational cost is negligible To allow so called model jumping between fixed rate amino acid models simply set the prio
15. character types other than restric tion sites For instance the model can be used for gap characters The presence and absence of gaps must be coded consistently for all characters let us assume here that absence of a gap is coded as 0 and presence as 1 Since the detection of gaps is typically contingent on observing some sequence length variation neither all absence nor all presence characters can be observed Thus the correct ascer tainment bias for gap characters is variable The parameters 7 and m would represent the rate at which insertions and deletions occur respectively assuming that state 0 denotes absence of a gap The binary model can also be used for ecological morphological or other binary characters of arbitrary origin However if the binary model is applied to more than one character then there is an implicit assumption that the state labels are not arbitrary for those characters That is the 0 state in one character must somehow be comparable to the 0 state in the others For instance 0 could mean absence or presence of a particular type of feature such as a wing vein a restriction site 54 or a gap in a DNA sequence It is not appropriate to apply the default binary model to a set of characters where the state labels are arbitrary as is true of most morphological characters Thus we can possibly estimate the rate of loss versus gain of wing veins over a set of consistently coded wing venation characters but
16. does not start the analysis Type mcmcp ngen 20000 to set the number of generations to 20 000 You can type help memc to confirm that the setting was changed appropriately By default MrBayes will run two simultaneous completely independent anal yses starting from different random trees Nruns 2 Running more than one analysis simultaneously allows MrBayes to calculate convergence diagnostics on the fly which is very helpful in determining when you have a good sample from the posterior probability distribution The idea is to start each run from different randomly chosen trees In the early phases of the run the two runs will sample very different trees but when they have reached convergence when they produce a good sample from the posterior probability distribution the two tree samples should be very similar To make sure that MrBayes compares tree samples from the different runs check that Mcmcdiagn is set to yes and that Diagnfreq is set to some reason able value such as every 1000th generation MrBayes will now calculate vari ous run diagnostics every Diagnfreq generation and print them to a file with the name lt Filename gt mcmc The most important diagnostic a measure of the sim ilarity of the tree samples in the different runs will also be printed to screen every Diagnfreq generation Every time the diagnostics are calculated either a fixed number of samples burnin or a percentage of samples burninfrac from the beginning
17. estimated There is no tratio option in the lset command of version 3 of MrBayes prset basefreqpr Now set using prset statefreqpr prset brlenpr Now set using prset brlenspr The options are now more complicated as well since the prior includes information both about the general type of the branch lengths clock non clock and the specific shape of the prior for example uniform or exponential prset siteratepr Now set using prset ratepr The options are fixed or variable Dirichlet The parameters of the Dirichlet can be set to reflect various types of prior information concerning the site rates prset qmatpr Now set using prset revmatpr and prset aarevmatpr for nucleotide and amino acid substitution rates respectively 79 set This command only controlled the autoclose option in MrBayes 2 In version 3 it controls a number of different things including the currently selected partition shownodes This command is no longer included in MrBayes 3 Use showtree to display the user tree 7 Advanced Topics 7 1 Compiling MrBayes Compiling the MrBayes executable from the source code can be done on several different compilers targeting all the common operating systems Macintosh Win dows and Unix The easiest way to build MrBayes is to use the included configure script and with a make tool for the Makefile this script generates One can also compile MrBayes with the Metroworks Codewarrior and Microsoft Visual Studio
18. file that you process after you have read in your data MrBayes will keep the data set in memory until you read in a new data block so you can have your different MrBayes blocks pertaining to the same data file distributed over as many separate Nexus files as you like We recommend that before you run your analysis you check the current model settings using the showmodel command This command will list all the active pa rameters and how they are linked across partitions as well as the priors associated with each parameter Finally we want to give you a warning Even though MrBayes allows you to easily set up extremely complex and parameter rich models and the Bayesian MCMC approach is good at handling such models think carefully about the pa rameters you introduce in your model There should be at least some reasonable 68 chance of estimating the parameters based on your data For instance a common mistake is to use a separate GTR model for a partition with so few substitutions that there is not a single observation for several rate categories Making sure there are at least some observations allowing you to estimate each parameter is good practice Over parameterized models often result in problems with convergence in addition to the excessive variance seen in the parameter estimates 4 7 Ancestral State Reconstruction MrBayes allows you to infer ancestral states at ancestral nodes using the full hierarchical Bayesian approach i
19. files primates nex and cynmix nex will be used in the tutorial sections of this manual sections 2 and 3 1 3 Getting Started Start MrBayes by double clicking the application icon or typing mb and you will see the information below MrBayes v3 2 cvs Bayesian Analysis of Phylogeny Distributed under the GNU General Public License Type help or help lt command gt for information on the commands that are available Type about for authorship and general information about the program MrBayes gt The order of the authors is randomized each time you start the program so dont be surprised if the order differs from the one above Note the MrBayes gt prompt at the bottom which tells you that MrBayes is ready for your commands 1 4 Changing the Size of the MrBayes Window Some MrBayes commands will output a lot of information and write fairly long lines so you may want to change the size of the MrBayes window to make it easier to read the output On Macintosh and Unix machines you should be able to increase the window size simply by dragging the margins On a Windows machine you cannot increase the size of the window beyond the preset value by simply dragging the margins but on Windows XP 2000 and NT you can change both the size of the screen buffer and the console window by right clicking on the blue title bar of the MrBayes window and then selecting Properties in the menu that appears Make sure the La
20. in a MrBayes block charset posi 1 3 charset pos2 2 3 charset pos3 3 3 partition by_codon 3 posi pos2 pos3 set partition by_codon The character sets are first defined using the dot sign to mark the last character in the data set and the 3 sequence to include every third character in the specified range Then a partitioning scheme called by_codon is defined using the previously named character sets Finally the partitioning scheme called by_codon is invoked using the set command When we process these commands in MrBayes using the execute command the characters are divided into three sets corresponding to the codon positions By default however all model parameters including the rate will be shared across partitions To allow the rates to differ across partitions we need to change the prior for rates using prset Specifically prset ratepr variable invokes partition specific rates The partition specific rate parameter is referred to as ratemult and the individual rates are labeled m 1 m 2 etc for the rate multiplier of character division 1 2 etc See below for more information on how to set up partitioned models 4 4 5 Inferring Site Rates When you are allowing rate variation across sites you may be interested in infer ring the rates at each individual site By default the site rates are not sampled during a MCMC run You need to request the sampling of these values using report siterate
21. it must be included in the constraint definition The probability value will be used by future versions of the program lset aamodel The amino acid model is now set using prset aamodelpr lset ancfile Ancestral states are now written to the p file s lset basefreq Whether this parameters is estimated or fixed to a particular value in version 3 is controlled by setting the prior with prset statefreqpr By default all parameters are estimated There is no basefreq option in the lset command of version 3 of MrBayes lset clock Whether the tree is a clock or a non clock tree is now set using prset brlenspr lset enforcecal Calibrations are not implemented yet in MrBayes 3 lset enforcecodon The type of nucleotide model is now set using lset nucmodel 4by4 doublet codon lset enforcecon Nowset using prset topologypr constraints lt list_of_constraints lset inferanc Now set using report ancstates lset inferpossel Now set using jttjreport posseljtt lset inferrates This is now set with report siterates lset ncat number of categories used to approximate the gamma distribution of site rates Now set using lset ngammacat to distinguish it from nbetacat the number of categories used to approximate the beta distribution of rate stationary state asymmetry across sites in the standard model lset nonrevmat Time irreversible models are not implemented in MrBayes lset omega Whether this parameters is estimated o
22. off state all rates are 0 R and Rs describe the switching process the diagonal elements are either 59 or 10 and R3 is the chosen model for the evolution in the on state The covarion like model can be described as a general case of the proportion of invariable sites model Huelsenbeck 2002 As the switching rates go to zero the proportion of these rates represented by the switch to the off state s10 becomes 63 identical to the proportion of invariable sites When the switching rates are zero there is no exchange between the on and off states and the characters in the off state remain off throughout the tree in other words they are invariable sites Note that the covarion like model implemented in MrBayes differs from the original covarion model in that sites switch completely independently of each other between the on and off states To invoke the covarion like model simply use lset covarion yes and then choose the desired nucleotide or amino acid model using the other lset and the prset options The covarion like model can be combined with the gamma model of rate variation across sites 4 6 Topology and Branch Length Models The topology and branch length models in MrBayes are set using the prset topologypr options which deal with the tree topology prior and the prset brlenspr options which deal with the branch lengths 4 6 1 Unconstrained and Constrained Topology There are two choices for the prior probability dis
23. often more convenient to use predefined character sets like we did above The final step is to tell MrBayes that we want to work with this partitioning of the data instead of the default partitioning We do this using the set command set partition favored Finally we need to add anend statement to close the MrBayes block The entire file should now look like this NEXUS begin mrbayes execute cynmix nex charset morphology 1 166 charset COI 167 1244 charset EFla 1245 1611 charset LWRh 1612 2092 charset 28S 2093 3246 partition favored 5 morphology COI EFla LWRh 285 set partition favored end When we read this block into MrBayes we will get a partitioned model with the first character division being morphology the second division being the COI gene etc Save the data file exit your text editor and finally launch MrBayes and type execute cynmix command nex to read in your data and set up the partitioning scheme Note that this command causes MrBayes to read in the data file because it contains the command execute cynmix nex 3 3 Specifying a Partitioned Model Before starting to specify the partitioned model it is useful to examine the default model Type showmodel and you should get this table as part of the output Active parameters Partition s Parameters 12 3 4 5 38 Statefreq 1 2 2 2 2 Topology 3 3 3 3 3 Brlens 4 444 4 There is a lot of other useful information in the output of sh
24. ror Tar In total there are eight free parameters since one of the stationary state frequencies and one of the substitution rates are determined by the others By default MrBayes uses a flat Dirichlet distribution with all distribution parameters set to 1 as the prior for both the stationary state frequencies and the substitution rates This is a reasonable unin formative prior that should work well for most analyses During the analysis both the stationary state frequencies and substitution rates are estimated If you want to fix the stationary state frequencies or sub stitution rates you can do that by using the prset command For instance assume that we want to fix the stationary state frequencies to be equal con verting the GTR model to the so called SYM model This is achieved by prset statefreqpr fixed 0 25 0 25 0 25 0 25 or more conveniently prset statefreqpr fixed 44 The substitution rate matrix now becomes A C G T A O25rac 0 25rag 0 25rar C 0 25rac si rc 02ror G 0 25rac 0 25rce Gror TI 0 25rar 0 25rcr 0 25rcr and the stationary state frequencies are no longer estimated during the analysis Similarly it is possible to fix the substitution rates of the GTR model using prset revmatpr fixed Assume for instance that we want to fix the substitu tion rates to be rac 0 11 rag 0 22 rar 0 12 rog 0 14 ror 0 35 rer 0 06 Then the correct statement would be prset revmatpr f
25. standard model assures that you will get the same results regardless of the way in which you label the states In morphology based parsimony analyses one sometimes distinguishes between 59 ordered and unordered characters In ordered characters evolution between some states is assumed to go through intermediate states MrBayes implements a stochastic model for such characters For a three state character assumed to be ordered by convention in the sequence 0 1 2 the instantaneous rate matrix is _ ls 1 0 ae J As 1 a oa We Note that the instantaneous rate of going between the two end states is 0 that is a transition from 0 to 2 or from 2 to 0 has to go through state 1 By default MrBayes treats all standard characters as unordered To change this use the ctype command For instance if you want to treat characters 3 and 4 as ordered you need to include the statement ctype ordered 3 4 in your MrBayes block or enter it using the command line if you prefer The number of states of each standard character is determined by MrBayes simply by looking at the state codes in your matrix Thus a three state model will be used for a three state character and a six state model for a six state character MrBayes does not check that all state codess are used it simply finds the largest state code in the matrix for each character Thus make sure that you use the state codes 0 1 and 2 for a three state character and state codes 0 1 2
26. the best settings you should not have to adjust proposal tuning parameters manu ally However if you have difficulties getting convergence you can try selecting a 42 different mix of topology moves than the one used by default For instance the random SPR move tends to do well on some data sets but it is switched off by default because in general it is less efficient than the default moves You can add and remove topology moves or change the frequency with which they are used by adjusting their relative proposal probabilities using the propset command Use showmoves allavailable yes first to see a list of all the available moves For more information and tips turn to the MrBayes web site http mrbayes net Fredrik s MrBayes resources page http www sc fsu edu fronquis mrbayes the MrBayes home on SourceForge http www sf net projects mrbayes and the MrBayes users email list 4 Evolutionary Models Implemented in MrBayes 3 MrBayes implements a wide variety of evolutionary models for nucleotide amino acid restriction site binary and standard discrete data In addition there are several different ways of modeling the process generating phylogenetic trees An overview of all the models is given in diagrammatic form in the Appendix Here we provide a brief description of each model with some comments on their imple mentation in MrBayes and advice on how you can apply them successfully to your data 4 1 Nucl
27. the Macintosh serial version If you decide to run the program under Unix Linux then you need to compile the program from the source code In the latter case simply unpack the file mrbayes3 2_src tar gz by typing gunzip MrBayes_3 2_src tar gz and then tar xf MrBayes_3 2_src tar The gunzip command unzips the compressed file and the tar xf command extracts all of the files from the tar archive that resulted from the unzip operation note that the gz suffix is dropped in the unzip operation You then need to compile the program We have included a Makefile that contains compiler instructions producing the serial version of the program You simply type make to compile the program according to these instructions A typical compile session would look like this ronquistg5 mrbayes gt l1s mrbayes3 2 _src tar gz ronquistg5 mrbayes gt gunzip MrBayes _3 2 _src tar gz ronquistg5 mrbayes gt 1s mrbayes3 2 _src tar ronquistg5 mrbayes gt tar xf MrBayes _3 2 _src tar ronquistg5 mrbayes gt make gcc DUNIX _VERSION 03 Wall Wno uninitialized c o mb o mb c gcc DUNIX _VERSION 03 Wall Wno uninitialized c o mcmc o mcmc c gcc DUNIX _VERSION 03 Wall Wno uninitialized c o bayes o bayes c gcc DUNIX _VERSION 03 Wall Wno uninitialized c o command o command c gcc DUNIX _VERSION 03 Wall Wno uninitialized c o mbmath o mbmath c gcc DUNIX _VERSION 03 Wall Wno uninitialized c o model o model c gcc DUNIX _VERSION 03 W
28. to 0 transitions is expressed in terms of the stationary state frequencies Thus if the stationary frequencies are 7 0 25 and m 0 75 then the rate of 0 to 1 transitions is 3 times as high as the rate of transitions in the other direction 7 79 3 A problem with some binary data sets notably restriction sites is that there is an ascertainment coding bias such that certain characters will always be missing from the observed data It is impossible for instance to detect restriction sites that are absent in all of the studied taxa MrBayes corrects for this bias by calculating likelihoods conditional on the unobservable characters being absent Felsenstein 1992 The ascertainment coding bias is selected using lset coding There are five options 1 there is no bias all types of characters could in principle be observed 1lset coding all 2 characters that are absent state 0 in all taxa cannot be observed lset coding noabsencesites 3 characters that are present state 1 in all taxa cannot be observed lset coding nopresencesites 4 characters that are constant either state 0 or 1 in all taxa cannot be observed lset coding variable and 5 only characters that are parsimony informative have been scored lset coding informative For restriction sites it is typically true that all absence sites cannot be observed so the correct coding bias option is noabsencesites The binary model is useful for a number of
29. we are going to create a command block from scratch but you can consult the cynmix run nex for reference In your favorite text editor create a new file called cynmix command nex in the same directory as the cynmix nex file and add the following new MrBayes block note that each line must be terminated by a semicolon NEXUS begin mrbayes execute cynmix nex charset morphology 1 166 charset COI 167 1244 charset EFla 1245 1611 charset LWRh 1612 2092 charset 28S 2093 3246 The first line is required to comply with the nexus standard With the execute command we load the data from the cynmix nex file and the charset command simply associates a name with a set of characters For instance the character set COI is defined above to include characters 167 to 1244 The next step is to define a partition of the data according to genes and morphology This is accomplished with the line add it after the lines above partition favored 5 morphology COI EFla LWRh 285 The elements of the partition command are 1 the name of the partitioning scheme favored 2 an equal sign 3 the number of character divisions in the scheme 5 4 a colon and 5 a list of the characters in each division separated by commas The list of characters can simply be an enumeration of the character numbers the above line is equivalent to partition favored 5 37 1 166 167 1244 1245 1611 1612 2092 2093 3246 but it is
30. 6460 6368 6295 6239 6122 6052 5985 5917 5885 5849 5835 5731 5728 5733 5730 5734 5736 5733 5734 5741 5746 5748 409 248 342 651 643 809 207 514 179 042 631 771 996 051 074 551 126 053 678 802 447 344 6335 6274 6170 6126 6054 5987 5962 5957 5947 5893 5882 5733 5742 5741 5731 5733 5727 5723 5725 5722 5716 5716 541 370 237 382 535 512 131 886 285 568 916 018 658 748 851 469 055 926 685 740 807 947 ooo ooo omomome ooo Mo Momo Momomome The first column lists the generation number The following four columns with negative numbers each correspond to one chain in the first run Each column corresponds to one physical location in computer memory and the chains actually shift positions in the columns as the run proceeds The numbers are the log likelihood values of the chains The chain that is currently the cold chain has its value surrounded by square brackets whereas the heated chains have their values surrounded by parentheses When two chains successfully change states they trade column positions places in computer memory If the Metropolis coupling works well the cold chain should move around among the columns this means that the cold chain successfully swaps states w
31. Advice 4 162 40 6 ad watele Ae oe Re a 4 Evolutionary Models Implemented in MrBayes 3 4 1 Nucleotide Models 2 44 44 26 8 So Oe Awe Ow WANS BAS 4 1 1 Simple Nucleotide Models 4 1 2 The Doublet WOE 42 2 216 g ap a 2 oa Oey he Pek had B13 SOCOM Mod ls ice 4 02 e eee ae he es Sk eA A eed 4 2 Amino acid Models lt 3 2 so eshte eg ed Soh ee Sed 2 4 2 1 Fixed Rate Models lt 2 wc Gee Ce EE AR 4 2 2 Estimating the Fixed Rate Model 4 2 3 Variable Rate Models 4 440 424 44 409 44 amp ese he 4 4 2 4 Restriction Site Binary Model 4 2 5 Standard Discrete Morphology Model 4 3 Parsimony Mod l s r pos a i eh we aed Be eS Be a EES 4 4 Rate Variation Across Sites 0 22000000 4 4 1 Gamma distributed Rate Model 4 4 2 Autocorrelated Gamma Model 4 4 3 Proportion of Invariable Sites 4 4 4 Partitioned Site Specific Rate Model 4 4 5 Inferring Site Rates vie goa o og de amp eed 68 eek a 4 5 Rate Variation Across the Tree The Covarion Model 4 6 Topology and Branch Length Models 4 6 1 Unconstrained and Constrained Topology 4 6 2 Non clock Standard Trees 4 6 3 Strict Glock Trees 6 ret ele seh eB eh eB a ee 4 6 4 Relaxed Clock Trees 2 2000 4 4 6 5 Partitioned Models ay kas oe ae eo eS ee eK ee ek 4 7 A
32. G_H c o mcmc o mcmc c mpicc 03 ffast math Wall DUSECONFIG_H c o model o model c mpicc 03 ffast math Wall DUSECONFIG_H c o plot o plot c mpicc 03 ffast math Wall DUSECONFIG_H c o sump o sump c mpicc 03 ffast math Wall DUSECONFIG_H c o sumt o sumt c mpicc 03 ffast math Wall DUSECONFIG_H c o tree o tree c mpicc 03 ffast math Wall DUSECONFIG_H c o utils o utils c mpicc 03 ffast math Wall DUSECONFIG_H lreadline lm mb c bayes o command o mbmath This produces an MPI enabled version of MrBayes called mb Make sure that the mpicc compiler is invoked It is perfectly normal if the build process stops for a few minutes on the mcmc c file this is the largest source file and it takes the compiler some time to optimize the code How you run the resulting executable depends on the MPI implementation on your cluster A simple approach would use LAM MPI First the LAM virtual machine is set up as usual with lamboot Then the parallel MrBayes job is started with a line such as mpirun np 4 mb batch nex gt log txt amp to have MrBayes process the file batch nex and run all analyses on four processors np 4 saving screen output to the file log txt If you keep both a serial and a parallel version of MrBayes on your system make sure you are using the parallel version with your mpirun command If your analysis takes a lot of time we advise you to turn on checkpointing by setting the checkpoint option mcmcp c
33. MrBayes version 3 2 Tutorial Fredrik Ronquist August 12 2010 Contents 1 Introduction 1 1 Conventions Used in this Manual soe wb gt we ale BAS 1 2 Acquiring and Installing MrBayes 3 Getting Started 2 0 ot Deni eat ate Baie ge PE Es ag 8 Oe OR Bee 1 4 Changing the Size of the MrBayes Window 15 Gettines Help 2 2 0 of gece bo ge eee a aT GR eo ee RAS 1 6 Reporting and Fixing Bugs 4 4 4 2p aed 2S ae A 1 7 License and Warrant ys 4 2 2 2 gi e e 2 2b ad ah a ek ed 2 Tutorial A Simple Analysis 2 QuickStart Version s scs sise ee eR ge Ee ay Bn Me ee Oe 2 2 Getting Data into MrBayes a ears 24 RE ee ee ee 2 3 Specifying a Model 25 2s i nae he Bac Ee aE Be RS ES 2 4 Setting the Priors 2 ue ew ee bw OS ee ke Ee we ke ee eS 2 5 Checkins the Wiel naa ii ay area fas ao ea Ge a GG a ae 2 6 Setting up the Amal ysis ea hue wa Gea AS ee 2 R nning the Analysis ede a i aere Yi Ak be ws ae a ee 2 8 When to Stop the Analysis i oa equi 2 Ge we OS ae we 2 9 Summarizing Samples of Substitution Model Parameters 2 10 Summarizing Samples of Trees and Branch Lengths 3 Analyzing a Partitioned Data Set 3 1 Getting Mixed Data into MrBayes 3 2 Dividing the Data into Partitions 2 4 lt lt 4 6 e265 eeS e es 3 3 Specifying a Partitioned Model 00 4 3 4 Running the ANALYSIS 66 hao ile BA Rad RA a ak Qewidde Hee 3 5 Some Practical
34. aca_fuscata M_mulatta 10 M_fascicularis 11 M_sylvanus 12 Saimiri_sciureus tree gen 1 amp U 12 0 100000 3 0 100000 4 0 100000 0 100000 OMAN OOS tree gen 20000 amp U 10 0 087647 8 0 013447 9 0 021186 0 030 end To summarize the tree and branch length information type sumt relburnin yes burninfrac 0 25 The sumt and sump commands each have separate burn in settings so it is necessary to give the burn in here again Most MrBayes settings are persistent and need not be repeated every time a command is executed but the settings are typically not shared across commands To make sure the settings for a particular command are correct you can always use help lt command gt before issuing the command The sumt command will output among other things summary statistics for the taxon bipartitions a tree with clade credibility posterior probability values and a phylogram if branch lengths have been saved The output first gives a key to each partition in the tree sample using dots for the taxa that are on one side of the partition and stars for the taxa on the other side For instance the 14th partition ID 14 in the output below represents the clade Homo taxon 3 and Pan taxon 4 since there are stars in the third and fourth positions and a dot in all other positions List of taxa in bipartitions Tarsius_syrichta Lemur_catta Homo_sapiens Pan Gorilla Pongo Hylobates
35. all Wno uninitialized c o plot o plot c gcc DUNIX _VERSION 03 Wall Wno uninitialized c o sump o sump c gcc DUNIX _VERSION 03 Wall Wno uninitialized c o sumt o sumt c gcc DUNIX _VERSION 03 Wall Wno uninitialized lm mb o bayes o command o mbmath o mcmc o model o plot o sump o sumt o o mb ronquistg5 mrbayes gt The compilation usually stops for several minutes at the mcmc c file this is perfectly normal This is the largest source file and optimization of the code takes quite a while We assume as the default C compiler gcc which is installed on most systems If you do not have gcc installed on your machine or you want to produce the MPI version or some other special version of the program you have to change the compiler information in the Makefile as described in section 7 of this manual The executable serial version of the program is called mb To execute the program simply type mb in the directory where you compiled the program The prefix is needed to tell Unix that you want to run the local copy of mb in your working directory If you run MrBayes often you will probably want to add the program to your path refer to your Unix manual or your local Unix expert for more information on this All three packages of MrBayes come with example data files These are intended to show various types of analyses you can perform with the program and you can use them as templates for your own analyses Two of the
36. analysis discarded the first 25 of the samples it makes sense to discard 25 of the samples when summarizing the parameter values Thus summarize the information in the p file by typing sump reburnin yes burnin 0 25 By default sump will summarize the information in the p file or files generated most recently but the filename can be changed if necessary The relburnin yes option specifies that we want to give the burn in in terms of a fraction relative burn in rather than as an absolute value The burninfrac option specifies the desired burn in fraction 28 The sump command will first generate a plot of the generation versus the log probability of the data the log likelihood values If we are at stationarity this plot should look like white noise that is there should be no tendency of increase or decrease over time The plot should look something like this 4 5717 71 2 1 1 2 1 1 22 1 1 1 22 2 1 12 1 22 2 1 1 1 PEs ae 1 2 11 1111 1 1 112 1 2 12 1 22 2 1 1 2 2 11 1 1 2 12 2 12 1 221 2 12 2 gt 2 2 2 1 2 1 2 12 2 1 i OC 909 2 1 1 1 2 211 1 22 2 2 2 122 2 2 2 1 1 2 pee LSS 5731 34 5000 20000 If you see an obvious trend in your plot either increasing or decreasing you very likely need to run the anal
37. ant to the analysis of your sequences it might be reasonable to use those numbers as a prior in your analysis In our analysis we will be cautious and leave the prior on state frequencies at its default setting If you have changed the setting according to the suggestions above you need to change it back by typing prset statefreqpr Dirichlet 1 1 1 1 or prs st Dir 1 1 1 1 if you want to save some typing Similarly we will leave the prior on the substitution rates at the default flat Dirichlet 1 1 1 1 1 1 distribution The Shapepr parameter determines the prior for the a shape parameter of the gamma distribution of rate variation We will leave it at its default setting a uniform distribution spanning a wide range of a values The prior for the pro portion of invariable sites is set with Pinvarpr The default setting is a uniform distribution between 0 and 1 an appropriate setting if we don t want to assume any prior knowledge about the proportion of invariable sites For topology the default Uniform setting for the Topologypr parameter puts equal probability on all distinct fully resolved topologies The alternative is to constrain some nodes in the tree to always be present but we will not attempt that in this analysis The Brlenspr parameter can either be set to unconstrained or clock constrained For trees without a molecular clock unconstrained the branch length prior can be set either to exponential or uniform The default expon
38. arpr for the proportion of invariable sites Topologypr for the The default prior probability density is a flat Dirichlet all values are 1 0 for 17 both Revmatpr and Statefreqpr This is appropriate if we want estimate these parameters from the data assuming no prior knowledge about their values It is possible to fix the rates and nucleotide frequencies but this is generally not rec ommended However it is occasionally necessary to fix the nucleotide frequencies to be equal for instance in specifying the JC and SYM models This would be achieved by typing prset statefreqpr fixed equal If we wanted to specify a prior that put more emphasis on equal nucleotide frequencies than the default flat Dirichlet prior we could for instance use prset statefreqpr Dirichlet 10 10 10 10 or for even more emphasis on equal frequencies prset statefreqpr Dirichlet 100 100 100 100 The sum of the numbers in the Dirichlet distribution determines how focused the distribution is and the balance between the numbers determines the expected proportion of each nucleotide in the order A C G and T Usually there is a connection between the parameters in the Dirichlet distribution and the observations For example you can think of a Dirichlet 150 100 90 140 distribution as one arising from observing roughly 150 A s 100 C s 90 G s and 140 T s in some set of reference sequences If the reference sequences are independent but clearly relev
39. ata file 2 Set the evolutionary model 3 Run the analysis 4 Summarize the samples In more detail each of these steps is performed as described in the following paragraphs 1 At the MrBayes gt prompt type execute primates nex This will bring the data into the program When you only give the data file name primates nex MrBayes assumes that the file is in the current directory If this is not the case you have to use the full or relative path to your data file for example execute taxa primates nex If you are running your own data file for this tutorial beware that it may contain some MrBayes commands that can change the behavior of the program delete those commands or put them in square brackets to follow this tutorial 2 At the MrBayes gt prompt type lset nst 6 rates invgamma This sets the evolutionary model to the GTR model with gamma distributed rate variation across sites and a proportion of invariable sites If your data are not DNA or RNA if you want to invoke a different model or if you want to use non default priors refer to the full MrBayes manual and its Appendix 3 1 At the MrBayes gt prompt type memc ngen 20000 This will ensure that you get at least 200 samples from the posterior probability distribution since the default sampling frequency is every 100th generation For larger data sets 11 you probably want to run the analysis longer and sample less frequently You can find the predicted remaining t
40. cify the evolutionary model that will be used in the analysis Usually it is also a good idea to check the model settings prior to the analysis using the showmodel command In general 1lset is 13 used to define the structure of the model and prset is used to define the prior probability distributions on the parameters of the model In the following we will specify a GTR I F model a General Time Reversible model with a proportion of invariable sites and a gamma shaped distribution of rates across sites for the evolution of the mitochondrial sequences and we will check all of the relevant priors We assume that you are familiar with the common stochastic models of molecular evolution In general a good start is to type help Iset Ignore the help information for now and concentrate on the table at the bottom of the output which specifies the current settings It should look like this Model settings for partition 1 Parameter Options Current Setting Nucmodel 4by4 Doublet Codon 4by4 Nst 1 2 6 1 Code Universal Vertmt Mycoplasma Yeast Ciliates Metmt Universal Ploidy Haploid Diploid Diploid Rates Equal Gamma Propinv Invgamma Adgamma Equal Ngammacat lt number gt 4 Usegibbs Yes No No Gibbsfreq lt number gt 100 Nbetacat lt number gt 5 Omegavar Equal Ny98 M3 Equal Covarion No Yes No Coding Al1 Variable Noabsencesites Nopresencesites All Parsmodel No Yes No First note that the table is headed by Model settings for partitio
41. culate each site s likelihood The weights are the prior probabilities that a site would belong to that category Because the gamma is discretized by breaking it up into ngammacat equal size chunks the probability that any site i would be in rate category c is simply 1 ngammacat and this quantity is independent of the site of the rate The likelihood for site i is a sum over all categories c of the quantity 1 ngammacat L i c where L i c is the likelihood of site i conditional upon it being in category c Typically at least for datasets with a large number of taxa one of the rate categories contributes the vast majority of the likelihood to this sum because its conditional likelihood is much larger than the conditional likelihoods from the other rate categories Thus the likelihoods under this model are close to 1 ngammacat times the conditional likelihood of the best fitting rate category for each site The fact that the site likelihoods are dominated by one term implies that it is wasteful to calculate the conditional likelihoods over all ngammacat rate categories most of the calculations will be lost in the rounding error of the final summation over rate categories The LSet UseGibbs yes option addresses this concern Instead of calculating a site s likelihood by summing over all of the hidden states the rate categories an indicator variable that corresponds to the rate category for a site is sampled during the MCMC The lik
42. duce an ASCII text version of the command reference at any time by giving the command manual to MrBayes Further help is available in a set of hyperlinked html pages produced by Jeff Bates and available on the MrBayes web site Finally you can get in touch with other MrBayes users and developers through the mrbayes users email list subscription information at www mrbayes net 1 6 Reporting and Fixing Bugs If you find a bug in MrBayes we are very grateful if you tell us about it using the bug reporting functions of SourceForge as explained on the MrBayes web site www mrbayes net When you submit a bug make sure that you upload a data file with the data set and sequence of commands that produced the error If the bug occurs during a MCMC analysis after issuing the memc command you can help us greatly by making sure the bug can be reproduced reliably using a fixed seed and swapseed for the mcmc command and ideally also with a small data set The Tracker software at SourceForge will make sure that you get email notification when the bug has been fixed in the source code on the MrBayes CVS repository at SourceForge Note however that there may be some time before new executables containing the bug fix will be released Advanced users may be interested in fixing bugs themselves in the source code Refer to section 7 of this manual for information on how to contribute bug fixes improved algorithms or expanded functionality to other us
43. e that you have concatenated nucleotide sequences from three genes in your data set of length 1962 1050 and 2082 sites respectively Then you create character sets for those three genes using charset gene1 1 1962 charset gene2 1963 3012 charset gene3 3013 in a MrBayes block Note the use of the dot as a synonym of the last site You can also use the backslash n sequence to include every nth character in the preceding range of characters see the description of the site specific rate model above Once the character sets are defined the partitioning scheme based on the genes is defined with the partition command and selected using the set command partition by_gene 3 genel gene2 gene3 set partition by_gene Here by_gene is the name we chose for the partitioning scheme The name is followed by an equal sign the number of partitions and then a comma separated list of characters to include in each partition Note that MrBayes requires the partitioning scheme to include all characters Say for instance that you wanted to run an analysis with only gene 1 and gene 2 Then define a two partition scheme and exclude the characters represented by gene 3 partition gene1 amp 2 2 genel gene2 gene3 exclude gene3 set partition gene1 amp 2 If the only purpose of the partition gene1 amp 2 is to allow exclusion of gene 3 then gene 3 can of course be included in either of the two partitions before being excluded O
44. e version 3 which is written in C this version will be written in C and our goal is to provide a cleaner faster and more extensively documented implementation of Bayesian MCMC phylogenetic analysis This means among other things that the code will be better organized and all important sections will be documented using http www doxygen org Doxygen for easy access to other developers You are welcome to examine this project as it develops by downloading the source code doxygen documentation or programming style directives from the MrBayes SVN repository at SourceForge 7 4 Advanced Options 7 4 1 LSet UseGibbs Option As described in the Gamma distributed rates section MrBayes can accommodate rate heterogeneity across sites using a discrete approximation to the Gamma distribution The discrete approximation to the Gamma distribution is a form of hidden Markov model in which there are ngammacat rate categories that any site can belong to There is a hidden state that identifies the appropriate rate category for each site Because we cannot see the rate category for a site we must treat it as an unknown variable The LSet UseGibbs option controls how this variable is handled When LSet UseGibbs no is in effect the likelihoods calculated by MrBayes will be comparable to the likelihood from other software and versions of MrBayes 85 earlier than 3 2 In this mode a weighted sum over all rate categories is performed to cal
45. eciationpr Extinctionpr Sampleprob Thetapr Nodeagepr Treeagepr Clockratepr Cppratepr Psigammapr Nupr Ratepr Fixed Mixed Dirichlet Fixed Dirichlet Fixed Beta Fixed Uniform Exponential Fixed Exponential Fixed Dirichlet Fixed Dirichlet Fixed Uniform Exponential Fixed Uniform Fixed Uniform Fixed Uniform Exponential Fixed Uniform Exponential Fixed Uniform Constraints Fixed Unconstrained Clock Fixed Exponential Gamma Uniform Exponential Fixed Uniform Exponential Fixed lt number gt Uniform Exponential Fixed Unconstrained Calibrated Fixed Uniform Offsetexponential Strict Cpp Bm Ibr Fixed Exponential Fixed Exponential Uniform Fixed Exponential Uniform Fixed Variable Dirichlet Fixed Poisson Dirichlet 1 0 1 0 Dirichlet 1 0 1 0 Beta 1 0 1 0 Exponential 1 0 Exponential Dirichlet 1 0 1 0 1 0 Dirichlet 1 0 1 0 1 0 1 0 Uniform 0 0 200 0 Uniform 1 0 1 0 Uniform 0 0 1 0 Uniform 0 0 100 0 Fixed Infinity Uniform Unconstrained Exp 10 0 Exponential 1 0 Uniform 0 0 10 0 Uniform 0 0 10 0 1 00 Uniform 0 0 10 0 Unconstrained Fixed 1 00 Strict Exponential 0 10 Fixed 1 00 Fixed 1 00 Fixed We need to focus on Revmatpr for the six substitution rates of the GTR rate topology and Brlenspr for the branch lengths matrix Statefreqpr for the stationary nucleotide frequencies of the GTR rate matrix Shapepr for the shape parameter of the gamma distribution of rate variation Pinv
46. efault MrBayes uses the universal code but you can select other codes by using the lset code command Note that the codon models are computationally demanding Whereas the computations for the simple four by four models need to deal with only 16 Q matrix and transition probability elements 4x4 the codon model computations need to process more than 3 600 Q matrix and transition probability elements resulting in these runs being roughly 200 times slower The codon models also require much more memory than the four by four models about 16 times as much The simplest codon model described above assumes that all amino acid sites are subject to the same level of selection the same w factor However MrBayes 49 also implements models accommodating variation in selection across sites This allows you to detect positively selected amino acid sites using a full hierarchical Bayes analysis that is an analysis that does not fix tree topology branch lengths or any substitution model parameters but instead integrates out uncertainty in all these factors The w variation models work much like the model accommodating rate varia tion across sites according to a gamma distribution The likelihood of each site is calculated under several different w values and then the values are summed to give the site likelihood A difference is that the stationary frequency of each omega category is estimated instead of being fixed as in the case of the gamma distri
47. el settings for partition 1 Parameter Options Current Setting Nucmodel 4by4 Doublet Codon 4by4 Nst 1 2 6 6 Code Universal Vertmt Mycoplasma Yeast Ciliates Metmt Universal Ploidy Haploid Diploid Diploid Rates Equal Gamma Propinv Invgamma Adgamma Invgamma Ngammacat lt number gt 4 Usegibbs Yes No No Gibbsfreq lt number gt 100 Nbetacat lt number gt 5 Omegavar Equal Ny98 M3 Equal Covarion No Yes No Coding Al1 Variable Noabsencesites Nopresencesites All Parsmodel No Yes No 2 4 Setting the Priors We now need to set the priors for our model There are six types of parameters in the model the topology the branch lengths the four stationary frequencies of the nucleotides the six different nucleotide substitution rates the proportion of invariable sites and the shape parameter of the gamma distribution of rate variation The default priors in MrBayes work well for most analyses and we will not change any of them for now By typing help prset you can obtain a list of the default settings for the parameters in your model The table at the end of the help information reads Model settings for partition 1 Parameter Options Current Setting Tratiopr Beta Fixed Beta 1 0 1 0 Revmatpr Dirichlet Fixed Dirichlet 1 0 1 0 1 0 1 0 1 0 1 0 16 Aamodelpr Aarevmatpr Omegapr Ny98omegalpr Ny98omega3pr M3omegapr Codoncatfreqs Statefreqpr Shapepr Ratecorrpr Pinvarpr Covswitchpr Symdirihyperpr Topologypr Brlenspr Treeheightpr Sp
48. elihood reported for each site is actually the conditional likelihood the likelihood of the site conditional upon it belonging to rate category Each site has its own value for the hidden state The fact that we are uncertain of each site s true rate category is accommodated by sampling with MCMC instead of summation Conveniently when you calculate the likelihood by summing over all rate cat egories then you are also in position to calculate the posterior probability that each site belongs to a particular category To update the hidden states for each site you have to 1 Calculate the conditional likelihood for each rate category 2 Calculate the full site likelihood by summing over all rate categories weighted by their prior probabilities 86 3 Calculate the posterior probability of each of the hidden state assignments by dividing the conditional likelihood for each rate category by the site like lihood so that the probabilities sum to one 4 Select a new hidden state for each site by choosing a randomly giving each rate category a probability of being chosen that is equal to the posterior probability that the site belongs to that category This is a Gibbs_sampling Gibbs Sampling update of the hidden state The LSet Gibbsfreq option controls how frequently the hidden states are updated Thus the likelihood for all ngammacat categories only has to be calculated in the iteration in which the hidden state
49. emp parameter Every Swapfreq generation two chains are picked at random and an attempt is made to swap their states For many analyses the default settings should work nicely If you are running many more than three heated chains however you may want to increase the number of swaps Nswaps that are tried each time the chain stops for swapping If the frequency of swapping between chains that are adjacent in temperature is low you may want to decrease the Temp parameter The Samplefreq setting determines how often the chain is sampled By default the chain is sampled every 100th generation and this works well for most analyses including ours If you have a large data set it may take longer to converge and you may want to sample less frequently or you will end up with very large files containing tree and parameter samples When the chain is sampled the current values of the model parameters are printed to file The substitution model parameters are printed to a p file in our case there will be one file for each independent analysis and they will be called primates nex runi p and primates nex run2 p The p files are tab delimited text files that can be imported into most statistics and graphing programs The topology and branch lengths are printed to a t file in our case there will be two files called primates nex runi t and primates nex run2 t The t files are Nexus tree files that can be imported into programs like PAUP and TreeView
50. encies of the negatively selected neutral and positively se lected categories respectively and omega omega N and omega are the corresponding w values The M3 model parameters are instead labeled pi 1 pi 2 and pi 3 for the category frequencies and omega 1 omega 2 and omega 3 for the w values The probability of a codon being positively selected is labeled by the site numbers in the original alignment Thus pr 16 17 18 is 50 the probability of the codon corresponding to the original nucleotide alignment sites 16 17 and 18 being in a positively selected omega category 4 2 Amino acid Models MrBayes implements a large number of amino acid models They fall in two dis tinct categories the fixed rate models and the variable rate models The former have both the stationary state frequencies and the substitution rates fixed whereas one or both of these are estimated in the latter 4 2 1 Fixed Rate Models The Poisson model Bishop and Friday 1987 is the simplest of the fixed rate models It assumes equal stationary state frequencies and equal substitution rates thus it is analogous to the JC model for nucleotide characters The rest of the fixed rate models have unequal but fixed stationary state frequencies and substi tution rates reflecting estimates of protein evolution based on some large training set of proteins These models include the Dayhoff model Dayhoff Schwartz and Orcutt 1978 the Mtrev model Adachi
51. ential prior with parameter 10 0 should work well for most analyses It has an expectation of 1 10 0 1 but allows a wide range of branch length values theoretically from 0 to infinity Be 18 cause the likelihood values vary much more rapidly for short branches than for long branches an exponential prior on branch lengths is closer to being uninformative than a uniform prior 2 5 Checking the Model To check the model before we start the analysis type showmodel This will give an overview of the model settings In our case the output will be as follows Model settings Datatype DNA Nucmodel 4by4 Nst 6 Substitution rates expressed as proportions of the rate sum have a Dirichlet prior 1 00 1 00 1 00 1 00 1 00 1 00 Covarion No States 4 State frequencies have a Dirichlet prior 1 00 1 00 1 00 1 00 Rates Invgamma Gamma shape parameter is uniformly dist ributed on the interval 0 00 200 00 Proportion of invariable sites is uniformly dist ributed on the interval 0 00 1 00 Gamma distribution is approximated using 4 categories Likelihood summarized over all rate categories in each generation Active parameters Parameters Revmat 1 Statefreq 2 Shape 3 Pinvar 4 Topology 5 Brlens 6 1 Parameter Revmat 19 Type Prior Parameter Type Prior Parameter Type Prior Parameter Type Prior Parameter Type Prior Subparam Parameter Type Prior Rates of reversib
52. eotide Models MrBayes implements a large number of DNA substitution models These models are of three different structures The 4by4 models are the usual simple models of nucleotide evolution The Doublet model is intended for stem regions of models ribosomal DNA where nucleotides evolve in pairs Finally the Codon group the nucleotides in triplets and model evolution based on these The type of nucleotide model is set in MrBayes with lset nucmodel for instance if you want to use the doublet model the command is lset nucmodel doublet The default setting is 4by4 43 4 1 1 Simple Nucleotide Models There has been more work based on the simple four by four nucleotide models than on any other type of evolutionary model for molecular data MrBayes 3 implements three main types of models you select among those by setting the number of substitution types using lset nst to 1 2 or 6 The widely used General Time Reversible GTR model has six substitution types lset nst 6 one for each pair of nucleotides The instantaneous rate matrix for the GTR model is note that we order the rows and columns alphabetically A C G T unlike some other authors A fc G fF A TCTAC TGTAG TTAT Q C tarac arog Trror G tarag Teroa Terror T marar Torcer Terer The GTR model Tavare 1986 has four stationary state frequencies 74 7 nG Tr and six rate parameters r AC TAG TAT TCG
53. er slightly because of stochastic effects but should show a similar trend In larger and more difficult analyses you will typically see the standard devi 26 ation of split frequencies come down much more slowly towards 0 0 the standard deviation can even increase temporarily especially in the early part of the run A rough guide is that an average standard deviation below 0 01 is very good indication of convergence while values between 0 01 and 0 05 may be adequate depending on the purpose of your analysis The sumt command see below allows you to examine the error standard deviation associated with each clade in the tree Typically most of the error is associated with clades that are not very well supported posterior probabilities well below 0 95 and getting accurate estimates of those probabilities may not be an important concern Given the extremely low value of the average standard deviation at the end of the run there appears to be no need to continue the analysis beyond 20 000 generations so when MrBayes asks Continue with analysis yes no stop the analysis by typing no Although we recommend using a convergence diagnostic such as the standard deviation of split frequencies there are also simpler but less powerful methods of determining when to stop the analysis The simplest technique is to examine the log likelihood values or more exactly the log probability of the data given the parameter values of the cold chain t
54. ers of MrBayes 1 7 License and Warranty MrBayes is free software you can redistribute it and or modify it under the terms of the GNU General Public License as published by the Free Software Foundation either version 2 of the License or at your option any later version The progr