Home
        ProtTest - Bioinformatics and Computational Biology Services
         Contents
1.    OB sodel RETTET TEEEEEEETETEETEEETTETETT    JTT F  Number of parametersS         s         58  19   39 branch length estimates   S aminoacid frequencies               observed  see above   EOC   2999 42  0h0m5s   ar S PE OE EA   JTHI          Number of parameters  proportion of invariable sites       40  1   39 branch length estimates   0 069           lNL  ssssssssooosossooososssoososso   2976 88  0h0m16s        Model    ccccccccccnccccnssecccccscce   JTT I F      gt  Number of parameters              59  20   39 branch length estimates           proportion of invariabl    _ aminoacid frequencies       0 068  observed  see above   2975 28  OhOmizs              Work progress  model  5 32   gt  Sort Dy   Overall comparison    Display  best  tree in   Newick format _  ASCII text   Close     Save      Figure 2  ProtTest   s console window       ProtTest at the command line   To run Prottest at the command line  open a shell window  terminal   change to the  directory where ProtTest is installed and run the script runProtTest  In Windows  use  the script runProtTest bat instead   specifying the following options            A  alignment file  required    E  tree file  optional   default  NJ tree    O  output file  optional   default  STDOUT    sort  A B C D  optional   default  A   A  AIC  B  BIC  C  ATCC  D  LLnL   all  T F  If true a 7 framework comparison table is displayed   default  true    S  optimization strategy mode   default  0     0  Fast  optimize branch lengths  am
2.   model averaged estimates of different parameters  Posada and Buckley 2004  and  calculates the importance of each of these parameters  ProtTest differs from its  nucleotide homolog Modeltest  Posada and Crandall 1998  in that it does not include  likelihood ratio tests  many models implemented in ProtTest are not nested      The program  using ProtTest    ProtTest  selection of models of protein evolution    ProtTest is written in java and takes advantage of the PAL library  Drummond and  Strimmer 2001  for manipulating trees and alignments  and of the Phyml program   Guindon and Gascuel 2003  for the computation of likelihoods and the estimation of  parameters  Given an alignment and a tree  provided by the user or calculated with the  BIONJ algorithm  Gascuel 1997    ProtTest currently computes the likelihood for each  one of 112 candidate models of protein evolution  matrices WAG  Dayhoff  JTT   mtREV  MtMam  MtArt  VT  RtREV  CoREV  Blosum62  LG  DCMut  HIVw  and HIVb  with the  l   G  and  F parameters  Then  the fit of the models can be estimated using  the AIC  AlCc  and BIC     Download and installation   ProtTest works in Mac OSX  Windows  and Linux  and requires a version of the java  runtime environment equal or posterior to 1 3  read section    Installing java    if you don   t  have it   ProTest is available from http   darwin uvigo es  After registration  download  the package and decompress it in any directory  Some examples are included     Installing java  
3.   there are some ways we can model these  constraints  we can consider that a fraction of the amino acids are invariable   commonly indicated with a     I    code in the name of the model   Reeves 1992   we can  consider some different categories of change  low  medium  high rate  etc   and assing  each site a probability to belong to each of these categories  usually indicated by a      G    code   Yang 1993   or we can include both in the model   I G   Also  we can use  as equilibrium aminoacid frequencies those observed in the alignment at hand   indicated as     F      Cao et al  1994      Statistics for model selection  Akaike Information Criterion and others   For a more detailed background on model selection  the user is referred to  Posada  and Crandall 2001  and Posada and Buckley  in press   Burnham and Anderson  2003   provide a very good description of the AIC framework and its use for model averaging     which they call multimodel inference      The fit of a model of protein evolution  M  to a given data set  D   given a tree  T  and  branch lenghts  B  is measured by the likelihood function  L      L  P D M 7  B   One could think that the best model is the one which results in the maximum likelihood   but this is not necessarily true  the more parameters the model includes  the higher its  advantage in fitting better the data  but also the higher the variance for the parameter  estimates  So how many parameters should the best model include     One way to answer
4.  Phylogenetics and sample size    What is the sample size of a protein alignment is very unclear  ProtTest offers different  criteria for sample size determination     e Alignment length  default    e Number of variable sites   e Shannon entropy summed over all alignment positions        Shannon Entropy   Shannon 1948   It   s a way to measure the disorder or  entropy     ShEn   X  Pilog  Pi     In our case if a position is completely conserved it takes the value of 0  If  completely disordered  frequency of every aminoacid equals 1 20  it takes  the value of 4 32              e Number of sequences x length of the alignment x normalized Shannon   s  entropy  The normalized Shannon   s entropy is calculated by summing the  entropies over all positions  dividing this quantity by the number of  positions  and dividing the resulting quantity by the maximum possible  entropy  4 32      13    ProtTest  selection of models of protein evolution    e Number of sequences x length of the alignment   e Users provided size     Akaike weights and the relative importance of parameters  The AIC  or AlCc  or BIC  differences can be used for calculating the Akaike weights     exp  1 2Ai   5 exp  1 2A         i      these weights can be interpreted as the probability that a model is the best AIC model     Parameter _importance  By summing the weights of the models that include a given parameter  for example the    gamma distribution  we get the relative importance of such parameter     R  wpa  
5.  X  w   M    where    1 if    is in model M   I  M     0 otherwise    Model averaged parameter estimates  We can also obtain an averaged estimation of any parameter by summing the different    estimates for the models that contain such parameter after multiplying them by the  Akaike weight of the corresponding model  For example  the model averaged estimate  of alpha  q    for R candidate models would be     R  5 wily  M  Qa  Pa  i l s       Wy   a     where    R  w     gt   Wily  M      and    1 if Qa is in model M   ly   M         0 otherwise     14    ProtTest  selection of models of protein evolution    Credits and acknowledgements    ProtTest takes advantage of the PAL library  Drummond and Strimmer 2001  for  manipulating alignments and trees  The core of the computation is carried out by the  Phyml program  slightly modified to acomplish some requirements and to include  additional models   Guindon and Gascuel 2003    which calculates the likelinoods and  optimizes the parameters  Phyml is also used for calculating BioNJ trees  The code of  ProtTest takes also benefit from other resources found at the WWW  as indicated in the  source java code     Very special thanks to Stephane Guindon  Phyml  and Matthew Goode  PAL  for being  so helpful and patient     This work was financially supported from a grant for research in bioinformatics  from the Fundacion BBVA     Given that ProtTest uses intensively Phyml and PAL  we encourage users to cite these  programs as well when 
6.  optimization of  model  branches and topology of the tree   see the    Optimization strategies     section below     4  Optionally  you might want to restrict the set of candidate models  For that you  should click the    models    button  Read first the section    Restricting the set of  candidate models       5  And click the    Start    button     ProtTest  selection of models of protein evolution    Protlest 2 0 11  About WWW Help                         ProtTest Tree   PAL  amp  Phyml based   Alignment   BION  tree  J User tree   required   Select file    Program options  Optimization strategy      Fast  optimiz         Set of candidate    models  Exit   Start            Figure 1  ProtTest main window     At this moment ProtTest starts computing the parameters for the different models  Be  aware that computation can take some time  maybe hours or even days  You can  watch the progress in a new window  the console  see figure 2   If there   s some error  related with the format of the alignment or the installation of the program  a warning will  appear in the console     When likelihood computations are finished  you ll be warned and prompted to select a  statistical framework  AIC  AlCc or BIC  for determining which of the candidate models  best fits your data  If you select AlCc or BIC you will be prompted to specify a criterion  to estimate the size of the sample  Additionally  you can ask ProtTest to display a  comparison of different selection scenarios  AIC  AlCc a
7.  orders  J Mol Evol 47  307 322    Dayhoff  M O   Schwartz  R M   and Orcutt  B C  1978  A model of evolutionary  change in proteins  In Atlas of Protein Sequence and Structure   ed  M O   Dayhoff   pp  345 352  National Biomedical Research Foundation  Washington   DC    Dimmic  M W   Rest  J S   Mindell  D P   and Goldstein  R A  2002  rtREV  an amino  acid substitution matrix for inference of retrovirus and reverse transcriptase  phylogeny  J Mol Evol 55  65 73    Drummond  A   and Strimmer  K  2001  PAL  an object oriented programming library  for molecular evolution and phylogenetics  Bioinformatics 17  662 663   Gascuel  O  1997  BIONJ  an improved version of the NJ algorithm based on a simple   model of sequence data  Mol Biol Evol 14  685 695    Gilbert  D  2001    Guindon  S   and Gascuel  O  2003  A simple  fast  and accurate algorithm to estimate  large phylogenies by maximum likelihood  Syst Biol 52  696 704    Henikoff  S   and Henikoff  J G  1992  Amino acid substitution matrices from protein  blocks  Proc Natl Acad Sci U S A 89  10915 10919    Jones  D T   Taylor  W R   and Thornton  J M  1992  The rapid generation of mutation  data matrices from protein sequences  Comp  Appl  Biosci  8  275 282    Le  S Q   and Gascuel  O  2008  An improved general amino acid replacement matrix   Mol Biol Evol 25  1307 1320    Maddison  W P   and Maddison  D R  1992  MacClade  analysis of phylogeny and  character evolution  Sinauer Associates Inc   Sunderland  Massachusetts  US
8.  s Shannon Entropy over the whole alignment  169 2    AICc BIC 3  sample size as  align  length x num sequences x averaged Sh  Entropy  822 2     10    ProtTest  selection of models of protein evolution    We can end to some conclusions  the empirical WAG matrix is clearly the one that fits  best the family of L5 proteins  However  modifying the matrices with the observed  amino acid frequencies  applying  F  allows RtREV models to better fit the data  compared to WAG  Since the use of these observed frequencies add extra parameters  to the model  AIC  AlCc and BIC interpretate this better fit of RtREV models as an  over fit  and penalizes them consequently  Including a gamma distribution to account  for different rates of change at different positions is always of some importance   ranging from 0 09 in AlCc 3 to 0 43 in BIC 3   Including an invariable sites distribution  alone is not  But both  G and  I together do it better     Known bugs    Many users have reported errors when running ProtTest under Windows  Such errors  are related to filenames and file paths conflicts and can be usually circumvented by  placing the ProtTest input files  the alignment and  optionally  the tree  in a lower  directory  such as    C          Program history    Version 2 4  September 2009   Bug fixed in the reading of the proportion of invariable  sites    Version 2 2  August 2009   Some new options added to the command line version of  the program  E g     numcat    Version 2 1  June 2009   
9.  this question is by using the Akaike Information Criterion  AIC    Akaike  1973      AIC    2LnL   2K    12    ProtTest  selection of models of protein evolution     LnL  log likelinood  K  number of parameters   The model with lowest AIC is expected  to be the closest model to the true model among the set of candidate models  Since  AIC is on a relative scale  it is useful to present also the AIC differences  or deltaAIC    For the ith model  the AIC difference is     Ai   AIC      min AIC  where min AIC is the smallest AIC among all candidate models   The AIC might not be accurate when the size of the sample is small compared to the  number of parameters  For these cases  it is recommended to use a second order  AIC or corrected AIC  AlCc in ProtTest   Sugiura 1978    which includes a penalty for    cases where the sample size is small     2K K  1   n K 1    AlCc   AIC      where n is the size of the sample  see below   If n is large with respect to K  the second  term is negligible  and AlCc behaves similar to AIC  The corrected AIC is  recommended when the relation n K is small  for example n K  lt  40  being K the  number of parameters of the most complex model among the set of candidate models      ProtTest also calculates the Bayesian Information Criterion  BIC   Schwarz 1978     which is another measure of model fit  The BIC is considered a good approximation of  the  very computationally demanding  Bayesian methods  and is formulated as     BIC   2LInL   K log n    
10. 1 September 2009       ProtTest  Selection of best fit models of protein evolution  version 2 4    Federico Abascal      Rafael Zardoya  and David Posada     1  Universidad de Vigo  Vigo  Spain  2  Museo Nacional de Ciencias Naturales  Madrid  Spain    abascal mncn csic es  dposada uvigo es  rafaz mnecn csic es    e What can I use ProtTest for      Introduction  e The program  using ProtTest  o Download and installation    Installing java  o Running ProtTest    ProtTest through its graphical Interface  GUI      ProtTest at the command line     ProtTest at the web  Optimization strategies  Alignment and tree formats  Restricting the set of candidate models  o ProtTest   s output  A guided example  the ribosomal L5 protein family  e Known bugs  e Program history  e BACKGROUND  o Models of protein evolution  o Statistics for model selection  Akaike Information Criterion and others    Phylogenetics and sample size    Akaike weights and the relative importance of parameters  e Parameter importance  e Model averaged parameter estimates  e Credits and acknowledgements  e References    o0 o    What can I use ProtTest for      Introduction    ProtTest is a bioinformatic tool for the selection of the most appropriate model of  protein evolution  among the set of candidate models  for the data at hand  ProtTest  makes this selection by finding the model with the smallest Akaike Information Criterion   AIC  or Bayesian Information Criterion  BIC  score  At the same time  ProtTest obtains
11. A   pp  398    Muller  T   and Vingron  M  2000  Modeling amino acid replacement  J Comput Biol 7   761 776    Nickle  D C   Heath  L   Jensen  M A   Gilbert  P B   Mullins  J I   and Kosakovsky  Pond  S L  2007  HIV specific probabilistic models of protein evolution  PLoS  ONE 2  e503    Posada  D   and Buckley  T R  2004  Model Selection and Model Averaging in  Phylogenetics  Advantages of AIC and Bayesian approaches over Likelihood  Ratio Tests  Systematic Biology 53  793 808    Posada  D   and Crandall  K A  1998  MODELTEST  testing the model of DNA  substitution  Bioinformatics 14  817 818     16    ProtTest  selection of models of protein evolution    Posada  D   and Crandall  K A  2001  Selecting the best fit model of nucleotide  substitution  Syst Biol 50  580 601    Reeves  J H  1992  Heterogeneity in the substitution process of amino acid sites of  proteins coded for by mitochondrial DNA  J Mol Evol 35  17 31    Schwarz  G  1978  Estimating the dimension of a model  Ann  Statist  6  461 464    Shannon  C E  1948  A mathematical theory of communication  Bell System Technical  Journal 27  379 423 and 623 656    Sugiura  N  1978  Further analysis of the data by Akaike s information criterion and the  finite correction  Comm  Statist  A Theory  Meth 7  13 26    Thorne  J L   and Goldman  N  2003  Probabilistic models for the study of protein  evolution  In Handbook of Statistical Genetics   ed  M B  D J  Balding  and C   Cannings   pp  209 226  John Wiley  amp  Sons  Lt
12. First of all  make sure that you have a Java Virtual Machine  JVM  properly installed in    your system  To test your JVM  1  Go to http   www java com en download help testvm jsp  2  Or in a terminal window  type    java    version      The JVM is also included in     Java Runtime Environment  JRE     Java 2 Platform Standard Edition  J2SE   More information on obtaining the JVM in     http   java sun com   To automatically download the JVM    http   java sun com webapps getjava BrowserRedirect    Running ProtTest  There are three different ways to run ProtTest  using the graphical interface   recommended   using the command line version  and through the web server     http   darwin uvigo es    ProtTest through its graphical Interface  GUI     Just double click the jar file  ProtTest jar      Note  Some Linux environments need to be configured to respond to jar double clicks  If  you are not able to set up Linux to do so  you can launch ProtTest by running the  runXProtTest script     A window like the one in figure 1 will appear in the screen  Now follow these steps     1  Input an alignment in phylip  recommended  or nexus sequential format  more  about accepted formats in the section    Alignment and tree formats    below     2  Optionally you can input a tree topology in newick format  if not  a BIONJ tree  will be calculated    3  Select the strategy  fast  fixed tree  optimization of model parameters and  branch lengths  this is the strategy used by Modeltest  or slow 
13. NS Sans ened see   JTT I  Number of parameters                  40  1   39 branch length estimates   proportion of invariable sites      0 069  LTT Saharan E E AE Talk AEE   2955 16  OhOm8s     When likelihoods and model parameters are estimated for all models  we will be  prompted to select a statistical framework for the selection of the best model  To  accomplish this we should select one of the options that can be found at the bottom of  the console  Even if raw likelihoods are not adequate for model selection we start by  selecting this option to illustrate some concepts  We   ll get something like this     KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK    Maximum Likelihood   lnL  framework  KKKEKKK KKK KK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKK KKKKEKK    Best model according to  lnL  WAG I G F  KKEKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK             odel deltaAIC AIC  inL  AICw  RtREV I Gt F 0 96 5908 73  2894 36 031  RtREV G F 2625 5910 02  2896 01 0 16  AG I G F 13 48 5921 25  2900 63 0 00  AG G F 16 46 5924 23  2903 12 0 00  AG I G 0 00 5907 77  2912 89 0 49  AG G 5x29 5913 06  2916 53 0 04  Blosum62 I G 9 84 5917 61  2917 81 0 00  CpREV I G F 51 66 5959 43  2 919 71 0 00        rest of lines omitted     We can see which model has the highest likelihood  the RtREV I G F  Some  information related with the AIC framework is also displayed  but forget it by now     Now we select the AIC framework and this is what we obtain     kkkkx
14. Updated to a new release of Phyml    Version 2 0  March 2009   Major update  LG and DCmut models included  Updated to  Phyml v3 version    Version 1 4  July 2007   HIVb and HIVw models added to ProtTest    Version 1 3  January 2006   Version tracking renumbered according to the release of  a new version    Version 1 2 16  November 2005   Minor aesthetic change  some information is printed  to the console when ProtTest is launched in the console mode    Version 1 2 12  July 2005   A bug in the    overall comparison    has been fixed  thanks  to Marc Elliot     Version 1 2 10  April 2005   New model in ProtTest  MtArt is a replacement matrix for  arthropod mitochondrial proteins  It has been estimated with Paml    Version 1 2 8  February 2005   bug corrected so ProtTest is now java 1 3 compatible   Version 1 2 6  January 2005   added the ability to specify the number of rate  categories for the gamma distribution    Version 1 2 4  January 2005   MtMam matrix added to ProtTest and Phynil    Version 1 2 2  December 2004   some adjustments to the calculation of model   averaged parameters and their importance    Version 1 2 0  November 2004   models based on VT  CpREV  RtREV and Blosum62  added to Phyml and ProtTest    Version 1 0 4  October 2004   updated to new version of Phyml  2 4 1   ProtTest now  checks for taxa name duplicates  this caused problems with Phyml     Version 1 0 2  October 2004   A problem with spaces in the path has been corrected   Some improvements in the co
15. a good candidate  RtREV I G F has the highest likelihood  but it has  more parameters  19 from the  F  and two from     l G     plus the ones for the branch  lengths  and consequently is penalized by the AIC  what situates WAG I G as the  best fit model  The worst models are the ones built upon the MtREV matrix     Below the table of models  we can see the relative importance of parameters  There  we find that including  I G is very important  the two best models use this distribution     G seems to be also important  but    alone seems to describe poorly the evolution of  these proteins  In this example  adding  F has some importance  0 47  because the  model RtREV I G F is the second best     Below we see a model averaged estimate of parameters  In this example the  averaged alpha shape of the models  I G has a value of 2 69  Why  The alpha of the  models  I G  mainly the top ranking  WAG I G and RtREV I G F  is averaged using  the weight of those models  Note that both the importance and the averaged estimate  of alpha are separated for models  G and models  I G  given the interdependence of     and  G parameters  The same stands for the models  I     Let   s try now the AlCc framework to see if the AIC model selection is affected by a not  enough large sample size  If we select AlCc at the bottom of the console a new window  will appear  prompting us to select a criterion to determine the sample size  We start  selecting  for example  the    Alignment length     As a resu
16. d   Chichester  England    Whelan  S   and Goldman  N  2001  A general empirical model of protein evolution  derived from multiple protein families using a maximum likelihood approach   Mol Biol Evol 18  691 699    Yang  Z  1993  Maximum likelihood estimation of phylogeny from DNA sequences  when substitution rates differ over sites  Mol Biol Evol 10  1396 1401     17    
17. ection  AIC   AlCc  BIC or LnL      If AlCc or BIC      sort C    and     sort B     respectively  are selected  you might want to  change the default criterion for sample size interpretation  To accomplish this  set the       sample    option to one of the values shown above  0 5      The optimization strategy  see the section    Optimization strategies    below  is specified  by the    S option followed by a 0 or 1  fast or slow      The     all    option is set to true by default  Since the command line version of ProtTest is  not interactive and you cannot play with the different frameworks once the likelihoods  are calculated  the  all option is useful for having a table in which you can see at one  sight how the best model selection is affected under seven different scenarios  AIC   AlCc and BIC with three different criteria for sample size    corresponding to 0  1 and 2  in the     sample    option       If  t1 or  t2 options are set to    true     the tree corresponding to the best model will be  displayed in the output  in newick format or ASCII representation  respectively    The order in which you specify the options doesn   t matter     Example        runProtTest  i alignment _file phylip  S 0  sort C sampl i  o  results txt  t1 T  MtArt F  MtREV F  MtMam F   F F       By running ProtTest with these options we will find the model that fits best the  alignment contained in    alignment_file phylip    according to the AlCc criterion and  taking as sample size the nu
18. kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk  Akaike Information Chriterion  AIC  framework  kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk  Best model according to AIC  WAG I G  kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk    Model deltaAIC  AIC AICw  1nL    ProtTest  selection of models of protein evolution    WAG I G 0 00 5907 77 0 49  2912 89  RtREV I G F 0 96 5908 73 0 31  2894 36  RtREV G F 2 25 5910 02 0 16  2896 01  WAG G 5 29 5913 06 0 04  2916 53  Blosum62 I G 9 84 5917 61 0 00  2917 81  WAG I G F 13 48 5921 25 0 00  2900 63  Blosum62 G 14 07 5921 84 0 00  2920 92  WAG G F 16 46 5924 23 0 00  2903 12   some lines omitted   MtREV G 402 75 6310 52 0 00  3115 26  MtREV I 584 89 6492 66 0 00  3206 33  MtREV 616 99 6524 76 0 00  3223 38    kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk    Relative importance of parameters  kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk    alpha   G   0 20  p inv   I   0 00  alpha p inv   I G   0 80  freqs   F   0 47    kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk    Model averaged estimate of parameters  kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk    alpha   G   1 67  p inv   I   0 07  alpha   I G   2 69  p inv   I G   0 05    The best model according to the AIC criterion is the WAG I G model  and the  probability that it is the best AIC model is 0 49  its Akaike weight   The second best  model is RtREV I G F and since the AIC difference is 0 96  with a 0 31 weight   this  model is also 
19. le interpretations  through a guided example  the case of the ribosomal L5  C terminal domain  protein  family  which you can find in the examples Ribosomal_L5 PF00673 directory  We  will use the graphical version of ProtTest     ProtTest  selection of models of protein evolution    First  we double click the ProtTest jar file  If we have the proper java version a window  will appear  Then  we enter the alignment  we can use the file included in the examples  folder at examples Ribosomal_L5 PF00673 alignment file   We leave the other  options as they are and press the    Start    button  A new window will appear  we will  refer to this window as the    console    window  to distinguish it from the main one   In the  console we can watch the progress of the analysis and check if everything is working  properly  We will see a header reporting some information about the alignment and  below the results for each model as they are being optimized  Something like     The header  bla  bla                 Modeles canre Dubie aha  TE a E ae 200  Goure a SIJ LT  Number of parameters                  39  0   39 branch length estimates   SHUT E eee arava kee Gueth Boos where eee ates   2980 51  OhOm15s   MOG iF se ehan ails ie oh be  sae E S Susie tere   JTT F  Number of parameters                  58  19   39 branch length estimates   aminoacid frequencies               observed  s above   eS TPT Mies toe ee ssa sate a a aa Mie eee eee ate aMe Ts   2977 23  OhOm6s   Modekai a aA aie a A
20. lt  the following will be  displayed in the console     ProtTest  selection of models of protein evolution    kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk  Second order AIC  AICc  framework  Sample size  Total number of characters  aligment length      113 00    kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk    Best model according to second order AIC  WAG G  kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk    Model deltaAICc  AICc AICcw  1nL   WAG I G 0 00 5956 28 0 76  2912 89  WAG G 2 34 5958 61 0 24  2916 53  Blosum62 I G 9 84 5966 12 0 01  2917 81  Blosum62 G 11 12 5967 39 0 00  2920 92  CpREV I G 19 31 5975 58 0 00  2922 54    Umm  now the RtREV I G F and RtREV G F models dissapear from the top of the  ranking  The AlCc tells us that the sample size isn   t large enough for supporting the  addition of the  F extra parameters in those models  Using this framework we could  say that WAG I G is the best fit model  and that the WAG G model is also an  interesting candidate     What if the alignment length is an underestimation of the size of the sample  If we try  other sample size criteria  we ll see that as sample size increases  the support given by  AlCc to more complex models is higher  the larger the sample size the most similar the  behaviour of AlCc compared to AIC is      What about the BIC framework  We may have a better perspective of the scenario if  we click the    overall comparison    button in the console  what resul
21. mber of sequences multiplied by the length of the  alignment and multiplied by a correction factor  the averaged Shannon   s entropy  normalized to a 0 1 scale   see more details in the    Phylogenetics and sample size     section   Results will be stored in a file named    results txt     The     t1 T    is telling    ProtTest  selection of models of protein evolution    ProtTest to return the tree optimized for the best fit model in newick format  This  command line excludes MtArt  MtMam and MtREV models from the analysis  as well  as the add on  F     To learn about the output of the program  and its interpretation  go to the section     ProtTest   s output  A guided example        New options   It is now possible to restrict the set of candidate models in the command line version of  ProtTest  For example  if we would like to discard the    mitochondrial models     we  should type      MtArt F  MtREV F  MtMam F       In addition  if we would like to discard  1 G models as well as  F models  just type       I G F   F F    To discard different models  proceed in a similar way     ProtTest at the web   ProtTest analysis can also be executed at its web site  http   darwin uvigo es    Functionality of the web version of ProtTest is similar to the graphical one  but the  ability of restricting the set of candidate models  and the ability to select interactivelly  different model selection criteria is not provided  as in the command line version    Enter the web page and just in
22. nd BIC with three different  criteria for sample size  by clicking the    overall comparison    button     If you want ProtTest to display the tree corresponding to the best model  select  between an ASCII representation and the tree in Newick format  Under the fast  strategy this tree becomes the topology of the initial tree  BIONJ or provided by the  user  with branch lengths optimized under the best fit model under the current selection  criterion  Under the slow strategy this tree is the best ML tree  both topology and  branch lengths will be optimized  under the best fit model under the current selection  criterion    ProtTest  selection of models of protein evolution    To learn about the output of the program  and its interpretation  go to the section     ProtTest   s output  A guided example              A080 ProtTest 1 0 6  About WWW Help  e080 ProtTest console  SE sequence  17  rL11 DROME     PAL  amp  Sequence  18  RL11 HUMAN    Sequence  19  RL11_CHLRE  k Sequence  20  R111 ARATH  r Alig Sequence  21  RL11 TETTH  Alignment contains 21 sequences of length 113    PERE EEE EEE EEE EEE EEE TEETER EEE EEE EE EEE EEE E EE ET  Observed number of invariable sites  8  Observed aminoacid frequencies     gt  aA  0 051 cC  0 009 D  0 061    E  0 062 F  0 070   G  0 091 H  0 019 I  0 074 K  0 078 L  0 082   7 OM  0 025 N  0 034 P  0 037 Q  0 021 R  0 087      S  0 042 T  0 052 v  0 066 W  0 006 Y  0 034    E E         39  0   39 branch length estimates          3003 21  0h0m7s
23. nsole interface    Version 1 0  October 2004   First release of ProtTest  Version tracking renumbered     11    ProtTest  selection of models of protein evolution    BACKGROUND  Models of protein evolution    Basically a model of protein evolution indicates the probability of change from a given  amino acid to another over a period of time  given some rate of change  Although  mecanistic models exist  Thorne and Goldman 2003   models of protein evolution are  preferentially based on empirical matrices for computational and data complexity  reasons  These matrices are constructed based on large datasets consisting of many  diverse protein families  The resulting matrices state the relative rates of replacement  from one aminoacid to another  The most common matrices  which are the ones  included in ProtTest  are the Dayhoff  Dayhoff et al  1978   JTT  Jones et al  1992    WAG  Whelan and Goldman 2001    mtREV  Adachi and Hasegawa 1996   MtMam   Cao et al  1998   VT  Muller and Vingron 2000   CpREV  Adachi et al  2000   RtREV   Dimmic et al  2002   MtArt  Abascal et al  2007   HIVb HIVw  Nickle et al  2007   LG   Le and Gascuel 2008   and Blosum62  Henikoff and Henikoff 1992  matrices     Conservation of protein function and structure imposes constraints on which positions  can change and which cannot  This evolutionary information can be inferred from a  multiple alignment but cannot be specified in a substitution matrix such as the empirical  ones described below  Fortunately
24. one of the accepted formats  using programs such as MacClade  Maddison and Maddison 1992  or ReadSeq   Gilbert 2001   The second can be easily used through its web version     http   bimas dcrt nih gov molbio readseq      You can get more information about alignment and tree formats at   http   workshop molecularevolution org resources fileformats     ProtTest  selection of models of protein evolution    http   workshop molecularevolution org resources fileformats tree formats php    Restricting the set of candidate models   This functionality is only available through ProtTest   s graphical interface  If the set of  models you are interested in is a subset of those offered by ProtTest  e g  you want to  select the best model for using a program that doesn   t support the Dayhoff matrix  you  can restrict the set of candidate models by clicking the    models    button  Then select  which empirical matrices  WAG  Dayhoff  JTT  MtREV  VT  RtREV  CpREV and  Blosum62  and improvements   l   G   F   l G  should be included in the analysis   see figure 3             A ProtTest  candidate models    Substitution matrices  Add ons     Mi JTT M  1  M   MtREV Mic  v Dayhoff v  14 G  M WAG MW  F  Mi RtREV   M CpREV   v Blosum62    MW vT      Cancel   OK         Figure 3  Restricting the set of candidate models     ProtTest   s output  A guided example  the ribosomal L5 protein family   using ProtTest version 1 2 2     In this section we ll explain the output of the program and its possib
25. p  model    1  Slow  optimize branch lengths  model  amp  topology    sample  sample size for AICc and BIC corrections  default  2    0  Shannon entropy Sum   1  Average  0 1 Shannon entropy x NxL   2  Total number of characters  aligment length           ProtTest  selection of models of protein evolution    3  Number of variable characters  4  Alignment length x num taxa  NxL   5  Specified by the user                                size  number specifying sample size  only for   sample 5    EL T F  If true  display best model s newick tree   default  false    E23 T F  If true  displya best model s ASCII tree   default  false    verbose  T F  true or false   default  true      model   T F  true or false   default  all models are set to true   model   JTT  LG  DCMut  MtREV   MtMam   MtArt  Dayhoff  WAG    RtREV   CpREV   Blosum62   VT  HIVb  HIVw      addon   T F  true or false         default  all model addons are set to true   addon   F   I   I G   G                The only required option is     i     and it must be followed by the name of the file that  contains the alignment  Other options have their own values set by default  which you  may want to change      If you want to specify a tree topology  its branch lengths are of no importance  they will  be optimized  use the     t    option followed by the tree file name  this option doesn   t  matter when     S 1    is also specified      Set the       sort    option to indicate what statistic should be used for model sel
26. put an alignment  and optionally a tree   select the  statistical criterion for model selection  and the other parameters  Your job will be sent  to a queue and you ll be notified by e mail when the analysis is finished     Optimization strategies   Ideally  one should optimize the tree topology  its branch lengths and the model  parameters  for each model  to assure maximum likelihood is achieved  This complete  optimization strategy can be performed by ProtTest when the    slow    option is selected  in the main window  However  model selection seems to be quite robust to topology as  long as this is a reasonable representation of the true phylogeny  Posada and Crandall  2001   Therefore a faster strategy  and the one implemented in the program Modeltest   Posada and Crandall 1998   is to estimate a    good    tree and make all likelihood  calculations for all models in this fixed tree  This strategy is named    fast    in ProtTest   Because it only optimizes branch lengths and model parameters it has the advantage  of being much faster     Alignment and tree formats   ProtTest is able to read  through the PAL library  the following alignment formats   phylip  interleaved or sequential  and nexus  sequential   The phylip format is  recommended since the nexus reader has some bugs  For reading trees  the newick  format is supported  You can find examples of these formats in the    formats examples     directory     If your data is in a different format you can convert it to 
27. ts in a table like the  following  where the ranking of models  the importance of parameters  and other  statistics are compared under seven frameworks     model AIC AICc 1 AICc 2 AICc 3 BIC 1 BIC 2 BIC 3   WAG I G 0 49 1  0 76 1  0 86 1  0 86 1  0 78 1  0 74 1  0 57 1    RtREV 1I G F 0 31 2  0 00 19  0 00 11  0 04 3  0 00 12  0 00 12  0 00 18    RtREV G F 0 16 3  0 00 16  0 00 10  0 02 4  0 00 11  0 00 11  0 00 16    WAG G 0 04 4  0 24 2  0 13 2  0 07 2  0 22 2  0 25 2  0 42 2    Blosum62 I G 0 00 5  0 01 3  0 01 3  0 01 5  0 01 3  0 01 3  0 00 4    WAG I GtF 0 00 6  0 00 23  0 00 13  0 00 7  0 00 16  0 00 17  0 00 23    Blosum62 G 0 00 7  0 00 4  0 00 4  0 00 6  0 00 4  0 00 4  0 01 3    some lines omitted    Relative importance of   parameters AIC AICc 1 AICc 2 AICc 3 BIC 1 BIC 2 BIC 3    G 0 20 0 24 0 13 0 09 0 22 0 25 0 43    I 0 00 0 00 0 00 0 00 0 00 0 00 0 00    I G 0 80 0 76 0 87 0 91 0 78 0 75 0 57    F 0 47 0 00 0 00 0 06 0 00 0 00 0 00   Model averaged estimate of   parameters AIC AICc 1 AICc 2 AICc 3 BIC 1 BIC 2 BIC 3   alpha   G  3 02 2 01 2 01 2 05 2 01 2 01 2 02   p inv   I  0 00 0 07 0 06 0 00 0 07 0 07 0 07   alpha   I G  2 68 3 10 3 10 3 10 3 10 3 10 3 09   p inv   I G  0 03 0 05 0 05 0 05 0 05 0 05 0 05   AIC   Akaike Information Criterion framework    AICc x  Second Order Akaike framework    BIC x   Bayesian Information Criterion framework    AICc BIC 1  sample size as  number of sites in the alignment  113 0    AICc BIC 2  sample size as  Sum of position
28. using ProtTest   o  ProtTest  Abascal F  Zardoya R  Posada D  2005  ProtTest  Selection of best   fit models of protein evolution  Bioinformatics 21 2104 2105   o  Phyml  Guindon S  Gascuel O  2003  A simple  fast  and accurate algorithm to  estimate large phylogenies by maximum likelihood  Syst Biol  52  696 704   o  PAL  Drummond A  Strimmer K  2001  PAL  An object oriented programming  library for molecular evolution and phylogenetics  Bioinformatics 17  662 663     15    ProtTest  selection of models of protein evolution    References    Abascal  F   Posada  D   and Zardoya  R  2007  MtArt  a new model of amino acid  replacement for Arthropoda  Mol Biol Evol 24  1 5    Adachi  J   and Hasegawa  M  1996  Model of amino acid substitution in proteins  encoded by mitochondrial DNA  J  Mol  Evol  42  459 468    Adachi  J   Waddell  P J   Martin  W   and Hasegawa  M  2000  Plastid genome  phylogeny and a model of amino acid substitution for proteins encoded by  chloroplast DNA  J Mol Evol 50  348 358    Cao  Y   Adachi  J   Janke  A   Paabo  S   and Hasegawa  M  1994  Phylogenetic  relationships among eutherian orders estimated from inferred sequences of  mitochondrial proteins  instability of a tree based on a single gene  Journal of  Molecular Evolution 39  519 527    Cao  Y   Janke  A   Waddell  P J   Westerman  M   Takenaka  O   Murata  S   Okada  N    Paabo  S   and Hasegawa  M  1998  Conflict among individual mitochondrial  proteins in resolving the phylogeny of eutherian
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
PRO-790  Enregistrement et codage des RCP du CHU sous aXigate  BELT DRIVE CD PLAYER  Kidde 154144 Brochure  Fre - unesdoc  2 - Dimplex  Fujitsu ETERNUS DX DX80  QUICK START GUIDE  Field Controls UV-1500C Air Cleaner User Manual  アイスツール 取扱説明書    Copyright © All rights reserved. 
   Failed to retrieve file