Home
        PDF file 1.7 MB - Carleton University
         Contents
1.     0794305205   0 112624457   0 106209541     426377415  0 256774 742   0 46 7557566     0 023512936   0 239 03669    0 2576567   0349460364   0 045533105   0 006024037    PARAMETERS CONFIGURATION    The interface to set the parameters to be used in a calculation can be launched using    the toolbar button  Set Parameters        via the    Options    menu            5 Set Parameters for calculations    Topogratic root  ri     Distance cutoff for hydrophilic groups       Cutott of surface area   LAW     Quantity of decimals numbers       Weighting procedures     Topological distance cutoff for Suto correlation  and Gravitational weighting procedures  R     MAXIMUM topological distance between  terminal aa in the sequence window s of the  Kier Hall wething procedure  sub     Agregation Operators     Humber of bine Information Theory based for  agregation operators  bine        Reset to default                    Cancel     Accept         ProtDCal  A Program for Protein Descriptors Calculation    Organizing the output file    The feature or raw matrix obtained after calculation in the output file  lt name gt _Prot txt  is a block matrix that  by default  organizes the descriptor in the hierarchic order     index  gt  group  gt  aggregation operator     invariant   To change the order in this output  file  the    Output Tags Order    button located in the    Option    menu  provides two  options   lt index gt _ lt group gt _ lt invariant gt   default  and  lt group gt _ lt index
2.     CostSensitiveSubsetEval    FilteredAttributeEval    FilteredSubsetEval                       ee             GainRatioAttributeEval e Selection on all input data      m      InfoGainAttributeEval    i   LatentSemanticAnalysis d      H    OneRAttributeEval   first   H    PrincipalComponents tt set  no attributes  f H    ReliefFAttributeEval tch direction  forward  c        SVMAttributeEval e search after 5 node expansions  H    SymmetricalUncertAttributeEval 1 number of subsets evaluated  1929     apperSubsetEval t of best subset found  0 291       ibset Evaluator  supervised  Class  nominal   324 class    Subset Evaluator  uding locally predictive attributes          ributes  1 2 3 6 7 8 9 12 15 16 17 18 24  40 69  70 79  8 amp 2  96 138     Filter      Remove filter    Close   At_AC auto3 2 proj_6 Nl    Once the extraction is finished  the reduced subset is saved and used to build the  corresponding classifier over all training data using a similar configuration as it was  used during the Wrapper  In the    Classify    panel of Weka there are options to  automatically perform x fold cross validation  hold out prediction test by splitting the  input data  and external prediction by providing a second set of test instances with the  corresponding features and class attribute  This latter option was used to evaluate our                         ProtDCal  A Program for Protein Descriptors Calculation    final naive Bayes and random forest classifiers using the blind test data 
3.    N1 N3 Ar P2 M V CV Q3 K Q1 DE MI     F  This section specifies the parameter values needed to evaluate the indices and    invariant aggregation operators  The parameter values are listed as follow     parameters t_cont s_cont A   HydGroup n bins K SubG         ProtDCal  A Program for Protein Descriptors Calculation    4 0 8 0 5 0 9 4 3 0 50 5 3    These parameters adopt default values  We do not recommend changing the  numbers unless the user has an advanced knowledge of its influence on the  requested features  Please contact the authors for further direction regarding    this subject  The following table provides a brief description of the parameters     Topological cutoff for inter residue contacts  Minimum value of sequence    t_cont       separation between pairs of residues in contact     cont Spatial cutoff for inter residue contacts  Maximum value of distance    between the Ca of pairs of residues in contact   A  Cutoff of superficiality  Minimal percent of the total surface area of a given    residue for being labeled as superficial   Distance cutoff to identify hydrophilic groups of residues  This parameter  HydGroup is used by the thermodynamic indices  Gw F   DGw  W F   Its value must  vary between   7 6   10 6    This parameter is used in the index     logarithm of the Folding Degree      InFD   as the order of the power to which the spatial distance  between    n the Ca of a pair of residues  is raised to compute their    compaction      quotient between the 
4.    ProtDCal  A Program for Protein Descriptors Calculation    ProtDCal    A Program for Protein Descriptors Calculation    USER MANUAL    ow 4         NY PROTDCAL 10    Protein Descriptor s Calculation    Unit of Computer Aided Molecular  Biosilico    Discovery and Bioinformatic Research  CAMD BIR   Univerisdad Central    Marta Abreu  de Las Villas       ProtDCal  A Program for Protein Descriptors Calculation    CONTENT TABLE  ABOUTUS era E 3  GETTING a cease Ste ces eee een nsec eels aeons ee NEAEH FAAA EAEEREN 3  WORKSPACE siisii erinnere a aa Taai EA aeea Aa aSa Aiaia 5  BASIC ENVIRONMENT    ssssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ennan 6  Hee Pa e EE E 7  ronne FING sar acces es wep sqecenaecenes cen A E E tae seatessn canon uieseisonuscenantuosenceanneccesel 10  Aggregation Operators Panel             cccccccccccccccssssseseececeeeceaeeesseecceeeeeseeeeeeeeeeeeeeeeeeaeenees 10  DESCRIPTION OF MENUS       snssasasnnsnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nananana 12  OUTPUT FILE S eie E EEE SE EE varias varnntanuresany 17  PARAMETERS CONFIGURATION     sssssssnsnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnn 18  Organizing the Output file    eeeeeseecccccceeeeeeesseeceeeeeeeseeeeseeeceeeeesaaeeseseeeeeeeeeaaas 19  PROJECT accent ate vata encase meee crue ance caneeaeca ee mmoeeanacoentenexecrevaul 19  PEO OE sy CN a E goannas amen ceeeeas E 19    Brey   ouhatsare iyo 
5.   File_14 pdb  File_15 pdb File_1 pdb  File 16 pdb File_2 pdb    File 17 pdb File_3 pdh  File _4  pdb    File_5 pdkh  File _ amp   pdk       Compare with     File_18 pdh  File_19 pdh  File_20 pdh  File_21 pdk  File_r pdb  File_o pdb  File_ S pdh    0     Pee  Atom     Carbon Alfa  j   Calculate j Cancel    Carbon Alfa  Backbone  All Atoms    In the    Advanced    panel  other options are available such as     e  Use the best N residues      This option performs two iterations  one to superimpose the structures  considering all the residues  and the second only considers the N best aligned    pairs of residues to re build the superposition and compute the RMSD     e  Use specific ranges        This option superposes and computes the RMSD among structures using    specific ranges of residues from the target  the decoys  or all the structures     9 9    RMSD     Configuration   Advanced E      z   Fi w     Use specific ranges with    Target bail    Fromas  Chain ip  1379  t Chain ID  Decoys  Target Decoys       t   Add Range        Remove Range             g     Atom    Carbon Alfa vr  I Calculate j  i Cancel         ProtDCal  A Program for Protein Descriptors Calculation    OUTPUT FILES    The button    Set output for Results    0  in the toolbar or in the    File    menu allows one  to set the file path to save the results of a calculation  This button launches an explorer    to set the path and name of the output files     O  5 Save Results    oS W                     amp  Ou
6.  Files T          Load   _ Cancel j    USER SPECIFIED INDICES    The button    Define new Property    located in    Options    menu permits the definition of    specific property based indices  This option will launch the following window     a O e2 Add a new Property Index           Residue Values    Residues    ALA  ARG  ASN  ASP  DYS    M p  ih a  tk     MLM  ML  GLY  HIS  ILE  LEL  LYS  MET    EPHE    J0O000000002N gt 0  z0o000000000pan       The panel    Property List    provides the list of available indices  The    Residue Values       panel permits editing the assigned values to each residue     When defining new indices  the option    Select new Property     in the    Option    menu     permits selecting these indices for calculation     ProtDCal  A Program for Protein Descriptors Calculation           5 Select Property Indices         Select Indices     Ce     Ca     3 I       Cancel J Accept j  USER SPECIFIED GROUPS    A   To create new groups select the  Define new group    K  located in the menu     Options Managing Groups  which will launch the following window                          AO    Group Manager    Group List Group specifications  gray   Description   Ranges  qr    This is the description                      d         _ J i Cancel   e Save J  This option allows the definition of new groups of residues  These groups are created    by extracting specific ranges of residues that can be fixed using panel    Ranges        The ranges can be configured by se
7.  Nth root     Distances        Manhattan distance Mi    Euclidean distance M2    Minkowski distance N3       Means       Arithmetic Mean Ar    Harmonic Mean ht     Potencial Mean P2    Geometric Mean G       Potential Mean  3  P3       Statistics       Kurtosis K    variation Coefficient CW    Percentile 25 G  O Range RA    Standard Deviation DE    Percentile 50 G2  O Skewness 5 0 Minimum Value MN    Percentile 75 23          Variance       Maximum Value MX     o cr 150       Classics         Standardized Information Corte       Mean Information Content Ml    Total Information Content TI       ProtDCal  A Program for Protein Descriptors Calculation       DESCRIPTION OF MENUS    File  This menu allows uploading and or exporting the different files that are used by    the program  e g  projects and input or output files     File     14  Open PDE Ctri A     Open FASTA Ctrl F  O Save Resut Ctri R  LJ   Exit Alt F4    Loading either FASTA or PDB file can be performed by clicking on a buttons F or  respectively  which are located in the toolbar or the    File    menu  These buttons launch    an explorer to select the files to upload        O    Open PDB  a Stability_training v  T  gt  A T za  QC Ee    1AEY pdb tiki pdb Jobe pdb  1APS pdb tFNF_9 pdb 1PHP_   pdb  1ENI_A pdb IFNF_10 pdb IPHP_n pdb  1BETA pdb timg A K301 mih IRFA ndh  SONA    1CSP pdd 1kOs_red pdt   z Open FASTA    10N e pib    ILMB redpd    gt     1DIV_npdb INES pdb _  gt  Fasta Protein Format Gr   E    1EOL_W30A pdb 
8.  The  classification accuracy is reported in the    Classifier output    section of the Weka  environment     Finally  the resulting classification model can be saved from the report in the left panel    as shown below   G Weka Explorer bolek  Preprocess Classify   Cluster   Associate   Select attributes Visualize    Classifier  Choose  FilteredClassifier  F  weka filters supervised instance Resample  B 1 0  5 1  2 14 0  no replacement   w weka classifiers  bayes NaiveBayes     D                                            Test options Classifier output  Use training set a  Supplied test set Set    Time taken to build model  0 02 seconds     Cross validation Folds 10  Percentage split    66     Stratified cross validation          Summary        More options     Correctly Classified Instances 3194 91 049       Nom  class hd Incorrectly Classified Instances 314 8 951    Kappa statistic 0 5639    Start   Stop Mean absolute error 0 1182  id r 7  Result list  right click for options  Root mean squared error 0 2757  E rE Relative absolute error 69 4818    REE ee  7   ae ee E eee error 107 3439    t Instances 3506    View in separate window                Save result buffer tcuracy By Class      Delete result buffer  TP Rate FP Rate Precision Recall F Measure ROC Area Class  Load model 0 906 0 028 0 998 0 906 0 95 0 937 N  0 972 0 094 0 441 0 972 0 607 0 937 P  Save model  0 91 0 033 0 958 0 91 0 925 0 937    The saved model file can then be used to predict the glycosylation states 
9.  a directory named    ExampleGly    within the     Projects    directory  the features can be computed by executing this command line   Java    jar ProtDCal jar    p Projects ExampleGly    o Outputs   This calculation generates two _ tab delimited output files named  lt project  name gt _AA txt and  lt project name gt _Prot txt  which summarize feature matrices in the  format  AA vs  residue indices  and  sequence windows vs  features  respectively  We  will use the file called  lt project name gt _Prot txt  which shall summarize the computed  features for each sequence window        ProtDCal  A Program for Protein Descriptors Calculation    Preparing the data file to be read by Weka   Weka can read csv files directly which are easily obtained from the tab delimited files  generated by ProtDCal  Additionally  one must append the class column at the end of  each line of the file  This can be accomplished easily  for example  using a spreadsheet  program such as MS Excel by pasting the column with the class information after the  last column of features  Lastly  the column with the name of the instances should be  removed to prevent Weka from interpreting this column as another attribute  Finally   the document must be saved in csv format    Running filters and attribute selection approaches with Weka    In order to eliminate some trivial features that could be generated  is recommended to    first run the unsupervised Weka attribute filter called    RemoveUseless         Weka 
10.  amp  9 12 15 16 17 18  24 40  69 70  79 82  96 138   il At AC auto3 2 proj_ 6 N1                After uploading this reduced subset of features  it is advisable to end by running the     WrapperSubsetEval    attribute selection approach  Depending on the number of  features remaining in your data file  a genetic search may be used within the wrapper   However  if the number of attributes is too high   gt 100   a    Bestfirst    search would be  preferable for a first reduction  The Wrapper should be executed with the same type  of classifier that you intend to use to later use to evaluate your final model over the  test data  For the study of N glycosylation presented in the ProtDCal paper  a genetic  search with 50 chromosomes per population and 500 generations was conducted  As  for the evaluator  a    FilteredClassifier    was used  which applies a    Resample    filter to  the training data such that a class balanced subset is sampled for each cross validation  fold  This subset is used to train a classifier  both NaiveBayes and RandomForest were  considered  and evaluate it in the hold out set during the x fold iteration of the  Wrapper        Q  Weka Explorer            Preprocess   Classify   Cluster   Associate   Select attributes   visualize  Attribute Evaluator   J weka ZeroR  F 5  T 0 01  R 1           attributeSelection  5   CfsSubsetEval    ChiSquaredAttributeEval    ClassifierSubsetEval  4   ConsistencySubsetEval 1 output  H    CostSensitiveAttributeEval ile ea
11.  gt _ lt invariant gt      alternative      Options      _ Set Parameters Cirl s  28 Repair dataset s Ctrl F 1  Windex Options F    Output Tags Order   IDX_GROUPS_INY    Convert Dataset d GROUPS IDS INY  Manage Indices 20 b  Manage Groups F     PROJECTS    Projects are text files in which all the options required to execute a calculation are  included  To configure a project   one must set all the options of a calculation  i e     loading data set  indices  modification operators  groups  aggregation operators  and    parameters  then the project can be exported by using the button    Save Project           located in the toolbar  The path to the dataset will be kept as part of the project     Project Structure  A ProtDCal project consists of several tags that identify each of the configuration    parameters for a given calculation  The structure of a project is divided into seven    sections     A  Path of the directory containing the input file s   This section comprises two    lines as is illustrated below     directory     F  WORK RESEARCH ProtDCal Datasets Fasta_Protein_Format prediction       ProtDCal  A Program for Protein Descriptors Calculation    B  This section summarizes the tag of each selected indices separated by commas     indices     Gw  U  Gs U   W U  Mw HP ECI  Vm Z1 Z2 Z3 ISA Pa Pb Pt     When using weighted topographic indices  wldx   such as the weighed Contact  Order  wCO   additional lines are needed to specify the selected weights     separated by com
12.  hydrogen should be present to compute  the indices Hod and wbHbd     O save project files ta keep a record of your calculations      Use the Run Multi project    tool in order of executing several project files in batch mode     O Increase the memory of the Java Virtual Machine  before launching ProtDCal s graphical user  interface  when several concurrent calculations are going to be executed  Each calculation will  mn in a separate thread  An example for using 2000M is as follow  java  AimeJ000m  jar  ProtD Cal jar     O Use the command line interface ta execute Prot    Cal from a terminal console     Once a dataset is uploaded  the interface provides access to the available indices    depending on the input file type  PDB or FASTA format   The next figure depicts the    interface with access to all type of indices  as is obtained when PDB files are used        ProtDCal  A Program for Protein Descriptors Calculation    9      ProtDCal 1 0     File Options Thermo  kinetics Analyze Run Help    QHwO OG 06 GG COG  rr    Thermodynamic Indices of Folded Protein States     0 Geir     O AGel       Thermodynamic Indices of the Extended Protein State    B ca 0 Gwil  B vwu     Topographic Indices of Folded Protein States      0 wRca    wPsi_H    wPhi H    0 wAHBd O whsis B wPhLs  0 wL O wPsi_ O wPri_  0 weo    Psi    Phi    O wFLE O wre    nF       Chemical Physical and Structural Composition Indices         TAE AminoAcid Descriptor    Other Properties Indices          The panels bel
13.  ke  E E ene ery ante E E ere eet wre a ietr E ete erry gener ere 25  USER SPECIPIED INDICES   n sanssnsnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnannna 25  USER SPECIFIED GROUPS wiieiiccvenccesoiteavesiessececeieseiiccverscesaituesesexeniecsbessticsuevecesssaeeseicls  26  EXECUTING CALCULATIONS     sssssasannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnannnn 27  BASIC MODELING WORKFLOW USING PROTDCAL AND WEKA                  28    Exemplification in the prediction of N glycosylation               ccccccccssssseeseeeeeeeeeeeeeeeees 28       ProtDCal  A Program for Protein Descriptors Calculation    ABOUT US    ProtDCal is a protein modeling platform developed and maintained in the Unit of  Computer Aided Molecular Discovery and Bioinformatics Research  CAMD BIR  of the  Universidad Central    Marta Abreu    de Las Villas  UCLV  and the Department of  Systems  amp  Computer Engineering of Carleton University  CU      Project members    Yasser B  Ruiz Blanco  yasserrb uclv edu cu   UCLV    Waldo Paz Rodriguez  waldopaz uclv cu   UCLV    Yovani Marrero ponce  Ph D   ymarrero77 yahoo es   UCLV   James Green  Ph D   jrgreen sce carleton ca   CU     Citations    Ruiz Blanco  Y  B  et al   ProtDCal  A Program to Compute General Purpose Numerical  Descriptors for Sequences and 3D Structures of Proteins  BMC Bioinformatics  2015   Submitted    Ruiz Blanco  Y B   et al   A Hooke   s law based approach to protein folding rate  Jou
14. Explorer     Lo tx    Preprocess   Classify   Cluster   Associate   Select attributes   Visualize                                       Open file      Open URL        Open DB        Generate    Undo Edit        Save       Filter  PATMEXpESSION aS  a     Apply      MergeTwoValues Cika  2 Selected attribute    NominalToBinary Name  Gs U _KH_kier3 1 proj_10_N1 Type  Numeric      NominalToString Missing  0  0   Distinct  20 Unique  0  0        Statistic Value    H    NumericCleaner          NumericToBinary Pattern skami 5 258  H    NumericToNominal Maximum 31 87    NumericTransform Mean 19 516       Obfuscate StdDev 7 761            PartitionedMultiFilter  H    PKIDiscretize    PrincipalComponents       PropositionalToMultiInstance      amp  RandomProjection  H    RandomSubset     RELAGGS  H    Remove  fea Pemavel ype  Ce Renoveusetes    Y    Reorder  H    ReplaceMissingValues    Standardize  H    StringToNominal            Class  dass  Nom  v   Visualize All             m  Wi  gt       Filter       Remove filter     Close           18E JGs U _AC_AC1_UCR_P3    19   Gs U _AC_AC1_UCR_DE    20  F Mw_AC_AC1_THR_N2     21   IP_AC_AC1_ASN_RA X                              Remove         5 26 18 56 31 87  Status       too age       This filter will eliminate all constant attributes that may be generated by ProtDCal  following the project file    Depending on particular interests and the desired number of attributes  other filters  can be applied at this stage  It is recommended to perfo
15. GLY9 22 6387094  46 4646192   25 4258188 0 02  0 75591333 0 0 0212 3 13E 04  IBNL ApdbAVAL10 0   145 998937   124 305008 0  1 16260555 0 035 0 07505 0 00106801  1BNI_A pdb  ALA11 0   103 188938  78 6799473 0  1 46365202 0 025 0 08525 6 55E 04  IBN A pdbAASP 12   76 5502064   59 2709622   26 0952002 1 25  0 74344111 0 1 1675 0 075 70326  IBN ApdbATYRI3   49 2110054  162 200424  129 201317 0 72  0 66711166 0 36 0 72  0 007 70609  IBN A pdbALEU14   1 06912325   177 656716   154 957145 0  1 01745402 0 005 0 07205 1 25E 04       ProtDCal  A Program for Protein Descriptors Calculation    Similarly  the output file  lt name gt _Prot txt contains  in the first line  the labels of all  computed descriptors  which are a combination of the indices  groups  and  aggregation operators selected in the main interface  The figure below depicts an    example of this type of file     PDB NAME  Ge F  ALA N1 Ge F  ALA N2 Ge F  ALA N3 Ge F  ARG N1   BNI A pdb  0 262927909  0 108934115   0 085653346  0 087155006  1BTA pdb 044563367 0 310293383   0 300779168   0 04167399  1CSP pdb 0115023833  0053122428  0 041917634  0 012742242  1DIV_c pdb  3 31932248  22987804715  2 142282257   0 019688557  1DIV_n pdb  0 228598938 0101913987  0 083279311   0 004512168    TEOL VW30_ pdb    1FNF 10_pdb  1FNF_9 pdb    ILMB red pdb    1N88_pdb   NTL pdb     1 010471731  0 2348 15222   0 192847766   0 689933543   0 485642216   0 691944915    0 676074395  0 129950019  0 122603904  0 435129953  0272855077  0 47 7533667 
16. INTi pdb  ese i BHP tA THON tet IPSF tet  1BHP_204   1HEW  dd TRAG tet     IAEY pdb    IBTA pdb   101V _e pib  1C8C td   THRE tt IRIS tet      ICRNtt     TFC be 1 TIT tet  f  PDB  Protein Data Bank i       1CSP bd TIMO tet 1UBO tt  1DIV ot 1PBAt 2A5E tt  TEAL tt 1PGB tt 2ABD tet  1FKS tt PHP bt  IGP tet IPOH dd       I1SHP_2 Uf    1CSP bf    IFC IT    1 PBA tT    4 FASTA Files     eC ones    T Open J   Cancel      Options  This menu permits configuring the parameters used to evaluate the indices   fixing the amount of significant digits in the output files  and particularly the selection  of the modification operator  Windex  weighted index  to be applied to the computed  residue indices  After the application of this operator  the indices values are updated    and the subsequent procedures  grouping and aggregation  make use of these new       ProtDCal  A Program for Protein Descriptors Calculation    indices values instead of the original unmodified indicies  Note that the selected  operator will be applied to all selected indices in the same manner  To evaluate  different operators  a separate execution needs to be configured  rerun the GUI or  save  amp  execute multiple projects using the different operators in batch mode   In    addition  the Option menu permits defining new indices and grouping criteria     Options    Set Parameters Cirl s  2  Repair datazet s Ctrl F 1    Windex Options  Output Tags Order  Convert Dataset    Manage Indices 2D    Manage Groups    Functi
17. T HEX TRN RCL INT SUP PRT     ProtDCal Procedural Aggregation_operators ID        gt Distances  N1 N2 N3      gt Means  Ar P2 P3 M G V      gt Statistics  CV Q3 S RA MN K Q1 MX DE Q2 150      gt Information Theoretic Operators  SI MI TI    Below is a screenshot showing the structure of an actual project     pesen eco BEERS E    directory        3 indices    4 DGw Gw U  Gs U  W U   A  1nFD wCTP HP wPsiS  Phi  Psi  Pa  Pb   5 functions   G   7    DGfold  DGconf  DGHBd   wPsiS   8 Z1 Z3 HP IP   At   9 WwCTP   10 21 23 HP IP At   11 groups     2 ALA GLU GLY HIS MET  PHE  ARM  PLR  NCR  SHT  HEX  TRN  RCL  INT  SUP  PRT   13 invariants   4 N1 N3 Ar P2 M V CV Q3 K Q1 DE MI   5 parameters t_cont s cont  area  dHSG n  Int K Subgraph    l6   0  0 0   900  9 2453402007 5 4  7    H    options  decimals  armonicMeanType  geometricMeanType  windexID  datasetType  outputOrder    18  1 0 0  1  true  true       ProtDCal  A Program for Protein Descriptors Calculation       Loading a project    To load a project use the button    Load Project    E located in the toolbar  This button    will launch an explorer to select the desired project                                            O      Load a Project    T  lt  amp  Projects     gt  Ss TE  i i    i  ERMET AEE  _  2delectproj _  groups proj proj    2dgravi proj   inva proj    2divan proj    ivan proj    Zdkier proj _  kier proj  B 20none proj 3 none  proj    auto proj _  tae proj  _  elect proj  _  gravi praj  20 auto  pro   _ Project PROTDCAL
18. al  A Program for Protein Descriptors Calculation    java    Xmx1000m  jar ProtDCal jar  p  lt Path to projects    directory gt   o  lt path to  outputs    directory gt    If no option is specified this line will simply execute the graphical user interface   ProtDCal   s command line options     p  Defines the path to the directory enclosing the projects to execute  All projects in  this directory will be computed     o  Defines the location in which to create the output files  Each file will take the same  name as the corresponding project     v  Defines whether to include the name of the project within the label of final  descriptors  0  no  default   1  yes  This option is valuable when the same descriptors  are computed  but different parameters are evaluated each time  likely of interest only    to advanced users      BASIC MODELING WORKFLOW USING PROTDCAL AND WEKA    ProtDCal is intended to generate a wide variety of features describing a protein  sequence and or structure  By applying feature selection  an appropriate feature  subset may be identified and used to create effective classifiers  Below  we detail the  creation of a predictor of N linked glycosylation based on protein sequence     Prediction of N linked glycosylation from protein sequence    Gathering the data set of instances    3508 sequence unique windows of 15 aa  each centered on an Asn residue  were  extracted from the 242 protein sequence targets of O GLYCBASE  This data set can be  found  in FASTA f
19. cess calculates a distance value    ProtDCal  A Program for Protein Descriptors Calculation        using either Manhattan  Euclidean or Minkowski  p 3  distances  between all proteins  using standardized values of the available descriptors   This option is configured using the following interface     9 O RMSD IDX         Distantes      ni   N 0M    Missing Values           Geometric Mean       pE a ee   4 Load   Prot IR J Canter j    The distance matrix is computed from a file  lt name gt _Prot which must contain only the    features which are going to be used to evaluate the distance metrics     The panel    Missing Values    provides two options to deal with such data that ProtDCal  labels as  9999   e Delete the descriptor   Removes the descriptor that contains at least one missing value  e Geometric Mean   Replaces missing values   9999  with the geometric mean of the other values of  the descriptor   The    Analyze    menu also implements a Root Mean Squared Deviation RMSD calculator  which uses the Kabasch algorithm  as implemented in the CDK  Chemistry  Development Kit  library  to obtain the optimum structural alignment between protein  conformations and the selected target  The RMSD can be evaluated among Ca     backbone or all the atoms of the proteins  This option is configured at        ProtDCal  A Program for Protein Descriptors Calculation    aA O    RMSD E     Configuration   Advanced      Dataset s  Target   File_11 pdb     4EBG pdb  File_12 pdb    File_13 pdb
20. e according to the residues within a vicinity defined       ProtDCal  A Program for Protein Descriptors Calculation    by the type of modification operator and its parameter value  e g  for the  autocorrelation operator with parameter k   2  the neighbourhood of residue i  comprises the residues in positions i   2     ProtDCal implements five modification    operators that can be selected in the Menu     Options Weighting operators        iii  A third layer named    Groups    is intended to select one or more groups of residues  according their ID or type  When a group of residues is selected  an array of index  values is obtained corresponding to the residues in the group  In addition to the  implemented grouping approaches  an option is included by which users can define  their own groups of residues  see the option Groups in menu Option     iv  A fourth layer comprises several aggregation operators that are used to combine  an array of values  from a group of residues  into a single value  descriptor   reflecting the distribution of the index within that group  Some examples of these  aggregation operators are the sum  average  variance  kurtosis  geometric mean     information content  etc     The output of the calculation shows the full combination of indices  groups and  aggregation operators selected in each panel  The input file formats of the software  can be either PDB or FASTA  for PDB files  all indices can be computed  whereas for  FASTA files  only the indices o
21. es     Groups Panel    In the panel of groups there are three subpanels enclosing groups formed by residue  ID  chemical physical properties and topographic features  Also with the button      Others groups      is possible to select previously defined groups  see Define new    groups      Residues Based Groups           Properties Based Groups        Topographic Groups        p7  Other groups           Aggregation Operators Panel    The panel of aggregation operators is divided into four categories  distances  central    tendency  dispersion and information theoretic metrics        ProtDCal  A Program for Protein Descriptors Calculation    In the central tendencies subpanel  specifically for harmonic and geometric means   there are three implemented variants to evaluate these metrics in order to avoid    possible in definitions associated with    0    values     e IGNORE THE VALUES    0     With this option all zero values are excluded only from the operations  but not  in the analysis  i e  the zero elements are counted to obtain the value N  that    refers to the size of the sample      e PRINT  9999  This option prints the value  9999 for missing values  In the case of the  geometric mean  this occurs when the group size is zero  however  in the case    of the harmonic mean  it occurrs when any of the elements are zero     e DELETE THE VALUES    0     With this option all values zero are excluded and are not taken into account    when dividing by N nor when evaluating the
22. es are grouped in three main classes     Thermodynamics  which are almost all novel indices designed in our laboratory  based on an empirical model of the main factors involved in the stability of protein  structures  These indices are  in turn  divided into two panels grouping  on one  side  those that are defined for 3D folded structures and on the other side  those  based on information relating to the protein sequence  These indices refer to the    contribution of the folded and unfolded  reference  states of a protein chain     Topographic  which include many of the contact based descriptors with proven  correlation with the protein folding rate constant  e  g  the relative contact order   CO   the total contact distance  TCD   the cliquishness  CLQ   etc  These indices  were defined originally as global metrics  however  they were modified to obtain a  value for each residue of a protein  Each contact of the protein is weighted by a  determined residue property selected in this interface  The weighting procedure is  conducted by multiplying the values of the selected property for both residues that    are in contact     Property based indices  this final group encloses a number of chemical physical  and structural properties of each type of residue such as hydrophobicity  electronic    charge index  molar weight  volume  isotropic surface area  etc     ii  Modification operators  these approaches are intended to modify the value of a    selected index for a given residu
23. f the second  Thermodynamics indices for sequences   and fourth  Properties based indices  panels can be evaluated  Multiple proteins may  be input simultaneously  The output files of ProtDCal calculations are two tab   delimited text documents named  lt name gt _AA txt and  lt name gt _Prot txt which store all  the descriptors for each residue of each protein and the descriptors for the  combinations of indices  groups  and aggregation operators for each protein    respectively     WORKSPACE  The ProtDCal workspace consists of the program folders   e Datasets  Containing all the input data files in PDB or FASTA format     e Outputs  Containing the output files of the program   lt name gt _AA      lt name gt _Prot  etc          ProtDCal  A Program for Protein Descriptors Calculation Ko    e Projects  Containing all project files   lt name gt  proj    e Help  Containing all the documentation files about the program and    descriptors     BASIC ENVIRONMENT    When the application is executed  the following launch screen is displayed     6     5 ProtDCal 1 0    File Options Analwze Run Help    OVE UO us   TIPs    Tips for  best practices  in using ProtDCal     O Check your FASTA files to prevent the presence of other than the 20 natural residues ID     O Use properly formated PDB files  Check the presence of lnes TER  END or ENDIMIDL   secundary structure based groups  SHT  HEX  TEN  RCL  require the explicit definition of  secundary structure ranges in the PDE files  Backbone s
24. for any other  data set  for which the final features  used in the model  must be previously calculated  using ProtDCal  The current  2015  distribution of ProtDCal contains the specific  project files to compute each of the features entered in the models described in our    report  Y B  Ruiz Blanco et al  BMC Bioinformatics  2015  for N linked glycosylation      1  Ruiz Blanco  Y  B   Marrero Ponce  Y   Prieto  P  J   Salgado  J   Garcia   Y   Sotomayor Torres  C  M  Journal of Theoretical Biology 2015  364  407     2  Ruiz Blanco  Y  B   Marrero Ponce  Y   Garcia  Y   Puris  A   Bello  R    Green  J   Sotomayor Torres  C  M  Chemical Physics Letters 2014  610 611  135     3  Ruiz Blanco  Y  B   Marrero Ponce  Y   Paz  W   Garcia  Y   Salgado  J   Journal of Theoretical Biology 2013  321  44     
25. ication operator to be use in the calculation     decimals  harmonicMeanType    geometricMeanType    windexID Where  O   Autocorrelation  1   Gravitational  2   Kier Hall  3    lvanciuc Balaban  4   Electrotopological State   1   none  datasetType Type of input files  True  PDB files  False  FASTA files    Order of the block matrix of features in the output file  True     eutpurerder IDX_GROUP_ INVARIANT  False  GROUP_IDX_ INVARIANT    NOTE  A project must not contain any empty lines or incorrect tags  It is strongly  recommended to use the graphical user interface to configure the project initially   What follows is a list of valid tags for each section     ProtDCal Indices tags         gt Thermodynamic Indices of Folded Protein States   Gc F  Gw F  Gs F  W F  DGs  HBd DGel DGw DGLJ DGtor       gt Thermodynamic Indices of the Extended Protein State   Gw U  Gs U  W U        gt Topographic Indices    A DA DAnp wSp  nfd wR2 wDHBd wNc wFLC wNLC wCO wLCO wRWCO wCTP wCLQ   wPsiH wPsiS wPsil wPhiH wPhiS wPhil Phi Psi        gt Property Based Indices  Mw HP IP ECI Vm Anp Z1 Z2 Z3 ISA At Ap Pa Pb Pt    ProtDCal Functions tags      DGfold DGwat DGconf DGpack  In kf   DGscr  DGHBd       ProtDCal  A Program for Protein Descriptors Calculation      ProtDCal Groups tags        gt Residue Basic Group    ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR   VAL        gt Properties Based Group  RTR BSR AHR ALR NPR ARM PLR PCR NCR UCR UFR      gt Topographic Group  SH
26. ma  for each weighted index     indices     A DA DAnp wSp InFD wR2 wDHBd wNc wFLC wNLC wCO wLCO wRWCO    wCO   ECI HP IP ISA Mw None Num_Atoms     wSp   ECI HP IP ISA Mw None Num_Atoms     wR2   ECI HP IP ISA Mw None     wDHEBd   IP  ISA Mw None Num_Atoms     wRWCO   ECI HP IP Num_Atoms     wNc     HP IP ISA Mw None Num_Atoms     wNLC   ECI HP IP ISA Mw     wLCO   ECI HP IP ISA Mw None Num_Atoms        ProtDCal  A Program for Protein Descriptors Calculation    wFLC   ECI HP IP ISA Mw None Num_Atoms     C  This third section uses two lines to specify the functions    tags separated by    commas     functions     DGfold DGconf DGHBd     Each function corresponds to one of the models enclosed in the menu     Thermo amp kinetics     which correspond to the empirical thermodynamic model    defined in our laboratory to describe protein folding stability and kinetics     D  The fourth section comprises two lines specifying the groups of residues    selected for calculation  Each group   s tag is listed separated by comma     groups     ALA GLY HIS  PHE ARM PLR NCR SHT HEX TRN  INT SUP PRT     Additionally  if a user creates and selects a new group of residues  the defined    label is added to the list of other groups     groups     ALA GLY HIS  PHE ARM PLR NCR SHT HEX TRN  INT SUP PRT USER 1 USER 2    E  This section summarizes the invariant aggregation operators selected to be  applied on each group of residues  Each operator   s tag is listed separated by    comma     invariants  
27. ons  In addition to protein descriptors  ProtDCal implements the calculation of  empirical thermodynamic and kinetic functions  folding free energy  AGroig    configurational free energy  AG  ont   Hydrophobic effect  AGwat   H bond deficit free  energy  AGupaq   close packing interactions  AGgpack   scoring function for structural  decoys  AG      as well as the logarithm of the folding rate constant Inks     Thermo amp kinetics        v AGtold Ctrl 4  AGywyat Ctrl 2  AGcont Ctrl 3  Atiscpack Ctrl 4       SwSHBd Ctri 4  Inkt Ctri 6  AGECE Ctril      Analyze  This menus gives access to three options to compare a set of protein    structures or sequences     Analyze    alt Graphs  ME Distance Matrix    C RMSD    First  one can plot profiles of indices and bar graphs according the distribution of a    given index along a sequence     Profile Graph        ProtDCal  A Program for Protein Descriptors Calculation          DAnp Distribution          Indices Values  D  O       0O 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110  Aminoacids             Bar graph  absolute frequency of the index values in different ranges                                          4  3  W     a   ae  a        1  E   i  5 10 15 20 2D 30 35 40 45 50  Absolute Frequency    Distance Matrix  This option permits one to compute descriptor based distance  matrices among all proteins in an output file  This option compares different proteins    by using previously computed descriptors  This pro
28. ormat  within the    Datasets    directory in the ProtDCal distribution  with the name    glyco 3508 fasta        Generation of an initial set of features    It is known from the literature related to N glycosylation that this process is highly  sensitive to the presence of specific amino acids at specific positions near the target  Asn residue  The most commonly used sequence motif associated with N linked  glycosylation is defined by the    sequon     Asn Xxx Thr Ser  which indicates the strong  influence of a Thr or Ser residue at position Asn   2  Therefore  it was decided to  generate position specific features for all the analysed sequence windows    Please see the section    User specified groups    of this manual in order to learn how to  define such groups  User specified groups are saved in a text file named    groups gdm     that appears in the main directory of ProtDcal distribution  Each newly defined group  is saved in this file using the following format     RangeGroup  lt name gt    lt Comment line gt        ProtDCal  A Program for Protein Descriptors Calculation    nOnOd  END Group    These four lines are summarized as follows  i  the name given to the group  ii  an  optional description  iii  the starting and final position of an inclusive range of residues  gathered in the group  where n O n O means  the n  residue of the first chain to the n     residue of the first chain   and iv  a marker ending the section of this group  This file  can be edited directl
29. ow the toolbar represent the three of the hierarchical levels described  above  Modification Operators are accessed via the Options menu   When you mouse  over each element  a brief text description pops up explaining its functionality  Panels   indices  groups and aggregation operators  are subdivided  according to their nature     into several subpanels     Indices Panel    The panel of indices is divided into four subpanels  Thermodynamic Indices for  Structures  Thermodynamic Indices for Sequences  Topographic Indices  and    Properties based Indices         ProtDCal  A Program for Protein Descriptors Calculation Eo    Thermodynamic Indices of Folded Protein States         GsIF     O AGel       Thermodynamic Indices of the Extended Protein State      csu  O cwu  B vyu     Topographic Indices of Folded Protein States     O  wRco O wPsi_H O wPriH    B  wAHBd O wPsi_s O wri S    B wooo O wPzi O wPhi l    O wed O Psi O Pri       O wNLE O wFLC O WRG     Chemical Physical and Structural Composition Indices         TAE AminoAcid Descriptor    Other 20 Indices       In the Topographic Indices subpanel  there are several weighted indices  starting with  the letter  w    that can be calculated using one or several weights for inter residue  contacts  The following figure shows the window intended for selecting the properties  to be used as weights for contacts  This window appears every time one of these  weighted indices is selected  in such a way that different properties can be 
30. rm a supervised attribute  selection approach that analyses the relevancy and redundancy of the features  This  can be carried out with a wide range of methods implemented within Weka  Here  we  use the attribute selection method called    CfsSubsetEval    coupled with the    Bestfirst     search method  The reduced data set can be obtained by right clicking on the report  name at the left panel of the windows and selecting    Save reduced data           ProtDCal  A Program for Protein Descriptors Calculation    J Weka Explorer cse  mom    Preprocess   Classify   Cluster   Associate   Select attributes   Visualize   Attribute Evaluator   CfsSubsetEval  Search Method    BestFirst  D 1  N5    Attribute Selection Mode Attribute selection output     Use full training set a               Cross validation Folds  10    2Seeu        Attribute Selection on all input data                    Nom  dass hd Search Method   os Best first     Start   st Start set  no attributes       Result list  right click for options Search direction  forward   00 18 19   BestFir  Fey heath  gt  Stale search after 5 node expansions  View in main window Total number of subsets evaluated  1929  Merit of best subset found  0 291          mW    View in separate window    Save result buffer pute Subset Evaluator  supervised  Class  nominal   324 class      Delete result buffer CFS Subset Evaluator  Including locally predictive attributes  Visualize reduced data  Save reduced data    ced attributes  1 2 3 6 7 8
31. rnal  of Theoretical Biology  2015  364  p  407 417    Ruiz Blanco  Y B   et al   A physics based scoring function for protein structural decoys   Dynamic testing on targets of CASP ROLL  Chemical Physics Letters  2014  610 611  p   135 140    Ruiz Blanco  Y B   et al   Global Stability of Protein Folding from an Empirical Free  Energy Function  Journal of Theoretical Biology  2013  321  p  44 53     GETTING STARTED    ProtDCal is a user friendly software package that was developed to generate a variety  of numeric descriptors for protein structures and sequences  This manual is intended  to provide an overview of the main interfaces and functionalities of the program  As  part of the current distribution of ProtDCal  one can find a similar tutorial and a theory  section describing the formalism and parameters of the indices implemented in the    program     ProtDCal   s feature generation strategy comprises four hierarchical levels        ProtDCal  A Program for Protein Descriptors Calculation    Aggregation          Indices Ea e  TA Modification by type of l with erence    gt  1277 OEREN neighbourhood residue or by invariant  GSFHEIHPDT properties operators  VVCLNWOQAD  KLIMIP         Area   Autocorrelation   Hydrophilic   Sum     Contact Order   Ivanciuc   Balanban    Aromatic   Average     Electrostatic   Electro topological   Alanine   Variance   free energy state 7 7    i  An initial layer intended to select the type of indices to encode for each residue     These indic
32. selected for  different indices  Alternatively  if many indices will use the same weighting properties   one could first select all the topographic indices at once by clicking the button     Topographic Indices of Folded Protein States     which launches the properties window  once  and then the user may deselect the non desired indices  These indices will be    identified in the outputs as follows      index_name weight          ProtDCal  A Program for Protein Descriptors Calculation ee          ef Choose the weight s  to obtain the wRWCO C gt         an inline an    Selected Weigths           Mone    Mum Atoms  Phi   Psi    TopDist   im   Z    i        5                      2 d  _ Gancel      Accept      Other indices can be computed using the  TAE Amino acid Descriptor     and  Other  Indices     buttons located at the end of the panel  The first option  calculates the    Transferable Atom Equivalent  TAE  indices  which are available in     http   reccr chem rpi edu Software Protein Recon TAE doc     The second option computes user defined properties  see Creating new Properties   using the  Define new indices  option located in the menu  Option Manage Indices      This option activates the following window     E      Select Property Indices             Select indices        F                                d Cancel I Accept r       ProtDCal  A Program for Protein Descriptors Calculation    In this window the buttons  KI ALL and help to select previously defined    indic
33. sequence separation and a power of the spatial  distance    Number of bins to compute Shannon entropy based information theoretic  bins aggregation operators  The user should fix this value such that that the    number of residues per selected group is larger than the number of bins   Parameter used by the Autocorrelation and Gravitational modification  operators  This value corresponds to the sequence offset to identify the  K residues used to modify the initial value of the index  For example  when  computing the autocorrelation modification for residue position    each  index will be affected by the residue at position     5   Parameter used by the modification operator  Kier Hall  This value  corresponds to the maximum length of the sub graphs  of path type  used  SubG to modify the value of a given residue  For example  for a value of 3  all the  sub graphs of no more than 3 residues and containing the residue i are  used to modify its value     G  This last section summarizes the value of other general options of project     options decimals armonicMeanType geometricMeanType windex ID datasetT  ype outputOrder       1 0 0  1 true true       ProtDCal  A Program for Protein Descriptors Calculation    Where     Amount of decimals numbers to use in the output file   1  no  approximation is done     Specify the options to deal with the zeros when computing the  Harmonic Mean    Specify the options to deal with the zeros when computing the  Geometric Mean    Specify the modif
34. tPuts e    l TAT Files F      gt  Save J    Cancel            Two files result from ProtDCal calculations   lt name gt _AA txt and  lt name gt _Prot txt   Given the input proteins  these files include the values of the descriptors for each    residue  and for each selected group  respectively     The structure of the file  lt name gt _AA txt contains  in the second line  the parameters  used for calculations  while the third line has the labels of the requested indices  The  first column  labeled    AA     represents the identifier of each residue in the proteins  This  column is a combination of protein name  chain identifier  residue name  and residue    number from the PDB file  The figure below depicts an example of this type of file     SS sas saa sae a FP A RA h E TI E R g  SG sare a    t cont  4 0 n  3 0 a  5 0 dHSG  9 4  s cont  3 0 Windex  None   AA  A DA  DAnp wop ECl  InFD wDHBdfECl   wNefECl   wFLC ECl    IBN A pdbAVALs 193 908098  47 9091607   21 7193762 0 07  1 47141616 0 0 04655 3 46E 04  IBNL A pdbAILE4 777314453   95 0526708   75 0992704 0 09 0 8783326 0 09 0 0702 2 24E 04  IBNL A pdbA ASNS 16 3521779   129 518355   46 1175002 1 31  1 65279025 0 655 3 49115  0 00114354  IBNL A pdbATHR6 74666298   55 2963374   21 5960033 0 65  1 36608813 0 325 O47 775  0 003527 76  IBNL A pdbAPHE  299173541   170 276232   154 915144 0 14  1 57064057 0 07 0 1449 T  FOE 04  IBNL A pdbAASPS O7 7524366   46 07673   21 6235684 1 25  0 90887772 0 625 0 76125  0 00958327  IBNL A pdbA
35. ttling the position of the initial and last residues as    well as the identifier of the chain of each residue     ea l ias Define Range           Amino Acid Chain ID    From   __  f amp  A    a     ae  _ Accept J    Cancel A       ProtDCal  A Program for Protein Descriptors Calculation    The option    Select groups    0 permits selecting these new groups  for subsequent    calculations  through the following interface        e Group selection SS          Group List Selection       f k     ee     Cancel      Accept      EXECUTING CALCULATIONS    ProtDCal permits carrying out a single calculation or running multiple projects in batch  mode  The first option can be accessed directly by configuring a set of indices  groups   and aggregation operators  Additionally  it can be executed by uploading a single    predefined project     To execute several projects in batch mode  the button    Run Projects     he  located in  the toolbar  permits one to select a set of predefined projects through the following    interface           O      Project List     Projects Project specifications    131 proj e  a i   131Auto proj INDICES      131Elect proj   131Gravi proj GeF Gw F GstFi Wit  AGs HBd   13  1hvan proj   1341Kier proj AGel Atay AGLJ AGtor   132 proj   132Auto proj GROUPS    132Elect proj     anno 2 oo    QHT MEY TRI POT TWIT orm in  fo yaya     cancel JC Run     Alternatively  if a number of Projects are configured the user can execute ProtDCal in    console mode as        ProtDC
36. y by the user without the need of using the graphical interface   Fifteen new groups were defined  each corresponding to exactly one residue position  within the 15 aa windows  These were named    1    through    15       A number of residue indices were then selected to be computed for each of the 15  groups  These indices comprised distinct properties and thermodynamic indices  using  the Kier Hall modification operator  with a sub graph parameter of    1     and the  Minkowsky norm    N1    as the aggregation operator  These options can be specified  using the graphical user interface or by manually creating of a project file with the  following information  the comment text in green italics is added here to explain each  line  but should not appear in the actual project file        path to input sequence window files   directory     lt Path to input sequence windows files or multi FASTA file gathering all the  sequence windows gt      which indices to compute for each group   indices    Gw  U  Gs U  W U  HP IP ECI Z1 At Pb      which groups to use     defined in    groups gdm      groups    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15      specify aggregation operator to use   invariants    N1      default parameter values  parameters t_cont s_cont A  HydGroup n bins K SubG    4 0 8 0 5 0 9 4 3 0 50 5 1     default options used  options decimals harmonicMeanType geometricMeanType windexID datasetTyp  e outputOrder      1 0 0 2 false true    Finally  by placing this project file in
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
ACR Globalfix 406 2742 Epirb Category I  SPARC M10 システム/SPARC Enterprise/PRIMEQUEST  Tail Gate Loader Installation Manual  キャリブレータ 取扱説明書  Sur l`ensemble du règlement les superficies exprimées  Frigidaire Gallery Professional Series User's Manual  ASTA Handling System Operator`s Manual    Copyright © All rights reserved. 
   Failed to retrieve file