Home

PDF file 1.7 MB - Carleton University

1. 0794305205 0 112624457 0 106209541 426377415 0 256774 742 0 46 7557566 0 023512936 0 239 03669 0 2576567 0349460364 0 045533105 0 006024037 PARAMETERS CONFIGURATION The interface to set the parameters to be used in a calculation can be launched using the toolbar button Set Parameters via the Options menu 5 Set Parameters for calculations Topogratic root ri Distance cutoff for hydrophilic groups Cutott of surface area LAW Quantity of decimals numbers Weighting procedures Topological distance cutoff for Suto correlation and Gravitational weighting procedures R MAXIMUM topological distance between terminal aa in the sequence window s of the Kier Hall wething procedure sub Agregation Operators Humber of bine Information Theory based for agregation operators bine Reset to default Cancel Accept ProtDCal A Program for Protein Descriptors Calculation Organizing the output file The feature or raw matrix obtained after calculation in the output file lt name gt _Prot txt is a block matrix that by default organizes the descriptor in the hierarchic order index gt group gt aggregation operator invariant To change the order in this output file the Output Tags Order button located in the Option menu provides two options lt index gt _ lt group gt _ lt invariant gt default and lt group gt _ lt index
2. CostSensitiveSubsetEval FilteredAttributeEval FilteredSubsetEval ee GainRatioAttributeEval e Selection on all input data m InfoGainAttributeEval i LatentSemanticAnalysis d H OneRAttributeEval first H PrincipalComponents tt set no attributes f H ReliefFAttributeEval tch direction forward c SVMAttributeEval e search after 5 node expansions H SymmetricalUncertAttributeEval 1 number of subsets evaluated 1929 apperSubsetEval t of best subset found 0 291 ibset Evaluator supervised Class nominal 324 class Subset Evaluator uding locally predictive attributes ributes 1 2 3 6 7 8 9 12 15 16 17 18 24 40 69 70 79 8 amp 2 96 138 Filter Remove filter Close At_AC auto3 2 proj_6 Nl Once the extraction is finished the reduced subset is saved and used to build the corresponding classifier over all training data using a similar configuration as it was used during the Wrapper In the Classify panel of Weka there are options to automatically perform x fold cross validation hold out prediction test by splitting the input data and external prediction by providing a second set of test instances with the corresponding features and class attribute This latter option was used to evaluate our ProtDCal A Program for Protein Descriptors Calculation final naive Bayes and random forest classifiers using the blind test data
3. N1 N3 Ar P2 M V CV Q3 K Q1 DE MI F This section specifies the parameter values needed to evaluate the indices and invariant aggregation operators The parameter values are listed as follow parameters t_cont s_cont A HydGroup n bins K SubG ProtDCal A Program for Protein Descriptors Calculation 4 0 8 0 5 0 9 4 3 0 50 5 3 These parameters adopt default values We do not recommend changing the numbers unless the user has an advanced knowledge of its influence on the requested features Please contact the authors for further direction regarding this subject The following table provides a brief description of the parameters Topological cutoff for inter residue contacts Minimum value of sequence t_cont separation between pairs of residues in contact cont Spatial cutoff for inter residue contacts Maximum value of distance between the Ca of pairs of residues in contact A Cutoff of superficiality Minimal percent of the total surface area of a given residue for being labeled as superficial Distance cutoff to identify hydrophilic groups of residues This parameter HydGroup is used by the thermodynamic indices Gw F DGw W F Its value must vary between 7 6 10 6 This parameter is used in the index logarithm of the Folding Degree InFD as the order of the power to which the spatial distance between n the Ca of a pair of residues is raised to compute their compaction quotient between the
4. ProtDCal A Program for Protein Descriptors Calculation ProtDCal A Program for Protein Descriptors Calculation USER MANUAL ow 4 NY PROTDCAL 10 Protein Descriptor s Calculation Unit of Computer Aided Molecular Biosilico Discovery and Bioinformatic Research CAMD BIR Univerisdad Central Marta Abreu de Las Villas ProtDCal A Program for Protein Descriptors Calculation CONTENT TABLE ABOUTUS era E 3 GETTING a cease Ste ces eee een nsec eels aeons ee NEAEH FAAA EAEEREN 3 WORKSPACE siisii erinnere a aa Taai EA aeea Aa aSa Aiaia 5 BASIC ENVIRONMENT ssssssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn ennan 6 Hee Pa e EE E 7 ronne FING sar acces es wep sqecenaecenes cen A E E tae seatessn canon uieseisonuscenantuosenceanneccesel 10 Aggregation Operators Panel cccccccccccccccssssseseececeeeceaeeesseecceeeeeseeeeeeeeeeeeeeeeeeaeenees 10 DESCRIPTION OF MENUS snssasasnnsnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nananana 12 OUTPUT FILE S eie E EEE SE EE varias varnntanuresany 17 PARAMETERS CONFIGURATION sssssssnsnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnunnnnnnnnnnnnnnnnnnnnnnnnnnnn 18 Organizing the Output file eeeeeseecccccceeeeeeesseeceeeeeeeseeeeseeeceeeeesaaeeseseeeeeeeeeaaas 19 PROJECT accent ate vata encase meee crue ance caneeaeca ee mmoeeanacoentenexecrevaul 19 PEO OE sy CN a E goannas amen ceeeeas E 19 Brey ouhatsare iyo
5. File_14 pdb File_15 pdb File_1 pdb File 16 pdb File_2 pdb File 17 pdb File_3 pdh File _4 pdb File_5 pdkh File _ amp pdk Compare with File_18 pdh File_19 pdh File_20 pdh File_21 pdk File_r pdb File_o pdb File_ S pdh 0 Pee Atom Carbon Alfa j Calculate j Cancel Carbon Alfa Backbone All Atoms In the Advanced panel other options are available such as e Use the best N residues This option performs two iterations one to superimpose the structures considering all the residues and the second only considers the N best aligned pairs of residues to re build the superposition and compute the RMSD e Use specific ranges This option superposes and computes the RMSD among structures using specific ranges of residues from the target the decoys or all the structures 9 9 RMSD Configuration Advanced E z Fi w Use specific ranges with Target bail Fromas Chain ip 1379 t Chain ID Decoys Target Decoys t Add Range Remove Range g Atom Carbon Alfa vr I Calculate j i Cancel ProtDCal A Program for Protein Descriptors Calculation OUTPUT FILES The button Set output for Results 0 in the toolbar or in the File menu allows one to set the file path to save the results of a calculation This button launches an explorer to set the path and name of the output files O 5 Save Results oS W amp Ou
6. Files T Load _ Cancel j USER SPECIFIED INDICES The button Define new Property located in Options menu permits the definition of specific property based indices This option will launch the following window a O e2 Add a new Property Index Residue Values Residues ALA ARG ASN ASP DYS M p ih a tk MLM ML GLY HIS ILE LEL LYS MET EPHE J0O000000002N gt 0 z0o000000000pan The panel Property List provides the list of available indices The Residue Values panel permits editing the assigned values to each residue When defining new indices the option Select new Property in the Option menu permits selecting these indices for calculation ProtDCal A Program for Protein Descriptors Calculation 5 Select Property Indices Select Indices Ce Ca 3 I Cancel J Accept j USER SPECIFIED GROUPS A To create new groups select the Define new group K located in the menu Options Managing Groups which will launch the following window AO Group Manager Group List Group specifications gray Description Ranges qr This is the description d _ J i Cancel e Save J This option allows the definition of new groups of residues These groups are created by extracting specific ranges of residues that can be fixed using panel Ranges The ranges can be configured by se
7. Nth root Distances Manhattan distance Mi Euclidean distance M2 Minkowski distance N3 Means Arithmetic Mean Ar Harmonic Mean ht Potencial Mean P2 Geometric Mean G Potential Mean 3 P3 Statistics Kurtosis K variation Coefficient CW Percentile 25 G O Range RA Standard Deviation DE Percentile 50 G2 O Skewness 5 0 Minimum Value MN Percentile 75 23 Variance Maximum Value MX o cr 150 Classics Standardized Information Corte Mean Information Content Ml Total Information Content TI ProtDCal A Program for Protein Descriptors Calculation DESCRIPTION OF MENUS File This menu allows uploading and or exporting the different files that are used by the program e g projects and input or output files File 14 Open PDE Ctri A Open FASTA Ctrl F O Save Resut Ctri R LJ Exit Alt F4 Loading either FASTA or PDB file can be performed by clicking on a buttons F or respectively which are located in the toolbar or the File menu These buttons launch an explorer to select the files to upload O Open PDB a Stability_training v T gt A T za QC Ee 1AEY pdb tiki pdb Jobe pdb 1APS pdb tFNF_9 pdb 1PHP_ pdb 1ENI_A pdb IFNF_10 pdb IPHP_n pdb 1BETA pdb timg A K301 mih IRFA ndh SONA 1CSP pdd 1kOs_red pdt z Open FASTA 10N e pib ILMB redpd gt 1DIV_npdb INES pdb _ gt Fasta Protein Format Gr E 1EOL_W30A pdb
8. The classification accuracy is reported in the Classifier output section of the Weka environment Finally the resulting classification model can be saved from the report in the left panel as shown below G Weka Explorer bolek Preprocess Classify Cluster Associate Select attributes Visualize Classifier Choose FilteredClassifier F weka filters supervised instance Resample B 1 0 5 1 2 14 0 no replacement w weka classifiers bayes NaiveBayes D Test options Classifier output Use training set a Supplied test set Set Time taken to build model 0 02 seconds Cross validation Folds 10 Percentage split 66 Stratified cross validation Summary More options Correctly Classified Instances 3194 91 049 Nom class hd Incorrectly Classified Instances 314 8 951 Kappa statistic 0 5639 Start Stop Mean absolute error 0 1182 id r 7 Result list right click for options Root mean squared error 0 2757 E rE Relative absolute error 69 4818 REE ee 7 ae ee E eee error 107 3439 t Instances 3506 View in separate window Save result buffer tcuracy By Class Delete result buffer TP Rate FP Rate Precision Recall F Measure ROC Area Class Load model 0 906 0 028 0 998 0 906 0 95 0 937 N 0 972 0 094 0 441 0 972 0 607 0 937 P Save model 0 91 0 033 0 958 0 91 0 925 0 937 The saved model file can then be used to predict the glycosylation states
9. a directory named ExampleGly within the Projects directory the features can be computed by executing this command line Java jar ProtDCal jar p Projects ExampleGly o Outputs This calculation generates two _ tab delimited output files named lt project name gt _AA txt and lt project name gt _Prot txt which summarize feature matrices in the format AA vs residue indices and sequence windows vs features respectively We will use the file called lt project name gt _Prot txt which shall summarize the computed features for each sequence window ProtDCal A Program for Protein Descriptors Calculation Preparing the data file to be read by Weka Weka can read csv files directly which are easily obtained from the tab delimited files generated by ProtDCal Additionally one must append the class column at the end of each line of the file This can be accomplished easily for example using a spreadsheet program such as MS Excel by pasting the column with the class information after the last column of features Lastly the column with the name of the instances should be removed to prevent Weka from interpreting this column as another attribute Finally the document must be saved in csv format Running filters and attribute selection approaches with Weka In order to eliminate some trivial features that could be generated is recommended to first run the unsupervised Weka attribute filter called RemoveUseless Weka
10. amp 9 12 15 16 17 18 24 40 69 70 79 82 96 138 il At AC auto3 2 proj_ 6 N1 After uploading this reduced subset of features it is advisable to end by running the WrapperSubsetEval attribute selection approach Depending on the number of features remaining in your data file a genetic search may be used within the wrapper However if the number of attributes is too high gt 100 a Bestfirst search would be preferable for a first reduction The Wrapper should be executed with the same type of classifier that you intend to use to later use to evaluate your final model over the test data For the study of N glycosylation presented in the ProtDCal paper a genetic search with 50 chromosomes per population and 500 generations was conducted As for the evaluator a FilteredClassifier was used which applies a Resample filter to the training data such that a class balanced subset is sampled for each cross validation fold This subset is used to train a classifier both NaiveBayes and RandomForest were considered and evaluate it in the hold out set during the x fold iteration of the Wrapper Q Weka Explorer Preprocess Classify Cluster Associate Select attributes visualize Attribute Evaluator J weka ZeroR F 5 T 0 01 R 1 attributeSelection 5 CfsSubsetEval ChiSquaredAttributeEval ClassifierSubsetEval 4 ConsistencySubsetEval 1 output H CostSensitiveAttributeEval ile ea
11. gt _ lt invariant gt alternative Options _ Set Parameters Cirl s 28 Repair dataset s Ctrl F 1 Windex Options F Output Tags Order IDX_GROUPS_INY Convert Dataset d GROUPS IDS INY Manage Indices 20 b Manage Groups F PROJECTS Projects are text files in which all the options required to execute a calculation are included To configure a project one must set all the options of a calculation i e loading data set indices modification operators groups aggregation operators and parameters then the project can be exported by using the button Save Project located in the toolbar The path to the dataset will be kept as part of the project Project Structure A ProtDCal project consists of several tags that identify each of the configuration parameters for a given calculation The structure of a project is divided into seven sections A Path of the directory containing the input file s This section comprises two lines as is illustrated below directory F WORK RESEARCH ProtDCal Datasets Fasta_Protein_Format prediction ProtDCal A Program for Protein Descriptors Calculation B This section summarizes the tag of each selected indices separated by commas indices Gw U Gs U W U Mw HP ECI Vm Z1 Z2 Z3 ISA Pa Pb Pt When using weighted topographic indices wldx such as the weighed Contact Order wCO additional lines are needed to specify the selected weights separated by com
12. hydrogen should be present to compute the indices Hod and wbHbd O save project files ta keep a record of your calculations Use the Run Multi project tool in order of executing several project files in batch mode O Increase the memory of the Java Virtual Machine before launching ProtDCal s graphical user interface when several concurrent calculations are going to be executed Each calculation will mn in a separate thread An example for using 2000M is as follow java AimeJ000m jar ProtD Cal jar O Use the command line interface ta execute Prot Cal from a terminal console Once a dataset is uploaded the interface provides access to the available indices depending on the input file type PDB or FASTA format The next figure depicts the interface with access to all type of indices as is obtained when PDB files are used ProtDCal A Program for Protein Descriptors Calculation 9 ProtDCal 1 0 File Options Thermo kinetics Analyze Run Help QHwO OG 06 GG COG rr Thermodynamic Indices of Folded Protein States 0 Geir O AGel Thermodynamic Indices of the Extended Protein State B ca 0 Gwil B vwu Topographic Indices of Folded Protein States 0 wRca wPsi_H wPhi H 0 wAHBd O whsis B wPhLs 0 wL O wPsi_ O wPri_ 0 weo Psi Phi O wFLE O wre nF Chemical Physical and Structural Composition Indices TAE AminoAcid Descriptor Other Properties Indices The panels bel
13. ke E E ene ery ante E E ere eet wre a ietr E ete erry gener ere 25 USER SPECIPIED INDICES n sanssnsnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnannna 25 USER SPECIFIED GROUPS wiieiiccvenccesoiteavesiessececeieseiiccverscesaituesesexeniecsbessticsuevecesssaeeseicls 26 EXECUTING CALCULATIONS sssssasannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnannnn 27 BASIC MODELING WORKFLOW USING PROTDCAL AND WEKA 28 Exemplification in the prediction of N glycosylation ccccccccssssseeseeeeeeeeeeeeeeeees 28 ProtDCal A Program for Protein Descriptors Calculation ABOUT US ProtDCal is a protein modeling platform developed and maintained in the Unit of Computer Aided Molecular Discovery and Bioinformatics Research CAMD BIR of the Universidad Central Marta Abreu de Las Villas UCLV and the Department of Systems amp Computer Engineering of Carleton University CU Project members Yasser B Ruiz Blanco yasserrb uclv edu cu UCLV Waldo Paz Rodriguez waldopaz uclv cu UCLV Yovani Marrero ponce Ph D ymarrero77 yahoo es UCLV James Green Ph D jrgreen sce carleton ca CU Citations Ruiz Blanco Y B et al ProtDCal A Program to Compute General Purpose Numerical Descriptors for Sequences and 3D Structures of Proteins BMC Bioinformatics 2015 Submitted Ruiz Blanco Y B et al A Hooke s law based approach to protein folding rate Jou
14. Explorer Lo tx Preprocess Classify Cluster Associate Select attributes Visualize Open file Open URL Open DB Generate Undo Edit Save Filter PATMEXpESSION aS a Apply MergeTwoValues Cika 2 Selected attribute NominalToBinary Name Gs U _KH_kier3 1 proj_10_N1 Type Numeric NominalToString Missing 0 0 Distinct 20 Unique 0 0 Statistic Value H NumericCleaner NumericToBinary Pattern skami 5 258 H NumericToNominal Maximum 31 87 NumericTransform Mean 19 516 Obfuscate StdDev 7 761 PartitionedMultiFilter H PKIDiscretize PrincipalComponents PropositionalToMultiInstance amp RandomProjection H RandomSubset RELAGGS H Remove fea Pemavel ype Ce Renoveusetes Y Reorder H ReplaceMissingValues Standardize H StringToNominal Class dass Nom v Visualize All m Wi gt Filter Remove filter Close 18E JGs U _AC_AC1_UCR_P3 19 Gs U _AC_AC1_UCR_DE 20 F Mw_AC_AC1_THR_N2 21 IP_AC_AC1_ASN_RA X Remove 5 26 18 56 31 87 Status too age This filter will eliminate all constant attributes that may be generated by ProtDCal following the project file Depending on particular interests and the desired number of attributes other filters can be applied at this stage It is recommended to perfo
15. GLY9 22 6387094 46 4646192 25 4258188 0 02 0 75591333 0 0 0212 3 13E 04 IBNL ApdbAVAL10 0 145 998937 124 305008 0 1 16260555 0 035 0 07505 0 00106801 1BNI_A pdb ALA11 0 103 188938 78 6799473 0 1 46365202 0 025 0 08525 6 55E 04 IBN A pdbAASP 12 76 5502064 59 2709622 26 0952002 1 25 0 74344111 0 1 1675 0 075 70326 IBN ApdbATYRI3 49 2110054 162 200424 129 201317 0 72 0 66711166 0 36 0 72 0 007 70609 IBN A pdbALEU14 1 06912325 177 656716 154 957145 0 1 01745402 0 005 0 07205 1 25E 04 ProtDCal A Program for Protein Descriptors Calculation Similarly the output file lt name gt _Prot txt contains in the first line the labels of all computed descriptors which are a combination of the indices groups and aggregation operators selected in the main interface The figure below depicts an example of this type of file PDB NAME Ge F ALA N1 Ge F ALA N2 Ge F ALA N3 Ge F ARG N1 BNI A pdb 0 262927909 0 108934115 0 085653346 0 087155006 1BTA pdb 044563367 0 310293383 0 300779168 0 04167399 1CSP pdb 0115023833 0053122428 0 041917634 0 012742242 1DIV_c pdb 3 31932248 22987804715 2 142282257 0 019688557 1DIV_n pdb 0 228598938 0101913987 0 083279311 0 004512168 TEOL VW30_ pdb 1FNF 10_pdb 1FNF_9 pdb ILMB red pdb 1N88_pdb NTL pdb 1 010471731 0 2348 15222 0 192847766 0 689933543 0 485642216 0 691944915 0 676074395 0 129950019 0 122603904 0 435129953 0272855077 0 47 7533667
16. INTi pdb ese i BHP tA THON tet IPSF tet 1BHP_204 1HEW dd TRAG tet IAEY pdb IBTA pdb 101V _e pib 1C8C td THRE tt IRIS tet ICRNtt TFC be 1 TIT tet f PDB Protein Data Bank i 1CSP bd TIMO tet 1UBO tt 1DIV ot 1PBAt 2A5E tt TEAL tt 1PGB tt 2ABD tet 1FKS tt PHP bt IGP tet IPOH dd I1SHP_2 Uf 1CSP bf IFC IT 1 PBA tT 4 FASTA Files eC ones T Open J Cancel Options This menu permits configuring the parameters used to evaluate the indices fixing the amount of significant digits in the output files and particularly the selection of the modification operator Windex weighted index to be applied to the computed residue indices After the application of this operator the indices values are updated and the subsequent procedures grouping and aggregation make use of these new ProtDCal A Program for Protein Descriptors Calculation indices values instead of the original unmodified indicies Note that the selected operator will be applied to all selected indices in the same manner To evaluate different operators a separate execution needs to be configured rerun the GUI or save amp execute multiple projects using the different operators in batch mode In addition the Option menu permits defining new indices and grouping criteria Options Set Parameters Cirl s 2 Repair datazet s Ctrl F 1 Windex Options Output Tags Order Convert Dataset Manage Indices 2D Manage Groups Functi
17. T HEX TRN RCL INT SUP PRT ProtDCal Procedural Aggregation_operators ID gt Distances N1 N2 N3 gt Means Ar P2 P3 M G V gt Statistics CV Q3 S RA MN K Q1 MX DE Q2 150 gt Information Theoretic Operators SI MI TI Below is a screenshot showing the structure of an actual project pesen eco BEERS E directory 3 indices 4 DGw Gw U Gs U W U A 1nFD wCTP HP wPsiS Phi Psi Pa Pb 5 functions G 7 DGfold DGconf DGHBd wPsiS 8 Z1 Z3 HP IP At 9 WwCTP 10 21 23 HP IP At 11 groups 2 ALA GLU GLY HIS MET PHE ARM PLR NCR SHT HEX TRN RCL INT SUP PRT 13 invariants 4 N1 N3 Ar P2 M V CV Q3 K Q1 DE MI 5 parameters t_cont s cont area dHSG n Int K Subgraph l6 0 0 0 900 9 2453402007 5 4 7 H options decimals armonicMeanType geometricMeanType windexID datasetType outputOrder 18 1 0 0 1 true true ProtDCal A Program for Protein Descriptors Calculation Loading a project To load a project use the button Load Project E located in the toolbar This button will launch an explorer to select the desired project O Load a Project T lt amp Projects gt Ss TE i i i ERMET AEE _ 2delectproj _ groups proj proj 2dgravi proj inva proj 2divan proj ivan proj Zdkier proj _ kier proj B 20none proj 3 none proj auto proj _ tae proj _ elect proj _ gravi praj 20 auto pro _ Project PROTDCAL
18. al A Program for Protein Descriptors Calculation java Xmx1000m jar ProtDCal jar p lt Path to projects directory gt o lt path to outputs directory gt If no option is specified this line will simply execute the graphical user interface ProtDCal s command line options p Defines the path to the directory enclosing the projects to execute All projects in this directory will be computed o Defines the location in which to create the output files Each file will take the same name as the corresponding project v Defines whether to include the name of the project within the label of final descriptors 0 no default 1 yes This option is valuable when the same descriptors are computed but different parameters are evaluated each time likely of interest only to advanced users BASIC MODELING WORKFLOW USING PROTDCAL AND WEKA ProtDCal is intended to generate a wide variety of features describing a protein sequence and or structure By applying feature selection an appropriate feature subset may be identified and used to create effective classifiers Below we detail the creation of a predictor of N linked glycosylation based on protein sequence Prediction of N linked glycosylation from protein sequence Gathering the data set of instances 3508 sequence unique windows of 15 aa each centered on an Asn residue were extracted from the 242 protein sequence targets of O GLYCBASE This data set can be found in FASTA f
19. cess calculates a distance value ProtDCal A Program for Protein Descriptors Calculation using either Manhattan Euclidean or Minkowski p 3 distances between all proteins using standardized values of the available descriptors This option is configured using the following interface 9 O RMSD IDX Distantes ni N 0M Missing Values Geometric Mean pE a ee 4 Load Prot IR J Canter j The distance matrix is computed from a file lt name gt _Prot which must contain only the features which are going to be used to evaluate the distance metrics The panel Missing Values provides two options to deal with such data that ProtDCal labels as 9999 e Delete the descriptor Removes the descriptor that contains at least one missing value e Geometric Mean Replaces missing values 9999 with the geometric mean of the other values of the descriptor The Analyze menu also implements a Root Mean Squared Deviation RMSD calculator which uses the Kabasch algorithm as implemented in the CDK Chemistry Development Kit library to obtain the optimum structural alignment between protein conformations and the selected target The RMSD can be evaluated among Ca backbone or all the atoms of the proteins This option is configured at ProtDCal A Program for Protein Descriptors Calculation aA O RMSD E Configuration Advanced Dataset s Target File_11 pdb 4EBG pdb File_12 pdb File_13 pdb
20. e according to the residues within a vicinity defined ProtDCal A Program for Protein Descriptors Calculation by the type of modification operator and its parameter value e g for the autocorrelation operator with parameter k 2 the neighbourhood of residue i comprises the residues in positions i 2 ProtDCal implements five modification operators that can be selected in the Menu Options Weighting operators iii A third layer named Groups is intended to select one or more groups of residues according their ID or type When a group of residues is selected an array of index values is obtained corresponding to the residues in the group In addition to the implemented grouping approaches an option is included by which users can define their own groups of residues see the option Groups in menu Option iv A fourth layer comprises several aggregation operators that are used to combine an array of values from a group of residues into a single value descriptor reflecting the distribution of the index within that group Some examples of these aggregation operators are the sum average variance kurtosis geometric mean information content etc The output of the calculation shows the full combination of indices groups and aggregation operators selected in each panel The input file formats of the software can be either PDB or FASTA for PDB files all indices can be computed whereas for FASTA files only the indices o
21. es Groups Panel In the panel of groups there are three subpanels enclosing groups formed by residue ID chemical physical properties and topographic features Also with the button Others groups is possible to select previously defined groups see Define new groups Residues Based Groups Properties Based Groups Topographic Groups p7 Other groups Aggregation Operators Panel The panel of aggregation operators is divided into four categories distances central tendency dispersion and information theoretic metrics ProtDCal A Program for Protein Descriptors Calculation In the central tendencies subpanel specifically for harmonic and geometric means there are three implemented variants to evaluate these metrics in order to avoid possible in definitions associated with 0 values e IGNORE THE VALUES 0 With this option all zero values are excluded only from the operations but not in the analysis i e the zero elements are counted to obtain the value N that refers to the size of the sample e PRINT 9999 This option prints the value 9999 for missing values In the case of the geometric mean this occurs when the group size is zero however in the case of the harmonic mean it occurrs when any of the elements are zero e DELETE THE VALUES 0 With this option all values zero are excluded and are not taken into account when dividing by N nor when evaluating the
22. es are grouped in three main classes Thermodynamics which are almost all novel indices designed in our laboratory based on an empirical model of the main factors involved in the stability of protein structures These indices are in turn divided into two panels grouping on one side those that are defined for 3D folded structures and on the other side those based on information relating to the protein sequence These indices refer to the contribution of the folded and unfolded reference states of a protein chain Topographic which include many of the contact based descriptors with proven correlation with the protein folding rate constant e g the relative contact order CO the total contact distance TCD the cliquishness CLQ etc These indices were defined originally as global metrics however they were modified to obtain a value for each residue of a protein Each contact of the protein is weighted by a determined residue property selected in this interface The weighting procedure is conducted by multiplying the values of the selected property for both residues that are in contact Property based indices this final group encloses a number of chemical physical and structural properties of each type of residue such as hydrophobicity electronic charge index molar weight volume isotropic surface area etc ii Modification operators these approaches are intended to modify the value of a selected index for a given residu
23. f the second Thermodynamics indices for sequences and fourth Properties based indices panels can be evaluated Multiple proteins may be input simultaneously The output files of ProtDCal calculations are two tab delimited text documents named lt name gt _AA txt and lt name gt _Prot txt which store all the descriptors for each residue of each protein and the descriptors for the combinations of indices groups and aggregation operators for each protein respectively WORKSPACE The ProtDCal workspace consists of the program folders e Datasets Containing all the input data files in PDB or FASTA format e Outputs Containing the output files of the program lt name gt _AA lt name gt _Prot etc ProtDCal A Program for Protein Descriptors Calculation Ko e Projects Containing all project files lt name gt proj e Help Containing all the documentation files about the program and descriptors BASIC ENVIRONMENT When the application is executed the following launch screen is displayed 6 5 ProtDCal 1 0 File Options Analwze Run Help OVE UO us TIPs Tips for best practices in using ProtDCal O Check your FASTA files to prevent the presence of other than the 20 natural residues ID O Use properly formated PDB files Check the presence of lnes TER END or ENDIMIDL secundary structure based groups SHT HEX TEN RCL require the explicit definition of secundary structure ranges in the PDE files Backbone s
24. for any other data set for which the final features used in the model must be previously calculated using ProtDCal The current 2015 distribution of ProtDCal contains the specific project files to compute each of the features entered in the models described in our report Y B Ruiz Blanco et al BMC Bioinformatics 2015 for N linked glycosylation 1 Ruiz Blanco Y B Marrero Ponce Y Prieto P J Salgado J Garcia Y Sotomayor Torres C M Journal of Theoretical Biology 2015 364 407 2 Ruiz Blanco Y B Marrero Ponce Y Garcia Y Puris A Bello R Green J Sotomayor Torres C M Chemical Physics Letters 2014 610 611 135 3 Ruiz Blanco Y B Marrero Ponce Y Paz W Garcia Y Salgado J Journal of Theoretical Biology 2013 321 44
25. ication operator to be use in the calculation decimals harmonicMeanType geometricMeanType windexID Where O Autocorrelation 1 Gravitational 2 Kier Hall 3 lvanciuc Balaban 4 Electrotopological State 1 none datasetType Type of input files True PDB files False FASTA files Order of the block matrix of features in the output file True eutpurerder IDX_GROUP_ INVARIANT False GROUP_IDX_ INVARIANT NOTE A project must not contain any empty lines or incorrect tags It is strongly recommended to use the graphical user interface to configure the project initially What follows is a list of valid tags for each section ProtDCal Indices tags gt Thermodynamic Indices of Folded Protein States Gc F Gw F Gs F W F DGs HBd DGel DGw DGLJ DGtor gt Thermodynamic Indices of the Extended Protein State Gw U Gs U W U gt Topographic Indices A DA DAnp wSp nfd wR2 wDHBd wNc wFLC wNLC wCO wLCO wRWCO wCTP wCLQ wPsiH wPsiS wPsil wPhiH wPhiS wPhil Phi Psi gt Property Based Indices Mw HP IP ECI Vm Anp Z1 Z2 Z3 ISA At Ap Pa Pb Pt ProtDCal Functions tags DGfold DGwat DGconf DGpack In kf DGscr DGHBd ProtDCal A Program for Protein Descriptors Calculation ProtDCal Groups tags gt Residue Basic Group ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL gt Properties Based Group RTR BSR AHR ALR NPR ARM PLR PCR NCR UCR UFR gt Topographic Group SH
26. ma for each weighted index indices A DA DAnp wSp InFD wR2 wDHBd wNc wFLC wNLC wCO wLCO wRWCO wCO ECI HP IP ISA Mw None Num_Atoms wSp ECI HP IP ISA Mw None Num_Atoms wR2 ECI HP IP ISA Mw None wDHEBd IP ISA Mw None Num_Atoms wRWCO ECI HP IP Num_Atoms wNc HP IP ISA Mw None Num_Atoms wNLC ECI HP IP ISA Mw wLCO ECI HP IP ISA Mw None Num_Atoms ProtDCal A Program for Protein Descriptors Calculation wFLC ECI HP IP ISA Mw None Num_Atoms C This third section uses two lines to specify the functions tags separated by commas functions DGfold DGconf DGHBd Each function corresponds to one of the models enclosed in the menu Thermo amp kinetics which correspond to the empirical thermodynamic model defined in our laboratory to describe protein folding stability and kinetics D The fourth section comprises two lines specifying the groups of residues selected for calculation Each group s tag is listed separated by comma groups ALA GLY HIS PHE ARM PLR NCR SHT HEX TRN INT SUP PRT Additionally if a user creates and selects a new group of residues the defined label is added to the list of other groups groups ALA GLY HIS PHE ARM PLR NCR SHT HEX TRN INT SUP PRT USER 1 USER 2 E This section summarizes the invariant aggregation operators selected to be applied on each group of residues Each operator s tag is listed separated by comma invariants
27. ons In addition to protein descriptors ProtDCal implements the calculation of empirical thermodynamic and kinetic functions folding free energy AGroig configurational free energy AG ont Hydrophobic effect AGwat H bond deficit free energy AGupaq close packing interactions AGgpack scoring function for structural decoys AG as well as the logarithm of the folding rate constant Inks Thermo amp kinetics v AGtold Ctrl 4 AGywyat Ctrl 2 AGcont Ctrl 3 Atiscpack Ctrl 4 SwSHBd Ctri 4 Inkt Ctri 6 AGECE Ctril Analyze This menus gives access to three options to compare a set of protein structures or sequences Analyze alt Graphs ME Distance Matrix C RMSD First one can plot profiles of indices and bar graphs according the distribution of a given index along a sequence Profile Graph ProtDCal A Program for Protein Descriptors Calculation DAnp Distribution Indices Values D O 0O 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 110 Aminoacids Bar graph absolute frequency of the index values in different ranges 4 3 W a ae a 1 E i 5 10 15 20 2D 30 35 40 45 50 Absolute Frequency Distance Matrix This option permits one to compute descriptor based distance matrices among all proteins in an output file This option compares different proteins by using previously computed descriptors This pro
28. ormat within the Datasets directory in the ProtDCal distribution with the name glyco 3508 fasta Generation of an initial set of features It is known from the literature related to N glycosylation that this process is highly sensitive to the presence of specific amino acids at specific positions near the target Asn residue The most commonly used sequence motif associated with N linked glycosylation is defined by the sequon Asn Xxx Thr Ser which indicates the strong influence of a Thr or Ser residue at position Asn 2 Therefore it was decided to generate position specific features for all the analysed sequence windows Please see the section User specified groups of this manual in order to learn how to define such groups User specified groups are saved in a text file named groups gdm that appears in the main directory of ProtDcal distribution Each newly defined group is saved in this file using the following format RangeGroup lt name gt lt Comment line gt ProtDCal A Program for Protein Descriptors Calculation nOnOd END Group These four lines are summarized as follows i the name given to the group ii an optional description iii the starting and final position of an inclusive range of residues gathered in the group where n O n O means the n residue of the first chain to the n residue of the first chain and iv a marker ending the section of this group This file can be edited directl
29. ow the toolbar represent the three of the hierarchical levels described above Modification Operators are accessed via the Options menu When you mouse over each element a brief text description pops up explaining its functionality Panels indices groups and aggregation operators are subdivided according to their nature into several subpanels Indices Panel The panel of indices is divided into four subpanels Thermodynamic Indices for Structures Thermodynamic Indices for Sequences Topographic Indices and Properties based Indices ProtDCal A Program for Protein Descriptors Calculation Eo Thermodynamic Indices of Folded Protein States GsIF O AGel Thermodynamic Indices of the Extended Protein State csu O cwu B vyu Topographic Indices of Folded Protein States O wRco O wPsi_H O wPriH B wAHBd O wPsi_s O wri S B wooo O wPzi O wPhi l O wed O Psi O Pri O wNLE O wFLC O WRG Chemical Physical and Structural Composition Indices TAE AminoAcid Descriptor Other 20 Indices In the Topographic Indices subpanel there are several weighted indices starting with the letter w that can be calculated using one or several weights for inter residue contacts The following figure shows the window intended for selecting the properties to be used as weights for contacts This window appears every time one of these weighted indices is selected in such a way that different properties can be
30. rm a supervised attribute selection approach that analyses the relevancy and redundancy of the features This can be carried out with a wide range of methods implemented within Weka Here we use the attribute selection method called CfsSubsetEval coupled with the Bestfirst search method The reduced data set can be obtained by right clicking on the report name at the left panel of the windows and selecting Save reduced data ProtDCal A Program for Protein Descriptors Calculation J Weka Explorer cse mom Preprocess Classify Cluster Associate Select attributes Visualize Attribute Evaluator CfsSubsetEval Search Method BestFirst D 1 N5 Attribute Selection Mode Attribute selection output Use full training set a Cross validation Folds 10 2Seeu Attribute Selection on all input data Nom dass hd Search Method os Best first Start st Start set no attributes Result list right click for options Search direction forward 00 18 19 BestFir Fey heath gt Stale search after 5 node expansions View in main window Total number of subsets evaluated 1929 Merit of best subset found 0 291 mW View in separate window Save result buffer pute Subset Evaluator supervised Class nominal 324 class Delete result buffer CFS Subset Evaluator Including locally predictive attributes Visualize reduced data Save reduced data ced attributes 1 2 3 6 7 8
31. rnal of Theoretical Biology 2015 364 p 407 417 Ruiz Blanco Y B et al A physics based scoring function for protein structural decoys Dynamic testing on targets of CASP ROLL Chemical Physics Letters 2014 610 611 p 135 140 Ruiz Blanco Y B et al Global Stability of Protein Folding from an Empirical Free Energy Function Journal of Theoretical Biology 2013 321 p 44 53 GETTING STARTED ProtDCal is a user friendly software package that was developed to generate a variety of numeric descriptors for protein structures and sequences This manual is intended to provide an overview of the main interfaces and functionalities of the program As part of the current distribution of ProtDCal one can find a similar tutorial and a theory section describing the formalism and parameters of the indices implemented in the program ProtDCal s feature generation strategy comprises four hierarchical levels ProtDCal A Program for Protein Descriptors Calculation Aggregation Indices Ea e TA Modification by type of l with erence gt 1277 OEREN neighbourhood residue or by invariant GSFHEIHPDT properties operators VVCLNWOQAD KLIMIP Area Autocorrelation Hydrophilic Sum Contact Order Ivanciuc Balanban Aromatic Average Electrostatic Electro topological Alanine Variance free energy state 7 7 i An initial layer intended to select the type of indices to encode for each residue These indic
32. selected for different indices Alternatively if many indices will use the same weighting properties one could first select all the topographic indices at once by clicking the button Topographic Indices of Folded Protein States which launches the properties window once and then the user may deselect the non desired indices These indices will be identified in the outputs as follows index_name weight ProtDCal A Program for Protein Descriptors Calculation ee ef Choose the weight s to obtain the wRWCO C gt an inline an Selected Weigths Mone Mum Atoms Phi Psi TopDist im Z i 5 2 d _ Gancel Accept Other indices can be computed using the TAE Amino acid Descriptor and Other Indices buttons located at the end of the panel The first option calculates the Transferable Atom Equivalent TAE indices which are available in http reccr chem rpi edu Software Protein Recon TAE doc The second option computes user defined properties see Creating new Properties using the Define new indices option located in the menu Option Manage Indices This option activates the following window E Select Property Indices Select indices F d Cancel I Accept r ProtDCal A Program for Protein Descriptors Calculation In this window the buttons KI ALL and help to select previously defined indic
33. sequence separation and a power of the spatial distance Number of bins to compute Shannon entropy based information theoretic bins aggregation operators The user should fix this value such that that the number of residues per selected group is larger than the number of bins Parameter used by the Autocorrelation and Gravitational modification operators This value corresponds to the sequence offset to identify the K residues used to modify the initial value of the index For example when computing the autocorrelation modification for residue position each index will be affected by the residue at position 5 Parameter used by the modification operator Kier Hall This value corresponds to the maximum length of the sub graphs of path type used SubG to modify the value of a given residue For example for a value of 3 all the sub graphs of no more than 3 residues and containing the residue i are used to modify its value G This last section summarizes the value of other general options of project options decimals armonicMeanType geometricMeanType windex ID datasetT ype outputOrder 1 0 0 1 true true ProtDCal A Program for Protein Descriptors Calculation Where Amount of decimals numbers to use in the output file 1 no approximation is done Specify the options to deal with the zeros when computing the Harmonic Mean Specify the options to deal with the zeros when computing the Geometric Mean Specify the modif
34. tPuts e l TAT Files F gt Save J Cancel Two files result from ProtDCal calculations lt name gt _AA txt and lt name gt _Prot txt Given the input proteins these files include the values of the descriptors for each residue and for each selected group respectively The structure of the file lt name gt _AA txt contains in the second line the parameters used for calculations while the third line has the labels of the requested indices The first column labeled AA represents the identifier of each residue in the proteins This column is a combination of protein name chain identifier residue name and residue number from the PDB file The figure below depicts an example of this type of file SS sas saa sae a FP A RA h E TI E R g SG sare a t cont 4 0 n 3 0 a 5 0 dHSG 9 4 s cont 3 0 Windex None AA A DA DAnp wop ECl InFD wDHBdfECl wNefECl wFLC ECl IBN A pdbAVALs 193 908098 47 9091607 21 7193762 0 07 1 47141616 0 0 04655 3 46E 04 IBNL A pdbAILE4 777314453 95 0526708 75 0992704 0 09 0 8783326 0 09 0 0702 2 24E 04 IBNL A pdbA ASNS 16 3521779 129 518355 46 1175002 1 31 1 65279025 0 655 3 49115 0 00114354 IBNL A pdbATHR6 74666298 55 2963374 21 5960033 0 65 1 36608813 0 325 O47 775 0 003527 76 IBNL A pdbAPHE 299173541 170 276232 154 915144 0 14 1 57064057 0 07 0 1449 T FOE 04 IBNL A pdbAASPS O7 7524366 46 07673 21 6235684 1 25 0 90887772 0 625 0 76125 0 00958327 IBNL A pdbA
35. ttling the position of the initial and last residues as well as the identifier of the chain of each residue ea l ias Define Range Amino Acid Chain ID From __ f amp A a ae _ Accept J Cancel A ProtDCal A Program for Protein Descriptors Calculation The option Select groups 0 permits selecting these new groups for subsequent calculations through the following interface e Group selection SS Group List Selection f k ee Cancel Accept EXECUTING CALCULATIONS ProtDCal permits carrying out a single calculation or running multiple projects in batch mode The first option can be accessed directly by configuring a set of indices groups and aggregation operators Additionally it can be executed by uploading a single predefined project To execute several projects in batch mode the button Run Projects he located in the toolbar permits one to select a set of predefined projects through the following interface O Project List Projects Project specifications 131 proj e a i 131Auto proj INDICES 131Elect proj 131Gravi proj GeF Gw F GstFi Wit AGs HBd 13 1hvan proj 1341Kier proj AGel Atay AGLJ AGtor 132 proj 132Auto proj GROUPS 132Elect proj anno 2 oo QHT MEY TRI POT TWIT orm in fo yaya cancel JC Run Alternatively if a number of Projects are configured the user can execute ProtDCal in console mode as ProtDC
36. y by the user without the need of using the graphical interface Fifteen new groups were defined each corresponding to exactly one residue position within the 15 aa windows These were named 1 through 15 A number of residue indices were then selected to be computed for each of the 15 groups These indices comprised distinct properties and thermodynamic indices using the Kier Hall modification operator with a sub graph parameter of 1 and the Minkowsky norm N1 as the aggregation operator These options can be specified using the graphical user interface or by manually creating of a project file with the following information the comment text in green italics is added here to explain each line but should not appear in the actual project file path to input sequence window files directory lt Path to input sequence windows files or multi FASTA file gathering all the sequence windows gt which indices to compute for each group indices Gw U Gs U W U HP IP ECI Z1 At Pb which groups to use defined in groups gdm groups 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 specify aggregation operator to use invariants N1 default parameter values parameters t_cont s_cont A HydGroup n bins K SubG 4 0 8 0 5 0 9 4 3 0 50 5 1 default options used options decimals harmonicMeanType geometricMeanType windexID datasetTyp e outputOrder 1 0 0 2 false true Finally by placing this project file in

PDF file 1.7 MB - Carleton University

Contents

Download Pdf Manuals

Related Search

Related Contents