Home
SARpy - User's Manual
Contents
1. Moderate Add a constraint Select an attribute Select a value Load Clear Save Log txt Hit LOAD to load a new dataset IDLE Figure 6 11 Use the arrow to move the class on the Active or Inactive side The last step you can perform is filtering the dataset i e selecting only some specific entries and discarding all the other not meeting the property specified In the example we want to throw away all the molecules with an ID number greater than 200 figure 6 12 24 CHAPTER 6 SARPY STEP BY STEP Select a dataset CSV DF datasetFish csv Browse v Binarize optional INACTIVE FOUND Moderate Get a DATASET Get a RULESET Predict and Validate gt ACTUAL DATASET None zj SMILES gt ACTUAL RULESET None CLASS Select a value Load Hit LOAD to load a new dataset Save Log txt Figure 6 12 Check the Filtering option if you want to discard some dataset entries Now that every parameter is set up it is time to load the dataset in SARpy simply clicking on the Load button Read the info panel on the right to check how many molecular structure have been loaded and check on top of this if the file name is correct figure 6 13 A SARpy j cSV SDF datasetFish csv Browse v Binarize opt
2. e p b d PR SIL en caleidaos Prepared under the PROSIL project by Dario Cattaneo Alessio Mauro Franchi and Giuseppina Gini SARpy User s Manual 2015 April 27th NNN MARIO NEGRI ISTITUTO DI RICERCHE FARMACOLOGICHE SARpy User s manual COPYRIGHT 2015 by Dario Cattaneo Alessio Mauro Franchi and Giuseppina Gini as part of the PROSIL project If you intend to use the SARpy tool please cite Ferrari T and Cattaneo D and Gini G and Golbamaki Bakhtyari N and Manganaro A and Benfenati E Automatic knowledge extraction from chemical structures the case of mutagenicity prediction SAR and QSAR in Environmental Research Volume 24 Issue 5 pp 365 383 DOI 10 1080 1062936X 2013 773376 ALL RIGHTS RESERVED Contents Introduction 2 Installing and starting SARpy 3 Working with datasets 3 1 Loading a dataset 3 2 Managing the current dataset 4 Working with rulesets jam 4 1 Creating a model with SARpy 5 Predicting and validating 6 SARpy step by step 6 1 Preparing the CSV dataset file 6 2 Loading a dataset in SARpy 6 3 Loading or computing a model in SARpy 6 4 Predicting and validating Introduction Welcome to the SARpy user guide this brief manual will introduce you to the SARpy tool providing you with the basic knowledge for using this software An illustrated step by step example is also provided in the last section SARpy is an eas
3. gt ACTUAL DATASET None gt ACTUAL RULESET None Select a dataset SMILES 7j csv SDF LC50 mmol datasetFish csv Browse Binarize optional P Set threshold j n ic thai 10 L O O Add a constraint Select an attribute Select a value Load Clear Save Log txt IDLE Figure 6 10 In the next popup set the numeric threshold 6 2 LOADING A DATASET IN SARPY 23 Now you can let s say binarize your dataset this step is totally optional By binarize we mean dividing all the compound in the dataset in two classes Active and Inactive based on the mapping you specify in the three lists In our case as we want the Low and Moderate class to become the Inactive one and the two others to represent the Active class we simply move the four labels respectively in the right or in the left list see figure 6 11 The same happens with a numeric threshold you will have to decide whether the bigger or the lower values of the threshold defined above are to be set in the Active on Inactive class 5 A SARpy _ t e Get a DATASET Get a RULESET Predict and Validate gt ACTUAL DATASET None Select a dataset SMILES x csv SDF fl ASS gt ACTUAL RULESET None datasetFish csv L Browse Binarize optional FOUND
4. 29 11 2014 17 ella di file E Immagini Ji rFactor2 29 11 2014 17 19 Cartella di file z ruleset_Fish 27 04 2015 22 09 Documento di tg P Musica E Video Save predictions Hit PREDICT first then y n VALIDATE or SAVE the pre il Computer a Gruppo home 3 a Nascondi cartelle Figure 6 24 How to save the prediction result as text file These attributes are in addition to those saved by default The file produced in this example is shown in figure each compound is listed in a row with its SMILE the predicted class Active or Inactive the likelihood ratio and the SMARTS also with the LC50 value and the original class of the compound it Lh i File Modifica Formato Visualizza SMILES Prediction Training LR SMARTS LC50 mmol 1 CLASS C C1 C1 C 0 N ACTIVE inf c cl1 cl 1 8834011 Extreme nic o c c4 n c c ccic ACTIVE 2 53 nic c ccecl 1 0595937 Extreme cicccceclssciccceccl INACTIVE 1 07 clccceccl 0 0005038 Extreme Ocic o c cl c cl c cl cicl INACTIVE 4 62 cicc cl ccc1o0 0 0051232 Extreme O C N OCC INACTIVE 1 27 c o oc 58 8169267 Extreme ccccoc o ccc o yocccc ACTIVE 1 77 ccoccce 0 0193610 strong ccccocccc ACTIVE 1 77 ccocccce 0 2480227 strong oc 0 c1 c cccc2 Cc c3ccc C C Cc cc3CcCcc12 ACTIVE inf cc c cc o 0 0069897 Extreme c1 0 cccccic 0 N ACTIVE 1 64 c1 0 cccccic 0 0 7364737 Extreme CNC 0 Ocicccc2cccccl2 ACTIVE 1 26 CNC 0 0 0443771 Extreme Ocic N 0 0 cc N
5. 7 or Windows 8 Any Intel or AMD processor x86 or x64 suggested Intel Core i5 750 or greater e 512MB RAM 1GB or more for huge dataset Up to 100MB available hard disk space dataset files not included e Administrator rights are not needed for the installation process but suggested e A compression decompression software Windows built in tool WinZip WinRar It is a recommended best practice to back up your system and data before you remove or install software Chapter 2 Installing and starting SARpy The SARpy tool is distributed as a ZIP package this archive contains all the files the tool needs to work correctly Once you have downloaded the package from the download area of the VEGA website SARpy link simply unpack it in any folder you like To extract the files you can use the Windows integrated ZIP tool or any file compression software you prefer e g WinZip or WinRar Now enter in the folder just created and double click on the file SARpy exe the software will automatically starts showing its splash screen for a few seconds After that the main window should appear focused on the Dataset Managing Tab as shown in 2 1 f ca SARpy a _Get a DATASET Get a RULESET Predict and Validate Select a dataset gt ACTUAL DATASET None gt ACTUAL RULESET N cSV SDF gt ULESET None Browse i assis FOUND i Select an attribute Select a value Clear Save Log t
6. ACTUAL DATASET None gt ACTUAL RULESET N csv SDF j aai LUID t Select the ACTIVITY attribute SMILES datasetFish csv Browse LC50 mmol 50 a irti INACTIVE FOUND fond aT AUU a CONSU a t Select an attribute Select a value Save Log txt Figure 6 8 Select the activity attribute The activity attributes may also be a numeric value in this case SARpy will let you know that you probably want to set a threshold for binary classification of compund Click Ok button a new pop up will be displayed asking you this numeric threshold figure 6 10 22 X CHAPTER 6 SARPY STEP BY STEP Get a DATASET Get a RULESET Predict and Validate gt ACTUAL DATASET None gt ACTUAL RULESET None Select a dataset SMILES csv SDF LC50 mmol l datasetFish csv Binarize optional numeric same class Add a constraint Select an attribute J The selected attribute contains numbers Since SARpy works only in classification you probably want to set a threshold to split numbers into two classes If no threshold is set all numbers will be considered belonging to the Load Clear Save Log txt IDLE Figure 6 9 A popup will warn you if you select a numeric activity attribute P SARpy Le Get a DATASET Geta RULESET Predict and Validate
7. O ARY E gt Raccolte Documenti Documenti gt v 4 Cerca Documenti The ACTUAL DATASET will be used pp as training set Organizza v Nuova cartella sm Multidass dassification Cass SW Preferiti Raccolta Documenti Extract alerts for each activity dass The resulting ruleset will Figure 6 20 When saving the ruleset make sure the ifan alertis fired gt dy otherwise gt p Customize single alert precision Auto Minimi MAX OPTIMAL MIN Minimize Load Ruleset EXTRACT a RULESET EXTRACTED BB Desktop ie Download Documenti Nome Risorse recenti aa Raccolte Documenti Immagini a Musica E Video Gruppo home 1 Computer J Adobe D Modelli di Office personalizzati d NewBlueFX di rFactor2 amp Riservato perg e a Nascondi cartelle Disponi per Cartella v Ultima modifica Tipo 28 10 2014 22 57 06 11 2014 00 01 28 10 2014 21 40 2014 17 19 ruleset_Fish txt files bet txt format is selected 6 4 PREDICTING AND VALIDATING 29 A plain text file will be created Its structure is quite simple and is shown in figure each row is a rule and contains the fragment SMART the training class as Active or Inactive and the normalized likelihood ratio ruleset_Fish Blocco note E e File Modifica Formato Visualizza SMARTS Target Training LR c cljtcl ACTIVE int clccec
8. a new dataset Figure 3 6 Here is the filtering tool select the property you need to filter by and compose the filtering rule Chapter 4 Working with rulesets SARpy s Ruleset Managing Tab allows the user to create a SARpy ruleset that is a list of rules that establishes relationships among various substructures and the selected activity classes These rules are written using the SMILES format for the chemical structures and always follow this syntax CC C O O cleccccl Developmental toxicant 1 06 4 1 This syntax indicates that the selected fragment usually identifies the activity of a molecule as Developmental toxicant with a likelihood ratio of 1 06 SARpy s rulesets are therefore models for molecular activity and so they must be generated taking into account a wide range of conditions that allow users to improve the overall reliability of each model to the specific target s characteristics The general appearance of Ruleset Managing Tab is shown in Figure 4 1 gt ACTUAL DATASET None MIN MAX Atoms number 2 18 gt ACTUAL RULESET None The ACTUAL DATASET will be used as training set Min occurences 7 Binary classification Structural Alerts Extract alerts for the selected activity dass The resulting ruleset will predict if an alertis fired gt POSITIVE otherwise gt NEGATIVE Customize single alert precision HIGH SPECIFICITY minimize false positives
9. cc1 7 38000 0 0517568 Low 21 30 DIETHYL ETHER O CC CC 2560 00000 34 5385861 Moderate 22 32 ANILINE 1 Ne1ececc1 134 00000 1 4388489 Low 23 33 CARBARYL SEVIN 2 CNC 0 Oc1cecc2ceecc12 8 93000 0 0443771 Extreme 24 34 ETHANOL cco 14200 00000 308 2266117 Strong 25 36 2 HYDROXYBENZAMIDE c1 O ccccc1C 0 N 101 00000 0 7364737 Extreme 26 38 HEXANAL 2 CCCCCC 0 14 00000 0 1397764 Low Figure 6 1 The example dataset loaded as Excel file 17 18 CHAPTER 6 SARPY STEP BY STEP 9 From Excel you can export a CSV file just click on Save as and select the cvs extension from the list available under the file name see figure 6 2 ier Salva con nome Nuovo Apri da OneDrive i Computer Cartella corrente Salva a i Computer SARpy_v1 0 Salva con Desktop SARpy_v1 0 nome tal Aggiungi una posizione Cartelle recenti Stampa a Desktop Condividi Salva con nome gO ny gt Computer Riservato per il sistema C Utenti Frank Desktop SARpy_vl 0 gt Esporta Organizza v Nuova cartella Chiudi z Download Nome Ultima modifica Tipo Dimensione T Risorse recenti P A i misc 26 04 2015 18 20 Cartella di file Account g Raccolte B Documenti E Immagini d Musica E Video Opzioni e amp Gruppo home JE Computer amp Riservato per il siste DARANTCY M datasetFish CSV delimitato dal separagbre di elenco Tag Aggiungi tag Titolo Aggiungi titolo a Nas
10. ccl ACTIVE 1 52 cicc O ccciN 0 2348667 Extreme cicoccl None 0 8961363 stron ccccccccccc 0 C INACTIVE 5 93 cccec t 0 C 0 0064019 Extreme ccoP S occ scsc c c c ACTIVE 2 02 csc 0 0000461 Extreme ccoc 0 c c1 c 0 0cCc ACTIVE 6 07 EK 0 0048816 Extreme cccccccioc 0 cCcC1 ACTIVE 1 77 ccoccce 0 1057269 stron Ocic c c c c cc c c C C ccic c cjc INACTIVE 1 38 cleec c c c cct 0 0002321 Moderate Cic c C c c c2ccC 3c Cc Cc 0 0 CcCcCcCc3 C C2Cc1 ACTIVE inf cc c cc o 0 0056536 strong c1 c 0 ccc Oc2ccccc2 ccl ACTIVE 4 04 ciccc Oc2ccccc2 ccl 0 0232065 strong Ocic cl cc cl ccicl INACTIVE 4 62 clec cl cccio 0 0463915 strong CCCCN ACTIVE 1 61 cC N C 3 6642056 strong Figure 6 25 An example file containing the saved predictions 32 CHAPTER 6 SARPY STEP BY STEP The last tool you may want to use is the validation useful to compute the error rate and the confusion matrix of the model when applied to the loaded dataset figure 6 26 SARpy i Predict Apply the ACTUAL RULESET to the ACTUAL DATASET For testing on a test set load a new dataset in the first panel Get a DATASET Get a RULESET Predict and Validate OO Predict Validate i Output statistics and confusion matrix validate Select additional attributes to be included in the predictions SMILES NAME ID ee Multiple selection enabled Save predictions PREDICTION SAVED gt ACTUAL DATASET 411
11. external file are read parsed and converted in the specific SARpy format and then memorized as a SARpy dataset This method assures the reliability of the dataset and a short elaboration time SARpy accepts as external data source only files saved with either the CSV or SDF extension Please be sure your file meets correctly the format specification e CSV format all the floating values must use the dot as decimal separator as the CSV file requires each value must be separated by comma the file must be column wise i e each column is a property the first row of the file must contain property labels e SDF format all the floating values must use the dot as decimal separator each entry must be separated by the character sequence each property name must be surrounded by gt lt and gt To load the external file simply select its format and then click on Browse Figure 3 2 to check if the operation was successful check whether all the other functionalities in this tab are now available or not 1 Select the SMILES column CSV SDF Select the ACTIVITY attribute i ri Pa A Binarize optional INACTIVE FOUND ACTIVE x 7 Figure 3 2 How to load an external file first select the file format and then browse for it on your computer 3 2 MANAGING THE CURRENT DATASET 1 3 2 Managing the current dataset Once the set of molecules has been correctly loaded f
12. minimize false negatives HIGH SENSITIVITY Load Ruleset Clear Save Log txt Hit EXTRACT to extract a new ruleset or LOAD to load a saved one Figure 4 1 The Ruleset Managing Tab 10 CHAPTER 4 WORKING WITH RULESETS Unlike the datasets that are only loaded onto the system starting from an external file a ruleset i e a model might be saved to be loaded again after in order to analyze molecules on the same endpoint These models are saved by SARpy in a user specified folder in plain TXT text format and are then easily interpretable and ready to use for scientific publications and papers To load or save a ruleset the user must use the dedicated buttons and follow the instruction on the dialog window that will open Otherwise to create a brand new model the user must have a dataset loaded and have to specify several parameters to extract meaningful rules The loaded dataset must be valid and must contain at least two activity classes if it is not the case most of all the options are disabled 4 1 Creating a model with SARpy SARpy has two ways of creating models these are similar in the produced result and in the way they operate but generate models that have different purposes You can use SARpy as e A classifier to predict a property SARpy will generate a model that establishes relationships between each found substructure and all the activity classes specified during the dataset lo
13. regarding the model performance see figure 5 6 Validate Output statistics and confusion matrix Figure 5 5 Click on the validate button to start the model validation process Validaing Aultidass dassification ACTIVE INACTIVE ERROR RATE 0 29 Unpredicted rate 0 08 CONFUSION MATRIX ACTIVE INACTIV unknown lt pr dicted 173 79 25 AS i IVE a4 175 18 iMACTIVE Figure 5 6 As the validation process is over the info panel will show its result Chapter 6 SARpy step by step This last section of the SARpy manual will guide you through the SARpy tool step by step performing each action needed to load a dataset to create or load a ruleset and to predict and validate a result The toy example here proposed is based on a public dataset from the Environmental Protection Agency EPA Mid Continent Ecology Division Duluth MN This dataset is Important note for user this is only an example about how to use the SARpy tool we do not want to buid a model 6 1 Preparing the CSV dataset file The first thing you need before opening SARpy is a csv or sdf file containing the compounds dataset Here we will start from an xls file the file must be column wise i e each column represents a property you have in the dataset SMILE Name ID class and so on and the first row of the file must specify the property label see figure 6 1 Please note also that floating point value must use the dot as decim
14. 10 2014 21 40 Cartella di file E D ctor 1 29 11 2014 17 19 Cartella di file Bh Sonus p B datasetFish 27 04 2015 21 30 File con valori sep Select an attribute J Immagini db Immagini pubt t Immagini a Musica B Video 23 Gruppo home Hit LOAD to load a new dataset d Nome file datasetFish Figure 6 6 Look for the csv file and click on load if you cannot find it make sure the file format filter is set correctly Now choose from the first dropdown list the SMILES attribute so that SARpy knows which column contains the compounds SMILE 6 7 A SARpy 3 Get a DATASET Get a RULESET Predict and Validate Select a dataset gt ACTUAL DATASET None Select the SMILES column Sj j N csv SDF SMILES ACTUAL RULESET None 0 mmol ID datasetFish csv Browse LC50 mgA NAME Add a constraint Select an attribute Select a value Save Log txt Figure 6 7 From the list select the attributes for SMILES 6 2 LOADING A DATASET IN SARPY 21 Secondly indicate to the software which column contains the activity attribute In our example dataset we have added a column called Name which specifies the class for each compound Our classes are just string values namely Low Moderate Strong and Extreme A SARpy i Get a DATASET Get a RULESET Predict and Validate Select a dataset gt
15. E gt ACTUAL DATASET None Apply the ACTUAL RULESET to the ACTUAL DATASET For testing on a test set load a new dataset in the first panel gt ACTUAL RULESET None Validate Output statistics and confusion matrix Select additional attributes to be induded in the predictions Multiple selection enabled Clear Save Log txt Hit PREDICT first then you can VALIDATE or SAVE the predictions IDLE Figure 5 1 The Predict and Validate Tab To predict the activity of the compounds listed in the testing dataset you simply have to click on the Predict button you find in the top of this tab The process will be quite fast and you can check whether it has finished reading the info panel on the right when the process is over you should read xxx structured matched as shown in figure 5 3 13 14 CHAPTER 5 PREDICTING AND VALIDATING Get a DATASET Get a RULESET Predict and Validate Predict Apply the ACTUAL RULESET to the ACTUAL DATASET For testing on a test set load a new dataset in the first panel Predict Figure 5 2 To start the prediction process click on the Predict button you find in the top of this tab Multidass dassification ACTIVE INACTIVE ERROR RATE 0 29 Unpredicted rate 0 08 CONFUSION MATRIX ACTIVE INACTIV unknown lt predicted 172 79 25 ACTIVE a4 1 5 18 INACTIVE g Predicting 510 structures matched Figure 5 3 The info panel wi
16. EXTRACT and VALIDATE S Clear Save Log txt Hit EXTRACT to extract a new ruleset or LOAD to load a saved one Figure 6 14 The Ruleset Tab Here you have just three simple parameters to setup first of all you have to indicate which classes you want to consider for model extraction for nomal application select All classes as shown in figure 6 15 Geta DATASET Get a RULESET Predict and Validate ps n Structural Alerts Options gt ACTUAL DATASET 129 structures the TARGET activity class jaar apt MIN MAX Atoms number 2 18 gt ACTUAL RULESET None Min occurences gt Loading dataset Cassifier mode Read 129 molecular structures Extract alerts for each activity dass 85 ACTIVE 44 INACTIVE The resulting ruleset will ifan alertis fired gt apply the relative activity label otherwise gt leave unpredicted Customize single alert precision Auto ae Manual Minimize error rate Minimize unpredicted rate Load Ruleset EXTRACT and VALIDATE Scien Clear Save Log txt Hit EXTRACT to extract a new ruleset or LOAD to load a saved one Figure 6 15 Select which class you want to consider for the ruleset extraction 26 CHAPTER 6 SARPY STEP BY STEP Now its time to specify the structural alerts options you can set its minimum and maximum number of atoms and its number of occurencies to be considered as significant
17. NE CCCSCCSCCC 7 52000 0 0421643 Strong 11 4 9 DITHIADODECANE CCCSCCCCSCCC 2 99000 0 0144857 Low 14 2 CHLOROETHYL N CYCLOHEXYL CARBAMATE C 1CCOc 0 NC1iCcCcCcCCc1 35 00000 0 1701673 Extreme 15 PHENOBARBITAL NLC 0 C c2ccccc2 CC C 0 NC1 0 484 00000 2 0838715 Low 16 2 4 DINITROPHENOL 9 Oc1c N 0 0 cc N 0 0 cc1 13 30000 0 0722394 Moderate 17 URETHANE O C N OCC 5240 00000 58 8169267 Extreme 19 BENZAMIDE Clcccccic 0 N 661 00000 5 4564966 Moderate 21 1 1 DIMETHYLHYDRAZINE NN C C 7 85000 0 1306156 Low 23 AMOBARBITAL Nic 0 C cc CCC C C C 0 NC1 0 85 40000 0 3773585 Strong 24 CAFFEINE CN1C 0 N C C 0 C2 C1IN CN2C 151 00000 0 7774688 Extreme 25 2 METHYL 1 4 NAPHTHOQUINONE Clcccc2c 0 C C cCc 0 c12 0 11000 0 0006389 Low 26 2 3 4 6 TETRACHLOROPHENOL Ocic cl c cl c cl ccicl 1 03000 0 0044418 Low 27 4 CHLORO 3 METHYL PHENOL 41 0cicc C c Ccl ccl1 7 38000 0 0517568 Low 30 DIETHYL ETHER O CC CC 2560 00000 34 5385861 Moderate 32 ANILINE 1 Ncicccccl1 134 00000 1 4388489 Low 33 CARBARYL SEVIN 2 CNC 0 Oc1lcccc2cccccl12 8 93000 0 0443771 Extreme 34 ETHANOL CCO 14200 00000 308 2266117 Strong 36 2 HYDROXYBENZAMIDE c1 0 ccccc1c 0 N 101 00000 0 7364737 Extreme 38 HEXANAL 2 CCCCCC 0 14 00000 0 1397764 Low Figure 6 4 Edit the file csv just extracted and verify that it meets all the prerequisites 6 2 Loadin
18. ading operation Models generated with this modality consist of a list of rules that might be used to classify other molecules into one of the considered classes Since the purpose is to predict the activity of new structures rules generated in this modality are generally more detailed than rules produced in the Extractor mode e A knowledge extractor tool to extract relevant substructures the SARpy tool tries to generate some new knowledge from the current dataset analyzing it versus a singular activity class and establishing relationships among substructures and the specified activity class This modality produces a list that is somewhat less specific than the one produced in the other mode but gives sound information about substructures that are possibly related to a specific activity Generally speaking each SAR model works better for molecules that have similar char acteristics this means that SAR models might always be tailored around the specific batch of structures being used to create it So a minute regulation of a certain number of pa rameters is generally possible for SAR model generation SARpy is not an exception as it provides some functionalities that affects robustness sensitivity and sensibility of the developed models Through the user interface the user can regulate models precision and the definition of the structural alerts themselves 4 1 CREATING A MODEL WITH SARPY 11 e Model precision SAs ex
19. al separator 1 2 3 4 5 6 1 ID NAME SMILES LC50 mg l LC50 mmol l CLASS 2 1 4 HEXYLOXY M ANISALDEHYDE c1cc C 0 cc OC c10CCCCCC 2 67000 0 0112987 Low 3 2 5 BROMO 2 NITROVANILLIN c1 OC c N 0 0 c C 0 ce Br c10 73 30000 0 2655412 Moderate 4 4 P CHLOROPHENYL O NITROPHENYL ETHER c1cc Cl ccc1 Oc2c N 0 O0 cccc2 1 92000 0 0076908 Strong 5 5 3 CHLORO O FORMOTOLUIDIDE Cc1c NC 0 cccc1Cl 46 60000 0 2747480 Low 6 6 DI n BUTYLISOPHTHALATE CCCCOC 0 c1ccee C O OCCCC c1 0 90000 0 0032333 Moderate 7 7 4 1 DIPHENYL 2 PROPYN 1 OL C c1eceee1 c2ccccec2 0 C C 11 10000 0 0532988 Moderate 8 8 4 7 DITHIADECANE cccsccsccc 7 52000 0 0421643 Strong 9 11 4 9 DITHIADODECANE CCCSCCCCSCCC 2 99000 0 0144857 Low 10 14 2 CHLOROETHYL N CYCLOHEXYL CARBAMATE CICCOC 0 NC1CCCCC1 35 00000 0 1701673 Extreme 11 15 PHENOBARBITAL N1C 0 C c2cccec2 CC C 0 NC1 0 484 00000 2 0838715 Low 12 16 2 4 DINITROPHENOL 9 Oc1c N 0 0 cc N 0 0 cc1 13 30000 0 0722394 Moderate 13 17 URETHANE O C N OCC 5240 00000 58 8169267 Extreme 14 19 BENZAMIDE c1ccece1C 0 N 661 00000 5 4564966 Moderate 15 21 4 1 DIMETHYLHYDRAZINE NN C C 7 85000 0 1306156 Low 16 23 AMOBARBITAL N1C 0 C CC CCC C C C 0 NC1 0 85 40000 0 3773585 Strong 17 24 CAFFEINE CN1C 0 N C C 0 C2 C1N CN2C 151 00000 0 7774688 Extreme 18 25 2 METHYL 1 4 NAPHTHOQUINONE c1eccc2C 0 C C CC 0 c12 0 11000 0 0006389 Low 19 26 2 3 4 6 TETRACHLOROPHENOL Oc1e Cl c Cl c Cl cc1Cl 1 03000 0 0044418 Low 20 27 4 CHLORO 3 METHYL PHENOL 1 Octec C c Cl
20. cic 0 0cc INACTIVE inf cc cjcc o ACTIVE inf clccc ACTIVE int clec ccciNnjc 0 ACTIVE 3 54 cilc Cjcc N 0 0 ccl1 ACTIVE inf coccoc ACTIVE inf ccecce 0 Cc INACTIVE 5 93 ccccccccec ACTIVE 1 62 Cclcc cljcccl INACTIVE 4 95 cccac ACTIVE 3 54 clecce Oc2ccccc ccl ACTIVE 4 04 clec cljcccio INACTIVE 4 62 cilc cljJccccicl ACTIVE 1 01 c cc 0j 0cCc ACTIVE 4 04 NC 0 C ACTIVE 3 54 ccoc cjcc INACTIVE 1 32 cle ACTIVE 6 07 Figure 6 21 In the image is shown an example of the ruleset file as created by SARpy 6 4 Predicting and validating Once you have correctly loaded a dataset and a ruleset you can switch to the last tab the Predict and validate tab by this tab you can predict the activity of other unseen compound i e apply the ruleset on the dataset and also validate the model i e compute the prediction error and the confusion matrix This tab is shown in figure 6 22 To predict the property of each compound in the selected dataset simply click on the first button Predict the info panel will show you how many structures have been matched figure 6 23 30 CHAPTER 6 SARPY STEP BY STEP SARpy Geta DATASET Geta RULESET Predict and Validate Predict Apply the ACTUAL RULESET to the ACTUAL DATASET For testing on a test set load a new dataset in the first panel Predict Validate Output statistics and confusion matrix Select additional attributes to be included in the
21. condi cartelle Strumenti Figure 6 2 To export your xls to csv simply save the file and choose the cvs extension from the list It can happen that an information message stating that you will loose some file features appears click on Yes Microsoft Excel Figure 6 3 Click on Yes to close the message and continue 6 2 LOADING A DATASET IN SARPY 19 Now that you have your CSV file open it with a text editor the Windows Notepad is fine check that your data are not corrupted and also that the value separator is a comma if not you have to set the correct separator in the Windows Control Panel Language section The file should look like the one showed in figure 6 4 L datasetFish Blocco note e s _ _ _ File Modifica Formato Visualizza ID NAME SMILES LC50 mg 1 LC50 mmol 1 CLASS 1 4 CHEXYLOXY M ANISAL DEHYDE Clcc C 0 cc oc clocccccc 2 67000 0 0112987 _Low 2 5 BROMO 2 NITROVANILLIN c1 0C c N 0 0 c C 0 cc Br c10 73 30000 0 2655412 Moderate 4 P CHLOROPHENYL O NITROPHENYL ETHER cCicc Cl cccl0c2c N 0 0 cccc2 1 92000 0 0076908 Strong 5 3 CHLORO O FORMOTOLUIDIDE Ccic NC 0 cccc1cl 46 60000 0 2747480 Low 6 DI N BUTYLISOPHTHALATE CCCCOC 0 cicccc c 0 0ccCcC c1 0 90000 0 0032333 Moderate 7 1 1 DIPHENYL 2 PROPYN 1 OL C clcccccl1 c2ccccc2 0 c c 11 10000 0 0532988 Moderate 8 4 7 DITHIADECA
22. ect the number of SAs considered and the computation time and the minimum number of occurrences needed for each SA to be considered as valid higher values correspond to more precision tand 7 Structural Alerts Options gt f MAX Atoms number 3 z in E gt f Min occurences 3 Loa Figure 4 3 After you have checked the Structural Alerts Options it is possible to modify parameters regarding the structure of the SAs 12 CHAPTER 4 WORKING WITH RULESETS After every parameters is set up correctly just click on the button Extract and Vali date to generate the model the info panel will show you the progress of the computation e Vv Minimize unpredicted rate Save Ruleset EXTRACT and VALIDATE it EXTRACT to extract a new ruleset or LOAD to load a saved one Figure 4 4 Once every desired properties of the model have been set click on the Extract and Validate button to generate the model Chapter 5 Predicting and validating This last section of the SARpy tool is the Prediction and Validation Tab once you have correctly loaded a dataset in the system and created or loaded a set of rules as seen in the two previous sections through this tab you can predict the activity of other unseen compound i e apply the ruleset on the dataset and also validate the model Here is represented this tab Figure 5 1 AL SARpy j Get a DATASET Get a RUL
23. f the model To start the extraction and validation process click on the Extract and validate button you see in the bottom of this tab While the process is running check the info panel see figure 6 18 H SARpy Non risponde s Geta DATASET Get a RULESET predict and Validate Select the TARGET activity class Structural Alerts Options gt ACTUAL DATASET 411 structures datasetFish csv MIN MAX ALL Classes bA Atoms number 5 18 gt ACTUAL RULESET 9 rules ruleset_Fish txt The ACTUAL DATASET will be used E INACTIVE z as training set ERROR RATE 0 00 Multidass classification Cassifier mode Unpredicted rate 0 68 i Extract alerts for each activity dass CONFUSION MATRIX ACTIVE INACTIV unknown lt predicted The resulting ruleset will 35 0 50 ACTIVE 38 INACTIVE if an alertis fired gt apply the relative activity label otherwise gt leave unpredicted 9 RULES have been saved Customize single alert precision Auto oe Manual Loading dataset Minimize error rate Read 411 molecular structures CTIVE MAX ABHINACTIVE i Fragmenting Minimize unpredicted rate X Clear Save Log txt Hit EXTRACT to extract a new ruleset or LOAD to load a saved one Fragmenting please wait Figure 6 18 Click on Extract and validate to build the model 28 CHAPTER 6 SARPY STEP BY STEP After some minutes the computational time can
24. g a dataset in SARpy With the csv file ready you can open SARpy A splash screen should appear and after a few second the main window should open focused on the Dataset Managing Tab Figure 6 5 Get a DATASET Get a RULESET Predict and Validate Select a dataset gt ACTUAL DATASET None Select the SMILES column 1 5 e gt ACTUAL RULESET None SDF 2 the ACTIVITY attribute Add a constraint Select an attribute Select a value Clear Save Log txt Hit LOAD to load a new dataset Figure 6 5 The SARpy tools as it opens 20 CHAPTER 6 SARPY STEP BY STEP The first thing you have to do is to load the csv external file only after having selectend the file format you are about to use in this example we select csv Click then the Browse button to search for the file in your computer figure 6 6 Ay SARpy o B XG Get a DATASET Get a RULESET Predict and Validate Select a dataset gt ACTUAL DATASET None ect the E CSV SDF i ii Select a csv dataset file B GO gt Raccolte Documenti Documenti gt Organizza v Nuova cartella ari T Fe Preferiti Raccolta Documenti e ana E Desktop Documenti a J Download Nome Ultima modifica Tipo Risorse recenti p Adobe 28 10 2014 22 57 Cartella di file g Raccolte F di Modelli di Office personalizzati 06 11 2014 00 01 Cartella di file aT E di NewBlueFX 28
25. igure 6 23 Click on Predict to run the prediction process If you need it you can save the prediction results as text file just click on the Save predictions button select the folder and the file name and make sure the selected format is txt 6 4 PREDICTING AND VALIDATING l Optionally you can select some attributes of your dataset to be added to the prediction output To do this select those you want from the list multiple selection is performed by pressing and holding the shift key while clicking on attributes A SARpy o B x Geta DATASET Geta RULESET Predict and Validate Predict gt ACTUAL DATASET 411 structures Apply the ACTUAL RULESET to the ACTUAL DATASET datasetFish csv For testing on a test set load a new dataset in the first panel gt ACTUAL RULESET 46 rules ruleset_Fish txt Validate Output statistics and confusion matrix A Select Folder gO lt E Raccolte Documenti Documenti gt m Organizza v Nuova cartella Select additional attributes to be induded in the predictions Ae Preferiti Raccolta Documenti Disponi per Cartella v W Desktop Documenti W Download Nome Ultima modifica Tipo sees Ji Adobe 28 10 2014 22 57 Cartella di file hae tiple se nea we F i Rava a Modelli di Office personalizzati 06 11 2014 00 01 Cartella di file E d NewBlueFX 28 10 2014 21 40 Cartella di file Documenti
26. ined values Max that will give a more specific result minimizing the error rate Min for a more sensitive result that minimize the unpredicted rate and Optimal a trade off between these two Otherwise with the second option you can set the minimum likelihood ratio you like increasing this parameter strengthen the precision of the model For this example tutorial we selected Auto Max mode 6 3 LOADING OR COMPUTING A MODEL IN SARPY 27 SARpy Geta DATASET Get a RULESET Predict and Validate pape n V Structural Alerts Options gt ACTUAL DATASET 129 structures Select the TARGET activity class disia MIN MAX ALL Classes X Atoms number 3 E is gt ACTUAL RULESET None The ACTUAL DATASET will be used 5 d 5 ea Min occurences as training set Loading dataset Multiclass classification Cassifier mode Read 129 molecular structures Extract alerts for each activity dass 85 ACTIVE 44 INACTIVE The resulting ruleset will if an alertis fired gt apply the relative activity label otherwise gt leave unpredicted TERS Manual Minimize error rate Minimize unpredicted rate EXTRACT and VALIDATE TA Clear Save Log txt Hit EXTRACT to extract a new ruleset or LOAD to load a saved one Figure 6 17 Set the single alert precision you prefer a higher LR will result in a higher precision o
27. ional INACTIVE FOUND Moderate Low msa za v Add a constraint Select an attribute Get a DATASET Get a RULESET Predict and Validate ID gt ACTUAL DATASET 129 structures datasetFish csv SMILES CLASS Loading dataset Read 129 molecular structures 85 ACTIVE Strong Select a value 200 Save Log txt Figure 6 13 Last click on Load button to start the dataset creation process 6 3 LOADING OR COMPUTING A MODEL IN SARPY 25 6 3 Loading or computing a model in SARpy The second tab is the Ruleset Tab that will let you load or create a new model i e a set of rules from the dataset you have just loaded in the tool ivi Structural Alerts Options gt ACTUAL DATASET 129 structures Select the TARGET activity class k MIN MAX ALL Classes bd Atoms number 2 18 gt ACTUAL RULESET None The ACTUAL DATASET will be used Min enes E as training set a Loading dataset Multiclass classification Cassifier mode Read 129 molecular structures Extract alerts for each activity dass 85 ACTIVE 44 INACTIVE The resulting ruleset will ifan alertis fired gt apply the relative activity label otherwise gt leave unpredicted Customize single alert precision Auto sats Manual Minimize error rate D Minimize unpredicted rate Load Ruleset
28. ll show you when the prediction process is over Now that the process correctly finished you can save the prediction result clicking on the Save prediction button a text file containing the prediction result will be generated The prediction information are row wise meaning that each row is the prediction of a specific compound found in the dataset Each row contains in this order the following information e The compound SMILE e The prediction as a label standard values are Active and INACTIVE e The likelihood ratio test value e The SMART 15 Optionally it is possible to add in this output other information among those contained in the dataset to do this just select those you want to add from the list before saving the prediction result see figure 5 4 Select additional attributes to be induded in the predictions CLASS Multiple selection enabled Figure 5 4 Select from this list all the attributes you would like to add to the prediction result The last step you can perform is model validation that will give you information about the model performance and will check the accuracy of the model s representation of the real system To start the validation of the generated model on the loaded dataset just click on Validate button you find in the middle of this tab as the process finish the info panel will give you the relevant information i e the error rate and the confusion matrix
29. predictions Multiple selection enabled Hit PREDICT first then you can VALIDATE or SAVE the predictions gt ACTUAL DATASET 411 structures datasetFish csv gt ACTUAL RULESET 46 rules ruleset_Fish txt Loading dataset Read 129 molecular structures Fragmenting substructures found substructures found substructures found substructures found substructures found substructures found substructures found substructures found FRAGMENTS 1081 Evaluating fragments on the training set Clear Save Log txt IDLE Figure 6 22 The Predicte and validate tab A SARpy Geta DATASET Geta RULESET Predict and Validate Predict Apply the ACTUAL RULESET to the ACTUAL DATASET For testing on a test set load a new dataset in the first panel Ss Output statistics and confusion matrix a Select additional attributes to be induded in the predictions SMILES NAME ID LC50 mmol l LC50 maj CLASS Save predictions Hit PREDICT first then you can VALIDATE or SAVE the predictions Multiple selection enabled gt ACTUAL DATASET 411 structures datasetFish csv gt ACTUAL RULESET 46 rules ruleset_Fish txt ERROR RATE 0 30 Unpredicted rate CONFUSION MATRIX INACTIV unknown 95 8 105 3 46 RULES have been saved Predicting 400 structures matched n Clear IDLE Save Log txt F
30. rom external file you may manage the loaded data according to your own needs once you do that simply click on the Load button in the bottom of this tab so that SARpy can create its own internal dataset If it has been correctly created the filename of the external data source and the total structure count should be reported on the right just above the Info Dialog Figures 3 3 and 3 4 Add a constraint Select an attribute Select a value DATASET LOADED IDLE Figure 3 3 Once you are ready with your dataset just click on the Load button you find in the bottom of this tab Figure 3 4 The image show the Info Panel useful to check for error on every load operation Basically user can manage the current dataset in two ways e Binarizing the current dataset see section 3 2 e Filtering the current dataset see section 3 2 Binarize a dataset Even if SARpy may properly work using a non binary classification i e considering more than two activity classes a lot of case studies usually divide the compounds using a bi nary classification scheme generally labeling each compound with the generic Active or Inactive labels The binarization tool provides you with the capability to create this kind of classification from a multi class one relabeling the classes The binarization operation requires you to specify which of the present classes are to be considered a
31. s active and which are instead considered as the inactive ones Once the binarization operation is done all molecular structures will change their activity description according to the new parameters all the previous information regarding the previous activity labeling are not lost and are retrievable simply by unchecking the Binarize checkbox The binarize tool also works with datasets that use continuous values in this case a proper threshold value within the range must be specified in order to split the set into Active and Inactive molecules Please refer to the 6 in order to learn more about this functionality Filter a dataset A second important tool provided in SARpy is for filtering data Testing methodologies often need to split a datasets into several subsets each with a particular property a classical example is the division in training and testing sets the first dedicated to the generation of the model the second used only for validating the model This constraint tool allows the user to apply one or more constraints on the whole dataset splitting it into several parts and obtaining a reduced dataset that meet the given restrictions CHAPTER 3 WORKING WITH DATASETS Moderate Figure 3 5 To binarize a dataset check the highlighted checkbox and move all the labels in the right new class ra Add a constraint Select a value ial WON a A il tl Hit LOAD to load
32. structures datasetFish csv gt ACTUAL RULESET 46 rules ruleset_Fish txt Predicting 400 structures matched Predictions saved Validating Multidass dassification ACTIVE INACTIVE ERDEK RATE 0 30 Unpredicted rate 0 03 CONFUSION MATRIX ACTIVE INACTIV unknown lt predicted 170 95 ACTIVE J 30 105 3 INACTIVE Clear Save Log txt IDLE Figure 6 26 To validate the model click on the Validate button and read the result in the info panel
33. the higher this last number is the less restrictive but more precise will be the model Default values suggested are two atoms min and eighteen atoms max Changing these two numbers affect not only the number of the SA that will be present in the model but also has a strong influence on comptation time Get a DATASET Get a RULESET Predict and Valj gt ACTUAL DATASET 129 structures datasetFish csv Select the TARGET activity class ee ALL Classes Mal Atoms number gt ACTUAL RULESET None The ACTUAL DATASET will be used as training set Min occurencies Loading dat Multidass dassification Cassifier mode Read 129 molecular structures Extract alerts for each activity dass 85 ACTIVE 44 INACTIVE The resulting ruleset will ifan alertis fired gt apply the relative activity label otherwise gt leave unpredicted Customize single alert precision Auto pg Minimize error rate Minimize unpredicted rate EXTRACT and VALIDATE S Save Log txt Hit EXTRACT to extract a new ruleset or LOAD to load a saved one Figure 6 16 Set the values you prefer for structural alerts options The next parameter of the model is the single alert precision i e the minimum likelihood required by the model This value affect its precision and accuracy there are two main options Auto on the left or Manual on the right The first let the user select among three predef
34. tracted by SARpy are usually associated with numbers that defines their precision the user can regulate the level of sensitivity and speci ficity by various parameters that affect the alert precision For a quick tuning there is an Auto setting by which user has just to select among three predefined values of precision min means a more sensitive result i e it minimizes the unpredicted rate while max will produce a more specific result i e it minimizes the error rate In alternative using the Manual regulation mode the user can set the mini mum likelihood ratio he prefer for each structural alert An increase in this parameter corresponds to a higher precision of the model lize single alert precision Manual Minimize error rate OPTIMAL MIN Minimize unpredicted rate Figure 4 2 From this panel it is possible to regulate the precision of the model to be generated e Structural Alerts Options the process of creating a new model implies the anal ysis of each molecular structure and the fragmentation of molecules into several sub structures these last are are matched with the information about activity in order to establish any possible relationship between substructures i e structural alerts SAs and activities SARpy gives the user the capability to define some parameters of the SAs the maximum and the minimum number of atoms each SA is composed of which aff
35. vary depending on the dataset loaded and on the hardware you have the model is created in the info panel its accuracy and confusion matrix will be displayed SARpy Geta DATASET Get a RULESET Predict and Validate V Structural Alerts Options gt ACTUAL DATASET 129 structures Select the TARGET activity class datasetFish csv MIN Atoms number 3 The ACTUAL DATASET will be used 5 ALL Classes X gt ACTUAL RULESET 2 rules as training set Min occurencies x Multidass dassification Cassifier mode Extract alerts for each activity dass gt time 0 04 seconds Predicting The resulting ruleset will ifan alertis fired gt apply the relative activity label otherwise gt leave unpredicted Customize single alert precision Auto Manual Minimize error rate O Minimize unpredicted rate EXTRACT and VALIDATE Save Log txt RULESET EXTRACTED Figure 6 19 When the process is over the info panel will show some information about the model If you need you can save the resulting model so that you can load later maybe with a different dataset see figure 6 19 To do this simply click on the Save ruleset button located on the right of the Extract button and select where you want to store the file as shown in figure SARpy o B x Select the TARGET activity class a Select Folder ee O O
36. xt Hit LOAD to load a new dataset IDLE Figure 2 1 The main window of the SARpy tool Chapter 3 Working with datasets SARpy s Dataset Managing Tab allows the user to work with datasets the general appear ance of this tab is shown 3 1 With the word dataset we mean By this tab it is possible load and manage a dataset either for using it as training set to generate models or as test set to validate models or as a collection of untested molecules whose activity has to be predicted SARpy Get a DATASET t a RULESET Predict and Validate gt ACTUAL DATASET None gt ACTUAL RULESET None SDF Aaa a constraint Select an attribute Select a value Clear Save Log txt Hit LOAD to load a new dataset IDLE Figure 3 1 The figure shows the Dataset Managing Tab here you can load and customize a dataset Basically SARpy Dataset Managing Tab allows the user to perform two main actions e Loading a new dataset from external files see section 3 1 e Managing and preparing the current dataset see section 3 2 6 CHAPTER 3 WORKING WITH DATASETS 3 1 Loading a dataset SARpy is able to read only a specific dataset format based on SMILES which is auto matically created by the system itself using data coming from externals sources such as text or excel files All molecular structures contained within the selected
37. y tool suitable for building SAR models for regulatory and other pur poses The tool provides the capability to create a personalized model for a specific end point obtaining then a tailored ruleset from the chemical and property dataset Further more you can use this generated ruleset for making predictions about the activity of other unseen compounds SARpy is useful also for model validation with this tool it is easy to separate the starting dataset into a training and a testing set A set of rules can be computed from the former while the latter can be used for validation of the model Before starting please check the prerequisites section and make sure your computer is compatible with SARpy Briefly the second section will guide you during the installation process in the third and fourth you will work with dataset and ruleset respectively the last one will explain you how to predict and validate the model If you intend to use the SARpy tool please cite Ferrari T and Cattaneo D and Gini G and Golbamaki Bakhtyari N and Manganaro A and Benfenati E Automatic knowledge extraction from chemical structures the case of mutagenicity prediction SAR and QSAR in Environmental Research Volume 24 Issue 5 pp 365 383 DOI 10 1080 1062936X 2013 773376 Chapter 1 Prerequisites The software and hardware prerequisites for installing and using SARpy tools are as follows e Microsoft Windows XP SP1 SP2 SP8 Vista Windows
Download Pdf Manuals
Related Search
Related Contents
the brochure EC 2002 antistatic Montageanleitungen この商品の カタログ(PDFファイル) PDPE09-COUV:Mise en page 1 Philips 32PFL3508G 32" HD-ready Smart TV Black BF 1150 - BURY.com DEUCE PACK 150 - ESAB Welding & Cutting Products 900-00009-01 Rev 05 WPT-Training-Manual Copyright © All rights reserved.
Failed to retrieve file