Home

ExPlain 3.0 manual

1. 106 CHAPTER 7 MOLECULAR NETWORKS ANALYSIS 7 2 NETWORK CLUSTER ANALYSIS 7 1 4 Key node search algorithm The search for signaling molecules key nodes in the network vicinity of a gene list can be performed based on only one gene list or based on a primary gene list regarding an additional gene list as sec ondary set Genes of a secondary set are incorporated such that the key node algorithm is pushed to go through the elements of this gene set The network path is attracted by the secondary genes resulting in longer paths being often cheaper than shorter paths if they include molecules from the secondary set The algorithm is a feed forward based approach which transforms the original weights of the network into new weights The weights of the resulting network are such that the desired attraction power is reflected Score The significance score used for ranking the key node results counts the hits relative to the re spective logarithmized Volume li that was required to reach every hit quation 7 1 1 Score calculation F j T T S N gt N Nia i 1 log V Volume i number of compounds reachable in total from the key node within distance i Hits Vi number of targets reached by key node k within distance i With growing distance i the Volume i is growing too The maximum distance is limited by r False Discovery Rate FDR Each individual key node g
2. Yes No distribution Export to RTF 35 30 25 20 15 10 xx Hide sites on sequences Legend Matrix name Impact bas VSAHR QS V CEBP C 0 67 v o VBHNF3ALPHA Q6 V HNF4ALPHA Q6 2 01 v E V HNF3 Q6 01 V OCT1 02 0 33 View yes set 1435 background set 515 Filter filter bar none total 1435 rows Rows per page 100 Export Plain text XLS RTF EM 2 3 4 5 6 7 8 9 1 Mark Page 100 All 1435 None Invert TRANSPro ID Match display Sequence score OO HSA 4727 P e 0706799 1000 800 600 400 200 1 HSA 6764 T aa A 0 666776 I 1000 800 600 400 200 1 iO HSA 24344 1 E uum m 0 653882 1000 800 600 400 200 I b binding sites are located in close proximity e g 300 nt to each other The software receives as input a set of positive promoters as well as a set of negative promoters 100 CHAPTER 6 COMISOSLNEAVQIOIM IPOSNUKIMSIDINNDANKIDYBS BACKGROUND INFORMATION Figure 6 22 CM scanning score column Gene IEE VSAHR Q5 VSCEBP C VSHNF3ALPHA_Q6 VSHNF4ALPHA_06 Symbol ALDHIAl Chr9 74757786 0 246453 0 213256 LAMP2 Chr x 118487177 0 240404 0 212808 CCNBIIP1 Chr 14 19871268 0 238733 0 184972 TPS3TG1 Chr 86812741 0 24124 0 206963 ENF 330 Chr 4 142361520 0 250682 0 181715 Figure 6 23 Create interactions dialog window Create interactions from
3. 0 228969 0 0621757 0 140518 0 0069871 0 0170953 0 0280592 0 0290209 0 0086078 0 0522572 0 124504 0 289175 0 0712066 0 0212586 0 0359371 0 132267 0 173557 0 20968 0 185598 0 0803116 0 0703189 0 0991534 0 170546 ZMYND8 Human 0 214918 0 413598 0 455742 0 323288 0 205887 0 0198857 0 083102 0 0981535 0 0981535 NFATC1 Human 0 224882 0 261006 0 285088 0 0803876 0 10748 0 124313 10 0490552 0 0039008 0 0039008 SMAD4 Human 0 0751548 0 0663293 0 0693396 0 0422469 0 102453 0 147607 0 102453 0 132556 0 0753602 oTUB1 Human 0 115798 0 0977358 0 121818 0 112787 0 0977358 0 0377277 0 0046144 0 0557895 0 0347174 TM2D1 Human 0 144069 0 150089 0 183203 0 246419 0 107945 0 0597803 0 086873 0 0214978 0 0094566 RTF1 Human 0 0502086 0 113425 0 0622498 0 0291365 0 0340798 0 151481 10 0792343 0 121379 0 0250489 TREXI Human 0 110068 0 158233 0 200377 0 189841 0 0769545 0 127746 0 0524884 0 0600141 0 0916223 CRC clustering can be carried out by clicking on CRC clustering in the Analyze drop down menu In the input form Figure 11 20 you must specify the parameters that shall be used for clustering If your set contains a large number of genes more than 3000 genes it is strongly recommended that you perform a filtering step before clustering Filtering is performed using a slightly modified form of the
4. Species Human Figure 3 5 Import data dialog with the saved BKL search results Import data Destination 4t ExPlain predefined data Your data sets from BIOBASE databases Human housekeeping genes F Gene Protein AI012589 D86086 Z BioMarker Cystic Fibrosis 80 Ld Protein BioMarker Cystic Fibrosis gt Protein 190 Mouse housekeeping genes Mouse promoters TRANSPRO 6 2a Mouse promoters TRANSPRO 6 2b Rat housekeeping genes Z Rat promoters TRANSPRO 6 2a Rat promoters TRANSPRO 6 2b known sites PRF adipocyte specific Cancel The data set Known sites is the set of genome intervals that are experimentally proved to corre spond to real sites from TRANSFAC database The right side list in the Import data dialog window contains various data sets from BIOBASE GmbH databases For example if ExPlain is installed together with the BKL database you can select certain saved search results from BKL to be imported as an ExPlain gene set In the right menu you can see the names of the saved results that can be imported Figure 3 5 The number of rows in each search 38 CHAPTER 3 GENE SETS 3 2 REPRESENTATION OF THE GENE SET DATA result is shown in the square brackets next to its name User defined profiles from TRANSFAC installation are also displayed in the right side list For a description of how to create profiles in TRANSFAC Professional e g on the basis of a datab
5. Add column From another gene set or result table Compute from existing columns Add a column computing mathematical Function From existing columns Manage columns Manage columns in data set Show hide delete reorder rename columns Rename customize column Change columns Format Intersect or subtract gene sets Extract unique genes Multiple subtraction Join different interaction sets Merge into 4 single set of interactions Functional classifications summary Compare classifications of several gene sets Keynodes summary Compare keynodes network of several gene sets System annotation Add columns with annotation From BEL Assign levels to columns Group columns by experiment conditions etc 1 2 The menu and dialogs The menu bar contains four constant items File View Data and Analyze and one variable item that differs depending on the selected tree node containing actions specific for this node Click on the menu item in the drop down to start a process or open the dialog window The dialog window contains several fields to configure program options and buttons to start an analysis or apply changes The dialog window for the Enrichment analysis is shown in Figure 1 7 With the check boxes you can switch certain option on or off with radiobuttons you can select one 13 1 2 THE MENU AND DIAHQXGSER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE Figure 1 7 The dialog window exemplified
6. CHAPTER 6 COMPOSITE MODULE ANALYSIS AND MOIDDELSHE CMA INTERFACE IN EXPLAIN NOTE The CMA overall cut off is shown in both plots by a vertical line The Sites density distribution graph shows the number of sites lying in a specific sequence position relative to TSS divided by total number of sequences For your convenience a smoothed graph is also displayed whith an average site density of 50 bp Figure 6 9 Sites density distribution Sites density distribution Export to RTF l IR MI hy E Jm MIN 400 300 2060 Expression scare and Yes No distribution By pressing the Show sites on sequences gt button shown below the link you are provided with information about matrix and model matches on promoters A more de tailed description of promoter extended model view will be given in the Section 6 2 The Model search result menu appears when you select the CMA node in the project tree From the dropdown menu you can select the Get matrices of the model as gene set link to compile a new data set from the factors linked to PWMs of the model The link Save model adds a node for the model to the project tree A model saved this way can be used for promoter classification as described in Section 6 6 The Edit link launches model editor described in S
7. 131 9 3 FILTERINGKDHNIBRN HOSHSUMBP ISTNRVINERRVAPSCHIP TFBS CHIP SEQ TILING ARRAYS Figure 9 5 Interval entries from a selected gene promoter Chip on Chip data chr22 chip lt lt Back to promoters list Intervals lying on promoter HSA_29344_2 Gene symbol TBC1D10A Description TBC1 domain family member 10A TSS Chr 22 29021522 500 Score 40 30 20 100 10000 Filter filter bar none total 2 rows Rows per page 100 Export Plain text XLS RTF BED Chromosome position 4 Start position End position Em Chr 22 29024276 29024752 3230 2753 430 19 Chr 22 29030804 29031268 9746 9281 136 91 Chromosome position 4 Start position End position Score 1 Ami mma mmn Figure 9 6 Operation on intervals Create interval Source interval 4013176 Spl pvalue cut 107 To create an interval from the active interval set specify numerical expression conditions and interval parameters First condition Join on gaps Max gap 40 6 P value gt au gap P Second condition Min run 80 bp Promoter window from 10000 tp 1000 Automatically create subset from interval intervals that have a p value greater than 0 3 If you have no signal or p value information attached to your intervals data set then the above fields will be omitted The Max gap field accepts the maximum length in bp between intervals Intervals that are closer will be joined if the Join o
8. Alzheimer Disease Preventative 148 25 11 4 36369e 05 34 24 0 0131628 8 4 0 00996028 sim L D001161 Arteriosclerosis Preventative 198 13 5 0 000552378 0 0 1 11 5 0 00208539 diff oO D001172 dnd oid Preventative 84 4 2 0 126729 7 5 0 119798 9 2 4 30661e 05 diff g povise Broncos Correlation 2 0 0 1 0 0 1 2 1 0 000408073 diff Diabetic Oo D003928 Nephropathies Correlation 52 7 2 0 000161005 0 0 1 4 2 0 0205804 diff L DOO6689 Hodgkin Disease Preventative 56 amp 2 0 136984 3 3 0 528087 7 2 0 000118994 diff F D009101 Multiple Myeloma Preventative 160 7 4 0 074239 9 7 0 152966 14 4 3 33315e 06 diff go poog3e9 Neovascularization preyentative 15 0 0 1 0 0 1 4 1 0 000184174 diff Pathologic If the Create graphs option was selected the output also contains several graphical comparisons as shown below 4 6 Functional Analysis algorithm The Functional Analysis algorithm is designed to statistically analyze the representation of certain molecular classification groups in an input set The input set is a list of genes ExPlain can take any data set whose elements are mapped to the gene level automatically 60 CHAPTER 4 THE FUNCTIONAL CLASSIFICATION 4 6 FUNCTIONAL ANALYSIS ALGORITHM Figure 4 20 FA summary graphs Hits by set Terms Preventative Alzheimer Disease Preventative Multiple Myeloma Preventative Arthritis Rheumatoid Preventative Arteriosclerosis Correlation Diabetic Nephropath
9. Entrez Protein Wheeler D L Church D M Federhen S Lash A E Madden T L Pontius J U Schuler G D Schriml L M Sequeira E Tatusova T A and Wagner L Database Resources of the National Center for Biotechnology Nucl Acids Res 31 28 33 2003 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list nids 125199417 Entrez Protein http www ncbi nlm nih gov entrez query fcgi DB protein International Protein Index IPI Kersey P J Duarte J Williams A Karavidopoulou Y Birney E and Apweiler R The International Protein Index An integrated database for proteomics experiments Proteomics 4 1985 1988 2004 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 15221759 gt IPI lt http www ebi ac uk IPI index html gt UniGene Wheeler D L Church D M Federhen S Lash A E Madden T L Pontius J U Schuler G D Schriml L M Sequeira E Tatusova T A and Wagner L Database Resources of the National Center for Biotechnology Nucl Acids Res 31 28 33 2003 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 12519941 gt UniGene lt http www ncbi nlm nih gov entrez query fcgi db unigene gt UniProt Bairoch A Apweiler R Wu C H Barker W C Boeckmann B Ferro S Gasteiger E Huang H Lopez R Magrane M
10. x AF Up Dis6 FDR KN m columns that shall appear in the summary table When the button is pressed the process of sum mary creation will be launched In the summary output Figure 7 5 the key node analyses results selected for summary display are shown in one table It is now possible to directly compare the results of different key node searches and to extract interesting molecules genes via the option Key molecules Get key molecules from selected rows as gene set of the summary specific menu Figure 7 5 Key node summary result Molecule name Lyn Fyn Blk p mma 2 59054 1 66514 14 1 60848 0 69456 A Raf 12 224245 2 57663 0 lo 0 Fyn pl 14 2 17857 1 62622 14 1 60778 0 684806 Efs 14 2 17092 1 60809 14 1 60079 0 632798 rum 15 2 13528 113169 16 1 8182 0 750911 IL 7R 14 2 12138 1 38576 16 1 841 1 03525 beta3 integrin 15 2 02969 1 19748 16 1 77834 0 588919 Abl 16 2 02135 0 319107 18 2 01898 0 325733 CD45 14 2 00433 1 31022 16 1 82912 0 986937 Csk 15 1 97145 0 688935 18 2 0558 0 547748 ERK2 p 15 1 0036 1 56381 0 0 i Src 16 1 95915 0 122634 16 1 79937 1 57795 CD19 15 195311 1 06172 17 1 82541 0 756598
11. 124 CHAPTER 8 PROFILES 8 6 USER MATRICES appear in the matrices list of the profile creation dialog see Section 8 2 and can be used together with TRANSFAC PWMs 8 6 1 Creating new matrix To create a matrix launch the dialog by clicking on the Weight matrix link from the Create new data section of the File menu Figure 8 13 shows the dialog window Figure 8 13 Weight matrix creation dialog Create weight matrix Enter a name for new matrix Select assigned factors DR3 user defined LISF2A delta H USF2b Name can contain letters digits underscores and hyphens UTF1 Anything else wil be discarded i ax j Vax 2 ci 8 WOR o Specify a window size VDR M4 i VDR RXR Enter the starting position of the alignment VDR RXR alpha VDR RXR beta Enter or paste your aligned sequences in FASTA ClustalW or Gibbs format 1 AATGTCCICATTCAC 2 CCTGARCCTCCTGCAA 3 naAACECCECACTTCA 4 CCCEEACECCCTCnn 5 GCACACCCTCCTGAC 6 nnCcGCcCcCcTCACTCAC Preview matrix gt gt To create a new matrix you should specify a name using letters digits underscores and hyphens Then you should enter aligned sequences in one of the supported formats You can change window size and starting position or use default values You can also assign known transcription factors to the matrix Select the desired factors in the factor selection control Note that you can use Shift key button to select range of factors and Ctrl k
12. All 531 None Invert MiRNA description Genes list Genes count in CHAPTER 12 MIRNA ANALYSIS Rows perpage 100 x MIR944 MIRS48C MIR374A MIR942 microRNA 944 microRNA 548c microRNA 374a microRNA 942 ATP1B1 BHLHE40 BMP2 C8orf4 CAB39 CD47 CSF2 CXCL1 CXCL5 F2RL1 GATA6 GBP1 HLA B IL18R1 IL8 JAG1 MMP10 NAV2 NRIP1 OPTN PLAU PMAIP1 PTGS2 PTPRK RAPGEF5 37 SELE SERPINB2 SLC12A2 SLC7A2 SMAD3 SOD2 TNFAIP1 TNFAIP2 TNFRSF9 TRAF1 VCAM1 VEGFA ABCG1 BHLHE40 BIRC3 C110rf17 C8orf4 CAB39 CD47 CD69 CSF2 CXCL2 CXCL3 CXCL5 CXCR7 DDX58 ETS1 F3 FZD6 GATA6 GBP1 HIVEP2 ICAM1 IER5 IFNGR1 IL18R1 IL8 ISG20 JAG1 LAMC2 NAV2 NFKBIA NRIP1 NUAK2 OPTN PFKFB3 57 PLAU PMAIP1 PPAP2B PTGS2 PTPRK RAPGEFS RIPK2 SDC4 SELE SERPINB2 SLC12A2 SLC31A2 SLC7A2 SMAD3 SSH1 TNEAIRZ BHLHE40 BIRC3 BMP2 CAB39 CCL2 CCL20 CD47 CD69 CXCL5 F2RL1 FZD6 GATA6 GBP1 HIVEP2 HLA B IL18R1 PANX1 PMAIP1 PPAP2B PTGS2 SELE SLC12A2 TNFAIP3 dz TNFRSF9 UGCG VEGFA ABCG1 BHLHE40 CD47 CD69 CXCL5 DDX58 EFNA1 ETS1 F2RL1 GATA6 GFPT2 HIVEP2 IL18R1 IL7R IRF1 NAV2 NFKBIE OPTN PFKFB3 PPAP2B RAPGEF5 SDC4 SELE SLC7A2 SMAD3 SSH1 TNFSF10 TRAF1 UGCG VEGFA 156 background 90 179 56 74 TT 9 Genes P value 4 count ratio 150 0 374876 13 8704 2 05556 4 49461e 05 108 0 402567 22 9463 1
13. F match vertebrate h0 01 600 SUP 2009 10 22 16 45 38 X Up Dis6 FDR KN 2009 10 22 16 45 39 X 586 2009 10 22 16 45 40 X visualize network 2009 10 22 16 45 40 f GO annotation BKL manual curation D O 1max 2min 417J 2009 10 22 16 45 31 v ili gt Figure 2 14 p EB GCRMA test CEL files c Experiment Contral gt F match vertebrate hO0 01 600 SUP X amp Subset from F match vertebrate HO 01 600 SUP amp Up Dis amp FOR KN i sgo I visualize network m GU annotatio ENT manus curator D 0 frmax Ze m GU annotatis punc CO fimax amin X 31 2 3 WORKFLOW MODE CHAPTER 2 ANALYTICAL WORKFLOWS AND WIZARD MODE Figure 2 15 Up EB GCRMA test CEL files c Experiment Control Perey cys amp match vertetvate ho 01 600 SUD J S Subset from Fmateh vertebrate KOOI 600 SUE R Sf Lp DES FCR KO 9 S724 gt ATEL ATE 2ueaform 6 CD22 gt ATE ATFdenformt ss X YS sos x m 32 annotation BET manus curate OL freax Ze X m TO anioftabon buic OL frax es X 2S sB7 SUBJLD EB GCRMA test CE fles cExperimen X Tranpoath Pathways O OSTA zin m Shara eayporge D frmax zen Figure 2 16 Summary of several functional classifications amp Full upstream analvsis k Join Functional classification results For Go x F Join Functional classification results For aopub aA Join Functional classification results For K Wws W hh Join functional classification results for p
14. Martin M J Natale D A O Donovan C Redaschi N and Yeh L S The Universal Protein Resource UniProt Nucleic Acids Res 33 D154 D159 2005 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 15608167 gt UniProt lt http www uniprot org gt 166 CHAPTER 14 REFERENCES 22 23 24 25 26 27 28 29 30 Gene Ontology The Gene Ontology Consortium Gene Ontology tool for the unification of biology Nat Genet 25 25 29 2000 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 10802651 gt Gene Ontology lt http www geneontology org gt Benjamini Hochberg multiple testing correction Benjamini Y and Hochberg Y Controlling the False Discovery Rate a Practical and Powerful Approach to Multiple Testing Journal of the Royal Stat Soc 57 289 300 1995 Eukaryotic Promoter Database EPD Schmid C D Perier R Praz V and Bucher P EPD in its twentieth year towards complete promoter coverage of selected model organisms Nucleic Acids Res 34 D82 D85 2006 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 16381980 gt EPD lt http www epd isb sib ch index html gt DBTSS Suzuki Y Yamashita R Sugano S and Nakai K DBTSS DataBase of Transcriptional Start Sites progress report 2004 Nu
15. d oe amod Zgr 1 36 0 608 2009 10 12 18 18 59 Hl js 2mod 2or 1 3p 2009 10 12 18 22 24 4I 1 25 2 mod Sgr O 2p 0 611 2009 05 27 11 48 57 a3 imod lar Op 0 661 2009 05 25 15 28 45 33 imod lar Op 0 661 2009 05 25 15 51 06 s53 3 1 3 3 2009 05 27 14 40 25 v o54 3 1 3 3 2009 05 27 14 41 32 t v A amp P2 Q6 V aP2 Q6 01 viPAX4 D3 2009 05 3vertebrate non redundant 1100 ALL 211 2009 10 i E v ER Qb6 yiEel2 06 1100 ALL 1100 ALL 0 2005 Human housekeeping genes 561 87 909 474 700 human promoters TRANSCRO 6 Posse 002 9 005 8597 Y PXE 181 50 340 137 2009 10 13 13 08 33 Rat housekeeping genes 399 15 415 134 y Rita rut 55 0 122 19 2009 10 12 16 04 17 sample_tab 224 959 407 180 2009 08 27 09 39 22 x z Filtered sample tabz 50 34 85 42 2009 10 12 15 56 Cluster1 0 7thr Oshift 7 6915 0 96065 46 0 77 0 Cluster 0 7thr Oshift 3 48535 1 4 0 8 0 2009 i Genome intervals ChIP chip TFBS etc Interactions S interactiontest 18 2009 05 04 17 35 40 S pxe inter 142 2009 05 04 17 35 53 X 3 transpath genes inter 22 2009 05 14 09 14 14 amp transpath genes inter 5 2009 05 20 14 29 17 vw e T gt The project tree arranges all data into a hierarchy of nodes The tree nodes are marked by different icons depending on their type A complete list o
16. impact value is calculated which roughly estimates the contribution of a given matrix or pair to the final fitness value of whole CM using the following formulae I m F M F M Vm T I m I m Zr m Xm Where m is i th component matrix or pair m is corresponding impact value M is the whole CM M m is CM without component m and F is the fitness value Thus if the impact value is close to 1 then the corresponding component contributes greatly to the overall fitness Note that sometimes you may observe a negative impact value which means removing the corresponding component from CM would increase the score Usually this means that CMA did not run long enough to produce a good model Nucleotide positions are given relative to the TSS which is indicated by a kinked arrow The Se quence score column shows the model score on the promoter The table also contains columns with the TRANSPro ID of the sequence corresponding gene name description of the corresponding gene chromosomal location of the promoter Columns linked from parent sets are available through the column hiding mechanism The table enables users to select individual promoters in the checkbox column In the Model search result menu you may save selected promoters as a gene set by clicking on the Get selected promoters as set menu link By clicking on the Show text report on selected promoters menu link you obtain a detailed text report of the selected
17. mouse pig 0 721 0 914 rabbit rat guinea pig human C V amp AHRARNT 02 AhR AhR2 arnt arnt L killifish mouse pig 25 0 7 0 767 rabbit rat guinea pig human killifish mouse pig 47 0 752 0 979 rabbit rat AhR AhR repressor AhR repressor arnt AhR2 AhR arnt Arnt 774 AA V AHRHIF_Q6 form HIF 1 HIF 1alpha HIF 1alpha isoform1 arnt arnt L arnt2 M vSARNT_01 arnt arnt L human mouse rat 20 0 712 0 819 The Weight matrices profile field can be used to change the default name New Profile to another appropriate for the specific analysis 8 4 Profile representation in the result table The table contains one row for each PWM of the profile Besides the Matrix name column each row con tains a list of factor names Recognized factors column and a Species column The Number of sites col umn shows the number of real sequences suggested binding sites compiled sequences TRANSCompel sites used as the basis for matrix calculation CSS and MSS columns contain the score thresholds that were stored with the profile ExPlain provides you the option to export a profile as a tab separated text table Microsoft Excel table Rich Text Format document or Match compatible profile 8 5 Profile menu options If the profile node is active in the project tree the button Profile appears in the main menu After clicking on this button you will see eight menu links available Section Set Profile
18. sification analysis to find relevant pathways The figures below show one of the branches in waiting state Figure 2 14 and ready state Figure 2 15 When you have several data sets in the analysis to facilitate the comparison of the results ExPlain will make a table with a summary of all analysis of the same type 30 CHAPTER 2 ANALYTICAL WORKFLOWS AND WIZARD MODE Figure 2 12 Yes set 2 objects JU ground workflow folder Eftest_CEL_files P MASA test CEL files amp Fold Change MAS4 test CEL files aries para z LJIUp Fold Change GCRMA test i I Down Fold Change MAS4 t X Cone Fold change MAS4 test CE I up Fold Change MAS4 test FAiGene Sets Ili gt hal 2 3 WORKFLOW MODE Figure 2 13 E SN Nu LA XS T TR Select s Process Report on Full upstream analysis X Join functional classification results for GO 2009 10 22 16 45 32 Process status waiting fh Join functional classification results for GOpub 845 2009 10 22 16 45 32 t The task is waiting for 23 processes to finish X Join functional classification results for KWs 2009 10 22 16 45 33 X Press B to postpone X Join functional classification results for pathway 2009 10 22 16 45 43 X Press X to permanently delete it X Join network keynodes search results 2009 10 22 16 45 43 X 1 X Report on
19. 0 885 gacgaaAcccc HF kappaBZz ps5z Rel VSRELEBPSZ 01 371 0 870 0 931 gaaaaTTCccc HF kappaBZz ps5z Rel VSRELBPSZ 01 378 1 1 000 0 309 gaaat Toce HF kappaBZz ps5z Rel VSRELEBPSZ 01 407 1 1 000 0 920 aqaaaTCcce HF kappaBZz ps5z Rel Sites lying on promoter HSA_ 14475 Gene symbol IFIH1 Neecrintinn interferon induced with helirace C damain 1 5 3 Summary set of several site search results ExPlain provides a mechanism to compare several site search results Using the Create summary set of several Match results option from the Analyze menu launch the dialog displayed on Figure 5 25 Select several results of site search analysis described in Chapter 5 Transcription factor site search analysis Select columns that should be present in the output Note that the Sites density column will appear in the output dataset if MATCH results without background set were present among the selected results Figure 5 25 MATCH summary dialog Create summary set Choose site search results rz objects Choose columns to include in summary Yes calumn Ma column vesio column value column Matched promoters Yes column Matched promoters Ma calumn Matched promoters P value calumn Sites density Matched promoters Cancel When you press the button a process of summary set creation will be started The output set will contain matrices from all marked site searches and the selected columns grouped by MATCH results The figure below
20. 04 C3 V NFKAPPABES_O1 0 7016 0 1106 6 3448 1 1011e 13 2 5896e 06 EI v amp MFKAPPAB 1 1 0351 0 2593 3 9912 1 2930e 13 9 18885 04 C3 v4NFKB Q6 1 0 3684 D 0114 32 2000 2 2433e 13 6 0070e 09 Co v CREL_o1 0 5965 0 0763 7 8200 2 6121e 13 4 2066e 07 Co vaMFKB C 0 6667 0 1068 6 2429 6 0654e 13 6 1471e 06 CIC v RELBPS2_o1 0 4211 0 0267 15 7714 7 8041e 13 3 48426 07 C3 v4pPs RELAP6S Q5 01 0 5439 0 0992 5 4846 5 6912e 10 2 3197e 05 Co veIRFF_o1 0 3694 0 0496 7 4308 1 5963e 08 0 0015 C3 v Pso0pso0_o03 0 3860 0 0572 6 7467 1 9728e 08 0 0011 0 vgsrATSA 2 2 5965 1 5866 1 6365 48714e 07 P 0 0318 C3 V NFKAPPABSO_O1 2 1754 1 3272 1 6391 3 5936e 06 FF D 0596 CIC vesREBP2 Q6 5 8772 8 5346 0 7265 1 1210 amp 05 M 0 0370 C3 v TALIBETAE4 _O1 1 3860 2 1739 0 6375 4 8541e 05 LL 0 0161 Oe v kID3_o1 3 5789 4 7635 0 7513 5 3801e 05 0 0451 C3 v e2Fipp2_o1 0 0175 0 2288 0 0767 8 7681e 05 0 0051 0 v amp CACCCBINDINGFACTOR Q6 1 8421 1 1899 1 5481 1 1569e 04 F 0 4349 5 1 7 Pattern based search Patch searches for TF binding sites by comparison of the analysed sequences with a collection of known sites To launch the Patch dialog window click the Patch menu link in the Analyse menu Like for the MATCH see Section 5 1 2 you can select datasets to analyse and set promoter parameters After setting up the required parameters press the button to start the
21. 1 01 ro AAKI AP2 associated kinase 1 a protein serine threonine kinase that acts in protein amino acid phosphorylation and protein 1 01 import regulates receptor mediated endocytosis i Arylalkylamine N acetyltransferase acts in melatonin biosynthesis involved in response to xenobiotic stimulus increased m C AANAT mRNA expression correlates with basal cell carcinoma single nucleotide polymorphism is associated with idiopathic 1 03 scoliosis Alanyl tRNA synthetase a tRNA binding protein that is involved in humoral immune response and tRNA processing acts as 1 01 z diia an autoantigen in dermatomyositis Figure 3 8 Gene set annotation window Gene set HUVEC GSE2639 example Origin Gene set was loaded from predefined set HUVEC GSE2639 example Of 11840 accessions 10035 were recognized and matched The rest 1805 accessions were ignored Data build i Mammal 2007 11 21 User comments Attach descriptions from Affymetris DTT XML file or ARR file Browse ume 00 LLL factors Time factor and Temperature are added to a dataset on the figure below To remove or change the assignment launch the dialog again remove unnecessary factors and or add new ones Figure 3 9 Gene set with assigned factors Time factor Temperature HO m H24 CEL HOS CEL H12 CEL rc ACOI 8 29 3 85 4 79 7 42 2 9 ay ACTB 44 4 39 3 39
22. 4 42 6 47 5 mn ACTG1 22 21 8 19 5 16 7 17 5 3 3 Recombining gene sets Syntax of naming recombined nodes By default node names in the project tree convey information about the way they were obtained For all sets created within a corresponding project by default the name is prefixed with a number string where number increments chronologically starting with 1 and carries the number of entities in parentheses and a timestamp Names of recombined sets are con structed from the names of the nodes involved and the applied operation Here Sx amp Sy signifies a set created by union Sx Sy is a set that was created by intersection and Sx Sy is a set that was created by subtracting Sy from Sx 40 CHAPTER 3 GENE SETS 3 3 RECOMBINING GENE SETS 3 3 1 Filter gene set by condition The Filter gene set by condition dialog of the Data menu can be used to extract data entities based on a range of expression numerical values or text values as well as distinct categories This allows you to easily design experiments that compare different combinations of expression levels or changes NOTE This tool is not limited to a single column You can choose up to three different columns with specific conditions on each of them Figure 3 10 shows the Filter gene set by condition dialog with configured parameters First you should select the gene set that will be filtered The field Objects to consider allows you to select the entities which wi
23. 59218 6 98384e 05 37 0 413173 10 7425 2 32143 0 000204694 39 0 200407 6 0122 2 02703 0 000411583 Chapter 13 Reports 13 1 Report generation ExPlain provides the ability to create a brief report on your data and analysis results The report is available for the folder and for the gene set Short descriptions of all the data analysis performed and main results are saved in a report tree node for the selected folder or gene set Use the Generate report link within the Analyze menu to launch the report creation dialog Figure 13 1 Simple report generation dialog window Generate report Generate report on folder or gene set mir141 targets cons 0 3 251 118 524 161 Options Site search result options Functional analysis options P value cut off 0 001 P value cut off 19 06 Yes no ratio cut off 1 2 Select the desired folder or gene set depending on whether you want to collect information on the whole folder or on a gene set with its subnodes In advanced dialog mode you can also set up the threshold values to restrict the output results Pressing the button initiates the report generation The output result is displayed below 13 2 Graph report generation The graph report is a tree item type which contains several graphs 13 2 1 Generating a graph of the value distribution of a gene set You can create a graphical representation of up regulated down regulated or non differentially ex pressed ge
24. AADACL 1 arylacetamide deacetylase like 1 1 22695 2754 89 0 0304 0 4088 11 43 The Rank Product algorithm The Rank Product procedure was developed by Hong and Breitling 32 based on the RankProduct method of Breitling et al 2004 33 This procedure allows a detection of differently expressed genes by ranking the genes according to the difference in expression between control and experiment After calculating the expression expreriment expression control ratio for all possible comparisons k the genes are sorted according to the fold change ratios determined for each comparison in k separate lists The occurence of a gene g being differently expressed is evaluated by calculating its rank product RP regarding the ranking positions of the genes in all sorted lists quation 11 4 1 Rank Product E RP RankProduct of gene g RP r 7 rj rank of gene g in comparison list g ig i 1 k total number of comparisons To estimate the probability of observing a specific RP value a p value is determined in analogy to the e value of the BLAST procedure 34 First the RP values of a large number of random permutations of the original expression values are calculated Then these values are used to evaluate the likeliness of observing a RP value equal to or smaller than the one under consideration by random chance Addition ally the percentage of false positives or false discovery rate when all genes with RP values smaller th
25. Composite Model search EtslMFkappaB 1100 ALL 0 16 F Choose of allowed false positives m Cancel Figure 6 24 CM interaction output Interaction profile reactionset Filter filter bar none total 36 rows Rows per page 100 Export Plain text XLS RTF Mark All 36 None Invert KEEA mM nm Cost ActRep AChE MO000068336 elF 4E MO000020752 2 1 alpha2 HSG MO000044587 alpgha2A AR MO000018309 1 1 CP2a MO000054371 CP2 isoform 2 MO000036897 0 1 CP2a MO000054371 CP2 xbb2 MO000036897 0 1 CP2a MO000054371 CP2a MO000036897 0 1 CP2a MO000054371 LBP 1d MO000036897 0 1 DeltaNp73beta MO000044149 TACE MO000017665 2 1 together with MATCH binding site predictions The first set may correspond to genes found to be differentially expressed during an experiment whereas the second may be a set of genes one does not expect to respond to the same signal as the target genes e g genes which did not exhibit differential expression during the same experiment CMA then attempts to find a combinatorial motif model that best discriminates between positive and negative promoters 6 8 1 CMA promoter models The main component of a CMA model is the composite module CM Each composite module is defined by 3 parameters 06 M v where is the set of transcription factors regulating a promoter M is a set of PWMs used to predict binding sites for the factors of and is a set of parameters defining the
26. Create profile from selection Gene set Create gene set using selected matrices 121 8 5 PROFILE MENU OPTIONS CHAPTER 8 PROFILES Figure 8 8 Output table of the New profile creation result Weight matrices profile immune cell specific Filter filter bar none total 128 rows Rows per page 9 z Export Plain text XLS RTF PRF 1 2 3 wl gt I Mark Page 50 All 128 None Invert Matrix name 4 Recognized factors ie Number of sites CO vSAML1_01 AMLia human 57 0 78 0 828 M VSAML1_Q6 AML1 gibbon ape human mouse rat 9 0 7 0 772 AML1 AML1DeltaN AML1a AML1b AMLic AML2 AML3 AML3 G1 AML3 G2 AML3 U1 AML3 Y1 v AML Q6 AML3 Y2 AML3 isoform1 AML3 isoform2 AML3 isoform3 PEBP2alphaA1 PEBP2alphaA2 PEBP2alphaB1 PEBP2alphaB2 PEBP2beta Runx3 cattle gibbon ape human mouse rat 17 0 724 0 786 Japanese pufferfish Japanese quail cattle chick dog 2 v AP1 01 AP 1 FosB Fra 1 JunB JunD c Fos c Jun domestic pig gibbon ape 47 0 814 0 779 hamster human monkey mouse rabbit rat sheep Homologous matrices Extend set of selected matrices by all homologs Section Change cut offs minFN Minimize false negatives minFP Minimize false positives minSUM Minimize sums of FP and FN Custom Set arbitrary cut offs Section Tools Join profiles Merge several profiles into one 8 5 1 Create profile from selection When the profile node is active you c
27. FT number 1 FT intron 720 997 FT number 1 FT exon 998 1567 FT number 2 XX 5q Sequence 1733 BP 277 A 529 C 526 G 401 T O other aacctagatc cetetgetgt cccctgcact gccggtaaca tggcacagca gagcagggtt 60 gtttgtgcac gqggcagctcc tgcagctgct gccgtcgccc accagcctcc tatgccaaac 120 cccacatcct aactcaggaa cctctgagaa aaaacqggagc cctcgagggg cccagccttg 180 gaagggtaac tggaccgctg ccgcctggtt gcctgggcca gaccagacat gcctgctgct 240 ccttccggct taggaggagc acgcgtcccg ctcgcgcgca ctctccagcc ttttcctggc 300 tgaggagggg ccgagcctcc ggtagggcgg gggccggatg aggcgggacc tcaggcccgg 360 aaaactgeet gtgecacgtg acccgccgcc ggccagttaa aaggaggcgc ctgctggcct 420 ccccttacag tgcttgttcg gggcgctccg ctggcttctt ggacaattgc gccatqtgtg 480 ctgctcggct agcggcggcg gcggcccagt cggtgtatgc ettcteggeg cgcccqttgg 240 When this file is loaded to ExPlain a new sequence node called example1 will appear in the project tree Figure 10 4 New item in project tree double IDs 41 0 26 3 2009 06 02 16 28 47 x example1 1 0 1 0 2009 10 18 14 52 11 x foxa2 mouse top50 50 0 50 0 2009 06 08 16 ELI Cw ru Each nucleotide sequence is considered as the sequence of one gene When supportin ginformation is provided the identifier is taken from the ID line Y00483 in this case and is used as the gene symbol in posterior analyses In the figure below you can see the sequence entry loaded from the example file above Once loaded EMBL formatted sequence can be retrieved using the ex
28. Full upstream analysis 2009 10 22 16 45 44 Currently running processes X Summary on TF binding sites search results 2009 10 22 16 45 35 X FA gt GO annotation BKL manual curation 0 01 Matching to functiona X 584 sim diff is diff 2009 10 22 16 45 37 11 396 X 595 sim diff is sim 2009 10 22 16 45 38 Match P F match vertebrate_h0 01 600 SUP Running Match Wes 2008 10 22 16 25 11 R 40 0 Sis GORMA test CE files 2009 10 22 16 25 59 GCRMA test CEL files 2009 10 22 16 25 59 X fZ Down EB GRMA test CEL files c Experiment Control 1290 140 2529 043 2009 10 22 16 2 Copyright 2003 by BIOBASE GmbH X F match vertebrate h0 01 600 SUP 2009 10 22 16 45 34 X Subset from F match vertebrate h0 01 600 SUP 2009 10 22 16 45 39 X I Up Dis6 FDR KN 2009 10 22 16 45 41 I sae 22 16 45 42 X Visualize network 2009 10 22 16 45 42 gt GO annotation BKL manual curation 0 01max 2min 2009 10 22 16 45 32 jon public O 0imax 2min 528 2009 10 22 16 45 32 R X 589 s88 Down EB GCRMA test_CEL_files c Experim 2009 10 22 16 45 43 X X Transpath Pathways 0 05max 2min 2009 10 22 16 45 43 X X SwissProt keywords 0 01max 2min 2009 10 22 16 45 32 Y NC EB GCRMA test_CEL_files 451 75 846 275 2009 10 22 16 25 59 Up EB GCRMA test_CEL_files c Experiment Control 678 148 1284 483 2009 10 22 16 25 59 gt F match vertebrate_h0 01 600 SUP 2009 10 22 16 45 33 X Subset from
29. If you created a new model a new model node will be added to the Composite elements folder in the process tree If you were editing an already existing model the new node will appear under the original model 6 6 Classifying promoters You can use CM models to classify promoters of other datasets This functionality is made available through the Scanning pre defined CMs link of the Analyze menu In the dialog window it is necessary to choose a gene set and a composite model If the current selected node is a CM saved from CMA or one of those described in Section 6 4 it will be selected in Promoter model field of the dialog window If the current node is a gene set then it will automatically be selected as the main gene set Figure 6 19 CM scanning dialog window Model search General parameters Fun model search on gene set Up HUVEC GSE2639 example 333 promoters Use background set NC HUVEC GSE2639 example 3802 promoters Promoter model v HNF1 Q6 v ISRE O 1 v MFIKKB Q6 O1 amp V You can launch the search by pressing the button By pressing the you will be able to set up more parameters in the advanced dialog win dow By default the parameters take values assigned to the selected model The length of the promoter sequences type of promoters to use size of module and fitness function parameters from CMA see Section 6 8 2 can be adjusted here The CM scanning output looks exactly the same as the one described
30. SHP1 isoform2 31 4 83637 1 14539 0 membrane transducing components receptors GPCR rhodopsin like m C 5 HT 2A receptors type A amine 27 4 808 1 95191 0 012 receptors serotonin receptors 5 HT 2 A racantnr S HT better rank position by random chance Key nodes at the top of the result list that have small FDR values are thus very reliable whereas key nodes with large FDR value have a high likelihood of appearing also at a significant position in any other analysis and are therefore not very specific The Z Score serves as an additional measure of the key node reliability Key nodes with a Z Score above 1 can be regarded as statistically significant When the key node result is displayed the custom menu Network search result appears in the menu bar Corresponding icons are shown in the right side of the toolbar You can obtain a combined presentation of several networks by marking key nodes in the checkbox column and pressing the Net works Visualize networks from selected rows menu item or the 3 toolbar button There are several ways to derive subsets from rows marked in the output table The Hits Get hits from selected rows as gene set command creates a subset from the input molecules connected to a key node 2 Key node molecules themselves can be exported with the Molecules Get key molecules from selected rows as gene set option Clicking the Networks All nodes between hits and key molecules f
31. V MAZP Ol1 V MMEF2 Q6 V VMYB 02 M2 V HIF1 Q5 V MAZF O1 V MEISIAHOXA9 O1 V MMEF2 Q6 V VD PM3 nan 0 69724 0 761675 0 724287 0 291783 0 729458 M3 V CBF_O2 V HIF1_Q5 V LFA1_Q6 V MYCMAX_02 V SRY_02 PM RR 2IM3 M1 V CBF Pathan alte Q5 V P53_DECAMER ed o M2 V HFH3_O1 V HIF1_Q5 V MAZR_01 V MMEF2_Q6 V VMYB PM4 nan 0 692148 0 765806 0 604772 0 30374 0 728977 M3 V HIF1_Q5 V MAZR_O1 V MEISTAHOXAS_01 V MMEF2_ OSIVSVDP Q3 PM M1 M2 M3 6 2 Composite modules on promoters By pressing the Show sites on sequences gt button shown below the Yes No distribution chart you will be provided with information about matrix and model matches on promoters This button will be changed to Hide sites on sequences which will return to the view without promoters The switch between main and background set View yes set background set i also appears in extended view The promoters table below the graphical description of the model is very similar to the one found in the MATCH report described in Section 5 2 The PWM color legend is followed by a table with one row for each promoter The Match display column contains graphical presentations of matches along the promoter region with CMA model matches as grey boxes Single TRANSFAC matrix matches are displayed as single colored arrows Pairs of sites are displayed as two linked arrows In the PWM color legend pairs are displayed as two arrows in the same cell For each matrix or pair
32. a role in the signal transduction pathways To change the view of your gene set and see all molecules connected with your data click on the corresponding link The number of rows you see is exactly the number of corresponding entities for the current view If your data set has 100 genes you will see exactly 100 rows in the table even if you had a different annotation for the same gene in source data file In this case corresponding annotation fields are merged into effective values Effective values are those passed into analyses and used for sorting in the table These effective values can be manipulated For example if you have two expression values for the same gene you can specify whether the effective value is the minimum maximum or average of the source expression values In Figure 3 7 you can see an example of such case A gene with the symbol Abcc9 has two corresponding accessions in the initial file thus two different fold change values The effective value used for sorting is by default the average but this can be changed in column preferences The content of the text dblink and categorical columns is also modified to fit the current number of rows All column settings for the current set can be controlled via the Rename customize column dialog from the Data menu Section 1 5 6 Press the link near the gene set name to see additional information on the file origin ExPlain allows annotation of projects through project notes as we
33. ao nnn0s488 ADFP AKR1A1 ALDOB AMPEP binding D 188 12447 153 9 47439e 09 AP3B2 APRT ACE ADRM1 AKR1A1 ALDOB EN amp o 0009987 ANPEP AOAH AP3B2 APRT cellular process J 181 12076 148 2 14069e 07 Peres ammca Drocess 55 43 FUNCTIONAL ANALYSIS OUTPUT TABLEHAPTER 4 THE FUNCTIONAL CLASSIFICATION 4 3 3 Organ tissue analysis output Each row of the table presents a matched organ or tissue The columns contain from left to right the name of the organ or tissue and a link to its Cytomer page Gene symbols the number of input genes matching that group the size of the matched group in Cytomer the randomly expected number of hits and the P value of the match result Figure 4 8 Output table of Organ Tissue expression analysis rgan Tissue Gene symbol Hits in Group Hits p value group size expected ABCF2 ADFP 40RM1 AER1A l1 APRT epidermis ARPC2 ATF3 BAD COX6AL CSMKZB D annz seas ze E C pancreatic islets EE DUO SEI D 3 916 20 0 000540854 P aea rc A ARIA ASS OTR ask 4 3 4 BKL Disease analyses output The table presents the matched disease in each row The columns contain from left to right the MeSH ID and a link to its description page Gene symbols the descriptive name of the Disease Biomarker associations type correlation causality or prevention the number of input genes matching that group the size of the matched group the randomly expected number of hits and the P val
34. are matched to a single gene you can select to use maximum minimum average sum or extreme of all values as an effective expression value for this gene Text format With the Merge duplicate strings option selected only unique text strings will be shown Bach string will be shown at a new line DB link format Here you can change the number of identifiers shown on the screen separation type and presence or absence of duplicate entries Categorical format Within this dialog categories can be freely assigned numbers in order to range them To modify an assignment choose a category from the list with all currently assigned category number pairs in the selected gene set and enter the new value in the editor field You can also adjust effective value rules here 1 5 7 Selecting rows and further actions Table presentations of ExPlain usually contain a checkbox column that can be used to select data rows for different actions such as extraction of subsets Actions available for the selected rows for example Get selected genes as a gene set for the Gene set or Show site maps for selected matrices for the MATCH result can be found in the custom menu specific for each tree node as well as on the toolbar More information is given in the corresponding documentation section Additional checkboxes appear when the data table size exceeds the number of rows per page All data rows on previous pages can be selected by a single checkbox in the to
35. are dynamically combined and can be balanced by the user 7 1 2 Results of key node analysis We describe the output of the Key Nodes analysis by an example computed with a set of transcription factors obtained in a F Match analysis for up regulated genes from the HUVEC GSE2639 example The way in which the set of transcription factors used as input for the key node analysis was derived is shown in Section 5 1 Upstream reactions were searched with a distance threshold of 6 without including expression transregulation reactions and follow curated chains options The results were filtered by a false dis covery rate threshold of 0 05 Figure 7 2 shows the output frame of the network node created by the application Each row of the output table contains information about one key node The molecule name and classification are given in respective columns The Hits in network values represent the number of input molecules that are connected with the key molecule within the required distance The Hits list column contains the names of the input molecules that are connected to the described key node Furthermore there are three columns with the Score Z score and FDR values calculated for each key node which allow an evaluation of the key node reliability see below Some additional columns TRANSPATH id s for key molecules and hits molecule type descrip tion distance and tnon relevant reachable nodes are hidden In the non relevant
36. are to be considered marked Follow curated chains checkbox the persistence reward controls the integration of reliable annotated pathways and reaction chains during the analysis Secondary set You may choose a secondary gene set to run a two set network analysis in the Use secondary set parameters block The set of secondary genes is then included by the algorithm and reaction paths that go through the elements of this gene set are given higher scores The key nodes searching algorithm can be used to identify common signaling molecules in the network vicinity of genes from your input list The underlying application searches the network in a specified range starting from each input molecule in order to find the most proximal molecule that is connected to a maximal number of input molecules This is achieved by scoring each node that was visited on the path from any input molecule Since the resulting score may be determined by a generally high level of connectivity of some molecule the total number of connections reaching every node is also taken into account by the algorithm and pe nalized in order to acquire a preference for molecules that are specific for your input genes The application is also capable of taking a secondary gene set into account and finding reaction chains that use the molecules described in this set Fully annotated pathways and reaction chains can be included in the search Path predictions and expert annotation
37. be shown The Interactive graphs limit option limits the maximum number of layers different matrices site types pairs in site search results in the matrix legend see Section 5 2 1 If this number is exceeded interactive mode is turned off WARNING Displaying many layers in interactive mode may make your browser work slow and use a lot of memory You can also set commas as thousand separators rather than decimal separators Figure 1 29 Preferences dialog Preferences Tree format Show item statistics if available Show date Use Time Zone data from browser Format Long 2004 12 20 10 37 45 Time zone UTE LI Show whole tree in tree control Treat commas as thousand separator rather than decimal separator Interactive graphs limit 10 default This setting limits maximum number of layers different matrices site types pairs in site search result Ater exceeding this number interactive made is turned off Displaying many layers in interactive mode may make your browser work slow and use a lot af memory Cancel 23 Chapter 2 Analytical workflows and Wizard mode 2 1 Workflow The Wizard mode provides you with a set of analytical workflows that will allow you to Perform comprehensive promoter analyses Predict prospective drug targets Reveal disease biomarkers Explain mechanisms of drug toxicity To enter the wizard mode you can choose one of the following options Press the D to
38. bp Figure 5 17 Sites density distribution graph sites seqg 0 2 5 21 Matrix legend Each matrix from the report is represented by a colored arrow matrices from the same TF family have identical colors The length of the arrow corresponds to the length of the site recognized by the matrix The legend also displays information on site search analysis By default the legend is placed above the table After clicking the legend will be placed in the top left corner of the frame when you scroll the page down By clicking the TH button you can place it back above the table 76 CHAPTER 5 TRANSERIATEDNG RACHORATIREFHARICBPF MATRIX DISTRIBUTION IN PROMOTERS Figure 5 18 Legend of selected PWMs Legend Marin name Yes sites 1000bp No sites 1000bp Yes No P value Graphs Matched promoters p value gt VEB4CH1_ 01 0 1667 0 0478 3 4857 0 0039 00022 gt WEBACH2 l 0 1500 D 0444 3 3785 0 0070 0 0041 gt VvsIRF Q6 0 3157 0 1059 2 9910 3 2705e 04 0 0042 gt WENFEGPPAB6S 01 1 6833 0 5464 3 0805 722P4e 17 p 3 2912e D8 gt V NFKB_O6 D1 0 9167 0 2391 3 8343 1 7253 amp 12 9 5913e 09 gt v RELEPS2_01 1 1333 0 3279 3 4567 1 5155e 13 l 1 2663e 08 5 2 2 Promoter table Bach line corresponds to the sequence analysed with the site search algorithm The Picture column shows all hits of selected matrices over the whole length of promoter regions 600 bp in our example Matches are represented by ar
39. circles According to Affymetrix they should be about 1 GAPDH values that are considered potential outlier ratio gt 1 25 are coloured red otherwise they are blue Beta actin 3 5 ratios are plotted as triangles Because this is a longer gene the recommendation is for the 3 5 ratios to be below 3 values below 3 are coloured blue those above red The blue stripe in the image represents the range where scale factors are within 3 fold of the mean for all chips Scale factors are plotted as a line from the centre line of the image A line to the left corresponds to a down scaling to the right to an up scaling If any scale factors fall outside this 2 fold region they are all coloured red otherwise they are blue Percent present and average background are listed to left of the figure Page 4 Box plots of the intensities of the positive and negative control elements from the affymetrix Page arrays The average values and standard deviations of the intensities should be similar for all arrays of the data set 5 Center of intensity COI plots for positive and negative control elements Ideally all COIs will be located in the center of the plot 0 0 Some variation to the COI can however be expected and tolerated If the COI of an array has a coordinate equal to or larger than 0 5 the corresponding dot is labelled with the array identifier A significant shift of the COI far from the center indicates a spatial variation in
40. first gene set 3 3 4 Extracting unique genes You can select several genesets to create subsets of genes which are unique within this group i e present in only one set but absent in any other selected gene set Open the Extract unique genes dialog window from the Data menu specify objects to consider selecting at least two items then press the button Figure 3 14 42 CHAPTER 3 GENE SETS 3 3 RECOMBINING GENE SETS Figure 3 13 Join two gene sets dialog Join gene sets Here you may join several gene sets Note that result wil be placed below the main set Also main set column name and format is preferred when merging columns Main gene Set HUVEC G5bE2639 example 7385 505 15477 4563 Other gene sets P 14045 435 13783 4450 Merge native columns NS C When name and type equals SFBEGF GSES282 F When origin equals zf7p38 14045 435 13783 zz Down p38 94 2 92 52 z Gene set example 496 4 regulated genes 203 47 1 zfz adj p 0 05 185 47 174 4 b OK Cancel Figure 3 14 Extract unique genes dialog Extract unique genes Here you can select several genesets to create subsets of genes which are unique within this group i e absent in any other selected gene set Select at least two items 3 abjects J J EGF_4h 14045 435 13783 4 2 Down EGF 4h 213 17 21 JviUp EGF 4h 415 73 41 s4 Up Down EGF 4h 628 7 HEti GSE4695 zz Eti 21641 585 50772 8080_
41. for example as a No set Press OK button 67 5 1 SEARCHING FOR SITES IN PROMOTEGSIAPTER 5 TRANSCRIPTION FACTOR SITE SEARCH To extract a list of transcription factors select the matrices with a good yes no ratio and low p value and press icon 5 1 2 Sites search dialog window Figure 5 1 The Match dialog window Run Match Yee set Up HUVEC GSE2639 example 175 promoters No set background MC HUVEC GSE2639 example 932 promoters Profile Create Lnad verkebrate all min SLM Use high specific matrices with cut offs from profile Promoter window fram 00 to 100 If gene has multiple promoters use Best supported lOptimize cut off it p 0 01 Clontimize window position with p value threshold Cancel Click the Match menu link in the Analysis menu to launch the dialog window The dialog provides fields to select a profile of TRANSFAC matrices for the search the promoter window around the TSS and to specify a rule for promoter selection if several are available for an individual gene Select the main set you want to analyze in the Yes set field You can choose a gene set an interval set or a set of loaded sequences If you are standing on an appropriate data set tree node when the sites search analysis dialog is launched the Yes set field is set by default to the current gene set We recommend you specify a No set to use as background set for calculations but it s not strictly required By defaul
42. from sequence start 1 Cancel By pressing the button sequences from the specified file will be loaded into ExPlain When a recognized identifier is provided with the sequence i e RefSeq ENSEMBL or other identifiers are provided in the header of each FASTA entry or supported EMBL features are provided as described below ExPlain will try to identify the gene associated with each sequence If no identifying information is found or recognized ExPlain will create a new user gene entry for each sequence The sequences will be automatically mapped to a gene set and a new node will appear in the chosen folder on the process tree 10 1 2 Supported file formats ExPlain supports three types of sequence files EMBL and FASTA formatted files and raw nucleotide sequences You can also upload a compressed archive with sequence files such as a tar or gzip file 135 10 1 LOADING SEQUENCE DATA INTO EXPLAIN CHAPTER 10 SEQUENCES For FASTA formatted files only the identifier and the nucleotide sequence itself will be loaded into ExPlain This format contains a one line header followed by lines of sequence data Sequences in fasta formatted files are preceded by a line starting with a gt symbol The first word on this line is the name of the sequence The rest of the line is a description of the sequence For example in the line gt AATF apoptosis antagonizing transcription factor AATF is the name and apoptosis antagonizing
43. g the set of categories I MI NC MD D Numerical A numerical column contains real numbered expression data such as raw intensities nor malized intensities or expression change ratios Text Text columns contain all sorts of text annotation Db link The Db link column contains various database accession numbers They are recognized by the system and are provided with the links to the corresponding database entries Location This category marks columns with the chromosomal location information to provide proper sorting of data by this column like Chr 1 4534456 or Chr 1p21 Exclude This category marks a column to be skipped in the ExPlain import process Table 3 1 provides a list of accession numbers sources the required format and an example Table 3 1 Supported databases Database Example Affymetrix 41657 at EMBL I3 GenBank 14 DDBJ 15 U15637 Ensembl 16 ENSG00000118046 Entrez Gene 17 5966 Entrez Protein 18 AAC03365 HGNC Sri IPI 19 IPI00029773 MCI Amt RefSeq NM 002908 RGNC Arha TRANSFAC T00168 TRANSPATH MO000019368 TRANSPro MMU 81230 UniGene 20 Hs 279920 UniProt 21 Q04864 mm IPRO 0001 SwissProt TrEmbl 1433B HUMAN DBTSS 29R00318 Fantom Other ID T01F000E467B The categories can be selected from boxes above each column Figure 3 2 Column titles can be edited in line editors above the category boxes After pressing the amp dvanced options link you can set several additional options You can specify the
44. genes followed by identification of key nodes in signal transduction network Figure 2 11 Full upstream analysis parameters dialog Full upstream analyses common regulators in signaling network Yes set i2 objects No set background MC Fold Change MAS4 tesk CEL Files Classification into functional categories parameters LI control False Discovery Rate Minimal hits to group 4 Analysis of TF binding sites Profile ertebrate h 01 use high specific matrices with cut offs From profile Promoter window fram 00 to 100 Analysis of pairs of TF binding sites Distance in pair 0 50 bp Maximal number of pairs Report parameters Search for key nodes Site search result options Max radius F value cut off 0 001 mclude expression transregulatian reaction Yes no ratio cutoff 14 display only a note on the number of scheduled processes and the progressbar of the currently running one In the process tree you will see new nodes for all the scheduled processes Each data set will get a tree branch under it which will include Transcription Factor TF site pre diction f match results a selection of relevant TFs and the result of the upstream search in relevant signal transduction networks to find key molecules as well as network visualization nodes The best key molecules will be then extracted combined with the main data set and passed to the pathway clas
45. i l AW L L Aw X max B ACGT w i B gt min BC ACGT w i B i i where B Xi and B Si are the nucleotides in ith position of the subsequence X and the site S respec tively The d score ranges from 0 0 to 1 0 where 1 0 denotes an exact match of the nucleotide weights 84 CHAPTER 5 TRANSCRIPTION FACTOR SITE amp SEABKTHS SEARCH THEORETICAL BACKGROUND of the site S to the corresponding weights of the sub sequence X Similar to the Match algorithm we compute two separate d scores d matrix for the whole site and d core for the core positions of the site which are the five most conserved positions in the alignment In the search we choose two independent cut off values for these two d scores in each site set the cut offs are the same for all sites in the set Only those matches are reported whose both d scores dmatrix and dcore exceed the corresponding cut offs The sets of aligned sites contain many sites that are similar to each other out of several overlapping matches those with the highest d score will be outputed In addition the high similarity of a match in the sequence under study to an existing known site in a promoter of another gene can give an idea about the function of the found potential binding site Three different d score cut offs are precalculated for each TRANSFAC matrix that has the site align ment set attached to it i to minimize false negative rate under prediction error ii to minimize false p
46. in 0 1 If p 0 the ES resembles a Kolmogorov Smirnov statistics 27 If p gt 0 the algorithm takes into account the value of expression 63 4 7 GENE SET ENRICHMENT ANALYSIS CHAPTER 4 THE FUNCTIONAL CLASSIFICATION Figure 4 22 Enrichment of a sample data set with Transit peptide genes Enrichment results Enrich via group kis pi hits Export Plain text XLS RTF Mark Page 10 All 373 None Invert d Oe Oe oe oo Kw Kw wooo Kwnano KWwnSee3 EW OGG 1 Kwnaenn KWwn1S Kwnan4 Kwnz3a Hormone Nucleus Acetylation Transit peptide Cytoplasm Retinal protein Photoreceptor protein Chromophore Transcription DNA binding d 27 rows on next pages Molecular Function Cellular component PTM Domain Cellular component Molecular Function Molecular Function Ligand Biological process Ligand 0 790534 5 156 065 1 2751 vie 0 153697 2 0565e 06 33 0626 205 VIEW 0229931 2 8977e 056 16 5575 141 view 0335462 0000190063 3 40835 38 vie 0 166582 0000243634 21 5097 188 vie 0 930151 0000373296 0 348028 3 VIEW 0 930151 0000373296 0 348028 3 wie 0 930151 0000373296 0 348028 3 vie 0 15751 n0 00138919 18 4455 159 VIEW 0 158534 0 00146692 17 6654 154 vie a Enrichment results Enrich via group KWs pO hits2 Back to aroup list keyword ID 0809 keyword Transit peptide category Domain Score 0 335462 P value 0 000190063 Percent
47. in Section 6 1 3 including the graphical description calculation performance statistical distribution plots and promoter table The Match display column contains a graphical representations of matches along the promoter region with CMA model matches as gray boxes and TRANSFAC matrix matches as colored arrows The Sequence score column represents the calculation score of the sequence This column is automatically added to the main set If you select the main set node in the project tree you will see the scanning results column named by CM 6 7 Obtaining interactions between Transcription factors and their target genes Once you have a list of putative targets of a promoter model obtained by CMA you can generate a list of interactions between the transcription factors present in the model and the target genes where their 98 CHAPTER 6 COMEROSTMEAVLOIOIMUPOSINATM EID A NIDAKADVES BACKGROUND INFORMATION Figure 6 20 CM scanning advanced dialog window Model search Advanced parameters Promoter window from 00 to 100 If gene has multiple promoters use Best supported Size of module 200 Fitness function components Use T test Error rate Control normality of fuzzy score Penalize model complexity Use regression by column Fold change eg Previous Cancel sites are located This function allows you to include the regulatory interactions between the transcription factors found in a CMA run and their target genes T
48. most promising potential binding sites in the extended genomic DNA sequences Cut off to minimize the sum of both error rates min SUM We compute a sum of both error rates to find cut offs that give an optimal number of false positives and false negatives Figure 5 28 To do so we compute the number of matches found in promoter sequences for each matrix using a cut off allowing 10 of false negative matches minFN10 This number is defined as 100 of false positives The sum of corresponding percentages for false positives and false negatives is then computed for every cut off ranging from minFN10 to minFP We refer to the cut off that gives the minimum sum as minSum cut off Figure 5 28 FP false positive rate a frequency of 100 0095 i x recognized sites from the set EET FN false negative rate of real sites that are 80 0094 not recognized 70 009 SUM sum of both FP and FN rates 60 00 50 0094 40 00 oO tf m f E 2a n w m h un uu N z m 2a ao o k m a E 30 0094 20 0094 10 0054 0 0055 e MinFN 9 Min SUM FN10 0 94 0 85 Matrix Score 5 4 4 Calculation of theoretical P values for TFBS predictions In the following we describe an extension of the standard method of exact calculation of P values for PWM scores MSS Section 5 4 3 concerning calculation of score densities considering a search over both sequence orientations and accor
49. multiple promoters use Best supported F Maximum alowed promoter window size is 1000 bp Cancel Figure 5 13 Seeder output table 545 50 16 96 42 Matric name Seed Seed Seed amp Motits seed 6 length 10 m p value q value information EN content 2 Matrices b viM2 ACTTTC X CO w M2 ACTTTC ACTTTE 6 20202e 06 0 0127017 5 86197 I vdw3 ATACAA X C v4M3 ATACAA ATACAA O 00138064 0 999998 5 79511 i v M4 GTACTT X TO vem GTACTT GTACTT 00048605 0 999998 6 70844 Figure 5 14 Matrix created with Seeder Accession ID 15002 Matrix quality Window size 10 Binding factors nane change e 10 42 3 1 e lz 13 lz c e 8 30 e 1 T 47 el lz a es 3l 5 e 1 1 1 17 T 11 1 3 15 46 48 43 16 8 Consensus 3 3 4 i3 T T iE M M LsAST ITC Figure 5 15 Sites search filtering dialog window Filter Match results Source Match result ertebrate all 600 SUP 594 r Interval set to use own sites 2441 F lLeave only sites for factor assigned to the interval Filter background set if ar Expand interval by 14 i Bp Cancel 79 5 2 DETAILED GRAPHICAL REPORT OF NMCATIASIPY EN ST RIRAINGINRIRTRRNMOT RRR SITE SEARCH Figure 5 16 Sites search filtering results v amp NFKB Q6 14 3067 1 5480 0 2421 7 39475e 07 3 297 7e 09 VEAPZALPHA 03 29 7139 9 8039 3 0303 1 4078e 06 a 3 77 30e 08 C3 vewFKAPPABSU D1 12 1056 1 5480 7 8202 1 1684e 05
50. negatives minFN 120 CHAPTER 8 PROFILES 8 4 PROFILE REPRESENTATION IN THE RESULT TABLE Figure 8 6 The result input output table of the modified HUVEC up regulated profile Japanese pufferfish Japanese quail cattle chick dog V AP1 01 AP 1 FosB Fra 1 JunB JunD c Fos c Jun domestic pig gibbon ape 0 814 0 88 hamster human monkey mouse rabbit rat sheep Chick clawed frog fruit fly V AP1 C AP 1 gibbon ape hamster human 0 642 0 841 monkey mouse rat chick clawed frog gibbon ape V AP1 Q2 AP 1 hamster human monkey 0 756 0 923 mouse rat Japanese pufferfish Japanese quail cattle chick dog V AP1 Q2 01 AP 1 FosB Fra 1 Fra 2 JunB JunD c Fos c Jun deltaFosB domestic pig hamster human 0 822 0 951 monkey mouse rabbit rat sheep 2 Minimize sums of FP and FN minSUM 3 Minimize false positives minFP After clicking on one of these menu links appropriate for your analysis for example Minimize sums of FP and EN a new profile will be created and the interface will change to the profile mode Figure 8 7 Input output table for the profile created from gene set Weight matrices profile MHAE Cut offs minSUM Filter filter bar none total 20 rows Rows per page 9 Export Plain text XLS RTF PRF Mark All 20 None Invert Matrix name name Recognized factors Number of sites guinea pig human VSAHRARNT_O1 AhR AhR2 arnt arnt L killifish
51. procedure employed by Tamayo et al 36 To filter the gene set mark the checkbox Use most differentially expressed genes and specify the number of genes that shall be present after filtering Genes passing the filtering step are those with a high difference in expression between experiments i e these genes posses pronounced expression profiles The exact calculation procedure employed in the filtering step as well as details about CRC clustering can be found in the section Algorithm details of CRC clustering Parameters for the probability threshold and the maximal shift size must be specified During the clustering process the probability that a gene actually belongs to the assigned cluster is calculated Prob ability values close to zero denote that the genes are only weakly associated with the assigned cluster while values between 0 9 and 1 mean that it is very likely that the genes belong to the assigned cluster If a posterior probability threshold is specified all genes with probability values below this threshold will be excluded from the clustering results In adition of clustering genes exhibiting a similar expression profile The CRC clustering algorithm can also identify complex correlation relationships like time shift effects If you want to find genes that show a similar but time shifted profile please specify the number of shifts 0 1 2 that will be regarded in the analysis If for example a shift size of 3 is used
52. random variable Y a is associated with each word a of seed length k corresponding to the SMD between a and a randomly selected background sequence from B This a specific distribution function is obtained empirically from B for each word a one sets ga y Pr Y a y Bi d a Bi y m for y 0 k Seed position weight matrix for each word a the sum of SMDs to the positive sequences S a 5 d a Pj is computed Under the background model the distribution function of this sum of n indepen dent and identically distributed random variables is gn a y the n fold self convolution of ga y The P value p for word a with sum S a which is the probability of obtaining a sum lower or equal to S a under the assumption that Pj s are random in respect to a is quation 5 4 4 S a p S a 2 g y y 0 The word a for which the P value p S a is minimal is retained For each positive sequence in P the set of one or more subsequences of length k having the SMD to a are retained A PWM PO is built from this set of selected subsequences using standard procedures and pseudocounts proportional to y n with the modification that when a sequence contains more than one match each match sub sequence weight is reduced proportionally The sub sequence associated with the highest score to PO is retained in each sequence and the seed PWM Ps is built from this optimal set of n sub sequences Full length motifs the original s
53. reachable nodes column the number of molecules that could be reached but were not present in the input list is dis played All key nodes either molecule or gene in the analysis results list will be ordered by default based on their significance score This score reflects the relation of connected relevant nodes i e the nodes that correspond to the molecule gene list from your initial query to nonrelevant nodes i e molecules genes that can be reached from the key node but are not in the initial list Details about the score calculation can be found in the key node algorithm section below The false discory rate allows an evaluation of the probability of finding the respective key node at the same or a 104 CHAPTER 7 MOLECULAR NETWORKS ANALYSIS 7 1 NETWORK KEY NODE ANALYSIS Figure 7 2 Output frame of a key node analysis Moleculename Molecule classification hitsin network Scorev Seore OR enzymes hydrolases EC 3 hydrolases EC 3 1 phosphoric monoester hydrolases EC 3 1 3 protein phosphatases protein tyrosine phosphatases SHPs SHP 1 SHP 1 h SHP1 isoform1 31 4 83637 1 14539 0 enzymes hydrolases EC 3 hydrolases EC 3 1 phosphoric monoester hydrolases EC 3 1 3 protein phosphatases protein tyrosine phosphatases SHPs SHP 1 SHP 1 h enzymes hydrolases EC 3 peptidases EC 3 4 C Caspase 6 cysteine endopeptidase EC 30 4 81581 1 14162 0 01 3 4 22 Caspases Caspase 6
54. section below Arrays with a low average intensity need to be thoroughly examined The second plot shows the kernel density estimates of the all pm intensities Special attention should be paid to the shape of the density curves Arrays which show a significantly different density profile are suspect 143 11 3 QUALITY CONTROL CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Figure 11 8 Plots of page 2 of the affyQCReport for datasets that are of reasonable quality A and bad quality B Log2 intesity Page 3 Plot of the 3 5 ratios percent present calls and average background levels The 3 5 ratios are derived for spiked in and control genes specific to the array type All data within the plot that do not fulfil the recommended criteria are marked in red otherwise they are colored blue A detailed description of the plot can be found in the document OC and Affymetrix data 30 from the documentation of the simpleaffy package 31 Important parts of the description provided in this document are given as excerpts below The figure is plotted from the bottom up with the first chip being at the base of the diagram Dotted horizontal lines separate the plot into rows one for each chip Dotted vertical lines provide a scale from 3 to 3 Each row shows the percent present average background scale factors and GAPDH beta actin ratios for an individual chip GAPDH 3 5 values are plotted as
55. statistical significance that each Yes promoter presents at least one site when compared with No promoters Figure 5 11 Patch analysis output table Filter filter bar none total 1082 rows Rows per page 19 Export Plain text XLS RTF Raw output 1 2 3 4 5 6 7 8 Mark Page 10 All 1082 None Invert gt Yes No p yalue Graphs Matched ERE sites 1000bp ER sites 1000bp promoters raked on Oc HMGIY HS IL2RA 12 0 0000 0 2152 0 0000 7 9374e 06 mmm 0 1526 C9 NF kappaB1 p50 HS NFKB1_02 0 1167 0 0034 34 1600 2 8020e 05 Mf 2 3672e 05 C9 NF kappaB HS CXCL1 O1 0 1167 0 0034 34 1600 2 8020e 05 2 3672e 05 RelA p65 2 c Rel RelA p65 NF kappaB C9 NF Les poo c Rel NF kappaB1 HS CCL2 06 0 1167 0 0034 34 1600 2 8020e 05 Pl 2 3672e 05 C9 NF kappaB HS GRO3 01 0 1167 0 0034 34 1600 2 8020e 05 2 3672e 05 9 RelA p65 MOUSE CCL2 02 0 1167 0 0034 34 1600 2 9020e 05 MI 2 3672e 05 O 9 NF kappaB HS CXCL2 O1 0 1167 0 0034 34 1600 2 8020e 05 2 3672e 05 CC NF kappaB RelA p65 Rel4 p65 HS CD40 06 0 1167 0 0034 34 1600 2 8020e 05 2 3672e 05 CC NF kappaB NF kappaB1 p50 RAT SAA1_06 0 1333 0 0102 13 0133 7 0704e 05 M 5 7620e 05 OC sp3 RAT GRINi 07 0 1333 0 0102 13 0133 7 0704e 05 P 5 7620e 05 L 1072 rows on next pages Binding factors Yes No Yes No P value Graphs Matched sites 1000bp sites 1000bp a promoters p value Use the Site search result
56. structure of the CM such as the window size the number of single motifs the number of pairs of motifs and score cut offs for pairs and singletons CMA promoter models can combine multiple CMs in groups and also be composed of more than one group Promoter sequences are classified by sliding a window of the size of a CM along the sequence and scoring each according to the CM parameters to finally yield a normalized score over all window posi tions The normalized CM score is transformed to 0 negative or 1 positive according to a threshold 101 6 8 CMA COMPOSITE MODULE GNAPYBE 6 BAOKGROIWN MOLTHORIVEATRINSIS AND MODELS defined for each CM of the model These binary scores are then combined by a function of Boolean operators which classifies a promoter as positive if the overall output is 1 6 8 2 Model construction CMA attempts to find a model that optimally discriminates between promoters of the positive and the negative set In the absence of an analytic method to compute optimal parameters this task is assessed by stochastic optimization with a Genetic Algorithm GA Briefly a GA works iteratively i e over several discrete generations on a population of solutions In the case of CMA the solutions are promoter models defined by their parameters Typically a popula tion of anew generation is created by selecting solutions individuals of a population according to their performance fitness and introducing variation mutation
57. the command New folder from the File menu 1 1 1 Search tab on the tree frame The Search tab can be used to set constrains on the node name its type and time of creation Press the button to simply highlight appropriate nodes or the button to higlight the nodes and switch to the Select tab in one step Figure 1 3 Search tab of the project tree Tree Search Select Mame substring ra Type Gene set Date fram 0000 00 00 to 2009 10 14 Search amp mass action 1 1 2 Select tab on the tree frame The Select tab is useful for manipulation of project tree nodes It gives you the opportunity to manip ulate several tree nodes at once Choose several nodes select the appropriate action and press the button A convenient way for range selection and deselection provided by the tree select tab is to use the Shift key The status of the last modified checkbox can be transferred to a whole range of rows by pressing the Shift key while selecting the top or bottom node of the desired range 11 1 2 THE MENU AND DIGH GEER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE Table 1 1 Project tree icons Icon Description General icons t Expands a node Collapses a node Expands all nodes Collapses all nodes x Deletes a node from the tree Shows date of node creation X Hides date of node creation A system folder A user created folder A process about to be launched An active process
58. the format of the linked column rename it and sort or filter its data Figure 3 20 Link column from another data set dialog Link column You may add to the current data set any column from other data set Gene set you want to add a column to EGF 4h 14045 435 13783 4450 Data set to link column from EGF 12h 14045 435 13783 4450 r Column to link E ID Ei log FoldChange 12h x AveExpr x t E P Value s adj P Val B Cancel Figure 3 21 Gene set with linked column Gene BKL description log FoldChange log FoldChange symbol 4hy 12h Plasminogen activator inhibitor 2 a putative endopeptidase inhibitor that plays a role in wound healing human SERPINB2 is associated with several types of neoplasms 99997 9 19387 M C Serpinb2 High mobility group AT hook 1 a transcriptional regulator that functions in cell proliferation Hmgai and DNA repair human HMGA1 is associated with leiomyoma lung breast and other various 8 69926 6 45499 cancers mouse Hmga1 is associated with cardiac hypertrophy Fos like antigen 1 a transcription activator acts in neuron differentiation gene expression is M Foslt upregulated in ischemia and asbestos induced mesothelioma human FOSL1 is associated 8 53336 5 9523 with several cancers mouse Fosli is associated with osteopenia Amphiregulin an EGFR activating ligand that acts in antiapoptosis chemotaxis cell I Areg proliferation prostaglandin biosy
59. the most probable binding sites Seeder is used to find de novo motifs overreperesented in the analysed set Promoters are automatically extracted by ExPlain from the TRANSPro database according to specified ranges upstream and downstream from the virtual transcription start site TSS and considering the extraction rules applied to genes with multiple promoter entries In these analyses two sets of promoters a query positive and a control negative set will be analyzed to enable you to identify factors with overrepresented sites or motifs in the query set 5 1 Searching for sites in promoters In order to find which transcription factors might control the genes in the dataset ExPlain will find the TFBSs in this set and compare the frequencies of these sites with those in a control or background gene set non changed genes This will produce a collection of Position Weight Matrices PWM whose described sites are overrepresented in the promoter set under study NOTE data set may as well be a single group of genes of interest or a single sequence and the report will then contain the absolute frequencies of the binding sites found You are not confined to do a differential analysis and require a control set The 5 1 1 How to run Match Click the Match menu link in the Analysis menu or press e to launch the dialog window Select analysed set upregulated genes for example as Yes set Select background set not changed genes
60. the selected profiles shall be used for creation of the joined profile the process can be started by clicking on the button 123 8 6 USER MATRICES Figure 8 11 Custom cut offs dialog Change profile cut offs Change cut offs for profile New Profile minSUM C minFN C minSUM C minFP custom CSS 9 75 C use the following profile as template MSS 0 8 none C p value based MSS CSS set to zero P value 9 91 dl Species Human Note that for p value based cutoffs matrices for which minimal prediction rate could not be achieved wil be removed from profile as well as user matrices Cancel Figure 8 12 Join profiles dialog 8 6 User matrices Join profiles Source profile cell cycle specific Profiles to add 2 objects eer bacteria I cell cycle specific eer FOX 0 3 m peel fungi e HUVEC UP minSUM e HUVEC_UP Sp1 minSUM I immune cell specific eer insects w 4 d If matrix appears in all profiles C take lowest cutoffs take average cutoffs take highest cutoffs Cancel CHAPTER 8 PROFILES ExPlain provides the ability to create custom positional weight matrices which will be used to repre sent transcription factor binding sites Using a site pattern or sequences alignment you can create its respective matrix and use it in profiles for site search analysis To include a PWM in a profile first create a matrix as it s described below Your PWM will then
61. the stronger statistical significance of the assumption that the observation is not a random event A probability threshold can be set in the P value field of the ExPlain Functional Analysis interface In addition to groups found to be statistically over or underrepresented according to the given P value threshold reported groups will be filtered by the minimal number of hits from the input set This corresponds to the minimal hits to group parameter of the Functional Analysis dialog window 4 7 Gene Set Enrichment Analysis 4 7 1 The GSEA interface Click the Gene Set Enrichment Analysis menu link in the Analysis menu to launch the dialog window Select a gene set that has at least one numerical column Specify the numerical column to be used from the source gene set Select Functional categories groups to be searched for enrichment see Section 4 1 molecules from canonical TRANSPATH pathways see Section 4 4 1 or genes from desired Gene sets Press OK button Jn You can also specify more parameters in the Advanced options block The Weight genes power p parameter described in quation 4 7 1 takes the values 0 or 1 To weight genes in the Source gene set using the expression value the p parameter should be set to 1 otherwise set it to 0 Striving to keep this analysis in groups only essentially connected to the source gene set you can control the Minimal hits to 62 CHAPTER 4 THE FUNCTIONA
62. transcription factor will be the discription The remaining lines contain the sequence itself Blank lines in a FASTA file are ignored and so are spaces in a sequence FASTA files containing multiple sequences are just the same with one sequence listed right after another For EMBL formatted files the following features will be loaded when provided into ExPlain ID identifier AC accessions DE description OS species FT promoter or FT exon TSS position and SQ nucleotide sequence For raw nucleotide sequences an identifier will be generated by ExPlain NOTE Only the first exon start position is taken as the TSS If both the promoter and the exon features are present in the file the promoter will be preferred 10 1 3 Copy Paste sequence data To directly copy paste sequence data into ExPlain use the New sequence menu link within the File menu A dialog window will open which allows you to specify the destination folder for the loaded se quence s the desired sequence set name and the default TSS position to be used within the sequence s Paste sequences in FASTA EMBL or raw nucleotide sequence format into the text area By pressing the OK button the sequences will be loaded into ExPlain in the same manner as described in Section 10 1 1 section above For more information about supported sequence formats please see Section 10 1 2 section above Figure 10 2 New sequence dialog New sequence Destination fold
63. wok Bunch 9 he rear eee POR dep m e gua wx 3 5 1 Exporting selected genes as BKL search result llis DlalisHcs calculator aie s oeow dex oe SOS EH OSS Ee X eos ROSES REED SHES eo VEX CONTENTS 4 The Functional classification 51 4 1 The Functional classification analysis 2 2 2 eee 51 4 1 1 How to run classificaiton analysis 000000048 51 4 1 2 Functional Analysis categories ee 52 4 1 3 Functional analysis dialog window aoaaa aaa eee ee eee eee 53 4 2 Example analysis with BKL curated GO annotation 000 0 53 43 Functional Analysis output tables ees 54 4 3 1 Expression BKL manual curation 000000048 55 4 3 2 Gene Ontology analysis public and BKL curated oaa 55 4 3 3 Organ tissue analysis output 2 2 56 43 4 BKL Disease analyses output 2 ees 56 455 SwissProt analysis output ec Ros PERS HES ER EEE Rw RES 56 4 3 6 Transcription Factor Classification analysis output llle 56 4 3 7 TRANSPATH Molecule Classification analysis output sss 57 4 3 8 Whole subsets from the tree analysis output s s oy 44 Functional Analysis with reaction pathways eee 57 4 4 1 Functional Analysis with TRANSPATH pathways 58 442 Pathwaysres lis 245 4424 2 06099 64266549 Eoo e SSS SHH ESSA 58 4 4 3 Functional Analysis with user defined interaction pathways 58 4 5 Functi
64. 00 N 1 V RUSH1A 02 C 0 973500 N 1 g 5qFitness max 0 576244 minutes Figure 6 5 Changing CMA stop condition on the fly Processing NC 4 57 7 Parameters Stop after 10 min NC limit 29 Update The model description is expanded in Figure 6 6 In this mode the graphic also displays the matrix cut offs C 0 972500 for the first matrix and the number of matrix matches expected in the module N 1 To switch between expanded and simple display click the link above the model Underneath the description false positive FP 13 0776 and false negative EN 30 9575 frequencies on the dataset as well as the CMA overall cut off Overall cut off 0 112918 for the model are given The Value row of the Goal function calculation table shows the performance of the model with regard to individual fitness function components The Weight row contains weights of the individual components These weights resemble the values set in the Advanced GA options dialog of the appli cation frame normalized to sum to 1 in our example the components T E and P were selected The total fitness of the model is computed from the sum of the weighted fitness values as demonstrated in the bottom row of the table named Weighted value The Expression score Yes No and Sites density distribution plots are available through Expression score and Yes No distribution button The Expression score distribution indicates how well s
65. 09_at Chr 2 162883396 D 3 amp ERIS Epstein Barr virus induced 3 212424 at Chr 19 4180520 1 1 4 nuclear Factor of kappa light polypeptide gene enhancer in B cells 1 HIFEEL 209239_at Chr 4 103641606 0 4 5 23 Hiding matrices from view Matrices can be excluded from the representation or included again using the checkboxes in the matrices box You can see an example where three matrices out of five are excluded The corresponding matches 77 5 2 DETAILED GRAPHICAL REPORT OF MCATIASIPYEN ST RIRAINGINRIRTRRNMOT RRR SITE SEARCH in the promoter table are hidden or revealed according to this selection Figure 5 21 Hiding matrices from view a Matrix name Yes sites 1000bp No sites 1000bp Yes No DAR V BACH1_01 0 1667 0 0476 3 4857 0 0039 0 0022 Number Gene of sites symbol Y gt V BACH2 01 0 1500 0 0444 3 3785 0 0070 0 0041 l P VSIRF_O6 0 3167 0 1059 2 9910 3 2705e 04 0 0042 14 NFKBIA CI gt V MFKAPPAB6S 01 1 6833 0 5464 3 0805 7 2224e 17 f 3 2912e 08 E gt v MFKB Q6 01 0 9167 0 2391 3 8343 1 7253e 12 9 5913e 09 gt V RELBP52 01 1 1333 0 3279 3 4567 1 5155e 13 1 2663e 08 H D 1 HsA 10123 P 10 EBI3 500 400 200 1 gt gt O HsA 12146 1 9 NFKB1 500 400 200 1 5 24 Detailed promoter view Clicking on a picture of a promoter in the Picture area provides a detailed view of the promoter usi
66. 1 4 6 Plk3 7 063207 3 9 Adamz B P37834 3 9 Gps 1 B Q9ESM 3 9 Haplnz 10 P55063 3 6 Hspall 11 PU4276 3 5 GC 12 Q99P 4 3 9 Rabz7b 13 P U6 761 3 2 Hspa3 14 P9777779 3 1 Hmm 15 P13596 3 1 carni 15 064 725 3 1 Syk 17 P1963 13 2 Plat 18 060567 10 3 Hadhb 19 0645953 3 Prkgz 70 D1 4d na A A cle 22 2 List with fold change Use this option when you have a file with gene identifiers and precalculated fold change and p value optionally columns in either tab separated text or excel format If you upload a list of this kind ExPlain will display information on the number of matched rows and sizes of the Up and Down regulated gene sets which will be created and used in the analyses Adjust the filter options to have a reasonable amount of genes in these sets between 20 and 500 and press gt gt Finish importing l 27 2 3 WORKFLOW MODE CHAPTER 2 ANALYTICAL WORKFLOWS AND WIZARD MODE 2 2 3 Microarray series Table Use this option when you have a plain text file or an Excel table containing a list of genes with several fold change columns assigned For each fold change column 2 sets Up and Down regulated genes will be created Figure 2 6 A data set with several pre calculated Fold Change columns Match accession by Let system guess Up set log FC gt 0 5 Species Al species Down set log FC 0 5 File FC example txt Rows 16739 header T unmatched 1512 matched 15226 Probe Set
67. 1 5 8 Data export The workspace provides you with the opportunity to export your data in different formats such as tab delimited text or a Microsoft Excel spreadsheet for use outside of ExPlain Certain tree nodes have additional export formats For example Genome intervals can be exported as a BED file Figure 1 27 Data export links Expart Plain text XLS RTF Some ExPlain plots can be exported in RIF format as a vector graphics image so that it can be resized with other applications such as Microsoft Word Press the RTF link in Export RTF XLS option above the plot and save the file or open it using the File download dialog The XLS link exports the plot to Microsoft M Excel stylesheets as a set of values 1 5 9 Interface preferences The Preferences menu option of the View menu launches the adjustment dialog shown on Fig ure 1 29 You can change the way the tree is displayed more or less space between tree entries the item statistics and date of creation The date format and time zone can also be specified 22 CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE 1 5 THE WORKSPACE Figure 1 28 Data export links Do you want to open or save this file ce Name explaingraph rtf Type Format RTF From biorainbow com If the Show whole tree in tree control option is chosen the tree node selection control Figure 1 8 will display all tree items otherwise only items of acceptable type will
68. 1 Identification of miRNA targets ees 155 Reports 157 LoL ROPO CENCI NO ma usos a PES meus Be oS Be ee Ghee dos eee eo ee ee Bes IA 13 2 Graph report generation s sce a ee hr PEGS moet e ch de dee n edo Ow AOS 157 13 2 1 Generating a graph of the value distribution ofa geneset 17 13 2 2 Graphical report on intervals m cm Reh we o b FIR RO ERED NUR EX OWS 157 Appendix 161 References 163 Part I User manual Chapter 1 Main components of the ExPlain user interface Figure 1 1 The ExPlain user interface File View Data Analyze Gene set Help Running CM Waiting NwWAsummary x1 Gi f es rA A Y Visualize x4 FASummary x1 NWA x4 FA x4 CreateSet x16 Nes En Select Gene set Human housekeeping genes X Join functional classification results for pat X Join network keynodes search results 20 Species name Human Filter filter bar none total 561 rows Rows per page 10 fz S59 fen 5 0 0 0 0 2000 18 Export Plain text XLS RTF 1 2 3 4 5 6 Y 569 simy diff is sim 0 0 0 0 2009 1t ui E bee UU Mark Page 10 All 561 None Invert E Composite Elements D Gene Sets Gene symbol Description RefSeq HGNC m TRANSFAC BH s accession symbol gene Human housekeeping genes 561 87 9C pss antagonizing transcription amp Interaction Pathways Imax 2min 6 20C 1 aar Factor NM 012138 AATF 0 6020228 5 5 biopax 2009 10 22 13 06 55 x O asi ped
69. 27 Chapter 9 Genome intervals ChIP chip TFBS ChIP Seq Tiling arrays 9 1 Loading of genome intervals 9 1 1 Loading of genome intervals data from BED file ExPlain recognizes only the intervals that fall into the range of 10000 1000 nucleotides with respect to the TSS of the gene To load your interval data select the Load intervals BED file option from the File menu Specify a data file using the form shown in Figure 9 1 Besides the file name you can specify a feature name and a genome build or select them to be obtained from the BED file A list of all transcription factors from TRANSFAC database is provided to link them to your data this information can be further used for filtering the sites search results Check the option automatically create subset from the interval to create a set of genes covered by the intervals after loading In the destination field you can choose the name of the folder in the process tree where your data will be saved Figure 9 1 Dialog for loading genome intervals from BED file Load intervals Destination Genome intervabE ChIP chip TFBS etc specify intervals file to load BEL select feature name from list Use name column from BED file select build and species from list Read genome build and species from BED file Maximal distance to TSS 7000000 Automatically create subset from interval Cancel 9 1 2 Loading of genome intervals from CHP BAR file ExPlai
70. 3 1 LOADING DATA INTO THE EXPLAIN SYSTEM CHAPTER 3 GENE SETS 3 1 1 1 Supported species At the moment the mamalian module of ExPlain supports data from human rat and mouse ExPlain Plant accepts data from Arabidopsis thaliana Oryza sativa and Glycine max 3 1 1 2 Supported file formats ExPlain supports four data formats Affymetrix lt http www affymetrix com gt CHP files Affymetrix Exon genecore CHP files ASCII text files and Microsoft lt http www microsoft com gt Excel files Any of these files can also be loaded as an archive zip or tar gz Within text and Excel files data are assumed to be presented as a table with one entity per row An arbitrary range of non data header rows e g containing notes or a column title bar may be present at the top of the table In ASCII text files data columns should be delimited by a single tabulator where each row should contain the same number of tabulations so that columns with undefined values are never skipped in individual rows Currently ExPlain does not support Unicode lt http www unicode org gt encoding in Excel M files affecting e g Asian or Cyrillic characters Such files should be saved with the Western encoding Otherwise proper processing of the data can not be guaranteed Furthermore files from Microsoft lt http www microsoft com Excel 2007 are presently not supported by the system 3 1 2 Import options of geneset loading dialog T
71. 3 20 67 Human housekeeping genes amp Interaction 1 1 3 1 2009 05 20 10 21 55 G8 Human housekeeping genes amp Interaction 0 0 0 0 2009 05 20 10 22 21 up Fam Expr Disd Pena KM 1420 2009 05 19 10 25 31 vertebrate nan redundant 1100 ALL 212 2009 05 19 18 23 00 vertebrate non redundant 1100 ALL 211 2009 10 12 18 22 08 z A V ER Qe vyiEel2 06 1100 ALL 1100 ALL 0 2009 05 20 10 07 38 Hurnan housekeeping genes 561 87 909 474 2009 10 14 08 45 26 wt ya human promoters TRANSDRO 6 22566 602 41905 85027 2009 10 12 15 20 13 X Z PXE 181 60 340 137 2009 10 13 13 08 33 Rat housekeeping genes 399 15 415 134 Figure 1 5 Select tab on the tree frame Tree Search Select Create Folder and move items there F pMove items to Folder Create Folder and move items there Delete selected items Mark items as viewed Mark items as non viewed Figure 1 6 The menu Data Analyze Filter T Filter by condition Filter by gene sets Leave rows satisfying given constraint Up Down Non change genes Split gene set by cut offs Random sub sets Extract several random sets of given size Merge and summarize Join gene sets Create a single gene set From several sets Match results summary Joining several Match results into one table B Generate report Brief summary of results From selected sub tree Add columns Link from another data set
72. 3553 and protein import regulates receptor mediated endocytosis Arylalkylamine N acetyltransferase acts in melatonin biosynthesis involved in response to xenobiotic AANAT stimulus increased mRNA expression correlates with basal cell carcinoma single nucleotide 1 03 0 0426443 polymorphism is associated with idiopathic scoliosis Alanyl tRNA synthetase a tRNA binding protein that is involved in humoral immune response and tRNA 1 01 0 0143553 processing acts as an autoantigen in dermatomyositis D x 3 4 2 Linking a column from another data set In order to compare different results you can add to your data set columns from other data sets In the example shown on Figure 3 20 we will add to the set of HUVEC genes the TRANSPro accession and Gene symbol columns from the Human Promoters set It s also possible to link columns from different analyses results Choose the Link from another data set option from the Add columns section of 46 CHAPTER 3 GENE SETS 3 4 ADDING COLUMNS TO THE GENE SET the Data menu When you select a dataset with the columns that you want to add the form will be disabled while system refreshes the list of available columns Then select column s and press the button If you would like to link columns to more than one gene set select Multi select mode in the Gene set you want to add a column to menu Linked columns will be marked by the EJ icon Click on it to go to the original dataset You can change
73. 4C 400 200 1 030011005 gene Figure 6 12 Columns displaying scores of individual matrices pairs Seen eng Description Affy ID TSS VSMRF2 01 gt gt V STAT3 01 VSSTAT 01 V VMYB 02 VSCMYC 02 V HNF1 01 V ATF4 Q2 gt V SREBP2 Q6 score TAE 1443677 at Chr 996653146 0 0 0 0 09488 0 826053 00 LL 0 826053 030011005Rik R9500 C0 0013005 gene CWF19 ike 2 cell 0 764737 Cwfi9l2 cyde c te 6 1453688 at Chr 9 3478219 0 0 344875 0 0 0 764737 n po ADP ribosylation n 0 0 0 0 747651 Arl factor ike 4C 1454788 at Chr 190598488 0 0 D 0 0 747651 A 4 0 744336 Fcho2 FCH domain only 2 1442453_at Chr 13 99585252 0 0 101 0 0 0 744336 vesicle transport 0 74378 Vtila through interaction 1439876_at Chr 19 55390458 0 0 0 124 0 0 74378 _ homolog 1A 1A yeast 6 3 Predefined parameters of CMA The parameters set saved in the CMA advanced dialog window appears in the project tree inside the CMA subfolder of the Preset folder When you select a preset node in the tree all parameters are displayed as a set of tables similar to ones in the dialog window ExPlain provides two types of predefined parameters System presets are loaded at the first start of the ExPlain application and cannot be removed The example below shows one of the system presets The second type of presets can be created through CMA advanced dialog window When you launch the CMA dialog and
74. 95 0 998 1 358 8 6 3 Changing factors associated with matrix To change binding factors associated with a matrix use the dialog window that can be launched by clicking the change link after the factors list or by using the Factors link in the matrix specific menu Select a matrix in the drop down list Factors already assigned to the matrix are displayed in the left 126 CHAPTER 8 PROFILES 8 6 USER MATRICES list and the right list contains other available factors Select factors and move them between the two lists using Add and Remove buttons so that all factors you want to be associated with the selected matrix are collected in the right list Figure 8 16 Change factors dialog Change factors assigned to matrix Select matrix V 0 3_user_defined Select factors RPGalpha RREB 1 RRF1 RSRFC4 RSRFCAS add RLUNX3 Runx3 RLUNX3 1 RVF RX RXR alpha RXR alpha CAR RXR alpha FXR Remove Cancel 8 6 4 Importing user matrices from TRANSFAC Custom matrices created with the matrix generation tool in TRANSFAC MATCH can also be used in ExPlain application PWMs from TRANSFAC are imported automatically on the first run of ExPlain and on each run of the new profile dialog To import matrices manually select the Weight matrices folder in the tree and click Import matrices from TRANSFAC Match Imported matrices have the same view as custom created matrices and can be manipulated in the same way 1
75. Back returns you to the promoters view Figure 5 23 Detailed text sequence view Sites lying on promoter HSA 3886 1 Gene symbol NFKBIA Description nuclear factor of kappa light polypeptide gene enhancer in B cells inhibitor alpha TSS Chr 14 34943663 i lt V NFKAPPAB65 01 0 89 2 lt V RELBPS2 01 0 88 tittcaaaagatcaaaaaacggqaaaggaccggcaggt tgogqcaaacccecaaagagggaccygcccatcaggteggegtecttggqgatctcag eagecgacga 401 RE feccc V NFKAPPAB65 01 0 91 2 ocrcc V NFKAPPAB65 01 0 96 SO gt VSNFKB_ Q6 01 0 93 4 V RELBPS2 01 0 89 ceccaattcaaatcgatcgtggqgqaaaccecaggqgqaaagaagg etcacttgcagaggqgacaggattacagggtycagg etycaggqgqaagtaccggggggag 301 2 VSNFKAPPAB65 01 0 97 2 V NFKB Q6 01 0 93 qaagccetgatcedgaaaggactttecagcecactceggcegetcatcaaaaagttccectgtcecgtgacccetagtggetcatcegcagggagtttetecgatgaacc 201 De gt VSNFKAPPAB6S O1 0 90 2 V NFKAPPAB65 O1 1 00 nn V NFKB Q6 01 1 00 4 4 V RELBP52 01 0 93 5 4 V RELBP52 01 0 91 ccagctcagggtttaggcttctttttccccctagcagagcgacdgaagccagttcetetttttetggtetgactggcttoggaaattccccgagcectgacccecg 101 1 4 V RELBPS2 01 0 92 ccccagagaaatccccagccagcgtttatagagagcegecgeggegaedgetgcagagceccacagcagtcecgtocecdgecegtececgeccgecagegecccagcega 1 gagaagcagccaqcgqceagceccegegagececagcegceaccecgcagcagcegececedgceagetegtecegegecatgttcecaggeggecgagegecccecaggagtgagagecat 100 By cl
76. C Go 0016265 CD24 CO7 CREB1 CSNK2B death g 65 2266 31 7 94531e 10 CTNND1 CTSG Beers ATF3 BAD BECN1 BNIP1 9 C Go 0n06915 CD24 COF CREB1 CSNK2B apoptosis Biological be 165 30 2 4643 7e 09 CTMND1 CTSG process ASTUCE tion 1 5 3 to emphasize groups of interest The figure below shows immune Instead of scrolling the table manually you can use the filtering option see Sec highlited groups Figure 4 4 New subset with a defense immune and inflammatory response genes compiled by Func tional Analysis Filter clear filter bar 3 GO Term contains immune 3 of 164 rows Export Plain text XLS RTF Mark All 3 Mone Invert EE GO Identifier Gene symbol Ontology i L eM ene ACE ANPEP ARHGEF7 CD24 CD300C CD37 CD7 CD8B CREBI CTS L GaO D002376 immune system process Biological process ACE CD24 CO7 CD8B C ao o002520 CREBI HES1 INPPSD u system Biological process MATK M D1 NCF4 P m ANPEP CD24 CD300C iw Go 0006955 CD37 CD7 CD8B CREB1 immune response Biological process CTSG CXCR3 GMB3 __ GO Identifier Benesymbol GOTerm Ontology The available options of the Functional classification in the main menu are Gene set which compiles a new subset containing entries from all selected groups and Separate gene sets which creates an individual set for each group New subset nodes will be added below the Func
77. CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA 11 4 HIGH LEVEL ANALYSIS Figure 11 15 Result of RankProduct analysis Rank Product RP QC Filtered MASS example2 CEL files Filter filter bar none total 22283 rows Export Plain text XLS RTF Mark Page 100 All 22283 None Invert 1007 s at 0 75107 16740 8 0 9968 1 1104 1053 at 0 878604 14361 4 0 9682 1 1941 7 117 at 1 22402 3928 24 0 0413 0 4038 121 at 1 03367 8940 48 0 5416 1 0616 1255 g at 1 35247 6463 14 0 2352 0 8031 1294 at 0 787509 17 071 1 0 9978 1 0978 1316 at 1 13516 6702 36 0 2618 0 83 1320 at 1 093 8139 44 0 4388 0 9919 M lians i at 1 4n SAS RA n 1465 n ARAT Note that some analyses require at least two columns for each level Figure 11 16 Statistical analysis dialog Fold Change GLM ANOVA i Source Gene Set example 20231 588 37413 8370 F Test Empirical Bayes Prior 95 No of iterations 1 Select factors to calculate Fold Changes for all pairs of levels EB aaa aaa A resulting gene set contains genes from the original set and columns with the statistical data calcu lated A fragment of the result of the Empirical Bayes analysis is shown in Figure 11 17 The Rank Product procedure described in the CEL file high level analysis section can also be applied to up to three gene sets provided there are columns with absolute expression values for controls and experiments assigned to the set The input dia
78. Calculation from existing numerical column dialog Calculate column You may add new numerical column which values are calculated from already existing columns Select gene set you want to add a column to HUVEC GSE2639 example 7985 505 15477 4563 r Formula log2 1 Available operations and functions abs x sqrt x exp x log x log2 x log10 x sin X cos x sum x avg x max x min x New column s name optional log2 Fold change Available columns 1 Fold change 2 Known sites 3 numerical 4 Test column 5 log2 Fold change Figure 3 19 Data table with logarithmically transformed of a fold change column Gene BKL description Fold change log2 Fold symbol 4 change Alpha 2 macroglobulin binds to collectin plays a role in cell proliferation and protein A2M homotetramerization upregulated in Alzheimer disease sickle cell anemia rheumatoid arthritis multiple 1 02 0 0285692 sclerosis and prostatic neoplasms A4GNT Alpha 1 4 N Acetylglucosaminyltransferase a glycosyltransferase that forms alpha 1 4 linked GlcNAc 1 05 0 0703893 residues especially in O glycans and is involved in synthesis of class III mucins i AADAC Arylacetamide deacetylase may play a role in protein amino acid deacetylation and lipid metabolic process 1 01 0 0143553 AAKI AP2 associated kinase 1 a protein serine threonine kinase that acts in protein amino acid phosphorylation 1 01 0 014
79. Chip MAS _Invariantset__ PMony Liwong Masso Mas Quantes PMony MAS RMA Quantiles PM only Median Polisk Table 11 2 lists the techniques available in ExPlain for each low level analysis step You can choose any combination of these techniques when performing low level analysis The output of a low level analysis is a table containing the microarrays in the columns and the probe IDs in the rows It may be noted that from this step onwards expression value tables generated in each step can be exported as an ExPlain gene set using the Gene set Convert selected rows to gene set menu option 141 11 3 QUALITY CONTROL CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Figure 11 5 Low level analysis dialog CEL low level analysis CEL archive to analyze xemele CEL files Method MAs 40 Advanced parameters Background correction 5 Normalize method Quanties PM correction method PM Only Summarization method Avs Cancel 11 3 Quality control This step enables you to exclude certain microarrays from further analysis based on quality control tests BioConductor includes several methods to assist the quality control Background statistics This procedure checks whether the background values across the arrays are com parable Global scaling factors This standard normalization procedure involves globally rescaling the arrays to set the median probe intensity to the same level The scaling fa
80. DNA replication arrest and hypoxia Fl C KRECI Protein of unknown Function has high similarity to uncharacterized mouse 44792894 KRCCI 0 0185773 0 0807626 Methyl Cp binding domain protein 4 a G T mismatch specific thymine DN4 glycosylase with endodeoxyribonuclease activity increased m expression correlates wi D D g MBD4 ith endod ib tivity i d mRNA i lat ith MBD4 O 0893079 0 0471637 systemic lupus erythematosus gene mutations are associated with several cancers Minichromosome maintenance complex component 3 plays a rale in DWA replication initiation humoral immunity and regulation of cell cycle upregulation correlates wi Fa MCM3 initiati h li it d lati F cell l lati lat ith MCM3 00742517 00682311 death associated with astrocytoma mRNA is downregulated in CML E 8 rows on next page Gene BKL description 4h symbol F3 1 2 H Figure 1 15 Click on the name to rename the node Gene set Fold change gt 2 Figure 1 16 Enter new name Gene set Rename Cancel Origin This field contains a brief description of the item origin Data build This field shows the data file version used when this item was created This can be useful when you update the data on an existing installation and want to know whether you performed some analysis using an older or newer version of the file User comments To store any notes in this field enter them in the comment window and press 1 5 3 Filtering table Th
81. ExPlain 3 0 manual explaining gene expression data November 2 2009 Copyright 2009 BIOBASE GmbH Contents I User manual 1 Main components of the ExPlain user interface 1 1 1 1 3 1 4 1 5 The projectie 5 232 949 594 5 934 99999 9 2205 bee 3 2599595 11 1 Search tab on the tree frame 2 2 dd dadda deaa ms 1 1 2 Select tab on the tree frame 6 0iK 04854 nee SSD wD EDR DSS DEK ORS Ihe mentand dialogs CT rrr by P era Eea ap PRN Che a he eS he Se a Ie Process MONITO P OPPm liac co T bk eee rhe ee ee ee ORR eee HEE ES Oe He eae ES OES 1 5 1 Renaming project tree nodes eee eee ee ee 15 2 lt mvongiiantormaliotk sss xx mie eee BE AOS OM ORDERED OH He 15 9 PINGS table on cb eae bade ee eo ROS e ode i SP REGS EERE ERO 15 4 Adjusting the number of visible table rows and page navigation 1 5 5 Sorting renaming and showing hiding table columns 1 5 6 Customization of the column content display 0 4 15 7 Selecting rows and further actions 2 0 0 00 00 0000 00000048 hoo HUG DOR uu s eae eee e oe ee PRBS EES ee tee eee eee d ee 15 9 Intertace prelerenees avo bated 65455 44428 4 uds EEE EES SS 2 Analytical workflows and Wizard mode 2 1 2 2 pas 3 Gene sets 3 1 J2 3 3 3 4 RI 3 6 ud 4 e e aaa E aE aaaea Oo G amp S IprcShcc EM eee SER EGE a e a a A 22l Ae E eraann ea EANES 222 JXas
82. F 0 0162 0 vgrRF7 01 0 9500 0 4508 2 1073 5 9623e 06 0 0031 Co vspouiF1i Q6 1 5833 0 9324 1 6982 1 3464e 05 0 0520 Some additional information about the analysis results is available in hidden columns You can read about hiding or showing columns in Section 1 5 5 Matched promoters in Yes and Matched promoters in No shows the fraction of promoters where at least one site of the matrix was found in the main and background sets respectively An example output table of an analysis done without a control set is shown below Figure 5 3 Each row contains information about the performance of one matrix of the input PWM profile Matrix names link to corresponding matrix entries in the BKL database The average number of matches per 1000bp is given for in the Sites density column and is visualized in the Graphs column Use the Site search result menu to generate a graphical visualization of the results and various sub sets of information from the output table for the selected rows The Gene set menu link creates a gene 69 5 1 SEARCHING FOR SITES IN PROMOTE SIAPTER 5 TRANSCRIPTION FACTOR SITE SEARCH Figure 5 3 Match output for query set only C wezFS Di 40 6167 2 WEWT1 Q6 30 6667 CC vezF5_B 21 8000 3 vauRF G2 20 7333 CJE v amp aPz2ALPHA n3 19 6167 Je yzi 01 17 7667 set of selected matrices the Homologous matrices extends the gene set by adding all the transcription factors linked to homologous matric
83. FXR FXR isoform1 FXR isoform2 FXR isoform3 FXR isoform3 RXR alpha FXR isoform4 C V PXR Q2 FXR RXR alpha LXR alpha RXR alpha LXR beta RXR alpha PXR clawed frog golden Syrian 33 0 697 0 893 PXR isoform1 PXR isoform1A PXR isoform2 PXR isoform3 PXR RXR hamster human mouse rat 122 CHAPTER 8 PROFILES 8 5 PROFILE MENU OPTIONS 8 5 2 Create gene set using selected matrices Clicking the menu link Gene set Create gene set using selected matrices starts the process of gene set creation from the matrices selected in the checkbox column if nothing is selected all matrices will be included You will be directed to the newly created gene set which was added to the project tree under the current profile node 8 5 3 Extend set of selected matrices by all homologous matrices Clicking the menu link Homologous matrices Extend set of selected matrices by all homologs starts the process of gene set creation by adding the homologous matrices to the selected ones if nothing is selected all profile matrices are included You will be directed to the newly created gene set which is added to the project tree under the current profile node Figure 8 10 Extention of selected matrices set by homologous matrices output Gene set 192 Filter filter bar none total 27 rows ME Export Plain text XLS RTF Mark All 27 None Invert Gene BKL description Species symbol a Aryl hydrocarbon recept
84. Fitness column By revealing hidden columns you can add into each row a set of matrices comprising a model gene symbols corresponding to matrices and hyper geometric distribution based p value Entries of the Promoter model column link to comprehensive reports about the corresponding model described before You can use the Model search result menu to operate with models marked in the checkbox column The Get matrices of the model as gene set link will compile a new data set from the factors linked to PWMs of the selected models The Save model link will add a separate node for each model to the project tree 93 6 2 COMPOSITE MODULES ON PROMOTERS COMPOSITE MODULE ANALYSIS AND MODELS Figure 6 10 CMA output table Model prediction results 3 5s 1mod 6gr HIF Filter filter bar none total 51 rows Rows per page Export Plain text XLS RTF Mark All 51 None Invert A amp amp amp amp Promoter model Model R T E N P Fimess name M1 V HIF1_Q5 V MAZR_O1 V MEISLAHOXAS _ O NUETE Q6 V VDP Q3 M2 V CBF 02 V GATA3 02 V HIF1 Q5 V P53 DECAMEP Q2 PM1 nan 0 699202 0 767814 0 839957 0 305628 0 733508 M3 V HFH3_01 V HIF1_Q5 V MAZR_01 V MMEF2_Q6 V VMYB_02 PM M1 M2 M3 M1 UM PLU HL M E ipu 02 M2 V CBF 02 V GATA3 02 V HIF1 QS5 V P5 PM2 nan 0 693391 0 76681 0 604772 0 304208 0 7301 M3 V HIFI Q5 V MAZR O1 V MEISIAHOXA9 __ ARE VSMMEEZ TRIvSvpR Q3 PM M1 M2 M3 M1 V HFH3 Ol1 V HIFl QS
85. I E A paused process E A postponed process x An analysis process that could not be completed Data types icons 7 lt A data set or subset Results from the Function analysis Results from the Function analysis in ExPlain Plant EE Ma we 55 d amp F M Results from the Pathways analysis A PWM profile Results of a binding site search analysis Motif search result Saved report page of a binding site search analysis Results of a promoter model construction with CMA Model created by a user in the model editor or provided by TRANSCompel A promoter model computed by CMA or results of a promoter model search in CMAcalc analysis Summary report page W Graph report page Results of a pathway network analysis Visualization of a signalling network Interaction profile Genome intervals set Enrichment analysis results User created weight matrix User defined data loading schema Predefined set of CMA parameters User defined set of CMA parameters E A set of CEL files in one archive A single CEL file Ts Quality check options ee Quality check filter T test Wilcoxon GLM ANOVA Empirical Bayes Rank product test 5 miRNA search result 12 CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTHRZACHE MENU AND DIALOGS Figure 1 4 Tree with the highlighted search results Filtered Human housekeeping genes D 0 0 0 2009 10 12 15 54 42 54 23 0 33 18 2009 05 04 14 0
86. ID substi cantral substz cantral subsE3 cantral accession lag1 FC logi FC lag1 FC 1 Probe Set ID substi contral substz contral subst3 contral 2 1552257 a at 187750446 0 230085556 0 035525244 3 15522563 at 0 0364407 76 0000105891 0009855013 4 1552264_a_at 0 293645502 0 3201061364 D 002259029 3 15527 74_at 002071114 00008537421 137482541 5 1552275 5s at 0 212408191 0 001430914 240477506 511552277 a at D 07291370869 0 15 7245556 0 0222934 5 8 1552287 5 at 344651125 0020055501 0 151639897 9 1552291_at 0 198393633 0 044977 758 112957881 10 1552205 a at 0 030023769 0 109794896 0 56404 7082 11 1552301_a_at 48951268 0 238808728 302810711 12 1552303 a at 0 077 749 42 0 513900798 0 053562997 13 1552307 a at 828440687 0 216 85051 1386719882 14 1552310 at 0D 182175719 0 017044586 D 0 76158378 13 1552315 at 12281574 3475148465 D 034114438 156 155231b5 a at 0 109969123 3159661258 0 007298424 1751552318 at 0 036125546 0 0986858 72 0 001647191 18 155232D a at 1 932313671 0D 177463954 0017740153 L I essa UU oi SS SS memi n Fa raram nmn Fa 7e ati anm CEL files Here you can load CEL files from one experiment archived in ZIP TAR or Gzip TAR format After the file is uploaded you will see a dialog that allows you to define groups of arrays for further comparison as shown in Figure 2 7 When the name of the comparison factor is defined you will need to assign ar
87. K CLUSTER ANALYSIS CHAPTER 7 MOLECULAR NETWORKS ANALYSIS Figure 7 6 Search for clusters dialog window Clusters Gene set Up HUVEC GSE2639 example 211 molecules Cluster separation degree Distance threshold 3 Cancel Add user defined interactions 02 Cluster separation The cluster separation degree influences the degree to which clusters are separated divided and thus determines the cluster size The higher the degree the more edges are removed The edges are assigned a betweenness value and edges with high value are more likely to be removed A low separation degree yields a single large cluster which is sometimes difficult to visualize A high separation degree value can leave the input set unclustered The size of the input list also influences this parameter large inputs usually require higher separation degrees Distance threshold The maximal search distance threshold defines the number of steps from each input molecule that are considered in the cluster calculation Only molecules that are linked by a number of steps smaller than the one specified by the distance threshold are connected by the algorithm User defined interactions In the Add user defined interactions list you can select any uploaded in teraction profile see Section 7 4 The Algorithm will include these molecule interactions during the clusterization 7 2 2 Network cluster analysis results Figure 7 7 shows an example output of a cluster analys
88. L 13Ralphaz 18 42 D 2 56231e U7 view MIP 3alpha WAFL p72 4 5 Functional Analysis summary ExPlain provides the possibility to compare classification results from several gene sets Click the Functional classifications summary menu link in the Data menu to launch the dialog window 59 4 6 FUNCTIONAL ANALYSIS ALGORITHM CHAPTER 4 THE FUNCTIONAL CLASSIFICATION Select analysis results Select columns to include in the summary Select checkbox for statistical graphs Press OK button Figure 4 18 FA summary dialog window Create summary set d Choose functional analysis results 4 objects F Choose columns to include in summary tHits in group HHits expected value Create graphs Each row of the output table is a group from FA analysis The table contains common columns such as group size and group names selected columns grouped by analyses results and the sim diff column showing similarity between groups Figure 4 19 FA summary output table Proteome BKL Disease iew Proteome BKL Disease Proteome BKL Disease View imax 2min on Sample 1 View Imax 2min on Human imax 2min on PXE PXE Sample 1 housekeeping genes Human housekeeping genes Disease Disease name Biomarker Group Hits Hits p value Hits Hits p value Hits Hits p value sim diff A associations size in expected i expected in expected group g D000544
89. L CLASSIFICATION 4 7 GENE SET ENRICHMENT ANALYSIS group and leave in the results only overrepresented groups If checked overrepresented groups are searched using the FA algorithm from Section 4 6 preliminary to GSEA To obtain statistically significant results the P value threshold should be set up close to zero In the case of multiple groups testing the most frequent case you should use the Control False Discovery Rate option so the P value threshold will control the rate of false positives in the results Figure 4 21 GSEA user interface Enrichment analysis Source gene set Sample 2 852 genes Select calumn Expression Find groups by 9 Functional categories Expression BKL manual curation 60 annotation BEL manual curation 60 annotation public Organ Tissue expression Cytomer Proteome BEL Disease view Transcription Factor classification Transpath molecule classification Map on canonical pathways O Gene sets none Advanced options Weight genes power p 1 Minimal hits to group 4 J Leave only overrepresented groups P value threshold 445 LI control False Discovery Rate Cancel 4 7 2 GSEA example Here we analyze a sample data set using an Expression column We choose SwissProt keywords as a functional category We set Weight genes power to 1 and left other parameters as they appear by default The enrichment result is pre filtered by p value and you can clear the fil
90. LOW LEVEL ANALYSIS CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Figure 11 1 Load CEL files dialog Load CEL files Destination GEL files select zip or tar gz archive containing CEL files Cancel Figure 11 2 Loaded CEL files with assigned factors levels CEL archive example CEL files This archive contains 8 file s C1 CEL C2 CEL C3 CEL C4 CEL T1 CEL T2 CH T3 CEL T4 CEL m rm m m uc fo coo Change configuration 11 1 1 Factor level assignment The factor level assignment dialog Figure 11 3 is designed for experiments having multiple factors and levels Factors are added by clicking the dd factor button and entering the factor name in the popup window that appears In order to enter levels for the factor the factor must be selected in the Factor list box and then the 4dd levels button must be clicked When a new factor is created popup for levels will appear automatically Assigning CEL files to a particular factor level combination involves selecting the factor and marking the cells in the assignment table where CEL files are displayed as rows and levels dispalyed as columns For your convenience when the cursor is over the cell the corresponding file level pair is highlighted Note that only one level of a factor can be assigned to a file and this is controlled automatically Ad ditional factors and their levels can be added in a similar manner To view the summary table of all assignm
91. N SYSTEM Figure 3 2 Import options dialog window Load gene set import options File is uploaded successfully Please take a look on it and adjust import parameters column lt lt Load another file names etc if necessary Destination folder ene Sets gt gt Finish importing vSheeti MSheetD Sheet 3 Sheet 4 Sheets Advanced options File Geneset Example xls Sheet 1 I i Rows 623 Meader t fitered 0 unmatched 4 matched 618 4 gt Chromosome start stop RefSeq Acc Genesymbol Description control Ar text numerical numerical accession text text numerical nur no filter no filter no filter no filter no filter no filter no filter nc 1 Chromosome start stop RefSeq Acc control Header 2 1 1624180 1640729 NM 033534 CDC2L2 cell division cycle 2187 3 1 3369982 3387537 NM 014448 ARHGEF16 Rho guanine exchi14 4 1 9830921 9892903 NM 020248 CTNNBIP1 catenin beta inter 17 5 1 10015630 10163884 NM 006048 UBF4B ubiquitination factc 143 6 1 10381672 10402788 NM 002631 PGD phosphogluconate 293 7 1 11037236 11042678 NM 003132 SRM spermidine syntha8 8 1 11631035 11637326 NM 012168 FBXO2 F box protein 2 419 9 1 16323419 16355151 NM 004431 EPHA2 EPH receptor A2 249 10 1 17173584 17180668 NM 017459 MFAP2 microfibrilar associ 107 x The first line above the data section is not completely discarded but is used to specify default column titles which can then be cust
92. NOL AKR1A1 ARHGEF7 ARPC2 ATPSG3 p Human housekeeping genes BECM1 CASC3 COXA CSNK2B 28 562 6 9 85399e 13 562 87 909 463 FEFID ENOL 4 4 Functional Analysis with reaction pathways The representation of molecules from reaction pathways can be used to investigate data sets ExPlain provides you possibilities to choose either canonical TRANSPATH pathways or custom uploaded inter action sets 57 44 FUNCTIONAL ANALYSIS WITH REACTIOMB TAHWAYSHE FUNCTIONAL CLASSIFICATION 4 4 1 Functional Analysis with TRANSPATH pathways The Canonical pathways mapping option in menu Analyze or toolbar button can be used to investigate the representation of input molecules in annotated TRANSPATH pathways As in the Functional classification we can limit the output to pathways which received a minimal number of hits in the input set minimal hits to group field and to results with a minimal statistical significance P value threshold field If there is an active data set in the project tree it will be displayed in the Gene set field You can choose another data set in this field in this case the project tree has to be in multi select mode Clicking the button starts the process Figure 4 14 Pathways dialog Pathways Gene set Sample 2 731 molecules P value threshold 9 minimal hits to graup 2 Cancel 4 4 2 Pathways results An example output is shown in Figure 4 15 Each row in the output table c
93. P 4 9252e 07 CICS VeNFKAPPABES D1 9 1709 0 5160 17 7733 1 6613e 05 m 6 0626e 10 Te veNFKB_c 7 3368 0 0000 inf 2 1644e 05 MN 3 2979e 10 C3 verka 01 13 9398 3 0960 45026 gare eee 5 1653e 08 v amp cREL n1 8 0704 D 5160 15 6405 7 4987e 05 6 0626e 10 C3 vspsoRELAP6S Q5 D1 7 3368 0 5160 14 2186 2 0152e 04 HM 1 29666 07 C3 vsRELEPS2 n1 5 1357 0 0000 inf 5 42932 04 3 4410e 06 CIC vecr2_o2 7 7036 1 0320 7 4648 6 2760e 04 2 2141e D8 5 2 Detailed graphical report of matrix distribution in promoters You can inspect the detailed distribution of matrix matches on promoters by selecting one or more ma trices in the checkbox column of the output table and pressing the Show site map for selected matrices menu link Graphs will be shown in a new window whose main component is a new table containing the graphical representation of matches for one promoter per row The View yes set background set switch at the top of the frame allows you to view promoters of the main or background set The x Back to summary output link returns to the sites search result page The report for the top five over represented PWMs in Section 5 1 3 is taken as an example below The Sites density distribution graph figure below shows the number of sites lying in specific sequence position relative to the TSS divided by total number of sequences A smoothed graph is also displayed which averages sites density in a window of 50
94. Plain ChIP seq ChIP chip data Promoter analysis and identification of transcription factors acting together to affect gene expression Identification of key regulators potential therapeutic targets Biomarker analysis Mammalian Module 3 0 T Quick Start Guide Wizard mode Data loading Workflows Results are Gene set E we Load a gene set pure list or a table with expression values assigned A Full expression data with calculated Fold Change E it Load a full gene expression data file for one comparative study with pre computed fold changes and p values Sets of up down and non changed genes will be automatically created Microarray series Hag Load a multi column table with pre computed fold changes or a set of CEL files Statistical preprocessing of CEL files and calculation of fold changes will be performed automatically Figure 2 2 Data loading page Wizard mode Data loading Workflows Results Gene set Load a gene set pure list or a table with expression values assigned Full expression data with calculated Fold Change ri Load a full gene expression data file for one comparative study with pre computed fold changes and p values Sets of up down and non changed genes will be automatically created Microarray series Load a multi calumn table with pre computed fold changes or a set of CEL files Statistical preprocessing of CEL files and calculation of fold changes will be performed aut
95. S TO BKL CHAPTER 3 GENE SETS Figure 3 22 Add columns with system annotation dialog Add annotation column lt ExPlain provides for you some annotation columns containing information about genes molecules and other biological objects which you may want to attach to your gene set Add column s to gene set HUVEC GSE2639 example 7985 505 15477 4563 r Please select columns Species re SBKL description SBEL disease SBKL gene accession SCs SCell types sCodeLink Description ENSEMBL gene GEnsEMBL transcript 3Fntrez id Figure 3 23 Gene set with two annotation columns Gene Description EnsEMBL gene Fold change symbol 4 A2M alpha 2 macroglobulin ENSG00000175899 1 02 A4GNT alpha 1 4 N acetylglucosaminyltransferase ENSG00000118017 1 05 AADAC arylacetamide deacetylase esterase ENSG00000114771 1 01 AAK1 AP2 associated kinase 1 ENSG00000115977 1 01 AANAT arylalkylamine N acetyltransferase ENSG00000129673 1 03 AARS alanyl tRNA synthetase ENSGO00000090861 1 01 AARSD1 alanyl tRNA synthetase domain containing 1 ENSG00000108825 0 96 AATK apoptosis associated tyrosine kinase ENSG00000181409 0 97 ABCA1 ATP binding cassette sub family A ABC1 member 1 ENSG00000165029 1 04 ABCA11P ATP binding cassette sub family A ABC1 member 11 pseudogene ENSG00000186777 1 04 3 5 Exporting gene sets to BKL You can export gene sets from ExPlain to the BioKnowledge Library BKL To do so choose the Expo
96. S is recalculated A permutation test performs this procedure 1000 times and builds a histogram of the corresponding enrichment scores ES_null distribution histogram The P value is evaluated by using the positive or negative tail of the distribution corresponding to the sign of observed ES Multiple S testing The most common case a permutation test with adjustment for variation in multiple S sets size and FDR control of false positives For each set S ES null S distribution histogram is built as described above Each ES S is adjusted by dividing it by the positive or abs negative mean of scores obtained in permutations depending on the sign of observed ES S positive and negative scores take part in positive and negative mean calculation separately Using the internally built ES_null histograms p values are calculated and by default adjusted to control False Discovery Rate FDR 23 If ES is significantly greater than 0 hits of gene set S tend to be located at the top of sorted L list if ES is significantly lower than 0 hits of set S tend to be located at the bottom of L 65 Chapter 5 Transcription factor site search The site analysis module allows you to find putative transcription factor binding sites TFBS in the promoters of a gene set interval set or user loaded sequences Binding site predictions are obtained by running MATCH Patch or P Match which use collections of known TFBS and positional weight matri ces to identify
97. S5s which are then used as reference A window of 1000 nt length is slid along the entire sequence fragment containing all TSSs and a clustering score is calculated by summing up contributions from each evidence point within the window weighted according to their source A score of 1 is added for each DBTSS or Ensembl entry whereas EPD entries receive a score of 50 assuming higher reliability due to manual expert annotation Scores of individual evidence points are multiplied by a distance factor ranging from 1 at the center of the window to 0 at the two outer positions In between the distance factor is computed by a cosine function with lower values the greater the distance to the window center Peaks of the score histogram are regarded as virtual TSSs if the corresponding sequence window contains at least 5 of all evidence points so that multiple virtual TSSs per gene are well accepted However for some genes only a handful of evidence points are available yielding several virtual TSSs yet with few references To accommodate such cases annotation restrains to the 5 most virtual TSS for genes with less than 20 evidence points Virtual TSSs defined by the method described above form the basis for the extraction of TRANSPro promoter sequences in a fully automatic fashion Whenever conflicts or inconsistencies occur the re spective gene is excluded from the TRANSPro database The components of the virtual TSS definition method of TRANSPro are summa
98. a F KwiBs44 HISTiHZzBK HISTiH2BL HISTIHZBH Nucleosome core 7 56 2 B8 75115e D5 OC 3497 component 4 3 6 Transcription Factor Classification analysis output Each row of the table presents a matched class The columns contain from left to right the transcription factor class identifier and a link to its TRANSFAC class browser Gene symbols the descriptive names 56 CHAPTER 4 THE FUNCTIONAL CLASSIFICMTT DNAL ANALYSIS WITH REACTION PATHWAYS of the factor classes the number of input genes matching that group the size of the matched group in TRANSFAC the randomly expected number of hits and the P value of the match result Figure 4 11 Output table of Transcription Factor Classification analysis Factor class Gene symbol Factor class Hits in Group Hits p value identifier description group size expected 45CL1 ATF2 ATF3 ATF CEBPA CREE1 CREM FOSB Basic Domains 3 43991e 09 MAX MYC FE Eu ELK4 ERG ETSI ETSA cte tyne 5 5 2 D 000238787 ATF2 ATF3 ATF CEBPA Leucine zipper Factors LJ ji CREE1 CREM FOSB bzIP F 8 3 0 000356 76 Olas Mas MYC MYCN SREBF1 Helix leap helix leucine 6 2 0 00133697 Tres Msz zipper Factors bHLH ZI 4 3 7 TRANSPATH Molecule Classification analysis output Each row in the table presents a matched class The columns contain from left to right the Molecule class identifier and a link to its TRANSPATH page Gene symbols input genes matching that group Molecule class d
99. ability density of score pairs up to position i 1 62 CHAPTER 5 TRANSCRIPTION FACTOR SITE amp SEABKTHS SEARCH THEORETICAL BACKGROUND JU s s i TICS i De tik Fig KAS an s qup with k 14 He The cumulative probability FS 9 PS 9 that determines selection of score threshold t is de rived from the joint probability density of scores We further extend the standard method by applying a dinucleotide background model By conditioning score functions on the terminal residue j associated with a score pair si s 1 the convolution can also accommodate higher order background models asde Iy 15 s i LE di kS SS dij WIL si i The accuracy of theoretical P values using a uniform background model is demonstrated in Fig ure 5 29 Sites were predicted at all score thresholds in 1000 real promoter sequences and afterwards the empirical prediction rate corresponding to each score threshold was compared to the theoretical value In case of perfect correspondence between theoretical and empirical P values points fall on the diagonal red line on the figure Figure 5 29 Empirical P value log10 Theoretical P value log10 5 4 5 Sites search optimization with F Match algorithm The F Match algorithm compares the number of sites found in a query sequence set against the back ground set It is assumed if a certain TF or factor family alone or as a part of a cis regulatory module plays a significant role in the regulat
100. ach1 Mafk VSMAF_Q6_01 Bachit VSBACH1_01 VSMAF_Q6_01 Ben VSBEN_01 VSBEN_02 Ben isoform2 VSGTF2IRD1_01 C custom CSS 975 MSS 9 8 After pressing the button you are redirected to the newly created profile and the interface changes to the profile mode 119 8 3 CREATING PROFILES FROM GENE SETS CHAPTER 8 PROFILES 8 2 3 Profile modification For profile modification choose the profile under study in the tree Click on the PWM profile link in the Create new data section of the File main menu Factors matrices of the chosen profile are then preselected in the list of the dialog window Figure 8 5 Dialog window with the active HUVEC up regulated profile node Create new profile uU mc Lu s High specificity matrices only Sox4 V 80X Q6 rome name Sox5 VSSOX5 01 V5SOX Q6 HUVEC UP 45p1 Sox6 V SOX Q6 Sox6 isoform 1 V SOX Q6 Sox6 isoform 2 V SOX Q6 Profile cut offs Sox6 isoform 3 V 8OX Q6 e minFN Sox6 Isoform1 V SOX Q6 minSUM Sox8 V SOX Q6 C minFP Sox9 V SOX9 B1 V SOX9 Q4 V SOX Q6 C custom CSS 075 MSS 0 8 JE FAVRE AU Spl VSSP1_01 V SP1 Q2 01 VBSP1 Q4 O1 Splisoform V SP1 01 V SP1 Q2 01 Spli4soform2 V SP1 01 V SP1 Q2 01 _ SpT sp3 V SP1SP3 Q4 j Sp2 VSSP1 Q2 01 VSSP1 Q4 01 Sp B V ETS Q6 V SPIB 01 Spi B isoform1 V ETS Q6 Spzl VSSPZ1 01 Staf VSSTAF 01 VSSTAF 02 oh 7 In the l
101. actions dataset 1 1 SAMSO 568553 Moxa 913794 SAMS0 568653 PLUMA QOBXHI 1 1 chic 4277 AFAP 140035 1 1 chic2 74277 multifunctional protein ADE 140946 1 1 chic 4277 RIP 013546 1 1 chic P4277 TNF alpha PO1S 5 1 1 chic 4277 TNFR1 P19438 1 1 chic P4277 TRADD 015628 1 1 chic 74277 traf2 isoformi 012933 1 1 DOK4 114255 ADP CHEBIE 16761 1 1 DOK4 114255 Apaf 1L 014727 1 1 DOK4 114255 Cytochrome C PO9999 1 1 DOK4 114255 proCaspase Oalpha P55211 1 1 Figure 7 16 Join interactions dialog Join interactions Destination folder Gene Sets F Choose interaction profiles to join 2 objects Interactions 3s viprotein protein interactions ak gt M small molecule protein interac 3 1 ii OK Cancel pathways such as hormones enzymes complexes and transcription factors are stored together with information about their interaction In the pathways panel of ExPlain the main components of the BKL database are the molecule indi vidual reactions and full pathway or reaction chain data The BKL definitions for these terms are given in the following list BKL ENTITIES Molecule Molecules interact with each other to build pathways A molecule in BKL is anything that is subject to reactions Most molecules have a mass be it a small molecule like ATP or a protein No distinction is made between receptors enzymes second messengers transcription factors or other special kinds of proteins A molecule can also be a group o
102. als the results to be ready Figure 1 12 The process monitor with an active CMA process Running CMA Ready LoadData 3 more Advanced You can take direct control over the process management through the dvanced button In the control window actions can be chosen for the process selected in the list The P button starts a waiting process possibly causing other running process to be paused Conversely any running process can be paused with the button The button removes a process from the list Finally the and buttons on the right side of the list can be used to alter the order of processes in the queue When a process replaces the process at the top of the list the former top process will be paused Figure 1 13 The process control window Advanced Processes Information Process queue Os 2mad 2gr 1 3p x cma Running vertebrate non redundant 1100 ALL mH t H match Running Os 2mod 2gr 1 3p X H t cma Waiting Ready results f Human promoters TRANSPRO 6 22566 602 41906 8592 LoadData ASInteraction from v kID3 n1 v amp Maz Q6 vPAx4 ni vsPAX4 n2 2000 ALL 6 Interact M GIVEN AACGAG MatCutoff GIPVEMP AACGAG MatCutoff 15 The workspace The workspace displays data provided as input to any analysis tool as well as the analysis results The contents of this frame depend on the active node in the project tree Here we describe the general functionality of the workspace Some additional options spec
103. an the RP under consideration are regarded as differently expressed is estimated based on the permuted values 150 CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA 11 5 CRC CLUSTERING 11 5 CRC clustering The weighted Chinese restaurant clustering algorithm developed by Zhaohui S Qin 35 enables clus tering of genes according to their expression values This model based clustering procedure has the advantage that unlike many other clustering algorithms the number of clusters is determined during cluster assignment so that the expected number of clusters does not need to be specified in advance The algorithm requires as input a gene set with expression data of different experiments Figure 11 19 e g different treatments or a time series study Expression values of biological replicas should be av eraged before the clustering process is started Average expression values can be calculated using the option Data gt Calculate from existing columns in the ExPlain top menu Figure 11 19 Example of a data set that can be analyzed by CRC clustering q x PLAC8 Human 0 0737 0 0556382 0 157988 0 160999 0 100793 0 0014528 0 0406914 0 112939 0 194217 SAP30 RANBP9 FiG4 PRKCZ Human Human Human Human 0 0652601 0 0636394 0 102537 0 0040509 0 158579 0 147928 0 0935058 0 085329 0 254909 0 158464 0 253052 0 0793084 0 233837 0 147928
104. an select in the checkbox column all matrices necessary for the analysis If nothing is selected all available matrices will be included in the analysis by default Clicking the menu link Profile Create profile from selection starts the process of profile creation The new profile is added to the Profiles folder of the project tree and you will be automatically redirected to the profile entry The name of the newly created profile can be customized in the User created profile field editor Figure 8 9 Creation of profile from selection output Weight matrices profile MHE Rename Cancel Cut offs minSUM Filter filter bar none total 4 rows Rows per page 9 Export Plain text XLS RTF PRF Mark All 4 None Invert Matrix name Recognized factors n guinea pig human killifish 25 0 7 E V AHRARNT_02 AhR AhR2 arnt arnt L mouse pig rabbit rat 0 767 BXR beta CAR CAR2 RXR alpha CAR RXR alpha PXR PXR isoform1 chick clawed frog domestic PXR isoform1 RXR alpha PXR isoform1A PXR isoform1A RXR alpha deb 2 B VSDR3 Q4 PXR isoform1A RXR beta PXR isoform2 PXR isoform2 RXR alpha E human monkey mouse 14 um em PXR isoform3 RXR alpha RXR beta SXR SXR RXR alpha VDR VDR M4 Chinese hamster Gray hamster chick clawed frog vsocr1 02 Oct 1 POU2F1 POU2F1a gibbon ape golden Syrian 44 0 627 0 768 hamster human monkey mouse rat BXR beta CAR CAR2 RXR alpha CAR RXR alpha
105. analysis Figure 5 10 P Match dialog window Run Patch Yee set Up HUVEC GSEZ639 example 175 promoters r Ma set background MC HUVEC GSEZ639 example 232 promoters F Promoter window from 00 tg 100 If gene has multiple promoters use Best supported F Lise site set Sites in vertebrata genes F Minimum length of sites 14 Mismatch penalty 109 z Lower score boundary 9 9 a Cancel Maximum number of mismatches 4 In the output table Yes refers to the main set and No refers to the background set Each row contains information about the performance of one site Site names link to corresponding site entries in BKL Binding factors for the sites are given in the Binding factors column The average number of matched sites per 1000bp is given for the main and the background sets in Yes and No columns respectively Additionally Yes and No values are visualized in the Graphs column where the red bar depicts the abundance of the PWM motif in promoters of the query set and the blue bar displays its 73 5 1 SEARCHING FOR SITES IN PROMOTE SIAPTER 5 TRANSCRIPTION FACTOR SITE SEARCH abundance in the control set The ratio of the two values is provided in the Yes No column where a number greater than one indicates overrepresentation of the motif in the analysed set Significance of the representation value is measured by the P value derived from a binomial distribution Matched promoters p value assess the
106. and Custom Choosing any of the first three menu links will start the process of applying the current choice of score thresholds to the PWMs of the profile without creating a new one As a result you will see the same profile with the altered CSS and MSS columns After choosing the Custom option a dialog window will appear where you can insert CSS and MSS thresholds according to your choice Furthermore it is possible to set the cutoffs of the current profile according to the cutoffs of the matrices from another profile by selecting the use the following profile as template option and choosing a profile from the corresponding drop down list You can also set the MSS according to a specific p value by selecting the p value based MSS option of the Change profile cut offs dialog and chosing an appropriate p value and species from the respective lists If the p value based MSS option is used the CSS will be set to zero Please also consider that for p value based cutoffs matrices for which minimal prediction rate could not be achieved will be removed from the profile as well as user matrices You will be redirected to the profile with the new customized cut offs 8 5 5 Merge several profiles into one profile Clicking on the menu link Join profiles Merge several profiles into one will open a dialog window in which the profiles that will be merged can be selected After deciding if the highest lowest or average cutoffs of
107. ar will appear below the column titles in the table Figure 1 19 Export Plain text XLS RTF Mark Page 10 All 18 None Invert Blocked early in transport 1 homolog a putative SNAP Ce Beri receptor that plays a role in BET1 o 008 7002 D 0809474 ER En Golgi vesicle mediated transport When the filter is active the filter bar is switched on by default To hide it press the filter bar link again To create a filter click on the drop down menu and specify the filtering condition For numerical data you can specify lower or upper limit inner or outer range in the column If you need to filter the column by an exact value enter this value as a boundary for in range For text data you can specify a substring to be found in the corresponding cells Hit OK button or press Enter and the filter will be applied Now you will see the filtered table The filter row shows the actual filter and the number of rows which meet the filter conditions If 18 CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE 1 5 THE WORKSPACE Figure 1 20 BKL description E none a Anigio asso iid in may zind to hep 9 contains hesian nay act int 7 begins with gulation al sngiogenes ends with i Orylalkylarri acts in HE al synthetase sponse Ec xenobintic xpression arcinama single nucle ciated mith idiopat Alanvyl ERMA synthetase a ERMA binding Figure 1 21 375 Alanyl tRNA synthetase a ERMA
108. as to contain the core sequence of a matrix i e the core sequence has to match with a score higher than or equal to the core similarity cut off In addition only the matches that score higher than or equal to the matrix similarity threshold appear in the output For the minFP minFN and minSUM cut offs first the core similarity score is calculated and then the matrix similarity score is calculated for the selected positions according to the following equation L L Dd i C i l 2 0407 DIO i i l miri where Fi is the frequency of nucleotide b at position i of the matrix with width L Ji the fre max quency of the rarest occurring nucleotide in position i and Ji the frequency of the most frequent occurring nucleotide in position i The information vector I i describes the conservation of nucleotide B in position i of the matrix I AEn E Bef A C G T 1551 2 b Cut off to minimize false negative matches minFN The false negative rate Match TM obtains with a matrix was measured on known genomic binding sites for the transcription factors associated with that matrix as far as such sites are available In case a sufficient number of genomic binding sites less than 10 were not available SELEX sites or sets of generated oligonucleotides were used for estimating the cut offs to minimize the false negative rate using actual weight matrices to calculate the probability of a nucleotide occurring at a certain position of a
109. ase search result please consult the documentation for TRANSFAC and MATCH 3 22 Representation of the gene set data The data table is shown in the workspace by clicking its node link in the project tree If more than 2576 of the loaded gene identifiers were not matched by ExPlain a warning message will appear at the top of the gene set page stating how many of your identifiers could not be matched by the system In the gene set table next to the checkbox column ExPlain has added several columns which were not present in the original data table These are system annotation columns which are taken from the ExPlain database and provide additional information to your original set Some of them are hidden by default but can be made visible via the show columns control or Manage columns in data set dialog see Section 1 5 5 Entity matching information is given in the View menu showing number of genes matrices pro moters and molecules Figure 3 6 Entity information in View menu View Data Analyze Gene set View mode 9 Genes 496 Matrices 42 2 Promoters 483 Molecules 210 Please note that the number of genes matrices promoters and molecules are not identical In the example Figure 3 6 for 496 genes there are 42 matrices that are able to bind transcription factors from the list of 496 proteins 483 pro moters since one gene might have more than one promoter and 210 molecules out of 456 that are known to play
110. athway r1 Join network kevnades search results nd ao m m 1 Ie HFS Suna on MO nnn afer rear nen 32 Chapter 3 Gene sets This chapter explains how to load gene set data into ExPlain and describes ways to manipulate data sets and data representation details 3 1 Loading data into the ExPlain system The first step to analyse any data set with ExPlain is to upload it to the ExPlain system There are several ways to create a data set in ExPlain A user can load data as a file create a set of data by pasting a list of accession numbers into a text dialog window create a subset from results of other analysis previously created with ExPlain or import search results from BKL 3 1 1 Load gene set dialog The process of data loading is guided step by step assuring that data are represented in an easy to understand relational schema The first window of the Load gene set dialog is shown in Figure 3 1 To load any set of data from a file specify the file name You will then automatically be directed to the detailed view of the data where you can control and change parameters recognized by the system Figure 3 1 Load geneset dialog Load gene set specify the file containing your microarray data or gene lit Supported formats are Tab separated text fie txt chat Affy matrix CHP cata file chp Micrceoft Excel Worksheet xE Archives zip tar qz containing any of these files Cancel 33
111. ating signal caintegrator 1 complex 9 ASCC2 subunit 2 HSA 29339 4 Chr 22 28536277 Human 2 0 C220rf25 chromosome 22 open reading frame 25 HSA 54792 1 Chr 22 18384296 Human 2 COMT catechol O methyltransferase HSA 1751 1 Chr 22 18309318 Human 2 glycine C acetyltransferase 2 amina e i l GCAT E ta Se IEEE HSA 4560 2 Chr 22 36543822 4 Human 2 glutathione S transferase theta F GSTTP1 SEMEL gine d HSA 29282 Chr 22 22677258 Human 2 GTPBP1 GTP binding protein 1 HSA 9155 Chr 22 37431895 Human 2 9 2 Recombining intervals Once the set of intervals is uploaded it can be reorganized in the following ways filtered using signal strength neighbor intervals can be joined or small intervals can be omitted 9 2 1 Filtering intervals by conditions Interval set operations are accessible by selecting an interval set node from the project tree and navigat ing to the Filter interval set item of the Intervals menu Figure 9 6 shows the dialog window where you can specify signal conditions and interval parameters The dialog includes two identical condition fields The second condition is collapsed by default and is not taken into account When both conditions are expanded their parameters can be connected by and or or rules You can use up to two columns to specify the intervals you wish to extract Subsequent lists are used to set a requirement for each marked column In our example we seek all
112. atures Alternatively the report can be is generated with the affyOCReport package from Craig Parman and Conrad Halling 29 and can also downloaded by clicking on the Download button The following description of the content of this report is based on the affyOCReport manual provided by the package authors 142 CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA 11 3 QUALITY CONTROL Figure 11 6 Quality control dialog window Quality Control Rules and Visuals Perform QC on MA54 example CEL files Rules Iv Background Stat Iv Scaling Factor M Percent Present Control Ratios Visuals Iv Boxplot Iv Histogram I RNA Degradation M Affy Quality Control PDF Report Cancel Figure 11 7 Quality control output zug Background Stat Scaling Factor Percent Present Control Ratios e1 CEL FAIL PASSED PASSED PASSED e2 CEL FAIL PASSED PASSED PASSED e3 CEL vf FAIL PASSED PASSED PASSED e4 CEL vf FAIL PASSED PASSED FAIL e5 CEL W FAIL PASSED PASSED FAIL e6 CEL vf FAIL FAIL FAIL FAIL Quality Report PDF Show or Save Plot 8 Boxplot Histogram RNA Degradation Page 1 Overview of all analyzed arrays A table giving the internal numbering used in the plots of the quality control report and the corresponding array names is displayed Page 2 The plot at the top of the page is a boxplot of the all pm intensities which enables an analysis of the overall probe intensities of the arrays see also Boxplot
113. binding protein that is involved in humoral immune a C AARS response and ERMA processing acts as an Dazda at autoantigen in dermatomyositis Phenylalanine ERMA synthetase like alpha it subunit of phenvlalanine ERMA that arminoacylates phenvlalanyl ERIMA with phenylalanine expressed in tumorigenic cells in a cell cycle stage and differentiation dependent manner Glvcyl ERMA a class II aminoacyl tRNA acts as an autoantigen in dermatomyositis gene 527 Unssi s at CIO cars Werle 661 mitkskinme ara acrariakad with Char cork Maria 6057 Pindgeey sok 1 FARSA 680 LID7424 at filtering was performed on a text column then the filtered substring is highlighted in each cell You can add more conditions constraining other columns content To clear the filter click the clear link in the Filter row To clear only one condition you can click the corresponding filter drop down menu change condition to none and press OK Filtering conditions persist when you select other tree items and after logout When a filter is active the Export options will export only rows remaining after filtering not the whole table Also you may want to mark all rows and use the Get selected items option in the menu as an alternative way to filter your data apart from Filter gene set by condition Note that the filter bar works for any data set which is represented as a table as opposed to Filter gene set by condition which wor
114. binding site For each matrix we applied the MatchTM algorithm to these test sequence sets without using any matrix similarity cut offs Then we set the cut off to a value that provides recognition of at least 90 of oligonucleotides We decided to tolerate an error rate of ten percent We call this set of cut offs minEN cut offs Applying the minEN cut offs the user will find most genomic binding sites but in this case a high rate of false positives should be taken into account as well The minEN cut offs are useful for the detailed analysis of relatively short DNA fragments Cut off to minimize false positive matches minFP In order to estimate this cut off which will reduce the number of random sites found by Match TM we applied the MatchTM algorithm to promoter 61 5 4 SITES SEARCH THEORETICAL BACKGRDUNBR 5 TRANSCRIPTION FACTOR SITE SEARCH sequences from TRANSProTM 2 1 in the Matrix Generation tool exon sequences are used The score that gives 1 of hits in these sequences relative to the number of hits received when using the minFN score calculated above is defined as minFP When a minFP cut off is applied for searching a DNA sequence the algorithm will find a relatively low number of matches per nucleotide In the output the user will only find putative sites with a good similarity to the weight matrix however some known genomic binding sites could not be recognized This kind of cut off is useful for example for searching the
115. bwIt IO Chane aco ue quee pos RR a e E epe mee e S a red Re qr 22 0 INNCVOOITAY SONICS s e e s wem e ste Rus a eee us doy eum tee e Sone RR Re d WOrkHOwIMOde PE ee Boy oe Re RE eR eR ESHER SO BEG Ee Loading data into the ExPlain system een O11 L a dgenesetdialog xe uoo eod bom x om dh ch RD m dh ede e en ed OLLI Supported SPECIES xelesscm y Ree m Qa HERS EE 16 e e 514 2 Supported file formats 446 cho Bee orem PR deu A dde m Roe 3 12 Import options of geneset loading dialog 94 9 Newgenesetdialog ce eo ose oem porte REE Ree wR RR ED ORS Du MNPOMCAld ov 4 9 o 9 9 qs 9 EA OH eee P RR S OH GE Oe eae de d Representation of the gene set data 0 00 00 00000022 ee ee Recombinine eene sels uw eo Svs 4 Oo ee 9o Oeo Gad Ree Oe EEE OE RG Bol Filter gene set by condition 6 44 ued de Gee hee eke RH HR nd de EDO 3 82 Filter gene set by other gene sets ee S00 JOM IWO Cee SPIS 125 4 end bt opor bes 4s Ree SEE Ee Eee oes 5 944 EXiM actine ubique BEDS o ooo ub Roma 3 664448489 94 RE d SHS SS 3 8 5 ExtractUp Down Non change 0 0 eee eee ee 3 8 6 Extracting random subsets 20 00000 2 Adding columns to the gene set 2 e 3 4 1 Calculation from an existing numerical column les 3 1 2 Linking a column from another data set 00 000000048 3 4 3 Adding a column with system annotations les Exporb une SenesetstO DISL 2 5 a9
116. by the Enrichment analysis Pin button ig Enrichment analysis Tree node selector Source gene set CRC on Rika_cut 0 7thr Oshift n ino suitable columns Found Find groups by Multi select list Functional categories Expression BKL manual curation Multi select list 60 Public GO annotation public 60 annotation BEL manual curation Organ Tissue expression Cybamer Proteome BEL Disease view efle SIBIRESIREM SwissProd keywords Transcription Factor classification Transpath molecule classification Map on canonical pathways C Gene sets none Eh Edit dropdown box Minimal hits to group I ILeave only overrepresented groups P value threshold 405 LI control False Discovery Rate Cancel out of several options Edit boxes are used to enter numerical or text values Edit dropdown boxes provide you with a list of suitable values but you can also enter your own value manually The drop down list can be used to select one of the list items One specific case of the drop down list is the tree node selector By default this control shows only part of the tree that contains elements of a specific type Nodes currently selected are shown in black nodes that cannot be selected are shown in grey This control can be in single select or multi select mode see Figure 1 8 When both modes are available you can switch from one to another by using the Multi select mode or Single select mode links In multiselect mo
117. case to reveal binding sites that are present in homologous promoters of all organisms To run the analysis press the Phylogenetic filtering menu link in the main menu Analyze As you see below the dialog options are almost the same as in Section 5 1 2 You should set up a query gene set control set promoter window and promoter type Figure 5 8 Site search with footprint dialog window Phylogenetic filtering Yee set Up HUVEC GSE2639 example 175 promoters No set background MC HUVEC GSE2639 example 932 promoters Profile ertebrate all minsLIM Use high specific matrices with cut offs from profile Promoter window fram 00 to 100 If gene has multiple promoters use Best supported Cancel For each promoter of the main gene set the algorithm leaves only those TF binding sites that were found in homologous promoters of other species First the MATCH algorithm searches for TF binding 72 CHAPTER 5 TRANSCRIPTION FACTOR SITE SEARGH SEARCHING FOR SITES IN PROMOTERS sites in all sets of promoters Then the search results are processed to get the intersection in terms of matrices Thus there can be fewer matches for matrices in the output table in comparison with the regular sites search analysis result see Figure 5 2 Figure 5 9 Output table of footprint site search Matrix name Tes No Matched sites 1000bp sites 1000bp promoters p value EIE v4NFKB Q6 1 3333 0 3242 4 1129 1 4104 amp 17 f 1 42426
118. ces Higher values correspond to loose filtering in this case you will get more entries in the result The 70 CHAPTER 5 TRANSCRIPTION FACTOR SITE SEARGH SEARCHING FOR SITES IN PROMOTERS Optimize window position option turns on an algorithm to find out the window that maximizes the difference between query and control sets The initial window size is set to 300 bp and is increased by 100 bp on each iteration The window slides from downstream to upstream position with a step of 100 bp The p value is calculated for all available combinations of window size and position and the best one is taken as the result Figure 5 5 Sites search dialog window with optimization turned on Run Match Yee set Up HUVEC GSEZ639 example 175 promoters No set background MC HUVEC GSE2639 example 932 promoters F Profile Create Load vertebrate_all minSUM F Use high specific matrices with cut offs from profile 4 Promoter window fram 00 tg 100 z If gene has multiple promoters use Best supported F Optimize cut off ith p 0 0001 optimize window position with p value threshold Cancel If the Optimize window position option was checked the output table contains optimized up stream and downstream positions in From and To columns respectively Figure 5 6 Optimized site search result table Filter filter bar 3 none total 41 rows Rows per page 19 Export Plain text XLS RTF Raw output 1 2 3 4 5 Mar
119. cinoma and several neoplasms gene polymorphism is associated with GO DD25 0 multiple sclerosis CD3 c molecule CMRF 35 antigen a cell surFace membrane antigen and member of the immunoregulatory L cpannc zoning fami plays a role in the cross regulation of TNFalpha and IFNalpha secretion From plasmacytoid a0 0006955 endritic cells 1 CD37 CD37 molecule a putative transporter mouse Cd37 is aberrantly expressed in B cell chronic lymphoproliferative GO D0D6855 disorders i CD molecule a signal transducer that acts in cell adhesion inmune response and NE and T cell mediated GO 0006955 L repu cytotoxicity altered expression correlates with lymphoproliferative disorders pityriasis lichenoides dermatitis GO Dz5 D HIV infection and sarcoidosis CD amp b molecule an MHE class I binding protein that plays a role in T cell differentiation acts in cytokine and GO D0D6855 C cose chemokine mediated signaling pathway aberrant protein expression is associated with AIDS renal cell carcinoma Go 0002520 pou pesce Ti ME ERE 4 3 1 Expression BKL manual curation Each row of the table presents a matched organ or tissue The columns contain from left to right the name of the organ or tissue linked to its Cytomer page Gene symbols Location Tumors Cell types or Organs tissues fluids the number of input genes matching that group the size of the matched group in Cytomer the randomly expected number of hits and the P va
120. cleic Acids Res 32 D78 D81 2004 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 14681363 gt DBISS http dbtss hgo 1 p The Gene Set Enrichment Analysis Aravind Subramanian Pablo Tamayo Vamsi K Mootha Sayan Mukherjee Benjamin L Ebert Michael A Gillette Amanda Paulovich Scott L Pomeroy Todd R Golub Eric S Lander and Jill P Mesirov Gene set enrichment analysis A knowledge based approach for interpreting genome wide expression profiles PNAS 43 15545 15550 2005 PNAS lt http www pnas org cgi content full1 102 43 15545 gt The Kolmogorov Smirnov like statistic Hollander M and Wolfe D A 1999 Nonparametric Statistical Methods Wiley New York LibGD library lt http www libgd org Main_Page gt affyOCReport Craig Parman and Conrad Halling 2005 affyOCReport A Package to Generate QC Reports for Affymetrix Array Data http www bioconductor org packages bioc html affyQCReport html QC and Affymetrix data Wilson C Pepper S D and Miller C J OC and Affymetrix data Paterson Institute for Cancer Research Christie Hospital NHS Trust Manchester UK 167 31 32 33 34 35 36 D 38 39 CHAPTER 14 REFERENCES Simpleaffy Crispin J Miller 2005 simpleaffy hrtgnp bioinformatics picr man ac uk simpleaffty index J 5p RankProd Hong F and Breitling R 2008 A comparison of meta analysis me
121. cores from the promoter model fit to input sets main and background set In our example the two sets were selected separately and CMA as signed the expression value of 1 to the main set and 1 to the background The separation between positive and negative set is indicated by the horizontal line in the plot In our example the separation line is congruent with the horizontal axis at the zero expression level The Yes No distribution plots observed frequencies of model scores for the main red and back ground blue set This presentation reflects how well the model discriminates between promoter se quences of the two sets 91 6 1 THE CMA INTERFACE IN EXPCAAVTER 6 COMPOSITE MODULE ANALYSIS AND MODELS Figure 6 6 Top section of a promoter model report Promoter model Simple display VECREL 01 C 0 972500 N 1 V IRF Q C20 979500 N 1 lt VECETS168 O6 C 0 841500 gt 19 25 lt V HANDIE4 7 01 C 0 821500 N 3 P Yalue 2 3226e 18 FP 13 0796 FM 30 9596 Overall cutoff 0 112918 Goal function calculation Value 1 000000 0 651196 0 779879 0 003198 0 511099 Weight O OO0000 0 333333 0 333333 QO 000000 0 353335 1 000000 Weighted value u U 00000 QO 217066 O 259960 000000 0 170366 0 647332 Figure 6 7 Expression score distribution Expression score distribution Export to RTF Expression Figure 6 8 Yes No distribution Yes No distribution Export ta RTF 40 30 20 10 92
122. ctors should not differ by more than 3 fold Percent present calls Extremely low below about 30 or high above about 60 values for the per centage of probes have potential quality problems 3 5 ratios Affymetrix includes probes at the 3 and 5 ends of some control genes the ratios should be less than 3 The quality control tests are launched from the output of low level analysis through the menu option CEL gt Quality Control To run the quality control tests you must specify a data set rules and visuals and then to click on the button Figure 11 6 With Visuals you will be able to see an overall intensity distribution of the feature intensities and RNA degradation for your data At hte end of the quality control test it is possible to generate a quality control report in PDF format which shows graphical representations of several quality measures The output of the quality control tests is a table containing all CEL files as rows and the results of each quality control step as columns To remove failed CEL files from the further analysis deselect the check boxes of the respective CEL files and click on the Submit button Figure 11 7 The result of the quality control filters is a table similar to the Low level analysis output but without the deselected microarrays Clicking the View button in the Quality Report section below the result table opens a report in PDF format that displays several quality control fe
123. cture and Growth Stages hierarchy Trait Ontology rice only Genes are matched to terms of the Trait Ontology hierarchy EO Environment Ontology rice only Genes are matched to terms of the Environment Ontology Whole subsets from the tree Genes are matched to other gene sets in your project tree 92 CHAPTER 4 THE FUNCTIONAL GIXXSWIPICAAIKANYSIS WITH BKL CURATED GO ANNOTATION 4 1 3 Functional analysis dialog window The Functional analysis dialog window is shown in Figure 4 1 It contains a list to select the grouping category and editors to specify P value and minimal hits thresholds The Control False Discovery Rate checkbox adds additional verification in the p value calculation algorithm based on a Bonferroni correction It is possible to select several categories simultaneously for which individual processes will be put in the queue and launched one after another When the algorithm is launched from the active data node this node is selected in the Gene set field automatically you can change it or add more gene sets to the selected one by switching the project tree in the dialog to the multi selection mode 4 2 Example analysis with BKL curated GO annotation This example demonstrates how to use the Functional Analysis classification by analyzing a sample data set of human genes We will use it to compile a subset of protein synthesis genes according to GO annotation curated by BKL The first step consists of selecting the
124. cture will be changed to a Press this button again to return to the original 14 CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE 1 3 LAYOUTS window behavior In some dialogs additional options can be hidden in expanding blocks To reveal these options click the link in brackets amp dvanced options The sien of the link will be changed to for expanded options as shown above To collapse options back click the link again On the figure below you can see an example of expanding blocks of a filtering dialog The second filtering condition is expanded and the third one is collapsed Figure 1 9 Expanding blocks First condition Gene symbol contains all Second condition Sand Oor Malecule name contains all Third condition In some cases when one control is changed other controls may become grey striped and the button is disabled see the figure below After ExPlain loads appropriate values these controls will be available again Figure 1 10 expression Cancel 1 3 Layouts The ExPlain user interface provides several layouts to make work more convenient You can switch be tween layouts using the Layout submenu in the View menu The Classic layout is set by default The Compact layout doesn t have the toolbar providing a little more vertical screen space for the data frame and may be useful if you want to view more table rows at once Flip panels and Compact flip layo
125. d in Chapter 5 Transcription factor site search analysis 8 1 Loading profiles You can add a profile to ExPlain by 1 Selecting one or several of the user defined profiles from your TRANSFAC installation in the data import dialog described in Section 3 1 4 2 Uploading a local file When you choose the PWM profile option from the Load data from file section of the File menu you can load profiles from local source files created from TRANSFAC or from other sources For a description of how to create profiles in TRANSFAC Professional e g on the basis of a database search result please consult the documentation for TRANSFAC and MATCH Figure 8 1 Load profiles dialog Load profile Destination Profiles specify profile file ta load PRF or archive file Cancel Click on the button to start the loading procedure Imported profiles are added to the Profiles folder on the project tree by default You can select a different folder in the Destination field 117 8 2 CREATING A NEW PROFILE CHAPTER 8 PROFILES 8 2 Creating a new profile 8 2 1 New profile dialog The Create new profile dialog which can be invoked by clicking on PWM profile in the Create new data section of the File menu provides the functionality to create and modify profiles Figure 8 2 shows the default mode dialog window which appears when the PWM profile menu link is used and the active tree node is not a profile or gene set In
126. d nucleic acids Potassium voltage gated channel Shal related subfamily member 1 a transmembrane KCND1 transporter that is involved in potassium ion transport and response to peptide hormone stimulus may act in cell volume homeostasis and regulation of heart 3 78529 1 2 26904e 08 1 contraction PAGE2 Protein of unknown function has strong similarity to uncharacterized human PAGE2B 3 73435 1 4 66172e 07 1 Protein containing three leucine rich repeats which mediate protein protein LRRC14 interactions has low similarity to preferentially expressed antigen of melanoma 3 72687 1 1 78645e 08 1 fhiiman PRAMFY which is a tiimar antinen nvereynresced in leukemias Figure 11 18 Rank Product analysis result Description Fold RankProduct p Change value A1BG alpha 1 B glycoprotein 0 835378 7317 66 0 6586 1 156 A1CF APOBEC1 complementation factor 0 958689 6569 4 0 5347 1 1283 A2BP1 ataxin 2 binding protein 1 1 04371 6554 83 0 5321 1 1265 A2M alpha 2 macroglobulin 0 580866 10622 0 967 1 0575 A2ML1 alpha 2 macroglobulin like 1 0 885263 9403 71 0 9027 1 103 A4GALT alpha 1 4 galactosyltransferase 0 841252 5055 62 0 2741 0 952 A4GNT alpha 1 4 N acetylglucosaminyltransferase 0 936106 6299 55 0 4878 1 1107 AAAS achalasia adrenocortical insufficiency alacrimia Allgrove triple A 2 46028 1444 53 0 0015 0 09 AACS acetoacetyl CoA synthetase 1 05135 7420 5 0 6745 1 1602 AADAC arylacetamide deacetylase esterase 0 648968 10739 5 0 9707 1 0535
127. d treatment2 as a control Create new factor if you want to make several comparisons Summary Factor Delete factor 4dd factor Column Level C3A cel C3B cel C4A cel C4B cel T3A cel T3B cel T4 cel Y Please enter new factor name T4B cel http explain30 biobase de co E gt gt Next Cancel Figure 2 8 Factor level assignment in workflow Summary Factor comparison Column Level Experiment Control C3A cel x 3B cel x C44 cel x C4B cel x T3A cel T3B cel T44 cel T4B cel XO X X X Figure 2 9 Statistical analyses parameters window CEL files loading Select factors to calculate Fold Changes for all pairs of levels comparison Normalization method Fold change test MAS 4 0 Student s t test Extract Up Down Non change Up regulated genes Log Fold change gt and P value 9 901 Down regulated genes Log Fold change 1 Cancel default parameters are selected You can change the minimal number of genes to consider for the func tional classification and the size of the promoter sequence to analyse After pressing the button you will be redirected to the report page which at the beginning will 29 2 3 WORKFLOW MODE CHAPTER 2 ANALYTICAL WORKFLOWS AND WIZARD MODE Figure 2 10 Full upstream analysis link Wizard mode Data loading Workflows Results Upstream analysis of key regulators in network Analysis of promoters of differentially expressed
128. dds a new node to the project tree under the active data set and displays the status of the process in the process monitor Results can be inspected as soon as the analysis is done either by clicking on the link in the process monitor or in the project tree Matched terms of the selected Function category are given in each row of the output table For GO ontologies the table presents from left to right the GO identifier Gene symbols the GO term description Ontology in general Biological Process Molecular Function or Cellullar Component the number of hits from the input set in the ontology group the size of the group in the database the 53 43 FUNCTIONAL ANALYSIS OUTPUT TABLESTAPTER 4 THE FUNCTIONAL CLASSIFICATION number of randomly expected hits and the P value of the observation In our case the cell death term is the most significant one P 7 94531e 10 Figure 4 3 Functional Analysis results with GO Biological Process groups Functional analysis GO annotation BEL manual curation Q 01max 20min on Sample 1 Filter filter bar none total 164 rows Rows per page 10 r Export Plain text XLS RTF 1 2 3 4 5 6 MW Mark Page 10 All 164 None Invert u E GO Gene symbol GO Term Ontology Hitsin Group Hits p value Identifier group size expected ATF3 BAD BECMI BNIP1 C co oo08219 CD24 CO7 CREBI CoNK2R cell death o 2266 7 94531e 10 CTNND1 CTSG ATF3 BAD BECN1 BNIP1 Bioloaical
129. de you can check several nodes to use them in some action or to start several processes depending on the dialog You can use the Shift key to select or deselect a range of nodes If the first clicked node of the range is selected then all other nodes in the range will be selected too the same applies for deselection Figure 1 8 Tree node selection control in single select and multi select modes vertebrate_non_redundant minFP 4 Single select mode Multi select mode m Profiles Profiles r vlhest selection q PRFhest selection err L leell cycle specific PRFrel cycle specific eriy immune cell specific PRFimmune cell specific err Liver specific PRE liver specific qi err Lung specific PRFla specific E err L muscle specific PRF muscle specific aael lon eee ee iv k Ill gt PRFrerve system specific m Ra The multiselect list provides you with the opportunity to select one or several items Click on the line with the Ctrl key pressed to select several items click using the Shift button to select or deselect a range of values and single click to select a single line and deselect all the others The Help button will lead you to the appropriate documentation chapter describing data analysis or dialog windows that were open at the moment you pressed the help button The dialog window will be closed when you press the button To leave the window open press the pin button the pi
130. desired data set from the project tree and navigating to the Functional classification button in the dropdown menu through the Analyze button in the main Menu Alternatively the Classification option can be chosen first then the desired data set should be selected in the Gene set drop down menu in the FA dialog window The desired category is selected from the category list In our example we set the P value threshold to 0 01 and the minimal number of hits to 20 ASTUCE The hit number threshold should be considered on a case by case basis While larger hit numbers typically also correspond to more significant P values groups which are populated with fewer genes than those allowed by the threshold are missed Thus a high value is most useful for finding groups clearly enriched in the input list while a vast collection of all groups of interest should be sought with a smaller threshold value Figure 4 2 Example of the Functional classification analysis Functional Analysis Gene set Sample 1 224 genes Find groups by Expression BEL manual curation 0 01 G3 annotation BEL manual curation P value threshold 30 annotation public Organ Tissue expression Cytomer Proteome BEL Disease view SwissProt keywords Transcription Factor classification Transpath molecule classification Whale subsets From Ehe tree LI control False Discovery Rate Minimal hits to group 20 After clicking the button ExPlain a
131. ding PWMs from a set of promoters thus if you choose a precomputed MATCH output in the Use Match output field the respective promoter set will be chosen automatically Note that you can only select the output of an analysis run against a background set The Use preset analysis field provides the list of preliminary saved model constraints and GA parameters Later in this chapter we will discuss how to save your own preset parameters 67 6 1 THE CMA INTERFACE IN EXPCAAVTER 6 COMPOSITE MODULE ANALYSIS AND MODELS You can further adjust the number of iterations or running time the NC limit and the population size for the GA The NC No Change limit specifies the number of iterations after which the program is stopped if the best fitness has not improved by more than 0 0001 during that interval For instance if CMA was set to use 20 minutes with a NC limit of 200 the program can stop before reaching the time limit if no improvement was achieved over the last 200 iterations Finally the population size is the number of individual model solutions the algorithm can take into account during each iteration ASTUCE The population size and the running length parameters have great effect on the model quality you can obtain You should set both to high values if for instance you have a large number of target promoters or consider complex assemblies of single PWMs PWM pairs and groups In general when the complexity of the optimization tas
132. ding to a 1st order Markov model Let 9 Gi be the W x 4 score matrix with i 1 W vectors also called site positions with k 1 4 scores for the residues R ri 4 C G T This matrix is used to score an alignment of the TFBS profile with a sequence segment of length W The score function is typically additive so that its value is calculated by summing the site position scores of the sequence residues Further let be a mononucleotide background model of residue frequencies Usually one considers both sequence orientations or equivalently both orientations of the PWM A method for P value calculation should take this into account and determine the probability that either score forward or reverse orientation is above a threshold The reverse PWM is defined in the following equations Ej lg ic with i L W 1 4 Fik 7 IW 1 5 5 Given forward and reverse orientation there is a pair of scores q q for each position of an alignment and subsequently a pair of scores s s for the sequence segment As shown in 11 the probability distribution of PWM scores can be efficiently calculated by convolution The latter can be extended to calculate the joint probability density f s s The equation below gives the joint probability density of scores up to position i of the alignment by convolution of g q q and h s q s q where g is the function of score pairs at position i of Q and Q and h is the joint prob
133. down regulated sub set will contain exactly ND objects with the lowest expression and the non change sub set will contain exactly NNC objects with the expression values closest to ENC Note that if you have several genes with exactly the same expression value the result may be unpre dictable For example if you have gene A with expression value 3 gene B with expression value 2 and gene C with expression value 2 and set NU 2 then gene A will appear in the up regulated sub set for sure and one of genes B and C will appear also but you cannot predict which one Thus this mode is less predictable than the cut off values mode but still if you specity the same parameters on the same gene set twice it s guaranteed that the result will be the same even in number of objects mode Figure 3 16 Dialog parameters in number of objects mode Select by number of objects M Up regulated genes 100 l Down regulated genes 0 F Up and down regulated genes M Non changed genes expression N 500 Cancel In both modes Up and down regulated genes is just the combination of Up and Down sub sets 44 CHAPTER 3 GENE SETS 3 4 ADDING COLUMNS TO THE GENE SET 3 3 6 Extracting random subsets You can you can create several random subsets using any geneset in the tree Open the Random sub sets dialog window from the Data menu and specify the source gene set number of random subsets to create number of genes in each subset and
134. ds Res 34 D6 D9 2006 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 16381940 gt DDBJ http www ddbj nig ac jp Ensembl Hubbard T Andrews D Caccamo M Cameron G Chen Y Clamp M Clarke L Coates G Cox T Cunningham F Curwen V Cutts T Down T Durbin R Fernandez Suarez X M Gilbert J Hammond M Herrero J Hotz H Howe K Iyer V Jekosch K Kahari A Kasprzyk A Keefe D Keenan S Kokocinsci F London D Longden 1 McVicker G Melsopp C Meidl P Potter S Proctor G Rae M Rios D Schuster M Searle S Severin J 165 17 18 19 20 21 CHAPTER 14 REFERENCES Slater G Smedley D Smith J Spooner W Stabenau A Stalker J Storey R Trevanion S Ureta Vidal A Vogel J White S Woodwark C and Birney E Ensembl 2005 Nucleic Acids Res 33 D447 D453 2005 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstracr amp listouidse15608235 Ensembl lt http www ensembl org index html gt Entrez Gene Maglott D Ostell J Pruitt K D and Tatusova T Entrez Gene gene centered information at NCBI Nucleic Acids Res 33 D54 D58 2005 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 15608257 gt Entrez Gene lt http www ncbi nlm nih gov entrez query fcgi DB gene gt
135. e Filter bar allows you to quickly filter any data table in ExPlain based on different conditions Note that this feature only hides table rows that don t meet the condition It doesn t change the data itself and filtering can be removed easily 17 1 5 THE WORKSPACE CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE Figure 1 17 Item origin information exemplified by MATCH results Site search results vertebrate non redundant 1100 ALL Parameters Run again Yes set f Human housekeeping genes 562 87 909 468 Ma set background vertebrate non redundant minFF Use jall matrices with cut offs Promoter window from to If gene has multiple promoters use Optimize cut off with p value threshold 0 01 Optimize window position P Origin PRE Sites search was performed on the promoters of Human housekeeping genes 562 87 909 468 from 1000bp upstream to 100bp downstream using PRF yertebrate_non_redundant minFP profile All available promoters were used Data build Mammal 2008 06 25 User comments To turn on the filter bar press the filter bar link above the table Figure 1 18 Species name Human Filter filter bar X none total 18 rows Export Plain text XLS RTF Mark Page 105 All 185 None Invert Gene BEL description symbol F1 BETI Blocked early in transport 1 homolog a putative SWAF receptor Eh Golgi vesicle mediated transport Now the filter b
136. e gives access to all data sets and analysis results generated by ExPlain Workspace Input output frame The workspace displays actual data such as gene sets genomic in tervals and the results of all analyses Menu The menu provides access to all analysis data manipulation options and actions If some ac tion needs additional information from the user it will create a dialog providing functionality for specific tasks Toolbar The toolbar gives fast access to the most important items of the menu 1 1 THE PROJECT TREECHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE Dialog window All dialogs open in a separate window and provide functionality for specific tasks Process monitor The process monitor displays information about running waiting or finished proces sor jobs 11 The project tree The project tree is the central access point for all data such as gene or protein data sets derived subsets see e g Chapter 3 Gene sets analysis results predefined or user defined PWM profiles promoter models or interaction profiles When ExPlain is started in the browser all data related to the correspond ing user is compiled from the ExPlain database into an individual tree representation The user s name is displayed in bold at the top of the tree Figure 1 2 The project tree i Tree Search Select E we Up Fam Expr Disd Pene KM 1420 2009 05 19 10 25 amp vertebrate nan redundant 1100 ALL 212 2009 05
137. eam to 100bp downstream using PRF vertebrate all minSUM profile Background frequencies were calculated based on the promoters of NC HUVEC GSE2639 example 500 63 932 275 After search matrices cut offs were optimized Only best supported promoters were used User comments When the MATCH output is large and the result is not activated as the in the case of a run within the CMA analysis for more than 3 days it will be compressed by ExPlain in order to save disk space In this case some menu features will be disabled but it is still possible to view the main Site search output and create a set or profile from the significant matrices To restore full functionality use the Restore full result options menu option of the Site search specific menu or the Restore link on the Site search item page The waiting time period before a MATCH result is compressed can be adjusted by the server administrator 5 1 4 Optimization mode The Optimize cut off option turns on the profile optimization mode In this mode matrices which do not differ much in density between positive and negative set will be removed from the result Cut offs of the remaining matrices will be optimized to maximize the difference between the query and control sets By adjusting the p value threshold option you can vary the severity of this effect Lower values of p value threshold correspond to strict filtering so you will get only a few very significant matri
138. easure defined by Medvedovic and Sivaganesan 2000 39 The similarity measure between a pair of genes is equivalent to the proportion of times during iteration in which both genes are assigned to the same cluster To obtain the stability of a cluster the average of the similarity measures of all gene pairs within the cluster is calculated 153 11 5 CRC CLUSTERING CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Table 11 2 Background correction Normalize method PM correction method Summarization method MAS PM Only Liwong None Subtract MM MAS RMA C Quantes Median Polish Commss Avgdif os Playeroui Opm SS Ospline Robust 154 Chapter 12 miRNA analysis 12 1 Identification of miRNA targets This feature identifies putative miRNA binding sites in the upstream regions of genes based on Tar getScan 41 data Pre calculated total context scores were taken from TargetScan lt http www targetscan org cgi bin targetscan data_download cgi db vert_50 gt Lower values cor respond to sites of better quality Use the miRNA sites option within the Analyze menu to launch the miRNA search dialog Figure 12 1 Find miRNA sites dialog window Find miRNA sites Gene set s Up HUVEC GSE2639 example 100 78 173 85 Background gene set s NC HUVEC GSE2639 example 500 78 954 280 Score limit 1 First specify the query and background gene sets Note that you can choose none at the top o
139. ection 6 5 As mentioned before the best model with the highest fitness score is shown as the CMA output By clicking on the view model list link above the model description you will see the list of several models found by CMA with information about one model per row The Promoter model column describes the PWM composition of the models with a concise syntax Each CM component of a model is identified by a module number e g M1 and described by single matrices and matrix pairs enclosed in squared brackets For instance the term VSEN1_01 V SOX9_B1 VSNFAT_Q6 V RP58_01 gt V CHX10_01 V GCM_Q2 V CIZ_01 describes a module consisting of three single matrices and two matrix pairs The single matrices are VSEN1_01 V 5OX9 B1 and V NFAT Q6 There is one pair of the matrices VbRP58 01 and V CHX10 01 in which V RP58_01 can only occur in direct orientation gt and V CHX10 01 can occur in both orien tations as well as one pair of the matrices V6GCM_Q2 and V CIZ 01 which can both occur in either orientation The CM definition part is followed by the definition of the complete promoter model by its modules PM Since all models in Figure 6 10 contain only one CM component this definition is always PM M1 The Model name column assigns a unique name to each model Columns R T E N and P provide values achieved in the corresponding fitness components described in Section 6 8 2 The total fitness of a model is given in the
140. ed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list uids 12824369 Source of the human housekeeping gene set Eisenberg E and Levanon E Y Human housekeeping genes are compact Trends Genet 19 362 365 2003 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list uids 12850439 RefSeq Pruitt K D Tatusova T and Maglott D R NCBI Reference Sequence RefSeq a curated non redundant sequence database of genomes tran scripts and proteins Nucleic Acids Res 33 D501 D504 2005 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 15608248 gt RefSeq http www ncbi nlm nih gov RefSeg Human Gene Nomenclature Database HGNC Wain H M Lush M J Ducluzeau F Khodiyar V K and Povey S Genew the Human Gene Nomenclature Database 2004 updates Nucleic Acids Res 32 D255 D257 2004 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 14681406 gt HGNC lt http www gene ucl ac uk nomenclature index html gt 164 CHAPTER 14 REFERENCES 11 12 13 14 15 16 Mouse Genome Informatics MGI Eppig J T Bult C J Kadin J A Richardson J E Blake J A and the members of the Mouse Genome Database Group The Mouse Genome Database MGD from genes to mice a community
141. ed binding sites for transcription factors possition weight matrices ChIP on chip and other information Genes have been included to provide information about the last step in signal transduction path ways the regulation of target genes by activated transcription factors Thus BKL presents in formation about complete signaling pathways starting with the activation of a receptor at the membrane followed by a cascade of kinases into the nucleus where a particular transcription factor is activated and regulates the expression of a set of target genes Pathway amp Chain Pathways reflect canonical pathways for specific signaling molecules mostly ligands or receptors and are made up of one or more chains Chains are sets of consequent reactions joined by common enzymes or metabolites Chains can have bifurcations and even loops if they have a regulatory meaning BKL molecules are hierarchically classified into families and can be separately annotated as certain states modified forms or as components of molecular complexes Based on their genes and species taxa molecules are assigned to the groups described below TRANSPATH MOLECULE GROUPS Orthofamily Family The prefix ortho is used when the family entry is not specific for a certain species or higher taxon An orthofamily is a group entry for a homologous family or superfamily of molecules A family is a species specific protein set Orthogroup Isogroup Fora single gene different iso
142. eed PWM Ps is of width k probably smaller than the full motif width It is extended to full motif width by adding null weights at 1 k 2 positions upstream and downstream The full length PWM is then refined by iterating the following process i Sites one per sequence in P maximizing the score to the extended weight matrix are selected and ii a revised full length PWM is built from those sites This process is repeated until convergence i e the sites maximizing the PWM score are fixed in all sequences or for at most a default number of 10 iterations which are often sufficient for the convergence of significant seeded motifs For each motif predicted a list of 4k P values is generated thus prompting for a multiple testing correction This is carried out by generating a list of q values from the list of P values associated with words of seed length k using a general algorithm for estimating q values The statistical significance of a motif is evaluated with the q value of the sum S a which is the expected proportion of false positives incurred when calling the sum significant i e not likely to have occurred if the positive sequences were randomly selected Reference F Fauteux M Blanchette and M V Stromvik Bioinformatics 2008 24 2303 2307 86 Chapter 6 Composite module analysis and models The composite module analysis allows you to derive promoter models for your set of target promoters A promoter model characterizes pr
143. efined pathways mapping in menu Analyze links to analysis of the representation of input molecules in user defined interaction pathways as described in Section 7 4 The dialog has the same parameters as the one mentioned above Figure 4 14 The output table contains the column Inter action pathways where user interaction pathways are referenced by the pathway name This column links to pathway data items in ExPlain Columns Molecule name Hits in group Group size over under representation and p value are identical to the ones from canonical TRANSPATH pathways output 58 CHAPTER 4 THE FUNCTIONAL CLASSIFICATION 4 5 FUNCTIONAL ANALYSIS SUMMARY Figure 4 16 Visualization of the CH000000693 pathway Network visualization CHOD0000824 nodes table Export GIF CSML 1 9 CSML 3 0 a m m E 2 A Figure 4 17 Interaction pathways output Pathways mapping Interaction Pathways Imax 2min Filter filter bar 3 none total 2 rows Rows per page 10 F Export Plain text XLS RTF Mark All 2 Mone Invert Interaction Molecule name ZHits in Group Hits p value Visualization pathway group size expected 4 14 ATM Cdc254 Cdk2 Delta40p53 Reactome_69615 mdmz isoForm1 pz1Cipl pzZ7Kipl lia 12 3 8 26494e 10 view po3 isoformi posbeta pre ubiquitin Erm CD38 grk4 HGF isoform1 interactiontest 83 I TAC IL 12B I
144. ences dialog item see Section 1 5 9 Please note that the number of genes matri ces promoters and molecules are not identical In this example for 126 genes there are 20 matrices Able to bind transcription factors from the list of 126 proteins 244 promoters Note A gene might have more than one promoter 69 molecules These 69 out of the 126 molecules in the set are known to play a role in signal transduc tion pathways 3 1 3 New gene set dialog When you click on Gene set in the Create new data section of the ExPlain File menu a New gene set dialog window will open In the text field of this dialog you can type or paste identifiers to load them as a gene set Note that this form is for identifiers only so anything unrecognizable will be discarded If you want to add other data like expression values please create a tab separated file and load the data as a file You can also specify the database and species of your accession numbers set the destination folder and type a name for the gene set 3 1 4 Import data ExPlain allows the import of predefined system data sets as well as sets from other sources through the Import data dialog of the File menu In the upper part of the dialog you can specify the folder into which the files will be imported Below this option you can see the lists of available data Note that the Import dialog allows multiple selections from both lists using Shift and Ctrl keys so you ca
145. ene set only those promoters from the interval set that are present in gene set will be used to plot the graphs The Blur window option specifies the window width in which values will be averaged to create a smoother graph When Display original distribution is checked non smoothed graph will be displayed which is equivalent to blur window 0 bp The result is a graph report tree item Figure 13 5 displaying the distribution of interval counts and signal p value distributions if this information is attached to your interval set 158 CHAPTER 13 REPORTS 13 2 GRAPH REPORT GENERATION Figure 13 4 Graph report dialog Generate intervals graph report Interval none Filter using gene set mir 141 targets cons 0 3 251 119 524 161 Blur window 100 bp Display original distribution Figure 13 5 Graph report Graph report Graph on known sites Count distribution Export to ETF 1800 1600 1400 1200 1000 12000 10000 8000 6000 4000 2000 159 Part II Appendix 161 Chapter 14 References 1 TRANSFAC and TRANSCompel Matys V Kel Margoulis O V Fricke E Liebich I Land S Barre Dirrie A Reuter I Chekmenev D Krull M Hornischer K Voss N Stegmaier P Lewicki Potapov B Saxel H Kel A E and Wingender E TRANSFAC and its module TRANSCompel transcriptional gene regulation in eukaryotes Nucleic Acids Res 34 D108 D110 2006 PubMed http www ncbi nlm nih g
146. ents click the Summary link After all factor level assignments have been made the button should be clicked to store the factor level information in the database The factor level information assigned will be displayed as a table see figure below Clicking the change configuration button will launch the factor level assignment dialog 11 2 Low level analysis You have the option to choose between pre selected techniques for each method such as MAS4 0 dChip RMA GCRMA or MAS5 0 or to configure advanced options see below The following steps will be 140 CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA 11 2 LOW LEVEL ANALYSIS Figure 11 3 Factor level assignment interface Assign levels to columns Gene set example CEL files Summary Factor Delete factor Add factor Add levels Column Level E C1 CEL C2 CEL C3 CEL C4 CEL T1 CEL TZ CEL T3 CEL T4 CEL ij X X X X D x X X XK Cancel Figure 11 4 Saved factor level information other factor C1 CEL B high C2 CEL B medium C3 CEL B low C4 CEL B high T1 CEL E low T2 CEL E low T3 CEL E medium T4 CEL E high Change configuration performed during the low level analysis background correction normalization PM correction and summarization Table 11 1 shows the individual techniques used for the pre selectable methods MAS 4 0 dChip MAS 5 0 and RMA Table 11 1 Pre selected method Summarization me MAS 4 0 Avgdifi d
147. eq experiments tag distribution along the genome could be modeled by a Poisson distribution The advantage of this model is that one parameter ABG can capture both the mean and the variance of the distribution After MACS shifts every tag by d 2 it slides 2d windows across the genome to find candidate peaks with a significant tag enrichment Poisson distribution p value based on ABG default 10 5 Overlapping enriched peaks are merged and each tag position is extended d bases from its center The location with the highest fragment pileup hereafter referred to as the summit is predicted as the precise binding loca tion In the control samples we often observe tag distributions with local fluctuations and biases Many possible sources for these biases include local chromatin structure DNA amplification and sequencing bias and genome copy number variation Therefore instead of using a uniform ABG estimated from the whole genome MACS uses a dynamic parameter Alocal defined for each candidate peak as Alocal max ABG A1k A5k A10K where A1k A5k and A10k are estimated from the 1 kb 5 kb or 10 kb window centered at the peak location in the control sample or the ChIP Seq sample when a control sample is not available in which case Alk is not used Alocal captures the influence of local biases and is robust against occasional low tag counts at small local regions MACS uses Alocal to calculate the p value of each candidate peak and remove
148. equences of every gene provided by TRANSPro where each promoter corresponds to the best supported virtual TSS You can otherwise change restriction to use 5 most 3 most or all promoters 68 CHAPTER 5 TRANSCRIPTION FACTOR SITE SEARGH SEARCHING FOR SITES IN PROMOTERS After setting up the required parameters press the button to start the analysis To leave the dialog window open after starting a process and schedule a second one press the pin button modify the parameters as needed and then send the analysis again NOTE Do not use optimization options when you plan to run CMA on the results of this search 5 1 3 The Match output In this result we show the MATCH output generated from the analysis of promoter sets extracted from the HUVEC GSE2639 example gene set We extracted the top 100 up regulated genes as query and 500 genes with FC 1 as a background set All parameters were set as described in Section 8 2 2 Results are accessible at the MATCH node added to the project tree below the node of the query set Yes set In the output table Yes refers to the main set and No refers to the background set Each row contains information about the performance of one matrix of the input PWM profile Matrix names link to corresponding matrix entries in the BKL database The average number of putative binding sites per 1000 bp is given for the query and the background sets in Yes and No columns respectively Additio
149. er GWUOOD01 Sequence set name New sequences Type or paste sequences in EMBL FASTA or RAW format here Default promoter position from sequence start 136 CHAPTER 10 SEQUENCES 10 2 SEQUENCES IN EXPLAIN EXAMPLE 10 2 Sequences in ExPlain example To understand how loaded sequences are presented in ExPlain let us consider an example in EMBL format Consider the file examplel embl which contains the following lines Figure 10 3 Example of sequence file fragment ID YOO483 SV 1 linear genomic DNA STD HUM 1733 BP AC YOOAS3 DT O2 APR 1988 Rel 15 Created DT 14 NOV 2006 Rel 89 Last updated Version 8 DE Human gene for gluthathione peroxidase KW glutathione peroxidase GSHPx gene peroxidase OS Homo sapiens human FH Key Location Qualifiers FH FT source l 1733 FT forganism Homo sapiens FT mol type z genomic DNA FT cell typez leukocyte FT fdb_xref taxon 9606 FT mRNA join 154 719 998 1567 FT product glutathione peroxidase FT CDS join 474 719 998 1357 FT transl_except pos 612 614 aa Cys FT product glutathione peroxidase FT db xrefz UuniProtKB Swiss Prot PO7203 FT protein id z CAB37833 1 FT translation z MCAARLAAAAAQSVY AFSARPLAGGEPVSL GSLRGK VLL IENVAS FT L CGTT VRD Y T QWHMEL QRRI GPRGL V VL GEPCNOFGHQENAKMEET LNSL KY VRPGGGFE FT PNFMLFEKCEVNGAGAHPLFAFLREALPAPSDDAT ALMTDPKLTTWSPVCRNDV AWNFEE FT KFLVGPDGVPLRRY SRREGTIDIEPDIEALLSQOGPSCA FT exon 154 718
150. er any group of this category is statistically over or underrepresented in the data set For example for the GO Biological Process signal transduction part of the Function category the complete set consists of all genes which have a link to any group within this category It also contains the maximal number of elements linked to the target group like signal transduction This is analyzed separately for each group within the category Two P values are computed for each set of hits to a category group an overrepresentation and an underrepresentation P value They are derived from the hypergeometric distribution by Equation 4 6 1 and Equation 4 6 2 The Benjamini Hochberg multiple testing correction 23 is applied to the list of obtained P values In above equations N the number of genes linked to the chosen category D the number of genes in the given group of the category 61 4 7 GENE SET ENRICHMENT ANALYSIS CHAPTER 4 THE FUNCTIONAL CLASSIFICATION quation 4 6 1 Overrepresentation of a Functional group OME ee ma i n i Koss XT a A k Min n D n quation 4 6 2 Underrepresentation of a Functional group D m Ls 2 coco I F7L 1 P Y NAT od N Fein k max 0 n N D A n the size of the input list k the number of genes that matched a gene in the group of the category The smaller the probability of observing the given input list consisting of elements that are linked to the group or not
151. es Note that in both actions columns from site search results will be added to the created set The Site map menu link opens a new window with a report as described in Section 5 2 for all selected matrices together The Save matrices menu link creates detailed reports for all of the selected matrices The Sites table displays the formatted MATCH output see Figure 5 24 for selected matrices and for all promoters used in the analysis The Profile option creates a PWM profile with the cut offs from the result Matrices from MATCH output can be exported to BKL as search result with BKL search result more about export you can read in Section 3 5 Parameters of the search can be recalled by expanding additional information see Section 1 5 2 There are also some descriptions of the analysis and a field for comments Figure 5 4 Site search analysis additional information a Site search results with background F match vertebrate all 600 SUP EMEN Run again Yes set Up HUVEC GSE2639 example 100 78 175 86 Ma set background YSINC HUVEC GSE2639 example 500 63 932 275 PR vertebrate _all minSUM Use high specific matrices with cutoffs from profile Promoter window from to If gene has multiple promoters use Best supported Optimize cut off with p value threshold 0 01 Optimize window position P Origin Sites search was performed on the promoters of Up HUVEC GSE2639 example 100 78 175 86 from 500bp upstr
152. escription the number of input genes matching that group the size of the matched group in TRANSPATH the randomly expected number of hits and the P value of the match result Please note that classes are also molecule entries in TRANSPATH Figure 4 12 Output table of TRANSPATH Molecule Classification analysis Molecule class Gene symbol Molecule class Hits in Group Hits p value identifier description group size expected Mon00037288 LISF1 LISF2 D 0D071026 C Moanons2234 LISF1 LISF2 LISF H 2 2 1 D 0D071026 C mannnnessa1 AOL ees IERI Lispe Basic Domains 7 7i 2 0 00255622 C mooogoss590 MXD1 USF1 LISF2 Helix loop helix leucine zipper 13 i D 00437475 Factors BHLH ZI 4 3 8 Whole subsets from the tree analysis output Each row of the table presents a matched primary set node The columns contain from left to right an identifier composed of the name of the project node Geneset and the name of the primary set node Gene symbol the number of input genes found in the set the size of the primary set the randomly expected number of hits and the P value of the match result Figure 4 13 Output table of Whole subsets analysis Gene symbol Hitsin Group Hits p value group size expected ABCC ACE ALDOB AGAH AP3B a operit BMPSA BMPSB Cl ark35 CD300c 2 4999 7e 44 CPEB1 AKRIAL ARHGEF7 ARPC2 ATPSG3 Human housekeeping genes BECM1 CASC3 COX6A1 CSNK2B 28 561 6 9 4405e 13 561 87 909 474 EEF1D E
153. ets a FDR value assigned which represents the probability to occupy the observed rank or higher ranks by random chance It is estimated on the fly by random sampling The ranking of the key nodes is defined by sorting them according to the score described above in descending order All key nodes that have an observed rank lower than 200 get assigned 1 0 as FDR value by definition since their score is considered not to be sufficient Molecules which do not have any hits get assigned the last rank since the score is zero in this case Z Score In addition to the FDR each key node gets a Z Score which measures the deviation of the observed rank of the key node from the expected rank in random case divided by the standard deviation quation 7 1 2 Z Score calculation In this formula the rank distribution is assumed to comply with the normal distribution Key nodes with Z greater than 1 0 are considered significant 7 2 Network cluster analysis 7 21 Cluster dialog window The Network clusters analysis can be used to identify common subnetworks for molecules of a data set The algorithm tries to connect each pair of molecules of the input set To start the analysis open the Network clusters dialog via the link in the Analyze menu and select the gene set you want to analyze in the Gene set field By default this field is set to the current tree node Then select the parameters that will be used by the algorithm 107 7 2 NETWOR
154. everal site search results 0 0 00 0 0000000 79 5 4 Sites search theoretical background en 80 5 4 1 The TRANSPro database 340g OR eto Ro he HS Reo dei 9a 80 5 4 2 Computational definition of transcription start sites sn 80 SUAVE LUI PC Cah wet Reh eRe eRe eG bee Reed ae Se ee 81 5 4 4 Calculation of theoretical P values for TFBS predictions 82 5 4 5 Sites search optimization with F Match algorithm 0 0 83 546 The P Match aleontnm 4244468448 mmn RRA He SER n RU de DR e e hend 84 54 7 The Patch also 14268005644 So Sos tiani de 6 dou or cd 9 5 85 54 5 The Seeder aleomthin ecra cy aa ore See dos RE R9 HOG E Se ae SS 85 CONTENTS 6 Composite module analysis and models 6 1 The CMA interfacein ExPlain eh 6 1 1 The CMA dialog window 0 000000 es 6 1 2 The CMA advanced dialog window leen Oo TNCCVAOUlUT a x 954 dox 9 98 ARR oO Se P S 24099 9 99 29 2 8 9 thes 6 2 Composite modules on promoters eh 6 3 Predefined parameters of CMA ees 6 4 Composite element models een 5 5 MOGELEOIIDI s x 5695 9 9 S RENE SO CRE R Ee PG EU SINON SES ES P es 6 6 Classiivine DIOBIOIOES uox ev d ote hing dee 9c A S X Oe OE x ex x OG SS 6 7 Obtaining interactions between Transcription factors and their target genes 6 8 CMA Composite Module Analyst Background information 6 8 1 CMA pr
155. ey to select several separate factors As an example we create a matrix named DR3 user defined using a window size of 8 nucleotides The alignment starts in the third nucleotide and has the factor VDR assigned It is possible to preview PWM and consensus sequence using the button Fig ure 8 14 shows the dialog with matrix preview Using this dialog you can change the matrix name or return to the previous dialog window to change other parameters After you set up all required parameters press the button to launch the process of matrix creation The cut off values are calculated during the matrix creation process this can take some time New matrices are placed in the Weight matrices folder on the tree 8 6 2 Representation of the user matrix When clicking on the matrix node in the project tree the matrix will be displayed in the output frame of ExPlain You can see an example of a newly created matrix in the figure below The matrix identifier is displayed on the top of the output frame and cannot be changed The identifier includes an indicator for one of six groups of biological species V vertebrates I insects P plants F fungi N nematodes D bacteria followed by the matrix name defined by the user in the matrix creation dialog The PWM is displayed as the nucleotide frequency matrix with one row per nucleotide and one column for each position in the pattern The derived IUPAC consensus is provided below the frequency mat
156. f main set background set and assuming a binomial distribution of the sites between two sets we can calculate the p value of finding the observed number of sites and higher for over represented matches or lower in the case of under represented matches quation 5 4 2 if k Kap p value TE fa A i k if k lt Es P value 2 C f 1 fy i giving the p value of over and under representation of matches in the main promoter set For a given significance level p e g p 0 001 F Match finds such thresholds th max and th min that maximizes and minimizes respectively the ratio k kexp provided that the p value lt p If the required significance level cannot be reached for a given matrix this matrix will not be considered 5 4 6 The P Match algorithm P Match combines pattern matching and weight matrix approaches thus providing higher accuracy of recognition than each of the methods alone The algorithm is based on simultaneous use of a positional weight matrix PWM and a set of aligned TF binding sites used to construct this matrix The P Match search algorithm computes d score value which measures similarity between a sub sequence X of the length L in DNA and a given TF site 5 from the site set The d score is calculated using weights of the nucleotides in the individual positions of the site taken from the corresponding weight matrix quation 5 4 3 L Aw w i B X w i B S d
157. f icons and descriptions of corresponding data types is given in Table 1 1 There is typically an active node displayed in bold in the tree whose data is currently displayed in the Workspace see also Figure 1 2 All nodes referring to analysis results or any data subset created during the analysis are children of the starting data node provided as input for the analysis This makes it easy to trace the workflow of a project All nodes that have children nodes are prefixed with a or icon while they are in collapsed or expanded state respectively and clicking on either icon expands or collapses the tree accordingly 10 CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTHRZACHE MENU AND DIALOGS System folders such as Composite Elements Gene Sets Genome intervals are always present in the tree The folder Gene Sets contains several data sets provided by Biobase as well as input data and results of your analyses Other system folders contain additional data sets Shown below which can be used in various analyses Composite Regulatory Element models for CMA 2 compiled from TRANSCompel composite ele ments 1 TRANSFAC profiles for binding site analysis 1 User defined interactions for Key Node and Cluster analyses Genome intervals from ChIP on chip or ChIP seq experiments for filtering of the site analyses results Parameter presets for data loading and CMA analyses New user folders can be created inside system foldersusing
158. f such entities like a protein family a 113 7 5 THE BKL DATABASE CHAPTER 7 MOLECULAR NETWORKS ANALYSIS state of such an entity like the phosphorylated form or a complex of several other molecules And finally a molecule can be part of another molecule either non covalently bound as in a complex or covalently bound as in a structural motif of a protein The reason for such a wide scope for this class is to catch anything that has a specific signaling behavior Reaction BKL reactions model interactions reactions and relationships between molecules A reaction is a term for all kinds of interactions between signaling entities in signaling or regulatory events The character of the interaction is more closely defined in its effect field by a set of terms Reactions as processes are not physical entities like molecules yet they are the central point in the signal transduction database By representing these reactions between molecules as separate nodes in the graph it becomes possible to store their properties and annotate them Since many reactions in signal transduction are catalyzed and most catalyzed reactions are quasi unidirectional all reactions stored in the database are by default unidirectional Equilibrium reactions are identified in the effect field Gene Gene information is linked to the BKL Gene table where you can find information about the structure of gene regulatory regions including individual experimentally demonstrat
159. f the background menu if you wish to identify miRNA binding sites without comparing to a background set Next choose a score limit If the total context score of predicted sites for a given miRNA on a given gene is greater than the selected limit then these sites will not be considered As the total context score doesn t exceed 1 a value of 1 in this field will turn this filter off Pressing the button you will start the miRNA target identification process The output result is displayed below The result of this search contains a list of miRNA names their descriptions and lists of identified target genes followed by the count of target genes the total number of sites found the average score and the sum score The average score represents the average value of the total context score values for all genes while the sum score represents the sum of total context score values for all genes The miRNA search result node opens specific options within the miRNA analysis results menu item which allow you to extract new subsets of selected miRNA genes or their targets The P value is calculated based on hypergeometric distribution using the total and matched gene counts in the data set and background set 155 12 1 IDENTIFICATION OF MIRNA TARGETS Figure 12 2 miRNA search output MIRNA analysis result miRNA 1 Up HUVEC GSE2639 example Filter filter bar none total 531 rows Export Plain text XLS RTF Mark Page 100
160. ffymetrix ID s present in all sets will be considered in the calculation Therefore the data sources should show significant overlap in the probe sets for which expression data is provided The output of the Rank Product high level analysis will be a table containing Affy IDs as well as 146 CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA 11 4 HIGH LEVEL ANALYSIS Figure 11 11 RNA Degradation plot for a data set that contains data from arrays with high RNA degra dation RHA degradation plot ell 30 40 all Mean Intensity shifted and scaled 10 Probe Number Figure 11 12 Fold Change High level analysis dialog window Fold Change GLM ANOVA ad Source QC Filtered MAS4 example CEL files Test Student s t test Select factors to calculate Fold Changes for all pairs of levels z Other factor Cancel fold change rank product p and fdr values that are calculated based on the input parameters Fig ure 11 15 This output can also be converted to an ExPlain gene set via the Gene set Convert selected rows to gene set of the analysis specific menu 147 11 4 HIGH LEVEL ANALYSIS CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Figure 11 13 Output of High Level Analysis Ts log FC EB E B P value EB E B 1007_s_at 1053_at 117_at 121 at 1255 g at 1294 at 1316 at 1320 at 1405 i at 1431 at 1438 at 1487 at 1494 f at om o r or orno mnom no 1598 g at Figure 11 14 RankProduct dialog
161. fined combination of parameters You should mark the checkbox and specify the name of a new preset When you launch CMA by pressing the button the preset will be saved It will appear in the project tree inside the CMA subfolder of the Preset folder Use to return to the previous step The last set of adjustable options is available on the third CMA dialog screen appearing after the button is pressed Figure 6 3 CMA advanced dialog window Launch CMA Advanced Options No upper limit on FN FP Fun made Limit FP by 2096 Run once Limit FN by 50 Fun 10 times Select profile with preffered matrices nene Inject models to initial population nene Fitness function components Use T test Error rate Control normality of fuzzy score Penalize model complexity Use regression by column Fold change Save as 6 groups of 3 matrices in 200bp ee Previous Cancel The FP EN field allows you to customize the false positive and false negative weights in the E com ponent of the fitness function You can increase the value for either FP or FN to make the corresponding error rate affect the fitness of models more strongly If you wish to give more importance to suppressing 69 6 1 THE CMA INTERFACE IN EXPCAAVTER 6 COMPOSITE MODULE ANALYSIS AND MODELS false positive errors you set the FP restriction to a higher level This will simultaneously lower the FN restriction You can further choose to run the program once
162. follow the option once or twice you see a checkbox with the field specifying the name of new preset to be saved Mark this checkbox and start the CMA then preset will be saved to the project tree User created presets can be managed as other items in the tree 6 4 Composite element models The ExPlain application provides several kinds of models that can be used to classify promoters of some data sets Composite Element models for CMA which were compiled on the basis of TRANSCompel composite elements 1 can be found in the folder Composite Elements in the project tree As system preloaded models they cannot be removed or modified Any model when selected is displayed graphically The extended description is available after clicking on Verbose display link above the model The TransCompel model link button links to the TRANSCompel model description The example below shows an extended view of the system model named CEPB NFkappaB Models saved from the editor see Section 6 5 also have a graphical representation and can be viewed in a simple and extended mode 95 6 5 MODEL EDITOR CHAPTER 6 COMPOSITE MODULE ANALYSIS AND MODELS Figure 6 13 CMA preset parameters System CMA preset 1 3 modules of 2 5 matrices Parameters Main parameters Run CMA on promoter set Use Match output Stop after MC limit Population size Composite Module Boolean promoter Model min awg max Number of single matrices 3 5
163. forms such as splice variants may exist Sometimes in the literature a signaling activity is first attributed to a single molecule and later it is discovered that there is a whole group of similar molecules Therefore a special type of molecule entry is used which we label isogroup for taxon specific entries and orthogroup for orthologous non species specific ones To these abstracted group entries all the known information can be assigned when it is not known which specific isoform is involved Orthobasic Basic Molecules of the type basic contain data for a specific isoform e g a splice variant to which an amino acid sequence can be assigned Again the prefix ortho is used to generalize information for orthologous isoforms from different species Orthocomplex Complex An orthocomplex is a group entry for orthologous complexes consisting of non covalently bound molecules where a complex describes taxon specific non covalently bound molecules 114 CHAPTER 7 MOLECULAR NETWORKS ANALYSIS 7 5 THE BKL DATABASE The unmodified form of a protein and all its modified forms are its states where the modification can be by covalent binding by complex formation or by change of the environment The protein per se is a concept that is based on the observation that there is only one gene coding for each protein sequence All the states share the same gene and consequently part of their structure the amino acid chain They are functiona
164. h 96 CHAPTER 6 COMPOSITE MODULE ANALYSIS AND MODELS 6 5 MODEL EDITOR group is combined out of modules group M1 or M2 or modules can include single matrices and matrix pairs A repressing module is included in the model with the logical operator NOT There are two ways to run the model editor in ExPlain After clicking on the New model link in the File menu the model editor will be launched with an empty input window see the figure below If you are choosing the Edit option with a model calculated by CMA or acquired by any other way as a current active node the current model will be opened in the model editor Figure 6 16 Model editor window Model editor V HNF1_Q6 V ISRE_01 n V CDP 02 V NFKB Q6 01 V TEL2_Q6 V NFKB Q6 01 Place matrix here to edit N Matrix 1 Matrix 2 Cut off Distance Cut off To add a single PWM to a new model first select it in the drop down box Matrix 1 We chose VSAP1_01 as an example The selected matrix appears in the box above Edit the cut off value in the Cut off field and select the N number of matrix matches in the module If you move your mouse pointer on top of the matrix you will see a floating blue box with the matrix parameters inside When you are ready drag the created matrix from the editor box up to create a new model Figure 6 17 Adding a single PWM Drag matrix or pair up
165. he import options dialog window displays the first rows of the loaded data In case you did not select the correct file or wish to load another file you can return to the previous step by pressing the lt lt Load another file button When the source file name is correct you can select the name of the tree folder where your data will be loaded in the Destination drop down menu If you are loading an Excel M file that contains several worksheets each sheet will be represented by a sheet button e g Sheet 2 in the upper part of the Import options dialog Figure 3 2 The buttons are labeled with the sheet names and contain check boxes To exclude worksheets from the data loading process deselect the corresponding check boxes If more than one worksheet is selected they will be processed one after another empty sheets will be skipped The dialog window also displays the head of your data table and asks you to specify the header data separator ExPlain guesses the location of header rows such as a column title bar If its suggestion is not correct the header range can be adapted by moving the mouse pointer over the appropriate row separator of the data table and clicking on it Rows to be excluded as part of the header are distinguished from the data section by a blue background TIP Click on the separator above the first row if your file contains no header rows 34 CHAPTER 3 GENE SETS 3 1 LOADING DATA INTO THE EXPLAI
166. her by edges The nodes can be either molecules genes or reactions The node menu displays accession link name and type of node and provides options to delete the node Delete or add some other sets of molecules to the network By clicking on a node set e g Upstream molecules you will get a list of the reactions which are upstream from the selected node Figure 7 10 Node options Accession Se DIES Select All Mamas akt Ty ES Clear zelecbon Load Cancel Jak1 5 M25 3 2 HTP Jaki OC5 Jpvz04 pv221 2 NOF IPH gt Jaki IL A gt jaki ILA gt Jaki SOCSCIS family IakA M IL 22R1 EL 10RZ2 Jakl Dalete accession MOOQODOTS029 name Jaki hypa bazic Upstream molecules i Downstream molecules Jaki Colors and shapes of the network nodes distinguish molecule types and functions Molecules are represented by ellipses while two vertical ovals represent a receptor Ligands have a triangle shape and transcription factors a trapezium shape Red molecules are key nodes and blue ones represent end targets Several nodes placed one over another denote complexes All this information is available by clicking on the Legend button The Layout button refreshes the canvas positioning the molecules hierarchically and adjusting the edges between them This option should be used when you have done some changes to the network like moving some molecules or edges or adding or rem
167. his interaction profile can then be used to enrich Key Node analysis This function can be launched by clicking on the d icon visible when standing on a CMA search result node In the dialog window it is necessary to choose a gene set and a composite model If the current selected node is a CM saved from CMA or one of those described in the Section 6 4 it will be selected in the Promoter model field of the dialog window If the current node is a gene set then it will automatically be selected as the main gene set You can launch the search by pressing the button The program has two parameters to set a The result of searching a saved CMA model on a gene set b The percentage 76 of false positives to allow when setting the threshold set to 10 for a relaxed threshold and to 1 for a stringent one The result of this program is a table with pairs of molecules corresponding to the TFs from the CMA model and the genes targeted by these TFs This table can be added as user interactions when running a key node analysis Algorithm The interaction profile generator reads the score of the CMA model in each target gene in both the background set used to run CMA and the set of new targets which are the product of the CMA search The percentage of false positives passed as a parameter is used to decide the exact CMA score that would identify the given percentage of genes in the control set as being regulated by the CMA model and use this as a
168. icking the Sites table link you can obtain a formatted text report on the selected promoters 78 CHAPTER 5 TRANSCRIPTION FACTOR SITEGEIMKAIRY SET OF SEVERAL SITE SEARCH RESULTS in MATCH format The figure below shows an example of detailed report Matrix name entry po sition cut offs matched sequence and assigned factors are displayed for every matrix hit on selected promoters The link Back returns you to the promoters view Figure 5 24 Detailed text report Sites lying on promoter HGSA 3886 1 Gene symbol NFKBIA Description nuclear factor of kappa light polypeptide gene enhancer in B cells inhibitor alpha TSS Chr 14 34943663 matrix position core matrix sequence always the factor name identifier 5trand match match Strand is shown VSNFKAPPAB6S Oi 39 0 8857 0 890 GGCidaccec Rel p amp 5 VSNFKAPPAB6S Oi 121 0 8594 0 906 GGGidaccee Rel p amp 5 VSNFKAPPAB6S Oi 1224 1 000 0 960 GGh cccca Rel p amp 5 VSNFKAPPAB6S Oi 215 f 1 000 0 968 aggacTTTcec Rel p amp 5 VSNFKAPPAB6S Oi 375 T 0 562 0 2300 tqqaad TTce Rel p amp 5 VSNFKAPPAB6S Oi 371 1 000 1 000 GGh i rtcec Rel p amp 5 VSNFEKB Q6 01 116 0 956 0 927 atcgqtocGh acccca IkappaB cgamnma Ikarp VSNFKB Q6 01 Z214 1 000 0 926 aagqgqacTTTCCagceca IkappaB cgamnma Ikar VSNFKB Q6 01 374 i 1 000 0 996 gqgcctrGoGssatteccec IkappaB cgamnma Ikar VSRELBPSZ 01 32i O 966 0 880 gacaaAcccc HF kappaBZz ps5z Rel VSRELBPSZ 01 121 O 966
169. ies Preventative Leukemia Lymphocytic Chronic B Cell Preventative Hodgkin Disease Preventative Neovascularization Pathologic Preventative Periodontal Diseases 10 Correlation Bronchiolitis Obliterans 11 Preventative Leukemia B Cell PONaAOAON a Most similar terms Export RTF XLS Terms 12 Log p value Preventative Alzheimer Disease 1 sets 1 2 3 Sets 1 Proteome BKL Disease View Imax 2min on Sample 1 Sample 1 2 Proteome BKL Disease View 1max 2min on Human housekeeping genes Human housekeeping genes 3 Proteome BKL Disease View 1max 2min on PXE PXE b Most different terms Export RTF XLS Terms 14 log p value Preventative Multiple Myeloma Preventative Arthritis Rheumatoid Preventative Arteriosclerosis Correlation Diabetic Nephropathies Preventative Leukemia Lymphocytic Chronic B Cell 10 zt Preventative Hodgkin Disease i Preventative Neovascularization Pathologic Preventative Periodontal Diseases Correlation Bronchiolitis Obliterans Preventative Leukermia B Cell 12 0 SIF sets Sets 1 Proteome BKL Disease View Imax 2min on Sample 1 Sample 1 2 Proteome BKL Disease View 1max 2min on Human housekeeping genes Human housekeeping genes 3 Proteome BKL Disease View 1max 2min on PXE PXE c For a given functional category such as Gene Ontology s Biological Process the algorithm tests wheth
170. if you set the maximal number of mismatches to 5 PatchTM searches only for sites which are longer than 11bp All shorter ones will be ignored The default value for this parameter is 0 Mismatch penalty When comparing a binding site search pattern with some part of the input sequence each mismatching position will receive a mismatch penalty This penalty value will have a negative influence on the overall score for the match between the whole site search pattern and the input sequence Each matching nucleotide receives a bonus weight of 100 So the default value for the mismatch penalty is also 100 and the negative influence of a mismatching position corresponds to the positive influence of a matching position If you reduce the mismatch penalty you will receive high scoring sites containing mismatches in the Patch output If you increase this parameter high scoring sites are not likely to contain mismatches Lower score boundary The lower score boundary is a cut off defining which matches between a site search pattern and the input sequence will be listed in the output The score which is estimated for every match has to be higher than or equal to this cut off The default value for the lower score boundary is 87 5 5 4 8 The Seeder algorithm Seeder is a discriminative seeding DNA motif discovery algorithm designed for fast and reliable predic tion of cis regulatory elements in eukaryotic promoters The motif search starts by enumerati
171. ific for certain tree nodes will be described in their corresponding sections 1 5 1 Renaming project tree nodes Each workspace provides a field to rename the active project tree node To reach the editing field select a node you want to rename in the project tree and click on its name on the top of the workspace After editing the name in the line editor and pressing the button ExPlain updates the interface with the new name 1 5 2 Item origin information By clicking on the sign near the node name you can see additional information Four fields can be present here Parameters This field is shown for all analysis results All parameters used by the algorithm are shown here By pressing the Run again link or button new analyses with the same parameters can be launched 16 CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE 1 5 THE WORKSPACE Figure 1 14 Functions of the workspace Gene set S59 Species is Human Species name Human Filter filter bar 3 none total 18 rows Rows per page 10 F Export Plain text XLS RTF 1 e P Mark Page 105 All 18 None Invert Gene EKL description symbol 1 5 BET Blocked early in transport 1 homolog a putative SNAP receptor that plays a role in ER to BETI Golgi vesicle mediated transport PEGE eee CDv3 homolog may contribute to the breast cancer phenotype associated with HERZ El C cpva ERBBZ overexpression expression in cells is ind
172. ion of the considered set of promoters then the frequency of the corresponding sites found in these sequences should be significantly higher than expected by random chance Often the stringency of the interaction of this TF with their target sequences in the considered promoters is not known leading to the uncertainty in setting thresholds on the site searches using the MATCH program F Match carefully evaluates the set of promoters and for each matrix tries to find two thresholds one th max which provides the maximum ratio between the frequency of matches in the promoters in 83 5 4 SITES SEARCH THEORETICAL BACKGRDUNBR 5 TRANSCRIPTION FACTOR SITE SEARCH focus query set and background promoters background set over represented sites and the second threshold th min that minimizes the same ratio underrepresented sites As a result for each weight matrix we obtain a set of predicted K sites and M sites in the both promoter sets with the corresponding matrix scores The F Match algorithm makes an exhaustive search through the space of all scores observed in the sequence sets Each observed score is taken as a threshold th and the program computes the number of sites k found in the main promoter set and number of sites m found in the background promoter set Then the expected number of sites in the main set to be observed in the case of even distribution of sites between two sets will be quation 5 4 1 main set kap f n k m
173. is conducted for the 300 most up regulated genes from the HUVEC GSE2639 example extracted in the same way as in Section 3 3 5 with clus ter separation degree 5 and distance 3 The Hits in network values represent the number of input molecules that are present in a certain subnetwork Names and TRANSPATH accession numbers of input molecules are listed in Molecule name and Molecule acc columns respectively Figure 7 7 Network cluster output Hits in Molecule name network 4 1BB Bcl 3 cIAP 1 cIAP 2 ErbB1 p170 GM CSF HB EGF IAP IkappaB alpha inhibin beta A MEK1 p62 isoform1 Raf 1 Smad3 TRAF1 TRAF3 Tyk2 5 beta c MKK3 SAB STAT5A TAB2 isoform1 3 IL 1alpha IL 1beta IL 1RI 2 A20 Naflalpha 2 Apo2L DR5 2 11 15 IL15RA 2 Jagged1 NMotch2 When a network cluster node is active the custom menu C usters search result which provides specific actions with selected rows appears in the menu bar Corresponding icons appear at the right of the toolbar You can obtain a representation of one or several clusters by marking key nodes in the checkbox column and pressing the Clusters Visualize selected clusters menu item or toolbar button 108 CHAPTER 7 MOLECULAR NETWORKS ANALYSIS 7 3 NETWORK VISUALIZATION The Hits Get hits from selected rows as gene set option creates a subset from the input molecules present in the selected subnetworks The network of the top cluster identified cf Figure 7 7 is displa
174. ist box we extend the HUVEC UP profile by adding matrices of the Sp1 factor New PWMs are added by scrolling to the corresponding list entries and selecting rows with the mouse pointer while pressing the Ctrl key Since the current HUVEC UP profile was stored with default cut offs 0 75 as CSS and 0 8 as MSS threshold we also mark the minSUM radio button so that matrices of our profile will be configured with the respective cut offs Finally we name the new profile HUVEC UP Spl1 The profile is created by pressing the button The list box supports the use of the left mouse button in conjunction with Ctrl or Shift as it is standard in many applications Keep the Ctrl key pressed while clicking on items of the list to select several not necessarily consecutive entries Alternatively pressing the Shift key marks the range from the previously selected item to the one currently underneath the mouse pointer Figure 8 6 shows a table of the profile created by modification of the initial HUVEC UP profile Each matrix row contains the respective predefined minSUM cut offs in the CSS and MSS columns There are additional rows corresponding to Sp1 matrices in the profile table that were not present in the HUVEC UP profile 8 3 Creating profiles from gene sets First select the gene set under study in the project tree The Gene set menu will appear in the menu bar This menu contains the following options 1 Minimize false
175. iven set to other data sets on your project tree If you have a list of genes compiled for your specific research interest you can load it into ExPlain and extract genes of this list from any other gene set 4 1 The Functional classification analysis 4 1 4 How to run classificaiton analysis Figure 4 1 Dialog window of the Functional classification Functional Analysis Gene set Sample 1 224 genes Find groups by Expression BEL manual curation id GO annotation BEL manual curation P value threshold LI control False Discovery Rate rabteome BEL Disease view SwissProt keywords Transcription Factor classification Transpath molecule classification Whale subsets From Ehe tree Minimal hits to group Cancel Click the Functional classification menu link in the Analysis menu or press ch to launch the dialog window Select analyzed set Select functional category or categories 51 4 1 THE FUNCTIONAL CLASSIFICATION ANAHIXSIBER 4 THE FUNCTIONAL CLASSIFICATION Set thresholds for P value and minimal number of hits Press OK button To extract a list of genes assigned to functional groups select the groups and press icon Species in the Functional categories human mouse rat are defined automatically 4 1 2 Functional Analysis categories Expression BKL manual curation The Annotated groups here are different organs tissues tumors or cell types of the human organism Mouse and
176. k Page 105 All 413 None Invert Matrix name Graphs Matched promoters ug sites 1000bp TIUS sites 1000bp p alue 7 VWECREL_ ni 0 3167 0 0034 92 7200 4 0395e 14 2 d823e 12 V MFEB C n 2833 0 0034 2 9600 1 2564e 12 3 6659e 15 300 100 ro V MFEB Q6 Di n 2833 0 0034 2 9600 1 2584e 12 3 6659e 13 300 100 ge V MFEAPPAB O1 n 2833 0 0034 29600 1 2564e 12 3 6659e 13 300 100 rye WENFEGPPABES 01 0 3000 0 0066 43 9200 1 8941e 12 5 1255e 13 300 100 L ICH V RELBP52 O1 0 4500 0 0546 8 2300 0611e 11 4 1206e 11 300 ge v PS0RELAP65_Q5_01 0 2167 0006s 31 7200 7 4143e 09 3 90756 09 200 100 rye V MFEB Q6 n 1833 0 0034 54 6800 3 48568e D8 5 n9e D 300 0 Fo V IRF Q6 0 2000 O 0102 19 5200 1 5968e 07 2 4842e 06 200 100 ye V STATI D1 D 51867 n 2015 43 0603 4 3150e 07 3 1720e 05 400 Ll T 31 rows on next pages Matrix name Tes No Graphs Matched promoters To sites 1000bp sites 1000bp p value The result of an optimized site search contains optimized PWMs cut off values within the hidden columns Use the Profile menu option to create a PWM profile with optimized cut off values 5 1 5 P Match combine pattern and matrix search As an alternative to the MATCH algorithm you can use P Match which combines pattern matching and weight matrix approaches see Section 5 4 6 In comparison with the MATCH approach P Match gen erally provides superior recognition accuracy in the area of low false negative e
177. k increases these two parameters can be used to optimize the algorithm toward finding the best model Press button to start the analysis after you set up all parameters 6 1 2 The CMA advanced dialog window The advanced options of CMA are available through the button Figure 6 2 CMA advanced dialog window Launch CMA Composite Module Boolean promoter Model min avg max groupl amp group amp Number of single 3 group M1 or M2 or matrices min avg max Max number of groups 8 Number pairs of sites 0 0 0 Max number of modules in a group Distance in pair 3 30 bp Allow repressing group Optimize distance in pair Consider orientation in pair Size of module 200 Optimize factors impact Save as 6 groups of 3 matrices in 200bp eg Previous Run now gt gt Next Cancel The Composite Module dialog contains fields to adjust the CM components of promoter models You can express your assumptions about the abundance of single matrices and matrix pairs by setting minima and maxima as well as the average number of these components in the dialog window For pairs you can also adjust the spacer range in the distance in pair field Furthermore you can require CMA to optimize the allowed spacer range in steps of 6 bp and to take specific orientation into ac count You can optimize factors impact by checking corresponding option In this mode weights will be assigned to the factors according to their c
178. ks for gene sets only 1 5 4 Adjusting the number of visible table rows and page navigation The number of visible rows can be adjusted by selecting an appropriate value 10 20 1000 all from the dropdown box above the data table ExPlain will then immediately update the view accordingly If the current view does not comprise all table rows the interface provides a list of page links to navigate through the table Hover the mouse pointer over the page number to see the corresponding value range of the sorting column Figure 1 22 Page navigation Rows per page 14 f a 3 4 5 6 H E38 gt Gene symbol ACTC1 ADD1 1 5 THE WORKSPACE CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE 1 5 5 Sorting renaming and showing hiding table columns Any table in the workspace can be sorted by the values of a column by clicking on the corresponding column title If the table is sorted an arrow next to the title indicates whether the items are arranged in ascending M or descending M order Clicking the title of a sorted column switches to the opposite order Furthermore any column can be hidden by clicking on the B icon that appears next to the column title when placing the mouse over it Hidden columns can be reintroduced to the table by clicking on their names in the list on the right side of the header row Figure 1 23 Hidden column selector asnananssssuaxsasssAsESSSSSISESSSSESESSSSESSSSESESERS HRS et lt Show co
179. le archive is displayed again For high level analysis it is necessary to choose the factor that will be used for the fold change calculation The output of the high level analysis is a table containing the Affy IDs fold change and p values that are calculated based on the input parameters and factor level assignment Figure 11 13 This output can be also converted to an ExPlain gene set via the menu option Gene set Convert selected rows to gene set of the analysis specific menu 11 4 1 Meta analysis In addition to the four high level analysis methods that can be used for studying single data sets a procedure called Rank Product which enables analyses of data from different origins e g different laboratories is provided within ExPlain The procedure employs the RankProd algorithm developed by Hong and Breitling 32 which is based on the RankProduct method of Breitling et al 2004 33 A brief description of the procedure is provided at the end of this section of the manual The Rank Product input dialog Figure 11 14 requires that you specify at least one data source plus that you select two or more columns for both the control baseline and the experiment The selection can be made based on previously specified factors and levels by choosing the corresponding factor name in the selection option factor drop down menu Nota that up to three different sources can be processed by the procedure Please also note that only data with a
180. lecting PWMs for the profile can be altered with the Matrices and Factors tabs above the list box as well as with the High specificity matrices only checkbox When the Factors tab is selected transcription factors are shown Upon profile creation all PWMs linked to the selected factors are collected in the profile An entry line contains the factor name e g AhR and one or more sample identifiers of corresponding PWMs in brackets When the Matrices tab is selected the listed entries are TRANSFAC PWM identifiers as shown in Figure 8 3 thus allowing a direct matrix profile compilation The High quality matrices only option limits the contents of the list to high quality matrices so that a profile can be created from such PWMs only Use the button after setting all the necessary parameters to launch the process of profile cre ation If this dialog window is used in conjunction with an active gene set or profile node matrix or factor entries of the active profile are preselected in the list box 8 22 Example of profile creation This example demonstrates a profile creation from a set of up regulated genes extracted from the HU VEC GSE2639 example geneset in Section 3 3 1 We use the Create new profile dialog window to create a profile of all matrices of this gene set To obtain the profile of matrices from the up regulated gene set open the Create new profile dialog window select the rows of the input output table as sho
181. ll as storage of technical conditions of the microarray experiment extracted from the respective Affymetrix DTT file For more information on the data table functionality columns management subset renaming etc see Section 1 5 The Assign levels to columns option of the Data menu launches the Factor and Level assignment dialog described in Section 11 1 1 You can group dataset columns by levels within a category referred to as factor After the assignment is done it is displayed as additional lines above the data table two 39 3 3 RECOMBINING GENE SETS CHAPTER 3 GENE SETS Figure 3 7 Gene set representation Gene set HUVEC GSE2639 example Species name Human Filter filter bar none total 7985 rows Rows per page 9 z Export Plain text XLS RTF 1 3 4 3 6 7 X Mark Page 50 All 7985 None Invert Gene BKL description Fold change symbol 4 9 A2M Alpha 2 macroglobulin binds to collectin plays a role in cell proliferation and protein homotetramerization upregulated in Alzheimer disease sickle cell anemia rheumatoid arthritis multiple sclerosis and prostatic neoplasms 1 02 re AAGNT Alpha 1 4 N Acetylglucosaminyltransferase a glycosyltransferase that forms alpha 1 4 linked GlcNAc residues especially in 1 05 O glycans and is involved in synthesis of class III mucins i 0 AADAC Arylacetamide deacetylase may play a role in protein amino acid deacetylation and lipid metabolic process
182. ll be used to create the filtered set genes promoters molecules or matrices Note for example if you select the matrices view then genes without a corresponding position weight matrix will not appear in the result set even if they satisfy the conditions Think of it as switching the gene set table to the corresponding view and then manually selecting some lines by checking checkboxes in the table If you are not sure about this option use genes The filtering parameters can be set up using condition fields The second and the third fields are collapsed by default To expand other conditions click the corresponding links Several conditions can be connected by and or or rules You can use up to three columns to specify the entities you wish to extract Figure 3 10 Subset creation with expression value constraints Filter gene set Please select gene set to filter HUvEC GSE2639 example 7385 505 15477 4563 Objects to consider genes First condition Fold change gt 0 993 Second condition and Cor Fold change z 1 001 Third condition Cancel Each condition consists of three fields The column list contains names of columns of the selected set regardless of their type and visibility Note that columns with system annotation are marked in the list with the M picture and other columns have the m picture on the left Subsequent lists are used to set a requirement for a marked column For a text or db link colum
183. lly related often even reversibly transformable into each other A basic molecule entry captures this concept and is the class of all states of a protein These states are different molecules and we store them as different entries As molecules they can be used in a pathway assembly We store general information like the amino acid sequence in the basic molecule entry and link its states to it In the simplest case there are only two states an inactive one and an active one In other cases there are more For example a transcription factor can be 1 de phosphorylated in the cytosol 2 phosphorylated in the cytosol or 3 phosphorylated and bound to DNA in the nucleus to name a few possibilities The same protein will exhibit distinctly different signaling functions in these three states For example in the first state it will be susceptible to phosphorylation in the second state to translocation into the nucleus or dephosphorylation and in the third state it will activate transcription Hierarchical grouping relations and roles of molecules are summarized in Figure 7 17 Figure 7 17 Hierarchical grouping relations between BKL molecules family A like h orthofamily orthogroup isogroup orthocomplex orthobasic basic A1 h A2 h A1 h B h complex The number of states for a molecule is the product of the number of its modified forms and the number of locations where it is found Only compounds which share the same location interact in nat
184. loft p oie be hid e wow ey ow dno eode Oe eee Hood ee we 8 3 Creating profiles from genesets 2 8 4 Profile representation in the result table eee 5 59 Pronle meni ODIO 2 5 2 446 ceo os hee toe OS POO Ghee eee Oe RES A 8 5 1 Create profile from selection 0 000000 000000000004 8 5 2 Create gene set using selected matrices 0 0 0 0 0 000004 8 5 3 Extend set of selected matrices by all homologous matrices 0 54 Change Gui OLlS sesde siea dox ey Oe he Sox eee ye eee oe ee Oe oS 8 5 5 Merge several profilesintooneprofile lle OO Weel mAT Bek ece tbh Ee ETT 5 5 Creamionew Max 62224515445 eee e seared Seeen tends dE dud 8 6 2 Representation of the user matrix eh 8 6 5 Changing factors associated with matrix llle 8 6 4 Importing user matrices from TRANSFAC een Genome intervals ChIP chip TFBS ChIP Seq Tiling arrays 9 1 Loading of genome intervals 42e 9 be Sooo GOES SHEER moy do So x om a 9 1 1 Loading of genome intervals data from BED file lll 9 1 2 Loading of genome intervals from CHP BAR file len 9 1 3 Loading of genome intervals from Illumina BED file ills 9 14 Intervalsrepresenlanion lt ses xoxo x Rm moy ow 5S o SERRA DHSS S wes 9 2 Recombining intervals ox dU boy m bum de e E PU PU Sor Rue ep eee uU see Ve He 10 11 12 13 II 14 CONTENTS 9 21 Filtering inte
185. log window is identical to the one shown in Figure 11 14 The resulting gene set contains the genes present in all source sets and several columns with statisti cal data calculated A fragment of the result of the Rank Product analysis is shown in Figure 11 18 149 11 4 HIGH LEVEL ANALYSIS CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Figure 11 17 The Empirical Bayes result BKL description CCDC11 Protein of unknown function has high similarity to uncharacterized mouse Ccdc11 4 83588 1 4 16777e 13 1 HHATL Protein with low similarity to human HHAT which binds to GTP and may be associated with melanoma 4 74761 1 1 14984e 12 1 Protein phosphatase 1 regulatory inhibitor subunit 10 increases hypoxia induced cell PPP1R10 death via post translational modification of TP53 and MDM2 induces mitotic 4 47569 1 1 23481e 11 1 chromosome decondensation gene is aberrantly expressed in Alzheimer disease Anti Mullerian hormone plays a role in the induction of apoptosis increased expression is associated with polycystic ovary syndrome obesity and hypothalamic amenorrhea mutation in the gene is associated with persistent Mullerian duct syndrome AMH 4 4127 1 3 80546e 11 1 Protein with strong similarity to zinc finger protein 2 mouse Zfp2 which binds zinc ZFP2 and is a putative transcription factor that may be required for neuronal development 3 99162 1 2 08193e 08 1 contains 13 zinc finger C2H2 type repeats which bin
186. lue of the match result Figure 4 6 Output table of Expression analysis Term Gene symbol Location Hits in Group Hits p value group size expected Fl leukemia bere ae I Sn 1333 3 368466 05 C other tumors ES cde c ea Tumors 114 sss a0 0 000206847 C myeloid cells See Biche MUCH a CDA cell types 29 543 15 D 000276412 C Ivmphaid cells She BERGE Quer ep nd CASCO cell bypes EE 2031 36 0 00027954 C m nervous Mesi SCNIB a de 2 i 000298033 4 3 2 Gene Ontology analysis public and BKL curated Each row of the table presents a matched Gene Ontology term The columns contain from left to right the GO identifier and a link to its description Gene symbols the GO term description Ontology in general Biological Process Molecular Function or Cellullar Component the number of hits from the input set to the ontology group the size of the group in the database the number of randomly expected hits and the P value of the observation Figure 4 7 Output table of Gene Ontology classification GO Gene symbol GO Term Ontology Hitsin Group Hits p value Identifier group size expected ACE ADD3 ADFP AKE1A1 T F ao n005515 ALDOB APSB2 APT protein binding OlecHial 7814 3 50201e 12 ARHGEF ARPC2 ARWCF Function ABCCI ABCF2 ACE ADFP Go 0008150 ADRM AERIAI ALDOB biological process Biological 207 14410 177 9 73042e 10 FOcess AMPEP AGAH APSBZ P ABCC1 ABCF2 ACE ADDS
187. lumn Matrix name Molecule name Molecule type Species TRANSFro ID TES mare calumns An advanced mechanism for hiding removing renaming and sorting columns is available trough the Manage columns in gene set menu link in the Data menu The column management dialog is displayed on the figure below Figure 1 24 Advanced column management Manage columns Manage columnis in data set vertebrate non redundant 1100 ALL 212 Available columns Graphs Yes No Matched promoters in Yes Matched promoters p value Matched promoters in No P value Ets AML 1100 ALL Picture Species To rename a column double click on a column name and type a mew name To change a column order drag a column name and place it ta the desired position in the list of columns To hide or to reveal a column change a blue tick ta a grey box with respect to different wiews G genes Mat matrices P promoters Mol molecules Sum result summary and so an To delete a column in gene set mowe a column line to the right list Cancel First select an item to manage note that not all item types allow column management All available columns will be listed in the column control Different views are designated in the top line If you point your mouse pointer over a button a hint with a view name will appear Marked check boxes mean that a column is visible in the corresponding view You can easily change the visibility by selecti
188. lumn with expression values Fold change Cancel The expression will be displayed as a tube see the figure below colored in red for positive values and in green for negative ones The height of the tube represents the numerical value of the expression Figure 7 13 Section of a network with expression tubes 111 7 4 USER DEFINED INTERACTIONS CHAPTER 7 MOLECULAR NETWORKS ANALYSIS 7 4 User defined interactions 7 4 1 Loading of user defined interactions You have the option to load interactions between molecules into the system and to use these interactions in the search for signaling cascades key nodes and network clusters Select the Interaction sets link from the Load data from file section of the File menu In the opened dialog window Figure 7 14 specify the file name and press the button Only molecules that are present in the current BKL release will be loaded Figure 7 14 Load interaction profile dialog Load interaction profile Destination Interactions Specify interaction profile file to load Durchsuchen Please specify interactions using BioPAX or the simple pipe separator format described below from molecule to molecule gt lt cost gt lt activation inhibition gt cost and lt activation inhibition gt are optional fields You can also load archive file containing any number of interaction profile files Cancel Supported file formats Pipe separated data files An interactio
189. menu to show a graphical match visualization and various subsets of information from the output table for selected rows The Site map menu link opens a new window with a report as described in Section 5 2 for all selected sites together The Save matrices link creates detailed reports for each of the selected sites Columns with the number of sites found for each gene will be automatically added to the Yes set The Sites table displays the formatted MATCH output see Figure 5 24 for selected matrices and for all promoters used in the analysis 5 1 8 De novo motifs identification Use the Seeder algorithm to discover new motifs overrepresented in your dataset To launch the Seeder dialog window click the Seeder link in the Analyse menu Like in the case of MATCH see Sec tion 5 1 2 you can select datasets to analyse and promoter parameters Length of DNA sequences to analyse should not be larger than 50000 bp A suitable input set would be up to 100 promoter sequences each of 500 600 bp ExPlain will automatically constrain the promoter window when the number of se lected promoters is high Typically 50 60 sequences will show well conserved motifs If the expression of the corresponding genes is known a selection of top 50 upregulated genes will work best There are no size restrictions regarding the background promoter set yet a rich background provides good quality and low false discovery rate After setting up the required parameter
190. min awg max Number pairs of sites ol n rj Distance in pair la 30bp Optimize distance in pair Consider orientation in pair groupl amp groupe amp groupziM1 or M2 ar Max number of groups Max number of modules in a group Allaw repressing module Size of module Optimize factors impact Advanced Options Mo upper limit on FN FP Run mode Limit FP by Run once Limit FM by Run 10 times Select profile with preffered matrices Fitness functian components Wise T test Error rate Control normality of fuzzy score Penalize model complexity Use regression by column Run CMA Figure 6 14 System model System of user created model CEBP NFkappaB Promoter model Simple display lt VECEBP C C 0 790000 gt 18 20 lt V NFKB C C 0 890000 gt N 1 TransCompel model link Figure 6 15 User created model System of user created model V CREL_O1 VECREB O1 VEAP1 Qe O1 VEAP1 Oe Promoter model Simple display VECREL 01 C 0 972500 N 1 VECREB O1 C 0 815 N 1 z V AP1 Q amp 01 C 0 933 gt 3 30 lt V AP1 O4 C 0 912 gt N 1 Both system and user created models have the Model menu where you can edit any model using the Edit link 6 5 Model editor ExPlain gives you the opportunity to create a boolean promoter model with any parameters and any structure The boolean model can be generally represented as union group amp group2 amp where eac
191. n file should contain the following lines from molecule to molecule cost activation inhibition The direction of the reaction is interpreted as from the first to the second molecule cost and activation inhibition are optional fields cost stands for the cost of the reaction numerical activation inhibition can be either 1 or 1 indicating activation or inhibition respectively BioPax data files http www biopax org gt Some examples can be found here lt http pid nci nih gov PID browse_pathways shtml gt Supported databases are listed in Table 3 1 7 4 2 Interaction representation Molecule identifiers from BKL and identifiers from the loaded file are displayed as pairs From To with corresponding numerical values if present 7 4 3 Joining interaction profiles Once two or more interaction profiles are uploaded they can be joined together The Join different interaction sets operation is accessible from the Data menu Check all inter action profiles you want to join in the dropdown menu and press button to invoke the process of creating a joined interaction profile 7 5 The BKL database BKL is a database on gene regulatory and signaling reactions pathways and protein attributes such as expression phenotype disease and drug associations Elements of the relevant signal transduction 112 CHAPTER 7 MOLECULAR NETWORKS ANALYSIS 7 5 THE BKL DATABASE Figure 7 15 Inter
192. n gaps checkbox is checked The Min run field accepts the minimum length in bp of intervals and intervals that are shorter will be excluded To select intervals within a specific position use the filtering by promoter option Set up start and end positions from the TSS Only intervals present in the required window will be shown in the result set 9 2 2 Filtering intervals by gene set A set of intervals can be filtered by a list of promoters from some gene set Using the Filter interval by gene set menu option will launch the filtering dialog Only intervals belonging to promoters from the selected gene set will remain in the result interval set 9 3 Filtering of MATCH results using intervals You can use interval sets Section 9 1 to filter your site search results 132 CHAPTER 9 GENOME_IN FHRYREBI C A BACKER KEBSNCADP 8H MINING INR RAVES PROCESSING Figure 9 7 Operating on intervals Filter genome interval set Interval set to filter 4013176 Spl pvalue cut 107 Gene set none From the Analyze menu select Filter site search results using interval set If you are standing on a site search result node in the process tree then it will be chosen in the Source Match result field by default but you can select any other result available from the drop down tree list The list of previously loaded intervals is available in the Interval set to use field see Figure 9 8 After filtering you will get only binding si
193. n import several data sets simultaneously ExPlain provides a number of installed gene sets interval data and profiles Note Human Mouse and Rat promoters and gene sets are not available in ExPlain Plant while Arabidopsis Rice and Soy bean promoters are available exclusively in the ExPlain Plant system For example to upload the HUVEC GSE2639 example select it in the ExPlain predefined data left list of Import data dialog window and press the button This will launch a process to load the genes It may take some time to load the data ExPlain predefined profiles you can read more on profiles in Chapter 8 Profiles also appear in the same list and can be imported These profiles are usually imported by default on the first run of ExPlain and later can be removed or modified If you need a clean version of one of the system profiles just import it again 37 3 1 LOADING DATA INTO THE EXPLAIN SYSTEM CHAPTER 3 GENE SETS Figure 3 4 New gene set dialog New gene set Destination folder Gene Sets Gene set name New gene set Type or paste gene identifiers here Note that this form is for identifiers only so anything unrecognizable wil be discarded If you want to add annotation like expression values please create tab separated file and load it instead UBE212 WDRB8 KIAAQ562 DFFB ESPN HKR3 PER3 SPSB1 SRG TARDBP MASP2 Advanced options Match accession number by Let system guess
194. n recognizes only the intervals that fall into the range of 10000 1000 nt from the TSS of the gene To load your interval data select the Load intervals CHP BAR file option from the File 129 9 1 LOADINGCGAGEBNOM EENKBINVANS ERVALS CHIP CHIP TFBS CHIP SEQ TILING ARRAYS menu In the dialog window Figure 9 2 you can specify files with signal and or p value columns You can specify a feature name and a genome build or select them to be read from the BED file A list of all transcription factors from TRANSFAC database is provided to link it with your data this information can be further used for filtering of the sites search results Check the option automatically create subset from the interval to create a set of genes covered by the intervals after loading Figure 9 2 Load intervals CHP BAR file dialog Load intervals Destination Genome intervaE ChIP chip TFBS etc specify intervals file s to load CHP BAR signal file pValue file select feature name from list Mot s pec ified select build and species from list Read genome build and species from CHP BAH file Maximal distance to TSS 1000000 Automatically create subset from interval Cancel 9 1 3 Loading of genome intervals from Illumina BED file ExPlain supports import and analysis of Next Generation Sequencing NGS data such as output from the Illumina Genome Analyzer in BED format For peak detection the novel Model based Analy
195. n you can require entities to contain start or end with a particular string For a numerical expression column you can require corresponding entries to be equal lower higher or not equal to the specified value Categorical values can be required to be equal or not equal to the specified category As an example we want to extract non changed genes from the HUVEC GSE2639 example pre defined dataset We select the expression column named Fold change in the first and in the second condition fields We seek all entities that have fold change greater than 0 999 AND lower than 1 001 Figure 3 10 Subset creation is invoked with the button After the process is done we have a new subset with nonchanged genes placed under the original 41 3 8 RECOMBINING GENE SETS CHAPTER 3 GENE SETS set in the project tree as it is shown at Figure 3 11 The new set contains 777 genes Figure 3 11 Subset created with expression value constraints HUVEC GSE2639 example 7985 505 15477 4563 x Fold change gt 0 999 and Fold change lt 1 001 777 123 1478 438 2009 10 09 14 05 14 x x Fold change gt 2 75 58 130 61 2009 10 09 14 06 06 x 3 3 2 Filter gene set by other gene sets In ExPlain you can filter a geneset leaving only those genes and other objects which are present or absent in some other datasets Choose the Filter by gene sets option in the Data menu The active tree node will be selected by default in the tree c
196. nally Yes and No values are visualized in the Graphs column at the right where the red bar depicts the abundance of the PWM motif in promoters of the positive set and the blue bar displays its abundance in the control set The ratio of the two values is provided in the Yes No column where a number greater than one indicates overrepresentation of the motif in the query set Significance of the representation value is measured by the P value derived from a binomial distribution Matched promoters p value assesses the statistical significance of the number of promoters in the query set that have at least one predicted site compared to that of promoters in the control set Figure 5 2 Site search analysis output table Matrix name Yes No Graphs Matched sites 1000bp sites 1000bp promoters p value C3 vNFKAPPAB amp S 1 1 6833 0 5464 3 0805 7 2224e 17 P 3 2912e 08 CI v NFKB Q6 3 0167 1 4686 2 0541 1 0084e 14 P 0 0020 0 WeRELBPS2_01 1 1333 0 3278 3 4567 1 5155e 13 1 2663e 08 C3 v NFKB Q6 1 0 9167 0 2391 3 8343 1 7253e 12 g 5913e 09 C3 v NFKAPPAB 01 2 5333 1 2534 z 211 3 2929e 12 P 1 7481e 04 Co v cREL 01 1 5500 0 6352 2 4400 2 76877e 11 1 2375e 06 CIC VePSORELAPES_O5_01 1 5500 0 6626 2 3394 1 5430e 10 5 4541e 05 Co WeNFKB_C 1 8833 0 8948 21047 2 7717e 10 P 1 5448e 04 E vePsopso_o3 1 2333 0 5567 2 2155 5 9381e 08 6 7420e 05 Co VeNFKAPPABSO_O1 4 6500 3 3572 1 3851 1 8539e 06 B
197. name of the database and species for matching accession numbers If you made some changes and want to save them for the future use fill in a name for the preset and click on the button It is also possible to make use of a preset saved before by selecting it in the Load Preset drop down menu When an Excel file with several worksheets is loaded the specified settings can be applied Copy settings to other sheets button After pressing the to all other worksheets by clicking on the gt gt Finish importing button you are redirected to the ExPlain interface and your data is uploaded to the system 36 CHAPTER 3 GENE SETS 3 1 LOADING DATA INTO THE EXPLAIN SYSTEM NOTE This is also influenced by the size of the data set or other jobs running on the same server amp Due to pre processing for further analyses data transfer may take some time The project tree now contains a new process node named after the uploaded file Through the process monitor you are notified as soon as the data transfer has finished Once the data loading is completed the process node will turn into a gene set node Figure 3 3 Gene data set in the project tree Gene set example 126 20 244 69 2009 10 09 09 03 15 x Node statistics number of genes TRANSFAC matrices promoters and molecules respectively as well as the creation time are shown near the name Display of this additional information can be ad justed via the Prefer
198. ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract list_uids 10077610 gt CRC Tutorial Steve Qin CRC tutorial 10 13 06 URL http www sph umich edu csg qin CRC tutor pdf Chinese restaurant clustering Pitman J 1996 Some Developments of the Blackwell MacQueen Urn Scheme IMS Hayward Cali fornia CRC Similarity measure Medvedovic M and Sivaganesan S 2002 Bayesian infinite mixture model based clustering of gene expression profiles Bioinformatics 18 1194 1206 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstracts amp list uids 12217911 168 CHAPTER 14 REFERENCES 40 FreeType Portions of this software are copyright 2008 The FreeType Project lt http www freetype org gt All rights reserved 41 TargetScan lt http www targetscan org gt 169
199. nes To do so choose the Distribution of values option from the Analyze menu select a gene set to be used specify the objects to consider and an expression column and press the button Figure 13 3 A graph will appear under the original gene set in the tree In addition you have the option to export this graph as an RTF document 13 2 2 Graphical report on intervals Using the Graph report on interval feature you can generate graphs on your interval set displaying the distribution of intervals by distance from the TSS 157 13 2 GRAPH REPORT GENERATION CHAPTER 13 REPORTS Figure 13 2 Generated report Report Overepresented Sites PXE Gene set PXE Go to item J User file PXE txt was loaded Of 193 accessions 182 were recognized and matched The rest 11 accessions were ignored The resulting gene set contains 188 genes 16 matrices 500 promoters and 91 molecules Gene set S33 PXE vs Co_FC gt 0 and PXE vs Co_FC lt 5 Go to item PXE 188 16 500 91 was filtered by condition PXE vs Co FC gt Ojand PXE vs Co FC lt 5 The resulting gene set contains 84 genes 0 matrices 199 promoters and 39 molecules Gene set S33 PXE vs Co FC O or PXE vs Co FC 5 Go to item SPXE 188 16 500 91 was filtered by condition PXE vs Co FC lt O or PXE vs Co FC gt 5 The resulting gene set contains 106 genes 16 matrices 306 promoters and 52 molecules Site search results with backg
200. ng a site viewer You can adjust the scale moving the slider in the top left toolbar of the viewer At the highest resolution individual nucleotides can be distinguished Moving the slide on the bottom toolbar you can move the window of the viewer along the promoter Detailed information is available by clicking on a specific site Exact chromosomal coordinates and the TSS of the position denoted with dashed vertical lines are shown on the top panel Press the button on the top right corner to close the site viewer Figure 5 22 Detailed sequence view via site viewer 125 100 I I I I I I V NFKAPPAB6S 0i 0 9 V NFKAPPAB65 01 1 V NFKB Q6 0i 0 996 V RELBPS2 01 0 931 V RELBPS2 01 0 909 V RELBPS2 01 0 92 Furthermore the table enables users to select individual promoters in the checkbox column The Site search result menu provides the possibility to save selected promoters as a new gene set clicking on the Get gene set link The Save report link saves the current match presentation of one or more matrices in the project tree A new node for the report appears under the current site search node The Get matrices menu option creates gene sets of matrices checked in the matrix legend By clicking on the Sequence report link you obtain a detailed text report on the selected promoters An example of such detailed view for the top promoter is shown on the figure below The link
201. ng all nucleotide combinations words of a given length usu ally six For each word it calculates the Hamming distance HD between the word and its best matching 85 5 4 SITES SEARCH THEORETICAL BACKGRDUNBR 5 TRANSCRIPTION FACTOR SITE SEARCH sub sequence the substring minimal distance SMD in each sequence of a background set This data is used to produce a word specific background probability distribution for the SMD For each word it then calculates the sum of SMDs to sequences in a positive set The P value for this sum is calculated using the word specific background probability distribution The word for which the P value is mini mal is retained and a seed PWM is built from the closest matches to this word found in every positive sequence The seed PWM is extended to full motif width and sites maximizing the score to the extended PWM are selected one in each positive sequence A new PWM is built from those sites and the process is iterated until convergence or a maximum number of iterations is reached Input data and parameters the algorithm takes as input a set B B1 Bm of m background sequences of length L a set P P1 Pn of n positive sequences of length L the length k of the motif seed and the length of the full motif to discover The SMD distance d a b between a short nucleotide sequence a and a longer sequence b is the minimal HD between a and a a length substring of b Background model a discrete
202. ng or deselecting check boxes To change the column order just select a column and drag and drop it to a desirable place in the column list as shown in the figure To rename a column click twice on it s name 20 CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE 1 5 THE WORKSPACE and type a new name inthe text box To remove a column from view select it and drag it to the right column container not available for all item types To cancel this action drag the column back Figure 1 25 Available columns 5 ad P Mol Gene symbol Description Molecule name TRANSPro ID VEKIDS_O1 VEMA2 Q5 V PAx4 0 Species 1 5 6 Customization of the column content display The way your data is represented in the workspace can be altered individually for each column by pressing on the icon which appears when hovering your mouse pointer over the column title Or for several columns via the Rename customize columns dialog from the Data menu Figure 1 26 shows options for the four column types available Only one type will be displayed for each column Number format Here you can select normal e g 0 0015 or scientific e g 1 5e 3 number format num ber of decimal digits and rules for the effective value calculation The effective value is that which will be used for the analyses and sorting when certain entity has several corresponding anno tations For example when several Affymetrix identifiers with different expression values
203. nthesis and wound healing human AREG is associated with 8 42203 6 41408 asthma multiple myeloma HIV infection psoriasis and several neoplasms Aldehyde dehydrogenase 1 family member A3 plays a role in vitamin A metabolism brain D C Aldh1a3 development and camera type eye morphogenesis may act in generation of neurons and 7 26606 0 841571 alcohol metabolism human ALDH1A3 is associated with psoriasis Dual specificity phosphatase 5 a phosphoprotein phosphatase that acts in cytokine mediated I Dusp5 and glutamate signaling pathway inhibits MAPK activity involved in response to heat and 7 11151 5 70001 stress human DUSPS5 is associated with thyroid neoplasm 3 4 3 Adding a column with system annotations ExPlain provides annotation columns containing some additional information such as identifiers of ex ternal and internal databases information about molecule sub or superfamilies data from the Bio Knowledge Library BKL database about genes molecules and other biological objects This infor mation can be attached to your gene sets To do so choose the System annotation option from the Add columns section of the Data menu then select an appropriate gene set choose one or several columns from the list and press the button If you would like to add columns with system anno tation to more than one gene set select Multi select mode in the Add column s to gene set menu Figure 3 22 47 3 5 EXPORTING GENE SET
204. of hits 4 40835 Hits 38 Hits ATPSA1 ATPSD ATPSG1 ATPSG3 ATPSO C210rf33 COX4I1 COXSA COXSB COX6A1 COX6A2 COX7A2 COX7A2L COX7C COXBA CYC1 GOT2 GPX4 HADHA HADHB IDH3B KIAA0141 MRPL9 MRPS12 NDUFC1 NDUFV1 NDUFV2 SDHA SLC2541 SLC25A3 STARD TIMM44 TRAP1 TSFM TUFM UOCRCT UQCRFS1 UOCRH Distribution Export RTF XLS Distribution Export RTF XLS Rank in the ordered dataset 10 bOO 300 900 difference of each gene P value assesses the probability to get a similar or better result by chance and should be restricted with a cut off The P value calculation has 3 cases p 0 and single S testing The P value is calculated precisely using a special dynamic programming algorithm that counts all possible cases when the absolute value of ES is higher than the absolute value of the given ES if list L has another order and divides the number of such cases by the number of all possible variants of order in L p gt 0 and single S testing Expression difference values r_i are reassigned to genes g i randomly in 64 CHAPTER 4 THE FUNCTIONAL CLASSIFICATION 4 7 GENE SET ENRICHMENT ANALYSIS quation 4 7 1 Enrichment Score ESS 1s the maximum deviation trom zero across i of the running sum ES S i Fj 5 2 m Pu S ni l init 5 i E b E Po GS i gt FAT AT X where gyeo R gyes N Ny j5i jai Ar Wile NA OL horn hse Na gt Ny numberof hits Sin L gE D and E
205. ogee TL TECBPEOE EIOS I a 005157 ABL1 0 G004639 Ed nase Reactome 69615 2009 10 22 13 06 acet yl Coenzyme n H Transpath Pathways 1max 2min 229 2 Ee acarz acetyltransferase 2 LEES ACAT2 j nja fl Join functional classification results for diseas QO acts actin beta NM 001101 ACTB s G000214 amp Join functional classification results for intera O acrG1 actin gamma 1 NM 001614 ACTGI 0 n a Join functional classification results for pathy OA AcTN4 actinin alpha 4 NM 004924 ACTN4 0 G018299 p53_down 180 27 336 129 2009 10 22 sia l CO AcvRL1 activin A receptor type II like 1 NM 000020 ACVRLI 0 6003535 Y PXE 181 60 340 137 2009 10 22 12 56 4 Wl raraton Pathways ima 2mip 1777 20 OC apamis ADAM metallopeptidase domain 15 NM 003815 ADAMIS 0 nja SSE ALU Fai Ways ad naXx TNT Ff ECL f Proteome BKE Disease View Imax 2min o O6 apaMrsL2 ADAMTS like 2 NM_014694 ADAMTSL2 0 nja t Trangpath Pathways 1max 2min 72 20 O apar adenosine deaminase RNA specific NM 001111 ADAR 0 nja Genome intervals ChIP chip TFBS etc 551 rows on next pages 5 Interactions eee Gene symbol Description RefSeq HGNC TRANSFAC Presets A accession symbol gene 1 2 3 4 5 6 Profiles PRF cei cycle specific 2009 10 22 14 40 51 Y AET PRE i20 col onori 9000 10 22 zaan ald lt il fid Copyright 2009 by BIOBASE GmbH The ExPlain interface has six major components Project tree The project tre
206. oins any of the presently existing clusters or starts a new cluster taking the current assignment of all other genes into account b Assignment of the gene to a cluster according to probability values and update of variables e g number of clusters Steps a and b of 2 are carried out for each gene and repeated for a large number of cycles until convergence is reached Details about the probability value calculation can be found in the article of Qin 2006 35 In order to obtain information about how tightly the genes are associated with a certain cluster the Bayes ratio of the clusters is calculated For each cluster two likelihood values are determined the likelihood that all genes in the cluster follow the same set of normal distributions and the likelihood that each gene follows a set of normal distributions that differs from that of all other genes Suppose that the expression levels of N genes from M experiments are collected The expression data can be denoted as X gt Us I 1 M 1 M Under the assumption that the first n1 genes belong to the same cluster the Bayes ratio can be calcu lated by quation 11 5 1 Bayes ratio IT amp x uo P u o du do M n 2 u 2 2 II P X H4 0 JPlu 5 Jdu do The cluster tightness given in the ExPlain tree entries is the log Bayes ratio normalized by the number of genes in the respective cluster Cluster stability is measured using the similarity m
207. olbar icon Click the Workflow menu link in the View menu Click on the Switch to workflow mode link on the start page ThefFirst part of the workflow will upload new data You can choose to load your data in several common formats as described in Section 2 2 After the data import is scheduled you will be forwarded to the workflow screen where newly loaded data will be preselected for the analyses If you already have uploaded data on which you want to run one of the workflows you canskip the data loading and proceed directly to the workflow section Section 2 3 22 Data loading This chapter describes various data formats recognized by ExPlain via the Wizard Choose the most siutable data structure based on the short descriptions present on the workflow icons Clicking on any icon will expand it to show the list of specific data formats supported Click on one of the links to upload a data file or copy paste data manualy 2 2 1 Gene set If you have an already precompiled set of co regulated genes or sequences you can upload it using one of the gene set options We recommend you to use a set of no more that 1000 genes the optimal size would be between 100 and 200 genes 25 2 2 DATA LOADING CHAPTER 2 ANALYTICAL WORKFLOWS AND WIZARD MODE Figure 2 1 Data loading page Tool mode e Wizard mode Help user TM ExPlain Gene Expression Analysis System Logout TM Analysis of microaray proteomics Ex
208. omatically Figure 2 3 Gene set formats A Gene set iL 4 Load a gene set pure list or a table with expression values assigned b List Table b Sequences List copy and paste or type a list of identifiers separated by space tab comma end of line etc as shown on the figure below Table load a tab separated text format file or Exel worksheet that contains a single column with a list of gene identifiers one identifier per row When the table is uploaded you will see it on the screen You can change column names and types and apply filters to shorten the list Make sure that the column containing the identifiers is recognized correctly as accession and press el Sequences load a list of sequences in FASTA or EMBL format 26 CHAPTER 2 ANALYTICAL WORKFLOWS AND WIZARD MODE 2 2 DATA LOADING Figure 2 4 A Gene set 4 Load a gene set pure list or a table with expression values assigned b List Type or paste gene identifiers here Hs00793 604 Hs00356749 Has O0607045 Hs00705917 Hs008634TTF Hs008529245 Hs001782569 HaU01764z7 Hs00188930 Hs00169141 Hs00z2 69944 Figure 2 5 Advanced options File IFHGamma xls Sheet Rows 136 eadem filtered 0 unmatched 0 matched 135 ID Mean value Gene accession numerical text ino Filter ino Filter ima Filter 1 ID Mean value Gene Header 2 p25077 11 3 Actaz 3 QUOSPETI if Xabz 4 Qgau21k98 4 Adami 3 090707 4 8 Yt521 6 09R01
209. omized in the next step If the header section exceeds the range of visible rows and therefore the start of the data section can not be specified in the initial presentation click on the number 2 in the list of numbers above the data table to view the next entries The final step involves categorization of the data columns and if necessary edit ing of column titles Each column can be assigned to one of the following eight categories Accession An accession column contains database accession numbers uniquely identifying the sequences corresponding to the data rows Accession numbers can be derived from Affymetrix Entrez Gene Unigene any BIOBASE database or others A complete list of available sequence sources is in Table 3 1 NOTE Only one column may be assigned to this primary accession category while several columns can be assigned to other categories In case you want to define an additional accession column use the category additional accession Additional accession This category allows you to define a column of the data set as an additional ac cession column that besides the primary accession column required for all data sets servers as a secondary accession set 35 3 1 LOADING DATA INTO THE EXPLAIN SYSTEM CHAPTER 3 GENE SETS Categorical A categorical column classifies expression values into a discrete set of categories Typically such categories are integers or strings of one or two letters e
210. omoter models i v 309 84442454 254644446805 4944 E S hh 092 IMOodeLeonsPPUelOfEs vs sce 9 34 9 9 9 Fee 59x berare eaaa Molecular networks analysis 7 1 Network key node analysis een 7 11 Key nodes dialog window 000002 ee ee 7 1 2 Results of key nodeanalysis een 7 1 3 Summary of key node results oen 7 1 4 Key node search algorithm 0000000000000 00000 7 Network cl ster analysis a 2 0 49 9 309 99 3 4 9 9749 4 4 98 9 38 4r d hve 3 ds 7 2 1 Cluster dialog Window 4 s s acs e Soo Se SG oo HORSES X RS EC Re es 7 2 2 Network cluster analysis results eee 7 9 INCIOVOCK VISUOBZOLTOPE s 444 e oae qd oder e EMERGE REDE Rob ee Re d 7 31 Adding expression data to the network llle 7A User defined interactions 2256 05 2464 546459645 8 or ROO 5 P Sub Po O8 E e us 7 4 1 Loading of user defined interactions llle 12 JNteraction representado s sses ea seamy OE a mob o R 9o od E v bow ed 74 0 Joining interaction profiles s s sse soo Regem mo EUR ROS HEH EH Hs 5o Ae BRE database e ses sadr gpa Tum bathe be doe ee Ie gx ee beau oP oe ss Profiles Ol Loading pionleS 4 6a boas eGo oe EERO RHO oe ROOD GOES eee ee eo A 52 Ciegnns a Hew prole xe 4 23 5 54 5 3 9924999259 ESE 5294 eee ee eee Ss 6 2 Newprofiledidlog so s5 4425945 4440 59 4 4 SHS gid hinia S22 Exampleorpromleceanon lt s ew gre wo eee eH Gee qe E OE EER EH sue mes 5 2 9 Troie modilical
211. omoters by a set of binding motifs and by rules for their arrange ment as single motifs and pairs Promoter models are optimized with a Genetic Algorithm for maximal discrimination between target promoters and a control set Models obtained through this method can be used to classify other promoter sequences or to investigate the signaling network upstream of the transcription factors binding to the motifs present in the promoters of a gene set 6 1 The CMA interface in ExPlain The CM Composite Models genetic algorithm link from the Analysis menu launches a dialog window providing an interface to the CMA program The second way to activate the CMA dialog is to press the button on the toolbar In the following example we show how to use the ExPlain interface to obtain CMA models for your target promoters The Section 6 6 section explains how to use CMA models for promoter classification 6 1 1 The CMA dialog window Figure 6 1 CMA dialog window Launch CMA Main parameters Note that to run CMA you should run Match with background set first Run CMA on promoter set Up HUVEC GSE2638 example Use Match output vertebrate non redundant 600 SUP Use preset analysis groups of 3 matrices in 200bp Stop after 1000 iterations C limit nene Population size 199 Run now Next Cancel The CMA dialog window allows for adjustment of general parameters CMA retrieves information about binding site predictions and correspon
212. onal Analysis summary zu esee 6 oo 6984 3 X 3 Od ww oo ee Bede ee eas 59 4 6 Functional Analysis algorithm ees 60 4 7 Gene Set Enrichment Analysis 2 22s 62 AV SNe Germ iiic PD TIT 62 21 GOES example 4 one cpu sodes E45 Wess NueSUS E S ee we E ee eee PE Nes 63 4 7 3 Enrichment Analysis theoretical basis llle 63 5 Transcription factor site search 67 5 1 Searching for sites in promoters en 67 Dll Howtorun Match s s e se 9 wmm SUE epo oo eh RR A 67 5 12 Sites search dialog window een 68 Do Mie WAC OULDU a ss ext sooo oe Re eo 9 Oe ee S XR a qos oe ee P 69 B14 OPUMmizanOnmMoOdes 44485564 44 Sew 4 ee Erna Hee EHR de dE m GS 70 5 15 P Match combine pattern and matrix search 2 2 2 ee 71 5 1 6 Searching for sites in promoters using phylogenetic filtering 72 5 L7 FPattermbased search sss 644see dsos YES mop o UR S EEE Owe E Xo 8 73 5 1 8 De novo motifs identification lees 74 5 19 Filtering site search results using intervals aaaea aa 74 5 2 Detailed graphical report of matrix distribution in promoters 76 5l Mamk ege d x 553999 hee Se E0309 9X vog oed don OX ON GEOG HY GHEE SS 76 D DRONRIODORISDIS oo see eue 6 303 9 Oe He SE Oe ESO S X XR a qoe eee Ges 77 52 9 Hiding matrices TOMI VIEW uoo e 9 xo o9 Ro qom eH EHR dup GS i 5 24 Detailed Promoter View o ve aww Oh ee ERED RHEE ER ODDO MOG 78 5 3 Summary set of s
213. ontains information about one pathway Each pathway is referenced by name Pathway name column and id where the Pathway id column links to the corresponding TRANSPATH entries In the Molecule name column molecules corresponding to certain pathway are listed The Hits in group column gives the number of input molecules found in a pathway while values of the Group size column represent its total number of molecules Results are sorted by P value p value column A Visualization column link will open a flash based network visualizer described in Section 7 3 The canonical pathway will be shown with the hits highlighted as shown on Figure 4 16 Figure 4 15 Pathways output eei Molecule name Pathway d Group Hits p value amp Visualization name size expected group mal le ABL 1b nur MMF A MF ABL a ABL 1b DelkadOp53 p53 puram Abl pe3beta pre ubiquitin LlbcbB LICRBP 6 phosphofructokinase liver type Aldo3 aldolase B alpha enolase alpha enolase C cHaB0003563 isoform2 Fructose bisphosphate aldolase A la aaa 11 22 3 5 90126e 06 view APD glucose 6 phosphate isomerase Py phosphoglycerate kinase 1 PKM Delta40p53 L11 L5 p53 isoformi posbeta pre ubiquitin LIbcbB LICRBP mahaa aa a aa C cHooooo0977 pod 8 11 e 2 06188e 06 view cHooooo0974 L11 gt p53 8 lz e 6 95501e 06 view 4 4 3 Functional Analysis with user defined interaction pathways The User d
214. ontribution toward widening the difference between the positive and negative sets 88 CHAPTER 6 COMPOSITE MODULE ANALYSIS AND MOIDDELSBHE CMA INTERFACE IN EXPLAIN ASTUCE The distance optimization and orientation options can help to make your models more discriminative For small target set sizes however the inclusion of these constraints should be carefully considered if you plan to use your models for classification of new promoters Small target sets can promote overfitting be cause they are less likely to be representative of the whole promoter family you wish to model ASTUCE set minimum average and maximum for this value as well as for minimal and To require a fixed value in single matrix and pair number specifications you can maximal spacer lengths in the distance field The Boolean promoter model dialog provides for options that apply to more general features of CMA models You can specify the maximal number of groups and the maximal number of CMs per group as well as the length of the CM sliding window Furthermore you can allow for a repressing module The repressing module is included in the model with the logical operator NOT Scores of the repressing module should be inversely correlated with expression values used for separation of the two input sets This option makes sense for some specific analyses such as comparison of up regulated and down regulated promoters The advanced dialog allows you to save a de
215. ontrol lists Select the set to be filtered and the other sets by choosing absent or present to perform subtraction or intersection respectively specify objects to consider and press the button Figure 3 12 Filter geneset by another geneset dialog Filter dataset by other gene sets Here you can filter dataset leaving only those genes and other objects which are present or absent in other datasets Filter set HUVEC G3E2639 example 7985 505 15477 4563 Objects to consider venes Leave only those objects which are presen in each of 2 bits OK Ca aN 34 EGF 4h 14045 435 13783 4 4 Down EGF 4h 213 17 21 2 Up EGF_4h 415 73 41 Up Down EGF_4h 628 7 69 i z lt Eti 21641 585 50772 8080 s Down Eti 447 72 1195 2 I Up Et1 512 21 1327 1 v 4 b 3 3 3 Join two gene sets Similarly you can create a new set with a unique list of elements from several source sets Open the Join gene sets dialog window from the Data menu A dialog window with the subsets selected for joining is shown in the Figure 3 13 If gene sets were prepared from the same original data set you will have the same number of columns in the resulting set When joining sets of different origin check Merge native columns When name and type equals to join columns with the same name and type or leave it unchecked to have all columns distinct After pressing the button ExPlain creates the recombined set as a child node of the
216. or a transcription regulator that acts in cell cycle apoptosis blood vessel patterning CO Abr circadian behavior and inflammatory response protects from bacterial infections human AHR is associated with Mouse breast and prostate cancers Aryl hydrocarbon receptor a transcription regulator that acts in cell cycle apoptosis blood vessel patterning AHR circadian behavior and inflammatory response protects from bacterial infections upregulated in breast and Human prostate cancers Aryl hydrocarbon receptor a transcription regulator that acts in cell cycle apoptosis blood vessel patterning C Abr circadian behavior and inflammatory response protects from bacterial infections human AHR is associated with Rat breast and prostate cancers Aryl hydrocarbon receptor nuclear translocator a transcriptional activator that acts in angiogenesis cell fate C Arnt determination and placenta development human ARNT is associated with type 2 diabetes breast neoplasms and Rat leiomyoma Aryl hydrocarbon receptor nuclear translocator a transcriptional activator that acts in angiogenesis cell fate D C Arnt determination and placenta development human ARNT is associated with type 2 diabetes breast neoplasms and Mouse leiomyoma 8 5 4 Change cut offs In the section Change cut offs a user has four options Minimize false negatives minFN Minimize sums of FP and FN minSUM Minimize false positives minFP
217. or any specified number of times For stochastic optimizers it is often recommended to run the program multiple times compare results of all runs afterwards and settle on a consensus among different runs GAs work stochastically and may deliver different results even when invoked with the very same parameters This is especially true for large complex optimization tasks where the algorithm is more likely to end up with different solutions On the other hand the runtime is typically increased for larger assignments generally larger populations more iterations ASTUCE tion runtime and afterwards examine not only the best but perhaps the 20 best models to see whether these deviate strongly from each other for instance in the motifs that compose them For large assignments you can first increase the population size and optimiza It is possible to determine set of PWMs which are preferred to be in the modules found The CMA will search for modules containing at least one of the preferred PWMs as a single matrix and at least one of the sites for a pair of sites should appear in the selected profile as well Finally you can also customize the five fitness components which are described above You should select desired components using checkboxes For instance if you find that normality of promoter scores is not important at all you might deselect the corresponding line to make CMA ignore this model property To save the combination of
218. ose display link above the model Below the model you are provided with the maximal fitness in the current population the fitness of the best model s and a plot of the best fitness development up to the current iteration The CMA process can be stopped by the button Then the calculation will be successfully finished and the best model calculated up to that moment will be taken as the result Parameters of the search as well as the fitness plot of the process can be recalled by expanding addi tional information see Section 1 5 2 You may also modify the stop condition while CMA is running by expanding Parameters block and changing parameters there The Stop after and NC limit parameters have the same meaning as in the CMA dialog New parameters will take effect only after button is pressed Note that if you set the running time to be less than what the CMA has already run it will stop after the current iteration After the calculations are completed you will see a comprehensive report about the best model found The presentation contains a graphical description of the model with its performance on the dataset and a table with details about the fitness calculation 90 CHAPTER 6 COMPOSITE MODULE ANALYSIS AND MOIDDELSHE CMA INTERFACE IN EXPLAIN Figure 6 4 CMA progress information Processing NC 2 O 77 798 Parameters Promoter model Simple display VSNFKAPPAB65_01 C 0 884500 N 2 V amp P53 02 C 0 9145
219. ositive rate over prediction error iii to minimize the sum of both errors 5 4 7 The Patch algorithm The Patch algorithm is designed for searching potential binding sites for transcription factors TF bind ing sites in any sequence which may be of interest The patterns which Patch uses for searching are TF binding sites of the TRANSFAC Professional database and the consensus sequences of weight matrices of TRANSFAC Professional Number of parameters used to limit the results of the search Minimum length This parameter specifies the minimum length for sites which are shown in the Patch outputs Using the default value 10 only sites longer than or equal to 10 will appear in the output Please note that the maximum number of mismatches allowed also influences the minimum sequence length For additional information please refer to the next paragraph Maximum number of mismatches It would be more precise to call this parameter the maximum number of local mismatches as it specifies how many positions may differ when comparing a binding site search pattern with some part of the input sequence A match between the whole site and the input sequence has been found if the actual number of local mismatches is lower than or equal to the maximum number of local mismatches Please be careful when selecting this value The maximum length of the sites searched for search patterns is restricted to 2 maximal_number_of_mismatches 1 That means
220. ot shows the shifted and scaled mean intensity from the 5 to the 3 end of the mRNA Each line in the graph corresponds to one microarray of the analysed data set Since the RNA is degraded from the 5 to the 3 end intensities at the 5 end are lower than those at the 3 end The slopes and shapes of the lines should be similar for all arrays DNA chips with significantly different profiles and slopes should be carefully examined High slopes of the plotted lines indicate a poor quality due to RNA degradation or other factors that cause systematically elevated intensities at the 3 compared to the 5 end 11 4 High level analysis High level analysis allows you to find out differentially expressed genes There are four high level analysis techniques for studying single data sets available for use in ExPlain ANOVA Fold change Empirical Bayes and Generalized Linear Model 145 11 4 HIGH LEVEL ANALYSIS CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Figure 11 10 Kernel density plot of the probe intensities for a data set that is of bad quality Histogram P CEL ese Bee B3 EL sv qa CEL mga cEL BH CEL density 0 2 0 4 0 6 0 6 0 0 log intensity Although fold change is presented as a separate technique it is also computed during each of the other high level analysis steps In the high level analysis input dialog Figure 11 12 the Factor assignment chosen at the time of loading the CEL fi
221. ov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 16381825 gt TRANSFAC lt http www biobase international com pages index php id 40 gt TRANSCompel M lt http www biobase international com pages index php id 112 gt 2 Composite Module Analyst CMA Kel A Konovalova T Valeev T Cheremushkin E Kel Margoulis O and Wingender E Composite Module Analyst a fitness based tool for identification of transcription factor binding site combinations Bioinformatics 22 1190 1197 2006 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 16473870 gt CMA lt http www gene regulation com pub programs html CMAnalyst gt 3 TRANSPATH Krull M Pistor S Voss N Kel A E Reuter I Kronenberg D Michael H Schwarzer K Potapov A Choi C Kel Margoulis O V and Wingender E TRANSPATH M An Information Resource for Storing and Visualizing Signaling Pathways and their Pathological Aberrations Nucleic Acids Res 34 D546 D551 2006 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstractklist u1ds 163981929 TRANSPATH M lt http www biobase international com pages index php id 39 gt 4 CYTOMER Michael H Chen X Fricke E Haubrock M Ricanek R and Wingender E Deriving an ontology for human gene expression sources from the CYTOMER database on human organs and cell t
222. oving some reactions and want the network to look in a more pleasant way To map the network data to CSML data format Cell System Markup Language use the csml1 9 or csml3 0 buttons according to the required format version The GIF button shows the network as image in a new window This image can be saved then to your machine Figure 7 11 Gif image 110 CHAPTER 7 MOLECULAR NETWORKS ANALYSIS 7 3 NETWORK VISUALIZATION 7 3 1 Adding expression data to the network The Map gene expression option of the visualized network specific menu provides the ability to add expression information to the network In a dialog window Figure 7 12 you can add up to three expres sion columns from any data set to the network visualized Select a tube number a set and an expression column and press the button To add more than one expression column launch the dialog again or press the button In the latter case the window will stay open and you will be able to make additional entries If the tube selected was already assigned to some expression column it will be reset to the new one Figure 7 12 The dialog window Map gene expression data on the network Visualization you want to add a tube to IL 7 h gt IAP h IL 7Ralpha h Tube number you want to use 1 empty You can add up to three expression tubes or replace any existing Data set to get expression from HUVEC GSE2639 example 7985 505 15477 4563 Co
223. p row Data rows on following pages are available through a checkbox in the bottom row As a convenient function checkbox columns provide range selection and deselection options so that the status of the last modified checkbox can be transferred to a whole range of rows by pressing the Shift key while selecting the top or bottom row of the desired range Marked data rows can then be extracted to a new subset node by pressing the appropriate menu link 21 1 5 THE WORKSPACE CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE Figure 1 26 Customization of the column content display Rename or customize column Source PRE 181 60 340 137 Mew mame BEL description Text format Merge duplicate strings Number format Format Auto Decimal digits Auto Effective value when several values found Mean Effective value when no values found 0 Category format Merge duplicate categorical values Separator new line DB link format Merge duplicate IDs and sort alphabetically Show all IDs in the original order Show 0 items separated by comma Marking options right above the table provide additional possibilities You can select all rows on the current page or all rows in the table if it has more than one page clear all selections and invert selections by turning selected rows to deselected and vice versa If an action is applied with no rows selected all rows from the table will be used as an input
224. parameters select the save option and edit a name for the parameter set as it was described above 6 1 3 The CMA output We will consider the MATCH output created in Section 5 1 3 After having searched up regulated genes with non changed genes as a negative set we would like to see whether we can find reasonable com posite elements i e pairs of factors that may act synergistically on promoters of up regulated genes Composite module parameters allow from zero to three single matrices with two matrices on average and one matrix pair Instead of the CMA default spacer range of three to thirty we allow for a range of zero to forty and let the program optimize the pair spacer We further allow for one group each containing maximally five non mutually exclusive composite modules with a window length of 100 nu cleotides Other parameters take their default values We let the program optimize for maximally 20 minutes with an NC limit of 200 iterations and a population of 1000 solutions When you select a node of a running CMA process in the project tree you can inspect the current status of iterations best model and fitness The Processing field provides a progress bar followed by the portion of iterations already processed as well as the status of the NC limit Below the Processing field the best model calculated up to the moment is shown You can save current model using Save model menu link or view the extended description clicking on Verb
225. port link This sequence or sequence set if you upload more than one is now available in ExPlain for further use 137 10 2 SEQUENCES IN EXPLAIN EXAMPLE Figure 10 5 Sequence item Sequence Y00483 lt lt Back Export EMBL Description Gene symbol Sequence length Spesies Promoters Nucleotide sequence Human gene for gluthathione peroxidase Y00483 1733 86 Promoter ID TSS USR 104939 1 aacctagatc gtttgtgcac cccacatcct gaagggt aac ccttccggct tgaggagggg aaaactgcct ccccttacag ctgctcggct cctctgctgt gggcagctcc aactcaggaa tggaccgctg taggaggagc ccgagcctcc gtgccacgtg tgcttgttcg agcggcggcg cccctgcact tgcagctgct cctctgagaa ccgcctggtt acgcgtcccg ggt agggcgg acccgccgcc gggcactccg gcggcccagt 138 gccggtaaca gccgtcgccc aaaacggagc gcctgggcca ctcgcgcgca gggccggat g ggccagttaa ctggcttctt cggtgtatgc tggcacagca accagcctcc cctcgagggg gaccagacat ctctccagcc aggcgggacc aaggaggcac ggacaatt gc cttctcggcg CHAPTER 10 SEQUENCES gagcagggt t tatgccaaac cccagccttg gcctgctgct ttttcctggc tcaggcccgg ctgctggcct gccatgtgtg cgcccgtt gg 60 120 180 240 300 360 480 540 Chapter 11 Statistical Analysis of Microarray Data This chapter describes the statistical tools provided within ExPlain for analysis of Microarray data from Affymetrix CEL files The following four steps are described in detail below Loading of CEL files Assignment of factors and levels Low le
226. profiles Xin Xiz Aima Xo Xis Xim and i3 Xis Xia of gene i will be compared with each of the existing clusters to determine if there is a cluster that fits one of the profiles well When the clustering process is completed the clusters appear as separate gene sets in the ExPlain tree Figure 11 21 Beside the name of the cluster and the parameters used for CRC two values are given in brackets in the tree entry The first value is the cluster tightness which represents a measure of the homogeneity of the genes within the cluster This value is followed by the cluster stability measure which reflects the similarity of the genes in the cluster A high stability value means that the genes within the cluster are closely related The generated gene sets contain an additional column in which the posterior probability calculated for each gene is listed so that the genes belonging to a cluster can be filtered by their probability of belonging to the cluster after the clustering process has been completed 151 11 5 CRC CLUSTERING CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA Figure 11 20 CRC Input dialog window CRC clustering select gene set Gene set for CRC clustering 6357 288 123554658 Select columns 8 hours CO Q w Use 1000 most differentially expressed genes Posterior probability threshold 0 7 Maximal shift size 0 Figure 11 21 Section of the results of a CRC clustering anal
227. promoters Clicking on a picture of a promoter in Picture area provides detailed view of the promoter with site viewer see Section 5 2 4 Similar to the Match result additional columns were added to represent score contribution of each model component matrix or pair If there are no sites for given component on some promoters then a gray zero is displayed When sequence score is calculated for each module the sum of the component scores within it is calculated then for each group the minimal score is selected fuzzy AND operator and finally maximal group score is selected fuzzy OR operator If a repressing group is present its score is subtracted from 1 fuzzy NOT operator 94 CHAPTER 6 COMPOSITE MODULE ANALYSIS AND M amp XDHIREDEFINED PARAMETERS OF CMA Figure 6 11 Promoter description in the model report table TT p v cMYC 02 V HNFi 01 0 06 M V MRF2 01 V STAT3 01 0 07 M gt V SREBP2 Q6 VSATF4 Q2 0 10 M V STAT_01 0 71 M gt V VMYB_02 0 06 View yes set 296 background set 396 Filter filter bar none total 296 rows Rows per page 100 Export Plain text XLS RTF 1 2 3 Mark Page 100 All 296 None Invert 7 C mmu_9317 f x v hh 826053 030011005Rik RIKEN 400 200 1 CWF19 ike 2 cell MMU_29473_3 0 764737 Cwfi19l2 cyde control S pombe 400 200 1 t ADP ribosylation CO MMU 14261 1 gt 0 747651 Arl c factor Ke
228. ract Up Down Non change dialog Extract Up Down Non change Gene sel MEC GEHI example 7985 505 15477 4553 Objects to consider ams Expression column P dense c RTF XLS F Up reguiated genes expression gt 1902 F Deown regulated genes expression lt 0 97 TC Up and down regulated genes F Non changed genes expression 7995 4 0 015 ok Can el with expression between 0 and 10 and one gene with expression value 20 then the graph will be scaled to display expression values between 0 and 10 only Below the graph you can specify which sub sets you want to create by using the checkboxes At the right side you can define how the genes or other objects will be distributed among sub sets There are two ways to define this either specifying cut off values or inserting the number of objects that will be selected In the cut off values mode you can specify the minimal expression value for up regulated sub set the maximal expression value for the down regulated sub set and mean value and range of expression values for the non change sub set In the number of objects mode see picture below you can specify the number of genes or other objects which will be included in the up NU down ND and non change NNC sub sets For the non change sub set you should also specity the mean expression value ENC Thus the up regulated sub set will contain exactly NU genes or other objects with the highest expression the
229. rat genes are abstracted to ortholog level to be mapped GO annotation BKL manual curation GO groups are terms of the respective Gene Ontology hierarchy manually curated in the BKL GO annotation public Gene Ontology hierarchy from files submitted by GO Consortium members to Gene Ontology http www geneontology org Organ Tissue expression Cytomer Genes associated to different organs or tissues of the human or ganism or in the case of mouse and rat genes matched to the closest ortholog Proteome BKL Disease View Genes are associated to diseases annotated in the BKL database Genes are classified as connected to diseases in correlative preventative or causal relationships For mouse and rat they are matched to corresponding orthologs SwissProt keywords Genes are matched to terms from the UniProt Knowledgebase that include widely accepted biological ontologies classifications and cross references Transcription Factor Classification Genes are mapped to a classification of transcription factors hu man mouse or rat annotated in the TRANSFAC Professional database TRANSPATH Molecule Classification Genes are mapped to a classification of molecules human mouse or rat annotated in the TRANSPATH Professional database GRO Plant Growth Stages rice only Genes are matched to terms of the Plant Growth Stages hierar chy PO Plant Structure and Growth Stages rice and Arabidopsis Genes are matched to terms of the Plant Stru
230. rays to experiment or control groups Later on in the workflow ExPlain will analyze genes differentially expressed between this 2 groups If you want to analyse several comparisons for example treatment 1 vs control teatment 2 vs control and treatment 1 vs treatment 2 create several factors marking for each certain arrays as experiment or control When the assignment is done press gt gt Next You will then see the parameters for the statistical processing of the CEL files Here you can in most cases leave the default values as they are FC filtering options indicate the threshold that will be used to define up and down regulated sets 2 3 Workflow mode The full upstream analysis common regulators in signalling network workflow will schedule a number Of processes to find over represented functional groups in your gene sets predict relevant transcription factors upstream regulators and affected pathways First you have to select analyzed Yes and background No sets If you went through a data load ing step in the workflow the results of the previous step will be already preselected In the form below 28 CHAPTER 2 ANALYTICAL WORKFLOWS AND WIZARD MODE 2 3 WORKFLOW MODE Figure 2 7 Factor level assignment in workflow CEL files loading Mark the array groups that sould be used for comparsons For example to analyse treatmentl vs treatment2 create factor comparison1 and mark all treatmenti arrays as experiment an
231. resource for mouse biology Nucleic Acids Res 33 D471 D475 2005 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 15608240 gt MGI http www informatics jax org Rat Gene Nomenclature Comittee RGNC RGNC http rgnc gen gu se RGNChem html EMBL Cochrane G Aldebert P Althorpe N Andersson M Baker W Baldwin A Bates K Bhattacharyya S Browne P Van Den Broek A Castro M Duggan K Eberhardt R Faruque N Gamble J Kanz C Kulikova T Lee C Leinonen R Lin Q Lombard V Lopez R McHale M McWilliam H Mukherjee G Nardone F Pastor M P Sobhany S Stoehr P Tzouvara K Vaughan R Wu D Zhu W and Apweiler R EMBL Nucleotide Sequence Database developments in 2005 Nucleic Acids Res 34 D10 D15 2006 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 16381823 gt EMBL http www ebi ac uk embl GenBank Benson D A Karsch Mizrachi I Lipman D J Ostell J Wheeler D L GenBank Nucleic Acids Res 33 D34 D38 2005 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 15608212 gt GenBank lt http www ncbi nih gov Genbank gt DDBJ Okubo K Sugawara H Gojobori T and Tateno Y DDBJ in preparation for overview of research activities behind data submissions Nucleic Aci
232. rithm log2 x binary logarithm log10 x logarithm to the base 10 Trigonometrical functions sin x Sine cos x Cosine Aggregate functions sum x summ of all values in the column avg x average value in the column max x maximal value in the column min x minimal value in the column Parenthesis and parenthesis are used to group parts of the formula Columns suitable for calculations are listed in the Available columns control of the Calculate column dialog that can be called via the Compute from existing columns item of the Data menu To use a column in a formula you can print a column denotation the sign and the number or you can click on the column name in the list The corresponding column symbol will then appear in the formula line The formula should consist of column names numbers and functions listed above with the 45 3 4 ADDING COLUMNS TO THE GENE SET CHAPTER 3 GENE SETS symbol x replaced with a column denotation number Numbers can be in normal format 123 456 of in scientific format 1 23456e 02 In the example shown in Figure 3 18 we selected the Fold change column denoted as 1 and apply a logarithmic transformation to absolute values of the column As a result we obtain a dataset with the original column Fold change and a new one log2 Fold change with transformed fold changes Figure 3 19 shows a part of the data table with the newly added column Figure 3 18
233. rix The last row displays the consensus sequence logo General matrix information is displayed above the table showing matrix accession ID matrix qual ity window size and transcription factors the matrix is associated with see next paragraph on factors 125 8 6 USER MATRICES CHAPTER 8 PROFILES Figure 8 14 Weight matrix creation dialog Matrix Preview Create weight matrix Enter a name for new matrix DR3 user defined Name can contain letters digits underscores and hyphens Anything else wil be discarded NENNEN Ce CNN RN CNN 2 0 2 2 0 0 0 3 c 2 2 a 4 6 2 3 3 G 0 4 2 0 0 2 1 0 2 n 1 0 0 2 2 0 H G N C C B Y M E 4s amp C x changing An accession ID is given by the ExPlain or BKL matrix generator for imported matrices Cut off values and FP frequency are displayed below the table Figure 8 15 Weight matrix User matrix V amp DR3 user defined Accession ID X18293 Matrix quality Window size 8 Binding factors VDR change A 2 Fi 0 E 0 C 2 2 1 4 amp 2 3 3 G 4 2 2 1 0 T 2 g 1 0 0 ri 2 0 Consensus H G N cC cC B Y M 7 C C K Cut off values 55 M55 Cut offs to minimize false negative matches Q 844 839 Cut offs to minimize false positive matches 0 844 1 000 Cut offs to minimize the sum of both errortates 0 844 0 899 icy when tolerating a false negative rate of 50 0 839 36 878 0 899 16 884 0 923 9 972 0 978 2 8
234. rized in Figure 5 27 The window of 1000nt length signified by the red bars is slid along the genomic sequence segment and the sum of scores of all evidence points within the window is computed Ensembl and DBTSS entries receive a maximal score of 1 whereas EPD TS5s obtain a maximal score of 50 Scores are multiplied by a penalizing distance factor where a score in the 80 CHAPTER 5 TRANSCRIPTION FACTOR SITE amp SEABKTHS SEARCH THEORETICAL BACKGROUND center of the window is multiplied with 1 and the factor diminishes to 0 according to a cosine function the greater the distance to the center In the figure 3 evidence points within the window are indicated by green and yellow bars The cosine is exemplified by the orange curve Finally a sum of evidence scores histogram peaking at the current window position is shown by blue bars Figure 5 27 TSS definition in TRANSPro 1 0 1000 nt Cagtgtttatagtaaagagatg actatagtgctgggtattgttaaaaacttcagccaaattaaat 20 CHSNEEEENEEEENNUD 5 4 3 Cut off values In the MATCH cut off values are separately defined for core and matrix similarity values 5 The matrix similarity is a score that describes the quality of a match between a matrix and an arbitrary part of the input sequences Analogously the core similarity denotes the quality of a match between the core sequence of a matrix i e the five most conserved positions within a matrix and a part of the input sequence A match h
235. rom selected rows link results in a subset containing all nodes from the whole network Figure 7 3 shows a part of the network of the top key node SHP1 The key node is shown in pink input molecules are indicated in blue and nodes are displayed between in brown Reaction types are represented by colored arrows and colored squares A red square indicates inhibition whereas a green square stands for activation Gray arrows indicate semantic associations Read Section 7 3 for more information about the flash based network visualization tool 7 1 3 Summary of key node results The results of several key node analyses can be combined in a summary table using the option Keyn odes summary from the Merge and summarize section of the Data menu In the create summary set dialog window Figure 7 4 the key node results to be compared can be selected along with the 105 7 1 NETWORK KEY NODE ANALYSIS CHAPTER 7 MOLECULAR NETWORKS ANALYSIS Figure 7 3 Network visualization P r t ErbB1 CD2 STAT3 m site ene sed ECF As A as Figure 7 4 Create network summary dialog Create summary set Choose network analysis results Choose columns to include in summary J SEGF_12h 14045 435 13783 44a szUp EGF 12h 442 74 442 21 Z Ser zi zup Down EGF 12h 941 12 Shon relevant reachable nodes P value Up Dis6 FDR KN 20 zEGF 4h 14045 435 13783 445 xl zUp EGF 4h 415 73 410 185 z p 4 Down EGF 4h 628 79 fox
236. round F match vertebrate non redundant 1100 SUP Go to item Sites search was performed on the promoters of E 833 PXE vs Co FC gt O and PXE vs Co FC lt 5 84 0 199 39 from 1000bp upstream to 100bp downstream using pgrvertebrate non redundant minSUM profile Background frequencies were calculated based on the promoters of 2 533 PXE vs Co FC lt Ojor PXE vs Co FC gt 55 106 16 306 52 A After search matrices cut offs were optimized Only best supported promoters were used The following significant matrices were found according to condition Pvalueczie 11 and Yes No 1 2 Matrix name Yes No Yes No Pvalue v POLI3F2 02 4 1786 1 8558 2 2517 1 0544e 20 V TBP Q6 4 4405 24231 1 8326 3 7500e 14 VERAI DT 3 5714 1 9135 1 8665 3 1734e 12 Figure 13 3 Generating graph on Up Down regulated Non differentially expressed genes Generate Up Down Non change graph report Gene set HUVEC GSE2639 example 8999499 17348 4 7l Objects to consider genes Expression column Fold change Export RTF XLS 5000 Launch the graph report dialog using either the Generate graph on interval link from the Ana lyze menu or the T toolbar button or the link from Interval menu when standing on an interval node in the process tree As shown in the dialog depicted in Figure 13 4 you may specify the interval set you want to analyze and optionally specify a gene set to filter your interval set When specifying this g
237. rows which are located above the promoter line If the matrix is identified on the positive strand of the promoter sequence the arrow is pointing to the right Otherwise the arrow is pointing to the left Transcription start sites are indicated by bent black arrows at position 1 of each promoter The Number of sites column shows how many sites of the reported matrices were found during the analysis Columns with system information chromosomal position short description species name as well as numerical columns from the main or background gene set depending on the current view are available in the hidden columns list Figure 5 19 Sites visualization TRANSPro Picture ID 109 Hsa 3886 1 1 Hs 14475 1 nsa 10123 O HsA 12146 1 Number Gene of sites symbol Y NFKBIA IFIH1 EBI3 9 MFKB1 Additional annotation columns with system information chromosomal position short description gene accession number as well as numerical columns with the number of sites for each matrix are available to the right of the graphical display Figure 5 20 Sites visualization for selected PWMs annotation columns Probe set T55 Description BACH1 01 B4CH2 01 SIRF Q6 NFEAPPABGS O01 nuclear Factor of kappa light polypeptide gene enhancer in B cells inhibitor alpha HFRBIA 201502 s at Chr 14 34943663 D 6 interferon induced with helicase domain 1 IFIH1 2192
238. rrors high sensitivity while in the area of low FP errors MATCH performs better in site recognition accuracy To launch the P Match dialog window click the P Match menu link in Analyse menu As is the case for the MATCH algorithm the dialog provides fields to select a profile of TRANSFAC matrices 71 5 1 SEARCHING FOR SITES IN PROMOTEGSIAPTER 5 TRANSCRIPTION FACTOR SITE SEARCH for the search the cut off level for matrices the promoter window and to specify a rule for promoter selection if several are available for an individual gene After setting up required parameters press the button to start the analysis NOTE Only matrices that have a defined collection of sites will be used for the search so the number of matrices in the result can be less than in the profile Figure 5 7 P Match dialog window Run P Match Yee set Up HUVEC GSE2639 example 175 promoters No set background MC HUVEC GSE2639 example 932 promoters Profile Create Lnad verkebrate all min SLM Use high specific matrices with cutoffs minsUM Promoter window fram 00 to 100 If gene has multiple promoters use Best supported Cancel The Output table and further analyses options for the P Match results are the same as for the MATCH see Section 5 1 3 5 1 6 Searching for sites in promoters using phylogenetic filtering MATCH combined with footprint method uses homology information between different species Hu man Mouse and Rat in our
239. rt gene sets to BKL option from the File menu then select one or more gene set to be exported specify at least one entity from the list and press the button Figure 3 24 All exported sets will appear in your Search Results in the User data section of the BKL page 3 5 1 Exporting selected genes as BKL search result You can export selected genes from ExPlain to BioKnowledge Library BKL as BKL search results To do so click on the appropriate gene set in the three mark genes you want to export and then choose the BKL search result option from the Gene set menu All exported genes will appear in your Search Results in the User data section of the BKL page 48 CHAPTER 3 GENE SETS 3 6 STATISTICS CALCULATOR Figure 3 24 Exporting gene sets to BKL dialog Export subsets to Biobase Knowlege Library Here you can select several genesets and export them to BKL TFs Yes No 1 2 76 186 83 104 EGF GSE5282 h a EGF 12h 14045 435 13783 J EGF 4h 14045 435 13783 4 s2 Down EGF 4h 213 17 21 a Up EGF 4h 415 73 410 1 3 adj p lt 0 05 128 60 12 3F match vertebrate_m0 JYiTFs Yes No gt 1 2 d Please specify the entities you want to export at least one Matrices romoters Molecules Cancel 3 6 Statistics calculator You can use the integrated statistics calculator to run classical statistical methods on your data To do so choose the Statistics calculator option from the Anal
240. rvals by conditions 0 0 0 0 0000048 131 9 24 Filtering intervals by gene sel 4442264452 mao Ro o x 84654549 4 132 9 3 Filtering of MATCH results using intervals cles 132 9 4 Theoretical background of Illumina BED files processing less 133 Sequences 135 10 1 Loading sequence data into ExPlain lees 135 10 1 1 Load sequence data from a file ees 135 10 1 2 Supported file formats lt s sesa x45 9 9 bee eee ee ey eee beads 135 10 1 3 Copy Paste sequence data 0 0 0 000 cee eee 136 10 2 Sequences in ExPlain example sce 4442 04 5 95 OG 69 644 OPE ee He EES 137 Statistical Analysis of Microarray Data 139 11 1 Loading of CEL files and assignment of factor level information 139 Lid Factor level assignment lt a 22444 tice eek m 33 op xo mo Y o ROS ER RET Y HS 140 11 2 LOW develanalysis x sos 3 RS nou EX de qe ENS Eq E WSOP RSS oS 140 Ll Qual COU x 3 scu scr ue ee em eee P ere 6 eae eG due dob cea 9 o us deu ds 142 114 High level analysis ace o9 Ec 9 9o 4 RR PUR BR PX Reo ES PERSE EEE SE Ges 145 LI Metaanadlysis oe ese uode m doit 9 465 RESP Xo 9b V0 43 8 3 d 9 E EA 146 11 4 2 Statistical Analyses of the gene expression data ss 148 11 4 3 The Rank Product algorithm eee 150 Lo CRC USES uev 223 bo deer qx S d ed e Xy ides Oe qd ie ses 151 11 5 1 Algorithm details of CRC clustering eee 152 miRNA analysis 155 12
241. s whereas the current best solution is addi tionally taken to the new population directly without modification Hence a single iteration consists of creating a new population based on previous results and testing the performance of each solution The initial population may be either created with randomly constructed solutions or roughly estimated parameters As any stochastic optimization method GAs require an externally defined termination criterion e g the simplest one might be to optimize for a certain number of iterations or time interval CMA implements a five component fitness function to evaluate the performance of its models The fitness function components are described below Each component is considered with a specifiable weight to obtain the fitness for a model CMA FITNESS FUNCTION COMPONENTS R The R component measures how well CM scores fit the expression values integers resembling cate gories e g NC 0 or continuous numbers e g fold change by linear regression T The T component assesses the statistical significance of the difference between the distributions of fuzzy scores derived from the Boolean function for each the two promoter sets by the t test E The E component controls the error rate of a model Here false positive and false negative errors are considered and derived from classification performance on the two promoter sets N The N component controls the normality of fuzzy scores also used in the T componen
242. s TSS chromosome location Gene symbol promoter ID and statistical values are given for every line An example is shown in Figure 9 4 Corresponding intervals are accessible through the link in the entry count column By clicking in the entry count number a new window Figure 9 5 with all intervals for the current promoter is shown containing corresponding values and relative gene positions The picture above the intervals graphically represents the distribution of score and or p value if they are present in the initial data 130 CHAPTER 9 GENOME INTERVALS CHIP CHIP TFBS CHIP SEC9 ZILENGOMNBIAMS G INTERVALS Figure 9 3 Load intervals Illumina BED file dialog Load intervals from ChlP Seq data Destination Genome intervas ChIP chip TFBS etc F specify BED files to load treatment file control file optional select build and species from list Maximal distance to T SS Human NCOCBlv 37 F 1O000 z MACS algorithm parameters P value cut off 12 05 F MFOLD high confidence enrichment ratio 32 Figure 9 4 Genome intervals dataset exemple set of known TF binding sites Chip on Chip data chr22 chip Filter filter bar none total 114 rows Export Plain text XLS ETF BED Mark Page 100 All 114 None Invert Gene Description TRANSPro ID T5S Species Entry symbol count 2 activating signal caintegrator 1 complex A ASCC2 subunit 2 HSA 29339 3 Chr 22 28539402 Human activ
243. s press the button to start the analysis The resulting table contains a selected number of motifs with the names of the corresponding matrix Marices are saved as separate tree nodes below the Seeder result When the cut off calculation procedure is finished newly created matrices can be included in a profile and used for MATCH search 5 1 9 Filtering site search results using intervals The dialog window Filter Match results appears after clicking on the link Filter sites using intervals in the Analyze menu This dialog provides fields for selection of a sites search result which you want to filter All the previously uploaded intervals see Section 9 1 are available through the Interval set to use dropdown box After filtering you will get just binding sites that are found within an interval You can choose the option to leave binding sites that are located in an interval linked to the transcription factor binding this site It is also possible to extend the intervals using the Expand interval by field Filtering of the results obtained in the example mentioned above with a predefined interval set 74 CHAPTER 5 TRANSCRIPTION FACTOR SITE SEARGH SEARCHING FOR SITES IN PROMOTERS Figure 5 12 Seeder dialog window Search for motifs d Yee set 345 86 promoters No set background MC HUVEC GSE2639 example 932 promoters F Motif width 19 Number of motifs to find 7 Promoter window fram 300 dl to If gene has
244. s Down Eti 447 72 1195 2 24 Up Et1 een Objects to consider genes Cancel 3 3 5 Extract Up Down Non change You can easily create from your gene set up to four subsets representing up regulated genes down regulated genes genes with non changed expression and genes with changed expression both up and down in the same set You may perform the same action using Filter gene set by condition dialog several times for each resulting subset but Extract Up Down Non change does this in one single step This dialog also displays the histogram of expression values distribution so you can use it just to view this distribution without creating any sub sets First select the gene set you want to create sub sets from and the objects that will be used as described above in Section 3 3 1 Next specify the column which will be considered as the expression column After that you ll see the histogram which reflects the distribution of expression values The X axis represents expression value while the Y axis represents the number of objects in the selected view number of genes if you selected genes before having an expression value lying in the specific interval The number of bars as well as bar width is selected automatically Also up to 1 of minimal and 1 of maximal points may not be displayed if they are too far from others For example if you have 99 genes 43 3 8 RECOMBINING GENE SETS CHAPTER 3 GENE SETS Figure 3 15 Ext
245. s about 61 000 mouse promoters from TRANSPro extracted with the same conditions as the hu man promoter set Rat promoters TRANSPro 6 2 This set contains over 21 000 rat promoters from TRANSPro extracted with the same conditions as the human promoter set ExPlain Plant only Arabidopsis promoters TRANSPro 6 2 Available in ExPlain Plant this set contains over 27 000 Arabidopsis promoters from TRANSPro Rice promoters TRANSPro 6 2 Available in ExPlain Plant this set contains over 29 000 rice promoters from TRANSPro extracted with the same condi tions as the Arabidopsis promoter set Soybean promoters TRANSPro 6 2 Available in ExPlain Plant this set contains over 68 000 soy bean promoters from TRANSPro 50 Chapter 4 The Functional classification This chapter describes the Functional Analysis FA and Gene Set Enrichment Analysis GSEA algo rithms The FA module allows you to identify statistically relevant calssification terms including Gene On tology 22 disease associations and specific organ tissue expression annotations The GSEA module enables you to explore statistical enrichment of functional annotations in your gene sets taking into account the expression value fold change or other numerical column associated to them These analyses allow you to extract subsets of functionally related genes and to investigate the biological properties of any gene list Furthermore you can use these algorithms to compare a g
246. s located in the vicin ity of a transcription start site TSS The extraction of reliable promoter sequences is usually a difficult task demanding a huge amount of tedious handwork To relieve this situation the TRANSPro database was introduced as module to the TRANSFAC suite adding extensive annotation for upstream 5 sequences of human mouse and rat genes The emphasis is made on the elements involved in gene regulation Underlying sequence databases TRANSPro is based on Genomic Sequence Assemblies from the international sequencing consortia in the Ensembl database Promoter sequences are extracted only for those genes for which both an Entrez Gene ID and a nomenclature accession number HGNC for human MGI for mouse and RGNC for rat are defined 5 4 2 Computational definition of transcription start sites TRANSPro integrates TSS data from EPD 24 DBTSS 25 and Ensembl to derive virtual TSSs as ref erence points DBTSS entries are assumed to be the first nucleotide of the one pass mRNA sequences whereas Ensembl TSSs resemble the first nucleotide of the 5 most exon of an Ensembl mRNA model Consequently the collection of TSSs for some genes may be scattered over a genomic sequence segment spanning several thousand nucleotides sometimes even more than 100 kb In order to acquire a reasonable number of TSSs for a given gene an algorithm was designed to cluster TSS locations from the source databases evidence points around virtual T
247. s potential false positives due to local biases that is peaks significantly under ABG but not under Alocal Candidate peaks with p values below a user defined threshold p value default 10 5 are called and ExPlain reports the quality score of each peak using the formula 10 log p value Reference Zhang et al Genome Biology 2008 9 R137 134 Chapter 10 Sequences This chapter explains how to load and analyze nucleotide sequences obtained from any source organism within ExPlain 10 1 Loading sequence data into ExPlain There are two ways to add sequence data to ExPlain The first si to upload a sequence data file via the Load custom sequences option within the File menu The second is to directly copy paste the sequence data into a dialog window via the New sequence option within the File menu 10 1 1 Load sequence data from a file To load sequences from a file use the Load custom sequences menu link within the File menu A dialog window will open which allows you to specify the destination folder for the loaded seqeunce s the data file to be loaded and the default Transcription Start Site TSS position to be used within the sequences The specified promoter position will be applied to those sequences from the file that do not contain promoter information Figure 10 1 Load sequences dialog Load sequences Destination Gene Sets Speoity tietoa OER FASTA EMBL or archive Default promoter position
248. shows a summary of results from Figure 5 3 and Figure 5 16 Rows with the most different and most similar p values are marked in the sim diff column 79 5 4 SITES SEARCH THEORETICAL BACKGRDUNBR 5 TRANSCRIPTION FACTOR SITE SEARCH Figure 5 26 MATCH summary set vertebrate all 600 SUP Up HUVEC vertebrate all 600 SUP filtered by known sites GSE2639 example NC HUVEC GSE2639 Up HUVEC GSE2639 example NC HUVEC example GSE2639 example Matrix name Gene symbol 4 O CEBPD V CEBPDELTA Q6 1 1528 0 0282 2 7015 0 0286 sim 10 CHURC1 V CHCH 01 1 0834 0 0117 1 5136 0 0335 sim ro ED V HMGIY Q3 1 1753 0 0224 3 3177 0 0354 sim OO zi v 1K3 01 1 2453 2 3650e 05 4 5026 6 2795e 05 sim me zr v IK1 01 1 1916 0 0051 3 4125 0 0050 sim 0 LlwFIc V NF1 Q6 01 0 8167 0 0200 0 1185 0 0232 sim 0 ete V NF1 Q6 0 8487 0 0456 0 2031 0 0313 sim NFKB1 NFKB2 ro REL RELA PNFKAPPAB DI 2 0211 3 2929e 12 7 1093 0 0010 diff NFKB1 NFKB2 po REL RELA NFKB1 MFKB2 ro REL RELA V NFKB C 2 1047 2 7717e 10 inf 2 1644e 05 diff V NFKAPPAB6S 013 0805 7 2224e 17 17 7733 1 6813e 05 diff 5 4 Sites search theoretical background 5 4 1 The TRANSPro database Numerous analyses require actual promoter sequences including promoter prediction tools and anal yses of gene regulation in gene expression experiments Being aware of the existence of several non interchangeable definitions of the term promoter we use it here for those sequence
249. sis of ChIP Seq MACS algorithm is used The input in BED format should include the 6th column containing the strand information as required by MACS The 4th and the 5th columns are not used by MACS nevertheless some values should be present there When there are replica BED files from the same experiment pasting them in one single file will work best for you ExPlain supports import of archive data in ZIP or GZIP format which significantly speeds up the process To load your NGS data select the Load intervals Illumina BED file option from the File menu In the dialog window Figure 9 3 you have to specify the genome build and the cutoff distance to the promoter TSS Only the chip seq intervals that fall into the range of 10000 1000 from the TSS of the gene will be included in further calculations however the neighbor genes detected at the chosen cutoff can be exported You might want to modify the cutoff p value used in the peak detection where more stringent crite ria lead to detection of a smaller number of peaks The MFOLD parameter is used to select the regions with MFOLD fold tag enrichment against a background to build the peak model where the default is 32 The higher the MFOLD the less of a number of candidate regions will be identified If you see an ERROR or CRITICAL message from MACS lowering this parameter is recommended 9 1 4 Intervals representation All intervals are grouped by corresponding gene promoters Specie
250. t i e their resemblance of a normal distribution This parameter is included to prevent over fitting P The P component penalizes model complexity according to the number of CM units and the matrices and pairs each CM contains Therefore it also safeguards against over fitting For a full description of the CMA algorithm please see Composite Module Analyst identification of transcription factor binding site combinations using genetic algorithm Waleev T Shtokalo D Konoval ova T Voss N Cheremushkin E Stegmaier P Kel Margoulis O Wingender E Kel A Nucleic Acids Res 2006 Jul 1 34 Web Server issue W541 5 PMID 16845066 102 Chapter 7 Molecular networks analysis ExPlain provides information about signal transduction networks contained in the BKL database Using the search for Key nodes option of ExPlain you can search for common signaling molecules key nodes in the network vicinity of your gene set The Network clusters tab can be used to identify subnetworks containing genes coherently connected by signal transduction reactions 7 1 Network key node analysis 7 1 1 Key nodes dialog window Figure 7 1 Search for key nodes dialog window New Key Nodes Gene set TFs identified by F Match 53 molecules Max radius Direction UPstream M Compute FDR and apply threshold 95 C Include expression transregulation reaction Add user interactions en l Follow curated chains Secondary gene set ene To s
251. t this field is set to none or to the last gene set used as background before if any NOTE Promoters present in both Yes and No sets will be excluded from the back ground The profile list contains all profiles from the whole project tree To create a new profile or load an existing one from a file use the Create Load link You can select score thresholds of the PWMs from the profile to use in your analysis Cut offs from the original profile minimized false negatives minFN sum of FP and FN minSUM and minimized false positives minFP are available for the selection Note that the original profile will not be changed To find an individual cut off for each PWM that provides the best frequency ratio between the analysed and background set use the Optimize cut off option see Section 5 1 4 The maximal size of a TRANSPro promoter sequence available is 11000 nucleotides ranging from position 10000 upstream to 1000 downstream of the TSS defined by TRANSPro This means that the maximal value in from is 10000 and the maximal value in to is 1000 When you upload your own segeunces the maximal size range allowed is 200000 nucleotides Within this range you can set the TSS arbitrarily at any position To find an individual promoter window for each PWM providing the best separation of the query and background sets use the Optimize window position option see Section 5 1 4 By default ExPlain searches the promoter s
252. tart a key node analysis open the key nodes dialog by clicking on the Key nodes link in the Network analysis section of the Analyze menu First you should select a gene set you want to analyze in the Gene set field By default the current tree node is selected as input set The dialog provides several further input options that can be used to specify the conditions for the analysis Max radius The maximal search distance threshold defines the number of steps from each input molecule that are considered by the algorithm FDR You have the option to compute the False Discovery Rate and to filter the results by a FDR threshold Only key nodes with a FDR value below the specified threshold will be shown in the result list 103 7 1 NETWORK KEY NODE ANALYSIS CHAPTER 7 MOLECULAR NETWORKS ANALYSIS Expression transregulation reaction If gene regulation and transregulation reactions shall be consid ered in the search for key node molecules use the Include expression transregulation reaction option If this option is not checked all reactions including gene regulation transregulation events will be excluded from the analyzed networks User interactions In the Add user defined interactions list you can select any uploaded interaction profile see Section 7 4 for more details The Algorithm will consider the corresponding molecule interactions during the key node search Follow curated chains If annotated pathways and reaction chains
253. ter to see all lines The figure below shows the enrichment output and detailed view for the Iransit peptide keyword This result shows that genes associated with the Iransit peptide function have low expression values in our experiment 4 7 5 Enrichment Analysis theoretical basis The Gene Set Enrichment Analysis GSEA algorithm is designed to detect the enrichment of input gene set with genes from certain molecular classification groups or groups given by user It is assumed that the input set contains genes differently expressed between two conditions phenotypes cell strains before after treatment etc The input set D is a list of N genes with a value connected to each gene that reflects the difference in expression it can be fold change correlation with phenotype or other Let s name these values expression difference The GSEA algorithm obtains a list of genes L by ranging the set D according to the expression difference descending sort L is a set of N genes gi ordered by values ri For each group S of the given category the algorithm accounts for whether hit genes of group S tend to be located at the top of the sorted list L at the bottom or distributed randomly in L While working the GSEA computes the Enrichment Score ES S that ranges in 1 1 and p value that ranges in 0 1 ES is the maximum deviation from zero of the so called running sum Equation 4 7 1 26 As we can see ES depends on the parameter p that ranges
254. termined for each gene regarding all experiments Then relative rel and absolute abs difference values are calculated according to the formulas rel max min abs max min max mim Subsequently all genes for which abs gt 1 1 are ranked according to their relative difference values and the X top ranked genes are used as input set for CRC CRC is a model based clustering approach based on the Chinese restaurant process CRP 38 The procedure is analogous to the random seating of customers that sequentially arrive at a restaurant with an infinite number of tables and an unlimited number of chairs Each new customer that arrives will be seated according to the actual seating scheme of the persons that have arrived at the restaurant before Translated to the clustering of gene expression data this means that each gene is assigned to a cluster depending on the existing constellation of gene to clusters assignment The number of clusters and the parameters are determined in the course of the process so that no initial information about the clusters 152 CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA 11 5 CRC CLUSTERING is needed The basic clustering procedure consists of the following steps 1 Initialization Genes are randomly assigned to a haphazardly number of clusters 2 Re assignment Each gene is again assigned to a cluster by a Removal of the gene from its actual cluster and calculation of the probability that the gene j
255. tes that are found within the intervals selected If the option Leave only sites for factor assigned to the interval option is checked the results will contain only sites predicted for the factor named at the corresponding interval To leave sites that don t exactly fit the interval but are close to it use the Expand interval option Enter here a number of base pairs by which to lenghten the intervals to the left and right the Filter background set if any option allows you to control background set filtering Figure 9 8 Filter Match results dialog Filter Match results Source Match result F match Motifs seed 6 length 10 1100 SUP 2 Interval set to use none Leave only sites for factor assigned to the interval Filter backaround set If any Expand interval by 10 Dp The button launches the process of filtering It creates a new node that contains the filtered MATCH results 9 4 Theoretical background of Illumina BED files processing ChIP Seq tags represent the ends of fragments in a ChIP DNA library and are often shifted towards the 3 direction to better represent the precise protein DNA interaction site The size of the shift is however often unknown to the experimenter Since ChIP DNA fragments are equally likely to be sequenced from both ends the tag density around a true binding site should show a bimodal enrichment pattern with Watson strand tags enriched upstream of binding and Crick strand tags enriched downs
256. the hybridization 144 CHAPTER 11 STATISTICAL ANALYSIS OF MICROARRAY DATA 11 4 HIGH LEVEL ANALYSIS Figure 11 9 Page 3 of the affyQCReport Plot of the 3 5 ratios percent present calls and average background levels for data sets that are of reasonable quality A and bad quality B A B QC Stats a omame QC Stats Beas By clicking on the plus buttons in the Plot s section below the quality control result table the user can display the graphs of different quality control features Boxplot This graph enables an analysis of the overall probe intensities of an individual array and a comparison of the intensities between the different arrays of the user s data set The upper and lower borders of each box indicate the 75th and 25th percentiles in the distribution of the intensi ties and the black bar within a box marks the median The lines extending from each box show the spread of the intensity values There should be no large differences in the levels of the raw probe intensities Arrays with a significantly different range of intensity values or a low average intensity are suspect and need a careful examination Histogram Kernel density plot of the probe intensities The intensities of the arrays are represented by individual lines The legend in the upper right corner shows the line to array assignment Differences in the shape or center of the distribution indicate a need for thorough normalization RNA Degradation The pl
257. then press the button Figure 3 17 If the non intersecting option is checked the resulting subsets will contain different sets of genes Figure 3 17 Extract random subsets dialog Extract random genes Here you can create several random subsets of specified geneset Source gene set Fold change gt 0 999 and Fold change 1 001 777 gen Number of random subsets to create Number of genes in each subset 22 C Create non intersecting random subsets Cancel 3 4 Adding columns to the gene set 3 4 1 Calculation from an existing numerical column For analyses that take expression values into account it can be useful to transform the values before hand Logarithms for instance can be used to decrease differences between fold changes Multiplica tion by 1 or absolute values can alter the way genes with negative fold changes are taken into account Some statistical methods such as normalization and mean shifting can also be necessary for data pro cessing In a case where multiple numerical columns exist the average values sum or multiplication of values in different columns is sometimes meaningful ExPlain provides a range of possibilities for such transformations of numerical and expression col umn values summarized below Arithmetic operations add subtract multiply divide power Advanced operations abs x absolute value sqrt x square root exp x exponentiation log x nat ural loga
258. this mode the dialog window can be used to create a profile from scratch by selecting PWMs from the list box and predefined minFN minSUM or minFP or custom MSS and CSS cut offs Figure 8 2 The New profile dialog window in default mode Create new profile Create new profile in folder Matrices Factors _C High specificity matrices only VSACAAT_B VSAPOLYA_B VSATATA B Profile name 120 kDa CRE binding protein V amp CREBATF Q6 a 47 kDa CRE bind prot VCREBATF Q6 AFP1 VSAFP1_Q6 J AIRE VSAIRE_01 VSAIRE_02 Profiles Profile cut offs AIRE isoform1 VSAIRE_01 VSAIRE_02 C minFN AML1 V AML1 Q6 VSAML_Q6 VSPEBP Q6 minSUM AML1DeltaN V AML Q6 VSPEBP Q6 CO minFP AML1a VSAML1 01 VSAML Q6 V PEBP Q6 AML1b VSAML Q6 VBPEBP Q6 C custom CSS 075 MSS 08 AML1c VSAML Q6 VBPEBP Q6 AML2 VSAML Q6 VSPEBP Q6 AML3 VSAML_Q6 VSOSF2_Q6 VSPEBP Q6 AML3 G1 VSAML Q6 VSOSF2 Q6 V PEBP Q6 AML3 G2 VSAML Q6 VSOSF2 Q6 VSPEBP_Q6 AML3 U1 VSAML Q6 VSOSF2_Q6 V PEBP Q6 AML3 Y1 VSAML Q6 VSOSF2_Q6 VSPEBP Q6 AML3 Y2 VSAML Q6 VSOSF2_Q6 VSPEBP Q6 AML3 isoform1 VBAML Q6 VSOSF2 QE AML3 isoform2 VSAML_Q6 V OSF2 Q6 Create new profile in folder Profiles is offered by default but it is possible to choose any other folder Furthermore you can specify the Profile name The contents of the list box and thereby the means of se
259. thods for detecting differentially expressed genes in microarray experiments Bioinformatics 24 374 82 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list uids 18204065 RankProduct Breitling R Armengaud P Amtmann A and Herzyk P 2004 Rank products a simple yet powerful new method to detect differentially regulated genes in replicated microarray experiments FEBS Lett 573 83 92 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list uids 153279805 BLAST Altschul S F Gish W Miller W Myers E W and Lipman D J 1990 Basic local alignment search tool J Mol Biol 215 403 10 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt AbBsLracrtslisrt wids 2231712 gt CRC Zhaohui 5 Qin 2006 Clustering microarray gene expression data using weighted Chinese restaurant process Bioinformatics 22 16 1988 1997 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list uids 167665612 Filtering of gene expression data Tamayo P Slonim D Mesirov J Zhu Q Kitareewan S Dmitrovsky E Lander E S Golub T R 1999 Interpreting patterns of gene expression with self organizing maps Methods and application to hematopoietic differentiation Proc Natl Acad Sci USA 96 2907 2912 PubMed http www
260. threshold The genes in the control set are however not included in the interaction table Setting a large percentage implies a low CMA score threshold and will produce an interaction table that will include many of the real target genes The use of a smaller percentage will produce a higher score threshold and an interaction table with fewer genes 6 8 CMA Composite Module Analyst Background information This section provides an overview of the Composite Module Analyst CMA software CMA is a soft ware tool that produces a model of the common regulatory regions in a promoter set described by combinations of single transcription factor binding sites as well as their pairs in composite modules A composite module is considered to be a region within a promoter in which specific combinations of 99 6 8 CMA COMPOSITE MODULE GNAPYBE 6 BAOKKBGOIUNMI ON ROISMAXNTKONSIS AND MODELS Figure 6 21 CM scanning output Model with report V CREL_01 V CREB_O1 V AP1_O4 V AP1_ Q6 01 550 SUP Promoter model verbose display V CREB 01 V CREL 01 x V AP1 Q4 3 30 lt V AP1 Q6 01 FP 14 8696 FN 49 3796 Overall cutoff 0 326833 Goal function calculation Value nan 0 449316 0 678840 0 842106 20 402913 Weight 0 000000 20 333333 0 333333 0 000000 20 333333 1 000000 Weighted value 0 000000 0 149772 0 226280 0 000000 20 134304 0 510356 Expression score distribution Export to RTF Expression 9 gt 9 99
261. tion Analysis node in the project tree The GO Identifier column with the ontology identifiers is hidden by default and can be shown if needed in the created subset 4 3 Functional Analysis output tables This section describes the output tables of the different Functional categories Organ Tissue expression BKL Disease Transcription Factor Classification TRANSPATH Molecule Classification Gene Ontology and Whole subsets from the tree results Note that only overrepresented groups are displayed in the result table for all types of classification 54 CHAPTER 4 THE FUNCTIONAL CLASSIFICATIONS FUNCTIONAL ANALYSIS OUTPUT TABLES Figure 4 5 New subset with an immune response and immune system development genes compiled by Functional Analysis Gene BKL description GO symbol ES Identifier Angiotensin I converting enzyme 1 acts in GPCR pathway regulates cell proliferation and blood pressure Fi ACE induced inhibition prevents cardio vascular diseases gene polymorphisms correlate with psoriasis diabetes and ao 0002520 Alzheimer disease Alanvl aminopeptidase involved in endothelial morphogenesis during angiogenesis may play a role in cell C ANPEP proliferation increased expression correlates with multiple sclerosis and non small cell lung cancer ORTI Ss CDz4 molecule a signal transducer that acts in cell adhesion and induction of apoptosis protein expression is GO D0D6855 L cpz4 upregulated in intrahepatic cholangiocar
262. to create new model NENNT Matrix hlatrix 2 sario E v y Cutoff Distance To add a pair of PWM to a new model select matrices in Matrix 1 and Matrix 2 drop down boxes edit the cut off values choose the orientation of matrices in the pair the spacer range and the number of pair matches in the module Drag the created pair from the editor box up to add it to the model The boolean structure of the model is represented by colored boxes A single matrix or matrix pair appears inside a gray box and a single matrix or several matrices and pairs comprising a group will appear within a light green group rectangle An entire model will appear in a yellow box consisting of groups containing matrices or matrix pairs quation 6 5 1 WEAP1 Di To add matrices or pairs to the model as a part of the module group or a model itself drag the gray rectangle with corresponding data inside the proper coloured rectangle Click the not button to make 97 6 6 CLASSIFYING PROMOTERS CHAPTER 6 COMPOSITE MODULE ANALYSIS AND MODELS a module repressing If you want to remove some elements drag them to the trash can at the bottom left corner It can be difficult for the first time to adjust all the elements but after some practice it will be much easier Figure 6 18 An example of a model created in the editor ViE2F 02 not SERE o VECEBP Q3 VESP1_O7 01 VEAP1_O1 When you are satisfied with your model press the save
263. tream MACS takes advantage of this bimodal pattern to empirically model the shifting size to better locate the precise binding sites Given a sonication size bandwidth and a high confidence fold enrichment MFOLD MACS slides two bandwidth windows across the genome to find regions with tags more than MFOLD enriched relative to a random tag genome distribution MACS randomly samples 1 000 of these high quality peaks separates their Watson and Crick tags and aligns them by the midpoint between their Watson and Crick tag centers if the Watson tag center is to the left of the Crick tag center The distance between the modes of the Watson and Crick peaks in the alignment is defined as d and MACS shifts 133 9 4 THEORETIGAP BARK GBEEPONJAR INITHRWMUISACBHD EIHEE PREK ESSINGEO TILING ARRAYS all the tags by d 2 toward the 3 ends to the most likely protein DNA interaction sites For experiments with a control MACS linearly scales the total control tag count to be the same as the total ChIP tag count Sometimes the same tag can be sequenced repeatedly more times than expected from a random genome wide tag distribution Such tags might arise from biases during ChIP DNA amplification and sequencing library preparation and are likely to add noise to the final peak calls Therefore MACS removes duplicate tags in excess of what is warranted by the sequencing depth bi nomial distribution p value 10 5 With the current genome coverage of most ChIP S
264. uced Following expression of the CDYS O 0696543 O 096747 ERBB2 receptor which suggests a role in receptor signaling events CCR4 NOT transcription complex subunit 8 a putative transcription Factor that may play L cuore a role in the regulation of cell proliferation and transcription From RNA polymerase TI HOTS 0 0595904 0 0728628 promoter Cysteine rich angiogenic inducer 61 promotes endothelial cell adhesion upregulated in Fl C CYR61 breast neoplasms aberrant gene expression is associated with gliomas and lung CYR61 Q 0596674 0 0908476 carcinomas rat Cyr amp 1 is associated with neointimal hyperplasia in a rat injury model Eukaryotic translation initiation Factor 54 a translation Factor that acts in p53 mediated Fi EIFEA signaling and apoptosis induction in response to DNA damage mRNA and protein export EIF5 0 00391455 0 0201679 f from nucleus may play a rale in skeletal muscle stem cell differentiation Fusion involved in E 1216 in malignant liposarcoma a transcriptional cofactor that acts C FUS in recombination repair gene translocation correlates with leukemia myxoid FLIS 00739477 0 0077211 liposarcoma and Fibrous histiocytoma mutation causes amyotrophic lateral sclerosis K1440247 protein expression is induced Following alteration of TP53 dependent L C EIAADZ47 microenvironmental components of the inflammatory response including nitric oxide KIAA 247 0 0773773 0 10146 hydrogen peroxide
265. ue of the match result Figure 4 9 Output table of BKL Disease analysis Disease Gene symbol Disease name Biomarker Hits in Group Hits p value associations group size expected FO rows on previous pages Nasopharyngeal poog303 HLA E TRAF3 eae Correlation 2 27 1 D 1256786 Coronary po23903 GNB3 IL1B e Correlation 2 27 i 0 126786 Arthriti C 0001172 HMRNPA2BI ICAM2 IL1B MMP14 DLL rns Laid Causality 4 84 2 0 126729 4 3 5 SwissProt analysis output The table presents the the UniProt Knowledgebase keyword in each row The columns are the keyword ID and a link to its UniProt lt http www uniprot org gt description Gene symbols the descrip tive name of the keyword category the number of input genes matching that keyword group the size of the matched group the randomly expected number of hits and the P value of the match result Figure 4 10 Output table of SwissProt analysis EUM word Gene symbol keyword category Hitsin Group Hits p value group size expected oa ADRMI 4ER141 ALDOB APRT EEF1D Metal Ehinlake kwno48n MT14 MTIH PMTIM PMT2A MTs cluster Ligand 5 11 1 1 13271e 07 ADDS ADRMI DDB2 GTF2I H3F3A Fi Kwila32 HISTiHiC HISTiH2BC HISTIHZBE Ubl conjugation PTM 2l 551 7 4 24426e 06 HISTiHZzBK HIST1iH BL SDRM1 PSMES PSMBS PShMIC1 PSM Cellular F KWDOSS PsMDa 2 3 i 7 Proteasome component 6 55 1 5 45677 4e D05 H3F34 HISTIHzBC HISTiH2BE elit
266. ure It is impractical to enter a separate state for each location Most molecules can be found in several tissues at several development stages in several cellular compartments several organs and several cell types To enter a state for each possible combination would lead to an explosion in the number of states and redundancy in the reactions This problem is circumvented by using a list of positive and negative locations that is linked to the basic molecule In each state the molecule is available only for a subset of all reactions for that molecule Receiving a signal changes the molecule s state usually leading to a new state from which reactions are triggered 115 7 5 THE BKL DATABASE CHAPTER 7 MOLECULAR NETWORKS ANALYSIS Figure 7 18 State switching switch state 116 Chapter 8 Profiles This section explains how to create modify and import PWM profiles sets of positional weight matrices of transcription factor binding sites and the included cut offs Cut offs are threshold values assigned to matrices and indicate the allowed variation when the matrix is used to predict a binding site They are measured by comparing the predicted sites with experimentally proven ones In ExPlain you can choose cut offs minimizing false positive overprediction false negative underprediction or the sum of both errors The created modified profiles can then be used to search a set of promoter sequences for binding sites as describe
267. uts are variants of Classic and Compact with the tree panel on the right side The treeless layout is somewhat different In this mode the tree is hidden by default leaving all of the horizontal screen space for your data which might be useful when you have many columns Instead of the tree a breadcrumb bar is shown below the toolbar which displays the current item and all its parents allowing you to easily go up the tree For the most part this is enough for navigation though you still can see the whole tree by pressing the Go to button on the left side of breadcrumb bar The only drawback to this mode is that tree functionality is limited to search the tree or perform a mass action you ll have to switch layouts Figure 1 11 Breadcrumb bar 30 to ann test s data Gene Sets sample tab2 224 99 407 180 1 4 The process monitor The process monitor provides an interface to the queue of ExPlain processes The monitor window summarizes all running and waiting tasks as well as completed jobs whose results are waiting to be inspected For further details about a running or waiting process you can click on its label in the monitor window Processes can be cancelled by clicking on the icon associated with a label The process 15 1 5 THE WORKSPACE CHAPTER 1 MAIN COMPONENTS OF THE EXPLAIN USER INTERFACE interface can also be used to move to the result of a newly completed analysis by clicking on its label when the monitor sign
268. vel analysis Quality Control High level analysis Microarray data analysis IMPORTANT NOTICE WARNING This chapter addresses highly specialized statistical tools if you have doubts about their usage we recommend you to leave the default values pro vided by ExPlain when using them The statistical tools are available only if the R language software suite is installed on your system This ExPlain program allows the analysis of Microarray data from Affymetrix CEL files The workflow includes four steps described in details below The following order of the steps is required Loading of CEL files Assignment of factors and levels Low level analysis Quality Control High level analysis 11 1 Loading of CEL files and assignment of factor level information The CEL files must first be archived in ZIP TAR or Gzip TAR format Correctly archived files can be loaded into ExPlain via the File gt Load CEL files menu option A dialog window will open which allows you to specify the destination folder and the actual file to be uploaded Press the button to launch the process The CEL files are extracted from the ZIP or TAR archive and the factor level assignment options are shown Figure 11 2 You must assign factors and levels for the data by clicking on the Change configuration link provided A dialog window will open in which the factors and levels for the data sets can be speci fied Figure 11 3 139 11 24
269. window Rank Product Source 1 Source QC Filtered MASS example1 CEL files Factor Assign column Source 2 Source QC Filtered MASS example2 CEL files Factor Assign column Source 3 Source none Factor Assign no dataset selected 0 141016 0 820644 0 555409 0 194605 0 183595 0 522997 0 188997 0 191603 0 225387 1 21762 0 0853689 0 429759 0 350762 0 127858 Experiment Experiment Blec2 CEL E ec3 CEL Experiment no dataset selected 0 0785341 0 000990837 0 426196 0 0288328 0 0450886 0 00419631 0 433148 0 0900062 0 113561 0 223415 0 0032689 0 0833852 0 00203457 Baseline no dataset selected Cancel 11 4 2 Statistical Analyses of the gene expression data The statistical analysis techniques ANOVA Fold change Empirical Bayes and Generalized Linear Model can be applied to a gene set as well as to filtered CEL data The Statistical Analysis gt Fold Change menu option of the Analyze menu provides a list of analysis methods The dialog is dis played in Figure 11 16 below Select a gene set method parameters and a factor for the Fold Change computation Only gene sets with the assigned factors can be used in this analysis It is possible to change assignment from within the dialog by clicking the hange configuration button or by launching the dialog from the Data menu 148
270. wn in Figure 8 4 if you select nothing by 118 CHAPTER 8 PROFILES 8 2 CREATING A NEW PROFILE Figure 8 3 PWM selector for matrix based profile compilation VSAHRARNT_01 VSAHRARNT 02 VSAHRHIF Q6 VSAIRE 01 VSAIRE 02 VSALPHACP1 01 VSAP1 Q2 01 default the system will select all the rows and press the button It is also possible to invoke the Create new profile dialog after selecting the up regulated gene set Then all PWMs associated with the gene set are preselected in the list and no additional matrices need to be selected Use the Profile name field to name the profile HUVEC UP and select the desired destination folder in the field Create new profile in folder Figure 8 4 Input output table for the creation of the example HUVEC up regulated profile Create new profile Create new profile in folder Matrices Factors Profiles Profile name HUVEC_UP AhR VSAHRARNT 01 VSAHRARNT 02 AhR repressor VSAHRHIF Q6 AhR repressor arnt VSAHRHIF_Q6 AhR2 VSAHRARNT 01 VBAHRARNT 02 __ Profile cut offs AhRcarnt VSAHRHIF Q6 C minFN Alx 4 VSALX4 01 minSUM Arid5B V SE2A_Q6 V EBOX Q6 01 minFP Arnt 774 AA form VSAHRHIF_ Q6 B Myb V MYB Q5 01 BCL 6 V BCL6 01 V BCL6 02 VSBCL6 Q3 BF 1 VSFOXO1 Q5 BRCA1 VSBRCA 01 BRCAT USF2 VSBRCA 01 BXR beta V DR3 Q4 VBDR4 Q2 V PXR Q2 J Bach VSBACH1_01 VSMAF_Q6_01 J B
271. yed in Figure 7 8 In this visualiza tion pink nodes are the clustered input molecules Figure 7 8 Section of a network cluster 7 3 Network visualization A flash based application provides you with tools to visualize and manipulate the network The screen contains two parts a navigation bar to the left and a main canvas You can change the scale scroll the displayed network view the information associated to each node or edit the network Figure 7 9 Flash based network vizualization nodes table Export GIF CSML 1 9 CSML 3 0 Layout Bd 3 hit Legend p beta31 integrin C Ete 1 hit E u m H c FLIP L key wg p F l H E ERK Max GSK3beta xn99908 us d y icam1 hit iY aang a b H Yoox inca piod i ax T Y Et 3 Ab The network visualizer navigation bar consists of three navigation elements The Navigation panel is a miniature scheme of the canvas which shows the position and moves the visual window around big or zoomed in networks The panning buttons provide another tool to navigate stepwise within the 109 7 3 NETWORK VISUALIZATION CHAPTER 7 MOLECULAR NETWORKS ANALYSIS canvas The zoom panel has a slider and buttons The button zooms in and the button zooms out There are several available export options shown above the network picture The area where the network is visualized is called the canvas A network is a set of nodes which are connected to each ot
272. ypes In Silico Biol 5 0007 2004 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list_uids 15972006 gt CYTOMER lt http www gene regulation com pub databases cytomer gt 163 5 10 CHAPTER 14 REFERENCES Human Protein Survey Database HumanPSD Hodges P E Carrico P M Hogan J D O Neill K E Owen J J Mangan M Davis B P Brooks J E and Garrels J I Annotating the human proteome the Human Proteome Survey Database HumanPSD and an in depth target database for G protein coupled receptors GPCR PD from Incyte Genomics Nucleic Acids Res 30 137 141 2002 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstracts amp list uids 117522755 HumanPSD cnttp www biobase international com pages index php id 71 TRANSPro Chen X Wu J m Hornischer K Kel A and Wingender E TiProD The Tissue specific Promoter Database Nucleic Acids Res 34 D104 D107 2006 PubMed http www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db pubmed amp dopt Abstract amp list uids 16381824 TRANSPro M chttp www biobase international com pages index php id 113 gt MATCH Kel A E Goessling E Reuter I Cheremushkin E Kel Margoulis O V Wingender E MATCH a tool for searching transcription factor binding sites in DNA sequences Nucleic Acids Res 31 3576 3579 2003 PubM
273. ysis displayed in the ExPlain tree Filtered 8154 890 103 1333 704 2009 09 22 14 59 40 x Clustert 0 7thr Oshift 13 4701 0 694885 75 10 158 96 2009 09 22 15 00 54 x Cluster10 D 7thr shitt 9 99652 0 726131 21 20 32 22 2002 08 22 15 02 01 X Cluster11 D 7thr shitt 9 00275 0 690909 12 1 24 12 2009 09 22 15 02 10 x Cluster12 D 7thr shitt 9 25073 0 938364 2 0 3 18 2009 09 22 14 58 28 x GlusterZ O thr shitt 12 2743 0 661428 104 1 3 225 100 2008 08 22 15 00 38 x Cluster3 D 7thr Oshift 10 2137 0 699093 51 8 103 44 2009 09 22 15 01 09 X v Gluster4 D 7thr Oshift 8 7037 0 839827 14 1 24 12 2009 09 22 15 01 11 x y Gluster5 D 7thr Oshift 11 824 0 814923 58 1 111 68 2009 09 22 15 01 22 x Gluster amp O thr Oshitt 11 2094 0 623909 42 22 80 4 7 2009 09 22 13 01 24 Gluster Q0 Pthr shift 12 1 57 0 885154 23 7 177 74 2009 09 22 15 01 28 Gluster8 0 thr Oshitt 12 68531 0 788028 84 11 165 85 2009 09 22 13 01 41 Glusterg 0 thr Oshitt 7 1879 0 540852 18 0 35 23 2009 09 22 15 02 00 X MEG NEG MEQ MEG REG NEL ESQ NEL NE 11 5 1 Algorithm details of CRC clustering As recommended by Qin 2006 37 genes showing little variation across all experiments can be re moved by filtering the data set before the clustering procedure is started In the filtering step the min imum and maximum expression values are de
274. yze menu Then in the dialog window specify the gene set to analyze objects to consider and two numerical columns to be compared The figure below shows an example of such an analysis 49 3 6 STATISTICS CALCULATOR CHAPTER 3 GENE SETS Figure 3 25 Statistics calculator dialog window Statistics calculator Gene set HUVEC GSE2839 example 985 505 154 77 4583 Objects to consider genes Expression column 1 Fold change Expression calumn 2 Test column Min value Max Value aum Average Sum of squares Standart deviation columni 0 210 43 110 8250 904 1 034 3887 828 D 898 calumna 0 000 1215 022 124624 221 15 607 13412973 376 40 986 M 7985 Coy ariance 0 288 Correlation 0 009 T P value student T test 231 76 1 9975e 206 Kalmoagoroay test D identity of samples Z F value Wilcoxon signed rank test es r6 o5 Table 3 2 Predefined data sets Human housekeeping genes This set contains about 550 human housekeeping genes de rived from a study described in 8 Human promoters TRANSPro 6 2 This set contains over 40 000 human promoters extracted from TRANSPro HUVEC GSE2639 example This set contains over 7 500 human genes with an associ ated fold change value Human umbilical vein wall s cells were treated with TNF alpha tumor necrosis factor alpha and a microarray experiment was done the fold change is the ratio of signals of treated cell genes and control cell genes Mouse promoters TRANSPro 6 2 This set contain

ExPlain 3.0 manual

Contents

Download Pdf Manuals

Related Search

Related Contents