Home

ARMADA user's guide

1. Mahal i l l iS d x X v x X y where V is the sample covariance matrix City Block Manhattan 2 d 2 Xe s Minkowski nd n p d 3 A It can be easily seen that for the special case of r p 1 the Minkowski metric gives the City Block metric and for the special case of p 2 the Minkowski metric gives the Euclidean distance Cosine Correlation Ha l n 23 l n where x gt x and x 251 7 l N r N r Hamming Be bos dis n Jaccard p x z x Gs 20 v x 0 x 0 v x 0 122 D 2 Linkage algorithms These linkage algorithms are based on different ways of measuring the distance between two clusters of objects If n 1s the number of objects in cluster i and n is the number of objects in cluster j and x is the r object in cluster i the definitions of these various measurements is presented in the following table Algorithm single Complete Average Centroid Median Ward Definition Single linkage is also called nearest neighbor and uses the smallest distance between objects in the two clusters It IS defined as i j min d x x re Pl SE 1 7 Complete linkage is also called furthest neighbor and uses the largest distance between objects in the two clusters It is defined as I i j max d x x r e r n se L n Average linkage uses the average distance between all pairs of objects in cluster i and Y Y s a
2. Use median instead of mean I Plot distribution summaries after scaling The following table explains each option and parameter Option Description Use median instead of mean If checked the median of the ranked values will be used instead of the mean Plot distribution summaries after scaling If checked a plot presenting the gene expression distributions among all slides of the selected Analysis and the summary quantiles distribution will be displayed In the Statistical testing panel the user can select the test to be performed select a multiple test correction method and set the cutoff values that will determine which genes are statistically differentially expressed The following table describes the available options in the Statisticaltesting panel Option Description Statistical test Available statistical tests are l way ANOVA ref here Ker Kruskall Walls non parametric equivalent of ANOVA t test and Time Course ANOVA Time Course ANOVA should be used when each experimental configuration has its own control e g when performing a time course experiment where one possible configuration is that each separate time point has its own control In this case ANOVA is performed among each point s fold changes instead of among expression values Multiple test correction Available multiple testing correction methods are the following None No multiple testing correction Bonferroni Bonferroni multiple testi
3. l n n r l s l Centroid linkage uses the Euclidean distance between the centroids of the two NES 1 n I 1 nj where x m E is Ec d a n r n r Median linkage uses the Euclidean distance between weighted centroids of the two cluster j It is defined as i j X X i J clusters It is defined as l i j clusters It is defined as I i j X x Where x and x are weighted centroids for 2 the clusters i and j If cluster i was created by combining clusters p and q then x IS pe 0 m P T defined recursively as x DU x and x is defined similarly Ward s linkage uses the incremental sum of squares that 1s the increase in the total within cluster sum of squares as a result of joining clusters 7 and j The within cluster sum of squares is defined as the sum of the squares of the distances between all objects in the cluster and the centroid of the cluster The equivalent distance is given qlo X i M NES n 7 Centroid linkage respectively where I is the Euclidean distance and X X as in the 123
4. P The bootstrap is an iterative resampling procedure which is based on creating new data by drawing with replacement from an initial dataset For more information on the bootstrap the user should see 19 Efron B and Tibshirani R 1993 An introduction to the bootstrap Chapman amp Hall CRC 105 General options Use always squared euclidean Verbose output command line Use waitbar Show output plot Whether to always use the squared euclidean distance to calculate the within cluster pairwise distances as the authors of 18 propose or to use the metric used during the clustering process sometimes seems to work better Display output messages on the operating system command line or MATLAB s command window if ARMADA used under MATLAB showing different stages of progress Display a bar showing progress of the calculations If checked will generate a figure with two panels the upper panel shows the within cluster dispersion range for the original dataset against the range of the number of clusters The bottom panel displays the Gap curve which is the values of the Gap statistic against the range of the number of clusters The user should also see 18 When choosing the clustering algorithm each corresponding preferences window contains a field for specifying the number of clusters e g the field Number of centroids k in the k means clustering preferences window This choice is ignored as the number of clust
5. i m o i e o o o o o o o co co e nw i E NM i i e S S S S Sa S i H i i lia i a i e e uc us z e eu 1 r o 1 usnba14 euet M c o i i i i i mr a o Ss B 1 1 1 o os o ee x E i 1 i A s 1 c 3 A no E 3 1 Dodo NM 2 2 H PP tS ll VQ E a TES o E a EU E A ee Y t 1 n5l MEE titi id MEE EB 5 v AME i e i i i m s a i E see ie ear e 001 1 0i IIo o FE k i L L AA 1 S i i i i a i 7 7 A 4 H k E EY i i 8L 2 o SS o m 42 21 ha a E c E l z A o ES a a oe c E E o i i i i i n o A w LI LI LI LI LI d LI LI LI LI po LI 1 1 1 1 A BD lcs do es L 1 mum J i 1 L 1 L NEN 1 i i i i i i i i i i i LI LI U e LI LI LI L LI E gt i i i i i i i i i i i i 008 i Y 1 1 1 c 1 D 1 1 1 D 1 D 1 1 LI e MN 1 LI LI LI LI LI LI m 1 a pneus Apress E e 42 EL AA MM i i i i i i i i i i i i i i 5 i i i i i i i i i i i amp LI LI LI
6. Misclassification error using rule nearest E i a 3 a T cityblock cosine correlation eee eed eee ee descar utclicbcsccncsss2s ctl2cc2colicd2z ecck X 0 15 0 1 10113 UONLIYISSE 9 SIA Number of Nearest Neighbors Misclassification error using rule random E i 3 D cityblock t cosine IM 10113 UOHEDIISSE I SIA Number of Nearest Neighbors Misclassification error using rule nearest euclidean cityblock r cosine A amp amp 5 correlation A AA 1 I Li I 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 4 1 I Li I Li I I I Li I l m 1 I I I Li I Li I Li I I ta 1 1 1 1 1 1 1 1 1 1 pl 1 1 1 1 1 1 1 1 1 I 10113 UONPIYISSE D SIA Number of Nearest Neighbors Misclassification error using rule random r euclidean cityblock r cosine 0 1 0 03 ation corel 1 I 1 I I I Li 4 I I I I I I I 10113 UOHEIYISSE D SIN Number of Nearest Neighbors 59 Misclassification error using rule nearest euclidean cityblock cosine A correlation Misclassification Error 0 02 1 2 3 4 5 6 T 8 9 10 Number of Nearest Neighbors Misclassification error using rule random 0 13 euclidean t
7. StDev ratio log Mi Mean ratio log iM StDev ratio log Intenzitigg Statistics and general Intensity Median intensity M Slide positions FDR M Mean intensity Mi StDev intensity lvl Gene names Mi Fold change Output file type M p values Mi Trust factors 69 Text tab delimited Excel C q values M CVs 95 The following table explains the data types that are exporting by checking each of the boxes in the gene list export preferences window the term ratio denotes the ratio between channels and intensity the intensity values calculated from the two channel signals Unnormalized ratios Normalized ratios Intensitie Statistics and general Option Ratio raw Ratio log Mean ratio raw Mean ratio log Median ratio raw Median ratio log StDev ratio raw StDev ratio log Ratio raw Ratio log Mean ratio raw Mean ratio log Median ratio raw Median ratio log StDev ratio raw StDev ratio log Intensity Mean intensity Median intensity StDev intensity Slide positions Gene names p values q values FDR Description The un normalized ratio in natural scale for each replicate of each experimental condition The un normalized ratio in log scale for each replicate of each experimental condition The mean un normalized ratio of the replicates for each condition in natural scale The mean
8. SvM options Aj Polynomial kernel options Kernel Parameters Normalize Gamma 1 i a ec Polynomial RT ER lsismoid MLP M Scale 1 Coefficient 0 REF Up 1 Degree 3 Talerance 0 001 Model validation options sigmoid LP kernel options iv Parameters iW N fold cross validation Gamma 4 OS M Leave M out Coefficient 0 Training and Test Hala 55 fa Read General options REF kernel options E Display evaluation plot wi Display output results Verbose output command line Tuna Cancel 62 The following table explains the available options in the SVM options Model validation options General options Polynomial kernel options Sigmoid MLP kernel options and RBF kernel options panels SVM options Polynomial Sigmoid kernel RBF kernel Model validation Gene kernel ral Option Kernel Normalize Scale Tolerance Gamma Coefficient Degree Parameters Gamma Coefficient Parameters Gamma Parameters N fold cross validation Leave M out Training and Test Display evaluation plots Description The kernel function type used to build the classifier model The following kernel types are available X denotes the data matrix Linear The kernel function has the form k X w X b 0 Polynomial The kernel funct
9. This box is available for all the options in the Plot options panel If checked the expression profile plots will also display a centroid calculated using the mean expression of the selected genes or the genes belonging to the each gene cluster Error bars are also created displaying expression standard deviation This box is enabled only for the Gene clusters option in the Plot options panel If checked expression profiles will be displayed in one figure with multiple plots instead of multiple figures one for each cluster This box if checked will display the expression of each gene with a different color instead of using only one color for each gene It should be checked when plotting gene clusters because it offers better visualization This box if checked will create a legend in the figure containing GeneIDs that correspond to different lines in the expression profile plot It should not be used when plotting a large number of genes for proper visualization purposes By filling this field the user can provide title s for plots As with other plot preference windows in ARMADA the number of titles should match the number of figures to be created e g 5 titles for 5 figures of different clusters This option if chosen will produce expression profile plots where gene expression is calculated from the mean of all the arrays for each condition This option if chosen will produce expression profile plots using expression values f
10. can help the user identify several spatial hybridization effects or the presence of artifacts on arrays responsible for high background contamination and perform quality control To create array images the user should select an array from the Arrays list and then click Plots Array Images The following window appears Array Image Editor Display Channel 1 Foreground Mean Image Tillers 2 0 Titlets Image colormap creer AN e OK Denzity B4 wi Display Colorbar From there the user is able to select the type of available data to be displayed on the reconstructed image the image dimensionality to be displayed 2D or 3D the color settings Image colormap as well as the color density e g a density of 64 will create 64 intermediate colors between the basic colors that defined the colormap while a density of 256 will create 256 intermediate variations and whether to display a bar depicting the color data correspondence for the data range that was used to create the image The following picture depicts the supported colormaps taken from MATLAB s help apart from Red and Green colormaps which are created by ARMADA The default colormap is the Jet colormap 67 The following table presents the data types derived from the imported files that the user can use to create array images choices may vary depending on the input data file type and the available data in the case of importing text tab delimited
11. To obtain more results of better quality the user should use the figure export setup which is accessible by clicking File Export Setup where a lot more parameters can be set in order to optimize the graphical output For more information the user should consult http www mathworks com access helpdesk help techdoc matlab html under the Graphics Preparing Graphs for Presentation section 6 4 Exporting to mat files As ARMADA is addressed to both experienced and inexperienced users the more experienced user can export the results from several analyses steps to a mat file and import it to MATLAB for further processing with MATLAB s internal algorithms or use specific functions from several toolboxes To be able to read ARMADA mat file exports MATLAB 7 1 R13SP3 should be P The purpose of ARMADA is to allow users not experienced with MATLAB to use the program and to offer a free analysis tool which needs only the MATLAB Component Runtime to run and not necessarily MATLAB installed on the user s machine 98 installed on the user s machine and the Statistics Toolbox should be present To export ARMADA results to mat files the user should click on File Export Settings MATLAB Workspace The following window will appear Export to MATLAB Analysis Analysis 1 Analysis 2 Analysis 3 Analysis 4 Export options Mi Gene names Raw data image software output Un normalized log ratio i
12. button on the main window 4 2 Fold Change Calculation Apart from statistical testing which leads to statistical score values fold changes can provide useful estimations on how much a gene is differentiated compared to its control or other conditions In order to calculate fold changes the user should select an Analysis object from the Analysis Objects list and click Statistics Fold Change Calculation and the following window will appear Fold Change Editor Fi iz GWT E FC ig D15ANT Select control wT select treated D15 vi Calculate Cancel In this preferences window the user should define pairs of experimental conditions so that fold changes can be calculated The user can define pairs by using the Select control and Select treated lists and using the Add gt gt and lt lt Remove buttons to add or remove pairs respectively The Select control and Select treated lists contain the names of the experimental conditions of the Analysis object selected in the Analysis Object list After finishing with pair assignment the user should click Calculate to calculate fold changes based on the assigned pairs 44 4 3 Clustering The term cluster analysis or clustering encompasses a number of different algorithms and methods for grouping objects of similar kind into respective categories A general question facing researchers in many areas of inquiry 1s how to organize observed data into meaningful structures In oth
13. 1 e commonly mislabelling one as another 54 Misclassification error using 10 fold cross validation evaluation 0 05 e uniform empirical z 0 04 1 1 o z 2 0 03 o 2 0 02 linear diaglinear quadratic diagquadratic mahalanobis Discriminant function type Misclassification error using leave 1 out validation evaluation 0 05 e Ll E ENIRO e up eme t e z l 2 0 03 A AA rc Mg l o amp 0 02 linear diaglinear quadratic diagquadratic mahalanobis Discriminant function type Misclassification error using training and test split 60 training and 40 test evaluation Misclassification Error 0 linear diaglinear quadratic diagquadratic mahalanobis Discriminant function type If the user selects to display a classifier evaluation report by clicking Display output results a window like the following will appear presenting the classifier evaluation results DA Classifier Tuning Results Classification accuracy 985 Confusion table ano Sao Jan Lo Class setosa versicolor virginical ano setosal 3 versicolor Sono virginical no o na ao Discriminant function type diaglinear Misclasification error 0 04 Classification accuracy 965 Confusion table So na ao Class setosa versicolor virginica If the use
14. 2 0286657 1 8506080 1 3434252 0 6882889 1 4599108 1 1918091 0 3161397 1 0291950 1 8619483 1 8278110 1 3622514 0 3505290 1 6529860 1 5206036 0 5477432 0 84112944 1 1124886 1 5364838 0 8916806 1 5326974 0 09988867 0 1887673 1 3239826 0 31447362 1 6880671 0 1655856 2 1114232 0 47186987 0 2124037 1 9934592 0 7399021 0 2543068 0 06493688 0 35150254 0 9556193 0 4085849 0 9474519 0 8159880 2 5264496 0 9709293 1 5700728 1 2060117 1 2258235 0 67764082 1 8088777 0 6856956 1 9089573 0 3418032 2 1134468 1 5381322 2 6018168 0 3065305 0 5742141 1 9126802 1 5916307 0 0108178 1 7642109 1 6817952 1 9027371 0 10088346 2 7115943 1 6387842 1 5874259 1 26135236 1 7103600 2 7263039 gt There are two kinds of brief reports that can be displayed from ARMADA in separate windows array reports and analysis reports Array reports can be obtained by right clicking on an array in the Arrays list and then selecting Report or by clicking View Array Report while analysis reports can be obtained by right clicking on an analysis object in the Analysis Object list and then selecting Report or by clicking View Analysis Report Array Report Report for Array Vit 3r bx QuantA
15. p y p n FWER methods are unsuitable for microarray data mostly because they are too conservative after correcting for multiple testing no single gene may meet the threshold for statistical significance In contrast FDR methods instead of adjusting p values they seek to minimize the proportion of errors committed by falsely rejecting null hypotheses As they are less stringent than FWER methods they are considered more suitable for microarray data However a common drawback with both of them is that they do not assume general variable dependence which is usually the case for microarrays because genes are involved in complicated interaction networks and pathways 121 Appendix D Distance metrics and linkage algorithms This appendix describes the linkage algorithms used in hierarchical clustering section 4 3 1 and the distance metrics used in several processes in ARMADA The descriptions below are based on MATLAB s help D 1 Distance metrics Let X denote a mxn data matrix whose m rows can be thought as m vectors each consisting of n elements dimensions The following table defines the various distances between two row vectors x and x of the matrix X Distance Definition Euclidean T d x x x x Standardized Euclidean ij i d x x D x x where D is the diagonal matrix with diagonal elements given by v which denotes the variance of the variable x over the m objects
16. 0 04738254 0 04698827 0 02035210 0 02924421 0 03066279 0 04587314 0 02134201 0 01471918 0 02968534 0 02935807 0 03075520 0 01107208 0 03541270 0 02856557 0 00850691 0 04507785 0 02875128 0 04613993 0 02727103 0 04908744 0 04445464 0 00556067 0 02513081 0 00321501 0 04217761 0 03983285 0 03173902 0 01772000 WT Rep 1 WT Rep 2 WT Rep 3 WT Rep 4 3 9471608 2 8486930 0 3020302 1 8448086 4 2 1087661 0 5322344 0 02426465 2 1087661 3 5059769 1 1884539 0 8565206 1 3184117 0 7773839 0 3092960 1 1948427 0 7773839 2 5202079 0 8776037 0 3685014 2 5202079 2 6558411 6 5566213 3 5118979 2 6558411 1 5843142 0 1841419 1 3559258 1 5843142 1 8399313 0 4584846 0 9764776 1 9638003 2 0448811 0 9391591 0 1426657 2 3521694 0 8167577 0 3164395 1 9552916 1 2037145 2 4537717 4 7993628 0 23188581 2 8696297 1 3373567 0 1070469 0 6641654 1 3139979 2 6714853 0 9483081 0 49807265 2 5137167 2 0374635 4 6780934 0 09896938 2 4459898 1 0536147 1 6427191 0 10670468 1 9785461 1 5062843 0 6646549 1 9953488 1 9493240 2 2355890 1 5972387 1 2542594 2 0041387 1 8506080 0 24359459
17. 12380 Creator GenePix Pro 3 0 0 98 Temperaturez1 53816 LaserPower 1 2396602 11066 LaserOnTime 272340268019 Block Column Row Name ID x Y Dia F635 Median F635 Mean F635 SD B635 Median B635 Mean 6635 SD gt B535 1 96 gt B635 21 F635 Sat F532 Median F532 Mean F5 1 1 1 Mus muscult 834712 3410 12580 30 4122 4442 1673 1487 1895 1528 73 42 3265 3716 1 2 1 ESTs 761302 3650 12580 30 8574 8474 1012 1510 1617 396 100 100 0 7103 7091 1 3 1 ESTs 761213 3870 12570 120 1705 1783 590 1124 1212 499 60 17 0 923 1252 1 4 1 ESTs Moder 304482 4120 12580 90 2044 2997 900 1168 1249 631 92 76 0 2951 29485 1 5 1 myosin lb 761018 4380 12580 30 1779 1925 701 1202 1279 424 57 32 0 1559 1611 1 6 1 ESTs Highly 305218 4620 12620 170 3441 4833 4571 1184 13892 1162 75 49 0 452 1754 1 7 1 ESTs Moder 761038 4850 12580 110 2307 2312 556 1085 1275 1221 47 2 0 2041 2066 1 8 1 ESTs 774996 5150 12550 140 924 983 316 900 1138 2617 0 0 0 232 252 1 9 1 ESTs Highly 779143 5340 12580 90 2403 2902 2179 1170 1639 4145 5 1 0 2029 1856 1 10 1 780127 5600 12580 100 2527 2381 887 1088 1331 1228 55 6 0 2264 1969 1 11 1 speckle type 791503 5880 12540 140 1070 1208 541 1038 1367 1177 3 1 0 285 397 1 12 1 ESTs Weakl 959839 6120 12540 140 1105 1480 2614 999 1193 1078 7 1 0 253 332 1 13 1 ESTs 775407 6370 12550 100 1918 2062 1263 854 1190 1645 18 3 0 1647 1610 1 14 1 ESTs Weakl 778381 6620 12550 100 6424 6962 3644 939 1043 489 100 100 0 4152 4362 1 15 1 ESTs 779847 6860 12550 110 15322 1
18. 1716506 0 4812512 3 1265262 0 6631876 0 44821289 2 4132194 70 0338474 0 3993551 0 21913906 0 5194353 2 05676584 0 50989167 1 5404782 0 19940122 0 6771326 0 4012926 0 9619028 1 2827396 0 7396525 0 3203742 1 0297541 0 9112755 0 7 7831881 1 4154252 0 5406426 0 6627394 D0 7661551 0 8210970 1 6384931 0 28337285 0 5773808 0 4731331 0 84104774 0 7094245 70 5954637 0 2904878 0 33539716 1 4933382 1 9361905 0 1144763 0 62230802 2 49256778 0 0751294 1 32108068 0 0915380 1 0590764 1 8514422 3 40490121 0 07864490 0 21418038 0 78989417 0 96728637 0 5374328 0 3197873 0 39518048 0 9165801 0 6495440 1 51341065 0 24515612 1 1432297 0 1378365 4 52981593 3 79222398 1 6181425 0 3429297 0 23374249 1 47334827 0 56003844 0 0158473 70 4331337 Normalized L 0 5140957 1 2357528 1 4933382 1 9361905 2 5252714 1 3615470 71 8278883 2 0133258 0 6718848 2 6423339 1 3846902 1 5854407 1 7912153 1 7120992 1 3975639 0 8620497 1 5327807 0 5374328 D0 3197873 1 2617788 0 9165801 D 649544D 1 0530312 2 2586067 1 1432297 1 4555292 2 0038374 0 1646
19. 2001 Issues in cDNA microarray analysis quality filtering channel normalization models of variations and assessment of gene effects Nucleic acids research 29 2549 2557 Troyanskaya O Cantor M Sherlock G Brown P Hastie T Tibshirani R Botstein D and Altman R B 2001 Missing value estimation methods for DNA microarrays Bioinformatics Oxford England 17 520 525 Bolstad B M Irizarry R A Astrand M and Speed T P 2003 A comparison of normalization methods for high density oligonucleotide array data based on variance and bias Bioinformatics Oxford England 19 185 193 Dudoit S Yang Y H Callow M J and Speed T P 2002 Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments Statistica Sinica 12 111 140 Benjamini Y and Hochberg Y 1995 Controlling the False Discovery Rate a Practical and Powerful Approach to Multiple Testing J R Statist Soc 37 289 300 Storey J D and Tibshirani R 2003 Statistical significance for genomewide studies Proceedings of the National Academy of Sciences of the United States of America 100 9440 9445 Speed T P ed 2003 Statistical analysis of gene expression microarray data Chapman amp Hall CRC Jain A and Dubes R 1988 Algorithms for clustering data Prentice Hall Englewood Cliffs Dembele D and Kastner P 2003 Fuzzy C means method for clustering microarray data Bioinformatic
20. Below there is an example of a volcano plot in data selection mode Volcano Plot for WT vs D b Data Up regulated Down regulated l gt Fald change cutoff 5 o p value cutoff 4 Select Data Export Selected Export up regulated Export down regulated Export deregulated Export unregulated Export All log10 p value to Fold change effect 89 If the user right clicks inside the volcano plot area the following menu will appear Select Data Export up regulated Export down regulated Export deregulated Export unregulated Export All The following table explains the functions of the items displayed in the menu appearing after right clicking Name Function Select Data Switches between data exploration and data selection modes Export Selected While on selection mode exports data points defined by the rectangular selection area The user must right click on one of the edges of the selection area Export up regulated Exports up regulated genes red data points Available only if fold change and or p value thresholds have been provided Export down regulated Exports down regulated genes green data points Available only if fold change and or p value thresholds have been provided Export deregulated Exports up and down regulated genes red and green data points Available only if fold change and or p value thresholds have been provided Export unregulated Exports up u
21. Data type Description Normalized Ratio Between channels log ratio as calculated after data normalization Unnormalized Ratio Between channels log ratio as calculated prior to data normalization 70 After finishing with setting the desired parameters the user should click OK in order to create the images Below there are two examples of 2 and 3 dimensional array normalized or un normalized images created with different colormaps Un normalized log ratio Normalized log ratio 5 4 3 2 1 D E 2 3 4 5 2D Image of Un normalized ratio for array Wt_1r txt Colormap Jet 3D Image of Normalized ratio for array Wt_1r txt Colormap Red Green If the user clicks on any of the images created individual spot data are displayed as in Normalized Image in ARMADA s main window The user should also note that un normalized spatial images are available only if grid coordinates and meta coordinates are provided with the input files and that if meta coordinates exist the un normalized images are available right after the data normalization step the user should see also 3 4 At this point it should be noted that for better image exploration as well as image saving exporting and other figure operations the user can utilize several figure controls and utilities which are provided by MATLAB s interface and are briefly explained in Appendix B 5 3 Array plots An array plot can depict the comparison of several input data e
22. LI LI LI LI L i o i i i i i i i i i i i i i i i N LI LI LI LI o LI LI LI LI LI mp b acu eee LN E nu Le EP LAKE HR eo LI LI LI LI LI LI LI I c E i y 3 i H E i i i i i i i i i i E 1 LI L LI LI LI LI LI LI LI L 1 LI LI LI LI LI LI LI LI LI Li i a JRAmB Q4 Q4 Q4 4 4 4 4 4 4 t e i i i i i i 1 1 1 1 1 1 1 1 1 i 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 i i i i i i i i o LI LI LI LI I LI LI LI i cis aca aaa occ NE o oE o Eo Po po o S 3 S 3 S 3 S 3 S z i i H i 1 i i i i i LA Lo LA LA LA i i i i i i i i i i i i i i t t e e N N 1 LI LI 1 1 LI LI LI LI LI L 1 LI LI LI LI LI LI T d og v oX cb FY fouenbe14 sust LI L LI LI e e e e e e e e e e e e A2uenbo14 euet Kouenbo14 u g 82 Ratio Normalized ratio distribution for arrays of condition WT Normalization Loess with Span 0 1 4500 MO we a DL Wwt r txt 3r txt MM BEEN wear va A A E lt y a lt 4000 3500 ee AAA A 3000 A A Jo 2500 2000 usnba14 auay 1500 www wm wb mw www we we ee we de
23. Ratio takes into consideration the signal to noise content of a signal an established notion in systems theory and image processing 1 thus coming in line with the perception of the experimentalist about the quality of a signal taking into account its interest in the strength of the signal compared to noise The following example illustrates the strength of using the signal to noise ratio for the net signal estimation compared to background subtraction which is the most common method used for background correction Let S denote the signal mean B the background noise mean and the final signal estimation for a spot on the microarray Let also i j denote two arbitrary genes on the microarray and let S 200 B 100 S 1100 and B 1000 Then the net signal estimation with each of the two methods would be S B 200 100 100 B 100 1000 100 rar Background Subtraction ni S B 200 100 2 S S B 1100 1000 1 1 It can be seen that when using background subtraction signals of different intensity range and thus signal to Noise Ratio with variation of different order are assigned similar corrected values a fact that could lead to misinterpretation of subsequent analysis On the other hand signal to noise ratio provides a more rational scale of measurement and it can be seen that when signal distribution is too close to background distribution the signal to noise ratio is close to 1 and ca
24. Selected A new window will appear prompting the user to select a location to place the new file that will contain information GeneIDs and expression values on the selected data points Below there is an example of how the figure looks like when data selection mode is on 75 Un normalized MA plot for array d 3r txt MA data Normalization points Fold change cutoff Ratio M Export Selected Export All Intensity A The following table explains the functions of the items displayed in the menu appearing after right clicking Name Function Select Data Switches between data exploration and data selection modes Export Selected While on selection mode exports data points defined by the rectangular selection area The user must right click on one of the edges of the selection area Export AII Exports all data points regardless of mode status 5 4 2 MA plots after normalization The same things concerning image modes apply also in the case of MA plots after the normalization procedure with the only difference that when the user right clicks inside the image the appearing menu 1s different Export up regulated Export down regulated Export deregulated Export unregulated Export All Below there is an example of an MA plot after normalization for a specific array 76 Ratio M Normalized MA plot for array d 3r txt Normalization Loess with Span 0 1 MA data Up regulated
25. as well as slide probe coordinates could generate errors If the user is not sure about the validity of these attributes they should not be provided ARMADA will still produce an image but it will not represent the distribution of probes on the array 13 slide is organized in blocks the Blocks attribute should be provided Rows Probe coordinates row in each block or simple probe Yes coordinates This attribute helps in the reconstruction of array images If given with meta coordinates it helps in the reconstruction of blocks in the image If meta coordinates are not given this attribute together with Columns are taken to be the array coordinates Columns Probe coordinates column in each block or simple probe Yes coordinates This attribute helps in the reconstruction of array images If given with meta coordinates it helps in the reconstruction of blocks in the image If meta coordinates are not given this attribute together with Rows are taken to be the array coordinates Gene Names The column containing ideally unique gene identifiers No usually provided by manufacturers Spot Flags The column containing flags manually or automatically Yes produced by the image analysis software used marking poor spots Note that this column should contain only one s 1 s and zero s 0 s with 1 representing good spots while 0 poor quality spots If you wish to provide this attribute make sure that your files contain p
26. be seen in the upper panel that the within cluster dispersion measure W red line drops steeply until the number of clusters reaches 5 and then it rises again Similarly the Gap curve blue line rises until the number of clusters is 5 and then drops However it can be observed that W drops again until the number of clusters becomes 8 and for a larger number of clusters presents small changes Correspondingly the Gap curve rises until the number of clusters is 8 and then drops very slightly This fact can give a clue that the algorithm fell on a local minimum and returned 5 as the optimal number of clusters while there might be another optimal solution This fact does not mean that 5 1s not a correct solution but rather that there are more than one possible solutions 7 3 The Batch Programmer Many times depending on the nature of the experiment it is required to perform several rounds of statistical selection procedures e g when the experiment includes lots of possible contrasts in order to extract different results corresponding to each case Moreover it is possible that the analyst would like to follow different analysis workflows concerning the statistical or the clustering procedures in order to compare different methods or combine the results As all these cases can take a quite considerable amount of time to be performed this section presents how multiple analysis steps can be programmed to a batch process through a simple interfac
27. can be imported to conduct statistical tests Depending on the input data some plots might be unavailable Concerning the file format of processed data these files should have only one column with gene ids accession numbers manufacturer s ids etc and all other columns should be numeric If only ratio values are available the file should have as many ratio columns as the number of microarrays in the experiment The column name should correspond to a unique array identifier string so that it can be used as array unique identifier For example if the experiment consists of 20 arrays the file should have 201 21 columns In the case of ratio intensity pairs the file should have apart from the gene ids column twice the number of columns as the number of microarrays in the experiment For example if the experiment consists of 20 arrays the file should have apart from the gene ids column 20 20 1 41 columns The user should make sure that there are no extra columns in such 115 files as they will generate an error In the case of importing ratio intensity pairs the user should make sure that all ratio columns have their intensity pair column Again all columns should have a unique name Additionally any missing data missing cells in the file columns should be empty or contain the string NaN Any other string such as NULL will generate an error The user can easily replace any other strings with a text editor or s
28. data fields specified If the user wishes to change the number of fields exported the export gene list preferences window can be used again It should be noted that the tables presented in ARMADA s main window and described in sections 2 6 9 and 2 6 10 will contain the fields specified in the export gene list preferences window 6 2 Exporting gene cluster lists The gene cluster files contain the results of the clustering processes the user should see 4 3 and their format is standard To export gene cluster files the user should click on File Export Data Gene Clusters List or right click on the selected Analysis from the Analysis Object list and select Export Clusters List or click on the Export Clusters shortcut button on the main window The user will then be prompted to select the storage location of the clusters file As with the gene list files the cluster output files can be either text tab delimited or Excel files The following table explains the meaning of each header in the cluster files Column name Description Slide Position The numbers denoting each gene s unique positioning on the microarray slide the user should see 2 5 and 2 5 2 GeneID The genes identifier names usually the chip manufacturer s identification names which serve as a textual identification for each gene ClusterNo The cluster ID the genes belong to Feature depending on The value of a specific feature which can be different for
29. each algorithm clustering algorithm e g for hierarchical clustering the Silhouette value is returned while for k means clustering the sum of distances from cluster centroid for each gene is returned The user should see 4 3 for further details p value The p value or adjusted p value returned from statistical test applied Data columns The rest of the columns until the end of the file contain normalized expression values according to what values the clustering is based on e g means or replicates the user should see 4 3 Additionally in the case of fuzzy c means clustering each gene s membership coefficient the user should see 4 3 3 1s returned to the c columns following the data columns 97 6 3 Exporting figures While the user could utilize simple screen capture tools or even simply pressing the PrtScn key to capture the diagrams created by ARMADA it is better to use MATLAB s figure saving and exporting controls even if MATLAB is not present on the machine The following figure is used as an example of image or diagram exporting Figure 1 File Edit View Insert Tools Desktop Window Help Dae hh amp a na E 08 20 Volcano Plot for WT vs D15 Data Up regulated Down regulated Fold change cutoff p value cutoff T m T a zi a I By clicking on File Save As the user is able to save the figure in any of the widely used image formats e g jpg or png formats
30. experimental conditions to perform different statistical tests without re performing the sometimes time consuming step of normalization The user 1s able to select different sets of experimental conditions and replicates and create different analysis objects by clicking Preprocessing Select Conditions The following window appears Select Conditions Editor Selection Settings wi Use same preprocessing steps as 1st time if performed Select all replicates for each condition Conditions Replicates Chosen replicates VAT al 015 1rbd rs dis 2rtd D7 di5 2rixt D id15 3rtxt D15 di5 3rtxt di5 drid D23 di5 drbd THF za Remove The above interface helps the user to create a new analysis object by defining which experimental conditions and which arrays from each experimental condition should be included in the analysis object This feature is helpful for example when experiment quality control or filtering has shown that an array from one of the conditions is of poor quality and maybe should not be used in further analysis but should be imported or remain to ARMADA to perform data exploration on this array e g create array images and compare the signal with the background noise 2 If the user has chosen to perform the preprocessing procedures up to normalization for the whole dataset without selecting any subset of experimental conditions at the beginning the checkbox Use same preprocessing steps as 1st time if per
31. following lt gt 1 104 8 or white spaces 11 generate an error message and will be automatically replaced by valid names After properly setting the parameters described above the user should click on the button Select Files and will be prompted to select the directory where the data files are placed 1t 1s not necessary to have all files in one directory this step exists for user s convenience The following window will appear that will help the user select the files of the experiment Select files for condition Control File Filter Show All Files Selected Files Current Directory eral ANDROMEDA Test Datasets E MEXP 81 7 raw Y Remove duplicates as per full path E MEXP 817 raw Show full paths E MEXP 81 7 raw data 1150388369 Knock2 txt E MEXP 81 7 raw data 1150388374 Knock2 txt E MEXP 81 7 raw data 1150388379 Knock2 txt E MEXP 81 7 raw data 1150388354 Knock2 txt E MEXP 81 7 ravy data 1 150388389 Control txt E MEXP 81 7 raw data 1150388394 Control txt E MEXP 81 7 rawv data 1150388394 Control txt E MEXP 81 7 rawv data 1150388389 Control txt E MEXP 81 7 raw data 11503858404 Control txt E Cancel E MEXP 81 7 rawvv data 11503883998 Control txt E MEXP 81 7 rawv data 1150388404 Control txt E MEXP 81 7 rawvv data 1150388408 Knock1 txt E MEXP 81 7 rawv data 1150388414 Knock1 txt E MEXP 81 7 rawvv data 1150388419 Knock1 txt E MEXP 81 7 rawv data 1150388424 Knock1 txt E MEXP 817 rawv z
32. g Channel 1 mean signal vs Channel 2 mean signal based on the input files or can depict the comparison between different arrays for the same measurements as well as log ratios or intensities 1f normalization has been performed Such images can help the user identify several phenomena connected to the nature of the experiment or identify correlations or differences between different dyes or different arrays of the same or another experimental condition To create array plots the user should click Plots Array Plots and the following window will appear 71 Array Plot Editor All arrays Normalized arrays Plot options vy Tr tet al WME rix ES 5 l Wit 2r bx htortd LL Single array Titles for all arrays Vy Sr tt Vy 3r txt Vu dr txt Vy dr txt d irit d irit fu Array ve array Title for normalized arrays dr 2rixt dy 2rixt ar tet df 3rtxt Data to plot dz drid dr drid e di5 1rbd di5 1rbd log 2 Ratio E d15_ 2rtixt di5 2rixt m d15 Sr txt d15 Sr txt di5 4p tet di5 4rtxt da3 dr txt d25 dr txt d23 4r tot d23 4r tot TMF 1r txt TMF 1r txt General option TMF 2r tot TMF 2r txt TMF _3r txt TMF _3r txt Wi Display correlation Plat in log scale Display cutoff lines Cutofflevel 2 rm i In the All arrays list all the imported arrays of the experiment are displayed and the user can select which arrays to plot On the other hand the Normalized arrays list displays the normalized arrays which correspond
33. not Whether the 2 channel sample channel refers to Cy3 or Cy5 Branch name for the statistical selection steps applied in the data of project after preprocessing Its children contain details on the steps performed The name of the Between Slide Normalization BSN that was utilized 1f any The missing value imputation algorithm that was used to impute any missing value caused by the image processing or the filtering steps Impute missing values before or after between slide normalization The Trust Factor filter value The statistical test used in the statistical selection process The multiple testing correction procedure applied if any The p value or FDR depending on multiple testing correction method threshold applied in the process of statistical selection The number of Differentially Expressed genes found after the application of a statistical test Branch name for clustering steps applied in the data of the project after statistical selection Its children contain details on the steps performed The clustering algorithm utilized The linkage algorithm used only in the case of hierarchical clustering The distance calculation metric used with the chosen clustering algorithm The initial cluster position only in the case of k means clustering The limitation method used to identify the number of clusters A p value cutoff to filter the differentially expressed genes to be clustered after the statistical selection pro
34. ratio boxplot for Ratio dz 2r ixt dz Ar ixt d15 1r ixt d23 2r txt d23 3r ixt ptr txt Non normalized ratio boxplot for Ratio IXFIE JHL IXFIZ JHL IXFIL JHL rip EZP JXric cep rig EZP IXFIL EZP rip GLP JXrig GLP rig GLP gap GEP yap ZP IXFIE Zp ag ZP yap ZP YIP ow PUE M IXFIZ YM yI o Normalized ratio boxplot for Ratio yE JHL IXFIZ JHL IXF IL JHL up EZP IXFIE cep uz EZP Jap EZP YIP GLP IXFIE opp ey IZ GLP Jap GEP pap Zp IXFIE ZP pug 2p yap ZP yir YM POE YA IXFIZ ow apo 86 5 7 Volcano Plots Volcano plots are useful for visualizing differentially expressed genes that have already been detected using a statistical test In ARMADA they are a plot of the log fold change on the horizontal axis and the quantity log o p value where the p value comes from a statistical test the user should see section 4 The volcano plot can be used to visualize differentially expressed genes and also to show that large fold changes do not necessarily equal statistical significance or the opposite Moreover volcano plots can be created only for pairwise statistical comparisons e g when performing a t test between control and samples treated with a specific drug and not in cases where the user seeks statistically significant genes among several conditions e g using 1 way ANOVA to identify differentially expressed genes in at least one among five experimental configurations Thus the Volcano Plots
35. rotating querying and editing plots The following picture shows the features available from this toolbar Enable plot Zoom Insert Insert edit mode in out color bar legend Pan Rotate Data Hide display 3 D cursor plot tools Note that two other toolbars can be enabled from the View menu Camera Toolbar which is used for manipulating 3 D views 119 Camera Motion Principal Axis Scene Projection Reset and Controls Selector Light Type Stop MERE ee SEP aS nue E E ONE and Plot Edit Toolbar which is used for annotation and setting object properties Click this button to enable property Pin object to Display the object editing of graphic objects data point alignment tool 08 80 AA BJ EBERBINN NN TOOTS Fill color Text color font Align text Insertlines Insert textarrow and line bold or italics and arrows text rectangle edge color and ellipse Generally MATLAB s figure interface offers a lot of possibilities for figure manipulation Through the figure s several menus the user can add annotation components textboxes arrows etc to figures as well change titles and axes titles change colors colormaps draw elements etc The user is also able to export figures in many available formats For more thorough information and examples on MATLAB s figure interfaces and possibilities the user should check http www mathworks com access helpdesk help techdoc matlab html under Graphics 120 Appendix
36. s Koy M medio os 4 tage LOWESS normalization normalizes data on each microarray slide by local regression of log ratio against intensity using weighted linear least squares and a 1 degree polynomial model This model is used to calculate normalized expression values for each gene e 8 8 0 8 ios J LOWESS ioe JR G log J l Robust LOWESS normalization normalizes data on each microarray slide by local regression of log ratio against intensity using weighted linear least squares and a 1 degree polynomial model The robust version of LOWESS performs additional fitting iterations and assigns lower weight to outliers in the regression The method assigns zero weight to data outside six mean absolute deviations This model is used to calculate normalized expression values for each gene Robust LOWESS needs more time to complete than simple LOWESS but produces results more robust against possible outliers oe 2 e 5 4 1 8 R R f os 2 RobustLOWESS os a R G log 2 l LOESS normalization normalizes data on each microarray slide by local regression of log ratio against intensity using weighted linear least squares and a 2 degree polynomial model This model is used to calculate normalized expression values for each gene 35 Robust LOESS quadratic fit Rank Invariant No normalization Ps 3 rt f o 2 LOESS Lo JR G log 2 Robust LOESS n
37. spots between replicates for each condition Common good spots between replicates for each condition Common bad spots between replicates for all conditions Common good spots between replicates for all conditions Description The union all filtered spots from all replicates of filtered spots for each experimental condition The union all remaining spots from all replicates of remaining spots for each experimental condition Filtered spots for each experimental condition and each individual replicate Remaining spots for each experimental condition and each individual replicate The conditional intersection filtered spots that are common between replicates of filtered spots for each experimental condition The conditional intersection remaining spots that are common between replicates of remaining spots for each experimental condition The total intersection filtered spots that are common between conditions of filtered spots among all experimental conditions The total intersection remaining spots that are common between conditions of remaining spots among all experimental conditions For example checking Bad spots for each condition and replicate will return an Excel file the user chooses the name and store location of the file containing the filtered spots named with their GenelD for each array replicate under each condition in separate columns while checking Common bad spots between replicates for all conditio
38. the case of LOWESS LOESS methods 3 the field Spanning Neighbourhood should also be completed or left with its default value The span value modifies the running window size proportion of neighbouring points to the currently processed point for the smoothing function If the span value is less than 1 the window size is taken to be a fraction of the number of points in the data If span value is greater than 1 the running window contains as many data points as the value given The table below describes briefly the currently supported normalization methods The notation used is R for Red or Cy5 or Channel 2 G for Green or Cy3 or Channel 1 and i denotes gene i on the array under normalization The function notation N denotes normalized values 34 Method Global Mean Global Median LOWESS linear fit Robust LOWESS linear fit LOESS quadratic fit Brief description Global Mean normalization normalizes data on each microarray slide by calculating the mean expression of all genes present on the array and subtracting this value from each individual gene R R N log gt log M amp ES R M nas 5 L lA senes Global Median normalization normalizes data on each microarray slide by calculating the median expression of all genes present on the array and subtracting this value from each individual gene R R N log gt log M amp
39. the following window Clustering Batch Editor E E e A Analysis Choose clustering method Analysis 1 ES Hierarchical Iv Analysis 2 E For each Analysis object the user can define a different clustering algorithm By selecting an algorithm from the Choose clustering method list the corresponding preferences window will open the user should see section 4 3 so that parameters can be set or defaults left After making the necessary selections the user should click OK If OK is pressed without making any selections the default parameters what is displayed in the window e g hierarchical clustering with default parameters will be used for the batch process If the user does not wish to perform clustering Cancel should be pressed instead At this point the user can save all previous settings for the defined batch process by clicking File Save batch and also save the settings under a different name by clicking File Save batch as It should be mentioned that simply saving a batch will not save the results produced by a batch procedure but it saves only the batch settings In order to save the results after a batch process 1s complete the user should click File Save batch as and in the field Save as type should choose ADROMEDA Project Files apj In this way the results are exported as an ARMADA project which can be opened from ARMADA in order to perform data exploration and exporting After having set all t
40. to limit the spread of the invariant set but should allow enough data points to determine the normalization relationship Parameter values must lie between 0 and 1 This parameter filters the invariant set of data points by excluding genes whose average rank between Channel 2 or Red or Cy5 and Channel 1 or Green or Cy3 is in the highest N ranked averages or lowest N ranked averages This parameter 1s useful if the user wishes to exclude rank invariant genes whose ranks are very high or very low respectively in order to ensure that this genes won t affect the normalization curve as outliers This parameter stops the rank invariant set definition iterative process when the number of genes in the invariant set reaches x percent of the total number of array genes If set to 0 the iterative process continues until no more genes from the array under process are eliminated This option controls the iterative process which determines the rank invariant set of genes When checked the rank invariant selection algorithm repeats the process until either no more genes are eliminated or a predetermined percentage of genes Maximum dataset included in rank invariant set is reached When unchecked the algorithm performs only one iteration of the process This option controls whether to display MA Ratio Intensity plots of un normalized and rank invariant normalized genes of the slide under process afte
41. txt 1 T I l l l 1 l 1 1 l l I I 1 I l 1 l 1 I 1 l 1 1 I l 1 I L Li l I 1 l l 1 I 1 1 I l l l 1 1 l T l l I 1 1 I l 1 1 1 1 1 I 700 600 SN ee eee 500 Py eg ne par Gc EIE pi pM a E eS ye E EE e e a eee ia e e Axuenba4 sust 200 Ratio 100 Normalized ratio distribution for array Wt 1r txt Normalization Loess with Span 0 1 I I Li T Li I I I I Li Li I I I I Li Li Li L I I I Li I Li I I I Li Li Li I I I Li Li Li I I I Li Li l I I I I I l T I I I I Li I I I I I Li Li I 1000 900 300 foo mmm ERU Re eR O ee ee ee ee eee ee ee mme um ee ee eee 600 A prt t rec cece ceeee e Lo cn A A C C rc I eee eee 400 fusnbaly euer 2 2 2 2 2 2 Jd ALi L 300 200 100 Ratio 81 Normalized ratio distribution for array Wt 1r txt e o i i i i g H i i i i i i LI eg z LI LI LI LI L LI eo I LI LI LI 1 i i i k El i i i i i i i LI N 1 LI LI LI LI LI S 1 1 ur ac vas EE is aie acier dE rr rue Lo EHI XD OR oL OERSO3 4 ME 4 NE SEO un ow de o Me pl i i A i i z s 1 1 1 1 i 1 u amp u LI LI LI LI LI LI LI LI LI L I I LI 1 1 1 v E 1 1 1 1 1
42. user should see 2 5 and 2 5 2 The genes identifier names usually the chip manufacturer s identification names which serve as a textual identification for each gene The p value or adjusted p value scores for each gene returned by the statistical test applied for the identification of differentially expressed genes the user should see 4 1 The q values returned by the False Discovery Rate estimation procedure the user should see 4 1 The False Discovery Rate estimates returned the user should see 4 1 96 Fold change The fold change estimates as calculated by the process explained in section 4 2 for each condition Trust factors The trust factor estimates calculated by the process explained in section 4 1 for each condition CVs The coefficients of variation StDev Mean for each condition Text tab delimited The output files are of text tab delimited format and can be opened by any text editor or spreadsheet editing programs such as MS Excel Text tab delimited files can also easily be imported to other tools for process or easily stored in local databases Excel The output files are of Excel format They can be opened and processed with MS Excel or Open Office tools but they are larger than and not as flexible as text tab delimited files Output file type After selecting the preferred fields to be exported the user should click OK and all files that are exported from that point forward will contain the
43. value cutoff in order to cluster fewer genes than those determined by the statistical test For example if the statistical test was performed with a p value cutoff of 0 05 the user can enter 0 01 to cluster fewer genes than those determined by the cutoff of 0 05 If checked fuzzy parameter value will be optimized as proposed in 12 50 CV constant Coefficient of Variation of the set of distances between genes E 7 the user should see 12 E S Tolerance Allowed tolerance to be used in the fuzzy parameter E E optimization algorithm amp Maximum iterations Maximum number of iterations for the convergence of the fuzzy parameter After setting the desired parameters or leave the defaults the user should click OK Fuzzy c means clustering will be performed and ARMADA will store the result It should be noted that if the box Optimize fuzzy parameter is checked the running time of the algorithm might increase considerably depending on the size of the dataset and the number of clusters Gene clusters and cluster memberships can be viewed by hitting the Cluster List button on the main window The user can also consult section 5 7 on how to plot gene expression profiles for clusters formed with the fuzzy c means algorithm For a complete description of the parameters in the table above as well as the fuzzy c means clustering algorithm the user should see 12 4 4 Classification Apart from classical statistical meth
44. whole file set imported by pressing the Select Files button in the Data Import wizard Therefore the user should have checked before this step that all the text files in the dataset have exactly the same format e g there isn t a file where the column with name Row is placed 4 from the beginning of the file while in all other files it is placed 3 The following table explains the content of each required field Field Name Gene Numbers Array Blocks Meta Rows Meta Columns Description A unique gene numbering present on the microarray slide e g the column Gene Number on QuantArray files This attribute allows the unique gene identification in the case of multiple clones of a transcript present on the slide If not given it will be assigned automatically The column with numbers from 1 to the number of blocks into which the probes on the array are organized This attribute helps in the reconstruction of array images If not given and the slide is organized in blocks the Meta Rows and Meta Columns attributes should be provided Probe meta coordinates row This attribute helps in the reconstruction of array images If not given and the slide is organized in blocks the Blocks attribute should be provided Probe meta coordinates column This attribute helps in the reconstruction of array images If not given and the Optional Yes Yes Yes Yes If not given properly probe meta coordinates
45. 0 03684518 281 2xX00026P01 0 00720329 282 ZX00026L13 0 01843009 314 2x00027P13 0 00939236 327 ZXD0031L01 0 04891699 336 2X00030H13 0 03170576 341 ZxX00035D01 0 04204787 360 2x00048P13 0 04076538 361 2xX00048P01 0 02360243 379 KG00002P01 0 01715951 380 KGODOD2L13 0 04528013 400 KGO0006H13 0 01493967 428 CNTRL14D13 0 00514420 433 NPOOOO1LO1 0 02187174 468 2400002K01 0 02588446 495 2400004G13 0 01738170 563 ZxX00004K13 0 04297893 615 2x00013013 0 04985758 635 2xX00016K13 0 01345936 801 2x00048013 0 02960167 803 2x00048K13 0 02491239 805 2x000458G13 0 03490098 820 KG00002001 0 01904997 873 NPOOOO1Ki3 0 03787566 883 CNTRL13J01 4 51284386 903 CNTRL11B01 0 00737537 907 2400002N01 0 00740561 920 2400001613 0 00974793 940 ZA400003N13 0 00247012 lili Mormalized L 0 5140957 2 6015188 1 8539887 0 2701940 0 50308959 1 3615470 1 7264890 0 25709182 0 6567064 2 0133258 0 6718848 2 6423339 0 5319141 1 5854407 2 6647512 0 24750770 0 1867841 0 09877379 1 4906205 0 58975227 0 83270608 1 2617788 0 5315799 0 2021967 0 51401375 0 60455542 0 4593538 1 4555292 0 4236478 1 3648943 0 90568680 0 02343750 0 4032206 0 61085261 0 3397404 Normalized L Normalized L 0
46. 1 Lu ow 1 1 r LI s e cie ds Dadas Tc UET T SN CT ETC E rom uw i r i 7 i 1 1 1 1 1 1 c o 1 1 1 1 1 1 D e Tt i i 1 gt 1 o E 1 1 r n i M P i i i 1 E dal ere beater ala are d n BE 4d c ox od d oc 1 Sees LI I LI LI LI LI LI e LI LI a LI LI LI LI LI LI LI LI a 1 D 1 1 1 r 1 1 1 1 1 ue i i i i i i i i i m LI LI LI LI LI i i i i i i i i i i i 3 pede E r3 Lo ee oT oi LI I LI LI LI i 8 ME E Q ui agr ci M eins ads irure g i 1 1 1 1 1 1 n 1 E 1 L o 1 1 1 1 v 1 1 1 1 p 1 1 ond LI ra LI LI LI LI 1 ssa 1 1 1 m I LI LI LI LI LI 1 1 Em poes q r i a a i i i 2 22 4 E E E gt S NENNEN M M E 3 E 3 v 4 3 P od 3 c z A O ee OON oh etn bitte Sieh De o o me ee ee is aa eS gee rcd icd o bc dc Medicaid Rl eu 1 1 1 1 1 1 1 1 1 gt MEME E E NM pn P a i i A i lt lt i LI LI LI LI g gt o 01 P ij ij a a E 8 Se ee eee ere ee ee ee ee er ees en 1 e 1 1 o 1 r L LI LI LI LI LI LI ne 1 i 1 Lo UU T T X X sz NEN e i i i i i i SUA C TEC Va caca 7 Ur 1 eu i 1 _ 1 1 e 1 i i 22 i i i i M i i
47. 1 1 1 11 CNTRL12H01 2770 500 1235591 7936 9106 431 58249 32843378 220 39533 6700 50 099575 0 871332 0 975143 0 98288 37 520705 12 1 1 1 12 CNTRL12D12 2970 500 12702 746 8223 9854 526 5022 359 03543 217 53391 6700 50 099575 0 88513 0 972717 0 980637 35 380202 13 1 1 1 13 CNTRL12D01 3170 500 13819 344 8108 6719 754 2569 364 35492 211 25116 6700 50 099575 0 844908 0 957352 0 977417 37 928248 14 1 1 1 14 CNTRL11P12 3370 500 10274687 4831 3584 59313532 4248197 19314742 6700 50 099575 0 782242 0 967316 0 976959 24185993 15 1 1 1 15 CNTRL11P01 3570 500 9496 1641 5096 7461 73714728 331 1965 212 3033 6700 50 099575 0 683919 05966187 0 982941 28 572296 16 1 1 1 16 CNTRL11L12 3770 500 89026563 5183 4775 488 15201 34366885 214 6888 6700 50 099575 0 859011 0974533 0 981689 25 904751 17 1 1 1 17 CNTRL11L01 3970 500 7916 2539 4486 4028 34056451 276 11966 216 75458 6700 50 099575 0 814753 0 97879 0 985382 28 66965 18 1 1 1 18 CNTRL11H13 4170 500 7507 5073 3863 8359 42310931 258 10379 222 26494 6700 50 089575 0 814878 0 974594 0 985474 29 087164 19 1 1 1 19 CNTRL11H01 4370 500 5296 1792 2441 330 9722 248 04997 206 21884 6700 50 099575 0 852154 0 98111 0 986099 21 351259 20 1 1 1 20 CNTRL11D12 4570 500 4452582 1834 6418 375 99268 21216736 218 50971 6700 50 099575 0 864272 0 980972 0 987 20 985178 21 1 1 1 21 CNTRL11D01 4770 500 40591792 1575 8552 332 78253 198 21785 209 89069 6700 50 089575 0 875343 0 981781 0 98996 20 478374 22 1 1 2 1 ZA00003D12 760 730 4600 5371 1608 0299 639
48. 271535 1 1581745 For more information on importing processed data to ARMADA the user should also consult section 2 5 3 A 3 Files used for classification There are three types of files that can be used as input to classification methods supported in ARMADA 1 files containing new samples to be classified i1 files containing class prior probabilities to be used with DA classifiers and 111 kernel function parameter files to be used with SVM classifiers In all cases the files can be either text tab delimited or Excel files The following sub sections give examples of these files 116 A 3 1 New sample files The 1 column of these files should contain as many rows as the number of features genes used to train the classifier and each row of the 1 column should contain names for each feature The 1 row should contain sample names An instance follows GenelD New 1 New 2 New 3 New 4 New 5 New 6 AFFX Murt 179 1 195 651 183 67 135 AFFX hum 4796 7 13042 13201 15404 13159 14220 AFFX Phe 97 8 15 22 44 115 74 AFFX HUM 247B8 1 5184 33506 34303 20086 29899 AFFX HUI 364 6 171 1495 2692 259 302 31317 ra 18252 2942 3538 5092 1825 3200 31324 at 478 8 275 419 850 148 372 31326_at 433 5 524 934 1638 439 1003 31331_at 75 9 92 170 122 22 34 31375_at 229 3 196 765 852 239 59 3138b at 511 1 215 871 958 21 109 31397_at 162 9 116 303 108 15 2 31399_at 230 1 Bau 1118 1269 496 707 31417_at 330 9 916 1011 749 417 487 31
49. 3 3rtd d23 drid TNE Artt TMF 2r txt THE 3r txt yi Display normalization curve before only Cancel iw Display fold change lines Fold change 2 In the Arrays list the user can select one or multiple arrays for which to display MA plots In the Plots panel the user can select to display MA plots before the normalization procedure after the normalization procedure or both and also supply the respective titles one for each selected array separated by new line Enter if desired Leaving the title s boxes as is or empty will cause automatic plot title generation In the Options panel the user may choose to display the above selected plots for each part of the array subgrid this option is enabled only when subgrid normalization has been performed the user should see 3 4 choose whether to also display the normalization curve calculated by any of the normalization methods described in section 3 4 this option applies only to MA plots presenting data before normalization and finally choose whether or not to display a fold change line depicting desired thresholds in fold changes among channels The threshold should be filled in the field Fold change Attention should be paid that while the users fill the fold change in natural scale the fold change lines are presented in the figures in log scale e g for a fold change of 2 the corresponding threshold lines are at 1 and 1 because log 2 1 After setting the paramet
50. 30627249 0 1162949 0 17008064 0 26464372 068051506 0 29922796 0 48757770 0 74367381 0 68480947 NaN 0 33500772 NaN 0 39670203 0 2832726 0 45200798 NaN 0 03774172 NaN 0 18947641 NaN 0 1098476 NaN 0 19077192 NaN 0 0339772 NaN 0 06442995 NaN 0 35992535 2 6 10 Differentially Expressed genes List 0 15147798 0 1852966 0 17488408 0 1329239 0 46897245 0 1848757 0 1228075 0 2076341 0 3184901 0 3761002 0 3329277 0 2656781 0 3225438 0 1730056 0 4405987 0 00277324 0 24817430 0 0769048 0 1085416 0 06104292 0 20930017 0 2501061 0 01459660 0 0799106 0 3243329 0 4209556 NaN 0 0430242 0 4972774 0 5893595 Mean Norma 0 47963038 0 29755275 0 16366270 0 18762536 0 07161478 0 11214425 0 0200570 0 27951440 0 25935850 0 47625541 0 19071215 0 14183801 0 06743225 0 01858608 0 03270551 0 0325824 0 04609032 0 0741540 0 0391013 D0 16397347 0 0475890 0 04179834 0 08465083 0 1203050 0 2617669 0 2696916 0 1839220 0 2272147 0 04029063 0 01299374 0 0325818 0 09106833 0 2293021 0 1492991 0 3241140 StDev Norm 0 44632052 0 51481605 0 14247658 0 20282959 0 20705959 0 17293061 0 16466822 0 39377
51. 3909 0 30902624 0 26789376 0 56929696 0 47183698 0 58035475 Abccia Nal NaN NaN NaN NaN NaN Abccib 0 3641862 0 51209795 0 45298058 0 31001368 0 29066816 0 24840271 Abcc2 NaN NaN NaN NaN NaN NaN Abcc3 0 5527132 0 70345503 0 5788742 1 7148554 1 1241817 1 8294444 Abcc5a 1 349615 0 98534876 1 0953488 1 3140831 1 7990783 1 7097581 A2bcc6 NaN NaN NaN Nah NaN NaN Abecg NaN NaN NaN NaN NaN NaN Abcd1 1 0531837 0 7791386 1 0002424 1 5899274 1 1865817 1 5702535 Abcd2 NaN NaN NaN NaN NaN NaN Abcd3 1 0074773 1 0837747 1 0866295 1 3982096 0 907019 0 80726904 Abed4 0 9015269 0 5946508 0 3703065 0 78884095 0 740609 0 6657123 Abce1 0 9475926 0 62172276 D 74285856 0 45811826 0 7299897 0 44573998 Abcf1 0 50011426 0 3760132 0 48036027 0 47371346 0 454261 87 0 74983424 Abcf2 0 60555005 0 456831 72 0 61108875 0 40908033 0 48863953 0 9063739 Abcgi 0 2799792 0 34214392 0 3352024 3 8893406 3 5225077 4 3020043 Abcg2 2 0225651 1 2335484 1 7648885 1 1054775 1 2485987 0 6887799 Abcq3 NaN NaN NaN NaN NaN NaN Abegs NaN NaN NaN NaN NaN NaN Abcgqa 0 9826715 0 7399547 1 052818 0 8763301 0 7645741 0 91664545 2b11 0 9229394 1 0389895 0 87565833 1 2906467 1 0031981 1 0011867 A blimi NaN NaN NaN NaN NaN NaN Abp1 NaN NaN NaN NaN NaN NaN Abt1 0 7731711 0 5483241 0 55794396 0 71175885 0 6947127 0 76653383 Abth1 0 764732 0 81660393 0 6719965 1 2096801 0 6036046 2 0502825 Acadl 1 229991 0 4709537 0 9617183 0 5761441 2 213873 1 0743774 Acadm 1 1487359 0 719995 0 68005455 1 2773688 1
52. 429_at 1584 8 1735 1906 2979 1018 1112 31491 s c 14 2 101 318 523 132 121 31514 at 512 5 232 1254 2358 003 710 31515_at 2852 5 4026 3710 5900 2295 4154 31534_at 17 2 767 1179 1111 227 525 A 3 2 External class prior files for DA classification The 1 column of these files should contain as many rows as the number of classes in the training dataset and each row should contain one class name The second column should contain as many rows as the number of classes and each row should have a number between 0 and 1 corresponding to the prior class probability The sum of the probabilities should be 1 An instance follows One 0 1 Two 0 2 Three 0 7 A 3 3 External kernel parameters files for SVM tuning These files should contain as many columns as the number of parameters that each kernel type accepts For example a file with polynomial kernel parameters should contain 3 columns with arithmetic data while a file with RBF kernel parameters should contain 1 column There are no headers An instance follows LC CO CO hi Pd hd o 000500000 0 Mm amp OO Mi to Mi 0 The following table presents the proper order of the columns in the kernel parameters files so as ARMADA interprets them correctly Kernel Column Column 1 Column 2 Column 3 Polynomial Gamma Coefficient Degree Sigmoid MLP Gamma Coefficient RBF Gamma 117 Appendix B MATLAB s figure controls This section is addressed to users not familiar with M
53. 58 86 9 9 CNTRL12L01 11596 16406 3108 417969 7158 343262 97 9403 495 098358 469 21109 337 72 10 10 CNTRL12H13 12259 83594 3781 567139 7814 507324 132 835815 681 328369 485 785553 320 9 11 11 CNTRL12H01 12355 91016 3479 626953 7936 910645 115 9403 431 582489 529 004211 328 4 12 12 CNTRL12D13 12702 74609 3441 0 8223 985352 96 970146 526 502197 669 60968 359 0 13 13 CNTRL12D01 13819 34375 3835 223877 8108 671875 112 970146 754 256897 654 447388 364 35 14 14 CNTRL11P13 10274 68652 3456 641846 4831 358398 98 925377 593 135315 450 316437 424 81 15 15 CNTRL11P01 9496 164063 3458 805908 5096 746094 78 074623 737 147278 749 285217 331 16 16 16 CNTRL11L13 8902 65625 3406 596924 5183 477539 77 238808 488 152008 542 04425 343 6 17 17 CNTRL11L01 7916 253906 3664 14917 4486 402832 96 611938 340 564514 630 924927 276 11 18 18 CNTRL11H13 7507 507324 3426 23877 3863 835938 72 388062 3423 109314 565 358948 258 1 19 19 CNTRL11HO1 5296 179199 3497 343262 2441 0 88 089554 330 972198 627 37616 248 04 20 20 CNTRL11D13 4452 582031 3233 537354 1834 641846 105 283585 375 992676 530 891907 212 1 zl 21 CNTRL11D01 4059 179199 3175 208984 1575 9552 93 179108 332 782532 507 173737 198 2 22 22 ZA00003D13 4600 537109 3422 179199 1608 029907 73 716415 639 640747 719 308838 202 23 23 ZADODO3DO1 11968 37305 4100 462891 2781 895508 101 686569 1667 217163 837 085693 311 2 24 24 2400002P 13 13039 83594 3601 596924 4738 044922 96 6268
54. 596 0 25054451 0 41325365 0 20406475 0 36043215 0 368580867 0 02709016 0 22170752 0 10926948 0 20461068 0 05681853 0 14641536 0 27711780 0 17657298 0 15807167 0 34170571 0 08860694 0 15091389 0 06132798 0 13349870 0 22215184 0 18051973 0 10601837 0 230669639 0 21944655 0 25293269 0 15029542 0 33245465 Mean Int 3 83024 1 87311 1 2 06500 1 81654 2 061051 2 07489 2 06454 2 17351 2 21748 2 43741 1 84748 1 729541 2 052431 1 793041 1 73394 1 72362 1 65058 1 75874 1 68996 1 73775 1 93694 2 42001 2 68700 2 33588 2 199831 2 23287 2 31177 2 22886 2 08074 2 08682 2 50829 2 67414 2 50027 1 759400 2 51912 4 r By clicking on the DE List Differentially Expressed button ARMADA displays a spreadsheet like view which contains data for the differentially expressed genes derived from a statistical selection process The data displayed correspond to the selected Analysis from the Analysis Objects list The same data can be displayed by right licking on the selected Analysis from the Analysis Objects list and then clicking on DE List or by clicking View DE Genes List GeneID p value 24 zA00002P13 0 01553916 87 2x00003P01 0 04524206 171 2x00014H01 0 00911155 193 2xX00016P01 0 02539326 213 2xX00019L01 D 03234523 241 2x00021H01 0 01359739 253 ZxX00025D01
55. 5976 8029 1015 1553 2213 93 87 0 8855 3999 1 16 1 779824 7100 12550 100 7927 8574 4423 998 1717 2396 88 77 4483 4779 1 17 1 syntrophin 318957 7340 12550 110 3206 3201 1693 888 1033 748 77 68 0 1893 1982 1 18 1 ESTs 329367 7580 12530 140 937 1038 403 840 1076 1237 2 0 0 357 458 1 19 1 Mus muscult 316187 7900 12560 130 5766 9601 11361 1116 1693 2260 90 50 1 577 1559 1 20 1 ESTs 313529 3060 12540 120 2267 2709 1815 985 1155 767 74 36 0 1107 1263 1 21 1 ESTs Weakl 328553 8320 12540 110 2424 2554 1574 907 993 517 87 68 0 1289 1582 Most GenePix files have the following headers Block Column Row Name ID X Y Dia F635 Median F635 Mean F635 SD B635 Median B635 Mean B635 SD gt B635 1SD gt B635 2SD F635 Sat F532 Median F532 Mean F532 SD B532 Median B532 Mean B532 SD gt B53241SD gt B532 2SD F532 Sat 114 Ratio of Medians Ratio of Means Median of Ratios Mean of Ratios Ratios SD Rgn Ratio Rgn R2 F Pixels B Pixels Sum of Medians Sum of Means Log Ratio F635 Median B635 F532 Median B532 F635 Mean B635 F532 Mean B532 Flags The user should make sure that at least the column names containing main image quantitation information and spot flags should have the names mentioned above e g F635 Mean F635 Median B635 Mean Flags etc It has been observed that in some GenePix files the wavelengths used for the two channels are slightly different In ARMADA the two channels must be named w
56. 631 1 3648943 0 3429297 0 15315455 0 4032208 D 1666848 0 3397404 0 6567064 Mean Norma 0 3725825 0 9957814 1 9917978 1 2014407 0 4221113 1 1285013 0 2739142 1 1252840 0 04911252 1 4739103 0 2297287 1 1281080 0 6478859 0 2966363 0 5118250 0 0936686 0 7545044 0 4419632 0 1366302 0 6121878 0 8486235 0 6031401 0 43817800 0 7060801 0 8216140 0 9279086 0 33404385 0 76588891 1 4966061 0 12580005 0 0417615 0 04844334 0 46131349 0 3511881 0 5855025 StDev Nor 0 092589 0 470671 0 113463 0 282574 0 574967 0 616019 2 574121 0 358477 0 883003 0 392369 1 825726 0 555341 5 164849 0 416273 0 335127 0 514724 0 778898 0 472847 0 532248 0 425459 0 089680 0 142506 0 977392 0 631280 0 147915 0 260114 11 28804 6 489850 0 037452 0 512566 0 225362 1 016345 0 379601 0 154229 0 050279 Fr 24 2 6 11 Cluster List By clicking on the Cluster List button ARMADA displays a spreadsheet like view which contains data concerning gene cluster memberships derived after clustering procedures The data displayed correspond to the selected Analysis from the Analysis Objects list The same data can be displayed by right licking on the selected Analysis from the Analysis Objects list and then clicking on C
57. 64075 202 92952 205 75525 6700 22156874 0 934221 0 964981 0 9888 22 670615 23 1 1 2 2 Z400003D01 960 730 11968 373 2781 8955 1667 2172 311 36792 163 5177 6700 22 156874 0827513 0 902557 0 983002 38 438042 24 1 1 2 3 ZA400002P1 1160 730 13039 8536 4738 0449 1183 7947 559 46295 170 56825 6700 22156874 0 895738 0 932327 0 968094 23 307774 25 1 1 2 4 ZA00002P01 1360 730 11631 6867 4266 7314 724 36023 446 43207 200 74371 6700 22 156874 0 9142 0 960052 0 975891 26 054774 26 1 1 2 5 ZAD OO02L1 1560 730 11275 955 3380 2637 1380 069 377 3172 185 41164 6700 22 156874 0 861331 0 923889 0 979614 29 884551 27 1 1 2 6 Z400002L01 1760 730 1117591 4078 1641 990 36328 418 57253 165 45288 6700 22156874 0787832 0 948288 0 977585 26 680934 28 1 1 2 7 ZA400002H12 1960 730 12930104 5692 6865 1148 3503 337 69244 140 7085 6700 22 156874 0 870585 0 933029 0 98514 38 289588 29 1 1 2 8 ZA00002H01 2160 730 1137409 6303 1641 80245978 368 63205 200 10844 6700 22 156874 0 847684 0 956223 0 980865 30 554859 30 1 1 2 9 ZA00002D13 2360 730 11531 388 6767 8657 64256561 426 70816 204 04642 6700 22156874 0 828543 0 960739 0 977463 27 024062 112 Most QuantArray files have the following column headers in the file data section Number Array Row Array Column Row Column Name X Location Y Location chl Intensity chl Background chl Intensity Std Dev chl Background Std Dev chl Diameter chl Area chl Footprint chl Circularity chl Spot Uniformity chl Bkg Uniformity chl Signal Noise R
58. 69 1183 794678 488 373566 559 4 25 25 ZA00002P01 11631 68652 3682 268555 4266 731445 105 179108 724 360229 643 543823 446 42 26 26 2400002L13 11275 95508 4381 76123 3380 283691 205 8806 1380 06897 731 681091 377 27 27 2400002L01 11175 91016 3415 283691 4078 164063 88 104477 990 363281 558 804932 2418 8 28 28 ZA00002H13 12930 10449 3746 537354 5692 686523 103 253731 1148 350342 597 067749 337 66 29 29 2400002H01 11374 08984 3729 970215 6303 164063 94 029854 802 459778 661 802795 368 30 30 2400002D13 11531 3877 4096 522461 6767 865723 146 865677 642 565613 610 543518 426 7 31 31 2400002D01 15427 95508 4214 044922 7380 015137 127 567162 1005 75647 932 726074 443 5t 32 32 ZA00001P13 19735 9707 4743 089355 8068 567383 201 298508 2278 647461 1057 460449 509 4 lt 33 33 Z400001P01 17634 40234 4483 402832 8225 194336 130 044769 1978 11853 1049 103638 373 51 34 34 2400001L13 14958 74609 3974 686523 8365 969727 102 8806 1084 022949 651 092529 346 92 35 35 ZA00001L01 16304 65625 5216 984863 7431 2241 21 178 0 1383 960938 1386 390991 575 7e Y lt gt 2 6 9 Normalized List By right clicking on and Analysis object from the Analysis Objects lists and selecting Normalized List ARMADA displays a spreadsheet like view which contains normalized data for 23 the selected Analysis The same data can be displayed by clicking View Normalized Data The data elements displayed are customizable the user should s
59. 85P13 ZXDOOSSPUT KGODD002P01 Mason 2L 1 3 KGOO00bHA1 3 KGOOD006D01 ENTRL14P13 ENTRL14013 MPO0001 PUT ENTRL1 2001 ENTRL1 2601 ENTRL1 2013 ZADODO 201 Z A000020513 ZAD0002013 ZA00001 OUT ZA00001K13 ZADODOUSCT3 Z A00004K13 ZAa000040513 ZZ ADDODOSCOT Plot option _ All genes T 1 DE genes Gene clusters Display options Plot centroids in selections Different calor far each gene Display legend gene names Title s Titles Plot values Condition means O All replicates Cancel In the left part of the expression profile preferences window the Genes or Clusters list displays the GeneIDs or the gene cluster numbers for the selected Analysis from the Analysis Object list depending on the selection on the Plot options panel The following table explains in detail all the user options from the Plot options Display options and Plot values panels Plot options Option All genes DE genes Description This option gives the ability to plot expression profiles using normalized values for several genes out of all the genes that passed the preprocessing filtering steps The user may select genes by their GeneID from the list Genes or Clusters on the left of the expression profile preferences window This option gives the ability to plot expression profiles using normalized values for several genes but only from the Diferentially Expressed genes that were determined
60. 9 54033 2677 8837 A 1 1 1 2 BrightComer 1 6150 6499 215 7006 6252 193 6159 6513 191 6896 140 294 861091 63416 2582 7304 A 1 1 1 3 CGxSLv1 1 287 8791 189 8994 274 184 273 0799 154 6724 91 338 26197 64186 67 2122 A 1 1 1 4 A51 P185156 D 408 2833 189 8307 403 5 182 400 258 151 0588 120 319 48994 60556 89 2805 A 1 1 1 5 A51 P153113 D 321 6902 182 3016 309 172 311 054 144 7799 113 305 36351 55602 87 074 A 1 1 1 6 A51 P113182 D 296 8275 182 7348 284 5 176 5 299 9655 156 7735 116 298 34432 54455 56 071 A 1 1 1 7 A51 P335050 D 310 787 184 5632 295 177 290 9062 151 3007 108 332 33565 61275 79 8019 A 1 1 1 8 A51 P297810 D 1493 5312 192 2307 1493 179 1588 4509 158 2453 128 312 191812 59976 378 2059 A 1 1 1 9 A51 P256510 D 283 6 179 9329 275 177 277 054 172 105 313 29778 56319 57 9077 A 1 1 1 10 A 51 P347103 342 3145 182 1224 334 5 175 337 4473 152 4862 124 294 42447 53544 82 0096 Most ImaGene files have the following data column names 113 Field Meta Row Meta Column Row Column Gene ID Flag Signal Mean Background Mean Signal Median Background Median Signal Mode Background Mode Signal Area Background Area Signal Total Background Total Signal Stdev Background Stdev Shape Regularity Ignored Area Spot Area Ignored Median Area To Perimeter Open Perimeter XCoord Y Coord Diameter Position offset Offset X Offset Y Expected X Expected Y CM X CM Y CM Offset CM Offset X CM Offset Y Min Diam Max Diam Control Failed Control Backgroun
61. A 1 1 215 105 End Field Dimensions Begin Measurement parameters Segmentation auto Signal Low D Signal High D Background Lo D Background Hi D Background BL 2 Background Wi 5 End Measurement parameters Begin Alerts Control Type Minimum thres If tested Percentage allc If failed Maximum thre If tested Percentage allc If failed CVthreshold If tested If failed BLANK D FALSE 1 00 FALSE 500 FALSE 0 10 FALSE 1 FALSE FALSE POSITIVE 1000 FALSE 0 10 FALSE 100000 FALSE 1 00 FALSE 1 FALSE FALSE End Alerts Begin Quality Flags Begin Flagging Settings Empty Spots TRUE Threshold 2 Poor Spots TRUE Begin Poor Spots Parameters Background co FALSE Threshold 0 9995 Background te TRUE Signal contami FALSE Threshold 0 9995 Signal contami FALSE Ignored percen TRUE Threshold 25 Open perimete TRUE Threshold 25 Shape regularit TRUE Threshold 0 6 Area To Perime FALSE Threshold 0 65 Offset flag TRUE Threshold 60 End Poor Spots Parameters Negative Spot TRUE End Flagging Settings Begin Flagged spots of Empty Spots 233 of Poor Spots 20 of Negative Spots O of Manually Flagged Spots 1075 End Flagged spots End Quality Flags End Header Begin Raw Data Field Meta Row Meta Column Row Column Gene ID Flag Signal Mean Background Mk Signal Median Background Mk Signal Mode Background hk Signal Area Background Ar Signal Total Background To Signal Stdew Bac A 1 1 1 1 Bright Comer 1 6312 0761 211 8941 6492 194 6425 7416 167 0144 144 255 90393
62. ARMADA Automated Robust MicroArray Data Analysis version 1 1 User s manual Metabolic Engineering and Bioinformatics Group Institute of Biological Research and Biotechnology National Hellenic Research Foundation 22008 Contents SP O RE 3 ADM dd NE A AA ETOS 6 os A 6 A A UM UM IE eT ee OO 7 Ius CCDOLDIHP oco tasto re nee en d cu on ELM EM E 8 AA eet ED deteot LI Ie E UE IAM LOS LIU ae E LED TD aea Cc 9 2 1 Installation requirements and instructions eese eene 9 2 2 Pe AMMO dne W POO CU stesse ient nbi A 9 2 2 Openin a previously saved PLO CCl usus en egestas aptus ipn erai d eli at hp us 10 2 4 Saving d PO israel dadas 10 2 5 Importime 36d tc ois ost e p eeu det Ium ows alae M Ue 10 2 5 1 Importing data directly from image analysis software supported program outputs 10 2 5 2 Importing data directly from image analysis software text tab delimited files 13 225 5 Importa alreadvsprocessed daLdaessde imus dava aid 15 2 0 Explorme data A e e dota nt ia su audae 17 2 60 1 AUSMADASS matr WOW crtana 17 2O o O O oT 18 PV Rim T 18 DO AADAYS USE A RERO RERO EUR 20 20a Analys A ueteri eic A e S eM Ri UD fr Sur 21 De Oise IAW X c alae 22 207 Nonna l zed IMi Cs ojo rand che nachos anda 22 ZO a Raw MaDe tt 23 2 0 9 NOM O added cas 23 2 6 10 Differentiall
63. ARMADA The user should note that when data import is completed properly no further data importing is possible in the same project This is part of ARMADA workflow and if the user wishes to import other data in ARMADA a new project should be created 2 5 1 Importing data directly from image analysis software supported program outputs The first step of analyzing a dataset consists of proper import of the image analysis software output files or text tab delimited files containing image quantitation data for each spot on the array To import a dataset in the current project the user should click File Data Import Raw image data and the following window will appear 10 Data Import Editor File Info Image Analysis format QuantArray Vv General Info Number of Experimental Conditions 3 Condition Names separeted by newline Control l Treatment 1 Treatment 2 In this data import wizard the user is prompted firstly to choose the software which was utilized to process raw images and create the files containing image quantitation data Currently 4 software formats are supported QuantArray Perkin Elmer Inc ImaGene BioDiscovery S A GenePix Molecular Devices and simple text tab delimited format files containing image quantitation elements coupled with optionally spatial information on how the spots are distributed on the microarray This format is useful when datasets downloaded f
64. ATLAB s sequence number format e g 1 10 results in a range of 1 10 and 1 2 10 gives 1 3 5 7 9 kNN tuning 57 Model validation General options options Distance Ties rule N fold cross validation Leave M out Training and Test Display evaluation plots Display output results Verbose output command line The distance function used to calculate the distance of samples to their nearest neighbors The following distances are available Euclidean The Euclidean distance Cityblock The cityblock Manhattan distance Cosine The cosine distance Correlation Pearson s correlation distance Hamming Hamming distance It can be used only with binary data else it will generate an error For further information on distances the user should see Appendix D Tie breaking rules The following rules are available Nearest Majority rule with nearest point tie break Random Majority rule with random point tie break Consensus Consensus rule when using the consensus option points where not all of the k nearest neighbors are from the same class are not assigned to one of the classes Because of this it might generate errors when not used carefully The user should check this box to perform N fold cross validation of the classifier and supply N The user should check this box to perform Leave M out validation of the classifier and supply M The user should check this box to perform Training and Test valida
65. ATLAB and explains briefly several features that are available for figure control and exploration It contains several parts of MATLAB s help concerning figures and the user should also check for further information the website http www mathworks com access helpdesk help techdoc matlab html under the section Graphics which explains several features more thoroughly B 1 Figures MATLAB offers a very strong interface for plotting handling exploring and exporting figures The following pictures are taken from MATLAB s help and present a typical figure explaining briefly several toolbars and controls 10 x Fie Edit View Insert Tools Desktop Window Help E Dee S s aareal 2l O8 an 0 0 0 6 0 4 Some of the components and tools of figure windows are called out below 118 One of the figure MATLAB Dock figure in toolbars fig dn MATLAB desktop Figure 1 Edit Insert Took Desktop Window Help Dae S 8 RAMs ee 08 af e x Whey Axes in which Line plots MATLAB plots data representing data It should be noted that figure docking in MATLAB desktop 1s available only when MATLAB is present else figures will be docked to a common figures window and will be accessible from there as different windows B 2 Figure toolbars Figure toolbars provide shortcuts to access commonly used features These include operations such as saving and printing plus tools for interactive zooming panning
66. C Multiple testing correction issues Hypothesis testing in statistics involves two kinds of errors the Type I error occurs when the null hypothesis is incorrectly rejected In the context of microarray experiments type I errors are committed when a gene is declared differentially expressed while it is not On the other hand Type II errors occur when the null hypothesis is not rejected while it is false that 1s in the context of microarrays when the test fails to identify a differentially expressed gene When conducting multiple testing the probability of Type I errors is increased proportionally to the number of tests This is allowable when the number of tests is small but in the case of microarrays where thousands of tests are performed a large number of false positives 1s undesirable For example if an experiment involves testing over 10000 genes and the p value threshold to determine differential expression is set to 0 01 then at least 100 false positives are expected Such unwelcome results necessitate the correction of statistical scores to adjust for multiple hypothesis testing There exist two main categories of multiple testing correction methods the Family Wise Error Rate FWER and the False Discovery Rate FDR methods FWER procedures correct for multiple testing by adjusting p values to account for multiple testing For example the Bonferroni procedure adjusts p values by dividing with the number of hypotheses n to be tested
67. Conditions columns Both rows and columns Clustering variables Description If selected gene expression values to be clustered are the mean expression value among replicates for each condition of the selected Analysis If selected gene expression values to be clustered are all the values from all array replicates from each condition of the selected Analysis This option sometimes serves also as quality control of an experiment If all replicates of an experimental condition are not clustered together the missing replicate might be of low quality If selected hierarchical clustering will be performed for genes revealing clusters of genes with similar expression If selected hierarchical clustering will be performed for conditions or replicates depending on the choice in the Clustering values panel revealing clusters of conditions If chosen hierarchical clustering will be applied to both genes and conditions replicates constructing thus a tree clustering diagram dendrogram for both genes and replicates or conditions 46 The following table explains the available options in the bottom panel Options Option Description Linkage The linkage algorithm to be used for data clustering for further information the user should see Appendix D on distances and linkages Distance The distance metric to be used for data clustering for further information the user should see Appendix D on distances and linkages p val
68. Down regulated Fold change cutott Intensity A The user should note that red and green points will appear only if the Display fold change lines option in the MA plots preferences window has been enabled and a proper fold change threshold value has been provided The following table explains the functions of the items displayed in the menu appearing after right clicking Name Select Data Export Selected Export up regulated Export down regulated Export deregulated Export unregulated Export AII Function Switches between data exploration and data selection modes While on selection mode exports data points defined by the rectangular selection area The user must right click on one of the edges of the selection area Exports up regulated genes red data points Available only if fold change thresholds have been provided Exports down regulated genes green data points Available only 1f fold change thresholds have been provided Exports up and down regulated genes red and green data points Available only if fold change thresholds have been provided Exports up unregulated genes blue points If fold change thresholds have not been provided exports all data points Exports all data points regardless of mode status 5 4 3 MA plots before and after normalization The same things concerning image modes apply also in the case of MA plots before and after the normalization procedure In this case
69. NDROMEDA Project Files l Cancel The user is prompted to fill the field File name with the desired project name and click Save for the new project to be created For a new ARMADA session window the user should click File New New Session or Ctrl I 2 3 Opening a previously saved project To open a previously saved project the user should click File Open or Ctrl O and then the following window will appear Open Project Look in MATLAB DH OkEGG Matlab Games My Recent My Matlab Exchange Documents My utils 3 fc ArrayVision apj E IPFNew apj Desktop E Knock api E TabDelimTest apj Es Test apj Es testabf apj Es TestDelim apj i E TestExternal apj a g E TestGRID apj My d pu zi rd E TestProject apj My Documents q gt My Network File name Knock api Places Files of type LAN DROM EDA Project Files api From there the user should select a previously created project and click Open 2 4 Saving a project To save the current project the user should click File Save or Ctrl S To save a project under a different filename the user should click File Save As enter a new filename on the project and click Save 2 5 Importing data The following sections describe how data files derived from the image analysis programs supported or from text tab delimited files e g downloaded from public repositories can be imported for processing to
70. Subgrid 1 3 Subgrid 1 4 4 Subgrid 2 3 5 5 5 ane a gt E 25 3 35 4 45 5 SS 6 65 28 3 35 4 45 5 SS 6 25 3 35 4 45 5 SS 6 65 25 3 5 55 6 Subgrid 3 1 Subgrid 3 2 Subgrid 3 3 7 5 2 5 3 35 is 5 55 6 25 3 35 4 45 5 S55 6 25 3 35 4 45 5 SS 6 65 25 3 35 4 45 5 SS 6 65 Subgrid 4 2 Subgrid 4 3 Subgrid 4 4 7 2 25 3 35 4 45 5 55 6 25 4 45 5 55 6 65 Subgrid 5 2 Subgrid 5 3 25 3 35 4 is 5 55 6 65 Subgrid 5 4 25 3 35 4 45 5 55 6 65 Subgrid 5 1 2 s 25 3 3 4 45 S SS 6 65 2 25 3 3 4 45 5 S55 6 25 3 235 4 45 S SS 6 65 2 3 i S 6 D Subgrid 7 1 Subgrid 7 2 Subgrid 7 3 s p o 5 25 3 35 4 is 5 55 6 25 3 35 4 45 5 55 6 65 Subgrid 8 4 35 4 is Subgrid 8 2 5 5 25 3 35 4 is 5 55 6 65 Subgrid 9 4 2 15 2 25 3 35 4 is 5 2 25 3 35 4 45 5 55 Subgrid 9 1 Subgrid 9 2 uw 2 15 2 25 3 35 4 45 5 55 2 25 3 35 4 45 5 55 2 25 3 35 4 45 5 55 6 Subgrid 10 1 Subgrid 10 3 25 3 35 4 45 55 2 25 3 35 4 45 5 55 6 2 25 3 35 t 4S Subgrid 11 2 Subgrid 11 3 5 5 5 T 4 Subgrid 12 1 Al ii o 79 5 5 Expression Distributions Slide expression distributions are histograms depicting the gene expression log ratio distributions for specific arrays They are useful for determining the nature of the data e g the normality or the bimodality of gene expression distributions and subsequently decid
71. Tools Annotator and the following window will appear Annotation Editor Choose files Gene list filers gs Panos Desktop4nntest Liver_24 final_list_BaP100_pval005 txt Annotation tile lings PanosDesktopinntestiGUG41218 CompleteAnnotation txt Choose annotation columns Unique gene ID column in gene list Columns Slide Position v Unique gene ID column in annotation file Slide Position In the Choose files panel the user should enter the required files location In the Gene list file s field the user should provide the exact location of the file s to be annotated By clicking the Browse button this task can be completed easily In the Annotation file field the user should provide the location of the annotation file which can also be done easily with the help of the Browse button Attention should be paid if multiple files are provided in the Gene list file s field all of them should be coming from analyses using the same microarray chip e g from the same project as they all use the specified annotation file else the program will generate an error The gene list and annotation files can be either in text tab delimited or Excel format or a mix e g the annotation file is an Excel file and the gene lists are in tab delimited format After selecting the necessary files their column headers are used to fill the Unique gene ID column in gene list the Unique gene ID column in annotation file and the Columns l
72. YSCANNER Date Wed Apr 06 11 52 47 2005 Experiment wt_1_1r Experiment Path C Program FilesiPackard BioChipV amp dministrator ExperimentSetswvt 1 1r Protocol CnikosdataWlicroarray Experiments YVART_VHiQuantArray Protocolswt_1_1r pro Version 3 Begin Protocol Info Units Microns Array Rows 12 Array Columns 4 Rows 21 Columns 21 Array Row Spacing 4500 Array Columns Spacing 4500 Spot Rows Spacing 200 Spot Columns Spacing 200 Spot Diameter 150 Interstitial D Dis off 1 is first one missing 2 is second one missing Spots Per Array 441 Total Spots 21168 Data is not crosstalk corrected Data is background subtracted Quantification Method Histogram Quality Confidence Calculation Minimum End Protocol Info Begin Tolerance and Weight Measurement Minimum Maximum Weight End Tolerance and Weight Begin Image Info Channel Image Fluorophor Barcode Units X Units Per P Y Units Per FX Offset Y Offset Status chi CAnikosdataWicroarray Experiments v ART v HI Microns 10 10 0 0 Control Image ch2 C nikosdataWicroarray Experiments ART_YHWI Microns 10 10 0 0 End Image Info Begin Measurements Number Array Row Array Column Row Column Name chi Ratio chi Percent ch2 Ratio ch2 Percent Ignore Filter 1 1 1 1 1 CNTRL13L01 1 69 76344 0 433416 30 23656 0 2 1 1 1 2 CNTRL13H12 1 52457754 0 601082 37 542245 0 3 1 1 1 3 CNTRL13H01 1 70 255147 0 423181 29 734853 0 File data section Begin Data Number Array Row Array Column Row Column Name X Loca
73. a DA classifier ARMADA offers a batch process module that allows the user to tune the classifier e g select the best discriminant function type of class prior probabilities for a specific problem studied using a representative dataset The following sub sections describe the tuning and classification process using DA in ARMADA 4 4 1 1 Linear Discriminant Analysis Tuning In order to perform DA classifier tuning the user should select an Analysis object from the Analysis Objects list and click Statistics Classification Discriminant Analysis Tune The following preferences window will appear 52 DA Tuning Editor DA tuning options Model validation options Discriminant function Priors ner ME Diagonal Linear po Empirical M Leave M nut Quadratic External Diagonal Quandratic Mahalanobis wi M fold cross validation Training and Test ba General options iw Display evaluation plots Wi Display output resuits Verbose output command line The following table explains the available options in the DA tuning options Model validation options and General options panels Option Description Discriminant function The type of discriminant function The following types are available descriptions partially from MATLAB s help Linear Fits a multivariate normal density to each group with a pooled estimate of covar
74. after the statistical selection process 4 1 The user may select genes by their GeneID from the list Genes or Clusters on the left of the expression profile preferences window If statistical selection has not been performed this option will not be available 9 Display options Plot values Gene clusters Plot centroids only Plot centroids in selections Multiple cluster plot Different color for each gene Display legend gene names Title s Condition means All replicates This option gives the ability to plot expression profiles using normalized values for genes belonging to clusters determined by the clustering algorithm used after the statistical selection process 4 3 The user may select genes by their cluster number from the list Genes or Clusters on the left of the expression profile preferences window If clustering has not been performed this option will not be available This box is enabled only for the Gene clusters option in the Plot options panel If checked it will produce expression profile plots but instead of using all the genes for each cluster it only uses the gene centroid expression pattern reflecting the general deregulation motif of the genes belonging to each cluster The Plot centroids only box is enabled only if k means 4 3 2 or fuzzy c means 4 3 3 clustering has been performed and disables the Plot values panel as the centroids have been calculated by the clustering algorithm
75. al dataset which provides valuable information about the experiment Gene expression pattern analysis can be also performed using a clustering algorithm in order to reveal genes belonging to groups with common expression This section presents the statistical selection and clustering methods implemented in this platform along with how these can help the user reaching to a conclusion 4 1 Statistical Selection This section presents the statistical tests supported by ARMADA for the extraction of differentially expressed gene lists as well as the statistical analysis workflow which includes the Trust Factor filtering between slide normalization imputation of possible missing values caused by image analysis user flagging or filtering steps and multiple statistical testing correction methods In order to perform statistical analysis the user should click Statistics Statistical Selection This will bring up the Statistical Selection preferences window which is shown below Statistical Selection Editor Select analysistes for testing Before testing Statistical testing Analysts jal A Missing value imputation Statistical test Multiple test correction Analysis 4 i A Average within same condition e ay ANON A e Mane e Trust Factor cutoff 06 Between slide scaling p value threshold 0 05 Median Shsolute Deviation MAD bead lt Remove Impute values After scaling Before scaling Eoo A Display boxplots
76. ameter set none linear kernel Misclasification error 0 041667 Classification accuracy 95 53335 Confusion table If the user right clicks inside the report area a context menu will appear allowing to export the report in a text tab delimited file or clear the report window 4 4 3 2 Support Vector Machines Training In order to train an appropriate SVM classifier model normally after a tuning session so as to evaluate the best kernel and respective parameters for the dataset under examination the user should click Statistics Classification Support Vector Machines Train and the following preferences window will appear SYM Train Editor SYM options Polynomial kernel options Kernel Gamma 4 Coefficient g Polynomial vi Tolerance 0 001 Degree 3 Normalize Scale MLP kernel options REF kernel options The options in the SVM training preferences window under the SVM options panel Kernel Normalize Scale and Tolerance are the same as in the case of tuning the SVM classifier and their description can be found in section 4 4 3 1 After setting the desired parameters the user should click OK After the classification procedure a confirmation dialog will appear stating that the classifier training has finished ARMADA will store the classifier model for further use 65 4 4 3 3 Support Vector Machines Classifying In order to perform SVM classification model the us
77. analysis 3l Filtering Editor General Use Medians instead of Means if available wi Export filtered spots in Excel format Moise filtering No filtering Signal to Noise threshold Signal Naise distribution distance m S x std S em Eey std E f Custom filter SighMean BackMean 3 amp Sightean 1000 Cutler detection statistic A p value cutoff 0 05 Display p value histograms In the second filtering step which is optional spot filtering based on measurement reproducibility among replicates for each experimental condition the user is able to select between a t test parametric or Wilcoxon non parametric test that will verify that for each spot the ratio measurements of all condition replicates derive from a normal or a continuous symmetrical distribution with mean median equal to the average ratio for this spot among all replicates This test tracks and excludes outliers among the replicate slides of an experimental condition If the user does not wish to perform this test the choice of Statistic in the Outlier detection panel should be set to None else the user should select the desired test to be performed set a statistical threshold cutoff p value usually a cutoff of 0 05 1s not considered neither very strict nor very loose in the field p value cutoff and choose whether or not to display histograms of p values that depict the p value range frequencies for all genes of an experimental co
78. and linkages choose the initial cluster centroid positions The following methods are available sample If selected k observations from the data matrix to be clustered are selected randomly to be the initial centroids 48 Uniform If selected k random points are selected uniformly from the range of the data matrix to be clustered Clusters If selected a preliminary clustering phase is performed on a random 10 subsample of the data matrix to be clustered This preliminary phase is itself initialized using the Sample option Repeat clustering Number of times to repeat the clustering process each with a new set of initial cluster centroid positions The solution with the smallest distance between clusters 1s returned Maximum iterations Maximum centroid convergence iterations p value cutoff A p value cutoff additional to the statistical test p value cutoff in order to cluster fewer genes than those determined by the statistical test For example if the statistical test was performed with a p value cutoff of 0 05 the user can enter 0 01 to cluster fewer genes than those determined by the cutoff of 0 05 After setting the desired parameters or leave the defaults the user should click OK k means clustering will be performed and ARMADA will store the result Gene clusters can be viewed by hitting the Cluster List button on the main window The user can also consult section 5 7 on how to plot gene expression profile
79. and the following window will appear Dye Swap Editor D ve Swwap options Conditions Arrays ES Please use the list on the lett to select the arrays for which Cys corresponds to Channel 2 or the Treated samples for each condition Clear condition m Clear all Cancel From there the user can select which arrays from which experimental condition correspond to a dye swap hybridization In other words the user should select only the arrays for which Cy3 or Green corresponds to Channel 2 or the channel user for treated samples If the user makes a mistake in the selection of the arrays the Clear condition and Clear all buttons will reset the arrays for the selected condition or all the arrays to the default which is Cy3 corresponding to Channel 1 or Control channel After finishing with the selection of dye swapped arrays the user should click OK Attention should be paid to the dye swap options as they affects directly the calculation of the log ratio between channels and thus gene expression Concerning the Other panel in the normalization window if subgrid meta coordinates are present on the slides the user is given the choice to select subgrid normalization by checking the Subgrid normalization box to possibly allow considering several spatial dependent properties such 38 as local background noise caused by the robotic printing of the arrays in the normalization procedure instead of
80. ariety of visualization options MA plots boxplots array images clustering heatmaps etc a module which allows multiple analyses to be performed in batch mode under a specific analysis workflow and an annotation tool Emphasis is given to the output data format which is fully customizable and contains a substantial amount of useful information such as detailed normalized and unnormalized expression values for each gene on each slide replicate along with several statistics concerning expression values for each experimental condition The ARMADA output files can be easily imported in a spreadsheet like software such as MS Excel or in a database for further processing and storage and the analysis results can be saved as mat files for further possible processing with MATLAB s built in algorithms Depending on the user s programming experience and analysis preferences ARMADA can be used to perform analyses step by step through the GUI of the system or as an automated analysis pipeline by using the batch programming module For the most experienced user ARMADA can also be invoked directly from MATLAB s command window as the main routines that perform the analysis behind the GUI are designed to run also individually in command line mode with specific arguments the user should see help inside m files to perform command line analysis ARMADA is a completely open source MATLAB based platform and the user may alter adjust or extend each of the main func
81. array plots for several image quantitation types e g Channel 1 vs Channel 2 etc Data are selectable and exportable as in MA or Volcano plots e Added array vs array plots for several measurements e g log ratio dye quantitation etc Data are selectable and exportable as in MA or Volcano plots e Several bug fixes 1 3 Bug reporting If the user wishes to report a bug it is recommended that the exact error message is included in the report a simple screenshot of the error using a simple screen capturing program or simply hitting the button PrtScn on the keyboard would be enough together with a small description on what process the user tried to perform If the bug appears during the data import process and 1f the problem is not solved by following the import instructions described in this user s guide exactly it is recommended that a sample of the files used for import is included in the report 2 Basic Operations This section of the user guide presents the installation requirements and installation process of ARMADA and how the user can perform basic operations such as creating and saving and opening projects opening new session windows and importing data to ARMADA in various ways 2 1 Installation requirements and instructions In order to run ARMADA the user must have at least MATLAB 7 3 R2006a or higher or the MATLAB Component Runtime MCR 7 6 not higher installed on his computer ARMADA can be downloaded
82. assification eese 117 A 3 3 External kernel parameters files for SVM tuning eese 117 Appendix B MATLAB s figure controls dde pt ien TU T ON tote du UN Meu 118 IS BB EIE NR mt 118 132 UE LOO MO ALS a sal od a Oete teme A exte au eed dod m ded ee de 119 Appendix C Multiple testing correction issues eese nennen 121 Appendix D Distance metrics and linkage algorithms eese 122 I Distance Memorias 122 DADES oa 123 1 Overview 1 1 Program overview Microarray technology allows gene expression profiling at a global level by measuring mRNA abundance ARMADA Automated Robust MicroArray Data Analysis is a MATLAB implemented program with a graphical user interface GUI which performs all steps of typical microarray data analysis starting from importing raw data from several image analysis software outputs as well as text tab delimited files or already processed data that need to undergo statistical testing ARMADA continues with processes including noise filtering spot background correction data normalization statistical selection of differentially expressed genes based on parametric or non parametric statistics cluster analysis based on several widely used clustering methods Hierarchical k means Fuzzy C means and annotation steps resulting in detailed lists of differentially expressed genes and formed clusters Along with the user friendly interface ARMADA offers a v
83. atio chl Confidence ch2 Intensity ch2 Background ch2 Intensity Std Dev ch2 Background Std Dev ch2 Diameter ch2 Area ch2 Footprint ch2 Circularity ch2 Spot Uniformity ch2 Bkg Uniformity ch2 Signal Noise Ratio ch2 Confidence Ignore Filter The user should make sure that at least the column names containing main image quantitation information and spot flags should have the names mentioned above e g chl Intensity ch2 Intensity chl Background chl Background Std Dev IgnoreFilter etc For more information on the necessary quantitation inputs the user should see section 2 5 2 A 1 2 ImaGene file format ImaGene files usually come in pairs one file for each channel ARMADA recognizes the correspondence to each channel channel 1 or Cy3 or Green and channel 2 or Cy5 or Red by their filenames The file corresponding to each channel should contain the string Cy3 or Cy5 depending on the channel somewhere on its filename not the file extension The user should also check section 2 5 1 Below there is an example of an ImaGene file section for one channel files for the other channel are the same but they contain different values for the main image quantitation types Begin Header version 6 0 1 Date Thu Jan 26 11 10 20 CET 2006 Image File EMARRAY wvonneVgilentWouse lungiimages27772 Cy3 controll exp1 tif Page D Page Name Inverted FALSE Begin Field Dimensions Field Metarows Metacols Rows Cols
84. before and after scaling In this window the user can set up the parameters that best fit his needs or leave the defaults In the left side of the window there are two lists The left one contains the pool of available Analysis objects created so far and the right list contains the Analysis object for which the user wishes to perform statistical testing Different Analysis objects can be added to or removed from the list for testing by clicking on the Add gt gt or lt lt Remove buttons All the parameters can be set by the options available in the Before testing and Statistical testing panels The following table describes the available options in the Before testing panel Option Description Missing value imputation This option determines how missing values from the dataset will be imputed If Average within the same condition is selected then 40 Trust Factor cutoff Between slide scaling Impute values Display boxplots before and after scaling missing values for each gene are imputed per experimental condition by averaging the average expression of the remaining present values of the gene of interest from the same experimental condition If k nearest neighbor kNN is selected then missing values are imputed using the whole dataset used a KNN based value estimation 5 The Trust Factor 1s defined for each experimental condition as TF Appearances Replicates The number of appearances for each gene is determined by t
85. buttons to add or remove pairs respectively After setting all the desired parameters in the statistical selection preferences window the user should clink OK It should be noted that the user can define a different workflow for each Analysis in the right list by selecting the Analysis object and setting the desired parameters After clicking OK ARMADA will start the statistical selection process for each of the selected Analysis objects For each Analysis object the following window will appear 43 Trust genes The number of trustworthy genes hor this set of conditions is 18734 Co vou accept This 1s the result of the Trust Factor cutoff The user should click Yes or No depending on preference If No is clicked ARMADA will stop the process for the Analysis object under processing and will jump to the next one If the user click Yes the following window will appear after the application of a statistical test DE genes found The number of differentially expressed genes for this set of conditions is 1 858 Do vou accept Le e This 1s the result of the statistical test concerning the number of differentially expressed genes found If the user clicks Yes ARMADA stores the results and continues with the next Analysis object If the user clicks No ARMADA will skip the result for the Analysis object under processing and will jump to the next one Differentially expressed genes can be viewed by hitting the DE List
86. cation results Support Vector Machines classification results Kernel function type Polynomial Kernel parameters Chosen parameters Gamma Coefficient O Degree 3 Number of new samples 6 The class ss assigned to new data samples isfarej sample New l belongs to class ALL sample New 2 belongs to class ALL sample New 3 belongs to class AML sample New 4 belongs to class AML sample New 5 belongs to class MLL sample New_6 belongs to class MLL 66 5 Graphical data exploration The following sections describe several graphics which are customizable and accessible in ARMADA through the Plots menu in the main window Such graphics include 2 or 3 dimensional array reconstructed images based on given data MA plots expression distribution plots across different arrays boxplots volcano plots and expression profiles across different arrays or conditions The user should note that not all plots are available at any time of the analysis For example MA plots become available after normalization and volcano plots become available after the statistical selection procedure and only 1f the selected Analysis has two conditions 5 1 Array Images An array image depicts a reconstructed spatial image of a microarray based on the data of the corresponding input file s The image can be created using several input data e g the user can create an image based on Channel 1 mean signal or Channel 2 background median Such images
87. cess The number of clusters found Branch name for Support Vector Machine classifier constructed for classification purposes after statistical selection The kernel function type of the SVM classifier The parameter set for the SVM classifier The arrays list is a list of all the files arrays that were imported to the project By selecting an array from the list and right clicking inside the list a menu with the following options appears Arrays Vy 1rtxt Vt 2r txt Vut 3rtxt Vut 4r txt d 1rtixt d 2rtixt d7 3rb d15 1rt d15 2rt Data d15 3rt di5 4rt i Normalized Image d23 1rt d23 org Report d23_3r txt d23_4r txt TNF 1rtxt TNF _2r txt TNF _3r txt 20 The functionalities of the submenu commands are explained in the following table Command name Description Image Displays an image for the selected array which 1s reconstructed from the raw data Data Displays a data table reconstructed from the columns of the selected file Normalized Image Displays an image for the selected array which is reconstructed from log ratios after the normalization procedure Report Displays information concerning the selected array in a separate window 2 6 5 Analysis Object List The analysis list is a list of all the different analyses that have been created in the current project By selecting an array from the list and right clicking inside the list a menu with the following options appears Analysis Obje
88. cityblock 0 11 i y cosine correlation 0 07 Misclassification Error 0 05 0 04 Number of Nearest Neighbors If the user selects to display a classifier evaluation report by clicking Display output results a window like the following will appear presenting the classifier evaluation results EMN Classifier Tuning Results ENN classifier tuning results Dataset information Observations T Variables 691 Number of classes Number of nearest neighbors 1 Misclasification error 0 069444 Plas 35 Ei anti an eee see SR MEL If the user right clicks inside the report area a context menu will appear allowing to export the report in a text tab delimited file or clear the report window 60 4 4 2 2 k Nearest Neighbors Classifying In order to perform kNN classification the user should click Statistics Classification k Nearest Neighbors Classify and the following preferences window will appear KNN Classify Editor sample file ENN Classification Options k Nearest Meighors Distance Euclidean Iv Ties rule Meareszt Ivi 3 The user should select the file that contains the new samples to be classified using as training data the data imported to ARMADA The file can be a text tab delimited or Excel file which should be structured as follows the first column should contain variable names e g gene names that serve as unique variable identifiers The f
89. command in the Plots menu is enabled only after a statistical test has been performed and when the selected Analysis in the Analysis Object list contains 2 experimental conditions To create volcano plots the user should select an Analysis from the Analysis Object list and click Plots Volcano Plots The following window will appear Volcano Plot Editor Options M Display p value cutoff line p value cutaff 0 05 Mi Display fold change cutoff line Fold change cutoff 2 Title Title Effect calculation Correspondence Ratio treated Reference WT Ratio treated ratio control C Ratio treated ratio control Treated DF In the Options panel the user should choose whether to display a p value threshold line or not by checking or unchecking the Display p value cutoff line box and provide a p value threshold The user can also select whether to display a fold change threshold line or not by checking or unchecking the Display fold change cutoff line box and provide a fold change threshold As in MA plots 5 3 the fold change threshold is provided in natural scale but converted to log scale for the construction of fold change lines A title for the volcano plot can be given in the Title field else the field should be left empty or as is for an automatic title generation In the Correspondence panel the user should tell ARMADA which condition corresponds to the reference condition an
90. containing a single object Hierarchical clustering may be represented by a two dimensional diagram known as clustergram or clustering heatmap which illustrates the fusions or divisions made at each successive stage of analysis An example of such a clustergram is given below Hierarchical Clustering Single euclidean 5 Conditions p ralue 0 01 45 For further information concerning hierarchical clustering the user should see 10 In order to perform hierarchical clustering the user should select an Analysis object from the Analysis Objects list and click Statistics Clustering Hierarchical The following preferences window will appear Hierarchical Clustering Editor Clustering values _ Condition means All replicates Options Clustering Linkage Single Distance Euclidean p value cutoff Heatmap iw Display heatmap Clustering variables 0 Genes rows C Replicates Conditions columna Both rows and columns Determine cluster number by v f Inconsistency coefficient Ivi Maximum number of clusters 0 05 Cutott 1 Calculate optimal dendrogram Colormap Red Green Iv Colormap density B4 Heatmap title Leave empty for automated title generation Cancel The following table explains the available options in the two upper panels Clustering values and Clustering variables Option Condition means All replicates Clustering values Genes rows Replicates
91. cted data where there are two distinguishable clouds of data points Genes furthest from the whole swarm center can be thought of as genes that their expression is different and can separate among different experimental configurations The PCA tool becomes available after the statistical selection procedure 7 2 The Gap Statistic tool A major problem when trying to discover groups in data without the help of a response variable e g when trying to discover groups of similarly deregulated genes without having a prior idea on how many are these groups is how to estimate the optimal number of these groups or clusters One way to partially solve this problem is the Gap statistic introduced by Tibshirani et al in 2001 and has been applied in microarray data The main idea 1s to use the pairwise inter cluster distances to define a within cluster dispersion measure with the original data and a background distribution which reflects randomness and then use statistical measures to compare the within cluster dispersion measures of the original data distribution with the dispersion in the random case To ensure the random characteristics of the background distribution the latter is estimated by averaging several instances of the randomly generated data reference data For more information and details about the algorithm that estimates the number of clusters in a data matrix using the Gap statistic the user should see 18 When the user
92. cts Analysis 1 ES Analysis E Analys Analy Marmalized List DE List Cluster List Export Mormalized List Export DE List Export Cluster List Delete Report iv The functionalities of the submenu commands are explained in the following table Command name Description Normalized List Displays the list of genes of the selected analysis and their expression values data displayed customizable after the normalization procedure DE List Displays the list of Differentially Expressed genes of the selected analysis and their expression values data displayed customizable after the statistical selection procedure Cluster List Displays the list of clusters and the genes belonging to each cluster coupled by their expression values after the clustering procedure Export Normalized List Exports the list of genes of the selected analysis and their expression values data exported and export format customizable after the normalization procedure Export DE List Exports the list of Differentially Expressed genes of the selected analysis and their expression values data exported and export format customizable after the statistical selection procedure Export Cluster List Exports the list of clusters and the genes belonging to each cluster coupled by their expression values after the clustering procedure export format customizable Delete Deletes the selected analysis and all its components Report Displays a brief repor
93. d contamination present Signal contamination present Ignored failed Open perimeter failed Shape regularity failed Perim to area failed Offset failed Empty spot Negative spot Selected spot Saturated spot The user should make sure that at least the column names containing main image quantitation information and spot flags should have the names mentioned above e g Signal Mean Background Mean Flag etc For more information on the necessary quantitation inputs the user should see section 2 5 2 For more information on ImaGene headers the user should also check http www biodiscovery com index imagene A 1 3 GenePix file format GenePix contains quantitation data in one file Below there is an example of GenePix output ATF 1 24 43 Type GenePix Results 1 2 DateTime 2001 01 18 12 41 32 Settings C V amp xon Params 3KMOS 011100 aps GalFile2 C V amp xon Params 3K A xonMOS AI txt Scanner GenePix 40004 Comment PixelSize 10 ImageName 635 nmos32 nm FileName E Sean de la Roton 42 m05 104 1 17 tif tifOE Sean de la Rotonv amp 2 m05 104 1 17 tif tif PMTWolts 7700780 NormalizationFactor RatioOfMedians 1 05442 NormalizationFactor RatioOfMeans 1 02011 NormalizationFactor MedianOfRatios 1 00515 NormalizationFactor MeanOfRatios 0 715029 NormalizationFactor RegressionRatio 1 18375 Jpeglmage E Sean de la Roton41 m05 129 1 17 jpg RatioFormulation 41 V2 635 nm 532 nm Barcode ImageOrigin 2760 11790 JpegOrigin 3190
94. d which to the treated e g with a drug condition so that proper fold changes can be calculated The table below explains the different options in the Effect calculation panel how fold change is calculated in each case 87 Option Ratio treated Ratio treated ratio control Ratio treated ratio control Description This option should be chosen when the volcano plot should be created by defining the fold change as the log ratio between channels in cases where there is only one experimental condition in the project or the current Analysis and Cy3 channel represents the reference samples while Cy5 channel represents the treated samples In such case 1t is also the only available option It might also occur in other case studies depending on what the user wishes to see e g when the analyst is interested to examine what is happening when comparing the treated sample to the common reference but not to the control which can be the 1 time point in a time point experiment The fold change is thus FC log a y Sonn This option is the default in volcano plots in ARMADA when the project or the current Analysis includes more than one experimental condition Is such cases there 1s usually a control condition which has to be compared to several other conditions and the common reference labelled by Cy3 is common to all samples under examination In these cases the fold change IS calculated as Cy5 Cy5 FC log y treat
95. deviation Channel 2 signal standard deviation Channel 1 background mean Channel 2 background mean 99 chlBackgroundMedian Channel 1 background median ch2BackgroundMedian Channel 2 background median chl BackgroundStd Channel 1 background standard deviation ch2BackgroundStd Channel 2 background standard deviation IgnoreFilter Spot flags Indices MATLAB indices created if meta coordinates are provided and used to create array images Shape MATLAB matrix defining subgrid block orientation Un normalized log ratio intensity data Un normalized log2 ratio and intensity data for all arrays in the Analysis Object Normalized log ratio intensity data Normalized log2 ratio and intensity data for all arrays in the Analysis Object DE genes data p values etc Statistics and data for differentially expressed genes It should be noted that the option boxes in the export to MATLAB preferences window are available only if the corresponding procedures have been performed e g DE genes data will not be available if statistical operations have not been performed Also the user may select different data to be exported for each Analysis Object After making the necessary selections the user should click Export and will be prompted to select a storage location for the mat file The mat file which is created with the above procedure consists of a structure of length equal to the Analysis Objects in the project from wit
96. does not have a prior knowledge in how many clusters can the dataset be grouped at one choice is to use the Gap statistic implementation of ARMADA and based on the estimate returned perform clustering using the same algorithm and parameters as those used by the Gap statistic tool To use the Gap statistic tool the user should select an Analysis from the Analysis Object list and click Tools Gap Statistic and the following window will appear Gap Statistic Gap statistic options Humber of clusters range Size of reference dataset Method repetitions Clustering algorithm Hierarchical Reference datatype Uniform Use always squared euclidean General options Verbose output command line Use waitbar Show output plot 104 By using this window the user can set several parameters that will be used for the estimation of the optimal number of clusters The following table describes each of the user options in the above preferences window Option Description Number of clusters range The range of number of clusters form which the optimal number of clusters should be estimated Size of reference dataset How many reference datasets should be randomly created and averaged for the estimation of the background distribution Method repetitions Due to the stochastic nature of the algorithm randomness of background distribution the optimal number of times may differ each time By allowing several repetitions of th
97. e on which normalization method is suitable for specific datasets They can also be used for quality control issues and for visualizing normalization effects To create slide distribution histograms the user should select an Analysis from the Analysis Object list and click Plots Slide Distributions The following window will appear Expression Distribution Editor Arrays Options Wut drtxt MM 2r txt Each array Wt 3r txt Vut dr tot Each condition d 1rixt d artt LA ur 3rtxt Plats d 4rtxt diS irtxt dio ar txt Before normalization di5 Sr tet After normalization Title E Before and after normalization In the Arrays list the user can select one or multiple arrays for which to create gene expression distributions The Options panel determines whether expression distributions will be created for individual arrays by selecting Each array in this case the Arrays list is activated and the user may select arrays from there or for conditions by selecting Each condition in this case the Arrays list is deactivated and the user may select conditions from the Conditions list which is activated As with MA plots the user may choose to plot expression histograms for data before normalization after normalization or make combined diagrams with data before and after normalization Below there are several examples of expression histograms 80 Non normalized ratio distribution for array Wt 1r
98. e provided with ARMADA In order to use the batch programming module the user should firstly import data in a new project with one of the ways described in section 2 5 After that the user is able to program a batch procedure at any time of the analysis To launch the Batch Programmer the user should click Tools Batch Programmer The following window will appear 107 Batch Programmer File Help Program your workflow Analysis objects Background Correction Analysts 1 E Analysis 2 Filtering Mormalization Select Conditions Statistical Selection Clustering In order to program a batch procedure the user should first create a new batch file by clicking File New batch The user will be prompted to select a location for the batch settings file to be created After this step the button Background Correction will be activated By hitting the Background Correction button the background correction preferences window will open section 3 2 and the user should select the preferred background correction method After setting this the Filtering button will be activated As with background correction the filtering preferences window section 3 3 will open and the user must set the desired parameters Similarly the Normalization button which will open the normalization preferences window section 3 4 At this point it should be mentioned that the preprocessing steps are common for the entire dataset After properly se
99. e whole estimation the program returns the number of clusters that was found to be optimal in the majority of the repetitions the most frequent Clustering algorithm The clustering algorithm to be used to cluster data and estimate their within cluster dispersion measures It can be one of the clustering algorithms supported by ARMADA Hierarchical k means Fuzzy C Means After selecting the desired algorithm the respective clustering preferences window will open the user should see 4 3 allowing parameter setting Reference data type Reference dataset generation method This option determines the method and the data source from which each reference dataset will be created The user can select one of the following Uniform The reference dataset is created based on uniformly distributed data with ranges taken from the columns samples arrays of the original data matrix Uniform PCA The reference dataset is created based based on uniformly distributed data which were derived by taking into account the shape of the original data using the principal components of the original data Gap statistic options matrix Bootstrap The reference dataset is created by bootstrapping the original data matrix Bootstrap PCA The reference dataset is created by based bootstrapping the data matrix derived by taking into account the shape of the original data using the principal components of the original data matrix
100. eans option from the Plot values panel selected 93 The following figure was created with exactly the same options but with the Plot centroids only box checked and the Plot centroids in selections unchecked WT or DIS 103 THF As with most figures in ARMADA if the user click on specific data points more information on that data point will be displayed It should be noted that Expression Profiles become available in the Plots menu after the normalization process 94 6 Exporting Data This section presents the various data types that can be exported using ARMADA s exporting functionalities as well as how to export and save figures In summary the subsection of this section explains how the user can customize and export normalized gene lists differentially expressed gene lists and cluster lists as well as how to export figures in various formats using MATLAB s figure interface and controls 6 1 Exporting gene lists After completing several steps of data analysis and exploration the user can export two kinds of gene lists from ARMADA normalized gene lists and differentially expressed gene lists Both lists can contain the same data apart from several statistics which are produced only after the application of statistical tests Exporting normalized gene lists becomes available right after the normalization step while exporting differentially expressed gene lists only after the statistical selection process To e
101. ed 7 control Cy Be reference Cy 3 common reference CY read CY gt control log eet pog temm Cy3 E AT Cy3 E T as derived from logarithm properties This option should be used only when conducting statistical tests on non log transformed data and should generally be avoided as it will not produce valid plots when used with log transformed data After setting all the proper parameters the user should click OK Below there is an example of a volcano plot created with fold change calculated with the option Ratio treated ratio control which was proper for the example used 1 ARMADA log transforms data by default However it is possible to get non log transformed data when the user is importing from external data by selecting Log ratio options in the external data import wizard 2 5 3 For these reasons this option is included in volcano plots 88 Volcano Plot for WT vs D7 Data Up regulated Down regulated Fold change cutoff 5 gt p value cutoff log10 p value Ly 4 3 2 4 d 1 2 3 4 5 Fold change effect As with MA plots 5 3 volcano plot figures exist in two modes data exploration mode where the user can click on any data point on the figure and more specific information will be displayed for that point the GeneID ratio value p value etc and data selection mode where the user can select several data points to export
102. ee section 5 GenelD 2 0 0CNTRL13H13 3 0 CNTRLISHO1 4 0 CNTRL13D13 5 0 CNTRL13D01 6 0 CNTRL12P13 7 OCNTRL12P01 8 0 CNTRL12L13 9 0 CNTRL12LO1 10 0 CNTRL12H13 11 0 CNTRL12H01 12 0 CNTRL12D13 13 0 CNTRL12D01 14 0 CNTRL11P13 15 0 CNTRL11P01 16 0 CNTRL11L13 17 0 CNTRL11L01 18 0 CNTRL11H13 19 0 CNTRL11H01 20 0 CNTRL11D13 21 0 CNTRL11D01 22 0 ZAD0003D13 23 0 ZADODO3DO1 24 0 2400002P13 25 0 2400002P01 26 0 ZADODO2L13 27 0 Z400002L01 28 0 2400002H13 29 0 2400002H01 30 0 2400002D13 31 0 2400002D01 32 0 2400001P13 33 0 2400001P01 34 0 2400001L13 Normalized L Normalized L NaN 0 06837191 NaN 0 0664771 NaN 0 01234704 NaN 0 04420318 NaN 0 0908056 NaN 0 05558527 NaN 0 1139568 NaN 0 1066155 NaN 0 0087301 NaN 2 82960262 NaN 0 04641658 NaN 0 1130260 NaN 0 03356146 NaN 5 6955328 NaN 0 1240653 NaN 0 04468273 NaN 0 0985912 NaN 0 1143307 NaN 0 1426326 NaN 0 0319784 NaN 0 1089483 NaN 0 08343491 NaN 0 0301441 NaN 0 0304733 NaN 0 0907102 NaN 0 2104689 NaN 0 0562167 NaN 0 2438186 NaN 0 0503975 NaN 0 08647994 NaN 0 0569397 NaN 0 33851902 NaN 0 05738228 NaN 0 2555741 35 0 2400001L01 Alli NaN 0 11429486 Mormalized L Mormalized L 0 95424264 0 41627659 NaN 0 66158268 0 29524241 0 18339866 NaN 0 33104754 8 76362621 0 30477361 0 0254250 0
103. er should click Statistics Classification Support Vector Machines Classify and the following window will appear prompting the user to select the file that contains the new samples Select New Samples file Look in E MATLAB y 4 es En My Recent Documents Desktop My Network Places DH OKEGG El z txt 3 ANDROMEDA O Lefteris Kyriakidis Matlab Games My Matlab Exchange My Utils Ranko E armlist txt E armlisttest txt E armlisttestwrong txt E Armstrong_expression_data txt E go txt E mip txt E poly txt El rbf txt File name i Open Files of type Text tab delimited files txt Cancel The file can be a text tab delimited or Excel file which should be structured as follows the first column should contain variable names e g gene names that serve as unique variable identifiers The first row should contain sample names that will be used to identify the new samples when they will be assigned to classes The rest of the data should be numeric Attention should be paid so that the number of variables features is the same as the number of features used to build the classifier model The classifier model that will be used is the one corresponding to the Analysis Object highlighted in the Analysis Object list Example of a file of new samples to be classified can be found in Appendix A After the classification process a report window will appear sm Classifi
104. er words cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise Given the above cluster analysis can be used to discover structures in data without providing an explanation interpretation In other words cluster analysis simply discovers structures in data without explaining why they exist Clustering differentially expressed genes in a microarray experiment helps scientists to organize these genes into groups Genes that belong to the same group might have something in common about how they are activated or even the way they interact with each other For example a group of similarly expressed genes might affect the transcription of proteins that catalyze reactions from specific pathway and thus giving clues to the scientists to further examine this particular pathway of the cell cycle There are several methods of clustering ARMADA supports hierarchical k means and fuzzy C means clustering This section presents how the user can perform cluster analysis using one of the three mentioned methods 4 3 1 Hierarchical clustering In hierarchical clustering the data are not partitioned into a particular cluster in a single step Instead a series of partitions takes place which may run from a single cluster containing all objects to n clusters each
105. ers the user should click OK for creating the plots It should be noted here that MA Plots in Plots menu is activated only after the normalization procedure The next three subsections describe several figure functionalities provided for each of the three categories of MA plots before normalization after normalization and before after normalization 74 5 4 1 MA plots before normalization Below there 1s an example of an MA plot before normalization for a specific array Un normalized MA plot for array d7 3r txt MA data Mormalization points Fold change cutoff Intensity 4 The image created right after clicking OK in the MA plots preferences window is on data exploration mode In this mode the user can click on any data point on the figure and then more specific information will be displayed for that point the GeneID ratio value etc If the user right clicks inside the figure the following menu will appear Export All If the user clicks on Select Data the menu title will be check marked and the figure will enter in data selection mode the cursor will become a cross In this mode the user can select several data points genes by setting a rectangular area which includes the desired data points with the help of the crosshair cursor To export the selected data points in text tab delimited format the user should right click on the edge of the specified rectangular area and select Export
106. ers range is given in the Number of clusters range field After setting the desired parameters the user should click OK The tool will start the estimation process of the optimal number of clusters might take some time Gap Statistic Overall progress Method repetitions Progress a III Calculating Gap statistic Progress ee Clustering reference dataset Progress It is recommended to also perform a graphical inspection by checking the box Show output plot in the Gap statistic preferences window as the algorithm might fall in local minima and not return the true best number of clusters Ideally the optimal number of clusters should be the point in the horizontal axis where the within cluster dispersion range drops steeply and then remains relatively stable across different numbers of clusters This should also be the point where the Gap curve rises before and drops steeply after After completing the process ARMADA will display the following message informing about the result Result The optimal number of clusters was found to be 5 You may also perform Gap curve inspections The figure below displays the output plot of the Gap statistic module 106 Within sum of squares W vs number of clusters 8 4 8 2 7 8 7 6 7 4 2 j 4 5 6 T 3 9 10 Number of clusters k Within clusters sum of squares W Gap curve Gap statistic 2 3 4 5 6 7 8 9 10 Humber of clusters k It can
107. ess the user wishes to perform a comparison study or any other purposes that include un normalized data R R mos 2 gt log 5 If Rank Invariant normalization is selected the following window will appear prompting the user to set certain parameters or leave the default values unchanged 36 Rank Invariant normalization options General Rank thresholds Lower 0 003 Upper 0 007 Exclusion rank threshold Maximum dataset included in rank invariant set 1 wi Iterate until specified rank invariant set size reached Display plots smoothing smoother Lowess Iv In the General panel the user should set parameters concerning the determination of the rank invariant set of genes while in the Smoothing panel the type of smoothing curve that will be fit using the set of rank invariant genes and will be used to normalized the genes on the array The following table describes the several parameters and their values Parameter Rank thresholds Exclusion rank threshold Maximum dataset included in rank invariant set Iterate until specified rank invariant set size reached Display plots Smoother Description The thresholds for the lowest average rank and the highest average rank which are used to determine the invariant set The rank invariant set 1s a set of data points whose proportional rank difference is smaller than a given threshold These two thresholds are usually being set empirically
108. essary selections the user should click OK Below there are two examples of array plots The data on the plots are viewable selectable and exportable through the use of a right click context menu For further details on these operations the user should consult section 5 4 72 Channel 2 Foreground Mean log Ratio d15 r txt Channel 1 Foreground Mean vs Channel 2 Foreground Mean Array Wt 2r txt Data Above threshold Below threshold Threshold 2 3 1 5 5 Channel 1 Foreground Mean Arrays d15 r txt vs dS r txt log Ratio 2 0 1 2 3 log Ratio d15 r txt 73 5 4 MA Plots MA plots of microarray data 7 are plots of log ratio of two channel intensities versus the mean log expression of the two MA plots are applied to the red Channel 2 or Cy5 and green Channel 1 or Cy3 channels and are representations of the data from single arrays which can be useful in depicting the effects of various normalization methods and quality control issues To create MA plots the user should select an Analysis from the Analysis Object list and click Plots MA Plots The following window will appear MA Plot Editor Arrays Plots Ww rbd Atar txt E Before normalization VW ar bd Vt dr txt dz rix d 2rixt After normalization drar tet dr 4r tet d15_1r tet Title s3 dis 2rixt M Before and after normalization di5 3rbd di5 drt das Tr tet d25_2rbd d2
109. ext box or be read by an external file by pressing the Read button The file can be a text tab delimited or Excel file For details on its format the user should see Appendix A The user should check this box to perform N fold cross validation of the classifier and supply N The user should check this box to perform Leave M out validation of the classifier and supply M The user should check this box to perform Training and Test validation of the classifier and supply the percentage of the dataset that should be held out for testing Displays classifier evaluation plots based on the tuning options and parameters plots are based on the number of nearest neighbors distances tie break rules and validation methods 63 Display output results Verbose output command line A report containing classification evaluation statistics and confusion tables is displayed in a separate window Several messages are displayed during the classifier tuning procedure in the command line or in MATLAB s command window if MATLAB is present After setting the parameters the user should click Tune and classifier tuning will be performed Depending on the SVM tuning results the user can select the appropriate parameters to build a model that best fits the dataset under examination If the user selects to display classifier evaluation plots by checking the box Display evaluation plots the following example depicts how these plots a
110. files Data type Channel 1 Foreground Mean Channel 2 Foreground Mean Channel 1 Foreground Median Channel 2 Foreground Median Channel 1 Background Mean Channel 2 Background Mean Channel 1 Background Median Channel 2 Background Median Channel 1 Foreground Standard Deviation Channel 2 Foreground Standard Deviation Channel 1 Background Standard Deviation Channel 2 Background Standard Deviation Channel 1 Foreground Background Mean Channel 2 Foreground Background Mean Channel 1 Foreground Background Median Channel 2 Foreground Background Median Description The foreground spot signal mean for channel 1 or Cy3 or Green The foreground spot signal mean for channel 2 or Cy5 or Red The foreground spot signal median if available for channel 1 or Cy3 or Green The foreground spot signal median if available for channel 2 or Cy5 or Red The background noise spot mean for channel 1 or Cy3 or Green The background noise spot mean for channel 2 or CyS or Red The background noise spot median if available for channel 1 or Cy3 or Green The background noise spot median if available for channel 2 or Cy5 or Red The foreground spot signal standard deviation 1f available for channel 1 or Cy3 or Green The foreground spot signal standard deviation 1f available for channel 2 or Cy5 or Red The backgro
111. formed will be active and the user may check it in order to define a new analysis object using part of the preprocessed whole dataset If this option remains unchecked a new clean analysis object will be created where the user can perform several preprocessing steps from the beginning It should be noted that keeping the same preprocessing steps does not mean that another set of preprocessing steps cannot be applied to the selected set of experimental conditions Also if the user has selected a subset of experimental conditions from the beginning e g Analysis 1 consists of WT and D7 in the above example then the option Use same preprocessing steps as 1st time if performed will not be available and the user should preprocess the data from the beginning for each analysis object created If there is no reason for an array to be excluded from a condition subset the user can check the Select all replicates for each condition option to select directly from the list Conditions without having also to select individual arrays Finally the Add gt gt and lt lt Remove buttons add or remove arrays in the selected experimental condition in the analysis object to be created When finished with condition and array selection the user should click OK and a new analysis object will be displayed in the Analysis Object list 3 2 Background Correction The first step in data preprocessing with ARMADA 1s the definition of a background correcti
112. from site here as MATLAB routines for the users who are experienced with MATLAB and have at least MATLAB 7 3 R2006a installed on their computer or as an executable installer file which also contains MCR 7 6 Example datasets can also be downloaded from the above site If the user chooses to download ARMADA as MATLAB routines the downloaded file should be unzipped in a location of user s preference and then the specific location must be added to MATLAB s path including its subfolders In MATLAB File Set Path Add with Subfolders If the user chooses to download ARMADA as an executable installer file the instructions provided through the installation process should be followed carefully ARMADA is distributed under the Academic Free License version 3 http www opensource org licenses academic php 2 2 Creating a new project To create a new project the user should click on File New New Project or Ctrl N and then the following window will appear New Project Save in MATLAB 5 OKEGG E Matlab Games My Recent My Matlab Exchange Documents my utils 3 f ArrayVision apj E IPFNew apj Desktop E Knock apj e TabDelimTest apj e Test apj E testabf apj Ej TestDelim apj E TestExternal apj a g ES TestGRID apj ed TestImagene apj Ej TestProject apj My Documents My Computer q D i My Network File name Test api Save Places Save as type A
113. he following table explains the available options in the two upper panels Clustering values and Clustering variables Clustering values Clustering variables Option Condition means All replicates Genes rows Description If selected gene expression values to be clustered are the mean expression value among replicates for each condition of the selected Analysis If selected gene expression values to be clustered are all the values from all array replicates from each condition of the selected Analysis If selected hierarchical clustering will be performed for genes revealing clusters of genes with similar expression Replicates Conditions columns If selected hierarchical clustering will be performed for conditions or replicates depending on the choice in the Clustering values panel revealing clusters of conditions The following table explains the available options in the bottom panel Options General options Option Number of clusters Fuzzy parameter Convergence tolerance Maximum iterations p value cutoff Optimize fuzzy parameter Description The number of clusters that the genes should be grouped into The parameter in fuzzy c means clustering algorithm The maximum error allowed between two consecutives values of the constrained fuzzy partition matrix cluster membership matrix Maximum number of iterations for centroid convergence A p value cutoff additional to the statistical test p
114. he initial filtering steps for example if one gene in a specific slide is found sensitive either to any of the filters applied then it is marked as absent If one gene is filtered out from all replicates for a given condition then the TF for this gene is zero This gene 1s then marked as unreliable and is excluded from further analysis The user is prompted to supply a cutoff or leave the default value This option determines the between slide normalization method If Median Absolute Deviation MAD is selected then expression values among all arrays for the selected Analysis will be scaled using the MAD MAD is defined as MAD median Y Y where Y is the median of the data and Y 1s the absolute value of Y This 1s a variation of the average absolute deviation that 1s less affected by extremes in a distribution tail because the data in the distribution tails have less influence on the calculation of the median than they do on the mean This value is calculated for each condition of the selected Analysis and then subtracted from it in order to make data more easily compared If Quantile normalization is selected then data are normalized between slides using the Quantile normalization algorithm 6 If No scaling is selected then no between slides normalization is performed This option determines whether missing value imputation will take place before Before scaling or after After scaling between slide normaliza
115. he obligatory parameters at least background correction and filtering the Run button will be enabled In order to start the batch process the user should click the Run button During the batch process running the Batch Programmer displays several output messages in the operating system command line or the MATLAB s command window if ARMADA runs under MATLAB By clicking File Exit or the Close button the Batch Programmer is terminated 7 4 The Annotator The ARMADA output files containing gene lists or gene clusters contain only the GeneIDs as gene identifiers More annotation elements can be easily added to these files using the Annotator module In order to use the Annotator module the user should have a complete annotation file for the microarrays used Such files should be in spreadsheet like format and contain the annotation elements in different columns One of the columns in the annotation file must be either a slide position if it is not already contained in the annotation elements it can be easily created by assigning unique numbers to each spot using e g MS Excel or the same textual identifier as the one imported in ARMADA usually the chip manufacturer s gene identifier For more information the 109 user should see Appendix A on input file formats Generally such annotation files are provided by the chip manufacturer or can be created using public repositories To launch the Annotator the user should click
116. hin the mat file was created Each structure in the structure matrix has the following fields displayed below in tree format Analysis GeneNames RawData UnNormalized Ratio Intensity Normalized Ratio Intensity DEGenesStats 100 The user can easily correlate the field names with the options described in the table above Each leaf of the above tree apart from the GeneNames field which is a cell array of strings and the RawData field which is a cell array of structures with fields describe in the table above is a MATLAB object of class dataset For more information on datasets and how they can be handled the user should consult http www mathworks com access helpdesk help toolbox stats under the Organizing Data Statistical Arrays Dataset Arrays section MATLAB includes internal functions to handle dataset objects and convert them to simple matrices which can be then used with any MATLAB toolbox 101 7 Other Tools This section presents some additional analysis tools implemented in ANDFROMEDA These tools are the principal component analysis tool which allows pattern discovery between genes belonging to different experimental conditions the Gap statistic which allows the determination of the number of clusters in a dataset using one of the supported clustering algorithms the Batch Programmer module which allows to perform multiple analy
117. iance Diagonal Linear Similar to Linear but with a diagonal covariance matrix estimate naive Bayes classifiers Quadratic Fits multivariate normal densities with covariance estimates stratified by group un E Diagonal Quadratic Similar to Quadratic but with a a diagonal covariance matrix E estimate naive Bayes classifiers E Mahalanobis Uses Mahalanobis distances lt with stratified covariance x estimates Priors The way that prior class probabilities are defined The following options are available Uniform Prior probabilities are derived from a uniform distribution equal prior probabilities for each class Empirical Class prior probabilities are estimated from the group relative frequencies in the training dataset External User defined class prior probabilities in this case the prior probabilities are given in a text tab delimited or Excel file 53 Model validation General options options Priors file N fold cross validation Leave M out Training and Test Display evaluation plots Display output results Verbose output command line A text tab delimited or Excel file containing class prior probabilities structured as follows the 1 column should contain the class names and the 2 the prior probabilities corresponding to each class in the 1 column the user should also see Appendix A for an example The user should check this box to perform N fold cross validation of the clas
118. ication algorithms supported in ARMADA are Linear Discriminant Analysis k Nearest Neighbor algorithms kNN and Support Vector Machines SVM classification The performance of classifiers can be evaluated by several techniques Three of them are supported in ARMADA 1 N fold cross validation In this technique the p data used to train the classifier are randomly split in n independent datasets of size approximately p n Subsequently n rounds of validation 5 follow where in each round n datasets are used to train the classifier and to test it The misclassification error is the average number of misclassified instances 11 Leave m out In this technique m rounds of classification are performed where in each round m samples are left out from the dataset in order to be used later to validate the classifier built with the rest p m samples A widely used value for m is m 1 The misclassification error is the average number of misclassified instances iil Training and test In this technique a percentage of the dataset is held out and the rest is used to train the classifier The misclassification error is the average number of misclassified instances from the held out part of the dataset At this point the user should note that usually prior to building a classifier using any classification technique the number of variables used as variables that can discriminate among classes have to be reduced in order to remove noise F
119. in section 4 4 1 1 After setting the desired parameters the user should click OK After the classification procedure a report window will appear DA Classification results Discriminant Analysis classification results Discriminant function type Diagonal Linear Class prior probabilities Empirical Number of new samples 6 The class es assigned to new data samples isjarej sample New 1 belongs to class ALL sample Mew 2 belongs to class ALL sample Mew 3 belongs to class AML sample Mew 4 belongs to class AML sample Mew 5 belongs to class MLL sample Mew 6 belongs to class MLL 4 4 2 k Nearest Neighbors The k Nearest Neighbors kNN classification is a very simple yet powerful classification method The key idea behind kNN classification is that similar observations belong to similar classes Thus one simply has to look for the class designators of a certain number of the nearest neighbors and weigh their class numbers to assign a class number to the unknown The weighing 56 scheme of the class numbers is often a majority rule but other schemes are conceivable The number of the nearest neighbors k should be odd in order to avoid ties and it should be kept small since a large k tends to create misclassifications unless the individual classes are well separated One of the major drawbacks of kNN classifiers 1s that the classifier needs all available data This may lead to considerable overhead if the tra
120. in the form of the tree It contains several summary data for each analysis in your project such as the number of conditions and slides included in the analysis which filtering or statistical methods were used etc l IPFNew le i Name m Filename T Date umber Of Conditions on Number OFSlides E Slides iD DIS EE D23 E TNF Analysis 3 Analysisi13 E M NumberofcConditio Mumberofslides Slides ET on Slide eo Slide Slide3 ER Slide4 i DF DIS D3 o E TNF Preprocess i Lia BiackgroundCo o UseEstimate j e FilterMethod j a ene FilterParamete ee QuiblierTest j TN Normalization t Suhgrid ChannelInfo 2 StatisticalSelection iud BSN j T Impute a When m TF E j Test Correction ui Cut oe DEGenes i Clustering ms Algorithm one Linkage ve Ill gt The following table presents the names of the tree branches Branch name Name Filename Date NumberOfConditions NumberOfSlides Slides Analysis Preprocess BackgroundCorrection UseEstimate Description The name of the project The path and filename of the project The date the project was created When placed directly under the root of the tree 1t represents the total number of experimental conditions in the project while when placed under an Analysis branch it represents the number of condition for that specific analysis When placed direct
121. ining data set is large Apart from building a kNN classifier ARMADA offers a batch process module that allows the user to tune the classifier e g select the best number of nearest neighbors in combination with the proper distance function and tie breaking rule for a specific problem studied using a representative dataset The following sub sections describe the tuning and classification process using KNN in ARMADA 4 4 2 1 k Nearest Neighbors Tuning In order to perform kNN classifier tuning the user should select an Analysis object from the Analysis Objects list and click Statistics Classification k Nearest Neighbors Tune The following preferences window will appear KNN Tune Editor KAM tuning options Model validation options k Nearest Meighors range 140 El N fald cross validation Distance Ties rule Leave M nut Euclidean f Cityblock I Training and Test Cosine Consensus Correlation Hamming l General options E Display evaluation plot wi Display output results Verbose output command line The following table explains the available options in the KNN tuning options Model validation options and General options panels Option Description k Nearest Neigbors range A range of values that should be used as number of nearest neighbors It should be given as a series of numbers separated by commas or spaces or in M
122. int inside the diagram in the upper panel corresponds to a gene The percentage of the data variance which each principal component caught can be seen inside the parenthesis on each axis label The bottom panel presents the projection of the data matrix on the 2 dimesional plane defined by the 1 and 3 principal components respectively The user can use the popup lists in order to change the data projection plane by changing the principal components displayed The user can also click on each data point gene to view its label and also select data as in MA or volcano plots The names of the selected genes are displayed in the list on the right part of the figure and user can export the selected genes coupled with their expression values in text tab delimited format by clicking the Export button In addition the user can move the rectangular area window over other data points and the list on the right will be updated with the genes that are inside the moving window each time The main goal of PCA is to transform the original data in such a way so as to reveal any possible patterns that can help the researcher to distinguish among different experimental conditions In the example used to create the above figure it 1s obvious that the first 2 principal components presented in the upper panel are able to account for a large percentage of variance in the dataset 103 and this can also be seen by the shape of the proje
123. ion has the form k X x Gamma X X Coefficient Sigmoid MLP stands for Multi Layer Perceptron The Degree MLP kernel function has the form k x x tanh Gamma OG X Coefficient RBF RBF stands for Radial Basis Function The kernel function has the form k X ACT O Normalize input data matrix so that each column has mean 0 and standard deviation 1 Scale input data matrix so that all data values lie between a given range The upper and lower limits can be given through the corresponding text boxes Tolerance of termination criterion The gamma coefficient in the polynomial kernel model The correction coefficient in the polynomial kernel model The degree of the polynomial kernel model Triplets of parameters they can be entered manually using the respective text boxes or be read by an external file by pressing the Read button The file can be a text tab delimited or Excel file For details on its format the user should see Appendix A The gamma coefficient in the sigmoid kernel model The correction coefficient in the sigmoid kernel model Doublets of parameters they can be entered manually using the respective text boxes or be read by an external file by pressing the Read button The file can be a text tab delimited or Excel file For details on its format the user should see Appendix A The gamma coefficient in the sigmoid kernel model Parameter values they can be entered manually using the respective t
124. ip E MEXP 817 sdrf red xls E MEXP 81 7 sdrf xls columns txt E i E MEXP 81 7 rawvv data 1150388389 Control txt w This window will appear as many times as the number of the experimental conditions in the project Each time the user will be prompted to select the files for the condition with name displayed in the window title e g Select files for condition Control This file selection window gives certain control over the file names and what is displayed e g the user can filter what 1s displayed by a regular expression the user should see for example http www regular expressions info change the directory or display the full paths of the selected files In the case of importing from ImaGene output the user should also take into account that ImaGene files are produced in pairs one file for Cy3 1 channel and one file for Cy5 2 channel Thus the files are also selected in pairs ARMADA distinguishes the channels for ImaGene files by searching for the text Cy3 or CyS in the filenames The user should make sure that this string exists in the filenames and that each file corresponds to the proper channel After selecting the array files for each condition the user should click Done To finish importing the dataset click OK in the Data Import wizard window If the user has chosen to import direct output files from one of the supported image analysis programs they will be automatically imported for analysi
125. irst row should contain sample names that will be used to identify the new samples when they will be assigned to classes The rest of the data should be numeric Attention should be paid so that the number of variables features is the same as the number of features used to build the classifier model For an example of a file of new samples to be classified the user should consult Appendix A The rest of the options in the kNN Classify preferences window under the kNN classification options panel k Nearest Neighbors Distance and Ties rule are the same as in the case of tuning the kNN classifier and their description can be found in section 4 4 2 1 After setting the desired parameters the user should click OK After the classification procedure a report window will appear KNN Classification results k Nearest Neighbors classification results Number of nearest neighbors Distance metric Euclidean Classification rule Nearest Number of new samples 6 The class es assigned to new data samples is are sample New l belongs to class ALL sample Mew z belongs to class ALL sample Mew 3 belongs to class AML sample Mew 4 belongs to class AML sample Mew 5 belongs to class MLL sample Mew 6 belongs to class MLL 61 4 4 3 Support Vector Machines Support Vector Machines SVM classification method was developed by Vapnik 14 for binary classification and have been extensively used for microarray data classificatio
126. is filter is based on the formula S B T where T is the threshold below which noisy spots are filtered out from each array This filter is based on the distance between the signal and background distributions a spot is robust against this filter 1f its signal and noise distributions abstain from each other a distance which is determined by the respective standard deviations Sensitive spots are determined by the inequality S xo lt B yo where x and y are user defined parameters In this case the user can create his own custom filter using any of the operators lt gt amp logical AND or logical OR any positive real number and any of the expressions below SigMean Signal Mean BackMean Background Mean SigMedian Signal Median for ImaGene GenePix or other if medians information present BackMedian Background Median for ImaGene GenePix or other if medians information present SigStd Signal Standard Deviation BackStd Background Standard Deviation As an example of the custom filter case the following filtering expressions are valid and are applied for each microarray in the selected set of conditions on both channels SigMean 2 SigStd SigMedian BackMedian lt 500 SigMean BackMean lt 3 amp SigMean lt 1000 It should be noted that spot filters are applied to both channels and the union of poor spots from both channels are considered to be poor spots and excluded from further
127. ists in the Choose annotation columns panel The Columns list contains the column headers from the annotation file so the user can choose The user should choose then the appropriate columns should contain the same data type else an error will be generated and then from the Columns list the user should choose the desired annotation elements to be added to the gene lists After having made at least one selection from all the lists in the Choose annotation columns panel even if having to reselect the default values the Annotate button will be enabled The user should click on the Annotate button and the Annotator will add annotation elements to the provided files This process might take some time depending on the number of files to be annotated their type and their size 110 References 10 11 12 13 14 15 16 Ly 18 19 de Jong S and van der Meer F 2002 Imaging spectrometry basic principles and prospective applications Kluwer Academic Yang Y H Dudoit S Luu P Lin D M Peng V Ngai J and Speed T P 2002 Normalization for cDNA microarray data a robust composite method addressing single and multiple slide systematic variation Nucleic acids research 30 e15 Cleveland W S Grosse E and Shyu W M 1992 In Chambers J M and Hastie T J eds Statistical Models in S Wadsworth amp Brooks Cole Dormand J R Tseng G C Oh M K Rohlin L Liao J C and Wong W H
128. ith biological background The purpose of this user s guide 1s to provide insights on the use of the platform and explain its capabilities as simple as possible in the eyes of a biologist with little programming experience or little experience in statistical computing If some points are unclear or not explained as explicitly as expected please provide feedback and help the developers perform better on later versions of ARMADA Please report feedback comments suggestions or possible bugs and malfunctions to Panagiotis Moulos pmoulos geie gr Unsupervised or Ssss A TR Meta Analysis 3 eats a 1 Pathway analysis Bev m O GO Association g Pattern Analysis Classification Biological database connection etc Signal to Signal Noise _ No Noise Distribution Customized Filtering Threshold a Distance Background Subtraction LFEESTIT ibas TIHTEHIUTHTITIE Analysis workflow of ARMADA 1 2 Release changes The following section is a description of the additions made in version 1 1 of ARMADA compared to the previous version 1 0 e Users can now import external already processed data directly for clustering or supervised learning training without having to perform statistical selection first e Complete dye swap experiment support implemented during the normalization procedure under the Normalization preferences e Added single
129. ith the column names above for GenePix For example 532 must correspond to Cy3 or Green and 635 must correspond to Cy5 or Red For more information on the necessary quantitation inputs the user should see section 2 5 2 For more information on GenePix headers the user should also check http www moleculardevices com pages software gn genepix file formats html gpr and also http www moleculardevices com pages software gn gpr format history html A 1 4 Text tab delimited files ARMADA can process other types of raw data which are derived from not yet supported image analysis software e g Array Vision Imaging Research Inc as long as they have a minimum of quantitation types section 2 5 2 for both channels and any software dependent headers have been removed so that the file contains only a number of columns with the first row of each column containing the name of the quantitation type The user can also import text tab delimited files from other sources such as public microarray databases e g ArrayExpress www ebi ac uk arrayexpress or Gene Expression Omnibus http www ncbi nlm nih gov geo A 2 Processed data The user can import already processed data in ARMADA Such data can be already calculated but not normalized expression natural or log ratio values or ratio intensity pairs which can be imported to ARMADA for normalization and they can also be normalized data ratios or ratio intensity pairs which
130. luster List or by clicking View Gene Clusters List GenelD ClusterNo Sum of Dist f 16237 ZX00048P10 1 0 0998229 2 16261 KG00002D10 1 0 09268542 3 16297 CNTRL15D10 110 33937952 4 16686 2X00047010 1 0 06602499 5 17760 ZX00004P23 1 0 0993346 6 17763 ZX00004L11 1 0 01211020 7 17785 ZX00007D1 1 1 0 2062034 8 17833 ZX00016P11 1 0 05101909 Ew 17841 ZX00015P11 1 0 26353380 10 17888 ZX00020H23 1 0 2638407 11 17923 ZX00026L11 1 0 11366440 12 17929 2X00025P 11 1 0 32354879 13 17934 ZX00025D23 1 0 21631930 14 18018 ZX00035L23 1 0 24062516 15 18021 KGO0002L11 1 0 1787082 16 18109 ZA00002G23 1 0 1029689 17 18110 2400002611 1 0 14480964 18 18208 2x00009G11 1 0 34127293 419 18271 ZX00017C23 1 0 43629742 20 18272 ZX00017C11 1 0 1244091 21 18411 ZX00031C23 1 0 1634163 22 18424 2xX00034011 1 0 38448335 23 18441 2X00048023 1 0 05014712 24 18449 2X00047023 1 0 42039981 25 18452 ZX00047K11 1 0 35294889 26 18473 KG00001C23 1 0 15236623 27 18501 KG00002023 1 0 1857079 28 18552 2400002523 1 0 41781860 29 18575 2400004311 1 0 0759104 30 18595 ZA00007F11 1 0 26980003 31 18613 ZX00003F11 1 0 1459118 32 18633 ZX00006B1 1 1 0 31063840 33 18634 ZX00005N23 1 0 01629784 34 18651 ZX00009B11 1 0 18688714 m 2 6 12 Reports p value 0 04986482 0 03257609 0 02476688 0 03279671 0 02875693 0 03103162
131. ly under the root of the tree 1t represents the total number of microarrays in the project while when placed under an Analysis branch it represents the number of microarrays for that specific analysis A sub tree containing the experimental conditions as branch names and the names of the data files as leafs Different analyses of the project e g using a different set of conditions or different preprocessing or statistical selection methods Branch name for the preprocessing steps used in the project Its children contain detail on the preprocessing methods used the user should see Preprocessing data The background correction method used The main signal estimation mean or median 19 FilterMethod FilterParameter OutlierTest Normalization Span Subgrid ChannelInfo StatisticalSelection BSN Impute When TF Test Correction Cut DEgenes Clustering Algorithm Linkage Distance Seed Limit PValue Clusters SVM Kernel Parameters 2 6 4 Arrays list The spot filtering method used The parameters used with the spot filtering method used with the filtering method The statistical test that was used for outlier detection among replicates of the same condition if performed The name of the within slide normalization method that was utilized 1f any The spanning neighbourhood size for LOWESS LOESS normalization methods Whether block dependent normalization was chosen to be performed or
132. m It differs from hierarchical clustering in that the number of clusters k needs to be determined at the onset The goal is to divide the objects into k clusters such that some metric relative to the centroids cluster centers of the clusters is minimized These centroids should be placed in a skillful way because different location causes different results The better choice 1s to place them as much as possible far away from each other The next step is to take each point belonging to a given data set and associate it to the nearest centroid When no point is pending the first step is completed and an early grouping is done At this point re calculation of k new centroids should be performed based on the results of the first grouping This procedure is repeated until the Only one among Inconsistency coefficient and Maximum number of clusters can be provided to determine the number of returned clusters 47 centroids stop changing positions k means clustering 1s very useful when there is an a priori estimation of the number of clusters that the data should be grouped into For further information on k means clustering the user should see 10 In order to perform k means clustering the user should select an Analysis object from the Analysis Objects list and click Statistics Clustering k means The following preferences window will appear kmeans Clustering Editor Clustering values Condition means f All replicate
133. n e g 15 Briefly the optimal separating hyperplane between the two classes 1s computed by maximizing the margin between the classes closest points The points lying on the boundaries are called support vectors and the middle of the margin is the optimal separating hyperplane The points of the wrong side of the discriminant margin are weighted down to reduce their influence When a linear separator cannot be found the points are projected into a higher dimensional space where the points effectively become linearly separable A program able to perform such optimization tasks is called a support Vector Machine Although SVMs were initially developed for binary classification problems there exist several strategies that can deal with this problem An example is one against one approach in which k k 1 2 binary classifiers are trained the appropriate class is found by a voting scheme The SVM classifier implemented in ARMADA is based on the OSU SVM Toolbox http sourceforge net projects svm which supports mutliclass classification The following sub sections describe the tuning and classification process using SVMs in ARMADA 4 4 3 1 Support Vector Machines Tuning In order to perform SVM classifier tuning the user should select an Analysis object from the Analysis Objects list and click Statistics Classification Support Vector Machines Tune The following preferences window will appear SVM Tune Editor
134. n be filtered easily and thus ensure constant level of comparison across all intensities range After selecting the desired background correction method the user should click OK If the user skips the background correction step ARMADA sets it automatically to No Correction 29 3 3 Spot quality filtering After setting the background correction method the next step in the analysis consists of spot quality filtering to exclude spots with high background contamination By clicking Preprocessing Filtering the following window will appear Filtering Editor General Use Medians instead of Means if available Export filtered spots in Excel format Moise filtering _ No filtering Signal to Noise threshold e Signal Moize distribution distance m 8 x std S m B y starB _ Custom filter Cutler detection Statistic t test p value cutoff 0 05 Display p value histograms ARMADA uses by default the signal and background means to estimate the net signal that will be used to calculate expression for each spot In the General panel the user can select to use the signal and background medians instead of the means 1f available or to export the genes sensitive to the filtering processes in Excel format In the Noise filtering and Outlier detection panel the user is able to select the gene filtering methods and change the default thresholds if desired ARMADA divides spot quality filtering in 2 parts 1
135. nd Un normalized Tagesschau ea Mata p Sd tel anas Mami oce qd tee 70 SNO 627g 0 NER E NE EE 71 A A uten i dono A fu afud 74 SAL MA plots before normalization is 75 3252 VLA plots alter normal ZAION 2 o EE eS 76 5 4 3 MA plots before and after normalization eene Di 5 2 4 iSubosrid MA DIOS concentraron OA 78 5 5 ERES O DIA A e O o neu eUs 80 O BO M EE Em 84 SI A Te OS ec AAA 87 SSH BXDEOSSIOUEDEODHICS da 90 APOLO P zur trillado 95 Ole EXPO ld 95 6 2 EXPONO Seno cl MS antera Pestis wh is tert ue It pe ett e e IL tere 977 OC TEX DOPING OE SS eec Denton et adesso E b ce Me AA 98 6 2 EPONE tO mido BIOS A AAA A ESEESE 98 Dir OS EET 102 Tole The Principal Component Analysis tol steve i Ua MEER RE Rtat utm e MUR Rees 102 12 Toe Gap ota St OO A ddsbettditendieieitud 104 Tos TABA Proet ammesso obs MB Mn DM Me MUI 107 FATICANO O CT C A RH 109 A austen E areatosas yaa ase eac ose aa suena ua osname sate ca cucteniea san eaenaune 111 Appendix Ac Us de bus 112 A 1 Raw data Image analysis software output and tab delimited files occcccccccnnnnn 112 AIT Quat Abras TH Toma a 112 PSI RPM a UH ERO eee ner oe eh a ey eee Te et ere Ee rey 113 AMS GenePre Tile TOETHOUS arar nana 114 ALA Text tabdelimited TCS id atest adit seh li ct lant doit eh esatta ins 115 A2 PROCESS da 115 Aad biles used Tor elassriCAUlOEL Aa 116 A d NES E dile S eiue het cios 117 A 3 2 External class prior files for DA cl
136. ndition t test 2000 1800 1600 1400 ho ce e 1000 Genes plurality E e co ce ce ce e e e ha ce e e e 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 1 p value After making the necessary selections or leaving the default settings the user should click OK to perform the selected operations At this point ARMADA filters the selected data based on the 32 intensities of both channels and calculates the log ratio between channels as well as the intensity values 2 for the genes that passed the filtering procedures If the option Export filtered spots in Excel format is checked the following window will appear prompting the user to select which types of filtered spots should be exported Export Filtered Spots Choose spot set Bad spots for each condition Good spots for each condition Bad spots for each condition and replicate Good spots for each condition and replicate Common bad spots between replicates for each condition Common good spots between replicates for each condition Common bad spots between replicates for all conditionis Common good spots between replicates for all conditionis Cancel The following table describes what spots will be exported by checking each of the check boxes Option Bad spots for each condition Good spots for each condition Bad spots for each condition and replicate Good spots for each condition and replicate Common bad
137. ng correction procedure 7 FDR Benjamini Benjamini Hochberg Hochberg False Discovery Rate control procedure 8 42 pFDR Storey Storey positive False Boostrap Discovery Rate control procedure True null hypothesis is calculated from the tuning parameter based on bootstrap 9 pFDR Storey Storey positive False Polynomial Discovery Rate control procedure True null hypothesis is calculated from the tuning parameter based on polynomial fit 9 The user should consult Appendix C for more information on multiple testing correction issues p value threshold A p value cutoff threshold when no or Bonferroni multiple testing correction chosen to determine the number of differentially expressed genes FDR threshold An FDR threshold value when all other multiple testing correction methods chosen to determine the number of differentially expressed genes In the case of selecting Time Course ANOVA in the Statistical test options list the following window will appear Time Course ANOVA Editor Pair FE iz OF APT y Pair FC is D1 SAAT Pair FC iz DZ3ANT select control wr Pair Fic iz TMF VT lt Remove Select treated TNF v In this preferences window the user should define pairs of experimental conditions so that fold changes can be calculated for time course ANOVA to run The user can define pairs by using the Select control and Select treated lists and using the Add gt gt and lt lt Remove
138. nregulated genes blue points If fold change and p value thresholds have not been provided exports all data points Export All Exports all data points regardless of mode status 5 8 Expression Profiles Expression profile plots allow the analyst to follow the expression of specific genes across different experimental conditions or across different time points in a time course experiment with the help of a graphic which displays the gene expression their log channel ratio against the different experimental conditions or time points Expression profile plots are especially useful for the display of expression patterns after the application of a clustering algorithm 4 3 such as k means clustering 4 3 2 as they allow the user to identify specific patterns and give initiatives for further research Expression profile plots are also helpful for overlying the expression of many genes across different conditions and identify several phenomena such as genes with reverse expression early or late deregulated genes compared to each other others etc To create expression profile plots the user should select an Analysis from the Analysis Object list and click Plots Expression Profiles The following window will appear 90 Expression Profile Editor Genes or Clusters CNTRL1SL01 ZAD000 2P1 3 ZXDODOGPUT ZXDODDA4DT 3 ZA00014H01 40001 6P01 00023H1 3 Z XAD0027H13 xAD0026P01 ZXDDOZBLT 3 ZX UUZ7P13 ZX DD4
139. ns will return an Excel file containing the spots that were commonly filtered from all conditions and replicates 33 Array spots which were found to be sensitive to any of the procedures described in this section are marked as non informative poor quality spots and excluded from the dataset to be subsequently normalized in order to alleviate the normalization procedure from the impact of systematic measurement errors It should also be noted that if the filtering part is skipped ARMADA assumes takes as default the No filtering option 3 4 Normalization The analysis part that follows the poor quality spot filtering in ARMADA workflow is the data normalization for each slide to compensate for systematic measurement errors At this point it should be noted that normalization is performed on each microarray slide separately using only genes that passed the filtering tests for each slide In order to select the normalization method and set various parameters the user should click Preprocessing Normalization and the following window will appear a s ZS Normalization Editor 2 El 2X Normalization Linear LOWESS Spanning Neighbourhood 0 1 Dye Syvap options Channel Dye correspondence for all arrays Cy5 is Channel 2 e OR Launch Drye Swap Editor Other Subgrid normalization Display timebar In the Normalization panel the user should select among one of the currently supported normalization methods In
140. ntensity data Ed Normalized log ratio intensity data E DE genes data p values etc Cancel From there the user can select several data types to be exported to the mat file which then can be opened from within MATLAB for further process The following table describes the data exporting choices of the export to MATLAB preferences window Option Gene names Raw data image software output Description Exports the chip manufacturer s gene identification which are determined from the input files the user should see 2 5 for further details Exports the raw data provided with the input files in structure format Each input file is a structure with the following fields some might be missing depending on the type of the input files which are described here very briefly as the user can find information on these fields in section 2 5 Field name Header Blocks ArrayRow ArrayColumn Row Column ColumnNames Number GeneNames chlIntensity ch2Intensity chlIntensityMedian ch2IntensityMedian chlIntensityStd ch2IntensityStd chl Background ch2Background Short description The header of the input file Array subgrid blocks Row meta coordinates Column meta coordinates Row coordinates Column coordinates File column names Gene numbering slide positions Gene identifiers Channel 1 signal mean Channel 2 signal mean Channel 1 signal median Channel 2 signal median Channel 1 signal standard
141. o intensity pairs and ratios will be log transformed If ratios are already log transformed the user should choose Log ratio intensity pairs Un normalized ratios alone cannot be imported Such data should be structured in ratio intensity pairs or ratios alone Ratio is the ratio between channel 2 sample and channel 1 reference data while intensity is an estimate of spot intensity based on the signals of the two channels the user should see Apendix A If the provided ratios are not log transformed the user should choose Raw ratio intensity pairs or Raw ratio only depending on the nature of the data and ratios will be log transformed If ratios are already log transformed the user should choose 16 Log ratio intensity pairs or Log ratio The user should note that some of the data exploration plots will not be available in the case of only ratio data The list File columns contains the column headers found in the data file to be imported The user should use this list and the buttons Add gt gt and lt lt Remove to assign proper ratio intensity pairs or ratios only to each of the conditions in the Conditions list At any point the user can see the assigned ratios and intensities in the Ratios and Intensities lists All columns of the file should be assigned to an experimental condition and the only column that will remain in the File columns list should contain a unique gene identifier After properly set
142. ods used for the selection of differentially expressed genes several other techniques are used derived especially from the area of Machine Learning for the classification of genes and the prognosis and prediction of new data These methods are mainly classified into two categories unsupervised learning and supervised learning In unsupervised learning the algorithm is given a set of objects and 1s trying to group them into classes without any prior knowledge of these classes or any labeled output Classical unsupervised techniques are clustering techniques the user should see section 4 3 Supervised learning algorithms make use of a set of classified examples and they are trying given this sample of input output pairs to determine the function that maps any input to any output such that disagreement with future input output pairs is minimized Supervised learning usually refers to classification problems the term classification usually refers to a prediction or learning problem in which the variable to be predicted assumes one of k unordered values c1 c Cx arbitrarily relabeled as 1 2 k or sometimes 0 1 k 1 The k values correspond to k predefined classes e g tumor class or bacteria type Classification algorithms are given a set of samples and their class label and try to predict the correct class for new data and regression problems where the output is a set of real numbers instead of class labels 10 Classif
143. on Subgrid Normalized Image Array Raw Table DE List Cluster List Export DE List Export Cluster S ChannelInfo The name of your project is IPFNew J StatisticalSelection E BSN Your project includes 5 experimental conditions and 19 slides Impute When TF Test Wt irot d7_1rtd d15_1rtd_ d23 1rtt TNF_1rbd Correction Wt 2r bd d7_2rtd di5 2rtbt d23 2rbt TNF 2rbt Cut Wt artt d 3r bd di5 3rbt d23 3rbt TNF 3rbt DEGenes Wt Arit d7_4rbd d15_4rbd d23_4rtd Clustering Algorithm Linkage Background Correction Method Signal to Noise ratio Information on Analysis run 1 2 6 2 History textbox The history textbox is placed at the right bottom in the main window It contains messages produced during different analysis steps in the project These messages could be used in the production of reports RATIO NORMALIZATION Per Chip ES Rank Invariant Normalization details Ba Average ranktheresholds Lower 0 03 Upper 0 07 Higher or Lower average rank exclusion position O Maximum percentage of dataset points included in the rank invariant set 1 terate until specified rank invariant set size reached Yes Method far data smoothing LOWESS Span 0 1 Normalizing the vvt 1r txt Please Wait Nnrmalizing Condition Slide part 1 15 aut af 2 4 klnrmalizinn the it 7r tit Pleaca ait Iv 2 6 3 Tree view The tree view is placed in the left side of the main window and displays analysis history
144. on method to correct for background image contamination caused by several factors such as artifacts on the array surface scratches or non specific hybridization To choose the background correction method the user should click Preprocessing Background Correction and the following window will appear Background Correction Editor Background Correction Background Subtraction Signal ta Moise Ratio No Correction The program uses one of three possible methods Background Subtraction Signal to noise Ratio or No correction to correct for each spot background contamination and calculate the pure signal value for each gene in each slide replicate for all conditions Each background correction method implemented is summarized in the table below notation S is the signal mean median B is the background mean median and is the net signal estimation for each spot 28 Method Description Background Subtraction In this case the net signal for each spot is 2 S B and the log ratio Seres H between channels IS R log DOS 05 log LN By S Rr Cy3 Cy3 Signal to Noise Ratio In this case the net signal for each spot is S S B and the log ratio No Correction In this case the net signal for each spot is S and the log ratio between channels is R log Bee Bos J ZH Cy3 Bo PLE Ds between channels is R og Cy3 It should be noted that the case of Signal to Noise
145. on of signal quantitation for each spot of the 2 channel The column containing background contamination quantitation for each spot for the 2 channel The title Cy5 1s indicative and taken by the fact that the reference samples in 2 coloured microarray experiments are labelled with Cyanine 5 red In any case this attribute should contain the background intensities for the sample dye channel The title Mean is indicative It depends on the quantitation algorithm of each image analysis software However this attribute is mandatory and should contain the main background quantitation for each spot The column containing the median of background quantitation for each spot of the 2 channel The column containing the standard deviation of background quantitation for each spot of the 2 channel After filling all the necessary fields by choosing from the lists the user should click OK Data importing should begin immediately 2 5 3 Importing already processed data If data which have been preprocessed with another analysis tool or downloaded from a public repository have to be imported to the project the user should click File Data Import Processed data and the following window will appear prompting the user to select a text tab delimited or MS Excel file containing the data to be imported 15 External Data Import Editor Experiment info Number af Conditions i2 Column assignement File columns Re
146. or example in a microarray experiment where microarrays contain several thousands of genes not all of them are differentially expressed among different experimental conditions classes and they only add noise to the experiment This procedure of noise removal in classification procedures 1s called feature selection Feature selection can be performed in ARMADA by any of the statistical selection methods that are supported section 4 1 For this reason classification algorithms become available in ARMADA only after the statistical selection procedure The rest of this section presents how the user can build and use classifiers in ARMADA For more information on classification feature selection and classifier evaluation techniques the user should see 13 4 4 1 Linear Discriminant Analysis Discriminant Analysis DA may be used for two objectives either to assess the adequacy of classification given the group memberships of the objects under study or to assign objects to one of a number of known groups of objects DA may thus have a descriptive or a predictive objective In both cases some group assignments must be known before carrying out the DA Such group assignments or labelling may be arrived at in any way For example in microarray classification studies one group might represent samples from healthy tissues while the other s from diseased tissues For more information on DA the user should see 10 Apart from building
147. ormalization normalizes data on each microarray slide by local regression of log ratio against intensity using weighted linear least squares and a 2 degree polynomial model The robust version of LOESS performs additional fitting iterations and assigns lower weight to outliers 1n the regression The method assigns zero weight to data outside six mean absolute deviations This model is used to calculate normalized expression values for each gene Robust LOESS needs more time to complete than simple LOESS but produces results more robust against possible outliers e R R f on 2 RobustLOESS Lon JR G log J l Rank Invariant normalization normalizes data on each microarray slide by selecting a number of genes which are non differentially expressed fit a normalization curve through these genes and use this curve coupled with interpolation methods to normalize the genes present on the slide Genes are ranked based on the signal intensities of the two channels and the rank invariant set of genes is determined by those genes whose proportional rank difference is smaller than a given threshold Rank Invariant normalization is useful especially when data on a microarray slide appear not to be very homogeneous e g the histogram of expression is bimodal VE N R gt RankInvariant G R If this option is selected ARMADA will continue without performing any data normalization It is highly not recommended unl
148. performing normalization on the whole slide Finally if the user checks the Display timebar box a timebar will be displayed presenting the progress and the remaining time of the normalization procedure While a timebar gives an estimate of the time required for normalization especially for large datasets it consumes computer memory resources causing the normalization procedure to take more time thus the default choice 1s not to display a timebar After making the necessary selections the user should click OK to normalize the data Depending on the normalization algorithm chosen the normalization procedure might take some time At this point it should be noted that the Statistics menu will not be enabled 1f normalization is not performed ARMADA workflow does not allow performing any statistical tests without having normalized the data first in order to scale them and be able to make rational comparisons among different experimental conditions This does not apply when the users has imported external data for process unless those data are not normalized 39 4 Statistical Operations After completing data preprocessing and normalization which excludes poor quality spots and scales data within each array a proper statistical test can reveal several genes which are statistically distinguished among different experimental configurations The result of statistical selection is usually a far smaller set of genes compared to the initi
149. porter name Condition Mames Control Treated Data info MBA MEXP 3759MHorma MEA MEXP 3760MHormal MBA4 MEXP 3761 Normal Conditions Control ES Treated i lt Remove Normalization Un normalized data f Normalized data Measurements O Raw ratio intensity pairs Log ratio intensity pairs O Raw ratio only 8 Log ratio only Cancel The user should see Appendix A for how this file should be structured Briefly each column or pair of ratio intensity columns should correspond to measurements of a single array The file should contain only one column with unique gene identifiers As with the Data Import wizard the user should properly fill in the number of experimental conditions of the dataset and proper condition names the user should see 2 5 1 for information on proper condition names After setting the above parameters the user should provide some information on the contents of the file The following table explains the types of processed data that can be imported to ARMADA Type Un normalized data Normalized data Description Such data should be structured in ratio intensity pairs Ratio is the ratio between channel 2 treated and channel 1 reference data while intensity 1s an estimate of spot intensity based on the signals of the two channels the user should see Apendix A If the provided ratios are not log transformed the user should choose Raw rati
150. preadsheet software such as MS Excel Below there is an example of processed data containing only normalized ratio values Reporter name MBA MEXP 3758 Mormalized MBA MEXP 3750 Mormalized MBA MEXP 3761 Normalized MBA MEXP 3762 Normalized MBA MEXP 3764 Normalized MBA MEXP 3763 Mormalized A2bp1 NaN NaN NaN NaN NaN NaN 22m Nal NaN NaN NaN NaN NaN Aabp3 1 3828324 1 02274 1 313355 0 14139101 0 21289581 0 15157567 Aadac NaN NaN NaN NaN NaN NaN Aanat 1 068118 1 0280066 0 90227524 14375744 1 0218852 D 87755398 Aatk 0 7039054 0 5187531 0 54615927 0 60616165 0 6087247 0 76790816 Abcat 1 1137171 0 89766896 1 0167725 1 1950728 1 121247 0 96731037 Abca2 4 091 7857 0 9450024 0 93968105 0 9698576 0 9271849 0 9192679 Abca4 0 8601027 0 74625766 0 8129547 0 8905621 0 910623 0 6339242 Abca 11535889 1 1185063 1 0396552 0 9910633 1 0259838 0 8701742 Abcaa8 NaN NaN NaN NaN NaN NaN Abcb10 0 9933904 0 6052491 0 78872824 0 87024313 0 9813563 1 0724393 Abch11 0 9040716 1 3788409 0 9544007 0 98535085 0 9789672 0 8867802 Abch1a 0 6597014 0 51996917 0 70184016 0 74078214 0 5194845 1 1298094 Abcb1b 0 99493974 1 031255 1 0593528 0 93215114 0 9361901 0 6081 227 Abch2 1 0212958 0 6207916 0 9986446 D0 8787848 0 9318093 0 87 29674 Abchs 0 25131285 0 28731525 0 23878798 0 7716271 0 5258987 1 0946274 Abch4 NaN NaN NaN NaN NaN NaN Abch6 0 9235208 1 1582047 0 933917 1 0521237 1 0382375 0 88397783 Abch 0 9666409 0 73462976 0 8835252 0 72145424 0 704321 27 0 6726146 Abcbhg 0 2396
151. r right clicks inside the report area a context menu will appear allowing to export the report in a text tab delimited file or clear the report window 4 4 1 2 Linear Discriminant Analysis Classifying In order to perform DA classification the user should click Statistics Classification Discriminant Analysis Classify and the following preferences window will appear 55 DA Classify Editor sample file DA classification options Discriminant function Priors Linear Iv Uniform The user should select the file that contains the new samples to be classified using as training data the data imported to ARMADA The file can be a text tab delimited or Excel file which should be structured as follows the first column should contain variable names e g gene names that serve as unique variable identifiers The first row should contain sample names that will be used to identify the new samples when they will be assigned to classes The rest of the data should be numeric Attention should be paid so that the number of variables features is the same as the number of features used to build the classifier model For an example of a file of new samples to be classified the user should consult Appendix A The rest of the options in the DA Classify preferences window under the DA classification options panel Discriminant function Priors are the same as in the case of tuning the DA classifier and their description can be found
152. r the completion of the procedure This option defines the method that will be used to fit a curve using the rank invariant set of genes determined by the rank 37 invariant selection algorithm The available methods are Lowess Running Mean and Running Median For a description of Lowess the user should see also the LOWESS LOESS normalization descriptions Span The span value modifies the running window size proportion of neighbouring points to the currently processed point for the smoothing function If the span value is less than 1 the window size is taken to be a fraction of the number of points in the data If span value is greater than 1 the running window contains as many data points as the value given For more details concerning Rank Invariant normalization the user should see 4 Concerning the Dye Swap options panel in the normalization window the list Channel Dye correspondence is used to determine which channel corresponds to which dye for example if the reference samples were labeled with Cy3 prior to hybridization it should be assigned to Channel 1 and Cy5 to Channel 2 Typically reference samples are labeled with Cy3 and treated samples with Cy5 which is also the default setting to ARMADA If for any reason Cy3 corresponds to Channel 2 this should be declared by choosing Cy3 is channel 2 Alternative if there is a dye swap experimental design the user can press the Launch Dye Swap editor button
153. re presented according to selected validation methods Misclassification Error Mis classification Error Mis classification Error Set 1 Set 2 Set 1 Set 2 Misclassification error using 5 fold cross validation evaluation Set 3 Misclassification error using leave 1 out validation evaluation Set 3 Set 4 Set 4 Set 5 Parameter sets Set 5 Parameter sets Set 6 Set 6 Set 7 Set 7 linear polynomial mip rb f Set 8 Set 9 linear polynomial y mip bf Set 8 Set 9 Misclassification error using training and test split 60 training and 40 test evaluation 0 Set 1 Set 2 Set 3 Sel 4 Set 5 Parameter sets Set 6 Set 7 linear polynomial mip rbt Set 8 Set 9 If the user selects to display a classifier evaluation report by clicking Display output results a window like the following will appear presenting the classifier evaluation results A confusion matrix is a visualization tool typically used in supervised learning Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class One benefit of a confusion matrix is that it is easy to see if the classifier is confusing two classes 1 e commonly mislabelling one as another 64 SvM Classifier Tuning Results 4 variables 531 Number of classes Par
154. responding slide Normalized array images in the main window can also be created by right clicking on the selected array in the Arrays list and selecting Normalized Image or by clicking View Normalized Image 22 Foor 500 2 6 8 Array Raw Table By clicking on the Array Raw Table button ARMADA displays a spreadsheet like view which contains basic data derived directly from the input file s for the selected array from the Arrays list The same data can be displayed by right licking on a selected array from the Arrays list and then clicking on Data or by clicking View Raw Data Gene ID Channel 1 F Channel 2 F Channel 1 B Channel 2 B Channel 1 F Channel 2 F Channel CNTRL13LO1 12921 82129 4739 074707 2248 805908 113 223877 3041 968994 1013 402283 284 52 4 2 2 CNTRL13H13 9662 462891 3237 940186 4399 865723 74 686569 562 497192 448 208801 419 4 3 3 CNTRL13HO1 10498 32813 3109 179199 3365 387939 90 656715 781 385864 442 991333 400 7 4 4 CNTRL13D13 8002 29834 3236 328369 2796 029785 63 820896 692 31842 482 769897 320 3 5 CNTRL13D01 9017 358398 3744 940186 4482 268555 112 373131 611 18659 589 145935 379 5 6 6 CNTRL12P13 9482 387695 3663 731445 5593 925293 106 358208 474 966614 658 953857 317 22 8 7 CNTRL12PO01 12125 85059 3852 014893 6803 552246 103 835823 889 658936 641 170837 373 4 8 8 CNTRL12L13 11146 83594 3631 970215 7209 462891 130 328354 491 192063 575 714722 2
155. rom all replicates for each condition Below there is an example of an expression profile plot for 11 genes selected from the list of differentially expressed genes after applying a statistical test The plot was produced with the option DE genes in the Plot options panel chosen the Different color for each gene and Display legend gene names boxes checked and the All replicates option from the Plot values panel chosen 12 Cluster centroids are usually calculated by averaging the expression of all the genes belonging to a cluster defining thus a meta gene which reflects the expression pattern of the entire cluster 92 THF THF THF X00009P01 3 X00012P13 XIMMM GLA 5 2x00016H01 A ZXIMMM 503 2x00018P01 ZA00005K01 ZA000050713 A00004001 A TX00004K 13 A EXIMMMHIEO 1 L I I I I I L I I r I I I I I I I I I I Conditions 25 WT 25 WT TNF TNF TNF Expression Profile DE Genes 11 genes cas Ces y A uoissardx g clusters presented in a graph with multiple plots The plot was produced with the option Gene clusters in the Plot options panel chosen the Plot centroids in selections Multiple cluster plot To illustrate another example the following figure presents an expression profile plot for 10 gene boxes checked and the Condition m
156. rom public databases such as ArrayExpress www ebi ac uk arrayexpress or Gene Expression Omnibus GEO http www ncbi nlm nih gov geo are to be imported The user can use this option also when the dataset images have been processed with software not supported by ARMADA after certain manual manipulation first the user should see Appendix A for further information on file formats Note that if the dataset files have been produced with ImaGene software the user can check the Mark empty spot as poor ImaGene option If this option is activated then spots that are flagged by the user or the software as empty will be treated by ARMADA as poor quality spots and excluded from analyses This option 1s included in case the user wishes to include empty spots in the analysis for example as an estimation of noise in the images of the experiment As a second step the user is prompted to enter the number of different conditions of the experiment in the field Number of Experimental Conditions e g if the experiment includes the experimental factors Control Treatment 1 Treatment 2 the user should enter 3 Next in the field Condition Names the user should fill in the names of the experimental conditions e g Control Treatment 1 Treatment 2 It should be noted that many special characters are not allowed and the names should not start with numbers e g the condition names Control 1 T or 1Control will These characters are the
157. roper flags as defined above If not do not provide this attribute and the internal filters of ARMADA will be used to mark poor quality spots Cy3 Signal Mean The column containing signal quantitation for each spot No for the 1 channel The title Cy3 is indicative and taken by the fact that the reference samples in 2 coloured microarray experiments are labelled with Cyanine 3 green In any case this attribute should contain the foreground intensities for the reference dye channel The title Mean is indicative It depends on the quantitation algorithm of each image analysis software However this attribute is mandatory and should contain the main signal quantitation for each spot Cy3 Signal Median The column containing the median of signal quantitation Yes for each spot of the 1 channel Cy3 Signal Standard The column containing the standard deviation of signal Yes Deviation quantitation for each spot of the 1 channel Cy3 Background Mean The column containing background contamination No quantitation for each spot for the 1 channel The title Cy3 is indicative and taken by the fact that the reference samples in 2 coloured microarray experiments are labelled with Cyanine 3 green In any case this attribute should contain the background intensities for the reference dye channel The title Mean is indicative It depends on the quantitation algorithm of each image analysis software However this at
158. rray Computer ARRAYSCANNER Date Wed Apr 06 12 04 37 2005 Experiment wt 1 3r Experiment Path C Program FilesiPackard BioChipVAdministrator ExperimentSetswwvt 1 3r Protocol CAnikosdataWicroarray Experiments VART_VHiQuant4rray Protocolswvt 1 3r pro Version 3 Begin Protocol Info Units Microns Array Rows 12 Array Columns Rows 21 Columns 21 Array Row Spacing Array Columns Spacing Spot Rows Spacing Spot Columns Spacing Snot Diameter 15N Analysis Report Reportfor Analysis 3 General Information Number of Conditions 2 Condition Names WT D15 Number of Arrays 8 Array Names Wt drbi Wt 2r bd Wt 3rbd Wt 4ribd di5 1rb amp d15_2r txt d15_3rbd d15_4rbd Steuctoire oommery d et rosas containe the condition namas the rast the slidas 25 These reports display very briefly certain information on the array files imported 1f header information 1s available when importing directly from the supported image analysis software this information is displayed too or the analysis steps followed and results obtained By right clicking inside the report window the user can export the displayed information in a text file 2 6 13 Deleting analysis objects In order to delete an analysis object the user should select the analysis to be deleted from the Analysis Object list right click on the selected analysis and click Delete The selected analysis will be deleted and the number of
159. rrays selected from the Arrays list on the main window Different titles should be separated by a new line Enter The field should remain empty for automatic title generation After specifying the desired parameters the user should click OK Below there are some examples of 2 and 3 dimensional array images created with different colormaps Channel 1 Foreground Mean X 10 2D Image of Channel 1 Foreground Mean for array Wt_1r txt Colormap Green Channel 2 Foreground Mean x TM p Channel 1 Background Mean 2500 2000 1500 1000 500 2D Image of Channel 1 Background Mean for array d15 lr txt Colormap Jet Channel 1 Foreground Background Mean 3D Image of Channel 2 Foreground Mean for array d23 1r txt 3D Image of Channel 1 Foreground Background Mean for array d23 1r txt Colormap Red Colormap Cool 69 If the user clicks on any of the array images created individual spot data are displayed as in Raw Image in ARMADA s main window The user should also note that array spatial images are available only if grid coordinates and meta coordinates are provided with the input files and that if meta coordinates exist the array images are available right after importing the data files to ARMADA 5 2 Normalized and Un normalized images An array normalized or un normalized image depicts an array spatial image reconstructed using the normalized or un normalized log ratio between the two channels S
160. s Options Mumber af centroids K Clustering variables Genes rows C Replicate Condition columns 5 Repeat clustering 4 Distance Squared Euclidean vi Maximum iterations 100 Seeds Sample me p value cutatf 0 03 Cancel The following table explains the available options in the two upper panels Clustering values and Clustering variables Option Condition means All replicates Clustering values Genes rows Clustering variables Replicates Conditions columns Description If selected gene expression values to be clustered are the mean expression value among replicates for each condition of the selected Analysis If selected gene expression values to be clustered are all the values from all array replicates from each condition of the selected Analysis If selected hierarchical clustering will be performed for genes revealing clusters of genes with similar expression If selected hierarchical clustering will be performed for conditions or replicates depending on the choice in the Clustering values panel revealing clusters of conditions The following table explains the available options in the bottom panel Options Option Number of centroids k Distance Seeds Description The number of clusters that the genes should be grouped into The distance metric to be used for data clustering for further information the Method used to user should see Appendix D on distances
161. s Oxford England 19 973 980 Guyon I and Elisseef A 2003 An introduction to variable and feature selection Journal of Machine Learning Research 3 1157 1182 Vapnik V N 1995 Statistical Learning Theory Wiley Furey T S Cristianini N Duffy N Bednarski D W Schummer M and Haussler D 2000 Support vector machine classification and validation of cancer tissue samples using microarray expression data Bioinformatics Oxford England 16 906 914 Tukey J W 1977 Exploratory data analysis Addison Wesley Reading MA Raychaudhuri S Stuart J M and Altman R B 2000 Principal components analysis to summarize microarray experiments application to sporulation time series Pac Symp Biocomput 455 466 Tibshirani R Walther G and Hastie T 2001 Estimating the number of clusters in a data set via the gap statistic J R Statist Soc 63 411 423 Efron B and Tibshirani R 1993 An introduction to the bootstrap Chapman amp Hall CRC 111 Appendix A Input file formats This appendix describes some of the input file formats for ARMADA and provides some links for further user information A 1 Raw data Image analysis software output and tab delimited files A 1 1 QuantArray file format QuantArray files contain two main sections the file header section and the file data section Below there 1s one example from each section File header section User Name Administrator Computer ARRA
162. s for clusters formed with the k means algorithm 4 3 3 Fuzzy C means clustering While partitional clustering methods such as k means or hierarchical clustering assign each gene to a single cluster these methods do not provide information about the influence of a given gene for the overall shape of clusters Fuzzy partitioning methods such as fuzzy c means clustering can solve this problem by attributing cluster membership values to genes The cluster where the membership for each gene is highest is probably the cluster which it belongs to but there is the possibility to check what happens with possible membership to other clusters by looking other memberships for specific genes For more information on fuzzy c means clustering as well as the algorithm implemented in ARMADA the user should see 12 In order to perform fuzzy c means clustering the user should select an Analysis object from the Analysis Objects list and click Statistics Clustering Fuzzy C Means The following preferences window will appear 49 Fuzzy C Means Clustering values Condition mean All replicates General options Clustering Editor Clustering variables 3 f Genes row C Replicates Conditions columns Optimization options Number of clusters 10 Cv constant Fuzzy parameter 2 Tolerance Convergence tolerance 0 00001 Maximum iterations Maximum terations p value cutoff wi Optimize fuzzwy S010 parameter T
163. s in ARMADA The case of tab delimited text files that contain the image quantitation types 1s explained in the next paragraph 12 2 5 2 Importing data directly from image analysis software text tab delimited files In the case of selecting to import image quantitation data from text tab delimited files e g downloaded from a public repository after clicking OK in the Data Import wizard window the following window will appear Please select the columns corresponding to the required fields Array spatial information Gene Mumberz Select Array Blocks Select Meta Roms metaR o spot quantitative information Cy3 Signal Mean iH Control 1 F532 Mean Cy3 Signal Median H Contral 1 F532 Median Cy3 Signal Standard Deviation iH Cantral 1 F532 sb Meta Columns Feature coordinates metaColumn Cy3 Background Mean H_Control 1 5532 Mean Rows rowy Columns column General spot information Gene Mames Reporter identifier Spot Flags Select Cys Background Median IH Control 1 6532 Median Cy3 Background Standard Deviation H Control 1 p532 SD Cy5 Signal Mean H Control 1 F635 Mean Cy5 Signal Median H_Control 1 F635 Median Cy5 Signal Standard Deviation H Cantral 1 F635 SD Cy5 Background Mean H_Control 1 BB535 Mean Cy5 Background Median H_Control 1 8635 Median Cy5 Background Standard Deviation H Control 1 6635 SD Each list contains all the column headers of the first file out of the
164. sifier and supply N The user should check this box to perform Leave M out validation of the classifier and supply M The user should check this box to perform Training and Test validation of the classifier and supply the percentage of the dataset that should be held out for testing Displays classifier evaluation plots based on the tuning options and parameters plots are based on the discriminant function types class prior probabilities and validation methods A report containing classification evaluation statistics and confusion tables is displayed in a separate window Several messages are displayed during the classifier tuning procedure in the command line or in MATLAB s command window if MATLAB 1s present After setting the parameters the user should click Tune and classifier tuning will be performed Depending on the DA tuning results the user can select the appropriate parameters to build a model that best fits the dataset under examination If the user selects to display classifier evaluation plots by checking the box Display evaluation plots the following example depicts how these plots are presented 7 A confusion matrix is a visualization tool typically used in supervised learning Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class One benefit of a confusion matrix is that it is easy to see if the classifier is confusing two classes
165. sis steps in an automated way and the Annotator module which allows the easy annotation of the gene and cluster lists created by ARMADA 7 1 The Principal Component Analysis tool Principal Component Analysis PCA is a statistical pattern analysis technique for determining the key variables in a multidimensional data set that explain the differences in the observations and 1s very useful for analysis simplification and visualization of multidimensional data sets Given m observations samples or arrays on n variables genes which form an mxn data matrix the goal of PCA is the reduction of the data matrix dimensionality by finding r new variables where r is less than n These r new variables are termed principal components and together they account for as much of the variance in the original n variables as possible while remaining mutually uncorrelated For more information on PCA for gene expression datasets derived from microarray experiments the user should consult 17 The user can conduct PCA in ARMADA by selecting an Analysis from the Analysis Object list and click on Tools Principal Component Analysis The following window will appear PCA Editor Gene names Data for PCA EXODO3OKOS f Select genes zxo 030G15 ZXO0035K03 ZXO0035615 ZX00035603 D ZX00035015 ES zxn n3sco3 ZXD00034015 ZXO0034003 ZXO00034K15 ZX00034K03 ZX00034615 Z XDO d Cr _ All genes From there the user is able to choose
166. spot filtering based on background noise and i1 spot filtering based on measurement reproducibility among replicates optional In the first step spots marked as poor manually or by the image analysis software are excluded for every replicate and noise sensitive genes are further isolated for each slide of each condition based on one of the available filters applied to both channels At this point it should be noted that if data are imported directly from the supported image analysis programs ARMADA recognizes automatically each program s spot flags and treats the flagged spots appropriately On the other hand if data are imported from other sources tab delimited files public repositories and they contain spot flags the user is responsible for transforming the flags column into ARMADA recognizable flags the user should see Appendix A for file formats or let ARMADA decide on each spot s quality by one of the supported filters The available filters No filtering Signal to Noise threshold Signal Noise distribution distance m S x std S lt m B y std B Custom filter are described in the table below 30 Filter No filtering Signal to Noise threshold Signal Noise distribution distance m S x std S lt m B y std B Custom filter Description Data are filtered based only to the automatic flagging of image analysis programs or by the flags provided during importing if they are valid flags No other filters are applied Th
167. t on analysis steps and results in a new window 21 2 6 6 Raw Image By hitting the Raw Image button in the main window 1f array block coordinates are provided an image is reconstructed based on these spatial data by overlaying raw signal data from both channels for the selected array from the Arrays list If coordinates are not provided ARMADA creates an image with grid size equal to the number of genes on the arrays multiplied by the number of slides on the project In this way each row of the reconstructed image presents a gene and each column the corresponding slide Array images in the main window can also be created by right clicking on the selected array in the Arrays list and selecting Image or by clicking View Raw Image Channel 1 value 43534 0747 Channel 2 Value 3391 373 ID XODOOZ I06 The user can click on raw images and view individual spot data Zooming and scrolling is also possible Zao 300 2 6 7 Normalized Image By hitting the Normalized Image button in the main window if array block coordinates are provided an image is reconstructed based on these spatial data using normalized log ratio data for the selected microarray from the Arrays list If coordinates are not provided ARMADA creates an image with grid size equal to the number of genes on the arrays multiplied by the number of slides on the project In this way each row of the reconstructed image presents a gene and each column the cor
168. the following analysis in the list will be decreased in order to keep a continuous analysis object numbering For example if there are 5 analysis object named Analysis l Analysis 2 Analysis 5 if the user deletes Analysis 4 Analysis 5 will be renamed to Analysis 4 26 3 Preprocessing Data The first steps on analyzing data derived from microarray experiments consist from several preprocessing steps which include quality control data filtering and normalization to assure the quality of the data that will be used to extract results and compensate for systematic error measurements among different arrays The next sections describe the filtering and normalization methods implemented in ARMADA and explain the program interfaces The user should also note that data preprocessing will not be available if the data import step is not completed properly 3 1 Selecting subjects of experimental conditions In order to create an analysis object the user should first choose a subset of experimental conditions and replicates from the total set of the imported arrays If this step 1s skipped for the first analysis ARMADA will apply the chosen preprocessing procedures to the whole dataset The application of the same preprocessing steps on the whole dataset is recommended when the user wishes to preprocess the data in a same manner e g with the same normalization method and then be able to choose different sets of
169. the output figure consists of two panels the upper panel contains the MA plot for un normalized data while the bottom panel contains the MA plot for normalized data If the user right clicks inside one of the two panels the menus that appear are the TI same as the ones in the cases of MA plots before normalization 5 3 1 and MA plots after normalization 5 3 2 Below there is an example of an MA plots before and after normalization for a specific array Un normalized MA plot for array d7 3r txt MA data Normalization points Fold change cutoff 8 a Intensity A Normalized MA plot for array d7 3r txt Normalization Loess with Span 0 1 Down regulated m oi Fold change cutoff A Tipi aa Do A Ux te aa dae 2 CAEN Pro rre rro ron raras arre ng i En dto e t 5 P l ER 1 eta 2 At o a te e t o A tee c Intensity A 5 4 4 Subgrid MA plots If subgrid normalization is performed at the presence of array meta coordinates the user should see 3 4 MA plots for each subgrid block are also possible by checking Display subgrid plots in the MA plots preferences window However the functionalities of simple MA plots such as data selection and exporting are not available in the cases of subgrid MA plots Subgrid MA plots consist of an image with as many blocks as the number of blocks in each slide Below there are
170. ting all the above parameters the user should click Import for the data to be imported to ARMADA 2 6 Exploring data main window 2 6 1 ARMADA s main window The image below depicts the main window of ARMADA It consists of the history textbox the project tree view the array list the analysis objects list the image or data area and some shortcut buttons for main functionalities of the program The menu bar provides access to all the platform s functions and abilities This and the following sections describe the functionalities that can be accessed directly from the main window and present the first steps of data exploration with ARMADA Special attention should be paid during column assignment It is of great importance for the relevance of the subsequent analysis 17 ANDROMEDA v 1 0 File Preprocessing Statistics Plots Tools Help Project Explorer Arrays Selected node is nota string Vt 1ritxt Wi_2r bd GB IPFNew Wt_3rbd Name Vt dr txt d7 rbd Filename d7 2r bd Date d7_3r txt NumberOfFConditions d 4rtxt NumberOfSlides e FE s NS 2r rojos di5 3rbd a WT di5 4rbi g D7 d23 1r bt HD15 d23 2rtxt en d23_3r txt Neve d23 4rbi INE TNF rbt Analysis TNF 2r txt Analysis 1 TNF _3r txt wl NumberOfConditio NumberOFSlides Analysis Objects Analysis 1 Analysis 2 Analysis 3 Analysis 4 Preprocess BackgroundCo UseEstimate FilterMethod H FilterParamete OutlierTest v Normalizati
171. tion It is recommended to perform missing value imputation after between slide normalization as data are standardized If this box 1s checked a boxplot the user should see also section 5 5 will be displayed depicting the distributions of the data before and after between slide normalization In the case of selecting k nearest neighbor KNN in the Missing value imputation options list the following window will appear prompting the user to set some parameters for the imputation KNN Impute Options a LLL Distance Euclidean Mearest Neighbors Use median instead of weighted mean The following table explains each option and parameter Option Distance Description Determines the distance metric that will be used to calculate the distances among gene vectors determined by the gene expression values for all experimental conditions in the selected Analysis For more 41 information on distance metrics the user should see Appendix D Nearest Neighbors The number of nearest neighbors that will be used to impute the missing values Use median instead of weighted mean If checked missing values will be imputed based on the median of the nearest neighbors expression values instead of the weighted mean In the case of selecting Quantile normalization in the Between slide scaling options list the following window will appear prompting the user to set some parameters for scaling Quantile Options db nO
172. tion Y Location chi Intensity ch1 Backgro chi Intensity ch1 Backgro chi Diameter ch1 Area chi Footprint ch1 Circular ch1 Spot Uni chi Bkg Uni ch1 Signal Nichi Cont 1 1 1 1 1 CNTRL13L01 770 500 12921 821 224858059 3041 969 284 53973 147 55473 6700 50 099575 0 675874 0 828156 0 983597 45 413065 2 1 1 1 2 CNTRL13H1 970 500 9662 4629 4399 8657 562 49719 419 45032 205 29062 6700 50 088575 0 873555 0 966888 0 978821 23 036013 3 1 1 1 3 CNTRL13H01 1170 500 10498 328 3365 3879 781 36566 400 77405 204 98027 6700 50 099575 0 847978 0 95256 O978836 26 19513 4 1 1 1 4 CNTRL13D12 1370 500 8002 2983 2796 0298 692 31842 320 39124 209 43523 6700 50 099575 0 852023 0 957108 0 980789 24 976646 5 1 1 1 5 CNTRL13D01 1570 500 9017 3584 4482 2686 611 18689 37953854 217 48751 6700 50 099575 0 858544 0 966292 0 978531 23 758742 6 1 1 1 6 CNTRL12P13 1770 500 9482 3877 5593 9253 474 96661 317 22812 213 94618 6700 50 089575 0 842386 0 972397 0 981506 29 891385 7 1 1 1 7 CNTRL12P01 1970 500 12125 851 6803 5522 889 55894 37346616 210 19377 6700 50 099575 0 817628 0 950897 0 980148 32 468405 8 1 1 1 8 CNTRL12L13 2170 500 11146 836 72094629 491 19296 258 88745 219 09163 6700 50 099575 0 807135 0 970184 0 985779 43 055587 8 1 1 1 9 CNTRL12L01 2370 500 11596164 7158 3433 495 09836 337 72617 219 672 6700 50 099575 0 839631 0973618 0 982162 34 335995 10 1 1 1 10 CNTRL12H12 2570 500 12259836 7814 5073 681 32837 320 99799 215 7247 6700 50 099575 0 894478 05966522 0 981445 38 192875 11
173. tion of the classifier and supply the percentage of the dataset that should be held out for testing Displays classifier evaluation plots based on the tuning options and parameters plots are based on the number of nearest neighbors distances tie break rules and validation methods A report containing classification evaluation statistics and confusion tables is displayed in a separate window Several messages are displayed during the classifier tuning procedure in the command line or in MATLAB s command window if MATLAB 1s present After setting the parameters the user should click Tune and classifier tuning will be performed Depending on the kNN tuning results the user can select the appropriate parameters to build a model that best fits the dataset under examination If the user selects to display classifier evaluation plots by checking the box Display evaluation plots the following examples depict how these plots are presented according to selected validation methods 1 plot using n fold cross validation 2 leave m out and 3 training and test A confusion matrix is a visualization tool typically used in supervised learning Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class One benefit of a confusion matrix is that it is easy to see if the classifier is confusing two classes 1 e commonly mislabelling one as another 58
174. tions or create new routines according to specific needs It should be noted that ARMADA can be used in command line mode only if MATLAB is present on the computer where ARMADA is installed Otherwise the program is distributed with MATLAB Component Runtime MCR and MATLAB is not required on the installation machine In order to use ARMADA under MATLAB or in command line mode and thus be able to export results in MATLAB s workspace for further processing with built in algorithms MATLAB 7 3 R2006a or higher should be installed on the target computer In this case the platform consists only of MATLAB routines and not compiled files and is platform independent After downloading the routines the user should place them in a folder of preference maintaining the structure in the compressed file and then add this folder including its subfolders to the MATLAB path If MATLAB is not present in the target computer MATLAB Component Runtime MCR 7 6 is required The MCR is included in the program installer which can be downloaded from URL here If MCR 7 6 was previously installed for other reasons then it does not have to be installed again Note that in this case the program will work ONLY with MCR 7 6 and NOT with older or newer versions of the MCR One of the main advantages of ARMADA is that it offers an analysis workflow by not allowing the user to do anything at anytime a feature that will prove valuable especially for users mostly w
175. to perform PCA on the gene selected from the list on the left to perform PCA on all genes using their normalized gene expression values after the trust factor filtering step the user should see 4 1 or to perform PCA on the differentially expressed genes selected after the application of a statistical test After making the necessary selections the user should click OK and the following figure will appear 14 The PCA module uses a slightly altered version of the mapcaplot function of the Bioinformatics Toolbox of MATLAB 102 Principal Component Visualization Tool L Selected Data ZXOOO48K1 3 ZX UU48kK03 ZXU GUU48A03 X DUO 3F 04 ZXUQULU049L 1T 0001 905 0003417 x0004811 7 x00048E17 Component 1 58 495 ZX 00046417 KGOO006A05 ZX UU47JO05 Component 1 58 495 5 VS Component 2 16 5 a ZfX Dn48GD7 zx0 003019 ZxD0048107 Zx00048E07 00048419 ZAD0002H08 Zxaoo047 JOG EX0D004 Sr 00048441 Ze 00024 422 Component 2 15 5521 Component x 3 7 099 Component 1 58 495 Component 1 58 492 m WS Component 3 7 095 e This figure contains two main panels the upper panel presents the projections of the data matrix on the 2 dimensional plane defined by the first two principal components which account for the largest and the 2 larger percentage of the variance observed in the data matrix the data matrix has been defined using the PCA preferences window above respectively Each po
176. to the conditions of the analysis highlighted in the Analysis Object list on ARMADA s main window In the Plot options panel when the Single array choice is selected the Normalized arrays list and the bottom drop down list are deactivated and reactivated when the Array vs array choice is selected When the Single array choice is selected the user can choose which measurements to plot in a 2 D plot for each array by selecting from the drop down lists Data to plot Additionally the letters H and V which appear next to the lists when Single array is selected represent the Horizontal and Verical axis respectively The user can also provide titles for the plots or leave the corresponding fields empty for automatic title generation Concerning the measurements which are available for plotting the user should see section 5 1 The Clear buttons below the array lists clear the selections allowing the user to make new ones The General options panel allows the user to display the Pearson correlation coefficient between the two measurements selected in the 2 D plot as well as plotting in log2 scale by checking the Display correlation or Plot in log scale boxes respectively The user can also check the Display cutoff lines box If selected the resulting plots will also depict the line which crosses the beginning of the axis system y x and two lines parallel to it at a distance chosen by the Cutoff level number n y xn y x n respectively After making the nec
177. tribute is mandatory and should contain the main background quantitation for each spot Cy3 Background Median The column containing the median of background Yes quantitation for each spot of the 1 channel It should be noted that if the Median attributes are not given ARMADA s filtering methods will not be available at full extent It should be noted that if the Standard Deviation attributes are not given ARMADA s filtering methods will not be available at full extent 14 Cy3 Background Standard Deviation Cy5 Signal Mean Cy5 Signal Median Cy5 Signal Standard Deviation Cy5 Background Mean Cy5 Background Median Cy5 Background Standard Deviation The column containing the standard deviation of background quantitation for each spot of the 1 channel The column containing signal quantitation for each spot for the 2 channel The title Cy5 is indicative and taken by the fact that the reference samples in 2 coloured microarray experiments are labelled with Cyanine 5 red In any case this attribute should contain the foreground intensities for the sample dye channel The title Mean 1s indicative It depends on the quantitation algorithm of each image analysis software However this attribute is mandatory and should contain the main signal quantitation for each spot The column containing the median of signal quantitation for each spot of the 2 channel The column containing the standard deviati
178. tting the preprocessing steps the user can define several Analysis objects through the Select Conditions button section 3 1 and then for each object define the desired statistical selection workflow and the clustering to be performed optional The Statistical Selection button will open a preferences window similar to the one of section 4 1 with some differences figure below Statistical Selection Batch Editor Analysis Before testing Statistical testing Missing value imputation Statistical test Multiple test correction Analysis 2 Average within same condition 1 wwany ANON A Ive Mane Ivi Trust Factor cutoff 0 6 p value threshold 0 05 Between slide scaling Median Absolute Deviation MAD Impute values 7 Fold change calculation optional After scaling Before scaling A Control yyT A m Treated yT Iv iv iv All options are the same as in section 4 1 plus that the user can define pairs of conditions for each Analysis object so that fold changes can be calculated After making the necessary selections the user should click OK If OK is pressed without making any selections the default parameters what is displayed in the window will be used for the batch process If the user does not wish to perform statistical tests Cancel should be pressed instead 108 The Clustering button will open
179. two examples of subgrid MA plots the first picture depicts a subgrid MA plot before data normalization while the second depicts a subgrid MA plots after data normalization 78 Subgrid 1 2 ma Subgrid 1 3 Subgrid 1 4 25 3 35 4 45 5 55 3 35 4 45 5 55 6 Subgrid 2 3 25 3 35 4 45 5 55 6 65 25 3 35 4 45 5 55 6 Subgrid 3 3 25 3 35 t 45 5 SS 6 65 25 3 35 4 45 5 55 6 25 3 35 4 45 5 55 6 65 25 3 35 4 4S 5 55 6 65 Subgrid 4 2 Subgrid 4 3 Subgrid 4 4 S Hert tes y 25 3 35 4 is 5 55 6 65 Subgrid 5 3 TT E v 4S 5 Subgrid 6 1 55 25 3 35 t 5 5 55 5 55 Subgrid 6 4 5 25 3 35 4 45 5 55 6 65 2 25 3 35 4 45 5 55 6 25 3 35 4 45 5 55 6 65 Subgrid 7 1 Subgrid 7 2 Subgrid 7 3 ESP 25 3 35 L us 5 55 6 65 Subgrid 8 4 LT g ae em PERE Mitre 2 25 3 35 t 45 5 55 5 Subgrid 8 2 25 3 35 4 45 5 55 6 65 Subgrid 9 4 15 2 25 3 35 4 45 5 2 25 3 35 4 45 5 55 Subgrid 9 1 Subgrid 9 2 TE one 5 n sun E TNT T a tht 15 2 25 3 35 4 45 5 55 2 25 3 35 4 45 5 55 2 25 3 35 4 45 55 6 Subgrid 10 2 Subgrid 10 3 o 2 25 3 35 4 45 5 25 3 35 4 45 55 2 25 3 35 4 as 5 55 6 2 25 3 35 4 45 Subgrid 11 1 Subgrid 11 2 Subgrid 11 3 Subgrid 11 4 5 5 gt 5 5 Om B i _ i 5 E 4 5 6 T 2 3 4 5 6 1 5 Subgrid 12 1 Subgrid 12 4 s 5 2 3 4 5 6 T 3 Subgrid 1 1 Subgrid 1 2
180. uch images may help the user identify individual characteristics of microarrays identify possible differentially expressed transcripts and check the effects of normalization procedures as well as compare normalized versus un normalized images To create un normalized array images the user should select an Analysis from the Analysis Object list and click Plots Un Normalized Images The following window appears UnjNormalized Image Editor Arrays Display Vt irt Normalized Ratio se Wt 2r tt t artt Image D Vit dr bi d rix a 4 3 D d 2rtxt e s d7 3rbd d7 rbd Titlets 015 irt dis 2rtd di5 3r tet Titles d15 drid d23 Tr tet d23 2r tt d23 Sr tet dz3 4rixi Image colormap TNF Artt TNF 2r txt se Red Green ames m iw Display Colorbar In the Arrays list the un normalized images preferences window displays the arrays from the currently selected Analysis in the Analysis Object list The user can select one or multiple arrays for image creation and as with the array images described in section 5 1 there are options on what data to use for image creation the image color settings and the colormap density as well as whether to display a colorbar or not As before the user may supply own titles or leave the filed Title s empty for automatic figure title generation The following table presents the data types the user can use to create array images
181. ue cutoff A p value cutoff additional to the statistical test p value cutoff in order to cluster fewer genes than those determined by the statistical test For example 1f the statistical test was performed with a p value cutoff of 0 05 the user can enter 0 01 to cluster fewer genes than those determined by the cutoff of 0 05 Inconsistency coefficient The inconsistency coefficient cutoff to determine the number of clusters based on the dendrogram 11 Maximum number of clusters The maximum number of clusters to which the dataset can be grouped into Cutoff Either the inconsistency coefficient cutoff or the maximum number of clusters Display heatmap If checked a clustering heatmap will be displayed Calculate optimal dendrogram If checked the dendrogram on the clustering heatmap will be optimized for better clustering results However it can take a considerable amount of time Colormap The user should see section 5 1 Colormap density The user should see section 5 1 Heatmap title A title for the heatmap to be created 1f chosen Clustering Heatmap After setting the desired parameters or leave the defaults the user should click OK Hierarchical clustering will be performed and ARMADA will store the result Gene clusters can be viewed by hitting the Cluster List button on the main window 4 3 2 k means clustering k means clustering is one of the simplest algorithms that solves the well known clustering proble
182. un normalized ratio of the replicates for each condition in log scale The median un normalized ratio of the replicates for each condition in natural scale The median un normalized ratio of the replicates for each condition in log scale The standard deviation of the replicates un normalized ratio for each condition in natural scale The standard deviation of the replicates un normalized ratio for each condition in log scale The normalized ratio in natural scale for each replicate of each experimental condition The normalized ratio in log scale for each replicate of each experimental condition The mean normalized ratio of the replicates for each condition in natural scale The mean normalized ratio of the replicates for each condition in log scale The median normalized ratio of the replicates for each condition in natural scale The median normalized ratio of the replicates for each condition in log scale The standard deviation of the replicates normalized ratio for each condition in natural scale The standard deviation of the replicates normalized ratio for each condition in log scale The intensity for each replicate of each experimental condition The mean intensity of the replicates for each condition The median intensity of the replicates for each condition The standard deviation of the replicates intensity for each condition The numbers denoting each gene s unique positioning on the microarray slide the
183. und noise spot standard deviation if available for channel 1 or Cy3 or Green The background noise spot standard deviation if available for channel 2 or Cy5 or Red The difference between mean signal and background noise for channel 1 or Cy3 or Green The difference between mean signal and background noise for channel 2 or CyS5 or Red The difference between the medians 1f available of signal and background noise for channel 1 or Cy3 or Green The difference between the medians if available of signal and background noise for channel 2 or 68 Channel 1 Foreground Background Mean Channel 2 Foreground Background Mean Channel 1 Foreground Background Median Channel 2 Foreground Background Median CyS or Red The signal to noise ratio between mean signal and background noise for channel 1 or Cy3 or Green The signal to noise ratio between mean signal and background noise for channel 2 or Cy5 or Red The signal to noise ratio between the medians if available of signal and background noise for channel 1 or Cy3 or Green The signal to noise ratio between the medians if available of signal and background noise for channel 2 or Cy5 or Red The user is also able to provide titles for the images to be created Titles are given in the Title s panel and should be as many as the a
184. w ee eee eee eee 1000 500 5 Ratio 4000 3500 e e e e A A iuanb314 aua g 3000 1500 1000 Ratio Normalized ratio distribution for arrays of condition WT Normalization Loess with Span 0 1 4500 4000 3500 e e A A A3uanba14 aus o 3000 1500 1000 83 The user should note that Expression Distributions in the Plots menu become available only after the normalization procedure 5 6 Boxplots In descriptive statistics a boxplot 1s a convenient and commonly used way of graphically presenting groups of numerical data A boxplot also indicates which observations 1f any might be considered outliers and is able to visually show different types of populations without making any assumptions of the underlying statistical distribution The spacings between the different parts of the box help indicate variance skewness and identify outliers For more information on boxplots the user should see 16 In the case of microarray data boxplots are used to summarize the gene expression distributions and identify their shape and several characteristics They are useful for quality control as well as depicting differences between distributions among different slides and assessing the results of data normalization Boxplots are available right after data importing To create boxplots with ARMADA the user should select an Analysis from the Analysis Object list and click Plots Bo
185. xplots The following window will appear Boxplot Editor Arrays Options Each Slide Data to plot Each Condition hannel Ratio dz 3rtxt Plots d 4r tet di5 dr tet E dis Zrt Before normalization l After normalization WIT wi Before and after normalization Title The interface is similar to the interface of Expression Distributions 5 3 with the selections in the Options and Plots panels denoting exactly the same configurations as with Expression Distributions and the only difference being the list Data to plot This list contains the types of data that the user can visualize by using boxplots All data apart from the Channel Ratio can be plotted only before normalization For a description of the data types for which boxplots can be created the user should look at the table in section 5 1 as they are exactly the same The channel ratio 1s the log 84 ratio between the two channels depicting gene expression Below there are several examples of boxplots for data before and after normalization x 10 Boxplot for Channel 1 Foreground Mean Channel 1 Foreground Mean Un normalized boxplot for Ratio Ratio tr txt d7_2rtxt dz Ar txt d15 1r txt d23 2r txt i23 3r txt By examining the boxplots before and after normalization the effect of normalization is immediately seen gene expression distribution are scaled and centered so that they can be compared using statistical tests 85 Normalized
186. xport normalized gene lists the user should select an Analysis from the Analysis Object list on the main window and click on File Export Data Normalized Genes List or right click on the selected Analysis from the Analysis Object list and select Export Normalized List To export differentially expressed gene lists the user should select an Analysis from the Analysis Object list and click File Export Data DE Genes List or right click on the selected Analysis from the Analysis Object list and select Export DE List or click on the Export DE List shortcut button on the main window In all cases the user will be prompted to specify a location for the output file to be saved at The normalized and differentially expressed gene lists are text tab delimited or Excel files which contain data separated in different columns The user 1s able to specify the output file format Excel or text tab delimited as well as the data fields to be exported by clicking on File Export Settings Gene Lists The following preferences window will appear Gene List Export Editor LInnormaliziecd ratios Mormalized ratios Ratio raw Median ratio raw Ratio raw sae E C Ratio flog Median ratio loc Mi Ratio log L Median ratio log Mean ratio raw StDev ratio raw Mean ratio rav StDev ratio raw Mean ratio log
187. y Expressed Penes Lis a bs 24 O CASE url T T c UE 25 PA OM IZ R POU EROR DERNIERE RT 25 2 60 15 Delete analysis OJSC CUS iu cerea e RE eese d meri 26 SuBreprocessmp Dados esit nete Haase eti uu rp E m ETE 27 3 1 Selecting subjects of experimental conditions eene 27 CEC UN AECA E TT 28 AP O a e naa inn ius 30 DAs NOTA ZAO aa 34 Je 18 BSH 2 as ODETA ON I ee PT I MA IMMO D CN CR MM ON MCI MD UM D SEM DII M DEC 40 4 1 Statistical S dE tienes 40 4 2 hold ames O O 44 ASC STEIN O gn ere eee 45 Z1 Iherarchiealbe DUSte ea 45 2 3 2 ksmeatns e USN T dr dione 47 3 Pro reg wy WV AYA C medis Cluster ine osea n pa Y E ida rea bee e ti ne 49 LE CASCO p 51 441 lin Discriminant Ana Sa 52 4 4 1 1 Linear Discriminant Analysis Tuning sss s 52 4 4 1 2 Linear Discriminant Analysis Classifying esee 55 4 2 2 k Nearest Neighbor A ee ee 56 4A 2tak Nearest Neighbors TUNN eae i ee Gt Miei 27 4 4 2 2 k Nearest Neighbors Class tOdO at e ads uade 61 24 3 SUPPO Vector MACHINES A iia 62 4 4 3 1 Support Vector Machines Tuning tas 62 243 5 2 Support vector Machines Training eo WEM aS 65 4 4 3 3 Support Vector Machines Classifying ssssssssssssennnneeeeeeeenenens 66 Gap tC alata EPIA IN ees d Rasa e esten iue ensd b Rnenen Ae urbi vta inen TAN UA Reed ns 67 SM BEC AOS ETT LU QUU LT E 67 5 2 Normalized a

ARMADA user's guide

Contents

Download Pdf Manuals

Related Search

Related Contents