Home

Significance analysis of Microarrays: User guide and technical

1. This document does not answer my questions Where should I look As we get asked new questions we update this list of frequently asked questions with an swers Please visit the url http www stat stanford edu tibs SAM where you may find further information get an error that says that cannot create Active X component This is usually due to the prerequisites not being met Try downloading the Microsoft Java VM as indicated in section 5 We have seen this problem with Office XP especially Every time I run SAM I get the message Run time error 429 Active X component can t create object How do I fix this problem This error message can occur if the Microsoft Data Access Components are not available on your system Follow the suggestions in section 5 Where is the SAM manual It should be located in C Program Files SAMVB doc in the default installation If you used a different directory then it should be in the analogous place In the worst case search for the file sam pdf Where are the examples They should be located in the C Program Files SAMVB Examples in the default installation If you used a different directory then it should be in the analogous place In the worst case search for the file twoclass xls What does the gene hyperlink lookup do Does it mean that my identified genes are snooped by Stanford The web lookup facility is provided merely a convenience One doesn t have to use
2. e La 8 3 o 8 2 8 e 8 8 6 4 2 0 2 4 6 expected score pa a 8 go 8 s 8 observed score 0 expected score Figure 4 SAM results for 3 different datasets 18 A A false pos called FDR 0 3 11 7 100 0 117 0 4 9 3 76 0 122 0 5 5 9 65 0 091 0 6 4 4 39 0 113 0 7 3 5 33 0 106 0 8 2 1 29 0 072 0 9 1 6 17 0 094 1 0 1 3 16 0 081 B A false pos called FDR 0 3 4 8 2 2 40 0 4 1 8 2 0 90 0 5 1 3 2 0 65 0 6 0 6 2 0 30 0 7 0 3 2 0 15 0 8 0 2 0 Inf 0 9 0 2 0 Inf 1 0 0 2 0 Inf C A false pos called FDR 0 3 23 4 894 0 026 0 4 10 6 840 0 013 0 5 5 0 818 0 006 0 6 3 1 780 0 004 0 7 1 9 741 0 003 0 8 1 6 708 0 002 0 9 1 4 674 0 002 1 0 0 9 636 0 001 Table 4 SAM false positive results for 3 scenarios 19 6 For a fixed threshold A starting at the origin and moving up to the right find the first 1 such that d dis gt A All genes past i are called significant positive Similarly start at origin move down to the left and find the first iz such that d da gt A All genes past 72 are called significant negative For each A define the upper cut point cut p A as the smallest d among the significant positive genes and similarly define the lower cut point cutjow A 7 For a grid of A values compute the tota
3. 17 18 19 20 a E i U KI Eum 738318897 TNA 4 Figure 1 Highlighting and invoking SAM 12 Welcome to SAM Version 1 20 Signiticance Analysis of Microarrays C Trustees of Leland Stanford Junior University All Rights Reserved Quantitative Response Two class unpaired data Censored Survival data Multiclass Response One class Response Paired data Choose Response Type Academic version Data in Log Scale Logged base 2 C Unlogged Web Link Option Clone ID Name Accession No UniGene Cluster ID Sheet2 Number of Permutations 200 Additional Sheets lie 300 El K Nearest Neighbors Imputer Imputation Engine Number of Neighbors 10 C Row Average Imputer Random Number Seed 1234567 Generate Random Seed OK Cancel Figure 2 The SAM Dialog Box 13 x laws olds aeeys 7 ays Tiseus N 4 y DI 926210 2N 9A VPA yodx3 alge eyea as e720 4qu3 JENUEW j 48pIIS e20 Q sauar JUEIIJIUBIS 351 5 3ue3lJIuBIS 3517 auey p 04 O 30 d WYS paniasqg ea JEU NAAA AAA DS GANO i i i i i POVIUBIS s e ueipay ld WYS CETTE 7 ealy 10 d aa 4 Eao old WYS WS O MM 13 Y lt Bol Sue mz Uu ob dif mopu WUD sjooT w54 Jesu MAIR JP ala 99X3 IJ0SOJ9 IlAJ E 14 A dialog form shown in figure 2 now pops up You have to select the type of response variable and if desired change
4. e 11 11 Handling Missing Data 11 12 Running SAM 11 12 1 Using data in Multiple Sheets 4444 ee ee eee ee eee eee ees 15 12 2 Format of the Significant gene list 16 13 Interpretation of SAM output 16 14 Technical details of the SAM procedure 17 14 1 Computation of sj sss bee Lire Viens ste pr ea 20 14 2 Details of r and s for different response types 21 14 3 Details of Permutation Schemes 22 15 Frequently Asked Questions 23 15 1 General Questions cierra OR RADE BERL RHEE HE 23 15 2 SAM Registration Questions 24 2 204 a Bas LIU e 23 15 3 Installation Uninstallation Questions 23 15 4 SAM Usage Questions 25 List of Figures 1 Highlighting and invoking SAM 12 2 The SAM Dialog Box 13 3 The SAM PIG s ep a A 4e pu AAA EERE A AA 14 4 SAM results for 3 different datasetS 18 List of Tables sns s a ee Be eee 9 ies ised sn A 10 oie ioe gee sap api 10 Sa Gt be Sober S 19 1 Important Announcement To foster communication between SAM users and make new announcements a new Yahoo group has been established See http groups yahoo com group sam software 2 Summary of Changes The following are changes since the initial release of SAM 1 0 2 1 Changes in SAM 1 21 Two bugs were fi
5. e Stricter checks on response variable values are now performed Several efficiency issues have been addressed The web version of SAM is no longer under development Hence we have removed it from this manual The old version still works for the time being and the version 1 0 manual contains documents 1t Due to changes in the internals of SAM results using SAM 1 10 will be close to but not exactly those obtained with SAM 1 0 We have also updated the FAQ with the latest information See section 15 3 Introduction SAM Significance Analysis of Microarrays is a statistical technique for finding significant genes in a set of microarray experiments It was proposed by Tusher Tibshirani and Chu 4 The software was written by Balasubramanian Narasimhan The input to SAM is gene expression measurements from a set of microarray experiments as well as a response variable from each experiment The response variable may be a grouping like untreated treated either unpaired or paired a multiclass grouping like breast cancer lymphoma colon cancer a quantitative variable like blood pressure or a possibly censored survival time SAM computes a statistic d for each gene i measuring the strength of the relationship between gene expression and the response variable It uses repeated permutations of the data to determine if the expression of any genes are significantly related to the response The cutoff for significance is determined by
6. a tuning parameter delta chosen by the user based on the false positive rate One can also choose a fold change parameter to ensure that called genes change at least a pre specified amount See section 4 Obtaining SAM SAM is licensed software Information on licensing of SAM can be obtained from Kirsten Leute Email Phone 650 725 9407 at the Stanford Univer sity Office of Licensing http ot1 stanford edu 5 System Requirements SAM requires The latest updates for your operating system available from http windowsupdate To prevent any problems access this and other Microsoft sites using Internet Explorer rather than Netscape Clicking on the Product Updates link pops up a box that will automate the installation of the latest patches Beware that several time consuming reboots are usually needed and you might need administrative privileges to in stall the patches It is generally a good idea to update your system for security reasons any way The latest Microsoft Java Virtual Machine This is freely available from the web site To prevent any problems access this and other Microsoft sites using Internet Explorer rather than Netscape Choose the correct version for your operating system and install it Windows XP and Office XP users should especially do so as Microsoft doesn t distribute Java with its products anymore The Microsoft Data Access Components This is usually available on all newer Windows machines by de
7. an error that a library was not registered However at the end the program says that the installation was successful Does this mean that SAM is installed correctly No Anytime an error occurs it means that SAM is not installed properly The problem must be fixed before you can rely on SAM working for you This often happens when the prerequisites are not met It also happens if your system doesn t have Microsoft Data Access Components installed See section 5 7 Tam using office 97 Where can I download the Service packs for 1t The last time we looked it was at the following URL http office microsoft com downloads 9798 sr2o0ff97detail aspx If you don t find it there search for the words office97 service release at the web site ht tp office microsoft com Beware these things keep being moved around 15 4 SAM Usage Questions 1 SAM generates an error when I run it on my dataset What should I do Most often errors are due to improper data formats e Please make sure that your data is formatted exactly as described in section TO Partic ular attention needs to be paid to the format of the response in the first row as described in section 10 1 e Please make sure that the response type you chose in the SAM dialog box shown in figure 2 matches the format of your response In our testing about 95 of the problems have been due to the wrong response format e Please make sure that you have chosen your data area appr
8. n input sheets contains missing data please note that SAM will add n sheets named SAM Imputed Dataset SAM Imputed Dataset 1 12 2 Format of the Significant gene list For reference SAM numbers the original genes in their original order as 1 2 3 etc In the output this is the Row number The output for list of Significant genes has the following format Row Number The row in the selected data rectangle Gene Name The gene name specified in the first column selected data rectangle This is for the user s reference Gene Id The gene id specified in the second column selected data rectangle This is for the user s reference but is also linked to the SOURCE web site for gene information SAM score d The T statistic value Numerator The numerator of the T statistic Denominator s soy The denominator of the T statistic q value This is the lowest False Discovery Rate at which the gene is called significant It is like the familiar p value adapted to the analysis of a large number of genes The q value measures how significant the gene is as d gt 0 increases the corresponding q value decreases The numerator denominator and q value are further explained in the technical section below The list is divided into positive and negative genes having positive or negative score d Positive score means positive correlation with the response variable e g for group response 1 2 positive score means expression is higher for g
9. you used that generated the error References 1 B Efron R Tibshirani J D Storey and V Tusher Empirical bayes analysis of a microarray experiment J Amer Stat Assoc to appear 2 T Hastie R Tibshirani and J Friedman The Elements of Statistical Learning Data mining Inference and Predict ion Springer Verlag New York 2001 3 J D Storey A direct approach to false discovery rates Submitted Available at www stat stanford edu jstorey 4 V Tusher R Tibshirani and C Chu Significance analysis of microarrays applied to ionizing radiation response Proceedings of the National Academy of Sciences 2001 First published April 17 2001 10 1073 pnas 091062498 28
10. 7 and within the set 2 4 6 8 but not across the two sets We indicate the blocks batches as follows lblock1 1block2 2block1 2block2 1block1 Iblock2 2block1 2block2 GENE1 101 7 64 0 50 1 95 10 12 10 77 4 47 7 65 7 58 GENE2 102 38 10 4 86 7 87 13 59 9 79 13 46 8 91 5 07 GENE3 103 21 15 5 96 3 20 4 74 3 70 12 35 10 17 0 63 GENE4 104 187 21 23 81 16 76 14 10 99 76 89 11 10 92 5 52 Table 3 Example Dataset for a blocked unpaired problem For example 1block1 means treatment 1 block or batch 1 1block2 means treatment 1 block or batch 2 In this example there are 4 24 permutations within block 1 and 4 24 permutations within block 2 Hence the total number of possible permutations is 24 24 196 If the block information is not indicated in line 1 all permutations of the 8 samples would be allowed There are 8 40320 such permutations 10 Please note that block permutations cannot be specified with Paired response as there is an implicit blocking already in force 10 4 Normalization of experiments Different experimental platforms require different normalizations Therefore the user is required to normalize the data from the different experiments columns before running SAM SAM does not do any normalization For cDNA data centering the columns of the expression matrix that is making the columns mean equal to zero is often sufficient For oligonucleotide data a stronger calibration m
11. General Questions 1 How is SAM licensed Whom should I contact SAM is distributed without cost to Academic Institutions for research purposes Academic users of SAM should cite the article 4 They can download the software after registration directly from http www stat stanford edu tibs SAM Commercial users of SAM should contact Kirsten Leute of the Stanford University Office of Licensing http otl stanford edu via phone at 650 725 9407 or via email at A limited version of SAM is available for download from http www stat stanford edu tibs SAM 2 Is there a version of SAM that works on Macintosh computers Unfortunately no Since the Excel version of SAM makes extensive use of Microsoft Com ponent architecture on Windows COM it is not easy to port it to Macs One suggestion that has been made is to use a Windows emulator on Macs such as Virtual PC from Connectix Corporation We have not confirmed that this works although the folks at Connectix say it should do so with the 4 0 2 update of their virtual PC product 15 2 SAM Registration Questions 1 I registered for SAM and I have still not received an email confirming my registration This is most likely due to your email server being down Hundreds of requests have been successfully sent out to people Our registration server tries every hour to remail the pending requests If you do not receive your registration user id and password within the day you may alwa
12. SAM Significance Analysis of Microarrays Users guide and technical document Gil Chu Balasubramanian Narasimhan Robert Tibshirani Virginia Tusher Contents 1 Important Announcement 3 2 Summary of Changes 3 2 1 Changes in SAM LT ak epee ok eS BP eS AAA 3 2 2 Changes in SAM 1 201 3 2 3 Changes in SAM LS ooh eae ii ue ner ben thus 3 24 Changes in SAM 1 13 22624646444 das 25442485440 462484 4 2 5 Changes in SAM 112 26 2525 aed ORS as RE OER EER ERE ERE 4 2 6 Changes in SAM 1 10 4 5 4 Obtaining SAM 5 5 System Requirements 5 6 Installation 6 Department of Biochemistry Stanford University Stanford CA 94305 Email chu cmgm stanford edu Department of Statistics and Department of Health Research amp Policy Stanford University Stanford CA 94305 Email naras stat stanford edu Department of Health Research amp Policy and Department of Statistics Stanford University Stanford CA 94305 Email tibs stat stanford edu Department of Biochemistry Stanford University Stanford CA 94305 Email goss cmgm stanford edu 7 Uninstalling SAM 7 8 Documentation 7 9 Examples 8 10 Data Formats 8 POT R spons Formati e vos eg Bue dora Beh dla Belg Bote rad Moog ya By 9 10 2 Example Input Data file for an unpaired problem 10 10 3 Block Permutations Lis NN EA E AR 10 10 4 Normalization of experiments
13. a should be put in an Excel spreadsheet The first row of the spreadsheet has information about the response measurement all remaining rows have gene expression data one row per gene The columns represent the different experimental samples e The first line of the file contains the response measurements one per column starting at column 3 This is further described below in section 10 1 e The remaining lines contain gene expression measurements one line per gene We describe the format below Column 1 This should contain the gene name It is for the user s reference Column 2 This should contain the gene ID for the user s reference Note that the gene ID column is the column that is linked to the SOURCE website by SAM Hence a unique identifier e g Clone ID Accession number or Gene Name Symbol should be used in this column if SOURCE web site gene lookup is desired Remaining Columns These should contain the expression measurements as numbers Miss ing expression measurements should be coded as NA This is done easily in good editor or In Excel In Excel to change blank fields to NA choose all columns pull done the Edit menu choose Replace and then nothing Blank with NA 10 1 Response Format Table l shows the formats of the response for various data types A look at the example files is also informative Response type Coding Quantitative Real number eg 27 4 or 45 34 Two class unpaired Integer 1 2 Mul
14. any of values of the default parameters For twoclass and paired data one has to specify if the data is in the logged base 2 scale or not Click the OK button to do the analysis If you had any missing data in your spreadsheet a new worksheet named SAM Imputed dataset containing the imputed dataset is added to the workbook This data can be used in subse quent analyses to save time If there is no missing data this worksheet is not added The software adds three more worksheets to the workbook There is one which is hidden called SAM Plot data and should be left alone The sheet named SAM Plot contains the plot that the user can interact with The sheet named SAM Output is used for writing any output Initially a slider pops up along with the plot shown in figure 3 that allows one to change the A parameter and examine the effect on the false positive rate It you want a more stringent criterion try setting a non zero fold change parameter see section 14 for details Positive significant genes are labelled in red on the SAM plot negative significant genes are green When you have settled on a value for A click on the List Significant Genes button for a list of significant genes The List Delta Table button lists the number of significant genes and the false positive rate for a number of values of A Please note that all output tables are sent to the worksheet named SAM Output erasing whatever was previously present in the worksheet While t
15. ay be necessary for example a linear nor malization of the data for each experiment versus the row wise average for all experiments 11 Handling Missing Data There are currently two options for imputing missing values in SAM Row Average Each value is imputed with the average of non missing values for that gene K Nearest Neighbor In the other default option missing values are imputed using a k nearest neighbor average in gene space default k 10 1 For each gene 1 having at least one missing value a Let S be the samples for which gene i has no missing values b Identify all other genes G having no missing values for samples S c find the k nearest neighbors to gene among genes G using only samples S to compute the Euclidean distance d impute the missing values in gene 2 using the averages of the non missing from the k nearest neighbors for that sample 2 If a gene still has missing values after the above steps impute them with the average non missing expression for that gene 12 Running SAM To begin you highlight an area of the spreadsheet that represents the data by first clicking on the top left corner and then shift clicking on the bottom right corner of the rectangle Then click on the SAM button in the toolbar See illustration in figure 1 11 El Microsoft Excel twoclass Ele Edit View Insert Format Tools Data Window Help 11 3 5 6 EA 8 10 1 12 13 15
16. ducibility of results How large a dataset can SAM handle There is really no hard limit per se in SAM Excel itself has some limit on the number of rows and columns it can handle There are additional overheads involved in marshalling the data between Excel and the core of SAM Therefore the practical limit is lower In general the more memory you have the larger problems you can handle I set the value of fold change to some value and now I want to analyze my data without fold change I seem to be unable to do so To analyze your data without using fold change completely erase the value for the fold change and leave it empty You can now hit Enter or move your delta slider to recompute the results Why does SAM take so long to show results when I change the value of fold change Whenever a new value is entered for fold change SAM has to recompute the g value bounds for each gene This is computationally intensive Why doesn t Excel allow me to enter a response label like 1 4block1 It seems to think it is a formula Use quotes around the response label to work around this problem SAM strips off quotes at the ends of the label When I enter a different number for the fold change it seems to have no effect on the number of significant genes This usually happens if one indicates the data is logged when they are actually not logged Make sure that you specify the correct scale for the data 26 10 11 12 13
17. e Addins and uncheck the box against the phrase Significance Analysis for Microarrays Then use the Control Panel to uninstall the software If you are asked if shared compo nents should be kept and not discarded elect to keep them as a conservative measure unless you are really hard pressed for space S Documentation This manual for SAM is available to authorized users from the SAM web site After SAM has been installed the manual is also available as a PDF file in the subdirectory doc of the SAM installation directory If you don t already have a PDF reader installed you can do so from the web site www adobe com 9 Examples Some examples of the use of SAM are in the directory C Program Files SAMVB Examples in the default installation These examples are meant to familiarize the users with the format in which SAM expects the data We briefly describe the examples below twoclass An example of two class unpaired data twoclassm An example of two class unpaired data with missing data twoclassb An example of two class unpaired data with experimental blocks defined oneclass An example of oneclass data multi An example of multiclass response paired An example of paired data censored An example of censored survival data Note the format of the labels in the first row quantitative An example of quantitative data Instructions on using SAM on these examples is discussed in section 12 10 Data Formats The dat
18. er will contain a Setup program 1 Double click on Setup In some cases Setup will complain that it needs to update your computer and reboot it before it can install SAM It is safe to click OK to update your com puter and run Setup once again after the reboot Sometimes the installation process might warn you that a version of a DLL that is being installed is older than one already on your computer Elect to keep the existing version SAM usually installs itself in C Program FilesNSAMVB Although users can change this directory at the time of installation although we recommend that only the drive letter be changed and not the name of the directory 2 Fire up Excel and click on the Tools menu Choose Addins and click on Browse Se lect the directory where the setup process installed SAM C Program Files SAMVB if you chose the defaults and click on the Addin subdirectory Double click on the SAM file The SAM addin will be loaded and the box against the phrase Significance Analy sis for Microarrays will be checked Once you click OK you should now see two buttons on your Excel toolbar named SAM and SAM Plot Control This completes the installation Windows XP Office XP users Microsoft doesn t ship Java anymore with its operating system You must install Java yourself as indicated in section 5 Otherwise SAM will not work 7 Uninstalling SAM Before uninstalling SAM one has to fire up Excel and click on the Tools menu Choos
19. fault but older Windows NT installations might require you to install it It is freely available from the web sitelhttp www microsoft com datal To prevent any problems access this and other Microsoft sites using Internet Explorer rather than Netscape Choose the correct version for your operating system and install it Microsoft Excel 97 or higher We recommend that users install appropriate Microsoft Office service packs that are available from http officeupdate microsoft com Of fice 97 users are especially encouraged to do so there are two service packs for Office 97 The Office 97 service packs are not easy to find one often has to search the Microsoft web site to access them Again performance gets better with faster processors and more RAM The size of the prob lem that one can handle is limited by the largest spreadsheet Excel can handle and the memory resources that are allocated to the Microsoft Java Virtual Machine 6 Installation If you received SAM on a CDROM then inserting the CDROM into the drive will bring up the Setup program for installing the software If for some reason that doesn t happen you can access the CDROM by clicking on My Computer and double clicking on the CDROM drive Then follow the steps below Otherwise if you downloaded SAM from the web you need to use a program like WinZip freely available from on the file sam that you need to run zip and extract the contents to a suitable folder This fold
20. he slider is present all interaction with the workbook is only possible via the slider It can be killed anytime and recreated by clicking on the SAM Plot Control button 12 1 Using data in Multiple Sheets The maximum number of columns one can have in an Excel worksheet is 256 columns A through IV If you have more than 256 samples you can arrange the data in multiple sheets before invok ing SAM For example consider the situation where you have 5000 genes and 300 samples Per the data format required by SAM this means that the data set would contain 300 2 302 columns and 5001 rows The extra two columns contain the gene name and identifier and the top row contains the response labels One possibility is to put the first 256 columns in one sheet and the remaining 46 in another sheet Or a 100 100 102 split over three not necessarily contiguous worksheets is also possible it is your call Then highlight the regions in each sheet as usual by clicking on the top left corner of the rectangle and shift clicking on the right bottom corner Then switch back to the sheet that contains the gene names and ids SAM must be invoked from the sheet that contains the gene names and ids Failure to do so will result in all kinds of hell breaking loose The SAM dialog will offer you the option of choosing the additional sheets Control click on the sheets that contain the additional data Proceed as usual after this point 15 If any of the
21. icult to enter blocking information see section without confusing Excel Excel thinks such entries are formulae Therefore SAM allows any response to be enclosed within quotes not apostrophes and strips the quotes off before doing any computation 10 2 Example Input Data file for an unpaired problem The response variable is 1 untreated 2 treated The columns are gene name gene id followed by the expression values The first row contains the response values 1 1 2 2 1 1 2 2 GENEI 101 7 64 0 50 1 95 10 12 10 77 447 7 65 7 58 GENE2 102 38 10 486 7 87 13 59 9 79 1346 8 91 5 07 GENE3 103 21 15 596 3 20 4 74 3 70 12 35 10 17 0 63 GENE4 104 187 21 23 81 16 76 14 10 99 76 89 11 10 92 5 52 Table 2 Example Dataset for an unpaired problem Note that there are two blank cells at the beginning of line 1 The gene expression measure ments can have an arbitrary number of decimal places 10 3 Block Permutations Responses labels can be specified to be in blocks by adding the suffix blockN where N is an integer to the response labels Suppose for example that in the two class data of section 2 samples 1 3 5 7 came from one batch of microarrays and samples 2 4 6 8 came from another batch We call these batches blocks Then we might not want to mix up the batches in our permutations of the data in order to control for the array differences That is we d like to allow permutations of the samples within the set 1 3 5
22. is on the next step below e Double click on My Computer followed by C followed by sam and finally Setup exe If even this doesn t work send email to sam bugt stat stanford edu with complete details including a The error message b The system you are using Windows 95 Windows 98 Windows ME Windows NT or Windows 2000 c Whether you have installed all the prerequisites mentioned in the SAM manual In particular please make sure you have installed the Microsoft Java Virtual machine and the Data Access Components specified in the section 5 d The dataset you used that generated the error 4 I would like to revert back to the old version of SAM How should I go about it We strongly recommend against this We have expended quite a bit of effort to make the new version of SAM bug free and correct However if you really need to do so for other reasons you must uninstall SAM as usual Then you must locate the file called SAMProject d11 and remove it manually This file tends to be left orphaned by the uninstallation process 24 5 Is there an easier way to detect 1f Microsoft Java virtual machine is installed on my com puter We don t know of any easy way except to set up an applet on our web site and sniff out your JVM when you access it using Internet Explorer If there is enough demand we might do so The surest bet is to download the latest Microsoft JVM as indicated in section 5 6 When I install SAM I get
23. it Just don t click on it Please remember that all websites have logs and surely your query gets recorded somewhere But as to what happens to it we cannot answer as we have really no affiliation with that site So the bottom line is that if you are really concerned you should just refrain from using that feature 27 14 Where can I go for help if I Just cannot get SAM to work We are very interested in making SAM work for all users However before reporting prob lems or bugs we d really like you to make sure that the problem is really with SAM The following checklist should help e Please make sure you have installed all the prerequisites See section e If the problem is with SAM usage please make sure that you have formatted your data exactly as mentioned in the SAM manual e If you are having problem on a particular type of data please make sure that you have formatted the response labels appropriately and have chosen the correct applicable data type If you still cannot get SAM to work send email to sam bug stat stanford edu with complete details including a The error message b The system you are using Windows 95 Windows 98 Windows ME Windows NT or Windows 2000 c Whether you have installed all the prerequisites mentioned in the SAM manual In particular please make sure you have installed the Microsoft Java Virtual machine and the Data Access Components specified in the section 5 d The dataset
24. l number of significant genes from the previous step and the median number of falsely called genes by computing the median number of values among each of the B sets of di i 1 2 p that fall above cut A or below cutiow A Similarly for the 90th percentile of falsely called genes 8 Estimate Ty the proportion of true null unaffected genes in the data set as follows a Compute q25 q75 25 and 75 points of the permuted d values if p genes B permutations there are pB such d values b Compute To d q25 g75 5p the d are the values for the original dataset there are p such values c Let fo min o 1 G e truncate at 1 9 The median and 90th percentile of the number of falsely called genes from step 6 are mul tiplied by To 10 User then picks a A and the significant genes are listed 11 The False Discovery Rate FDR is computed as median or 90th percentile of the number of falsely called genes divided by the number of genes called significant 12 Fold change Suppose z and Z 2 are the average expression levels of a gene z under each of two conditions These averages refer to raw unlogged data Then if a nonzero fold change t is also specified then a positive gene must also satisfy 2 2 1 1 gt t in order to be called sig nificant and a negative gene must also satisfy Z 1 Zj2 lt 1 t to be called significant When a fold change is specified genes with either 7 l
25. not be calculated 1t is flagged with an NA for Not Applicable Changes in SAM 1 10 Bug fixes a serious bug in the imputation was fixed The bug caused some data to be imputed with the value 65535 A symptom of this bug was that the plot would have a strange appearance due to the scaling A new facility for block permutations has been added to handle different experimental con ditions such as array batches See section In cases where the total number of possible permutations is small the full set of permutations 1s used rather than a random sampling The threshold now is replaced by a fold change criterion and now handles logged base 2 and unlogged data appropriately The fold change applies only to two class or paired data We have added a new output column to the significant gene list the q value for each gene this is the lowest False Discovery Rate at which that gene is called significant It is like the well known p value but adapted to multiple testing situations The reported False Discovery Rates are now lower and more accurate than in Version 1 0 They are scaled by a factor 0 lt Zo lt 1 that is now displayed on all output See Section 14 and reference B Significant gene ids are now linked directly to the Stanford SOURCE web database Several options for search are provided Default is by gene name For two class and paired data one must now specify whether data is in log scale or not
26. opriately as discussed in 12 It is easy to highlight the wrong area or accidentally highlight some blank cells e Is there a gene with only one or zero non missing value If so the imputation will fail Sometimes SAM will run out of memory especially if the dataset is large The memory demands during the imputation phase coupled with other demands during the SAM phase can cause SAM to bomb In such cases typically the imputation goes through One can save the workbook exit Excel and then rerun SAM on the imputed data 25 Why does the random number seed stay the same Can you not generate a new seed auto matically The random number seed allows one to reproduce an analysis By default it is set to 1234567 However if one uses the default seed for every analysis then the same sequence of permutations are generated This is not always desirable It would appear that generating a seed randomly using the clock or some such mechanism without bothering the user for input might be better Not necessarily If reproducibility is important then asking the user to set the seed is preferable so that any analysis can be rerun to confirm results We have come down on the side of reproducibility The user always has a choice of requesting a randomly generated seed based on the clock by clicking on the Generate Random Seed button Please also note that the random number generator seed used in any analysis is always listed in the output to ensure repro
27. roup 2 than group 1 For a survival time response positive score means people with higher expression have longer survival times The statements are all reversed for negative scores 13 Interpretation of SAM output The three panels of figure 4 shows the SAM plots for three different datasets There are 1000 genes in each of the datasets and 8 samples 4 each in control and treatment conditions We carried out SAM analysis using the unpaired 2 class option The corresponding false positive tables are shown in table 4 In dataset A there a number of genes above the band in the upper right and below the band in the bottom left Looking at table 4 we chose A 5 producing about 65 significant genes and about 5 9 false positives on the average The choice of is up to the user depending how 16 many false positives he she is comfortable with Note the SAM plots can be asymmetric in that sometimes there will be significant genes in the top right but not bottom left or vice versa In dataset B there may be no significant genes With A 5 shown in the plot there are 2 called genes but about 1 3 false positive genes on average In dataset C there are many significant genes If A 0 3 then nearly 800 genes are called significant and there are only about 23 false positives on the average This data was generated as Lig Zij Hij 13 1 for gene 1 2 1000 sample 7 1 2 8 The first four samples are from group 1
28. sored Let D be the indices of the K unique death times 21 22 zg Let Ri Ro RK be the ti gt Zp Let m in Rp Let d be the number of deaths at time 24 and zz gt Lik gt en Tij Mk S Tij and indices of the observations at risk at these unique death times that is R i K ri D ri dizi k 1 21 K D d Mx gt Tij Es 14 5 k 1 J Rk Multiclass response y 1 2 K Let C indices of observations in class k ny in Ck Tik rec Qum mi gt an DD ny I nr gt Tk Din z 2 1 2 14 6 Ti 1 ee s vig Tip 14 7 gt mn Nk 1 a Nk 2 2 14 8 Paired data y 1 1 2 2 K K Observation k is paired with observation k Let j d be index of the observation having y d Zik Tij k Tij k 14 9 ra Y aK 14 10 k s Y a r HKK 1 14 11 k One class data y 1V y 42 on D Tij 2 2 n n 1 1 2 14 12 14 3 Details of Permutation Schemes For unpaired quantitative Multiclass and Survival data we do simple permutations of the n values yj For Paired data random exchanges are performed within each k k pair For One class data the set of the expression values for each experiment are multiplied by 1 or 1 with equal probability If blocks are specified the permutations are restricted to be within blocks as described earlier 22 15 Frequently Asked Questions 15 1
29. t 0 or 2 2 lt 0 or both are automatically left off the significant gene list as their fold change cannot be unambiguously determined When such fold changes are reported in output they are indicated by NA 14 1 Computation of so 1 Let s be the a percentile of the s values Let d r s 5 2 Compute the 100 quantiles of the s values denoted by q lt q2 lt qioo 20 3 For a 0 05 10 1 0 a Compute v mad d s lq qj 1 J 1 2 n where mad is the median absolute deviation from the median divided by 64 b Compute cv a coefficient of variation of the v values 4 Choose argmin cv a Finally compute 9 s sy is henceforth fixed at the value 60 14 2 Details of r and s for different response types Quantitative response 1 is the linear regression coefficient y 2 Wii Hs 14 2 gt z Zi pd where z gt Ti n and s is the standard error of r Gj S PRE z 2 1 2 14 3 and is the square root of residual error ds a E ij 12 i L n 2 Yi ie Ti Pio Y TiTi Ti 14 4 Two class unpaired data y 1 or 2 Let C j y k fork 1 2 Let nk of observations in Cp Let Z gt Een Tij na Do Djeco Zij noa Tio Til Si GEC1 j C2 1 m 1 ma z Bir Y z 222 21 ni no 2 Censored survival data z t A t is time A 1 if observation is a death 0 if cen
30. the second four from group 2 Here z N 0 1 standard normal lij 0 for j lt 4 lij 0 N 0 4 for j gt 4 Hence all genes have a true change 6 in expression from group 2 vs group 1 although it may be small In the interpretation of the SAM results one should also look at the score d which is the standardized change in expression A value of d 0 5 say may be called statistically significant in example C but is it biologically significant That is up the scientist Another way to address this issue set a non zero fold change for calling genes With a moderate fold change say 2 far fewer genes will be called in this example 14 Technical details of the SAM procedure The data is z 1 1 2 p genes 7 1 2 n samples and response data y 7 1 2 n y may be a vector Here is the generic SAM procedure 1 Compute a statistic Ti d ae eee 14 1 Si So ds T is a score s is a standard deviation and sy is a fudge factor Details of these quantities are given later in this note 2 Compute order statistics da lt da lt dip 3 Take B sets of permutations of the response values y For each permutation b compute statistics d and corresponding order statistics di lt dez Hs dey 4 From the set of B permutations estimate the expected order statistics by dt 1 B gt de fori 1 2 p 5 Plot the d values versus the dei 17 A
31. ticlass Integer 1 2 3 Paired Integer 1 1 2 2 etc eg means Before treatment means after treatment 1 is paired with 1 2 is paired with 2 etc Survival data Time status pair like 50 1 or 120 0 First number is survival time second is status 1 died O censored One class Integer every entry equal to 1 Table 1 Response Formats A quantitative response is real valued such as blood pressure Two class unpaired groups are two sets of measurements in which the experiment units are all different in the two groups For example control and treatment groups with samples from different patients With a Multiclass response there are more than two groups each containing different experimental units This is a generalization of the unpaired setup to more than 2 groups Paired groups are two sets of mea surements in which the same experimental unit is measured in each group For example samples from the same patient measured before and after a treatment Survival data consists of a time until an event such as death or relapse possibly censored In the One class problem we are testing whether the mean gene expression differs from zero For example each measurement might be the log red green ratio from two labelled samples hydridized to a cDNA chip with green denoting before treatment and red after treatment Here the response measurement is redundant and is set equal to all 1s Sometimes it is diff
32. xed e A bug relating to what SAM percieves as a large number of permutations was fixed The default was very naive e A bug in adding the imputed data sheets for multiple sheets was fixed See last para in section 12 1 2 2 Changes in SAM 1 20 e SAM can now handle a large number of samples Input data can span several sheets con tiguous or non contiguous An example file named twoclassbig xls included with the distribution For more details on using multiple sheets see e A bug in the calculation of FDR for paired data with a fold change specified was fixed Versions 1 16 1 19 were skipped 2 3 Changes in SAM 1 15 Bugfix release A bug that caused SAM to bomb during the calculation of Zo was fixed 2 4 Changes in SAM 1 13 Bug fix release A bug was fixed in the calculations for Censored Survival data Everyone is advised to upgrade to this version 2 5 Changes in SAM 1 12 This is mostly a bug fix release Users of SAM 1 10 should immediately upgrade to this release Uninstall the previous version and install the new one per instructions in section 7 2 6 Bug fix An error in the calculation of the fold change was fixed The criterion for applying fold change to significant genes was also corrected We thank alert users for catching this By popular request a new column called Fold Change has been added to the significant genes list This applies only to Two class and Paired responses Where the fold change can
33. ys register again and use another email address that works 15 3 Installation Uninstallation Questions 1 How do I uninstall SAM To uninstall one pretty much reverses the steps in the install process However please make sure you do it in the following order a First you must unlink SAM from the list of Addins loaded into Excel The list of addins is available by choosing the Addins item from the Tools menu 23 b SAM can be uninstalled via the Control Panel Double Click Add Remove Programs and double click on Significance Analysis of Microarrays 2 How do I install a newly released version of SAM Do I just install it on top of the old version Installing new software on top of old versions is a good way to hose your Windows machine If you want to preserve the little sanity that Windows has you must first uninstall the old version and then install the new version 3 I just downloaded your SAM program from your website and am having difficulty installing it When I try to run the setup exe it says it says something about not finding a folder This is most likely due to the peculiarities of your computer e First make sure that your computer has sufficient disk space It s an easy thing to forget especially with the amount of crud that Internet Explorer keeps piling up in temporary folders e Extract sam zip to a directory on your C or D drive say sam You ll need an extractor like WinZip or equivalent We ll assume th

Significance analysis of Microarrays: User guide and technical

Contents

Download Pdf Manuals

Related Search

Related Contents