Home

000000065.sbu - DSpace Home

1. Table 4 4 Splitting method comparison for head and neck cancer study The score Random Forest is generated by running 200 score CART trees The average scores are reported in Table In comparison with sCART s RF misclassified one subject 38 and improved the accuracy Table 4 6 shows the comparison of eight classifiers in the head and neck cancer study s CART achieves a better classification than a single CART tree while s RF is better than Random Forest The average score gives s RF the advantage to achieve the best classification accuracy among the eight classifiers ID Truth CART s CART s RF 1 1 1 0 667 0 84 2 1 1 1 0 74 3 1 1 1 0 69 4 1 1 1 0 78 5 0 1 0 333 0 47 6 0 0 0 0 20 Continued on next page 48 Table 4 5 continued from previous page ID Truth CART s CART s RF T 0 0 0 0 03 8 0 0 0 0 02 9 0 0 0 0 00 10 0 0 0 0 41 11 0 0 0 0 46 12 0 0 0 0 37 13 1 0 0 667 0 74 14 0 0 0 167 0 10 15 0 0 0 0 12 16 0 0 0 0 45 17 0 1 0 667 0 39 18 0 0 0 0 09 19 0 0 0 0 25 20 0 0 0 0 03 21 0 0 0 0 07 22 0 0 0 0 01 23 0 0 0 0 43 24 1 1 1 0 58 25 0 0 0 167 0 31 26 0 0 0 0 17 21 1 1 0 667 0 83 Continued on next page 49 Table 4 5 continued from previous page ID Truth CART s CART s RF 28 0 0 0 0 34 29 1 1 1 0 86 30 1 1 1 0 50 31 0 0 0 0 11 32 1 1 1 0 95 33 0 0 0 0 12 34 0 0 0 0 22 35 0 0 0 0 17 36 1 1 1 0
2. Gene Sequences blastx cione Iatabasc ar tblastN s RetSeq NP Accession Protcin Sequenecs blastp against RefSeq Amino Avid Sequences NCBI GI Accession Protein Figure 6 1 Automated gene protein integration system would have the freedom to access the module on line and integrate their own gene protein database with ease Figure 6 1 illustrated the flow chart of developing a fully automatic web based integration on matching gene protein data It will be done by co referencing the microarray data and LC tandem MS also referred to as LC MS MS data from the same study to the NCBI reference sequence database 89 Bibliography Adam02 Adam B L Qu Y Davis J W Ward M D Clements M A Cazares L H Semmes O J Schellhammer P F Yasui Y Feng Z and Wright G L Jr 2002 Serum protein fingerprinting coupled with a pattern matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men Cancer Research 62 13 pp 3609 14 Adl81 Adler R J 1981 The Geometry of Random Fields John Wiley amp Sons New York Alba04 Alba R Fei Z Payton P Liu Y Moore SL Debbie P Cohn J D Ascenzo M Gordon JS Rose JKC Martin G Tanksley SD Bouza yen M Jahn MM and Giovannoni J 2004 ESTs cDNA microarrays and gene expression profiling tools for dissecting plant physiology and development Plant J 39 pp 697 714 Alt90 Altschul SF Gish W Mi
3. gi113606 NP_000025 1 58 5 1 2 gi113607 NP_000025 1 40 1 2 gi113608 NP_000025 1 46 5 1 2 21113609 NP_000025 1 33 2 Table 5 3 Taking the average to get the final gene and protein abundances The protein abundances are denoted by number of peptide which is nor malized by the median of the experiment The mRNA abundances are denoted by the normalized signal intensities of the microarray gene chips For the 148 genes the protein abundances are ranged from 0 29 to 118 36 and the average gene expressions of 5 platelet chips are ranged from 0 76 to 16 75 5 3 Correlation analysis We investigate the gene protein correlation using three different meth ods Pearson correlation Spearman rank correlation and the usual canonical correlation Among 143 gene proteins pairs there only 120 whose protein abundances can be detected in both runs of proteomic mass spectrometer To calculate the canonical correlation we will focus on these 120 pairs In Figure 5 5 neither protein nor gene data has normal distribution thus we perform the Box Cox transformation Protein abundanceflog 50 0 2 D 2 ive 2 N 2 e D 5 10 15 mRNA abundance Figure 5 4 143 gene protein pairs 140 140 120 35 120 30 100 100 gt gt S go a 2 80 v v 15 5 gt o 60 10 60 i Le 5 40 40 if d 2 25 3 35 20 20 a hemoglobin gamma G 0 200 300 400 500 10 15 mRNA abundance Protei
4. 22 24 oe 842446 e464 6 eee eos 5 1 4 Thesis structure and overview 6 2 Data Acquisition and Quality Control 8 2 1 Data acquisition 4 edn 40 aoe go ed 4 Bae ew woe ae aes 8 2 2 Data quality controll 2 amp 2c0 2 amp s B42 we ee ee 13 3 Data Preprocessing Biomarker Detection and Classification 16 3 1 Data preprocessing 2 00 0 eee eee 17 3 2 Biomarker detection 0 0 00040 2 18 3 3 Classification methods 0 0 0 0004 24 vi 4 3 4 Results 25 44 oes bat a eee ee RE EY 3 5 Extension to multiple group classification Scoring Method for CART and Random Forest 4 1 Classification and regression trees 04 4 1 1 Tree growing Ge eck eS ede eds pd Bde me SE 4 1 2 Tree pruning lt lt ka sk ek eine e Se ete eek we 4 2 Random forests 0 0 00 eee ee ee 4 2 1 Bagging sampling 04 4 2 2 Random forests generation 4 2 3 Variable importance 4 3 score CART and score Random Forest 4 3 1 From s CART to s RF 4 3 2 Test results 0 20 0 20 0000 004 5 Correlation of Proteomic and Genomic Data 5 1 Data acquisition 2 2 fee ewe eee we oR eee Be ew 5 2 Integration of gene and protein database 5 3 Correlation analysis 2 2 2 2 eee ee ee ee 5 4 Codon adaptation index 2 000002 eee 5 5 Triptic adjustment 6 0 44 8b bee bd
5. If a marker is related to the disease then we have X iid m 0 i 1 ys Disease Stage 1 Xi iid u2 02 i ni 1 n1 n Disease Stage 2 X iid u3 03 i n n 1 N Normal Control The expected sample variance of a disease unrelated marker is 0 The ex pected sample variance of a disease related marker is E S o gt o where Mm a n u2 n3 1 na n n 12 m na u2 ps TN3 ni us p nalus In summary we propose a novel approach to identify proteomic biomark ers using the variance component analysis method Our approach is suitable to not only two group but also multi group classification Furthermore it can be utilized to examine the consistency between the known data and the blinded 28 data by comparing the pooled variance at each marker between the testing and the training data sets This would indicate whether it is reasonable to classify the training data using the given testing data 29 Chapter 4 Scoring Method for CART and Random Forest The tree based classification and regression method is called CART It learns to extract the hidden patterns in the training data and can provide the predictive information for the future data Random Forest RF combines many classification trees Conventionally those two classifiers give binary clas sification results In this chapter we first introduce CART and RF brief
6. 08 1 12 14 16 18 2 1500 1600 1700 1800 1900 200 2100 2200 200 2400 M Z x 10 me Figure 3 2 Data preprocessing In the head and neck cancer study each raw mass spectrum consists of 34 378 mass to charge ratios m z values ranging from 0 to 100 000 The m z range of 2 000 to 20 000 is selected because the lower MS range is too noisy and the signal is too sparse in the higher MS zone These mass spectra were also standardized and smoothed using the method developed by Zhu and colleagues 2003 Figure 3 2 Now the mass spectra are aligned on a common scale and ready for the next two steps of analysis 3 2 Biomarker detection We will present three algorithms All of them are based on the statgram Method 1 detects the biomarkers over the entire m z range Method 2 employs a peak detection algorithm and look for the significant biomarkers at the peak with maximum intensity The focus of Method 3 is on those disease related 18 markers that highly appear in the peak region The new biomarker is the peak area instead of a single marker intensity This method is applied by the variance component analysis In the last section Head and Neck data is investigated by the three methods There are 73 samples that have head and neck squamous cell carcinoma HNSCC and 76 are normal control In the validation set 49 samples 22 HNSCC and 27 control will be classified using the detected biomarkers Method 1 Zhu s continuous biomarker approach
7. 20 000 is selected because there is noise in the range of m z below 2 000 and almost zero for m z above 20 000 140 Propnocesei np File Display Optons Prep steps dv protepExclorenidemodatadiseaseyHN01 7 tt type Simdothed saram 0 001000 type Normalized param 2000 000000 Save Profe Smoother wanji 0 Smooth Fitted sength imz 2 0 Baseine Staringending points mv 2000 oO 200000 M rmalz Click Save Profile select the directory and type the name of the file to save the preprocessing parameters The file has the extension name prp Should you decide to create and rename a new directory please press the ENTER key on your key board after typing the name of the new folder created to confirm the new folder name For example create a new directory para and save the parameters to proteoExplorer para prepl prp Note you only 141 need to type prep1 the prp extension will be added automatically Click OK in the Choose a save location dialog the Parameter Selection Page will change to Batch Processing Page automatically and the most recent saved parameter filename will appear in the Profile Setting textbox automatically as well Preprocessing Fie Display Options Prie Seting c proteobxplorendemodata para prep 1 prp Disease Dir Browse Control Dir Browse Binded Bir Browse Outout Root Cir Browse i atat Batch We will now perform preproce
8. 24 1 1 1 1 1 1 1 1 25 0 0 0 0 1 0 0 0 167 26 0 0 0 0 0 0 0 0 27 1 1 0 0 1 1 1 0 667 28 0 0 0 0 0 0 0 0 29 1 1 1 1 1 1 1 1 30 1 1 1 1 1 1 1 1 31 0 0 0 0 0 0 0 0 Continued on next page 46 Table 4 3 continued from previous page ID Truth entropy index ratio entropyt indext ratiot s CART 32 1 1 1 1 1 1 1 1 33 0 0 0 0 0 0 0 0 34 0 0 0 0 0 0 0 0 35 0 0 0 0 0 0 0 0 36 1 1 1 1 1 1 1 1 37 0 0 0 0 0 0 0 0 38 1 0 0 0 1 1 0 0 333 39 1 1 1 1 1 1 1 1 40 1 1 1 1 1 1 1 1 41 1 1 0 0 1 1 1 0 667 42 1 1 1 1 1 1 1 1 43 1 1 1 1 1 1 1 1 44 1 1 1 1 1 1 1 1 45 1 1 1 1 1 1 1 1 46 1 1 1 1 1 1 1 1 47 1 1 1 1 1 1 1 1 48 1 1 1 1 1 1 1 1 49 0 1 0 0 1 1 1 0 667 Table 4 3 Classification results on testing samples of different CART tree constructed by different splitting method ID is the testing sample index entropy is Quin lan s entropy information gain method index is gini di versity index information gain with Marshall correction inder is gini diversity index with Marshall correction emphratio is ratio is gini ratio gini ratio with Marshall correction entropy is entropy Splitting method Node Number Classification Accuracy Entropy Information Gain 11 83 67 Marshell Correction 23 87 76 Gini Index 3 87 76 Marshell Correction 13 91 84 Gini Ratio 3 89 90 Marshell Correction 13 91 84 s CART 64 93 88
9. DER x File View Analysis Workspace Help a File Display Options fl A i AJ NPs Ma Nae Me Dra nel besa Ad von Ingria 4039 Prep steps d protec amp plorer demodata contravHNes2 bt LC J Save Protiie Smoother window 5 2 003 Smaath Fitted length maf Saselne Starting Ending points mz 2000 fzo000 Normalize Change Cotor a p jataicontrovHNe6 2 bt ous J Transparency File Open New Spectrum open a spectrum from the directory dialog directly Display Options Show All show all spectra at each preprocessing step The current preprocessed spectrum is highlighted In the following example one spectrum is smoothed with the parameter 0 003 and baseline corrected with the parameter 3 The three spectra raw smoothed and smoothed base line corrected are displayed simultaneously in the visualization window One can tune the Display Toolbar to highlight the desired spectrum The raw spectrum 123 Preprocessing File Display Options Prep steps c proteoExplorer demodata sampile bt i Save Profile Smoother window 0 003 Smooth Fitted tength mz 3 Baseline Starting Ending points mez 2000 0 200 00 0 Nomalize The smoothed spectrum 124 Preprocessing i l led Parameter Selection gt Prep steps d protecExplorer demodata sarmple txt 0 type Smootned param 0 000030 ij Save Profile Smoother window fo 003 Smooth Fitted tength mz f
10. Eval uation of two dimensional electrophoresis and liquid chromatography tandem mass spectrometry for tissue specific protein profiling of laser microdissected plant samples Electrophoresis Jul 26 14 pp 2729 38 Sor03 Sorace J M Zhan M 2003 A data review and re assessment of ovarian cancer serum proteomic profiling BMC Bioinformatics 4 1 pp 24 Sri02 Srinivas P R Verma M Zhao Y and Srivastava S 2002 Pro teomics for cancer biomarker discovery Clinical Chemistry 48 pp 1160 1169 Tam00 Tamhane A C and Dunlop D D 2000 Statistics and Data Analy sis from elementary to intermediate Prentice Hall Upper Saddle River NJ 98 Tay02 Taylor J R D King T Altmann and O Fiehn 2002 Application of metabolomics to plant genotype discrimination using statistics and ma chine learning Bioinformatics 18 pp 241 S248 Vee04 Veenstra T D Prieto D A Conrads T P 2004 Proteomic pat terns for early cancer detection Drug Discovery Today 9 20 pp 889 97 Ver01 Verma M Wright G L Jr Hanash S M Gopal Srivastava R Srivastava S 2001 Proteomic approaches within the NCI early detection research network for the discovery and identification of cancer biomarkers Ann N Y Acad Sci 945 pp 103 15 Wads04 Wadsworth J T Somers K D Cazares L H Malik G Adam B L Stack B C Jr Wright G L Jr Semmes O J 2004 Serum protein profiles to i
11. F where a is the corrected experimentwise error rate u and v are the degrees of freedom of F statistic f is the threshold of the test and FWHM determines the Gaussian kernal and it is a constant indicating the number of biomarkers averaged in the smoothing 2 Stepwise Discriminant Analysis It begins like forward selection with no variables in the model At each step the model is examined If the variable in the model that contributes least to the discriminantory power of the model as measured by the following rule fails to meet the criterion to stay then the variable is removed Otherwise the variable not in the model that contributes most to the discriminantory power of the model is entered When all variables in the model meet the criterion to stay and none of the other variables meets the criterion to enter the stepwise selection process stops During the process of the stepwise selection only one variable can be entered into the model at each step The selection process does not take into account the relationships between variables that have not yet been selected Sequential F Test Based on a Fixed a Level is the rule Suppose that individuals belong to one of the two groups G1 and G3 and Z 1 p represents a full set of p measurements variables Assume that the prior 20 probabilities of group membership are equal and that in Gk z has a p variate normal distribution with mean vector fi and positive definite covarian
12. Feng Z 2003 Journal of Biomedicing and Biotechnology 4 pp 242C248 Zhu03 Zhu W Wang X Ma Y Rao M Glimm J and Kovach J S 2003 Detection of cancer specific markers amid massive mass spectral data Proceedings of National Academy of Science 100 25 pp 14666 14671 Zhu05 Zhu W Zhang Y Neophytou N Pradhan K Chen J Wu M Xu B 2005 proteoExplorer Interactive Analysis of Mass 100 Spectrometry based Proteomics Copyright application through the State University of New York Zhu05a Zhu W Zhang Y Neophytou N Pradhan K Chen J Wu M Xu B 2005 ViStaMSTM Software for Gene Microarray and SAGE Analysis Copyright application through the State University of New York 101 Appendix A User Manual of proteoExplorer Copyright 2005 Research Foundation of State University of New York at Stony Brook All rights reserved 102 Preface M is a customized software for the analysis and visual proteoExplorer ization of large scale proteomic mass spectrum datasets The combination of modern mathematical statistical methodologies and advanced computer graphical technologies provides the user with a novel environment for an infor mative and enjoyable data mining experience This user manual is prepared for both the experienced data analysts as well as novice Detailed data ex amples and screenshots are provided for each functionality In particular a flow cha
13. HBA1 alpha 2 globin 5 HBB beta globin AKR7A2 aldo keto reductase family 7 member A2 5 HBE1 actinin alpha 1 6 LDHA lactate dehydrogenase A 7 TXN thioredoxin 7 PCMT1 protein L isoaspartate D aspartate 8 CAPZA1 F actin capping protein alpha 1 subunit 9 EHD3 myosin heavy polypeptide 2 skeletal muscle adult Table 5 9 The gene symbols and names in nine clusters 86 Chapter 6 Conclusion and Future Work This thesis has focused on the discovery of genomics and proteomics knowledge by mining bioinformatics literature In the last few years there has been a lot of interest within the scientific community to help sort through this ever growing huge volume of literature and find the information most rele vant and useful for specific analysis tasks We extend and expand the available knowledge and provide new strategy in device data acquisition biomarker de tection classifier combination and data integration 6 1 Original contribution to knowledge This thesis makes the following original contributions to knowledge 1 A new data acquisition algorithm for proteomic ProteinChip SELDI data 2 F random field theory to determine the threshold for the reproducibil ity test 3 Majority k nearest neighbor classification method It loops over all 87 possible values for k Based on the Mahalanobis distance it takes the majority vote and improved the classic k NN method 4 Total variance analysis is a novel metho
14. MER A we E Se ES 6 5 6 Quadrant analysis and clustering 5 6 1 Quadrant analysis oaoa aa 5 6 2 Qlustering ec es ye dae ese oR oe OS eae Ae Conclusion and Future Work vil 52 53 59 61 70 72 74 74 78 87 6 1 Original contribution to knowledge 6 2 Future works Bibliography Appendix A User Manual of proteoExplorer A l Introduction A 2 Visualization re A24 Display features A25 Display options SeT nee A 3 1 Data preprocessing A 3 2 Biomarker detection A 3 3 Classification Prediction o oo A 3 4 Visualized biomarker pattern A 4 Example of head and neck data A 4 1 Data description A 4 2 Data preprocessing A 4 3 Biomarker selection vill 101 102 A 4 4 Classification Prediction List of Figures 2 1 ProtinChip SELDI Protocol Modified by William E Grizzle O John Semmes et al with permission from Ciphergen Biosystem Inc 9 2 2 Cold spots and Hot spots 2 4 k e ge eee ee es 10 2 3 m z 5997 97 and m z 8195 01 2 ee a ee Bee es 12 Die Olt 42466 64 h428 4h oe bbd4 ode bd ote Hd 12 Sig Boreas we ee mS kk 15 c 7 3 2 Data preprocessing 4k ee ee we a we a eS 18 3 3 Biomarker comparison ooa a a 24 cous 35 ee ee ee ee 43 4 3 s RF mechanism 2 262k ee ee ee ee ee 44 5 1 Platelet study the process of establishing and integrating the gene
15. Qui86 Gini Ratio of a split at node T is based on Information Gain Eq 4 2 IG T X Q GR T X T X Q 5 Pam pix 9 1 log P X x4 T 4 5 Marshall Correction In comparison to the Gini Ratio Marshall Correction favors attributes which split the examples evenly and avoids those that produce small splits It multiplies the splitting method by the product of the row totals x Thus it will be the maximum when the row totals are equal 33 Marshall Correction MarshallCorrection x x x k 4 6 Besides above four common split methods there are several other methods such as Misclassification rate y statistic F statistic G statistic Twoing criterion etc The stopping criteria of CART growing are as follows 1 A certain tree depth is reached 2 The number of samples at a node is less than a predefined threshold 3 The node is pure all samples in the node are in same category 4 All potential splits of the node are nonsignificant a F statistic as measure is given SS n 1 SS SS n 2 The tree depth the leaf node size and the threshold for F statistic are control parameters to avoid overfitting a tree The machine learning is the ideal procedure to find such parameters through the study on the training data 4 1 2 Tree pruning One should not make more assumptions than the minimum needed Thus the tree pruning is an important step It means we require the t
16. Window and then select Analysis Read Latest Biomarker Pattern to visualized the pattern The biomarkers are displayed in red bars A 3 4 Visualized biomarker pattern The user can visualize the selected biomarker pattern with the individual spec trum or the average spectrum Open the spectrum in the Main Window before reading the biomarker file and then select File Read Biomarker Pattern The biomarkers are displayed in red bars 137 6 i I4 AAA dha ANH A A NAA RAKA Wa N Mm a AAA AANA AVVN Aww PAAA AAA W VAN VAY V AAAA AAAA V UVV A 4 Example of head and neck data A 4 1 Data description Training Training Testing Status HNSCC disease Normal control Blinded test Number of Subjects 73 76 49 Each spectrum has 34 378 data points All spectra should be saved in the same parent directory A 4 2 Data preprocessing Select Analysis Preprocessing to open the preprocessing sub window dis play a spectrum by select File Open New Spectrum For instance we open a spectrum with the sample ID 11 in the disease group The Description Textbox displays the location of the file proteoExplorer demodata disease HNO11 txt 138 Preprocessing File Glsplay Options l Parameter Selechon gt Y Batch Processing Prep stepa cles pireydenodaaiseaseHA 1 Tl y Save Prois Smagther wido f Smacth Fitted length mn flo Baseline Startingfending poin
17. as the input directories They are the same as the output directories in the preprocessing step Input Dir Head amp Neck Cancer disease proteoExplorer demodata Preprocessed disease 149 Normal Control control proteoExplorer demodata Preprocessed control Blinded test proteoExplorer demodata Preprocessed test The refined and aligned maximum peak intensity data spectra are saved represents the parameters in the to subdirectories PeakAligned where Parameter Selection page Output Dir Head amp Neck Cancer disease disease PeakAligned 40 10 4 10 0 Normal Control control control PeakAligned 40 10 4 10 0 Blinded test test PeakAligned 40 10 4 10 0 Peak Area Same as Maximum Peak Intensity except there is one more parameter to choose Input the interval width to determine the area in the Peak Area textbox 150 Biomarker Detection DER 7708 7992 8075 Parameter Selection gt N Batch Processing E Window Size pts 40 Marker Display Size Noise Window 9 10 Peak Refinement Noise Coer 4 00 Alignment window m z 10 00 Peak Area m z 20 z Biomarker type erea lt Peak Identification Save Profile Save profile to a pek file and the output directories will be as following Head amp Neck Cancer disease disease PeakAligned 40 10 4 10 20 Normal Contro
18. cancer data as the study object The data is de TM See scribed in Chapter 3 Forty seven biomarkers are selected by proteoExplorer the Appendix for the software manual In Table 4 3 the classification results are shown on the testing samples of these different CART trees s CART takes the proportion of the vote as the score If the score is greater than 0 5 the subject has the disease otherwise it is normal Three samples are misclassified the disease subject 38 is classified as normal and the normal subjects 17 and 49 are classified as disease Table 4 4 shows the number of nodes and the classification accuracy of each splitting method s CART combines all methods and gives the best accuracy of 93 88 ID Truth entropy index ratio entropyt indext ratiot s CART 1 1 1 0 0 1 1 1 0 667 2 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 4 1 1 1 1 1 1 1 1 5 0 1 0 0 0 0 1 0 333 6 0 0 0 0 0 0 0 0 7 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 Continued on next page 45 Table 4 3 continued from previous page ID Truth entropy index ratio entropyt indext ratiot s CART 11 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 13 1 1 0 0 1 1 1 0 667 14 0 0 0 0 0 1 0 0 167 15 0 0 0 0 0 0 0 0 16 0 0 0 0 0 0 0 0 17 0 1 0 1 1 0 1 0 667 18 0 0 0 0 0 0 0 0 19 0 0 0 0 0 0 0 0 20 0 0 0 0 0 0 0 0 21 0 0 0 0 0 0 0 0 22 0 0 0 0 0 0 0 0 23 0 0 0 0 0 0 0 0
19. cleavage enlightened us to make an adjustment The new protein abundance is the peptide hits divided by the number of tripsin framents 1 Before triptic adjustment protein abundance peptide hits 72 Figure 5 12 Crystal structure of tripsin Frofilin zl 14 ACARAY VLASI DT ODS AS Y ST MEAS TP TT SEY lV LCE BSP e LT LO ORC is 18 pl et oh HE aM PEKA Pa pM GA PS Te A rt civ HEEEL AA ale Trypsin cleavage sites Detected peptides Figure 5 13 Example of triptic fragments for proflin 73 2 After triptic adjustment peptidehits rotein abundance k numberoftripsin fragments In Table the Pearson correlation increases from 0 02 to 0 31 and the Spearman correlation increases from 0 04 to 0 27 Both correlations are statistically significant after the triptic adjustment There is a small change for the canonical correlation from 0 53 to 0 55 and it is still significant Before Triptic Adjustment After Triptic Adjustment Correlation P value Correlation P value Pearson 0 02 gt 0 2 0 31 lt 0 05 Spearman 0 04 0 68 0 27 0 0014 Canonical 0 53 lt 0 01 0 55 lt 0 01 Table 5 6 Triptic adjustment comparison for the correlation of 120 gene protein pairs p values are calculated by bootstrapping A hypothesis testing on the change of the correlations is performed and both p values for Pearson and Spearman correlations are smaller than 0 01 which
20. corrected mass spectrum and is also necessary for all cases Use the Preprocessing Window to 1 Open a single mass spectrum and tune the parameters for each processing step by visualizing the effects 121 of different parameter settings on the given spectrum 2 Save selected pa rameters into a profile for the subsequent batch processing 3 Choose the dataset folder one wish to format using the selected parameter setting 4 Apply the saved preprocessing parameter profile to the chosen dataset folder and format the entire dataset folder using the given parameter setting auto matically Preprocessing File Lisplay Gpbons Spec Display Window e 2 Paraneler Seleuliu 2 X Balkhi Piuves sit Description Textbox Display Toolbar Save Profile Smuvuller winduw 0 0030 Smivullt Parameter Setting Filmed engin mz fsu baseline Starting point mz f2oo0 0 Normalize For the above 4 steps the parameter setting steps 1 amp 2 are done us ing the Parameter Selection sub page and the batch processing steps 3 amp 4 are done with the Batch Processing sub page Details are given below First we introduce the layout of the Preprocessing Window Display a single spectrum File Open Last Selected the selected spectrum is highlighted when 122 multiple spectra are displayed in the Main Window display that highlighted spectrum in the Preprocessing Window proteoExplorer Preprocessing
21. correlation without the normality transformation indicating that the correlations are uniformly significant is incorrect Figure 5 10a The Pearson correlation with the normality transformation done on both gene and protein data indicates that the correlations are uniformly insignificant Figure 5 10b The Pearson correlation with the normality transformation done on the gene data only indicates that the correlations are uniformly significant again Figure 5 10c So which one should we report Although both Figure 5 10b and Figure 5 10 are correct Figure 5 10b is too conservative because only one of the two variables is required to be normal for valid statistical results Thus the correct answer is to report the findings in Figure 5 10 the Pearson correlation sorted by the top genes are uniformly significant The canonical correlations aim to gauge the relationship between two sets of variables directly Canonical correlation is essentially the Pearson cor relation between the linear combination of variables in one set and the linear combination of variables from another set The pair of linear combinations having the largest correlation is determined first Next the pair of linear combinations having the largest correlation among all pairs uncorrelated with 66 S 2 a 3 al O a 2 S S a O Pearson Correlation TTG Figure 5 9 Pearson Spearman and canonical correlations between gene protein expression da
22. cross validation or by more esoteric methods that are not well known in the neural net literature Specht 1991 Rutkowski 2004 Support vector machine S VM SVM is a supervised learning method used for classification and regression The observed m z ratio for the ith subject X R An binary classifier would be to construct a hyperplane separating cancer subjects from normal subjects in this R space The algorithm we applied here is described by Chang and Lin 2003 We calculate a score for each classifier The score is usually a classification probability and always bounded between 0 and 1 If the score is greater than 0 5 the subject is often classified as diseased if a binary decision must be given If the score is less than 0 5 the subject is classified as normal To combine The decisions from the four classifiers we take the median of the four scores The binary decision is derived following the same threshold of 0 5 using the median score 3 4 Results The training set consists of 73 patients with cancer and 76 normal con trols The training data is randomly split into two equal parts and we train the classifiers using one part 37 of the cancer cases and 38 of the normal 26 cases and test using the remainder We repeat this procedure for thousand times The average classification sensitivity and specificity are reported in Ta ble 1 We then train the classifiers using the entire tra
23. desired average spectrum for example group average In this step the user can 1 detect any abnormal looking mass spectra outliers 2 check reproducibility of repeated measures 3 compare group average spectra e g the diseased group versus the control group 4 examine whether any processing steps such as baseline correction has been performed 104 Step 2 Data Processing Data Processing usually goes in the order of smoothing baseline cor rection and normalization In all cases smoothing is necessary as the first processing step to filter out noise Depending on whether baseline has been correction during the generation of the mass spectrum which is implied by the absence of negative intensity values in mass spectrum baseline correction is an optional processing step Normalization should be performed on smoothed and baseline corrected mass spectrum and also is necessary for all cases Use the Analysis Preprocessing Window to 1 tune parameters for each processing step by visualizing the effects of different parameters 2 save selected parameters into a profile 3 apply the saved profile to datasets to be analyzed and run the preprocessing batches automatically Step 3 Select Biomarker Type The user can choose and generate two types of biomarkers for the ensuing classification and prediction Maximum Peak Intensity and Peak Area For the choice of Maximum Peak Intensity one needs to generate the cor respond
24. entire range to take the average The default value is 2 000 20 000 which means we use data points with 2 000 lt m z lt 20 000 only Save Profile the preprocessing parameters in the Description Textbox will be saved in a prp file The file will be used later in the Batch Processing Page to preprocess an entire dataset folder Batch Processing Page 127 Preprocessing CER File Display Options Control Dir Browse Blinded Dir Browse Output Root Dir Browse Start Batch Profile Setting the location of the prp file with the preprocessed param eters By default it is the file saved most recently in the Parameter Selection Page Disease Dir the directory of training data set of subjects with certain disease or abnormality Control Dir the directory of training data set of normal control subjects 128 Blinded Dir the directory of blinded testing data set with a blinded mixture of diseased and control subjects Output Root Dir by default it has the same parent directory as the Input Dir A subdirectory of the Output Root Dir is created according to the preprocessing steps The three groups of preprocessed spectra will be output to this subdirectory For Example there are three groups of spectra A B and C The prepro cessing step is smoothing with the parameter 0 003 Then Disease Dir is dirl A Control Dir is dirl B Blinded Dir is dirl C and Output Root Dir dirl Preproc
25. finish for obtaining data from DNA mi croarrays IT Nature Genetics 32 pp 481 489 Hut93 Hutchens T W and Yip T T 1993 New desorption strategies for the mass spectrometric analysis of macromolecules Rapid Commun Mass Spectrom 7 pp 576 580 Jan03 Jansen R Bussemaker HJ Gerstein M Revisiting the codon adapta tion index from a whole genome perspective analyzing the relationship between gene expression and codon occurrence in yeast using a variety of models 2003 Nucleic Acids Res 31 pp 2242 2251 Joh04 Johann D J McGuigan M D Tomov S Fusaro V A Ross S Conrads T P Veenstra T D Fishman D A Whiteley G R Petricoin E F and Liotta L A 2004 Novel approaches to visualization and data mining reveal diagnostic information in the low amplitude region of serum mass spectra from ovarian cancer patients Disease Markers 19 pp 197 207 95 Joh04a Johann D J McGuigan M D Patel A R Tomov S Ross S Conrads T P Veenstra T D Fishman D A Whiteley G R Petricoin E F and Liotta L A Clinical proteomics and biomarker discovery 2004 Annals of the New York Academy of Sciences 1022 pp 295 306 Joo04 Joo J Ahn H Lombardo F Hadjiargyrou M Zhu W 2004 Statis tical Approaches in the Analysis of Gene Expression Data Derived from Bone Regeneration Specific cDNA Microarrays J Biopharm Stat 14 pp 607 28 Kar88 Karas M and Hillenkamp F 1988 A
26. match and NP_004479 1 belongs to the subset with 236 protein sequences Table 5 1 NCBI RefSeq Nucleotide Actual Protein Accession Accession by tblastN Match Name 213183011 NP_004479 1 NM_000173 NM_004488 1 glycoprotein V precursor 21121531 NP_000164 3 NM_000173 NM_000173 glycoprotein Ib alpha Table 5 1 An example of tblastn Similarly Table 5 2 shows that if we start from the nucleotide sequences NM_000419 is one of 143 sequences and NM_003637 is among 368 sequences which have not same match as in RefSeq database Affymetrix RefSeq Protein Actual Gene Probeset No Accession by blastX Match Name 216956_s_at NM_000419 NP_000410 1 NP_000410 1 integrin alpha 2b 206766_at NM 003637 NP_000410 1 NP_003628 integrin alpha 10 Table 5 2 An example of blastx Both protein and mRNA sequences were transformed to reference se quences If one reference sequence has multiple corresponding protein or mRNA sequences we take the average of those abundances For instance four protein sequences with NCBI accessions gil13606 gi113607 gil13608 and gil13609 have the same reference sequences with accession NP_000025 1 The 60 abundance of this protein sequence is 45 which is the average of four number of peptide hits 1 2 in the column Run means peptides are found in both two runs and the hit in the column No of Peptide is the average NCBI accession RefSeq accession No of Peptide Run
27. means there are significant changes 5 6 Quadrant analysis and clustering 5 6 1 Quadrant analysis First the set of 120 proteins was ranked by the protein abundance and the correlation was calculated by including the 15 highest abundant proteins and then decreasingly including the remaining 105 ones in order of abundance In Figure 5 14 the top 18 highly abundant proteins have the maximum corre lation of 0 44 In the other hand the set of 120 genes was ranked by the mRNA 74 abundance gene expression The correlation was calculated by including the 15 highest expressed genes and then decreasingly including the remaining 105 pairs As shown in Figure 5 15 the most highly expressed 20 genes have the largest correlation of 0 84 with the proteins T T T T T T T T T T T pes r a ig St HE c ya PP ka P Q oa a fone i t at ren J a t af O r ty Hee O i 4 pr J ia r i i y yt L 1 l 1 i L 1 l 1 1 20 30 40 50 60 7 80 90 100 110 120 Top sequences with largest number of peptide found Figure 5 14 Effect of highly abundant proteins on Spearman correlation coef ficient for mRNA and protein abundance in platelet Top 18 highly abuandant proteins has largest correlation of 0 44 Then we can divide all 120 genes into four groups It is shown as four quadrants in Figure 5 16 The 18 most highly abundant proteins are in quad rant 1 and 2 and the 20 highest expressed genes are in quadr
28. on the maximum in tensity of each peak First open a preprocessed spectrum and tune the parameters in the Parameter Selection Page to detect refine and align the peaks The spectrum with the sample ID 011 in the disease group is used as an example Since we recommend to use the preprocessed spectra open the spectrum HNO11 txt in the directory proteoExplorer demodata Preprocessed disease 144 Pick a Spectrum Location D proteoExplorerdemodatarPre processed Smoothed 1 e 003 Normalize dv O Sy Compact Disk G2 z B0 i proteoExplorer f HNO3 demodata HNO4 Preprocessed gt Smoothed 1 e 003 Normalized 2 e 003 HNO J El SH hty Computer J HNO7 i Floppy Disk A JHNO8 S Local Disk C S HNOS S user D A EHN AAN Mv Documents S HN108 0 KON E HN120 tt E HN133 tt Favorites Filename HNO1 1 b p o File Types fur w cancel Click Peak Identification to identify peaks The green squares indicate the identified peaks and their display size is tunable 145 Biomarker Detection aAA Parameter Selection gt Batch Processing Window Size pts fo Marker Display Size j Noise window 10 Noise Coef oo Alignment Window m z 0 00 Biomarker type Intensity Click Peak Refinement to refine peaks The yellow line indicates the noise level and the peaks below the noise level are denoted by red squares and are discarded for the ensuing c
29. parameter profile to the chosen dataset folder and format the entire dataset folder using the given parameter setting automatically 130 For the above 4 steps the parameter setting steps 1 amp 2 are done using the Parameter Selection sub page and the batch processing steps 3 amp 4 are done with the Batch Processing sub page Further details are given below Biomarker Detection E folk still Parameter Selection gt Batch Processing Window Size pts 40 Marker Display Size Noise Window fi 00 Noise Coet 5 00 Alignment Window mz fo Biomarker type finten sity Save Profile Parameter Selection Page Peak Identification within the neighborhood of each m z identify the local maximum or rise and fall as a peak Window Size means the number 131 of points within the neighborhood Click Peak Identification button and the peaks are displayed in green squares you can tune the Marker Display Size Peak Refinement The noise level is calculated by the points in the Noise Window You need to input a percentage The number of points in the Noise Window is the input percentage total number of points At each m z noise mean Noise Coef standard deviation where Noise Coeff is proportional to the signal noise ratio Click Peak Refinement and a yellow noise boundary line will appear Peaks below the noise level are represented by red squares and discarded for the ensuing classification predicti
30. protein 2 isoform a 4 PFN1 profilin 1 4 CTTN cortactin isoform a 4 CFL1 cofilin 1 non muscle 4 ARPC1B actin related protein 2 3 complex subunit 1B 4 TUBAI tubulin alpha 1 4 K ALPHA 1 tubulin alpha ubiquitous 4 TUBB4 tubulin beta4 4 TUBB2 tubulin beta2 4 MYL9 myosin regulatory light polypeptide 9 isoform a 4 4 PCBP1 poly rC binding protein 1 Continued on next page 84 Table 5 9 continued from previous page Cluster Symbol Name 4 TLN1 talin 1 4 CAP1 adenylyl cyclase associated protein 4 MRCL3 myosin regulatory light chain MRCL3 4 TALDO1 transaldolase 1 4 YWHAE polypeptide 4 YWHAQ polypeptide 4 CALM1 calmodulin 1 phosphorylase kinase delta 4 SUMO3 small ubiquitin like modifier protein 3 4 STXBP2 syntaxin binding protein 2 4 HSPCB microtubule associated protein RP EB family 4 MAPRE1 osteoclast stimulating factor 1 4 RSU1 coronin actin binding protein 1C 4 COROI1C EH domain containing 3 4 MYH2 cytochrome c 4 CYCS PDZ and LIM domain 1 elfin 4 PDLIM1 ubiquitin C 4 UBC smooth muscle and non muscle myosin alkali 4 TMSB4 X peptidylprolyl isomerase A isoform 1 4 PPIA coactosin like 1 4 COTLI1 SH3 domain binding glutamic acid rich 4 SH38BGRL3 tubulin alpha 6 Continued on next page 85 Table 5 9 continued from previous page Cluster Symbol Name 4 MRLC2 ras homolog gene family member C 4 RHOC tubulin beta polypeptide 5 NP G gamma globin 5
31. protien database 00002 0000 56 5 2 BLAST tool ao ea cae Bue eR Ola es bE e ee 57 5 3 Result of integrating platelet proteomic and genomic datasets 59 5 4 143 gene protein pairs 2 2 2020202 eee 62 5 5 Distributions of protein and mRNA abundances 62 Eee 63 re ee 64 ad ae eee oe ey ee 65 eee 67 5 10 Pearson correlation between the original gene protein expres sion data a the normality transformed data on both gene and protein b and the normality transformed data on gene only c 68 5 11 Box plot of CAI for highest and lowest expressed platelet tran Berrie it Gow op Ay Bo ea aE aS week eee A Bates we e ed 71 5 12 Crystal structure of tripsin oao oa een ee He aes 73 5 13 Example of triptic fragments for proflin 2 73 eea 7 ficient for mRNA and protein abundance in platelet Top 2 TT T6 E 5 17 Hierarchical clustering average Link distance l r 78 5 18 Top 9 clusters for hierarchical clustering 79 5 19 Top 9 clusters shown in the plot of mRNA abundance vs pro a ke eee ae ee ee 80 6 1 Automated gene protein integration system 89 xii List of Tables 2 1 Proportion of 165 shots that have intensities lt 6 11 2 2 Description of rats data ooa a sae tbe ees 14 2 3 Result of the reproducibility test a a aa aa 15 3 1 Head and neck cancer data ooa a eos eas 16 3 2 Comparison of MKNN and classi
32. than 40 out of 13 500 markers at which the null hypothesis is rejected Thus the data of rats is reproducible However when rats are 14 weeks the mass spectra are relatively less reproducible than those of rats at other ages This difference can also be seen in the F Map Figure 2 5 where the red line is the F threshold by the Gaussian random field theory 14 Data Set 1st d f 2st d f F threshold No Markers Reject HO 8 Weeks 1 21 26 85 0 10 Weeks 2 42 12 94 11 12 Weeks 2 42 12 94 2 14 Weeks 1 21 26 85 37 21 Weeks 2 42 12 94 3 Table 2 3 Result of the reproducibility test 1 M Z Figure 2 5 F map of the reproducibility test 15 Chapter 3 Data Preprocessing Biomarker Detection and Classification In this chapter we will use the head and neck cancer data set Table 3 1 to illustrate the three steps in proteomic biomarker analysis The flow chart of the whole procedure is shown in Figure For biomarker detection we developed a novel method based on vari ance analysis In comparison with two previous methods it improved the classification results We proposed a new classification method called major ity k nearest neighbor which is better than the traditional k nearest neighbor method A new classifier combination scoring system is also developed Head amp Neck Data Set M Z Range 0 100 000 M Z 34 378 HSNCC 73 Normal Control 76 Blinded
33. z 8195 01 more than half of shots for sample number 1 are noises Figure 2 3 We should not use those noises to generate mass spectra After eliminating the noises we take the average of shots between 25th percentile and 75th percentile at each m z This algorithm considers only those stable shots after excluding the noise with small intensities Therefore the mass spectra have higher intensities and are more accurate In Figure Regular means taking the average of all 165 shots and then subtract baseline Improved means eliminating the instrument noise and take the average of shots between 25th percentile and 75th percentile finally subtract the baseline 11 MZ 5997 97 MZ 8195 01 5000 5200 5400 5600 5800 6000 6200 6400 6600 6800 Figure 2 4 Comparison between the regular and improved methods on sample 3 12 2 2 Data quality control In many mass spectrometry datasets each protein serum sample is gen erated multiple times If the spectra of the same serum sample are not re producible we cannot trust them and do further analysis One way repeated measure ANOVA is implemented to perform the reproducibility test Method Suppose we have N protein serum samples and the mass spectrum of each sample contains intensities at M markers mass to charge ratio or m z The intensity of each sample has the model Yij Q bj ij i Lea NVT 1 M where a is the ith subject effect random effect 3 is t
34. 0 1 Geo02 George E Glimm J Li X Marchese A Xu Z A Comparison of Exper imental Theoretical and Numerical Simulation Rayleigh Taylor Mixing Rates 2002 Proc National Academy of Sci 99 pp 2587 2592 Gev00 Gevaert K and Vandekerckhove J 2000 Protein identification methods in proteomics Electrophoresis 21 6 pp 1145 54 Gna03 Gnatenko DV Dunn JJ McCorkle SR et al 2003 Transcript pro filing of human platelets using microarray and serial analysis of gene ex pression Blood 101 6 pp 2285 93 Gna05 Gnatenko DV Cupit LD Huang EC Dhundale A Perrotta PL Ba hou WF 2005 Platelets express steroidogenic 17beta hydroxysteroid de hydrogenases Distinct profiles predict the essential thrombocythemic phe notype Thromb Haemost Aug 94 2 412 21 Gyg99 Gygi SP Rochon Y Franza BR Aebersold R 1999 Correlation be tween protein and mRNA abundance in yeast Mol Cell Biol 19 pp 1720 1730 94 Har04 Hardiman G 2004 Microarray platforms comparisons and con trasts Pharmacogenomics 5 pp 487 502 HL03 B Lausen and T Hothorn 2003 Double Bagging Combining Classi fiers by Bootstrap Aggregation Pattern Recognition 36 6 pp 1303 309 HLBR04 T Hothorn B Lausen A Benner and M Radespiel Troger 2004 Bagging Survival Tree Statistics in Medicine 23 1 pp 77 91 Hol02 Holloway AJ van Laar RK Tothill RW and Bowtell DDL 2002 Options available from start to
35. 1 Statgram t Map A two independent samples t z test was performed at each m z value to compare the intensities between the two training samples disease and normal control The null hypothesis is that the intensities are equal between the two groups for each particular biomarker and the alternative one is they are different For each biomarker we calculated a test statistic t value and then generated the t Map by t values versus m z values Suppose n and n samples are drawn from the disease group X and the control group Y respectively The samples are independent within and between groups At each biomarker m the test statistic t m is X m Y m WS Gana Sa where X m Y m m and S3 m are the sample means and variances of the training samples When both samples are large n gt 30 andnz gt 30 by the central limit theorem the test statistic followed approximately the standard normal distribution under the null hypothesis Because the mutiple tests are performed there is also a false positive problem Namely we need to determine 19 a suitable significance level for each test such that at least 95 of all significant differences identified are real Traditional methods as Tukey or Bornferroni tend to be conservative Thus a less conservative correction method is applied based on Gaussian random field theory The threshold t is given by G arp e aM ae eet a T 2 u oi Ky In2 1 t iti
36. 3 Baseline Sterting Enoing points m z 2000 0 20 000 0 Normalize The smoothed and baseline corrected spectrum 125 Preprocessing TER File Display Options Al ft vill Nil ni z U f A r it fal i a fe 4 A en in ew Wis N iy j hy i J J i F WW y J vv Vi VeVi VY UY A vai nly a fi y y Stil Sm sothed 1 type Baselined param 3 000000 1 Save Profile Smoother window fo 003 Smooth Fitted length vz 3 Baseline Starting Ending points m z 2000 0 20000 0 Normalize The preprocessed spectra with different parameters can also be displayed simultaneously By comparing those spectra you can determine the best pa rameter profile Parameter Selection Page Description Textbox identify the spectrum the preprocessing steps and 126 the parameters Display Toolbar select the target spectrum It is highlighted and its location will be displayed in the Description Textbox Parameter Setting Smoothing input the percentage of all data points It determines the width of the Gaussian Smoother Window at each m z Baseline Correction input the Fitted Length for the convex hull algo rithm to fit the baseline Normalization given the Starting Ending Points of the m z range each spectrum is divided by the average intensity of its range If the Starting Point is zero and the Ending Point is larger than the maximum m z use all data points in the
37. 49 Table 3 1 Head and neck cancer data 16 Data Preprocessing Blinded Data Known Data n Apply Final Model Marker Selection Classification on Blinded Data Final Model Classification Avg Sensitivity amp Specificity for Marker Sets Combining Different Classifiers Rule Figure 3 1 Flow chart of the proteomic mass spectrometry analysis 3 1 Data preprocessing Preprocessing is an important step for mass spectra based data analysis The goal is to remove experimental noise and adjust mass spectra baseline 1 Calibration and smoothing Each original mass spectrum has to be externally calibrated to be in the same coordinate system and to be smoothed via a Gaussian filter 2 Baseline subtraction Eliminate the baseline signal caused mostly by chemical noise from matrix molecules without contamination of true protein or peptide peaks The result is a spectrum with a spectrum with a baseline signal hovering slightly above zero with protein peaks maintaining their true intensity 3 Normalization Adjust for the system effects between samples due to varying amounts of protein or degradation over time in the sample or variation 17 in the instrument detector sensitivity Each spectrum is divided by the average intensity Smoothing Raw Smoothed Baseline Corrected es ee a 105 02 04 O68
38. 49 Continued on next page 82 Table 5 9 continued from previous page Cluster Symbol Name 4 FYN protein tyrosine kinase fyn isoform a 4 GAPDH glyceraldehyde 3 phosphate dehydrogenase 4 LDHB lactate dehydrogenase B 4 MPP1 palmitoylated membrane protein 1 4 MSN moesin 4 MYH9 myosin heavy polypeptide 9 non muscle 4 PF4 platelet factor 4 4 PF4V1 platelet factor 4 variant 1 4 PFDN5 prefoldin 5 isoform alpha 4 PGAM1 phosphoglycerate mutase 1 brain 4 PKM2 pyruvate kinase 3 isoform 1 4 LEK pleckstrin 4 PRGI1 proteoglycan 1 4 CCL5 small inducible cytokine A5 precursor 4 SH3BGRL SH3 domain binding glutamic acid rich 4 SPARC secreted protein acidic cysteine rich 4 THBS1 thrombospondin 1 precursor 4 TPM4 tropomyosin 4 4 TPT1 tumor protein translationally controlled 1 4 VCL vinculin isoform VCL 4 YWHAH tyrosine 3 tryptophan 5 monooxygenase Continued on next page 83 Table 5 9 continued from previous page Cluster Symbol Name 4 TAGLN2 tyrosine 3 tryptophan 5 monooxygenase CAPZA2 capping protein muscle Z line alpha 2 4 SNX3 sorting nexin 3 isoform a 4 SNAP23 synaptosomal associated protein 23 4 ST13 heat shock 70kD protein binding protein 4 ACP1 acid phosphatase 1 isoform c 4 GSTO1 glutathione S transferase omega 1 4 PRDX6 peroxiredoxin 6 4 CAPZB F actin capping protein beta subunit 4 LIMS1 LIM and senescent cell antigen like domains 1 4 PCBP2 poly rC binding
39. 60 4470 4480 4490 4450 4460 4470 4480 4490 Figure 3 3 Biomarker comparison The biomarkers selected by these three different methods are shown in Figure 3 3 Method I is Zhu s approach Method II is by Yasui and colleagues Method III is our newly proposed method The continuous markers for Meth ods I and III are not necessarily located at the most prominent peak region Yasui s peak method selects peak apex as potential biomarkers only 3 3 Classification methods After selecting biomarker pattern in the previous section we need to vali date the pattern by applying classification methods to distinguish the disease related group from disease unrelated group Majority k nearest neighbor MK NN MKNN classifier is a generalization of the k nearest neighbor classifier The kNN classifier uses only one integer parameter k Given an input x R it finds the k nearest neighbors of x 24 in the training set and then predicts the label of x as the most frequent one among the k neighbors Extended to multi category case the principle of kKNN is to use the majority vote of their labels to assign a label to x MKNN extends kNN by using the majority vote of a range of k rather than just one k Table 3 2 shows that MKNN has sensitivity of 82 and specificity of 96 which are much better than the results of original k NN classifier Sensitivity Specificity Accuracy Average KNN 68 18 88 89 79 59 Majority KNN 81 82
40. 83 37 0 0 0 0 13 38 1 0 0 333 0 21 39 1 1 1 0 73 40 1 1 1 0 58 41 1 1 0 667 0 74 42 1 1 1 0 91 43 1 1 1 0 93 44 1 1 1 0 98 45 1 1 1 0 91 46 1 1 1 0 63 47 1 1 1 0 60 48 1 1 1 0 52 Continued on next page 50 Table 4 5 continued from previous page ID Truth CART s CART s RF 49 0 1 0 667 0 38 Table 4 5 Comparison on head and neck cancer testing samples by different method ID is the testing sample ID CART is classification result by original CART s CART is the classification result by score CART s RF is score Random Forest classification given by this thesis Head Neck Data Set Classifiers Sensitivity Specificity Total Accuracy MKNN 95 91 88 89 91 84 MLPNN 86 36 74 07 79 59 GRNN 86 36 88 89 79 59 SVM 90 91 85 19 87 76 CART 90 91 88 89 89 80 RF 86 36 92 59 89 80 s CART 95 45 92 59 89 90 s RF 95 45 100 00 97 96 Table 4 6 Comparison of sensitivity and specificity head and neck cancer study on eight classifiers 51 Chapter 5 Correlation of Proteomic and Genomic Data Only mRNA expression levels were considered for most of the pathway models analyzed due to the lack of protein expression data Bay04j Variables representing protein concentrations were either excluded or substi tuted with the corresponding mRNA expression levels With the newly emerg ing LC MS MS technol
41. 96 30 89 80 Table 3 2 Comparison of MKNN and classic kNN Multi layer perceptron neural network MLPNN The multi layer per ceptron is a hierarchical structure of several perceptrons and overcomes the shortcomings of those single layer networks It is an artificial neural network that learns nonlinear function mappings The multi layer perceptron is capa ble of learning a rich variety of nonlinear decision surfaces Nonlinear functions can be represented by multi layer perceptrons with units that use nonlinear activation functions Multiple layers of cascaded linear units still produce only linear mappings General regression neural network GRNN GRNN is Donald Specht s term for Nadaraya Watson kernel regression also reinvented in the NN litera ture by Schioler and Hartmann Kernels are also called Parzen windows One can view it as a normalized RBF network in which there is a hidden unit centered at every training case These RBF units are called kernels and are usually probability density functions such as the Gaussian The hidden to 25 output weights are just the target values so the output is simply a weighted average of the target values of training cases close to the given input case The only weights that need to be learned are the widths of the RBF units These widths often a single width is used are called smoothing parameters or bandwidths and are usually chosen by
42. Stony Brook University The official electronic file of this thesis or dissertation is maintained by the University Libraries on behalf of The Graduate School at Stony Brook University All Rights Reserved by Author Joint Analysis of Gene and Protein Data A Dissertation Presented by Chen Ji to The Graduate School in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in Applied Mathematics and Statistics Stony Brook University August 2007 Stony Brook University The Graduate School Chen Ji We the dissertation committee for the above candidate for the Doctor of Philosophy degree hereby recommend acceptance of this dissertation Wei Zhu Associate Professor Department of Applied Mathematics and Statistics Stony Brook University Dissertation Advisor Nancy Mendell Professor Department of Applied Mathematics and Statistics Stony Brook University Chairperson of Defense Esther Arkin Professor Department of Applied Mathematics and Statistics Stony Brook University Wadie Bahou Professor Department of Hematology School of Medicine Stony Brook University Outside Member This dissertation is accepted by the Graduate School Lawrence Martin Dean of the Graduate School il Abstract of the Dissertation Joint Analysis of Gene and Protein Data by Chen Ji Doctor of Philosophy in Applid Mathematics and Statistics Stony Brook University 2007 Early detection is critic
43. TP Computational Systems Biology of the Neuronal Cell December pp 6 10 Trieste Italy DF03 S Dudoit and J Fridlyand 2003 Bagging to improve the accuracy of a clustering procedure Bioinformatics 19 9 pp 1090 99 Die00 T G Dietterich 2000 An Experimental Comparison of Three Methods for Constructing Ensembles of Decision Trees Bagging Boosting and Randomization Machine Learning 40 2 pp 139 58 Fenn89 Fenn JB Mann M Meng CK Wong SF and Whitehouse CM 1989 Science 246 pp 64 71 Fie00 Fiehn O J Kopka P Dormann T Altmann R N Trethewey and L Willmitzer 2000 Metabolite profiling for plant functional genomics Nature Biotech 18 pp 1157 1161 For02 Forster J A K Gombert and J Nielsen 2002 A functional genomics approach using metabolomics and in silico pathway analysis Biotechnol ogy and Bioengineering 79 pp 703 712 For03 Forster J I Famili P Fu B O Palsson and J Nielsen 2003 Genome scale reconstruction of the Saccharomyces cerevisiae metabolic network Genome Research 13 pp 244 253 93 Fri04 Friberg M von Rohr P Gonnet G 2004 Limitations of codon adapta tion index and other coding DNA based features for prediction of protein expression in Saccharomyces cerevisiae Yeast 21 pp 1083 1093 Fun02 Fung E T and Enderwick C 2002 ProteinChip clinical proteomics computational challenges and solutions Biotechnique Suppl 34 8 pp 4
44. al in the successful treatment of life threatening diseases such as cancer A vital component of this re search is the identification and correlation of disease related genetic and proteomic biomarkers based on gene micro array data and pro teomic mass spectra data from diseased and control subjects Such knowledge is crucial in discovering the underlying genetic disease pathways in drug development and in early diagnosis In this work we first propose a quality control algorithm to improve proteomic data acquisition from the mass spectrometer We then demonstrate a novel variance component approach for biomarker detection and for population homogeneity examination ill A major contribution of this thesis is the development of the scoring method that would yield the predictive disease probability rather than the traditional crude binary yes no diagnosis We present the s CART and s RF classifiers the improved scoring variants of the binary classification and regression tree CART and Random Forest RF classifiers Finally we illustrate the biological and statistical process of integrating the genomic and proteomic data through a human platelet study conducted at the Stony Brook University Medical Center To my parents Contents List of Figures xii List of Tables XV Acknowledgements xvi 1 Introduction 1 1 1 Genomics and proteomics oso a a 1 1 2 Microarray technology ooa a Bae ea Boe Ea 4 1 3 Mass spectrometry
45. and functional units of heredity Genes are specific sequences of bases that encode instructions on how to make proteins Genes comprise only about 2 of the human genome the remainder con sists of non coding regions whose functions may include providing chromo somal structural integrity and regulating where when and in what quantity proteins are made The human genome is estimated to contain 30 000 to 40 000 genes Although genes get a lot of attention it s the proteins that perform most life functions and even make up the majority of cellular structures Proteins are large complex molecules made up of smaller subunits called amino acids Chemical properties that distinguish the 22 commonly occurring amino acids cause the protein chains to fold up into specific three dimensional structures that define their particular functions in the cell Whilst humans are estimated to have between 30 000 and 40 000 genes potentially encoding 40 000 different proteins alternative RNA splicing and post translational modification may in crease this number to in the region of 2 million proteins or protein fragments The constellation of all proteins in a cell is called its proteome Unlike the relatively unchanging genome the dynamic proteome changes from minute to minute in response to tens of thousands of intra and extracellular environ mental signals A proteins chemistry and behavior are specified by the gene sequence and by the number and identitie
46. ant 2 and 3 Table 5 7 shows that the three groups in Q1 Q3 and Q4 have very significant correlations p lt 0 01 75 Correlation l a amp p sat OSS nel 2 a 5 A 0 1 A 4a 4 BAR 0 4 L L L L L 1 L L f L 20 30 40 50 60 70 80 90 100 110 120 Most highly expressed genes Figure 5 15 Effect of highly abundant genes on Spearman correlation coeffi cient for mRNA and protein abundance in platelet Top 20 highly abundant genes has largest correlation of 0 84 Quadrant Number of Genes Spearman Correlation P value Q1 14 0 36 0 1015 Q2 4 0 0 54 Q3 84 0 33 0 0012 Q4 16 0 91 lt 0 0001 Table 5 7 Correlations of the group in four quadrants 76 0 500 1000 1500 2000 2500 3000 Figure 5 16 Four quadrants Q1 highly abundant in protein but low abudant in gene Q2 highly abundant in both gene and protein Q4 highly abundant in gene but low abundant in protein TT 5 6 2 Clustering Co regulated genes proteins are expected to have correlated expression patterns Thus when submitted to the cluster analysis with a suitable thresh old for the similarity measure they tend to be clustered together Figure shows the hierarchical clustering result The distance between subjects is 1 r where r is the correlation between the gene and protein The top nine cluseters are illustrated in Figure 5 18 and Figure 5 19 Cluster 4 is the largest one with 92 subjects The
47. apole type mass analyzer Another ionization technique matrix assisted laser desorption ionization MALDI involves co crystallizing the sample with an organic matrix which strongly absorbs UV laser light Upon irradiation under vacuum there is an energy transfer from matrix to peptide analyte which produces gaseous ions that are typically measured by a time of flight TOF mass analyzer The advent of these ionization techniques has ex tended the application of MS to study proteins in complex biological systems The MALDI MS method is one of the main contemporary analytical methods reviewed at length in Gev00 Surface enhanced laser desorption ionization SELDI oringinally described by Hut93 overcomes many of the problems associated with sample preparations inherent with MALDI MS Chiphergen Biosystems Fremon CA has developed the SELDI PrtoeinChip MS technol ogy that brings to the field of proteomics a user friendly methodology It is rapid highly sensitive and is readily adaptable to a diagnostic format With the help of these biological technologies and analytical methods researchers have been able to study the pathology of diseases and show a path to cure Pet02 applied the SELDI technology for the early detection of ovarian cancer also applied SELDI to identify serum biomarkers for the detection of breast cancer focused on the prostate cancer and the head and neck cancer A concise summary on proteomic pattern recognition m
48. ased on this split selection Both of them use ANOVA F statistic to find the split variable which F statistic is largest Then FACT uses linear discriminant analysis LDA while QUEST uses mod ified quadratic discriminant analysis mQDA to find out the split point Both above approaches seek the global best split variable from all input independent variables denoted as M Instead of that seeking a partial best split will introduce the the second randomness of Random forests At each node only a partial group of input variables is randomly selected to find the split rule They are called random features There are two types of Random Forests based on the complexity of random features 1 Forest RI is the simplest type of random features At each node A partial best split is found by the impurity measure same as CART from the selected group of variables It recursively grows the tree until the tree reaches the maximum size The number of the variable F in 39 the group is pre defined usually log M 1 The selection space of Forest RI is C Forest RC is suitable for the data set consists of a small number of in dependent variables M There are two problems when using Forest RC First the chance of random feature repeat will be significantly increased and it will reduce randomness Second the variable number in the group F may take big fraction which leads to much higher correlation And such wi
49. ation based on the training data sets and the subsequent prediction on the testing blinded dataset For the current test version only option 1 is provided for simplicity 106 The proteoExplorer software implemented the following 7 classifiers and each classifier will provide both binary classification outputs e g 0 for control and 1 for diseased as well as scores indicating the disease risk proba bilities for all testing samples The proteoExplorer includes classifiers as following 1 Marjority K Nearest Neighbor MKNN 2 Linear Discriminant Analysis LDA 3 Logistic Regression LOGIT 4 Generalized Regression Neural Network GRNN 5 Multiple Layer Perceptron Neural Network MLPNN 6 Support Vector Machine SVM 7 Spherical Support Vector Machine SSVM 8 Classification and Regression Tree CART Our experience indicated that no single classifier is dominantly superior to the others in protein proteomic data analysis The performance of classifiers depends to a large degree on the characteristics of the specific datasets This motivated us to combine the decisions from all classifiers for a unanimous and more robust decision Several approaches have been developed by our team In this test version we have included the mean score approach to yield the combined decision across all classifiers In the output HTML file you will see the combined decision labeled as Averaged in the summary table of the train
50. b HN552 tt AECA HN025 t HN553 tt HNO27 tt HN557 bt HNO28 b HNS560 te HN542 b EJ HN561 bet Favorites Filename HNOO1 bet oe File Types gt ba oin x Cancel To open multiple spectra press down the Ctrl button in the keyboard when choosing the spectra 110 Pick files to REAL irom Dors an T tore et Sale e dde riad aiee ilee e238 Ue MEHZ B inc a Z 1543 0 ENOS 24 E 14545 bt HNL JL 74 HOA s bt HNL12 cd S Nol bt HHFF al B HA4 HME 14 al NAF kte Ma cry alr f wy Cocumests r JIFIC1E z4 HSER LHCE lt 4 WS bt HNC2 74 NS 7 bt IS HNL Je 24 B Hot spt Z HM gt a S r 1 ht amao Fayoines Filens ne INS 50 3 wit bys 7 curt Select the color in the color panel the default color is yellow birlerinin Bok fe ver sidan Wy apay Pep Chiaie Suu Ww T b H Tiaia s BECCO 111 The spectrum is shown in both the Main Window and the Map Window protcoExplorer File View Analysis Workspace Help Change Coor fo proteoexciorendemadata contrcuHNs6e0 bt 010 Transparency Open an Entire Directory To display all spectra in the same directory click File Read Files locate the directory and Ctrl A in the keyboard All files in the directory will be chosen and opened 112 Pick files to READ from SEE Location DyprotecExploreridemodatalcont
51. biquitin conjugating enzyme E2L 3 isoform 1 2 ACTR3 ARP3 actin related protein 3 homolog 2 OSTF1 ras suppressor protein 1 isoform 1 3 GSTP1 glutathione transferase 3 RGS10 regulator of G protein signaling 10 isoform a 3 PPBP pro platelet basic protein precursor TIMP1 tissue inhibitor of metalloproteinase 1 precursor 3 DNCL1 dynein light chain 1 3 MYL6 thymosin beta 4 4 ALDOA aldolase A 4 F13A1 coagulation factor XIII A1 subunit precursor 4 GP1BA platelet glycoprotein Ib alpha polypeptide precursor 4 GSN gelsolin isoform a 4 NP purine nucleoside phosphorylase 4 PGK1 phosphoglycerate kinase 1 Continued on next page 81 Table 5 9 continued from previous page Cluster Symbol Name 4 SNCA alpha synuclein isoform NACP140 4 TPH riosephosphate isomerase 1 4 GPX1 glutathione peroxidase 1 isoform 1 4 FKBPI1A FK506 binding protein 1A 4 ZYX zyxin 4 SEPT7 cell division cycle 10 isoform 2 4 HSPCA heat shock 90kDa protein 1 alpha isoform 1 4 ACTB beta actin 4 ACTN1 actinin alpha 1 4 ARHGDIB Rho GDP dissociation inhibitor GDI beta 4 CLIC1 chloride intracellular channel 1 4 ENO1 enolase 1 4 FHL1 four and a half LIM domains 1 4 FLNA filamin 1 actin binding protein 280 4 GDI GDP dissociation inhibitor 1 4 HSPB1 heat shock 27kDa protein 1 4 ACTGI1 actin gamma 1 propeptide isoform 4 4 ARF3 ADP ribosylation factor 3 4 RHOA ras homolog gene family member A 4 ENO2 enolase 4 4 EPB49 erythrocyte membrane protein band
52. c KNN 2 2 25 3 3 Training classification via cross validation Method I is Zhu s approach Method II is Yasui s and ours is Method III Sen Sensitivity and Spe Specificity 27 3 4 Testing classification on blinded data information disclosed af ter analysis Method I is Zhu s method Method II is Yasui s and ours is Method UI 2 22264 ae bebe ee eee eS 27 4 1 Recursive tree growing schema for CART 2 2 31 4 2 Variable importance schema for RF 02 42 xiii 4 3 Classification results on testing samples of different CART tree constructed by different splitting method ID is the test ing sample index entropy is Quinlan s entropy information gain method index is gini diversity index ratio is gini ra is entropy information gain with Marshall correc is gini diversity index with Marshall correction Ker pak ar Z4 Splitting method comparison for head and neck cancer study 48 E E ee ee E ee ee 51 peas dees nee oe nee eee 51 5 1 An example of tblastn 02 00 0008 60 5 2 An example of blastx 0 0 20 00 00 eee 60 61 5 4 Correlation of gene data oa oa a a a a a a 64 5 5 Correlation of 120 gene protein pairs before the triptic adjust ment p values are calculated by bootstrapping 70 5 6 Triptic adjustment comparison for the correlation of 120 gene protein pairs p values are calculated by bootst
53. ce ma trix X The reference samples yield measurements pi hi1 Leip i Eis e ngk k 1 2 with sample means x and pooled sample covariance ma trix S n n 2 gt p Let Afa be the corresponding q variate Mahalanobis distance between the two groups given by Na Hag Hina Eia Aita Fiaqq And Di Ti Taa Sra Zia Faq is the usual estimate of Ai Test the sequential hypothesis Hq Aia Aa q 0 1 p 1 mec n ng qd 2 nin2 Da1 Dia 2 ny no n nz 2 nina D where Fa is selected as the best subset either the full set or Z for which q is the first step and Fi lt Fi a 1 n n2 q 2 The Monte Carlo results showed that for a fixed a level between 10 and 25 it performs better than the use of a much larger or a much smaller significance level Method 2 Yasuz s peak extraction method 1 Peak detection Yasui et al 2003 Define peaks by judging at each m z point whether or not the intensity at that point is the highest among its nearest N point neighborhood set Select the peaks above the noise level Count the total number of peaks at each m z in all samples that are within the window of potential shift for the m z point The m z point that has the highest total number of peaks within its window of potential shift is entered in 21 the new m z set as a calibrated m z value Construct the calibrated
54. d sample of control and diseased subjects than markers unrelated to the disease Suppose we have N subjects among which n are from the disease group and nz are from the control group The intensity for a subject at one specific 22 marker is X i 1 N j 1 M where M is the number of markers For a marker unrelated to the disease it is sensible to assume that it follows a common distribution for both the control and the diseased subjects as follows Xi tid p 07 i 1 N All subjects For a marker related to the disease however it is logical to assume that its distribution differs between the two groups as follows X iid m 02 i 1 n1 Control Xi iid u2 02 i n 1 N Disease Subsequently the expected value of the sample variance is derived as o for a marker unrelated to the disease N n101 n203 ninolp fe N N 1 i for a marker related to the disease E S In the special case of 0 o2 o the expected variance for a marker related to the disease is reduced to 2 Maine M1 Ha BS Se ENNET Thus the disease related markers have larger variance and the discrepancy is proportional to the squared mean signal intensity difference between the groups It is therefore reasonable to apply the variance component analysis to identify disease related biomarkers 23 Method Method II lt lt os 4450 4460 4470 4480 4490 4450 44
55. d to detect biomarker pattern In comparison with previous biomarker detection approaches such as stepwise discriminant analysis and the traditional peak detection strategy we found that the new variance component approach can better distinguish cancer from non cancer cases with a sensitivity of 86 and a specificity of 96 5 Classifier combination to improve the classification result using the new biomarker pattern 6 Conventional CART and random forest are extended to s CART and s RF The scoring system improves the binary classifiers 7 Integration of Gene and Protein Data in platelet A significant corre lation is found 6 2 Future works In our study the data set only has two groups disease and normal The extension of the analysis to multiple disease categories can be achieved for cross sectional classification and longitudinal profiling We can also correlate proteomic markers with other covariates such as age and gender etc The limitation of the gene protein database generation and integration process is that it was done half manually and for one platelet study only One would have to repeat the entire time and labor intensive process for another study Thus our goal is to establish a customized software module automating this process For any future gene protein integration study the researchers 88 Gene Affymetrix NimbleGen ww allymelrie corn 3 Pukl TD www simblegencom RefSeq NM Accession NCBI GI Ascession
56. dataset that consists of intensities of each sample that correspond to the points in the new m z set For each sample i and for each point in the new m z set j we take the maximum intensity of the sample i among the intensities corresponding to the window of potential shift for the point j as the intensity at the calibrated m z point j 2 Statgram t Map Same as in method 1 The significant peak max imums are the new biomarkers Classification example SELDI TOF spec trometry ProteinChip system was used to screen for differentially expressed proteins in serum from 73 patients with HNSCC and 76 normal controls The mass spectrometer is QSTAR which has high resolution The data was prepro cessed We applied the three methods to detect biomarkers on the 149 training samples There are 49 serum samples in the validation set among which 22 are with HNSCC and 27 are normal controls Support Vector Machines is applied to do the classification and the sensitivity and specificity are reported Method 3 Marker selection via the variance component analysis A good biomarker must be in the peak area and related to the disease which means it can differentiate the disease group and the control group We use the total variance of all subjects and independent t z test to detect the disease related markers at peak The idea behind the variance component method for marker selection is that disease related biomarkers tend to have larger variance over the poole
57. dentify head and neck cancer Clin Cancer Res 10 5 pp 1625 32 WAF03 Wu B Abbott T Fishman D McMurray W Mor G Stone K Ward D Wiliams K Zhao H 2003 Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data Bioinformatics 19 13 pp 1636 43 Wang06 Wang X Zhu W Pradhan K Ji C Ma Y Semmes OJ Glimm J Mitchell J 2006 Feature Extraction in the Analysis of Proteomic Mass Spectra Proteomics Apr 6 7 pp 2095 100 Wo004 Woo Y Affourtit J Daigle S Viale A Johnson K Naggert J Churchill G 2004 A comparison of cDNA oligonucleotide and 99 Affymetrix GeneChip gene expression microarray platforms J Biomol Tech 15 pp 276 284 Wright21 Wright 1921 Correlation and Causation Journal of Agricul tural Research 20 pp 557 585 Wu05 Wu G Culley DE Zhang W 2005 Predicted highly expressed genes in the genomes of Streptomyces coelicolor and Streptomyces avermitilis and the implications for their metabolism Microbiology Jul151 Pt 7 pp 2175 87 Yasui03 Yasui Y Pepe M Thompson M L Adam B L Wright G L Jr Qu Y Potter J D Winget M Thornquist M and Feng Z 2003 A data analytic strategy for protein biomarker discovery profil ing of high dimensional proteomic data for cancer detection Biostatistics 4 pp 449 463 Yasui03 Yasui Y McLerran D Adam B L Winget M Thornquist M and
58. details for clustering is shown in Table 5 9 This is very useful to biologists and chemists for further discussion we re E lAl se Leb te E EEEE HH t H IKIT T PREIS Sie ISAEV S 6 RA EAT A Gea d DOAI TIDE BIO MSD dS DEIA ik a TREG Figure 5 17 Hierarchical clustering average Link distance 1 r 78 Cluster No Number of protein gene pairs o gt OO conn ok w bd Table 5 8 Clustering result protein Ooh AD wo wo a A a 4 rs 500 1000 1500 200 mRNA 10 yl S amp amp E Bd Soft A 2g Ji oan a r a aJ A oo r n a 500 1000 1500 2000 2500 200 400 600 800 1000 100 1400 mRNA mRNA 49 35h 2 48 S S3 S15 a 4 bar i 4 1 48 kidi 45 A p 2 05 35 366 36 365 37 37 38 1 16 2 26 26 3 36 4 mRNA mRNA mRNA Figure 5 18 Top 9 clusters for hierarchical clustering 79 yerzze ol runs a c snz Foren J w LuV WU Uw zw Sut PRA abunda re av2raz 216 mico 24 Figure 5 19 Top 9 clusters shown in the plot of mRNA abundance vs protein abundance 80 Cluster Symbol Name 1 EEF1G eukaryotic translation elongation factor 1 gamma 1 GDI2 GDP dissociation inhibitor 2 1 RABI1A Ras related protein Rab 11A 1 ARPC3 actin related protein 2 3 complex subunit 3 1 ZNF185 heat shock 90kDa protein 1 beta 1 TUBA6 myosin regulatory light chain MRCL2 2 CA2 carbonic anhydrase IT 2 UBE2L3 u
59. egion and zoom in by clicking the right mouse button click the left mouse button to enlarge move it and the right mouse to shrink it You may also click and drag the left mouse button on the grey axes area to enlarge or shrink the spectrum In the Map Window which is linked to the Main Window the user can set use the right mouse button resize and move use the left mouse button the yellow rectangular selection bar along the horizontal and vertical axes to reveal details of the selected region in the Main Window The File Directory contains the directory of the file s opened The Display Toolbar clicking or dragging by the left mouse button al lows the user to look at each spectrum when multiple spectra are displayed 109 The Color Setting allows the user to change the color of a spectrum click Change Color and choose the desired color in the color panel The Transparency Toolbar is for the multiple spectra display the trans parency is defined from 0 to 1 A 2 2 Loading files Open Single Multiple File s To open a single spectrum go to File Read Files locate the directory choose the spectrum and click OK Pick files to READ from tial Location D protecExplorerdemodatafcontrol i E 7 099 8p 0A sean B HNO001 tt E HN543 t HNO05 bt HN545 bt Desktop iT HNOOG tt HN546 tt HNO1 2 b HN547 tet HN01 3 t HN549 bt HNO1 4 tet HN550 bt My Computer j HNO16
60. ertain disease or abnormality Control Dir the directory of training data set of normal control subjects Blinded Dir the directory of blinded testing data set with a blinded mixture of diseased and control subjects Use the output directories from the Biomarker Detection step C i e the PeakAligned directories embedded inside the preprocessed spectra directories Example following the previous example 135 Disease Dir dirl Preprocessed Smoothed 3 e 005 A PeakAligned Control Dir dirl Preprocessed Smoothed 3 e 005 B PeakAligned Blinded Dir dirl Preprocessed Smoothed 3 e 005 C PeakAligned The PeakAligned directories will have a suffix which is the value of the Peak Area size chosen in the Biomarker Detection step For example peak data generated using the Maximum Peak Intensity method will have output directories labeled as PeakAligned 0 Peak data generated using the Peak Area method with a chosen area size of 10 will have output directories labeled as PeakAligned 10 Biomarker Selection Set the parameters for the Z T test to select the significant biomarkers Number of Total Biomarkers the number of biomarkers detected It is determined by any input directory For the Maximum Peak Intensity or Peak Area method this is the number of refined peaks identified in the Biomarker Detection step Significant Level the significant level of the Z T test It is 0 05 by default This significa
61. essed The preprocessed spectra are output to the subdirectory of the Output Root Dir Disease dirl Preprocessed Smoothed 3 e 005 A Control dirl Preprocessed Smoothed 3 e 005 B Blinded dirl Preprocessed Smoothed 3 e 005 C A 3 2 Biomarker detection Select Analysis Biomarker Detection There are two types of biomarkers Maximum Peak Intensity and Peak Area 129 Biomarker D After you choose either Maximum Peak Intensity or Peak Area the Biomarker Detection sub window will pop up automatically Similar to the Preprocessing sub window it has two parts the top portion for spectrum vi sualization and the bottom portion for parameter selection Peak detection consists of three steps Identification Refinement and Alignment First the rise and fall within the neighborhood of each m z point is identified as a peak Then the noise level is determined within the noise window The peak above the noise level is called a refined peak After per forming the peak identification and refinement on each spectrum the program will then align the peaks across all spectra in the training and test data sets Use the Biomarker Detection Window to 1 Open a single mass spectrum and tune the parameters for peak detection 2 Save selected parameters into a profile for the subsequent batch processing 3 Choose the dataset folder one wishes to format using the selected parameter setting 4 Apply the saved
62. ethods and their applications for early cancer diagnostics can be found in Vee04 De spite the rapid progress in proteomic mass spectrometry technology there is substantial room for improvement in the following areas 1 high quality ac quisition of mass spectra data and 2 identification of significant and meaning ful biomarkers The most commonly used instrument for acquiring proteomic mass spectra is known as ProteinChip Biomarker System II PBS II It has relatively high sensitivity but low resolution and mass accuracy 1 4 Thesis structure and overview In Chapter 2 we present a new algorithm to improve the mass spectra acquisition quality using PBS II Furthermore we also propose a systematic approach for examining the reproducibility of mass spectrometer results using repeated measures ANOVA for point wise reproducibility test and the random field theory for multiple test correction To date many statistical groups have proposed various proteomic biomarker identification strategies Two notable ones were Zhu03 where they pro posed a continuous marker detection method using the random field theory for multiple test correction and where they developed a data analytic approach to detect biomarkers based on peaks from mass spectrum only In Chapter 3 we propose a new strategy for significant biomarker selec tion by examining the total variance of each data point along the mass spec trum Comparisons are made between the
63. etrix ProbeSet gil31563309 1640 Accession sequences 200015_s_at Amino Acid Sequence versaetvtkgimlp Nucleotide RefSeq ID NM_001008491 via Affymetrix Full length Sequence via NCBI RefSeq 1603 unique sequences Protein RefSeq ID 2416 NP_009224 via blastp sequences against human RefSeq 526 unique sequences 1240 unique sequences Protein Sequence via NCBI RefSeq BlastN Platelet Protein _ gt Platelet mRNA Database BlastX Database 526 sequences 1240 sequences Figure 5 1 Platelet study the process of establishing and integrating the gene protien database 56 The BLAST Search Algorithm quary word W 3 Cie CSE DT ERALA LEC TnL RAT ROP ADSA Ee ESTA DAP VET AOT LGE a FOG 19 PEG 15 PRG 14 neighberhoed Ae E words Po il FHG 12 Pic neighborhood FLA li score lhreshold FEN 1 f 13 pe oad Guseys 225 SLAALOORCR TPG COPLA ROP ED SR IESRINLWEA F S Hae TE AEE HH OFA RR Abyori 230 TLASVLGCT TPHCSRELEROL HOFVRDTEVLLERCOTIOA 340 High scoring Segment Pair HSP Figure 5 2 BLAST tool 57 assigned to any sequence not for protein only A total of 2 604 unique NCBI protein accessions were identified during 2DLC MS MS analysis Each se quence was queried against the protein RefSeq database for human using blastp protein protein BLAST identifying a query amino acid sequence and for find ing similar sequences in protein databases prog
64. f the outcome is big removing it causes a high misclassification rate and it plays an important role On the contrary smaller outcome means a lower importance 4 3 score CART and score Random Forest 4 3 1 From s CART to s RF In Bre84 Gini Diverse Index is used in CART as the splitting method to construct the tree However there are several other splitting methods to grow the tree Each splitting method has different strength and will generate different tree There is no significant advantage that one over another in general data sets We design a new scoring method achieving the benefit from the perfor mance variance of different splitting method It gathers and combines the decisions from different CART to give the score Using the same tree genera tion technique it is derived as an internal multi classifier system Some splitting methods are described in Section 4 1 1 Similar as Bre96 Bre01 usually vote system will produce a more accurate classification than that from each individual classifier Also with the vote system a probability 41 Require tree number TN gt 0 variables M training sample size X category number of dependent variable C Ensure Variable Importance array Vi 1 M 1 Variable Importance tree number TN variable number M 2 initialize ME is to save classification result 3 times is to count the times of sample x been selected in OOB 4 ME X TN M 0 times X 0 5 f
65. fic DNA sequences tagged or labelled such that they can be independently identified in solution The traditional solid phase array is a collection of microscopic DNA spots attached to a solid surface such as glass plastic or silicon chip The affixed DNA segments are known as probes although some sources such as journalists will use different nomenclature thousands of which can be placed in known locations on a single DNA microarray Microarray technology evolved from Southern blotting whereby fragmented DNA is attached to a substrate and then probed with a known gene or fragment 1 3 Mass spectrometry The most widely used techniques for the characterization of proteins are two dimensional gel electrophoresis 2 DGE amino acid composition analysis peptide sequence tagging and mass spectrometry MS In particular the pro tein mass spectrometry technology nicked named protein chips has given a major impetus to proteomics being the sole high throughput technology for protein identification and sequencing It spans the vast expanse of proteomics and drug discovery Three unique ionization techniques facilitated the char acterization of proteins by MS One is electrospray ionization ESI where a liquid solution of the peptide is sprayed through a fine capillary held at a high potential This produces charged droplets that are then rapidly desolvated producing charged ions of the peptide which are in turn directed into a quadr
66. he jth repeated measure effect fixed effect and is the random error The null hypothesis for test is that data is reproducible which means the repeated measure effects are equal Ho b1 p2 Bm It is rejected if This test is performed at each marker Considering the interactions among markers the multiple test correction should be done when we calculate the F threshold It is derived by the Gaussian random field theory 13 Pere vo a _ 2Mvin2 u wet Wwf ew VR EWHM TEE w ta 2 ve PER f where f is the threshold a is the significant level FWHM is the smoothing kernel v and w are the degrees of freedom v N 1 w N 1 M 1 Rat Age Subjects Replicates Total Inputs 4 19 47 Classes 3 7 2 8 Weeks 22 2 44 10 Weeks 22 3 66 12 Weeks 22 3 66 14 Weeks 22 2 44 21 Weeks 22 3 66 Table 2 2 Description of rats data Example Five groups of mass spectra are generated from twenty two wild type rats at their different ages from 8 weeks to 21 weeks Data is provided by Department of Pharmacology SUNY at Stony Brook Table 2 2 Each rat sample is divided into two or three equivalent parts and randomly assigned to the ProteinChip arrays The m z range is from 0 to 20 000 and there are about 13 500 m z values for each sample We will test if those two or three replicates are reproducible for the rats at different age There are less
67. iew Show Map Window The user may choose to show the Map Window or not Reset View View Reset View Reset View will set the spectra on display to the their original scale A 2 6 Reset and start over File Unload Spectrum Release one selected spectrum File Reset Release all spectra and back to the status when you open 120 the software with no spectrum on display A 2 7 Workspace Save all the spectra opened in the Main Window and the display options such as zoom color and the transparency By loading the workspace it is convenient to recover the display options without setting them again The workspace is in xml format A 3 Data analysis A 3 1 Data preprocessing To perform preprocessing click on Analysis Preprocessing in the manual bar of main GUI to open the Preprocessing sub window The Preprocessing sub window has two parts the top portion is for spectra visualization and the bottom portion for preprocessing parameter selection Preprocessing consists of three sequential steps smoothing baseline correction and normalization In all cases smoothing is a required first processing step to filter out noise Depending on whether baseline has been corrected during the generation of the mass spectrum or not which is indicated by the absence or presence of negative intensity values in the mass spectrum baseline correction is an optional step Normalization should be performed on smoothed and baseline
68. ing data and Combined for the prediction of the status of each subject in the testing blinded data set Step 5 Reading the Analysis Output 107 The final analysis output is in an html format for user s review It can be opened by clicking on Analysis gt Display Classification Prediction Results There are four parts in the output 1 Analysis Profile 2 Summary of cross validation Results based on the Training data 3 Classification Results on the Testing Blinded Set 4 4 Biomarker Pattern C significant biomarkers used for the classifica tion prediction Step 6 Visualize Biomarker Pattern Finally the user can visually examine the set significant biomarkers used in the above classification prediction analysis by clicking on Analysis Read Latest Biomarker Pattern Please note that you must open up some mass spectra in the main window first The biomarker pattern used in the latest classification prediction analysis will then be superimposed in red vertical lines to the opened spectrum spectra A 2 Visualization A 2 1 Overview Start the software by running proteoExplorer bat The following is a screen shot of the main Graphical User Interface with one spectrum loaded 108 Blan ie ban a dar C ulur ed Loy ar Transp eney Towlen Tas lary Toolbar File Thir ciary The Main Window displays one single spectrum or multiple spectra In the Main Window you can set up a target r
69. ing peak data using three sequential steps peak identification peak refinement and peak alignment The newly generated peak data will have two measurements peak center and maximum peak intensity or peak area In de tail the steps of peak data generation implemented in proteoExplorer include 1 Peak detection Detect all possible peaks by local maximums 2 Peak refinement Refine peaks above the local noise level 105 3 Peak alignment and generation Align refined peaks across all spectra in the data sets to be analyzed and calculate the corresponding biomarker value maximal peak intensity If Peak Area is chosen in addition to repeat all the steps for Maximum Peak Intensity one needs to select the area width in peak alignment and generation This step is performed using the Analysis gt Biomarker Detection Win dow Step 4 Classification Prediction Analysis Once the biomarkers are determined and or generated from Step 3 one can perform the ensuing classification and prediction analysis on the given training testing data sets This is done with the Analysis gt Classifica tion Prediction Window Methods for choosing significant biomarkers include 1 Z T test 2 Total variance test 1 2 3 Scoring system 4 Clustering 5 Stepwise Discriminant Analysis Depending on the necessity select all or part of the above methods to trim the biomarker pattern The final model is applied to the classific
70. ining set and classify a blinded data set of 49 subjects The prediction sensitivity and specificity for the blinded data are shown in Table 2 Training Method I Method II Method II Classifier Sen Spe Sen Spe Sen Spe MKNN 82 89 84 96 15 96 GRNN 91 78 93 93 96 93 MLPNN 91 85 93 95 89 94 SVM 91 89 93 93 93 93 Score 87 91 96 96 92 95 Table 3 3 Training classification via cross validation Method I is Zhu s ap proach Method II is Yasui s and ours is Method III Sen Sensitivity and Spe Specificity Testing Method I Method II Method III Classifier Sen Spe Sen Spe Sen Spe MKNN 82 89 82 96 82 96 GRNN 86 78 86 81 82 89 MLPNN 86 89 86 81 86 89 SVM 86 85 86 Wa 86 96 Score 86 85 86 81 86 96 Table 3 4 Testing classification on blinded data information disclosed after analysis Method I is Zhu s method Method II is Yasui s and ours is Method II For the training dataset our method is better than the other two for GRNN only However for the testing data using blinded subjects with a sensitivity of 86 and a specificity of 96 27 3 5 Extension to multiple group classification Our approach can be easily extended to the multiple group classification problem For example if we have two disease stages and one set of normal control a marker unrelated to the disease would be X tid u 07 i 1 N ny ng n3 All subjects
71. l control control PeakAligned 40 10 4 10 20 Blinded test test PeakAligned 40 10 4 10 20 151 A 4 4 Classification Prediction Select Analysis Classification Prediction First choose the Result Dir to output results Then choose the the directory of three groups of spectra Classification Prediction p 004 disease PeakAligned 40 10 4 10 0 Browse 2 e 004ycontrovPeakAligned 40 10 4 10 0 Browse 03 2 e 004 testPeakAligned 40 10 4 10 0 Browse We choose Maximum Peak Intensity as the biomarkers thus the input di rectories are the same as the output directories in the corresponding Biomarker Detection step Disease Dir disease PeakAligned 40 10 4 10 0 Control Dir control PeakAligned 40 10 4 10 0 Blinded Dir test PeakAligned 40 10 4 10 0 152 No of Total Biomarkers is 47 which means there are 47 refined and aligned peaks Select the Significant Level alpha which is 0 05 2 sided by default Click Classic or Bonferroni to determine the corresponding Critical Value for the Z T test The Critical Value is 3 273 if we choose Bonferroni s method to ensure an exprimentwise significance level of 0 05 2 sided The No of Final Biomarkers entered is 10 which means we wish to use the top 10 biomarkers with the large absolute Z T values as our final model If we want to select all significant markers inp
72. l Society Series B 26 pp 211 246 91 Bre84 L Breiman J H Friedman R A Olshen and C J Stone 1984 Classification and regression trees Stanford University Bre01 L Breiman 2001 Random forests Machine Learning 45 1 pp 5 32 BreOla L Breiman 2001 Statistical modeling the two cultures Statistical Science 16 pp 199 215 Bre96 L Breiman 1996 Out of bag estimation Wadsworth International Group Bre03 L Breiman 2003 RF TOOLS A Class of Two eyed Algorithms SIAM Workshop Statistics Department UC Berkeley CAI http www evolvingcode net codon cai cais php Car03 Cartieux F Thibaud M C Zimmerli L Lesssard P Sarrobert C David P Gerbaud A Robaglia C Somerville S Nussaume L 2003 Transcrip tome analysis of Arabidopsis colonized by a plant growth promoting rhi zobacterium reveals an general effect on disease resistance Plant J 36 pp 177 188 Cha03 Chang C C and Lin C J 2003 Software package LIBSVM v 2 3 http www csie ntu edu tw cjlin libsvmtools CST00 Cristianini N and Shawe Taylor J 2000 An introduction to Sup port Vector Machine and other kernel based methods Cambridge Univer sity Press 92 Cox05 Cox B Kislinger T Emili A 2005 Integrating gene and protein ex pression data pattern analysis and profile mining Methods 35 3 pp 303 14 Di04 Di Bernardo D 2004 Modeling genetic networks from expression pro filing SISSA IC
73. lassification prediction analysis The green squares are refined peaks that are saved for future analysis 146 Biomarker Detection 752 08 92 Parameter Selectior gt Batch Processing Window Size pts 40 Peak identification Marker Display Size Noise Window 10 ee o Peak Refinement Naise Coet fa 00 5 Alignment Window rvz 0 00 Biomarker type intensity Save Profile Set the parameter Alignment Window it is the peak shift width with 10 m z The alignment window for each peak is indicated by two grey vertical lines 147 Biomarker Detection DER P Window Size pts 40 Marker Display Size Noise Window 96 10 Noise Coet 4 00 Peak Identification Peak Refinement Alignment Window m z 1 ojoo Biomarker type inten sity Save Profile Click Save Profile to save the parameter settings into a file with the extension pek We save this file to proteoExplorer para peak1 pek Now the Parameter Selection Page will change to the Batch Processing Page and the location of this peak parameter file will appear in the Profile Setting textbox automatically 148 Biomarker Detection al K 7525 7708 7392 175 Parameter Selection gt Baten Processing Prome Setting jc protecExplorer cdemodate pera oeaki pek Disease Dir Browse Control Dir Browse Blinded Dir Browse Start Batch Choose the directory of preprocessed spectra
74. lit selection method v takes a very important role in tree grow ing There are over ten different methods The most general used are En tropy Information gain Gini Index Gini Ratio and Marshall Correction Min89 1 Entropy Information Gain 31 Entropy Information Gain is used by Quinlan in ID I D4 5 decision tree Entropy for a node T is J entropy T PLj T log PLjITI 4 1 j l Where T is the node J is the number of response categories P j T is the probability of observing an outcome as the jt category in node T 0 log 0 0 Information Gain IG of a split at node T is IG T X Q entropy T _ Plax X T entropy Tk 4 2 k 1 Where X is the split attribution Q is the branch set of node T on the split attribution X which will leads the child nodes generated from node T K is the child number of node T e g in binary split it is 2 Tk is the k child node P q X T is the probability of descending to the kt branch from T Gini Index Gini Index is also called Gini Diversity Index It is the main split algo rithm used in CART 32 Gini Index for a node T is gini T 1 X PUT 4 3 j 1 Gini Index of a split at node T is GI T X Q gini T X Pla X T gini T 4 4 k 1 In Eq 4 3 and Eq 4 4 all legends are same as Eq 4 1 and Eq 4 2 Gini Ratio Gini Ratio is developed and used to counteract the bias caused of un balanced data
75. ll cause the accuracy reduction In Forest RC random feature is no longer a variable selected from the group It is a linear combination of several variables Two parameters are introduced to control the search scope L and F From the whole independent variables M L variables are selected randomly Then in side these variables F coefficients is uniformly randomly picked from the range of 1 1 and be used to compose the combination of the L variables Then we use the same idea of impurity reduction as in CART and Forest RI to find the best combination as the split rule In Bre01 L is suggested as 3 and F is suggested as 2 and 8 4 2 3 Variable importance Our study is not only limited to the considering of accuracy of predicting a new case but also on the importance of variables Since OOB can be used on the testing data set we can derive variable ranking by removing the error change from classification That is we permute randomly all values at variable m in the OOB after each tree generation We then classify new OOB on the 40 tree to get the error rate Repeat this procedure for all variable and all trees Then the variable ranking is the average of error rate on all tree The pseudo code of algorithm is given in Table When viewing the outcome of a variable the value is the average of the margin misclassification rate This rate is raised by permuting the variable so it shows the variable role in classification I
76. ller W Myers EW Lipman DJ 1990 Basic local alignment search tool J Mol Biol 215 pp 403 410 Ars91 Arshad M and W T Frankenberger 1991 Microbial production of plant hormones Dordrecht the Netherlands Kluwer Academic Publish 90 ers Banfi06 Banfi C Brioschi M Wait R Begum S Gianazza E Fratto P Polvani G Vitali E Parolari A Mussoni L Tremoli E 2006 Proteomic analysis of membrane microdomains derived from both failing and non failing human hearts Proteomics 2006 Feb 13 Epub ahead of print Bar04 Barac T Taghavi S Borremans B Provoost A Oeyen L Colpaert J Vangronsveld J van der Lelie D 2004 Engineered endophytic bacte ria improve phytoremediation of water soluble volatile organic pollutants Nature Biotech 22 pp 583 8 Bash97 Bashan Y and G Holguin 1997 Azosprillum plant relationships environmental and physiological advances 1990 1996 Can J Micro biol 43 pp 103 121 Bay02 Bay SD Shrager J Pohorille A Langley P 2002 Revising regulatory networks from expression data to linear causal models J Biomed Inform Oct Dec 35 5 6 pp 289 97 Bay04 Bay SD Chrisman L Pohorille A Shrager J 2004 Temporal aggre gation bias and inference of causal regulatory networks J Comput Biol 11 5 pp 971 85 Bla http www ncbi nlm nih gov BLAST Box64 Box George E P Cox D R 1964 An analysis of transformations Journal of Royal Statistica
77. lored green yellow and red for distinction 116 iv ver sitat Wr apaw Fuy Chwrs rin Toespernee The transparency can be tuned from 0 to 1 using the Transparency Tool bar The transparency is 1 in the plot above which means all three spectra are shown with the same maximum clarity 4 00 Transparency By tuning the Display Toolbar one can select a particular spectrum of interest In the following example the green spectrum is the chosen spectrum and its file name appears in the File Directory By tuning the transparency down to 0 2 the other unselected spectra red and yellow will fade away as seen in the screen shot below 117 ee fe ver ida Wo ayan bey Chatge Sy NY RVESY tly 1 dentate TM DELA an j J UET Change Display Order of Opened Spectra View Reverse Display Order Change the display order In the previ ous example three spectra are opened and the colors are set in the order of green yellow and red Thus the green one is always shown on top 118 fe ver ida Wo ayay bey Slat ge uly LY RVES a des SHO z xf 4 Teiti iK Checking the Reverse Display Order will reverse the display order which means the red spectrum will be on top and the green one will be on the bottom 119 Side Mona Piy J5 Tei Waira F ue Su isha Bed ee Hide Show Grid View Show Grid If uncheck this option the grid will disappear Hide Show Map Window V
78. lso derived from the vote it will be regarded as the score in the scoring system Training Data ra Spit 7 L Method 1 Method 2 Teg Data y CART Tree 1 CART Tree 2 a Ai es Majority Voting of sub classifiers Classification with score chch Figure 4 2 s CART mechanism The scoring method is more accurate because 1 it may generate different scores for different cases even if they fall into a same node of a tree They may fall into a different node in another tree 2 it utilizes more information 43 from the internal characters of each case when achieving score The cases travel through several different CART trees and internal characters have been checked and utilized for several times Fo Seestetesesdecescasae Classification N Classification 1 Classification 2 Score Random Forest Figure 4 3 s RF mechanism chch The score Random Forest is developed based on score CART In the first step score Random Forest applies the same OOB technique as Random Forest in generating samples Unlike Random forest a score CART is grown instead of CART Each s CART will give a score as the classification result The score of the Random Forest is derived by taking the average on scores of all s CART This is a simple idea but it builds on the strength of s CART so that it has more power on classification 44 4 3 2 Test results We use the Head Neck
79. ly in Section and Section Then the scoring methods to improve those two classifiers are presented in Section 4 3 In Section 4 3 2 we compare and show the results 4 1 Classification and regression trees Basically CART has two steps recursive partitioning to grow the tree and prune to select the correct size of the tree 30 4 1 1 Tree growing The the tree growing step of CART is a top down divide and conquer procedure A binary decision tree will grow by learning the hidden pattern of the training samples Require node n dataset D split selection measure v Build classification tree T GrowTree Node n dataset D split selection measure v If n meets the stop criteria label of n the majority class label of D Else apply v to D to find the best split attribute y for node n partition D into D D by 9 create children nodes n with Di ny with D label the edge n n with predicate q n n and n n with predicative q n n based on split attribute y GrowTree n Di v 10 GrowTree n Dp v 11 End If 12 End GrowTree OO ON Ol eee Table 4 1 Recursive tree growing schema for CART In Table 4 1 n is the input root node and D is the training data set CART generates a binary tree This schema shows only two children after each split But it can be modified slightly to describe other decision algorithms CHAID ID4 5 FACT that can generate multiple children at each split The sp
80. mic data generation which means the protein data is reproducible In Table we notice that the five microarrays correlate very well The correlations are all above 0 8 except those between 3rd array and the others are above 0 7 Array 1 Array 2 Array 3 Array 4 Array 5 Array 1 1 0 9154 0 732 0 9105 0 8076 Array 2 1 0 7636 0 9717 0 9168 Array 3 1 0 7178 0 8775 Array 4 1 0 8654 Array 5 1 Table 5 4 Correlation of gene data Figure 5 9 shows the gene protein correlation result using three different methods Pearson correlation Spearman rank correlation and the canoni 64 Correlation between two runs for protein abundance 900 800 700 600 N 500 m 400 300 200 100 400 600 800 1000 1200 1400 Run 1 Figure 5 8 Correlation of the protein data 65 cal correlation It is evident that the Spearman rank correlation is the least powerful and the canonical correlation is the most powerful Even for the same method its correct and incorrect usage would yield drastically differ ent results Figure depicts the Pearson correlation for the platelet study without the Box Cox normality transformation Figure 5 10a with the normal ity transformation performed on both gene and protein data Figure 5 10b or with the normality transformation performed on the gene data only Figure 5 10c Since both the gene and protein data were found to be non normal the Pearson
81. n abundance Figure 5 5 Distributions of protein and mRNA abundances 62 Histogram of Protein Data Box Cox Normality Plot max 4 0 3 100 250 300 L Pa 350 L ha 400 F 450 f log likelihood 500 550 0 20 40 60 80 100 120 Transformed Data Probability Figure 5 6 Box Cox transformation of protein abundances Select A to maximize the logarithm of the likelihood function SeA Blogi gt BA FON a DY ogle i 1 where Z A 5 2 A is the mean of the transformed data In Figure and Figure we notice that the mRNA data is normal but the protein data is still not normal after the transformation But we can still calculate p values and confidence interval for Pearson correlation and canonical correlation by applying bootstrap method Before we calculate the canonical correlation it is necessary to check the reproducibility of the five microarrays and two proteomic runs for the 120 gene protein pairs Figure shows there is a high correlation 0 9 between 63 Histogram of mRNA Data Box Cox Normality Plot max 4 0 9 120 g 110 100 90 log likelihood 8 1 5 2 2 5 3 3 5 2 45 1 Transformed Data Normal Probability Plot EN PRA Probability 0 6 0 2 0 0 2 Data 0 4 Figure 5 7 Box Cox transformation of mRNA abundances the 2 runs for proteo
82. n be given by resubstitution error rate R s T A crite ria to estimate the variance of the error rate is 1 SE Rule SE R T R s T 1 R T N2 This rule can also be used to select the right size tree The purpose of the selection is 1 reduce the instability in pruning 2 select a simplest but accuracy comparable tree gives more detail decriptions 4 2 Random forests A random forest is a classifier consisting of a collection of tree strutured classifiers BreOl The random forest algorithm is based on CART and bagging sampling Bagging sampling causes the first randomness of the random forests al gorithm The second randomness is the variables for selecting the best split in each tree There are two methods of random forests Forest RI which uses a random input selection and Forest RC which uses linear combination of inputs The voting system is used for the multi classifier system of Random Forest 36 4 2 1 Bagging sampling Bagging is the acronym of bootstrap aggregating It was introduced by L Breiman in Bre96 In recent years bagging became quite popu lar as the other sampling methods boosting including Adaboosting v fold cross validation leaf one cross validation randomization etc HLBR04 It has two steps e sampling Each tree is constructed on the different training data set 2 Each training sample is drawn with replacement from the original training set L about one third of the sam
83. n the protein data and each individual set of gene expression data Thus as long as one set of gene data is of good quality canonical correlations will preserve and prevail In addition the major Principal Components can be obtained to replace the original variables to magnify the significance of canonical correlation Table 5 5 shows the three correlations for the 120 gene protein pairs The Pearson and Spearman correlations are very small and not significant The canonical correlation is 0 53 with a significant p value less than 0 01 We will show the adjustment technique using the number of tripsin fragments in Section 5 5 It improved the Pearson and Spearman correlations a lot 69 120 gene protein pairs Correlation P value Pearson 0 02 gt 0 2 Spearman 0 04 0 68 Canonical 0 53 lt 0 01 Table 5 5 Correlation of 120 gene protein pairs before the triptic adjustment p values are calculated by bootstrapping 5 4 Codon adaptation index Codon usage could be used as a tool to predict expression level of a par ticular protein or a group of proteins The degeneracy of the genetic code enables the same amino acid sequence to be encoded and translated in many different ways Alternative codon usage is not purely random systemic bias of degenerate codon usage appears at different level of genetic organization It became accepted that biased codon usage could regulate the expression levels of individ
84. nal Chem 60 pp 2299 2301 Kell02 Kell D B 2002 Metabolomics and machine learning explanatory analysis of complex metabolome data using genetic programming to pro duce simple robust rules Molecular Biology Reports 29 pp 237 241 Kur91 Kurland CG 1991 Codon bias and gene expression FEBS Lett 285 pp 165 169 LS97 Wei Yin Loh and Yu Shan Shih 1997 Split selection methods for clas sification trees Statistica Sinica 7 pp 815 40 LZR02 Li J Zhang Z Rosenzweig J Wang Y Y and Chan D W 2002 Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer Clinical Chemistry 48 8 pp 1296 1304 Mar86 R Marshall 1986 Partitioning methods for classification and deci sion making in medicine Statistics in Medicine 5 pp 517 526 96 Mer00 Merchant M and Weinberger S R 2000 Recent advancements in surface enhanced laser desorption ionization time of flight mass spec trometry Electrophoresis 21 pp 1164 1167 Min89 J Mingers 1989 An empirical comparison of selection measures for decision tree induction Machine Learning 3 4 pp 319 342 Per04 Perrotta PL and Bahou WF 2004 Proteomics in platelet science Curr Hematol Rep 3 6 pp 462 9 Pet02 Petricoin E F III Ardekani A M Hitt B A Levine P J Russo V A Steinberg S M Mills G B Simone C Fishman D A Kohn E C and Liotta L A 2002 Use
85. nce level refers to either the level of a single test at each biomarker selected or the experimentwise significance level for all biomarker selected depends on whether you click the Classic or Bonferroni button below Critical Value select biomarkers above the critical value of the Z T tests There are two methods to calculate the critical value Classic for the single marker test and Bonferroni for the multiple test correction to ensure the ex perimentwise error rate of all biomarkers selected Click either button will set the corresponding critical value automatically 136 Number of Final Biomarkers the number of biomarkers with the top largest absolute Z T values They will be in your final model On rare occa sions the number of biomarkers exceeding the critical value threshold might be less than the number of markers you have entered This could occur when you use the Bonferroni threshold In this case the minimum of the number of available markers and your chosen number will be used for the subsequent classification prediction Output Description The output is in the directory Result Dir The summary is in an html file entitled ClassificationReport htm A suffix of the date and time the re port is generated will be attached to the file name to avoid any confusion Select Analysis Display Classification Prediction Results to open it The biomarkers are saved in a file named Biomarkers pat Open a spectrum in the Main
86. new strategy and those of and Yasui03 using the head and neck data as an example In Chapter 4 we develop the scoring method that would yield the pre dictive disease probability rather than the traditional crude binary yes no diagnosis We present the s CART and s RF classifiers the improved scoring variants of the binary classification and regression tree CART and Random Forest RF classifiers In Chapter 5 we examine how integration of transcriptomics and pro teomics improves efficiency of protein identification and study correlation be tween mRNA and protein expression for thoroughly selected group of genes Finally we give the concluding marks and discuss future works in chapter Chapter 2 Data Acquisition and Quality Control 2 1 Data acquisition Ciphergen s Protein Chip technology is the mot common pre chromatography step prior to mass spectrometry analysis Patterns are derived from surface enhanced laser desorption and ionization SELDI protein mass spectra The most common analytical platform comprises a ProteinChip Biomarker System II PBS II a low resolution time of flight mass spectrometer We present a new algorithm for PBS II to generate a mass spectrum and show its advantage by an example A typical SELDI experiment is illustrated in Figure 2 1 Chip processing i e adding the protein sample washing adding the energy adsorbing molecule EAM The chips are then processed in the mass reader where the b
87. nked to the SwissProt database Version XX containing XXX proteins Information provided by the MS analysis included 1 protein gi accession number 2 run indicating if it was found in the Ist or 2nd run 3 protein name 4 confidence in the protein match which is based on the distance to next metric and 5 number of spectral peptide counts found which represents the total number of MS MS spectra corresponding to a particular protein accession Spectral peptide counts were used as a simple semi quantitative means of establishing protein abundance among complex MS data sets Sand05 All peptides with confidence levels greater than 70 were used for integrated proteomic abundance determinations To ensure compatibility between both runs spectral counts were normalized by global scaling to the average spectral count detected per protein sample spectral counts in each experiment were then scaled to ensure compatibility across data sets Platelet transcripts Gene Microarray analysis Microarray data were derived from a sub set of previously reported mRNA profiles of human platelets Gna03 Platelets were collected from volunteer donors N 5 by apheresis to obtain sufficient 54 RNA for hybridization to the Affymetrix U133A gene chip Affymetrix and expression data were analyzed using Genespring 7 0 software Silicon Genet ics Redwood City CA A transcript was considered platelet expressed if it was presen
88. ntly different oO 0 70 O75 080 0 85 0 65 060 Highest Expressed Lowest Expressed Figure 5 11 Box plot of CAI for highest and lowest expressed platelet tran scripts At the protein level we detected 22 proteins belonging to the group of 71 50 highest expressed platelet transcripts For the lowest expressed transcript group only 12 proteins have been detected It is evident that for individual genes correlation between protein and mRNA expression is low number Since correlation depends on the distri butions of both parameters compared it is possible that different types of transcript and protein abundances distribution Figure It may indicate also that our method of measurement of protein abundance number of pep tide hits per protein is not optimal for this type of analysis In summary CAI analysis could be used as a tool to predict or compare protein expression levels for a group of proteins but requires extra caution if applied to individual gene products 5 5 Triptic adjustment Trypsin is a serine protease found in the digestive system where it breaks down proteins It is used for numerous biotechnological processes Figure 5 12 shows the crystal structure of a Trypsin In Figure 5 13 the tripsin fragments of the protein Proflin are illustrated We use the number of peptide hits per protein to measure the protein abundance in previous correlation analysis This may not be optimal and the tripsin
89. of proteomic patterns in serum to iden tify ovarian cancer Lancet 359 pp 572 577 Pet02a Petricoin E F III Ardekani A M Hitt B A Levine P J Russo V A Steinberg S M Mills G B Simone C Fishman D A Kohn E C and Liotta L A 2002 Proteomic patterns in serum and identifi cation of ovarian cancer Lancet textbf360 pp 169 171 Pru05 Pruitt KD Tatusova T Maglott DR 2005 NCBI Reference Sequence RefSeq a curated non redundant sequence database of genomes tran scripts and proteins Nucleic Acids Res 33 Database issue pp D501 4 Qui86 J Ross Quinlan 1986 Induction of decision trees Machine Learning 1 1 pp 81 106 Sand05 Sandhu C Michael Connor Thomas Kislinger Joyce Slingerland and Andrew Emili 2005 Global Protein Shotgun Expression Profiling of 97 Proliferating MCF 7 Breast Cancer Cells Journal of Proteome Research 4 5 pp 674 689 Sch05 Schad M Lipton MS Giavalisco P Smith RD Kehr J 2005 Eval uation of two dimensional electrophoresis and liquid chromatography tandem mass spectrometry for tissue specific protein profiling of laser microdissected plant samples Electrophoresis Jul 26 14 pp 2729 38 Sha87 Sharp PM Li WH 1987 The codon Adaptation Index a measure of directional synonymous codon usage bias and its potential applications Nucleic Acids Res 15 3 pp 1281 95 Scha05 Schad M Lipton MS Giavalisco P Smith RD Kehr J 2005
90. ogy the protein expression data can now be readily ob tained Banfi06 including from plants such as Arabidopsis thaliana Sch05 Since the technique is much more sensitive significantly lower sample amounts are required for LC MS MS than for 2 D protein gel electrophoresis Our gene protein integration software module will enable the automated matching of mRNA s and their corresponding protein products A fundamental and pressing question is the correspondence of transcriptional responses mRNA level to cellular protein abundance which are also influenced by translational and post translational mechanisms Cox05 Quantification of the gene product mRNA and protein correlation concordance strength and their dif ference in abundance would offer a unique insight on how the information encoded by a myriad of gene products is integrated at the molecular cel 52 lular and organism levels However the few comparison studies published yielded inconsistent results The integration of gene and protein data would reveal the correspondence of cellular protein abundance to transcriptional responses and provide insight into molecular pathways that determine and link gene and protein expression patterns In this chapter we fist explain how to obtain the proteomic MS data and gene microarray data Section 5 1 and build the correspondence between the gene and protein data using the human platelet example Section In Section 5 3 three correla
91. omarkers detection and the ensuing classification prediction analysis Thus the above input directories should be the output directories in Preprocessing A subdirec tory named PeakAligned is created automatically in each input directory are selected parameters The spectra with detected biomarkers are where output to this subdirectory Example following the example in Data Preprocessing Disease Dir dirl Preprocessed Smoothed 3 e 005 A Control Dir dirl Preprocessed Smoothed 3 e 005 B Blinded Dir dirl Preprocessed Smoothed 3 e 005 C The output directories are Disease dirl Preprocessed Smoothed 3 e 005 A PeakAligned Control dirl Preprocessed Smoothed 3 e 005 B PeakAligned Blinded dirl Preprocessed Smoothed 3 e 005 C PeakAligned A 3 3 Classification Prediction Select Analysis Classification Prediction We perform the Z T test to select significant biomarkers The Bonferroni s method is applied for multiple test correction to determine the experimentwise critical value Using the sig nificant biomarkers we train the classifiers with the training sets e g disease and control and then predict the identity of those in the blinded testing data set e g test 134 Batch Processing Directory Classification Prediction Result Dir the directory of the output results Disease Dir the directory of training data set of subjects with c
92. on analysis Peaks above the noise level are represented by green squares and are termed refined peaks The refinded peaks will be used for further classification prediction Peak Alignment align peaks across all samples within the Alignment Window This parameter is the window size and should be a positive number Peak Area if you choose Peak Area in the Biomarker Detection menu this item will be activated Input the width of the interval to calculate the peak area If you input zero it is equivalent to detect the Maximum Peak Intensity Save Profile all parameters will be saved in a pek file The file will be used later in the Batch Processing Page to perform peak detection on an entire dataset folder Batch Processing Page 132 Biomarker Detection Parameter Selection gt Batch Processing Profile Setting Browse Disease Dir Browse Control Dir Browse Blinded Dir Browse Start Batch Profile Setting the location of the pek file with parameters in peak de tection By default it is the file saved most recently in the Parameter Selection Page Disease Dir the directory of training data set of subjects with certain disease or abnormality Control Dir the directory of training data set of normal control subjects 133 Blinded Dir the directory of blinded testing data set with a blinded mixture of diseased and control subjects We recommend the user to use the preprocessed spectra for bi
93. ori 1 to TN 6 T 4 RF tree construction 7 form 1to M 8 OOB m array of all OOB sample value at variable m 9 OOB randomly permute OOB m 10 Classify OOB on T cli x predicted category for case z 11 for allz such that x OOB 12 ME zx t m cli x count as majority vote 13 times x times a 1 14 end 15 end 16 end 17 form 1 to M 18 forr 1to X 19 initialize cc is category counter to sum classification result 20 cc C 0 21 fori 1 to TN 22 cc M E a i m cc ME 2 2 m 1 23 end 24 ct true category of x 25 cm maximum category in cc 26 Proportion ct ce ct times z 27 Proportion cm cc cm times a 28 for any m summary the misclassification rate for all X 29 VI m VI m Proportion cm Proportion ct 30 end 31 VI m VI m X average 32 end 33 End Variable Importance Table 4 2 Variable importance schema for RF 42 will be generated from the votes Figure 4 2 shows how the s CART system works In this thesis we adopt Information Gain Gini index Gini ratio and their Marshall Correction algo rithms as splitting methods Six different trees are generated using different splitting methods When a new case is input it will travel down all trees to get the classification results Besides the majority vote to give the final classification of the case the probability will be a
94. ound proteins are liberated by ionization and fly through a time of flight tube where they separate based on mass and charge The ProteinChip Software then converts the TOF data to generate a mass spectrum profile The two useful formats for viewing the data are the raw spectrum and the grey scale ProteinChip SELDI Protocol Pee i sme ple eR g ee Ar Hydrophob c Anio sic atonic Vets B ncing Viesh T wo unbound proteins Sats and other contaninents cam Sa SELDI Cesorst enor taton Reader Figure 2 1 ProtinChip SELDI Protocol Modified by William E Grizzle O John Semmes et al with permission from Ciphergen Biosystem Inc We always analyze the raw spectrum that has the markers mass to charge ratio or m z values as the horizontal axis and intensity as the vertical axis There are eight samples in each protein chip The analytical platform PBS II fires a laser beam on the middle stripe on each sample repeatedly Each sample can be accessed through 100 different positions position 1 is at the bottom and position 100 is at the top The positions contain important information are called hot spots and those contain no useful information are a cold spot It is expected to fire the laser on the hot spots only but it is impossible because hot spots and cold spots are not easy to distinguish To extract the information as much as possible from hot spots PBS II fires the laser beam seve
95. ples are left out The left out sample will be the testing data set called out of bag OOB samples e voting Suppose the predictor of the classifier is y Z the vote is T avpy z yis numerical variable pB 2 votey L y is categorical variable The Step 1 is the kernel and the first randomness in Random Forests In paper Die00 bagging has been simplified only its first phrase sampling phrase And that is been widely accepted Accuracy and generalization error PE estimation are two major advantages of using bagging Out of bag OOB is the most exciting technique developed in Random Forest because it can be used for many purposes such as generalization error 37 estimation outlier detection variable importance rank scaling coordinates etc Each bagging sampling result contains only two third of original training data set and the left samples are organized together as OOB data set Since the error rate decreases as the number of tree predictions increases in combi nation the out of bag estimates will tend to overestimate the real error rate on the testing sample In Bre96 the empirical study on error estimates for the bagged classifiers shows that OOB is as accurate as using a test set of the same size as the training set After generating hundreds of trees random forest needs apply them pre dicting the new case Each individual tree will classify the new case indepen dently Bre01 uses majority vo
96. ral times at each chosen position and Ciphergen s ProteinChip software takes the average of all shots of chosen positions and the Laser Beam Pa Cold Spot Hot Spot Figure 2 2 Cold spots and Hot spots average will be the final mass spectrum of the sample However the average of all shots is not good if the laser beam fired on too many cold spots The garbage information is included and this is not acceptable We use adjusted mean to generate more accurate mass spectrum 1 Eliminate the instrument noise For PBS II the intensities without sample on the protein chip are below 6 2 Take the average of all shots between 25th percentile and 75th per centile at each m z value Example Eight wild type rats are on one protein chip The laser beam starts firing from position 19 to position 79 The interval between the starting position and ending position is 6 The laser will fire 15 times at each position Therefore the total number of shots is 11 15 165 The m z range is 0 20 000 There are many instrument noises at each m z 10 Sample M Z 5997 97 M Z 8195 01 1 72 59 2 37 15 3 38 25 4 27 0 5 2 4 6 3 8 T 10 0 8 1 3 Table 2 1 Proportion of 165 shots that have intensities lt 6 value For example five samples have more than 10 shots below the noise level at m z 5997 97 and 3 samples have same situation at m
97. ram The relative RefSeq se quences NP ID are used to build the protein database 2416 of these have RefSeq accessions by using blastp against the human NCBI RefSeq database and 526 among them are unique The target nucleotide sequences for each Affymetrix probe set were down loaded from the Affymetrix analysis web database 1640 of the 22 215 platelets transcripts were represented on the Affymetrix U133A microarray These non full length sequences were then used to download full length platelet nucleotide sequences from RefSeq a curated and non redundant collection of sequences representing genomic data transcripts and protein citePru05 Full length sequences were available for 1 603 of the 1 640 Affymetrix accessions of which 1 240 represented unique non redundant sequences Those 1 240 sequences were used for all subsequent platelet transcript analyses Finally we derived two databases The platelet protein database con sists of 526 sequences and there are 1240 sequences in the platelet nucleotide database Protein sequences were then queried against the platelet nucleotide sequence database using tBlastN in BLAST which allow comparison of platelet protein amino acid sequences to the six frame translations of the platelet nucleotide database On the other hand nucleotide sequences were queried against the plate protein database using blastx which compares the six frame conceptual translation products of a nucleotide query sequence bo
98. rapping 74 5 7 Correlations of the group in four quadrants 76 XIV 5 8 Clustering result 5 9 The gene symbols and names in nine clusters XV Acknowledgements I cannot begin but by expressing my endless gratitude to my adviser Professor Wei Zhu not only for her valuable advice and support but also for her warm understanding I would have been nowhere without them I am also deeply indebted to the support of Doctor Wadie Bahou from School of Medicine This thesis would not be possible without his guidance and unquestioning support I would like to thank Professor Nancy Mendell and Professor Estie Arkin from whom I have learned many important scientific and mathematical skills I would like to thank my dear parents for constantly standing beside me and for keeping alive the place that I will always call home My thoughts go to you in all I do Many good friends from Stony Brook and some old friends from Boston have smiled and bestowed me with various graces through good times and rough Dr Dmitri Gnatenko Peter Perotta and Melissa Monaghan I learned a lot from you especially on microarray technologies and biological knowledges Dr Jim Ma Dr Bin Xu Dr Xuena Wang Dr Valentin Polishchuk for your listening and advice Kith Pradhan Xiangfeng Wu Meimei Wu Yue Zhang and Yue Wang Thank you for your suggestions and help My aunt Ye Wu and her family for their care and optimism Thank you Lar
99. ree as simple as 34 possible Usually the misclassification rate will decrease when the tree grows but it will increase again if the tree continues to grow and gets too big Figure Pruning will use the Minimal Cost Complexity criteria The key is to find the weakest link cutting WLC It generates a decreasing sequence of subtrees T gt To gt T3 gt gt t where t is the tree which contains the root node only It has been proved that the results are the minimum cost subtrees for a given number of terminal nodes Bre84 Misclassification Rates Misclassification rate onTraining Set Error rates Misclassification rate onTest Set PTO oO oO oO oO oo Oo o gt niw BOM DYN Dw Size of the Tree chch Figure 4 1 Tree pruning for head and neck cancer data The cost complexity measure Ra T is defined as R T R T alT 4 7 where R T is the misclassification rate of tree T T is the number of leaf nodes It is also considered as the tree size and a is the complexity cost 35 There are two methods for seeking the minimal cost complexity e Independent testing samples if an independent data set is given or the original training data set is big enough to draw out a independent testing set e The v fold cross validation method if the data set is small When the best tree is found by the tree growing and pruning its mis classification rate ca
100. roy m e gt 60898000A BE gA BH Desktop i B HN001 tet TER B HNO1 4 bt O m bin 5 Cancel The program will promote you to select color for all spectra to be opened They will be in the same color but you can change the color of any individual spectrum later on by selecting that particular spectrum using the Display Toolbar and then clicking the Change Color to reset its color Move the Display Toolbar below the Map Window to see each single spectrum The location of each spectrum can bee seen in the File Directory 0 proteoExplorer demodata controVHNO0 1 Dt Alternatively you can hold the Ctrl button in the keyboard and click the left mouse to choose the desired spectrum A 2 3 Average files Display and Output Average Spectrum of Multiple Files 113 Choose File Average Files the average of all selected spectra will be calculated and displayed in the Main Window To take the average of all spectra in a directory choose the target directory and press Ctrl A in the keyboard The File Directory will display that this spectrum is an average Average of same files starting with d proteoExplorer demodata controvHNO01 txt Select File Save Spec to save this average file A 2 4 Display features Zoom In Out Left click to zoom in right click to zoom out To select a target region in Main Window or Map Window right click the mouse hold it and drag the yellow rectang
101. rt of routine analysis and visualization procedures is provided to guide new users Behind this demo version the software is still under development to add more functionalities to implement the latest developed algorithms and to be more robust and user friendly We thank the State University of New York at Stony Brook for sponsoring the development of proteoExplorer A 1 Introduction Structure of proteoExplorer 103 Start Mon Cremabrine To atietan e reper aaing Fal Reer iire Seira b rrinte ar Pr Het ar d Te talas Pes Uta i E S L ll ya aii k T s 2 n SS A Nf Mem lly Peek Deteccion apres Estectee wukevp sak E Pecexl cf Arahys s Droi actu ccaezdon bre anneal i sate IEW sented t vabdal af JI em soma y L Aaly Ru se casi xt ees re rT 5 Vis sical treba Debaled Ulivi waksu Kec Janerio E Ees nolemeatec i che curars Teatagry adet Y venave rond burchce vino be A aetu barul wn aces p Ch en Sie Sir ras E FExfen Availabls Metaocs and Teias Il Ft Stes Analysis Procedure Step 1 Data Quality Control by Visualization Use the visualization tool provided by the software to easily monitor the mass spectrum individually by group among repeated measures or by any other experimental factors A simple function taking the average of multiple files or an entire directory is also implemented in this step both for visu alization and for creating the
102. ry and Dan for all you are My thanks my love to all Chapter 1 Introduction 1 1 Genomics and proteomics The fundamental working units of every living system are defined as cells All the instructions needed to direct their activities are contained within the chemical DNA deoxyribonucleic acid Whilst DNA from all organisms is made up of the same chemical and physical components the DNA sequence is the particular side by side arrangement of bases along the DNA strand e g ATTCCGGA This order spells out the exact instructions required to create a particular organism with its own unique traits The genome is an organism s complete set of DNA Genomes vary widely in size the smallest known genome for a free living organism a bacterium contains about 600 000 DNA base pairs while human and mouse genomes have some 3 billion Except for mature red blood cells all human cells contain a complete genome DNA in the human genome is arranged into 23 distinct chromosomes physically separate molecules that range in length from about 50 million to 250 million base pairs A few types of major chromosomal abnormalities including missing or extra copies or gross breaks and rejoinings translocations can be detected by microscopic exam ination Most changes in DNA however are more subtle and require a closer analysis of the DNA molecule to find perhaps single base differences Each chromosome contains many genes the basic physical
103. s of other proteins made in the same cell at the same time and with which it associates and reacts Studies to ex plore protein structure and activities known as proteomics will be the focus of much research for decades to come and will help elucidate the molecular basis of health and disease Specifically it enables correlations to be drawn between the range of proteins produced by a cell or tissue and the initiation or progression of a disease state As a consequence the proteome is far more complex than the genome In order to enable the diagnosis for an insidious disease producing few symptoms in early stages such as ovarian cancer proteomics is employed to detect the protein marker pattern from the database of proteomic mass spec trometry and to make a better understanding of the molecular mechanisms of cancer development Proteomics is a scientific discipline which detects proteins that are associated with a disease by means of their altered levels of expres sion between control and disease states It enables correlations to be drawn between the range of proteins produced by a cell or tissue and the initiation or progression of a disease state Whilst humans are estimated to have be tween 30 000 and 40 000 genes potentially encoding 40 000 different proteins alternative RNA splicing and post translational modification may increase this number to about 2 million proteins or protein fragments Proteins which carry out and modulate the vas
104. ssing on the three groups of spectra Click the Browse button to choose the directory for each group 142 Head amp Neck Cancer disease proteoExplorer demodata disease Normal Control control proteoExplorer demodata control Blinded test proteoExplorer demodata test The Output Root Dir will be proteoExplorer demodata Preprocessed automatically Click Start Batch to start the preprocessing procedure Preprocessing File Display Options Profile Setting d proteoExp Disease Dir d proteoExplorer demodata disease Control Dir d proteoExplorer demodata controy Blinded Dir d proteoExplorer demodatattest Output Root Dir d proteoExplorer demodata Preprocessed Start Batch A subdirectory of proteoExplorer demodata Preprocessed is created and named Smoothed 1 e 003 Normalized 2 e 003 2 e 004 The name con 143 tains the parameters for smoothing and normalization The preprocessed spec tra are saved in this subdirectory as follows Head amp Neck Cancer disease proteoExplorer demodata Preprocessed disease Normal Control control proteoExplorer demodata Preprocessed control Blinded test proteoExplorer demodata Preprocessed test A 4 3 Biomarker selection Maximum Peak Intensity Select Analysis Biomarker Selection Maximum Peak Intensity to open the sub window to generate the peak data based
105. t or marginal in 4 of 5 platelet samples Using these strict criteria 1640 mRNAs were expressed at significant levels by platelets Gna03 Relative transcript abundance was established by rank ordering the unique set of non redundant mRNAs by determining the mean normalized signal intensi ties across the individual arrays using computational algorithms as previously described 5 2 Integration of gene and protein database We use a comprehensive bioinformatic approach to integrate the platelet proteomic and transcriptomic datasets as in Figure Here we applied the BLAST algorithm for sequence comparison Figure 5 2 shows it is a heuristic search method that seeks words of length W that score at least T when aligned with the query and scored with a substitution matrix Words in the database that score T or greater are extended in both directions in an attempt to find a locally optimal ungapped alignment or HSP high scoring pair with a score of at least S or an E value lower than the specified threshold HSPs that meet these criteria will be reported by BLAST provided they do not exceed the cutoff value specified for number of descriptions and or alignments to report Amino acid sequences for each accession number identified by LC MS MS were downloaded from the NCBI database NCBI accession could be 59 2DLCMS MS of Diatolot Q Microarray of Platelet mRNA 00000000 NCBI Protein 2604 Accession sequences Affym
106. t majority of chemical reactions that together constitute life are the direct links to diseases and abnormalities The proteome reflects both the intrinsic genetic program of the cell and the impact of its immediate environment Proteomics is the study of proteins and one of its central themes is the development of proteomic biomarker based tests using easily accessible biolog ical fluids such as urine blood feces sputum and bladder or bronchioalveolar lavage to identify potential diseases and to monitor the progress of certain therapeutic treatments 1 2 Microarray technology A DNA microarray also commonly known as gene or genome chip DNA chip or gene array is a collection of microscopic DNA spots commonly representing single genes arrayed on a solid surface by covalent attachment to chemically suitable matrices DNA arrays are different from other types of microarray They either measure DNA or use DNA as part of its detec tion system Qualitative or quantitative measurements with DNA microarrays utilize the selective nature of DNA DNA or DNA RNA hybridization under high stringency conditions and fluorophore based detection DNA arrays are commonly used for expression profiling i e monitoring expression levels of thousands of genes simultaneously or for comparative genomic hybridization Arrays of DNA can either be spatially arranged as in the commonly known gene or genome chip DNA chip or gene array or can be speci
107. ta for the platelet study 67 Figure 5 10 Pearson correlation between the original gene protein expression data a the normality transformed data on both gene and protein b and the normality transformed data on gene only c 68 the initially selected pair is identified and so on The pairs of linear com binations are called the canonical variables and their correlations are called the canonical correlations The first canonical correlation which is often the only significant one as in our case is usually adopted to describe the inter class correlation Here we will report the first canonical correlation its test statistic Wilks Lambda the equivalent F statistic and the p value In our study there are five sets of gene microarray data and one set of protein LC tandem MS data The Pearson correlation and the Spearman correlation can only gauge the relationship between the protein data and one set of the gene expression data e g the average of the 5 sets of gene data Thus they will be influenced by the quality of all the data sets involved If one set of gene data is of poor quality and thus fail to reflect the true nature of the presum ably high correlated mRNA protein relationship both Pearson and Spearman correlation will be less than optimal On the other hand the first canonical correlation or its nonparametric counterpart based on the ranks will be larger than the largest Pearson or Spearman correlation betwee
108. te for gathering these internal predictions and giving its final classification Besides the majority vote the weighted vote can also be applied It applies the out of bag estimate on the combination of tree decision Since out of bag is an unbiased estimator it is used in research for estimating the strength of each tree Bre01 In this thesis we take it as the weight on voting to combine the prediction of the trees vote 4 2 2 Random forests generation Random forests is a multi classifier system consists of numerous trees as sub classifiers or internal classifiers Each tree is a unpruned CART The advantage of using the unpruned tree than using a pruned one is decreasing the correlation among tress The unpruned tree has less strength but the reduced correlation improves the final accuracy after combining all trees Without 38 pruning each tree generation will be much simpler and quicker Tree generation is a partition process of each node There are two ap proaches for split selection in each partition LS97 1 For the training data set all possible splits on each independent variable will be examined The most impurity reduction split will be selected as the best split and used for partition There are many impurity measures such as Entroy Information gain Gini diverse index Gini ratio etc as discussed in Section 4 11 2 Split rule f X lt c where f is a linear combination function FACT and QUEST are b
109. th strands 58 379 protein sequences that have corresponding nucleotide reference sequences 11 nucleotide sequences that have corresponding protein reference sequences Figure 5 3 Result of integrating platelet proteomic and genomic datasets against a protein sequence database As a result shown in Figure 5 3 379 of 526 proteins 73 have cor responding nucleotide reference sequences E value lt 0 001 while 511 of 1240 mRNA 41 transcripts have corresponding protein reference sequences E value lt 0 001 There are 143 sequences have the same match results between nucleotide and protein references sequences as in NCBI RefSeq database for human The reported E values provide an estimate of the statistical signifi cance of the match between protein and nucleotide or nucleotide and protein An E value of less than 0 001 was considered statistically significant Unless other stated all relational database analyses are derived using E values lt 0 001 For example in the query using tblastN program the sequence with Ref Seq accession NP_004479 1 has the same corresponding nucleotide sequence as the sequence with accession NP_000164 3 Table However the corre 59 sponding nucleotide accession of NP_004479 1 is NM_004488 1 in the full Ref Seq database The nucleotide sequence with accession NM_004488 1 is not in our mRNA database with 1240 sequences Thus NP_000164 3 is one of the 143 sequences that has the same RefSeq
110. tions are calculated in correlation analysis A codon adaptation index CAI is also introduced as a tool to predict expression level of a particular protein or a group of proteins Section 5 4 In Section 5 5 we propose a new method use the triptic number to adjust the measurement of protein abundance which is proved to be a useful method in improving the correlation Finally we applied two techniques to do clustering protein gene pairs in Section 5 1 Data acquisition Mass spectrometric analysis Platelet samples are drawn from four different donors and then pooled for proteomic studies They were completed in duplicate using liquid chromatography coupled to tandem mass spectrom etry LC MS MS in which the LC steps are interfaced with a fused silica capillary to maximize peptide resolution and detection sensitivity by tandem MS MS 53 The mass spectrometric analysis was completed using a QSTAR Pulsar i quadrupole TOF MS Applied Biosystems Foster City CA equipped with nano electrospray source The loading and elution of the peptides to and from the cation exchange column to the reverse phase column and to the mass spectrometer were fully automated and individual sample runs were completed in 24 36 hours MS MS acquisition was completed in a data dependent manner by operating the ion trap instrument using dynamic exclusion lists Automated protein identifications were obtained using Pro ID Software 1 0 Applied Biosystems li
111. ts imz 7000 0 faon T Normalize Click Smooth on the bottom to do smoothing the default parameter is 0 1 which means the smoothing window contains 34 points 0 1 34378 34 37 The description of this preprocessing step 0 type Smoothed param 0 001 is displayed in the Description Textbox Select Display Options Show All one can zoom in to see the change of the preprocessed spectrum All the spectra in each preprocessing step will be displayed simultaneously 139 and the most recent preprocessed spectrum smoothed is highlighted The difference can also be seen by tuning the Display Toolbar below the Description Textbox ER Preprocessin p p 2 Linn aoa My whl p AAN A A i el WRIT rf Ww hf ma TW Aly i Prep steps amp proteoexplorendemadataldisease HNu 1 bd steps o proteorxplorenderadalaaiseasesNO 11 bd g type Smoothed param 0 001000 il Save Profile Save Profile Smoother window 0 1 Smooth Smoother window fo 1 Smooth Fitted length m z 3 0 Baseline Fitted length ima 3 0 Baselme Startrig Enang points m z 2000 0 120000 0 Nowmnalae Startng Enaing points vz feaco o jz0000 0 Normalize Next we can perform the baseline correction and the normalization by clicking Baseline and Normalize respectively In this case we notice that the baseline is already corrected and there are some negative values Thus we perform normalization only The m z range between 2 000 and
112. ual genes by modulating the rates of polypeptide elongation His torically the relationship between codon usage and protein mRNA expression has been most extensively studied in yeast2 To date several gene sequence based computer algorithms are available to calculate the codon usage for a particular organism or tissue EMBOSS Jcat and etc We applied codon usage analysis to platelets to predict correlation between mRNA and protein abundances Sharp and Li Sha87 proposed to use CAI codon adaptation index to evaluate how well a gene is adapted to the translational machinery CAI is a single value measurement that summarizes the codon usage of a gene relative to the codon usage of a reference set of genes A higher CAI value usually suggests that the gene of interest is likely to be highly expressed 70 50 highest platelet expressed transcripts were taken as the initial refer ence set in our studies We calculated CAI for 156 highest expressed platelet transcripts and for 156 lowest expressed Wu05 The CAI distribution of 156 highest expressed platelet transcripts is left skewed and the median 0 77 is greater than the mean 0 76 Similarly the CAI distribution of the lowest expressed platelet transcripts is right skewed and the median 0 73 is less than the mean 0 74 The mean CAIs for these two groups of genes were 0 76 and 0 74 respec tively The p value of the two sample t test is 0 003 which means the two means are significa
113. ular box to zoom in protcoExplorer File View Analysis Workspace Help Change Cotor fo proteorxplorer demadata controv HNO0 1 bt 0 10 Transparency 114 The target region is resizable in the Map Window by clicking and drag ging its edges using the left mouse button To zoom out simply right click mouse continuously or click R button in the right down corner of the Main Window The R button is also available in the Preprocessing sub window and Biomarker Detection sub window Move Spectrum By moving the rectangular bar in the Map Window horizontally the view of mass spectrum in the Main Window will move simultaneously as well File View Analysis Workspace Help Change Cotor fa proteokxplorerdemadatacontrovHNad 1 bt 0 10 Transparency Bring a Specific Spectrum to the front of Multiple Spectra While holding Ctrl in your keyboard left click the target spectrum will bring it to the front among the multiple spectra on display Change Color 115 The spectrum is displayed in white by default The color can be changed by clicking Change Color and choosing the color in the color panel File View Analysis Workspace Help The color of the spectrum will change from yellow to green A 2 5 Display options Single Multiple Spectrum Display View Multi Spec Check this option to display multiple spectra in the same time The following example has three spectra on display They are co
114. ut the maximum number 47 in this example in No of Final Biomarkers and the number of significant biomarkers can be seen in the output file Now we are ready to click Start Batch to perform the classification and prediction using the given training and testing data sets An html file entitled ClassificationReport htm will be output to the Result Dir select Analysis Display Classification Prediction Results to open it Please note that is the generation date and time of this html output file to avoid confusion In the output file the summary of training result is listed in the first table and the detail classification result of each spectrum is in the 2nd table The selected biomarkers are also given 153 Tia Wee Eyal Arcia leis A file named Biomarker pat is generated in Result Dir and save the biomarker pattern To look at the positions of selected biomarkers open any spectrum first and then select Analysis Read Latest Biomarker Pattern to open Biomarker pat or select File Read Biomarker Pattern to locate the file In the figure below the average spectra of the two groups are displayed the green one is for disease and the white one for control the red bars denote the selected biomarkers 154

000000065.sbu - DSpace Home

Contents

Download Pdf Manuals

Related Search

Related Contents