Home

CRAN GMD: User's Guide (0.3.3)

image

Contents

1. 0 00 0 00 0 00 0 006 0 006 0 006 0 006 0 006 0 008 0 008 0 008 0 005 0 0044 0 004 0004 000 0004 0 003 0 003 0 003 0 003 0 008 0 002 0 002 1 Wed 9 0 0004 0 0004 0 0004 0 006 0 000 20 40 ebo abo 100 20 40 slo slo 100 20 40 elo abo 20 40 ebo slo 100 20 abo elo abo id 0 00 0 00 0 00 0 006 0 006 0 006 0 008 0 005 0 0044 0 004 0 0044 0004 0 008 0 008 0 003 0 008 0 002 0 0004 0 0004 0 006 0 000 20 40 slo abo 100 20 40 elo abo 10 20 40 elo slo 100 20 40 elo abo id 0 00 0 00 0 008 0 006 0 005 0 005 ep erac uns 0 0044 0 0044 0004 Gap 0 0 Gap 0 0 0 008 0 008 0 0084 0 002 0 002 0 008 0 s 0 006 0 006 zlo 40 elo abo 1 20 40 elo elo 10 20 40 elo abo 10 0 006 0 006 0 005 0 008 0 0044 0004 p py pies m ome 0 002 0 002 0 004 0 004 0 008 0 0084 20 40 elo sbo 100 20 40 elo abo 10 0 00 0 006 0 008 0 004 ps pum 0 008 0 004 0 008 20 abo i Position Figure 4 Graphical output of case chipseq R Color 5 Heatmap 5 data human CD4 T cells Sg Row Row ui Clusters Clustering 0 5 15 Value See Row group 2 n 17 k 3 Elbow plot mE D M 8 E H2AKSa
2. 27 Count 2 Bin Figure 7 Graphical output of example ghist R 16 o 0 00 amp N N N N PPE B N O oe R P CRAN GMD Zhao et al A 2 2 Examples using iris data case iris R Source Code 5 is a study on how to obtain and visualize histograms using Fisher s data set case iris R load library require GMD load data data iris create common bins n 30 the number of bins breaks lt gbreaks iris Sepal Length n the boundary of bins create a list of histograms Sepal Length list setosa ghist iris iris Species setosa Sepal Length breaks breaks versicolor ghist iris iris Species versicolor Sepal Length breaks breaks virginica ghist iris iris Species virginica Sepal Length breaks breaks convert to hist object x lt as mhist Sepal Length get bin wise summary statistics summary x visualize the histograms plot x beside FALSE main Histogram of Sepal Length of iris xlab Sepal Length cm 17 CRAN GMD Zhao et al Histogram of Sepal Length of iris Count m setosa 1 1 ll En 49 55 67 73 78 versicolor sls 49 55 67 73 78 virginica s M 49 5 Sepal Length cm Figure 8 Graphical
3. N N N D N N PE BB Be 6 N P CRAN GMD Zhao et al The GMD package provides classes and methods for computing GMD in R 5 The algorithm has been im plemented in C to interface with R for efficient computation The package also includes downstream cluster analysis in function css A 4 on page 23 that use a pairwise distance matrix to make partitions given variant criteria including the elbow rule as discussed in 7 or desired number of clusters In addition the function heatmap 3 5 on page 25 integrates the visualization of the hierarchical clustering in dendrogram the distance measure in heatmap and graphical representations of summary statistics of the resulting clusters or the overall partition For more flexibility the function heatmap 3 can also accept plug in functions defined by end users for custom summary statistics The motivation to write this package was born with the project 7 on characterizing Transcription Start Site TSS landscapes using high throughput sequencing data where a non parametric distance measure was developed to assess the similarity among distributions of high throughput sequencing reads from biological experiments However it is possible to use the method for any empirical distributions of categorical data The package is available on CRAN The source code is available at http CRAN R project org package
4. 1 9 79 Cd72 TO4RO28B8BC9 Centg2 TO1F055392D1 31 CRAN GMD Zhao et al ChIP seq data mES and chipseq_hCD4T gt help chipseq gt data chipseq mES gt class chipseq mES 1 list gt length chipseq_mES 1 6 gt names chipseq mES 1 H3K27me3 H3K36me3 H3K4me1 gt data chipseq_hCD4T gt names chipseq hCD4T 1 7 13 19 25 31 37 CTCF H2BK20ac H3K27ac H3K36me3 H3K79me2 H3R2me1 H4K5ac H2AK5ac H2BK5ac H3K27me1 4 H3K79me3 H3R2me2 HAK8ac H2AK9ac H2BK5mei H3K27me2 H3K9ac HAK12ac HAK91ac H3K4me2 H2AZ H3K14ac H3K27me3 H3K4me2 H3K9me1 HAK16ac HAR3me2 32 H3K4me3 H2BK120ac H3K18ac H3K36ac H3K4me3 H3K9me2 H4K20me1 H3K9me3 H2BK12ac H3K23ac 1 79 H3K9me3 H4K20me3 CRAN GMD Zhao et al References 1 Artem Barski Suresh Cuddapah Kairong Cui Tae Young Roh Dustin E Schones Zhibin Wang Gang Wei Iouri Chepelev and Keji Zhao High resolution profiling of histone methylations in the human genome Cell 129 4 823 837 May 2007 2 Piero Carninci Albin Sandelin Boris Lenhard Shintaro Katayama Kazuro Shimokawa Jasmina Pon javic Colin A M Semple Martin S Taylor PALr G Engstr Zm Martin C Frith Alistair Forre
5. ji 1 mmmn Observations hclust complete Figure 11 Graphical output of example gdist R 21 CRAN GMD Zhao et al Cluster Dendrogram of USJudgeRatings data 1 2 08 1 0 CONT 0 6 Height 0 4 0 2 o x 6 5 E 2 o d 5 lt A o o z amp Variables hclust complete Figure 12 Graphical output of example gdist R 22 o WN P CRAN GMD Zhao et al css Clustering Sum of Squares the elbow plot determining the num ber of clusters in a data set good clustering yields clusters where the total within cluster sum of squares WSSs is small i e cluster cohesion measuring how closely related are objects in a cluster and the total between cluster sum of squares BSSs is high i e cluster separation measuring how distinct or well separated one cluster is from the other example css R Source Code 8 is an example on how to make correct choice of k using elbow criterion A good k is selected according a how much of the total variance in the whole data that the clusters can explain and b how large gain in explained variance we obtain by using these many clusters compared to one less one more the so called elbow criterion The optimal choice of k will strike a balance between maximum compression of the data using a single cluster
6. 14 o CRAN GMD Zhao et al 2 ghist Generic construction and visualization of histograms A 2 1 Examples using simulated data example ghist R Source Code 4 is an example on how to construct a histogram object from raw data and make a visualization based on this load library require GMD create two normally distributed samples with unequal means and unequal variances set seed 2012 1 lt rnorm i000 mean 5 sd 10 v2 rnorm i000 mean 10 sd 5 create common bins n lt 20 desired number of bins breaks lt gbreaks c vi v2 n bin boundaries x list ghist v1 breaks breaks digits 0 ghist v2 breaks breaks digits 0 mhist obj lt as mhist x plot histograms side by side example ghist R plot mhist obj mar c 1 5 1 1 0 main Histograms of simulated normal distributions plot histograms as subplots with corresponding bins aligned plot mhist obj beside FALSE mar c 1 5 1 1 0 main Histograms of simulated normal distributions Histograms of simulated normal distributions 2504 200 4 1504 Count 1 2 Figure 6 Graphical output of example ghist R 15 n CRAN GMD Zhao et al Histograms of simulated normal distributions 250 4 m 1 5 11 250 4 200 5 1504 100 4 T 5 21 11 27 200 150 100 M i 21
7. Honda Cadilac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro 228 Ford Pantera L Maserati Bora Mere 46051 Merc 450SE Mere 450SLC Dodge Challenger Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914 2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 2400 Merc 280 Merc 280C Mazda RX4 Wag Mazda Ferrari Dino Fiat 128 Flat x1 9 Toyota Corolla Honda Civic Row Individuals Gross horsepower o o o a o o o o o o 50 100 150 200 250 300 CRAN GMD Zhao et al Column Individuals Heatmap Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro 228 Ford Pantera L Maserati Bora Merc 45051 Merc 450SE Merc 450SLC Dodge Challenger AMC Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914 2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 2400 Merc 280 Merc 2806 Mazda RX4 Wag Mazda RX4 Ferrari Dino Fiat 128 Flat X1 9 Toyota Corolla Honda Civic 5 3 8 8 E z s P 8 8 8 Mean features 8 ol e t gt z 8 Ej 5 8 5 g E Figure 17 Graphical output of example heatmap3a R Aa Heatmap Col
8. VBP Pe ee ee ee MON CRAN GMD Zhao et al 4 Case study 4 1 CAGE measuring the dissimilarities among TSSDs Studies have demonstrated that the spatial distributions of read based sequencing data from different platforms often indicate functional properties and expression profiles reviewed in 6 and 8 Analyzing the distributions of DNA reads is therefore often meaningful do this systematically measure of similarity between distributions is necessary Such measures should ideally be true metrics have few parameters as possible be computationally efficient and also make biological sense to end users Case studies were made in section 4 1 and 4 2 to demonstrate the applications of GMD using distributions of CAGE and ChIP seq reads In this section we demonstrate how GMD is applied to measure the dissimilarities among TSSDs histograms of transcription start site TSS that are made of CAGE tags with option sliding on The spatial properties of TSSDs vary widely between promoters and have biological implications in both regulation and function The raw data were produced by CAGE and downloaded from 2 and CAGE sequence reads were preprocessed as did in 7 The following codes case cage R Source Code 2 are sufficient to perform both pairwise GMD calculation by function gmdp and to construct a GMD distance matrix by function gmdm A handful of options are available for control and f
9. and maximum accuracy by assigning each data point to its own cluster More important an ideal k should also be relevant in terms of what it reveals about the data which typically cannot be measured by a metric but by a human expert Here we present a way to measure such performance of a clustering model using squared Euclidean distances The evaluation is based on pairwise distance matrix and therefore more generic in a way that doesn t involve computating the centers of the clusters in the raw data which are often not available or hard to obtain example css R load library require GMD simulate data around 12 points in Euclidean space pointv lt data frame x c 1 2 2 4 4 5 5 6 7 8 9 9 y c 1 2 8 2 4 4 5 9 9 8 1 9 set seed 2012 mydata c for i 1 nrow pointv mydata lt rbind mydata cbind rnorm 10 pointv i 1 0 1 rnorm 10 pointv i 2 0 1 mydata lt data frame mydata colnames mydata lt c x y plot mydata type p pch 21 main Simulated data determine a good k using elbow dist obj lt dist mydata 1 2 hclust obj lt hclust dist obj css obj lt css hclust dist obj hclust obj elbow obj lt elbow batch css obj print elbow obj make partition given the good k k lt elbow obj k cutree obj lt cutree hclust obj k k mydata cluster lt cutree obj draw a elbow plot and label the data dev new width 12 height 6 par mfcol c 1 2 mar c 4 5 3 3
10. 0 20 40 60 80 120 x Row group 1 n 20 0 20 40 60 80 120 x Row group 4 n 15 50 0 20 40 60 80 120 posen EONNEENSCOOROT EROR I SES Ies Bd 8 5 8 Function m 3 59 al ad ion he x o5 2 Package GMD o 8 g goo inR o o o o 0 40 80 120 0 40 80 120 0 40 80 120 0 40 80 120 x x x x Figure 19 Graphical output of example heatmap3b R 29 CRAN GMD Zhao et al B Data B 1 GMD dataset overview gt data package GMD Data sets in package GMD cage CAGE Data cagel CAGE Data chipseq hCD4T ChIP seq Data chipseq mES ChIP seq Data B 2 CAGE data cage and cagel gt help cage 30 CRAN GMD Zhao et al gt require GMD gt data cage gt class cage 1 list gt length cage 1 20 gt names cage 1 3 5 7 9 11 13 15 17 19 NA TO1F029805F8 Cyp4a14 04 06 91673 D230039LO6Rik 01 46 Rab5c T11RO5FBC6C4 Tpti T14F04079189 DOH48114 T18R020553F0 Hsdiibi TO1ROB8305BD TO2FO7A43FF05 Pfkfb3 TO2ROOAEC2D8 Cd72 TO4RO28B8BC9 gt data cagel gt names cagel 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 NA T17F05
11. main main revC TRUE kr elbow obj k kc elbow obj k side plots normalizeVector lt function v v sum v a function to normalize a vector dev new width 12 height 8 summary of row clusters expri lt list quote op lt par mar par mar 2 quote plot mhist summary as mhist i x if plot new FALSE quote par op summary of row clustering expr2 lt list quote tmp clusters lt cutree hclust dist row k kr quote tmp css lt css dist row tmp clusters quote print tmp css quote tmp wev tmp css wss tmp css tss quote names tmp wev lt as character unique tmp clusters quote tmp wev lt tmp wev order unique tmp clusters quote barplot tmp wev main Cluster Explained Variance xlab Cluster ylab EV col white border black ylim c 0 max tmp wev 1 1 cex main 1 expr3 lt list quote op lt par mar par mar 2 quote plot elbow css multi obj elbow obj if plot new FALSE quote par op heatmap 3 dist obj main main cex main 1 25 revC TRUE kr elbow obj k kc elbow obj k keysize 1 mapsize 4 5 row data lapply chipseq_hCD4T normalizeVector plot row clusters TRUE plot row clusters list list expri plot row clustering TRUE plot row clustering list list expr2 expr3 12 CRAN GMD Zhao et al Optimal alignments among distributions without sliding
12. omi c 0 75 0 0 0 plot mydata x mydata y pch as character mydata cluster col mydata cluster cex 0 75 main Clusters of simulated data plot elbow css obj elbow obj if plot new FALSE clustering with more relaxed thresholds resulting a smaller good k elbow obj2 elbow batch css obj ev thres 0 90 inc thres 0 05 mydata cluster2 lt cutree hclust obj k elbow obj2 k dev new width 12 height 6 1 1 2 mar c 4 5 3 3 omi c 0 75 0 0 0 plot mydata x mydata y pch as character mydata cluster2 col mydata cluster2 cex 0 75 main Clusters of simulated data plot elbow css obj elbow obj2 if plot new FALSE 23 CRAN GMD Zhao et al mydata y Clusters of simulated data e L E a mydata x A good k 7 EV 0 98 is detected when the EV is no less than 0 95 and the increment of EV is no more than 0 01 for a bigger k Figure 13 Graphical output of example css R mydata y Clusters of simulated data w amp 4 4 T T T T 2 4 6 8 mydata x 94 Elbow plot 7 0 0 0 0 0 0 0 0 0 0 0 LE 98 8 8 s G ANN E 5 x 7 gos a oe T T T 5 10 15 20 k Elbow plot k 5 24 MEME 1244 o 2 8 ime Xu x m ge A good k 5 EV 0 94 is detected whe
13. x main main kr 2 kc 2 show partition by predefined number of clusters show partition by elbow css multi obj lt css hclust x hclust x elbow obj lt elbow batch css multi obj ev thres 0 90 inc thres 0 05 heatmap 3 x main main revC TRUE kr elbow obj k kc elbow obj k heatmap 3 x main main sub sub n attr elbow obj description cex sub 1 25 revC TRUE kr elbow obj k kc elbow obj k show elbow info as subtitle iit side plot for every row clusters dev new width 10 height 10 expri lt list quote plot do call rbind i x xlab x ylab y xlim range ruspini x ylim range ruspini y heatmap 3 x main main revC TRUE kr elbow obj k kc elbow obj k trace none row data as list data frame t ruspini plot row clusters TRUE plot row clusters list list expri iit side plot for every col clusters dev new width 10 height 10 expr2 lt list quote plot do call rbind i x xlab x ylab y xlim range ruspini x ylim range ruspini y heatmap 3 x main main revC TRUE kr elbow obj k kc elbow obj k trace none col data as list data frame t ruspini plot col clusters TRUE plot col clusters list list expr2 28 CRAN GMD Zhao et al Heatmap of Ruspini data Row EV 0 93 is detected when the EV is no less than 0 9 and the increment of EV is no more than 0 05 for a Clusters Row group 2 n 23 qo 150 0 20 40 60 80 120 x Row group 3 n 17
14. 912B83 TO2FO7A43FF05 2600001A11Rik T12R043A2595 41 T10RO7AB7138 Dscrili T17F02802885 Rab5c T11RO5FBC6C4 Csf1 TO3RO672174D T17F04928998 Raii7 T14F014BF473 Apbb2 29 8 Tmeff2 TO1FO3OECT57 Hsdiibi TO1ROB8305BD 4930430JO2Rik 09 036 80 6 Trpv2 T11FO3B4EBD8 Scdi T19RO29B5186 5730596K20Rik T19FOO6DFC1A K1h15 73 NA TO2ROTEFBEDA Nudt7 TO8FO6C3B651 Eglni TO8R0769239F 003940 T11RO72A6CBO Aloxbap TO5FO8BCF2C4 Gchi 14802602138 Vrki T12F06010C9B Wdtci TO4RO7DAFEDC Glul TO1F092C2995 Stxbp4 T11R05607FD4 Gas5 TO1F09995479 003940 T11RO72A6CBO Pcna TO2RO7DE319B Gstol T19F02D03566 Csf1 TO3RO672174D Aloxbap TO5FO8BCF2C4 Higi TO9RO743763C Eglni TO8R0769239F Tpti T14F04079189 Grn T11F0615F289 Rbbp7 TOXFO91A7ACA H2afy T13R034ACF47 T10R0504CE97 Pfkfb3 TO2ROOAEC2D8 Ctsb T14F0348EDBA 3930401E15Rik T18RO2CDD141 Higi 0980743763 Ptn 0680230806 Mrps6 T16F0583C906 45114 T18R020553F0 Phtf2 TO5RO125E896 Slco3al 07 0 06 9 Ctxn TO8R00408644 Arbp 13 Gstol T19F02D03566 Srpki T17R019F4A41 TnfrsfiOb T14F03AB1306 9630050M13Rik TO2F002EC972 Ppia T11FOO604AFF 2 19
15. BA SPP 17 A 2 8 Examples using nottem data eee 19 gdist Generic construction and visualization of 21 A 4 css Clustering Sum of Squares and the elbow plot 23 A 5 heatmap 3 Visualization in cluster analysis with evaluation 25 51 Examples using mtcars data 2 25 A52 Examples using ruspini data w aaa w e wa rr 28 B Data 30 BT ONU dataset QUQ a k s 30 CAGE data tage and 5 S ook REE mST 24 30 ChIP seq data mES and chipseq 4 32 1 Introduction Similar to the Earth Mover s distance Generalized Minimum distance of Distributions GMD based on MDPA Minimum Difference of Pair Assignment 3 is a true metric of the similarity between the shapes of two histograms Considering two normalized histograms and GMD measures their similarity by counting the necessary shifts of elements between the bins that have to be performed to transform distribution A into distribution B statistics and many other fields histogram refers to a graphical representation of category frequencies in the data Here we use this term in a more mathematical sense defined as a function that counts categorical data or a result returned by such a function o amp
16. CRAN GMD User s Guide 0 3 3 Generic histogram construction generic distance measure cluster analysis with evaluation and visualization Xiaobei Zhao and Albin Sandelin Bioinformatics Centre University of Copenhagen Modified 2014 08 26 Compiled 2014 8 27 You may find the latest version of GMD and this documentation at http CRAN R project org package GMD Keywords histogram distance metric non parametric cluster analysis hierarchical clustering sum of squares heatmap 3 Abstract The purpose of this GMD vignette is to show how to get started with the R package GMD GMD denotes Generalized Minimum Distance between distributions which is a true metric that measures the distance between the shapes of any two discrete numerical distributions e g histograms The vignette includes a brief introduction an example to illustrate the concepts and the implementation of GMD and case studies that were carried out using classical data sets e g iris and high throughput sequencing data e g ChIP seq from biology experiments The appendix on page 14 contains an overview of package functionality and examples using primary functions in histogram construction the ghist function how to measure distance between distributions the gdist function cluster analysis with evaluation the elbow method in the css function and visualization the heatmap 3 function Contents 1 Introduction 2 2 Minimal Example H
17. GMD under GPL license 2 Minimal Example Hello GMD hello GMD R Check GMD s sanity and start up Oname hello GMD GMD at CRAN for source code download and installation http cran r project org web packages GMD index html load GMD library GMD version of GMD and description packageVersion GMD packageDescription GMD view GMD vignette vignette GMD vignette package GMD list the available data sets in GMD data package GMD list all the objects in the GMD 15 package GMD help info on GMD help package GMD run a demo demo GMD demo cite GMD in publications citation hello GMD R Source Code 1 is a minimal example to load and check of that your installation works It also includes code for viewing the package information and this vignette checking data sets provided by GMD starting a demo and listing the citation of GMD CRAN GMD Zhao et al 3 An example to understand GMD This example based on simulated data is designed to illustrate the concepts and the implementation of GMD by stepping through the computations in detail 3 1 Histogram construction and visualization 3 1 1 Load library and simulate data require GMD load library create two normally distributed samples with unequal means and unequal variances set seed 2012 1 rnorm i000 mean 5 sd 10 x2 rnor
18. c CTCF H4k5ac HaKBac H2BK20ac H2BKI2ac H2AZ HoKdac H3K36ac H4K91ac H2BK120ac H3K18ac E HoK9ac H3K27ac 2 HaK79me2 H3K36me3 12 2 2 H3K4ac 3 3 HaKatac H2BK120ac H3K4me3 3 1 3 H3K27ac H2BKSaC Hak20met Hak7ame2 H3K79me3 HaK36me3 H3K27me1 H3K79me1 H2BK5me1 H3K4me2 H3K9me1 H3K4me1 2 Haki2ac 1 3 14 H3K36me1 H3R2me1 H3R2me2 H2AKSac H3K27me2 H3K9me2 H3K9me3 H3K27me3 Figure 5 Graphical output of case chipseq R 0 0000 0 0015 0 0000 13 1 Row group 3 n 7 Row group 1 n 16 116 264 412 560 708 856 Cluster Explained Variance 0 12 116 264 412 560 708 856 EV 0 08 0 04 0 00 116 264 412 560 708 856 2 Cluster CRAN GMD Zhao et al A Functionality A 1 An overview Table 1 Functions of the GMD R package Function Description ghist gdist css heatmap 3 gmdp gmdm Generalized Histogram Computation and Visualization Generalized Distance Matrix Computation Computing Clustering Sum of Squares and evaluating the clustering by the elbow method Enhanced Heatmap Representation with Dendrogram and Partition Computation of GMD on a pair of histograms Computation of GMD Matrix on a set of histograms
19. duals list list expr3 25 CRAN GMD Zhao et al Color Key 30 60 E 3 m Count 19 8094294 Value s P carb wt drat gear mpg qsec Figure 15 Graphical output of example heatmap3a R Color Key Heatmap 5 8 0 2 4 Value Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro 228 Ford Pantera L Maserati Bora Merc 45081 Merc 450SE Merc 45081 Dodge Challenger Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914 2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Mere 2400 Merc 280 Mere 2806 Mazda Wag Mazda RX4 Ferrari Dino Fiat 128 Fiat X1 9 Toyota Corolla Honda Civic am vs carb wt drat gear qsec mpg hp disp Figure 16 Graphical output of example heatmap3a R 26 hp disp Cadillac Fleetwoo Lincoln Continental Chrysler Imperial Pontiac Firebir Hornet Sportabout Duster 36 Camaro 221 Ford Pantera L Maserati Bora Merc 45051 Merc 450SE Merc 450510 Dodge Challenge Javeli Hornet 4 Driv Valiant Toyota Coron Porsche 914 Datsun 71 Volvo 1428 Merc 23 Lotus Europ Merc 240 Merc 28 Merc 280 Mazda RX4 ed Mazda RX4 Ferrari Din Fiat Fiat X1 Toyota Coroll
20. ello GMD 3 Lineberger Comprehensive Cancer Center University of North Carolina at Chapel Hill 450 West Dr Chapel Hill NC 27599 USA xiaobei binf ku dk Bioinformatics Centre University of Copenhagen Department of Biology and Biotech Research and Innovation Centre Ole Maalges Vej 5 DK 2200 Copenhagen Denmark albin binf ku dk CRAN GMD Zhao et al 3 An example to understand GMD 4 3 1 Histogram construction and visualization u een 4 3 1 1 Load library and simulate data 2 2 22 24 4 3 1 2 Constr ct histograms oe Teda w o e o 9e eo 9 OR E w w Q 4 3 1 3 Save histograms as multiple histogram mhist 4 3 14 Vistas ao mhist objet 102220 Rs RR XU ROG ps 5 3 2 Histogram distance measure and alignment 7 3 2 1 Measure the pairwise distance between two histograms by GMD 7 dag BONS nuu Saa X ex d Q b Re UE ee ees 8 4 Case study 9 41 CAGE measuring the dissimilarities among 55 9 4 2 ChIP seq measuring the similarities among histone modification patterns 12 A Functionality 14 ME TI i 255222222 TT 14 A 2 ghist Generic construction and visualization of 15 A21 Examples using simulated data 15 8 22 using iris deta 222r RER
21. eric as factor brands RowIndividualColors lt rainbow max brands index brands index heatmap 3 x scale column RowIndividualColors RowIndividualColors coloring attributes column features randomly just for a test heatmap 3 x scale column CollndividualColors rainbovw ncol x add a single plot for all row individuals dev new width 12 height 8 expri lt list quote plot row data rowInd hp 1 nrow row data xlab hp ylab main Gross horsepower yaxt n quote axis 2 1 nrow row data rownames row data rowInd 1as 2 heatmap 3 x scale column plot row individuals TRUE row data x plot row individuals list list expr1 add a single plot for all col individuals dev new width 12 height 8 expr2 lt list quote plot colMeans col data colInd xlab ylab Mean xaxt n main Mean features cex 1 pch 19 quote axis 1 1 ncol col data colnames col data colInd 1 2 heatmap 3 x scale column plot col individuals TRUE col data x plot col individuals list list expr2 add another single plot for all col individuals dev new width 12 height 8 expr3 list quote op lt par mar par mar 1 5 quote mytmp data lt apply col data 2 function e e sum e quote boxplot mytmp data collnd xlab ylab Value main Boxplot of normalized column features quote op heatmap 3 x scale column plot col individuals TRUE col data x plot col indivi
22. gure 10 Graphical output of case nottem R 20 o 0 00 amp WN N N N D N N PPR N GO Gmm k CRAN GMD Zhao et al A 3 gdist Generic construction and visualization of distances example gdist R Source Code 7 is an example on how to measure distances using a user defined metric such as correlation distance and GMD example gdist R load library require GMD require cluster compute distance using Euclidean metric default data ruspini x lt gdist ruspini see a dendrogram result by hierarchical clustering dev new width 12 height 6 plot hclust x main Cluster Dendrogram of Ruspini data xlab bservations convert to a distance matrix m lt as matrix x convert from a distance matrix d as dist m stopifnot d x Use correlations between variables as distance data USJudgeRatings dd lt gdist x USJudgeRatings method correlation of variables dev new width 12 height 6 plot hclust main Cluster Dendrogram of USJudgeRatings data xlab Variables Cluster Dendrogram of Ruspini data 150 Oc roo Quer S9 Spe
23. lexibility particularly the option sliding is enabled by default to allow partial alignment case cage R require GMD load library data cage load data measure pairwise distance x lt gmdp cage P kfb3 TO2ROOAEC2D8 cage Csf1 TO3RO672174D print x print a brief version by default print x mode full print a full version by default show alignment plot x labels c Pfkfb3 Csfi1 beside FALSE show another alignment plot gmdp cage Higi TO9RO743763C cage Cd72 TOA4RO28B8BC9 labels c Higi TO9R0743763C Cd72 TO4RO28B8BC9 beside FALSE construct a distance matrix and visualize it Short labels lt gsub NN NN1 names cage get short labels x lt gmdm cage 1 6 1abels short labels 1 6 plot x CRAN GMD Zhao et al Fraction Optimal alignment between distributions with sliding GMD 7 881 0 100 o6 m Pfkfb3 TO2ROOAEC2DS8 T T T T T 50 100 150 200 250 064 m Csfi1 T03R0672174D T T T T 50 100 150 200 250 Position Figure 1 Graphical output of case cage R Fraction Optimal alignment between distributions with sliding GMD 3 992 0 24 m Higi TO9R0743763C 0 2 4 0 1 00d 4 d 1 1 la ams La UE 20 40 60 80 100 120 03 4 m Cd72 TO4RO28B8BC9 0 2 0 1 0 0 4 wl h l 20 40 60 80 100 120 Posi
24. m i000 mean 10 sd 5 V V V V V 3 1 2 Construct histograms create common bins n lt 20 desired number of bins breaks lt gbreaks c x1 x2 n bin boundaries make two histograms vi lt ghist x1 breaks breaks digits 0 v2 lt ghist x2 breaks breaks digits 0 V V V V V V 3 1 3 Save histograms as multiple histogram mhist object v x lt list v1 v2 mhist obj lt as mhist x v CRAN GMD Zhao et al 3 1 4 Visualize an mhist object gt plot histograms side by side gt plot mhist obj mar c 1 5 1 1 0 main Histograms of simulated normal distributions Histograms of simulated normal distributions 1 250 2 200 150 21 5 Count T 11 27 CRAN GMD Zhao et al gt plot histograms as subplots with corresponding bins aligned gt plot mhist obj beside FALSE mar c 1 5 1 1 0 main Histograms of simulated normal distributions Histograms of simulated normal distributions 250 4 1 200 4 1504 100 4 N d r r 5 n 21 27 Count 250 4 200 4 1504 1004 5 21 11 27 2 CRAN GMD Zhao et al 3 2 Histogram distance measure and alignment Here we measure the GMD distance between shapes of two histograms with option sliding on 3 2 1 Measure the pairwise distance between two histog
25. n the EV is no less than 0 9 and the increment of EV is no more than 0 05 for a bigger k Figure 14 Graphical output of example css R 24 20 o CRAN GMD Zhao et al A 5 heatmap 3 Visualization in cluster analysis with evaluation A 5 1 Examples using mtcars data example heatmap3a R Source Code 9 is an example on how to make a heatmap with summary visualization of observations example heatmap3a R require GMD load library data mtcars load data x lt as matrix mtcars data as a matrix dev new width 10 height 8 heatmap 3 x default with reordering and dendrogram heatmap 3 x Rowv FALSE Colv FALSE no reordering and no dendrogram heatmap 3 x dendrogram none reordering without dendrogram heatmap 3 x dendrogram row row dendrogram with row and col reordering heatmap 3 x dendrogram row Colv FALSE row dendrogram with only row reordering heatmap 3 x dendrogram col col dendrogram heatmap 3 x dendrogram col Rowv FALSE col dendrogram with only col reordering heatmapOut lt heatmap 3 x scale column sacled by column names heatmapOut view the list that is returned heatmap 3 x scale column x center 0 colors centered around 0 heatmap 3 x scale column trace column trun trace on coloring cars row observations by brand brands lt sapply rownames x function e strsplit e 111 11 names brands lt brands index lt as num
26. na Ponjavic Yoshihide Hayashizaki and David A Hume Mammalian rna polymerase ii core promoters insights from genome wide studies Nat Rev Genet 8 6 424 436 Jun 2007 7 Xiaobei Zhao Eivind Valen Brian J Parker and Albin Sandelin Systematic clustering of transcription start site landscapes PLoS ONE 6 8 23409 August 2011 8 Vicky W Zhou Alon Goren and Bradley E Bernstein Charting histone modifications and the functional organization of mammalian genomes Nat Rev Genet 12 1 7 18 Jan 2011 33
27. ons around the TSS are associated with transcription regulation and expression variation of genes Comparing the chromatin modification profiles originally produced by 1 and 4 and preprocessed by 7 the sliding option is disabled for fixed alignments at the TSSs and the flanking regions The GMD measure indicates how well profiles are co related to each other In addition the downstream cluster analysis is visualized with function heatmap 3 that use GMD distance matrix to generate clustering dendrograms and make partitions given variant criteria including the elbow rule discussed in 7 or desired number of clusters case chipseq R require GMD load library data chipseq mES load data data chipseq hCDA4T load data pairwise distance and alignment based on GMD metric plot gmdm chipseq_mES sliding FALSE clustering on spatial distributions of histone modifications x lt gmdm chipseq_hCD4T sliding FALSE resolution 10 heatmap 3 x revC TRUE Determine the number of clusters by Elbow criterion main lt Heatmap of ChIP seq data human CD4 T cells dist obj lt gmdm2dist x css multi obj lt css hclust dist obj hclust dist obj elbow obj lt elbow batch css multi obj ev thres 0 90 inc thres 0 05 heatmap 3 dist obj main main revC TRUE kr elbow obj k kc elbow obj k more strict threshold elbow obj lt elbow batch css multi obj ev thres 0 75 inc thres 0 1 heatmap 3 dist obj
28. output of case iris R 18 P CRAN GMD 2 3 Examples using nottem data case nottem R Source Code 6 is a study on how to draw histograms side by side and to compute and visualize a bin wise summary plot using air temperature data at Nottingham Castle load library require GMD load data data nottem class nottem case nottem R a time series ts object x lt ts2df nottem convert ts to data frame mhisti as mhist x 1 3 plot multiple discrete distributions side by side plot mhisti xlab Month ylab Degree Fahrenheit main Air temperatures at Nottingham Castle make summary statistics for each bin mhist2 as mhist x ms lt mhist summary mhist2 print ms plot bin wise summary statistics with confidence intervals over the bars plot ms main Mean air temperatures at Nottingham Castle 1920 1939 xlab Month ylab Degree Fahrenheit 70 65 60 L s Q 5 as 40 35 Air temperatures at Nottingham Castle 1920 1921 m 1922 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Month Figure 9 Graphical output of case nottem R 19 Nov Dec Zhao et al CRAN GMD Zhao et al Mean air temperatures at Nottingham Castle 1920 1939 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Degree Fahrenheit 20 30 40 50 60 10 Fi
29. rams by GMD gt gmdp obj lt gmdp vi v2 sliding TRUE gt print gmdp obj print a brief version by default 1 1 334 gt print gmdp obj mode detailed print a detailed version GM Distance 1 334 Sliding TRUE Number of hits 1 Gap Hiti Resolution 1 gt print gmdp obj mode full print a full version Distribution of v1 1 2 10 19 28 46 59 101 109 119 133 108 90 77 42 29 17 6 3 1 After normalization 0 001 0 002 0 01 0 019 0 028 0 046 0 059 0 101 0 109 0 119 0 133 0 108 0 09 0 077 0 042 0 029 0 0174 0 006 0 00 Distribution of v2 0000000001 5 30 94 206 258 199 139 58 8 2 After normalization 000000000 0 001 0 005 0 03 0 094 0 206 0 258 0 199 0 139 0 058 0 008 0 002 GM Distance 1 334 Sliding TRUE Number of hits 1 Gap Hiti Resolution 1 CRAN GMD Zhao et al 3 2 2 Show alignment Now let s have a look at the alignment by GMD with a distance 1 334 and a shift of 5 in the 18 distribution It is important to note that the specific features the values in this case of the original bins in the histograms are ignored with sliding on To keep original bin to bin correspondence please set sliding to FALSE see examples in section 4 2 on page 12 gt plot gmdp obj beside FALSE Optimal alignment between distributions with sliding GMD 1 334 gap c 5 0 r r r 10 15 20 25 5 h T T T 10 15 20 Position Fraction
30. st Wynand B Alkema Sin Lam Tan Charles Plessy Rimantas Kodzius Timothy Ravasi Takeya Kasukawa Shiro Fukuda Mutsumi Kanamori Katayama Yayoi Kitazume Hideya Kawaji Chikatoshi Kai Mari Nakamura Hideaki Konno Kenji Nakano Salim Mottagui Tabar Peter Arner Alessandra Chesi Ste fano Gustincich Francesca Persichetti Harukazu Suzuki Sean M Grimmond Christine A Wells Valerio Orlando Claes Wahlestedt Edison T Liu Matthias Harbers Jun Kawai Vladimir B Bajic David A Hume and Yoshihide Hayashizaki Genome wide analysis of mammalian promoter architecture and evolution Nat Genet 38 6 626 635 Jun 2006 3 Sung Hyuk Cha and Sargur N Srihari On measuring the distance between histograms Pattern Recognition 35 6 1355 1370 2002 4 Tarjei S Mikkelsen Manching Ku David B Jaffe Biju Issac Erez Lieberman Georgia Giannoukos Pablo Alvarez William Brockman Tae Kyung Kim Richard P Koche William Lee Eric Mendenhall Ais ling O Donovan Aviva Presser Carsten Russ Xiaohui Xie Alexander Meissner Marius Wernig Rudolf Jaenisch Chad Nusbaum Eric Lander and Bradley E Bernstein Genome wide maps of chromatin state in pluripotent and lineage committed cells Nature 448 7153 553 560 Aug 2007 5 R Development Core Team R A Language and Environment for Statistical Computing R Foundation for Statistical Computing Vienna Austria 2011 ISBN 3 900051 07 0 6 Albin Sandelin Piero Carninci Boris Lenhard Jasmi
31. tion Figure 2 Graphical output of case cage R 10 CRAN GMD Zhao et al Optimal alignments among distributions with sliding of ale p 3L de 3 pum 4 E 2j os Exo se Socal cot 7 5 4 vo T sb x p 22 p x5 j AU Genen Gimo Nn sa Pel pem per Gems Say 014 wm 0 25 Ex 21 e ry Son o ME x H Position Figure 3 Graphical output of case cage R 11 0 100 amp Q ea b RR RR G N M N b NN N N EBB O BF K F 3 G G GO G amp G F CRAN GMD Zhao et al 4 22 ChIP seq measuring the similarities among histone modification patterns In this section we demonstrate how GMD is applied to measure the dissimilarities between histone modifications represented by ChIP seq reads Distinctive patterns of chromatin modificati
32. umn Individuals Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro 228 Ford Pantera L Maserati Bora Merc 45081 Merc 450SE Merc 450SLC Dodge Challenger AMC Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914 2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 2400 Merc 280 Merc 2806 Mazda RX4 Wag Mazda RX4 Ferrari Dino Fiat 128 Flat X1 9 Toyota Corolla Honda Civic gt 5 s ar Boxplot of normalized column features wt drat hp disp carb 1 Figure 18 Graphical output of example heatmap3a R 27 o 0 00 amp N N D NNN NB BRB Re eB RB FPF O N A P AAA PF 34 CRAN GMD Zhao et al A 5 2 Examples using ruspini data example heatmap3b R Source Code 10 is an example on how to make a heatmap with summary visualization of clusters example heatmap3b R load library require GMD require cluster load data data ruspini heatmap on dist object x lt gdist ruspini main lt Heatmap of Ruspini data dev new width 10 height 10 heatmap 3 x main main mapratio 1 default with a title and a map in square heatmap 3 x main main revC TRUE reverse column for a symmetric look heatmap 3

Download Pdf Manuals

image

Related Search

Related Contents

Installation Manual - Syd-Com  York YMA024 User's Manual  Empezando - Medtronic Diabetes  T'nB CAVJJ  Rabot électrique  Bosch DWA097E50 cooker hood  Sommaire  48-538型 48-539型 43-137型 取扱説明書 ガスルームエアコン  1.工事の前に  

Copyright © All rights reserved.
Failed to retrieve file