Home

introduction to the R Project for Statistical Computing

image

Contents

1. 24 4 6 Arrays and matriceS 1 0 ee eee ee tee te es 25 C Jatalrtames esaouresesueuEuEGREG ew ne i aos 30 45 PaAClOUS 3 Sor eas eon Ww SS See Se hk EO a we ee aoi a cE 34 4 9 Selecting Subsets ouo Ye eee Bont ee ui 36 Version 4 1 Copyright D G Rossiter 2003 2014 All rights reserved Non commercial reproduction and dissemination of the work as a whole not parts freely permitted if this original copyright notice is included Sale or placement on a web site where payment must be made to access this document is strictly prohibited To adapt or translate please contact the author 4 9 1 Simultaneous operations on subsets Z IORearrangme dalas o suk ao De a weno ee Oe ee 4 11 Random numbers and simulation 4 12 Character SUgs gcgexe 9 e Romo a Pe g 4130bjects and classes 4 dd be ba bee Re ewm 4 13 1 The S3 and SA class systems 4 14 Descriptive statistics 1 0 ee ee es 4 15 Classification tables ssa umm bd edd REA Sellos S oa 6 4 0 tren a ee ew Bree ete we Se ee ea ee Be 4 17 Statistical modelsinS 4 17 1 Models with categorical predictors 4 17 2 Analysis of Variance ANOVA 4 18 Model output mo emu ee al ew ux 42 T6 I Madel diggnostibS a ox 4 64 ase a dee ee 4 18 2 Model based prediction 4 19 Advanced statistical modelling 4 20 Missing values i scd ox dae
2. c n 4 Regression diagnostic plots LL Binding the closest point 2E kale ee deg b Detaultscatterplot i aac mee deerat eR dus 7 Plotting symbols ecce kem xx m xS E Rea 8 Customscatterplot leen 9 Scatterplot with math symbols legend and model lines 10 Some interesting base graphics plots 104 106 106 107 108 108 110 110 110 111 112 112 113 115 115 117 118 120 121 122 124 126 126 127 128 131 133 138 iv List of Tables 11 12 13 14 15 16 17 18 19 20 21 BwWN eR Trellis density plots 0 6 0s a Rx Rx Rn 78 Trellisscather PIOUS o aea wx we Ab eek eh Pax 79 Trellis tavariate Plots 2 22 23 6 ce wee SR oo eS 81 Trellis scatter plot with some added elements 82 Available colours s s c croso 04466600 m x x oso 88 Example of a colour ramp ln 90 R graphical data ditor 229 ereenn raae hes 93 Example PDF produced by Sweave and BIpX 108 Results of an RSeek search 2 0 ee ee ei ani 113 Results of an R site search 1 2 ee 114 Visualising the variability of small random samples 130 Methods for adding to an existing base graphics plot 72 Base graphics plot types a seca sase eee ee ees 76 Xrellis graphies plot tyDes irc uomo EGG s 83 Packages in the base R distribution for Windows 125 O If you are impatient 1 Whatis R 1 Install R and
3. colours 655 1 yellow3 gt palette 1 black red green3 blue cyan magenta 7 yellow gray gt palette 4 1 blue V V VN N The named colours can be visualised as a bar graph Figure 15 tmp lt seq 1 length colors plot tmp rep 1 length tmp type h lwd 2 col colors tmp ylim c 0 1 xlab Colour number Then an individual colour number can be identified interactively by left clicking on the vertical colour bar at the midpoint right click anywhere in the graph when done abline h20 5 1wd 23 to aim at selected lt identify tmp rep 0 5 length tmp selected colour number colors selected colour name The Red Green Blue of any colour can be examined with the col2rgb method 87 f 00 T T 200 300 4 Colour number 500 600 0 100 Figure 15 Available colours gt col2rgb yellow3 1 red 205 green 205 blue 0 Single colours can be created with the rgb method specifying Red Green and Blue contributions each in the range 0 1 completely absent saturated gt rgb 0 25 0 5 0 1 408000 There are several built in colour ramps sequences of colours that give a pleasing visual impression these are returned by the heat colors terrain colors topo colors and cm colors methods another palette is provided by the bpy colors method of the sp package These all r
4. 11 2 Why didn t my command s do what I expected Because R does what you said not what you meant Some ideas to make the two match 1 Review the on line documentation for the command to see what the command actually does Pay especial attention to the arguments and any defaults All methods have examples experiment with these to make sure you understand what the method really does 2 Look for the same commands in a tutorial or text and follow the examples 3 Break down the command into smaller parts make sure each part does what you think 4 Experiment with test data or toy examples to understand how the command really works 5 Look at the data structures with str and class sometimes it has a different structure than you thought 6 Look at your data with summary head or print Maybe your data is not what you thought 7 A common problem occurs when a variable defined in the workspace also called a local variable has the same name as a field in a data frame The local variable is found by R when it looks for the name and masks the field name data trees str trees data frame 31 obs of 3 variables Girth num 8 3 8 6 8 8 10 5 10 7 10 8 11 11 11 1 Height num 70 65 63 72 81 83 66 75 80 75 Volume num 10 3 10 3 10 2 16 4 18 8 19 7 15 6 18 2 gt Volume sample 1 31 1 6 7 32122 124 91114 191320 8 5302931 19 18 10 23 15 12 27 16 2 26 17 28 4 25 gt attach trees gt cor Volume
5. Model y 2 Bis Box k 1 R 0 982 S formula Im SW SA spp 1 0 50 100 150 Stem area cm Figure 9 Scatterplot with math symbols legend and model lines equidist logi TRUE attr class chr histogram gt hist trees Volume mids 1 15 25 35 45 55 65 75 5 1 3 Types of base graphics plots Table 2 lists the principal plot types see the help for each one for more details Figure 10 shows examples of a boxplot a conditioning plot a pairwise scatterplot and a star plot all applied to the Anderson iris dataset boxplot Petal Length Species horizontal T col lightblue boxwex 5 xlab Petal length cm ylab Species main Grouped box plot coplot Petal Width Petal Length Species col as numeric Species pch as numeric Species pairsCiris 1 4 col as numeric Species main Pairwise scatterplot stars iris 1 4 key loc c 2 35 mar c 2 2 10 2 VtvVtVt 4V 75 assocp lot Association Plots barplot Bar Plots boxplot Box Plots contour Contour Plots coplot Conditioning Plots dotchart Cleveland Dot Plots filled contour Level Contour Plots fourfoldplot Fourfold Plots hist Histograms image Display a Colour Image matplot Plot Columns of Matrices mosaicplot Mosaic Plots pairs Scatterplot Matrices persp Perspective 3D Plots plot Generic X Y Plotting stars Star Spider Radar Plots stem Stem and Leaf Plots stripchart 1 D Sc
6. approx x y THEREHRRERHRHEERHERRRHRHRHRIRRHRHRHHRIHHREHRRIHRRRHRRNIE m f chunk number 24 Iw EAHA REA EE ERETTE Computes the Akaike information criterion or AIC si u Al l Lin 157 453 Col 1 Normal mode smNormal Size 11 43 KB Tinn R hotkeys inactive Rn Console a ole x 1 37 gt max meuse lead i 654 gt median meuse lead 1 123 gt mean meuse lead 1 153 36 gt range meuse lead 1 37 654 gt var meuse lead 1 12392 gt sd meuse lead 1 111 32 gt quantile meuse lead os 25 50 75 100 37 0 72 5 123 0 207 0 654 0 gt quantile meuse lead seq 0 1 1 0t 10 20 30 40 50 60 70 80 90 100 37 0 50 0 65 6 77 2 87 0 123 0 148 0 182 6 226 4 290 4 654 0 EM 3 P ig Posvak FR2R fo AEE S Microsof Beecne Gcyvewin Texopat 4 cipboar Figure 2 The Tinn R screen with the R command line interface also visible character to explain to yourself or others what the script is doing and why Here s a step by step description of how to create and run a simple script which draws two random samples from a normal distribution and com putes their correlation 1 if you are using RStudio or Tinn R open a new script If using RGui open a new document in a plain text editor i e one that does not insert any formatting Under MS Windows you can use Notepad or Wordpad 2 Type in the following lines in the s
7. gt str optionsO List of 45 70 http www rosuda org 71 As of the date of these notes their author 126 prompt chr gt continue chr htmlhelp logi TRUE gt options O digits 1 5 Here is an example Rprofile which sets some options and loads some libraries that will always be used in a project options show signif stars FALSE options htm help TRUE options digits 5 options prompt R gt options continue R options timeout 20 library gstat library lattice optional function to run at startup First lt function print Welcome to R you ve made a wise choice optional function to run at shutdown Last lt function graphics offQ print Get a life A 3 Rin different human languages R s menus and messages have been translated to several common lan guages the codes of these are listed as folders under folder share locale in the R installation MS Windows If your system is set up to work in a non English language e g Spanish code es R GUI will automatically display its menus and R its messages in that language If you d like R to work in a different language than your overall system the simplest method is to 1 Create a desktop shortcut to R Gui probably created during installa tion 2 Open its properties 3 In the the Target field add the environment variable LANGUAGE and the code for the language you want for exa
8. 180 sigma 20 mean 180 66 sdev 19 98 Pr t 0 86 o 4 2 o 2 Ww N o r T T T T T 1 120 140 160 180 200 220 240 Sample 3 Frequency Frequency mu 180 sigma 20 mean 184 31 sdev 17 63 Pr t 0 19 f T T T T T l 120 140 160 180 200 220 240 Sample 2 mu 180 sigma 20 mean 181 4 sdev 19 47 Pr t 0 7 r T T T T T l 120 140 160 180 200 220 240 Sample 4 Figure 21 A visualisation of the variability of small random samples Each sample of 30 has been divided into ten histogram bins on 130 230 The blue solid normal curves all have u 180 and 20 the red dashed normal curves are estimated from each sample Note the bias left or right of the blue curves and variances narrower or wider than the blue curves 130 C Anexample function A more powerful approach than writing and sourcing a script is to write a function which is loaded into the workspace with the source function and then run as if it were a built in R function The main advantage is that you can make the function adaptable with a set of arguments parameters that can be sent to the function so you don t have to edit the script Here we have converted the script of Appendix B into a function with only one required argument the sample size and six optional arguments which control aspects of the display for example the number of samples to compare on one plot You prep
9. Bpar ER 4 E 3 a 05 080 T T T T T T T NW A 1 2 3 4 5 6 7 s r 1 2 3 4 5 5 7 Petal length cm Petal Length Pairwise scatterplot 20 25 30 35 40 05 10 15 20 25 y pee iue pw t j c 28 P om Ln Star plot of individuals n i a Be Bact Sepall Width Sepal Length E E d Bo por 3 Petal Length Sepal Length s 2 8 Petal Width eG dp i rd P w Po 99 o a si hoa bot d B d a a 5 od d 2 f 9 10 1 42 3 o sr Bb b 4 a d 1s Bo P d 13 14 16 7 18 19 20 4 2g 23 24 Ld P 9 8 oo t t 3 M3 des o Sepal Width lao o E 25 26 27 28 29 30 31 32 3 34 35 36 pe ogee BOW OY 4W o3 us o 3 in E Well E 37 38 39 40 41 42 43 44 45 46 47 48 3 2 E yo c 434 3 Fx z TH 49 50 51 52 53 54 55 56 57 58 59 60 df e Sg c Pag et e yve F amp n K Eu RHET S 61 62 63 64 65 66 67 68 69 70 71 72 SP cx Petal Length p 73 7 4 75 76 77 78 79 80 81 82 83 84 S lo E 4 9 4 4 D 85 86 87 88 89 90 91 92 93 94 95 96 E aad o Cc E i z 4 P F Pq 97 98 99 100 101 102 103 104 105 106 107 108 Taal ue Er Pppp 3 apr E d 109 110 116 119 120 n Fabi oo Ede P lt T PT a EE QS Petal Width 121 122 128 181 182 4 o 134 140 143 144 S 9 5 sic EN P Figure 10 Some interesting base graphics plots This is quite useful for identifying unusual points in the
10. Finally we specify the dimensions and format of the image thereby pro moting it to a matrix with the matrix function By default R interprets a matrix in column major order each column is a vector but the image is stored by convention in row major order therefore we specify the optional byrow T argument gt img lt matrix img nrow ncol byrow T gt class img 1 matrix gt str img num 1 12 1 12 141 143 147 151 153 153 153 157 159 159 98 7 Exporting from R Data frames are easily exported from R For all the possibilities see the R Data Import Export manual Here we explain the most common operation exporting a data frame to a CSV file This format can be read into Excel ILWIS and many other programs Prior to this package exchange with Excel was via CSV files as explained in the next section 7 1 Importing data from a CSV file The Comma Separated Values or CSV file is a common interchange for mat which can be prepared in many ways For example a CSV file can be created directly as a text file using Notepad or some other plain text edi tor However it is common to have data already in a spreadsheet such as Excel In this case the procedure is as follows 1 Prepare the data as an Excel spreadsheet with named columns 2 Export from Excel to a CSV file using Excel s File Save As menu item 3 Import into R with the read csv function 4 Adjust data types in R if necessar
11. Note that the trees are first sorted by height then any ties in height are sorted by girth Some variables are categorical they can take only a defined set of values In S these are called factors and are of two types unordered nominal and ordered ordinal An example of the first is a soil type of the second soil structure grade from none through weak to strong and very strong there is a natural order in the second case but not in the first Many analyses in R depend on factors being correctly identified some such as table 84 15 only work with categorical variables Factors are defined with the factor and ordered functions They may be converted from existing character or numeric vectors with the as factor and as ordered function these are often used after data import if the read table or related functions could not correctly identify factors see 87 1 for an example The levels of an existing factor are extracted with the levels function For example suppose we have given three tests to each of three students and we want to rank the students We might enter the data frame as fol lows see also 86 1 34 gt student lt rep 1 3 3 gt score lt c 9 6 5 8 8 7 5 6 9 5 8 7 gt tests lt data frame cbind student score gt str tests data frame 9 obs of 2 variables student num 12312312 3 score num 96 5 887 569 5 87 We have the data but the student is just listed by a number t
12. another web page will appear You will probably want to start with the section Appendix A A sample session scroll down the web page to find this in the table of contents This will give you some familiarity with the style of R sessions and more importantly some instant feedback on what actually happens Don t worry if you don t understand everything this is just to give you a feel for how R works and what it can do For individual commands it is always best to look at its help topic The introductory tutorial or similar has been translated to many languages including Chinese Croatian Farsi French German Hungarian Italian 47 http cran r project org web views index html 48 http cran r project org web views Spatial html 49 http cran r project org web views Multivariate htm 50 http cran r project org other docs htm1 lhttp cran r project org doc contrib Ricci refcard regression pdf 52 http cran r project org doc contrib Ricci refcard ts pdf 110 Japanese Spanish Polish Portuguese and Vietnamese these can be ac cessed by following the CRAN link on the R Project home page and then Contributed documentation and other languages A good introduction to R concepts is the 100 page R for Beginners by Emmanuel Paradis of the University of Montpellier F 30 This is also availble in his native French 32 and in a Spanish translation 31 Correa amp Gonzalez 9 is a Span
13. this is defined as On ll vi i l n where 7 is the length of the vector v but is more reliably computed by taking logarithms dividing by the length and exponentiating gt hm lt function v exp sum log v length v gt hm 1 99 1 37 6231 gt mean 1 99 1 50 As it stands this function does not check its arguments it only makes sense for a vector of positive numbers gt hn cC 1 2 1 2 1 NaN Warning message NaNs produced in log x To correct this behaviour we write a multi line function with a conditional statement and the return function to leave the function gt hm lt function v if Clis numeric v 1 print Argument must be numeric return NULL else if any v OD 66 printC All elements must be positive return NULL else return exp sum log v length v gt classChm 1 function gt hm function v gt hm letters 1 Argument must be numeric NULL gt hn cC 1 2 1 25 1 All elements must be positive NULL gt hm 1 99 1 37 6231 Note the use of the any function to check if any elements of the argument vector are not positive Also note how simply typing the function name lists the function object this is shorthand for the print function to call the function you must supply the argument list Note also the use of the if else control structure for conditional ex ecution these can be nested i e the statemen
14. 7 1 Importing data from a CSV file 2 20 2 Importing mapes Pe oe ee we ee a E ERU ee R 8 Exporting from R 9 Reproducible data analysis 9 1 The NoWeb document i08 660045664004 be we 4 8 5 The Pigs document i253 sede Rue ce Geaux e ends 9 3 The PDF document 2 222x e t xem Xe REX EUR e 04 Graphicsin 5WeaVe dude 369 REST ERG Eis kisesi 10 Learning R TOA Task VIEWS 66 eee Se Re REGE Gen amp sc x 10 2 R tutorials and introductions 0 55 10 3 Textbooks Using Re se ect te a woe aoa 10 4 Technical notes using Rov we ak a ee i a a 10 5 Web Paves to learn R s ou Rem ee e a a 10 6 Keeping up with developments in R 11 Frequently asked questions 11 1 Help I got an error what didi do wrong 11 2 Why didn t my command s do what I expected 11 3 How do I find the method to do what I want 11 4 Memory problems eee eee eae 11 5 What version of R am I running 11 6 What statistical procedure should I use A Obtaining your own copy of R A l Installing new packages llle A 2 Customizing your installation A3 Rin different human languages 2 46 bs E ERE ERG B An example script C An example function References Index of R concepts L Whe RSUdIOSCEOelr creirien sw OR ae ee 2 The Timn R Screen 2 4 ko ke ko x oy oe ea ee 3 TheR Commander screen
15. Verzani J 2004 Using R for Introductory Statistics Chapman amp Hall CRC Press URL http www math csi cuny edu UsingR 111 Wickham H 2009 ggplot2 Elegant Graphics for Data Analysis Use R Springer ISBN 0387981403 69 Xie Y 2011 knitr Elegant flexible and fast dynamic report genera tion with R URL http yihui name knitr Index of R concepts operator 38 formula operator 53 Operator 19 27 formula operator 53 54 operator 19 formula operator 53 54 operator 19 formula operator 54 operator 19 formula operator 53 operator 22 language symbol 64 operator 37 operator 20 operator 20 gt operator 38 12 13 Devices 10 operator 47 L 1 operator 30 32 operator 22 36 37 89 operator 28 55 operator 19 operator 19 amp operator 38 A formula operator 53 A operator 19 23 formula operator 52 78 formula operator 78 operator 38 operator 24 32 47 operator 43 language symbol 64 language symbol 64 abline 71 73 abs 19 AIC 59 anova 57 any 61 67 aov 57 apply 29 49 61 arrows 65 72 as 95 as character 43 as data frame 93 99 as factor 34 45 as matrix 96 as numeric 45 71 97 as ordered 34 asin 20 asp trellis graphics argument 89 assocp lot 76 83 attach 32 attr 26 AttributeList package sp class 47 attributes 26 96 auto key tr
16. amp Hall CRC 111 112 17 Faraway J J 2006 Extending the linear model with R generalized lin ear mixed effects and nonparametric regression models Boca Raton Chapman amp Hall CRC 111 112 18 Fox J 1997 Applied regression linear models and related methods Newbury Park Sage 56 58 59 111 19 Fox J 2002 An R and S PLUS Companion to Applied Regression New bury Park Sage 111 119 20 Ihaka R amp Gentleman R 1996 R A language for data analysis and graphics Journal of Computational and Graphical Statistics 5 3 299 314 1 21 Knuth D E 1992 Literate programming CSLI lecture notes 27 Stan ford CA Center for the Study of Language and Information 106 22 Kopka H amp Daly P W 2004 Guide to ATEX Boston Addison Wesley Ath edition 106 23 Lamport L 1994 BTrEX a document preparation system user s guide and reference manual Reading MA Addison Wesley 2nd edition 106 24 Leisch F 2002 Sweave part I Mixing R and TpX R News 2 3 28 31 URL http CRAN R project org doc Rnews 106 108 25 Leisch F 2006 Sweave User s Manual Vienna A TU Wein 2 7 1 edition URL http www stat uni muenchen de leisch Sweave 106 108 26 Metcalfe A V amp Cowpertwait P S 2009 Introductory Time Series with R Use R Springer DOI 10 1007 978 0 387 88698 5 111 27 Mevik B H 2006 The pls package R News 6 3 12 17 62 28 Murrell P 2006 R graphics Computer
17. gt str trees complete data frame 30 obs of 3 variables Girth num 8 6 8 8 10 5 10 7 10 8 1111 Height num 65 63 72 81 83 66 75 80 75 79 Volume num 10 3 10 2 16 4 18 8 19 7 15 6 The first observation with the missing volume was removed 4 21 Control structures and looping S is a powerful programming language with Algol like syntax and control structures Single statements may be grouped between the braces and separated either by new lines or the command separator The value of the and expression is the value of its last statement The if else control structure allows conditional execution These can be nested i e the statement for either the if or else can itself contain a if else control structure The for while and repeat functions allow looping Note however that because of R s many vectorized functions looping is much less common in R than in non vectorized languages such as C For example to convert the sample confusion matrix from counts to pro portions you might be tempted to try this cmp cm gt for i in 1 length cm cmp i lt cm i sumCcm gt cmp 1 2 3 4 1 0 21472393 0 08588957 0 06748466 0 006134969 2 0 02453988 0 06748466 0 01840491 0 000000000 3 0 07361963 0 05521472 0 23312883 0 024539877 4 0 01226994 0 03067485 0 07361963 0 012269939 But you can get the same result with a vectorized operation gt cmp cm sum cm S
18. plot single SW single SA pch as numeric single spp xlab expression paste Stem area plain cm 2 y lab expression paste Sapwood area plain cm A2 xlim c 0 150 ylim c 0 90 abline O coefficients m all sw sa spp i st 1 title main Sapwood area vs stem area smaller trees for i in 2 9 1 abline O coefficients m all sw sa spp i st 1 coefficients m all sw sa spp i st i lty i 1 legend x 0 y 90 legend spp pch c 2 length spp 1 Ity c 2 length spp 1 text 150 0 S formula JImCSW SA spp 1 pos 2 text 150 6 substitute plain R A2 x list x round summary m all sw sa spp i st adj r squared 3 pos 2 text 150 12 expression paste plain Model y beta 1 s sum beta 2 x k k 1 n epsilon pos 2 5 1 2 Returning results from graphics methods gt gt gt Li FAA t t t Many high level graphics methods return results which may be assigned to an S object or used directly in an expression For example the hist method returns the break and mid points counts and densities data trees h hist trees Volume str h st of 7 breaks num 1 8 10 20 30 40 50 60 70 80 counts int 1 7 10951501 intensities num 1 7 0 03226 0 02903 0 01613 0 00323 0 01613 density num 1 7 0 03226 0 02903 0 01613 0 00323 0 01613 mids num 1 7 15 25 35 45 55 65 75 xname chr trees Volume 74 Sapwood area vs stem area smaller trees Sapwood area cm
19. read csv dlv csv Error in file file r unable to open connection In addition Warning message cannot open file dlv csv The unable to open connection message means that the file could not be found Check the current working directory with the getwd method and see if the file is there with the list files method or you could look for the file in Windows Explorer gt getwd O 1 Users rossiter ds DavisGeostats gt list files pattern dlv csv character 0 In fact this file is in another directory One way is to give the full path name note the use of the front slash as in Unix R interprets this correctly for MS Windows gt dlv lt read csv Users rossiter ds DLV dlv csv gt dim dlv 1 88 33 Another way is to change the working directory with the setwd method gt setwd Users rossiter ds DLV 115 gt getwd 1 Users rossiter ds DLV gt dlv read csv dlv csv gt dim dlv 1 88 33 3 A common problem is the attempt to use a method that is in an unloaded package For example suppose we try to run a resistant regression with the lqs method gt lgs lead om data meuse Error couldn t find function lqs This error has two possible causes either there is no such method anywhere for example if it had been misspelled 1qr or the method is available in an unloaded package To find out if the method is available in any package search for the topic with th
20. www stata com links stat_software htm1 9 http www gstat org Excel was not designed by statisticians There are also some commer cial add on packages for Excel that provide more sophisticated statistical analyses Excel s default graphics are easy to produce and they may be customized via dialog boxes but their design has been widely criticized Least squares fits on scatterplots give no regression diagnostics so this is not a serious linear modelling tool OpenOffice includes an open source and free spreadsheet Open Office Calc which can replace Excel 2 3 5 Applied mathematics programs MATLAB is a widely used applied mathematics program especially suited to matrix maniupulation as is R see 4 6 which lends itself naturally to programming statistical algorithms Add on packages are available for many kinds of statistical computation Statistical methods are also pro grammable in Mathematica 10 http www openoffice org 3 UsingR There are several ways to work with R with the RStudio integrated development environment IDE 3 1 this is recommended for the typical UT ITC user with the R console GUI 83 2 with the Tinn R editor 83 4 from another IDE such as JGR from a command line R interface CLI 83 3 from the ESS Emacs Speaks Statistics module of the Emacs editor Of these RStudio is the best choice for most ITC users it contains an R command line interface but wit
21. 215 221 222 224 228 230 234 232 222 211 206 211 215 212 218 227 232 234 235 240 241 234 222 211 218 224 230 238 244 245 247 245 234 222 211 222 230 235 241 250 256 256 245 234 223 214 This format is documented by ESRI there are six header lines with self explanatory names followed by one row per image row This specific format can be read directly into a spatial object with the read asciigrid function of the sp package however here we show how to read it using basic R commands The scan function which we used above 6 1 to read from the keyboard can also be used to read from text files if the first argument is the optional file Lines can be skipped with the optional skip argument and the num ber of lines to read can be specified with the optional nlines argument The value to take as the missing values NA in R is specified with the op tional na strings argument The separator between values is by default white space that can be changed with the optional sep argument Finally scan expects to read double precision numbers this can be changed with the optional what argument In this example we first read in the dimensions of the image The header lines have both character strings e g NCOLS and numbers e g 12 we read them into a character vector and then extract the number with the as numeric function Note that we read the second line by specifying 102 skip 1 and then n
22. 38 3 42 6 51 0 51 5 55 4 55 7 58 3 77 0 rank trees Volume 1 2 5 2 5 1 0 5 0 7 0 9 0 4 0 6 0 15 0 10 0 11 16 0 11 0 13 0 12 0 8 0 14 0 21 0 19 0 18 0 17 0 22 0 22 20 0 23 0 24 0 25 0 28 0 29 0 30 0 27 0 26 0 31 0 Note how rank averages tied ranks by default this can be changed by the optional ties method argument This example also illustrates the operator for extracting fields from dataframes see 84 7 4 5 Vector and list data structures Many S functions create complicated data structures whose structure must be known in order to use the results in further operations For example the sort function sorts a vector when called with the optional index TRUE argument it also returns the ordering index vector gt SS sort samp index TRUE gt str ss List of 2 x num 1 10 1 2392 0 0377 2 2405 3 7608 ix int 1 10 12 365478 10 9 This example shows the very important str function which displays the S structure of an object Lists In this case the object is a list which in S is an arbitrary collection of other objects Here the list consists of two objects a ten element vector of sorted values ss x and a ten element vector of the indices ss ix which are the positions in the original list where the corresponding sorted value was found We can display just one element of the list if we want gt ss ix 1 12 3 6 5 4 7 810 9 This shows the syntax for accessing named components of a data frame or l
23. 38 4 4 2 5 12 2 gt colnames cm lt c A B C D rownames cm LETTERS 1 4 gt cm V OnUu2 m N te UJ oo 35 1411 1 gt cm A C 1 11 Note the use of the byrow optional argument to the matrix function to indicate that we entered the data by row also the use of the rownames and colnames functions to label the matrix with upper case letters supplied conveniently by the LETTERS built in constant 6 2 A GUI data editor The fix function provides a simple graphical data editor for data frames It can be used to correct data entry mistakes or to build new data frames 39 This matrix is also used as an example in 84 6 92 For example to edit the sample trees dataset gt data trees gt fix trees This opens the data editor window Figure 17 Warning any changes made to the frame will be stored in the object You can change data values add rows add columns new variables change column names and if you give the optional edit row names T argument change row names R Data Editor Jog Figure 17 R graphical data editor To edit a new data frame you must first create it as a null data frame using the as data frame function and then call the fix function gt myNewFrame as data frame NULL gt fix myNewFrame 6 3 Importing data from Microsoft Excel The XLConnect package uses an interface to Java to allow R to directly read and wr
24. 61 86 46 38 k piv Workspace History et3 C Load A Saver _ 4 import Dataset yf Clear All Data dlv 88 obs of 33 variables dlv r 87 obs of 33 variables Values clay b aov aov 13 r geo cols character 7 Files Plots Packages Help a 49 Zoom Ej Exporte Q s Clear All Lake Valencia e Region 2 o Region 3 S _ Region6 hd 9 Region 7 e o o o 8 S 2 eo 2 27 so 5 ose S o B o o 8B 4 rr i ae es 640000 650000 660000 670000 UTME Subsoil bulk density by geomorphic region Figure 1 The RStudio screen stallation will create a Start menu item and a desktop shortcut Within the ITC network R can be installed with the ITC Software Manager 3 2 1 Running the R console GUI This section 83 2 1 is only relevant if you run R by itself not within an IDE such as RStudio R GUI for Windows is started like any Windows program from the Start menu from a desktop shortcut or from the application s icon in Explorer By default R starts in the directory where it was installed which is not where you should store your projects So you will probably want to change your workspace as explained in 83 2 2 You can also create a desktop shortcut or Start menu item for R also as explained in 3 2 2 To stop an R session type q at the command prompt or select the File Exit menu item in the Windows GUL 3 2 2 Setting up a workspace in Windows An important c
25. 71 64 78 80 74 72 77 81 82 80 29 80 80 87 gt get one element gt trees 1 2 1 70 311 ft 28 3168 dm 31 gt trees 1 Height 1 70 trees 1 Height 1 70 The forms like Height use the operator to select a named field within the frame The forms like 1 2 show that this is just a matrix with col umn names leading to forms like trees 1 Height The forms like trees 1 Height show that each row observation case can be consid ered a list with named items The forms like trees 2 show that the data frame is also a list whose elements can be accessed with the operator Attaching data frames to the search path To simplify access to named columns of data frames S provides an attach function that makes the names visible in the outer namespace gt attach trees gt Height 1 5 1 70 65 63 72 81 V MM N FOONDUBRWNE Building a data frame S provides the data frame function for creating data frames from smaller objects usually vectors As a simple example suppose the number of trees in a plot has been measured at five plots in each of two transects on a regular spacing We enter the x co rdinate as one list the y co rdinate as another and the number of trees in each plot as the third x lt rep seq 182 183 length 5 2 1000 y rep c 381000 310300 5 n trees c 10 12 22 4 12 15 7 18 2 16 ds data frame x y n trees x y n trees 182000 381000 10 18
26. NA NA NA NA NA NA 8 12 2 11 9 11 5 NA NA 13 11 9 11 5 11 5 11 11 4 11 08 11 07 11 08 11 06 10 97 23 10 54 10 82 10 94 10 75 10 93 28 When the 28 was displayed the user pressed the Enter key and R re alised the list was complete The three vectors are then gathered a data frame and the local copies are discarded By default the fields in the frame are named from the local 38 In RGui select menu command Help Manuals R Data Import Export 91 variables but new names can be given with the fieldname variable syntax For example gt oly 100 lt data frame year yr men women gt str oly 100 data frame 27 obs of 3 variables year num 1900 1904 1908 1912 1916 men num 11 11 10 8 10 8 NA 10 8 10 6 10 8 10 3 10 3 women num NA NA NA NA NA NA NA 12 2 11 9 11 5 gt rm yr men women To enter a matrix first enter the data with the scan function or as a list with the c function and then place it in matrix form with the matrix cre ate a matrix function We illustrate this with the sample confusion ma trix of Congalton et al 6 also used as an example by Skidmore 53 and Rossiter 43 This example also illustrates the rownames and colnames functions to assign or read the row and column names of a matrix gt cm scan 1 3514111411301293842 5 122 17 gt cm matrix cm 4 4 byrow T gt cm 1 2 3 4 1 35 14 11 1 2 4 11 3 0 3 12 9
27. RStudio on your MS Windows Mac OS X or Linux sys tem 8A 2 Run RStudio this will automatically start R within it 3 Follow one of the tutorials 810 2 such as my Using the R Environ ment for Statistical Computing An example with the Mercer amp Hall wheat yield dataset 48 4 Experiment 5 Use this document as a reference R is an open source environment for statistical computing and visualisa tion It is based on the S language developed at Bell Laboratories in the 1980 s 20 and is the product of an active movement among statisti cians for a powerful programmable portable and open computing en vironment applicable to the most complex and sophsticated problems as well as routine analysis without any restrictions on access or use Here is a description from the R Project home page R is an integrated suite of software facilities for data manip ulation calculation and graphical display It includes an effective data handling and storage facility a suite of operators for calculations on arrays in partic ular matrices alarge coherent integrated collection of intermediate tools for data analysis graphical facilities for data analysis and display either on screen or on hardcopy and a well developed simple and effective programming lan guage which includes conditionals loops user defined re cursive functions and input and output facilities The last point has resulted in
28. The R Foundation for Statistical Computing 2 15 0 2012 03 30 edition URL http cran r project org doc manuals R data pdf 91 R Development Core Team 2012 R Language Definition Vienna The R Foundation for Statistical Computing 2 15 0 2012 03 30 draft edition URL http cran r project org doc manuals R lang pdf 19 67 Ribeiro Jr P J amp Diggle P J 2001 geoR A package for geostatistical analysis R News 1 2 14 18 URL http CRAN R project org doc Rnews 2 111 Ripley B D 1981 Spatial statistics New York John Wiley and Sons 2 Ritz C amp Streibig J C 2008 Nonlinear regression with R Use R New York Springer URL http CRAN R project org package nIrwr 111 Rossiter D G 2004 Technical Note Optimal partitioning of soil transects with R Enschede NL unpublished online URL http www itc nl personal rossiter teach R R OptPart pdf 112 Rossiter D G 2004 Technical Note Statistical methods for accuracy assesment of classified thematic maps Enschede NL International Institute for Geo information Science amp Earth Observation ITC URL http www itc nl personal rossiter teach R R_ac pdf 25 92 112 Rossiter D G 2005 Technical Note Fitting rational functions to time series in R Enschede NL International Institute for Geo information Science amp Earth Observation ITC 135 45 46 47 48 49 50 51 52 53 URL http www itc nl p
29. also review graphics within the Windows R GUI itself Create the first graph bring the graphics window to foreground and then select the menu command History Recording After this all graphs are auto matically saved within R and you can move through them with the up and down arrow keys You can also write your graphics commands directly to a graphics file in many formats e g PDF or JPEG You do this by opening a graphics device writing the commands and then closing the device You can get a list of graphics devices formats available on your system with Devices note the upper case D For example to write a PDF file we open a PDF graphics device with the pdf function write to it and then close it with the dev of f function pdf figurel pdf h 6 w 6 hist rnorm 100 main 100 random values from N 0 1 dev of FQ Note the use of the optional height and width arguments here abbre viated h and w to specifiy the size of the PDF file in US inches this affects the font sizes The defaults are both 7 inches 17 18 cm 10 3 3 Working with the R command line These instructions apply to the simple R GUI and the R command line interface window within RStudio One of the windows in these interfaces is the command line also called the R console It is also possible to work directly with the command line and no GUI Under Linux and Mac OS X at the shell prompt just type R there are various startup options wh
30. be written more simply gt attach trees 33 Nicely illustrating the risks of extrapolating outside the range of calibration this data set only has trees from 63 to 87 feet tall so the fitted relation says nothing about shorter trees 52 gt model lt Im Volume Height but even if not the variable names can be referred to a frame with the data named argument gt model lt Im Volume Height data trees Additive effects More complicated models include additive effects using the formula operator gt model lt Im Volume Height Girth data trees Note this is not an arithmetic addition but rather a special use in the model notation Here the tree volume is explained by by both its height and girth considered as independent predictors Interactions The formula operator is used to indicate interactions usu ally these are used in addition to additive terms gt model lt Im Volume Height Girth Height Girth data trees Here the tree volume is explained by both its height and girth as well as their interaction i e that the effect of girth is different at different heights The formula operator is shorthand for all linear terms and interactions of the named independent variables so that the previous example could have been more simply written as gt model lt Im Volume Height Girth data trees The A formula operator is used to indicate predictor crossing to the speci f
31. because the sample is random 42 gt sort round rnorm 10 170 15 1 155 167 170 177 181 182 185 186 188 199 To start a simulation at the same point e g for testing use the set seed function gt set seed 61921 gt sort round rnorm 10 170 15 1 129 157 157 166 168 170 173 175 185 193 gt set seed 61921 gt sort round rnorm 10 170 15 1 129 157 157 166 168 170 173 175 185 193 Now the results are the same every time 4 12 Character strings R can work with character vectors also known as strings These are often used in graphics as labels titles and explanatory text A string is created by the quote operator gt Clabel lt A good graph 1 A good graph Strings can be built from smaller pieces with the paste function parts can be extracted or replaced with the substring function strings can be split into pieces with the strsplit function paste label 15 x 20 cm 1 A nice graph 15 x 20 cm gt labels lt paste B 1 8 sep 1 B1 B2 B3 B4 B5 B6 B7 B8 gt substring label 1 4 1 A go gt substring label 3 nice label 1 A nice graph gt strsplit label E111 1 A nice graph unlist strsplit label 1 A nice graph gt unlist strsplit label 3 1 graph Note the use of the unlist function to convert the list of one element returned by strsplit into a vector Numbers or f
32. does this look like as a parse language object gt parse text paste m b lt mean B b sep expression m 4 lt mean B4 gt initialise the results vector gt m lt NULL gt evaluate this one expression gt eval parse text paste m b lt mean B b sep Error in mean B4 object B4 not found gt must define this variable to compute its mean gt B4 lt 0 100 gt eval parse text paste m b lt mean B b sep gt result so far gt m 1 NA NA NA 50 gt apply to all 256 variables need B1 B256 defined gt for b in 1 256 eval parse text paste m b lt mean B b sep Note that the for expression in this example does not need to be written on separate lines it is so written here for clarity The result of the parse function is an expression it is also possible to create an R expression ready to be evaluated with the eval function with the expression function For example gt tmp lt expression 2 pi r gt r lt 3 gt eval tmp 1 2 0944 68 5 R graphics R provides a rich environment for statistical visualisation 28 There are at least four graphics systems 1 the base R graphics system in the graphics package loaded by de fault when R starts 2 the Trellis system in the lattice package 3 the Grid system 28 in the grid package 4 the Grammer of Graphics g
33. frame num 1 num 1 num 1 num 1 Factor frame num 1 num 1 num 1 Species NOR Uu NJ 4 amp Uu Q0 ORR onus PRU setosa str ir s 50 obs riabl 50 5 1 50 3 5 50 1 4 50 0 2 O w 3 level of 9 3 4 2 S 50 obs of 6 2 5 5 s f 8 iab 4 6 5 3 3 6 1 25 ds 22 0 QJ d QU Un AA 9 3 1 7 0 4 versi setosa ve 5 variables S 50 7 6 50 3 2 50 4 7 50 1 4 1 w 3 level 50 obs 50 6 3 5 50 3 3 2 7 50 6 5 1 5 4 3 4 Dwwsh e UJ hJ Oo non 2 Setosa ve 5 Variables 2 fags o NOUWN Bune Hooo Oo u w w M 0 wo NONU NONGO RPWuN Petal Width Species 50 2 5 1 9 w 3 levels num 1 Factor setosa 3 3 25 The split function builds a list of data frames named by the level of the factor on which the original data frame was split Here the original 150 observations have been split into three lists of 50 one for each species These can be accessed by name gt summary Cir s setosa Petal Length Min 1st Qu Median Mean 3rd Qu 1 00 1 40 1 50 1 46 1 58 Max 1 90 4 9 1 Simultaneous operations on subsets We often want to apply some computation to all subsets of a data frame For example to compute the mean petal length of the iris data set for each species separately we could first split the set as shown in the pre vious section 84 9 compute each subset s mean and join them in one vector Th
34. generically as Weaving or literate programming 21 and has been implemented for R as Sweave 24 25 and more recently as knitr 61 implemented as the knitr pack age This approach has three steps 1 preparing the source NoWeb document 2 compiling this with Sweave into a BIEX source file 3 compiling the IATEX source into a PDF document Following is a brief introduction for an extended discussion see Rossiter 47 9 1 The NoWeb document This is plain text file with extension Rnw R NoWeb It contains two kinds of commands 1 AIpXcommandas as in a standard AIX document 2 R commands to be executed The R commands are placed in code chunks delimited with the special symbols to begin a chunk and to end one Here is a minimal example suppose this is in a text file named example Rnw documentclass article usepackage pdftex final graphicx usepackage Sweave begin document title Sweaving for reproducible data analysis author A Nonymous date 14 November 2010 maketitle Here is a simple example of R and Sweave lt lt gt gt data trees str trees summary 1m Volume Height sGirth data trees adj r squared end document 44 http www latex project org 45 http yihui name knitr 106 The BIEX commands begin with e g documentclass The R code is written as if at the R console but between the lt lt gt gt beginning a cod
35. if the first argument is the optional file Lines can be skipped with the optional skip argument and the num ber of lines to read can be specified with the optional nlines argument The value to take as the missing values NA in R is specified with the op tional na strings argument The separator between values is by default white space that can be changed with the optional sep argument Finally scan expects to read double precision numbers this can be changed with the optional what argument In this example we first read in the dimensions of the image The header lines have both character strings e g NCOLS and numbers e g 12 we read them into a character vector and then extract the number with the as numeric function Note that we read the second line by specifying skip 1 and then nlines 1 gt ncol lt scan file image grd nlines 1 what character gt ncol 1 NCOLS 12 gt ncol lt as numeric ncol 2 gt ncol 1 12 gt nrow lt as numeric scan file image grd skip 1 nlines 1 what character 2 1 12 Now we read the image data into a matrix as one vector gt img scan file image grd skip 6 na strings 9999 Read 144 items gt summary Cimg Min 1st Qu Median Mean 3rd Qu Max 141 176 202 200 222 256 gt class img 1 numeric 97 Although this small image has no missing values in general we may expect them so we need to specify the format of the missing values
36. in Sweave Since R produces graphs Sweave can also produce them and incorporate them into documents See Rossiter 47 Leisch 24 25 and the Sweave 108 web sitet for many examples 46 http www stat uni muenchen de leisch Sweave 109 10 Learning R The present document explains the basics of using R 3 the S language 4 R graphics 5 and the on line help system 3 3 2 There are many other resources for learning and applying R this section explains the most useful 10 1 Task views Many applications are covered in on line Task Views These are a summary by a task maintainer of the facilities in R to accomplish cer tain tasks For example Roger Bivand maintains a task view Analysis of Spatial Data which discusses how to represent spatial data in R how to read and write it how to analyze point patterns and geostatistical analy sis Another useful task view is Multivariate Statistics maintained by Paul Hewson The contributed documentation at CRAN has many introductions and reference cards for specific kinds of analysis for example regression anal ysis and time series analysis 10 2 R tutorials and introductions Translations From the R console you can follow the introductory course An Introduc tion to R as follows 1 Select menu command Help Html help a web page will appear in your web browser 2 In this web page select the link An Introduction to R
37. is extensive help within the editor to select the proper commands and arguments Commands can be sent directly from the editor to R or saved in a file and sourced 3 5 Writing and running scripts After you have worked out an analysis by typing a sequence of commands you will probably want to re run them on edited data new data subsets etc This is easy to do by means of scripts which are simply lists of com mands in a file written exactly as you would type them at the command line interface They are run with the source function A useful feature of scripts is that you can include comments lines that begin with the 20 from the OED meaning a brief verbal description of a person place etc a short descriptive or evocative episode in a play etc 21 http ww sciviews org Tinn R 13 ee E A Tinn R O XCourseDataVE learning DEGeostatisticslCDlexercisesWRcodelex11 R Eile project Edit Format Search Options Tools R view Window Web Help 28x Q 8u8u8 amp 8 23 e9 GQGuati1isjgs utpe ise missum 00 R complex X F aleu 2 J 4 o Fenn sAnne y ex41 R ex42 R ext Ri median meuse lead mean meuse lead range meuse lead E 1 W 4 var meuse lead Operators arithmetic l sd meuse lead Operators logical quantile meuse lead n quantile meuse lead seq 0 1 1 uu Computer Project R card R explorer gt AIC f l
38. o S 7 of 9 3 3 5 o N N g v id 5 o g 14 96 c 0 5 S w E 9 7 o cg a E is ae 241 eI os oe Cook sglistance o 18 ie S T T T T T T T T T T T T T 10 20 30 40 50 60 70 80 0 0 0 1 0 2 0 3 0 4 0 5 Fitted values Leverage Figure 4 Regression diagnostic plots gt text h ifelse h gt 3 mean h seq 1 length h pos 2 A direct way to access diagnostics is with the influence measures func tion applied to the fitted model gt infl lt influence measures model gt strCinfl List of 3 infmat num 1 31 1 8 0 23213 0 06511 0 15754 0 00173 0 02461 attr dimnames List of 2 chr 1 31 1 2 3 4 chr 1 8 dfb 1 dfb Hght dfb Grth dfb Hg G is inf logi 1 31 1 8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE attr dimnames List of 2 chr 1 31 1 2 3 4 LL chr 1 8 dfb 1 dfb Hght dfb Grth dfb Hg G call language Im formula Volume Height Girth data trees attr class chr infl 60 The key field here is is inf which specifies which observations were es pecially influential in the fit using eight different measures of influence which are listed as attributes of this field gt attr infl infmat dimnames 21 1 7 gt infl is inf which applyCinfl is inf 1 any dfb 1_ dfb Hght dfb Grth dfb Hg G dffit cov r cook d hat FALSE FALSE FALSE FALSE FALSE
39. plot 5 2 Trellis graphics The Trellis graphics system is a framework for data visualization devel oped at Bell Labs where S originated based on the ideas in Cleveland 5 It was implemented in S PLUS and then in R with the lattice package its design and basic use are well explained by its author Sarkar 50 in the R Newsletter It is harder to learn than R base graphics but can produce 77 higher quality graphics especially for multivariate visualisation when the relationship between variables changes with some grouping factor this is called conditioning the graph on the factor It uses formulae similar to the statistical formulae introduced in 4 17 to specify the variables to be plotted and their relation in the plot Customization of Trellis graphics parameters for example default background and symbol colours is ex plained in 5 2 6 5 2 1 Univariate plots As a simple example consider the iris dataset To produce a kernel density plot a sophisticated histogram on the whole dataset use the densityplot method gt densityplot Petal Length data iris The operator here has no left hand side since there is no dependent variable in the plot it is univariate The petal length is the independent variable and we get one plot this is shown shown on the left side of Figure All species Split by species i L L 2 4 6 L L L L L 1 L L setosa versicolor virginica 0 25 4 c 25 4
40. several lines giving information on the file contents followed by the data values as one long vector or else with each row of the imagery as a row in the text file An example is the ESRI asciigrid format Here is a small example of a 12 by 12 image stored as ASCII text file image grd We display it within R with the file show function gt file show image grd NCOLS 12 NROWS 12 XLLCORNER 180000 YLLCORNER 331000 CELLSIZE 40 NODATA VALUE 9999 96 141 148 159 166 176 186 192 195 201 206 211 215 143 153 160 170 177 188 198 204 206 212 218 222 147 154 164 172 182 189 199 209 215 218 224 230 151 159 166 176 183 193 201 211 221 227 230 235 153 163 170 177 188 195 205 212 222 232 238 241 153 164 175 182 189 199 206 217 224 234 244 250 153 164 176 186 193 201 211 218 228 235 245 256 157 164 176 188 198 205 212 222 230 240 247 256 159 169 176 188 199 209 217 224 234 241 245 245 159 170 180 188 199 211 221 228 232 234 234 234 163 170 182 192 199 211 221 222 222 222 222 223 166 175 182 193 204 209 211 211 211 211 211 214 This format is documented by ESRI there are six header lines with self explanatory names followed by one row per image row This specific format can be read directly into a spatial object with the read asciigrid function of the sp package however here we show how to read it using basic R commands The scan function which we used above 6 1 to read from the keyboard can also be used to read from text files
41. str tests data frame 9 obs of 2 variables 35 student Factor w 3 levels John Paul 123123123 score num 9 6 588 7 569 587 gt table tests score student 6 6 57 7 5 8 9 9 5 John 0 00 011 1 Paul 0 10 110 0 George 1 01 010 0 The levels have the same internal numbers but different labels This use of factor with the optional labels argument does not change the order of the factors which will be presented in the original order If for example we want to present them alphabetically we need to re order the levels ex tracting them with the levels function in the order we want as indicated by subscripts and then setting these in the new order with the optional levels argument to factor gt tests student lt factor tests student levels levels tests student c 3 1 2 gt str tests data frame 9 obs of 2 variables student Factor w 3 levels George John 231231231 score num 96 5 887 569 5 87 gt table tests score student 6 6 57 7 5 89 9 5 George 1 01 010 0 John 0 00 011 1 Paul 0 10 110 0 Now the three students are presented in alphabetic order because the underlying codes have been re ordered The car package contains a useful recode function which also allows grouping of factors Factors require special care in statistical models see 84 17 1 4 9 Selecting subsets We often need to examine subsets of our data for example to perform a separate analysis for several strata
42. with the expression 1 Or the origin can be named explicitly with the formula operator with the expression 0 For example it s certainly true that a tree with no girth has no height so if we want to force the regression of height on girth to go through 0 0 gt model lt Im Height Girth 1 data trees gt model lt Im Height 0 Girth data trees V V M Note Although this seems logical in the range of timber trees this may give a poorer predictive relation than allowing a physically impossible intercept and only predicting in the range of the calibration data Arithmetic operations in formulas Since the characters and have special meaning in formulas they must be quoted with the I op erator if they are to interpreted as arithmetic operators For example to model tree volume from the height to girth ratio model lt Im Volume I Height Girth data trees To model volume as the square of girth model lt Im Volume I Girth 2 data trees This is only needed if there is a danger of mis interpretation most func tions can be used directly in formulas e g the log function to compute natural logarithms For example to fit a log log regression of tree height by width model lt 1m log Height log Girth For further description of model formulae see the help topic formula 54 The design matrix For full control of linear modelling R offers the ability to ext
43. 0 20 4 L 2 0 4 b 0 15 L 2 15 2 o 2 2 2 I 5 10 4 L a 0 10 1 0 F 0 05 F 05 L 0 00 4 GuxeDO O OGBEREDDDEDODODODODOOORORED cD CDO 0 0 Cammo T T T T T T T T T T T T 0 2 4 6 8 2 4 6 2 4 6 Petal Length Petal Length Figure 11 Trellis density plots without and with a conditioning factor To add conditioning we use the operator which can be read as condi tioned on the variable s named on its right side densityplot Petal Length Species data iris 78 Here there is one panel per species this is shown shown on the right side of Figure 11 We can clearly see that the multi modal distribution of the entire data set is due to the different distributions for each species 5 2 2 Bivariate plots The workhorse here is the xyplot method now with a dependent y axis and independent x axis variable this can also be conditioned on one or more grouping factors gt xyplot Petal Width Petal Length data iris groups Species auto key T gt xyplot Petal Width Petal Length Species data iris groups Species These are shown in Figure 12 Note the use of the groups argument to specify a different graphic treatment in this case colour for each species and the auto key argument to get a simple key to the colours used for each species All species Split by species setosa o versicolor o 12345 67 virginica TE CNN HRE RE OES NT DT NS LU OE NE
44. 014 Reduce Rex Xo dede cx s 3 2 2 Disadvantages ipe 43 40 ata ehh ade eee mum ba 4 2 9 JAlterraliVesS os cus RS Ree ae ee ds deu 4 Pu Se PLUS asi ndun d Smo ee pene d Shed boe OR 4 2 3 2 Statistical packages 22 0 00000000 5 2 3 3 Special purpose statistical programs 5 2 3 4 Spreadsheets lle 5 2 3 5 Applied mathematics programs 6 3 UsingR 7 3 1 The RStudio integrated development environment 7 212 JRoonsole GUI uo ime yoke s E X oW ue 7 3 2 1 Running the R console GUI 8 3 2 2 Setting up a workspace in Windows 9 3 2 3 Saving your analysis steps 10 3 2 4 Saving your graphs len 10 3 3 Working with the R command line 11 3 3 1 Thecommand prompt 11 3 5 2 On line help iNR ice ux omm Rex EEG Se 12 a4 The Tini R code editor 6 40 4 da wa ea ee ea XY 13 2 Waung and running Scripts lt ri seci 2 kem ORE XY 13 36 The BemurGUlLl 2 20 4 ax bua I OR a ee au 15 3 7 Loading optional packages ss ka eee t mx mx 16 3 8 Sampledastasets 1 22 urea a A ee ee ee ee c 17 4 The S language 19 4 1 Command line calculator and mathematical operators 19 4 2 Creating new objects the assignment operator 20 4 3 Functions and their arguments less 21 4 4 Vectorized operations and recycding 22 4 5 Vector and list data structures
45. 1 while 64 windows 76 84 wireframe package lattice 79 83 write table 99 writeWorksheet 93 141 xlab graphics argument 70 XLConnect package 93 xyplot package lattice 79 83 86 89 ylab graphics argument 70 142
46. 1 0 97285 end Soutput end Schunk end document Note how the input code has been parsed and placed in the special Sinput IATEX environment e g 107 begin Sinput gt summary 1m Volume Height Girth data trees adj r squared end Sinput The code was executed by R and the code that would produce output on the console was added to the document again in a special ATX environ ment Soutput begin Soutput 1 0 97285 end Soutput There is no cut and paste here The code was executed by R and produced the output What you asked for you got The BIX environments are defined in the Sweave TFX package this was specified in the preamble 9 3 The PDF document The BIEX document produced by Sweave is then processed as any BIX document usually to to produce a PDF Figure 18 shows the output from this example Sweaving for reproducible data analysis A Nonymous 14 November 2010 Here is a simple example of R and Sweave R gt data trees R gt str trees data frame 31 obs of 3 variables Girth num 8 3 8 6 8 8 10 5 10 7 10 8 11 11 11 1 11 2 Height num 70 65 63 72 81 83 66 75 80 75 Volume num 10 3 10 3 10 2 16 4 18 8 19 7 15 6 18 2 22 6 19 9 R gt summary 1m Volume Height Girth data trees adj r squared 1 0 97285 Figure 18 Example PDF produced by Sweave and BIEX Notice how the R input is formatted differently from the R output 9 4 Graphics
47. 2 Sarkar D 2002 Lattice R News 2 2 19 23 URL http CRAN R project org doc Rnews 17 77 Sarkar D 2008 Lattice multivariate data visualization with R Use R New York Springer URL http 1mdvr r forge r project org 111 Shumway R H amp Stoffer D S 2006 Time Series Analysis and Its Applications with R examples Springer Texts in Statistics Springer 2nd edition URL http www stat pitt edu stoffer tsa2 index html 112 Skidmore A K 1999 Accuracy assessment of spatial information In Stein A Meer F v d amp Gorte B G F eds Spatial statistics for remote sensing pp 197 209 Dordrecht Kluwer Academic 25 92 136 54 55 56 57 58 59 60 61 Spector P 2008 Data manipulation with R Use R New York Springer 111 Tatem A J Guerra C A Atkinson P M amp Hay S I 2004 Momen tous sprint at the 2156 Olympics Women sprinters are closing the gap on men and may one day overtake them Nature 431 525 91 Venables W N amp Ripley B D 2000 S Programming Springer ISBN 0 387 98966 8 48 Venables W N amp Ripley B D 2002 Modern applied statistics with S New York Springer Verlag 4th edition URL http www stats ox ac uk pub MASS4 17 56 58 62 111 119 Verzani J 2002 simpleR Using R for Introductory Statistics volume 2003 New York CUNY 0 4 edition URL http www math csi cuny edu Statistics R simpleR index html 111
48. 2 for some reason 1 is not used The results of any plotting commands will be displayed in this window Now we open another window it becomes the current window gt windows gt dev listQ windows windows 2 3 gt dev cur windows 3 At this point any plot commands will go to the most recently opened win dow i e number 3 5 3 1 Switching between windows 5 4 The dev set method specifies which graphics device to use for the next graphics output For example to compare scatterplots of tree volume vs two possible predictors height and girth in adjacent windows gt dev set 2 gt plot Height Volume gt dev set 3 gt plot Girth Volume Multiple graphs in the same window This depends on the graphics system base or trellis 85 2 as implemented in the lattice R package You can determine which system is used by a given graphics command at the top of its help page For example boxplot shows the page title as boxplot graphics indicating that it s in the base graphics package whereas gt xyplot shows the page titleas xyplot lattice indicating that it s a trellis plot 85 5 4 1 Base graphics VVVVV Vv The parameters of the active graphics device are set with the par method One of these parameters is the number of rows and columns of indivual plots in the device By default this is 1 1 i e one plot per device You can set up a matrix of any size with the par mfrow or
49. 2250 310300 12 182500 381000 22 182750 310300 4 183000 381000 12 182000 310300 15 182250 381000 7 182500 310300 18 182750 381000 2 0 183000 310300 16 Note the use of the rep function to repeat a sequence Also note that an arithmetic expression in this case 1000 can be applied to an entire vector in this case rep seq 182 183 length 5 2 In practice this data frame would probably be prepared outside R and then imported see 86 32 Adding rows to a data frame The rbind row bind function is used to add rows to a data frame and to combine two data frames with the same structure For example to add one more trees to the data frame V x 182000 182250 182500 182750 183000 182000 182250 182500 182750 10 183000 11 183500 WO ONDA UU RWN H y 381000 310300 381000 310300 381000 310300 381000 310300 381000 310300 381000 ds lt rbind ds c 183500 381000 15 n trees 10 12 22 4 12 15 7 18 2 16 15 This can also be accomplished by directly assigning to the next row gt ds 12 gt ds x 182000 182250 182500 182750 183000 182000 182250 182500 182750 10 183000 11 183500 12 183400 WOONA UN RWN H c 183400 381200 18 y 381000 310300 381000 310300 381000 310300 381000 310300 381000 310300 381000 381200 n trees 10 12 22 4 12 15 Adding fields to a data frame A vector with the same number of rows as an existing data frame may be
50. 30 48 0 27709 8 7 1 2 0 Ah 6 7 7 7 181390 333260 791 0 364067 7 8 1 2 0 Ga 6 5 6 8 7 Note that There is one line per observation record Each record consists of the same number of fields which are sepa rated by commas The first line has the same number of fields as the others but consists of the field names Suppose this file is named example csv To read into R we first change to the directory where the file is stored and then read into an R object with the read csv function gt ds read csv example csv gt str ds data frame 6 obs of 10 variables x int 181072 181025 181165 181298 y dnt 333611 333558 333537 333484 cadmium num 11 7 8 6 6 5 2 6 2 8 3 elev num 7 91 6 98 7 80 7 66 7 48 dist num 0 00136 0 01222 0 10303 0 19009 om num 13 6 14 13 8 8 7 7 8 ffreq int 111 222 111 soil int 111 40 read csv is a specialised wrapper to the very general read table function which can deal with tables in many other formats 94 lime int 111000 landuse Factor w 2 levels Ah Ga 1112 12 R could determine that landuse is a factor categorical variable because it was non numeric It could also determine the variable names from the first row The other factors were not recognized and in fact they have different R data types which we now assign using the as functions to change data types gt ds soil lt as fa
51. 89 4 32 4 69 trt2 num 6 31 5 12 5 54 5 5 5 37 5 29 4 92 6 15 5 8 5 26 gt max pg ctr1 1 6 11 The names of the groups in the unstacked frame become the names of the variables in the stacked frame the formula weight group told the unstack function that group was the column with the new column names This process also works in reverse when we have a frame with several variables to make into one we use the stack function gt pg stacked lt stack pg str pg stacked data frame 30 obs of 2 variables values num 4 17 5 58 5 18 6 11 4 5 4 61 5 17 4 53 ind Factor w 3 levels ctrl trt1 111111 gt names pg stacked lt c weight group str pg stacked data frame 30 obs of 2 variables weight num 4 17 5 58 5 18 6 11 4 5 4 61 5 17 4 53 group Factor w 3 levels ctrl trt1 111111 The stacked frame has two columns with default names value for the variables which were combined and ind for their names these can be changed with the names function A more general function for data re shaping is reshape 4 11 Random numbers and simulation R includes functions to evaluate a cumulative distribution function CDF the probability density function PDF and the quantiles and to draw ran dom samples from a large number of distributions including the uniform R name uni f normal R name norm Student s t R name t binomial R 41 name binom Poisson R name pois and many oth
52. Girth 1 0 30725 This is not the expected value it is very low and we know that a tree s volume should be fairly well correlated with its girth What is wrong Listing the workspace gives a clue gt IsO 1 Volume trees The name Volume occurs twice once as a local variable visible with 1sQ and once as a field name visible with str trees Even though the trees frame is attached the Volume field is masked by the Volume local variable which in this case is just a random permutation of the integers from 1 to 31 so the cor method gives an incorrect result One way around this problem is to name the field explicitly within its frame using e g trees Volume gt cor trees Volume Girth 1 0 96712 Another way is to delete the workspace variable with rm this makes the field name in the attached frame visible gt rm Volume gt cor Volume Girth 1 0 96712 Another way is to use names for local variables that do not conflict with field names in the attached data frames 11 3 How do I find the method to do what I want R has a very rich set of methods and there are often several ways to ac complish the same thing especially with contributed packages 1 Look at the help pages for methods you do know they often list related methods For example the help page for the linear models method 1m gives related methods for prediction summary regres sion diagnostics analysis of variance and generali
53. Introduction to the R Project for Statistical Computing for use at ITC D G Rossiter University of Twente Faculty of Geo information Science amp Earth Observation ITC Enschede NL http ww itc nl personal rossiter June 12 2014 Actual Frequency Actual vs modelled straw yields E e NE e e e 3 e ce g ms T T T T T T 4 5 6 y 8 9 Modelled Frequency histogram Meuse lead concentration 53 0 100 200 300 400 500 600 700 lead concentration mg kg 1 Counts shown above bar actual values shown with rug plot Grain yield Ibs per plot v Ld e f p eA wap 315000 320000 325000 330000 335000 340000 L f L T 9 o o i E e o T iy a i m o T1 PINE o v B ES cee IT TT T m DN T x o r M e iM TH Bgle rH M o M aq Ub o i Hd ier eo gp ire o i o A a a i E o Li i IT TTTTTITTITITTTTTITITTUTTTTT 1 3 5 7 9 1 13 15 17 19 21 23 25 Column number GLS 2nd order trend surface VG NeN ATS M ae I Z HA ae es oa Cee Contents O If you are impatient 1 1 Whatis R 1 2 WhyR for ITC 3 2 1 Advantages 4460404468
54. RStudio or R console modify it as you like for example change the sample size n or the number of replications save it as a command file and run it with the source function as explained in 83 5 You prepare this with a plain text editor as a text file for example plot4 R and then source it into R which executes the commands immediately source plot4 R If you want to change the plotting parameters you have to change the script and re source it The next section offers a better solution A note on R style It is good practice to make all parameters into vari ables at the beginning of the script so they can be easily adjusted If you find yourself repeating the same number in many expressions it probably should be converted to a variable Examples here are the sample size and parameters of the normal distribution You could write v rnorm 30 180 20 hist v breaks seq 800 280 by 20 3 points x dnorm x 180 20 30 20 3 72 http cran r project org bin macosx RMacOSX FAQ htm 73 http www loc gov standards iso639 2 php English list php 74 The R file extension is customary for R scripts 128 but it is more elegant to write n 30 mu 180 sd lt 20 bin width sd 3 v rnorm n mu sd hist v breaks seq mu 5 sd mu 5 sd by bin width points x dnorm x mu sd n bin width visualise the variability of small random samples sample size n 30 numbe
55. T EN EN NE MS LN NT setosa versicolor virginica l l i L l 253 ee Lr 2 5 4 e ee ee Pry s 204 ee 20 4 ccoo ee E e o S g4 Saig P 154 ose ooo L 5 iba cee Se Eg gt MM E Ead S an us L iod conem es 0 5 4 s L 0 5 e e wd NM I T T T T T T T T T T T T T TT T T T T T P T T T T T T 1 2 3 4 5 6 7 123 45 6 7 123 45 6 7 Petal Length Petal Length Figure 12 Trellis scatter plots without and with a conditioning factor 5 2 3 Triivariate plots The most useful plots here are the levelplot and contourplot meth ods for 2D plotting of one response variable on two continuous depen dent variables for example elevation vs two co rdinates the wi reframe 79 method for a 3D version of this and the cloud method for a 3D scatter plot of three variables All can be conditioned on a factor Figure 13 shows some examples produced by the following code pli lt cloud Sepal Length Petal Length Petal Width groups Species data iris pch 20 main Anderson Iris data all species screen list z 30 x 60 data volcano pl2 lt wireframe volcano shade TRUE aspect c 61 87 0 4 light source c 10 0 10 zoom 1 1 box F scales list draw F xlab ylab zlab main Wireframe plot Maunga Whau Volcano Auckland pl3 levelplot volcano col regions gray 0 16 16 main Levelplot Maunga Whau Volcano Auckland pl4 lt contourplot volcano at seq floor min vol
56. TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE 2 3 18 20 31 dfb 1 dfb Hght dfb Grth dfb Hg G dffit cov r cook d hat The most common use here is to list the observations for which any of the influence measures were excessive This example illustrates the use of the which function to return the in dices in a vector where a logical condition is true and the any function to perform a logical or on a set of logical vectors in this case the eight diagnostics In this example observation 31 has a high leverage 18 a high DFFITS value and all five high COVRATIO Notice also the use of the apply function to apply the any is any element of the vector TRUE function to each row 4 18 2 Model based prediction The fitted function only gives values for the observations often we want to predict at other values For this the predict generic function is used in the case of linear models this specialises to the predict 1m function The first argument is the fitted model and the second is a data frame in which to look for variables with which to predict these must have the same names as in the model formula Both confidence and prediction intervals may be requested with a user specified confidence level For example to predict tree volumes for all combi
57. Von Mises with the RSiteSearch method RSiteSearch von Mises RSiteSearch von Mises restrict functions The second form restricts the search to just functions by default documentation and the R help mailing list archives are also searched This opens the search page http search r project org in a browser In this case several matches are shown including two packages A review of their contents shows that the circular package for circu lar statistics is the more complete so it should be installed on your system A 1 Once the new package is installed its contents are available to be searched and this time the term von Mises is found Several meth ods in the circular package are relevant 63 http cran 64http cran 6 http cran 66 http cran r project r project r project r project 67 http cran r project org web vi ews index html org other docs html org doc contrib Ricci refcard regression pdf org doc contrib Ricci refcard ts pdf oOrg 119 gt help search von Mises dmixedvonmises circular Mixture of von Mises Distributions mle vonmises circular von Mises Maximum Likelihood Estimates mle vonmises bootstrap ci circular Bootstrap Confidence Intervals pp plot circular von Mises Probability Probability Plot dvonmi ses circular von Mises Density Function You can get help on any of these by loading the package and viewing the help gt libraryCcircul
58. actors can be converted to strings with the as character function however this conversion is performed automatically by many functions so an explicit conversion is rarely needed 43 4 13 Objects and classes S is an object oriented computer language everything in S including vari ables results of expressions results of statistical models and functions is an object each with a class which says what the object is and also con trols the way in which it may be manipulated The class of an object may be inspected with the class function gt class 1m 1 function gt class letters 1 character gt class seq 1 10 1 integer gt class seq 1 10 by 01 1 numeric gt class diag 10 1 matrix gt class iris 1 data frame gt class iris Petal Length 1 numeric gt class iris Species 1 factor gt class iris Petal Length gt 2 1 logical gt classCImCiris Petal Width iris Petal Length 1 Im gt class hist iris Petal Width 1 histogram class table iris Species 1 table The letters built in constant in this example is a convenient way to get the 26 lower case Roman letters for upper case use LETTERS As the last three examples show many S functions create their own classes These then can be used by generic functions such as summary to determine appropriate behaviour gt summary Ciris Petal Length Min 1st Qu Median Mean 3rd Qu M
59. added to it with the cbind column bind function For example we could compute a height to girth ratio for the trees a measure of a tree s shape and add it as a new field to the data frame we illustrate this with the trees example dataset introduced in 84 7 gt attach trees gt HG Ratio lt Height Girth str HG Ratio num 1 31 8 43 7 56 7 16 6 86 7 57 gt trees lt cbind trees HG Ratio str trees data frame 31 Girth Height Volume hum hum hum HG Ratio num gt rm HG Ratio obs of 4 variables 8 3 8 6 8 8 10 5 10 7 10 8 70 65 63 72 81 83 66 75 80 75 10 3 10 3 10 2 16 4 18 8 19 7 8 43 7 56 7 16 6 86 7 57 33 Note that this new field is not visible in an attached frame the frame must be detached with the detach function and re attached gt summary HG Ratio Error Object HG Ratio not found gt detach trees attach trees gt summary HG Ratio Min 1st Qu Median Mean 3rd Qu Max 4 22 4 70 6 00 5 99 6 84 8 43 Sorting a data frame This is most easily accomplished with the order function naming the field s on which to sort and then using the returned indices to extract rows of the data frame in sorted order gt trees order trees Height trees Girth Girth Height Volume 3 8 8 63 10 2 20 13 8 64 24 9 2 8 6 65 10 3 4 10 5 72 16 4 24 16 0 72 38 3 16 12 9 74 22 2 23 14 5 74 36 3 18 13 3 86 27 4 31 20 6 87 77 0 4 8 Factors
60. age is given in the R Language Definition 38 9 and a comprehensive introduction is given in the Introduction to R 36 This section reviews the most outstanding features of S All the functions packages and datasets mentioned in this section as well as the rest of this note are indexed C for quick reference 4 1 Command line calculator and mathematical operators The simplest way to use R is as an interactive calculator For example to compute the number of radians in one Babylonian degree of a circle gt 2 pi 360 1 0 0174533 As this example shows S has a few built in constants among them pi for the mathematical constant rr The Euler constant e is not built in it must be calculated with the exp function as exp 1 If the assignment operator explained in the next section is not present the expression is evaluated and its value is displayed on the console S has the usual arithmetic operators and some less common ones like modulus and integer division Expressions are evaluated in accordance with the usual operator precedence parentheses may be used to change the precedence or make it explicit gt 3 2A2 2 pi 1 7 03319 gt C3 2 A2 2 pi 1 13 3518 Spaces may be used freely and do not alter the meaning of any S expres sion Common mathematical functions are provided as functions see 84 3 in cluding 10g 10g10 and 10g2 functions to compute logarithms exp for e
61. al cokriging point or bm2 genes nig ac jp RGM2 pkg php p gstat PoF The meuse data set a tutorial for the gstat R package File Format PDF Adobe Acrobat View as HTML block co kriging or simulation for each of the varieties kriging local kriging or cokriging but provide others e g package geoR and cran r project org web packages gstat vignettes gstat pdf R Creates gstat Objects objects that hold all the information necessary for univariate or multivariate geostatistical predicti imple ordinary or universal co kriging www stat ucl ac be ISdidactique Rhelp library gstat html gstat htm PoF introduction to the R Project for Statistical Computing File Format PDF Adobe Acrobat co kriging 34 fitting rational functions to time series 31 Technical Note Co kriging with the gstat pack cran r ntri Rintro ITC pdf s All 39 versions Related Tools Introduction to the R Project for Statistical These include general data analysis logistic regression confusion matrices co kriging partioning transects and fitting rational functions www itc nl rossiter teach R Rintro_ov pdf 1 More results Figure 19 Results of an RSeek search Addicted to R graphics gallery also with source code 10 6 Keeping up with developments in R R is a dynamic environment with a large number of dedicated scientists working to make it both a rich statistical c
62. and p is the number of coefficients The coefficient of determination R both the unadjusted fraction of variance explained by the model R 1 Residual SS Total SS and 58 the coefficient adjusted for the number of parameters in the model 1 m 1 p 1 R The AIC Akaike s An Information Criterion function is often used to com pare models the lower AIC is better gt AICCIm Volume Height Girth 1 155 47 gt AICCIm Volume Height Girth 1 176 91 gt AlCCIm Volume Girth 1 181 64 gt AlC Im Volume 1 1 264 53 In this example the successively more complicated models have lower AIC i e provide more information 4 18 1 Model diagnostics V VN N The model summary 84 18 shows the overall fit of the model and a sum mary of the residuals but gives little insight into difficulties with the model e g non linearity heteroscedascity of the residuals A graphical presentation of some common diagnostic plots is given by the plot func tion applied to the model object note that the generic plot function in this case automatically calls the specific plot 1m function model lt 1m Volume Height Girth data trees par mfrow c 2 2 set up a 2x2 matrix for the 4 diagnostic graphs plot model par mfrow c 1 1 reset the graphics device to show just 1 graph This shows four plots Figure 4 1 residuals vs fitted 2 normal QQ plot of the residuals 3 s
63. anic matter proportion are known these will be imported as missing values symbol ized in R by NA Matrices A matrix can also be prepared as a CSV file perhaps exported from Excel and imported into R with the read csv function The matrix can also be prepared with row and column labels For example here is the CSV file for a small confusion matrix as prepared 41 These are not the numbers 1 2 and 3 95 in a text editor the first row has the column names reference classes and each of the other rows begins with the row name mapped class there is no entry for the very first column of the first row AB et DY A 35 14 11 1 B 4 11 3 0 C 12 9 38 4 D 2 5 12 2 Suppose this file is named cm csv it can be read in as follows note that we inform R that the row names are in the first column and convert the default class data frame to a matrix with the as matrix function The attributes function shows the correct data type and attributes here the dimension names for the rows and columns gt cm as matrix read csv cm csv row names 1 gt cm A B CD A35 14 11 1 B 411 30 C12 9384 D 2 5122 gt class cm 1 matrix gt attributes cm dim 1 4 4 dimnames dimnames 1 1 AU Bg d a D dimnames 2 1 A B C p 6 5 Importing images Some data such as satellite imagery is naturally organized as a matrix of data values on a regular grid Many export formats write a header of
64. another major feature Practising statisticians have implemented hundreds of spe l http www itc nl personal rossiter pubs list html pubs m R item 2 http www r project org cialised statistical produres for a wide variety of appli cations as contributed packages which are also freely available and which integrate directly into R A few examples especially relevant to ITC s mission are the gstat geoR and spatial packages for geostatistical analysis contributed by Pebesma 33 Ribeiro Jr amp Diggle 39 and Ripley 40 respectively the spatstat package for spatial point pattern analysis and simula tion the vegan package of ordination methods for ecology the circular package for directional statistics the sp package for a programming interface to spatial data the rgdal package for GDAL standard data access to geographic data sources There are also packages for the most modern statistical techniques such as sophisticated modelling methods including generalized linear mod els principal components factor analysis bootstrapping and robust regression these are listed in 84 19 wavelets wavelet neural networks nnet non linear mixed effects models n1me recursive partitioning rpart Splines splines random forests randomForest 2 Why R for ITC ITC is an abbreviation for University of Twente Faculty of Geo information Science amp Earth O
65. ar gt dvonmises 11 4 Memory problems Information on memory usage is returned by the object size method for individual objects and the gc garbage collection method for the whole workspace object size meuse 1 23344 gt gcO used Mb gc trigger Mb max used Mb Ncells 486087 13 0 818163 21 9 818163 21 9 Vcells 326301 2 5 905753 7 0 905753 7 0 If you work with large objects such as images you may well exceed R s default memory limits gt tmp matrix 1 10000 100000 Error in matrix matrix 1 10000 1e 05 cannot allocate vector of length 1000000000 There are other similar messages you may see This topic is discussed in the help page for memory limits gt help Memory limits MS Windows limits the total memory that can be allocated to R this can be reviewed with the memory size method on Windows only with the optional max TRUE argument it shows the maximum vector than can be allocated the memory limit method on Windows only shows the total memory that can be allocated for all objects gt running on MS Windows gt memory size max T 1 917176320 gt memory limit 1 1038856192 120 This is about 874 Mb for one object and almost 1 Gb total Memory can be increased up to about 2 Gb under Windows and indefinitely under better designed operating systems see the help gt Memory The easiest way to increase memory in Windows is by adding a command line option to the T
66. ara ison operator and then the select array elements operator to extract the array elements that are selected gt tr trees Volume 60 Note that there is no condition for the second subscript so all columns are selected To make a list of the observation numbers where a certain condition is met use the which function gt which trees Volume gt 60 1 31 gt trees which trees Volume gt 60 Girth Height Volume 31 20 6 87 77 37 Logical expressions may be combined with logical operators such as amp log ical AND and logical OR and their truth sense inverted with logical NOT For example to select trees with volumes between 20 and 40 cubic feet gt tr lt trees Volume gt 20 amp Volume lt 40 N BPWN H Note that amp like S arithmetical operators is vectorized i e it operates on each pair of elements of the two logical vectors separately Parentheses should be used if you are unclear about operator precedence Since the logical comparaison operators e g gt have precedence over bi nary logical operators e g amp the previous expression is equivalent to tr lt trees Volume gt 20 amp Volume lt 40 Another way to select elements is to make a subset with the subset func tion tr small subset trees Volume 18 Girth Height Volume 1 1 8 kr O co 3 70 10 3 6 65 10 3 8 63 10 2 z5 72 16 4 0 66 15 6 Selecting rando
67. are this with a plain text editor in the same way as a script for example creating a file named plot normals R and then read it into R with the source function This is a script as it is executed it defines the function Once the function is defined in the workspace which you can verify with the 1s function you run it just like a built in R function gt source plot_normals R gt IsO plot normals gt plot normals 60 gt plot normals 60 mu 100 sd 15 gt plot normals 60 rows 3 cols 3 mu 100 sd 15 function to visualise the variability of small random samples pl pa re bi X y fo j pa required arguments n sample size arguments with reasonable defaults rows cols dimensions of display mu sd mean s d of normal distribution to sample bsd histogram bins to represent each s d sdd number of s d to display ot normals lt function n rows 2 cols 2 mu 0 sd 1 bsd 2 sdd 3 5 set up graphic display r mfrow c rows cols number of random samples ps lt rows cols histogram bin width n width sd bsd scale x axis min lt mu sdd sd x max lt mu sdd sd scale y axis max dnorm 1 0 0 3989 adjust to sample and bin sizes and normalize by s d and leave room for higher bars Sh Hk Gk Gk max lt n 0 5 bin width sd compute and display each graph r i in 1 reps v lt rnorm n mu sd hist v xlim c x min x m
68. arget field of a desktop shortcut or Start menu item For example C Program File R R 2 5 1 bin Rgui exe LANGUAGE de max mem size 2Gb Just for fun we specify a German language GUI with the LANGUAGE flag After re starting R from the shortcut gt memory limitQ 1 21474835648 If you are doing serious image processing MS Windows is unlikely to be a satisfactory platform Fortunately almost all your R code will run un changed in R running under an OS that does not limit memory such as some variety of Unix 11 5 What version of R am I running The R version system variable contains this information if you want to report a possible program error it is always necessary to specify this gt R version platform x86 64 apple darwin9 8 0 arch x86 64 os darwin9 8 0 system x86_64 darwin9 8 0 status major 2 minor 15 0 year 2012 month 03 day 30 svn rev 58871 language R version string R version 2 15 0 2012 03 30 To determine the version of an installed package whether loaded or not use the help method with the optional package argument gt help package sp Information on package sp 121 Description Package sp Version 0 9 98 Date 2012 03 26 Title classes and methods for spatial data Author Edzer Pebesma lt edzer pebesma uni muenster de gt Roger Bivand lt Roger Bivand nhh no gt Depends R gt 2 10 0 methods graphics Suggests lattice RColorBrewer rgdal rgeo
69. argument Plotting symbols can be specified either by a single character e g for an asterisk or an integer code for one of a set of graphics symbols Figure 70 7 shows the symbols and their codes Note that symbols 21 26 also have a fill background colour specified by the bg argument the main colour specified by the col argument specifies the border 10 6 V 11 XX 16 e 210 50 109 15 W 20 e 25 Vy Figure 7 Plotting symbols See 85 5 for details on how to specify colours We now produce a customized plot showing the species gt plot Petal Length Petal Width pch 21 23 as numeric Species cex 1 2 xlab Petal length cm ylab Petal width cm main Anderson Iris data col cC slateblue firebrick darkolivegreen as numeric Species 4 4 It s clear that the species are of different sizes Setosa smallest Versicolor in the middle Virginica largest but that the ratio of petal length to width is about the same in all three Note the use of the as numeric method to coerce the Species which is a factor to the corresponding level number here 1 2 and 3 and then the use of this number to index a list of printing characters pch argument and colours col argument The plot generic method is an example of a high level plotting method which begins a new graph Once the co rdinate system is set up by plot several mid level plotting methods are availa
70. atistics Others are more specialised land cover change with logistic regression 49 assessing map accuracy 43 co kriging 45 fitting rational functions to time series 44 optimal partitioning of soil transects 42 Web Pages to learn R http www rseek org The RSeek website is a unified portal to R discussion lists books FAQ R Journal user contributed documentation etc Figure 19 shows the results of an RSeek search http www maths bath ac uk jjf23 book Practical Regression and Anova in R by Julian Faraway University of Bath UK this has been updated into two commerical textbooks 16 17 http rgm2 lab nig ac jp RGM2 R Graphical Manual a huge collection of graphs produced with R all with source code http addictedtor free fr graphiques 38 http www stat pitt edu stoffer tsa2 index html 39 http www itc nl personal rossiter pubs list htmlZpubs m R 112 RSeek org R project Search Engine aj G A l Al 4 Q nhttp www rseek org 7cx 01092314434370259875 0 Q7 Google QA mydel icio us post to del icio us ITCv ASv Rv RXv Sciencev XinHuay DGRw Searchy Moneyv NLv cokriging Search Results 1 10 for cokriging 0 61 seconds Introductions Support Lists Functions R code Books R Graphical Manual Google The function provides the following prediction methods simple ordinary custom Search and universal kriging simple ordinary and univers
71. atter Plots sunflowerplot Sunflower Scatter Plots Table 2 Base graphics plot types main Star plot of individuals frame T Some packages have implemented their own variations of these and other plots for example scatterplot and scatterplot matrix in the car package and truehist in the MASS package 5 1 4 Interacting with base graphics plots If the output graphics device is a screen e g as initialised with the windows method it is possible to query the graph with the identify method for scatterplots This reads the position of the graphics pointer when the left mouse button is pressed and searches the co rdinates given as its for the closest point in the plot If this point is close enough to the pointer its index is added to a list to be returned once the right mouse button is pressed The coordinate pairs for identify is normally the same as the scatterplot gt plot Petal Length Petal Width gt p lt identify Petal Length Petal Width 1 44 65 99 gt iris p Sepal Length Sepal Width Petal Length Petal Width Species 44 5 0 3 5 1 6 0 6 setosa 65 5 6 2 9 3 6 1 3 versicolor 99 5 1 2 5 3 0 1 1 versicolor 76 Given Species Grouped box plot virginica 1 io de ES o4 ae 0 l Lp n RR t Re i E NUM PUPA e 5 a 5 8 T T re 3 8 a 8 e Sg rz Petal Width 25 fi setosa L
72. ax 1 00 1 60 4 35 3 76 5 10 6 90 gt summary Ciris Species setosa versicolor virginica 50 50 50 gt summary ImCiris Petal Width iris Petal Length Call ImCformula iris Petal Width iris Petal Length Residuals Min 1Q Median 3Q Max 0 565 0 124 0 019 0 133 0 643 44 Coefficients Estimate Std Error t value Pr gt t Intercept 0 36308 0 03976 9 13 4 7e 16 iris Petal Length 0 41576 0 00958 43 39 lt 2e 16 Residual standard error 0 206 on 148 degrees of freedom Multiple R Squared 0 927 Adjusted R squared 0 927 F statistic 1 88e 03 on 1 and 148 DF p value lt 2e 16 gt summary table iris Species Number of cases in table 150 Number of factors 1 S has functions for testing if an object is in a specific class or mode and for converting modes or classes as long as such a conversion makes sense These have the form is test or as convert followed by the class name For example gt is factor iris Petal Width 1 FALSE gt is factor iris Species 1 TRUE as factor iris Petal Width 1 0 2 0 2 0 2 0 2 0 2 0 4 0 3 0 2 0 2 0 1 0 2 0 2 0 1 0 1 145 2 5 2 3 1 92 2 3 1 8 22 Levels 0 10 2 0 3 0 40 5 0 6 1 1 1 1 2 1 3 1 4 2 5 as numeric iris Species 1 1211111111111111111111111111 149 3 3 The is factor and is numeric class test functions return a logical value TRUE or FALSE depending on the class their argument The as factor class conversion function determined the u
73. ax ylim c 0 y max breaks seq mu 5 sd mu 5 sd by bin width main paste mu mu sigma sd xlab paste Sample i col Jightblue border gray freq TRUE X lt seq x min x max length 120 true normal distribution points x dnorm x mu sd n bin width type 1 col blue lty 1 1wd 1 8 normal dist estimated from sample points x dnorm x mean v sd v n bin width type 1 col red lty 2 l1wd 1 8 print sample params and Pr Type I error text x min 0 9 y max paste mean round mean v 2 pos 4 text x min 0 8 y max paste sdev round sd v 2 pos 4 text x min 0 7 y max paste Pr t round t test v musmu p value 2 pos 4 clean up r mfrow c 1 1 132 References 1 Albert J 2007 Bayesian computation with R Use R New York Springer 111 2 Bivand R S Pebesma E J amp G mez Rubio V 2008 Applied Spatial Data Analysis with R UseR Springer URL http www asdar book org 111 3 Chambers J M 1998 Programming with Data New York Springer ISBN 0 387 98503 4 URL http cm bell Tlabs com cm ms departments sia Sbook 48 4 Christensen R 1996 Plane answers to complex questions the theory of linear models New York Springer 2nd edition 55 56 58 5 Cleveland W S 1993 Visualizing data Murray Hill N J AT amp T Bell Laboratories Hobart Press 77 6 Congalton R G Oderwald R G amp Mead R A 1983 Assessing
74. ble to add elements to the graph such as lines points and text Table 1 lists the principal methods see the help for each one for more details For example to add horizontal and vertical lines at the mean and median centroids use the abline method gt abline v mean Petal Length lty 2 col red 71 gt gt gt V V abline Add a Straight Line arrows Add Arrows axis Add an Axis box Draw a framing Box grid Add a Grid legend Add Legends lines Add Connected Line Segments mtext Write Text into the Margins points Add Points polygon Draw Polygons rect Draw Rectangles rug Add a Rug segments Add Line Segments symbols Draw Symbols text Add Text title Plot Annotation Table 1 Methods for adding to an existing base graphics plot ablineCh mean Petal Width lty 2 col red abline v median Petal Length lty 2 col blue abline hemedian Petal Width lty 2 col blue The lty argument specifies the line type style These can be specified as a code O blank 1 solid 2 dashed 3 dotted 4 dotdash 5 longdash 6 twodash or as a descriptive name blank solid dashed dotted dotdash longdash or twodash To add light gray dotted grid lines at the axis ticks use the grid method gridO To add the mean and median centroids as large filled diamonds use the points method points mean Petal Length mean Petal Width cex 2 pch 23 col black bg red points media
75. bservation It is a faculty of the University of Twente lo cated in Enschede the Netherlands with a thematic focus on geo information science and earth observation in support of development Thus the two pillars on which ITC stands are development related and geo information R supports both of these 2 1 Advantages R has several major advantages for a typical ITC student or collaborator item It is completely free and will always be so since it is issued un der the GNU Public License It is freely available over the internet via a large network of mirror servers see Appendix A for how to ob tain R It runs on many operating systems Unix and derivatives in cluding Darwin Mac OS X Linux FreeBSD and Solaris most flavours of Microsoft Windows Apple Macintosh OS and even some main frame OS It is the product of international collaboration between top computational statisticians and computer language designers It allows statistical analysis and visualisation of unlimited sophis tication you are not restricted to a small set of procedures or op tions and because of the contributed packages you are not limited to one method of accomplishing a given computation or graphical presentation It can work on objects of unlimited size and complex ity with a consistent logical expression language It is supported by comprehensive technical documentation and user contributed tuto rials 810 There are also several good textbooks on sta
76. cale vs location of residuals 4 residuals vs leverage The intepretation of these is explained in many regression textbooks The par function sets up a matrix of graphs in one figure here there are four diagnostic graphs produced by the default plot 1m function so a 2x2 grid is set up See 85 4 for more information Many other diagnostics are available There are several functions to di rectly access these from the model itself dfbetas dffits covratio cooks distance and hatvalues methdods rstandard and rstudent return the standardized and Studentized residuals respectively These diagnostic measures are explained in regression texts e g 14 18 many of these were developed or expanded by Cook amp Weisberg 8 For example to plot the hat values leverage for each point with the number of the most influential observations gt h lt hatvalues model gt plot h type h 39 Residuals vs Fitted Normal Q Q o4 a 4 280 280 t o 5 o o o o o o o an 0 o an o9 o o S o S o 3 o ee ae o o o 9 56 90000 o N Bow 2 g o o o o S T o RE T o 7 o b oe 9 4 91 o 18 018 eo i T T T T T T T T T T T T 10 20 30 40 50 60 70 80 2 1 0 1 2 Fitted values Theoretical Quantiles Scale Location Residuals vs Leverage 018 xz e a 4 x 028 7 T 1 280 o Os gt gt Jos 016 gt eel o g E 9 o o S o T o o o 2 3 S o 3 9 3 o o
77. cano 10 10 cei ling max volcano 10 10 by 10 main Contourplot Maunga Whau Volcano Auckland sub contour interval 10 m region T col regions terrain colors 100 print pl1 split c 1 1 2 2 more T print pl2 split c 2 1 2 2 more T print pl3 split c 1 2 2 2 more T print pl4 split c 2 2 2 2 more F rm pli p12 p13 p14 VYVYVVV V V vyvv 4v Note that the volcano data set is just a matrix of elevations gt str volcano num 1 87 1 61 100 101 102 103 104 105 105 106 107 108 The levelplot method converts this into one response variable the z values and two predictors i e the row and column of the matrix the x and y values This example shows that high level lattice methods do not themselves draw a graph they return an object of class trellis which can be printed with the print method R s default behaviour when working interactively at the console is to print the results of any expression except an assign ment so the casual user doesn t see this behaviour It is however quite useful to place multiple graphs on the same page as illustrated here and explained in more detail in 85 4 5 2 4 Panel functions A Trellis plot must be constructed in one go unlike in the base graphics package where elements can be added later Each additional element be yond the default is specified by a so called panel function For example suppose we want to add a least squares regression line as well a
78. chaser must agree to the li cense terms These often have common analyses built in as menu choices these can be convenient but it is tempting to use them without fully un derstanding what choices they are making for you SAS is a commercial competitor to S PLUS and is used widely in industry It is fully programmable with a language descended from PL I used on IBM mainframe computers 2 3 3 Special purpose statistical programs Some programs adress specific statistical issues e g geostatistical analysis and interpolation SURFER gslib GEO EAS ecological analysis FRAG STATS and ordination CONOCO The algorithms in these programs have or can be programmed as an R package examples are the gstat program for geostatistical analysis 35 which is now available within R 33 and the vegan package for ecological statistics 2 3 4 Spreadsheets Microsoft Excel is useful for data manipulation It can also calculate some statistics means variances directly in the spreadsheet This is also an add on module menu item Tools Data Analysis for some common statistical procedures including random number generation Be aware that 7 There are differences in the language definitions of S R and S PLUS that are important to programmers but rarely to end users There are also differences in how some algorithms are implemented so the numerical results of an identical method may be somewhat different 8 See the list at http
79. class SpatialPointsDataFrame but to ob jects of the generalised class from which class SpatialPointsDataFrame is specialised namely class SpatialPoints and class Spatial 47 To determine the methods that can apply to a class review the help using the class lt class name gt syntax gt class SpatialPointsDataFrame gt class SpatialPoints gt class Spatial As this example shows classes may be organised in an inheritance struc ture so that the behaviour of a more generalised class is automatically inherited by a more specialised class these are said to extend the base class For example class Spatial PointsDataFrame extends class Spatial Points which in turn extends the base class Spatial These have appropriate slots and methods Spatial Only has a bounding box slot bbox and projection slot proj4string no spatial objects as such SpatialPoints Also has coordinates slot coords of each point at this level of hier archy there are also class SpatialLines and class Spatial Polygons SpatialPointsDataFrame Also has attributes slot data for each point SpatialGridDataFrame Points are on a regular grid inherited from class GridTopology as well as class SpatialPointsDataFrame The real power of this approach is seen when generic methods are applied to objects of the various classes each class then specialises the generic method appropriately For example the spplot plotting method gives a different kind of plot fo
80. complicated model is significantly better than the simpler one gt model 1 aov Volume Height Girth data trees gt model 2 aov Volume Height Girth data trees anova model 1 model 2 Analysis of Variance Table Model 1 Volume Height Girth Model 2 Volume Height Girth Res Df RSS Df Sum of Sq F PrF 1 28 422 2 27 198 1 224 30 5 7 5e 06 In this case the interaction term of the more complicated model is highly significant 4 18 Model output The result of a 1m linear models function is a data structure with detailed information about the model how it was fitted and its results It can be viewed directly with the str function but it is better to access the model with a set of extractor functions coefficients to extract a list with the model coefficients fitted to extract a vector of the fitted values what the model predicts for each observation residuals to extract a vector of the residuals at each observation and formula to extract the model formula gt model lt Im Volume Height Girth data trees gt coefficients model Intercept Height Girth Height Girth 69 39632 1 29708 5 85585 0 13465 fitted model 1 2 3 4 5 au 57 8 2311 9 9974 10 8010 16 3186 18 3800 gt residuals model 1 2 068855 2 3 4 Digit aes 0 302589 0 600998 0 081368 0 420047 gt formula model Volume Height Girth The results are best reviewed with the summary generic fun
81. cript editor x lt rnorm 100 180 20 y rnorm 100 180 20 What is the expected value of this correlation cofficient 14 plot x y cor test x y 3 Save the file with the name test R in a convenient directory 4 In RStudio click the run script button in RGui select menu com mand File Source R code 5 In the file selection dialog locate the file test R that you just saved changing directories if necessary and select it R will run the script 6 Examine the output You can source the file directly from the command line Instead of steps 5 and 6 above just type source test R at the R command prompt assuming you ve switched to the directory where you saved the script Appendix B contains an example of a more sophisticated script For serious work with R you should use a more powerful text editor The R for Windows R for Mac OS X and JGR interfaces include built in edi tors another choice on Windows is WinEdt And both RStudio 83 1 and Tinn_R 83 4 have code editors For the ultimate in flexibility and power try emacs with the ESS Emacs Speaks Statistics module learn ing emacs is not trivial but rather an investment in a lifetime of efficient computing 3 6 The Rcmdr GUI The Rcmdr add on package written by John Fox of McMaster University provides a GUI for common data management and statistical analysis tasks It is loaded like any other package with the requi re f
82. ction which for linear models is specialized into summary 1m gt summary model Call 1mCformula Residuals Min Volume Height Girth data trees 1Q Median 3Q Max 6 582 1 067 0 303 1 564 4 665 Coefficients Estimate Std Error t value Pr gt t Intercept 69 3963 23 8358 2 91 0 00713 Height 1 2971 0 3098 4 19 0 00027 Girth 5 8558 1 9213 3 05 0 00511 Height Girth 0 1347 0 0244 5 52 7 5e 06 Residual standard error 2 71 on 27 degrees of freedom Multiple R Squared 0 976 Adjusted R squared 0 973 F statistic 359 on 3 and 27 DF p value lt 2e 16 This provides the most important information about the model for more insight consult textbooks which explain linear modelling e g Venables amp Ripley 57 Fox 18 Christensen 4 or Draper amp Smith 14 The model formula with which 1m was called A summary of the residuals by definition the mean is zero so is not reported The model coefficients first column The standard error associated with each coefficient second column these can be used to construct confidence intervals The significance level of each coefficient third column this is the probability that rejecting the null hypothesis for the listed coefficient would be a mistake The residual standard error which is defined as the square root of the estimated variance of the random error o 1 n p X vi where 7 is one of the n residuals
83. ctly Class Spatial by class SpatialPoints distance 2 Known Subclasses Class SpatialPixelsDataFrame directly with explicit coerce This shows that this class inherits from class SpatialPoints and in turn can be extended as class SpatialPixelsDataFrame S4 classes have named slots marked by the sign in the output of str and also shown in the output of getClass these can also be listed with the slotNames function gt slotNames meuse 1 data coords nrs coords bbox proj4string The contents of these slots are explained in the help for the S4 class gt SpatialPointsDataFrame class Each slot is either a primitive S object e g slot bbox which is a matrix or another S4 class e g slot data which is an object of class AttributeList also defined by the sp package gt class meuseGbbox 1 matrix gt class meuse data 1 AttributeList attr package 1 sp Slots may be accessed directly with the operator just as fields in data frames are accessed with the operator For example to extract the bounding box limiting co rdinates of this spatial data set gt meuse bbox min max x 178605 181390 y 329714 333611 gt meuse bbox x min 1 178605 However it is better practice to use access functions gt bbox meuse min max x 178605 181390 y 329714 333611 Each S4 class has a set of methods that apply to it bbox is an example that applies not only to objects of
84. ctor ds soil gt ds ffreq as ordered ds ffreq gt ds lime lt as logical ds lime gt str ds data frame 6 obs of 10 variables x int 181072 181025 181165 181298 y int 333611 333558 333537 333484 cadmium num 11 7 8 6 6 5 2 6 2 8 3 elev num 7 91 6 98 7 80 7 66 7 48 dist num 0 00136 0 01222 0 10303 0 19009 om num 13 6 14 13 8 8 7 7 8 ffreq Ord factor w 3 levels 1 2 3 1 soil Factor w 3 levels 1 2 3 111 lime logi TRUE TRUE TRUE FALSE FALSE FALSE landuse Factor w 2 levels Ah Ga 1112 12 Using the correct data type is important for many modelling methods here we inform R that lime was either applied logical TRUE associated by R with the number 1 or not logical FALSE associated by R with the number 0 that there are three soil types arbitrarily named 1 2 and 3 not in any order and that there are three flood frequency classes named 1 2 and 3 and these are in this order from most to least flooded n b we know this from the data documentation not from anything we can see in the CSV file Missing values Missing values are expressed in CSV files by the two let ters NA without quotes For example Xx y cadmium elev dist om ffreq soil lime landuse 181072 333611 NA 7 909 0 00135803 NA 1 1 1 Ah 181025 333558 8 6 6 983 0 0122243 14 1 1 1 Ah In the first record neither the cadmium concentration nor org
85. d gt cm cm 1 2 3 4 1 1225 196 121 1 2 16 121 9 0 3 144 81 1444 16 4 4 25 144 4 gt cm cm 1 2 3 4 1 1415 748 857 81 2 220 204 191 16 3 920 629 1651 172 4 238 201 517 54 gt Cm 26C 1 2 3 4 1 1 100 2 1 35 3 160 4 56 As the last example shows also multiplies matrices and vectors Matrix A matrix can be inverted with the solve function usually with little ac inversion curacy loss in the following example the round function is used to show that we recover an identity matrix gt solve cm 1 2 3 4 1 0 034811530 0 03680710 0 004545455 0 008314856 2 0 007095344 0 09667406 0 018181818 0 039911308 3 0 020399113 0 02793792 0 072727273 0 135254989 4 0 105321508 0 37250554 0 386363636 1 220066519 gt solve cm cm 1 2 3 4 1 1 000000e 00 4 683753e 17 7 632783e 17 1 387779e 17 2 1 110223e 16 1 000000e 00 2 220446e 16 1 387779e 17 3 1 665335e 16 1 110223e 16 1 000000e 00 5 551115e 17 4 8 881784e 16 1 332268e 15 1 776357e 15 1 000000e 00 gt round solve cm 6s 6cm 10 1 2 D 31 4 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 4 0 0 0 1 Solving The same solve function applied to a matrix A and column vector b solves linear the linear equation b Ax for x equations gt b lt c 1 2 3 4 gt x solve cm b 1 0 08569845 0 29135255 0 28736142 3 08148559 gt c
86. dL ERR x e a 4 21 Control structures and looping Ls 4 22 User defined functions c s srastao cess Deea eens 4 23 Computing on the language cco 4 2 See ch ae ee 5 R graphics 5 1 Base R prapHiCS s s 6 6 ae a ee nla A we ew De owe m 5 1 1 Mathematical notation in base graphics 5 1 2 Returning results from graphics methods 5 1 3 Types of base graphics plots lt o s soe sec areenaa 5 1 4 Interacting with base graphics plots o2 Trellssraphies 22 49m Rm Row eee x 5 2 1 Univariate Plots cue dede XPRESS 3 E eR x EL 5 2 7 Bivariat plots lt cusa versum ERU EUR oe 5 2 3 Tritvariat Plots coe 2404 b04 588 64450 ha os 5 2 4 Panel functions crevne terasta naa IS SS 5 2 5 Types of Trellis graphics plots 6 see 5 2 6 Adjusting Trellis graphics parameters 5 3 Multiple graphics windows 5 3 1 Switching between windows 5 4 Multiple graphs in the same window DAL Basepraphies 23239 mde b e e A 5 4 2 Trellis graphics 00002 eee eee 5 9 COLOURS ns ima ee o ROS RR eG Be wo 6 Preparing your own data for R 6 1 Preparing data directly im R 22 oor 6 2 AGU dataeditor 2 6 64 66 6 bias x GG EG we 6 3 Importing data from Microsoft Excel 6 4 Importing datafromaCSVfile l l 6 5 Importing ages uso be om roh om komo mod m Po om wad 7 Exporting from R List of Figures
87. defined by some factor or to exclude outliers defined by some criterion Selecting known elements If we know the observation numbers we sim ply name them as the first subscript using the select array elements operator gt trees 1 3 Girth Height Volume 1 8 3 70 10 3 2 8 6 65 10 3 3 8 8 63 10 2 gt trees c 1 3 5 Girth Height Volume 36 1 8 3 70 10 3 3 8 8 63 10 2 5 10 7 81 18 8 gt trees seq 1 31 by 10 Girth Height Volume 1 8 3 70 10 3 11 11 3 79 24 2 21 14 0 78 34 5 31 20 6 87 77 0 A negative subscript in this syntax excludes the named rows and includes all the others gt trees 1 27 Girth Height Volume 28 17 9 80 58 3 29 18 0 80 51 5 30 18 0 80 51 0 31 20 6 87 77 0 Selecting with a logical expression The simplest way to select subsets is with a logical expression on the row subscript which gives the criterion For example in the trees example dataset introduced in 4 7 there is only one tree with volume greater than 58 cubic feet and it is substantially larger we can see these in order with the sort function gt attach trees gt sort Volume 1 10 2 10 3 10 3 15 6 16 4 18 2 18 8 19 1 19 7 11 19 9 21 0 21 3 21 4 22 2 22 6 24 2 24 9 25 7 21 27 4 31 7 33 8 34 5 36 3 38 3 42 6 51 0 51 5 31 55 4 55 7 58 3 77 0 To analyze the data without this unusual tree we use a logical expression to select rows observations here using the less than logical comp
88. e turn a vector of colours from defined endpoints according to the number of levels requested gt terrain colors 5 1 00A600 E6E600 EAB64E EEB99F F2F2F2 gt terrain colors 10 1 400A600 2DB600 63C600 A0D600 E6E600 E8C32E 7 4EBB25E EDB48E FOC9CO F2F2F2 gt terrain colors 10 1 1 00A600 The hexidecimal codes here represent Red Green and Blue from 00 no colour to FF full colour thus there are 2562 16 777 216 possible colours Grey scales use the slightly different gray method its argument is a vector 37 Note the American spelling of colour 88 of values between 0 1 giving the gray level gt gray seq 0 1 by 125 1 4000000 4202020 404040 606060 4808080 49F9F9F 7 BFBFBF DFDFDF FFFFFF gt gray 0 8 8 1 4000000 202020 404040 606060 4808080 49F9F9F 7 BFBFBF 4DFDFDF 4FFFFFF gt gray c O 2 3 1 1 4000000 4333333 4D4D4D FFFFFF gt col2rgb gray 0 4 1 red 102 green 102 blue 102 Custom colour ramps can be produced with the hsv and rainbow meth ods see their help for details All of these ramps can be indexed to get individual colours They are most useful however when these colours are linked to data values For example to plot soil sample points in the meuse soil pollution data set with the points coloured in order of their rank from least to most
89. e chunk and the ending it e g data trees Everything else is text e g Here is a simple example This is processed in R with the Sweave function which takes as its argu ment the name of the NoWeb file gt Sweave example Rnw The result is a PIX input file with the standard file extension tex and the same first part of the name where the NoWeb chunks have been evaluated and formatted as both input and output In this example the file is named example tex 9 2 The MITEX document This is now just a standard BIFX input file R code has been formatted in a special Schunk BIEX enviroment provided by the BIEX Sweave package as specificed in the usepackage Sweave declaration in the preamble In this example we have the following file documentclass article usepackage pdftex final graphicx usepackage Sweave begin document title Sweaving for reproducible data analysis author A Nonymous date 14 November 2010 maketitle Here is a simple example of R and Sweave begin Schunk begin Sinput R gt data trees R gt str trees end Sinput begin Soutput data frame 31 obs of 3 variables Girth num 8 3 8 6 8 8 10 5 10 7 10 8 11 11 11 1 11 2 Height num 70 65 63 72 81 83 66 75 80 75 Volume num 10 3 10 3 10 2 16 4 18 8 19 7 15 6 18 2 22 6 19 9 end Soutput begin Sinput R gt summary Im Volume Height Girth data trees adj r squared end Sinput begin Soutput
90. e help search method gt help search 1qs In this case the list of matches to the search term 1qs shows two both associated with package MASS Once the required package is loaded the method is available gt library MASS gt lgs lead om data meuse Coefficients Intercept om 19 9 15 4 Scale estimates 46 9 45 9 4 If the command which produced the error is compound break it down into small pieces beginning with the innermost command and then working outwards 5 Review the documentation for the command it may explain situ ations in which an error will be produced For example if we try to compute the non parametric correlation between lead and organic matter in the Meuse data set we get an error gt library gstat data meuse gt cor meuse lead meuse om method spearman Error in cor lead om missing observations in cov cor gt cor The help for cor says If use is all obs then the presence of missing observations will produce an error in the usage section it 116 shows that use all obs is the default so we must have some missing values We can check for these with the is na method gt sum is na lead sum is na om 1 0 1 2 There are two missing values of organic matter but none of lead Then we must decide how to deal with them one way is to compute the correlation without the missing cases gt cor lead om method Spearman use complete 1 0 59759
91. e trees ex ample dataset We repeat that here suppose this is a subsample perhaps for calibration of some model We select 2 3 of the trees for this set at random rounded down to the nearest integer with the floor function gt dim trees 1 1 31 3 gt floor dim trees 1 2 3 1 20 gt tr calib lt trees sort sample 1 dim trees 1 floor dim trees 1 2 3 Girth Height Volume 2 8 6 65 10 3 4 10 5 72 16 4 30 18 0 80 51 0 If we now want to select the validation set i e the remaining trees we use the setdiff function on the two sets of row names extracted with the rownames function gt rownames trees 1 1 2 3 4 og g 7 g 9g 1o 11 12 13 14 15 16 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 31 gt rownames tr calib 1 2 4 5 g OM 11 12 13 14 16 18 20 21 22 23 16 26 27 28 29 30 gt setdiff rownames trees rownames tr calib 1 1 3 g 7 9g 15 17 19 24 25 31 gt tr valid trees setdiff rownames trees rownames tr calib Girth Height Volume 1 8 3 70 10 3 3 8 8 63 10 2 6 10 8 83 19 7 31 20 6 87 77 0 51 gt dim tr calib 1 20 3 gt dim tr valid 1 11 3 The dataframe has been split into mutually exclusive frames for calibra tion 20 observations and validation 11 observations 4 17 Statistical models in S Statistical models in S are speci
92. ecies Fach list is a vector with named elements the dimnames attribute 4 10 Rearranging data As explained above 84 7 the data frame is the object class on which most analysis is performed Sometimes the same data must be arranged different ways into data frames depending on what we consider the ob servations and columns A typical re arrangement is stacking and its inverse unstacking In stack ing several variables are combined into one coded by the original variable name unstacking is the reverse For example consider the data from a small plant growth experiment 40 gt data PlantGrowth str PlantGrowth data frame 30 obs of 2 variables weight num 4 17 5 58 5 18 6 11 4 5 4 61 5 17 4 53 group Factor w 3 levels ctrl trt1 1111111 There were two treatments and one control and the weights are given in one column If we want to find the maximum growth in the control group we could select just the controls and then find the maximum gt max PlantGrowth weight PlantGrowth group ctrl1 1 6 11 But we could also unstack this two column frame into a frame with three variables one for each treatment and then find the maximum of one new column for this we use the unstack function gt pg unstack PlantGrowth weight group str pg data frame 10 obs of 3 variables ctrl num 4 17 5 58 5 18 6 11 4 5 4 61 5 17 4 53 5 33 5 14 trtl num 4 81 4 17 4 41 3 59 5 87 3 83 6 03 4
93. ed some S4 classes to make the spatial nature of the data explicit sp also provides a coordinates function to set the spatial co rdinates and thereby create explict spatial data gt coordinates meuse lt x y gt alternate command format coordinates meuse lt c x y gt class meuse 1 SpatialPointsDataFrame attr package 1 sp gt str meuse Formal class SpatialPointsDataFrame package sp with 5 slots Q data Formal class AttributeList package sp with 1 slots Q att List of 12 cadmium num 1 155 11 7 8 6 6 5 2 6 2 83 3 2 2 8 2 4 1 6 e ee dist m num 1 155 50 30 150 270 380 470 240 120 240 420 coords nrs int 1 2 12 Q coords num 1 155 1 2 181072 181025 181165 181298 181307 attr dimnames List of 2 NULL e ee chr 1 2 x y bbox num 1 2 1 2 178605 329714 181390 333611 attr dimnames List of 2 chr 1 2 x y e chr 1 2 min max Q proj4string Formal class CRS package sp with 1 slots projargs chr NA The object meuse has been promoted to an S4 class SpatialPointsDataFrame which is shown as being a formal class defined by sp The class hierarchy may be examined with the getClass function gt getClass SpatialPointsDataFrame Slots Name data coords nrs coords bbox proj4string Class data frame numeric matrix matrix CRS Extends 46 Class SpatialPoints dire
94. ee Chapter 9 of 36 for more details An example of both the for and if else control structures is the following code to find the closest sample point to a given point This also illustrates the data frame function to create a new data frame and the runif function to draw random numbers by default in the range 0 1 35 also used in C and its derivatives such as C and Java of Algol it has been aptly said that it was a great improvement over its successors 64 Note also the use of the ifelse function to select one of two alternatives in this case the colour to plot based on some truth condition We also plot the sample points and target point showing the nearest point and shift This uses the plot points text and arrows base graphics functions The result is shown in Figure 5 simulate ten sample points n pts lt 10 sample pts data frame x runif n pts y runif n pts simulate the given point new pt lt data frame x runif 1 y runif 1 print paste Target point x round new pt x 4 1 Target point x 0 0756 y 0 8697 gt the distance to the nearest point can t be further on a 1x1 grid gt min dist lt sqrt 2 close pt lt NULL for pt in 1 n pts d lt sqrt sample pts pt x new pt x A2 sample pts pt y new pt y 2 if d min dist min dist d closest pt pt print paste Point else print paste Point pt is not closer 1 Point 1 i
95. el passes through all the extra arguments specified previously e g data iris Within the un named panel function several pre defined panel functions are called to add graphic elements These all have names beginning with panel First we use panel fill to change the main plot background then panel xyplot to draw the points note that this must be called ex plicitly if there is a panel function Then we add the regression lines with panel abline Note the use of the for loop to add one line for each species 82 See panel functions and the explanation of the panel parameter in xyp lot for details 5 2 5 Types of Trellis graphics plots Table 3 lists the principal plot types see the help for each one for more details Univariate assocp lot Association Plots barchart bar plots bwp lot box and whisker plots densityplot kernel density plots dotplot dot plots histogram histograms qqmath quantile plots against mathematical dis tributions stripplot 1 dimensional scatterplot Bivariate qq q q plot for comparing two distribu tions xyplot scatter plots Trivariate levelplot level plots contourplot contour plots cloud 3 D scatter plots wireframe 3 D surfaces Hypervariate splom scatterplot matrix parallel parallel co rdinate plots Miscellaneous rfs residual and fitted value plot tmd Tukey Mean Difference plot Table 3 Trellis graphics plot types 5 2 6 Adjusting Trell
96. ellis graphics argument 79 axis 72 73 barchart package lattice 83 barplot 76 bbox package sp 47 bg graphics argument 71 87 biplot 62 boot package 62 box 72 boxp lot 52 76 bpy colors package sp 88 bwplot package lattice 83 by 29 39 c 22 91 92 car package 36 76 cbind 33 ceiling 19 cex graphics argument 70 chisq test 50 chol 30 circular package 2 114 class 44 45 cloud package lattice 80 83 cm colors 88 coefficients 57 col 27 col graphics argument 70 71 87 col2rgb 87 colnames 92 colours 87 comp lete cases 63 contour 76 contourplot package lattice 79 83 contr helmert 56 contr poly 56 138 contr sum 56 contr treatment 56 contrasts 56 cooks distance 59 coordinates package sp 46 99 coplot 76 cor 63 111 113 cov 63 covratio 59 cut 45 data 17 18 data frame 32 64 datasets package 17 densityplot package lattice 78 83 det 30 detach 34 dev off 10 dev set 85 dfbetas 59 dffits 59 diag 27 29 dim 26 38 dnorm 42 dotchart 76 dotplot package lattice 83 duplicated 51 dvonmises package circular 114 e1071 package 50 eigen 30 eval 68 74 exp 19 expression 68 74 factor 34 36 file show 96 filled contour 76 fitted 57 61 fix 92 93 floor 19 51 for 22 64 82 formula 57 fourfoldplot 76 function 29 65 gc 115 geoR package 2 106 getClass 46 47 ge
97. ers see Chapter 8 of 36 for a complete list The names of the functions are built from two parts 1 A prefix indicating the type of function p for the CDF d for the density q for the quantile and r for random samples 2 The R name of the distribution as listed above e g norm for the normal distribution So the functions for the normal distribution are p pnorm for the CDF d dnorm for the density q qnorm for the quantiles inverse probability r rnormto draw random numbers For example to find the proportion of people in a normally distributed population with mean height 170 cm and standard deviation 15 cm shorter than 200 cm use pnorm gt pnorm 200 170 15 1 0 9772499 To plot the bell shaped normal curve use dnorm gt q lt seq 3 3 by 05 gt plot q dnorm q type 1 xlab z ylab Prob z To find which normal score corresponds to a list of critical Type I error probabilities use qnorm gt alpha c 0 1 0 05 0 01 0 001 gt qnorm 1 alpha 2 1 1 644854 1 959964 2 575829 3 290527 Finally to simulate sampling ten individuals from a normally distributed population with mean height 170 cm and standard deviation 15 cm with a measurement precision of 1 cm use rnorm draw the sample and then round the results to the nearest integer gt sort round rnorm 10 170 15 1 147 159 166 169 169 174 176 180 183 185 Each time this command is issued it will give different results
98. ersonal rossiter teach R R rat pdf 112 Rossiter D G 2007 Technical Note Co kriging with the gstat pack age of the R environment for statistical computing Enschede NL International Institute for Geo information Science amp Farth Observa tion ITC 2 1 edition URL http www itc nl personal rossiter teach R R_ck pdf 112 Rossiter D G 2010 Technical Note An example of data analysis using the R environment for statistical computing Enschede NL In ternational Institute for Geo information Science amp Earth Observation ITC 1 2 edition URL http ww itc nl personal rossiter teach R R_ corregr pdf 112 Rossiter D G 2012 Technical Note Literate Data Analysis Interna tional Institute for Geo information Science amp Earth Observation ITC 1 3 edition 29 pp URL http www itc nl personal rossiter teach R LDA pdf 106 108 Rossiter D G 2012 Tutorial Using the R Environment for Statistical Computing An example with the Mercer amp Hall wheat yield dataset Enschede NL International Institute for Geo information Science amp Earth Observation ITC 2 51 edition URL http www itc nl personal rossiter teach R R mhw pdf 1 112 Rossiter D G amp Loza A 2012 Technical Note Analyzing land cover change with R Enschede NL International Institute for Geo information Science amp Farth Observation ITC 2 32 edition 67 pp URL http www itc nl personal rossiter teach R R LCC pdf 11
99. esedempcUnee o edeneoxbssste s A miam mee I Auc RR aee RM k ee a Sd Se mm oig e 00 8 009 6 opc oo eo 7 os 1 2 3 4 5 6 7 Petal length cm Centroids mean green and median gray Figure 8 Custom scatterplot 5 1 1 Mathematical notation in base graphics To include mathematical notation in one of the graphics functions that draw text for example text axis or title there is a special syntax ex 73 V V MN V MN V plained in the help for the plotmath function The trick is to turn the argu ment into an R expression by using the expression function in this case the text functions interpret the argument as a mathematical expression and the output will be formatted according to TpX like rules It is possible to produce many mathematical symbols sub or superscripts fractions operators greek letters etc run demo plotmath to see examples The expression function turns its argument into an R expres sion ready to be evaluated with the eval function more on this in 4 23 Figure 9 shows an example of a graph with superscripts e g R and math notation e g Greek letters such as f and a summation produced with the following code Note the use of the substitute function to place a numeric value computed on the fly from a linear model result into a text string along with math symbols Note also the alignment of the text with the pos position argument to the text function
100. fied in symbolic form with model formulae These formulae are arguments to many statistical functions most notably the 1m linear models and glm generalised linear models functions but also in base graphics 85 1 functions such as plot and boxplot and trellis graphics 85 2 functions such as levelplot The simplest form is where a mathematically dependent variable is math ematically explained by one mathematically independent variable like all model formulae this uses the formula operator to separate the left de pendent from the right independent sides of the expression gt model Im trees Volume trees Height Call ImCformula trees Volume trees Height Coefficients Intercept trees Height 87 12 1 54 This is to be read like a mathematical function where the left hand side is the result dependent and the right hand side is the expression inde pendent In other words a formula is read as a response variable left hand side 1s explained by the symbol a formula including one or more predictor variables In this example the tree volume is to be explained as a linear function of its height this is just a first order linear regression with the best fit least squares line y 87 12 1 54x for every foot increase in height the volume increases by 1 54 ft also a zero height tree would have a negative volume If the data frame has been attached this can
101. ge 83 6 provides a reasonable GUI for common tasks and there are various development environments for R such as RStudio 83 1 The user must decide on the analysis sequence and execute it step by step However it is easy to create scripts with all the steps in an analysis and run the script from the command line or menus 83 5 scripts can be preared in code editors built into GUI versions of R or separate front ends such as Tinn R 83 5 or RStudio 83 1 A major advantage of this approach is that intermediate results can be reviewed and scripts can be edited and run as batch processes The user must learn a new way of thinking about data as data frames 84 7 and objects each with its class which in turn supports a set of methods 84 13 This has the advantage common to object oriented languages that you can only operate on an object according to methods that make sense and methods can adapt to the type of object The user must learn the S language 84 both for commands and the notation used to specify statistical models 84 17 The S statis tical modelling language is a lingua franca among statisticians and provides a compact way to express models There are many ways to do computational statistics this section discusses them in relation to R None of these programs are open source meaning that you must trust the company to do the computations correctly 2 3 1 S PLUS S PLUS is a commercial program dis
102. ge is that the same plot can be printed in a group of several plots alone or on different devices without having to be recomputed For example hl lt histogram rnorm 100 col lightblue h2 lt histogram rbinom 100 20 5 col snow3 h3 lt histogram rpois 100 1 col springgreen1 h4 lt histogram runif 100 col red4 print hl split c 1 1 2 2 more T print h2 split c 2 1 2 2 more T 86 gt print h3 split c 1 2 2 2 more T gt gt print h4 split c 2 2 2 2 more F rm hl h2 h3 h4 If you are printing to a file more F will end the page The next call to the print method will start a new page This is useful to produce a multi page PDF file 5 5 Colours gt Both base 5 1 and Trellis 5 2 graphics have many places where colours can be specified often with the col foreground colour or bg background colour optional arguments to graphics methods such as plot Colours may be specified either by name by code colour specification or by their numeric position in the active colour palette There are a large number of named colours but only eight of these in the default palette To get a list of possible colour names use the colours method to see the numeric colours in the active palette use the palette method colours 1 white aliceblue antiquewhi te 4 antiquewhitel antiquewhite antiquewhite3 655 yellow3 yellow4 yellowgreen gt
103. gments 72 seq 21 91 sessionInfo 117 set seed 43 setdiff 51 setequal 51 setwd 110 show settings package lattice 84 showMethods 48 sin 19 slotNames 47 solve 28 55 sort 23 24 37 source 123 126 sp package 2 45 47 50 88 94 97 Spatial package sp class 47 48 spatial package 2 SpatialLines package sp class 48 SpatialPixelsDataFrame package sp class 47 SpatialPoints package sp class 47 48 SpatialPointsDataFrame package sp class 46 48 99 SpatialPolygons package sp class 48 spatstat package 2 splines package 2 split 38 39 splom package lattice 83 spplot package sp 48 sqrt 19 stack 41 stars 76 stem 76 step 62 str 24 45 47 55 57 stripchart 76 stripplot package lattice 83 strsplit 43 subset 38 substitute 74 substring 43 sum 23 29 summary 4 12 44 49 58 63 110 summary Im 58 sunflowerplot 76 svd 30 Sweave 102 symbols 72 t 4 26 55 table 34 35 50 tapply 29 terrain colors 88 text 65 72 74 title 72 73 tmd package lattice 83 topo colors 88 89 trees dataset 30 33 37 51 55 63 93 trellis par get package lattice 83 trellis par set package lattice 83 truehist package mass 76 trunc 19 union 51 unique 51 unlist 43 unstack 41 var 23 variogram package gstat 17 vegan package 2 6 vignette 13 volcano dataset 80 wavelet package 2 whi ch 37 6
104. gplot system 60 in the ggplot2 pack age The first of these is easiest to begin with the second is especially suited to plots grouped by a factor the third is for low level graphic manipulation and the last is more difficult to learn but very powerful Only base R graph ics are loaded by default when R starts The lattice package is generally included in initial R distributions the other two have to be installed before use R graphics are highly customizable see each method s help page for de tails and for base graphics the help page for graphics parameters par Except for casual use it s best to create a script 3 5 with the graphics commands this can then be edited and it also provides a way to produce the exact same graph later Multiple graphs can be placed in the same window for display or printing see 5 4 and several graphics windows can be opened at the same time see 5 3 To get a quick appreciation of R graphics run the demostration programs gt demo graphics gt demo lattice Packages grid and ggplot2 have no demos see the respective texts for an introduction 5 1 Base R graphics A technical introduction to base R graphics is given in Chapter 12 of 36 Here we give an example of building up a sophisticated plot step by step starting with the defaults and customizing The example is a scatter plot of petal length vs width from the iris data set A default scatterplot of two var
105. h a code editor help text a workspace browser and graphic output 3 1 The RStudio integrated development environment RStudio is an excellent cross platform integrated development envi ronment IDE for R A screenshot is shown in Figure 1 This environment includes the command line interface a code editor output graphs his tory help workspace contents and package manager all in one atttractive interface The typical use is 1 open a script or start a new script 2 change the working directory to this script s location 3 write R code in the script 4 pass lines of code from the script to the command line in terface and evaluate the output 5 examine any graphs and save for later use If you chose to use RStudio you first install R and then RStudio 3 2 R console GUI The default interface for both Windows and Mac OS X is a simple GUI We refer to these as R console GUI because they provide an easy to use interface to the R command line a simple script editor graphics output and on line help they do not contain any menus for data manipulation or statistical procedures R for Linux has no GUI however several independent Linux programs provide a GUI development environment an example is RStudio 83 1 You can download and install R as instructed in 8A On Windows the in I http www rstudio org 1 Windows Mac OS X Linux 13 from http www rstudio org 14 http www linuxlinks com art
106. he R Commander screen Menu bar at the top a top panel showing commands submitted to R by the menu commands a bottom panel showing the results after execution byR When using Rcmdr observe the commands it formats in response to your menu and dialog box choices Then you can modify them yourself at the R command line or in a script Rcmdr also provides some nice graphics options including scatterplots 2D and 3D where observations can be coloured by a classifying factor 3 7 Loading optional packages R starts up with a base package which provides basic statistics and the R language itself There are a large number of optional packages for specific 16 statistical procedures which can be loaded during a session Some of these are quite common e g MASS Modern Applied Statistics with S 57 and lattice Trellis graphics 50 85 2 Others are more specialised e g for geostatistics and time series analysis such as gstat Some are loaded by default in the base R distribution see Table 4 If you try to run a function from one of these packages before you load it you will get the error message Error object not found You can see a list of the packages installed on your system with the library function with an empty argument gt libraryQO To see what functions a package provides use the library function with the named argument For example to see what s in the geostatistical package gstat gt library Che
107. he table function won t work and if we try to predict the score from the student using the Im function see 84 17 we get nonsense gt Im score student data tests Coefficients Intercept student 9 556 0 917 The problem is that the student is considered as a continuous variable when in fact it is a factor We do much better if we make the appropriate conversion gt tests student as factor tests student gt str tests data frame 9 obs of 2 variables student Factor w 3 levels 1 2 3 123123123 score num 9 6 588 7 5 69 5 87 gt Im score student data tests Coefficients Intercept student2 student3 8 83 1 50 1 83 This is a meaningful one way linear model showing the difference in mean scores of students 2 and 3 from student 1 the intercept Factor names can be any string so to be more descriptive we could have assigned names with the labels argument to the factor function gt tests student factor tests student labels c Harley Doyle JD gt str tests data frame 9 obs of 2 variables student Factor w 3 levels Harley Doyle 123123123 score num 9 6 5887 569 5 87 gt table tests score student 6 6 5 Harley 0 O0 Doyle O 1 JD 1 0 7 ROON Obrov HH H oo O OH o OO Hn u An existing factor can be recoded with an additional call to factor and different labels gt tests student factor tests student labels c John Paul George gt
108. hich runs on any platform with industry standard Java Run time Engine including Mac OS X and most Windows systems This includes the iplots interactive graphics package A 1 Installing new packages The ITC network installation includes the optional packages requested by ITC users If you need to install packages on your own copy of R use the Packages Install Package s from CRAN menu item while connected to the Internet You need administrator privledges on your system if you can install the base program you can install packages A brief description and full documentation of the available packages is available from the CRAN home page click on the link for Packages If you need a new package on the ITC network contact the person who maintains the R distribution A 2 Customizing your installation Many aspects of R s interactive behaviour can be changed to suit your pref erences For example you can set up your own copy of R to load libraries at startup and you can also change many default options You do this by creating a file named Rprofile either in your home directory or in a working directory from which you will start R or both The second of these is used if it exists otherwise the master copy in the home directory To see the current options setings use the options function without any arguments it s easier to review these by viewing its structure Individual settings can then be viewed by their field name
109. iables is produced by the plot default method which is automatically used by the generic plot command if two 69 equal length vectors are given as arguments gt data iris gt str iris data frame 150 obs of 5 variables Sepal Length num 5 1 4 9 4 7 4 6 5 5 4 4 6 5 4 4 4 9 Sepal Width num 3 5 3 3 2 3 1 3 6 3 9 3 4 3 4 2 9 3 1 Petal Length num 1 4 1 4 1 3 1 5 1 4 1 7 1 4 1 5 1 4 1 5 Petal Width num 0 2 0 2 0 2 0 2 0 2 0 4 0 3 0 20 2 0 1 Species Factor w 3 levels setosa versicolor 1111111 gt attach iris plot Petal Length Petal Width In this form the x and y axes are given by the first and second arguments respectively The same plot can be produced using a formula 84 17 show ing the dependence in which case the y axis is given as the dependent variable on the left hand side of the formula gt plot Petal Width Petal Length fe ai 7 o o o ooo ooo o oo o O o e 4 coco o o ai o0 o oo oooo o D oo o o S _ o 000 O00 3 T o oc o z O o0 S 000 0 o o co 7 000 a o o o o 000 O ooo o oo0000 O o T T T T T T T 1 2 3 4 5 6 7 Petal Length Figure 6 Default scatterplot This default plot is shown in Figure 6 It is informative but not so attrac tive We can customize the plotting symbol pch argument its colour and size col and cex arguments the axis labels xlab and ylab arguments and graph title main
110. ich you can see with R help Under Windows find and open the executable Rterm this opens R in a terminal non graphics window 3 3 1 The command prompt You perform most actions in R by typing commands in a command line interface window in response to a command prompt which usually looks like this The is a prompt symbol displayed by R not typed by you This is R s way of telling you it s ready for you to type a command Type your command and press the Enter or Return keys R will execute your command If your entry is not a complete R command R will prompt you to complete it with the continuation prompt symbol R will accept the command once it is syntactically complete in particular the parentheses must balance Once the command is complete R then presents its results in the same command line interface window directly under your command If you want to abort the current command i e not complete it press the Esc escape key For example to draw 500 samples from a binomial distribution of 20 trials with a 40 chance of success you would first use the rbinom function and then summarize it with the summary function as follows 16 An alternative for some analyses is the Rcmdr GUI explained in 3 6 17 This simulates for example the number of women who would be expected by chance to present their work at a conference where 20 papers are to be presented if the women make up 40 of the possib
111. icle 20110306113701179 GUIsforR htm1 ds DLV RStudio 9j F E EJ Q rgeo R ec B JSeuceonsaee Q Run 9 Source i 1 2 3 div lt read csv dlv csv 4 5 div r lt div dlv r geo gt 1 length dlv 6 acii 7 8 De eK cC red green3 blue magenta cme plot jitter dlv r 2 4 jitter dlv 4 cex dlv r dens b max dlv r dens b p ILE title main Lake Valencia sub Subsoil bulk density by geomorphic region 11 Legend topleft paste Region c 2 3 6 7 5 pch 21 pt bg c red green3 12 pte 13 14 par mfcol c 2 2 col main darkblue IESE boxplot clay b r geo dlv r main Clay X B xlab Geomorphic Region IE boxplot ph b r geo dlv r main pH B xlab Geomorphic Region 17 boxplotCdens b r geo dlv r main Bulk Density B xlab Geomorphic Region 18 boxplot vfs b r geo dlv r main Very fine sand B xlab Geomorphic Region 19 PAGE 20 21 par mfcol c 2 2 col main darkblue 22 boxplot clay a r geo dlv r mai ay X A xlab Geomorphic Region TERE boxplot ph a r geo dlv r main pH A xlab Geomorphic Region ELEME boxplot dens a r geo dlv r main Bulk Density A xlab Geomorphic Region 25 boxplot vfs a r geo dlv r main Very fine sand X A xlab Geomorphic Region 26 Dti 27 13 1 Top Level R Script Console d V ex 2 2 2 9 0 7 0 2 4 4 43 50 38 19 50 32800 15 23 1435 25 2118568 205 87 500 16 630 58
112. ied degree gt model lt Im Volume Height Girth A2 data trees Here the A2 expands to all interactions between the named predictors since there are only two this is equivlent to Height Girth Height Girth which in this two predictor case is also the same as Height Girth Removing terms Sometimes it is convenient to specify a model and then remove a term from it with the formula operator As a somewhat arti ficial example to model tree volume by only tree girth and its interaction with height gt model lt Im Volume Height Girth Height data trees This is equivalent to 53 gt model lt Im Volume Girth Girth Height data trees The formula operator is often used to remove the intercept see below Nested models The operator is used to specify that the second named predictor is nested within the first named predictor i e levels of the nested predictor are not independent factors Nested models are often used in de signed experiments such as split plot designs or replicated measurements within an experimental unit This is also used for the analysis of covari ance ANCOVA where the covariates are nested within the treatment in this case the intercept should be removed see next 4 No intercept The intercept term e g the mean is implicit in model for mulas For regression through the origin it must be explicitly removed with the formula operator in this case the implied intercept
113. imultaneous Confidence Intervals mvtnorm Multivariate Normal and T Distribution nlme Linear and nonlinear mixed effects models nnet Feed forward neural networks rgl 3D visualization device system OpenGL rpart Recursive partitioning SparseM Sparse Linear Algebra spatial Functions for Kriging and point pattern analysis splines Regression Spline Functions and Classes stats The R Stats Package includes classical tests exploratory data anlysis smoothing and local methods for regres sion multivariate analysis non linear least squares time series analysis stepfun Step Functions including Empirical Distributions survival Survival analysis including penalised likelihood utils R Utilities Table 4 Packages in the base R distribution for Windows libraries loaded when R starts are shown in boldface 125 It also installs six manuals in both PDF and HTML format 1 An Introduc tion to R 2 R Installation and Administration 3 R Language Definition 4 Reference Index 5 R Data Import Export 6 Writing R Extensions Other operating systems For Mac OS X or other Unix based system fol low the appropriate links from the CRAN home page and read the instal lation instructions The GUI for the Mac OS X version has a code editor data editor data browser and native Quartz graphics Cross platform RStudio is explained in 83 1 The JGR project at the University of Augsburg D has developed a Java based GUI w
114. is can be accomplished in one step using the by function gt by Petal Length Species mean INDICES setosa 1 1 462 INDICES versicolor 1 4 26 39 INDICES virginica 1 5 552 In this example we applied the mean function to the Petal Length field in the attached data frame grouping the petal lengths by the Species categorical factor A function can be applied to several fields at the same time and the results can be saved to the workspace gt iris m lt byCiris 1 4 Species mean gt class iris m 1 by gt str iris m List of 3 setosa Named num 1 4 5 006 3 428 1 462 0 246 attr names chr 1 4 Sepal Length Sepal Width versicolor Named num 1 4 5 94 2 77 4 26 1 33 attr names chr 1 4 Sepal Length Sepal Width virginica Named num 1 4 6 59 2 97 5 55 2 03 attr names chr 1 4 Sepal Length Sepal Width attr dim 2 int 3 attr dimnames List of 1 Species chr 1 3 setosa versicolor virginica attr call language by data frame data iris 1 4 INDICES Species FUN mean non attr class chr by gt iris m setosa Sepal Length Sepal Width Petal Length Petal Width 5 006 3 428 1 462 0 246 gt iris m setosa 3 Petal Length 1 462 gt iris m setosa Petal Length Petal Length 1 462 As this example shows the result is one list for each level of the grouping factor here the Iris sp
115. is graphics parameters The Trellis graphics environment as implemented in the lattice package sets reasonable defaults for its graphics parameters based on the output device screen PDF file Changing these requires three steps 1 re trieve the parameters into a data structure in memory 2 modify them 3 write the modified parameters back as permanent options Parameters are retrieved with the trellis par get method and set with the trellis par set method In the following example we change the superposition symbol which is used in the xyplot method to show groups of points as in the example just above 83 The current settings are shown graphically by the show settings method gt show settingsQ gt options lt trellis par get gt names options 1 grid pars fontsize background 4 clip add line add text 7 plot polygon box dot box rectangle 10 box umbre1la dot line dot symbo 13 plot line plot symbol reference line 16 strip background strip shingle strip border 19 superpose line superpose symbol superpose polygon 22 regions shade colors axis line 25 axis text axis components layout heights 28 layout widths box 3d par xlab text 31 par ylab text par zlab text par main text 34 par sub text gt options superpose symbol alpha 1 1111111 cex 1 0 8 0 8 0 8 0 8 0 8 0 8 0 8 col 1 40080ff ffOOff darkg
116. ish language introduction to R graphics Dal gaard 11 is a clearly written introductory statistics textbook using R in all examples Another useful introduction is simpleR by John Verzani of the City University of New York 58 also available as a set of web pages 10 5 This shows how to use R for the usual univariate and bivariate statistics and tests linear regression and ANOVA It has since been up dated as a commercial textbook 59 His See A LAE G R x BO RSI wE RBS KE E 99 10 3 Textbooks using R e books in the Crawley 10 is a good introduction to many kinds of analysis using R with ITC library Other books in the ITC library UseR series emphasis on statistical modelling The introductory text of Dalgaard 11 was mentioned above Venables amp Ripley 57 present a wide variety of up to date statistical methods includ ing spatial statistics with algorithms coded in S for both R and S PLUS There are a variety of other texts using S or S PLUS which are mostly ap plicable to R Fox 19 explains how to use R for regression analysis in cluding advanced techniques this is a companion to his text 18 A more mathematically sophisticated approach but with a heavy emphasis on R techniques is the free text by Faraway 15 which was expanded into two commercial texts 16 17 Springer publishes the UseR series with 39 titles which includes p
117. ist using the operator object component where the indicates that the component or field is to be found within the named object We can combine this with the vector indexing operation 24 gt ss ix length ss ix 1 9 So the largest value in the sample sequence is found in the ninth position This example shows how expressions may contain other expressions and S evaluates them from the inside out just like in mathematics In this case The innermost expression is ss ix which is the vector of indices in object ss The next enclosing expression is length the length function returns the length of its argument which is the vector ss ix the innermost expression The next enclosing expression is ss ix which converts the result of the expression length ss ix to a subscript and extracts that element from the vector ss ix The result is the array position of the maximum element We could go one step further to get the actual value of this maximum which is in the vector ss x gt samp ss ix length ss ix2 1 9 599777 but of course we could have gotten this result much more simply with the max function as max ss x or even max samp 4 6 Arrays and matrices An array is simply a vector with an associated dimension attribute to give its shape Vectors in the mathematical sense are one dimensional arrays in S matrices are two dimensional arrays higher dimensions are possible For example c
118. it the vec tor 1 2 10 The syntax of seq 10 is explained in the next section Multiple assignments are allowed in the same expression 20 gt mu lt theta lt pi 2 1 1 5708 The final value of the expression in this case the value of mu is printed because the parentheses force the expression to be evaluated as a unit Removing objects from the workspace You can remove objects when they are no longer needed with the rm function gt rm s gt sS Error Object s not found 4 3 Functions and their arguments In the command s seq 10 seq is an example of an S function by analogy with mathematical functions which has the form function name arguments Some functions do not need arguments e g to list the objects in the workspace use the 1s function with an empty argument list gt IsO Note that the empty argument list i e nothing between the and is still needed otherwise the computer code for the function itself is printed Optional arguments Most functions have optional arguments which may be named like this gt S lt seq from 20 to 0 by 2 gt sS 1 20 18 16 1412 10 8 6 4 2 0 Named arguments have the form name value Arguments of many functions can also be positional that is their meaning depends on their position in the argument list The previous command could be written gt S lt seq 20 0 by 2 s 1 2018 16 14 1210 8 6 4 2 0 because the seq functi
119. ite Microsoft Excel worksheets It includes a loadworkbook function to load or create a workbook a readworksheet function to read all or part of one sheet of a workbook as a data frame and a comparalbe writeWorksheet function Prior to this package exchange with Excel was via CSV files as explained in the next section 6 4 Importing data from a CSV file The Comma Separated Values or CSV file is a common interchange for mat which can be prepared in many ways For example a CSV file can be created directly as a text file using Notepad or some other plain text edi tor However it is common to have data already in a spreadsheet such as 93 Excel In this case the procedure is as follows 1 Prepare the data as an Excel spreadsheet with named columns 2 Export from Excel to a CSV file using Excel s File Save As menu item 3 Import into R with the read csv function 4 Adjust data types in R if necessary We illustrate this with a simplified version of the meuse dataset from the gstat and sp packages which has been prepared to illustrate some issues with import Here is a small CSV file written by Excel and viewed in a plain text editor such as Notepad x y cadmium elev dist om ffreq soil lime landuse 181072 333611 11 7 7 909 0 00135803 13 6 1 1 1 Ah 181025 333558 8 983 0 0122243 14 1 1 1 Ah 181165 333537 6 8 0 103029 13 1 1 1 Ah 181298 333484 2 655 0 190094 8 1 2 0 Ga 2 4 3 1 181307 3333
120. ith to see just how the function works If you don t know the function name you can search the help for relevant functions using the help search function gt help search binomial 1 also available via the Help Search help menu item 12 will show a window with all the functions that include this word in their description along with the packages where these functions are found whether already loaded or not In the list shown as a result of the above function we see the Binomial stats topic we can get more information on it with the function this is written as the character immediately followed by the function name gt Binomial This shows the named topic which explains the rbinom among other functions Vignettes Packages have a long list of functions each of which has its own documen tation as explained above Some packages are documented as a whole by so called vignettes for now most packages do not have one but more will be added over time You can see a list of the vignettes installed on your system with the vignette function with an empty argument gt vignette and then view a specific vignette by naming it gt vignette sp 3 4 The Tinn R code editor For Windows user the Tinn R code editor for Windows is recommended for those who do not choose to use RStudio This is tightly integrated with the Windows R GUI as shown in Figure 2 R code is syntax highlighted and there
121. kage 2 nls 62 nnet package 2 object size 115 options 56 121 order 34 ordered 34 pairs 76 palette 87 panel trellis graphics argument 82 panel abline package lattice 82 panel fill Cpackage lattice 82 panel xyplot package lattice 82 par 59 86 parallel package lattice 83 parse 67 68 paste 31 43 67 pch graphics argument 70 pdf 10 persp 76 pi constant 19 plot 4 52 59 65 69 71 72 76 84 87 plot default 69 plot 1m 59 plotmath 74 pls package 62 pnorm 42 points 65 72 polygon 72 prcomp 62 predict 61 predict 1m 61 print package lattice 86 87 print 67 80 q 9 qnorm 42 qq Cpackage lattice 83 qqmath package lattice 83 qr 30 quanti le 49 R version 116 rainbow 89 randomForest package 2 rank 23 24 89 rbind 33 rbinom 12 Rcmdr package 15 read asciigrid package sp 97 read csv 94 95 110 read table 34 94 readworksheet 93 recode 36 rect 72 rep 32 91 repeat 64 require 15 17 45 reshape 41 residuals 57 return 66 rfs package lattice 83 rgb 88 rgdal package 2 rim package mass 62 rm 21 31 rnorm 23 42 round 19 28 42 99 140 row 27 rownames 51 92 rpart package 2 RSiteSearch 109 114 rstandard 59 rstudent 59 rug 72 runif 64 sample 38 sapply 29 scale 29 scan 91 92 97 scatterplot package car 76 scatterplot matrix package car 76 screeplot 62 se
122. l c or catenate make a chain function which makes a list out of its arguments in this case the two integers representing the indices of the first and last elements in the vector 4 4 Vectorized operations and re cycling A very powerful feature of S is that most operations work on vectors or matrices with the same syntax as they work on scalars so there is rarely any need for explicit looping commands which are provided xe g for 22 These are called vectorized operations As an example of vectorized oper ations consider simulating a noisy random process gt sample seq 1 10 rnorm 10 1 0 1878978 1 6700122 2 2756831 4 1454326 5 5 8902614 7 1992164 9 1854318 7 5154372 9 8 7372579 8 7256403 This adds a random noise using the rnorm function with mean 0 and standard deviation 1 the default to each of the 10 numbers 1 10 Note that both vectors have the same length 10 so they are added element wise the first to the first the second to the second and so forth If one vector is shorter than the other its elements are re cycled as needed gt samp seq 1 10 rnorm 5 1 1 23919739 0 03765046 2 24047546 4 89287818 5 4 59977712 3 76080261 5 03765046 7 24047546 9 9 89287818 9 59977712 This perturbs the first five numbers in the sequence the same as the sec ond five A simple example of re cycling is the computation of sample variance di rectly from the definition rather than with the
123. landsat classification accuracy using discrete multivariate analysis sta tistical techniques Photogrammetric Engineering amp Remote Sensing 49 12 1671 1678 25 92 7 Cook D Swayne D F amp Buja A 2007 Interactive and Dynamic Graphics for Data Analysis with R and GGobi Use R New York Springer Verlag 111 8 Cook R amp Weisberg S 1982 Residuals and influence in regression New York Chapman and Hall 59 9 Correa J C amp Gonz lez N 2002 Gr ficos Estadisticos con R Medel l n Colombia Universidad Nacional de Colombia Sede Medell n Pos grado en Estad stica 111 10 Crawley M J 2007 The R book Chichester Wiley amp Sons URL http www dawsonera com depp reader protected external AbstractView S9780470515068 111 11 Dalgaard P 2002 Introductory Statistics with R Springer Verlag 111 119 12 Davis J C 2002 Statistics and data analysis in geology New York John Wiley amp Sons 3rd edition 119 13 Diggle P J amp Ribeiro Jr P J 2007 Model based geostatistics Springer 111 119 14 Draper N amp Smith H 1981 Applied regression analysis New York John Wiley 2nd edition 56 58 59 15 Faraway J J 2002 Practical Regression and Anova using R Depart ment of Mathematical Sciences University of Bath UK self published web URL http www maths bath ac uk jjf23 book 111 119 133 16 Faraway J J 2005 Linear models with R Boca Raton Chapman
124. le presenters 18 Your output will probably be somewhat different why 11 gt x lt rbinom 500 20 4 gt summary x Min 1st Qu Median Mean 3rd Qu Max 2 000 7 000 8 000 8 232 10 000 15 000 This could also have been entered on several lines gt x lt rbinom 500 20 4 You can use any white space to increase legibility except that the assign ment symbol lt must be written together lt rbinom 500 20 0 4 R is case sensitive that is function rbinom must be written just that way not as Rbinom or RBINOM these might be different functions Variables are also case sensitive x and X are different names Some functions produce output in a separate graphics window gt hist x 3 3 2 On line help in R Both the base R system and contributed packages have extensive help within the running R environment In RStudio the Help tab in the lower right window pane gives access to the help system and via the home button to the R manuals From the command prompt you can also get help on any function with the function this is a shorthand for the help function For example if you don t know the format of the rbinom function used above Either of these two forms gt rbinom gt help rbinom Searching for functions will display a text page with the syntax and options for this function There are examples at the end of many help topics with executable code that you can experiment w
125. levels Ah Ga 1112 12 VVVYV R could determine that landuse is a factor categorical variable because it was non numeric It could also determine the variable names from the first row The other factors were not recognized and in fact they have different R data types which we now assign using the as functions to change data types ds soil as factor ds soil ds ffreq as ordered ds ffreq ds lime as logical ds lime str ds data frame 6 obs of 10 variables x int 181072 181025 181165 181298 y int 333611 333558 333537 333484 cadmium num 11 7 8 6 6 5 2 6 2 8 3 elev num 7 91 6 98 7 80 7 66 7 48 dist num 0 00136 0 01222 0 10303 0 19009 om num 13 6 14 13 8 8 7 7 8 ffreq Ord factor w 3 levels 1 2 3 1 soil Factor w 3 levels 1 2 3 111 lime logi TRUE TRUE TRUE FALSE FALSE FALSE landuse Factor w 2 levels Ah Ga 1112 12 Using the correct data type is important for many modelling methods here we inform R that lime was either applied logical TRUE associated by R with the number 1 or not logical FALSE associated by R with the number 0 that there are three soil types arbitrarily named 1 2 and 3 not in any order and that there are three flood frequency classes named 1 2 and 3 and these are in this order from most to least flooded n b we know this from the data documentation not from anything we ca
126. lines 1 gt ncol lt scan file image grd nlines 1 what character gt ncol 1 NCOLS 12 gt ncol lt as numeric ncol 2 gt ncol 1 12 gt nrow lt as numeric scan file image grd skip 1 nlines 1 what character 2 1 12 Now we read the image data into a matrix as one vector gt img scan file image grd skip 6 na strings 9999 Read 144 items gt summary img Min 1st Qu Median Mean 3rd Qu Max 141 176 202 200 222 256 gt class img 1 numeric Although this small image has no missing values in general we may expect them so we need to specify the format of the missing values Finally we specify the dimensions and format of the image thereby pro moting it to a matrix with the matrix function By default R interprets a matrix in column major order each column is a vector but the image is stored by convention in row major order therefore we specify the optional byrow T argument gt img matrix img nrow ncol byrow T gt class img 1 matrix gt str img num 1 12 1 12 141 143 147 151 153 153 153 157 159 159 103 8 Exporting from R VVVVVVVV VY Data frames are easily exported from R For all the possibilities see the R Data Import Export manual Here we explain the most common operation exporting a data frame to a CSV file This format can be read into Excel ILWIS and many other programs The XLConnect package uses an interface
127. lp gstat To load a package simply give its name as an argument to the require function for example gt require gstat Once it is loaded you can get help on any function in the package in the usual way For example to get help on the variogram method of the gstat package once this package has been loaded gt variogram 3 8 Sample datasets R comes with many example datasets part of the default datasets pack age and most add in packages also include example datasets Some of the datasets are classics in a particular application field an example is the iris dataset used extensively by R A Fisher to illustrate multivariate methods To see the list of installed datasets use the data method with an empty argument gt dataQ To see the datasets in a single add in package use the package argument gt data package gstat 17 To load one of the datasets use its name as the argument to the data method gt data iris The dataframe representing this dataset is now in the workspace 18 4 The S language R is a dialect of the S language which has a syntax similar to ALGOL like programming languages such as C Pascal and Java However S is object oriented and makes vector and matrix operations particularly easy these make it a modern and attractive user and programming environment In this section we build up from simple to complex commands and break down their anatomy A full description of the langu
128. ly ordered from least to most flooding So we might want to change the data type data meuse gt str meuse ffreq Factor w 3 levels 1 2 3 1111111 contrasts meuse ffreq 23 UJ hJ H 00 10 01 gt ImClog meuse lead meuse ffreq Coefficients Intercept ffreq2 ffreq3 5 106 0 666 0 626 gt meuse ffreq lt as ordered meuse ffreq gt str meuse ffreq Ord factor w 3 levels 1 2 3 1111111111 gt contrasts meuse ffreq L i 1 7 0711e 01 0 40825 2 9 0738e 17 0 81650 3 7 0711e 01 0 40825 gt ImClog meuse lead meuse ffreq Coefficients Intercept meuse ffreq L meuse ffreq Q 4 675 0 443 0 288 The unordered factor has treatments contrasts sometimes called dummy 56 variables whereas the ordered factor has orthogonal polynomial con trasts These result in different fitted coefficients 4 17 2 Analysis of Variance ANOVA The results of linear models can be expressed in the traditional language of ANOVA as found in many textbooks with the aov function this calls 1m and formats its results in a traditional ANOVA table gt model lt aov Volume Height Girth data trees gt class model 1 aov Im gt summary model Df Sum Sq Mean Sq F value Pr gt F Height 1 2901 2901 193 4 5e 14 Girth 1 4783 4783 317 2e 16 Residuals 28 422 15 Two models on the same dataset may be compared with the anova func tion this is one way to test if the more
129. m 926596 x 1 1 1 2 2 28 4 4 Applying The apply function applies a function to the margins of a matrix i e the functions rows 1 or columns 2 For example to compute the row and column to matrix sums of the confusion matrix use apply with the sum function as the margins function to be applied gt rsum lt apply cm 1 sum 1 61 18 63 21 gt csum lt apply cm 2 sum 1 53 39 64 7 These can be used along with the diag function to compute the pro ducer s and user s classification accuracies since di ag cm gives the correctly classified observations gt pa round diag cm csum 2 1 0 66 0 28 0 59 0 29 gt ua lt round diag cm rsum 2 1 0 57 0 61 0 60 0 10 The apply function has several variants lapply and sapply to apply a user written or built in function to each element of a list tapply to apply a function to groups of values given by some combination of factor levels and by a simplified version of this For example here is how to standardize a matrix by hand i e subtract the mean of each column and divide by the standard deviation of each column gt applyCcm 2 function x sapply x function i Ci mean x sd x 1 2 L3 4 1 1 381930 0 10742 0 24678 0 689 2 0 087464 1 39642 0 44420 0 053 3 0 297377 0 32225 1 46421 1 431 4 0 997089 0 96676 0 77323 0 689 The outer apply applies a function to the second margin i e c
130. m elements of an array Random elements of a vector can be selected with the sample function trees sort sample 1 dim trees 1 5 Girth Height Volume 11 4 76 21 4 13 3 86 27 4 14 2 80 31 7 14 5 74 36 3 17 3 81 55 4 Each call to sample will give a different result By default sampling is without replacement so the same element can not be selected more than once for sampling with replacement use the replace T optional argument In this example the command dim trees uses the dim function to give the dimensions of the data frame rows and columns the first element of this two element list is the number of rows dim trees 1 Splitting on a factor Another common operation is to split a dataset into several strata defined by some factor For this S provides the split func 38 tion which we illustrate with the iris dataset which has one factor the species of Iris gt data iris str ir data frame 150 ob Sepal Length num Sepal Width num Petal Length num Petal Width num Species gt attach iris gt ir s lt splitCiris List of 3 setosa data Sepal Length Sepal Width Petal Length Petal Width Species versicolor data Sepal Length Sepal Width Petal Length Petal Width Species virginica data Sepal Length Sepal Width Petal Length is s frame num 1 num 1 num 1 num 1 Factor
131. matrix where the rows are cases called observations by S whether or not they were field observations and the columns are the variables In standard database terminology these are records and fields respectively Rows are generally accessed by the row number although they can have names and columns by the vari able name although they can also be accessed by number A data frame can also be considerd a list whose members are the fields these can be accessed with the list access operator Sample data R comes with many example datasets 83 8 organized as data frames let s load one trees and examine its structure and several ways to access its components gt trees gt data trees gt str trees data frame 31 obs of 3 variables Girth num 8 3 8 6 8 8 10 5 10 7 10 8 11 Height num 70 65 63 72 81 83 66 75 80 75 Volume num 10 3 10 3 10 2 16 4 18 8 19 7 The help text tells us that this data set contains measurements of the girth diameter in inches measured at 4 5 ft height height feet and timber 29 1 inch 2 54 cm 301 ft 30 48 cm 30 volume cubic feet in 31 felled black cherry trees The data frame has 31 observations rows cases records each of which has three variables columns attributes fields Their names can be retrieved or changed by the names function For example to name the fields Var 1 Var 2 etc we could use the paste function to build the names int
132. mple to force English LANGUAGE en Note the format upper case variable name language name exactly as in the list of available languages no quotes the whole thing after the program name See the R for Windows FAQ Questions 2 2 and 2 15 for more details this FAQ is available via the Help menu in RGui 127 Mac OS X The R for Mac OS X FAQ explains that R detects the language settings from the Language amp Text section of the System Preferences and presents translated messages and GUI if they are available in the selected language The Formats information for numbers is also used This can be over ridden on a system wide basis not per session by set ting the force LANG defaults setting For example to force the US English text on a system running another language open the Terminal i e a shell and enter this command defaults write org R project R force LANG en US UTF 8 Other languages and encodings can be specified with the International Or ganization for Standardization ISO standard for the identification of lan guages ISO 639 1 or 2 and locales ISO 3166 1 B Anexample script This is an example of a moderately complicated script which gives both a numerical and a visual impression of the variability of small random sam ples The output is shown in Figure 21 on the following page If you want to experiment with the script cut and paste it into a text editor for exam ple Tinn R or the built in editor of
133. n gt trees Volume 1 lt NA gt str trees data frame 31 obs of 3 variables Girth num 8 3 8 6 8 8 10 5 10 7 10 8 11 11 Height num 70 65 63 72 81 83 66 75 80 75 Volume num NA 10 3 10 2 16 4 18 8 19 7 15 6 gt summary trees Volume Min 1st Qu Median Mean 3rd Qu Max NA s 10 2 19 8 24 5 30 8 37 8 77 0 1 0 As the example shows some functions like summary can deal with NA s but others can t For example if we try to compute the Pearson s cor relation between tree volume and girth using the cor function with the missing value included gt cor trees Volume trees Girth Error in cor trees Volume trees Girth missing observations in cov cor This message is explained in the help for cov where several options are given for dealing with missing values within the function the most com mon of which is to include a case in the computation only if no variables are missing gt cor trees Volume trees Girth use complete obs 1 0 967397 To find out which cases have any missing value S provides the complete cases function gt complete cases trees 1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 14 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE 27 TRUE TRUE TRUE TRUE TRUE This shows that the first case record is not complete The na omit function removes cases with a missing value in any variable 63 gt trees complete lt na omit trees
134. n Petal Length median Petal Width cex 2 pch 23 col black bg blue Titles and axis labels can be added with the title method if they were not already specified as arguments to the plot method title sub Centroids mean green and median gray Text can be added anywhere in the plot with the text method the first two arguments are the co rdinates as shown on the axes the third argument is the text and optional arguments specify the position of the text relative to the co rdinates the colour text size and font 72 gt text 1 2 4 Three species of Iris pos 4 col navyblue A special kind of text is the legend added with the legend method V legend 1 2 4 levels Species pch 21 23 bty n colec slateblue firebrick darkolivegreen The abline method can also add lines computed from a model for ex ample the least squares regression using the Im method and a robust regression using the 1qs method of the MASS package V ablineCIm Petal Width Petal Length lty longdash col red require MASS for lqs function abline lqs Petal Width Petal Length Ity 2 col blue VV This customized plot is shown in Figure 8 Anderson Iris data q ee yt Three species of Iris i i 9 Ea setosa i 0000 999 amp m versicolor oo g 9 virginica ROY as 00 OO 999 9 99 94909 9 l ie pe EE m e m oS mmm m o u mm e 2 qpsscsescn
135. n see in the CSV file 42 read csv is a specialised wrapper to the very general read table function which can deal with tables in many other formats 43 These are not the numbers 1 2 and 3 100 Missing values Missing values are expressed in CSV files by the two let ters NA without quotes For example X y cadmium elev dist om ffreq soil lime landuse 181072 333611 NA 7 909 0 00135803 NA 1 1 1 Ah 181025 333558 8 6 6 983 0 0122243 14 1 1 1 Ah In the first record neither the cadmium concentration nor organic matter proportion are known these will be imported as missing values symbol ized in R by NA Matrices A matrix can also be prepared as a CSV file perhaps exported from Excel and imported into R with the read csv function The matrix can also be prepared with row and column labels For example here is the CSV file for a small confusion matrix as prepared in a text editor the first row has the column names reference classes and each of the other rows begins with the row name mapped class there is no entry for the very first column of the first row A B C p A 35 14 11 1 B 4 11 3 0 C 12 9 38 4 D 2 5 12 2 Suppose this file is named cm csv it can be read in as follows note that we inform R that the row names are in the first column and convert the default class data frame to a matrix with the as matrix function The attributes function shows the correct data type and attributes here
136. nations of heights in 10 foot increments and girths in 5 inch radius increments along with the 99 confidence intervals of this prediction gt model lt Im Volume Height Girth data trees V new data data frame expand grid Height seq 50 100 by 10 Girth seq 5 25 by 5 gt pred lt predict model new data interval prediction V level 0 99 add the predictor values for easy interpretation pred cbind new data pred gt str pred 34 Some of these combinations would result in strange looking trees 61 data frame 30 obs of 5 variables Height Girth fit Iwr upr num 50 60 70 80 90 100 50 60 70 80 num 55555 5 10101010 num 8 93 2 69 3 55 9 79 16 03 num 7 37 9 37 12 74 18 89 27 88 num 25 222 14 743 5 639 0 685 4 167 gt fits for trees 50 feet tall gt pred pred Height 50 fit lwr upr Height Girth 1 8 9265 7 3694 25 222 50 5 7 13 3109 3 0322 23 590 50 10 13 17 6952 5 7180 29 672 50 15 19 22 0796 2 6126 41 547 50 20 25 26 4639 2 0348 54 963 50 25 4 19 Advanced statistical modelling The Im function is the workhorse of modelling in S because of the im portance of linear models and its versatility However R has many other modelling functions including Imimplements weighted least squares if the weights are specified as an optional argument loess for local fitting glm for generalised li
137. near models r1m and 1qs in the MASS package for robust fitting of linear models m ridge also in the MASS package for ridge regression nls for non linear least squares fitting step for stepwise regression this is a dangerous procedure when applied blindly step can select the best model based on AIC us ing forward or backward selection and a user specified stopping and starting points The pls package 27 implements partial least squares regression PLSR and principal components regression PCR these are often used in spectroscopy and chemometrics The boot package provides bootstrapping functions Principal components of multivariate matrices are computed by the prcomp function the results can be visualised with the biplot and screeplot functions There are many other modelling functions see 811 3 for some ideas on how to find the one you want The advanced text of Venables amp Ripley 57 62 4 20 has chapters on many sophisticated functions all with theory references and S code Many of these functions are implemented in the MASS package Missing values A common problem in a statistical dataset is that not all variables are recorded for all records R uses a special missing value value for all data types represented as NA which stands for not available Within R it may be assigned to a variable For example suppose the volume of the first tree in the trees dataset is unknow
138. nique values of petal length 22 and then coded each observation this is not too useful here in prac tice you would use the cut function The as numeric conversion function extracts the level number of the factor for each object this can be useful if we want to give a numeric argument derived from the factor 4 13 1 The S3 and S4 class systems S has two class systems referred to as 3 old style and S4 new style Most R functions still use S3 classes as presented in the previous section but most new and rewritten packages use the more powerful and modern S4 classes The difference between the two systems is readily apparent in the output of both the class and str functions A good example of S4 classes is the class structure of the sp spatial statis tics package 34 First we look at an R object that is an old style S3 class Note the use of the require function instead of library this ensures that the library is loaded only once 45 gt require sp gt data meuse gt class meuse 1 data frame gt str meuse data frame 155 obs of 14 variables x num 181072 181025 181165 181298 181307 y num 333611 333558 333537 333484 333330 cadmium num 11 7 8 6 6 5 2 6 2 8 3 3 2 2 8 2 41 6 dist m num 50 30 150 270 380 470 240 120 240 420 The sample data meuse is imported as an S3 class namely a data frame Notice that the co rdinates of each point are listed as fields However sp has defin
139. o a list and then assign this list to the names attribute of the data frame gt saved names names trees 1 Girth Height Volume gt names trees lt paste Var 1 dim trees 2 sep 1 Var 1 Var 2 Var 3 gt names trees 1 Perimeter gt names trees 1 Perimeter V 2 v3 gt names trees saved names 1 Girth Height Volume gt rm saved names Note in the paste function how the shorter vector Var was re cycled to match the longer vector 1 dim trees 2 This was just an example of how to name fields at the end we restore the original names which we had saved in a variable which since we no longer need it we remove from the workspace with the rm function The data frame can accessed various ways gt most common by field name gt trees Height 1 70 65 63 72 81 83 66 75 80 75 79 76 76 69 15 75 74 85 86 71 64 78 80 74 72 77 81 82 80 29 80 80 87 gt the result is a vector can select elements of it gt trees Height 1 5 1 70 65 63 72 81 gt but this is also a list element gt trees 2 1 70 65 63 72 81 83 66 75 80 75 79 76 76 69 15 75 74 85 86 71 64 78 80 74 72 77 81 82 80 29 80 80 87 gt trees 2 1 5 1 70 65 63 72 81 gt as a matrix first by row gt trees 1 Girth Height Volume 1 8 3 70 10 3 gt then by column gt trees 2 gt trees 2 1 70 65 63 72 81 83 66 75 80 75 79 76 76 69 15 75 74 85 86
140. odels as well as for models using only continuous predictors The categorical variables must be ordered or unordered S fac tors this can be checked with the is factor function or examined directly with the str function 55 Factors are included in the design matrix as contrasts which divide the observations according to the classifying factors This is quite a techni cal subject treated thoroughly in standard linear modelling texts such as those by Venables amp Ripley 57 Fox 18 Christensen 4 and Draper amp Smith 14 The practical importance of contrasts is mainly the interpreta tion of the results that is possible with a given contrast and secondly in the computational stability One of R s environment options is the default contrast type for unordered and ordered factors these can be viewed and changed with the options function Contrasts for specific factors can be viewed and set with the contrasts function using the contr helmert contr poly contr sum and contr treatment functions to build contrast matrices gt options contrasts contrasts unordered ordered contr treatment contr poly Polynomial contrasts assume equal feature space distance between levels of the ordered predictor this may not be justified and so you may want to change the contrast type For example the meuse soil pollution dataset includes a factor for flooding frequency this is an unordered factor but the three levels are natural
141. olumns The function is defined with the function command The structure is clearer if braces are used optional here because each function only has one command and the command is written on several lines to show the matching braces and parentheses applyCcm 2 function x i gt sapply x function i 4 i mean x sd x end function end sapply end function end apply This particular result could have been better achieved with the scale 29 Other matrix functions scale a matrix function which in addition to scaling a matrix column wise with or without centring with or without scaling returns attributes showing the central values means and scaling values standard devia tions used gt scale cm 1 2 L3 4 1 381930 0 10742 0 24678 0 689 2 0 087464 1 39642 0 44420 0 053 3 0 297377 0 32225 1 46421 1 431 4 0 997089 0 96676 0 77323 0 689 attr scaled center 1 15 25 4 50 15 75 5 25 attr scaled scale 1 14 2916 4 6547 15 1959 4 7170 There are also functions to compute the determinant det eigenvalues and eigenvectors eigen the singular value decomposition svd the OR decomposition qr and the Choleski factorization chol these use long standing numerical codes from LINPACK LAPACK and EISPACK 4 7 Data frames The fundamental S data structure for statistical modelling is the data frame This can be thought of as a
142. omputing environment and a modern programming language Almost every day brings new and modi fied packages added to CRAN New versions of the R base appear regularly about twice a year This means the R user needs to invest some time to keep up with developments Read the R Journal This is issued about four times a year and is announced on R home page It is an attractive PDF document with news announcements tutorials programmer s tips bibliographies and much more Prior to 2009 it was known as the R Newsletter this is archived along with the Journal Many important developments in R were and are explained and illustrated with examples in the Journal and Newsletter Subscribe to one or more mailing lists follow the Mailing Lists link on the R Project home page The most relevant for most ITC users are R announce major announcements e g new versions 60 http journal r project org 113 R packages announcements of new or updated packages R help discussion about problems using R and their solutions The R gurus monitor this list and reply as necessary A search through the archives is a good way to see if your problem was already discussed You can access this search via the Search link on the R home page or from within R with the RSiteSearch method gt RSiteSearch logistic and ROC restrict Rhelp02a sortby date late A search query has been submitted to http search r projec
143. on expects its first un named argument to be the starting point of the vector and its second to be the end 21 The command separator This example shows the use of the command separator This allows several commands to be written on one line In this case the first command computes the sequence and stores it in an object and the second displays this object This effect can also be achieved by enclosing the entire expression in parentheses because then S prints the value of the expression which in this case is the new object gt s lt seq from 20 to 0 by 2 1 20 18 16 14 12 10 8 6 4 2 0 Named arguments give more flexibility this could have been written with names gt s lt seq to 0 from 20 by 2 1 20 18 16 141210 8 6 4 2 0 but if the arguments are specified only by position the starting value must be before the ending value For each function the list of arguments both positional and named and their meaning is given in the on line help gt seq Any element or group of elements in a vector can be accessed by using subscripts very much like in mathematical notation with the select array elements operator gt samp 1 1 1 239197 samp 1 3 1 1 23919739 0 03765046 2 24047546 gt samp c 1 10 1 1 239197 9 599777 The notation 1 3 using the sequence operator produces the sequence from 1 to 3 The catenate function The notation c 1 10 is an example of the very usefu
144. oncept in R is the workspace which contains the local data and procedures for a given statistics project Under Windows this is usually determined by the folder from which R is started Under Windows the easiest way to set up a Statistics project is 1 Create a shortcut to RGui exe on your desktop 2 Modify its properties so that its in your working directory rather than the default e g P R bin Now when you double click on the shortcut it will start R in the directory of your choice So you can set up a different shortcut for each of your projects Another way to set up a new statistics project in R is 1 Start R as just described double click the icon for program RGui exe in the Explorer 2 Select the File Change Directory menu item in R 3 Select the directory where you want to work 4 Exit R by selecting the File Exit menu item in R or typing the qO command R will ask Save workspace image Answer y Yes This will create two files in your working directory Rhistory and RData The next time you want to work on the same project 1 Open Explorer and navigate to the working directory 2 Double click on the icon for file RData R should open in that directory with your previous workspace already loaded If R does not open instead Explorer will ask you what programs should open files of type RData navigate to the program RGui exe and select it If you don t see the file RData in y
145. onsider the sample confusion matrix of Congalton et al 6 also used as an example by Skidmore 53 and Rossiter 43 2 Reference Class A B C D A 35 14 11 1 Mapped B 4 11 3 0 Class C 12 9 38 4 D 2 5 12 2 This can be entered as a list in row major order gt cm c 35 14 11 1 4 11 3 0 12 9 38 4 2 5 12 2 gt cm 8 This matrix is also used as an example in 86 1 25 1 35 1411 1 411 3 012 9 38 4 2 5 122 gt dim cm NULL Initially the list has no dimensions these may be added with the dim function gt dim cm c 4 4 gt cm 1 2 3 4 1 35 4 12 2 2 14 11 9 5 3 11 3 38 12 4 1 0 4 2 gt dim cm 1 4 4 attributes cm dim 1 4 4 attr cm dim 1 4 4 The attributes function shows any object s attributes in this case the object only has one its dimension this can also be read with the attr or dim function Note that the list was converted to a matrix in column major order follow ing the usual mathematical convention that a matrix is made up of column vectors The t transpose function must be used to specify row major or der gt cm t cm gt cm 1 2 3 4 1 35 14 11 1 2 4 11 3 0 3 12 9 38 4 4 2 5 12 2 A new matrix can also be created with the matrix function which in its simplest form fills a matrix of the specified dimensions rows columns with the value of its firs
146. or distribution References Dobson A J 1990 An Introduction to Generalized Linear Models London Chapman and Hall Hastie T J and Pregibon D 1992 Generalized linear models Chapter 6 of Statistical Models in S eds J M Chambers and T J Hastie Wadsworth amp Brooks Cole McCullagh P and Nelder J A 1989 Generalized Linear Models London Chapman and Hall Venables W N and Ripley B D 2002 Modern Applied Statistics with S New York Springer You can also look up technical terms in your favourite statistics textbook One of R s strongest points is that you are not limited to textbook or routine methods you can use the most modern techniques that you see referenced in papers usually with contributed packages As a last resort you can program a procedure yourself 123 A Obtaining your own copy of R Within the ITC computing environment R can be installed via the Software Manager From anywhere in the world everything R is found via the R Project Home Page which has links to Download R and additional packages from CRAN The Comprehen sive R ArchiveNetwork the first time you attempt to download you will be asked to select a mirror i e one of the many servers that host the R distribution Installation instructions Manuals Frequently Asked Questions FAQ The R Journal including innovative statistical applications clever uses of R and R programming If you wan
147. our Explorer this is because Windows 15 This is a special case of the q function Revealing hid considers any file name that begins with to be a hidden file You need den files in to select the Tools Folder options in Explorer then the View tab Windows and click the radio button for Show hidden files and folders You must also un check the box for Hide file extensions for known file types 3 2 3 Saving your analysis steps The File Save to file menu command will save the entire con sole contents i e both your commands and R s response to a text file which you can later review and edit with any text editor This is useful for cutting and pasting into your reports or thesis and also for writing scripts to repeat procedures 3 2 4 Saving your graphs In the Windows version of R you can save any graphical output for in sertion into documents or printing If necessary bring the graphics win dow to the front e g click on its title bar select menu command File Save as and then one of the formats Most useful for insertion into MS Word documents is Metafile most useful for BIFX is Postscript most useful for PDFLaTeX and stand alone printing is PDF You can later review your saved graphics with programs such as Windows Picture Editor If you want to add other graphical elements you may want to save as a PNG or JPEG however in most cases it is cleaner to add annotations within R itself You can
148. par mfcol commands The difference is the order in which figures will be drawn either row or column wise For example to set up a two by two matrix of four histograms and fill them from left to right top to bottom the par method refers to the active device par mfrow c 2 2 hist rnorm 100 hist rbinom 100 20 5 hist rpois 100 1 hist runif 100 par mfrow c 1 1 next plot will fill the window 5 4 2 Trellis graphics VVVYV VVVVV Vv A trellis 5 2 graphics window can also be split but in this case the print method of the lattice package must be used on an object of class trellis which might be built in the same command using the split optional argument to specify the position x and y within a matrix of plots For all but the last plot on a page the more T argument must be specified Repeating the previous example print histogram rnorm 100 split c 1 1 2 2 more T print histogram rbinom 100 20 5 split c 2 1 2 2 more T print histogram rpois 100 1 split c 1 2 2 2 more T print histogram runif 100 split c 2 2 2 2 more F A more elegant way to do this is to create plots with any lattice method including levelplot xyplot and histogram but instead of displaying them directly saving them using the assignment operator lt as trellis objects this is the object type created by lattice graphics methods and then print them with lattice s print method The advanta
149. polluted by cadmium gt library sp gt data meuse gt attach meuse gt xyplot y x data meuse asp iso pch 20 cex 2 col topo colors length cadmium rank cadmi um This plot is shown in Figure 16 Note the use of the asp argument to xyplot to specify the aspect ratio of axes in this case proportional to the two scales isometric The key methods here are topo colors to return a range of colours from dark blue through light pink length to find the number of points and from this set the length of the colour ramp and rank to return the rank of each sample based on its cadmium content the lowest pollution is rank 1 and so on to the most polluted with rank equal to the number of samples Then the correct colour for each sample is extracted from the vector with the subscript operator 89 Soil samples Meuse colour ramp by Cd value Peo mp 333000 4 y e so ao e e e eee o 3 o e 332000 4 sae o e o 4 nt cdo e oo eoo 331000 A PEU JM E e e 9 e e e e e o ge e a a e 330000 e o e e b e T T T T T r 178500 179000 179500 180000 180500 181000 X Figure 16 Example of a colour ramp 90 6 Preparing your own data for R R comes with many interesting example datasets 3 8 but of course you are most interested in your own data There are many ways to prepare this for R a comprehensive guide is given in the online R Data Impor
150. r each class that inherits from class Spatial we can see this with the showMethods method gt showMethods spplot Function spplot obj SpatialPixelsDataFrame obj SpatialGridDataFrame obj SpatialPolygonsDataFrame obj SpatialLinesDataFrame obj SpatialPointsDataFrame For more on the S4 class system see Chambers 3 Ch 7 amp 8 and Venables amp Ripley 56 Ch 5 4 14 Descriptive statistics Numeric vectors can be described by a set of functions with self evident names e g min max median mean length gt data trees attach trees 48 gt min Volume max Volume median Volume mean Volume length Volume 1 10 2 1 77 1 24 2 1 30 17097 1 31 Another descriptive function is quanti le quantile Volume 0 25 50 75 100 10 2 19 4 24 2 37 3 77 0 gt quantile Volume 1 10 15 6 gt quantile Volume seq 0 1 by 1 0 10 20 30 40 50 60 70 80 90 100 10 2 15 6 18 8 19 9 21 4 24 2 27 4 34 5 42 6 55 4 77 0 The summary function applied to data frames combines several of these descriptive functions gt summary Volume Min 1st Qu Median Mean 3rd Qu Max 10 20 19 40 24 20 30 17 37 30 77 00 Some summary functions are vectorized and can be applied to an entire data frame gt mean trees Girth Height Volume 13 24839 76 00000 30 17097 gt summary trees Girth Height Volume Min 8 30 Min 163 Min 2 10 20 Ist Qu 11 05 Ist Qu 72 Ist Q
151. r of plot rows columns rows lt 2 cols lt 2 reps lt rows cols parameters of the normal distribution mu 180 sd 20 set up graphic display par mfrow c rows cols number of s d s for histogram display sdd 3 5 compute bin width from s d and the number of bars for each bin width sd 3 scale x axis x min lt mu sdd sd x max lt mu sdd sd scale y axis y max lt nx0 5 bin width sd compute and display each graph for i in 1 reps v rnorm n mu sd hist v xlim c x min x max ylim c O y max breaks seq mu 5 sd mu 5 sd by bin width main xlab paste Sample i X seq x min x max length 120 true normal distribution points x dnorm x mu sd n xbin width type 1 col blue lty 1 l1wd 1 8 distribution estimated from sample points x dnorm x mean v sd v n bin width type 1 col red lty 2 l1wd 1 8 print sample params and Pr Type I error text x min 0 9 y max paste mean round mean v 2 pos 4 text x min 0 8 y max paste sdev round sd v 2 pos 4 text x min 0 7 y max paste Pr t round t test v musmu p value 2 pos 4 clean up par mfrow c 1 1 rm n rows cols reps mu sd v i sdd bin width x min x max y max x 129 mu 180 sigma 20 mean 177 sdev 16 45 I Pr t 0 33 o 44 o 9 m aH o d I T T T T T 1 120 140 160 180 200 220 240 Sample 1 mu
152. rac tical texts with many examples of R code among the topics are time series 26 data prepration 54 spatial data analysis 2 Bayesian data anlysis 1 Lattice graphics 51 dynamic visualization 7 wavelet analysis 29 and non linear regression 41 It is increasingly common for sophisticated statistics texts to use R to illustrate their concepts an example is the geostatistics text of Diggle amp Ribeiro Jr 13 which uses the authors geoR package 39 So theory and practice go hand in hand 53 http www r project org 54 http www biosino org R R doc 5 http rbbs biosino org Rbbs forums list page 56 http www springer com series 6991 detai lsPage titles as of 13 August 2012 The text of Shumway amp Stoffer 52 on time series analysis uses R code for illustration they also provide an on line tutorial at the book s supporting website 10 4 Technical notes using R 10 5 I have written several technical notes on statistical topics using R to com pute and graph these are all available as PDF files on line If you work through these and use them as starting points for your own analysis you will have a good basis in R One note 48 is designed as a tutorial on R and the S language Another 46 is designed to show as many R tech niques as possible exploratory data analysis univariate statistics bivari ate correlation and regression multivariate analysis including PCA and some geost
153. ract or build design matrices of linear models this is discussed in most regression texts for example Christensen 4 The design matrix of a model is extracted with the model matrix func tion gt model lt Im Volume Height Girth data trees gt X lt model matrix model Intercept Height Girth 1 1 70 8 3 2 1 65 8 6 30 1 80 18 0 31 1 87 20 6 This matrix contains the values of the predictor variables for each obser vation This provides a good check on your understanding of the model structure The matrix can be used to directly compute the least squares linear solution B X X X Y using the t matrix transpose and solve matrix inversion function and the matrix multiplication operator For example to directly com pute the regression coefficients for the model of tree volume predicted by height and girth in the trees dataset gt Y lt trees Volume gt beta lt solve t X 96x96 X t X 96x96 Y 1 Intercept 57 98766 Height 0 33925 Girth 4 70816 gt check this is the same result as from ImQ gt Im trees Volume trees Height trees Girth Coefficients Intercept trees Height trees Girth 57 988 0 339 4 708 The direct computation may be numerically unstable and is certainly slow 1m uses more sophisticated numerical functions 4 17 1 Models with categorical predictors The 1m and glm functions are also used for models with categorical pre dictors and for mixed m
154. reen 4ff0000 orange 6 400ff00 brown font 1 1111111 pch 1 1111111 fill 1 transparent transparent 5 transparent transparent gt options superpose symbol pch 1 1111111 options superpose symbol pch 20 options superpose symbol col lt c blue green red magenta cyan black grey options superpose symbol cex 1 4 trellis par set superpose symbol options superpose symbol xyplot Sepal Width Sepal Length group Species iris non non non transparent transparent transparent non non V VN BF M There are a large number of options each with sub options For example the superposition symbol has a character code pch a vector of colours col a vector of fonts font and a character expansion fraction cex These can all be set and then written back as shown Subsequent graphs use the changed parameters Colour options are often specified with colour ramps see 85 5 for details 5 3 Multiple graphics windows To open several graphics windows at the same time use the windows method R opens the first graphics window automatically in response to the first graphics method such as plot or hist in the following example we assume no such commands have yet been given 84 gt dev listQ NULL gt windows gt dev listQ windows 2 gt dev cur windows 2 At this point there is only one window and it is of course the current graphics device i e number
155. req 1 ffreq 1 2 3 54 2 31 0 14 8 gt table ffreq landuse landuse ffreq Aa Ab Ag Ah Am B Bw DEN Fh Fw Ga SPO STA Tv W 108 519 6 3 3 015 3 0 0030 210 01411 0 1 1 04 0 1 2 112 3 10065 02 00100 0 0 8 The last example is the cross classification or contingency table showing which land uses are associated with which flood frequency classes The x test for conditional independence may be performed with the chisq test function gt chisq test ffreq lime Pearson s Chi squared test data ffreq and lime 3 No I don t know what the name stands for either 50 X squared 26 81 df 2 p value 1 508e 06 Both the x statistic and the probability that a value this large could occur by chance in the null hypothesis of no association is true the p value are given the second is from the x table with the appropriate degrees of freedom df Here it is highly unlikely meaning the flood frequency and liming are not independent factors 4 16 Sets S has several functions for working with lists including vectors as sets i e a collection of elements these include the is element union intersect setdiff setequal functions The unique function removes duplicate el ements from a list the duplicated function returns the indices of the duplicate elements in the original list The setdiff function can be used to select the complement of a defined set For example in 84 9 we selected random elements of th
156. required by the write table function To export this data frame we use the write table function specifying that a comma separate the fields by using the optional sep argument gt write table round as data frame kxy 4 file KrigeResult csv sep quote T row names F col names c E N LPb LPb var In this example the precision of the output was limited with the round function and named the fields with the col names optional argument to write table We also specified that no row names be written with the optionaol row names F argument and that any strings be quoted with the 104 optional quote T argument Here are the first few lines of the file KrigeResult csv viewed in a plain text editor such as Notepad E N LPb LPb var 181180 333740 5 7214 0 2932 181140 333700 5 6772 0 2254 181180 333700 5 6064 0 2391 105 9 Reproducible data analysis Since R is a programming language it can be used to write scripts 3 5 which can then be run at any time to reproduce the analysis This is especially useful if the data changes or to apply the same analysis to a different dataset It also allows the exact same figure to be produced A step further is to integrate the R analysis into a text document Rather than cutting and pasting a better approach is to prepare the text doc ument with BTrX 23 22 and incorporate executable R code into the document This ingenious approach is known
157. s gt 0 1 8 Imports utils lattice grid Index CRS class Class CRS of coordinate reference system arguments DMS class Class DMS for degree minute decimal second values zerodist find point pairs with equal spatial coordinates To find all packages loaded at a given time use the sessionInfo com mand gt sessionInfo R version 2 15 0 2012 03 30 Platform x86 64 apple darwin9 8 0 x86 64 64 bit locale 1 C attached base packages 1 stats graphics grDevices utils datasets methods 7 base other attached packages 1 gstat 1 0 10 spacetime 0 6 2 xts 0 8 6 4 zoo 1 7 7 sp 0 9 98 loaded via a namespace and not attached 1 grid 2 15 0 lattice 0 20 6 tcltk 2 15 0 tools 2 15 0 11 6 What statistical procedure should I use This of course depends on your application field your research questions and your dataset You should always refer to textbooks and research pa pers R is a computer program to carry out your computations not a statistical wizard to design them Within R each help page gives references to textbooks or articles which you should consult if you are unsure about the theory behind the method or its options For example gt help glm 122 gim package stats R Documentation Fitting Generalized Linear Models Description gim is used to fit generalized linear models specified by giving a symbolic description of the linear predictor and a description of the err
158. s closer d 0 4066 V MV NN N y round new pt y 4 pt is closer d round d 4 4 4 V 1 Point 2 is not closer 1 Point 3 is closer d 0 1478 1 Point 4 is not closer 1 Point 5 is not closer 1 Point 6 is not closer 1 Point 7 is not closer 8 1 Point is not closer 1 Point 9 is not closer 1 Point 10 is not closer gt print paste Closest point x round sample pts closest pt x 4 y round sample pts closest pt y 4 1 Closest point x 0 2061 y 0 8003 plot sample pts xlim c 0 1 ylim c 0 1 pch 20 col ifelse row names sample pts closest pt main Finding the closest point text sample pts row names sample pts pos 3 points new pt pch 20 col blue cex 1 2 text new pt target point pos 3 col blue arrows new pt x new pt y sample pts closest pt x sample pts closest pt y length 0 05 red green VVVV V 4 22 User defined functions An R user can use the function function to define functions with argu ments including optional arguments with default values See the example in C and a good introduction in Chapter 10 of 36 These then are objects in the workspace which can be called For example here s a function to compute the harmonic mean of a vector 65 Finding the closest point 1 0 target point 8 3 0 6 eu 0 4 eo 0 2 Figure 5 Finding the closest point
159. s regres sion lines for each species to the scatterplot of petal width and length 80 Anderson Iris data all species Wireframe plot Maunga Whau Volcano Auckland Sepal Length Petal Length 60 50 40 20 Levelplot Maunga Whau Volcano Auckland Contourplot Maunga Whau Volcano Auckland 200 60 r 200 180 50 40 30 column 20 20 40 60 80 row contour interval 10 m row Figure 13 Trellis trivariate plots gt xyplot Petal Width Petal Length data iris colec darkgreen navyblue firebrick as numeric iris Species pch 20 xlab Petal length ylab Petal width main Anderson Iris data panel function x y t1 panel fi11 Ccol antiquewhite3 panel xyplot x y panel abline Im y x col black for lvl in 1 length levels Species 1 81 tet ett panel ablineClm Cy x subset Species levels Species lvl col cC darkgreen navyblue firebrick 1v1 ty 2 D This plot is shown in Figure 14 Anderson Iris data Petal width Petal length Figure 14 Trellis scatter plot with some added elements The panel argument is used whenever we want to control the appear ance of the panel beyond the default plot Here it is a function which takes as arguments the dummy variables x and y which represent the two variables of the x y plot The argument to the pan
160. science and data analysis series Chapman amp Hall CRC ISBN 0849316227 69 29 Nason G P 2008 Wavelet methods in statistics with R Use R New York London Springer 111 30 Paradis E 2002 R for Beginners Montpellier F University of Mont pellier URL http cran r project org doc contrib rdebuts_en pdf 111 31 Paradis E 2002 R para Principiantes Montpellier F University of Montpellier URL http cran r project org doc contrib rdebuts_es pdf 111 32 Paradis E 2002 R pour les d butants Montpellier F University of Montpellier 134 33 34 35 36 37 38 39 40 41 42 43 44 URL http cran r project org doc contrib rdebuts fr pdf 111 Pebesma E J 2004 Multivariable geostatistics in S the gstat package Computers amp Geosciences 30 7 683 691 2 5 Pebesma E J amp Bivand R S 2005 Classes and methods for spatial data in R R News 5 2 9 13 URL http CRAN R project org doc Rnews 45 Pebesma E J amp Wesseling C G 1998 Gstat a program for geostatis tical modelling prediction and simulation Computers amp Geosciences 24 1 17 31 URL http www gstat org 5 R Development Core Team 2012 An Introduction to R Vienna The R Foundation for Statistical Computing 2 15 0 2012 03 30 edition URL http cran r project org doc manuals R intro pdf 19 42 64 65 69 R Development Core Team 2012 R Data Import Export Vienna
161. sed linear models with hyperlinks in the HTML help to directly access their help 2 Search for keywords For example help search sequence lists methods to generate sequences vectors of sequences and sequences of dates for time series analysis 118 Look at the Task Views These are a summary of the facilities in R to accomplish certain tasks multivariate methods spatial statistics including the names of the applicable packages and methods Review the tutorials 810 2 they cover many common methods The contributed documentation at CRAN has many introductions and reference cards for specific kinds of analysis for example the brief reference cards by Vito Ricci for regression analysis and time series analysis 96 Many textbooks use R to illustrate their discussion 810 3 e g 11 19 57 15 13 you can adapt their examples to your needs If you don t find what you re looking for perhaps the method is in a contributed package which has not yet been installed on your system You can search for it on CRAN the R archive For example the von Mises distribution is the circular analogue of the normal distribution 12 p 322 On R with the default installation a search for this term with the help search method will give no results gt help search von Mises No help files found with alias or concept or title matching von Mises However you can search the R archive for the term
162. t Export manual also provided as a PDF file 37 6 1 Preparing data directly in R A small dataset may be entered directly in R in several ways Here we illustrate this with the winning times from the 100 m footrace in the mod ern Olympic games 55 which we will assemble into a data frame with the cases being each year and the fields being the year number the men s winning time and the women s winning time One method is to assign a vector directly to a variable In the case of reg ular data this can be easily accomplished with the seq or rep functions For irregular data the c function must be used to create a list yr seq 1900 2004 4 produces 1900 1904 1908 2004 men c 11 11 10 8 10 8 NA 10 8 10 6 10 8 10 3 10 3 NA NA 10 3 10 4 10 5 10 2 10 9 95 10 14 10 06 10 25 9 99 9 92 9 96 9 84 9 87 9 85 VV Note the continuation prompt from R as long as the list was not com pleted not enough to match the open Q R could determine that the expression was not complete and prompted for more input Also note the use of the special NA value to indicate a missing value no Olympic games were held in 1916 1940 or 1944 so there is no time for those years Vectors can also be entered with the scan function This waits for you to type at the console until a blank line is entered It shows the number of the next data item as a continuation prompt gt women lt scan 1 NA
163. t argument gt m lt matrix O 5 3 1 2 L 3 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 This value may also be a vector gt m lt matrix 1 15 5 3 byrow T 1 2 3 26 1 1 2 3 2 4 5 6 3 7 8 9 4 10 11 12 5 13 14 15 gt m matrix 1 5 5 3 1 2 D 3 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 In this last example the shorter vector 1 5 is re cycled as many times as needed to match the dimensions of the matrix in effect it fills each column with the same sequence A matrix element s rows and column are given by the row and col func tions which are also vectorized and so can be applied to an entire matrix gt col m 1 2 3 1 1 2 3 2 1 2 3 3 1 2 3 4 1 2 3 5 1 2 3 The diag function applied to an existing matrix extracts its diagonal as a vector gt d lt diag cm 1 35 11 38 2 The diag function applied to a vector creates a square matrix with the vector on the diagonal d diag seq 1 4 1 2 3 4 1 1 0 0 0 2 0 2 0 0 3 0 0 3 0 4 0 0 0 4 And finally diag with a scalar argument creates an indentity matrix of the specified size gt d lt diag 3 1 2 3 1 1 0 0 2 0 1 0 3 0 0 1 Arithmetic operators such as operate element wise on matrices as on 27 any vector if matrix multiplication is desired the operator must be use
164. t for either the 1f or else can itself containa if else control structure 4 23 Computing on the language As explained in the R Language Definition R belongs to a class of programming languages in which sub routines have the ability to modify or construct other subrou tines and evaluate the result as an integral part of the language itself 38 Ch 6 This may seem quite exotic but it has some practical applications even for the non programmer R user in addition to the deeper applications explained in the Definition For example consider the problem of summarising a set of variables that are named B1 B2 B256 To avoid writing m 1 lt mean B1 m 2 mean B2 etc we d like to loop through the numbers and form the variable name with the B prefix and a number and perform the operation We do this in three steps 1 Build up a syntactically correct string to be evaluated using the paste function 2 Parse this into an R language object with the parse function 36 Perhaps reflectances from a hyperspectral sensor 67 3 Evaluate it with the eval function gt demonstrate how the string is built up gt paste m b lt mean B b sep Error in paste m b lt mean B b sep object b not found gt must define a value for b to see how this works gt b lt 4 gt paste m b lt mean B b sep 1 m 4 lt mean B4 gt what
165. t meta name Date content 2007 08 09 lt style type text css Date Fri 31 Aug 2007 20 14 08 0500 R Systematically biased count data regression model This message Message body More options Related messages gallon li R compute ROC curve Next message Felix Andrews R Sea http finzi psych upenn edu R RhelpO2a archive 106735 html 19 480 bytes Figure 20 Results of an R site search Attend the useR user s conference every two years since 2004 The papers from these meetings are available on line 61 links to http finzi psych upenn edu search htm 8 http ww r project org conferences htm 114 11 Frequently asked questions 11 1 Help I got an error what did I do wrong 1 Read the error message carefully Often it says exactly what is wrong For example gt x lt rnorm 100 gt summary X Error in summary X Object X not found This means exactly what it says there is no object named X in the workspace nor in attached data frames so R could not find it when it tried to execute the summary method In this case the solution is clear the variable is a lower case x not an upper case X gt summary x Min 1st Qu Median Mean 3rd Qu Max 2 750 0 561 0 070 0 014 0 745 2 350 2 If the command involves an external file make sure your R session is in the right directory or else use the full path name For example the read csv method requires a file name gt dlv
166. t org The results page should open in your browser shortly Figure 20 shows the results of this search nnn R site search lt logistic and ROC gt 4 4 A A ce R http search r project org cgi bin namazu cgi query logistic and ROC Qr Go gle QA mydelicious postto delicious ITCv AS Wikipedia Scholarpedia Rv Xv Sciencey Money XinHuav DGRY R Site Search Note more than two search terms may fail Query logistic and ROC Search How to search Display 20 Description normal Sort by date latest on top B Target C3 Functions Documents V R help 2002 Rhelp 1997 2001 Results References logistic 2939 ROC 379 Total 50 documents matching your query 1 R pseudo R square and istic in R logistic regression from Chuck Cleland on 2008 03 27 stdin score 1 Author Chuck Cleland ccleland Date Tue 01 Apr 2008 08 47 48 0500 R pseudo R square and or C statistic in R logistic regression This message Message body More options Related messages Tribo Laboy R Rule for accessing attributes Next message http finzi psych upenn edu R RhelpO2a archive 125833 html 8 951 bytes 2 R Systematically biased count data regression model from Steven McKinney on 2007 08 09 stdin score 1 Author Steven McKinney smckinney gt lt meta name Subject content R Systematically biased count data regression model gt l
167. t to use an IDE such as RStudio it should be installed after installing R itself Installing R for Windows To install R on Windows download the setup program from CRAN by following the links for Windows then the base package then selecting the setup program the exact name depending on the version Download the file about 32 Mb and run it this will install R and its base packages This link will redirect to the current Windows binary release http mirrors dotsrc org cran bin windows base release htm Note for Windows system managers the R Windows FAQ in the same directory as the setup program has extensive infor mation on administering R for Windows The setup installs the base R system and some of the most common li braries Table 4 68 http www r project org 69 http cran r project org 124 base The R Base Package chron Chronological objects which can handle dates and times class Functions for classification cluster Functions for clustering datasets The R Datasets Package foreign Read data stored by Minitab S SAS SPSS Stata graphics The R Graphics Package grDevices The R graphics devices s upport for colours and fonts grid The Grid Graphics Package KernSmooth Functions for kernel smoothing lattice Lattice Graphics MASS Main Library of Venables and Ripley s MASS methods Formal Methods and Classes mle Maximum likelihood estimation multcomp Multiple Tests and S
168. the dimension names for the rows and columns gt cm as matrix read csv cm csv gt cm A B CD A 35 14 11 1 B 411 30 C12 9384 D 2 5122 gt class cm 1 matrix gt attributes cm dim 1 4 4 dimnames dimnames 1 1 A Bg ch Dp dimnames 2 1 A B C p row names 1 101 7 2 Importing images Some data such as satellite imagery is naturally organized as a matrix of data values on a regular grid Many export formats write a header of several lines giving information on the file contents followed by the data values as one long vector or else with each row of the imagery as a row in the text file An example is the ESRI asciigrid format Here is a small example of a 12 by 12 image stored as ASCII text file image grd We display it within R with the file show function gt file show image grd NCOLS 12 NROWS 12 XLLCORNER 180000 YLLCORNER 331000 CELLSIZE 40 NODATA VALUE 9999 141 143 147 151 153 153 153 157 159 159 163 166 148 153 154 159 163 164 164 164 169 170 170 175 159 160 164 166 170 175 176 176 176 180 182 182 166 170 172 176 177 182 186 188 188 188 192 193 176 177 182 183 188 189 193 198 199 199 199 204 186 188 189 193 195 199 201 205 209 211 211 209 192 198 199 201 205 206 211 212 217 221 221 211 195 204 209 211 212 217 218 222 224 228 222 211 201 206
169. tistical meth ods that use R or S for illustration Every computational step is recorded and this history can be saved for later use or documen tation It stimulates critical thinking about problem solving rather than a push the button mentality It is fully programmable with its own sophisticated computer language 84 Repetitive procedures can easily be automated by user written scripts 83 5 It is easy to write your own functions SB and not too difficult to write whole packages if you invent some new analysis All source code is pub lished so you can see the exact algorithms being used also expert statisticians can make sure the code is correct It can exchange data in MS Excel text fixed and delineated formats e g CSV so that ex isting datasets are easily imported 86 and results computed in R are easily exported 88 Most programs written for the commercial S PLUS program will run unchanged or with minor changes in R 82 3 1 3 http www gnu org copyleft gpl html 2 2 Disadvantages R has its disadvantages although every disadvantage has its advantage 2 3 Alternatives 18 The default Windows and Mac OS X graphical user interface GUI 83 2 is limited to simple system interaction and does not include statistical procedures The user must type commands to enter data do analyses and plot graphs This has the advantage that the user has complete control over the system The Rcmdr add on packa
170. to Java to allow R to directly read and write Microsoft Excel worksheets It includes a loadworkbook function to create a workbook and a writeWorksheet function to write all or part of one sheet of a data frame to a worksheet A common reason for exporting is that you have computed some result in R which you want to use in another program For example suppose you ve computed a kriging interpolation with the krige function of the gstat package load sp library to support spatial classes load gstat for variograms and kriging library sp library gstat load sample data and interpolation grid data meuse data meuse grid convert these to spatial objects coordinates meuse lt x y coordinates meuse grid x y krige using a known variogram model kxy lt krigeClog lead x y locations meuse newdata meuse grid maxdist 800 model vgm 0 34 Sph 1140 0 08 using universal kriging gt str as data frame kxy data frame 3103 obs of 4 variables x y num 181180 181140 181180 181220 181100 num 333740 333700 333700 333700 333660 varl pred num 5 72 5 68 5 61 5 54 5 67 varl var num 0 293 0 225 0 239 0 257 0 173 Note the use of the coordinates function to promote the meuse and meuse grid data frames to objects of class Spatial PointsDataFrame as required by the krige function and the as data frame function to extract the data frame from the kriged object as
171. tributed by the Insightful corporation and is a popular choice for large scale commerical statistical computing Like R it is a dialect of the original S language developed at Bell Laborato For example the t transpose function only can be applied to matrices For example the summary and plot methods give different results depending on the class of object 6 http www insightful com ries S PLUS has a full graphical user interface GUI it may be also used like R by typing commands at the command line interface or by running scripts It has a rich interactive graphics environment called Trellis which has been emulated with the lattice package in R 85 2 S PLUS is licensed by local distributors in each country at prices ranging from moderate to high depending factors such as type of licensee and application and how many computers it will run on The important point for ITC R users is that their expertise will be immediately applicable if they later use S PLUS in a commercial setting 2 3 2 Statistical packages There are many Statistical packages including MINITAB SPSS Statistica Systat GenStat and BMDP which are attractive if you are already familiar with them or if you are required to use them at your workplace Although these are programmable to varying degrees it is not intended that special ists develop completely new algorithms These must be purchased from local distributors in each country and the pur
172. twd 110 ggplot2 package 69 glm 52 55 62 graphics package 69 gray 88 grid 72 grid package 69 GridTopology package sp class 48 groups trellis graphics argument 79 gstat package 2 6 17 94 99 hatvalues 59 heat colors 88 help 12 116 help search 13 111 114 hist 74 76 84 histogram package lattice 83 86 hsv 89 I 54 identify 76 if else 64 67 ifelse 65 image 76 influence measures 60 intersect 51 iplots package 121 iris dataset 17 39 69 75 78 is element 51 is factor 45 55 is na 112 is numeric 45 krige package gstat 99 lapply 29 lattice package 5 17 69 77 80 83 85 86 legend 72 73 length 23 25 48 89 LETTERS constant 44 92 letters constant 44 levelplot package lattice 52 79 83 86 levelplot 80 levels 34 36 library 17 45 lines 72 list files 110 1m 35 52 55 57 58 62 73 Im ridge package mass 62 139 1oadWorkbook 93 loess 62 log 19 54 10910 19 log2 19 lqs Cpackage mass 62 111 lqs 73 1s 21 lty graphics argument 72 main graphics argument 70 MASS package 17 62 63 73 76 111 matplot 76 matrix 26 92 98 max 25 48 mean 48 median 48 memory limit 115 memory size 115 meuse dataset 46 50 56 89 94 99 meuse grid dataset 99 min 48 mle vonmises package circular 114 model matrix 55 mosaicplot 76 mtext 72 na omit 63 names 31 41 nlme pac
173. u 19 40 Median 12 90 Median 76 Median 24 20 Mean 713 25 Mean 76 Mean 130 17 3rd Qu 15 25 3rd Qu 80 3rd Qu 37 30 Max 20 60 Max 87 Max 77 00 Others are not but can be applied to margins of a matrix or data frame with the apply function gt apply trees 2 median Girth Height Volume 12 9 76 0 24 2 The margin is specified in the second argument 1 for rows 2 for columns fields in the case of data frames 49 Surprisingly base R has no functions for skewness and kurtosis these are provided by the e1071 package written by TU Wien gt require e1071 gt skewness trees Volume 1 1 0133 gt kurtosis trees Volume 1 0 24604 The kurtosis here is excess kurtosis i e subtracting 3 from the kurtosis 4 15 Classification tables For data items that are classified by one or more factors the table func tion counts the number of observations at each factor level or combination of levels We illustrate this with the meuse dataset included with the sp package This dataset includes four factors gt require sp data meuse str meuse data frame 155 obs of 14 variables ffreq Factor w 3 levels 1 2 3 1 111111111 soil Factor w 3 levels 1 2 3 1112222112 lime Factor w 2 levels 0 1 2221111111 landuse Factor w 15 levels Aa Ab Ag 44 4 11 4 11 attach meuse gt table ffreq ffreq 1 2 3 84 48 23 gt round 100 table ffreq length ff
174. unction gt require Rcemdr As it is loaded it starts up in another window with its own menu system You can run commands from these menus but you can also continue to type commands at the R prompt Figure 3 shows an R Commander screen shot To use Rcmdr you first import or activate a dataset using one of the commands on Rcmdr s Data menu then you can use procedures in the Statistics Graphs and Models menus You can also create and graph probability distributions with the Distributions menu 23 http www winedt com 24 http en wikipedia org wiki Emacs 25 http stat ethz ch ESS 15 E R Commander oog Fie Edit Data Statistics Graphs Models Distributions Help R Data set trees Ed Index plot No active model Submit oe Histogram as MACHIOEERA PAGKA RETINENS attach trees Bios n names trees Quantile comparison plot summary trees Scatterplot Scatterplot matrix 3D scatterplot Line graph Plot of means Bar graph Pie chart Save graph to file gt data trees package datasets gt attach trees gt names trees 1 Girth Height Volume gt summary trees Girth Height Volume Min sr 8 3 Min 63 Min 10 2 1st Quril11 1 ist Qu 72 ist Qu 19 4 Median 12 9 Median 76 Median 24 2 Mean 213 4 Mean 76 Mean 30 2 3rd Qu 15 3 3rd Qu 80 drd Qu 37 3 Max 220 6 Max Bu Figure 3 T
175. var function gt sample seq 1 8 1 12345678 gt sample mean sample 1 3 5 2 5 1 5 0 5 0 5 1 5 2 5 3 5 gt sample mean sample 2 1 12 25 6 25 2 25 0 25 0 25 2 25 6 25 12 25 gt sum sample mean sample 2 1 42 gt sum sample mean sample 2 length sample 1 1 6 var sample 1 6 In the expression sample mean sample the mean mean sample a scalar is being subtracted from sample a vector The scalar is a one element vector it is shorter than the eight element sample vector so it is re cycled the same mean value is subtracted from each element of the sample vector in turn the result is a vector of the same length as the sample Then this entire vector is squared with the operator this also is applied element wise The sum and length functions are examples of functions that summarise a vector and reduce it to a scalar Other functions transform one vector into another Useful examples are sort which sorts the vector and rank which returns a vector with the rank order of each element of the original vector 23 gt data trees gt trees Volume 1 10 3 10 3 10 2 16 4 18 8 19 7 15 6 18 2 22 6 19 9 11 24 2 21 0 21 4 21 3 19 1 22 2 33 8 27 4 25 7 24 9 34 5 22 31 7 36 3 38 3 42 6 55 4 55 7 58 3 51 5 51 0 77 0 gt sort trees Volume 1 10 2 10 3 10 3 15 6 16 4 18 2 18 8 19 1 19 7 19 9 11 21 0 21 3 21 4 22 2 22 6 24 2 24 9 25 7 27 4 31 7 33 8 22 34 5 36 3
176. x ponentiation sqrt to extract square roots abs for absolute value round ceiling floor and trunc for rounding and related operations trigono metric functions such as sin and inverse trigonometric functions such as 6 m RGui menu command Help Manuals R Language Manual 27 n RGui menu command Help Manuals R Introduction 19 asin gt log 10 10g10 10 10g2 10 1 2 3026 1 1 1 3 3219 gt round log 10 1 2 gt sqrt 5 1 2 2361 sin 45 pi 180 1 0 7071 gt asin 1 pi 180 1 90 4 2 Creating new objects the assignment operator New objects in the workspace are created with the assignment operator which may also be written as gt mu 180 gt mu 180 The symbol on the left side is given the value of the expression on the right side creating a new object or redefining an existing one here named mu in the workspace and assigning it the value of the expression here the scalar value 180 which is stored as a one element vector The two character symbol lt must be written as two adjacent characters with no spaces Now that mu is defined it may be printed at the console as an expression gt print mu 1 180 gt mu 1 180 and it may be used in an expression gt mu pi 1 57 2958 More complex objects may be created gt S lt seq 10 gt sS 1 1 2 3 4 5 6 7 8 910 This creates a new object named s in the workspace and assigns
177. y We illustrate this with a simplified version of the meuse dataset from the gstat and sp packages which has been prepared to illustrate some issues with import Here is a small CSV file written by Excel and viewed in a plain text editor such as Notepad x y cadmium elev dist om ffreq soil lime landuse 181072 333611 11 7 7 909 0 00135803 13 6 1 1 1 Ah 181025 333558 8 983 0 0122243 14 1 1 1 Ah 181165 333537 6 8 0 103029 13 1 1 1 Ah 181298 333484 2 655 0 190094 8 1 2 0 Ga 2 4 3 1 181307 333330 48 0 27709 8 7 1 2 0 Ah 6 6 5 7 6 7 8 7 181390 333260 3 7 791 0 364067 7 8 1 2 0 Ga Note that There is one line per observation record Each record consists of the same number of fields which are sepa rated by commas The first line has the same number of fields as the others but consists 99 of the field names Suppose this file is named example csv To read into R we first change to the directory where the file is stored and then read into an R object with the read csv function gt ds read csv example csv gt str ds data frame 6 obs of 10 variables x int 181072 181025 181165 181298 y int 333611 333558 333537 333484 cadmium num 11 7 8 6 6 5 2 6 2 8 3 elev num 7 91 6 98 7 80 7 66 7 48 dist num 0 00136 0 01222 0 10303 0 19009 om num 13 6 14 13 8 8 7 7 8 ffreq int 111111 soil int 111222 lime int 111000 landuse Factor w 2

Download Pdf Manuals

image

Related Search

Related Contents

Nokia Bluetooth Carkit CK-7W  Simrad RGC12 Boating Equipment User Manual  1. - Factory RV Surplus  avvertenza importante - important warning avertissement  Samsung SGH-P300 Vartotojo vadovas  EUR-USA v8.0 USER MANUAL  User Manual Manuel d`utilisation Manual del usario  Simpli Home AXCAVA006 Instructions / Assembly  

Copyright © All rights reserved.
Failed to retrieve file