Home

S-PLUS 7 Enterprise Developer User's Guide

1. 0 01 A 0 01 24 1 10 0 72 18 0 80 0 81 0 95 0 79 Total number columns 72 Numeric columns 72 Total number rows 32165 Factor columns 0 String columns 0 Date columns 0 Other columns 0 Figure 4 4 P8 data Nz bd Displaying a In this exercise using the normalized data from the section Bar Plot Transforming the Data on page 99 produce a single bar plot to show the national average of female and male age distributions for the whole population This bar plot shows females to the left of 0 and males to the right To display the gender bar plot 1 In the Commands window type barplot rbind P8 dataN mean 1 18 P8 dataN mean 19 36 horiz T 102 Joining Columns Manipulating Data Census Example 2 Examine the plot and notice the baby boom ages and the subsequent boomlet Also note the difference in population between genders at greater ages Figure 4 5 Bar plot of age and gender data For more information about the graph functions available for large data sets see Chapter 5 Creating Graphical Displays of Large Data Sets In the course of our data processing the data and the geographic information have become separated In this exercise join the normalized data row by row with the informational columns To join columns 1 In this step combine the transformed gender and age data set P8 dataNz bd with the latitude and longitude data set P8 supp1 b
2. Data Modeling For more information and usage examples see the functions individual help topics Table A 10 Fitting functions Function name bdCluster bdGlm bdLm bdPrincomp Table A 11 Other modeling utilities Function name bd model frame and matrix bs ns spline des contrasts contrasts lt 230 Big Data Library Functions Model Methods The following table identifies functions implemented for generalized linear modeling linear regression principal components modeling and clustering The cross hatch indicates the function is implemented for the corresponding modeling type Table A 12 Modeling and Clustering Functions principal Generalized linear Linear components Function name modeling bdG1m Regression bdLm bdPrincomp bdCluster AIC all equal anova BIC coef deviance durbinWatson effects family fitted formula kappa labels loadings 231 Appendix Big Data Library Functions Table A 12 Modeling and Clustering Functions Continued principal Generalized linear Linear components Function name modeling bdG1m Regression bdLm bdPrincomp bdCluster logLik model frame model matrix plot predict print tF E prin
3. Note The steps in this section simply demonstrate converting big data it is not required for the remainder of the example You have already imported the data as a big data object in the earlier section To import the data set To read census data as a data frame and then convert to a bdFrame 1 Load the Big Data library if it is not already loaded 2 Read the census data without setting the bigdata T argument small Census lt importData paste getenv SHOME samples bigdata census census csv sep type ASCII stringsAsFactors F larger Census lt as bdFrame small Census bigdata T 3 View the resulting data in the Data Viewer In the Commands window type bd data viewer larger Census Likewise you can convert an S PLUS vector to a bdVector To convert between an S PLUS vector and a bdVector subclass 1 In the Commands window type 95 Chapter 4 Exploring and Manipulating Large Data Sets ZCTA bv lt as bdCharacter P8 bd ZCTA5 Note While you can use either bd coerce or functions like as bdCharacter and as bdFrame to convert standard objects to big data objects you must use bd coerce to convert big data objects to standard objects This technique provides a single function to convert big data objects to standard data objects so it is easier to track where big data is coerced to standard and to make it easier for you to write code that scales to handle arbitrarily large data
4. cluster pm bd lt bd aggregate x cluster p bd by columns PREDICT membership input columns column names Nz summary fns mean cluster pc bd lt bd aggregate x cluster p bd by columns PREDICT membership input columns 1 summary fns count 2 Optionally you can display the changed data in the data viewer bd data viewer cluster pc bd 3 Assign the mean cluster group to cluster pm df as a bdFrame cluster pm df lt bd coerce cluster pm bd 4 Assign the count cluster group to cluster pc df as a bdFrame cluster pc df lt bd coerce cluster pc bd 5 Assign both cluster group bdFrames to cluster pmc df cluster pmc df lt merge cluster pm df cluster pc df 6 For a more systematic display re order by number of members within each cluster cluster pmc df lt cluster pmc df rev order cluster pmc df ZCTA5 count 7 Assign the cluster ID column of the data frame PREDICT membership to the bdCharacter object PREDICT membership ordered Building a Model PREDICT membership ordered lt as character cluster pmc df PREDICT membership To prepare the graph display 1 Set the color for the histograms to cycle through the 16 color list indexl6 1 0 200 16 2 Prepare the graph display The function graph setup is defined in the included file graph setup q The function my vbar is defined in the included file my vbar q See the section Loading Supporting Source Files on page 94 for more
5. 132 Plotting by Summarizing Data Create a Box Plot Example Graphs The qqline chart is displayed as follows fueLbd Mileage Quantiles of Standard Normal Figure 5 10 Graph using qqline in a qqplot chart The following examples demonstrate functions that summarize data in a plot specific manner to plot big data objects These functions do not use hexagonal binning Because the plots for these functions are always monotonically increasing hexagonal binning would obscure the results Rather summarizing provides the appropriate information The following example creates a simple box plot from fuel bd To create a Trellis box and whisker plot see the following section To create a sample box plot in the Commands window type the following fuel bd lt as bdFrame fuel frame boxplot split fuel bd Fuel fuel bd Type style bxp att 133 Chapter 5 Creating Graphical Displays of Large Data Sets The box plot is displayed as follows 55 45 4 0 L 3s 3 0 a ee a a Compact Large Medium Figure 5 11 Graph using boxp ot Create a Trellis Box and Whisker Plot Small Sporty The box and whisker plot provides graphical representation showing the center and spread of a distribution To create a sample box and whisker plot in a Trellis graph in the Commands window type the fol
6. Date In the Data View tab you can scroll horizontally and vertically to examine the data You can also change the size of the Data Viewer window to show more or less of the data table Note that the bottom pane of the Data Viewer provides summary information about the data set including the numbers of rows and columns and identifies the types of columns in the data set 90 Manipulating Data Census Example MANIPULATING DATA CENSUS EXAMPLE In this section we begin manipulating the data in the P8 bd example used throughout the first half of this guide to demonstrate working with a large data set using the Big Data library functions For practical reasons this data set is not particularly large about 33 000 rows and 40 columns however it is illustrative of a typical data set and the type of problem solving users typically must perform Note The entire sample script can be found in the default installation directory in samples bigdata census census demo ssc You can work through the example demonstrations below or you can open the script and review it or run it Overview of Census Sample After you import the data your next task is to manipulate the data using standard and Big Data library functions The Big Data library Census sample reads in the pre processed file census csv which came from the Census Level 3 data All data is binned by ZIP code tabulation area ZCTA using 5 digit zip codes a
7. Plotting using hexagonal bins is available for a single plot a matrix of plots and conditioned single or matrix plots 123 Chapter 5 Creating Graphical Displays of Large Data Sets Create a Pair wise Scatter Plot 124 In the Census example in the section Displaying in a Simple Plot on page 98 of Chapter 4 Exploring and Manipulating Large Data Sets demonstrates plotting using hexagonal binning When you create a plot showing a distribution of zip codes by latitude and longitude the following simple plot is displayed 70 Counts 398 300 200 100 000 00 00 00 500 400 300 200 100 1 PS suppl bd Lat 30 50 60 20 180 160 140 120 100 80 60 P8 suppl bd Lon Figure 5 1 Example of graph showing hexagonal binning The functions listed in Table 5 1 support big data objects by using hexagonal binning This section shows examples of how to call these functions for a big data object The pairs function creates a figure that contains a scatter plot for each pair of variables in a bdFrame object To create a sample pair wise scatter plot for the fuel frame bdFrame object in the Commands window type the following pairs as bdFrame fuel frame Example Graphs The pair wise scatter plot appears as follows Weight eo A o o 4 oo Disp oo Wo o aoed ool m aagi a a a ol Mileage Fuel mpc Smil a LT Cmpc Smii Van Figure 5 2
8. the Workspace This design varies from the traditional S PLUS project design where the Data directory is associated with a single project and contains objects only for that project Address C C Workspaces workl MyProjl Folders x Name Size Type Ga WINDOWS al ie 1KB PROJECT Fie Workspaces MyScript ssc 1KB SPLUS Script E worki Data metadata MyProjt MyProj2 Figure 2 1 Workspace directory showing Data directory metadata directory and project directories o A Important By default the S PLUS Workbench reads objects from the Workspace s Data directory while traditional S PLUS reads objects from the project s Data directory Therefore if you create a project using the S PLUS Workbench and then open that project in the traditional S PLUS GUI you must attach the Workspace s Data directory to see its objects The reverse is also true If you open a project in the S PLUS Workbench that you have previously opened in the traditional S PLUs GUI you must attach the project s Data directory to see its objects By default the traditional S PLUS 7 working directory is C Program Files Insightful splus70 users username Data When working with S PLUS Workbench projects Never store your project files directly in your Workspace directory Project files including the project file should be in their own directory Avoid nesting projects th
9. 2 To view the ZIP codes as strings type ZCTA bv To view the bdVector data in the data viewer type bd data viewer ZCTA bv The ZCTAs are stored as strings in this table click the Strings tab in the Data Viewer to see the results 3 Next you can coerce the ZCTA data to numbers Zip Code Tab Areas Num bv lt as bdNumeric P8 bd ZCTA5 Examine the results in the Data Viewer if you choose Note ZIP codes are best imported as character strings otherwise S PLUS truncates the leading 0 for east coast ZIP codes e g 02139 becomes 2139 Manipulating When you examined the sample census data in the Data Viewer Rows you might have seen that the data set contains several rows of uninformative data rows showing ZCTAs containing letters and the population bins all showing 0 In the following exercise examine and filter those rows and then re display the data set in the Data Viewer To filter the rows 1 To only keep rows where P008001 is greater than 0 96 Sorting and Manipulating the Data Manipulating Data Census Example P8 bd lt bd filter rows P8 bd expr P008001 gt 0 bd filter rows has a logical argument include with a default of TRUE If you add include F to the above call you would drop all rows where P008001 is greater than 0 2 Show the data set in the Data Viewer bd data viewer P8 bd Note that the data set now contains 32 165 rows The filtering function removed 1
10. Problems Wika NULL i wo work mtx as matrix agg df plotcols bar loc barplot rbind work mtx k l nbars work mtx k nbar 1l nbars horiz T col col ablineth bar loc c 4 14 if lis null Nreport col title main paste k agg df PREDICT membership k nN agg df k Nreport col aed ratnurni 4 s Figure 2 13 S PLUS Workbench Output view Problems View The Problems View is a standard Eclipse view that displays errors as you edit and save code For example if you forget a bracket or a quotation mark and then save your work the description appears as a syntax error in the Problems View Note Syntax problems appear in the Problems View only after you save the file If your code has a problem that is displayed in the Problems View and the view is not the active view the Problems View tab title appears as bold text To open the Script editor at the location of the problem double click the error in the Problems View You can use the Problems View control menu click 7 to perform the following tasks Display the Sorting dialog box to sort the problems displayed in the view either in ascending or descending order and according to the problems characteristics 34 Views Display the Filters dialog box to specify properties for filtering problems Console View Objects view Search Path View Output View Tasks AS dopa
11. bd by window Apply an arbitrary S PLUS function to multiple data blocks defined by a moving window over the input dataset bd coerce Converts an object from a standard data frame to a bdFrame or vice versa 208 Big Data Library Functions Table A 5 Data manipulation functions Continued Function name Description bd create columns Creates columns based on expressions bd duplicated Determine which rows in a dataset are unique bd filter columns Removes one or more columns from a data set bd filter rows Filters rows that satisfy the specified expression bd join Creates a composite data set from two or more data sets For each data set specify a set of key columns that defines the rows to combine in the output Also for each data set specify whether to output unmatched rows bd modify columns Changes column names or types Can also be used to drop columns bd normalize Centers and scales continuous variables Typically variables are normalized so that they follow a standard Gaussian distribution means of 0 and standard deviations of 1 To do this bd normalize subtracts the mean or median and then divides by either the range or standard deviation 209 Appendix Big Data Library Functions Table A 5 Data manipulation functions Continued Function name Description bd partition Randomly samples the rows of
12. mdy mean median min minutes months plot quantile quarters range seconds seriesLag 237 Appendix Big Data Library Functions Table A 15 Time Date and Series Functions Continued 238 Function bdTimeDate bdTimeSpan bdSeries bdSignalSeries bdTimeSeries shiftPositions show t sort sort list split start substring lt sum Summary summary timeConvert trunc var wdydy weekdays yeardays years INDEX Symbols Data database 16 metadata database 16 Numerics 64 bit systems 64 A add a task in script file 52 anonymous functions displaying 21 anova 76 B background color console 19 Workbench script editor 21 basic algebra 82 bd create columns 101 bd options 74 bdCharacter 75 bdCluster 73 76 173 bdFactor 75 bdFrame 69 73 77 introducing the new data type 62 bdGLM 73 bdGlm 76 bdLM 73 bdLm 76 bdLogical 75 bdNumeric 75 bdPrincomp 73 76 bdSeries 73 data 77 positions 77 units 77 bdSignalSeries 73 bdTimeDate 75 81 bdTimeSeries 73 bdTimeSpan 75 bdVector 73 74 78 Boston housing example 163 build a model 7 C changing databases adding a directory 47 adding a library 46 adding a module 47 classes bdCharacter 77 bdCluster 77 bdFactor 77 bdGlm 77 bdLm 77 bdLogi
13. vectors The boxplots can be made to display the variability of the median and can have variable widths to represent differences in sample size bwplot Produces a box and whisker Trellis graph which you can use to compare the distributions of several data sets plot density density returns x and y coordinates of a non parametric estimate of the probability density of the data densityplot Produces a Trellis graph demonstrating the distribution of a single set of data hist Creates a histogram histogram Creates a histogram in a Trellis graph qq Creates a Trellis graphic object comparing the distributions of two sets of data 119 Chapter 5 Creating Graphical Displays of Large Data Sets Functions Providing Support to Preprocess Data for Graphing 120 Table 5 3 Functions that summarize in plot specific manner Continued Function Description qqmath Creates normal probability plot for only one data object in a Trellis graph qqmath can also make probability plots for other distributions It has an argument distribution whose input is any function that computes quantiles qqnorm Creates normal probability plot in a Trellis graph qqnorm can accept a single bdVector object qqplot Creates normal probability plot in a Trellis graph Can accept two bdVector objects In qqplot each vector or bdVector is taken as a sample for the x and y axis values of an empiric
14. 1 Set the time to calculate the betas using this approach t0 lt proc time 3 2 Initialize the vector of betas betasl b lt structure numeric length constituentNames names constituentNames 3 Loop through the stocks and calculate the beta directly as a regression coefficient for constituentName in constituentNames beta lt Isfit seriesData dailyReturns ts Le SP500 1 seriesData dailyReturns ts constituentName betasl b constituentName lt beta coef 2 timeBetasl b lt proc time 3 t0 print all equal betasl a betasl b To calculate betas using Approach 2 1 Set the time to calculate the betas using this approach tO lt proc time 3 stdevs lt colStdevs dailyReturns ts 2 Calculate betas without an explicit loop by adjusting the correlation coefficients 112 Manipulating Data Stock Sample corSP500 bd lt bd cor seriesData dailyReturns ts y columns constituentNames x columns SP500 betas2 lt unlist corsPs00 bd 1 1 drop T stdevs constituentNames stdevs SP500 timeBetas2 lt proc time 3 t0 Comparing Compare the answers from Approaches and 2 techniques To check both techniques 1 Examine both betas print all equal betasl b betas2 Plot the beta Plot the 10 year return against the beta calculated over that period To plot the beta 1 2 Create an object for the 10 year return tenyrReturn lt unlist seriesD
15. 54 New Project wizard 23 O Objects View 26 30 opening external files 39 Outline View 26 31 out of memory data storage 4 processing 61 Output View 26 33 P Perspective 12 perspective 24 preferences 18 plot 76 predict 76 basis matrix for polynomial splines 180 censored data 180 factor analysis model 180 linear mixed effects models 180 local regression model 181 nonlinear mixed effects model 181 nonlinear regression model 181 normal linear discriminant function 180 principal components 181 preferences setting 44 Index Prim4 principal pomponents example 170 Principal Components component 170 principal components loadings 170 predict 170 print 170 screeplot 170 summary 170 Problems View 27 34 54 project files removing 48 R refreshing Objects View 31 Problems View 34 Search Path View 35 views 47 removing project files 48 residuals 76 restoring files 56 running code 39 52 on startup 20 running scripts 14 S scalable algorithms 62 63 script creating 48 Script window 2 searching terms 57 Search Path View 27 setting bigdata T 65 signalSeries 76 simultaneous sessions 11 S Plus Workbench 11 starting the Workbench 16 stringsAsFactors 65 summary 74 76 241 Index 242 T task levels 36 task options 23 Tasks View 27 timeDate positions 76 timeSeries 76 time series creating 110 time series object creating 109 toggling comment 40 U units 76 V vectors 74 view cu
16. Class es Description bdFrame Big data frame bdLm bdGlm bdCluster bdPrincomp Rich model objects bdVector Big data vector bdCharacter bdFactor bdLogical Vector type subclasses bdNumeric bdTimeDate bdTimeSpan bdTimeSeries bdSignalSeries Series objects 77 Chapter 3 The Big Data Library Functions Data Import and Export Big Vector Generation Data Exploration Functions Data Manipulation Functions Graph Functions 78 In addition to the standard S PLUs functions that are available to call on large data sets the Big Data library includes functions specific to big data objects These functions include the following e Big vector generating functions e Data exploration and manipulation functions e Traditional and Trellis graphics functions e Modeling functions The functions for these general tasks are listed in the Appendix Two of the most frequent tasks using S PLUS are importing and exporting data The functions are described in Table A 1 of the Appendix You can perform these tasks either from the Commands window or from the S PLUS import and export dialog boxes For more information about importing and exporting large data sets see the section Importing Existing Data and the section Exporting Data in Chapter 4 Exploring and Manipulating Large Data Sets To generate a vector for a large data set call one of the S PLUS functions described in Table A 3 in the
17. Display the Filters dialog box to specify properties for filtering tasks For more information about the basic Eclipse Tasks View see the Workbench User s Guide Console View Objects View Search Path View Output View AlTasks x Problems 3 6 5m 2 items y Description Resource In Folder Location TODO Print the plot for the big meeting census demo ssc Census line73 FIXME Find the code error census demo ssc Census line 35 Figure 2 16 S PLUS Workbench Tasks view 37 Chapter 2 The S PLUS Workbench SCRIPT EDITOR The S PLUS Workbench Script Editor is a text editor It is similar to the Script Editor in S PLUS however it contains additional script authoring features such as syntax coloring and integration with the other views in the IDE Text Editing To help you write efficient easy to follow scripts the Script Editor Assistance provides the following features 4 73 TODO Print the plot for the big meeting Displays keywords and function arguments in customizable colors See the section Setting the Project s Preferences on page 44 Displays code line numbers in a column adjacent to the code Provides automatic code indentation and parenthesis matching See the Eclipse documentation for more information on the editor s standard features Provides customized menu items to control text layout and integration in the Script editor Activates the Script Outline View wh
18. Graph using pairs for a bdFrame This scatter plot looks similar to the one created by calling pairs fuel frame however close examination shows that the plot is composed of hexagons Create a Single The plot function can accept a hexbin object a single bdVector two Plot bdVectors or a bdFrame object The following example plots a simple hexbin plot using the weight and mileage vectors of the fuel bd object To create a sample single plot in the Commands window type the following fuel bd lt as bdFrame fuel frame plot hexbin fuel bd Weight fuel bd Mileage 125 Chapter 5 Creating Graphical Displays of Large Data Sets The hexbin plot is displayed as follows 35 ounts 30 25 20 2000 2500 3000 3500 Figure 5 3 Graph using single hexbin plot for fuel bd Create a Multi The function splom creates a Trellis graph of a scatterplot matrix The Panel Scatterplot scatterplot matrix is a good tool for displaying measurements of three Matrix or more variables To create a sample multi panel scatterplot matrix where you create a hexbin plot of the columns in fuel bd against each other in the Commands window type the following fuel bd lt as bdFrame fuel frame splom data fuel bd Note Trellis functions in the Big Data Library require the data argument You cannot use formulas that refer to bdVectors that are not in a specified bdFrame Notice that the is interpr
19. information This code uses the appropriate display device for both the Windows and Unix platforms graph setup Name Histograms par mfrow c 5 10 Nplot lt 30 for k in 1 Nplot my vbar cluster pmc df k k plotcols 2 37 Nreport col 38 col lt index1l6 k Figure 6 5 Histograms displaying clusters 177 Chapter 6 Modeling Large Data Sets Creating a Multi tabbed Sheet 178 3 Select the columns to determine the data you want to appear in the histogram and assign them to the data frame cluster psbu df cluster psub df lt bd coerce bd select rows x cluster p bd columns c Lat Lon PREDICT membership Optionally you can view this three column data set in the data viewer Observe that it still has 32 165 rows bd data viewer cluster psub df Create a vector to contain the data set s latitudes Lat vec lt cluster psub df Lat Create a vector to contain the data set s longitudes Lon vec lt cluster psub df Lon Create a character vector to contain the data set s predicted membership Memb vec lt as character cluster psub df PREDICT membership Create a vector of the column PREDICT membership Memb vec lt cluster p bd PREDICT membership In the following exercise use the data you sorted and filtered in the previous exercise to create a multi tabbed sheet one for each of the first 20 clusters of your 40 cluster set Each sheet shows black dots for all but th
20. lines loess smooth fuel bd Weight fuel bd Mileage Ity 2 130 Add a Smoothing Spline The resulting chart is displayed as follows fuel bd Mileage 30 35 1 1 25 1 20 L T T T 2000 2500 3000 fuel bd Weight Figure 5 8 Graph using loess smooth in a hexbin plot Example Graphs ounts Use lines smooth spline to add a smoothing spline to a scatter plot To add a smoothing spline to a sample plot in the Commands window type the following fuel bd lt as bdFrame fuel frame hexbin out lt plot fuel bd Weight fuel bd Mileage displays a hexbin plot add to hexbin hexbin out lines smooth spline fuel bd Weight fuel bd Mileage 1ty 3 131 Chapter 5 Creating Graphical Displays of Large Data Sets The resulting chart is displayed as follows fuel bd Mileage 2000 2500 3000 3500 fuel bd Weight Figure 5 9 Graph using smooth spline in a hexbin plot Add a Least To add a reference line to an xyplot set 1m1 ine T Alternatively you Squares Line to can call panel 1mline or panel loess See the section Create a an xyplot Conditioning Plot or Scatter Plot on page 128 for an example Add a qqplot The function qq1ine fits and plots a line through a normal qqp1ot Reference Line To add a qqline reference line to a sample qqp1ot in the Commands window type the following fuel bd lt as bdFrame fuel frame qqnorm fuel bd Mileage qqline fuel bd Mileage
21. predict all work with Big Data g1ms Note Fitting Data for a Linear Model Boston Housing Linear Regression Example At this time the gamma family does not work with bigdata g1ms For a list of functions implemented for big data linear modeling and generalized linear modeling and a short description of each see Table A 12 in the Appendix Big Data Library Functions For more detailed information about each function see its help file The following example uses the Boston housing data to fit a linear model As well as fitting the linear model the example demonstrates tasks covered in earlier chapters including e importing data e manipulating data e creating simple graphs e adding data columns The Boston Housing example data set is included in the example directory of your S PLUS installation samples bigdata boston The text below gives brief descriptions of each of the variables in the data set This data set contains the Boston house price data of Harrison and Rubinfeld 1978 that was subsequently analyzed in Belsley et al 1980 The table in Belsley et al p 244 has various transformations already applied to the data that are not included in the bostonhousing txt file 163 Chapter 6 Modeling Large Data Sets The main variable of interest in the bostonhousing txt data is MEDV the median value of owner occupied homes given in the thousands of dollars We use this as the response varia
22. r 8 4 5 number of bytes required for the data where e r number of rows in the input file e c number of columns in the input file e 8 bytes per entry required for numeric data e 4 5 average number of data copies that the S language creates while processing the data This formula can give you an idea of the amount of dynamic memory needed in standard S PLUS For example using this formula you can see that a dataset with 98672 rows and 507 columns of numeric data requires about 1 8 GB of RAM in the processing machine Size Considerations 98672 507 8 4 5 1 800 961 344 bytes or approximately 1 8 GB On a windows machine 1 8 GB is approaching the limits of the 32 bit operating system so you should set bigdata T when importing this data set 67 Chapter 3 The Big Data Library Physical RAM vs Virtual RAM Summary 68 For efficient operations it is best to have space for all of your data in physical memory If your data requires 1 2 GB of memory according to the above formula and you have only 512 MB of RAM and 2 GB of swap space virtual memory then performance will likely suffer For example a Windows machine in this situation will often appear to hang Whenever your memory requirement is more than available physical RAM you can benefit from moving to out of memory processing techniques such as using the Big Data library For more information about how S PLUS allocates memory and how to use i
23. your data set to partition it into three subsets for training testing and validating your models bd relational difference Get differing rows from two input data sets bd relational divide Given a Value column and a Group column determine which values belong to a given Membership as defined by a set of Group values bd relational intersection Join two input data sets ignoring all unmatched columns with the common columns acting as key columns bd relational join Join two input data sets with the common columns acting as key columns bd relational product Join two input data sets ignoring all matched columns by performing the cross product of each row bd relational project Remove one or more columns from a data set bd relational restrict Select the rows that satisfy an expression Determines whether each row should be selected by evaluating the restriction The result should be a logical value 210 Big Data Library Functions Table A 5 Data manipulation functions Continued Function name Description bd relational union Retrieve the relational union of two data sets Takes two inputs bdFrame or data frame The output contains the common columns and includes the rows from both inputs with duplicate rows eliminated bd remove missing Drops rows with missing values or replaces missing values with the column mean a constant or values generat
24. Appendix When you set the bigdata flag to TRUE the standard S PLUS functions generate a bdVector object of the specified type For example sample of size 2000000 with mean 10 0 5 5 rbinom 2000000 10 0 5 T After you import your data into S PLUS and create the appropriate objects you can use the functions described in Table A 4 in the Appendix to compare correlate crosstabulate and examine univariate computations After you import and examine your data in S PLUS you can use the data manipulation functions to append filter and clean the data For an overview of these functions see Table A 5 in the Appendix For a more in depth discussion of these functions see Chapter 4 Exploring and Manipulating Large Data Sets The Big Data library supports graphing large data sets intelligently using the following techniques to manage many thousands or millions of data points The Big Data Library Architecture e Hexagonal binning That is functions that create one point per observation in standard S PLUS create a hexagonal binning plot when applied to a big data object Plot specific summarizing That is functions that are based on data summaries in standard S PLUS compute the required summaries from a big data object e Preprocessing data using table tapply loess or aggregate e Preprocessing using interp or hist2d Note The Windows GUI editable graphics do not support big data objects To use these
25. Click to sort items displayed the Outline View alphabetically Click again to return the items to the order in which they appear in the script Click to display a menu showing all buttons available on the button bar You can toggle these selections either using the menu or on the button bar Output View Views BR eal plt a a pch amp cex xlim ylim a xlab a ylab a new gt my vbar cluster pmc df k k plotcals 2 H2i a data as table layout ka 4 gt Figure 2 12 S PLUS Workbench Outline view The Output View displays the code you run and the results of the code you run when you click either Run on the toolbar or when you press F9 The text displayed in the Output View is replaced each time you click Run or press F9 That is unlike the Console View the Output View does not store and display previously run commands Also unlike the Console View the Output View is not editable however you can select and copy lines of text in the Output View You can also print or clear the entire contents of the Output View You can use the Output View control menu click 7 to perform the following tasks e Clear the contents of the view e Copy the selected text e Find a string e Select all text e Save the view contents to a file 33 Chapter 2 The S PLUS Workbench Print the view contents Console view Objects View Search Path View ESTOT Tami Tasks
26. ENTER Copy an individual command or blocks of commands from the script editor using the Copy to Console menu item to run them in the Console View Note that you do not need to select Paste Copy to Console copies your selected text in the Script Editor and pastes it into the Console View You can use the Console View control menu click 7 to perform the following tasks Clear the contents of the console Copy the selected text Cut the selected text Paste text from the clipboard to the console Find a string Select all text Save the console contents to a file Print the console contents For exercises using the S PLUS Workbench Console View see the section Copying Script Code to the Console on page 52 For more information about the S PLUS Commands window see Chapter 10 Using the Commands Window in the S PLUS User s Guide Console view z Objects View Search Path View Output View Tasks Problems O n A gt source paste getenv SHOME samples bigdata census a gt source paste getenv SHOME samples bigdata census gt P8 bd lt importData paste getenv SHOME samples bigt stringsAsFactors F startRow 1 bigdata T wi Figure 2 9 S PLUS Workbench Console view 28 Views Note If your script contains a command to open a graph or the data viewer these windows launch externally to the S PLUS Workbench Note that these windows op
27. P8 bd lt bd filter rows P8 bd colref lt POO8001 coldata lt 5 40 data P8 ref bd lt P8 bd c 1 4 41 42 P8 data bd lt P8 bd coldata exprs c INTPTLAT 1 e6 INTPTI names c Lat Lon Figure 2 10 S PLUS Workbench History view The Objects View is similar to the Object Explorer in the S PLUS GUL It displays all objects for projects associated with the Workspace See the section S PLUS Workspace on page 16 for more information about the Workspace Data database The S PLUS Workbench Objects View also provides a list of the names and types of objects in S PLUS databases The Objects View includes the following information about each object name data class storage mode extent size creation or change date You can use the Objects View control menu click 7 to perform the following tasks Select another database Refresh the view on the currently active database Remove the selected object from the currently active database Views Note When you run code that creates objects in an S PLUS script the Objects View is not automatically refreshed to display the new objects To refresh Objects View and display newly created objects right click the Objects View or click the control menu button 7 and then from the menu click Refresh Console View ES Objects View X Search Path View Output View Tasks Problems sa Simi Nane Data Clas
28. S PLUS Truncation and displays a warning with detailed information on the number of altered values after the operation is completed You can set the following options to make an error occur immediately when a string Errors truncation or level overflow occurs Level Overflow bd options error on string truncation T bd options error on level overflow T 195 Chapter 7 Advanced Programming Information 196 The default for both options is F If one of these is set to T an error occurs with a short error message Because all of the data has not been processed it is impossible to determine how many values might be effected These options are useful in situations where you are performing a lengthy operation such as importing a huge data set and you want to terminate it immediately if there is a possible problem Storing and Retrieving Large S Objects STORING AND RETRIEVING LARGE S OBJECTS When you work with very large data you might encounter a situation where an object or collection of objects is too large to fit into available memory The Big Data library offers two functions to manage storing and retrieving large data objects e bd pack object e bd unpack object This topic contains examples of using these functions Managing Suppose you want to create a list containing thousands of model Large Amounts hjects and a single list containing all of the models is too large to fit in your available memory By using
29. Save Perspective As To restore an unsaved perspective s default settings click Window P Reset Perspective To open another perspective click Window gt Open Perspective and then select a perspective from the Select Perspective dialog box S PLUS Perspective census csv graphsheet set my vbar q P8 bd P8 ref bd PS data bd P8 suppl bd P8 dataN bd P8 dataN mean column exprs 8 source paste getenv SHOME samples 9 source paste getenv SHOME samples 10 11 P8 bd lt importData paste getenv SHOME stringsdsFactors F startRow 1 bigdata Open the data viewer to examine the 3 bd data viewer P amp bd Eliminate some uninformative rows gt source paste getenv SHOME samples bigde gt source paste getenv SHOME samples bigd gt P8 bd lt importData paste getenv SHOME sz Figure 2 8 S PLUS Workbench window 25 Chapter 2 The S PLUS Workbench VIEWS 26 A view is a visual component in the workbench Views support the script editor by providing alternate means of navigating through working with and examining the elements of the project All views except the Outline View feature their own context right click menus with menu items that act on the type of data displayed in the view Each view contains a control menu listing actions that apply specifically to the view The control menu is displayed either when you click the drop down
30. Search Path a 1 C PROJECTS 2 bigdata 3 splus 4 stat 5 data 6 trellis 7 8 9 1 nlme3 winjava nenu pe 0 SPXML 4 Figure 2 15 S PLUS Workbench Search Path view Tasks View The Tasks View is a standard Eclipse IDE view which is customized in S PLUS to provide three levels of tasks Table 2 4 S PLUS Workbench Tasks Task Description FIXME Defines high priority tasks The task appears with an exclamation mark in the Tasks view TODO Defines medium priority tasks XXX Defines low priority tasks You can change these tasks or you can add your own custom tasks For more information about changing task settings see the section To Set the Example Preferences on page 44 36 The Views Tasks View also contains a button bar that displays the following buttons Table 2 5 Tasks View buttons You Button Description Click to display the Add Task dialog box to add a custom task lt 5 Click to delete the selected custom task Note that you x cannot use this button to delete tasks identified in the script Click to display the Filters dialog box to specify properties for filtering the tasks can use the Tasks View control menu click 7 to perform the following tasks Display the Sorting dialog box to sort the tasks displayed in the view either in ascending or descending order and according to the tasks characteristics
31. a new project consider the following scenarios and then review the S PLUS Workbench options Table 2 6 S PLUS Workbench project scenarios Scenario S PLus Workbench Option You are starting an empty In the New Project wizard specify a project with no existing files project name and accept the default project directory location Your project is created as a subdirectory in the Workspace directory The Navigator View displays the project resource but no existing project files You have one or more In the New Project wizard specify a project s and you want to project name clear the Use default work with the files at their check box and then browse to the existing location location of the project files S PLUS Workbench works with the files at the specified location The Navigator View displays the project resource and all files in the project directory 41 Chapter 2 The S PLUS Workbench Table 2 6 S PLUS Workbench project scenarios Continued Scenario S PLus Workbench Option You have an existing project In the New Project wizard specify a and you want to copy project name and accept the default selected files to a Workspace project directory location An empty directory perhaps for project subdirectory is created in the example because they are at Workspace directory You can then a remote location are read import your project files See the section only or you do not w
32. and quantile function diff digamma dim dimnames a bdFrame has no row names dimnames lt a bdFrame has no row names dlnorm Density CDF and quantile function dlogis Density CDF and quantile function dmvnorm Density and CDF function dnbinom Density CDF and quantile function dnorm Density CDF and quantile function dnrange Density CDF and quantile function dpois Density CDF and quantile function Big Data Library Functions Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment dt Density CDF and quantile function dunif Density CDF and quantile function duplicated Density CDF and quantile function durbinWatson Density CDF and quantile function dweibull Density CDF and quantile function dwilcox Density CDF and quantile function floor format formula grep hist hist2d histogram html table 4 intersect 219 Appendix Big Data Library Functions 220 Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector Optional Comment is all white 4 is element t is finite is infinite s na s nan is number is rectangular kurtosis Handles bdNu
33. create columns 200 185 Chapter 7 Advanced Programming Information INTRODUCTION 186 As an S PLUS Big Data library user you might encounter unexpected or unusual behavior when you manipulate blocks of data or work with strings and factors This section includes warnings and advice about such behavior and provides examples and further information for handling these unusual situations Alternatively you might need to implement your own big data algorithms using out of memory techniques Big Data Block Size Issues BIG DATA BLOCK SIZE ISSUES Block Size Options Big data objects represent very large amounts of data by storing the data in external files When a big data object is processed pieces of this data are read into memory and processed as data blocks For most operations this happens automatically This section describes situations where you might need to understand the processing of individual blocks When processing big data the system must decide how much data to read and process in each block Each block should be as big as possible because it is more efficient to process a few large blocks rather than many small blocks However the available memory limits the block size If space is allocated for a block that is larger than the physical memory on the computer either it uses virtual memory to store the block which slows all operations or the memory allocation operation fails The size of the block
34. create columns x a b d copy F c Note that in the last function above specifying copy F creates a new column without copying the old columns APPENDIX BIG DATA LIBRARY FUNCTIONS Introduction Big Data Library Functions Data Import and Export Object Creation Big Vector Generation Big Data Library Functions Data Frame and Vector Functions Graph Functions Data Modeling Time Date and Series Functions 202 203 203 204 205 206 214 228 230 234 201 Appendix Big Data Library Functions INTRODUCTION 202 The Big Data library is supported by many standard S PLUS functions such as basic statistical and mathematical functions properties functions densities and quantiles functions and so on For more information about these functions see their individual help topics To display a function s help topic in the Commands window type help functionname The Big Data library also contains functions specific to big data objects These functions include the following e Import and export functions e Object creation functions e Big vector generating functions e Data exploration and manipulation functions e Traditional and Trellis graphics functions e Modeling functions These functions are described further in the following section Big Data Library Functions BIG DATA LIBRARY FUNCTIONS The following tables list the functions that are implemented in the Big Data library Data Im
35. directory at samples bigdata census Select the directory and then click OK The directory name appears in the left pane and all of the project s files appear in the right pane Click Select All and then click Finish to add the files to your project Hint You can select just the ssc file to import if you prefer because the script itself references the data in these files For the purposes of this part of the exercise we import all files Adding a Second Project In this exercise use the Boston Housing example one of the examples provided with the S PLUS Enterprise edition This exercise demonstrates adding a new project at a different location rather than importing the files To Add a Project 1 Click File gt New gt Project 43 Chapter 2 The S PLUS Workbench Setting the Project s Preferences 44 2 In the New Project wizard select S PLUS Project and then click Next In the Project name text box type Boston Housing and then clear the Use default check box Browse to the location of the Boston Housing sample directory by default in the samples bigdata directory of your S PLUS installation Select the boston directory and then click OK Click Finish to add the project In the Navigator View the Boston Housing directory appears This directory contains all of the files in that sample directory location You won t be using this project for the remainder of the t
36. graphics create a data frame containing either all of the data or a sample of the data For a more detailed discussion of graph functions available in the Big Data library see Chapter 5 Creating Graphical Displays of Large Data Sets Modeling Algorithms for large data sets are available for the following statistical Functions modeling types e Linear regression e Generalized linear regression e Clustering e Principal components See the section Models on page 75 for more information about the modeling objects See Table 3 7 for an overview of the big data modeling architecture If the data argument for a modeling function is a big data object then S PLUS calls the corresponding big data modeling function The modeling function returns an object with the appropriate class such as bdLm See Table A 12 in the Appendix for a list of the modeling functions that return a model object Generally methods for a large data modeling class such as bdLm correspond to the methods for the standard modeling class such as 1m however the Big Data library supports a subset of the following 79 80 Chapter 3 The Big Data Library e Modeling methods e Function arguments e Formulas If you request an unsupported option for a big data object the algorithm stops with an error message Reviewing the Big Data library modeling methods functions and formulas in the documentation can help avoid these errors Table 3 7 Bi
37. number of rows The library also offers support for elementwise and matrix multiplication Matrix multiplication is available for two bdFrames with the appropriate dimensions Cross Product Function When applied against two bdFrames the cross product function crossprod returns a bdFrame that is the cross product of the given bdFrames That is it returns the matrix product of the transpose of the first bdFrame with the second Summary The Big Data Library Architecture In this section we ve provided an overview to the Big Data library architecture including the new data types classes and functions that support managing large data sets For more detailed information and lists of functions that are included in the Big Data library see the Appendix Big Data Library Functions In the next chapter Chapter 4 Exploring and Manipulating Large Data Sets we provide examples for working with data sets using the types classes and functions described in this chapter 83 Chapter 3 The Big Data Library 84 EXPLORING AND MANIPULATING LARGE DATA SETS Introduction Working in the S PLUS Environment Command line functions Dialog box support Data Viewer Manipulating Data Census Example Overview of Census Sample Overview of Data Manipulation Functions Work with the Census Example Displaying in a Simple Plot Displaying a Bar Plot Exporting Data Summary Manipulating Data Stock Sample Preparing
38. pie chart appears as follows Barley Yield bushels acre Figure 5 35 Graph using aggregate to create a Trellis pie chart Create a Trellis A surface plot is an approximation to the shape of a three Wireframe Plot dimensional data set Surface plots are used to display data collected on a regularly spaced grid if gridded data is not available interpolation is used to fit and plot the surface The Trellis function that displays surface plots is wireframe For big data sets wireframe requires a preprocessing function such as loess To create a sample Trellis surface plot using loess to preprocess the data in the Commands window type the following environ bd lt as bdFrame environmental oZ76 m lt loess ezone 1 3 wind temperature radiation data environ bd parametric c radiation wind span 1 degree 2 w marginal lt seq min environ bd wind max environ bd wind length 50 t marginal lt seq min environ bd temperature max environ bd temperature length 50 r marginal lt seq min environ bd radiation max environ bd radiation length 4 wtr marginal lt list wind w marginal temperature t marginal radiation r marginal grid lt expand grid wtr marginal gridi fit lt c predict ozo m grid 156 Example Graphs print wireframe fit wind temperature radiation data grid xlab Wind Speed mph ylab Temperature F main Cube
39. pie charts are most useful when the emphasis is on an individual item s relation to the whole in these cases the sizes of the pie wedges are naturally interpreted as percentages Calling pie directly on a big data object can result in a pie with thousands of wedges therefore preprocess the data using table to reduce the number of wedges To create a sample pie chart using table to preprocess the data in the Commands window type the following fuel bd lt as bdFrame fuel frame pie table fuel bd Type names levels fuel bd Type sub Count 154 Example Graphs The pie chart appears as follows Figure 5 34 Graph using table to create a pie chart Create a Trellis The function piechart creates a pie chart in a Trellis graph Pie Chart e If your data contains a small number of cases convert the data to a standard data frame before calling piechart e If your data contains a large number of cases first use aggregate and then use bd coerce to create the appropriate small data set To create a sample Trellis pie chart using aggregate to preprocess the data in the Commands window type the following barley bd lt as bdFrame barley temp df lt bd coerce aggregate barley bd yield list year barley bd year variety barley bd variety sum piechart variety x year data temp df xlab Barley Yield bushels acre 155 Chapter 5 Creating Graphical Displays of Large Data Sets The Trellis
40. preprocessing function such as loess 147 Chapter 5 Creating Graphical Displays of Large Data Sets 148 The following example creates a contour plot of predictions from loess To create a sample Trellis contour plot using loess to preprocess data in the Commands window type the following environ bd lt as bdFrame environmental ozo m lt loess ozone 1 3 wind temperature radiation data environ bd parametric c radiation wind span 1 degree 2 w marginal lt seq min environ bd wind max environ bd wind length 50 t marginal lt seq min environ bd temperature max environ bd temperature length 50 r marginal lt seq min environ bd radiation max environ bd radiation length 4 wtr marginal lt list wind w marginal temperature t marginal radiation r marginal grid lt expand grid wtr marginal gridl fit lt eCpredicttoze m grid print contourplot fit wind temperature radiation data grid xlab Wind Speed mph ylab Temperature F main Cube Root Ozone cube root ppb Create a Dot Chart Example Graphs The Trellis contour plot is displayed as follows Cube Root Ozone cube root ppb Temperature F Wind Speed mph Figure 5 27 Graph using loess to create a Trellis contour plot When you create a dot chart you can use a grouping variable and group summary along with other options The function dotchart can be p
41. rf rgamma rgeom rhyper rlnorm rlogis rmvnorm rnbinom rnorm rnrange rpois rstab rt runif rweibull rwilcox seq Displaying and exploring bdFrame data bd cor bd crosstabs bd univariate show summary bd data viewer Manipulating data in blocks bd block apply bd by group bd by window Manipulating time series data print summary aggregateSeries align diff seriesMerge Cleaning existing data bd remove missing bd normalize bd duplicated bd unique Splitting data bd split bd split by group bd split by window Appending data sets either by rows or by columns bd append bd join Work with the Census Example Importing Existing Data Manipulating Data Census Example Table 4 1 Data manipulation tasks and their associated functions Task Function names Manipulating rows bd filter rows bd partition bd relational restrict bd sample bd select rows bd shuffle bd sort rowMaxs rowMeans rowMins rowRanges rowStdevs rowSums rowVars Manipulating columns bd aggregate bd bin bd create columns bd filter columns bd modify columns bd relational divide bd relational project bd reorder columns bd transpose bd stack bd unstack colMaxs colMeans colMins colRanges colStdevs colSums colVars Exporting data exportData Relational operations bd relational difference bd relational intersection
42. system default Select Custom Color and then click the color button to display the Color dialog box and choose a different background color 19 Chapter 2 The S PLUS Workbench e Choose Input Color Choose Output Color By default the Console View displays input and output as blue and red respectively You can select a custom color by clicking the color button and then in the Color dialog box select a color for the input or output in x E Workbench Console Output Build Order E Help 5 PLUS Console options E Install Update Background Color E Run Debug Cc E m 5S PLUS System Default Custom a Col Choose Input Color Task Tags E Team Choose Output Color Ea Restore Defaults Apply Import Export Cancel Figure 2 4 S PLUS Console Options dialog box PLuS Workbench options These options control general settings for the S PLUS Workbench Run code on startup Select this option and then provide any code that you want the S PLUS Workbench to run when it starts up Note that this box is cleared by default so no additional libraries including the Big Data library are loaded by default Note If you clear the Run code on startup box or if you remove the option to load the Big Data library on startup and then later open a project that uses the bigdata library you could see unexpected results when you try to perform actions If your projects typica
43. tiled across the bottom of the window The Script Editor pane is empty To Customize the S PLUs Workbench Default Perspective 1 Click the Outline View tab and drag the view beside the Navigator View The Outline View now tiles with the Navigator View 2 Click the History view tab and drag the view to the right it now tiles with the other views 3 Right click the Tasks view tab and select Fast View The Tasks view minimizes and appears as an icon in the window s status bar 4 Click the Console view tab to select it Click Window P Save Perspective As In the Name box type Sample Exercise and then click OK The Sample Exercise perspective button appears on the toolbar nee H Sample Exercise Figure 2 19 Sample exercise perspective button To return to the S PLUS Workbench default click the perspective button to the left of the Sample Exercise button and then click Other 45 Chapter 2 The S PLUS Workbench Changing Attached Databases Adding a Database 46 In the Select Perspective dialog box select S PLUs default and then click OK The perspective returns to its previous layout You can select other views to display in your perspective To Change the Displayed Views 1 To change the views or to display the list of available views on the menu click Window gt Show View 2 From the submenu select the view to display Alternatively if you do not see the view you want to disp
44. to a higher or lower number depending on the data set size and the degree of accuracy you want for the clusters NK lt 40 Set the random number generator seed for reproducibility set seed 21 Call bdCluster passing your normalized large data set as the data argument Provide column names and the cluster number Assign the resulting object to cluster bd cluster bd lt bdCluster P8 Nz bd columns column names Nz k NK Extract the predicted cluster groups from the cluster object with the predict function and then calling cbind to bind the resulting prediction to your normalized data set with cbind Assign the resulting object to cluster p bd cluster p bd lt cbind P8 Nz bd predict cluster bd Display the resulting data in the data viewer Your data set should contain 32 165 rows bd data viewer cluster p bd 175 Chapter 6 Modeling Large Data Sets Analyze and Graph the Resulting Clusters 176 In the next section analyze the clusters that you created above During this exercise produce a series of histograms that illustrate each clusters age distributions by gender At this point you use no geographical information In this example you will produce two different summaries of the clusters e The mean histogram within each cluster group The number count of members of each cluster group Aggregate and order the cluster group 1 Code the cluster ID into the variable PREDICT membership
45. w wo boston housing bd LSTAT Figure 6 1 Plot showing economic status to housing value 166 Building a Model 4 Compute the logarithm of MEDV and add it to 5 boston housing bd object boston housing bd LMEDV lt 1og boston housing bd MEDV This requires two passes over the data one to compute the log and one to add the new variable called LMEDV to the original data object A more efficient method is to use the bd create columns function boston housing bd lt bd create columns boston housing bd exprs log MEDV names LMEDV To see the relationship between distance to employment centers and the logarithm calculated use the plot command plot boston housing bd DIS boston housing bd LMEDV boston housing bd LMEDY 2 4 6 8 10 12 boston housing bd DIS Figure 6 2 Plot of distance to employment centers 6 Based on scatterplots of log housing values versus the other predictors not shown here we decide to account for the nonlinear relationships by transforming five of the predictor variables Use bd create columns to create all the new variables in one pass through the data 167 Chapter 6 Modeling Large Data Sets 7 10 boston housing bd lt bd create columns boston housing bd exprs c log RAD log LSTAT NOX 2 Togt DISI RRMA names c LRAD LLSTAT NOX2 LDIS RM2 Open the data viewer and examine the new columns bd data viewer boston
46. 013 rows containing uninteresting data Big Data Viewer P8 bd amp lol x File Edit Rounding Help Factor Date numeric numeric numeric numeric 41 476 291 oo 320 00 198 00 al x oo 763 oo 578 oo 410 00 3 1 456 00 1 163 oo 1 146 oo 720 00 4 00 56 oo Total number columns 43 Numeric columns 42 Total number rows 32165 Factor columns 0 String columns 1 Date columns 0 Other columns 0 Figure 4 2 P8 bd Other functions that provide cleaning filtering and compiling are bd partition rows bd sample rows bd select rows bd shuffle rows bd sort rows As part of your data manipulation you can separate the data set according to the types of data in its columns add reference columns and add columns representing values manipulated to provide more usable information 97 Chapter 4 Exploring and Manipulating Large Data Sets Displaying in a Simple Plot 98 To create reference and data columns e Create separate data sets to hold the reference data columns and the data columns assigning the gender and age bins to the object P8 ref bd lt P8 bd c 1 4 41 43 ref cols P8 data bd lt P8 bd 5 40 data cols The reference data columns contain the ZCTA information the population totals and the housing information e The data columns contain the gender and age bins To create and transform columns of existing data 1 Add to the re
47. 108 Cleaning the Stock Data Working with Time Series Data Creating the Time Series Manipulating Data Stock Sample names closeSP500 bd 1 lt SP500 3 Join the index series with those of the conglomerate stocks closePrices bd lt bd join list closePrices bd closeSP500 bd key columns DATE 4 View the univariate summaries summary closePrices bd When you examine closePrices bd notice that most of the stocks as well as the S amp P 500 Index have 97 NA values These NA values represent the days the market was closed for holidays over the 10 year period In the next steps drop these NA values To drop the NA values 1 Identify the missing days for the entire index and remove those days represented by NAs in the S amp P 500 Index closePrices bd lt bd remove missing closePrices bd columns SP500 method drop 2 Examine the whole data set in the Data Viewer bd data viewer closePrices bd Of the 24 stocks notice that only RTK TVIN KOR MITSY still have NAs These constituents were not listed in the S amp P 500 for the entire observation period In the following steps using the stock sample data remove the shorter term constituents and then create a time series representing the conglomerate stock closing price returns In the previous section you discovered that the stocks RTK TVIN KOR MITSY have a shorter history than the other stocks In this analysis consider only the const
48. 79213 0 66704590 4 0 8345442 1 726552 0 09256986 0 10535579 5 6 6856195 2 087905 0 42910847 0 08836129 495 more rows Note To increase the number of rows of output data displayed increase the print bdFrame rows value using bd options for example bd options print bdFrame rows 15 6 To display the standard deviations and observation information just print the object print prim4 bdp bdPrincomp x prim4 bd Standard deviations Comp 1 Comp 2 Comp 3 Comp 4 5 133588 2 533057 0 9316154 0 8374292 The number of variables is 4 and the number of observations is 500 7 To get more details on the components use the summary function summary prim4 bdp Importance of components Comp 1 Comp 2 Comp 3 Comp 4 Standard deviation 5 1335885 2 5330575 0 93161540 0 8374292 Proportion of Variance 0 7674509 0 1868524 0 02527446 0 0204223 Cumulative Proportion 0 7674509 0 9543032 0 97957770 1 0000000 Clustering Cluster analysis segments observations into classes or clusters so that the degree of similarity is strong between members of the same cluster and weak between members of different clusters 172 Building a Model If you are involved in market research you could use clustering to group respondents according to their buying preferences If you are performing medical research you may be able to better determine treatment if diseases are properly grouped Purchases economic background an
49. Directory 1 Right click the Search Path View 2 From the menu click Attach Directory In the Attach dialog box in the Directory to attach text box browse to the directory location 4 In the Label text box type Projects In the Position text box type 4 6 Click OK and examine the Search Path View The label you provided should appear at position 4 From the Search Path View you can detach a database from your current session To Detach a Database 1 Inthe Search Path View right click bigdata 2 In the right click menu select Detach 3 Examine the Search Path View The Big Data library is no longer attached When you refresh the view any changes to the Search Path View that have not been reflected in a recent change are displayed For example if you add a library by calling the load function in an S PLUS script the change is not immediately displayed in the Search Path View To Refresh the View 1 Using the Console View reattach the Big Data library In the Console View type library bigdata first T 47 Chapter 2 The S PLUS Workbench Creating a Script Viewing Project Files Removing files from a project 48 2 Right click the Search Path View In the right click menu click Refresh Notice that the Big Data library appears as attached in the first position position You can create a new S PLUS script file or you can import an existing script file The following two examples demonstrate bo
50. History View tab to give it focus 3 Right click the first line of code and click Select input The code is copied to the Console View You must return to the Console View and press ENTER to run the code Alternatively double click the code in the History View to copy it to the Console View You can scroll through the individual entries in the History View as you scroll the selection appears in the Console View To run a selected item switch from the History View to the Console View and press ENTER at the end of the code line 53 Chapter 2 The S PLUS Workbench Running Code and Reviewing the Output Fixing Problems in the Code Closing the Project 54 You can run code directly from the Script Editor by using the Run feature To Run Code 1 2 Select the Output View tab In the Script Editor select the code to run or to run the whole script select nothing and press F9 or on the toolbar click Run The Output View displays the run code and any S PLUS messages Introduce a programmatic problem in the script to examine the results in the Problems View To Examine Problems 1 In the Script Editor on line 9 of the script remove the closing parenthesis Save the file Note that the Problems View tab shows bold text Click the Problems View tab to display the view Click the problem description Note that the Script Editor highlights the line where the code is broken In the Script E
51. Insightful S PLUS 7 Enterprise Developer User s Guide April 2005 Insightful Corporation Seattle Washington Proprietary Notice Copyright Notice Trademarks Insightful Corporation owns both this software program and its documentation Both the program and documentation are copyrighted with all rights reserved by Insightful Corporation The correct bibliographical reference for this document is as follows S PLUS 7 Enterprise Developer User s Guide Insightful Corporation Seattle WA Printed in the United States Copyright 1987 2005 Insightful Corporation All rights reserved Insightful Corporation 1700 Westlake Avenue N Suite 500 Seattle WA 98109 3044 USA Insightful Insightful Corporation the Insightful logo S PLUS Insightful Miner S FinMetrics S SeqTrial SpatialStats S ArrayAnalyzer S EnvironmentalStats S Wavelets S PLUS Graphlets and Graphlet are either trademarks or registered trademarks of Insightful Corporation in the United States and or other countries Intel and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries Microsoft Windows MS DOS and Windows NT are either registered trademarks or trademarks of Microsoft Corporation in the United States and or other countries All product names mentioned herein may be trademarks or registered trademarks of their respective companies Acknowledgments ACKNO
52. N Proportion of residential land zoned for lots over 25 000 square feet Source The data are available from the University of California Irvine Machine Learning Repository http www ics uci edu mlearn M1Repository html Note The entire script for this example can be found in the sample directory on the S PLUS installation CD By default this sample is samples bigdata boston Import the data 1 In the Commands window type boston housing bd lt importData paste getenv SHOME samples bigdata boston bostonhousing txt sep stringsAsFactors F bigdata T 165 Chapter 6 Modeling Large Data Sets In this example we change the default stringsAsFactors from TRUE to FALSE because this example does not use levels If you do not need to use levels setting stringsAsFactors to FALSE can improve the speed of your data import Open the data viewer to examine the data bd data viewer boston housing bd Summarize manipulate and plot the data 1 2 3 To see a summary of the data at the Command prompt type summary boston housing bd To see a correlation matrix of the data at the command prompt type bd cor boston housing bd To see how the percentage of lower economic status relates to housing value create a scatterplot plot boston housing bd LSTAT boston housing bd MEDV 50 L amp Counts boston housing bd MED Y 3 1 7 eeeeee oO Ny oO FH DH
53. PLUS function When you call the timeCalendar function with any big data arguments then a bdTimeDate object is created timeSeq Standard S PLUS function to use with a large data set set the bigdata argument to TRUE In the following table the cross hatch indicates that the function is implemented for the corresponding class If the table cell is blank the function is not implemented for the class This list includes bdVector objects bdTimeDate and bdTimeSpan and bdSeries classes bdSignalSeries bdTimeSeries Table A 15 Time Date and Series Functions Function bdTimeDate bdTimeSpan bdSeries bdSignalSeries bdTimeSeries X lt align 235 Appendix Big Data Library Functions Table A 15 Time Date and Series Functions Continued 236 Function bdTimeDate bdTimeSpan bdSeries bdSignalSeries bdTimeSeries all equal Arith as bdFrame as bdLogical bd coerce t ceiling coerce cor cumsum cut data frameAux days deltat diff end floor hms Big Data Library Functions Table A 15 Time Date and Series Functions Continued Function bdTimeDate bdTimeSpan bdSeries bdSignalSeries bdTimeSeries hours match Math Math2 max
54. Root Ozone cube root ppb The surface plot is displayed as follows Cube Root Ozone cube root ppb Ba Wiha Speed fpi a us Figure 5 36 Graph using loess to create a surface plot Unsupported Using the functions that add to a plot such as points and lines Functions results in an error message 157 Chapter 5 Creating Graphical Displays of Large Data Sets 158 MODELING LARGE DATA SETS Introduction Overview of Modeling Building a Model Linear Regression and Generalized Linear Modeling Principal Components Clustering Predicting from the Model Predicting on Big Data from Small Data Models 160 161 162 162 169 172 180 181 159 Chapter 6 Modeling Large Data Sets INTRODUCTION In Chapter 4 Exploring and Manipulating Large Data Sets you graphed the filtered Census data In this chapter the functions available for modeling large data sets are reviewed In this chapter you will perform e A linear regression e Principal components reduction e K means clustering and predicting e Prediction from small data 160 Overview of Modeling OVERVIEW OF MODELING The Big Data library provides modeling functions on big data sets for linear models generalized linear models logistic regression loglinear models and so on principal components and K means clustering In addition you can also do prediction scoring with a big data sets using almost any standard S PLUS model object
55. S PLUS Enterprise Developer includes the S PLUS Workbench the S PLUS customization of the Eclipse Integrated Development Environment It also includes the premier data analysis package and the ability to handle both small and large data sets Programmers familiar with the S language will be comfortable immediately with the Big Data library s object oriented design and syntax It is designed to work with existing S PLUS functions and many functions available in the S PLUS engine also work with large data sets Conversely Big Data library functions work with small data sets For a comprehensive list of the Big Data library functions see the Appendix Note The Big Data library loads by default only in the Windows S PLUS GUI and from S PLUS BATCH The Big Data library is not loaded by default when you start S PLUS from the S PLUS Workbench the Unix or Windows Command line or as a Console application Always load the Big Data library if you work with big data projects If you start a big data project without having loaded the Big Data library you will see errors when you run your script To set the option to start S PLUS without loading the Big Data library in the Windows S PLUS GUI on the menu click Options General Settings and then click the Startup tab Clear the Load Bigdata library check box Note When you work with large data sets using S PLUS use the Commands window and Script window See the S PLUS U
56. Tasks on page 41 Editor An integrated code text editor that includes support for syntax coloring text formatting and integration with the other views Analogous to the Script Editor in the traditional S PLUS GUI See Script Editor To practice using the Script Editor see the section Editing Code in the Script Editor on page 49 If you are not familiar with the Eclipse IDE once you start the S PLUS Workbench take the first few minutes to learn the basic concepts and IDE layout by working through the basic tutorial in the Workbench User Guide 13 Chapter 2 The S PLUS Workbench To View the Eclipse Getting Started Tutorial 1 From the Workbench main menu click Help gt Help Contents 2 In the right pane expand the table of contents by clicking Workbench User Guide 3 Click Getting Started and then click Basic tutorial The Workbench User Guide opens in a separate window you can toggle back and forth between the Workbench application and the User Guide Additional S PLuUs The S PLUS Workbench includes the following additional Customizations customizations to the basic Eclipse UI Customized Menus The S PLUS Workbench provides customizations to Eclipse menu options For more information see the section Menu Options on page 39 Function Help The S PLUS Workbench provides access to function help topics e Inthe Console View type help functionname where functionname is the function for whic
57. TimeDate object is created See Table A 14 in the Appendix Note bdTimeDate always assumes the time as Greenwich Mean Time GMT however S PLUS stores no time zone with an object You can convert to a time zone with timeZoneConvert or specify the zone in the bdTimeDate constructor Time Conversion To convert time and date values apply the standard S PLUS time Operations conversion operations to the bdTimeDate object as listed in Table A 14 in the Appendix 81 Chapter 3 The Big Data Library Matrix Operations 82 The Big Data library does not contain separate equivalents to matrix and data frame Standard S PLUS matrix operations are available for bdFrame objects including e matrix algebra amp gt lt lt gt 4 e matrix multiplication e Crossproduct crossprod solve does not support big data objects in version 7 In algebraic operations the operators require the big data objects to have appropriately corresponding dimensions Rows or columns are not automatically replicated Basic algebra You can perform addition subtraction multiplication division logical amp and and comparison gt lt lt gt operations between A scalar and a bdFrame Two bdFrames of the same dimension e A bdFrame and a single row bdFrame with the same number of columns e A bdFrame and a single column bdFrame with the same
58. Trellis Graphics in the in the Application Developer s Guide Create a Simple A histogram displays the number of data points that fall in each of a Histogram specified number of intervals A histogram gives an indication of the relative density of the data points along the horizontal axis For this reason density plots are often superposed with scaled histograms To create a sample hist chart of a full dataset for a numeric vector in the Commands window type the following fuel bd lt as bdFrame fuel frame hist fuel bd Weight 137 Chapter 5 Creating Graphical Displays of Large Data Sets The numeric hist chart is displayed as follows _ ow el al ell 3000 4000 2000 2500 2 o 3500 fuel bd Weight Figure 5 15 Graph using hist for numeric data To create a sample hist chart of a full dataset for a factor column in the Commands window type the following fuel bd lt as bdFrame fuel frame hist fuel bd Type The factor hist chart is displayed as follows 10 Compact Large Medium Small Sporty Van fuel bd Type Figure 5 16 Graph using hist for factor data 138 Create a Trellis Histogram Create a Quantile Quantile QQ Plot for Comparing Multiple Distributions Example Graphs The histogram function for a Trellis graph is histogram To create a sample Trellis histogram in the Commands window type the following singer bd lt as bdFrame singer histogram hei
59. US code to sequential data blocks User code called within bd block apply should not be written to depend on having a particular block size because the block size is different when the input data has different numbers and types of 189 Chapter 7 Advanced Programming Information columns When developing such code test it with several small values of bd options block size to ensure that it does not depend on the block size 190 Big Data String and Factor Issues BIG DATA STRING AND FACTOR ISSUES String Column Widths String Widths and importData Big data columns of types character and factor have limitations that are not present for regular data frame objects Most of the time these limitations do not cause problems but in some situations warning messages can appear indicating that long strings have been truncated or factors with too many levels had some values changed to NA This section explains why these warnings may appear and how to deal with them When a bdFrame character column is initially defined before any data is stored in it the maximum number of characters or string width that can appear in the column must be specified This restriction is necessary for rapid access to the cache file Once this is specified an attempt to store a longer string in the column causes the string to be truncated and generate a warning It is important to specify this maximum string width correctly All of the big data opera
60. WLEDGMENTS S PLUS would not exist without the pioneering research of the Bell Labs S team at AT amp T now Lucent Technologies John Chambers Richard A Becker now at AT amp T Laboratories Allan R Wilks now at AT amp T Laboratories Duncan Temple Lang and their colleagues in the statistics research departments at Lucent William S Cleveland Trevor Hastie now at Stanford University Linda Clark Anne Freeny Eric Grosse David James Jos Pinheiro Daryl Pregibon and Ming Shyu Insightful Corporation thanks the following individuals for their contributions to this and earlier releases of S PLUS Douglas M Bates Leo Breiman Dan Carr Steve Dubnoff Don Edwards Jerome Friedman Kevin Goodman Perry Haaland David Hardesty Frank Harrell Richard Heiberger Mia Hubert Richard Jones Jennifer Lasecki W Q Meeker Adrian Raftery Brian Ripley Peter Rousseeuw J D Spurrier Anja Struyf Terry Therneau Rob Tibshirani Katrien Van Driessen William Venables and Judy Zeh iii iv CONTENTS Acknowledgments iii Chapter 1 Introduction 1 Welcome to the S PLUS Enterprise Developer User s Guide 2 Analyzing Large Data Sets Advanced Programming Chapter 2 The S PLUS Workbench 9 Introduction 11 Starting the S PLUS Workbench 16 S PLUS Perspective 24 Views 26 Script Editor 38 S PLUS Workbench Tasks 41 Commonly Used Features in Eclipse 56 Chapter 3 The Big Data Library 59 Introduction 60 Working with a Large Data S
61. a hexbin plot using two bdVectors The quantile quantile plot is a good tool for determining a good approximation to a data set s distribution In a qqp1ot the ordered data are graphed against quantiles of a known theoretical distribution To create a sample two vector qqp1ot In the Commands window type the following fuel bd lt as bdFrame fuel frame qqplot fuel bd Mileage runif length fuel bd Mileage bigdata T Note that in this example the required y argument for qqp1ot is runif length fuel bd Mileage the random generation for the uniform distribution for the vector fuel bd Mileage Also note that using runif with a big data object requires that you set the runif argument bigdata T The qqp1ot plot is displayed as follows Create a One Dimensional Scatter Plot Example Graphs o r a t o gt g D a9 o 3 o eo 8 S aT 8 i 3 A a E q 334 a H 2 F a g a2 gt gt 2 8 D T T T T 20 25 30 35 tuel bdfhlileage Figure 5 21 Graph using qqplot The function stripplot creates a Trellis graph similar to a box plot in layout however the individual data points are shown instead of the box plot summary To create sample one dimensional scatter plot in the Commands window type the following Singer bd lt as bdFrame singer stripplot voice part jitter height data singer bd aspect 1 xlab Height inches 143 Chapter 5 Creating Gra
62. al probability plot stripplot Creates a Trellis graphic object similar to a box plot in layout however it displays the density of the datapoints as shaded boxes The following functions are used to preprocess large data sets for graphing Table 5 4 Functions used for preprocessing large data sets Function Description aggregate Splits up data by time period or other factors and computes summary for each subset hexbin Creates an object of class hexbin Its basic components are a cell identifier and a count of the points falling into each occupied cell hist2d Returns a structure for a 2 dimensional histogram which can be given to a graphics function such as image or persp interp Interpolates the value of the third variable onto an evenly spaced grid of the first two variables Functions Requiring Preprocessing Support for Graphing Overview of Graph Functions Table 5 4 Functions used for preprocessing large data sets Continued Function Description loess Fits a local regression model loess smooth Returns a list of values at which the loess curve is evaluated lsfit Fits a weighted least squares multivariate regression smooth spline Fits a cubic B spline smooth to the input data table Returns a contingency table array with the same number of dimensions as arguments given tapply Partitions a vector according to one or mor
63. ame max block mb The maximum number of megabytes used for block processing buffers If the specified block size requires too much space the number of rows is reduced so that the entire buffer is smaller than this size This prevents unexpected out of memory errors when processing wide data with many columns The default value is 10 The function bd options contains other optional arguments for controlling column string width display parameters factor level limits and overflow warnings See its help topic for more information The Big Data library also contains functions that you can use to control block based computations These include the functions in Table 3 2 For more information and examples showing how to use these functions see their help topics The Big Data Library Architecture Table 3 2 Block based computation functions Function name Description bd aggregate Use bd aggregate to divide a data object into blocks according to the values of one or more of its columns and then apply aggregation functions to columns within each block bd aggregate takes two required arguments data which is the input data set and by columns which identifies the names or numbers of columns defining how the input data is divided into blocks Optional arguments include columns which identifies the names or numbers of columns to be summarized and methods which is a vector of summary methods to be
64. ant to Importing Files on page 43 for more work with the original files information In the following sections create an empty project and then import the Census project files To Create the Example Project 1 Click File gt New Project 2 In the New Project dialog box expand the S PLUS node and select S PLUS Project Click Next Provide the friendly project name Census 4 Accept the option Use default This option creates the project directory in the default Workspace location 5 Click Finish to create the project EI x Project Create a new project resource Ta Project name Census f Project contents IV Use default Directory C Documents al Figure 2 18 New Project dialog box 42 S PLUS Workbench Tasks Note Importing Files When you create a project you see in the Navigator View the project resource This resource is created by Eclipse and contains information that Eclipse uses to manage your project You should not edit this file In this exercise use the Census example one of the examples provided with the S PLUS Enterprise Developer edition To Import Files 1 With the Census Project node selected in the Navigator View click File Import In the Import Select dialog box select File system and then click Next In the Import File system dialog box browse to the location of the census project by default in your installation
65. ar association between the y axis variable and the x axis variable you might want to display a straight line that has been fit to the data Call 1sfit to perform a least squares regression and then use that regression to plot a regression line The following example draws an abline on the chart that plots fuel bd weight and mileage data First create a hexbin object and plot it and then add the abline to the plot To add a regression line to a sample plot in the Commands window type the following fuel bd lt as bdFrame fuel frame hexbin out lt plot fuel bd Weight fuel bd Mileage displays a hexbin plot use add to hexbin to keep the abline within the hexbin area If you just call abline then the d line might draw outside of the hexbin and interfere with the label add to hexbin hexbin out abline lsfit fuel bd Weight fuel bd Mileage 129 Chapter 5 Creating Graphical Displays of Large Data Sets The resulting chart is displayed as follows 35 ounts H 3 Fd 3 2 28 2 i 2 i 2000 2500 3000 3500 fuel bd Weight Figure 5 7 Graph drawing an abline in a hexbin plot Add a Loess Use lines loess smooth to add a smooth curved line to a scatter Smoother plot To add a loess smoother to a sample plot in the Commands window type the following fuel bd lt as bdFrame fuel frame hexbin out lt plot fuel bd Weight fuel bd Mileage displays a hexbin plot add to hexbin hexbin out
66. ask added directly to the Tasks View displays a check box for marking the task complete in the Tasks View s first column It does not display a reference to a resource a directory or a location To Add a Task in the Script File In the script file scroll to line 6 1 Type the following text d FFIXME Remove the comment markers to display the viewer 2 Save the script file Note that the FIXME comment appears in the Tasks View as a high level task with a red exclamation mark in its second column The task also displays information about its resource directory and line location You can go directly to any task in your script by double clicking it in the Tasks View 3 In the Script Editor change the level of the task by changing FIXME to TODO and save the file Note that the exclamation mark disappears and the task becomes a normal level task You can run your S PLUS script code directly from Eclipse in two ways Copy a selected block of code from the Script Editor to the Console View Run the selected code or all code if none is selected by clicking Run or pressing F9 The Console View is an editable view in other words you can type commands and run them by pressing ENTER therefore when you copy script contents to the Console View you must include the line return or the script will not run This behavior is consistent with the S PLUS Commands window in the S PLUS GUI which also requires a line return t
67. at is create one project in a subdirectory of another project Workspace Launcher a x Select a workspace 5 PLUS Workbench stores your projects in a directory called a workspace Select the workspace directory to use for this session Workspace C Documents and Settings x Browse I Use this as the default and do not ask again cms Figure 2 2 Workspace Launcher dialog box 17 Chapter 2 The S PLUS Workbench Setting the Workspace Changing the When you first launch the S PLUS Workbench you are prompted to supply the path to your S PLUS Workspace To Set the Workspace 1 Inthe Workspace Launcher dialog box specify the directory location where the Workspace Data and metadata databases will be stored 2 Indicate whether you want to be prompted in future sessions to identify a Workspace using this dialog box You can switch to another Workspace from within the S PLUS Workspace Workbench user interface To Open a Different Workspace in S PLUs Workbench 1 Save your work 1 Click File Switch Workspace 2 In the Workspace Launcher dialog box provide the new Workspace location Note S PLus Preferences 18 When you switch workspaces during an S Plus Workbench session the current session closes and a new session of S Plus Workbench starts using the new Workspace location When you open the S PLUS Workbench the IDE defaults are set to the default S PLUS perspecti
68. at sheet s salient cluster which is superimposed with the color assigned for that sheet To create the multi tabbed histogram sheet 1 Set the vector to 1 NK where NK is the cluster number Kvec 1 NK Set up and name the histogram graph setup Name USA Building a Model 3 Plot 20 clusters one per tab to create maps displaying age and gender population distribution for each cluster Note that the histogram legend showing age and gender distribution appears on each tab par err 1 for k in 1 20 k index Memb vec PREDICT membership ordered k par plt c 1 1 1 13 plot Lon vec Lat vec pch 1 cex 0 3 col 1 xlim c 125 70 ylim c 25 50 xlab Lon ylab Lat points Lon vec k index Lat vec k index col lt indexl6 k cex 0 4 pch 16 par new T par plt ct 1 3 1 37 my vbar cluster pmc df k k plotcols 2 37 Nreport col 38 col 1 index16 k box Figure 6 6 Sample population distribution histogram 179 Chapter 6 Modeling Large Data Sets PREDICTING FROM THE MODEL 180 Other books in the S PLUS documentation discuss at length predicting from a model including predicting from a linear model a generalized linear model a generalized additive model principal components and clustering For more information about predicting see the S PLUS Guide to Statistics Volume 1 The S PLUS Big Data library provides support for predicting for most model types using big data as
69. ata cumulativeReturns ts last0bs constituentNames drop T Plot the 10 year return plot betas2 tenyrReturn main 10 yr Return vs Beta xlab beta ylab return pch 16 text betas2 0 015 tenyrReturn constituentNames cex 0 7 adj 0 points 1 seriesData cumulativeReturns ts lastObs SP500 pche1s cot 3 texi 1 0 015 seriesData cumulativeReturns ts lastObs SP500 SP500 cex 0 7 col 3 adj 0 113 Chapter 4 Exploring and Manipulating Large Data Sets 10 yr Return vs Beta UTX eme cE pas nun a TEX ocr OM srmn FP no SHON CHE 2 oun ROK 2S esx esy L eur T T T T T T 0 2 0 4 0 6 0 8 1 0 1 2 beta Figure 4 7 Plot of 10 year return vs the beta Summary In this chapter you practiced exploring and manipulating big data sets using common Big Data library functions including e Importing and viewing data e Coercing data to a smaller data set and back to a big data set e Sorting and filtering data e Creating columns e Appending data sets e Joining rows e Transforming data e Rendering graphs e Comparing calculation techniques e Plotting a time series In the next chapter review the graph and chart functions that the Big Data library supports using small stand alone data examples to call each graph function and display a different graph or chart type 114 CREATING GRAPHICAL DISPLAYS OF LARGE DATA SETS Introduction Overview of Graph Funct
70. aty Industries Incorporated lgl Lynch Corporation mitsy Mitsui amp Co LTD mmm 3M Company PPS PPG Industries Incorporated quix Quixote Corporation rok 106 Rockwell Automation Importing the Data Manipulating Data Stock Sample Table 4 2 Stock Data used in Example Continued Stock symbol Company name rtk Rentech Incorporated sxi Standex International Corporation tfx Teleflex Incorporated tmo Thermo Electron Corporation tvin TVI Corporation txt Textron Incorporated tyc Tyco International LTD utx United Technologies Corporation This example contains 25 csv files one for each of the represented 24 conglomerate stocks and one for the S amp P 500 index First specify an object to contain the stock IDs and then import each of the 24 conglomerate stock files Prepare the conglomerate stock data 1 Specify the constituent stock IDs and assign them to the object stockNames stockNames lt Tebet er dov for ge gx hoir Ase kor TKE al witsy mim ppg Tarix poka TEER Teki EFX rhor Evin TEST tye Uts Import the corresponding source files srcFileNames lt paste getenv SHOME samples bigdata stocks paste stockNames csv sep sep 107 Chapter 4 Exploring and Manipulating Large Data Sets Manipulating the In this section create a
71. average demographic profile in these age gender groups The bd create columns function accepts values for the new columns and the expressions to form them It is often convenient to pre form these character vectors before the actual call as shown here column exprs lt paste names P8 dataN bd paste P8 dataN mean 1 36 sep sep column names N lt names P8 dataN bd column names Nz lt paste column names N 2 sep P8 dataNz bd lt bd create columns P8 dataN bd exprs column exprs names column names Nz row language F Note The row language argument above is set to F because the expressions contain the subset operator which requires S PLUS in its evaluation Display the new data in the Data Viewer Note that the variable has a z appended to indicate that this is the normalized data bd data viewer P8 dataNz bd 32 165 rows 101 Chapter 4 Exploring and Manipulating Large Data Sets This table shows the ratio of the population for each group compared to the national average For example the value of M05 in ZIP code 07043 is 1 2 meaning that this region has proportionally 20 more males in this age group than the national average Big Data Viewer P8 dataNz bd i 5 x File Edit Rounding Help Data View Numeric Factor String Date F 85N M OONZ M OSNz M AONZ MASNz numeric numeric numeric numeric numeric 0 01 12 1 02 0 83 20 24
72. bd relational join bd relational product bd relational union Identifying and removing orphan caches bd cache cleanup bd cache info Store and retrieve objects bd pack object bd unpack object In the following exercises import filter and manipulate the Census data This section describes importing the example data set from a data source using the importData command in the Commands window For more information about importData see its help topic 93 Chapter 4 Exploring and Manipulating Large Data Sets To import the data set 1 In the Commands window type P8 bd lt importData paste getenv SHOME samples bigdata census census csv sep stringsAsFactors F startRow 1 bigdata T Note When you import data you have the option to set the flag stringsAsFactors to T or F the default is T S PLUS imposes a limit of 500 levels for bdFactors Loading Supporting Source Files 94 2 Display the resulting data set in the data viewer bd data viewer P8 bd Each cell in the rectangular big data object displayed in the Viewer contains the count of either males or females within 5 year age bins shown for each ZCTA Each ZCTA is shown as a separate row each male or female age bin is shown as a separate column The columns labeled M00 M05 M10 and so on represent the number of males from 0 to 4 years 4 to 9 years 10 to 14 years The columns labeled F00 F05 F10 and so on
73. ble in our model and attempt to predict its values based on the other thirteen variables in the data set For a description of the other variables see Table 6 1 Size The data set is fairly small 506 rows and 14 columns however to demonstrate the Big Data library modeling features we import the data set as big data This example would work without modification on a dataset of millions of rows Variables The following table lists the bostonhousing txt variables Table 6 1 bostonhousing txt variables Variable name Description AGE Proportion of owner occupied units built prior to 1940 B 1000 Bk 0 63 2 where Bk is the proportion of blacks by town CHAS Indicates whether the property bounds the Charles River 1 if a tract bounds the river 0 otherwise CRIM Per capita crime rate by town DIS Weighted distances to five Boston employment centers INDUS Proportion of non retail business acres per town LSTAT Percentage of the population that is of lower economic status 164 Building a Model Table 6 1 bostonhousing txt variables Continued Variable name Description MEDV Median value of owner occupied homes in 1000s NOX Nitric oxides concentration parts per 10 million PTRATIO Pupil teacher ratio by town RAD Index of accessibility to radial highways RM Average number of rooms per dwelling TAX Full value property tax rate per 10 000 Z
74. button 7 located in the upper right corner of each view or when you right click the view Each view action also has a quick key sequence to perform an action For example to clear the text in the console with the Console View active type CTRL L When you modify an item in a view it is saved immediately Normally only one instance of a particular type of view can exist in the Workbench window Customized views in the S PLUS Workbench include the following Table 2 2 S PLUS Workbench views and exercise references View Practice exercise Console See the exercise in the section To Run Copied Script View Code on page 53 History View See the exercise in the section To Examine the History on page 53 Objects View See the exercise in the section To Examine the Objects on page 51 Outline View See the exercise in the section To Examine the Outline on page 50 Output View See the exercise in the section To Run Code on page 54 Customizing the Perspective s Views Views Table 2 2 S PLUS Workbench views and exercise references View Practice exercise Problems See the exercise in the section To Examine Problems on View page 54 Search Path See the exercise in the section Adding a Database on View page 46 and section Detaching a Database on page 47 Tasks View See the exercise in the section To Add a Task in the Script File on page 52 and section To Add a Task Directly t
75. by processing large data sets using scalable algorithms and data streaming Instead of loading the contents of a large data file into memory S PLUS creates a special 61 Chapter 3 The Big Data Library Scalable Algorithms Data Streaming New Data Type Flexibility 62 binary cache file of the data on the user s hard disk and then refers to the cache file on disk This out of memory design requires relatively small amounts of RAM regardless of the total size of the data Although the large data set is stored on the hard drive the scalable algorithms of the Big Data library are designed to optimize access to the data reading from disk a minimum number of times Many techniques require a single pass through the data and the data is read from the disk in blocks not randomly to minimize disk access times These scalable algorithms are described in more detail in the section The Big Data Library Architecture on page 69 S PLUS operates on the data binary cache file directly using streaming techniques where data flows through the application rather than being processed all at once in memory The cache file is processed on a row by row basis meaning that only a small part of the data is stored in RAM at any one time It is this out of memory data processing technique that enables S PLUS to process data sets hundreds of megabytes or even gigabytes in size without requiring large quantities of RAM S PLUS Enterprise Dev
76. cal 77 bdNumeric 77 bdPrincomp 77 bdSignalSeries 77 bdTimeDate 77 239 Index 240 bdTimeSeries 77 bdTimeSpan 77 bdVector 77 coef 76 Commands window 2 comparing versions 56 components Principal Components 170 console options 19 Console View 26 28 copying from script to console 15 create a Workbench project 41 Creating 109 crossprod 82 custom color setting 19 20 21 D data frame 73 data streaming 62 debugging 15 drop NA values 109 E Eclipse 11 edit code 49 empty project creating 41 evaluating expressions 40 existing files creating a project for 41 existing project importing files for 42 exporting data 104 external files opening 23 57 F file associations 19 filtering files 57 fitted 76 format code 40 formula 76 function help 14 functions to watch 21 G generalized linear modeling 162 graphical display 7 graphical user interface 4 graphics functions 78 GUI support Big Data library 2 4 H help displaying 38 History View 26 29 53 I import data 6 importing data 6 Boston housing example 181 importing multiple files 107 Stock example 107 importing files 42 J join columns 103 K K means 173 L linear model 163 linear regression 162 Boston Housing example 163 line numbers 49 50 displaying 21 38 loading Big Data library by default 2 M manipulate data 6 metadata 63 model 75 modeling functions 79 multiple projects 43 N Navigator View 27 48
77. calculated for columns See the help topic for bd aggregate for a list of the summary methods you can specify for methods bd block apply Run an S PLUS script on blocks of data with options for reading multiple input datasets and generating multiple output data sets and processing blocks in different orders See the help topic for bd block apply for a discussion on processing multiple data blocks bd by group Apply the specified S PLUS function to multiple data blocks within the input dataset 71 Chapter 3 The Big Data Library 72 Table 3 2 Block based computation functions Continued Function name Description bd by window Apply the specified S PLUS function to multiple data blocks defined by a moving window over the input dataset Each data block is converted to a data frame and passed to the specified function If one of the data blocks is too large to fit in memory an error occurs bd split by group Divide a dataset into multiple data blocks and return a list of these data blocks bd split by window Divide a dataset into multiple data blocks defined by a moving window over the dataset and return a list of these data blocks For a detailed discussion on advanced topics such as block size issues and increasing efficiency see Chapter 7 Advanced Programming Information Data Types Data Frames The Big Data Library Architecture S PLUS Ent
78. cally instantaneous regardless of the data set size For example mean census data Income range census data Age 63 Chapter 3 The Big Data Library No 64 Bit Solution 64 Are out of memory data analysis techniques still necessary in the 64 bit age While S PLUS Enterprise Developer is available on some 64 bit systems the out of memory techniques described above are still required to analyze truly large data sets 64 bit systems increase the amount of memory that the system can address This can help in memory algorithms handle larger problems provided that all of the data can be in physical memory If the data and the algorithm require virtual memory page swapping that is accessing the data in virtual memory on the disk can have a severe impact on performance With data sets now in the multiple gigabyte range out of memory techniques are essential Even on 64 bit systems out of memory techniques can dramatically outperform in memory techniques when the data set exceeds the available physical RAM Size Considerations SIZE CONSIDERATIONS While the Big Data library imposes no predetermined limit for the number of rows allowed in a big data object or the number of elements in a big data vector your computer s hard drive must contain enough space to hold the data set and create the data cache Given sufficient disk space the big data object can be created and processed by any scalable function The speed of mos
79. cause tree models are invariant to monotone re expression of the predictor variables This property along with the ease of interpretation of the resulting tree are some of the reasons why tree models are popular However a disadvantage of tree models is their sensitivity if you repeat the above sample fit exercise you will most likely get quite different trees One way to overcome this problem is to aggregate multiple trees by averaging predictions from many different trees See the literature on bagging and boosting of trees Hastie et al 2001 has a good overview of this technique References Belsley D Kuh E and Welsch R 1980 Regression Diagnostics Identifying Influential Data and Sources of Collinearity John Wiley amp Sons New York Harrison D and Rubinfeld D L 1978 Hedonic prices and the demand for clean air Journal of Environmental Economics Management 5 81 102 183 Chapter 6 Modeling Large Data Sets 184 ADVANCED PROGRAMMING INFORMATION Introduction 186 Big Data Block Size Issues 187 Block Size Options 187 Group or Window Blocks 188 Big Data String and Factor Issues 191 String Column Widths 191 String Widths and importData 191 String Widths and bd create columns 193 Factor Column Levels 194 String Truncation and Level Overflow Errors 195 Storing and Retrieving Large S Objects 197 Managing Large Amounts of Data 197 Increasing Efficiency 199 bd select rows 199 bd filter rows 199 bd
80. ctions for plotting big data using hexagonal binning Function Comment pairs Can accept a bdFrame object plot Can accept a hexbin a single bdVector two bdVectors or a bdFrame object splom Creates a Trellis graphic object of a scatterplot matrix xyplot Creates a Trellis graphic object which graphs one set of numerical values on a vertical scale against another set of numerical values on a horizontal scale Functions Adding Reference Lines to Plots The following functions add reference lines to hexbin plots Table 5 2 Functions that add reference lines to hexbin plots Function Type of line abline 1sfit Regression line lines loess smooth Loess smoother lines smooth spline Smoothing spline panel Imline Adds a least squares line to an xyplot in a Trellis graph 118 Overview of Graph Functions Table 5 2 Functions that add reference lines to hexbin plots Continued Function Type of line panel loess Adds a loess smoother to an xyplot in a Trellis graph qqline QOQ plot reference line xyplot 1mline T Adds a least squares line to an xyplot in a Trellis graph Graph Functions The following functions summarize data in a plot specific manner to Summarizing plot big data objects Data Table 5 3 Functions that summarize in plot specific manner Function Description boxplot Produces side by side boxplots from a number of
81. d to get one data set Later using this combined data set you can plot gender and age information on a map In the Commands window type P8 Nz bd lt bd join list P8 supp1 bd P8 dataNz bd 2 Display the results in the Data Viewer Note the latitude and longitude variables bd data viewer P8 Nz bd 103 Chapter 4 Exploring and Manipulating Large Data Sets Exporting Data This optional step just demonstrates exporting data to an ASCII text file Optionally skip this step and continue to the next chapter In the Commands window type exportData P8 bd file exportedfile txt type ASCII These options indicate that the data set is exported as an ASCII text file The file name and location are specified by file See the help file for exportData for a description of all options for exporting to a database Summary The next steps in working with the Census example are to perform cluster modeling These steps and discussion are continued in Chapter 6 Modeling Large Data Sets The next section in this chapter provides further practice importing manipulating and plotting time series data for a sample financial data set 104 Manipulating Data Stock Sample MANIPULATING DATA STOCK SAMPLE In this section we work with a different data set a financial data set using the script stock ssc and associated csv files provided in the default S PLUS Installation sample directory Again for practical reasons this
82. d spending habits are just a few examples of information that can be grouped and once these objects are grouped you can then apply this knowledge to reveal patterns and relationships on a large scale K means is one of the most widespread clustering methods It was originally developed for situations in which all variables are continuous and the Euclidian distance is chosen as the measure of dissimilarity There are several variants of the K means clustering algorithm but most variants involve an iterative scheme that operates over a fixed number of clusters while attempting to satisfy the following properties e Each class has a center which is the mean position of all the samples in that class Each object is in the class whose center it is closest to The Big Data library clustering function bdCluster applies a K means algorithm that performs a single scan of a data set while using a buffer for points from the data set of fixed size Categorical data is handled by expanding categorical columns into m indicator columns where m is the number of unique categories in the column The K means algorithm selects k of the objects each of which initially represents a cluster mean or centroid For each of the remaining objects an object is assigned to the cluster it resembles the most based on the distance of the object from the cluster mean It then computes the new mean for each cluster This process iterates until the function conver
83. d stack Combines or stacks separate columns of a data set into a single column replicating values in other columns as necessary Programming Big Data Library Functions Table A 5 Data manipulation functions Continued Function name Description bd string column width Returns the maximum number of characters that can be stored in a big data string column bd transpose Turns a set of columns into a set of rows bd unique Remove all duplicated rows from the dataset so that each row is guaranteed to be unique bd unstack Separates one column into a number of columns based on a grouping column Table A 6 Programming functions Function name Description bd cache cleanup Cleans up cache files that have not been deleted by the garbage collection system This is most likely to occur if the entire system crashes bd cache info Analyzes a directory containing big data cache files and returns information about cache files references counts and unknown files bd options Controls S PLUS options used when processing big data objects bd pack object Packs any object into an external cache 213 Appendix Big Data Library Functions Table A 6 Programming functions Continued Function name Description bd split by group Divide a dataset into multiple data blocks and return a list of these data blocks bd
84. d wtr marginal gridi fit lt c predict ozo m grid print levelplot fit wind temperature radiation data grid xlab Wind Speed mph ylab Temperature F main Cube Root Ozone cube root ppb The level plot is displayed as follows Cube Root Ozone cube root ppb Temperature F Wind Speed mph Figure 5 32 Graph using loess to create a level plot The persp function creates a perspective plot given a matrix that represents heights on an evenly spaced grid For more information about persp see section Perspective Plots on page 96 of the Application Developer s Guide To create a sample persp graph using hist2d to preprocess the data in the Commands window type the following fuel bd lt as bdFrame fuel frame persp hist2d fuel bd Weight fuel bd Mileage 153 Chapter 5 Creating Graphical Displays of Large Data Sets The persp graph is displayed as follows Figure 5 33 Graph using hist2d to create a perspective plot Hint Using persp of interp might produce a more attractive graph Create a Pie A pie chart shows the share of individual values in a variable relative Chart to the sum total of all the values Pie charts display the same information as bar charts and dot plots but can be more difficult to interpret This is because the size of a pie wedge is relative to a sum and does not directly reflect the magnitude of the data value Because of this
85. data set is not particularly large 26 columns 2729 rows however it illustrates features and tasks of working with a typical large data set that contains financial data including time series information and missing data This example can easily be run with a data set of millions of rows without requiring additional RAM In this stock analysis example you will e Manipulate the data join filter remove missing data create columns and so on e Create a time series object e Plot the time series e Use different methods to analyze the betas using linear modeling e Compare the analysis methods Note The entire sample script can be found in the default installation directory samples bigdata stocks stock ssc You can work through the example demonstrations below or you can open the script and review it or run it 105 Chapter 4 Exploring and Manipulating Large Data Sets Preparing the This example examines the daily close prices of 24 conglomerate Stock Sample stocks and the S amp P 500 index from 01 01 1994 to 11 01 2004 Script Table 4 2 Stock Data used in Example Stock symbol Company name cbe Cooper Industries cr Crane Company dov Dover Corporation fo Fortune Brands Incorporated ge General Electric Company GenCorp Incorporated hon Honeywell International hsc Harsco Corporation kor Koor Industries LTD kt K
86. default the objects are displayed sorted by name 2 Right click the Objects View and in the right click menu click bigdata The Big Data library objects are displayed in the Objects View It might take a few seconds to display all of the objects 3 Resort the objects by any property displayed in the Objects View by clicking the property s column title To Select Another Object Database 1 Right click the Objects View and in the right click menu click your default object directory the first database in the list by default found in your installation directory at users yourname The project objects are displayed in the Objects View It might take a few seconds to display all of the objects The Tasks View displays outstanding project tasks As discussed in the section Setting the Project s Preferences on page 44 the indicators for task levels are stored in the Preferences dialog box Click Windows P Preferences to display them You can add a task in one of two ways e Add the task directly to the Tasks View e Add the task to the script file To Add a Task Directly to the Tasks View 1 Click the Tasks View tab to give it focus 2 Right click the view and then click Add Task 3 In the Add Task dialog box provide the description and priority level of the task 51 Chapter 2 The S PLUS Workbench Running Code Copying Script Code to the Console 52 4 Click OK to save and display the new task A t
87. ditor replace the missing parenthesis and save your file Note that the problem disappears from the Problems View The S PLUS Workbench maintains a list of your active projects in the Navigator View even after you close all associated files To Close the Project l Click File gt Close All 2 Examine the views and note that the views all still contain data The views continue to show project information The S PLUS Workbench stores information in many views even after you close the interface For example the Objects View continues S PLUS Workbench Tasks to store information about all your projects objects and the Tasks View and Problems View continue to display outstanding issues These features can help you track outstanding work even between sessions 55 Chapter 2 The S PLUS Workbench COMMONLY USED FEATURES IN ECLIPSE 56 The core Eclipse IDE contains many additional features that you might find helpful in managing your projects The following table lists a few of these features along with references to the Eclipse Workbench User Guide to help you learn how to use them effectively Table 2 7 Eclipse Tasks and Features Task Eclipse Feature Description Comparing files with previous versions The Compare With Local History menu item is available from the control menu in Navigator View Using this feature you can compare the current version of the selected file with previously stored
88. e categorical indices The following functions do not accept a big data object directly to create a graph rather they require one of the specified preprocessing functions Table 5 5 Functions requiring preprocessors for graphing large data sets Function Preprocessors Description barchart table tapply Creates a bar chart in a Trellis aggregate graph barplot table tapply Creates a bar graph aggregate contour interp hist2d Make a contour plot and possibly return coordinates of contour lines contourplot loess Displays contour plots and level plots in a Trellis graph 121 Chapter 5 Creating Graphical Displays of Large Data Sets 122 Table 5 5 Functions requiring preprocessors for graphing large data sets Continued Function Preprocessors Description dotchart table tapply Plots a dot chart from a vector aggregate dotplot table tapply Creates a Trellis graph displaying aggregate dots and labels image interp hist2d Creates an image under some graphics devices of shades of gray or colors that represent a third dimension levelplot loess Displays a level plot in a Trellis graph persp interp hist2d Creates a perspective plot given a matrix that represents heights on an evenly spaced grid pie table tapply Creates a pie chart from a vector of aggregate data piechart table tapply Creates a pie chart in a Trellis graph aggr
89. e name of the data set P8 bd For File Name type census csv For Files of Type select ASCII file comma delimited csv Ot oP oN Click OK to export the data set The Data Viewer is a multi page tabbed dialog box providing summaries of the different column types and a noneditable scrollable grid view of the data The Data Viewer is available only with the Enterprise Developer version of S PLUS and requires that you have the Big Data library loaded You can use the Data Viewer for both large data frames bdFrames and standard data frames To view the example data set in the Data Viewer You can display the data viewer from the Commands window At the Commands window prompt type bd data viewer P8 bd 89 Chapter 4 Exploring and Manipulating Large Data Sets Big Data Viewer P8 bd o Mme File Edit Rounding Help Factor Date numeric numeric numeric numeric 1 291 320 00 198 00 2 3 00 578 00 410 00 3 1 456 00 ries eal 1 146 00 720 00 4 33 00 56 00 39 00 Total number columns 43 Numeric columns 42 Total number rows 32165 Factor columns 0 String columns 1 Date columns 0 Other columns 0 Figure 4 1 Data viewer displaying P8 bd The Data Viewer contains the following tabs The first tab Data View contains a table of the data The remaining five contain summary information about the corresponding data type Data View Numeric Factor String
90. ecause summary information metadata is computed and stored for each column the number of columns is slightly limited The current implementation supports tens of thousands of columns on a typical computer e Function names starting with bd internal are not intended to be used directly Data sets that are much larger than the system memory are manipulated by processing one block of data at a time That is if the data is too large to fit in RAM then the data will be broken into multiple data sets and the function will be applied to each of the data sets As an example a 1 000 000 row by 10 column data set of double values is 76MB in size so it could be handled as a single data set on a machine with 256MB RAM If the data set was 10 000 000 rows by 100 columns it would be 7 4GB in size and would have to be handled as multiple blocks 69 Chapter 3 The Big Data Library 70 Table 3 1 lists a few of the optional arguments for the function bd options that you can use to set limits for caching and for warnings Table 3 1 bd options block based computation arguments bd option argument Description block size The block size in number of rows the number of bytes in the cache to be converted to a data frame max convert bytes The maximum size in bytes of the big data cache that can be converted to a data frame convert warn If T generates a warning whenever a big data cache is converted to a data fr
91. ed as follows ozone radiation Figure 5 6 Graph using xyplot with 1m1 ine T Trellis functions in the Big Data Library handle continuous given variables differently than standard data Trellis functions they are sent through equal count rather than factor You can add a regression line or scatterplot smoother to hexbin plots The regression line or smoother is a weighted fit based on the binned values Add a Regression Line Example Graphs The following functions add the following types of reference lines to hexbin plots e A regression line with abline e A Loess smoother with loess smooth e A smooth spline with smooth spline e Aline to a qqplot with qqline e A least squares line to an xyplot in a Trellis graph For smooth spline and loess smooth when the data consists of bdVectors the data is aggregated before smoothing The range of the x variable is divided into 1000 bins and then the mean for x and y is computed in each bin A weighted smooth is then computed on the bin means weighted based on the bin counts This computation results in values that differ somewhat from those where the smoother is applied to the unaggregated data The values are usually close enough to be indistinguishable when used in a plot but the difference could be important when the smoother is used for prediction or optimization When you create a scatterplot from your large data set and you notice a line
92. ed from an empirical distribution based on the observed values bd reorder columns Changes the order of the columns in the data set bd sample Samples rows from a dataset using one of several methods bd select rows Extracts a block of data as specified by a set of columns start row and end row bd shuffle Randomly shuffles the rows of your data set reordering the values in each of the columns as a result bd sort Sorts the data set rows according to the values of one or more columns bd split Splits a data set into two data sets according to whether each row satisfies an expression 211 Appendix Big Data Library Functions 212 Table A 5 Data manipulation functions Continued Function name Description bd sql Specifies data manipulation operations using SQL syntax e The Select Insert Delete and Update statements are supported The column identifiers are case sensitive e SQL interprets periods in names as indicating fields within tables therefore column names should not contain periods if you plan to use bd sql Mathematical functions are allowed for aggregation avg min max sum count stdev var The following functionality is not implemented e distinct mathematical functions in set or select such as abs round floor and so on e natural join e union merge between e subqueries b
93. edures to accomplish the tasks For nearly every investigation understanding the problem and planning for its solution can save you time energy and money later Defining the problem is key to determining the type of information you want to derive from your data and the best strategy for extracting the information and demonstrating the answer Determine the question you are trying to answer What is the objective of your inquiry e Identify the variables in your data that can answer the question Your data set might contain much more information than you need to determine the answer to your question and it also might include blank fields or errors You must filter and remove factors that are not essential to your answer Chapter 1 Introduction Import the Data Manipulate the Data Design the model You can create a model that predicts behavior Using S PLUS and the Big Data library you can import data from the source types listed in Table 5 1 in Chapter 5 Importing and Exporting Data in the S PLUS User s Guide The easiest way to import a large data set is by typing the importData command directly in the S PLUS Commands Window or the Console View in the S PLUS Workbench specifying the argument bigdata T For more information about using the Commands window see the S PLUS User s Guide For more information about importData see its help topic For more information about deciding when to set bigdata T see the sectio
94. ee the section Creating a Project on page 41 for more information about specifying the project file location You can use the Eclipse editor to edit non project files in the S PLUS Workbench To open a non project file on the File menu click Open External File and then browse to the location of the file to edit For more information about editing files in Eclipse see the Eclipse User s Guide 23 Chapter 2 The S PLUS Workbench S PLus PERSPECTIVE Changing the S PLus Workbench Perspective 24 The perspective provides functionality aimed at accomplishing specific types of tasks or working with specific types of resources The S PLUS Workbench perspective defines the appearance and behavior for using S PLUS including the editor views menus and toolbars You can change the perspective to suit your development style by moving hiding or closing views For more information about customizing the views within the perspective see the section Customizing the Perspective s Views on page 27 For practice exercises customizing the perspective see the section Customizing the S PLUS Workbench Default Perspective and Views on page 45 To customize the default S PLUS perspective on the menu click Window gt Customize Perspective The Customize Perspective dialog box has two pages Shortcuts and Commands Each of these pages describes global changes you can make to the perspective To save a changed perspective click Window gt
95. egate wireframe loess Displays a three dimensional wireframe plot in a Trellis graph Example Graphs EXAMPLE GRAPHS Plotting Using Hexagonal Binning The examples in this chapter require that you have the Big Data Library loaded The examples are not large data sets rather they are small data objects that you convert to big data objects to demonstrate using the Big Data Library graphing functions Hexagonal binning plots are available for e Single plot plot e Matrix of plots pairs e Conditioned single or matrix plots xyp1ot Functions that evaluate data over a grid in standard S PLUS aggregate the data over the grid such as binning the data and taking the mean in each grid cell and then plot the aggregated values when applied to a big data object Hexagonal binning is a data grouping or reduction method typically used on large data sets to clarify a spatial display structure in two dimensions Think of it as partitioning a scatter plot into larger units to reduce dimensionality while maintaining a measure of data clarity Each unit of data is displayed with a hexagon and represents a bin of points in the plot Hexagons are used instead of squares or rectangles to avoid misleading structure that occurs when edges of the rectangles line up exactly Plotting using hexagonal binning is the standard technique used when a plotting function that currently plots one point per row is applied to a big data object
96. eloper introduces the large data frame an object of class bdFrame A big data frame object is similar in function to standard S PLUs data frames except its data is stored in a cache file on disk rather than in RAM The bdFrame object is essentially a reference to that external file While you can create a bdFrame object that represents an extremely large data set the bdFrame object itself requires very little RAM For more information on bdFrame see the section Data Frames on page 73 S PLUs Enterprise Developer also introduces time date bdTimeDate time span bdTimeSpan and series bdSeries bdSignalSeries and bdTimeSeries support for large data sets For more information see the section Time Date Creation on page 235 in the Appendix The Big Data library provides reading manipulating and analyzing capability for large data sets using the familiar S programming language Because most existing data frame methods work in the same way with bdFrame objects as they do with data frame objects the style of programming is familiar to S PLUS programmers Much existing code from previous versions of S PLUS runs without Balancing Scalability with Performance Metadata Working with a Large Data Set modification in the Big Data library and only minor modifications are needed to take advantage of the big data capabilities of the pipeline engine While accessing data on disk rather than in RAM allows for scalable statistical co
97. en separate from the S PLUS Workbench so multiple instances launch without focus hidden behind the S PLUS Workbench window History View The History View is similar to the Commands History dialog box in S PLUS for Windows The History View is a scrollable list of commands that have previously been run in the Console View Commands that you run by clicking Run or pressing F9 do not appear in the History View See the section Output View on page 33 e When you select a command in the History View the pending text in the Console View changes to the selected text You can then press ENTER or you can double click the text in the History View to execute the command You can select only one line at a time in the History View When you scroll up or down through previously run commands in the Console View the corresponding command is highlighted in the History View Note While S PLUS uses the key F10 to run a selected command the S PLUS Workbench uses the key F9 to run a selected command You can use the History View control menu click 7 to select input displayed in the History View and copy it to the Console View 29 Chapter 2 The S PLUS Workbench Objects View 30 By default the History View holds up to 150 000 lines of commands History view 3 Outline ml Add the additional code source source paste getenv SHOME se plot P8 bdSINTPTLON P8 bd SINTPT bd data viewer P8 bd
98. en you edit a script Displays the help topic for documented functions when you select the function name and then type F1 CFT 9 oraphshect setup q a 70 This plot is not weighted by population but just the aves 71 Note the baby boom and the boomlet 72 Note difference between genders as ages increase 74 barplot rbhind P8 dataN mean 1 18 P8S dataN mean 75 19 36 horiz T 76 m 77 Create new series of columns by normalizing by 78 average per bin 79 Divide each column by the mean 80 column exprs lt paste names P8 dataN bd paste P8 dat 81 82 column names N lt names P8 dataN bd 83 84 column names Nz lt paste column names N z sep 85 86 P8 dataNz bd lt bd create columns x JE R J Figure 2 17 S PLUS Workbench Script editor 38 Script Editor Note View integration Menu Options Copy to Console You can use the Eclipse editor to edit non project files in the S PLUS Workbench To open a non project file on the File menu click Open External File and then browse to the location of the file to edit For more information about editing files in Eclipse see the Eclipse User s Guide The Script Editor is closely integrated with the views in the S PLUS Workbench This integration includes the following e When you type a task keyword in the editor it is automatically added to the Tasks View See the section Tasks View on page 36 for
99. enu the S PLUS Language Reference displays the topic for the he1p function The Source menu contains the following four submenus Format Applies S PLUS consistent formatting and line indentation to the entire script Toggle Comment Designates the selected text in the Script editor as a comment or if the selected text already is a comment removes the comment designation Shift Right Moves the selected text four character spaces to the right Shift Left Moves the selected text five character spaces to the left Open S Plus Help File Opens the S PLUS Language Reference to the topic for the selected function If you have no documented function selected the he1p function topic is displayed This menu item is available from the right click menu in the Script Editor Selecting this menu option parses and then evaluates each expression in the given file displaying the results in the Console View S PLUS Workbench Tasks S PLus WORKBENCH TASKS Creating a Project The following topics demonstrate the basic tasks for the S PLUS Workbench user For information about basic Eclipse IDE tasks see the Eclipse Workbench User s Guide Before you begin working with files in the S PLUS Workbench you must create a project The S PLUS Workbench project is a resource containing text files scripts and other associated files You can use the project to control build version sharing and resource management Before you create
100. erprise Developer introduces the following new data types described in more detail below Table 3 3 New data types and data names for S PLUS Big Data class Data type bdFrame Data frame bdVector bdCharacter bdFactor Vector bdLogical bdNumeric bdTimeDate bdTimeSpan bdLM bdGLM bdPrincomp bdCluster Models bdSeries bdTimeSeries bdSignalSeries Series The main object to contain your large data set is the big data frame an object of class bdFrame Most methods commonly used for a data frame are also available for a bdFrame Big data frame objects are similar to standard S PLUS data frames except in the following ways A bdFrame object stores its data on disk while a data frame object stores its data in RAM As a result a bdFrame object has a much smaller memory footprint than a data frame object A bdFrame object does not have row labels as a data frame object does While this means that you cannot refer to the rows of a bdFrame object using character row labels this design reduces storage requirements and improves performance by eliminating the need to maintain unique row labels A bdFrame object can contain columns of only types double character factor timeDate timeSpan or logical No other column types such as matrix objects or user defined classes are allowed By limiting the allowed column types S PLUS ensures that the binary cache file representing the data is as compact as po
101. et 61 Size Considerations 65 The Big Data Library Architecture 69 Contents vi Chapter 4 Exploring and Manipulating Large Data Sets Introduction Working in the S PLUS Environment Manipulating Data Census Example Manipulating Data Stock Sample Chapter 5 Creating Graphical Displays of Large Data Sets Introduction Overview of Graph Functions Example Graphs Chapter 6 Modeling Large Data Sets Introduction Overview of Modeling Building a Model Predicting from the Model Chapter 7 Advanced Programming Information Introduction Big Data Block Size Issues Big Data String and Factor Issues Storing and Retrieving Large S Objects Increasing Efficiency Appendix Big Data Library Functions Introduction Big Data Library Functions 85 86 87 91 105 115 116 117 123 159 160 161 162 180 185 186 187 191 197 199 201 202 203 INTRODUCTION Welcome to the S PLUS Enterprise Developer User s Guide Analyzing Large Data Sets Out of Memory Data Storage Big Data Library Options in the S PLUS Environment Working with Large Data Sets Advanced Programming More Advanced Programming Concepts and Tasks oe Coo Ne Chapter 1 Introduction WELCOME TO THE S PLus ENTERPRISE DEVELOPER USER S GUIDE The Big Data library is a significant addition to the S PLUS family of libraries It provides objects classes and functions to manipulate model and explore large data sets using the S language
102. eted as all columns in the data set specified by data 126 Example Graphs The splom plot is displayed as follows Mileage ad Figure 5 4 Graph using sp1om for fuel bd To remove a column use term To add a column use term For example the following code replaces the column Disp with its log fuel bd lt as bdFrame fuel frame splom Disp t log Disp data fuel bd ry Tous v log Disp 3 Mileage a Figure 5 5 Graph using sp1om to designate a formula for fuel bd For more information about sp1om see its help topic 127 Chapter 5 Creating Graphical Displays of Large Data Sets Create a Conditioning Plot or Scatter Plot Adding Reference Lines 128 The function xyp1ot creates a Trellis graph which graphs one set of numerical values on a vertical scale against another set of numerical values on a horizontal scale To create a sample conditioning plot in the Commands window type the following xyplot data as bdFrame air ozone radiation temperature shingle args list n 4 Imline T The variable on the left of the goes on the vertical or y axis and the variable on the right goes on the horizontal or x axis The function xyplot contains the default argument 1m1 ine T to add the approximate least squares line to a panel quickly This argument performs the same action as panel 1mline in standard S PLUS The xyplot plot is display
103. ference data set columns containing the adjusted scale of latitude and longitude Lat and Lon and assign the resulting data set to P8 supp1 bd The original latitude and longitude values INTPTLAT and INTPLTLON were stored as large integer values P8 suppl bd lt bd create columns P8 ref bd exprs c INTPTLAT 1 e6 INTPTLON 1 e6 names c Lat Lon types continuous copy T In the next section plot the ZIP code distribution to examine its density 2 Open the data viewer and examine the latitude and longitude variables bd data viewer P8 supp1 bd In this exercise use the data set P8 supp1 bd with the adjusted latitude and longitude values to display the distribution of zip code locations in a simple hexbin plot This simple plot maps the density of zip code locations in the United States and Puerto Rico Transforming the Data Manipulating Data Census Example To display zip code density 1 Next create new Lat and Lon variables on the correct scale and then save along with the original reference data in a new data set p8 supp1 bd In the Commands window type plot P8 supp bd Lon P8 supp1 bd Lat Note that the plot function produces the hexbin plot by default for big data objects rather than a scatter plot 70 Counts 398 300 200 100 000 00 60 50 PS suppl bd Lat 500 400 300 200 100 30 20 T T T T T T 180 160 140 120 100 80 60 P8 sup
104. from File dialog box 2 Under File name click Browse and in the Select file to import dialog box browse to the census directory by default located in your installation directory at samples bigdata census Select census csv In File format select ASCII file comma delimited csv Select the Import as Big Data check box In the Data set text box type P8 bd Click the Options tab ON AAS w Clear the Strings as factors check box Note When you import data you have the option to set the flag stringsAsFactors to T or F the default is T S PLUS imposes a limit of 500 levels for bdFactors 9 To preview the data click the Data Specs tab and then click Update Preview 88 Export Data dialog box Working in the S PLUS Environment 10 Click OK to import the data set To export a large data set from S PLUS using the S PLUS GUI in Microsoft Windows from the File menu click Export Data gt To File or Export Data gt To Database Note From the command line export the data using the exportData function Data Viewer For a list of the data file types in the S PLUS for Windows GUI click Help gt Available Help gt S PLUs Help and then in the Index find the topic Export File Type To export the census example data set using the S PLUs GUI in Microsoft Windows 1 From the File gt Export Data menu open the Export to File dialog box For Data frame provide th
105. g Data library modeling architecture Primary modeling function Class glim bdGIm bdCluster bdCluster Im bdLm princomp bdPrincomp See Tables A 10 through A 13 in the Appendix for lists of the functions available for large data set modeling See the S PLUS Language Reference for more information about these functions Formula operators The Big Data library supports using the formula operators in and The Big Data Library Architecture Time Classes The following classes support time operations in the Big Data library See the Appendix for more information Table 3 8 Time classes Class name Comment bdSignalSeries A bdSignalSeries object from positions and data bdTimeDate A bdVector class bdTimeSeries See the section Time Series Operations for more information bdTimeSpan A bdVector class Time Series Time series operations are available through the bdTimeSeries class Operations and its related functions The bdTimeSeries class supports the same methods as the standard S PLUS library s timeSeries class See the S PLUs Language Reference for more information about these classes Time and Date e When you create a time object using timeSeq and you set the Operations bigdata argument to TRUE then a bdTimeDate object is created e When you create a time object using timeDate or timeCalendar and any of the arguments are big data objects then a bd
106. ges A second scan through the data assigns each observation to the cluster it is closest to where closeness is measured by the Euclidean distance When you perform K means clustering the number of cluster iterations you specify determines the accuracy of each cluster That is the higher the iteration number the more accurate the observations The clustering function bdCluster includes the optional arguments listed in Table 6 3 for using the K means algorithm 173 Chapter 6 Modeling Large Data Sets Table 6 3 bdCluster algorithm arguments Optional argument Description columns The names of columns to use in clustering The default uses all columns jter max The maximum number of iterations to run within a block This is the number of iterations of the standard K Means algorithm applied to the combined new data from the block the retained set and the current centers The number of clusters You might know this number based on the subject matter For example you know in advance you expect to find three species groups in a particular dataset Often however clustering is an exploratory technique and the number of clusters is unknown Try a number of cluster runs with varying number of clusters and see which setting provides meaningful results retain The number of rows in the retained set As each block of data is processed observations that do not cluster well are kept in the retain set At
107. ght voice part data singer bd nint 17 endpoints c 59 5 76 5 layout c 2 4 aspect 1 xlab Height inches The Trellis histogram chart is displayed as follows Figure 5 17 Graph using histogram For more information about Trellis histograms see Chapter 3 Traditional Trellis Graphics in the in the Application Developer s Guide The functions qq qqmath qqnorm and qqplot create an ordinary x y plot of 500 evenly spaced quantiles of data The function qq creates a Trellis graph comparing the distributions of two sets of data Quantiles of one dataset are graphed against corresponding quantiles of the other data set To create a sample qq plot in the Commands window type the following fuel bd lt as bdFrame fuel frame qq Type Compact Mileage data fuel bd 139 Chapter 5 Creating Graphical Displays of Large Data Sets Create a QQ Plot Using a Theoretical or Empirical Distribution 140 The factor on the left side of the must have exactly two levels fuel bd Compact has five levels The qq plot is displayed as follows TRUE Figure 5 18 Graph using qq Note that in this example by setting Type to the logical Compact the labels are set to FALSE and TRUE on the x and y axis respectively The function qqmath creates normal probability plot in a Trellis graph that is the ordered data are graphed against quantiles of the standard normal distributi
108. gle column and all other columns rather than computing the full correlation covariance matrix Produces a series of tables containing counts for all combinations of the levels in categorical variables bd data viewer Displays the data viewer window which displays the input data in a scrollable window as well as information about the data columns names types means and so on bd univariate Computes a wide variety of univariate statistics It computes most of the statistics returned by PROC UNIVARIATE in SAS 207 Appendix Big Data Library Functions Data Manipulation Table A 5 Data manipulation functions Functions Function name Description bd aggregate Divides a data object into blocks according to the values of one or more columns and then applies aggregation functions to columns within each block bd append Appends one data set to a second data set bd bin Creates new categorical variables from continuous variables by splitting the numeric values into a number of bins For example it can be used to include a continuous age column as ranges lt 18 18 24 25 35 and so on bd block apply Executes an S PLUS script on blocks of data with options for reading multiple input datasets and generating multiple output data sets and processing blocks in different orders bd by group Apply an arbitrary S PLUS function to multiple data blocks within the input dataset
109. h functions in S PLUS review the following chapters in the Application Developer s Guide e Chapter 1 Editable Graphics Commands e Chapter 2 Traditional Graphics e Chapter 3 Traditional Trellis Graphics Implementing plotting and graph functions to support large data sets requires an intelligent way to handle thousands of data points To address this need the graph functions to support big data are designed in the following categories e Functions to plot big data objects without preprocessing including e Functions to plot big data objects by hexagonal binning e Functions to plot big data objects by summarizing data in a plot specific manner e Functions providing the preprocessing support for plotting big data objects e Functions requiring preprocessing support to plot big data objects The following sections list the functions organized into these categories For an alphabetical list of graph functions supporting big data objects see the Appendix Big Data Library Functions Using cloud or parallel results in an error message Instead sample or aggregate the data to create a data frame that can be plotted using these functions 117 Chapter 5 Creating Graphical Displays of Large Data Sets Graph Functions The following functions can plot a large data set that is can accept a using Hexagonal big data object without preprocessing by plotting large amounts of Binning data using hexagonal binning Table 5 1 Fun
110. h these modeling algorithms you can create and evaluate statistical models on very large data sets The low level modeling functions in the big data library return a big data model object This object contains a reference to the bdFrame used to fit the model and a reference to a description of the model 75 Chapter 3 The Big Data Library Series Objects 76 A model object is available for each of the following statistical analysis model types Table 3 5 Big Data library model objects Model Type Model Object Linear regression bdLm Generalized linear models bdGlm Clustering bdCluster Principal Components Analysis bdPrincomp When you perform statistical analysis on a large data set with the Big Data library you can use familiar S PLUS modeling functions and syntax but you supply a bdFrame object as the data argument instead of a data frame This forces out of memory algorithms to be used rather than the traditional in memory algorithms When you apply the modeling function 1m to a bdFrame object it produces a model object of class bdLm You can apply the standard predict summary plot residuals coef formula anova and fitted methods to these new model objects For more information on statistical modeling see Chapter 6 Modeling Large Data Sets The standard S PLUS library contains a series object with two subclasses timeSeries and signalSeries The series object contain e A data comp
111. h you want help In the Script Editor highlight the function for which you want help and then type F1 Use the S PLUS Workbench menu options In the Script Editor select the function for which you want help and then on the menu click either Source gt Open S PLUS Help File e Help gt S PLUS Help Note If you click either menu option with no function selected in the Script Editor the S PLUS Workbench displays the he1p function topic Script Running Options The S PLUS Workbench provides the following customized solutions for running your scripts 14 S PLUS Features Changed and Eclipse Features Not Supported by the S PLUS Workbench Introduction Copy to Console Available from the right click menu in the Script Editor this option copies the selected code and pastes it into the Console View See the section Copying Script Code to the Console on page 52 e Run Available by pressing F9 or on the toolbar from the right click menu in the Script Editor this option runs the selected code or all code if none is selected and displays output in the Output View See the section Running Code and Reviewing the Output on page 54 for more information In the traditional S PLUS GUI you use F10 to run code Eclipse reserves F10 to switch focus to the main menu therefore the S PLUS Workbench specifies F9 to run code The S PLUS Workbench does not implement the Eclipse Run menu item Selecting th
112. height in the z direction from the corresponding three dimensional surface A level plot is essentially identical to a contour plot but it has default options that allow you to view a particular surface differently The following example creates a contour plot from fuel bd using interp to preprocess data For more information about interp see the section Visualizing Three Dimensional Data on page 94 of the Application Developer s Guide Like density interp and loess summarize the data That is when the data is a bdVector the data is aggregated before smoothing The range of the x variable is divided into 1000 bins and the mean for x computed in each bin See the section Create a Density Plot on page 135 for more information To create a sample contour plot using interp to preprocess the data in the Commands window type the following fuel bd lt as bdFrame fuel frame contour interp fuel bd Weight fuel bd Disp fuel bd Mileage The contour plot is displayed as follows Ea 2 a x GF LP T T T T 2000 2500 3000 3600 interp fuel bdfuveight fuel bd Disp fuel bdf hileage ypa 250 300 j i 200 L bi iy s Ba interpGuel bd ueight fuel bd Disp el bd fivileage Ay Figure 5 26 Graph using interp to create a contour plot The function contourp1ot creates a Trellis contour plot The contourplot function creates a Trellis graph of a contour plot For big data sets contourplot requires a
113. housing bd Fit the linear regression boston 1Im lt 1m LMEDV CRIM ZN INDUS CHAS AGE TAX PTRATIO B LRAD LLSTAT NOX2 LDIS RM2 data boston housing bd Look at the model results by typing in the Commands window boston 1m Look at some diagnostic plots for the model plot boston 1m LMEDY 2 0 2 5 3 0 3 5 40 Figure 6 3 One diagnostic plot 11 168 Call summary for a longer synopsis of the model Principal Components Building a Model summary boston 1m Call bdLm formula LMEDV CRIM ZN INDUS CHAS AGE TAX PTRATIO B LRAD LLSTAT NOX2 LDIS RM2 Residuals Min Mean Max StDev 0 7118 0 0000 0 7978 0 1801 Coefficients Value Std Error t value Pr gt t Intercept 4 5578 0 1544 29 5116 0 0000 CRIM 0 9119 0 0012 9 5320 0 0000 ZN 0 0001 0 0005 0 1585 0 8741 INDUS 0 0002 0 0024 0 1013 0 9193 CHAS 0 0914 0 0332 2 7527 0 0061 AGE 0 0001 0 0005 0 1724 0 8632 TAX 0 0004 0 0001 3 4261 0 0007 PTRATIO 0 0311 0 0050 6 2081 0 0000 B 0 0004 0 0001 3 5271 0 0005 LRAD 0 0957 0 0191 5 0021 0 0000 LLSTAT 0 3712 0 0250 14 8406 0 0000 NOX2 0 6380 0 1131 5 6393 0 0000 LOTS 0 1913 0 0334 lt 5 275 0 0000 RM2 0 0063 0 0013 4 8226 0 0000 Notice that ZN INDUS and AGE are not significant predictors If we were building a model for this data we would likely refit several other candidate models and examine them more fully For investigation invo
114. implemented for bdVector and bdFrame Continued Big Data Library Functions Function Name bdVector bdFrame Optional Comment pnbinom Density CDF and quantile function pnorm Density CDF and quantile function pnrange Density CDF and quantile function ppois Density CDF and quantile function print f pt Density CDF and quantile function punif Density CDF and quantile function pweibull Density CDF and quantile function pwilcox Density CDF and quantile function qbeta Density CDF and quantile function qbinom Density CDF and quantile function qcauchy Density CDF and quantile function 223 Appendix Big Data Library Functions 224 Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment qchisq Density CDF and quantile function qexp Density CDF and quantile function qf Density CDF and quantile function qgamma Density CDF and quantile function qgeom Density CDF and quantile function qhyper Density CDF and quantile function qlnorm Density CDF and quantile function qlogis Density CDF and quantile function qnbinom Density CDF and quantile function qnorm Density CDF and quantile function qnrange Density CDF and quantile function
115. import the data and seeing the error message Unable to Obtain Requested Dynamic Memory 65 Chapter 3 The Big Data Library Memory Requirements for In Memory Calculations 66 For standard S PLUS the absolute upper limit on the size of datasets it can work with is set by the maximum amount of memory that S PLUS can address On 32 bit systems this theoretical limit is 2 32 bytes or approximately 4 GB There is a practical limit that is determined by the operating system that is the operating system requires some of the aforementioned 4 GB For example a 32 bit Windows system without special configuration reduces available virtual memory to about 1 5 GB In addition to considering the initial size of the data set you must also consider the numbers of copies that S PLUS makes while processing the data The underlying S Language that is part of S PLUS makes between four and five temporary copies of a dataset in memory Memory requirements depend on the following e The size of data including the number of rows and columns in the raw data file e Column types that is numeric data requires 8 bytes per value while character data consisting of long strings requires more than 8 bytes per value e The data operations to be performed During data operations the data needs to be copied on average 4 5 times To determine approximately how much total memory physical and virtual a dataset requires use the following formula
116. ing methods set the bigdata argument to TRUE to Generation generate a bdVector This instruction applies to all functions in this table For more information and usage examples see the functions individual help topics Table A 3 Vector generation methods for large data sets Method name rbeta rbinom rcauchy rchisq rep rexp rf rgamma rgeom rhyper rlnorm rlogis rmvnorm rnbinom rnorm 205 Appendix Big Data Library Functions Table A 3 Vector generation methods for large data sets Continued Method name rnrange rpois rstab rt runif rweibull rwilcox Big Data The Big Data library introduces a new set of bd functions Library designed to work efficiently on large data For best performance it is important that you write code minimizing the number of passes through the data The Big Data library functions minimize the number of passes made through the data Use these functions for the best performance For more information and usage examples see the functions individual help topics Functions 206 Data Exploration Big Data Library Functions Functions Table A 4 Data exploration functions Function name Description bd cor bd crosstabs Computes correlation or covariances for a data set In addition computes correlations or covariances between a sin
117. ions Functions Supporting Graphs Example Graphs Plotting Using Hexagonal Binning Adding Reference Lines Plotting by Summarizing Data Creating Graphs with Preprocessing Functions Unsupported Functions 116 117 117 123 123 128 133 144 157 115 Chapter 5 Creating Graphical Displays of Large Data Sets INTRODUCTION This chapter includes information on the following e An overview of the graph functions available in the Big Data Library listed according to whether they take a big data object directly or require a preprocessing function to produce a chart e Procedures for creating plots traditional graphs and Trellis graphs Note In Microsoft Windows editable graphs in the graphical user interface GUI do not support big data objects To use these graphs create an S Plus data frame containing either all of the data or a sample of the data 116 Overview of Graph Functions OVERVIEW OF GRAPH FUNCTIONS Functions Supporting Graphs The Big Data Library supports most but not all of the traditional and Trellis graph functions available in the S PLUS library The design of graph support for big data can be attributed to practical application For example if you had a data set of a million rows or tens of thousands of columns a cloud chart would produce an illegible plot This section lists the functions that produce graphs for big data objects If you are unfamiliar with plotting and grap
118. is menu option does nothing The S PLUS Workbench does not support Eclipse s Project gt Build menu items Currently the S PLUS Workbench does not support Eclipse s Debug perspective To debug S PLUS Scripts in the Script Editor use the S PLUS debugging functions such as inspect browser debugger and others For more information see Chapter 7 Debugging Your Functions in the S PLUS Programmer s Guide 15 Chapter 2 The S PLUS Workbench STARTING THE S PLus WORKBENCH From Microsoft Windows From Unix S PLus Workspace 16 The S PLUS Workbench user interface is the same in both Microsoft Windows and Unix platforms In Microsoft Windows click the Start menu gt All Programs gt S PLus 7 0 gt S PLUS Workbench In Unix at the command prompt type Splus w or type Splus workbench The S PLUS Workspace is the directory where the S PLUS Workspace Data and Eclipse metadata databases are stored You should never touch these files Optionally the Workspace directory can also store your project directories The S PLUS Workspace is the default directory specified for the project s directory in the New Project wizard See the section New Project Wizard on page 23 for more information The S PLUS Workspace Data directory is associated with the Workspace not individual projects That is the Workspace Data directory stores all objects for all project directories associated with Starting the S PLUS Workbench
119. ituents with the 10 years of history in our date range In this section remove the shorter term constituents and create the time series of returns and then compute the daily returns time series 109 Chapter 4 Exploring and Manipulating Large Data Sets To create the time series 1 Create a bdTimeSeries object removing the stock IDs that do not have the entire history keepIds lt is element collIds closePrices bd c DATE RTK TVIN KOR MITSY closePrices ts lt bdTimeSeries data closePrices bdL keeplIds positions closePrices bd DATE print class closePrices ts 2 Compute daily returns time series and assign it to the object dailyReturns ts dailyReturns ts lt diff log closePrices ts Plotting the Time In this section create a time series object of the cumulative returns Series and then plot them Add a label to each series To plot the cumulative returns 1 Create a time series object of the cumulative returns cumulativeReturns ts lt cumsum dailyReturns ts 2 Plot the cumulative returns plot cumulativeReturns ts main Cumulative returns of SP500 Index and 20 Stocks ylab Returns 3 Annotate each series lastObs lt positions cumulativeReturns ts max positions cumulativeReturns ts text rep 1 numCols dailyReturns ts unlist seriesData cumulativeReturns ts lastObs drop T colIds dailyReturns ts col 3 cex 0 5 110 Analyzing the betas Manipu
120. ject file and census demo ssc Q devel bigData census dema ssc Figure 2 20 Navigator view after deleting files The S PLUS script is a text file that you can edit in the Script Editor In this exercise just edit census demo ssc using the menu items provided specifically for S PLUS To Edit Script Code 1 In the Navigator View double click the file census demo ssc to open it in the Script Editor and examine the script Note that e The comment text appears in the Script Editor as green You can change this default color in the Preferences dialog box See the Eclipse User s Guide and the section Setting the Project s Preferences on page 44 for more information e The line that has focus appears highlighted e The line numbers appear to the left of the script text Scroll to line 12 and highlight the line and the next line stringsAsFactors F startRow 1 bigdata T Click Source gt Shift Left The code shifts four character spaces to the left 49 Chapter 2 The S PLUS Workbench 4 Click Source gt Format This command formats the entire script Note that the formatting change you made in the previous step has been reverted Also note that the line numbers for formatted functions are highlighted Hint The line numbers for any line changed in your script are highlighted until the next time you save your work 5 Scroll to line containing the comment bd data viewer P8 bd 6 Clic
121. ject many models packed x which plots 3 The above example shows a space difference of only a few MB 6MB to 2MB which is probably not a large enough saving to take the time to pack the object However if each of the model objects were very large and the whole list were too large to represent the packed version would be useful Increasing Efficiency INCREASING EFFICIENCY bd select rows bd filter rows The Big Data library offers several alternatives to standard S PLUS functions to provide greater efficiency when you work with a large data set Key efficiency functions include Table G 1 Efficient Big Data library functions Function name Description bd select rows Use to extract specific columns and a block of contiguous rows bd filter rows Use to keep all rows for which a condition is TRUE bd create columns Use to add columns to a data set The following section provides comparisons between these Big Data library functions and their standard S PLUS function equivalents Using bd select rows to extract a block of rows is much more efficient than using standard subscripting Some standard subscripting and bd select rows equivalents include the following Table G 2 bd select rows efficiency equivalents Standard S PLUS subscripting function bd select rows equivalent x Weight bd select rows x columns Weight x 1 1000 c 1 3 bd select rows x from 1
122. k Source gt Toggle Comment to remove the comment character Alternatively you can just delete the comment character 7 Notice that the script text color changes to indicate that the line is no longer a comment 8 Scroll to line 187 Select all rows from 187 through the end of the script and then click Source gt Toggle Comment The graphsheet will not launch from the S PLUS Workbench Examining the The Outline View displays all of the items objects functions and so Outline on that are contained in the open script Outline View is not editable To Examine the Outline 1 Examine the objects that appear in the Outline View Note that set seed appears with a yellow arrow next to it because in the section Setting the Project s Preferences on page 44 you indicated that set seed was a function to watch 2 Scroll through the Outline View list and highlight an object Note that the Script Editor scrolls to and highlights the line where the object appears 50 Examining Objects Adding a Task to A Script S PLUS Workbench Tasks Details about your project s objects and all objects in your database appear in the Objects View Objects View is not editable however you can refresh the contents or change the view to another attached database To refresh the view right click the Objects View and click Refresh To Examine the Objects 1 Select the Objects View tab to display the objects and their details By
123. l the rest of the variables as predictors and then examine its summary tree boston lt tree MEDV data boston housing sample summary tree boston 5 Use the tree model to predict median housing values for all observations in the boston housing data set Plot the observed versus predicted housing values The plot is drawn as a hexbin plot because the predicted values as well as the observed values are big data objects predict boston lt predict tree boston boston housing plot predict boston boston housing MEDV This model could be applied to a data model that included millions of points A tree model is one example of models that cannot be fit on bigdata objects but the resulting model can be used to predict all observations In the above example we just made a single call to the tree function to create our tree model object In a real modeling situation you would likely consider several different tree models and use some of the associated tree functions such as cv tree prune tree and Predicting from the Model plot tree to select an appropriate model For more information about tree models see Chapter 19 Classification and Regression Trees in the Guide to Statistics Volume 2 In the above example we did not transform any of the predictor variables for the tree model as we did when fitting the linear regression model to the same data earlier in this chapter The transformations are not necessary be
124. lating Data Stock Sample Cumulative returns of SP500 Index and 20 Stocks eS ChE ytd aN WMA ATA A EZ V A Ri MNA j sty od hal Te RV JA s ERA SS AENG RE Aa sie Returns 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Figure 4 6 Plot of cumulative returns The beta is one way of measuring how returns on an asset change when the market changes In this example the market is represented by the S amp P 500 Index This analysis shows two separate techniques for analyzing the betas The second technique Approach 2 is slightly faster To calculate betas using Approach la 1 Capture column IDs of the stocks constituentNames lt collds dailyReturns ts collds dailyReturns ts l SP500 2 Set the process time for calculating the betas using this approach tO lt proc time 3 3 Initialize the vector of betas betasl a lt structure numeric length constituentNames names constituentNames 111 Chapter 4 Exploring and Manipulating Large Data Sets 4 Loop through the stocks and calculate the beta directly as a regression coefficient for constituentName in constituentNames ImFormula lt paste constituentName SP500 beta lt Im 1 mFormula data dailyReturns ts data betasl a constituentName lt coef beta 2 timeBetasl a lt proc time 3 t0 To calculate betas using Approach 1b
125. lay from the Show View menu click Other and then select a view from the Show View dialog box e If the view is not currently visible in the UI selecting it displays the view and gives it focus in the UI e If the view is available selecting it gives it focus in the UI S PLUS recognizes libraries modules and directories as legitimate object databases You can add and detach any of these types of databases to the Search Path View By default the Search Path View displays the full path of the working database and all of the attached S PLUS data libraries Objects existing in a recognized active database appear in the Objects View Objects in an added database appear in Objects View when you refresh the view to that database See the section Examining Objects on page 51 To Add a Library 1 Right click the Search Path View 2 From the right click menu click Add Library 3 In the Attach Library dialog box type MASS Clear the Attach at top of search list check box to indicate that you want add the library to the bottom position 4 Click OK and examine the Search Path View for the change Detaching a Database Refreshing the View S PLUS Workbench Tasks To Add a Module 1 From the right click Search Path menu click Add Module 2 Inthe Attach Module dialog box provide the module name and indicate whether to add it to the first position 3 Click OK and examine the Search Path View for the change To Add a
126. ld be important when density is used for prediction or optimization To plot density use the plot function To create a sample density plot from fuel bd in the Commands window type the following fuel bd lt as bdFrame fuel frame plot density fuel bd Weight type 1 135 Chapter 5 Creating Graphical Displays of Large Data Sets The density plot is displayed as follows 0 0004 0 0006 0 0008 L L i density fuel bd Bi eighthby o 0002 L T T T T T 1500 2000 2600 3000 3500 4000 densityifuel bdfWeightypx Figure 5 13 Graph using density Create a Trellis The following example creates a Trellis graph of a density plot which Density Plot displays the shape of a distribution You can use the Trellis density plot for analyzing a one dimensional data distribution A density plot displays an estimate of the underlying probability density function for a data set allowing you to approximate the probability that your data fall in any interval To create a sample Trellis density plot in the Commands window type the following singer bd lt as bdFrame singer densityplot height voice part data singer bd layout c 2 4 aspect 1 xlab Height inches width 5 136 Example Graphs The Trellis density plot is displayed as follows Height inches Figure 5 14 Graph using densityplot For more information about Trellis density plots see Chapter 3 Traditional
127. leage nx 9 ny 9 The image plot is displayed as follows 35 20 15 2000 2500 3000 3500 4000 Figure 5 31 Graph using hist2d to create an image plot The levelplot function creates a Trellis graph of a level plot For big data sets 1evelplot requires a preprocessing function such as loess A level plot is essentially identical to a contour plot but it has default options so you can view a particular surface differently Like contour plots level plots are representations of three dimensional data in flat two dimensional planes Instead of using contour lines to indicate heights in the z direction level plots use colors The following example produces a level plot of predictions from loess To create a sample Trellis level plot using 10ess to preprocess the data in the Commands window type the following environ bd lt as bdFrame environmental zom lt 1loe ss ozone 1 3 wind temperature radiation data environ bd parametric c radiation wind span 1 degree 2 Create a persp Graph Using hist2d Example Graphs w marginal lt seq min environ bd wind max environ bd wind length 50 t marginal lt seq min environ bd temperature max environ bd temperature length 50 r marginal lt seq min environ bd radiation max environ bd radiation length 4 wtr marginal lt list wind w marginal temperature t marginal radiation r marginal grid lt expand gri
128. list of close price series for the stocks If you Stock Data are working with a large number of stocks this list object is potentially quite large however when the expression is evaluated the component bdFrame objects are not loaded into virtual memory To create a list of close price series 1 Read close price series for the stocks from file sources closePricesList lt lapply srcFileNames function fileName importData fileName keep c DATE CLOSE bigdata T names closePricesList lt casefold stockNames upper T 2 Combine the close columns into one data set Note that this function works even if the series items do not all have the same date column closePrices bd lt bd join closePricesList key columns DATE suffixes paste names closePricesList sep 3 Remove the CLOSE column name markers collds closePrices bd lt su bstitutestringi CLOSE me TH colIds closePrices bd Importing the The data for the S amp P 500 Index is drawn from the same date range S amp P 500 Index 01 01 1994 to 11 01 2004 Data To import and Join the S amp P 500 Index data 1 Read close price series for the S amp P 500 Index from the index data file inx csv closeSP500 bd lt importData paste getenv SHOME samples bigdata stocks Inx csv sep dirSeparator keep c DATE CLOSE bigdata T 2 Edit the S amp P 500 Index column names to identify the column as S amp P 500 data
129. lly include large data sets then select this option to always load the bigdata library when you start the S PLUS Workbench 20 Starting the S PLUS Workbench Show Anonymous Functions in Outline By default the S PLUS Script editor shows anonymous functions in the outline Functions to Watch Contains a predefined list of S PLUS functions to identify in the Outline View You can add your own functions to this list using the New button You can also remove functions from the list or reorder the list io x Workbench Build Order Help E Run Debug Se S PLUS Editor Team Install Update Console Output Task Tags S PLUS 5 PLUS Workbench options J Run code on startup library biadata first T IV Show Anonymous Functions in Outline Functions to Watch setMethod setClass setReplaceMethod setAs setIs setYalidity a Restore Defaults Apply Import Export Cancel Figure 2 5 S PLUS Workbench Options dialog box e PLUs Editor Options These options control settings for the S PLUS Workbench Script Editor Show Line Numbers By default the S PLUs Script editor shows line numbers Background Color By default the S PLUS Script Editor uses the system background color Select Custom Color and then click the color button to display the Color dialog box and choose a different background color 21 22 Chapter 2 The S PLUS Wo
130. local versions For more information see the topic Local history in the Eclipse Workbench User Guide Replacing files with a previous version The Replace With Local History and Replace With Previous from Local History menu items are available from the control menu in Navigator View Using these features you can replace the current version of the selected file with one of the previously stored local versions Replace With Previous from Local History displays no selection dialog box it just replaces the file To choose a previous state in the Local History list use Replace With Local History For more information see the topic Replacing a resource with local history in the Eclipse Workbench User Guide Commonly Used Features in Eclipse Table 2 7 Eclipse Tasks and Features Continued Task Eclipse Feature Description Finding a word in a project or a term ina Help topic Using the Search gt File menu item you can find all occurrences of a word in a project or Help topic For more information see the topic File search in the Eclipse Workbench User Guide Filter files in the Navigator View Using the Working Sets menu option on the control menu in Navigator View you can create subsets of files to display or hide For more information see the topics Working Sets and Showing or hiding files in the Navigator View in the Eclipse Workbench User Guide Vie
131. lowing bwplot Type Fuel data as bdFrame fuel frame The box and whisker plot is displayed as follows Sporty gmail ieee e Type Medium Large Compact 3 0 34 Figure 5 12 Graph using bwplot 134 Create a Density Plot Example Graphs For more information about bwp1ot see Chapter 3 Traditional Trellis Graphics in the Application Developer s Guide The density function returns x and y coordinates of a non parametric estimate of the probability density of the data Options include the choice of the window to use and the number of points at which to estimate the density Weights may also be supplied Density estimation is essentially a smoothing operation Inevitably there is a trade off between bias in the estimate and the estimate s variability wide windows produce smooth estimates that may hide local features of the density Density summarizes data That is when the data is a bdVector the data is aggregated before smoothing The range of the x variable is divided into 1000 bins and the mean for x is computed in each bin A weighted density estimate is then computed on the bin means weighted based on the bin counts This calculation gives values that differ somewhat from those when density is applied to the unaggregated data The values are usually close enough to be indistinguishable when used in a plot but the difference cou
132. lving a large number of observed variables it is often useful to simplify the analysis by considering a smaller number of linear combinations of the original variables PCA is one method for this data reduction It finds linear combinations of the data that are orthogonal and taken together explain all of the variance of the original data the linear combinations from PCA can be ordered based on the variability in the original data that each one explains It might be possible due to redundancy in the variables to reduce the dimension of the data by using PCA yet still retain most of the original variability in the data 169 Chapter 6 Modeling Large Data Sets Using principal components you can reduce the number of predictor variables and compute values to use as predictors in a logistic regression Take care when using the principal components as predictors for a response variable because the principal components are computed independently of the response variable Retention of the principal components that have the highest variance is not the same as choosing those principal components that have a highest correlation with the dependent variable Note Prim4 Principal Components Example 170 The signs of the loadings might differ between princomp and bdPrincomp because the signs are not uniquely determined The Big Data library provides the Principal Component functions listed below For more detailed informatio
133. mation about using the Commands window see Chapter 10 Using the Commands Window in the S PLUS User s Guide The Big Data library provides dialog box support for the following two functions in Microsoft Windows only e jmportData exportData For more information about importing and exporting data including a list and descriptions of supported file types see Chapter 5 Importing and Exporting Data in the S PLus User s Guide If you are using Microsoft Windows you can use the GUI dialog boxes for importing data To import the data as a large data set using either the Import From File or Import from Database dialog boxes select the Import as Big Data checkbox For more 87 Chapter 4 Exploring and Manipulating Large Data Sets information about using the Import Data dialog box in Windows click Help gt Available Help gt S PLUS Help and then see the topic Importing Data Files Note From the command line import the data using the importData function For more information on importing data from the command line see the section Importing the Data on page 107 S PLUS 7 includes Census and Stock big data examples The example files are installed in the samples directory in your S PLUS program directory In the following section import the census example data To import the Big Data census example data set using the S PLus GUI in Microsoft Windows 1 From the File gt Import Data menu open the Import
134. meric length levels Handles bdFactor levels lt Handles bdFactor mad match 4 Math Operand function Math2 Operand function matrix Big Data Library Functions Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment mean median merge na exclude na omit names bdVector cannot have names names lt bdVector cannot have names nchar Handles bdCharacter not bdFactor ncol notSorted 4 nrow numberMissing Ops pairs pbeta Density CDF and quantile function 221 Appendix Big Data Library Functions 222 Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment pbinom Density CDF and quantile function pcauchy Density CDF and quantile function pchisq Density CDF and quantile function pexp Density CDF and quantile function pf Density CDF and quantile function pgamma Density CDF and quantile function pgeom Density CDF and quantile function phyper Density CDF and quantile function plnorm Density CDF and quantile function plogis Density CDF and quantile function plot pmatch pmvnorm Density and CDF function Table A 7 Functions
135. more information When you make an error and save your script file the error shows in the Problems View See the section section To Examine Problems on page 54 for more information e When you create a new object in the script it appears in the Objects View with its properties The object also appears in the Outline View S PLUS customizes the basic Eclipse menu and right click menus to include the following Script Editor control menu items This menu item is available only through the right click menu Use this command to copy the text selected in the Script editor to the Console View When you copy text to the Console View S PLUS runs the command See the section Copying Script Code to the Console on page 52 for more information This menu item is available through the right click menu It is also available as a button on the toolbar and by pressing F9 when the Script Editor is in focus Use this command either to run the entire script or to run the selected commands in the Script editor When 39 Chapter 2 The S PLUS Workbench you run the script you can observe the results in the Output View See the section Running Code and Reviewing the Output on page 54 for more information Note The S PLUS Workbench does not implement the core Eclipse Run menu item S PLUs Help Source Source Current File 40 This menu item is available from the Help menu When you open S PLUS Help from the Help m
136. mputing some compromises are inevitable The most obvious of these is computation speed The Big Data library in the S PLUS Enterprise Developer provides scalable algorithms that are designed to minimize disk access and therefore provide optimal performance with out of memory data sets This makes S PLUS Enterprise Developer a reliable workhorse for processing very large amounts of data When your data is small enough for traditional S PLUS it s best to remember that in memory processes are faster than out of memory processes If your data set size is not extremely large all of the S PLUS traditional in memory algorithms remain available so you need not compromise speed and flexibility for scalability when it s not needed To optimize performance S PLUS stores certain calculated statistics as metadata with each column of a bdFrame object and updates the metadata every time the data changes These statistics include the following Column mean for numeric columns Column maximum and minimum for numeric and date columns e Number of missing values in the column e Frequency counts for each level in a categorical column Requesting the value of any of these statistics or a value derived from them is essentially a free operation on a bdFrame object Instead of processing the data set S PLUS just returns the precomputed statistic As a result calculations on columns of bdFrame objects such as the following examples are practi
137. n When to Set bi gdata I on page 65 Alternatively in Microsoft Windows import data using the Import Data dialog box in the S PLUS GUI For more information about using the Import Data dialog box in Windows click Help gt Available Help gt S PLUs Help and then see the topic Importing Data Files Once your data is imported into S PLUS you can view and manipulate the data Chapter 4 Exploring and Manipulating Large Data Sets contains more in depth discussions of the following data manipulation tasks e Converting data e Generating a large data vector e Displaying data as a big data frame bdFrame e Exploring data e Cleaning data e Splitting data e Appending data sets e Manipulating and filtering rows and columns e Manipulating time series objects e Exporting data Build a Graphical Display Create a Model Analyzing Large Data Sets Once your data has been cleaned sorted and filtered in S PLUS you can optionally build a graphical display as an initial step towards assessing trends in your data contains more in depth discussions of the following graphics tasks Plotting using hexagonal bins Creating a traditional graph Creating a Trellis graph Evaluating and aggregating data over a grid using traditional or Trellis graphs Creating time series graphs After you have examined an initial graph of your data you can decide how you plan to model the data The Big Data library contains support f
138. n on each function see its help topic Table 6 2 Principal components functions Function name Description loadings Returns the loadings component of an object predict Computes principal component variables for new observations print Prints the input screeplot or plot Produces a barplot of the variances of the derived variables summary Provides a summary of principal components This example uses the data set provided with S PLUS Prim4 Prim4 is a relatively small data set 500 rows and 4 columns but for demonstration purposes convert it to a big data object 1 Convert Prim4 to a big data object prim4 bd lt as bdFrame prim4 Building a Model 2 Create a primcomp object from prim4 bd primcomp returns an object of class bdPrincomp containing the standard deviations of the principal components the loadings and optionally the scores prim4 bdp lt princomp prim4 bd Get the loadings for prim4 bdp loadings prim4 bdp Produce a plot plot prim4 bdp The plot displays as follows 0 767 20 Variances 16 10 Comp 1 Comp 2 Comp 3 Comp 4 Figure 6 4 prim4 bdp 5 Call predict to extract the fitted values predict prim4 bdp bdFrame 500 rows 4 columns Comp 1 Comp 2 Comp 3 Comp 4 1 9 6113930 1 257928 0 48919465 0 87537112 2 4 8931668 3 164171 0 29226528 0 68005429 171 Chapter 6 Modeling Large Data Sets 3 4 9597341 2 940688 0 230
139. nd includes information from the following census tables e Table P8 Contains the total ZCTA population data P008001 with each column separated by gender and age with the ages aggregated into 5 year bins M 00 M 05 M 10 F 00 F 05 F 10 and so on This table also includes the latitude INTPTLAT and longitude INTPTLON information for each ZCTA which we use for plotting purposes Table H7 Contains ZCTA tenancy information including Total number of occupied housing units H007001 number of owned homes H007002 number of rented homes H007003 91 Chapter 4 Exploring and Manipulating Large Data Sets Overview of Data Manipulation Functions 92 The table below lists some common tasks for working with large data objects Corresponding to the tasks is a list of functions that apply to the task Each function is described in further detail with an example showing how to use it in its corresponding help topic which you can access easily from the command line by typing hel p functionname The tasks that apply to the census data set are described in more detail with procedures and example code later in this chapter Table 4 1 Data manipulation tasks and their associated functions big data to a data frame Task Function names Importing data importData Converting data for example from bd coerce Generating a vector of random numbers rbeta rbinom rcauchy rchisq rep rexp
140. nning the entire file by explicitly setting bd options default string column width before the call to importData bd options default string column width 200 dat lt importData f type ASCII stringsAsFactors F bigdata T scanLines 10 bd string column width dat strsize str 200 This string truncation does not occur when S PLUS reads long strings as factors because there is no limit on factor level string length One more point to remember when you import strings the low level importData and exportData code truncates any strings either character strings or factor levels that have more than 254 characters S PLUS generates a warning in importData if bigdata T if it encounters such strings You can use one of the following techniques for setting string column widths explicitly To set the default width if it is not determined some other way use bd options string column width To override the default column string widths in bd block apply specify the out1 column string widths list element when IM test T or when outputting the first non NULL output block To set the width for new output columns use the string column width argument to bd create columns When you use bd create columns to create a new character column you must set the column string width You can set 193 Chapter 7 Advanced Programming Information Factor Column Levels 194 this width explicitly with the string column wid
141. ns 216 Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment attr lt attributes 4 attributes lt bdFrame Constructor Inputs can be bdVectors bdFrames or ordinary objects boxplot Handles bdNumeric by casefold ceiling coerce collds collds lt colMaxs t t colMeans 4 colMins colRanges colSums Table A 7 Functions implemented for bdVector and bdFrame Continued Big Data Library Functions quantile function Function Name bdVector bdFrame Optional Comment colVars concat two cor cut dbeta Density cumulative distribution CDF and quantile function dbinom Density CDF and quantile function dcauchy Density CDF and quantile function dchisq Density CDF and quantile function density densityplot dexp Density CDF and quantile function df Density CDF and quantile function dgamma Density CDF and quantile function dgeom Density CDF and 217 Appendix Big Data Library Functions 218 Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment dhyper Density CDF
142. nts r TE 1 error 0 warnings 0 infos v l Description Resource In Folder Location Syntax error census demo ssc Census line 37 Figure 2 14 S PLUS Workbench Problems view Search Path The Search Path View displays the names or full pathname in the View case of the working data and search path position of all the attached S PLUS databases By right clicking the Search Path View you can Attach a library Attach a module Attach a directory Detach the currently selected database in the view Refresh the current view Note When you use the control menu to add to or remove from the Search Path View a library module or directory the view automatically refreshes When you run code to add or remove a library module or directory the view is not automatically refreshed To refresh the view right click the Search Path View or click the control menu button and then from the menu click Refresh The databases that are in your search path determine the objects that are displayed in Objects View That is if a database is in your search path the objects in that database appear in the Objects View See the 35 Chapter 2 The S PLUS Workbench section Examining Objects on page 51 For more information about working with the Search Path View see the section Changing Attached Databases on page 46 a T Console View Objects View PQ ene yh leas Output view Tasks Problems o Tm
143. o run code Also like the S PLUS Commands window the Console View concatenates the code that runs throughout your S PLUS Workbench session so you can review and save it Examining the History View S PLUS Workbench Tasks To Run Copied Script Code 1 Select lines 1 and 9 in the script Be sure to select the line return at the end of line 9 2 Right click the code and click Copy to Console The selected code is copied immediately to the Console View and runs You do not need to paste it in the Console View 3 Repeat steps 1 and 2 for line 10 4 Finally repeat steps 1 and 2 for lines 11 13 You can select all of the code lines 1 13 but if you do so it appears in the History View as one line By following the steps above the History View reflects the three different calls to run the code See the section Examining the History View on page 53 for more information This exercise uses the script code run in the section Copying Script Code to the Console on page 52 The History View reflects the code run in the Console View Note that the History View displays each selection you make even if it is more than one command on one line and if the line extends beyond about 50 characters the History View displays an ellipse to indicate more code To display each line of code in the History View you must run the lines individually To Examine the History 1 To examine and rerun code from the History View 2 Click the
144. o the Tasks View on page 51 These views are discussed in the following sections and corresponding exercises for using the views are listed above S PLUS also uses the default Navigator View which displays project directories and all files associated with the project The Navigator View The Eclipse IDE contains other views described in the Eclipse Workbench User s Guide The default perspective settings control the views that open by default in preset locations in the Workbench UI however you can customize the view appearance and then save the resulting perspective See the section Customizing the S PLUS Workbench Default Perspective and Views on page 45 for more information Using the standard Eclipse IDE features you can Close a view by clicking the X icon on the view tab e Reposition a view by clicking its tab and dragging it to another part of the UI e Seta selected view to Fast View This option hides the view to free space in the Workbench window and places a minimized icon which you can click to open the view on the shortcut bar e Change the views you see in the perspective See the section To Change the Displayed Views on page 46 27 Chapter 2 The S PLUS Workbench S PLUS The S PLUS Workbench Console View is an editable view analogous Workbench to the Commands window in the S PLUS GUI Using the Console Console View View you can Run individual S PLUS commands by typing them and pressing
145. of bd options default string column width Because of the way that bdFrame factor columns are represented a factor cannot have an unlimited number of levels The number of levels is restricted to the value of the option The default is 500 bd options max levels Big Data String and Factor Issues If you attempt to create a factor with more than this many levels a warning is generated For example dat lt bd create columns data frame num 1 2000 se EH aa ee factor Warning messages CreateColumnsEngineNode 0 output column f has 1500 NA values due to categorical level overflow more than 500 levels you may want to change this column type from categorical to string in bd internal ex ec node engine class engine class node props node props summary dat num f Min 1 0 x99 1 Ist Qu 500 8 x98 1 Median 1001 0 x97 1 Mean 1001 0 x96 1 3rd Qu 1500 0 x95 1 Max 2000 0 Other 495 NA s 1500 You can increase the max 1evels option up to 65 534 but factors with so many levels should probably be represented as character strings instead Note Strings are used for identifiers such as street addresses or social security numbers while factors are used when you have a limited number of categories such as state names or product types that are used to group rows for tables models or graphs String Normally if strings are truncated or factor levels overflow
146. on qqmath can also make probability plots for other distributions It has an argument distribution whose input is any function that computes quantiles The default for distribution is qnorm If you set distribution qexp the result is an exponential probability plot To create a sample qqmath plot in the Commands window type the following singer bd lt as bdFrame singer qqmath height voice part data singer bd layout c 2 4 aspect 1 xlab Unit Normal Quantile ylab Height inches Example Graphs The qqmath plot is displayed as follows 3 2 10 1 2 3 Height inches 60 I a s i T 2 3 TT 1 0 m4 Unit Normal Quantile Figure 5 19 Graph using qqmath Create a Single The function qqnorm creates a plot using a single bdVector object The Vector QQ Plot following example creates a plot from the mileage vector of the fuel bd object To create a sample qqnorm plot in the Commands window type the following fuel bd lt as bdFrame fuel frame qqnorm fuel bd Mileage 141 Chapter 5 Creating Graphical Displays of Large Data Sets Create a Two Vector QQ Plot 142 The qqnorm plot is displayed as follows D 7 othe gJ u Ei a i ge 2 om 3 8 7 ao cy aame nae copa aT o o oo o o 9 T T T T T 2 4 0 4 2 Quantiles of Standard Normal Figure 5 20 Graph using qqnorm The function qqp1ot creates
147. on a big data object call dotplot after using aggregate to reduce size of data In the following example sum the barley yields over sites to get the total yearly yield for each variety To create a sample dot plot in the Commands window type the following barley bd lt as bdFrame barley temp df lt bd coerce aggregate barley bd yield list year barley bd year variety barley bd variety sum dotplot variety x year data temp df aspect 0 4 xlab Barley Yield bushels acre The resulting Trellis dot plot appears as follows Wiscons ih No 3B vetet gt Trebl aaa Sarota Peattand No 475 No 462 No 457 2 Marchant Gebo A Wicosh No 33 vetet Trebl i Senso gt Peatand gt No 0S gt No 462 No 7 nacha Gdor Barley Yield bushels acre Figure 5 30 Graph using aggregate to create a dot chart The following example creates an image graph using hist2d to preprocess data The function image creates an image under some graphics devices of shades of gray or colors that represent a third dimension 151 Chapter 5 Creating Graphical Displays of Large Data Sets Create a Trellis Level Plot 152 To create a sample image plot using hist2d preprocess the data in the Commands window type the following fuel bd lt as bdFrame fuel frame image hist2d fuel bd Weight fuel bd Mi
148. onent that is typically a data frame e A positions component that is a timeDate or timeSequence object timeSeries or a bdNumeric or numericSeries object signalSeries e A units component that is a character vector with information on the units used in the data columns Classes The Big Data Library Architecture The Big Data library equivalent is a bdSeries object with two subclasses bdTimeSeries and bdSignalSeries They contain e A data component that is a bdFrame e A positions component that is a bdTimeDate object bdTimeSeries or bdNumeric object bdSignalSeries e A units component that is a character vector For more information about using large time series objects and their classes see the section Time Classes on page 81 and the section Working with Time Series Data on page 109 of Chapter 4 Exploring and Manipulating Large Data Sets The Big Data library follows the same object oriented design as the standard S PLUS Sv4 design For a review of object oriented programming concepts see Chapter 8 Object Oriented Programming in S Plus in the Programmer s Guide Each object has a class that defines methods that act on the object The library is extensible you can add your own objects and classes and you can write your own methods The following classes are defined in the Big Data library For more information about each of these classes see their individual help topics Table 3 6 Big Data classes
149. or the following model types Linear Regression Model Generalized Linear Model Principal Components Clustering Chapter 6 Modeling Large Data Sets contains more in depth discussions of the following tasks Building a Model Predicting from Small Data Models Chapter 1 Introduction ADVANCED PROGRAMMING Whether you are new to S PLUS or an experienced user consider taking advantage of the more advanced features available with the Big Data library More Advanced Chapter 7 Advanced Programming Information discusses more Programming complicated issues including Concepts and d Tasks Enhancing performance Splitting and aggregating data Performing by block computations Creating your own Big Data subclasses Writing your own Big Data functions classes and methods Writing general computations with bd block apply THE S PLus WORKBENCH Introduction Starting the S PLUS Workbench S PLUS Workspace S PLUS Preferences New Project Wizard S PLUs Perspective Changing the S PLUS Workbench Perspective Views Customizing the Perspective s Views S PLUS Workbench Console View History View Objects View Outline View Output View Problems View Search Path View Tasks View Script Editor Text Editing Assistance View integration Menu Options S PLUS Workbench Tasks Creating a Project Setting the Project s Preferences Customizing the S PLUS Workbench Default Perspective and Views Changing Attached Databa
150. p t tabulate Handles bdNumeric tapply trigamma union unique var which infinite which na 227 Appendix Big Data Library Functions Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment which nan xy2cell xyCall xyplot Graph For more information and examples for using the traditional graph Functions functions see their individual help topics or see Chapter 5 Creating Graphical Displays of Large Data Sets Table A 8 Traditional graph functions Function name barplot boxplot contour dotchart hexbin hist hist hist2d image interp 228 Big Data Library Functions Table A 8 Traditional graph functions Continued Function name pairs persp pie plot qqnorm qqplot For more information about using the Trellis graph functions see their individual help topics or see Chapter 5 Creating Graphical Displays of Large Data Sets Table A 9 Trellis graph functions Function name barchart contourplot densityplot dotplot histogram levelplot piechart qq 229 Appendix Big Data Library Functions Note The cloud and parallel graphics functions are not implemented for bdFrames
151. phical Displays of Large Data Sets Creating Graphs with Preprocessing Functions Create a Bar Chart 144 The stripplot plot is displayed as follows 4 4 L L Soprano 1 E Soprano 2 Alto 1 Alto2 i 3 Tenor 1 Tenor2 K Bass 1 Bas 2 i T T T T i 60 65 70 75 Height inches Figure 5 22 Graph using stripplot for singer bd The functions discussed in this section do not accept a big data object directly to create a graph rather they require a preprocessing function such as those listed in the section Functions Providing Support to Preprocess Data for Graphing on page 120 Calling barchart directly on a large data set produces a large number of bars which results in an illegible plot If your data contains a small number of cases convert the data to a standard data frame before calling barchart If your data contains a large number of cases first use aggregate and then use bd coerce to create the appropriate small data set In the following example sum the yields over sites to get the total yearly yield for each variety Example Graphs To create a sample bar chart in the Commands window type the following barley bd lt as bdFrame barley temp df lt bd coerce aggregate barley bd yield list year barley bd year variety barley bd variety sum barchart variety x year data temp df aspect 0 4 xlab Barley Yield bushels acre The resulting bar chart a
152. pl bd Lon Figure 4 3 ZIP code concentration plot 2 Examine the plot and notice the concentration of ZIP codes in the Northeast Ohio valley and upper mid west along with a relatively smaller concentration on the California coast and other urban population centers For more information about the graph functions available for large data sets see Chapter 5 Creating Graphical Displays of Large Data Sets To compare the distribution of age gender groups across different ZCTAs you must adjust the values for the total population count within the ZCTA The simplest adjustment is to divide each age gender population value by the total population for that ZCTA This procedure yields the fraction of the population for that ZCTA in each 99 Chapter 4 Exploring and Manipulating Large Data Sets 100 age gender group This transformation makes column comparisons meaningful when you do a cluster analysis in Chapter 6 Modeling Large Data Sets To transform the data 1 Divide each of the data columns by the total population for each row in the reference data set which is contained in the column named P008001 and then store this transformed data in a new big data object P8 dataN bd lt P8 data bd P8 ref bd P008001 Modify this new object by appending an N to the column labels to signify they ve been normalized Both the name of the new big data object and its variables contain N names P8 dataN bd lt pa
153. port For more information and usage examples see the functions and Export individual help topics Table A 1 Import and export functions Function name Description data dump Creates a file containing an ASCII representation of the objects that are named data restore Puts data objects that had previously been put into a file with data dump into the specified database exportData Exports a bdFrame to the specified file or database format Not all standard S PLUS arguments are available when you import a large data set See exportData in the S PLUS Language Reference for more information importData When you set the bigdata flag to TRUE imports data from a file or database into a bdFrame Not all standard S PLUS arguments are available when you import a large data set See importData in the S PLUS Language Reference for more information 203 Appendix Big Data Library Functions Object The following methods create an object of the specified type For Creation more information and usage examples see the functions individual help topics Table A 2 Big Data library object creation functions Function bdCharacter bdCluster bdFactor bdFrame bdGlm bdLm bdLogical bdNumeric bdPrincomp bdSignalSeries bdTimeDate bdTimeSeries bdTimeSpan 204 Big Data Library Functions Big Vector For the follow
154. ppears as follows Wulscons Ib No 33 velet Trebl Senso Peatind No 475 No 462 No 457 Marchi Geb Wiscons Ib No 33 vetet Trebi Smot Peatnd No 475 No 462 No 457 nacia Gedor 10 10 a0 20 Barley Yield bushels acre Figure 5 23 Graph using barchart Create a Bar Plot The following example creates a simple bar plot from fuel bd using table to preprocess data To create a sample bar plot using table to preprocess the data in the Commands window type the following fuel bd lt as bdFrame fuel frame barplot table fuel bd Type names levels fuel bd Type ylab Count 145 Chapter 5 Creating Graphical Displays of Large Data Sets The bar plot is displayed as follows 10 Count Compact Large Medium Small Sporty van Figure 5 24 Graph using barplot To create a sample bar plot using tapply to preprocess the data in the Commands window type the following fuel bd lt as bdFrame fuel frame barplot tapply fuel bd Mileage fuel bd Type mean names levels fuel bd Type ylab Average Mileage The bar plot is displayed as follows Average Mileage 15 20 25 10 Compact Large Medium Small Sporty Van Figure 5 25 Graph using tapply to create a bar plot 146 Example Graphs Create a Contour A contour plot is a representation of three dimensional data in a flat Plot Create a Trellis Contour Plot two dimensional plane Each contour line represents a
155. puter does not have enough RAM to hold the data your computer returns an out of memory message S PLUS Enterprise Developer includes the Big Data library which provides functions to store and manipulate data out of memory For a more in depth discussion about how the Big Data library uses out of memory data storage to help solve this problem see Chapter 3 The Big Data Library The S PLUS graphical user interface GUI in Microsoft Windows and the S PLUS Workbench in Microsoft Windows or Unix platforms provide limited support for working with large data frames You can use the S PLUS GUI in Microsoft Windows to import export and view data in the Data Viewer Otherwise you must call Big Data Library functions by typing them at the prompt in the Commands window For more information about importing or exporting large data sets see Chapter 3 The Big Data Library Analyzing Large Data Sets Working with When you work with a large data set you can perform any or all of Large Data Sets Define the Problem the tasks illustrated in Figure 1 1 Define Problem Import Data Manipulate and Graph Data Create a Model l Figure 1 1 Big Data tasks You might just be importing or manipulating data or building a graphical display or modeling data This section outlines these high level tasks first discussing the concepts behind defining your data problem and then dividing each high level task into proc
156. qpois Density CDF and quantile function Big Data Library Functions Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment qq qqmath qqnorm qqplot qt Density CDF and quantile function quantile quni f Density CDF and quantile function qweibull Density CDF and quantile function qwilcox Density CDF and quantile function range rank replace rev rle row names Always NULL row names lt Does nothing 225 Appendix Big Data Library Functions 226 Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment rowlds Always NULL rowlds lt Does nothing rowMaxs rowMeans rowMins rowRanges rowSums rowVars runif sample scale setdiff shiftPositions show skewness Handles bdNumeric sort split Big Data Library Functions Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment stdev Handles bdCharacter sub sub lt substring substring lt f Summary Operand function summary swee
157. represent the number of females in those age groups The last bin contains males or females age 85 and older For example the first ZCTA shown is 00601 and there are 712 males from 0 4 years old in this ZCTA Although these raw counts are interesting in their present form the data for two ZCTAs cannot be compared directly because the total populations in the ZCTAs vary greatly The objective in this section is to demonstrate using several big data functions by manipulating this data set The census example uses some customized functions To continue working with the example provide references to these source files To reference the supporting function files for the example 1 Open the Commands window 2 At the command prompt type Manipulating Data Census Example source paste getenv SHOME samples bigdata census my vbar q sep source paste getenv SHOME samples bigdata census graph setup q sep Note graph setup q runs graphsheet on the Windows platform and java graph on Unix platforms If you are working with this example in the S PLUS Workbench remember that Eclipse does not work with java graph on the Unix platform Converting Data You can convert a standard data frame object to a bdFrame object In the following procedure you can load the census data set as a standard S PLUS data frame and then convert it to a bdFrame Later you can convert the bdFrame to a data frame
158. reprocessed using either table or tapply To create a sample dot chart using table to preprocess data in the Commands window type the following fuel bd lt as bdFrame fuel frame dotchart table fuel bd Type labels levels fuel bd Type xlab Count 149 Chapter 5 Creating Graphical Displays of Large Data Sets The dot chart is displayed as follows Compact eoero reserse rrr raar rrene earran e Large G ene ee Small e ee ees Sporty oor De ee eee nee Count Figure 5 28 Graph using table to create a dot chart To create a sample dot chart using tapply to preprocess data in the Commands window type the following fuel bd lt as bdFrame fuel frame dotchart tapply fuel bd Mileage fuel bd Type median labels levels fuel bd Type xlab Median Mileage The dot chart is displayed as follows Compact EEE eee eee Lange oro Dee ne teeter eee Gopal pope EERIE ahi SG HERETER ENEA GENS Rake at SRAT ER LE SE A Sporty ereere ee De ee te rnrn T T 20 22 24 26 28 30 32 edian hileage Figure 5 29 Graph using tapply to create a dot chart 150 Example Graphs Create a Dot Plot The function dotplot creates a Trellis graph that displays that Create an Image Graph Using hist2d displays dots and gridlines to mark the data values in dot plots The dot plot reduces most data comparisons to straightforward length comparisons on a common scale When using dotplot
159. rkbench Foreground You can select a custom color for each of the text types listed in the Foreground box by selecting the text type and then clicking Choose Color Select the color for the text type from the Color dialog box Workbench Editor Build Order Help 5 PLUS Editor options Install Update IV Show Line Numbers Run Debug 5 PLUS Background Color Console Output C System Default Custom a La Task Tags Foreground Team Default Comment Keyword String Constant Choose Color aa Type Task Restore Defaults Apply Import Export Cancel Figure 2 6 S PLUS Editor Options dialog box New Project Wizard Working with Files External to the Project Starting the S PLUS Workbench e Task Options Lists the three pre defined default task tags See the section Tasks View on page 36 for more information inixi workbench Task Tags Build Order Help 5 PLUS Task options E Install Update FIXME HIGH Run Debug TODO NORMAL XXX LOW m S PLUS Console Output New Remove Editor Up as ags Down Team Restore Defaults Apply Import Export Cancel Figure 2 7 S PLUS Task Options dialog box When you start a new S PLUS project in the S PLUS Workbench you see the New Project wizard where you specify the location of your project files S
160. rms and concepts that vary from the traditional S PLUS Windows GUI and Java GUI The Eclipse IDE contains extensive in depth documentation for its user interface For information about basic Eclipse IDE functionality see the integrated documentation the Workbench User Guide Note If you are using the Eclipse IDE on a Unix platform from a Windows machine using a Windows X server software package you might notice that Eclipse runs slowly similar to the S PLUS Java GUI See the Release Notes for more information and recommendations for improving UI performance 11 Chapter 2 The S PLUS Workbench 12 Table 2 1 Important terms and concepts Term Definition Perspective Defines the preferences settings and views for working with Eclipse projects The S PLUS perspective is conceptually equivalent to the traditional S PLUS Windows GUI or Java GUI Use the S PLUS perspective as the primary perspective for interactive command line use of S PLUS For an example of changing the perspective see the section Customizing the S PLUS Workbench Default Perspective and Views on page 45 Workspace A physical directory on your machine that manages S PLUS Workbench resources such as projects and other options On your machine s hard drive the Workspace directory contains the S PLUS Data database and the Eclipse metadata database You should never touch these resources Notice that the Data databa
161. s If the block size is large enough we can use bd by group to process each of the GENDER groups of 500 rows BIG GROUPS lt data frame GENDER rep c MALE FEMALE length 1000 NUM rnorm 1000 bd options block size 5000 Big Data Block Size Issues bd by group BIG GROUPS by columns GENDER FUN function df data frame GENDER df GENDER 1 NROW nrow df GENDER NROW 1 FEMALE 500 2 MALE 500 If the block size is set below the size of the groups this same operation will generate an error bd options block size 10 bd by group BIG GROUPS by columns GENDER FUN function df data frame GENDER df GENDER 1 NROW nrow df Problem in bd internal exec node engine class BDLManager BDLSplusScriptEngineNode 0 Problem in bd internal by group script IM function can t process block with 500 rows for group FEMALE can only process 10 rows at a time check bd options values for block size and max block mb Use traceback to see the call stack In this case bd split by group could be called to divide the data into a list of multiple bdFrame objects and process them individually BIG GROUPS LIST lt bd split by group BIG GROUPS by columns GENDER data frame GENDER names BIG GROUPS LIST NROW sapply BIG GROUPS LIST nrow simplify T row names NULL GENDER NROW 1 FEMALE 500 2 MALE 500 Another function where block size is a concern is bd block apply which applies user specified S PL
162. s Storage Hode Extent Obje Date Last value list list 2 266 Wed Dec 29 Randon seed numeric integer 12 89 0 coldata numeric integer 36 185 Wed Dec 29 colref character character 1 53 Wed Dec 29 graphsheet function function 4 798 Wed Dec 29 last dump list list 4 1043 Wed Dec 29 my vbar function function 4056 Wed Dec 29 P8 bd bdFrane list 321 26864 Wed Dec 29 kE 2I Figure 2 11 S PLUS Workbench Objects view Outline View The Outline View displays an outline of the elements in the script open in the script editor In the S PLUS Workbench Outline View displays functions and objects in the order they appear in the script editor Items that you have identified to watch in the Functions to watch text box of the Preferences dialog box appear in the Outline View with an arrow You can jump to the definition of a function or object or other structure element by clicking it in Outline View 31 Chapter 2 The S PLUS Workbench 32 The Outline View contains a menu bar that displays the following toggle buttons Table 2 3 Outline View buttons Button Description Click to hide all standard functions displayed in the Outline View Click again to display standard functions Click to hide all functions that you have designated to watch displayed in the Outline View Click to hide all anonymous functions displayed in the Outline View Click to hide all variables in the Outline View
163. s used is controlled by two options e bd options block size The option block size specifies the maximum number of rows to be processed at a time when executing big data operations The default value is 1e9 however the actual number of rows processed is determined by this value adjusted downwards to fit within the value specified by the option max block mb e bd options max block mb The option max block mb places a limit on the maximum size of the block in megabytes The default value is 10 When S PLUS reads a given bdFrame it sets the block size initially to the value passed in block size and then adjusts downward until the block size is no greater than max block mb Because the default for block size is set so high this effectively ensures that the size of the block is around the given number of megabytes The resulting number of rows in a block depends on the types and numbers of columns in the data Given the default max block mb of 10 megabytes reading a bdFrame with a single numeric column could 187 Chapter 7 Advanced Programming Information Changing Block Size Options Group or Window Blocks 188 be read in blocks of 1 250 000 rows A bdFrame with 200 numeric columns could be read in blocks of 6 250 rows The column types also enter into the determination of the number of rows in a block There is rarely a reason to change bd options max block mb however if you increase it do not set i
164. se is associated with the Workspace rather than the S PLUS Project This design is different from the association you notice when you work in S PLUS in its other environments When you start the S PLUS Workbench you are prompted to create or identify the Workspace See the section S PLUS Workspace on page 16 Getting Started Tutorial Introduction Table 2 1 Important terms and concepts Continued Term Definition Project A resource containing text files scripts and associated files The S PLUS Workbench project is used for build and version management sharing and resource management Before you begin working with any files in the S PLUS Workbench you must create a project You can e Create an empty new directory located in your specified Workspace directory and then either create a new script or import an existing project directory i e copy the files e Select an existing directory containing project files at an alternate location i e work with the files at the specified location See the section Creating a Project on page 41 View Integrated windows containing their own menus and commands that display specific parts of your data and projects and provide tools for data manipulation Includes the Console View History View Objects View Outline View Output View Problems View Search Path View and Tasks View For practice exercises working with views see the section S PLUS Workbench
165. ser s Guide for more information on these user interfaces there are no equivalent GUI functions available in this release Using S PLUS and the Big Data library you can e Import a large data set from a text file or a database e Convert data frames to big data objects and vice versa Welcome to the S PLUS Enterprise Developer User s Guide Manage projects and code files in the S PLUS Workbench View data in the Data Viewer Split or append a data set Clean sort and filter rows and columns in a data set Create plots using hexbin plotting Fit models to large data sets Export the large data set Create your own functions that use large data sets Chapter 1 Introduction ANALYZING LARGE DATA SETS Out of Memory Data Storage Big Data Library Options in the S PLUS Environment This section includes e The architecture of the Big Data library e A description of the options in the S PLUS environment for working with large data sets e An outline of the tasks associated with importing manipulating modeling and plotting large data sets mapped to the procedures in the outline of this manual The S language was originally designed to store data objects in memory to provide the fastest data analysis possible For example when you create a data frame object as follows mydata lt read table datafile txt all of the data in the object mydata is manipulated in random access memory RAM If your com
166. ses Creating a Script 11 16 16 18 23 24 24 26 27 28 29 30 31 33 34 35 36 38 38 39 39 41 41 44 45 46 48 Chapter 2 The S PLUS Workbench 10 Editing Code in the Script Editor Running Code Fixing Problems in the Code Closing the Project Commonly Used Features in Eclipse 49 52 54 54 56 Introduction INTRODUCTION S PLUS provides a plug in or customization of the Eclipse Integrated Development Environment IDE called the S PLUS Workbench You can use the S PLUS Workbench and the basic Eclipse IDE features to manage your project files provide source control for shared project files edit your code run S PLUS commands and troubleshoot problems with S PLUS projects The S PLUS Workbench is a stand alone application that runs the S PLUS engine When you run the S PLUS Workbench you do not need to run any other version of S PLUs for example the console or traditional Windows or Java GUI Caution If you run two or more simultaneous sessions of S PLUS including one or more in the S PLUS Workbench take care to use different working directories To use the same working directory for multiple sessions can cause conflicts and possibly even data corruption This chapter contains descriptions of the features and a task centered tutorial for the S PLUS implementation of Eclipse the S PLUS Workbench Before you begin using the S PLUS Workbench you should understand key te
167. ships between variables that are readily described by straight lines or their generalizations in multiple dimensions If you are new to linear regression and generalized linear modeling you might want to review their different uses e Use linear regression to predict a continuous response as a linear function of predictors using a least squares fitting criterion e Use generalized linear modeling to predict a general response as a linear combination of the predictors using maximum likelihood For more information about model types see Chapter 10 Regression and Smoothing for Continuous Response Data in Guide to Statistics Volume 1 In S PLUS linear regression Im and generalized linear modeling glm share many function names Table A 12 in the Appendix Big Data Library Functions identifies these functions as implemented for either large data linear modeling bdLm large data generalized linear modeling bdG1m or both Implemented functions are marked with a hash mark in the model type s column Building a Model The Big Data library includes generalized linear models Like the Big Data linear models the Big Data generalized linear models are invoked through a call to the g1m function when the data argument is a Big Data object a bdFrame The standard arguments to g1m formula family data subset weights na action work with Big Data The standard model methods residuals fitted coef print summary plot anova
168. split by window Divide a dataset into multiple data blocks defined by a moving window over the dataset and return a list of these data blocks bd unpack object Unpacks a bdPackedObject object that was previously stored in the cache using bd pack object Data Frame The following table lists the functions for both data frames bdFrame and Vector and vectors bdVector The the cross hatch indicates that the Functions function is implemented for the corresponding object type The Comment column provides information about the function or indicates which bdVector derived class es the function applies to For more information and usage examples see the functions individual help topics Table A 7 Functions implemented for bdVector and bdFrame Function Name bdVector bdFrame Optional Comment i I lt E 214 Big Data Library Functions Table A 7 Functions implemented for bdVector and bdFrame Continued Function Name bdVector bdFrame Optional Comment lt lt abs aggregate all all equal any anyMissing append apply Arith as bdCharacter as bdFactor as bdFrame as bdLogical Handles all bdVector derived object types as bdVector attr 215 Appendix Big Data Library Functio
169. ssible and can be efficiently accessed If you use the operator to refer to a column name that is not a syntactic name in S you must surround it in quotes For example my bdFrame Return percent 73 Chapter 3 The Big Data Library e The print function works differently on a bdFrame object than it does for a data frame It displays only the first few rows and columns of data instead of the entire data set This design prevents accidentally generating thousands of pages of output when you display a bdFrame object at the command line Note You can specify the numbers of rows and columns to print using the bd options function See bd options in the S PLUS Language Reference for more information e The summary function works differently on a bdFrame object than it does for a data frame It calculates an abbreviated set of summary statistics for numeric columns This design is for efficiency reasons summary displays only statistics that are precalculated for each column in the big data object making summary an extremely fast function even when called on a very large data set Some data frame methods are not defined for bdFrame objects To use these methods you must convert your data to a regular data frame To learn how to convert your data see the section Converting Data on page 95 of Chapter 4 Exploring and Manipulating Large Data Sets Vectors The S PLUS Big Data library also introduces bdVector and six s
170. ste names P8 dataN bd al il sepH Alternatively you can use the bd modify columns function P8 dataN bd lt bd modify columns P8 dataN bd names P8 dataN bd paste names P8 data bd N sep You can use bd modify columns for more extensive column manipulation such as changing column types and identifying columns to keep or drop as well as changing column names For more information about bd modify columns see its help topic Display the resulting normalized data set bd data viewer P8 dataN bd Note that the values are no longer integer counts but fractions between 0 and 1 To transform by average per bin You can now directly compare the transformed data P8 dataN bd across all 32 165 ZCTAs We use clustering methods to seek geographic patterns of interesting groups of populations Before proceeding to the clustering step though perform one further transformation of the data Manipulating Data Census Example We want a factor of 2 change in population to be as significant in the 80 year bin a very small bin as it is in the 30 year bin a very large bin Just as you adjusted for differing populations across ZCTAs now adjust for the differing numbers across age gender groups 1 Calculate the mean for each age gender group column P8 dataN mean lt colMeans P8 dataN bd Create new series of columns by dividing by this national average value per group The resulting object contains the national
171. stomize 45 views changing display 46 virtual memory limitations 61 WwW Workbench Project 13 Workbench project creating 41 Workbench Script Editor 13 Workbench User Guide 13 Workbench View 13 Workspace 12 workspace 16 18 changing 18
172. t strsize str n file f fort in 1230 str lt paste rep abcd x collapse cat nchar stir T Str im sep append T file f Importing this file with the default scanLines value 256 detects that the maximum string has 150 characters and sets this column string length correctly dat lt importData f type ASCII stringsAsFactors F bigdata T dat bdFrame 30 rows 2 columns strsize str LS abcd Z AQ abcd abcd 3 15 abcd abcd abcd 4 20 abcd abcd abcd abcd on 25 abcd abcd abcd abcd abcd 25 more rows bd string column width dat strsize str ot 150 In the above output the strsize value of 1 represents the value for non character columns If you import this file with the scanLines argument set to scan only the first few lines the column string width is set too low In this case the column string width is set to 45 characters so longer strings are truncated and a warning is generated String Widths and bd create columns Big Data String and Factor Issues dat lt importData f type ASCII stringsAsFactors F bigdata T scanLines 10 Warning messages ReadTextFileEngineNode 0 output column str has 21 string values truncated because they were longer than the column string width of 45 characters maximum string size before truncation was 150 characters in bd internal exec node engine class engine class You can read this data correctly without sca
173. t Big Data library operations is proportional to the number of rows in the data set if the number of rows doubles then the processing time also doubles The amount of RAM in a machine imposes a predetermined limit on the number of columns allowed in a big data object because column information is stored in the data set s metadata This limit is in the tens of thousands of columns If you have a data set with a large number of columns remember that some operations especially statistical modeling functions increase at a greater than linear rate as the number of columns increases Doubling the number of columns can have a much greater effect than doubling the processing time This is important to remember if processing time is an issue Note When you import data you have the option to set the flag stringsAsFactors to T or F the default is T S PLUS imposes a limit of 500 levels for bdFactors When to Set bigdata T When you get ready to import data into an S PLUS session you might find yourself considering whether to use the Big Data library or to use the standard S PLUs library Using the standard S PLUS library can provide you with faster processing times because working in memory is more efficient than streaming the data when virtual memory is not in use However if your data is large when you try to import the data the process is very slow because of necessary swapping or worse you run the risk of trying to
174. t effectively See Chapter 16 Using Less Time and Memory in the Application Developer s Guide By bringing together flexible programming and big data capability the S PLUS Enterprise Developer is a data analysis environment that provides both rapid prototyping of analytic applications and a scalable production engine capable of handling datasets hundreds of megabytes or even gigabytes in size In the next section we provide an overview to the Big Data library architecture including data types functions and naming conventions The Big Data Library Architecture THE BIG DATA LIBRARY ARCHITECTURE Block based Computations The Big Data library is a separate library from the S PLUS engine library It is designed so that you can work with large data objects the same way you work with existing S PLUS objects such as data frames and vectors The library uses terminology familiar to S PLUS users and follows these conventions e Class names are mixed case delimited and prepended with the designation bd such as bdFrame e Function names are period delimited except when they match an existing function such as importData or if they refer to a class such as as bdVector e Functions start with bd such as bd compare unless the function uses the same syntax as functions available for non big data functions such as the predict or summary function e Big Data library functions do not restrict the number of rows in the data B
175. t summary qqnorm residuals ca screeplot step summary 232 Predict from Small Data Models Big Data Library Functions This table lists the small data models that support the predict function For more information and usage examples see the functions individual help topics Table A 13 Predicting from small data models Small data model using predict function arima mle bs censorReg coxph coxph penal discrim factanal gam glm gls gnis lme ImList 1mRobMM loess loess smooth 233 Appendix Big Data Library Functions Table A 13 Predicting from small data models Continued Small data model using predict function mim nlme nls ns princomp safe predict gam smooth spline smooth spline fit survreg survReg survReg penal tree Time Date and The following tables include time date creation functions and Series functions for manipulating time and date time span time series and Functions signal series objects 234 Big Data Library Functions Time Date Creation Table A 14 Time date creation functions Function name Description bdTimeDate The object constructor Note that when you call the timeDate function with any big data arguments then a bdTimeDate object is created timeCalendar Standard S
176. t to be larger than the physical memory on the computer Likewise there is little need to change bd options block size One exception is if you are developing and debugging new code for processing big data Consider developing code that calls bd block apply to processes very large data in a series of chunks To test whether this code works when the data is broken into multiple blocks set block size to a very small value such as bd options block size 10 By following this technique you can test processing multiple blocks quickly with very small data sets Note that the block size determined by these options and the data is distinct from the blocks defined in the functions bd by group bd by window bd split by group and bd split by window These functions divide their input data into subsets to process as determined by the values in certain columns or a moving window S PLUS imposes a limit on the size of the data that can be processed in each block by bd by group and bd by window if the number of rows in a block is larger than the block size determined by bd options block size and bd options max block mb an error is displayed This limitation does not apply to the functions bd split by group and bd split by window To demonstrate this restriction consider the code below The variable BIG GROUPS contains a 1 000 row data frame with a column GENDER with factor values MALE and FEMALE split evenly between the row
177. th argument If you set it smaller than the maximum string generated then this will generate a warning bd create columns as bdFrame fuel frame TypetType t2 character string column width 6 Warning in bd internal exec node engine class engi CreateColumnsEngineNode 0 output column t2 has 53 string values truncated because they were longer than the column string width of 6 characters maximum string size before truncation was 14 characters bdFrame 60 rows 6 columns Weight Disp Mileage Fuel Type t2 2560 97 33 3 030303 Small Smalls 2345 114 33 3 030303 Small Smalls 1845 81 37 2 702703 Small Smalls 2260 91 32 3 125000 Small Smalls 2440 113 oe 3 125000 Small Smalls 55 more rows aPrWNMFR If the character column width is not set with the string column width argument the value is estimated differently depending on whether the cal1 splus argument is true or false If row language T the expression is analyzed to determine the maximum length string that could possibly be generated This estimate is not perfect but it works well enough most of the time If row language F the first time that the S PLUS expression is evaluated the string widths are measured and the new column s string width is set from this value If future evaluations produce longer strings they are truncated and a warning is generated Whether row 1anguage T or F the estimated string widths will never be less than the value
178. th techniques To Create a New Script File 1 Click File gt New gt Other 2 In the New dialog box expand the Simple node and select File Click Next 3 In the New File dialog box select the parent directory the Stock Project directory 4 In the File name text box type Sample ssc 5 Click Finish to create the file We won t work with this file for this exercise so you can either disregard the file or delete it from your project Alternatively you can open the file add some S PLUS code and save it in the project The Navigator View displays the project files In Windows if you have Microsoft Excel installed you can open a CSV file in an external window In this project only the files identified in Windows gt Preferences in the File Extensions page open in the Script editor Because the project script imports the data in the files from their installation directory in S PLUS you don t need to have them all in the project However removing an imported file deletes it from your project directory so remove individual files with care To Remove a File 1 In the Navigator View select all files except census demo ssc 2 Right click the selected files and then click Delete Editing Code in the Script Editor S PLUS Workbench Tasks In the Confirm Resource Delete dialog box click OK to remove the files from the project The Navigator View should now just display the Census Project directory the pro
179. that has a predict method The Big Data linear model generalized linear modeling and principal components functions are implemented using the same standard S PLUS modeling functions 1m g1m and princomp respectively If the data argument to any of these functions is a big data object a bdFrame then S PLUS uses the big data algorithms Using this design you can switch easily between working with standard and big data sets These big data modeling functions create objects of a new class for example bdLm Most of the standard S PLUS methods used with modeling functions for example print summary plot predict fitted and residuals work on this new class of objects 161 Chapter 6 Modeling Large Data Sets BUILDING A MODEL Linear Regression and Generalized Linear Modeling 162 This section provides e An overview to linear regression generalized linear modeling and principal components specifically the S PLUS functions as they apply to large data sets e A list of the functions provided in the S PLUS Big Data library for modeling Exercises so you can practice modeling sample data sets In linear regression you model the response variable as a linear function of a set of predictor variables Examples of response variables include sales figures and bank balances This type of model is one of the most fundamental in nearly all applications of statistics It has an intuitive appeal in that it explores relation
180. the Stock Sample Script Working with Time Series Data Summary 86 87 87 87 89 91 91 92 93 98 102 104 104 105 106 109 114 85 Chapter 4 Exploring and Manipulating Large Data Sets INTRODUCTION This chapter includes information on the following topics for working with the S PLUS Big Data library Working from the command line e PLUS GUI support in Microsoft Windows dialog boxes and data viewer e Manipulating data demonstrated using census and stock examples e Creating graphs for large data sets 86 Working in the S PLUS Environment WORKING IN THE S PLus ENVIRONMENT Command line functions Dialog box support Import Data dialog box When you use the Big Data library you must perform all operations in the Commands window except for importing and exporting data in the Windows environment The Import Data Select Data and Export Data dialog boxes accommodate big data objects Start the Commands window and then type expressions and call big data functions at the command prompt Remember that S PLUS is case sensitive and while many functions in the Big Data library are similar to standard S PLUS functions their case designation might be slightly different For more information on the naming conventions in the Big Data library see the section The Big Data Library Architecture on page 69 of Chapter 3 The Big Data Library and the Appendix Big Data Library Functions For more infor
181. the data to predict for The predict functions include the following Table 6 4 Big Data library predict functions Function Predicts for this model object predict bs Basis matrix for polynomial splines predict censorReg Regression model for censored data predict discrim Normal Gaussian linear or quadratic discriminant function predict factanal Factor analysis model factanal object predict gam Generalized additive model predict gls Generalized least squares model predict gnls Nonlinear model using generalized least squares predict 1m Linear model predict 1me Linear mixed effects models predict ImList List of linear model objects predict 1mRobMM Robust fit of a linear regression model as estimated by the 1mRobMM function Predicting on Big Data from Small Data Models Predicting from the Model Table 6 4 Big Data library predict functions Continued Function Predicts for this model object predict loess Local regression model predict mlm Multiple response linear least squares model predict nime Nonlinear mixed effects model predict nls Nonlinear regression model via least squares predict ns Basis matrix for natural splines predict princomp Principal components predict survreg Parametric survival regression model predict survReg Survival model using parametric regression Many of the modern modeling methods in S PLUs do not
182. the function bd pack object you of Data can store each model in an external cache and create a list of the smaller packed models You can then use bd unpack object to restore the models to manipulate them Creating a In the following example use the data object fuel frame to create Packed Object 1000 linear models The resulting object takes about 6MB with bd pack In the Commands window type the following object Create the linear models many models lt lapply 1 1000 function x Im Fuel Weight Disp sample fuel frame size 30 Get the size of the object object size many models 1 6210981 You can make a smaller object by packing each model While this exercise takes longer the resulting object is smaller than 2MB In the Commands window type the following Create the packed linear models many models packed lt lapply 1 1000 function x bd pack object Im Fuel Weight Disp sample fuel frame size 30 197 Chapter 7 Advanced Programming Information Restoring a Packed Object with bd unpack object Summary 198 Get the size of the packed object object size many models packed 1 1880041 Remember if you use bd pack object you must unpack the object to use it again The following example code unpacks some of the models within many models packed object and displays them in a plot In the Commands window type the following for x im 1 5 plot bd unpack ob
183. the next step in the algorithm the observations are added to the new chunk of data and the K means clustering is run on this combined set start 174 The method for selecting starting values for centers e Specify firstSample to use a random sample of K rows from the first block of data as the initial centers e Specify kPoints to use the first unique K rows of data as the initial centers e Specify hClustFirstBlock to compute the initial centers from the first block of dataset using the hierarchical clustering method e Specify entireSample to compute the initial centers from a sample of the entire dataset using the hierarchical clustering method Building a Model Census Clustering In this section practice performing clustering on the census data Example example that you filtered and graphed in the previous chapters Note This exercise picks up from the manipulated Census data set from the end of Chapter 4 If you are starting this example at this point without having worked through the previous chapter s exercises you can load and run the previous exercise steps of the example script from the S PLUS sample directory by default installed at your installation directory in samples bigdata census To perform K means cluster analysis 1 Set the number of clusters to solve for In this case we set the cluster number to 40 When you model your own large data set you can set it
184. tions attempt to estimate this width but there are situations where this estimated value is incorrect In these cases it is possible to explicitly specify the column string width To retrieve the actual column string widths used in a particular bdFrame call the function bd string column width Unless the column string width is explicitly specified in other ways the default string width for newly created columns is set with the following option The default value is 32 bd options string column width When you convert a data frame with a character column to a bdFrame the maximum string width in the column data is used to set the bdFrame column string width so there is no possibility of string truncation When you import a big data object using importData for file types other than ASCII text S PLUS determines the maximum number of characters in each string column and uses this value to set the bdFrame column string width 191 Chapter 7 Advanced Programming Information 192 When you import ASCII text files S PLUS measures the maximum number of characters in each column while scanning the file to determine the column types The number of lines scanned is controlled by the argument scanLines If this is too small and the scan stops before some very long strings it is possible for the estimated column width to be too low For example the following code generates a file with steadily longer strings f lt tempfile ca
185. to 1000 columns c 1 3 Using bd filter rows is equivalent to subscripting rows with a logical vector By default bd filter rows uses an expression language that provides quick evaluation of row oriented expressions Alternatively you can use the full range of S PLUS row functions by 199 Chapter 7 Advanced Programming Information bd create columns 200 setting the bd filter rows argument row language F but the computation is less efficient Some standard subscripting and bd filter rows equivalents include the following Table G 3 bd filter rows efficiency equivalents Standard S PLUS subscripting function bd filter rows equivalent x x Weight gt 100 J bd filter rows x Weight gt 100 xCpnorm x stat gt 0 5 bd filter rows x pnorm stat gt 0 5 row language F Like bd filter rows bd create columns offers you a choice of using the more efficient expression language or the more flexible general S PLUS functions Some standard subscripting and bd create columns equivalents include the following Table G 4 bd create columns efficiency equivalents Standard S PLUS subscripting function bd create columns equivalent x d lt x a x b x c x lt bd create columns x atb Cc d x pval lt pnorm x stat x lt bd create columns x pnorm stat pval row language F y lt x at x b x c y lt bd
186. ubclasses which represent new vector types to support very long vectors Like a bdFrame object the big vector object stores data out of memory as a cache file on disk so you can create very long big vector objects without needing a lot of RAM You can extract an individual column from a bdFrame object using the operator to create a large vector object Alternatively you can generate a large vector using the functions listed in Table A 3 in the Appendix Like bdFrame objects the actual data is stored out of memory as a cache file on disk so you can create very long big vector 74 Models The Big Data Library Architecture objects without worrying about fitting them into RAM The Big Data library vector data types are listed in Table 3 4 along with their corresponding S PLUS types Table 3 4 bdVector data types Big Data library vector data types Analogous classes in S PLUS bdCharacter character bdNumeric double bdFactor factor bdLogical logical bdTimeDate timeDate bdTimeSpan timeSpan You can use standard vector operations such as selections and mathematical operations on these data types For example you can create new columns in your data set as follows census data adjusted income lt log census data income census data tax S PLUS Enterprise Developer Big Data library introduces scalable modeling algorithms to process big data objects using out of memory techniques Wit
187. uld have encountered an error similar to the following Problem in read table Unable to obtain requested dynamic memory This error occurs because S PLUS requires the operating system to provide a block of memory large enough to contain the contents of the data file and the operating system responds that not enough memory is available While S PLUS can access data contained in virtual memory the maximum size of data files depends on the amount of virtual memory available to S PLUS which depends in turn on the user s hardware and operating system In typical environments virtual memory limits your data file size and then it returns an out of memory error Finally you can also encounter an out of memory error after successfully reading in a large data object because many S functions require one or more temporary copies of the source data in RAM for certain manipulation or analysis functions S programmers with large data sets have historically dealt with memory limitations in a variety of ways Some opted to use other applications and some divided their data into digestible batches and then recompile the results For S programmers who like the flexibility and elegant syntax of the S language and the support provided to owners of an S PLUS license the option to analyze and model large data sets in S has been a long awaited enhancement The Big Data library available in S PLUS Enterprise Developer provides this enhancement
188. utorial so right click the directory and then select Delete In the Confirm Delete Project dialog box select Do not delete contents Otherwise you will delete the sample from your installation directory Click Yes to remove the project S PLUS provides customizations to the Eclipse IDE to accommodate the specific needs of the S PLUS programmer To Set the Example Preferences 1 2 On the Window menu click Preferences In the Preferences dialog box expand the Workbench node and examine the dialog box pages Click File Associations and review the file types that the Script Editor recognizes Click the S PLUS node Click New In the Add New Function to Watch dialog box add set seed Click OK Review the list in the Functions to Watch dialog box Note that set seed has been added to the list Click Task Tags Customizing the S PLus Workbench Default Perspective and Views S PLUS Workbench Tasks 9 Highlight the items to change in the S PLUS Task Options text box or using the New Remove Up and Down buttons edit the available tasks 10 Click OK or Apply to save your changes or click Restore Defaults to return the task options to their default state 11 Click OK to save your changes The default layout of the S PLUS Workbench presents the Navigator View Outline View and History View on the left side of the window The Console View Objects View Output View Tasks View and Problems View are
189. ve The default perspective preferences include project type window appearance editor preferences menu options and file associations You can change these preferences and any other default Eclipse preferences in the Preferences dialog box It is available from the Window menu On the menu click Windows gt Preferences For more information about setting preferences see the Eclipse Workshop User s Guide The S PLUS Workbench sets default preferences in the following areas Starting the S PLUS Workbench e File Associations S PLUS recognized file types include q ssc and t Any of these files which are associated with the S PLUS Script editor are checked for syntax errors and scanned for task tags S PLUS also recognizes plain text or txt files oix workbench File Associations Appearance Capabilities File types Colors and Fonts PEE Compare Patch Editors EAA File Associations Keys Label Decorations Linked Resources Local History Perspectives Search Startup and Shutdown Build Order 4 Help Install Update E Run Debug Associated editors Barns Q S PLUS Editor default E Team Default Import Export Cancel Figure 2 3 S PLUS File Associations dialog box e S PLus Console Options These options control settings for the S PLUS Workbench Console View e Background Color By default the S PLUS Console View uses the
190. w a file that is not part of your project Use the File gt Open External File menu item to open a file that is not part of your project 57 Chapter 2 The S PLUS Workbench 58 THE BIG DATA LIBRARY Introduction Working with a Large Data Set Finding a Solution No 64 Bit Solution Size Considerations When to Set bigdata T Summary The Big Data Library Architecture Block based Computations Data Types Classes Functions Summary 60 61 61 64 65 65 68 69 69 73 77 78 83 59 Chapter 3 The Big Data Library INTRODUCTION 60 In this chapter we discuss the history of the S language and large data sets and describe improvements that the Big Data library presents This chapter discusses data set size considerations including when to use the Big Data library The chapter also describes in further detail the Big Data library architecture its data objects classes functions and advanced operations Working with a Large Data Set WORKING WITH A LARGE DATA SET Finding a Solution Out of Memory Processing When it was first developed the S programming language was designed to hold and manipulate data in memory Historically this design made sense it provided faster and more efficient calculations and modeling by not requiring the user s program to access information stored on the hard drive Data size has outstripped the rate at which RAM size increased consequently S program users co
191. work on big data objects in version 7 Often the algorithms for these models require all the data to be in memory at once An approach to using these in memory models is to sample from the large data set and fit the model to the in memory sample The fitted model can then be used to predict all observations since the predict methods will work on bigdata objects In this exercise we sample from the boston housing data even though it is small fit a tree model to predict the median housing value and then use that model to predict housing median housing values for all observations in the data set While the boston data does not require out of memory model fitting for the purpose of this example we set the max block size to 100 to process the data in blocks Fitting the model To fit the model 1 If you have not done so create the boston housing bdFrame by importing the data from the samples directory 181 Chapter 6 Modeling Large Data Sets About Tree Models 182 boston housing lt importData paste getenv SHOME samples bigdata boston bostonhousing txt sep stringsAsFactors F bigdata T 2 Set the max block size to 100 to process the data in blocks bd options max block size 100 3 Create arandom sample of size 200 from the big data object and convert it to a data frame boston housing sample lt bd coerce bd sample boston housing n 200 4 Fit a tree model to predict median housing value using al

S-PLUS 7 Enterprise Developer User's Guide

Contents

Download Pdf Manuals

Related Search

Related Contents