Home

Data Modeling with Regress+

1. Each bootstrap sample is treated as though it were the original data The output is a matrix of parameter vectors with one row for each bootstrap sample When a column of this matrix representing one parameter is sorted from low to high it yields an empir ical distribution for that parameter given the model and sample size At this point the simplest procedure is to determine the central confidence interval by using the confidence limits that define the requisite tails of this distribution When appropriate and with a lot Goodness of fit is tested in this fashion Usually there are at least 1 000 rows samples 54 CHAPTER 7 HOW PRECISE ARE THE MODEL PARAMETERS 55 of additional effort more accurate confidence limits can be determined from the same empirical distribution by correcting for bias and skewness 4 Note that in a Bayesian context which we shall not discuss there are other ways to estimate their parameter precision For instance this online calculator 7 2 An Example A good illustration of the varying precision of ML parameters can be seen in the salaries example we considered earlier Fig 4 10 The model is a weighted mixture of two Normal distributions as follows PDF pN p1 01 1 p N pa 02 7 1 The ML values for the five parameters are listed in Table 7 1 Table 7 1 ML Parameters for Salaries Data and Model pa 01 H2 02 P 370 201 53 5986 481 630 92 5531 0 549763 The ML values shown
2. CHAPTER 4 MODELS IN THE REAL WORLD 28 This last step was the really hard part The protocol required completing all three titrations before doing any computations which meant that you had no idea what sort of answer to expect and therefore no bias when doing the next titration If the three answers did not match to several significant figures then the day was wasted and you had to do it all over again As I said I really needed those half drops No doubt even this brief synopsis of the experiment sounds a bit long winded and so it should Molecules even large protein molecules are much too small to see and when human senses fail we need something to take their place In our Fisheries Laboratory that something was chemical theory This theory enabled me to follow the chain of connections linking a color indicating the endpoint of a titration all the way back to the percentage of protein in a piece of fish There are a great many links in this chain and none of them are visible they exist only in our imagination Most things in Nature lie far beyond the senses of human beings so in order to examine and or test them we need something that we can sense or at least manipulate We need a model 4 1 Models A model is a symbolic description of some real world behavior that is observable directly or indirectly The symbols used can be mathematical or just ordinary words of a spoken language In this document we shall consider mathematical m
3. Error Alas in real life there is no back of the book to look up the correct answer All you have is your own experience and expertise and sometimes a little software assistance 3 1 An Experiment Take two protons and one electron and bring them together in a space small enough so that each can tell that the other particles are present The protons will repel each other since their electric charges have the same sign Each of the protons will be attracted to the electron and vice versa since opposites attract Question Will the combination of all three stay together or just wander off In other words do they form a stable molecule 22 CHAPTER 3 DATA VS INFORMATION 23 This can be more than just a thought experiment Schr dinger s Equation describes the basic laws of quantum mechanics for this system Hj at least to a non relativistic approximation which is quite accurate in this case This equation is usually difficult to solve but this particular example is relatively easy since it has so few elements The whole computation can be done ab initio from first principles using a sophisticated numerical technique called Diffusion Monte Carlo DMC Here the computation was carried out ten times The results are plotted in Figure 3 1 10 10 154 1 4 20 _ 22 H H Distance A Figure 3 1 Ab initio Results for H 10 replicates This plot gives the total energy of the system in el
4. Percentile 100 900 Academic Salaries hundreds of dollars Figure 6 3 Salaries Probability Plot for Gaussian Model Unacceptable 6 2 Deterministic Models Maximum likelihood deterministic models are almost always estimated using the least squares procedure To assess goodness of fit the minimum SSE is compared to TSS the total sum of squares N TSS y y 6 2 k 1 TSS quantifies the total variation of the data from its average If the model explains all of this variation then there will be nothing left to explain and SSE will equal zero An intuitive goodness of fit metric is then the fraction of TSS explained termed R 2 SSE i TSS 6 3 Consequently a good deterministic model will have R close to one Typically one would like to see R values of 0 99 or better There is one note of caution here Even when R is close to one there is still the possibility of systematic error in the model After all one assumes that taken collectively the model residuals y g x are N O cp i e they are random errors If they Note that TSS N equals the variance of y CHAPTER 6 HOW GOOD IS THE MODEL 53 are random then they should be scattered about the model curve randomly In particular the signs of the residuals should be random There should not be any obvious pattern to consecutive runs of positive and negative residuals This is something that can and should be tested separately e g using t
5. 30 Were the PDF a continuous function this summation a weighted average would become a definite integral Either way the PDF makes it easy to compare theoretical predictions with observed data As we shall see it has many other uses as well Graphing Relationships When the data are not random variates but value pairs describing a relationship other kinds of graph are possible These pairs are traditionally termed x independent variable and y dependent variable and the relationship is characterized by saying that y is a function of x That is the value of x somehow determines the value of y Symbolically we write this as follows y f x 2 7 Here we have at most two significant figures CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 21 Here f denotes some deterministic but unspecified relationship between x and y Functional relationships can be plotted in several different ways depending upon the coordinates used to label the axes Most familiar are Cartesian coordinates x y For instance the data in Table 1 4 may be plotted as follows 1000 Daytime min 1 1100 Day Figure 2 6 Daytime vs Day In this plot the axes have been scaled to fit the data For other purposes different scales might be more appropriate e g when comparing this dataset to another Whatever kind of graph is drawn to visualize a dataset its primary purpose is to exhibit a comparison between the data and some explana
6. Graphs 13 4 Systematic Error 13 5 Overparametriztion 13 6 Wishful Thinking Deterministic Models 54 54 59 56 57 58 58 59 60 60 60 6l 63 63 67 69 71 71 13 76 79 80 80 80 81 81 81 82 83 B Technical Details 84 BL SO 5083 SEA a ee oe ee ede eee 84 B2 Optimization 4 eye a Se Se ee EEE EES HES SEO GE 84 B 3 Bootstrapping 22 4 4 24 2 22 254k eee RS a o Re we eH OS 85 C Illustration of Weierstrass Theorem 87 List of Figures 1 1 1 2 1 3 2 1 2 2 2 3 2 4 2 5 2 6 3 1 4 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 5 1 5 2 6 1 6 2 6 3 11 1 11 2 Mural Quadrant of Tycho Brahe 1598 8 CERN ATLAS Detector sb bs sra 28 e084 os Bee ee Ses 10 ATLAS Detector Magnets 45 240244 e4G 244 baw ess 10 Frequency Histogram for MLB Batting average Maxima 16 Another Frequency Histogram e e 17 Carbon 14 Decay Intervals Frequency 18 Carbon 14 Decay Intervals Probability 19 Carbon 14 Decay Intervals PDF 0 20 Daytime vs Day es ae ke ole alee eee ale eae 2 BA 4 21 Ab initio Results for H 10 replicates o ooa 23 Carbon 14 Decay Intervals Big Sample 30 Exponential Model with Five A Values 32 Exponential Model with A Empirical Mean 33 Exponentia
7. How Precise are the Model Parameters considered optimal and in chapter 5 we have discussed ways in which such parameters can be estimated from the data and model However a given dataset is just a sample of data and different samples will give different ML parameters and this variation should be taken into account We finish our discussion of modeling by considering the precision of our ML parameters The usual way to describe this precision is to provide a confidence interval for each of the model parameters In a frequentist context this interval is interpreted to mean the continuous interval within which there is a specified probability P of finding the true value Usually P 95 percent impllying that we are 95 percent confident that the true value lies inside the interval In a central confidence interval the remaining 1 P probability the probability of being wrong is equally split into the two tails outside the interval One good way to estimate a confidence interval is with a bootstrap analysis I traditional frequentist statistics maximum likelihood parameters are almost always 7 1 Bootstrap Analysis In a bootstrap analysis a large number of random bootstrap samples are synthesized In a parametric bootstrap they are created using a model in a non parametric bootstrap they are created from the data via selection with replacement In both cases the bootstrap sample is the same size as the original data
8. While this is the purpose of Regress there are hidden dangers The first is that one can use this capability unthinkingly As an analyst you should be familiar with your data and the likely form that a valid model should take However when sample sizes are small the statistical power of all tests decreases and in a given instance you might find any number of acceptable models solely because there is so little informa tion available in your dataset that even coarse distinctions are not feasible Alternatively it may be that no model form suggests itself a priori A second related danger arises from the plethora of models available in Regress As noted earlier it is tempting to keep trying one after the other until a good fit is achieved or to flip from one optimization to another for the same reason or to make the even more elementary error of disregarding the number of parameters Regress does not perform model comparison explicitly That is it does not enable you to determine the goodness of fit of different models with possibly different numbers of parameters and return a metric telling you whether one model is significantly better than another This you must do for yourself Lastly there is the all too common error of ignoring the context of the task and con fusing what is significant with what is meaningful If for example you have a dataset and histogram of 1 000 variates which given their origin should be Gaussian and look l
9. above were estimated to six figures but these figures are not all significant If we perform a non parametric bootstrap analysis with 1 000 bootstrap samples and carry out the corrections referenced above we find that the 95 percent central confidence limits on these parameters are those shown below Table 7 2 Salaries 95 Confidence Limits for ML Parameters Parameter Lower Limit Upper Limit pa 362 546 383 975 071 45 3251 62 0710 12 457 440 542 227 02 67 9438 100 934 p 0 407607 0 767566 These intervals are quite wide in contrast to the precision implied by the values in Table 7 1 even though this was a fairly large dataset N 1 161 In particular the weight parameter p is especially uncertain given that it ranges only over 0 1 Estimating confidence limits requires much more effort than finding ML parameters but unless this part of the modeling is carried out the parameter values reported will be overly deceptive with regard to their precision gt Weight parameters for binary mixtures tend to be estimated rather loosely Chapter 8 Summary and should be modeled All of the topics discussed deserve considerably more attention and some such as multivariate analysis have not been discussed at all Nevertheless basic concepts have been covered and these will suffice for a very large fraction of simple data modeling tasks The hyperlinks provided all contain references leading to further material that might be
10. amp Gye oe Se Be ee Bek ee Se A Bes 3 Data vs Information 3 1 An Experiment 22 4 4424 684824 694 564 S94 564 bod 4s 3 2 Another Experiment esoo 3 3 Separating Information from Error 0 II Modeling 4 Models in the Real World Al Models scese ie daw hiwediueteakeaeeankeeueecuhes 4 2 Stochastic Models e 4 3 Deterministic Models 000080088 5 Optimizing the Model 5 1 Stochastic Models a 5 2 Deterministic Models 0 0 0 0 000 eee eee 6 How Good is the Model 6 1 Stochastic Models 6 2 Deterministic Models 0 20 2 0 000 bee eee Vii 12 13 15 22 22 24 25 7 8 6 3 Is One Model Better Than Another How Precise are the Model Parameters 7 1 Bootstrap Analysis 12 An Example 2c 2 0 4 2 Summary III Regress User Guide 9 10 11 12 13 Overview 9 1 Installation 9 2 Examples 44482424 x Input 10 1 Input Format 10 2 Stochastic Input 10 3 Deterministic Input Setup 11 1 Stochastic Models 11 2 Deterministic Models 11 3 User Model Output and Menus 12 1 Display os o lt 4t2 6 282 12 2 Graphs 2 432842 45944544 23 Files ss Oa wo eae ea S es 12 4 Menus 4 445445 244446 0 What Could Possibly Go Wrong 13 1 Failure to Converge 13 2 Convergence to an Incorrect Solution 13 3 Poor quality
11. by drawing N random variates from the optimum modeled distribution These are processed in exactly the same way as was the original sample including the computation of the goodness of fit metric When a large number of the latter are sorted from low to high the sorted vector comprises an em pirical sampling distribution for this metric given all the other conditions in the problem As noted earlier the one sided percentile of the value of the original fit metric 1s com puted directly from this empirical distribution Any percentile less than 90 is considered acceptable B 3 2 Non parametric Bootstrap A non parametric bootstrap sample of size N is synthesized by drawing N values from the original dataset with replacement With stochastic models this involves selecting the datapoints directly With deterministic models the procedure is based on the error model of the residuals Thus each draw is made from the original vector of residuals with replacement and these N random residuals added to the Y values of the original dataset If the regression is weighted then each weighted residual is first unweighted according to the uncertainty in the datapoint from which it came The resulting normalized residual is re weighted when added to a Y value using the uncertainty of the latter In Regress the empirical distributions of parameter values resulting from a non parametric bootstrap are not used directly because it is known that
12. exp 4 7 For our model the CDF is shown in Figure 44 1 0 0 8 4 0 6 CDF 0 4 0 2 4 A E E E EE ER TS SE SE EEE E STE 0 5 10 15 20 25 30 C 14 Decay Interval s Figure 4 4 Exponential CDF with A Empirical Mean Using Equation 4 7 CDF 10 0 8920 and CDF 5 0 6714 the difference of which gives the answer shown in Equation 4 6 Looking further the maximum value in this dataset is 42 864 s The probability that a random decay interval is greater than this value 1 CDF 42 864 0 00007 Such a region under the PDF is called an upper tail The probability that a sample N 10 000 would have such an extreme value as this 1 1 0 00007 0 503 Roughly a 50 50 chance implying that we should not be surprised to see such a big value in such 8Compare this to the small sample empirical value in Equation 2 6 Obviously a CDF value cannot be greater than one 0and of course a tail on the left is called a lower tail CHAPTER 4 MODELS IN THE REAL WORLD 36 a big sample However we would not expect to see such a big value in the small sample shown in Table 1 1 An unexpected extreme value large or small is called an outlier and suggests that either the datapoint or the model might be invalid However identifying outliers reliably is a difficult task Quantiles For example using Equation 4 7 itis very easy to show that the median of an exponential distribution is give
13. experiment is currently staffed by 2 900 physicists from dozens of coun tries all working to test Nature at a smaller scale than ever before all intently focused on very tiny things To be successful in any endeavor of this magnitude only the very best data will suffice These data are the product of considerable effort They are in other words precious in a literal sense The same is true of many such efforts not only in science but in any analysis that is CHAPTER 1 SOMETHING ABOUT IT 10 Figure 1 3 ATLAS Detector Magnets genuinely important If you really want to say something about it then your data have to CHAPTER 1 SOMETHING ABOUT IT 11 be the very best 1 2 An Imperfect World When one looks at Table 1 5 the level of accuracy achievable with modern instrumentation is bound to be very impressive Nevertheless even the best instrumentation and the best experiments are not perfect Hence the data they output are likewise not perfect There will always be some uncertainty associated with them Understanding and quantifying uncertainty then handling it properly is the underlying theme of this text Chapter 2 Data Summaries Statistics and Graphs ahead and it is time to put away all personal belongings and fasten seat belts The runway is about 150 feet wide and as we left Washington D C it was so far away that it subtended only ten seconds of arc somewhere over the horizon Nevertheless w
14. from Table 1 1 From this dataset we can construct the frequency histogram shown in Figure 2 3 Half of the data fall in the first bin Therefore 25 20 4 N 50 mb al l r r i Frequency hb o l s sn s 0 3 6 9 2 15 8 24 C 14 Decay Interval s Figure 2 3 Carbon 14 Decay Intervals Frequency using the definition above we could say predict that based on this single sample the probability that a similar decay interval would be less than three seconds 1 2 We could go on to make analogous predictions for the remaining bins Obviously such predictions probabilities are only approximate After all if we wait long enough we are virtually certain to get a decay interval greater than 21 s but there is CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 19 0 5 0 4 N 50 03 99 2 4 E 0 2 0 1 4 0 0 EA a 0 3 6 9 12 15 18 21 C 14 Decay Interval s Figure 2 4 Carbon 14 Decay Intervals Probability no such bin on our graph because we did not observe such an interval yet Nevertheless we could compute an approximate probability for each bin and construct a histogram with probability on the ordinate instead of frequency This graph is shown in Figure 2 4 You can check for yourself that the bin probabilities in this graph add up to one so in this figure we are denying the possibility of larger decay intervals On the other han
15. in a longer report Regress Report files have the extension out All of the numerical results from Regress computations except lists and samples are summarized in its Report This includes the optimum parameters and the values for whatever optimization criteria are relevant With stochastic models the goodness of fit results are included as well Results for Confidence Intervals if any are shown next and are of similar appearance to that for the parametric bootstrap results shown in Figure 12 7 When there are prediction requests the results for these are appended to the bottom of the Report In this example there are four percentiles estimated for the Y values requested with confidence intervals if selected CHAPTER 12 OUTPUT AND MENUS BattingAvg pred Model y Gumbel A B 137 points Regress converged after 61 iterations Using the maximum likelihood criterion the optimum parameters are as A 3 47396e 01 B 2 44539e 02 Summary Statistics one sided 1000 bootstrap samples Log likelihood 2 96087e 02 This value is estimated to be in the 37th percentile K S statistic 0 0569502 This value is estimated to be in the 70th percentile Goodness of fit is ACCEPTABLE Two sided parametric percentile confidence intervals for this distribution A 90 gt 3 43770e 01 3 51302e 01 95 gt 3 43130e 01 3 51921e 01 99 gt 3 42059e 01 3 52934e 01 B 90 gt 2 18957e 02 2 7
16. interesting things from it besides the mean And what if anything does all of this tell us about C beta decay It turns out that a really good model can tell us quite a lot indeed most of what we might want to know We have just seen that using the PDF we can get the weighted average for any function of the random variable This is how we modeled the first moment the mean We can model other moments in analogous fashion To illustrate we shall compute the variance using our model then compare that answer to the empirical variance 20 1357 s The variance of x Var x was defined in Equation 2 3 Var x ma M q q 1 We already have the average of x A we now need the average of x o oo q 2 2 E gt exp x dx 2 4 4 Substituting Var x 2 Z 2X X NA 4 5 The value of A is 20 1862 s Comparing this to the observed variance we have a discrepancy of 0 0505 a relative error of 0 25 percent This unusually good match is the result of a good model and large sample size As shown in Chapter 3 random errors tend to cancel out more and more as the sample size increases Deriving moments from a PDF usually requires more mathematics than this In a few famous cases the PDF is very easy to interpret and manipulate Such PDFs are utilized in many situations One of the most famous PDFs the normal Gaussian distribution will feature prominently in much of our modeling es
17. is the one shown in Equation 4 14 j A 1 exp B x C D 4 14 Using its best five parameters we get the plot shown below 10 eV 20 H H Distance A Figure 4 12 H Data and Model In this plot the error bars are so small that they do not extend much beyond the dot used for the data We shall see examples later with larger errors and more obvious error bars It is now time to discuss what is meant by the best model parameters and how they are estimated Chapter 5 Optimizing the Model we first must define what we mean by best In frequentist statistics the paradigm underlying Regress software the best model parameters are deemed to be those which maximize the likelihood of observing the data that were actually observed That is with any other set of parameters the joint probability of the observed data would be smaller Such parameters are termed the maximum likelihood ML parameters Maximum likelihood parameters are computed directly when the model is stochastic and indirectly for deterministic models B EFORE we can describe the procedures for computing the best model parameters 5 1 Stochastic Models Starting with the pdf for a stochastic model f y and assuming that the N datapoints are independent the likelihood of a given dataset 2 y is as follows 8 L y fe 5 1 gt ll 1 In the usual calculus procedure the ML parameters are found by differentiati
18. model remains to be seen At least theory tells us the correct model form But how can any theory tell us that These 10 000 time intervals are random variates How can one say anything definitive about numbers which are supposed to be random Is that not an oxymoron Yes and no If I told you the first 9 999 numbers and asked you to predict the last one then you could not do that The greatest expert in the world could not do that the numbers are truly random However we seek a model for 10 000 numbers not for one number We want to model the sample as a whole so that we can make valid inferences about the decay of C in general That is usually the case when modeling random variates and it is not only possible but sometimes easy to do This is one of those times Analytic Form Whenever one models random variates one is seeking a formula for the PDF describing the data Here Figure 4 1 depicts one representation of this PDF a histogram However this is only a crude approximation since the data are binned The graph shows 30 bins so in each bin there are an average of 10000 30 datapoints all represented by the same PDF value Clearly this is a very low res picture With continuous data one should expect to see a continuous PDF The model required by theory is just such a continuous function This function models describes how the data are distributed along the abscissa how many near the origin how many far away etc Consequently
19. polynomial that returns all of these points exactly For example the integer values shown below were generated randomly with X in 0 50 and Y in 0 20 uniformly distributed in both cases x 49 36 41 9 50 15 2 5 30 43 y 19 1 7 12 17 16 10 14 18 13 The desired polynomial is _ 20438537504873 6664212737543216143 5849971169202443981 2351639402112 4251999202958707200 3270768617660544000 1584270345683457882731 3354997464750261679 3826799282662836480000 80564195424480768000 1269102370411287871 5 47163217558597193 amp 588738351178897920000 765359856532567296000 2768183843707 7 222827171803 8 2830472842206240000 27831267510275174400 1164450797 P 44757886346933760000 C 1 The fit of course is perfect see Figure C 1 However such a polynomial is useless as a model For instance it is extremely unlikely that it would predict additional datapoints with acceptable accuracy 87 APPENDIX C ILLUSTRATION OF WEIERSTRASS THEOREM 88 35 30 25 4 20 4 X Figure C 1 One Example of the Weierstrass Theorem Bibliography 1 ATLAS EXPERIMENT http atlas ch 2 BASEBALL REFERENCE COM http www baseball reference com 3 4 5 6 7 8 9 10 11 12 13 14 BELL J S Speakable and Unspeakable in Quantum Mechanics Collected Papers on Quantum Philosophy second ed Cambridge University Press 2004 EFRON B AND TIBSHIRAN
20. support it The data in Table 1 1 were actually the first 50 datapoints from a larger sample collected by recording C decays for about 12 hours The histogram for this big sample N 10 000 is shown in Figure 4 1 Our goal will be to develop a model for these data 0 20 0 15 N 10 000 7 G 0 10 a 0 05 0 00 t T Fob is 0 5 10 15 20 25 30 C 14 Decay Interval s Figure 4 1 Carbon 14 Decay Intervals Big Sample Often one does not know what analytic form is most appropriate for modeling some data In that case one tries different forms based on past experience and or the appearance of a histogram Here the situation is just the opposite Beta decays follow a general law of Nature that is very well known There is no doubt at all about the formula describing decay intervals However this formula contains a parameter which changes from one radioactive isotope to another Even knowing the analytic form we must still determine the value of this parameter Saying determine the value is overstating the situation All we have is a single dataset so the best we can do is to estimate this parameter How well we can do this depends upon CHAPTER 4 MODELS IN THE REAL WORLD 31 how much information we have in our dataset We believe it to be a large sample but whether it is large enough to develop a good
21. that are reasonably well known in advance If this is true then a useful technique is to set these parameter values Constant temporarily and let the remaining parameters converge Then Restart releasing one of the constant parameters so that it can attain a better value If this is done one constant parameter at a time convergence to the correct optimum is usually achieved Rarely when the global optimum is very hard to find it might be necessary to set two parameters Constant alternately That is make one Constant then the other toggling back and forth This procedure does not always work but sometimes it does 13 3 Poor quality Graphs A significant amount of error was expended to try an produce graphs of publication quality Here too the result is not always satisfactory especially with Probability Plots With plots of this kind the mathematics sometimes gets in the way of nice graphics With left bounded models for example the tick marks on the ordinate can get so compressed that they are illegible There is no good solution for problems of this kind because the mathematical requirements are dominant In other cases the poor quality results from poor coding of the data This is easily fixed by proper coding before using Regress 13 4 Systematic Error As discussed in chapter 6 residuals from a deterministic model should be random usu ally Normal 0 o When they are not this indicates that there is some systematic error pres
22. this distribution is both biased and skewed To correct this the BCa technique 4 is employed Correcting for skewness requires a complete jackknife computation The BCa procedure computes new indices in the empirical sampling distribution for confidence limits for the p percentile Rarely a BCa index will be outside the range of the empirical vector from which it was computed When this happens the Regress Report Regress default 1 000 gt The pseudo random number generator is MT19937 Obviously small datasets give rise to samples with several duplicate values During this process the Regress Display shows the message Initializing Confidence Intervals APPENDIX B TECHNICAL DETAILS 86 will show that confidence limit bounded with a paren instead of a bracket and the Display will have fewer than six such limit indicators Sometimes it is possible to correct this situation by increasing the number of bootstrap samples in the Setup dialog Appendix C Illustration of Weierstrass Theorem The Weierstrass Theorem guarantees that you can find a model that will fit any dataset perfectly if you try hard enough All you need to reproduce N points is a polynomial with N parameters Then you will fit everything noise included One can easily illustrate not prove this theorem by generating random values for X and Y pretending that these variates constitute ordered pairs then showing that it is possible to find a
23. times to ensure repeatability It has also done a goodness of fit test to determine whether the model is acceptable In this case it has completed the confidence interval computation as well For this small dataset all of this is virtually instantaneous The Gumbel distribution see Compendium has two parameters a location parameter A and a scale parameter B Their optimum here maximum likelihood values are shown in the Display Goodness of fit was estimated using a parametric bootstrap similar to that discussed previously ch 7 but with the 1 000 bootstrap samples synthesized from the optimum model Model acceptability depends on the one sided percentile of the worse of the two observed fit statistics using both optimization criteria as shown in Table 12 1 For this dataset and model the fit is acceptable Confidence intervals if any are pictured in the Display as a set of three nested intervals surrounding the optimum parameter value with the outermost given numerical values The indicated limits correspond to the central 90 95 and 99 percentile confidence intervals Precision equals six significant figures or fewer depending on the shape of the parameter space Numerical values will be shown in the Report gt These intervals are often very skewed 71 CHAPTER 12 OUTPUT AND MENUS 72 BattingAvg Gumbel A 3 47396E 01 3 42E 01 3 53E 01 B 2 44539E 02 2 11E 02 2 84E 02 Finished ACCEPTABLE i
24. up with a partial separation with information still contaminated with some error 3 2 Another Experiment It often happens that the data we collect are not the result of any sort of equation they might be random quantities such as those listed in Table 1 1 For instance we might ask What is the average distance between two points in a unit circle and then try to determine the answer by selecting random pairs of points and measuring their separations The easiest way to do this is to select repeatedly two points in a 2 x 2 square and use them only when both are inside the inscribed unit circle A simple computer simulation will suffice and Table 3 1 shows the results for one such experiment Table 3 1 Distance Between Two Points in Unit Circle experimental Trials Result Error 10 0 8246413693 0 0807734180 100 0 9229038963 0 0174891089 1000 0 9025456727 0 0028691147 10000 0 9057814948 0 0003667074 100000 0 9053029327 0 0001118547 1000000 0 9052005706 0 0002142168 10000000 0 9052150149 0 0001997725 100000000 0 9054222979 0 0000075105 1000000000 0 9054073486 0 0000074387 The middle column in this table lists the average as the number of trials increases It is clearly converging If we have done everything correctly it is converging to the correct answer Fortunately this answer can be computed exactly with a little calculus The true average separation is 128 45 m 0 9054147873 to ten decimal places Thus we can
25. useful when and if analyses turn out to be not so simple W have presented here a very brief and incomplete overview of how data can 56 Part HI Regress User Guide Chapter 9 Overview computations discussed in Part II Modeling and more can be accomplished using built in Regress functionality A wide selection of common models is hard coded including 21 equations and 59 distributions In addition a User defined equation can be defined The Regress interface has been designed to be intuitive and to hide the math as much as possible For some technical details see Appendix B Regress installation is described in this chapter Succeeding chapters describe the following topics T User Guide describes the basics for installing and using Regress All of the e Input Creating a Regress input file What is valid and what is not e Setup File Menu Options available through the Setup and Parameter Constraint dialogs User defined equations e Other Menus and Output Graphs Report List file Sample file It is assumed that the user is familiar with standard Mac operations 9 1 Installation Regress is downloaded as a zip archive which might or might not be uncompressed auto matically If not then double clicking the file icon will uncompress it The result will be the Regress folder Installation involves simply dragging the Regress folder to your Applications folder Two further steps are optional but are recommend
26. will be utilized again for finding the best parameters for deterministic models 5 2 Deterministic Models A deterministic model relates one of more independent variables to a dependent variable In general the modeling exhibits some error Typically this error is a combination of mea surement error imperfect observation combined with modeling error imperfect model For all practical purposes this error is random unpredictable so the natural way to model the error itself is with a stochastic model By far the most common model used for this purpose is a Gaussian normal distribution with a zero mean N 0 o Whatever the error model the process of finding the best equation by maximizing the likelihood of the modeling errors is called regression Suppose that we have some deterministic model g a such as the sine wave we used for the daytime data shown in 4 13 Recalling that data information error we can express observation k as the sum of a model prediction plus an error Yobs k Upred k e g x Ex 5 6 If all ez are described by the same model e g N 0 c then we have an unwweighted regression If this is not true then we get a weighted regression e g e N 0 Or The parameters of g x are its ML parameters iff the parameters of the error model are also ML l is read is distributed as CHAPTER 5 OPTIMIZING THE MODEL 45 5 2 1 Unweighted Regression We want to find the ML paramet
27. 0326e 02 95 gt 2 14818e 02 2 76506e 02 99 gt 2 05249e 02 2 88295e 02 IL 90 gt 2 77918e 02 3 07559e 02 95 gt 2 74963e 02 3 10906e 02 99 gt 2 68714e 02 3 18288e 02 KS 90 gt 3 53319e 02 7 29195e 02 95 gt 3 33414e 02 7 73535e 02 99 gt 2 85199e 02 9 06869e 02 Parametric Bootstrap Jun 28 2013 3 08 57 PM Mean values for parameters A B 3 47525e 01 2 43732e 02 Covariance Matrix 4 96387e 06 1 19964e 06 2 52903e 06 Correlation Matrix 1 00000e 00 3 3858le 01 1 00000e 00 Predicted Percentiles Y Percentile 0 3 0 09624 0 35 40 6982 0 4 89 0164 0 45 98 5054 Figure 12 7 Default Report for BattingAvg pred in CHAPTER 12 OUTPUT AND MENUS 78 Other output files include the list file associated with deterministic models and the sample file when random samples are created Both of these have extension let 12 3 2 Graphs All of the graphs discussed above are likewise documents When saving the standard File dialog is modified so that the file can be saved either in PDF format the default or PNG format The former is of higher quality but might not be compatible with all commercial software The SaveAs command Shift Command S brings up a File dialog similar to that shown in Figure 12 8 Save As BattingAvg pdf a al 33 1 E im sz Goutput ala FAVORITES 2 Input 1 BattingAvgCDF pdf A Desktop BattingAvgDisp png 1 BattingAvg
28. 2 5 Default Graph Dialog for Weighted Hale Bopp Exponential Model Moreover the Axes dialog now shows checkboxes for making axes logarithmic base 10 Regress enables this option automatically and only when the axis spans at least two orders of magnitude Fig 12 6 The remaining checkboxes should be obvious Finally with graphs for deterministic models Regress might adjust the displayed values of the independent variable X when this quantity is poorly encoded Such adjust ments consist of factoring out a constant or subtracting a constant These modifications are reflected in the abscissa label and may not be edited out This behavior is the result of the limited space available to draw the tick labels and the requirement for a publication quality plot but not the actual values 8It is preferable to encode data correctly before using Regress CHAPTER 12 OUTPUT AND MENUS 76 1000 CN Rate Distance AU Figure 12 6 Hale Bopp Model wiith Logarithmic Y axis 12 3 Files Regress can create several output files depending on the model and Setup options All of these are considered documents and can therefore be saved and or printed directly Note that Regress cannot open any of its own files they are purely output files 12 3 1 Report The primary output file is the Report see File menu The default Report for input file BattingAvg pred in is shown in full in Figure 12 7 Additonal Setup choices will result
29. 36 615 335 555 701 554 1066 554 22 12 95 356 540 21 12 96 721 540 21 12 97 1086 540 1 1 98 1097 545 association between two or more variables A classic example is the relationship between the height and weight of adult humans Itis not really correct to say that height is the cause of weight or vice versa However they certainly exhibit a strong association When one is small the other tends to be small as well and so on Quite often this means that both are linked to one or more common perhaps unknown factors The foregoing datasets are not meant to comprise an exhaustive categorization of data in general merely a few examples to illustrate some of the possibilities There are many kinds of data but in this text we are going to focus on numerical univariate data meaning one variable in stochastic cases and just two variables in deterministic cases This will suffice to explain all of the fundamental ideas appropriate to an introductory discussion and should provide a useful starting point for readers interested in pursuing this subject as well as a basis for further study regarding multivariate data 1 1 A Precious Resource Back in 1989 John Allen Paulos garnered more than his 15 minutes of fame by lasting almost five months on the New York Times Review of Books best seller list for an en joyable little volume entitled Innumeracy 10 Needless to say he was writing about the population at large not himself It is a sad fact th
30. 61 337 336 341 350 344 368 335 378 409 420 367 356 327 326 318 332 343 349 330 374 380 344 390 403 356 328 337 321 350 361 363 326 354 440 376 368 378 363 327 341 323 364 343 359 372 371 405 308 369 393 349 309 340 321 359 368 356 331 429 388 410 358 386 378 388 353 353 316 333 357 358 347 387 372 424 350 383 398 371 343 388 326 388 363 347 363 358 344 385 324 382 379 349 369 328 301 333 366 339 328 357 373 410 377 384 369 381 343 353 332 333 339 357 365 CHAPTER 1 SOMETHING ABOUT IT 5 The original data 2 were recorded as decimal fractions with four significant figures in deference to their exactness as discrete values Typically batting averages are reported to three significant figures as in this table implying precision to one part in a thousand However since baseball players do not get 1 000 at bats in one season even this much precision is a bit fictitious Table 1 3 illustrates one more feature common to datasets The observations are coded by multiplying each average by 1 000 Coding is usually a linear transformation that simplifies data presentation by removing redundant digits It can also improve an analysis by focusing on the range exhibited by the data In this case that range is 325 440 spanning just 116 Consequently any analysis that compares these batting averages to one another cannot ju
31. Cancel Figure 12 1 Batting Average Display Dialog for Gumbel Model Table 12 1 Goodness of fit Percentiles Percentile P Assessment P lt 90 acceptable 90 lt P lt 95 marginally acceptable 95 lt P lt 99 unacceptable P gt 99 very unacceptable In some cases the entire set of six confidence limits might not be displayed This can happen when the estimated confidence limits fall outside the range of the vector of boot strap values for the goodness of fit statistic The uncertainty of an optimum parameter is also indicated by the thickness of the horizontal line thinner is better With deterministic models the Display dialog has the same appearance except that the goodness of fit assessment is replaced with the value of the optimization criterion usually R squared Possible since these are BCa intervals 4 not percentile intervals 5Of course none of this applies if the parameter is set Constant CHAPTER 12 OUTPUT AND MENUS 73 12 2 Graphs With continuous stochastic models three graphs are available using the Output Graph menu item Command G This command brings up the default Graph window Fig 12 2 O GOO BattingAvg pred PDF CDF Prob Axes 20 pra a a 0 0 28 0 47 Y Figure 12 2 Default Graph Dialog for Continuous Stochastic Models The first graph is the PDF shown above and the second is the CDF The third is what is sometimes called a probability plot In Regress this kind
32. D Log Axlog Bx x C LogPoly1 Axlog Bx x Cex Dxx 2 E LogPoly2 Axlog Bxx x Cxx 2 Pow Axx B C PowRxpo Axx Bxexp Cx x D Sin Axsin 2 PixBx x C D SinExpo Axsin 2 PixBxx C x exp Dxx E Cos Axcos 2xP1xBxx C D CosExpo Axcos 2xP1xBxx C x exp Dx x E Michaelis Menton Axx B x Logistic AxB B A B exp Cxx ConsecFirstOrder Ax 1 Bxexp Cxx Cxexp Bxx C B Conic A 1 Bxcos 2x P1xCxx D E Catenary Axcosh Bx x C D Gaussian Cxexp log 2 x x A B 72 D Lorentz C x A B 2 1 D Gaussian amp Lorentz p Gaussian 1 p Lorentz A peak position B half width at half height C peak height above baseline D baseline p fraction due to Gaussian component 83 Appendix B Technical Details This appendix provides some low level details that might be of interest to expert users who want to know about the algorithms etc utilized in Regress B 1 Software Regress is a Macintosh Cocoa application 12 500 LOC developed using Xcode 4 6 The language is a combination of Objective C Objective C C C Flex and Bison The Xcode target is Mac OS 7 Lion This program is adaptively multi threaded Given k effective CPUs one is reserved for the top level main thread and all other computations partitioned among the remaining k 1 CPUs This is particularly useful for bootstrapping which is an SIMD process Regress uses some functionality from th
33. Data Modeling with Regress Michael P McLaughlin Data Modeling with Regress Second Edition v2 7 Copyright 2013 by Michael P McLaughlin All rights reserved DISCLAIMER Regress is intended solely as a tool for the convenience of users It makes no guarantee or warranty of any kind that its use and or its output are appropriate for any purpose All such decisions are at the discretion of the user Any brand names and product names included in this work are trademarks registered trademarks or trade names of their respective holders This document created using IEX and TeXShop Figures produced by Regress or Mathematica For deeds do die however nobly done And thoughts of men do as themselves decay But wise words taught in numbers for to run Recorded by the Muses live for ay E Spenser 1591 When you can measure what you are speaking about and express it in numbers you know something about it but when you cannot measure it when you cannot express it in numbers your knowledge is of a meager and unsatisfactory kind it may be the beginning of knowledge but you have scarcely in your thoughts advanced to the stage of science Lord Kelvin 1891 Contents Preface I Data 1 something about it 1 1 A Precious Resource 21424524 p44 RA A AA 1 2 An Imperfect World 2 4 24 58 be eee eee eee eee eewes 2 Data Summaries Statistics and Graphs 2a SUAWBUCS a A BOR we ee Be ce Be He eS 22 Graphs oe amp ok
34. I R J An Introduction to the Bootstrap Chapman amp Hall CRC 1993 HOPPER V D AND LABY T H The electronic charge Proceedings of the Royal Society of London Ser A 178 974 1941 243 272 MENZEL D H Ed Fundamental Formulas of Physics Dover Publications Inc 1960 MILLIKAN R A On the elementary electrical charge and the Avogadro constant Physical Review 2 2 1913 109 143 MILLIKAN R A The most probable 1930 values of the electron and related con stants Physical Review 35 1930 1231 1237 NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY NIST Fundamental physical constants 1986 1998 2002 2006 2010 PAULOS J A Innumeracy Hill amp Wang 1988 PRESS W H TEUKOLSKY S A VETTERLING W T AND FLANNERY B P Numerical Recipes 3 Edition The Art of Scientific Computing Cambridge Uni versity Press 2007 RAUER H ET AL Optical observations of comet Hale Bopp C1995 O1 at large heliocentric distances before perihelion Science 275 1997 1909 THOMSON J J Nobel lecture December 1906 WEAST R C Ed Handbook of Chemistry and Physics 56 Edition CRC Press Inc 1975 89 BIBLIOGRAPHY 15 WIKIMEDIA http commons wikimedia org 90
35. PDF pdf E eh phe 1 BattingAvgProb pdf Applications ha Documents SHARED Gall DEVICES y Macintosh HD A unz A PDF PNG Hide extension New Folder Cancel Save Figure 12 8 SaveAs Dialog for Regress Graphs All of the standard operations associated with saving files apply to Regress files However the graphs are of fixed size unless modified using additional software vector graphics vs bitmap graphics Resizing PNG files sometimes results in a degradation in quality CHAPTER 12 OUTPUT AND MENUS 79 12 4 Menus Here is a summary of the menu items specific to Regress File Restart Option Command R Begin again with the Setup dialog Output Report Command R Create the Report Graph Command G Create a graph Listing Command L Create a document with the List Data results Flip Display Command F The default Display shows only parameters A E If there are more parameters from a User defined model this menu item toggles between A E and F J Chapter 13 What Could Possibly Go Wrong lengths to try to be foolproof but as you might expect this goal is not completely realized There are still several different kinds of things that can go wrong even when you know what you are doing This last incidentally is taken for granted All bets are off if this is not the case In this chapter we describe a few of the problems that might arise Te internal algorithms in Regre
36. PTER 11 SETUP 64 BattingAvg 137 points y Normal A B Model Optimization Criterion Bootstrap Samples K Maximum Likelihood 1 8 O K S Statistic Options _ Confidence Intervals Percentiles _ Generate Sample s _ Positive Data Cancel ok Figure 11 1 Initial Setup Dialog for BattingAvg Example BattingAvg 137 points _ y Gumbel A B Model Optimization Criterion Bootstrap Samples K Maximum Likelihood 1 8 K S Statistic Options _ Confidence Intervals Percentiles _ Generate Sample s _ Positive Data Cancel o Figure 11 2 Setup Dialog with Gumbel Model CHAPTER 11 SETUP 65 WARNING Given the ease with which Regress performs modeling it is extremely tempting to try one model after another until the results look good This is a major mistake Standard goodness of fit tests to be discussed in the next chapter work only when they are used just once When applied repeatedly with the same data their output is unreliable One should have a priori reasons for choosing a model and this choice should be made before modeling the data 11 1 2 Optimization Criterion There is a choice of two optimization criteria By far the most common is maximum likelihood which is always available With continuous data the alternative is to minimize the K S statistic instead This option is trivial for Regress but is rarely used in the litera ture Still it makes for an interesting compariso
37. a good approximation day number determines daytime Alternatively one could say that daytime determined day number at least in a given year However one variable is almost always considered to be determined by dependent on the rest This is the so called dependent or response variable in this case daytime The others are independent variables also termed covariates This language arose be cause in an experiment one typically has good control over independent variables but no control over the dependent variable Measuring the latter is nearly always a goal of the experiment When a relationship is causal it is obvious which variable is dependent but a determin istic relationship does not necessarily imply cause and effect It might indicate merely an that is being able to distinguish 1 000 from 999 reliably adding a constant multiplying by a constant or both rounded off to the nearest minute 4 Apologies to the astrophysics community where this statement is less true CHAPTER 1 SOMETHING ABOUT IT 6 Table 1 4 Daytime min in Boston MA Date Day Daytime Date Day Daytime Date Day Daytime 1 1 95 1 545 1 1 96 366 544 1 1 97 732 545 32 595 397 595 763 597 60 669 426 671 791 671 91 758 457 760 822 760 121 839 487 840 852 839 152 901 518 902 883 902 21 6 95 172 915 21 6 96 538 915 21 6 97 903 915 182 912 548 912 913 912 213 867 579 865 944 865 244 784 610 782 975 783 274 700 640 698 1005 699 305 616 671 614 10
38. accomplished by making a graph which contains the line described in the last paragraph the abscissa which is drawn horizontally plus another perpendicular line the ordinate These two lines are joined together at their respective origins to produce the familiar Cartesian axes so named in honor of Ren Descartes The abscissa supplies locations for our data but what does the ordinate supply There are several choices each giving a different kind of graph The different graphs describe different aspects of the data Graphing Random Variates When the data are random variates the simplest kind of graph is a histogram To create a histogram the data must be either discrete or binned into discrete categories To keep the math simple bins should be of equal width For this batting average dataset we shall define five bins with binwidth 25 and the following six bin boundaries boundaries 325 350 375 400 425 450 There must be enough bins to contain all of the data In this example the data do not go as high as 450 but that does not matter 4Some authors consider the peak of any clump to be a mode while others define only the tallest peak as the mode Even discrete data may be further binned CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 16 The next step is to fill the bins with the data putting each datapoint into its correct bin after which each bin will contain zero or more datapoints The number of datap
39. also list the magnitude of the error third column What this experiment shows is that 1 even random measurements can contain error but that 11 inherently random quantities still can be described with precision CHAPTER 3 DATA VS INFORMATION 25 3 3 Separating Information from Error We shall see in the next chapter that information can often be separated from error when ever you have a good way to describe one or the other or both Part II Modeling Chapter 4 Models in the Real World there was What you had to do if you were right handed was to position the burette so that the stopcock was pointing to the right Then you could wrap your left hand around the bottom of the burette in order to manipulate the stopcock while swirling the flask counterclockwise with your right hand Near the endpoint you could set the flask down and use two hands to twist the stopcock very quickly so as to get half drops I really needed those half drops because I really needed the job This was late June between my Junior and Senior undergraduate years and I had found summer employment working in a chemistry laboratory for the U S Fish and Wildlife Service We tested fish and sometimes water I was lucky to get work at all that year and particularly fortunate to find something that matched my college major Getting paid for doing chemistry was the best I could have hoped for and a far cry from my first summer job at age eleven picking strin
40. and model usually the likelihood For instance robust regression sometimes describes errors as Laplacian instead of Gaus sian However such procedures are far less common than least squares The latter is very conservative in its assumptions about the nature of the errors which is usually seen as desirable 2also called sum squared residuals SSR CHAPTER 5 OPTIMIZING THE MODEL 46 5 2 2 Weighted Regression The only difference between weighted and unweighted regression is that the former takes into account that fact that different datapoints have different precision typically because they have different measurement errors Consider the following data observed for the Hale Bopp comet of 1996 1997 12 Table 5 2 Rate of Production of CN in Comet Hale Bopp Rate Distance from Sun Uncertainty in Rate molecules per second 10 AU molecules per second 10 130 2 9 40 190 3 1 70 90 3 3 20 60 4 0 20 20 4 6 10 11 5 0 6 6 6 8 3 Here the uncertainty in rate is a large fraction of the rate itself Were one unaware of this or if it were ignored then one would expect to get incorrect values for the optimum parameters whatever the model The hardest part of accounting for variable precision is simply knowing what weights to use in the regression formula Usually with Gaussian errors the weight on a datapoint is the reciprocal of some constant multiple of ap typically 1 o The sigma itself is then a measure of unce
41. ary to determine its values under some standard conditions These values are listed in a table and published in handbooks of various kinds One looks in the table to find the probability that a measured value of the statistic will be as large or small as that observed Usually with goodness of fit statistics large bad By convention a model is considered poor if its goodness of fit statistic has a probability of less than 5 percent See Bootstrap Analysis in Chapter 7 The catch in all of this is the validity of the standard conditions to the model and data in question Nevertheless this is the usual procedure for testing continuous distributions In the salaries example the empirical K S value 0 0173174 This value falls in the 85 percentile of its sampling distribution so it is not large enough to reject the adequacy of the model We conclude therefore that this model is acceptable A good way to visualize the data versus a continuous model is with what is sometimes called a probability plot This is similar to a Q Q plot except that the abscissa shows the variates themselves and the ordinate shows percentiles The probability plot for this exam ple is in Figure 6 2 The model is the gray line and the dots represent the datapoints This plot shows a good result even though the upper tail drifts off a bit Tails usually deviate from the bulk of the data because they constitute a small extreme subset One needs to be familia
42. at the proverbial man on the street cannot do long division without a calculator and would not know a logarithm from a lol lipop No surprise then that anything numerical leaves most people more than willing to change the subject This deficiency also helps explain why science especially is so CHAPTER 1 SOMETHING ABOUT IT 7 poorly understood and appreciated except when it is erroneously equated to technology or medicine We are concerned here with data and most data are expressed in numbers since they have an unlimited capacity for accuracy and precision Lord Kelvin justly renowned for his work in thermodynamics had it exactly right and his thoughts on the matter are most apt Knowledge may originate with casual observations but it does not mature until those observations give way to accurate measurements which as any experimentalist will attest require a great deal of talent and experience just to collect Any nontrivial experiment or data collection effort is something that is difficult to do well and usually very expensive Consequently good data are truly a precious resource and merit analysis of equal quality Obtaining good data requires considerable care when making and recording measure ments so as to maximize accuracy and precision while at the same time avoiding biases Some definitions are in order Accuracy Closeness to the truth which in turn is defined by Nature Precision Roughly speaking the nu
43. ata are discrete CHAPTER 11 SETUP 66 more such samples and save the output to a file When this option is selected it is the only analysis that is done the usual computations are not performed Therefore the initial parameters chosen see below are not modified in any way Also since any model must be valid the input data used to reach this point in the Setup must be compatible with the desired model The dialog for creating samples is shown in Figure 11 3 Random Variate Sample s Cenerate sample s of size 100 Cancel lt Figure 11 3 Sample Dialog Samples are saved as tab delimited columns Thus the default is to save one column vector with 100 rows 11 1 6 Positive Data When Regress creates a graph see next chapter it always draws axes that can show all of the data as well as most of the model curve In some cases the model is poor and its plot may extend into the region where variates are negative even though the data are and perhaps must be all positive This setup option forces Regress to start the abscissa at zero when it otherwise would not 11 1 7 Parameter Dialog Clicking OK in the Setup dialog brings up the Parameter dialog Fig 11 4 Regress makes initial guesses for the parameter values but these can be changed If they are changed to invalid values this dialog will be re shown until they are acceptable Sometimes it is desired that one or more possibly all of the parameters be c
44. bservable today e Predict future data or events A common use of a model is to make a prediction of an unobserved quantity This might be an extrapolation such as predicting tomorrow s weather or it could be just a need to fill in some missing data a process called imputation Whatever the reason every model prediction will contain error since no model is perfect Quantifying that error is important but often difficult e Make inferences or decisions Models are very often used to make inferences and decisions In fact almost any time a decision is made as the result of examining some data it is not made using the raw data alone but in accordance with some model that was produced using that data The model is interpretable the dataset is often just a collection of numbers because 1 your money ran out 2 you no longer have access to the equipment 3 your test subjects all died etc CHAPTER 4 MODELS IN THE REAL WORLD 30 All of the reasons listed above for developing a model presuppose that given a dataset it is possible to construct a model that describes it There are a number of ways to do this depending on the needs of the analysis In particular models for stochastic data are developed and optimized using methods very different from those for deterministic data Therefore we consider these two cases separately 4 2 Stochastic Models We shall start with an easy example one with a lot of good data and reliable theory to
45. capability requires special pre processing all desired options must be selected before computations commence there is no way to pause and change your mind or add an option partway through without Cancelling the analysis We shall begin by describing the Setup for stochastic models then for deterministic models In the latter section we shall focus mainly on the differences between the two Output will be discussed in the next chapter O an input file has been successfully loaded the first thing that appears is the 11 1 Stochastic Models Assume that we have opened the file BattingAvg in This input is continuous and the Setup dialog will appear as shown in the screenshot in Figure 11 1 Although most of the available options should be obvious we shall go through them one by one 11 1 1 Model The Model button brings up a tabbed dialog with which the model may be changed if desired For theoretical reasons the most likely model of those available is a Gumbel distribution and so we choose it The Setup dialog then reappears as shown in Figure 11 2 Here everything is the same except the model In this example all continuous models are valid Were even one datapoint less than or equal to zero then several models would be invalid and consequently disabled Note also that Regress can do only one analysis with one input file at a time Available stochastic models are described in the aforementioned Compendium 63 CHA
46. cold but something was most definitely happening Chemistry was happening You cannot go around hitting things at over 7 000 km hr 4 300 mph without serious consequences In this case these neutrons are continually hitting nitrogen atoms in the air The consequence albeit invisible to human eyes is truly spectacular nothing less than the old alchemists dream of the transmutation of elements It can be written as follows T view from what I could glimpse through several layers of thick plastic was as in YN 4C 1H 1 1 Granted this is not exactly lead into gold but it is nitrogen into carbon which is just as fundamental a change The whole point about chemical elements is that they are for all practical purposes immutable It takes an extraordinary amount of energy to force one to change into another A collision at 2 km s does however provide sufficient kinetic energy for one nitrogen atom to change into one carbon atom There is a proton left over to balance things out As you can imagine Equation 1 1 is not your typical chemical reaction but the prod uct carbon 14 is genuine carbon and behaves as such Chemistry is determined by the number of protons in an atom not the number of neutrons Although carbon 14 has eight neutrons in it instead of the usual six or seven this does not affect how it reacts with other CHAPTER 1 SOMETHING ABOUT IT 3 atoms and molecules Carbon 14 mixes thoroughly with the
47. d a decay interval less than zero really is impossible so that part of Figure 2 4 is correct In this figure probability is given by the height of each bin Later when we discuss modeling most of the math will involve continuous relationships and we shall discover that it is much more convenient if probability were given as an area not a height If the binwidth in Figure 2 4 were equal to one then the height and the area of a bin would be the same Since the binwidth 3 it is not Therefore we define one final type of histogram in which the ordinate measures probability density that is probability per unit binwidth To get probability density we define a probability density function PDF having units equal to the reciprocal of the random variate _ probability POS Te 2 The PDF histogram is shown in Figure 2 5 It has a total area and is therefore said to be normalized It is more flexible than the histogram in Figure 2 4 because it can be used to compute the probability for any range s of the random variate For example the CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 20 0 20 0 15 N 50 z A 010 a 0 05 0 00 n 0 3 6 9 12 15 18 21 C 14 Decay Interval s Figure 2 5 Carbon 14 Decay Intervals PDF probability P that a decay interval is in the range 5 10 can be computed as follows 10 P PDFIk k 5 ee E ee 2 6 e 3 50 gt 3 50 3 50 0
48. e found it How we found it how any large aircraft knows its location is quite a tale At the hub of an aviation navigation system is a device known as a ring laser gyro an expensive analogue of the toy gyroscope that you might have received as a present once upon a time This toy works by pulling on a string wound around a heavy wheel forcing the wheel to rotate rapidly The Law of Conservation of Angular Momentum then keeps the gyroscope at a constant position until it starts to slow down A ring laser gyro does something similar but without wheels Instead it sends two beams of laser light around a closed loop in opposite directions When the light comes back together again it generates an interference pattern Einstein s Theory of Relativity guarantees that this interference pattern creates its own self calibrating inertial frame of reference In other words it is immune to acceleration Consequently it can act as a fixed zero baseline against which accelerations can be accurately measured If you were paying attention in calculus class then you can integrate these accelerations to get a sequence of velocities then integrate these velocities to determine your current 3 D position However this is a long and difficult mathematical process If a pilot had to go through all of these computations explicitly the airplane would run out of fuel and crash There are times when the full analytical procedure must be set aside in favor of a quick c
49. e 11 1 User model Functionality symbol description abs absolute value acos inverse cosine acosh inverse hyperbolic cosine asin inverse sine asinh inverse hyperbolic sine atan inverse tangent atanh inverse hyperbolic tangent ceil ceiling cos cosine cosh hyperbolic cosine exp exponential floor floor log natural logarithm logl0 common logarithm sin sine sinh hyperbolic sine sqrt square root tan tangent tanh hyperbolic tangent atan2 inverse tangent two parameters Pi 3 14159 In all of the above angles are assumed to be in radians For the HJ example the user model is written Ax 1 exp Bx x C 2 E 1 70 Chapter 12 Output and Menus as the menu items We first describe the Regress Display dialog common to all models Thereafter we describe various plots and files and the different outputs available for stochastic models and deterministic models Finally we present a synopsis of Regress menus most of which should be obvious to experienced users T chapter discusses the various displays and files created by Regress as well 12 1 Display As an example we choose the input file BattingAvg pred in discussed in Chapter 2 along with a Gumbel model We also choose the Confidence Intervals option Convergence is almost instantaneous and a Display dialog is presented as shown in Figure 12 1 By the time this dialog appears Regress has completed the initial optimization at least three
50. e Cephes library B 2 Optimization To estimate parameters Regress utilizes the Nelder Mead simplex method exclusively Thus it converges in almost all cases without requiring derivatives Also this algorithm is not entirely greedy it has some tendency to move away from a local optimum 11 With the maximum likelihood criterion all computations are carried out in log space to avoid numerical overflow For the initial solution Regress carries out three replicate iterations until the best of the three occurs at least two times If three of these triple runs fail a convergence failure message is sent Convergence requires six significant figures for all non constant parameters but this can be overridden if the response surface is too flat In these rare instances fewer than six significant figures will be reported 84 APPENDIX B TECHNICAL DETAILS 85 B 3 Bootstrapping Bootstrapping consists of estimating the variance of statistical measures by generating a large number of synthetic random bootstrap samples of the same size as the original sample and using these in addition to the original sample Regress employs a parametric bootstrap to assess goodness of fit for stochastic models and a non parametric bootstrap to estimate confidence intervals In general the accuracy of bootstrapping increases with bootstrap sample size B 3 1 Parametric Bootstrap A parametric bootstrap sample of size N is synthesized
51. e model The estimate with the largest residual is flagged with an asterisk Sinput file h2p in Built in deterministic models are listed in Appendix A 10For an example with predictions see Hale_Bopp CN pred in CHAPTER 11 SETUP 68 h2p 28 points y A Btx C xA2 D xA3 E xA4 Model Optimization Criterion Bootstrap Samples K Least Squares 1 4 Y _ Minimum Deviation Options __ Confidence Intervals Simulated Annealing _ Test Residuals _ List Data Y Use Weights Predictions Figure 11 5 Initial Setup Dialog for H Example 11 2 3 Simulated Annealing With deterministic models the initial parameter values default to one but this choice often does not lead to an acceptable fit especially when there are more than three parameters Moreover it might be difficult to make good initial guesses for the parameters To facilitate this process the Simulated Annealing option permits entry of a finite range of values for the parameters using a Constraint dialog instead of the usual Parameter dialog If this option is chosen here then the Constraint dialog appears Figure 11 6 shows this dialog after suitable guesses have been entered If lower bounds are set equal to upper bounds then the corresponding parameters will be set Constant Regress uses these constraints to perform an adaptive simulated annealing analysis The benefit of simulated annealing is that it is not a greedy algorithm That is it tri
52. ectron volts as a function of the separation in Angstroms between the two protons It clearly shows that the energy goes through a minimum at a little over 1 A 107 m When the protons get closer than that the energy rises sharply due to their mutual repulsion As their separation gets larger and larger the energy levels off At that point the system is just a neutral hydrogen atom plus a distant proton Thus the three particles will stay together However the energy minimum is very shallow indicating that their tendency to stay together is not particularly strong and so this chemical bond is easy to break What spoils this nice result is the fact that DMC is not exact Each of these replicate computations gave a slightly different energy for the same H H distance As noted on the previous page data information error and here we can actually see the error but we cannot measure it since we do not know the true answer easy to see if you enlarge this document CHAPTER 3 DATA VS INFORMATION 24 So we have a new problem separating the information from the error Only the former will tell us what we want to know namely the behavior of H In order to proceed we have to know either something about the information or something about the error Whatever this something is we might be able to use it to effect some separation It is unlikely however that we will be 100 percent successful in any case and will end
53. ed for ease of use The first is to drag the Regress app to the Dock so that it is readily available The second is to open the 58 CHAPTER 9 OVERVIEW 59 Examples folder select any file in the Input folder and select File Get Info Command D In the Get Info dialog set Regress as the app associated with the selected file Click the Change All button Thereafter double clicking any file with extension in will open Regress with that file as input Regress requires Mac OS 10 7 Lion or greater These installation instructions are duplicated in the README txt file Release notes are in the RELEASE _ NOTES txt file 9 2 Examples All of the examples cited in this User Guide have corresponding input and output files in the Examples folder Chapter 10 Input can be created with any software that will output plain unstyled text However Regress does expect its contents to be formatted in a way that the parser will understand As a reminder of this all Regress input files must have the extension in Otherwise the file will disabled in the file dialog and it will be rejected when using drag and drop or double clicking There must be at least seven points in the input file and perhaps more for some models This restriction is required so that the internal processing that Regress carries out given Setup options will work all the time With deterministic input the seven points must be unique see below Ts is nothing rea
54. ent This can occur even though the value of R is very close to one its maximum If systematic error is present with 99 percent confidence then a warning to this effect is added to the Report 13 5 Overparametriztion Occasionally a model will contain two parameters where there should be only one For instance if two parameters appear only as a ratio then that ratio should be a single param eter When this is the case there will be an infinite number of parameter pairs that give the same global optimum and there will be no unique solution Regress might converge but it will converge to an arbitrary combination of the two parameters The only solution for situations like this is to rewrite the model typically a User model in a different form with fewer parameters 2 Another possibility is a scale parameter in a denominator with the whole fraction raised to an exponent another parameter CHAPTER 13 WHAT COULD POSSIBLY GO WRONG 82 13 6 Wishful Thinking Finally there is you It is always possible that good intentions notwithstanding when it comes to modeling your level of expertise might be insufficient to ensure success Regress makes a lot of difficult things easy For example entire books have been written on methods for finding optimum parameters for a Weibull distribution yet with a few mouse clicks you can not only find these parameters but assess their variability and the goodness of fit of the model as well
55. ers when the error model is the same for all points We shall assume that we have independent Gaussian errors Then X is given by a j 1 2 feo 0 sn k 1 In log space the product again becomes a sum and the likelihood of the errors will be maximized iff this sum is maximized First find the log likelihood e N Y no ogLik gt log o 27 5 D 2 5 8 k 1 logLik will be maximized iff the second sum is minimized However Yk g 2x Therefore since y gt 0 max logLik gt min bs l gt min PS Yk 9 2r on 5 9 k 1 In other words an unweighted deterministic model with unbiased u 0 Gaussian errors will have ML parameters if and only if the last bracketed expression the so called sum squared errors SSE is minimized For this reason the procedure described here is termed least squares With all but the simplest models the computation is done numeri cally As one example the ML parameters for the daytime model 4 13 are as follows Table 5 1 ML Parameters for Daytime Model A B C D 183 325 0 00273605 1 39082 728 424 These parameters were found by searching the parameter space for an SSE minimum The quality of the SSE value for this dataset and model 767 92 will be discussed later but obviously it is quite good see Figure 4 11 Other error models are sometimes used In general each will give an optimum set of parameters by maximizing some function of the data
56. es are measured again and again with improving accuracy and precision An example is shown in Table 1 5 Here in chronological order are the best experimental values for the charge on an electron one of the most fundamental of physical constants The precision of these 5The full moon is about 30 minutes of arc in diameter CHAPTER 1 SOMETHING ABOUT IT OVADRANS MVRALIS Ad da cd Abd Figure 1 1 Mural Quadrant of Tycho Brahe 1598 values is indicated by one or two digits given in parentheses after the value These par enthetical digits correspond to the estimated uncertainty in the rightmost digit s of the reported measurement We shall have a lot more to say about the quantitative meaning of this uncertainty but for now what matters is that the precision in this table is obviously CHAPTER 1 SOMETHING ABOUT IT 9 getting better It is also true that accuracy is improving as well but there is no way to tell that just by looking at the numbers Table 1 5 Measured Values of the Electronic Charge Year Charge C x10 Ref 1906 1 0 13 1913 1 592 3 7 1930 1 591 2 8 1941 1 6015 4 5 1960 1 60154 3 14 1975 1 6021892 46 6 1986 1 60217733 49 9 1998 1 602176462 63 9 2002 1 60217653 14 9 2006 1 602176487 40 9 2010 1 602176565 35 9 Figure 1 1 and Table 1 5 demonstrate that the need for the best possible data is a con tinuing concern One reason for this is that t
57. es to find a global optimum rather than just the closest optimum Parameter spaces are often very convoluted with many optima That is why setting all parameters to one might not converge to the desired result Simulated annealing is not guaranteed to do better but it usually does provided that the constraints supplied are reasonable Once this phase has terminated the constraints are released and Regress converges to the optimum set of parameter estimates in its usual fashion Note that simulated annealing is used only to find the optimum parameters not for the Confidence Intervals analysis if any ll These defaults do not work for this example CHAPTER 11 SETUP 69 Starting Constraints y User Model Lower Bound Upper Bound A 0 5 B 0 5 c 0 5 D 50 0 E 0 So Cancel ok Figure 11 6 Constraint Dialog with New Values 11 3 User Model With deterministic models Regress permits the selection of a user defined equation The initial User dialog appears as follows User Model y JA B x Cancel HORES Figure 11 7 Initial User Dialog This dialog allows the entry of a user defined RHS for the model Primitive operators are the same as in the C language plus an additional exponential operator Parameters are A J and must be used in that order Further functionality is listed in Table 11 1 The dependent variable y may not appear on the RHS Everything is case sensitive CHAPTER 11 SETUP Tabl
58. etc required by the general user In addition some technical details are provided in an appendix so that experts in the field may have the opportunity to assess Regress methodology in the light of their own experience A related document the Compendium of Common Probability Distributions has been published separately This is an encyclopedia with 59 entries including all of those built into Regress MICHAEL P MCLAUGHLIN MCLEAN VA JUNE 2013 MPMCL AT CAUSASCIENTIA ORG Vil Part I Data Chapter 1 lt something about it stark and monotonous as ever making me wonder why I habitually chose a window seat We were about halfway between Washington D C and Dallas Fort Worth flying at 40 000 feet and all I could see were the mounded tops of clouds and the sky Not a bright blue sky worthy of this clear November morning but a sky of a more somber hue an indigo intimation of the blackness lying in wait far above us It was bleak and freezing out there Rather boring as well if you didn t know better I knew better At this altitude where the troposphere thins out to become the strato sphere free neutrons were whizzing about at some 2 km s and smashing into everything in sight their sight not mine From their sub nano perspective it was far from boring Think Bob Dylan Ballad of a Thin Man something is happening here but you don t know what it is do you Mister Jones The atmosphere was thin and
59. g beans on farms for three dollars a day The titration alluded to above was the final step in a day long experiment to determine the percentage of protein in a fish It began by carefully weighing three small samples of the fish Our procedures were very rigorous with protocols worthy of a forensic lab so we did everything in triplicate at a minimum Each sample was placed in a Kjeldahl flask along with measured amounts of concentrated sulfuric acid sodium sulfate and mercuric oxide to act as a catalyst This mixture was allowed to boil for about half an hour until the entire solution turned crystal clear and colorless After the solutions had cooled to room temperature excess sodium hydroxide was added to each flask which was quickly stoppered with a tube running into an Erlenmeyer flask containing a known excess amount of dilute hydrochloric acid HCl The Kjeldahl contents were then boiled some more to force all of the liberated ammonia gas into the HCl where it was neutralized Since the amount of HCl was in excess of the ammonia there was some HC left over The final step back titrating with standard sodium carbonate solution was done to quantify this excess and by subtraction determine the amount of HCl that it took to neutralize all of the ammonia generated from the protein in a known quantity of fish Ti was a trick to it of course I knew that there must be one and sure enough thanks to a boiling point gt 400 C 27
60. han the first 1t is customary to shift the origin to the mean and consider moments about the mean instead of about zero These central moments Mg are defined in Equation 2 2 where the overbar denotes an expectation 1 M 5 9 2 2 occasionally a vector or matrix CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 14 Here the fulcrum is the mean of the data instead of zero so the datapoints are being compared to the mean not to zero With a little algebra the second moment about the mean known as the variance can also be written in terms of the raw moments as shown in Equation 2 3 M m2 mi y y 2 3 In other words the variance is equal to the average of the squares minus the square of the average Mean and variance are by far the most common statistics used to summarize a dataset The mean describes in its own way the location of the data on the real axis Were it unknown or undetermined it would be thought of as a location parameter The variance describes the spread of the data about the mean This can be seen immediately by looking at two datasets d and d2 with the same mean but different variances d 1 2 3 4 5 6 7 8 9 d 4 5 6 Both datasets have mean 5 but Var d 20 3 and Var d2 2 3 As measured by the variance statistic the spread of d about its mean is ten times greater than that of da Variance denotes scale size not location When unknown or u
61. he effects one is seeking by making measure ments are not necessarily large If they were then they would be easy to find but once you find the large effects you must then focus on smaller and smaller effects It might be tempting to ignore small effects but in science and other disciplines small exceptions are not always insignificant Very often the opposite is true The old saying that It is the exception that proves the rule is somewhat confusing to modern listeners because the meaning of prove is not what it used to be The English verb to prove originally meant to test as in the term proving ground so what this old proverb is really saying is that an exception tests whether a rule is valid or not If you find an exception however small it tells you that the rule is defective If that rule is thought to be a physical law then an exception indicates that the law is not a law after all and that the relevant theory is in need of adjustment This is not something that can be ignored Just as Tycho Brahe went to great effort and expense to collect his data contemporary scientists must often do the same Figure 1 2 is a computer generated schematic showing the ATLAS detector of the Large Hadron Collider LHC As you can appreciate judging by the four humans in the figure this detector is an extremely large complex and expensive instrument Figure 1 3 shows another internal view during construction 1 The ATLAS
62. he runs test 6 3 Is One Model Better Than Another It is often necessary to decide whether one model fits the data better than some other model It would be nice if there were a good way to answer this question However in the context of traditional frequentist inference there is no really good method statistic that will provide an unambiguous answer One statistic that is often recommended is the Akaike information criterion AIC which is derived from information theory and which utilizes the log likelihood of the data given the model AIC 2 k log 6 4 where k is the number of parameters An improved variant is the corrected AIC metric AICc which works well for small datasets as well as large ones 2k k 1 AlCc AI Cc Gt ee ER 6 5 where N is the number of datapoints How one computes the likelihood in this formula depends upon the specific analysis For instance the likelihood in an unweighted least squares regression is just the likelihood of the residuals 27 SSE N To decide which of two models is better compute AICc in each case and choose the model with the smaller value of AICc Unfortunately there are no robust criteria to decide how much smaller AICc needs to be in order to be meaningful in a given situation This criterion takes the number of parameters into account This is essential You can after all fit any dataset perfectly if you employ enough parameters see Appendix C Chapter 7
63. ike they are and Regress reports that a Gaussian model is very unacceptable then this result must not be overinterpreted In such a case Regress is saying only that the discrepancies from Normality are real not that they are of some practical consequence A sample of 1 000 independent observations contains enough information to make fine distinctions often distinction that you may discount with impunity Whatever model you choose you must be prepared to defend it More often than not there will be other with conflicting ideas If you declare that some errors are Laplacian not Gaussian then eventually you might have to provide an argument why this must be the case Merely to reply that Regress or some other software package says so will not prove a sufficient rebuttal for an expert audience Beware of wishful thinking Points that appear more or less linear are not necessarily so Likewise a histogram that is vaguely symmetrical with a hump in the middle is not necessarily Gaussian even if your textbook talks about root sum squares and nothing else There is a real Universe out there with real answers A good analyst will try to find them Appendix A Deterministic Models Table A 1 Built in Deterministic Models Name Formula Poly A B x C x 2 D x 3 E x 4 Expo Axexp Bx xx C ExpoPoly1 Axexp Bex Cx D x 2 E ExpoPoly2 Axexp Bex x C x72
64. is the Poisson model blue with empirical 0 superimposed on the data 0 12 0 10 0 08 N 120 0 06 PDF 0 04 0 02 i V 0 5 10 15 20 25 30 C 14 Decay Counts Figure 4 6 Decay Counts Modeled as Poisson 13 61 Binwidth 1 11The Floor function is needed here because the Gamma functions take real arguments 12For integer n T n 1 n CHAPTER 4 MODELS IN THE REAL WORLD 38 Here is the same result but with a different histogram The moral of this comparison is to be wary of using histograms to assess goodness of fit 0 12 0 10 0 08 N 120 0 06 PDF 0 04 0 02 gt 0 00 e a a 0 5 10 15 20 25 30 C 14 Decay Counts Figure 4 7 Decay Counts Modeled as Poisson 13 61 Binwidth 3 No description of models for random variates would be complete without at least one example of the famous normal Gaussian distribution There are many reasons why this continuous distribution is famous but the main reason is that it is used so often to model so many things The PDF of the normal distribution can be written in simple closed form 4 12 1 1 fy py GaussianPDF N u 0o exp 2 4 12 ov2r 2 o where y is the random variate u is the mean and is the standard deviation
65. l CDF with Empirical Mean 35 Carbon 14 Decays in One Minute 0 36 Decay Counts Modeled as Poisson 13 61 Binwidth 1 37 Decay Counts Modeled as Poisson 13 61 Binwidth 3 38 Standard Normal Gaussian Distribution 39 Synthetic Normal 0 1 Data N 1 000 39 A Binary Mixture Model o e 002 2000 40 Sine wave Model for Daytime Data o 41 Hi Data and Model lt lt lt lt lt lt lt 2 0 42 Hale Bopp Model unweighted o o 47 Hale Bopp Model weighted o e 47 CDF Plot for Salaries Data and Model 50 Salaries Probability Plot for Mixture Model dl Salaries Probability Plot for Gaussian Model Unacceptable 52 Initial Setup Dialog for BattingAvg Example 64 Setup Dialog with Gumbel Model 00 64 iv 11 3 11 4 11 5 11 6 11 7 12 1 12 2 12 3 12 4 12 5 12 6 12 7 12 8 Cl Sample Dialog 22 622 24242884 OF 4S Ss SSK ESS OSE Es 66 Parameter Dialog os 24 44 4565 o be oe bee we ed 67 Initial Setup Dialog for H Example 004 68 Constraint Dialog with New Values o 69 Initial User Dialog 2 464 4 4 a4 RR A 69 Batting Average Display Dialog for Gumbel Model 12 Default Graph Dialog for Continu
66. lly special about a Regress input file It is just a textfile and 10 1 Input Format In general there are three kinds of records lines that are acceptable to Regress data comments and prediction requests Blank lines are always ignored Comments must begin with a semicolon and can be either full line comments in column 1 or appended to a data record Comments terminate at the end of a record Data records and prediction requests vary depending on whether the model is stochastic or deterministic These cases are discussed separately below It should go without saying that it is the user s responsibility to assure that all input is valid However Regress does some checking of its own Bad input files will generate an error message but there are many reasons why a file might be bad 10 2 Stochastic Input Stochastic input can be either continuous or discrete The latter can also be ungrouped or grouped 1 ASCII UTF 8 or UTF 16 with no accented characters 60 CHAPTER 10 INPUT 61 10 2 1 Continuous Data Continuous data are input as a column vector one value per line Regress recognizes that the data are continuous when there is a decimal point in at least one datum Otherwise Regress will assume that the data are integers and continuous models will be disabled If continuous data happen to be recorded as integers append 0 to one or more of them This will tell Regress that the data are meant to be continuou
67. mber of significant digits in the measured value indicating how many of them are reliably repeatable in replicate experiments Bias An offset from the truth often fairly constant We shall have much more to say about these terms in due course but the intuitive descrip tions above will be enough for now They are essentially correct The literature is replete with examples showing the extent to which scientists and others will persevere in their quest for the best possible data Even centuries ago the need for accuracy was well understood as the picture shown in Figure 1 1 makes clear This is an illustration 15 of the Great Mural Quadrant an astronomical instrument built in Denmark by Tycho Brahe in the late sixteenth century and used to determine the positions of stars and planets Brahe was nearly obsessed with a desire for accuracy and this quadrant was his most ambitious undertaking in pursuit of that goal It had a precision of six seconds of arc when measuring declinations elevations This plus a very good clock to measure distance along the perpendicular dimension as the Earth rotated gave celestial positions accurate to about one minute of arc which was world class at the time Celestial positions are important because they are the basis for making annual calen dars so getting them as accurate as possible is worth a lot of effort The same can be said of a large number of physical quantities As technology improves these quantiti
68. me combination of measurement error and modeling error By minimizing it one can get a better idea of what errorless data might look like i e the information e Quantify goodness of fit The process of quantifying goodness of fit does two things It tells us how good the model is that is how well it can act as a substitute for observation Also it allows us to compare two or more models to each other It is important to know when one model is good while another is of lower quality e Test an hypothesis To the degree that a model is good one may query the model instead of collecting additional data Therefore an hypothesis may be tested using the model This can be especially important when there is no possibility of collecting additional data A good model will likely suggest further experiments as well e Perform what if experiments Occasionally it is interesting to wonder what would happen if something contrary to experience were actually true This is one example of a what if experiment If the desired experimental conditions cannot be met then obviously one cannot do the experiment but one might be able to insert these conditions into a model The model output in such a case can sometimes be very illuminating Another purpose of a what if experiment is to test our understanding of the situa tion or phenomenon by considering a scenario that is thought to have occurred in the distant past and is no longer o
69. n When the model is good the K S value will be roughly the same regardless of the optimization criterion With discrete data the alternative is to minimize the chi square statistic This too is rarely done in the literature 11 1 3 Bootstrap Samples Regress makes considerable use of bootstrapping see Appendix B By default the boot strap sample size is 1 000 However for more precision this can be increased using the counter shown 11 1 4 Confidence Intervals Central confidence intervals can be estimated for parameters and predictions if any With stochastic models goodness of fit is determined by default and the precision of model parameters is therefore known assuming that the model is correct This computation is a parametric bootstrap If the Confidence Intervals option is chosen Regress carries out a non parametric bootstrap analysis that does not assume that the data are correctly modeled When the model is good these confidence limits will be roughly the same as those estimated by the parametric bootstrap analysis Confidence intervals vary slightly from run to run Their precision can be improved by using a larger number of bootstrap samples Details are in Appendix B 11 1 5 Generate Sample s As part of a parametric bootstrap analysis Regress must generate random samples from the optimum model This capability is exposed to the user as an option to create one or gt This choice appears only when the d
70. n by Median A log 2 4 8 where log here and elsewhere denotes the natural logarithm In our model the median is predicted to be 3 114 s The observed median is 3 080 s Once again this small relative error indicates that we have a good model Further Examples We have described a model for the intervals between C decays but suppose instead that our data were recorded differently If instead of decay intervals suppose we had recorded the number of decays in some fixed interval e g one minute Two hours worth of such data would look something like this 0 12 0 10 4 0 08 4 N 120 0 06 4 PDF 0 04 0 02 m 0 00 ee ee ee 10 15 20 25 30 UN y C 14 Decay Counts Figure 4 5 Carbon 14 Decays in One Minute CHAPTER 4 MODELS IN THE REAL WORLD 37 This dataset is discrete integer values only Therefore it must be described by a discrete PDF It can be shown that if interarrival times are exponential then counts per fixed intervals will be Poisson 4 9 1 PoissonPDF zI EP 0 0 4 9 where x is a positive integer including zero and 0 is a parameter Since y exp 0 0 0 4 10 x 0 we find that the parameter 0 is once again the mean of this distribution Also TD x 1 0 PoissonCDF P D 4 11 where T is the complete Gamma function T the incomplete Gamma function and the Floor function Here
71. ndetermined it is therefore a scale parameter In a similar fashion M3 describes the skewness or lopsidedness of a dataset about its mean and M describes the kurtosis or pointedness of the data These two statistics shape parameters will make more sense after we look at some Graphs in the following section Altogether these four central moments provide a rough summary of a dataset Quartiles Another group of statistics based on rank order becomes available once the data y are sorted from low to high Rank statistics do not use the values of the datapoints directly All that matters are the rankings To illustrate we shall use the data in Table 1 3 N 137 One set of rank statistics are the quartiles Given the sorted data the first quartile is the value that is 1 4 of the way from the beginning of the sorted list The second and third quartiles are 1 2 and 3 4 of the way along The second quartile is also called the median and is sometimes used as an alternative to the mean although they are not equivalent For instance if the highest batting average in Table 1 3 were 900 instead of 440 the mean would increase substantially but the median would not change at all Likewise the difference between the first and third quartiles called the interquartile range is often used as an alternative to the variance in order to describe the spread of the data Dividing the sorted data into four parts is traditional but other divisions are po
72. need for the capabilities that it provides In particular I required an application that would handle equations and probability distributions equally well with reliable estimates for goodness of fit and confidence intervals Moreover I wanted one that was user friendly For a Mac user developer that was de rigueur It also seemed probable that such a package might gain a broader audience Since its initial publication in 1998 Regress has been downloaded nearly 50 000 times by researchers students professors in 157 countries so the hypothesis appears valid This document is a sibling that was born out of a desire to elucidate the functionality of a new and much enhanced Regress 2 7 Data analysis includes much that is a obscure to most practitioners few of whom are certified professionals in the discipline I include myself in the majority I am a scientist not a statistician or mathematician a fact that will likely be apparent in the pages to follow In this book I have tried to explain various things in the manner in which I wish someone had explained them to me The book is divided into the three parts specified in the title Part I discusses data and Part II discusses mathematical modeling in general The latter was written for a broad audience so while the examples utilize Regress the software is not otherwise mentioned Part III is the Regress User Manual Here you will find all necessary description for input output options
73. new what it represented Model parameters do not always have a simple interpretation but this one does To understand what A represents you have to understand a bit more about what a PDF represents One way to think about a continuous PDF is to imagine it to be a PDF histogram with infinitesimally narrow binwidth symbolized dx describing a sample of infinite size Then the height of the PDF curve for any x equals the probability density of x which in turn is proportional not equal to the probability of x for that distribution Note that probability density can be greater than one often much greater If a PDF f x is normalized as described in Chapter 2 then it can be used as a weighting function for the purpose of computing a continuously weighted average For any arbitrary function of x g x the expectation mean of g x would then be given by the definite integral over all x shown in Equation 4 2 g a g x fle aa 42 In Equation 4 2 the product f x dx is the probability of x so the integrand is a weighted probability of g x for any x and the integral adds up an infinite number of these probabilities If g a 1 then this sum equals the total area under the curve 1 gt but it cannot be less than zero Stechnically the probability that x is in the infinitesimal interval a dz x da CHAPTER 4 MODELS IN THE REAL WORLD 33 Were the PDF discrete this definite integral would be replaced wi
74. ng the RHS of this equation with respect to each parameter setting the respective derivatives equal to zero and solving the resulting set of nonlinear equations taking care to select the solution that gives a maximum Sometimes this is easily done especially in log space For instance consider the ex ponential distribution defined earlier 4 1 Here there is only one parameter 0 so in this case the math is very easy fu S exp 5 2 Now write down the likelihood of N exponential variates in log space log 2 y N log X yr 5 3 43 CHAPTER 5 OPTIMIZING THE MODEL 44 Differentiate 5 3 with respect to 0 and set the derivative equal to zero N 1 POOS gt an 4 En 0 5 4 k 1 Since A gt 0 FOC will be equal to zero iff the bracketed factor equals zero Hence N 1 a AML N 2 Yk Y 5 5 It can be shown by substitution or from the second derivative that this value gives a maximum of the likelihood not a minimum or a saddle point Therefore in the exponential distribution the ML value for A is just the mean of the variates It seldom happens that an ML parameter value is a simple function of the data Usually one must find ML values by solving simultaneous equations numerically The exponential distribution is an exception as are the Gaussian Binomial and Poisson distributions q v For stochastic models ML parameters are considered optimal in the sense described above This property
75. odels models expressed in the language of mathematics and refer to them hereinafter simply as models There are several reasons why one might wish to devise a model e Describe the data mathematically Why are my numbers not all the same It is difficult to overstate the degree to which the language of mathematics enables us to understand Nature Here we note two features of particular importance Analytic form The analytic form of a model the formula provides a huge amount of infor mation about the data The fact that one model gives a good description and a similar model does not is usually highly suggestive Parameter values The values of the model parameters are likewise informative Quite often these parameters represent constants of Nature and many models are developed in order to determine these constants and interpret them in the context of some physical theory e Summarize the data A model might be used solely as a simple formula for regenerating an observed dataset In this role it could also be used for interpolation Extrapolating a model to an unobserved part of its domain however is very risky CHAPTER 4 MODELS IN THE REAL WORLD 29 e Minimize error Once a model of some chosen form is optimized meaning that it now contains whatever parameters best reproduce empirical data then the amount of variation not explained by the model is minimized This residual variation is typically so
76. of plot shows the optimum model as a gray line and the data as black dots Were the model perfect the data would fall on the line exactly In general the extremes of the data are usually near the line but not on it The latter two graphs are shown in Figures 12 3 12 4 In the CDF and probability plots the label on the abscissa has been edited from its default Y to that shown using the Axes button in the Graph window Axes labels are in this case freely editable However the range and tick marks are not editable This is partly to ensure that nothing is hidden in Regress output In very rare cases the top of a graph might have to be truncated 6This is not always true with deterministic models CHAPTER 12 OUTPUT AND MENUS 74 a A O 0 28 0 47 Maximum Batting Average Figure 12 3 Batting Average Gumbel CDF 99 9 e e e e 2 T o o 90 D a 0 28 0 47 Maximum Batting Average Figure 12 4 Batting Average Gumbel Probability Plot CHAPTER 12 OUTPUT AND MENUS 13 With discrete stochastic models only a PDF graph is available cf Examples Hyphens in Here the model is shown in black superimposed on the histogram With deterministic models the graphs are different especially 1f the data are weighted With the weighted Hale_Bopp CN pred in data modeled as Expo Eq 5 11 the default graph window is as follows O O Hale_Bopp CN pred V Show Model M Weights Axes 300 gt 0 2 7 x Figure 1
77. oints in a bin is called the bin frequency At last we have something to put on that second axis frequency Doing so produces one type of histogram a frequency histogram With this dataset we get the histogram shown in Figure 2 1 50 40 Frequency 3 N o 10 325 350 375 400 425 450 BattingAverageMax x 1000 Figure 2 1 Frequency Histogram for MLB Batting average Maxima Figure 2 1 and Table 1 3 present the same data in two different ways The table gives actual values while the figure shows a picture derived from those values One cannot do much analysis given only a picture but as a summary it can provide useful information For instance Figure 2 1 shows the location and spread of the data as well as some sense of relative frequency Batting average maxima in the range 350 375 are common this is the tallest bin roughly the mode However maxima of 425 or more are uncommon What is not clear from Figure 2 1 is that it is somewhat arbitrary We defined five bins but we could have defined more or fewer In fact we could have put all the data into one bin or at the other extreme created 116 bins one for each value in the observed range Both of these choices produce valid but useless graphs There is no universally accepted rule of thumb for selecting the number of bins in a histogram One simple approach is to pick a binwidth near the square root of the number 6By convention the left bin boundary is inside the bin b
78. om errors described by the same distribution then all points are equally weighted otherwise each point must have its own weight which must be known Once again we shall defer parameter optimization to the next chapter and simply present the results for two examples To illustrate a model for equally weighted i e unweighted data consider the data shown in Figure 2 6 This looks a lot like a sine wave although it is more complicated than that If we ignore the complications and model it as a sine wave then the model is that given in 4 13 y A sin 2r Bz C D 4 13 where A amplitude B frequency 1 period C phase and D offset A plot of the data and model with optimum parameters is shown in Figure 4 11 Clearly this sine wave model is not a bad fit at all and we would probably not hesitate to use it in many applications e g to determine the period 1000 Daytime min 500 1 1100 Day Figure 4 11 Sine wave Model for Daytime Data CHAPTER 4 MODELS IN THE REAL WORLD 42 As an example with weighted datapoints that is datapoints of differing precision we can use the dataset shown in Figure 3 1 for the H experiment The modeling error for point k is distributed as Normal 0 o We can use the average of the 10 replicates as the kt datapoint and the empirical 1 0 as an appropriate weight for that point A very good model for this dataset albeit more complicated than those in the literature
79. onsequence of the weak force and the laws of Quantum Mechanics which have been experimentally validated to a dozen decimal places require that no one will ever figure it out In fact for reasons that you can read elsewhere it is not even a meaningful question 3 If you observe a gram of pure carbon freshly extracted from the environment and record the time intervals between successive carbon 14 decays you will get a dataset much like that shown in Table 1 1 As described above these numbers are random unpredictable Table 1 1 Carbon 14 Decay Intervals s in 1 g of Natural Carbon 31 149 12 16 09 37 74 35 29 80 10 52 12 78 67 0 8 135 11 15 6 7 08 11 19 31 75 71 94 26 08 2 7 2 0 83 79 160 0 5 18 33 24 10 0 8 0 1 63 88 19 43 06 02 42 184 10 3 These observations are also independent In non mathematical language this says that they do not have any influence on each other Note that independence is distinct from randomness one does not imply the other Yet another property of the observations measurements in Table 1 1 is that they are continuous meaning that there is no limit as to how close they can be to each other A continuous value can be any real number In contrast discrete measurements are usually integers most often starting at zero A discrete dataset can be obtained by repeating the CHAPTER 1 experiment above but this time just counting how many carbon 14 atoms in the sample decay during a fixed period If you do thi
80. onsidered constant If the corresponding box is checked the initial values will not be changed and the number of parameters will be reduced accordingly Confidence intervals cannot be computed for constant parameters This is one way to see what a sample from the optimum model should look like 5which means that they must be valid This capability is intended primarily for aesthetic purposes 7Regress computations will begin as soon as the OK button in the Parameter dialog is clicked CHAPTER 11 SETUP 67 Starting Parameters y Gumbel A B Parameter Constant A B 2 19229E 02 C n a D n a E n a Cancel OK j Figure 11 4 Parameter Dialog 11 2 Deterministic Models As an example of a deterministic model consider again that used to describe the energy of the hydrogen molecule ion 4 12 5 This example requires a user defined model discussed in the following section The Setup dialog show below Fig 11 5 is what is available for deterministic models when there are no prediction requests The optimization criteria are different as are the predictions However the Confi dence Intervals option is the same as previously described There are three new options 11 2 1 Test Residuals The residuals of the model should be random as described in Section 6 2 However the test is optional 11 2 2 List Data This option generates an output file listing the original data along with the Y values esti mated by th
81. onvenient summary It is the latter that a pilot sees in the cockpit The same is true in data analysis There are no real shortcuts for the analyst but when reporting to a broader audience analytical results must be summarized in a way that is easy to understand Even the analysis itself can benefit by considering various summaries of the data Summaries may be quantitative statistics or pictorial graphs Both can be used to describe stochastic data as well as deterministic data Pe approach for the first of two flights I am halfway there Our runway is dead unless it is tilted from vertical in which case it remains at a constant angle while it precesses around the vertical axis 12 CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 13 2 1 Statistics A statistic is a number that can be computed from the data alone using a formula that contains no parameters unspecified quantities in the model adjusted somehow to suit the analysis Every statistic is designed to quantify some aspect of the data giving a perspective with a clear physical interpretation Thus a judicious collection of statistics will provide a short facile description of the entire dataset The literature contains a huge number of statistics of various kinds Some of these are highly specialized and were proposed for use in very narrow circumstances Others are extremely common not only because they are easy to understand but also because they arise naturally
82. or bin 430 440 This line of thought leads eventually to the concept of probability Probability may be defined in more than one way but consistent with the foregoing discussion we shall adopt the frequentist approach with which one imagines rightly or wrongly that somewhere out there is a parent population of potential experiments or CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 18 measurements with as yet unknown outcomes These putative experiments might be real something that could actually be done or simply thought experiments The set of all possible outcomes of these experiments contains a subset perhaps empty that corresponds to some predefined event which is of particular interest The probability of amp in the frequentist sense is then defined as follows of outcomes corresponding to amp Prob 6 2 4 RG of all possible outcomes oe This definition is not mathematically rigorous and strictly speaking it is true only in the limit as the denominator approaches infinity but it captures the essence of the concept Probability quantifies the chances of a hypothetical event By convention a probability of zero means that the event is impossible while a probability of one means that it is certain Consequently probability is a real number in the range 0 1 Moreover the sum of the probabilities for all possible outcomes must add up to one often expressed as 100 To illustrate consider the data
83. ous Stochastic Models 13 Batting Average Gumbel CDF o 74 Batting Average Gumbel Probability Plot 74 Default Graph Dialog for Weighted Hale Bopp Exponential Model 75 Hale Bopp Model wiith Logarithmic Y axis 76 Default Report for BattingAvg pred in ooo ee TI SaveAs Dialog for Regress Graphs o a 78 One Example of the Weierstrass Theorem 88 List of Tables 1 1 Carbon 14 Decay Intervals s in 1 g of Natural Carbon 3 1 2 Carbon 14 Decays in 1 min oaoa o 4 1 3 U S MLB Batting average Maxima 1876 2012 4 1 4 Daytime min in Boston MA 0 022000007 6 1 5 Measured Values of the Electronic Charge 9 3 1 Distance Between Two Points in Unit Circle experimental 24 5 1 ML Parameters for Daytime Model 45 5 2 Rate of Production of CN in Comet Hale Bopp 46 5 3 ML Parameters for Hale Bopp Regressions 48 7 1 ML Parameters for Salaries Data and Model 53 7 2 Salaries 95 Confidence Limits for ML Parameters 55 11 1 User model Functionality 2 526452 eb ee eee ee eee es 70 12 1 Goodness of fit Percentiles oaoa 20000 T2 A l Built in Deterministic Models aaa aaa e 83 vi Preface The original motivation for creating the Regress modeling package was my personal
84. out of the mathematics of analysis theory and are especially robust and trustworthy We shall see examples of both kinds In this section we discuss statistics that quantify the overall extent of a dataset that is how some specific data compare to numbers in general This means examining a given set of numbers apart from any relationships they might have to other numbers Therefore we shall introduce statistics as they might apply to random variates Moments The term moment is borrowed from the domain of physics specifically mechanics There it refers to the tendency of a force to rotate an object Numerically it is equal to the product of the size of the force times the distance between the point where the force is applied and a fulcrum about which rotation might be possible In statistics there is the corresponding notion of a raw moment Here the fulcrum is zero the origin and the force comes from the numbers in the dataset There are an infinite number of raw moments but only the first four m1 M2 m3 ma are of any interest The k moment defined in Equation 2 1 where N is the number of points y e 2 1 N i 1 The first raw moment m4 is just the arithmetic mean of the data For reasons that will become clear later the mean is also called the expectation the expected value of y a random variate In general the expectation average of y is equal to the kt raw moment of y For moments higher t
85. pecially with deterministic data A PDF can also tell us the value for the mode assuming for now that there is only one The exponential distribution has its mode at zero but in general with continuous PDFs you find the mode by setting the derivative of the PDF to zero and solving for the root of that equation We shall demonstrate this later when we discuss the normal distribution Cumulative Distribution The area under any portion of a PDF equals the probability that the variate will be found in the corresponding range If the PDF is continuous then this area is found by integrating the PDF over the range s of interest For instance in our present example the probability P that a 4C decay interval x will be observed in the range 5 lt x lt 10 seconds is computed by integrating our PDF over that range A A 0 2206 10 P 1 Dees dx A 4 4929 s 4 6 Teven when they are not particularly good models CHAPTER 4 MODELS IN THE REAL WORLD 35 or about 22 percent Check Figure 4 3 Does this answer look right In the actual dataset there are 2 185 observations in this range 21 9 percent If instead we integrate a PDF from its theoretical minimum min to some arbitrary gt Lmin We obtain the cumulative distribution function CDF sometimes called simply the distribution For any random variate X CDF x is the probability that X lt x For the exponential distribution CUE x x exp dx 1
86. r with this kind of plot with various models and sample sizes to appreciate when the results are good or bad Strictly speaking a non sequitor but again this is the usual procedure CHAPTER 6 HOW GOOD IS THE MODEL 51 99 9999 Percentile 900 Academic Salaries hundreds of dollars Figure 6 2 Salaries Probability Plot for Mixture Model To see what an unacceptable model looks like we can use a Gaussian distribution with this dataset instead of the mixture model Even with the ML parameters the K S value is now 0 0648411 and this falls in the 99 percentile of its sampling distribution indicating a very poor fit The probability plot is shown in Figure 6 3 6 1 2 Discrete Models The most common metric for assessing the fit of a discrete model to some data X is the Chi square d statistic 6 1 N 2 2 Xobs k K ornk Vv 6 1 2 2 k 1 where obs observed exp expected from model and y degrees of freedom N 1 As one example we can use the C 14 decay counts which we modeled as Poisson Fig 4 6 This looks like a reasonable fit judging by the histogram but the Chi square test is a much better criterion Here x 14 8359 This value falls in the 16 percentile of its sampling distribution which is not significant improbably large at all and so we accept the ML Poisson model as valid gt The expected value for Chi square v CHAPTER 6 HOW GOOD IS THE MODEL 52 100 1E 5
87. rest of the carbon on Earth and does what carbon does all the time For instance it forms carbon dioxide which is taken up by plants which are then eaten by animals etc With no effort at all C gets spread evenly throughout the biosphere along with the far more common 2C and C Still that eighth neutron is in some respects one neutron too many and as a result carbon 14 is not stable It will slowly decay all by itself back into nitrogen 14 WBC IN e7 i 1 2 Moreover it will do so even when it is part of a molecule any molecule anywhere If your body has a mass of 80 kg 176 lbs then it contains about 14 kg of carbon an extremely small fraction of which is carbon 14 Of course atoms are also extremely small so your body contains a huge number of C atoms in spite of their rarity Adding everything up the reaction above Eq 1 2 is occurring inside of you more than 3 000 times per second On Earth living organisms are all radioactive When an organism dies the carbon 14 it contains continues to decay but is no longer replenished by eating or respiring This fact as you probably know is the physical basis of the carbon dating technique More noteworthy for our purposes is that this beta decay is an example of a process that is inherently random The time until a particular carbon 14 atom decays is completely unpredictable in principle not just because nobody is smart enough to have figured it out This decay is a c
88. rical constraints of double precision numbers CHAPTER 10 INPUT 62 10 3 1 Data Unweighted input consists of a matrix with two columns The first column contains the dependent response variable y and the second the independent variable x 2 An example is Daytime in Weighted input requires an additional column Here the third column contains not a weight but rather a measure of the uncertainty in the corresponding y value typically a 1 sigma estimate of that uncertainty A good example is Hale_Bopp CN in 10 3 2 Predictions Regress can predict the value of y given an x assuming as always that the x value is valid The prediction request format is the same as with stochastic input val with the again in column 1 Example Hale_Bopp CN pred in uses the same Hale Bopp data as above but with three requests for predicted y 5In general Regress refers to any value in the first input column as y whether the input is deterministic or stochastic 6In general a k sigma estimate provided that k is constant This is especially useful when the Confidence Intervals option has been chosen see Ch 11 Chapter 11 Setup Setup dialog showing relevant options The general appearance of this dialog and the options available in it vary depending on the nature of the input Regress makes appropriate changes and or disables user choices here and elsewhere whenever they are not applicable Since much of Regress
89. rtainty and these uncertainties are often shown on the plot as error bars Accounting for variable weights requires only a slight change to 5 9 since dz is no longer the same for each point as follows max logLik gt min gt 5 10 o k l k Here the deviation of y from the model is normalized against its uncertainty giving a weighted SSE Looking at the table above a likely model for rate as a function of distance is a simple exponential model y Aexp B x 5 11 If we ignore the uncertainties shown in the table and do an unweighted regression we get the results plotted in Figure 5 1 If we account for the uncertainties the weighted results yield Figure 5 2 with error bars The unweighted regression treats all points equally even though the second point in particular should not get as much weight since it has an unusually large error In the second plot the curve is much farther from this point but still close to its error bar CHAPTER 5 OPTIMIZING THE MODEL 47 200 CN Rate Distance AU Figure 5 1 Hale Bopp Model unweighted 300 CN Rate Distance AU Figure 5 2 Hale Bopp Model weighted CHAPTER 5 OPTIMIZING THE MODEL 48 Before leaving this example there is one further issue that should be discussed It is tempting to look at 5 11 and note that if you take logs of both sides you get a linear equation for log y log y log A Ba 5 12 Since 5 12 is algebraicall
90. s For an example of continuous input see BattingAvg in 10 2 2 Discrete Data Discrete data may also be input as a column vector All values must be positive integers An example is Binomial in Discrete data may also be grouped In that case the format for a data record is val freq where val is the value of the datum and freq is its frequency The symbol must be in column 1 It is permissible to have the same val more than once in the file Regress will expand grouped data to the ungrouped equivalent before processing Grouped and ungrouped data must not be mixed together For an example of grouped data see Hyphens in 10 2 3 Predictions With continuous stochastic input Regress can predict the percentile of a given value val once processing is complete Requests for a predictions should follow the data but need not A prediction request is formatted as follows val with the in column 1 For an example of stochastic input with prediction requests see BattingAvg pred in 10 3 Deterministic Input Deterministic data can be unweighted or weighted As will be shown later ch 11 the weights need not be used As noted above there must be at least seven unique points with deterministic data This is interpreted to mean seven unique values of x although there is no limit as to how small the difference may be Zero is considered positive 3The space after the is optional 4within the nume
91. s of fit metric is the likelihood itself However it turns out that this metric has relatively little power In other words it does not detect bad models very well unless there are outliers in the data A good statistic is one that has inter alia sufficient power to do its job adequately The statistics we describe here are utilized for just that reason There are different metrics for continuous and discrete models We treat these cases separately 6 1 1 Continuous Models The most common metric used to test the goodness of fit of a continuous distribution to some data is the Kolgomorov Smirnov K S statistic Consider again the salaries example and the mixture model used earlier Fig 4 10 The corresponding CDF plot is shown in Figure 6 1 In this plot the empirical CDF is a gray stepwise curve and the model is a smooth black curve Whenever a model is a poor fit there will be some relatively large separations between these two curves The largest separation in absolute value is equal to the K S statistic The question now is How large is too large This is a tricky question and the usual answer is determined by what is termed the sampling distribution of the statistic When With so many points the steps are very small 49 CHAPTER 6 HOW GOOD IS THE MODEL 50 CDF 200 900 Academic Salaries hundreds of dollars Figure 6 1 CDF Plot for Salaries Data and Model a Statistic is first developed it is necess
92. s with 1 minute periods for 2 hours you will get something like the results shown in Table 1 2 Once again these observations are random and independent 18 21 18 20 12 16 12 15 Sometimes data do not fall neatly into continuous discrete categories because the ob servations although discrete are so close together that for all practical purposes they can be treated as though they were continuous The data in Table 1 3 illustrate this quite well These data list the seasonal maximum batting averages for U S Major League Baseball over more than a century The averages are actually discrete fractions but can be very close 9 9 22 12 15 15 17 10 21 14 12 10 13 10 15 13 SOMETHING ABOUT IT Table 1 2 Carbon 14 Decays in 1 min 13 19 16 9 14 15 11 13 17 12 10 13 18 9 17 20 11 8 7 13 11 17 15 17 9 26 23 12 13 9 7 12 10 14 11 9 8 12 20 11 11 10 16 9 20 14 13 22 15 15 15 18 17 14 13 14 16 11 12 14 11 8 11 17 16 14 19 13 16 13 8 14 in value since a batter can get hundreds of at bats during a season Season 9 17 14 13 12 14 13 17 16 10 10 13 16 12 Table 1 3 U S MLB Batting average Maxima 1876 2012 Maximum x 1000 14 10 14 15 6 13 17 11 1900 1950 2000 360 336 381 384 407 381 352 354 320 329 390 329 372 359 399 340 426 420 394 390 406 344 3
93. ss are quite robust and the software goes to great 13 1 Failure to Converge With stochastic models Regress starts with good initial values for the parameters and it nearly always converges to the correct global optimum However this is not the case with deterministic models for which the default initial values 1 are almost always poor This is especially true if a User defined model is entered If Regress does not converge there will be an error message to that effect One possible solution is simply to Restart with or without different initial values for one or more parameters or use the Simulated Annealing option Other possible solutions are analogous to those described in the next section No progress can be made unless until Regress converges 13 2 Convergence to an Incorrect Solution Sometimes Regress converges but the solution found is not the true global optimum assuming that the latter is unique There will be no error message in this case and it is up to the user to recognize the fault As discussed in the previous section one can try restarting with new initial parameter values or simply Restart from where Regress left off keeping the existing parameter values Models with trigonometric functions almost always have multiple global optima all equally good 80 CHAPTER 13 WHAT COULD POSSIBLY GO WRONG 81 Alternatively if the model is a familiar one then some of its parameters should have values
94. ssible With 10 divisions one would have deciles and with 100 divisions we get the special case of percentiles In general such divisions are known as quantiles 3If the quartile position falls in between two values then you split the difference CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 15 Mode The mode is meant to be the value in the dataset that occurs most often Of course with continuous data it is likely that no value occurs more than once Nevertheless values tend to clump together more often than not and the peak of the highest clump then becomes the mode This will become clearer in the following section 2 2 Graphs The preceding section defining various statistics contains approximately a thousand words and it is said that one picture is worth a thousand words Unfortunately this old proverb does not tell you how to draw that picture We begin with the realization that even the simplest picture is two dimensional If we take a dataset such as that in Table 1 3 we have only a set of numbers all of which in this case fall on the real number line In fact were they not coded they would all fall between zero and one by definition We could draw a short line label the left end O and the right end 1 then put a dot on it at the appropriate location for each batting average but the result would be one dimensional a messy line which is not very useful We need to utilize the second dimension This is
95. stifiably claim a precision any better than one part in 116 The results would then be reported to two significant figures plus by convention one or two uncertain digits with the uncertainty shown explicitly Random numbers such as we see in these three tables are often called random variates since they are variables not constant and unpredictable Generally they are unpredictable because we do not know how or have enough information to predict them In rare cases they are intrinsically unpredictable regardless of how expert one might be Of course making predictions is a primary purpose of data analysis Therefore it is fortunate that random stochastic data are the exception not the rule In the majority of datasets there is a possibly causal relationship between two or more variables such that one seems to be determined by the other s Such relationships are said to be deterministic An example is provided by the time series shown in Table 1 4 Table 1 4 records the duration between sunrise and sunset in Boston MA USA over three years starting on 1 January 1995 The datapoints correspond to the first day of each month plus the minimum and maximum in each year It is apparent that for a fixed location daytime is not at all random It varies regularly and periodically so that a given day of the year has roughly the same daytime every year Likewise the shortest and longest times occur on or about the same date year after year To
96. t for the first component The data consist of academic salaries in hundreds of dollars collected in 1993 1994 N 1 161 How the five parameters were optimized will be discussed in the next chapter For now we just show the PDF result Fig 4 10 0 006 PDF 900 Academic Salaries hundreds of dollars Figure 4 10 A Binary Mixture Model The model and histogram are bimodal That is there are two separate modes even though one peak is buried under a larger one Real world data can get very messy Compendium of Common Probability Distributions One good reference for stochastic models is the compendium that is published along with this documentation It can be found and downloaded here This compendium describes the distributions found in Regress The parametrization is also the same Balso included with this documentation Please honor the copyright CHAPTER 4 MODELS IN THE REAL WORLD 4 4 3 Deterministic Models A deterministic model is an equation describing a relationship between one or more inde pendent variables and a dependent response variable This is by far the most common sort of mathematical model with an enormous supporting literature In this document we shall assume that there is only one independent variable Even so with the usual method ology there are at least two kinds of modeling that can be done depending upon the errors associated with the datapoints If all of the points have rand
97. th a possibly infinite summation See Further Examples Now suppose that g x x In that case Equation 4 2 will tell us the mean of x Substituting our exponential model for f x we get z fe 3 dz gt 4 3 Thus A is the mean of x a nice easy interpretation We can easily guess a good value for because we have a sample of 10 000 variates which is enough to estimate any mean with decent accuracy The mean of our sample to five significant figures is 4 4929 s Substituting this value for A we get the model curve blue shown in Figure 4 3 0 25 0 20 0 15 4 N 10 000 PDF 0 10 0 05 cu Y 0 00 ies 0 5 10 15 20 25 30 C 14 Decay Interval s Figure 4 3 Exponential Model with A Empirical Mean This model does not describe our dataset perfectly but then we do not have a sample of infinite size so there is bound to be some experimental error even with a valid model Still the fit looks extremely good We shall discuss how to quantify goodness of fit in Chapter 6 Playing with the PDF We have not considered whether our model is optimal the best it could possibly be that topic will be discussed in Chapter 5 However it is clearly a very good model Figure 4 3 shows that much for certain Therefore we may legitimately ask What might we do with CHAPTER 4 MODELS IN THE REAL WORLD 34 this model Can we extract other
98. this theoretical PDF is termed a distribution For our example the analytic form of the PDF is given below 4 1 TA x exp 4 1 where x is the decay interval in seconds is a parameter and is read is distributed as This particular model is called the exponential distribution Here the units of x are seconds Since the argument of a transcendental function such as the exponential function exp must be dimensionless the units of A must be seconds as well Hence the overall units for this PDF are s In general the units for any PDF are the reciprocal of the units of the variates it describes This is a good rule to remember It provides a necessary check on the algebraic correctness of complicated PDFs Figure 4 2 shows the dataset histogram together with five different exponential models A 1 2 3 4 5 from left to right resp Away from the mode a model curve should pass through the center of the tops of each histogram bin Judging by this figure the correct value of A should lie somewhere between 4 red curve and 5 green curve 3single parameter version but not sufficient CHAPTER 4 MODELS IN THE REAL WORLD 32 0 20 0 15 N 10 000 i A 010 a 0 05 0 00 I gt SSS 0 5 10 15 20 25 30 C 14 Decay Interval s Figure 4 2 Exponential Model with Five A Values We might be able to estimate the value of A if we k
99. tion of why the data look the way they do and there are lots of possible reasons for that Chapter 3 Data vs Information Colorado Springs clearly visible roughly a thousand feet below appeared a bit dry and hot to anyone accustomed to life near an ocean The city was small and spread out to the East of necessity since the West was cut off by Cheyenne Mountain which looked to me rather tall and equally parched This mountain is well known as the location of several facilities constructed during the Cold War for national defense However its most interesting feature from my point of view were the many antennas sprouting up from its top Looking over at all of these sensors with all their different shapes and sizes I couldn t help but see them as a concrete metaphor for the fundamental difference between data and information In colloquial En glish these two terms are often considered equivalent However when one attempts any serious data analysis the difference becomes very obvious very quickly Any student in a high school physics lab trying to predict the final temperature for a mixture of hot and cold water knows only too well that what you observe and what Nature says you should observe are almost never the same The problem of course is that empirical measurements contain error sometimes quite a lot of error Consequently Feo approach once again and the end of a long day of traveling My destination Data Information
100. ut the right boundary is in the next bin if the latter exists Symbolically our first bin in this example is 325 350 and the last is 425 450 7Empty bins are sometimes unavoidable especially if an empty bin is between other bins CHAPTER 2 DATA SUMMARIES STATISTICS AND GRAPHS 17 of datapoints together with some appropriate minimum and maximum for the number of bins This strategy tends to split the precision of the graph evenly between the two axes Here we have N 137 so a binwidth of 12 should be reasonable but that would give unaesthetic bin boundaries A compromise would be to start at 320 and set binwidth 10 giving Figure 2 2 legend added 25 20 al Frequency 320 340 360 380 400 420 440 460 BattingAverageMax x 1000 Figure 2 2 Another Frequency Histogram Notice that Figure 2 2 has a very different shape from that of Figure 2 1 The latter is fairly smooth but the figure above is bumpy It is generally true that the shape of a histogram is quite sensitive to the binwidth and bin boundaries so you should not read too much into it This sensitivity decreases as N becomes very large Probability It is natural to wonder what the chances are for a future datapoint to fall into one of the existing bins of a frequency histogram Obviously the answer depends upon which bin you have in mind Looking at Figure 2 2 one would expect that the chances are a lot better for bin 350 360 than f
101. y equivalent to 5 11 one might think that they would give the same parameters once the transform was undone This is a common mistake Table 5 3 lists the A and B parameters for the unweighted weighted and unweighted log transform models They are quite different Table 5 3 ML Parameters for Hale Bopp Regressions Regression A B unweighted 2763 39 0 978253 weighted 2926 11 1 04642 transform 1936 35 0 911981 A nonlinear transform such as the log transform affects large values more than small values In this case points close to the Sun are affected more than those farther away It is possible to undo nonlinear transforms correctly but it requires a lot more work In contrast linear transforms adding a constant multiplying by a constant or both are generally acceptable 3and ubiquitous on pocket calculators with regression capability but the parameters will have different units Chapter 6 How Good is the Model good or not At a minimum the model must describe the data and this requires some quantitative goodness of fit metric statistic There are a variety of such metrics for both stochastic and deterministic models In this chapter we describe those most commonly used as well as one special kind of plot N modeling task is completely finished until you determine whether the model is 6 1 Stochastic Models Since stochastic models are optimized by maximizing the likelihood of the data one ob vious goodnes
102. y var Note that 4 12 has the correct units and is normalized The graph of this PDF is the familiar bell shaped curve shown below in its standard form u 0 0 1 Considering how ubiquitous the use of Normal distributions is in routing analysis it is surprisingly difficult to find a large real world dataset that is demonstrably Normal Al most always there is some slight deviation from normality and with a lot of data hence a lot of information this slight discrepancy becomes significant and spoils the goodness of fit test For this reason and also to show that it can be done we shall synthesize a dataset by drawing 1 000 points from a standard Normal distribution The data histogram is shown in Figure 4 9 superimposed upon the theoretical PDF red The fit is not a perfect As noted in the last chapter even random data contain error CHAPTER 4 MODELS IN THE REAL WORLD 39 0 5 PDF Figure 4 8 Standard Normal Gaussian Distribution 0 5 PDF Figure 4 9 Synthetic Normal 0 1 Data N 1 000 CHAPTER 4 MODELS IN THE REAL WORLD 40 The preceding examples are all very simple and the real world usually is not Conse quently we often need a more elaborate model As just one example of this we shall use some data well described by a mixture model in this case a weighted combination of two Normal distributions with two means two standard deviations plus one parameter giving the weigh

Data Modeling with Regress+

Contents

Download Pdf Manuals

Related Search

Related Contents