Home
        The winGamma- User Guide - Pages supplied by users
         Contents
1.          0 000205  56   3 6 1 1 Using the WhatIf and Query options on the LLR model     58   3 6 1 2 A histogram of prediction errors for LLR model            59   3 6 2 Building and testing a neural model                    0 00 e eee 59   3 7 Example model construction and testing for DH 34 5000 asc                60  3 7 1 How the prediction quality degrades into the future                60   3 8 Using a prediction file    is eX a pex ee ARA ERGO E EX E eg 61  3 8 1 Using a prediction file on Input Output data            lesse  61   3 8 2 Using a prediction file on Time Series data                lesse  61   3 9 Using the neural networks outside of winGamma            0 0  cee eee 61  3 9 1 The activation function and the sigmoidal                        61   3 9 2 NEREIST ooa s epe oo eT ee tC eS ed e ec eae 62   3 9 3 Exporting and using Neural network models in Excel               62  APPENDIX I General Information   ooa Cer ERE eR PC SR 64  Shipping list i241  4o RUE SER AQ REESE E E a A R D S RERSAE 64  Hardware requirements   amp    iss her qx VES SR MER eek wl a ice S 64  Installatioh     eielerrisAecepecer  cepexsauqbbrevcepksumheqq E pa 64  List of files and directory structure after installation             llle  65  Problems reDorbell 365 uL Mr soto oU ust c a Mea NL AL carat ide 66  APPENDIX II Data file formats 02  uo cone e e iren MuR Gb eas TP a ey e DRE Re SN Gees 67  JAMES series  data  sixes IRSE LU PERIERE CADRE ORA MDC o bog 67  Input OWtput dala nic c
2.     Unique Points  10578  10578        0 0  note the SE first increases and l l  Unique Points 10578 10578  1 1  1 1                   then plateaus  Along the plateau a    minimum SE occurs at around           3 FEN   pmax   20  which from now on Zero Near Neighbours f0   O id   we take as the best pmax for   Upper 95  Confidence  0      Jo       further analysis of this data  OU  ts   Figure 2 21 Shows the M test and   we can see that for M   9000 we   are beginning to get a stable   asymptote  From this we infer that around 9000 data points will be required to build a model which   will predict with an accuracy about equal to the noise level  The result of a Gamma test on pmax     20 near neighbours using the full data set is shown in Table 2 3     4     The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    Figure 2 22 shows the scatter plot and regression line  zoomed   This is typical of slightly noisy data  but the regression line fit is good  The desirable empty wedge is the top left corner is not obviously  present  another indication that the data is somewhat noisy     The 3D Histogram in Figure 2 23 shows similar features except that we can see the actual frequency  distribution of points in the scatter plot and this shows that strong outliers or empty wedge    points  although present are relatively infrequent  The Angle histogram of Figure 2 24 shows a roll off in  frequency as we approach angles close to 7 2    The final Moving Window Gamma Test 
3.    3 6 1 1 Using the WhatIf and Query options on the LLR model    The Whatlf option allows us to see what happens if we set values for all of the inputs except one and  vary the remaining input over some range  This is a very useful tool in a variety of contexts     For example  in a sales and marketing campaign we may be able to answer the question    If I spend  X on advertising on TV and Y on advertising in newspapers how will the sales of the soft drink vary    with the mean day time temperature      Similarly the Query option allows a particular selection of all inputs to be queried  The use of  Predict is discussed in section 3 8     Having analysed the data  built and tested a model  we can now ask some interesting questions  regarding the solar csv data  For example  using the WhatIf options we can answer the question       How does the power output vary when the temperature is fixed at 7 degrees and the  Irradiance varies from 0 to 30     The answer is given in Figure 3 6 As expected at a fixed temperature the power output is almost  linear with the Irradiance     What If                   P i i 1 i i 1 1 i i  006 18 336 48 666 78 996108 12 132144 162174186 204216228 24 252264 282294 1 156212258324 38 436492548604 66 716772 856912968 1052 1136 122 1304 1388 1472  Input Input      Inp       Figure 3 6 The variation of output power as Figure 3 7 The variation of output power as  Irradiance varies from 0 to 30 and Temperature varies from 1 to 15 and  temperature is 7 de
4.    Primary Y Series  Gamma zl a Gamma zl          Overlay Y Series   Eerad  1       3455 7 8 9101112131415 1517 18 1920 21 2223 24 2526 27 2829 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480 S    Near Neighbours Unique Data Points        Gamma     Standard Error       Figure 2 13 Increasing near neighbours  3  Figure 2 14 M test  pmax   17  on  30  on Sin500 asc Gamma SE Sin500 asc       Analysis Manager a   Analysis Manager  Select output   Output 1 7 Select output    Output 1 7  Scatter Plot  30 Histogram   Angle Histogram   Settings        Gamma Scatter Plot 3D Histogram       Frequenc   BREESE ES 88278        2B         0 01 0015 0 02 0 025  delta       Figure 2 15 Scatterplot and regression line Figure 2 16 3D Histogram  pmax   17  for   pmax   17  for Sin500 asc  Sin500 asc       Analysis Manager E   Results Visualiser    Select output  Output 1 X Select output  Output 1 zl    Scatter Plot   3D Histogram  Angie Fisiogram   Settings   Cision Chai   X Series      N Angle Histogram Position in List z     Position in List v Gamma    0 0765          0 0760  0 0755  00750        goos    t      5   amp  0 0740    x  E  2  5  5  G  Fy        00735         0 0730  0 0725    0 0720       9 10 11 12 13 14 15 16 17 18 19 20  Position in List     90  80  70  60  50  40  30  20  10 50 5 101520 2530 3540 45 50 5560 65 70758085  zn          Figure 2 17 Angle histogram for Sin500 asc Figure 2 18 Moving window Gamma test   pmax   17    pmax   17  on 300 p
5.   File    and then  Open Analysis Data Set        1 2 1 Comma separated variable    csv  files from spreadsheets    If the file data is in the   csv format  e g as exported from Excel   on loading the file you will be  asked to specify which of the columns are outputs  Because a   csv file does not indicate which  columns are inputs and which are outputs  if the file is an Input Output file it is necessary to give this  information to winGamma  Each column has to be tagged as an input or output column  This is done  as indicated in Figure 1 1 To change an input  default  to an output select it with the mouse or cursor  keys and press the    Enter     or  Return   key  or toggle with a double click on the left mouse button     11    The winGamma User Guide GETTING STARTED Version  18 Jan 2002       alia eae Data Transformation     s Data Settings Time Series Options      Number of inputs per series  5  Number of outputs per series  2  Moving average width  o      Differences    a O   Output         Use the cursors or mouse to select a row     Press return or double click to toggle the highlighted row between input and output       Cancel   Apply      Figure 1 1 Toggle inputs to outputs as Figure 1 2 Selecting the number of inputs  required when loading a   csv file as and outputs per time series   Input Output data           For Time Series data specify all columns as inputs  As in Figure 1 2 winGamma will then ask you  to specify the number of inputs and outputs per series  A
6.   List of Tables   Table 1 1 Gamma test results with pmax   10 for unscaled and scaled solar csv data                     15   Table 2 1 The results of a simple Gamma test on the file Ran500 asc for unscaled and scaled data           35   Table 2 2 The Gamma test result  pmax   10  for unscaled and scaled data on the file Sin500 asc           37   Table 2 3  The results of the Gamma test  pmax   20  for unscaled and scaled data from solar csv          41    Table 2 4 Excel file for multiple time series    0 6    eee I 48    CHAPTER I Getting Started    1 1 Introduction    Data or observations can be considered as a spreadsheet of numbers in which the columns are  divided into two types  input columns and output columns  In any row we might wish to  determine the values of the outputs when these are not known but the corresponding inputs are  known     A data model data model is an algorithm constructed from a set of observations  for which all  inputs and outputs are known  which enables us to predict the outputs from a given set of inputs   This software is concerned with constructing data models of a particular type     1 1 1 The Purpose of the Software    winGamma is a software package which in the first instance estimates the least Mean Squared Error   MSError   that any smooth data model  e g  a trained feed forward neural network  can achieve on  the given data without over training     winGamma can be used with multiple column  nput Output data files and single or mult
7.   This test can also tell us how much data we are likely to need to obtain  a model of a given quality     Moving Window Gamma test  Shows how the estimate for the Gamma statistic using  a fixed number of data points varies as we move a fixed length window along the  data file  This is used to check the stability of the Gamma statistic as we move along  a large file     Model Identification  These options are used to select those inputs which can best be used  to predict a selected output  some inputs may be noisy or irrelevant   The use of model  identification techniques is discussed in Chapter II     Full Embedding  Genetic Algorithm  Hill Climbing  Sequential Embedding  Increasing Embedding    Other features  Are captioned in Figure 1 7       winGamma    Scale unscale or partition the data E META Options View Window Help    t Transform data set  ate ane T2 BG  Data Set Manager oo   Partition analysis data set       Current data file name            Analysis Ma  apet so    al    B       Delete selected experiment      as        Analysis Manager  Select Analyse for graphical D vs xx   Analyse Graph     Model   Test     Guey whati Predict       analyses Experiments l Models   Results   Settings      Current Experiment type     E Training Set Analysis  ajos  Output 1 z     i Gamma test  Current Experiment number it i m   Experiment 1 Analyse Model Row 1          Increasing near neighbours  Gradient       Standard Error   V Raatio    Current Experiment results toe MTest WIE EM
8.  19  1 5 1 An Input Output file a Veo D RI eH Gries n 5 QUER aA C REC 19  1 5 1 1 The basic Steps    ose AG Re KHER REA AS CAR ERE ERY 19    3 2 A chaoue Time Series cod oy lea sep SPESE ia O SRI S E E EI 20  1 5 2  T The basic Steps  kel E EV ESO UT EE a ERES 21  IGA Mear Jnodels   ce deep A ote gh eere hk on bole E oy grea ach es ni oe 23  1 7 Exporting results for use by other software           llle 23  1 8 Customising the file and project directories            llseleleeeeeeese 23  CHAPTER II Performing an analysis iecore ER as oo 04 nib eek Ra 25  2 1 Introduction oysa sia eo takers S E E AE eyed en uua MES EVE 25  2 LlITheusercycle riii unine d wea ene ea RN GP REC a 25  2 2 The Gamma test  eio eere he vec eR EF Ch vene SO pee der pede 26  2 3 The Gamma Test analysis graphs           0    0c cece eee eee ee 27  2 3 1 The scatter plot and regression line                 0 0002 ee 27  2 3 2 Phe  SLM MISLOSTAMN e Dd oor ona DIE oes Pa ete EE MIROR tx Gee 29  2 9 9 The angle histogram  442 etwa oe eave mE Rae a 29  2 4 Increasing near neighbours oo eod M a te vod SR ss 30  2 5 MAGS i est einer Eus pNESREARERSERIMASRUECRSARENSES PARERE ek 31  2 6 Moving Window Gamma test            0 0 cece eee eee eee ee nee 32  2 7 Model identification   2 002   05 40 22bde nee ranerne 32  2At Pull embedding    345 won Gade RES a a a ee A 32  2 12 Geneue Algorithmi 42 bees ars S i a EARS I a GA NA Ree 33  2 Tuo BEC MAM DING  S oy S koe CRANE Ed a eV S dO 33  2 7 4 Sequential Embedding 925
9.  99 02 31p 107 677  03 02 99 04 08p 1 228 288  03 02 99 04 20p 968 704  02 17 99 03 32p 35 328  Real data files   Data    11 21 98 11 51a    02 04 99 02 35p   DIR   Solar  430 515 Solar csv  12 11 98 03 37p   DIR   Sunspot  Sun280 asc    03 19 98 07 52p 2 240  04 20 98 02 15p 24 543    Artificially generated test data files    TestFiles  01 04 99  04 16 98  03 27 08  03 02 99  09 15 98  09 15 98  09 15 98  10 29 98  04 22 98    SunPairs asc    02 18p  lt DIR gt  Noise   12 21p 50 183 Ran500 asc  03 04p 20 539 Sin500 asc  12 43p   DIR   NoNoise   04 58p 1 958 Hen100 asc  04 59p 9 830 Hen500 asc  05 00p 19 666 Hen1000 asc  01 26p 983 909 Hen50000 asc  12 51p 9 617 MGIls500 asc    65    The winGamma User Guide APPENDIX I General Information Version  18 January 2002    10 22 98 05 09p 96 097 MGIls5000 asc  02 24 99 02 44p 205 286 ModSin5000 asc  12 09 98 03 02p 98 966 DH 34 5000 asc    Mathematica    3 01 files    12 19 98 04 25p 2 317 DataAnaly m   12 19 98 04 25p 2 812 869 DataAnaly nb  03 02 99 07 13p 8 831 DataGen m   03 02 99 07 13p 946 827 DataGen nb   10 01 98 2 07p 38 062 mathlinkGamma nb  01 28 99 04 36p 253 440 GammatTestProject exe  10 29 98 4 18p 50 773 NetReader nb    Problems reported    Graphics files saved are in billions of colours and cause    out of memory    errors when attempts are  made to load them into some software including WPCorelV8 and Graphics Workshop     66    APPENDIX II Data file formats    All data files are in plain ASCII and have the file name 
10.  Input Output file for  which you have to specify the inputs and outputs  Becuase the data is not now recognised as multiple  time series data the Iterate  Model  option will not be available  at present this can only be used with  a single time series      49    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    2 15 To scale or not to scale     If two input variables are incompatible  e g  temperature in degrees K and altitude in metres  both in  semantics and in range then the effects of a change in one of them can completely outweigh the  effects of a change in the other  To ensure that all variables at least start with an equal chance to  contribute to an output prediction it is often helpful to apply a standard normalization     In this software the standard normalization is that the mean of each input variable is mapped to zero  and the standard deviation to 0 5  In later versions we may include an option for the user to select the  standard deviation  this can have some advantages in model building      The effect of normalizing the data is two fold  First  since the output is also rescaled this will affect  the Gamma statistic in a trivial way  it will divide it by the square of the new output range   The  Vratio however will not change due to this effect  Second  rescaling the inputs can change the near   neighbourhood relationships and hence possibly change the associated Gamma value  We can detect  if this happens as it will also cause Vratio t
11.  also sometimes interesting to observe when the Gamma statistic is at a local minimum and the  Gradient is at a local maximum as the number of near neighbours varies    This criterion seems to be  sensitive to noise on different scales of distance in input space     2 5 M test    This test is used to show how the Gamma statistic  and the other results returned by the Gamma test   estimate varies as more data is used to compute it  Eventually  if enough data is used the Gamma  statistic should asymptote to the true noise variance on the output for which it has been computed     The M test can also tell us how much data we are likely to need to obtain a model of a given quality   in the sense of predicting with a MSError around the noise level  In Figure 2 5 we see that in this  sense a perfectly adequate model can be built using anywhere from 150 200 data points  since the  variance of the Gamma statistic after this stage is relatively small compared with its actual value       Results Visualiser    Select output    Dutput 1 7    Custom Chart    X Series     Unique Data Pc    i Unique Data Points v Gamma  Primary Y Series   Gamma       Overlay Y Series  mro       200 250 300  Unique Data Points      Gamma    Figure 2 5 M test graph for Sin500 asc Note the relatively stable asymptote          We call these the  Terry points    after John Terry who first observed the phenomenon     3l    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    Of course  using more
12.  and that we can make accurate short  term predictions but that long term prediction s pus  becomes exponentially more difficult        Figure 3 11 How the Gamma statistic varies  against the number of steps ahead for  DH 34 5000 asc     3 8 Using a prediction file   Building and testing models when you know  the outputs for a corresponding set of inputs  is quite interesting but it is purely an academic exercise  Sooner or later you will want to make  predictions that matter and where the outcomes are not known  Perhaps from some large quantity of  input data     To accomplish this it is first necessary to have a    prediction    file  i e  the input data is placed in a file  without the corresponding inputs     3 8 1 Using a prediction file on Input Output data   Load the data file EXAMPLE NEEDED HERE   3 8 2 Using a prediction file on Time Series data    Load the data file EXAMPLE NEEDED HERE   3 9 Using the neural networks outside of winGamma    If the neural models are used outside of winGamma  i e  in other software  it is necessary to know  some technical details of the implementation     3 9 1 The activation function and the sigmoidal     The activation function used by the neural networks is    61    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    n  act x    D WX   j l    j i    where Wij is the weight of the connection from unit j to unit 1 and xj is the output of unit j     The sigmoidal used by each neural node as an output function i
13.  data we can actually often progressively improve the model  this can easily  be checked by building a local linear regression model and using the WhatIf option to recover a quite  good approximation of the original sine curve   but it is not necessarily helpful to have an extremely  accurate model if the output data we are comparing it with is subject to large amounts of noise     2 6 Moving Window Gamma test    The Moving Window Gamma test shows how the estimate for the Gamma statistic  and other  relevant results returned by the Gamma test  using a fixed number of data points varies as we move  along the data file  It gives some indication of how stable the Gamma statistic is when estimated for  different subsets of the data all having the same size     The remaining sections deal with model identification  i e  in this context  the best choice of inputs  for predicting a given output     2 7 Model identification  2 7   Full embedding     An embedding is a selection of inputs chosen from all the possible inputs  In winGamma an  embedding is designated by a string of  1 s and    0   s called a mask  Thus if there five inputs the mask  10111 indicates that all inputs are to be used are to be used in the embedding except the second     A full embedding tries every combination of inputs to determine which combination yields the  smallest absolute Gamma value  It returns the number of results requested  If there are m scalar inputs  then there are 2    1 possible embeddings  th
14.  progressively increase the numbers of  hidden units in the two layers      1  winG amma    Analysis Manager       Di  Xx   File Edit Transform Options View    Window Help    8  x     B ei  mi    New Delete   Analyse  Gem Mode     Test BUE WRIT Predict  Experiments Models   Settings Real Time Evaluation              Local Linear Regression Models  gt     Local linear regression    Dynamic local linear regression   E  e Neural Networks   E  e Two layer backpropagation neural  ET     Conjugate gradient neural network      BFGS neural network       Training Mean Squared Error       8 10 12 14 16 18 20 22 24 26 28 30 32 34  Cycles        MSE     Target MSE    4    Initialising Weights    Figure 3 3 The Analysis Manager during backpropagation training        Two layer backpropagation also requires     The initial learning rate  must be positive   This controls the initial step size in weight  adjustment     Momentum constant  must be positive   This controls the extent to which the size and  direction of the current step in weight space is influenced by the size and direction of the  previous step  Setting this parameter to zero means there is no momentum term in the weight  adjustment at each step     Regularisation constant  must be positive   This limits the size of weights  A zero here  corresponds to no restriction on weight magnitude     These options are configured using the set up menu shown in Figure 3 1  There is a second tab which    allows the user to specify the maxim
15.  sunspot data   bia Mata tx eo A acu xad dar dc apio du dac bab a ex adu iq Od ea hp SACRA SU OS CU Para 47  Figure 2 37 A test of the LLR model on the data set SunPairs asc  blue predicted  green actual   Ied efroby os s Vos Uva ivt veut ie tel cree 47  Figure 3 1 The first set up menu for two layer backpropagation                       54  Figure 3 2 The second set up menu for two layer backpropagation                     54  Figure 3 3 The Analysis Manager during backpropagation training                    55  Figure 3 4 Selecting a proportion of the data for testing             0 0 0    02  ee eee 57  Figure 3 5 Result of LLR test for solar csv          lseeeeeee e 57  Figure 3 6 The variation of output power as Irradiance varies from 0 to 30 and temperature is  TIO BIGBS  puta Sere LOVAL OS et HUS  Ql ar dolo E a ade Gummi g eH QUO gen etree SRG  lap ose 58  Figure 3 7 The variation of output power as Temperature varies from 1 to 15 and Irradiance is  IU Sees Oe Lu edt ARES Sits heal s QAM UD ale dose uta d Rea 58  Figure 3 8 An error histogram for the LLR test        0 0 0    0  ee eee eee eee 59  Figure 3 9 A topographic plot of the solar csv data      0    fcc eee eee 59    Figure 3 10 A test of the LLR model on the data set DH 34 5000 asc  blue predicted  green  actual Ted  erFok  ous Su ees ise vM veo Go uas d iu e e DAT oS M MOX 60    Figure 3 11 How the Gamma statistic varies against the number of steps ahead for    DH 34 5000 856 5  Lo WS E ERU RAS M ERU C xis 61
16.  target  MSError  This is useful in the event that the partition of the data for training and testing has been  altered  Clicking  Recalc  will cause a new Gamma statistic for that part of the data selected as  training data to be calculated  and hence set a new target MSError for training     Alternatively the user can set any target MSError  However  if the target MSError is much less than  the Gamma statistic on the training data then  1  the network may end up being  overtrained  resulting  in poor predictions  or  ii  the training algorithm may never be able to reach the  possibly  unrealistic  target MSError     User settable options     For each of the neural training algorithms we shall need to specify the number of hidden units  Thus  each neural network option needs    The number of units in the two hidden layers  default 5  5   The number of units required to achieve a good model will depend on the complexity of the unknown  function we are trying to approximate  Unfortunately here there are few general rules to guide us  One    useful guide is that if the Gradient value returned by the Gamma test is large then the unknown  function has regions of high curvature and we shall require more hidden units to approximate it    54    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    accurately  The best approach is to try to train using relatively few units  the defaults are set quite  low  and if training fails to converge to the target MSError
17.  the  initial analysis and click OK       5  Select  Gamma test  from the Experiments  Manager and then click on   New          Results Visualiser       output  output 1 z   Chart       b  Position in List Position in List v Gamma    Position in List       Figure 1 13 The result of an Increasing  Embedding on Hen500 asc with a maximum  of 10 inputs  using 10 nearest neighbours        Results Visualiser    Select output  output 1 z        i    Unique Data Pc Y  Unique Data Points v Gamma  Primary Y Series  i KEA 14 pu  Gamma g  Overlay Y Series   None   m          20 40 60 801 00 120140160 180200 220240  260 280300320 340360 380400420 440460 480  Unique Data Points    Figure 1 14 The result of an M test on  Hen500 asc with 2 inputs  using 10 nearest  neighbours        The fact that this is so is by no means obvious  Itis a consequence of a fairly deep theorem due originally    to Takens  1981      21    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    Initial results  The initial result gives a Gamma statistic of 0 117143 and a Vratio of  0 185337 which is not very encouraging  However  the real reason for this is that most of the  inputs we have selected for the model are irrelevant or not very helpful     6  Nextinthe Experiments Manager under Model Identification highlight Increasing Embedding   and then click on  New   Leave the number of near neighbours set at 10 and click on  Execute    What this experiment does is to compute the Gamma statistic for a succ
18.  the angle histogram in Figure 2 11     Finally the Moving Window Gamma test using a window size of 300 in steps of 10 in Figure 2 12  consistently shows a Gamma statistic between 0 29 and 0 38         These results together indicate that there is no point in going on and trying to produce a    smooth model for this data     35    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002      Results Visualiser g   Results Visualiser    Select output   Output 1 zl Select output   Output 1 zl    Custom Chart      X Series    Near Neighbour zl d Near Neighbours v Gamma and Standard Error Unique Data Pe zl    Unique Data Points v Gamma  Primary Y Series 3 A i Primary Y Series me   Gamma z  hr    Gamma z    Overlay Y Series  ade Bl          Overlay Y Series  a     10143 piepuets       345678 9 1011121314 15 1617 18 19 20 21 2223 24 2526 27 28 29 30 60 80 100120 140160180 200220 240 260 280 300 320 340 360 380 400 420 440 460 480 500  Near Neighbours Unique Data Points        Gamma     Standard Error       Figure 2 7 Increasing near neighbours  3 30  Figure 2 8 M test  pmax   10  on  on Ran500 asc Gamma SE Ran500 asc       Analysis Manager E   Analysis Manager    Select output   Output 1   Select output   Output 1     R    Gamma Scatter Plot 3D Histogram       Frequency    Hox B eB aS          du tel fies Beets DT a HUN ee Jail  005 01 015 02 025 03 035 04 045 05 06 085 07 075  delta       Figure 2 9 Scatterplot and regression line Figure 2 10 3D Histogram  pmax   10  fo
19.  thumb for this situation but in practice itis not a major  problem to optimise the choice of k using a test set     There is also an option whether or not to include a constant term in the locally linear model  In  general it is better to include this term     The final user settable parameter is equivalent to a local principal components threshold filter on the  eigenvectors of the local linear model  We are tying to predict along the tangent plane of the local  flow and eigenvectors corresponding to relatively small eigenvalues probably represent noise and lie  outside the tangent plane  The threshold decides which eigenvectors we should ignore  Setting itlow  or zero will essentially include all eigenvectors in the local model  the default is around 10    Raising  this threshold will filter out more and more eigenvectors  For noisy time series one often finds that  0 001 gives quite good results  Again the best approach is to experiment on a test set     3 3 Dynamic local linear regression    This option is mainly designed for time series analysis  It is basically identical to LLR with the  additional feature that as new data is seen for the first time it is incorporated into the model  You can  see the effect of this by starting the model with very little training data and running a test on a large  amount of data  As new test data is encountered  but after the attempt at prediction of course   dynamic LLR will make steadily better predictions  This is interesting to obs
20. 4654 cou zv bue Eee 33  2 7 5 Increasing Embedding 2252255  o9 40d Wises hele yee sete  34    2 12 Analysing Input Output datas  6 04 nicks ere ar ADR pr d Le 34    2 12 1 The Ran500 asc data         lecce RR A 34    2 12 2 The Sin500 asc data oeste e e Le eek RI RR ER RUE 37   2112 3 LHe solar esv daldos oerte dorna ERES NE PESO eR HEN 39   2 13 Analysing Time Series dala  amp   us ost uA ee Nard See ee 42  2 13 1 The DH 34 5000 asc data  Delayed Henon Map                  42   2 13 2 The FTSE weekly closing price data               lessen  44   2 13 3 The sunspot data    ss essc esas Kidda gant dbs idawkiwreaaeheees4 46   2 14 Handling multiple time series       2 0 0 0    0  eee eee ee eee 48   2 15 T   scale Of dot 10 Stale  orainn Lagos Ree HY dh tanta Beg 50   216 PROVE OS a 2 scien x eal obec NPS BGA ae abu aue doi oak PO dtp dE 50  CHAPTER III Building and testing a model      1 0    0    cece eee 52  S TL THEOOUCUOIE  5 beo talc o oe eG hes ta atom sage obe tate dex gee des 52  3 2 Local linear regression   sued le 4p Rede E 4 bab ek des EE SA ee RE UE 52  3 3 Dynamic local linear regression 2241 e c V ES x ee kaw QE DM REY RE MA 33  3 3 Two layer back propagation               sseeeeeeeeee ee 54  3 4 Conjugate gradient descent 2  iowa acy ath eden eens PI ER aa Er AERA Oe aS 56  ao BEGGS neural NetWork ick eevee ea Mp Cc C A Md sS LE 56  3 6 Example model construction and testing for solar asc            cece eee 56  3 6 1 Building and testing a LLR model           
21. EE n 256531 2 32865 o 00078571        Moving window gamma test            Model Identification  be a Ful embedding  b    Genetic algorithm       Hill climbing                                           Figure 1 7 The Analysis and Data Set Manager windows after performing the initial  experiment     18    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    1 5 Two simple examples    In this section we further illustrate the use of winGamma using two test files provided with the  software     1 5 1 An Input Output file     The data for the file Sin500 asc was created  via  the Mathematica file DataGen nb  using the  function y   Sin x  and then adding uniformly  distributed noise with a theoretical variance of  0 075 to the y values  A point plot of the data is  shown in Figure 1 8     1 5 1 1 The basic steps    Load the data file and run a simple Gamma test  with the number of near neighbours set at 10   default   as described in section 1 3  Do not scale  the data  Note that we do not need to specify the  number of inputs and outputs because this file is in  standard format        Figure 1 8 The noisy sine data     The Gamma statistic in the Results window is 0 07355 which is quite close to the theoretical noise  variance  The Vratio of 0 12762 suggests that we will not be able to predict the value of an output  very accurately  which in view of the data plot in Figure 1 8 is not too surprising  The SE is  0 0037651 which indicates a fair degree of reliability in t
22. ORMING AN ANALYSIS Version  18 Jan 2002    Normal distribution  see Figure 2 33   If on the other hand there are clear underlying dynamics in the  data then the histogram often shows a bimodal or multimodal distribution  see Figure 2 36      2 7 2 Genetic Algorithm    This option searches the space of all masks using a Genetic Algorithm  GA  to find good  embeddings  The parameters which can be used to control this search are  default values of  parameters are given in brackets      Population Size  100  The size of the population of masks being used throughout the search     Mutation Rate  0 01   The probability that an individual bit will be mutated during the  reproduction process     Crossover Rate  0 5  The chance of inserting a random length run of bits from a parent mask  to a child mask  i e  the probability that a crossover event occurs during the reproduction  process      Gradient Fitness  0 1  The weighting in the GA fitness function for masks giving a low  gradient in the Gamma Test  Increasing this weighting will place more emphasis on the  relative simplicity of the modelling function    Intercept Fitness  0 8  The weighting in the GA fitness function for masks with a low  absolute value of the Gamma statistic  Increasing this weighting will place more emphasis  on the model accuracy    Length Fitness  0 1  The weighting in the GA fitness function for masks with a given number   of    1   s  Increasing this weighting will encourage the selection of masks with f
23. SE   0 00236 represents a good fit for the regression  line     Vratio    The Vratio is defined as Gamma Var output   It thus represents a standardised measure of the  Gamma statistic and enables a judgement to be formed  independently of the output range  as to how  well the output can be modelled by a smooth function  In comparing different outputs  or outputs  from different data sets  the Vratio is a good number to study because it is independent of the output  range  A Vratio close to zero indicates a high degree of predictability  by a smooth model  of the  particular output  If the Vratio is close to one the output is equivalent to random noise as far as a  smooth model is concerned  In this case Vratio   0 00076 indicates low noise data which we should  be able to model quite accurately     Near Neighbours  number of pmax    This is the one user settable parameter in the Gamma test  When estimating the Gamma statistic  pmax should be selected in relation to the size of the data set  For large data sets  in the interests of  getting a more accurate Gamma statistic  we can afford to take the number of near neighbours  somewhat larger  this depends on a number of factors discussed in Chapter II   In general in a  Gamma test experiment we should keep the number of near neighbours less than 30  Usually 10 20  is a good choice    Start    This indicates the row identifier for the first vector selected     Unique Points    16    The winGamma User Guide GETTING STARTED Versio
24. Set the number of  inputs to 20 the number of outputs initially to one  Now perform a simple Gamma test on the  full data set  which gives 4985 I O pairs with 20 inputs  to get an initial idea  If the data set  is very large use a subset of the data for initial experiments     Results  This gives an initial Gamma  statistic of 0 0042614 and a Vratio of  0 007481  The SE for this result is 0 00125   These initial results are encouraging     2  Next run an  ncreasing Embedding test  to determine a likely embedding  dimension     Results  If we zoom in on the resulting  graph we see Figure 2 28 and infer that a  good model is likely to be obtained with 4  or 5 previous values     The Gamma statistic for 4 is 0 00019635   for 5 it is 0 0002997 but the lowest value of  allis for 7 past values which gives  1 5E 7          Results Visualiser    Select output    Output 1 Y    f Custom Char    Position in List v Gamma          4  Position in List    Figure 2 28 The result of an Increasing  Embedding for the delayed Henon map     These very low values suggest that the time series is consists of very low noise  or noise free   data  Examination of the scatter plot and associated graphics supports this view     3 We next Transform the data set to reset the maximum number of inputs to 8     4  We next run a M test to check the  stability of the Gamma statistic If the M   test produces a stable asymptote we can  decide if we really have enough data to  support these conclusions  A reasonab
25. The winGamma     User Guide       Abstract  This document is the user guide for the winGamma software and is  updated to version 1 97  winGamma is a state of the art non linear analysis and  modelling tool     Keywords  Smooth model  data analysis  prediction  Gamma test  Feature  selection  Noise estimation     Communication regarding this document should be directed  to     Antonia J  Jones    DEPARTMENT OF COMPUTER SCIENCE  UNIVERSITY OF WALES  CARDIFF   PO BOX 916   Cardiff CF24 3XF    Telephone    444 292 087 4812  Telefax   44 292 087 4598    Or by email to  Peter J  Durrant at p j durrant  cs cf ac uk       Copyright     University of Wales  Cardiff 1998 2001        About winGamma     9          NOTICE    This program is experimental and should be used with caution  All such use is at  your own risk  To the extent permitted by applicable laws  all warranties   including any express or implied warranties of merchantability or fitness for a  particular purpose  are hereby excluded  The authors and distributors of this  software disclaim all liability for direct  indirect  consequential  or other damages  in any way resulting from this software     This program is protected by copyright  You may not copy this program or    accompanying documentation without the express written permission of the  copyright holder  You may not modify this program     Copyright  O University of Wales  Cardiff 1998 2001     Acknowledgements    Thanks are due to many people who have contribute
26. aders  Save the new file as a ASC DOS text file with the filename  suffix   csv  Edit the file name after saving if the text editor insists on putting an extra   txt on the  suffix      48    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    Step 4  Decide the maximum lag you are likely to need in analysis and modelling  For example  if  you think the maximum lag likely to be needed in two months and the data is sampled weekly then  the Number of Inputs to be set in winGamma will be 8  Load the   csv file into winGamma  which  because the numeric entries are separated by spaces will treat the file as a multiple time series    Selecting the number of inputs as 8 and  assuming that you wish to produce a one week ahead  forecast select the Number of Outputs as 1  Do not at this stage normalise scale the data  this can  easily be done later but cannot be undone if it is done at this stage  You also have to decide at this  stage whether you want to include a running window average for each time series as an input and  whether you wish to include successive differences as possible inputs     Without the running window average or successive differences  the data loaded into winGamma will  now be ordered into the following columns     TS1   t 8      TS1 t 1  TS2   t 8       TS2t 1       TSm   t 8       Tsm   t 1  Target  t 8       Target   t 1  TS1   t       TSm   t  Target  t       where the outputs  at time t  have been underlined   This is a rather tricky mani
27. al  scale     This can illustrate more clearly the    wedge shaped    area  It can also be used to quickly ascertain the  distribution of outliers  We shall call point pairs with large 6  each is a long way from its nearest  neighbour  and large y  the y values of close inputs are far apart  strong outliers and techniques for  identifying and eliminating such points will be discussed in a later version of this document     2 3 3 The angle histogram     To help to further analyse the situation the software also produces an  angle histogram    as for  example in Figure 2 11  for each point in the scatter plot we imagine joining the gamma intercept on  the vertical axis of the regression line plot to the scatter point  The angle the resulting line makes with  the positive horizontal axis is then computed  This angle lies between   7 2  1 2   A histogram of the  resulting angles is then displayed  The feature to look for in this histogram is the frequency of angles  close to the right hand end  i e  close to 7 2  If there are no points close to 7 2    90 degrees  this is    29    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    a good indicator for smooth modelling  If there are many points close to 7 2 this is a very bad sign   The importance of the distribution close to 7 2 in the angle histogram is another way to visualise the  upper left hand wedge of the scatter plot     The remaining types of Experiment which can be performed are described in the follo
28. ar  embedding is sufficiently high to make the outcome of a full embedding search itself extremely  unreliable  The resulting very low Gamma statistic of around 0 007 is an artifact of the statistics of  the situation  with over a million embeddings to search we are quite likely to find one with a very  small Gamma   The associated SE is 517  This clearly illustrates that a low Gamma statistic on a  single data set is not enough to ensure a good model   we need to be sure that the SE is acceptable  and that an M test illustrates the estimate has stabilised     In reality using the time series alone we are lucky to predict the weekly closing FTSE price to within  a standard deviation of 80  i e  the true Gamma statistic is around 6400      There is a further complication in that we have no real reason to suppose either that the underlying  system is describable by a smooth dynamic model  or that if so the dynamical system is constant   Indeed towards the end of the 10 year period it is noticeable that the local variance of both the system  behaviour and  as we shall see in Chapter IIT  of the errors of predictions increase  From this we    45    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    conclude that either the dynamical system is itself varying or at the very least the constant noise  variance assumption is suspect     2 13 3 The sunspot data     nmp Activity 1700 1373       Figure 2 34 Plot of the sunspots data file Sun280 asc     The data used i
29. are empty because it is assumed that  the outputs are unknown  The use of prediction files is discussed in section 3 10     In general data files can be divided into two main categories  input output files and time series files   Creating data files using Excel    If data is prepared in a spreadsheet it can be exported to winGamma in the   csv format  Make sure  that the numbers exported are in pure decimal format  At present winGamma may read numbers in  the xEy format incorrectly     When a   csv file is loaded the user will be automatically prompted to nominate particular columns  as inputs or outputs by selecting with the mouse or using up down cursor keys and the Enter  or  Return  key  The mouse may also be used to select then double clicking will change an input to an  output an vice versa     APPENDIX V Definitions    Model  A smooth data model is a differentiable function from inputs x    x       Xm  to each output  y  It is assumed that the data can be represented by an unknown model f so that    y dax     where r is a stochastic variable which represents noise     Gamma test  An algorithm to estimate the variance of the noise Var r  associated with a particular  output  Not to be confused with the variance of the output      Gamma statistic  Often referred to as a  Gamma value     It is the vertical intercept of the regression  line plot and represents our best estimate for Var r      Embedding  A selection of past values of a time series used to predict the curren
30. art of a project with funding from the European  Commission  THERMIE Programme  and the UK Department of Trade and Industry     39    The winGamma User Guide      Results Visualiser    Select output    Output 1 zl  Custom Chart    X Series     Near Neighbour     Primary Y Series  Gamma zl    Near Neighbours v Gamma and Standard Error       Overlay Y Series  femme       4 6 8 1012 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50  Near Neighbours        Gamma     Standard Error    Figure 2 20 Increasing near neighbours  3   50  on solar csv Gamma SE      Analysis Manager  Select output  Output 1 X     Scaiter Fiol  3D Histogram   Angle Histogram   Settings      d Gamma Scatter Plot       001 002 003 004 005 O06 007 008 009 Of Off 012 O13 014 04  delta    Figure 2 22 Scatterplot and regression line  zoomed  pmax   20  for solar csv       Analysis Manager    Select output   Output 1 X    Scatter Plot   3D Histogram  Argie Histogtan    Settings      Angle Histogram           30  80  70  60  50  40  30  20  10 50 5 101520253035 4045 50 55606570 758085  Angle    Figure 2 24 Angle histogram for solar csv   pmax   20      PERFORMING AN ANALYSIS             Version  18 Jan 2002      Results Visualiser    Select output   Output 1 zl    Unique Data Pc v  x  Primary Y Series  Gamma zl  Overlay Y Series  ic          Unique Data Points v Gamma    Ss       1 000 2000 3000 4 000 5000 6000 7 000 8000 9 00 10 000  Unique Data Points      Gamma    Figure 2 21 M test  pmax   20  Randomise
31. art of the input space  By spitting up the input space and building a  different model for each part a vast improvement on modelling capability was obtained     28    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    It is interesting to note that by taking the number of near neighbours pmax much larger than is  necessary  or desirable  for the Gamma test  the scatter plot can also reveal periodicities on different  scales present in the data  although for large pmax the resulting Gamma statistic estimate will be  essentially meaningless   Consider for example the data provided in ModSin5000 asc  This is a 1   Input 1 Output file derived from sampling the graph in Figure 2 2                    Figure 2 2 Modulated sine curve used to Figure 2 3 Scatter plot with pmax   100 for  generate the Input Output file ModSin5000 asc   ModSin5000 asc    The scatterplot with pmax   100 is shown in Figure 2 3  This illustrates both levels of periodicity and  also shows why to get an accurate Gamma statistic we should take pmax fairly small     2 3 2 The 3D histogram     This is just another way of viewing the scatter plot  The software can also display the scatter plot as  a 3D histogram  as for example in Figure 2 10  which can be rotated and examined from different  viewpoints  Click the left and right pointing red arrows to rotate the viewpoint  Default is to display  frequency values linearly on the vertical axis but there is also an option for a logarithmic vertic
32. ax   and the length  of the data  M   If run times are just too long then the Genetic Algorithm  GA  can be used with Hill  climbing    and a Sequential embedding  embedding to do a small search around the candidates offered  by the GA    How should I choose the optimal number of near neighbours  pmax  in the Gamma test   See section 2 4 of the manual     How should I choose the optimal number of nearest neighbours  k  in Local linear regression     By experiment with a test set     The winGamma User Guide APPENDIX VI Frequently asked questions Version  18 January 2002    What is  the best Gamma  and what does it mean     The  best Gamma    in the context of a Gamma test is the closest approximation to the asymptotic  Gamma statistic  which should approach the true noise variance     The    best Gamma    in the context where we have a number of Gamma estimates for different  selections of inputs  essentially different models   assuming these estimates are accurate  is the  Gamma statistic closest to zero   because that suggests the model which should have smallest  MSError when predicting outputs from inputs not used in the model construction process     Note that if the true noise variance is actually zero  and the data is of arbitrarily high precision  there  is no limit to how accurately we can model the unknown function  provided only that we have more  and more data     In most real life situations there is a positive noise variance remaining even after optimising the  s
33. b    This shows how to load and use the file GammaTestProject exe and gives examples of each function  that can be called     NetReader nb    This notebook can read in any neural network created and saved by winGamma  The program  identifies the network type and can then run the network  There may be very small differences in the  results owing to the fact that this notebook uses a pure form of the sigmoidal function whereas  winGamma uses a fine grained discrete lookup table for speed in training     APPENDIX IV Generating test files    Generating your own data files     Data files may be generated using a wide variety of software tools  AII data files used by winGamma  are in plain ASCII format  One convenient method of generating data files is to use Excel to  manipulate your data into the required rows and columns and then save the data in   csv format   Another convenient method for creating data files is to use Mathematica  winGamma is supplied  along with a number of useful Mathematica programs for generating  manipulating and saving data  in the correct formats  These are described in Appendix III     Data is generally divided into four types  analysis  training  testing  and prediction  Prediction files  are different in that they contain no output values but otherwise use the same formatting conventions   We use prediction files when we genuinely do not know the corresponding output values and want  to generate predictions  For the prediction file the output fields 
34. d   2 repetitions  on solar csv       Analysis Manager    Select output  Output 1 X     gt  3D Histogram    Figure 2 23 3D Histogram  pmax   20  for  solar csv       Results Visualiser    Select output  Output 1 zl    Custom Chart    X Series    Position in List z   Primary Y Series  Gamma z   Overlay Y Series       Position in List v Gamma          3  4 5 8 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  Position in List      Gamma     Figure 2 25 Moving window Gamma test   pmax   20  on 8400 points in steps of 100  from solar csv     The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    There are some other factors of interest in this particular situation  As it happens the sensor which  measures Irradiance is on the roof of the building  whereas the solar array is in a nearby location  The  solar array is shaded at certain times of day by a chimney and a nearby building and this shading is  not measured by the Irradiance sensor  Since the shading is obviously a function of the time of day  and the time of year this has the effect of introducing smooth non linearities into the situation  which  would be extremely hard to model analytically  One could imagine including the time of day and year  into the data set and then building a different and much more accurate model  By examining the  difference between the two models we could actually quantify the effect of shading without having  an analytic model  This would be a good example of one type of application of 
35. d probably build a pretty good model using only  around 100 points  However  if we want to be sure then we should choose around 280 points  because from this point onwards the  variations in the M test graph are very  small  280 points gives a Gamma statistic      of  0 001054 and a Vratio of  0 001017          8  Highlight the result in question in row 28 and  click  Analyse   The scatter plot and regression line  is shown in Figure 1 15     Handy Tip  By left clicking and dragging the  mouse down and to the right we can zoom in on  any selected part of these graphs as shown in  Figure 1 15  We can also move the contents  up down and left right by right clicking and  dragging  To restore the original view simply left  click and drag the mouse up and to the right           Figure 1 15 The scatter plot for 280 test  points on Hen500 asc with 2 inputs  using 10  nearest neighbours      22    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    It is interesting to see the remarkable difference between this scatter plot and corresponding  scatter plot for the noisy sin data we used in section 1 5  Here in Figure 1 15 we cannot fail  to observe the almost empty wedge in the top left hand corner of the plot  We shall see in  Chapter II that such a feature in the scatter plot is strongly indicative of a noise free smoothly  determined process  This observation is reinforced by the very small Vratio     9  Finally examine and compare the other graphs produced by the Analy
36. d to the development of winGamma  In  particular we should like to mention     A  albj  rn Stef  nsson whose conviction that MSError could be determined without trial and error sparked  the whole idea in 1995     Antonia J  Jones and Nenad Kon  ar  who slaved away to turn an idea into a useable prototype     Alban Tsui who explored the outer reaches of the technique using incredibly complicated surfaces and  a long double precision version of the Gamma test     Steve Margetts who generated the original version of the amazingly tricky tree code in a matter of days  and who put an unreasonable amount of effort into the algorithms that are the heart of winGamma     Peter Durrant who put far more into the front end development than any of us had a right to expect and  who gave the product the look and feel it now has  The collaboration between Steve and Pete has been  critical     McCann Erickson  UK  Who generously funded a Research Studentship without which winGamma  would not now be available     Nick Fiddean the HoD of Computer Science at UWC whose unfailing support made the job possible     Tina Thomas who put up with an invasion of computer scientists at the research farm over several years  and cared for us all so well     Howard James who saw the potential and used his business expertise to help turn winGamma into a  product     Nick Bourne of the RACD division at UWC who has steered us enthusiastically through the legalities that  computer scientists blithely tend to ign
37. e embedding where no inputs are chosen can obviously  be omitted   If m   20 this is around one million  To do a full embedding we therefore have to  perform one million or so Gamma tests  which is fairly time consuming  although it can be done in  about a week on a fast PC     Even if m is sufficiently small to make this practical  say m  lt  20   before we perform a full  embedding  assuming say m  gt  10  we should ask if we have sufficient data to justify it   because  looking at around one million Gamma values the differences between many of them will probably  be quite small and so we should ask if our estimates of the Gamma values are accurate enough to be  able to make these distinctions  Whether or not the estimates are sufficiently accurate to choose the  absolutely best embedding will mainly depend on how much data is available  It practice the best few  embeddings will usually have little to choose between them     Because a full embedding on a large number of inputs is often pointless or impractical winGamma  offers a number of excellent heuristic methods to find a good embedding and these are described in  the following sections     A useful feature associated with a full embedding or GA search is the Embedding Histogram  which    shows the frequency of embeddings with a specific Gamma statistic  If the choice of embedding is  largely determined by statistical variations in the data this histogram tends to have a Gaussian or    32    The winGamma User Guide PERF
38. e model type set at Local Linear Regression set the  number of nearest neighbours at 20  leave the Add constant box checked  and leave the  Define local flow threshold option at 1E 6  Then click on  Build      5  When the Test  Query  WhatIf and Predict buttons become active in the Analysis  Manager the model is built and ready to be used  Click on  Test      6  In the Select proportion of data set for model testing window set the range of test data  to 8400 10578 as shown in Figure 3 4     Select proportion of data set for model testing  10578 End       Figure 3 4 Selecting a proportion of the data for testing         Model Tester    Select Output to view JIEN a0   Mean Squared Error  0 011959  Chart   Data         Model Tester    800 1 000 1 200 1 400 1 500        Actual         Predicted     Error       Figure 3 5 Result of LLR test for solar csv     57    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    We have used the points from 8400   10578 for testing  A sequence of such experiments for the  number of near neighbours k   10  15  20  25 shows that k   20 seems to give the smallest MSError    0 011959 on the test set  This is somewhat better than the Gamma test result led us to believe but  very much in the same ball park  The results of the test with k 2 20 are shown in Figure 3 5  Here we  can see that the agreement between the predicted  blue trace  and the actual test data  green trace  is  very close  The red trace indicates the error  
39. e particular  output     The goal of model identification for a particular output is to choose a selection of inputs that  minimises the asymptotic value of the modulus  of the Gamma statistic  All things being equal this  should result in a model which has minimal MSError when used to predict the output using input  data not seen in the model construction process     What happens if the final conclusion is that the noise variance on the output we are trying to predict  is unsatisfactory  We can attempt one or all of the following         Increase the accuracy of measurements of both the inputs and the outputs  The effective  noise variance on the output may be the result of measurement error on the inputs         Ask if we have included all the principal causative input variables liable to affect the  output  If some obviously important factor has been missed then this may well explain why  we are currently unsuccessful in predicting the output variable         For a time series prediction we could increase the rate of sampling or consider if there are  other time series which may have predictive value for the time series we are interested in  predicting  such time series are often called leading indicators      One reason the Gamma test is so useful is that it can immediately tell us directly from the data  whether or not we have sufficient data to form a smooth non linear model and how good that model  is liable to be  If the result is that the error of prediction is too h
40. e searched over the previous  15 years     O Results Visualiser    Select output    Output 1 x Frequency  Custom Chart    XSeries        Unique Data Pc v        Overlay Y Series   EN                     20 40 6  80 100 120 140 160 180 200 220 240 20 y    Unique Data Points Garma    0 0050 010 0150 020 0250 030 0350 04       Figure 2 35 The M test graph for the sunspot Figure 2 36 The frequency histogram of all  data data using the best embedding of length embeddings of length 15 using the sunspot  15  data     The best embedding found was 001001000010111  Here the most recent data comes last  So this  embedding says that to predict this year   s sunspot activity x t  we should use the data x t 1   x t 2    x t 3   x t 5   x t 10  and x t 13   an embedding of dimension six  It is interesting to note the bimodal  distribution of the Full Embedding Histogram of Figure 2 36  The bimodal distribution is partly  explained by the observation that only 2 38  of the embeddings with a Gamma statistic  gt  0 008  include x t 1  as compared with 99 8  of those having a Gamma statistic  lt  0 008  Put plainly this  says that the most important predictive factor for the sunspot activity this year is the value for last  year  It is also interesting to see which   variables appear in the best few embeddings   ERE   These indicate that the last few years  plus     messe EEE   the value approximately one 11 year cycle discs bez  ei   back  plus a value about half way through the     Model Test
41. ear data     The Increasing Near Neighbours plot for pmax   3 to 30 is given in Figure 2 13  This suggests the  best estimate for the Gamma statistic is obtained at around pmax   17  The M test result of Figure  2 14 was obtained starting at M   50 and increasing M to 500 in steps of 10  This consistently gives  a Gamma statistic of around 0 07  but ideally as the graph has not yet settled to an asymptote we  should need more points to obtain an accurate estimate for this noisy 1 dimensional data     The scatter plot in Figure 2 15 contains points with small 6 but large y which also supports the  conclusion  At the same time the regression line fit is rather poor  The 3D histogram in Figure 2 16  shows partial indicators of an    empty wedge    and supports the general conclusions that the data is  noisy  The same is true ofthe angle histogram in Figure 2 17  Finally the Moving Window Gamma  test  using a window size of 300 in steps of 10  in Figure 2 18 consistently shows a Gamma statistic  between 0 072 and 0 076     These results together indicate that we have noisy non linear data but that model construction is quite  feasible     The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002      Results Visualiser B   Results Visualiser    Select output   Output 1 zl Select output   Output 1 zl  Custom Chart    X Series     Near Neighbour zl x Near Neighbours v Gamma and Standard Error Unique Data Pe zl    Unique Data Points v Gamma     lt      Primary Y Series H E
42. ece eee eee 15  Figure 1 7 The Analysis and Data Set Manager windows after performing the initial  judi MR rc ET 18  Figure 1 8 The noisy sme data 2r ic upvu ga 3 be ee ara EEG GU EA Edu 19  Figure 1 9 An M test on the noisy sine data         0    eee eee eee eee 19  Figure 1 10 The first 100 points of the Hen500 asc time series           llle 20  Figure 1 11 The surface which defines x    in the Henon map as a function of the two previous  SATB Ls s e Pueden dC eon A 809 ADR d PE QURE P RO d n o oca pu ul og A cara E 20  Figure 1 12 The distribution of points in the input space for the Henon map              20  Figure 1 13 The result of an Increasing Embedding on Hen500 asc with a maximum of 10  inputs  using 10 nearest neighbours           2 0 0    cee eee e 21  Figure 1 14 The result of an M test on Hen500 asc with 2 inputs  using 10 nearest neighbours   sched G Mabe Lo  E eg os ONES al oue aa dash EORR pea Sach lh atta a da 21  Figure 1 15 The scatter plot for 280 test points on Hen500 asc with 2 inputs  using 10 nearest  Berg iDOUF S  sews nas gi eee I CP EROS Ce ger der UD SURE Oe Ree ERS Cea Tee 22  Figure 2 1 Main Features of the scatter plot and regression line                       28  Figure 2 2 Modulated sine curve used to generate the Input Output file ModSin5000 asc   29  Figure 2 3 Scatter plot with pmax   100 for ModSin5000 asc        0 0    eee eee 29  Figure 2 4 The variation of Gamma and SE as the number of near neighbours increases    30  Figure 2 5 M tes
43. election of inputs  because real measurements are subject to error  and there is no point in building  more and more accurate models  for example by using the noise cancelling features of local linear  regression  because the predictions of the model will never agree with our measured data unless the  measurement error is decreased  for example      An exception to this might be if are trying to get some idea about an underlying theoretical model  and winGamma can help in this respect but determining a theoretical model  as opposed to an  accurate numerical model  lies outside the competence of winGamma     How should I choose between a local linear regression  LLR  method and a neural net method  of model building     Nets take a long time to train but may generalise better than LLR in regions of the input space where  data is sparse  A high Gamma statistic on the training data may make neural network training even  more difficult  If data is densely distributed over the input space then LLR may be a better choice in  this situation     The particular application also has an influence on which may be the best modelling tool to select   For example  to learn new data it may be necessary to retrain a neural network from scratch which  is time consuming  whereas dynamic LLR can easily accommodate new training data     Local linear regression models are very fast to build  but take relatively longer to query because a  kd tree is used to find the near neighbours of the query poi
44. ending up with a long GA run     6  If a better embedding is found then repeat steps 2  3 and 4 to refine those conclusions     The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    Time series files     Time series analysis is complicated by the fact tat we probably do not know how far back in time we  should look to build our prediction model  This initial decision is not irrevokable and should be  guided by some degree of commonsense analysis on what is likely to be the case for the given data  set and how much data is available     E g  For a single time series with annual periodicity  where the samples are weekly  we might set the  number of inputs to 104   the equivalent of two years  However  104 dimensional space has    a lot of  room up there  and we should need a data set going back many years to make this worthwhile  If only  a few years data is available then perhaps we should first consider a model over the last several  months or weeks     1  Load the data and do not initially normalise if it is a single time series  Set the number of  inputs to a reasonable maximum in the light of the data and the number of outputs initially  to one  Now perform a simple Gamma test on the full data set  if not exceedingly large  with  the default number of near neighbours set to 10  to get an initial idea  If the data set is very  large use a subset of the data for initial experiments     2  Run an Increasing Embedding test to determine a likely embedding dimen
45. ents are   Pentium processor 133 MHz or preferably faster     RAM 32 64 Mbytes  The amount of memory you will need to run winGamma is not really  constrained by the program so much as the size of the data sets that you wish to analyse  With  the possible exception of the neural network training algorithms the theoretical average case  computation times of the main algorithms in winGamma scale like O MlogM   where M is  the number of rows in the data file  However  under some conditions some algorithms in  winGamma may require quite a lot of memory to achieve the theoretical scaling     An example is Increasing Near Neighbours when pmax is large  Suppose we consider  solar csv sith 10578 rows of three numbers each and set pmax   100  This demand will  require approximately 0 25 Mgbytes for the data  0 25 Mgbytes for the kd tree but more than  4 Mgbytes for the 10   numbers which constitute the list of 100 nearest neighbours for each  of the 10578 input vectors in the data file  To perform a Gamma test each of the near  neighbour indices must be instantly available and they could be anywhere in the range 1   10578  If the system has less than 4Mgbytes of available RAM then it will have to keep  paging data in and out from the hard disk  This will dramatically slow the algorithm and may  in fact render the entire computation infeasible  If you observe a large amount of continuous  paging disk activity then  a  Close down all other applications  b  Consider if it is feasible  to 
46. er   previous cycle  give the best results  This is SEES  rather impressive since the software has no  way of knowing about sunspot cycles        If we run the Gamma test on the six  inputs one output I O data file constructed  using this mask we get Gamma   0 0015 and  V xo   0 036  SE   0 00093  Note the M test   did Era  of Figure 2 35 indicates that there is not 208 210212 214 218 218 220 222 224 226 228 230 232 234 236 238 240 242 244 246 248 250 252 254 256 258 260 262 264 266  really enough data  the graph has not e   stabilized   Therefore if we construct a model Figure 2 37 A test of the LLR model on the data    and test on unseen data we might expect to          set SunPairs asc  blue predicted  green actual  red  get a higher MSError than the estimated RUD   n a                      47    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    gamma value  If we now predict the last 59 years data  using local linear regression with k   60 near  neighbours and a threshold of 0 0001  we shall see how to do this in the next chapter   on the basis  of all the previous years we obtain Figure 2 37 which gives a MSError around 0 007  In cases such  as this  where there is insufficient data  it is not uncommon to see a MSError on unseen data around  an order of magnitude greater than the Gamma statistic     2 14 Handling multiple time series     In later releases we plan to include a TimeSeries Editor to facilitate the direct manipulation of  multiple time s
47. eries  However  using a combination of winGamma and Excel at present it is possible  to accomplish multiple time series manipulation relatively easily     Suppose we have several time series TS1       TSm  which we wish to use to predict a target time  series Target     Step 1     Suppose the time series are in an Excel file which is structured as follows     Table 2 4 Excel file for multiple time series      Date           TSr TS2     TSm  Target    10 07 1981 10 39     132 06   606 8    10 14 1981      15 5 10 34    132 92   606 8  10 21 1981   155 1038    130 43  626    a   o e e de e       In Excel delete the date column because this is not numerical that can be used as input for winGamma  notice now that the only factor which preserves the time relationship is the order of the rows     Hint  It is important when dealing with multiple time series that all the data in a row  is sampled at the same time  If one measurement is sampled weekly and another  monthly then we can use linear interpolation to construct weekly data samples for the  monthly sampled data     Step 2  Save the file from Excel in CSV format   you should include the first row of text descriptors  of the time series as winGamma can handle these and they will be useful later     Step 3  Load the file just saved into a text editor which can SaveAs ASC DOS text files and search  and replace the commas separating the numeric data only   Do not the replace the commas separating  the commas separating the text he
48. erve but is not actually  the best way to use dynamic LLR  It is better to start with a reasonable training set size because then  the initial kd tree  a data structure used extensively by winGamma  will be more balanced and query  times will be reduced     33    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    Under the circumstance it is not surprising that if the same test data is presented to the model a second  time the MSError will reduce dramatically     3 3 Two layer back propagation    Modelling E ditor Modelling E ditor  Model type  Mw propagation Two Layer Neuraiad Model type   Backpropagation Two Layer Neura      Network Parameters   rraring Parameters   Network Parameters Traring Parameters       gt  Network min Ri Training Parameters                4   Number of nodes in first layer  Initial leaming rate    E F fo 25 Network initialisation time   5       Number of nodes in second layer  Momentum  59s eS j   5 a  o5  p Regularisation     Train to                        3    0 001 pam  Target MSE  3m  P    fo 00105462113952337    Aecalc      Network training lime        inutes                              Figure 3 1 The first set up menu for two Figure 3 2 The second set up menu for two  layer backpropagation  layer backpropagation     This option uses the standard backpropagation algorithm to produce a two layer feedforward neural  network     With all the neural network training algorithms one should note the option to recalculate the
49. ese experiments and the interpretation of their  results fully in Chapter II  For the present we shall simply illustrate the basic Gamma test  experiment    The Data Set Manager shows the data that has been loaded as in Figure 1 5  where the windows  have been tiled   Because data files may be very large the data rows are divided into    pages    of 100  rows each  In Figure 1 5 the first page has been selected  Each column represents a column of inputs  or outputs and is labelled as such  The first four rows give the Mean  Standard Deviation  Minimum  and Maximum of each column for the whole of the data selected for analysis  The name of the  current data file is also displayed at the top of this window     Handy Tip  Note that most of the windows and sub windows including the column separators in the  Data Set Manager data display can be resized using click and drag     To perform a Gamma test select the Analysis Manager and then Experiments  Highlight Gamma    test and select    New     We can now toggle between the Experiment tab and the Mask tab  The only  option to be set from the Experiment tab in this experiment is the number of near neighbours  For    13    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    the present leave this set to 10  The Mask tab is used to select which inputs to include  Leave this  set to    11     i e  both inputs are included        winGamma  File Edit Transform Options View Window Help    C o m m   I Analysis Manager  New Del
50. ession of models based on  1 Input  the previous value   2 Inputs  the previous two values   etc  up to the maximum number of  inputs we have set  which is 10   where in each case the output is the current value     Results  This gives us a succession of Gamma values which we can graph by clicking on     Graph     The result is shown in Figure 1 13  Here it becomes clear that the best of these  models  i e the one having a Gamma statistic closest to zero  is the one which uses just the  two previous values  The Gamma statistic for this model is approximately  0 000161 which  is very close to zero  The Vratio is  0 0001648 which again is close to zero     7  Now that we have identified the relevant inputs pull down the  Transform  menu and click on   Transform the data set     Under the Time Series Options select 2 inputs and 1 output and then leave  the proportion of data set for analysis set to 1 498     8  Next in the Analysis Manager under Training Set Analysis select M test and then click on  New    In the Experiment Editor click on the M test tab and set the Initial sample size to 10  leave the final  sample size set to 498  and set the step size to 10  Now click on  Execute   We should like to see how  stable the Gamma statistic is and how much data we are likely to need to get a good quality model   Finally when the results window comes up click on   Graph  to see the result of the experiment  This  is shown in Figure 1 14    Results  We see from the graph that we coul
51. ete   Afalvse Graph Madel   Test Huer vw hat Predict    Experiments   Models      E  e Training Set Analysis    IEEE RC    Increasing ne sneighbours  e M Test    Moving window gamma test  H  e Model Identification          Data Set Manager  Page p File  solar csv    6 5962 2 6583  2 2676 5 2233  0 38343  5 928  15 679 28 22      CD CO 7 CD Cl e CO n2       Figure 1 5 The Analysis and Data Set Managers after loading a data file     When these steps are complete click on    Execute     Under the Analysis Manager the Settings  window will now show the settings for the current experiment  This is shortly augmented by a  Results window which shows the results of this experiment  We can switch between the Settings  and Results windows using the appropriate tabs  These results for the single output are presented  in a Results Settings window along a single row  because there is only one output  and are shown  here in the first column of Table 1 1  If there is more than one output the software generates a similar  set of results for each output     Finally    Transform    the data and repeat the experiment to obtain the scaled results in the last  column of Table 1 1     1 3 1 Interpreting the results   To interpret these results it helps to have some idea of how the Gamma statistic is calculated  We  shall describe this more fully in Chapter II  but for now it is enough to know that the Gamma statistic    is calculated by determining a regression line based on near neighbour statist
52. ewer    1   s and   thereby place more emphasis on simpler models   Note the three weightings selected for GA fitness should sum to 1    Run Time  5 minutes  The  approximate  maximum time selected to perform the GA   Setting the population larger may improve the final fitness of the best mask found but only if a large  run time is permitted  For long masks  i e  a large number of inputs  and large data sets the GA will  require runs of several hours     2 7 3 Hill Climbing    In hill climbing a mask is taken  default is all ones for the current number of inputs  and each bit is  flipped in turn calculating the Gamma until the end of mask is reached  This is repeated until no  single it flip gives an improvement on the Gamma  This is a relatively fast heuristic but takes longer  than a sequential embedding     2 7 4 Sequential Embedding    33    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    Here a single pass through the current mask is made  flipping each bit only if there is an improvement  over the Gamma statistic obtained with the original mask  Again default is a mask of all ones equal  to the length of the current input vector  though as in the previous method an initial mask of any kind  can be used provided its length does not exceed the current number of inputs  This is very fast     2 7 5 Increasing Embedding    The Increasing Embedding algorithm starts with the mask obtained by taking only the rightmost input   in the case of a time series 
53. files of data for testing    When the analysis data set doesnt include the test data or further tests need to be made on a model   When should I use the moving average option    Usually when you have plenty of data and want to determine if the number of data samples used to  estimate the Gamma statistic gives a stable value over a range of different sample sets of the same  size    This test is also useful to investigate if the underlying dynamics is itself varying    When should I use the differential option    It may improve the MSError for difficult time series    Which input is the differential  or moving average  input    When these options are activated the new data column is placed in the highest numbered positions    with differential first and moving average last  This can be confirmed by placing the cursor over the  vertical column and dragging it wider thus revealing the applicable legend     76    INDEX    Analysis Manager   window 13  angle histogram 29  Data 9  data model 9  Data Set Manager 13  embedding 32  embedding dimension 21  Embedding Histogram 32  Excel Macro 62  Experiments   window 13  full embedding 25  32  GA Fitness 71  Gamma   statistic 15  Gradient 16  heuristic search techniques 25  input columns 9  mask 32  Normalization   standard normalization 50  observations 9  output columns 9  over training 9  71  Over training   definition 71  Project   definition 50  Query 58  62  Results   window 14  Standard Error 16  Vratio 16  Whatlf 58    
54. function generating the data  provided this is itself a smooth function  You  can easily see this by using the winGamma Whatlf facility on LLR models built from increasing  sized data sets  for example on the data from Sin500 asc     LLR can produce very accurate predictions in regions of high data density in input space  but it is  liable to produce unreliable results for non linear functions in regions of low data density  In other  words LLR does not generalise well but is a very good interpolative tool if we have large amounts  of data     There are three user settable parameters in LLR  the number of near neighbours  whether or not to  include a constant term in the linear model  and a threshold value for filtering the local eigenvectors     The choice of the number of near neighbours k in LLR is quite critical  If the noise level on the output   i e  the asymptotic Gamma statistic  is low then some small multiple of the number of inputs should  suffice  If the noise level on the output is high then k needs to be larger to obtain better noise  cancellation  Unfortunately if the unknown function f we seek to model is highly non linear  has  regions of high curvature  then  unless we have a very large amount of data  setting k large may mean  that in the region of these k points the assumption that the unknown function can be locally  approximated by a linear model may be false  In this case the resulting predictions would be  inaccurate  We have not yet developed rules of
55. grees  Irradiance is 10     Similarly we can ask    58    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002        How does the power output vary when the Irradiance is fixed at 10 and the Temperature    from 1 to 15     The answer is given in Figure 3 7  Here the  result is somewhat different  One interesting  feature is the slight rolloff of power output  with increasing temperature  This is a real  effect and is a consequence of the physics of  solar cells     3 6 1 2 A histogram of prediction errors for  LLR model    If we save the data produced by the results of  the LLR test then we can examine an error  histogram for the predictions  This is shown  in Figure 3 8 The vertical gridlines are one  standard deviation either side of the mean   which is close to zero  This is the final test of  our model     3 6 2 Building and testing a neural model                Figure 3 8 An error histogram for the LLR test     We can now repeat the model building process using a neural model     3 6 3 Visualising the data     For a 2 Input l Output data set we can  visualise the model as a 2 dimensional  surface and using suitable software plot this  surface directly from the data  Of course in  higher dimensional spaces such graphical  realisations are not possible  Moreover  if the  data is very noisy such a surface will be very  jagged and not much use as a model     Nevertheless now that we have finished  studying solar csv it would be interesting to  see what 
56. he structure of the data  as in multiple Input Output data sets  the only parameter to optimise is  the number of near neighbours  often denoted by pmax   It is a remarkable fact that for many data  sets the default of pmax   10 nearest neighbours is often nearly optimal     A suitable size for pmax in the Gamma test principally depends on two factors  The number of data  samples M  if M is large the local number of data points close to a given point can be expected to be  high The  ocal curvature of the surface described by the unknown function f  other things being  equal  for a surface with high curvature we cannot afford to take neighbours too far away  so that  pmax will require to be smaller     Systematic ways to determine the best choice for the number of near neighbours are described later  in section 2 4     Note that the size of pmax in modelling the unknown function f using local linear regression is  determined by other factors described in section 3 2  Whilst for the Gamma Test it is usually the case  that we want to take pmax small  for local linear regression at high noise levels we will need to take  pmax much larger     2 3 The Gamma Test analysis graphs     After performing an experiment highlight the row containing the Gamma result to be scrutinised and  click    Analyse        Clicking on the tabs will provide the other plots that are discussed below  In an experiment where  there are multiple Gamma results the graphs and plots will relate to the highligh
57. his assessment     Now click on Analyse  This enables us to see three analytical graphical displays which are described  more fully in Chapter II  The first of these displays is the Gamma scatter plot and regression line of  Figure 1 6  The other two tabs give a 3D Histogram and an Angle histogram  These are different  ways of viewing the data in the scatter plot       Results Visualiser    How stable is the Gamma statistic  with 10 near  em ke    neighbours  as the number of data points varies   We can answer this question by clicking on the    Experiments tab and then highlighting M test  This NK   will run the Gamma test for an increasing number _ oveisyseies    M  of data points  Now click on    New    to begin Z    setting up the M test  leave the number of near  neighbours set to 10 and click on the M test tab   Set the initial sample size to 10  the final sample  size to 500  and the steps size to 10  Now click on    Unique Data Pc Y    Unique Data Points v Gamma          50 100     150 200 250 300 350 400 h 450     500    Execute to begin the Experiment  After the Results Dige pan Pon    window comes up click on  Graph    to obtain a      graph of the Gamma statistic values against the       Figure 1 9 An M test on the noisy sine data     19    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    number of data points  This is shown in Figure 1 9 and we can see that after around 425 points the  graph is fairly stable  The fact that the data is rather noisy 
58. hniques     Local linear regression  Dynamic local linear regression  Neural networks using various types of learning algorithm     Local linear regression models are fast to construct and quite fast to execute a query  Local linear  regression models can also be easily updated as new training data becomes available  which is not  the case with neural networks  where a prolonged extra period of training  or starting training all over  again  may be required to modify the model on the basis of new data   Indeed winGamma also offers  a dynamic local linear regression option which is exactly local linear regression with dynamic  updating  this option is quite useful for time series prediction      Usually local linear regression is extremely accurate in parts ofthe input space where the training data  density is high  However  local linear regression will not generalise well to parts of the input space  for which training data is sparse     Neural network models take time to construct but in parts of the input space where data is sparse tend  to generalise better than local linear regression  It is often quite hard to get a neural network to train  down to a very small Gamma statistic  say 10 or 10 which can easily happen with zero noise  dynamical system time series   i e  it may take several attempts  each of which takes a long time   However  neural networks can make predictions at blinding speeds compared with local linear  regression based algorithms  so for some applicatio
59. hs ee Qux MESE X REAR eR SEX WE OE ea Y SEMEN FA SEE 67  Prediction file data   22 252 4 Aig beess i ps ERGO E REESE AERE 68  APPENDIX III Using the Mathematica 4 0 files       0    cece eee eee 69  DataGemn  4    cen ae he v e Ee Sete Sean etr ux ERAS dS 69  DataAnal mb 4 2 5  46b 2226 eh asee sie BUS toy heces EPS E EE hares rid e 69    mathlinkGamma nb        0 0    cece eee cece e ees 69    NetReader MD    mte tt DER EN NES o NN RE NAE 69    APPENDIX IV Generating test files uuovak ead   dep IO nee aih od PEN eye 70  Generating your own data files svo ua LAE RR ood on EAE REALES 70  Creating data files using Excel esie ure ev nb yee wee oe eae 70   APPENDIX V DefnnJtiols    eve RE eI ape yeah EE eleeqam eK eie RE edad 71   APPENDIX VI Frequently asked questions           0 0    cee eee eee eese 13   INDEX bI A Qa Det pie FLA ED EE R QC DRE O eb b d 71   List of Figures    Figure 1 1 Toggle inputs to outputs as required when loading a   csv file as Input Output data    bord iS Anon adi EA Ke SERA OAS EA ovedte dx odutib diede dre dS Ud M aca 11  Figure 1 2 Selecting the number of inputs and outputs per time series                   11  Figure 1 3 The  Normalise    check Box   45 023 vxo ete us cigs Maes vx da acsi eee oe 12  Figure 1 4 Selecting a proportion of the data for initial analysis                        15  Figure 1 5 The Analysis and Data Set Managers after loading a datafile               14  Figure 1 6 The Gamma statistic and the Gradient Slope               c
60. ics derived from the  data   see Figure 1 6     14    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    Gamma    The first row of Table 1 1 gives Table 1 1 Gamma test  results with pmax   10 for unscaled and scaled  the Gamma statistic  pmax    10      solar csv data     for the output as evaluated over   uL d e analysis  in h Unsealed   Scaled    this case the whole data set   As  onecansec from Fieuws 106 tic  Gamma statistic is actually the    vertical intercept of the 0 000760 0 000785    regression line in the figure   This is the estimated variance of  the errors for any smooth model  built on the data  Since the  output variable range is  approximately  0  30  this is a  relatively small error variance  It  oe  on this data will have a standard 7   deviation of the prediction error   of about   0 020761   0 144 on   the unscaled data   which is about 0 5  of the range        In general it is helpful to distinguish two cases         First  where the true noise variance is  zero  In this case the asymptotic Gamma  statistic should approach zero and there is  no limit to how good a model we can  build provided only that we have more  and more data of arbitrarily high precision   For example  this can happen with  artificially generated data for chaotic time  series            Second  and more realistically  where  the true noise variance is positive  In this Figure 1 6 The Gamma statistic and th  case the asymptotic Gamma statistic yq dient Slope    should a
61. igh  no matter how much data we are  given  then we must address the above issues     For each choice of inputs investigated  as the number of data points increases we attempt to establish  the asymptotic Gamma statistic for each output  We then choose the set of inputs for a particular  output that has the minimum asymptotic Gamma statistic   this is known as model identification   Having established the best selection of inputs for each output  using the winGamma software   models may be built by     e Static local linear regression  fixed model      Dynamic local linear regression  model updated as new data becomes available      or by using one of four different types of neural network training algorithms     d Convergence in probability       Because of sampling error if the variance of the noise level on an output is very small the Gamma statistic  may sometimes be negative  even though a variance can never be negative  If this occurs we use the absolute value  or modulus of the Gamma statistic     10    The winGamma User Guide GETTING STARTED Version  18 Jan 2002        Two layer back propagation       Meta backpropagation  Not included in the Beta release       Conjugate gradient descent     BFGS neural network    Predictions on new input data for which the outputs are unknown can also be made using one or  more of the models     1 1 2  The range of applicability     The software is designed to analyse data with the goal of producing a near optimal smooth function  fr
62. iginal data set which  have Ix 1    x j   small  i e  x 1  and x j  are close together  but their corresponding output values y  have ly 1     y j   large  i e  y i  and y j  are far apart   This is very bad from the viewpoint of  constructing a smooth model  It may be a reflection of a high intrinsic noise level on y  a high  gamma  or it may just be that there is no smooth underlying model      Gamma Scatter Plot       200 000 400 000 600 000 800 000 1 000 000 1 200 000  delta       Figure 2 1 Main Features of the scatter plot and regression line    An example where the underlying model is not intrinsically smooth might be a logic function of the  input variables  e g XOR or m bit parity  In m bit parity the inputs are the vertices of the  m dimensional unit hypercube and the outputs are 1 or 0  In fact one can put a smooth surface through  these points but this is a rather meaningless exercise  Problems with a large number of discrete input  or output variables are best tackled via a decision tree approach rather than trying to use smooth  modelling techniques     The scatter plot can also give important clues on the nature of the data  For example it can happen  in some control applications that the system being modelled goes through two or more different  dynamical regimes  In one instance the scatter plot revealed that there were really two different  regression lines each corresponding to a different dynamical regime  Moreover  each regime  corresponded to a distinct p
63. iple Time  Series     winGamma assumes that non determinism in a smooth model from inputs to outputs is due to the  presence of statistical noise on the outputs  Not all phenomena that one might seek to model fall into  this category  For example  if the outcome that one is trying to predict from observations is highly  probabilistic then the model produced by winGamma will not be satisfactory as a prediction tool     e However  the software is able to detect this situation      The models that winGamma is designed to produce are of phenomena  more exactly outputs  that  are smoothly determined by the input variables  Mostly the limiting factor on the predictive accuracy  of the model will be measurement noise or insufficient data   For a given data set the winGamma software executes the Gamma Test which estimates the variance  of the noise on each output  This will be an estimate of the best MSError that a smooth model can    achieve for the corresponding output         Inputs and outputs should be continuous variables       See Appendix V for definitions       Tt will be reflected in a high Gamma statistic or a Vratio close to 1     The winGamma User Guide GETTING STARTED Version  18 Jan 2002    The estimate of that part of the variance of an output that cannot be accounted for by a smooth data  model is called the Gamma statistic  As the number of data samples increases the Gamma statistic  invariably  approaches an asymptotic value which is the variance of the noise on th
64. istributions of data like this can be very helpful in high    20    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    dimensional input spaces  as it often means that we need less data to build a good model than would  be the case if the data were uniformly distributed over the whole space     It is precisely the surface of Figure 1 11 which is  the model that we can seek to construct using  winGamma  We could take the time series and  create a 2 Input 1 Output data structure  x     x     gt  XQ  In fact any time series that evolves  according to some smooth iterative or dynamic  process can be treated this way  provided only that  we can determine the number of previous values of  the time series required to predict the next value   this is called the embedding dimension      In this  example we shall pretend that we do not know the  embedding dimension and show how winGamma  can be used to get some idea of which previous  inputs are likely to produce a good model     Note that the data in the file Hen500 asc is high  precision and not subject to noise     1 5 2 1 The basic steps    1  Load Hen500 asc with  Open Analysis Data  Set      2  Set the number of inputs to 10 in the Time Series  tab     3  Do not enable  Normalisation  in the check box   Since the data is a single time series and each  sample is comparable we should not expect much  gain from scaling     4  When prompted to select a proportion of the data  set for analysis use all the data  1 490  for
65. izontal     With pure random non smooth data the slope of the regression line will gradually increase as the  number of data points M is increased   this is because the continuity condition is not satisfied     34    The winGamma User Guide    Taken together  particularly with  Vratio so close to one  these are  clear indicators that it is pointless  to try to model the data with a  smooth function  Next we  examine the standard analysis  tests in turn     The Increasing Near  Neighbours plot for pmax   3 to  30 is given in Figure 2 7  This  suggests the best estimate for the  Gamma statistic is obtained at  around pmax   10  The M test  result of Figure 2 8 was obtained  starting at M   50 and increasing  M to 500 in steps of 10  This  consistently gives a Gamma  statistic of around 0 3  but ideally    PERFORMING AN ANALYSIS    Version  18 Jan 2002    Table 2 1 The results of a simple Gamma test on the file Ran500 asc for    unscaled and scaled data     po nscale  Scaled          Start       1111 1111    as the graph has not yet settled to an asymptote we should need more points to obtain an accurate  estimate for this 4 dimensional data     The scatter plot in Figure 2 9 contains points with small 6 but large y which also supports the  conclusion  At the same time the regression line fit is rather poor  The 3D histogram in Figure 2 10  shows no real indicators of an    empty wedge    and supports the general conclusions that the data is  extremely noisy  The same is true of
66. le  choice is to start at 100 in steps of 100 until  the end of the data     Results  The M test graphs of the Gamma  statistic together with the Gradient are  shown in Figure 2 29  From this we see a  good asymptote and conclude that with 8  inputs a good model can be obtained using  around 3000 points  It also looks likely that  we have an essentially zero noise time  series     43         Results Visualiser    Select output    Output 1 v    Unique Data Pc        Primary Y Series  Gamma z   Overlay Y Series  E    G    Unique Data Points v Gamma and Gradient       500 1000 1500 2000 2500 3 00 3500 4000 4500  Unique Data Points h        Gamma     Gradient    Figure 2 29 The M test graph  pmax  10  number of inputs   8  for the delayed Henon  map     The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    5  Can we get a better Gamma statistic by discarding still more of the inputs  To answer this  question we run a Full Embedding on 7 inputs   To do this we have to transform the data  again      Results  If we make a small table of the best 5 masks found we obtain     Gamma 2 3228E 6 1 3276E 5  1 888E 5  2 176E 5  2 438E 5  Mask 0001101 1101100 1111111 0001100 1011101    From this we infer that lags 3 and 4  remember we have to count from the right  are very  important but that the marginally best model should be obtained using lags 1  3 and 4       Results Visualiser    It is worth examining the Embedding  Select output    Output 1 v    Histogram associated wi
67. lowed by one or more spaces  The end of the input vector is signified by a  comma  What follows the comma is one or more outputs separated by one or more spaces   The last number on a line should be followed by a carriage return linefeed  There must be the  same number of data fields before and after the comma on each row         Prediction file data     68    APPENDIX III Using the Mathematica 4 0 files    A number of Mathematica files for data generation manipulation  data analysis  and model testing  are supplied with winGamma  There is also a C code executable MathLink file which can be used to  execute the Gamma test from within Mathematica  To use these files you will need to have  Mathematica installed and be familiar with Mathematica notebooks     At a later stage it is hoped to supply equivalent files in Matlab    DataGen nb   Data Generator    This file enables the creation of Input Output files and Time Series files of data with our without  added noise  In includes a large number of examples and shows how every test file used in this  manual was created     DataAnal nb  Data Analyser     This file is useful for producing graphics such as histograms and performing various types of  supplementary data analysis     GammaTestProject exe   This is a C code executable which communicates with Mathematica via MathLink and enables a  variety of Gamma test computations to be called directly from Mathematica  It cannot be executed  as a standalone program     mathlinkGamma n
68. lso be positive and there will   come a point where using more data to   build our model will not actually improve the quality of the predictions when compared with  the measured values of the output     Delta       In the case of a positive asymptotic Gamma statistic we can determine the minimum amount  of training data required to build a smooth model with this MSError using the M test  described in section 2 5     15    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    Gradient    The Slope or Gradient is the slope of the regression line in Figure 1 6 used to calculate the Gamma  statistic  It is actually a rough measure of the complexity of the smooth function we are seeking to  construct  In this case the gradient of A   0 244 indicates that the output is a rather simple function  of the two inputs  It is generally best to look at the Gradient for the scaled data since this refers to  a standardized output range     Like the Gamma statistic the Gradient will eventually asymptote to a fixed value  However  the  number of data samples required to get a stable asymptote for the Gradient will usually be much  larger than the number required to get a stable asymptote for the Gamma statistic     Standard Error  SE     This is the usual goodness of fit applied to the regression line in Figure 1 6  If this number is close  to zero we have more confidence in the value of the Gamma statistic as an estimate for the noise  variance on the given output  In this case an 
69. means we should try to optimise the  number of near neighbours for the Gamma test if we wish to obtain a more accurate Gamma statistic  and we shall see how to do this in Chapter II     1 5 2 A chaotic Time Series     3A    EIER    1 5       Figure 1 10 The first 100 points of the Figure 1 11 The surface which defines x      Hen500 asc time series  in the Henon map as a function of the two  previous values     Here we use the file Hen500 asc  This file contains  time series data generated by iterating the Henon  map  It is described in more detail in The Gamma  test and how to use it  a practitioners guide     To get some idea of what the time series data looks  like we graph the first 100 points of the time  series using any convenient software as in Figure  1 10  Although this time series looks quite  unpredictable  nevertheless the underlying model  which takes us from two successive values to the  next is a smooth function of the two successive  inputs and therefore does not violate the    requirement of the Gamma Test  see Figure 1 11  Figure 1 12 The distribution of points in the  input space for the Henon map        A very important factor to consider when building  a non linear model is the distribution of sample  points in the input space  In some cases these points will be uniformly distributed but in many cases  this will not be the situation  If we plot the distribution of the points  x     Xn  for the Henon map data  from the file we obtain Figure 1 12 Peculiar d
70. n  18 Jan 2002    In some data sets the same input vector may occur several or many times  This indicates how many  distinct input vectors are present in the data  see the discussion on zeroth near neighbours below      Evaluated Output    Indicates to which output the results relate  In a file with multiple outputs all these results are  calculated for each output     Zero th  Near Neighbours    In some data sets the same input vector may occur several or many times  If an input vector appears  multiple times then  if it has the same output value s   it might be construed as a repetition or it may  be a separate independent observation  In the first case there is no extra information and the data  vector should be deleted  In the second case there is useful information in the two vectors because  they are telling us that for these inputs the outputs are identical  and so presumably subject to low  or Zero noise variance  If one or more outputs are different for the same input vector then again there  is useful information  because enough vectors of this type could give us an immediate grip on the  noise variance     Therefore because it is important for an analyst to know if the same input vector occurs multiple  times winGamma provides this information by stating the maximum number of non unique input  vectors  If this number is small in relation to the size of the data set it can safely be ignored on a first  pass  If it is large then the data should be subjected to some a
71. n this experiment was FTP ed from ftp address  ftp santafe edu  directory   pub Time Series data  Its origin  normalization and training test regions are described in  Weigend  1990   The data consists of 280 points representing sunspot activity over the period 1700   1979 and  was used in  Weigend 1991   The range of the data has been scaled to  0  1  and we found the  variance to be 0 0410558  Figure 2 34 shows the variation of sunspot activity over the full range of  the data     Itis known that the primary sunspot cycle is approximately periodic over 11 years  Other shorter and  longer cycles are also known  For radio propagation the short period cycle of 28 days is particularly  significant  The data used here is collected from telescopic observations projected onto a white paper  card  The sunspots are counted and classified by size and a correction factor applied depending on  the magnification of the telescope  The virtue of this data is that it has been regularly collected since  1700  Of course  if one were really interested in predicting sunspot activity much more accurate data  is available  The data provided is often used as a test of prediction techniques and can give a  reasonable model of gross sunspot activity     46    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    Selecting a best embedding  If we are prepared for a several day run we can use the Full Embedding  option of the software to search for a good embedding  In this example w
72. nalysis outside of winGamma     Upper 95  Confidence Lower 95  Confidence    In the case where zeroth near neighbours are present these results are the lower and upper bounds  at the 95  confidence levels for the Gamma statistic estimated directly from the zeroth near  neighbours  Unless the data file has many repeated input rows these values can be ignored  If the  file has many repeated inputs then these values can be compared with the normal Gamma statistic   which is computed in an entirely different way      1 4 The basic controls of winGamma    The use of these options will discussed fully in Chapter II    The Analysis Manager   Experiments These are options used to determine the Gamma statistic and to investigate  how reliable this statistic is  i e  to determine  the quality of a model which might be built  using the data and a given selection of inputs  To invoke any of these options after loading  a data set simply select the Analysis Manager and highlight the option required  Then click  on    New     For any particular option there are probably other parameters which require to be    set before invoking  Execute      Gamma test  Finds the Gamma statistic and other relevant measures     17    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    Increasing near neighbours  Finds how the Gamma statistic varies with the number  of near neighbours used to compute it     M test  Shows how the Gamma statistic estimate varies as more data is used to  compute it
73. nces of the error surface in weight space which in most cases gives  faster convergence at the expensive of a more complicated algorithm and more memory     What do all the fields associated with a Gamma Result mean    See section 1 3 1 of this manual    What does a high gradient suggest    If there is enough data to give a stable Gradient asymptote then a high Gradient  computed values  on artificial test sets can come out as high as 20 000  suggests a complicated unknown function with  on average regions of high curvature    Why is the Vratio useful    It provides a standardised estimate of the noise which is independent of the output variable range   What is the use of the Standard Error    Ittells us how reliable the Gamma statistic is as an estimate of the variance of the noise on the output   What file formats are permitted for data to be analysed by winGamma    See Appendix II     How much data should I use for training     If the Gamma statistic is asymptoting to zero you can use as much data as is practical and models  with MSErrors of order 10   are quite feasible     75    The winGamma User Guide APPENDIX VI Frequently asked questions Version  18 January 2002    If the Gamma statistic is asymptoting to a positive value a good rule of thumb is to use as much data  as will give a standard deviation  the square root of the variance  of the Gamma values about the  asymptote  of around 10  of the asymptote value on the last 10  of the data    When should I use external 
74. ns it is well worth the time and effort to construct  a neural model     3 2 Local linear regression    To make a prediction for a given query point in input space local linear regression  LLR  first finds  the k nearest neighbours of the query point from the given data set  where the number k is supplied  by the user  and then builds a linear model using these k data points  Finally the model is applied to  the query point thus producing a predicted output  Because of the way winGamma analyses the data  to compute the Gamma statistic the k nearest neighbours of any point in input space can be found  very rapidly  Consequently local linear regression using the k nearest neighbours  in the training data   of the query point can be accomplished quickly  Thus local linear regression is a very fast and capable  predictive tool     LLR is most effective in regions of the input space with a high density of data points  If data points  are few and far between in the vicinity of the query point then LLR will not be very effective if the  underlying function we are trying to model is truly non linear     The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    It may seem odd that although winGamma is all about constructing smooth models the global  function produced by patching together many LLR predictions in general is not even continuous   However  as the number of data points increases  the global function produced by LLR will converge  rapidly to the unknown 
75. nt  If the final target application is a real  time system neural networks offer the advantage that they can be implemented in hardware    How should I choose between local linear regression and dynamic local linear regression   For a model to adapt it must be dynamic  Every data row  vector   seen  by a dynamic LLR model    will be added to the model  but of course eventually the model becomes memory hungry and starts  to slow down  At this point the model will have to be pruned  If the phenomena that you are trying    74    The winGamma User Guide APPENDIX VI Frequently asked questions Version  18 January 2002    to model is likely to be fixed then a static model is best  If the underlying dynamics themselves might  be changing  e g  the stock market  then a dynamic model is more sensible     How should I choose between the Backpropagation  Conjugate Gradient Descent and BFGS  neural net algorithms     Backpropagation is the original feedforward neural network training algorithm  It is reasonably  effective on simple problems but only makes use of the first differentials of the error surface in weight  space  Therefore backpropagation can take longer to train than other more sophisticated neural  training algorithms and may fail to converge to the target MSError derived by the Gamma test at all   But compared to more recent algorithms backpropagation is inexpensive on memory     CGD offer some improvements over BP at the cost of extra memory     BFGS uses the second differe
76. o change     Whether normalization is a good or bad idea depends largely on the circumstances  If input variables  are incompatible then it is probably a good idea to normalize     Normalization of just the input values will not change the asymptotic Gamma statistic or Vratio   provided we imagine that as the number of data points becomes large we also increase pmax by a  suitable constant factor    but a good scaling will cause the M test to converge more rapidly to the  asymptote  so improving the accuracy of the noise estimate for a given amount of data  A good  scaling can also improve the accuracy of a model constructed using a fixed amount of data     The effect of masking is    all or none    and it may be better to apply a suitable weight to each input  variable  For example  it is a general observation regarding near neighbour classifiers that they  perform well given the right weighting of inputs but that at present there are no general techniques  for finding such weightings  However  if weights are applied then of course the data must NOT then  be renormalised     2 16 Projects    A Project is the collection of all Experiments performed on a given data set  A given Project is  determined initially once the data set for analysis is defined  At Project creation time the number of  inputs and outputs to a time series have to be set and options to normalise or scale the data  There is  also an option to generate a parallel moving average and or difference series along
77. of Figure 2 25 is intended to show the relative stability of  the Gamma test result  In this case it really fails to do so because the order of the data should really  be randomised  since it is very time periodic   Even so when we examine the vertical scale of Figure  2 25 we see that the relative variation is not very large     We see shall later in Chapter III how to take the results of this analysis and build and test models  using the solar csv file     2 13 Analysing Time Series data    2 13 1 The DH 34 5000 asc data  Delayed Henon Map        ji         j    Figure 2 26 The first 100 points of the Figure 2 27 The return map  x  delayed Henon map time series  delayed Henon map        xp  for the    n 1     This Time Series data was generated by a process very similar to the Henon map  except that where  the current value of the Henon map time series depends the last two values of the series  for the  Delayed Henon map the current value is determined by the values three and four steps in the past   This changes things in a number of respects     The plot of the time series is given in Figure 2 26 and Figure 2 27 shows the return map for  x     Xp   which is analogous to Figure 1 12  We observe that this distribution looks quite different     42    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    We now proceed through the steps outlined in section 2 1 1 for Time Series     1  Load the data and do not initially normalise  it is a single time series   
78. oints in steps of 10  from Sin500 asc     38    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    2 12 3 The solar csv data        The data considered in this 2 input 1 output file relates to the generation of electrical power by an  array of solar cells  The inputs are a measure of light intensity  South Plane Irradiance  to a precision  of   0 01 kW m     and the current temperature in degrees C to a precision of   0 5   C        Figure 2 19 Plot against time of the irradiance and temperature from the training data file  solar csv     The output is the voltage inverter AC power output measured to a precision of 30 01 kW  The file  consists of these values sampled every minute     Figure 2 19 illustrates the graphs of the two inputs and the output against time  position in the file    We note that at low Irradiance the recorded power output values are irregular and sometimes  negative  This is a result of the fact that intelligent circuits are attempting to determine whether or not  to initialise the system as the sun rises or sets  The effect is to produce noise on the output power at  low Irradiance levels  We are just using the data as an example  but if one wanted to use the data to  build a really accurate model obviously one should filter out the data having low or zero Irradiance     7 Data for this example were provided by Newcastle Photovoltaics Applications Centre at the University  of Northumbria at Newcastle UK  These data were collected as p
79. om inputs to outputs using only the data provided  Both the inputs and outputs should be  continuous real variables from some bounded range  The software will be much less effective if  some of the input or output variables take only categorical values  e g  0 or 1   The underlying  function is presumed smooth and this means bounded first and second derivatives  If the unknown  function has regions of very high curvature it will be much harder to produce an accurate predictive  model     Itis also assumed that the noise variance on each of the outputs is bounded and independent of the  input values  If the independence condition is false this is not necessarily fatal  the Gamma test will  return an average noise variance over the whole input space     Subject to these conditions winGamma can be applied to a wide variety of non linear modelling  problems  It is particularly useful in the research and design of non linear control systems     1 2 Loading data files     winGamma can analyse two basic types of numeric data files  Input Output data  where each column  corresponds to an input or an output  and Time Series data where each column corresponds to a  particular time series and successive rows represents successive values in time for each series     Note all data files must contain only numerical data arranged in one of the allowed formats   For  more details of data file formats see Appendix II   To load a data file launch the application from the  Start menu  Click on  
80. ore but which are essential if the product is ever to enjoy  widespread use     Dr  Nicola M  Pearsall who kindly gave permission for us to use the example data in solar csv  Data for  this example were provided by Newcastle Photovoltaics Applications Centre at the University of  Northumbria at Newcastle  UK  These data were collected as part of a project with funding from the  European Commission  THERMIE Programme  and the UK Department of Trade and Industry     To all of you we express our thanks and hopes that winGamma will make a contribution worthy  of your efforts     The winGamma User Guide    CONTENTS  CHAPTER I Getting Started    2iih ob RR DEG R RE Pxe AREE XAR RPIX  EXAY 9  hol IntrGdUCloH cook Ee Eius bee tie ow SEXES EI e eX 9  1 1 1 The Purpose of the Software   24 4 6 lu eL Vp RE Ex 9  1 1 2  The range of applicability    sre IR ReRE PREIS 11  1 2 1929 data files  Ao eder aa Ate he ke Mi tue d oM hong Cual n 11  1 2 1 Comma separated variable    csv  files from spreadsheets           11  1 2 2 Input Output data in standard format    asc  files                   12  1 2 3 Time Series data in standard format    asc  files              llis  13  1 2 4 Partitioning the data   xus peu ooo ER S expeh E SEXES ee koe 13  1 3 A first experiment 22 63 pl Xp xe RpEREPST p Een d prie 13  1 3 1 Interpreting tlie results 5 2 ee eSI ELESEV e Bees 14  1 4 The basic controls of winGamma 23 acea uu EE EX XS 17    gt  Two simiple examples  22e ukek oti ex aed sete RSRERILEMESZem
81. ot of the FTSE close data     44    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    If we perform a full embedding on the last 20 weeks  on a convenient Unix station   using pmax      10 we obtain the histogram in Figure 2 33  This has the characteristic Gaussian shape of mainly  statistically determined data     From the table of the best embeddings we select the embedding which gives the smallest positive  gamma  This is 11011100110000111101     Handy Tip  Typing in long masks can be error prone and tedious  There is no need to do this  A mask  can be copied to the clipboard and pasted in whenever required using a right click on the mouse       Results Visualiser    Select output   Output 1 v  Custom Chatt    X Series        Unique Data Pc Y  Unique Data Points v Gamm  1 600 4                Frequency  140000    lzo000  looo000   0000  50000  40000      0000    100 120140 160 180 200 220 240 260 260 300 320 340 360 380 400 420 440 460 480  Unique Data Points      000 4000 6000  000 10000       Figure 2 32 The M test graph for the FTSE Figure 2 33 The frequency histogram for  data using the best embedding of length 20  embeddings of length 20 using the FTSE  data     This choice of embedding should be treated with some caution  The M test graph of Figure 2 32   pmax   10  Randomised  3 repetitions  shows that the estimated gamma values have not yet  stabilised  M is not sufficiently large  so the error in estimating the Gamma statistic for any particul
82. perform the analysis on a subset of the data  If the problems continue you need more RAM   In most cases 64 Mgbytes is sufficient for any reasonable data set     At least 50 Mbytes remaining hard disk space    Operating system  Windows 95 or 98  or Windows NT4 0 were the original development  targets but we have so far observed no problems with later versions of Windows operating  systems  Licenses for a script file driven UNIX version of the Gamma Test software may be    available by special arrangement     Installation    The winGamma User Guide    APPENDIX I General Information    Version  18 January 2002    Beta release  At present simply copy all files in the winGamma directory into a convenient  directory on your hard disk  If you experience problems getting the help system to work you  may have an older version of Explorer  To update run the file hhupd exe     V I release  Place CD in drive  Follow install instructions from screen     List of files and directory structure after installation      DIR        DIR      Program and associated files    BORLNDMM DLL   CP3240MT DLL   hhupd exe  Run to update HTML if problems with help  occur    Tee4C bpl   vcl35 bpl   winGamma chm   winGamma exe   winGammaBaseComponents bpl  winGammaComponents bpl    Directory of C  WinGamma  11 20 98 12 28p  lt DIR gt  Data  10 30 98 02 37p  lt DIR gt  TestFiles  02 09 98 02 00a 29 952  02 09 98 02 00a 996 872  11 05 98 06 28p 471 840  10 24 98 04 01a 420 864  02 09 98 02 00a 1 455 736  02 22
83. port     Choose    Save as type       62    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    Step 3     Set to    Excel Macro       mac   Enter directory and file name  Export model    Setting up the data and model in Excel     Start up Excel   Load data from test csv   Save the file as an Excel Workbook   Right click on worksheet tab    test    at the bottom left corner   Select    Insert      Select  MS Excel 4 0 Macro   Hit OK   This now opens a macro sheet   Load  test mac  into Notepad  select all text  and copy    Paste text into macro sheet in Excel in cell A1   Highlight column A   Do InsertiName Define   In the macro box set to    Function      Set name to    model     Hit OK   Now when in the macro sheet with column A highlighted you should see    model    in  the top left name box    Switch back  using the tabs at the bottom  to the    test    worksheet   Enter heading    model    in cell F1   In cell F2 type      model A2 C2    no quotes  and press    Return     You should now see the model output value in cell F2 as compared to the actual  output in cell D2    Select cell F2 and copy   Highlight the range of cells from F3 to F278 and paste    You should now have all the model predicted values for each row     63    APPENDIX I General Information    Shipping list       Compact disc  2  This manual   3  The gamma test and how to use it  a practitioners guide     Hardware requirements  This software is PC based and normal minimum requirem
84. pulation to do from  scratch in Excel      Notice that although we set out with the intention of trying to model the time series Target  we have  created a file in which every time series has an out put that we can model  You can delete these extra  outputs in Step 6 if you wish     Youcan begin immediately with experiments on this file but  since not all these inputs may be needed  for the model  you can also proceed as follows     Step 5  Use some other software such as Mathematica to perform data analysis on the last file using  tools not yet provided by winGamma  For example  one useful analysis tool is to take the average  lagged correlations of successive differences of the target time series with the successive differences  of all the time series  e g  for lags from 1 to 8 in the above example    This tool is available as  DeltaCorrelation in the Mathematica suite provided with winGamma    We may choose to take only  those lags which have the largest absolute lagged delta correlation  This may suggest that some  columns could be deleted from the   csv file we have produced     Step 6  Load the   csv file into Excel and delete the columns which have been selected as unlikely  to be useful  Re save the result as a   csv file and proceed  as if from the end of Step 4  with  winGamma analysis on the resulting file  Further inputs may be set to zero in the mask as a result of  winGamma analysis     Notice that when you reload this file into winGamma it will be treated as an
85. r   pmax   10  for Ran500 asc  Ran500 asc       Analysis Manager B   Results Visualiser    Select output   Output 1 X Select output  Output 1 zl  Custom Chart    i X Series    Angle Histogram Position in List zl d Position in List v Gamma      i tae tt i tt i 1 Primary Y Series 1a dg  Gamma z   a          x  8  E  5  5  8  5              2 3  4 5 8 7 8 9 10 11 12 13 14 15 16 17 18 19 20  Position in List     80  80  70  60  50  40  30  20  10 5 0 5 101520253035 40 4550 556065 70758085       Figure 2 11 Angle histogram for Ran500 asc Figure 2 12 Moving window Gamma test   pmax   10    pmax   10  on 300 points in steps of 10  from Ran500 asc     36    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    2 12 2 The Sin500 asc data     If we run a simple Gamma test Table 2 2 The Gamma test result  pmax   10  for unscaled and scaled data    with pmax   10 near neighbours       the file Sin500 asc     we obtain the results in Table 2   2  0 07335 0 03190  0 71122   4 0386    The estimated Gamma statistic    0 07335 indicates a moderate  noise level as does Vratio     0 12762     The regression line on the scaled  data with slope A   4 0386  indicates definitely non linear  data  However if we pull up the    Gradient plot we see that it is still Lower 95  Confidence           l       highly variable   so one should Mask  remaining entries     not have too much confidence in  this observation        Taken together  these results indicate noisy but manageable non lin
86. rner of many of the graphics windows     1 8 Customising the file and project directories    To customise the locations of data files and project files  discussed in Chapter IT  pull down the   Options  menu and click on  Customize  You can modify the number of data files and project files  kept in the history  in is usually best to set these to their maximum of 9   Now under data files click  on  Modify  and select the directory that should first appear when the process of loading a data file  is initiated  When the desired location has been selected click on  OK   Go though a similar procedure  to locate the project directory  If you wish the windows settings to be saved each time winGamma  is closed down then check the appropriate box  Finally click on  Apply  and exit the program     23    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    24    CHAPTER II Performing an analysis    2 1 Introduction   An Experiment is a particular type of calculation performed on the analysis data  A new experiment  is started by highlighting the type of experiment required and then selecting  New  in the Analysis  Manager window  If we want to perform the same calculation but with different parameters  e g  the  number of nearest neighbours  or a new method  e g  M test  then a new experiment is started     In this chapter we discuss each type of Experiment  how to set the parameters and how to interpret  the results  Each Experiment is discussed using an example and illustra
87. rom a continuous range  If many inputs are categorical it is also possible to get a negative  Gamma statistic     How should I choose the right number of inputs for a Time Series     Initially set the number of inputs large  but reasonable in the context of the data   Then do an   Increasing embedding     This will compute successive Gamma statistics based on one input  the  historically most recent sample of the time series  rightmost on the mask   then on two inputs  the two  most recent samples  and so on up to the maximum number of inputs you have selected     The minimum Gamma statistic obtained will determine an upper bound for the maximum number  of inputs it is useful to consider     An optimum for the number of near neighbours used in the Gamma test should now be obtained   Then the maximum number of inputs can be checked again using that number of near neighbours  in the Gamma test   If the maximum number of inputs changes then the optimum number of near  neighbours should be checked again   Finally using the best maximal number of inputs a check for  the best embedding can be run  this may cause some inputs to be discarded     How should I choose a method for establishing an optimal embedding  mask     The best method for choosing a mask on the inputs is Full embedding     The problems come with this  method when the run times required become too long  Runtime is a function of the input  dimensionality  the number of inputs  m   the number of nearest neighbours  pm
88. s        2  sigmoidal act    scaleFactor                                       1  l e  act Temperature    where act is the activation  weighted sum of inputs   scaleFactor   1 5  and Temperature   0 8333    To speed up neural computations this function is implemented in winGamma as a fine grained look   up table  whereas for feedforward computations when the weights are loaded into other software it  can be implemented directly as a function  This may cause very small differences in neural output  calculations using the same weights outside winGamma     3 9 2 NetReader     NetReader nb is a Mathematica program supplied with winGamma which can read the neural  network weights saved from winGamma and implement the neural network for feedforward testing   Which type of network training was used in the creation of the weights is automatically identified  from the weights file     3 9 3 Exporting and using Neural network models in Excel     After winGamma has built a neural network model it may be exported as an Excel Macro and used  directly in Excel  This facility is not currently available for LLR models    We illustrate this process  using an example     Step 1  Build a model     In winGamma load data file Sun280 asc this is a single time series file   Transform the data to 3 inputs 1 output   Export transformed data as test csv   Perform Gamma analysis    Train neural network model on the transformed data    Step 2  Export the model  Right click on    Model       Select    Ex
89. se tabs with those produced  for the noisy Sin data in section 1 5 1 1  We shall examine these tools more fully in Chapter II     1 6 Linear models    winGamma is a non linear modelling tool and makes very few assumptions about the nature of the  model  Because of this fact it generally needs far more data than parametric analysis where the model  is presumed to have a particular form  If it is safe to assume that the model is linear then a simple  linear regression model should be built and tested using some other standard software  e g   Mathematica has very good linear regression facilities      If you know nothing at all about the data being analysed it is always a good idea to check the linear  regression model before using winGamma     If the data is fundamentally linear then winGamma will perform quite well using local linear  regression  However  winGamma will make less efficient use of the data available than global linear  regression     1 7 Exporting results for use by other software  Data produced by winGamma is either Graphics or data such as predictions   Data Files can be exported in     1  Mathematica compatible format e g     s are embedded to format lists and arrays   2  Excel and spreadsheet compatible comma separated variable    csv  format     These Export functions are available as an option under  Edit  in the main winGamma parent window   a right click on the mouse button in the appropriate context  or by clicking the     gt     tab in the top left  co
90. side a time series     As successive Experiments are completed the parameter settings for each Experiment and the results  are added to the Project which can be saved as a file and reloaded at a later date  Thus there is no need  to repeat the same experiment  which may have taken a while to compute       All distance functions are equivalent to within a constant  But rescaling changes specific near neighbour  relationships     50    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    Handy Tip The Project file    gpr  contains the path of the data file used in the project  in fact the data  file name and path are the first item in the Project file  Project files are in DOS ASCII format and can  be edited  as long as the file is saved in the same format after editing   This can be useful to know if  the data has been moved or renamed  or you have moved both the Project file and the data file to  another system where the path is different  One way to handle this is to just edit the path in the Project  file  However  if winGamma cannot find the data file associated with a project it will ask you to  Browse for the file and you can indicate the new location   It is important to select the right file    which must not have been altered      51    CHAPTER III Building and testing a model    3 1 Introduction    We now assume that you have analysed the data and decided which inputs and how much data to use   To actually build the model winGamma offers several tec
91. sion   3  Transform the data set to reset the maximum number of inputs to the largest number from  the Increasing Embedding Experiment which still gives a comparatively small Gamma    statistic     4  Run a M test to check the stability of the Gamma statistic If the M test produces a stable  asymptote decide if the noise variance is likely to be         7ero  arbitrarily good models possible with enough high precision data    Or     Non zero  not much point is using more data than necessary to give a model which  predicts at the Gamma statistic level    On this basis decide how much data is likely to be needed to build a model   5  Can we get a better Gamma statistic by discarding some of the input  To answer this  question run a Full Embedding if the number of inputs is small enough to allow this  say        10  15   Otherwise try the heuristic search techniques ending up with a long GA run     6  If a better embedding is found then repeat steps 4  5 and 6 to refine those conclusions     7  Refine the number of near neighbours for the final estimate of the Gamma statistic using  an Increasing Near neighbours test     2 2 The Gamma test    26    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    This finds the Gamma statistic and other relevant measures  These are principally the Gradient  the  Vratio and the Standard Error as described in Chapter I     Once the inputs have been determined  either with preliminary Gamma tests or because these are set  by t
92. suffix   asc  Data files may be created using  Excel  as   csv files and imported into winGamma  Data files for winGamma are in two basic  formats     e Times series data     Example  a single time Series     0 0262  0 0575  0 0837  0 1203  0 1883  0 3033  0 1517    etc   Each number followed by a carriage return linefeed     Example  multiple time Series  it is the responsibility of the user to prepare the data so that  fields referring to the same time are on the correct line  most recent data is last      0 0262 1000 26  0 0575 1031 78  0 0837 1037 86  0 1203 1038 567  0 1883 1040 810  0 3033 1100 721  0 1517 1027 851    Each number followed by one or more spaces  The last number on a line followed by a  carriage return linefeed  There must be the same number of data fields on each row     e Input Output data     Example  a 4 input l  output file     0 36368593157164 0 3304959949667  0 21811098544356  0 2093396 1443087   0 0220710621963   0 00591 105325917  0 9085902611647 0 19548859472561  0 34015487882487   0 0064356217878  0 86221883819100  0 5929180658183  0 36843151702318  0 89277930056707  0 6617039028787  0 59877814813365 0 9562473549851 0 25582643936911 0 97996127233012  0 4810764303063   0 13712162278232 0 9035299186427 0 29916358157799  0 22014139763247  0 7734356912106   0 42696607632396  0 4827254329784 0 98919821679839  0 20449324659299  0 5789449769352    etc     The winGamma User Guide APPENDIX II Data file formats Version  18 January 2002    Each number fol
93. t graph for Sin500 asc Note the relatively stable asymptote              31  Figure 2 6 The Ran500 asc output plotted against the position in the file                34  Figure 2 7 Increasing near neighbours  3 30  on Ran500 asc Gamma SE                36  Figure 2 8 M test  pmax   10  on Ran500 asc    1  eee eee 36  Figure 2 9 Scatterplot and regression line  pmax   10  for Ran500 asc                 36  Figure 2 10 3D Histogram  pmax   10  for Ran500 asc    6    eee 36    Figure 2 11 Angle histogram for Ran500 asc  pmax 210          llle 36    Figure 2 12 Moving window Gamma test  pmax   10  on 300 points in steps of 10 from    RANDOO OSC i i p sve NAE SE XGA TES GA EGO RUE RS RR da es REA MARE 36  Figure 2 13 Increasing near neighbours  3 30  on Sin500 asc Gamma SE                38  Figure 2 14 M test  pmax   17  on Sin500 asc      llle 38  Figure 2 15 Scatterplot and regression line  pmax   17  for Sin500 asc                 38  Figure 2 16 3D Histogram  pmax   17  for Sin500 asc   1    eee 38  Figure 2 17 Angle histogram for Sin500 asc  pmax   17  1 1    00  eee eee 38  Figure 2 18 Moving window Gamma test  pmax   17  on 300 points in steps of 10 from  SIHOUO USE  2 areas iter eet de WUE ek et Ba Pa ee Rit he EE Mead use pus 38  Figure 2 19 Plot against time of the irradiance and temperature from the training data file  ovis ACER  39  Figure 2 20 Increasing near neighbours  3 50  on solar csv Gamma SE                 40  Figure 2 21 M test  pmax   20  Randomised  2 repe
94. t might in model construction to choose a mask with a low Gradient which will correspond to a  simpler model  and the fitness due to the number of 1   s in the mask because shorter masks also mean  simpler models  The contribution of each of these terms is controlled by three weights Winrercepr  W gradient  Ad W engin according to the formula    fitness mask    W     intercept    W  gradientFitness mask       gradient    Wren  ulengthF  itness mask     AinterceptFitness mask       The component fitness calculations are described below  where Vratio mask  and Gradient mask   return the Vratio and Gradient as calculated by the Gamma test on the data set for mask  outputrange  is the range of the output and     denotes absolute value     interceptFitness mask    1    1   IO Vratio mask      ifVratio mask   lt  0  2   2 1   Vratio mask      otherwise    gradientFitness mask    1    1    gradient mask   outputRange       numofones mask     lengthFitness mask     length mask           712    APPENDIX VI Frequently asked questions    Why is the Gamma statistic sometimes negative     Sometimes the Standard Error  the error obtained from the  0  y  regression which is always stated  when a Gamma result is obtained  is large enough to account for a negative intercept by the  regression line  This is most likely to occur when the true asymptotic Gamma statistic is close to zero   It can also happen when the data fails to fulfill the basic requirement that inputs and outputs are  drawn f
95. t present these are the same for all series   In Figure 1 2 we are choosing to use 5 previous values of every time series to predict the next 2  values for each of the time series  Choosing more outputs will produce predictions further into the  future  The nature of things is such that the further we try to predict into the future the less accurate  these predictions will be  This is reflected in a higher Gamma statistic for more distant future  predictions     1 2 2 Input Output data in standard format    asc  files     Data Transformation    Standard format for an Input Output file is DOS  ASCII in the following form  In each row the    inputs are separated by spaces and the list of Data Settings     inputs terminated by a comma  The list of outputs   then follows  each separated by spaces  The end of Data type Vector Function  a row is signified by CR LF  File data in standard Inputs   Input Output form will be automatically Dutputs   recognised as such  At present the numbers in the Vectors    file must be in simple decimal format     The first decision to be made after specifying the  file name is whether or not to    Transform     i e  to  scale or normalise  the data  To normalise check  the appropriate box as indicated in Figure 1 3       Figure 1 3 The   Normalise  check box     For a full discussion of the effects of scaling and  whether or not to scale in any particular case see section 2 14  In an initial investigation it is usually  a good idea to scale Input Ou
96. t value     Mean squared error  MSError   If y i   1  lt  i  lt  M  is a set of values of an output and y  1  is a set  of predictions for y i  then the MSError of the predictions is given by    M  Y   yx   xf    MSError   i  Mi 1    Standard Error  SE  This is the standard error about a regression line and is calculated as       pmax    SET      Y qo   TY  n 2    i l    where I   1  is the ith Gamma regression point value and T is their mean     Over training describes the effect when we attempt to produce a model by exactly following the  training data  Consider the effect of trying to produce a model by drawing a line through every point  in the noisy sine data in Figure 1 8  It would look nothing like a sine curve and if we asked this model  to predict y for a particular value of x we should little faith in the prediction  One of the main  advantages of winGamma is that it gives us the necessary information to prevent over training before  we begin to build a smooth model such as a neural network     GA Fitness  In order to better control the GA search it is useful to know how the GA fitness is  calculated  The overall fitness of a mask is composed of three parts  corresponding to the fitness due  to the intercept  i e  the actually Gamma statistic  because mainly we want masks with small Gamma     The winGamma User Guide APPENDIX V Definitions Version  18 January 2002    the fitness due to the Gradient because if we have enough data to estimate the Gradient accurately  i
97. ted Gamma result         Therefore it is important to highlight the required result in the Results window   2 3   The scatter plot and regression line     The critical graph to look at first is the scatter plots and  d p   y p   regression line  see Figure 2 1   The scatter plot shows point pairs      y   where 6 is the squared distance of an input  x  from one of  its near neighbours and y is one half of the squared distance between the two corresponding scalar  output  y  values  The points to which the regression line is fitted are calculated by finding the mean     p  of 6 and y p  of y  where p refers to the first nearest neighbour  p   1   the second nearest  neighbour  p 22   and so on up to the maximum number of near neighbours  pmax  which has been  set by the user     A good regression line with points  6 p   y p   approaching  6  y     0  0  indicates that the scalar    output values of input near neighbours are close  If the regression line has a steep slope this indicates  that the modelling function fthat we seek to approximate is liable to be quite difficult to construct and    27    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    a large number of data points M will be required  If the line is almost horizontal the function is quite  simple     A particular feature to look for here is an empty    wedge    in the top left corner of the scatter plot  If  there are points in the top left corner it means that there are input points in the or
98. ted with screen shots     2 1 1 The user cycle    The user cycle for a full analysis is not completely fixed and can be varied according to  circumstances  However the general approach can be summarised by the following steps     Input Output files   1  Load the data and on the full data set  if not exceedingly large  do a simple Gamma test  scaled and unscaled with the number of nearest neighbours set to the default of 10  If the data    set is very large use a subset of the data for initial experiments     2  Run an  ncreasing Near Neighbours test and use the minimum SE between  say  pmax    5 and pmax   50 to determine the most accurate Gamma statistic     3  Using the value for pmax determined in Step 2 run an M test to determine how stable the  Gamma statistic is with increasing data set size     4  If the M test produces a stable asymptote decide if the noise variance 1s likely to be       7ero  arbitrarily good models possible with enough high precision data      Or     Non zero  not much point is using more data than necessary to give a model which  predicts at the Gamma statistic level      On this basis decide how much data is likely to be needed to build a model    5  Can we get a better Gamma statistic by discarding some of the input  To answer this  question run a Full Embedding if the number of inputs is small enough to allow this  say     10  15   Otherwise try the heuristic search techniques  such as Hill climbing or Sequential  embedding  see 2 7 2   2 7 4   
99. th the embedding ee   i X Series  search  In this case we see a somewhat 53     Primary Y Series  Gamma y  Overlay Y Series  forem z   5  E    Near Neighbours v Gamma and Standard Error       irregular multimodal histogram  there are  only 15 possible embeddings      6  If we now fix on the embedding 1101  having a Gamma statistic of around  2 3228E 6 then we might next do an  Increasing Near Neighbours Experiment  to optimise the choice of near neighbours  in estimating the Gamma statistic     34567 8 9 1011121314151617 1819 20 21 2223 24252627 28 2930  Near Neighbours         Gamma        Standard Error       Results  The result is shown in Figure 2 30  Figure 2 30 An Increasing Near Neighbours  The minimum SE is obtained using pmax        est on the embedding 1101 for the delayed  7 nearest neighbours and corresponds toa Henon map    final Gamma statistic of 2 3228E 6 so that   optimising the number of near neighbours   hardly changed the Gamma statistic at all in this case     Thus the final analysis of DH 34 5000 asc is that it  is a noise free time series  Using a few thousand  points we should be able to construct a model  capable of one step prediction with an estimated  MSError of around 2 23E 6     2 13 2 The FTSE weekly closing price data     The file FTSEcls asc contains the FTSE weekly  closing price from 9 May 1988   26 January 1998  which gives 508 samples  Figure 2 31 shows the  time series over the full run of the data     Time  weeks         Figure 2 31 A pl
100. the surface constructed from the  data actually looks like  Figure 3 9  This is a  topographic plot of the surface in which  lower power outputs are blue and higher  power outputs are red  We could regard our  Whatlf graphs as cross sections  using the       10    Temperature    Irradiance    Figure 3 9 A topographic plot of the solar csv data     model  of a surface which is very similar to this plot     59    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    3 7 Example model construction and testing for DH 34 5000 asc    In Chapter II we analysed this data and concluded that it represented a low or zero noise time series  for which the current sample could be predicted accurately on the basis of between 4 and 7 previous  samples  The mask 1101 was identified as a good mask     1  Load the file DH 34 5000 asc     2  In the Time Series options specify 4 inputs and 50 outputs  This gives 4952 samples for  analysis     We are going to build a local linear regression model and this needs a kd tree so it is necessary first  run a Gamma test     3  In the Experiments tab highlight       LIzzArz   Gamma test  Leave the number of oem ETE  sl   near neighbours set at the default of        amp     pss     10  In the Mask tab enter the maskas      Model Tester    1101  Now click  Execute   KAN i       Result  We specified 50 outputs and  so are trying to predict a maximum  of 50 steps into the future  If we just  look at the one step prediction then  the res
101. this is the most recent  and obtains a Gamma statistic for this mask  It  progressively increases the number of bits set in the mask working from right to left performing a  Gamma test for each new mask  It runs to the maximum number of inputs and stops  We can then  examine the Gamma statistic for each mask  The best embedding found will be the one whose  Gamma statistic is closest to zero  This is useful in a time series to discover the underlying  embedding dimension as we saw in section 1 5 2     In the next sections we shall give example analyses using these various options   2 12 Analysing Input Output data  2 12 1 The Ran500 asc data     We begin with a data set which is a type of    worst   output  case  in the sense that there is no smooth data  model for this example  The file Ran500 asc is is a  4 Input 1 Output file containing 500 I O pairs of  completely random data generated using a uniform  distribution in   1  1  via the Mathematica  test file  DataGen nb  The output is actually pure noise  having a true variance of 0 333333  A point plot of  the output is given in Figure 2 6     i    If we run a simple Gamma test with pmax   10  near neighbours we obtain the results in Table 2 1           The estimated Gamma statistic   0 31793 indicates Figure 2 6 The Ran500 asc output plotted  a high noise level as does Vratio   0 97821 which against the position in the file    is very close to one  The regression line with slope   A    0 0575 on scaled data is close to hor
102. titions  on solar csv          luus  40  Figure 2 22 Scatterplot and regression line zoomed  pmax   20  for solar csv            40  Figure 2 23 3D Histogram  pmax   20  for solar csv         lees 40  Figure 2 24 Angle histogram for solar csv  pmax 220     0 0    eee ce eee 40  Figure 2 25 Moving window Gamma test  pmax   20  on 8400 points in steps of 100 from  NOVA gro y MN CETTE 40  Figure 2 26 The first 100 points of the delayed Henon map time series                  42  Figure 2 27 The return map  x     Xp  for the delayed Henon map                s sss 42  Figure 2 28 The result of an Increasing Embedding for the delayed Henon map           43  Figure 2 29 The M test graph  pmax 210 number of inputs   8  for the delayed Henon map    EUR I TD M KE EE 43  Figure 2 30 An Increasing Near Neighbours test on the embedding 1101 for the delayed Henon  TUBOS cec riesce pete qu ud sposi ameet bars ros eta A ed actes  acra ee 44  Figure 2 31 A plot of the FTSE close data        0    eee eee eee eee 44    Figure 2 32 The M test graph for the FTSE data using the best embedding of length 20    45  Figure 2 33 The frequency histogram for embeddings of length 20 using the FTSE data    45    Figure 2 34 Plot of the sunspots data file Sun280 asc          llle 46  Figure 2 35 The M test graph for the sunspot data data using the best embedding of length 15  prede AE EN Pan dae du AER SE ARCS eae Re Pda O dad s 47  Figure 2 36 The frequency histogram of all embeddings of length 15 using the
103. tput or multiple Time Series data     12    The winGamma User Guide GETTING STARTED Version  18 Jan 2002    1 2 3 Time Series data in standard format    asc  files     Standard format for a Time Series file is DOS ASCII in the following form  Each column represents  an individual time series  The rows represent values for each of the time series  successive rows  being successive values in time  Within a row each numeric value is separated by spaces  The end  of a row is signified by CR LF     1 2 4 Partitioning the data     Select proportion of data set for analysis    Start 1 10578 End  1       Cancel      Figure 1 4 Selecting a proportion of the data for initial analysis        It is sometimes convenient to perform the initial analysis on a subset of the whole data file  This  could happen for example where the data set was very large  Therefore winGamma will next ask the  user to select the proportion of the data which should currently be used for analysis  see Figure 1 4   We can later separate training and test data     1 3 A first experiment     Load the 2 input l output data file solar csv and select column 3 as output  Initially do not  normalise  Select all the data for analysis  there are 10578 data points in the file  After the data has  been successfully loaded winGamma displays the main screens  as in Figure 1 5     The Experiments window in the Analysis Manager shows the different kinds of data analysis that  can be performed  We shall discuss the meaning of th
104. ult for the first output is a  Gamma statistic of  6 089E 5 with a    SE of 4 3 1 1 8E 5 3 000 3 020 3 040 3 060 3 080 3 100 3 120 3 140 3 160 3 180 3 200          Actual     Predicted     Error j                                                                                                                                  4 For output 1 select  Model   Our Figure 3 10 A test of the LLR model on the data    previous experiments suggest that  lt et DH 34 5000 asc  blue predicted  green actual   about 3000 data points are need to red error      obtain a good model  a fact   confirmed by a M test for this   embedding   which curiously gives small negative results increasing towards zero   Select  a local linear regression model with 10 nearest neighbours and set the Mask to 1101     Results The MSError over the test set 3000   The MSError of the test set 3000 3200 is  8 2303E 6  The graph of predictions  actual values and errors is shown in Figure 3 10     3 7 1 How the prediction quality degrades into the future     60    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    Results  From the result of Step 3 in  last experiment we actually got 50  output Gamma statistic results  The  graph of Gamma against the number  of steps ahead is shown in Figure  3 11  Here we see an exponential rise  in the error of prediction  which is  typical of a chaotic process     We conclude that the Time Series data is of  a low zero noise smooth process which 1s  chaotic
105. um amount of time to spend on trying to attain the MSError goal   This is shown in Figure 3 2     55    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    Handy Tip  Backpropagation along with most other processes in winGamma can be paused resumed  or terminated using the buttons on the top level menu  Terminating an operation does not necessarily  loose everything  Any results already calculated will be displayed and in the case of neural net  training the model created so far will be retained     Figure 3 3 shows the Analysis Manager during backpropagation training  Note that the graphical  window can be zoomed and moved using the left and right mouse buttons     Because the number of layers  the number of hidden units  and the slope of the sigmoidal are fixed   limiting the size of the weights also limits the magnitudes of the partial derivatives of the neural  network as a function of its inputs  Thus if the unknown function to be approximated has regions of  high curvature the training algorithm with regularisation may find it difficult to obtain the desired  approximation  We can get some idea if this is likely to be the case by examining if the Gradient  returned by the Gamma test is unusually large     These parameters may be left at default until fine tuning of the model is required   3 4 Conjugate gradient descent    This a variation and improvement on two layer vanilla backpropagation  itis generally more effective  but requires more memor
106. winGamma to  scientific data  However  since the data in this file only runs over about one week we do not consider  this extra complication here    If we run a quick Gamma test on the full data set with pmax   10 near neighbours we get the results  of Table 1 1 in Chapter I     The unscaled Gamma statistic of 0 020761 seems high but in view of the output range  approx  0   30   is actually quite good  A better measure is the V         0 000760  defined as the ratio  Gamma Var output    which is low and shows that the output is highly predictable from the inputs   Because the data clearly falls into two distinct classes  day and night  we should be aware that  representative training and test data should include both types  The point to grasp here is that although  the time series data varies from moment to moment  as clouds obscure the sun  the relationship    between sunlight input at a given temperature and power output is a smooth  almost linear  model     The next step in a more careful Table 2 3  The results of the Gamma test  pmax   20  for unscaled and  analysis is to run an Increasing scaled data from solar csv     near Neighbours test  This will  77 Unscaed  Sealed          give us some idea of the best 0 020328 0 000221  pmax to choose to give the most 0 250261 0 230184  ies  e e  b Standard Error 0 002051       3 095267E 5  Lea e dd or 0 000744       0 000884   the increasing near neighbours     Near Neighbours  test run for pmax   3 to 50  We    Stat oo  t  1 0  
107. wing sections   2 4 Increasing near neighbours    This experiment shows how the Gamma statistic  and the other results returned by the Gamma test   varies with the number of near neighbours used to compute it  It is used to get some idea of how  accurate the Gamma statistic is liable to be     If we perform this experiment and use the graphing facility to plot the Gamma statistic and the SE  against the number of near neighbours  by examining the graphs together we can usually see which  choice for the number of near neighbours is likely to produce the most accurate estimate     For example in Figure 2 4  produced from Sin500 asc we see that the SE first increases and then for  a while plateaus before  eventually  beginning to steadily increase  The range of the plateau is  roughly between 7 27 near neighbours and it minimises at around pmax   17 with a Gamma statistic  slightly larger than 0 074  which we know  from the way the data was constructed  is close to the  correct value      1 Results Visualiser    Select output   Output 1       Custom Chart    X Series     Near Neighbour     Primary Y Series   Gamma       Overlay Y Series  Standard Error    12131415 1617 18 19202  Near Neighbours        Gamma     Standard Error       Figure 2 4 The variation of Gamma and SE as the number of near neighbours increases     30    The winGamma User Guide PERFORMING AN ANALYSIS Version  18 Jan 2002    We also note that the Gamma statistic is reasonably stable in the same range     It is
108. y  The procedures for set up are very similar     3 5 BFGS neural network   Probably the fastest and most efficient neural network training algorithm offered by winGamma is  a modified version of the Broyden Fletcher Goldfarb Shanno learning algorithm  This algorithm  uses second differences and is sometimes degraded by very noisy data  but generally it is probably  best to use this option first when trying to produce a neural model    3 6 Example model construction and testing for solar asc   We return to the example solar panel data we analysed in 2 12 3  Using the first 8400 data and scaling  we initially build a local linear regression model with k   20  We then test this model on the remaining  points in the data file     3 6 1 Building and testing a LLR model    We initially build a LLR model using the first 8400 points  the results of our earlier analysis suggest  that slightly more points are required for a really good model      1  Load solar csv  Do not normalise and use all the data for analysis  Execute a simple  Gamma test  these steps are described in 1 3 and 2 12 3     Handy Tip  winGamma requires that at least a simple Gamma test Experiment be conducted before  any attempt to build a LLR model  a kd tree is required      56    The winGamma User Guide BUILDING AND TESTING A MODEL Version  18 Jan 2002    2  After the Gamma test results appear click on  Model  in the Analysis Manager     3  Select the training set as 1 8400     4  In the Modelling Editor leave th
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Westinghouse 42-Inch Instruction Manual  VENTS TT PRO 150 Use and Care Manual  Lexibook DJ028SP compact camera  Bushnell 78-9500 User's Manual  WARNING: - Umarex USA  u550 manual del usuario  American Standard Twin Ell 8888.120 User's Manual  Les annonces Mode d`emploi Réservées aux membres SCBA à jour  カタログ(PDF) - 株式会社エフ・エム・アイ  5DII-P manual    Copyright © All rights reserved. 
   Failed to retrieve file