Home

Gaussian processes in Bayesian modeling: User guide for Matlab

1. In all of the previous examples we have had only inputs for the covariance function However if there are inputs for likelihood they should be given with optional parameter value pair whose indicator is z The model is now constructed and we can optimize the parameters and evaluate the posterior predictive distribution of the latent variables opt optimset TolFun 1le 3 TolX 1le 3 Display iter gp gp_optim gp x y z ye opt opt Ef Varf gp_pred gp x Y x z ye tstind l n Here we predicted to the same locations which were used for training Thus Ef and Varf contain the posterior mean and variance Eff 6 Var f I In this case the prediction functions such as 1a_pred for example require the test index set for FIC also This is given with parameter value pair tstind 1 n These have previously been used with PIC see section 4 2 FIC is a limiting case of PIC where each data point forms one block Whenever we predict to new locations that have not been in the training set we do not have to worry about the test index set since all the test inputs define their own block However whenever we predict for exactly the same locations that are in the training set we should appoint the test inputs into the same block with the respective training input This is done with FIC by giving gp_pred a vector with indices telling which of the test inputs are in the training set
2. X 4 2 2 and the normalization is computed from the Gaussian approximation itself Laplace EP or the normalization is not needed at all MCMC Section 2 3 treats the problem of marginalizing over the parameters to obtain the marginal posterior distribution for the latent variables p f D ip P E D 0 6 p 0 a D dbde 2 3 The posterior predictive distributions can be obtained similarly by first evaluating the conditional posterior predictive distribution for example p f ID 0 X and then marginalizing over the parameters The joint predictive distribution p y D 6 X would require integration over possibly high dimensional distribution p f D 0 X but usually we are interested only on the predictive distribution for each y separately Since the observation model is factorizable this requires only one dimensional integrals except for multioutput models For many observation models which do not have free parameters y this integral re duces to marginalization over f only 8 CHAPTER 2 INFERENCE AND PREDICTION 2 1 Conditional posterior of the latent function 2 1 1 The posterior mean and covariance If the parameters are considered fixed GP s marginalization and conditionalization properties can be exploited in the prediction Assume that we have found the condi tional posterior distribution p f D 0 which in general is not Gaussian We can then evaluate the posterior predictive mean sim
3. cf gpcf jitterSigma2 le 8 gp gp_optim gp x y opt opt Eft gp_pred gp X y xt With this data the covariance matrix is rather sparse since only about 5 of its elements are non zero The following lines show how to evaluate the sparsity of the covariance matrix and how to plot the non zero structure of the matrix The structure of the co variance matrix is plotted after the AMD permutation Davis 2006 in Figure 4 1 K gp_trcov gp x nnz K prod size K p amd K spy K p p k In the section 7 2 2 we discuss a demo with non Gaussian likelihood and compactly supported covariance function 4 2 FIC and PIC sparse approximations Snelson and Ghahramani 2006 proposed a sparse pseudo input Gaussian process SPGP which Qui onero Candela and Rasmussen 2005 named later fully indepen dent conditional FIC The original idea in SPGP was that the sparse approximation is used only in the training phase and predictions are conducted using the exact co variance matrix where the word training comes to the name If the approximation is used also for the predictions the word training should drop out leading to FIC In this case FIC can be seen as a non stationary covariance function on its own Snelson 2007 The partially independent conditional PIC sparse approximation is an ex tension of FIC Quifionero Candela and Rasmussen 2005 Snelson and Ghahramani 2007 and they are b
4. Gaspari G and Cohn S 1999 Construction of correlation functions in two and three dimensions Quarterly Journal of the Royal Meteorological Society 125 554 723 757 Gelfand A E Diggle P J Fuentes M and Guttorp P 2010 Handbook of Spatial Statistics CRC Press Gelman A 2006 Prior distributions for variance parameters in hierarchical models Bayesian Analysis 1 3 515 533 Gelman A Carlin J B Stern H S and Rubin D B 2004 Bayesian Data Analysis Chapman amp Hall CRC second edition Geweke J 1989 Bayesian inference in econometric models using Monte Carlo integration Econometrica 57 6 721 741 Gibbs M N and Mackay D J C 2000 Variational Gaussian process classifiers IEEE Transactions on Neural Networks 11 6 1458 1464 Gilks W Richardson S and Spiegelhalter D 1996 Markov Chain Monte Carlo in Practice Chapman amp Hall Gneiting T 1999 Correlation functions for atmospheric data analysis Quarterly Journal of the Royal Meteorological Society 125 2449 2464 Gneiting T 2002 Compactly supported correlation functions Journal of Multivari ate Analysis 83 493 508 Goel P K and Degroot M H 1981 Information about hyperparamters in hierar chical models Journal of the American Statistical Association 76 373 140 147 Good I J 1952 Rational decisions Journal of the Royal Statistical Society Series B Methodological 14 1 107 114
5. prior_cov 1 oe Initialize gp structure p gp_set lik lik ecf gpcf meanf gpmf1l gpmf2 gpmf3 j Q In figure 8 1 is presented the underlying process the GP prediction the used mean function 95 confidence interval and the observations GP regression with a mean function 20 prediction 95 16 45 mean function observations output y input x Figure 8 1 GP prediction for the process f with a mean function 66 CHAPTER 8 MEAN FUNCTIONS Appendix A Function list A 1 GP THE GP TOOLS in the GP folder Gaussian process utilities GP_SET Create and modify a Gaussian Process structure GP_PAK Combine GP parameters into one vector GP_UNPAK Set GP parameters from vector to structure GP_COV Evaluate covariance matrix between two input vectors GP_TRCOV Evaluate training covariance matrix gp_cov noise covariance GP_TRVAR Evaluate training variance vector GP_RND Random draws from the postrior Gaussian process Covariance functions GPCF_CAT GPCF_CONSTANT GPCF_EXP GPCF_LINEAR GPCF_MATERN32 GPCF_MATERN52 GPCF_NEURALNETWORK Create GPCF_PERIODIC GPCF_PPCSO GPCF_PPCS1 GPCF_PPCS2 GPCF_PPCS3 GPCF_PROD GPCF_ROQ GPCF_SEXP Create a categorigal covariance function Create a constant covariance function Create a squared exponential covariance function Create a linear covariance function Create a Matern nu 3 2 cova
6. 482 Moreaux G 2008 Compactly supported radial covariance functions Journal of Geodecy 82 7 43 1 443 Neal R 1998 Regression and classification using Gaussian process priors In Bernardo J M Berger J O David A P and Smith A P M editors Bayesian Statistics 6 pages 475 501 Oxford University Press Neal R M 1996 Bayesian Learning for Neural Networks Springer Neal R M 1997 Monte Carlo Implementation of Gaussian Process Models for Bayesian Regression and Classification Technical Report 9702 Dept of statistics and Dept of Computer Science University of Toronto Neal R M 2003 Slice sampling The Annals of Statistics 31 3 705 767 Nickisch H and Rasmussen C E 2008 Approximations for binary Gaussian pro cess classification Journal of Machine Learning Research 9 2035 2078 O Hagan A 1978 Curve fitting and optimal design for prediction Journal of the Royal Statistical Society Series B 40 1 1 42 O Hagan A 1979 On outlier rejection phenomena in Bayes inference Journal of the Royal Statistical Society Series B 41 3 358 367 88 BIBLIOGRAPHY O Hagan A 2004 Dicing with the unknown Significance 1 132 133 Quifionero Candela J and Rasmussen C E 2005 A unifying view of sparse ap proximate Gaussian process regression Journal of Machine Learning Research 6 3 1939 1959 Rasmussen C E 1996 Evaluations of Gaussian Processes an
7. BIBLIOGRAPHY 87 Grewal M S and Andrews A P 2001 Kalman Filtering Theory and Practice Using Matlab Wiley Interscience second edition Harville D A 1997 Matrix Algebra From a Statistician s Perspective Springer Verlag Kass R E and Raftery A E 1995 Bayes factors Journal of the American Statistical Association 90 430 773 795 Kaufman C G Schervish M J and Nychka D W 2008 Covariance tapering for likelihood based estimation in large spatial data sets Journal of the American Statistical Association 103 484 1545 1555 Kuss M and Rasmussen C E 2005 Assessing approximate inference for binary Gaussian process classification Journal of Machine Learning Research 6 1679 1704 Lawrence N 2007 Learning for larger datasets with the Gaussian process latent variable model In Meila M and Shen X editors Proceedings of the Eleventh International Workshop on Artificial Intelligence and Statistics Omnipress Lawson A B 2001 Statistical Methods in Spatial Epidemology John Wiley amp Sons Ltd Matheron G 1973 The intrinsic random functions and their applications Advances in Applied Probability 5 3 439 468 Minka T 2001 A Family of Algorithms for Approximate Bayesian Inference PhD thesis Massachusetts Institute of Technology M ller J Syversveen A R and Waagepetersen R P 1998 Log Gaussian Cox processes Scandinavian Journal of Statistics 25 451
8. The HMC options are set into the hmc_opt structure in a similar manner as in the regression example Since we are sampling also the latent variables we need to give op tions for their sampler as well These options are set into the latent_opt structure The options specific to gp_mc are given with parameter value pairs nsamples 1 repeat 15 The above lines demonstrate also how the sampling can be continued from an old sample chain The first call for gp_mc returns a record struc ture with only one sample This record is then given as an optional parameter for gp_mc in the second call for it The sampling is continued from the previously sam pled parameter values The sampling options are also modified between the two suc cessive sampling phases The modified options are then given to gp_mc The line repeat 15 3 2 GAUSSIAN PROCESS CLASSIFICATION 25 hmc2 state sum 100 clock re sets the state of the random number generators in the HMC sampler The last two lines evaluate the predictive statistics similarly to the EP and Laplace approximations However now the statistics are matri ces whose columns contain the result for one MCMC sample each The gp_mc func tion handles the sampling so that it first samples the latent variables from p f 0 D using the scaled Metropolis Hastings after which it samples the hyperparameters from p 0 D This is repeated until nsamples samples are drawn The MCMC solution
9. niteness A globally supported covariance function cannot be cut arbitrarily to obtain a compact support since the resulting function would not in general be positive def inite Sans and Schuh 1987 provide one of the early implementations of spatial prediction with CS covariance functions Their functions are build by selfconvoluting finite support symmetric kernels such as a linear spline These are however special functions for one or two dimensions Wu 1995 introduced radial basis functions with a compact support and Wendland 1995 developed them further Later for example Gaspari and Cohn 1999 Gneiting 1999 2002 and Buhmann 2001 have worked more on the subject The CS functions implemented in GPstuff are Wendland s piece wise polynomials kpp Wendland 2005 such as 2 0 r Kpp2 Lr G 4j 3 r 35 6 r 3 4 1 where j d 2 3 and r Jf tik 2j n 2 12 These functions correspond to processes that are q times mean square differentiable and are positive definite up to an input dimension d Thus the degree of the polynomial has to be increased alongside the input dimension The dependence of CS covariance functions to the input dimension is very fundamental There are no radial compactly supported functions that are positive definite on every R but they are always restricted to a finite number of dimensions see e g Wendland 1995 theorem 9 2 The key idea of using CS covariance function
10. 5 3 1 Leave one out cross validation For GP with a Gaussian noise model and given covariance parameters the LOO CV predictions can be computed using an analytical solution Sundararajan and Keerthi 2001 which is implemented in gp_loopred 5 3 2 k fold cross validation The LOO CV is computationally feasible only for the Gaussian likelihood and fixed parameters with gp_looe To reduce computation time in k fold CV we use only k 5 4 DIC 37 e g k 10 k fold CV distributions p 0 D4 M and get a collection of predic tive densities p yi xi D sa M i 1 2 n 5 5 where s i is a set of data points as follows the data are divided into k groups so that their sizes are as nearly equal as possible and s 7 is the set of data points in the group where the th data point belongs The expected utility estimated by the k fold CV is then icv E u yi xi D sy M 5 6 Since the k fold CV predictive densities are based on smaller training data sets Dy s than the full data set D the expected utility estimate is slightly biased This bias can be corrected using a first order bias correction Burman 1989 Utr E u y xi D M 5 7 Uevtr Ej E u yi xi D s M j 1 5 8 uccv Uucyv Utr Uevtr 5 9 where tr is the expected utility evaluated with the training data given the training data and Ucytr is the average of the expected utilities evaluated with the training data given
11. Figure 4 2 shows the predicted surface One should notice that the PIC s prediction is discontinuous whereas the prediction with FIC and full GP are continuous The dis continuities take place in the block boundaries and are a result of discontinuous covari ance function that PIC resembles This issue is discussed in more detail by Vanhatalo et al 2010 4 3 Deterministic training conditional subset of regres sors and variational sparse approximation The deterministic training conditional is based on the works by Csat and Opper 2002 and Seeger et al 2003 and is earlier called Projected Latent Variables see Quifionero 4 3 DTC SOR VAR APPROXIMATIONS 33 Candela and Rasmussen 2005 for more details The approximation can be con structed similarly as FIC and PIC by defining the inducing conditional which in the case of DTC is alf X Xu u 0 N f Key Kyi u 0 4 7 This implies that the approximate prior over latent variables is q f X Xu 0 N f 0 Key Ky Kus 4 8 The deterministic training conditional is not strictly speaking a proper GP since it uses different covariance function for the latent variables appointed to the training inputs and for the latent variables at the pr diction sites f The prior covariance for f is the true covariance K instead of Kj Ki uK This does not affect the predictive mean since the cross covariance Covi f K KA K g but it gives a larger predictive va
12. Random draws from a left truncated normal distribution with mean mu variance sigma2 Random draws from a right truncated normal distribution with mean mu variance sigma2 Random draws from a normal truncated to interval Random draws from a normal distribution truncated by zero Random matrices from scaled inverse chi distribution Random numbers from Student s t distribution Generate unifrom random numberm from interval Random matrices from Wishart distribution A B Kernel density estimator for one dimensional distribution Hammersley quasi random sequence A 4 Monte Carlo MONTE CARLO FUNCTIONS in the mc folder Bayesian bootstrap mean Gibbs sampling Hybrid Monte Carlo sampling Default options for Hybrid Monte Carlo sampling Harmonic mean Markov Chain Monte Carlo sampling with Metropolis algorithm Default options for Metropolis sampling Deterministic resampling Residual resampling Simple random resampling Stratified resampling Markov Chain Monte Carlo sampling using Slice Sampling A 5 MISCELLANEOUS 71 SLS_OPT SLS1MM SLS1MM_OPT SOF TMAX2 Default options for Slice Sampling 1 dimensional fast minmax Slice Sampling Default options for SLS1MM_OPT Softmax transfer function A 5 Miscellaneous MISCELLANEOUS FUNCTIONS in the misc folder CULT CVITR MAPCOLOR MAPCOLOR2 M2KML QUAD_MOMENTS RANDPICK STR2FUN SET_PIC WMEAN Create itr and itst indeces for k fold cv Create itr
13. in the case of EP If expectation propagation is used for inference the model assessment is conducted similarly as with Laplace approximation Also the MCMC and IA solutions are evalu ated identically to the Gaussian case For this reason the code is not repeated here Chapter 6 Covariance functions In the previous chapters we have not paid much attention on the choice of the covari ance function However GPstuff has rather versatile collection of covariance functions which can be combined in numerous ways The different functions are collected in the Appendix B This chapter demonstrates some of the functions and ways to combine them 6 1 Neural network covariance function A good example of covariance function that has very different properties than the stan dard stationary covariance functions such as squared exponential or M ern covariance functions is the neural network covariance function In this section we will demonstrate its use in two simple regression problems The squared exponential covariance function is taken as a reference and the code is found from the demo_neuralnetCov 6 2 Additive models In many practical situations a GP prior with only one covariance function may be too restrictive since such a construction can model effectively only one phenomenon For example the latent function may vary rather smoothly across the whole area of interest but at the same time it can have fast local variations In this case a
14. is N a l y f z l is p 1 pi T9 C 9 eI yi zi ya Here z denotes the number of trials and y is the number of successes Appendix D Priors This chapter lists all the priors implemented in the GPstuff package Gaussian prior prior_gaussian The Gaussian distribution is parametrized as nO sex 0 10 where u is a location parameter and g is a scale parameter Log Gaussian prior prior_loggaussian The log Gaussian distribution is parametrized as nO as ex z lost where J is a location parameter and g is a scale parameter Laplace prior prior_laplace The Laplace distribution is parametrized as p 0 exp ea Oo where p is a location parameter and gt 0 is a scale parameter Student t prior prior_t The Student distribution is parametrized as _ T v 1 2 wey eth WST 1 oe 2 where u is a location parameter o freedom 79 D 1 D 2 D 3 D 4 is a scale parameter and v gt 0 is the degrees of 80 APPENDIX D PRIORS Square root Student t prior prior_sqrtt The square root Student t distribution is parametrized as 1 2 T v 1 2 0 u v 1 2 p or TON 1 2 D 5 where jz is a location parameter o freedom is a scale parameter and v gt 0 is the degrees of Scaled inverse prior prior_sinvchi2 The scaled inverse distribution is parametrized as p 9 e
15. respect to the parameters which allows gradient based optimization The gradients of the Laplace approximated log marginal likelihood 2 20 can be solved analytically too In EP the parameters Cfp and can be considered constants when differen tiating the function with respect to the parameters Seeger 2005 for which reason gradient based optimization is possible also with EP The advantage of MAP estimate is that it is relatively easy and fast to evaluate Ac cording to our experience good optimization algorithms need usually at maximum tens of optimization steps to find the mode However it may underestimate the uncertainty in parameters 2 3 MARGINALIZATION OVER PARAMETERS 13 2 3 2 Grid integration Grid integration is based on weighted sum of points evaluated on grid M p D X pf D 0 p 0i D Ai 2 24 i 1 Here V 67 and A denotes the area weight appointed to an evaluation point Vi The first step in exploring log p v D is to find its posterior mode as described in the previous section After this we evaluate the negative Hessian of log p W D at the mode which would be the inverse covariance matrix for v if the density were Gaussian If P is the inverse Hessian the approximate covariance with eigendecomposition P UCU then V can be defined as vlz 9 UC2z 2 25 where z is a standardized variable If p v D were a Gaussian density then z would be zero mean normal distributed This re para
16. the k fold CV training sets GPstuff provides gp_k fcv which computes k fold CV and bias corrected k fold CV with log score and root mean squared error RMSE The function gp_kfcv pro vides also basic variance estimates for the predictive performance estimates First the mean expected utility u for each k folds is computed w s tend to be closer to Gaus sian due to the central limit theorem and then the variance of the expected utility is computed as see e g Dietterich 1998 Although some information is lost by first taking the sub expectations the estimate is useful indicator of the related uncertainty See Vehtari and Lampinen 2002 for more details on estimating the uncertainty in performance estimates 5 4 DIC Deviance information criterion DIC is another very popular model selection criterion Spiegelhalter et al 2002 DIC estimates aslo the predictive performance but replaces the predictive distribution with plug in predictive distribution where plug in estimate is used and uses deviance as the loss function With parametric models without any hierarchy it is usually written as Pet Eg n D y 0 Dy Eon 4 5 11 DIC Egip D y 0 pett 5 12 where pet is the effective number of parameters and D 2log p y is the de viance Since our models are hierarchical we need to decide the parameters on focus see Spiegelhalter et al 2002 for discussion on this The parameters on the focus
17. throughout this paper In Section 8 1 1 we describe how to use non zero mean functions in GPstuff As will be seen the class of models described by the equations 1 3 1 5 is rather rich Even though the observation model is assumed to factorize given the latent variables f the correlations between the observations are incorporated into the model via the GP prior and the marginalized observation model Ply 0 f df p 0 IT p yil fi is no longer factorizable Gaussian processes are not certainly a new invention Early examples of their usage can be found for example in time series analysis and filtering Wiener 1949 and geostatistics e g Matheron 1973 GPs are still widely and actively used in these fields and useful overviews are provided by Cressie 1993 Grewal and Andrews 2001 Diggle and Ribeiro 2007 and Gelfand et al 2010 O Hagan 1978 was one of the firsts to consider Gaussian processes in a general probabilistic modeling context He provided a general theory of Gaussian process prediction and utilized it for a number of regression problems This general regression framework was later rediscovered as an alternative for neural network models Williams and Rasmussen 1996 Rasmussen 1996 and extended for other problems than regression Neal 1997 Williams and Barber 1998 This machine learning perspective is comprehensively summarized by Rasmussen and Williams 2006 1 2 1 Gaussian process prior Th
18. 1 f x Figure 1 2 A conditional posterior GP p f 9 The observations f f 0 7 1 f 1 3 1 f 2 4 0 f 3 9 2 are plotted with circles in the upper left sub figure and the prior GP is illustrated in the figure 1 1 When comparing the subfigures to the equivalent ones in Figure 1 1 we can see clear distinction between the marginal and the conditional GP Here all the function samples travel through the observations the mean is no longer zero and the covariance is non stationary Chapter 2 Inference and prediction As was stated on previous section the essential task in Bayesian inference is solve the posterior distribution of quantities of interest In an ideal situation the posterior distributions could be solved analytically but most of the time we need to approximate it GPstuff is built so that the first inference step is to form either analytically or approximately the conditional posterior of the latent variables given the parameters pele PW Eee 2 1 J ply 0 p E X a d where the integral over f in the denominator is difficult for non Gaussian likelihoods Section 2 1 discusses approximation of the posterior of f first generally and then presents exact posterior in case of Gaussian likelihood and Laplace Expectation Propagation EP and Markov chain Monte Carlo MCMC approximations in case of non Gaussian likelihood The approximations use the unnormalized posterior p f D 0 p ply p
19. 1 n here The posterior mean and variance of the latent variables are shown in the figure 7 2 56 CHAPTER 7 SPECIAL OBSERVATION MODELS a The posterior mean b The posterior variance Figure 7 2 The posterior predictive mean and variance of the latent function in the demo_spatiall1 data set obtained with FIC The demo contains also MCMC implementation for the model but it is not dis cussed here Using Markov chain sampler for Poisson likelihood is very straightfor ward extension of its usage in classification model The only difference is that we have to carry along the extra input e 7 2 2 Disease mapping with negative Binomial likelihood The Negative binomial distribution is a robust version of the Poisson distribution sim ilarly as Student t distribution can be considered as a robustified Gaussian distribution Gelman et al 2004 In GPstuff it is parametrized as ae r g Li 4 f TT di r ae a mee f x 0 GP 0 k x x 8g 7 8 0 half Student t v o 7 9 where u eexp f x and r is the dispersion parameter governing the variance The model is demonstrated in demo_spatial2 where the data are simulated so that the latent function is drawn randomly from a GP with piecewise polynomial covariance function and the observed death cases are sampled from a Negative binomial distribu tion This is done in order to demonstrate the use of CS covariance functions with non Gaussian
20. 38 CHAPTER 5 MODEL ASSESSMENT AND COMPARISON are those over which the expectations are taken when evaluating the effective number of parameters and DIC In the above equations the focus is in the parameters and in the case of a hierarchical GP model of GPstuff the latent variables would be integrated out before evaluating DIC If we have a MAP estimate for the parameters we may be interested to evaluate DIC statistics with the focus on the latent variables In this case the above formulation would be po 9 Et io olD y f Dly Erio olf 5 13 DIC E in o D y f po 0 5 14 Here the effective number of parameters is denoted differently with pp 0 since now we are approximating the effective number of parameters in f conditionally on 0 which is different from the per pp 0 is a function of the parameters and it measures to what extent the prior correlations are preserved in the posterior of the latent variables given 0 For non informative data pp 0 0 and the posterior is the same as the prior The greater pp is the more the model is fitted to the data and large values compared to n indicate potential overfit Also large pp indicates that we cannot assume that the conditional posterior approaches normality by central limit theorem Thus pp can be used for assessing the goodness of the Laplace or EP approximation for the conditional posterior of the latent variables as discussed by Rue et al 2009 and Vanhatalo e
21. CONTENTS Chapter 1 Introduction 1 1 Bayesian modeling Building a Bayesian model denoted by M starts with an observation model which contains the description of the data generating process or its approximation The ob servation model is denoted by p D 0 M where 0 stands for the parameters and D the observations The observation model quantifies a conditional probability for data given the parameters and when looked as a function of parameters it is called likeli hood If the parameter values were known the observation model would contain all the knowledge of the phenomenon and could be used as such If the observations con tain randomness sometimes called noise one would still be uncertain of the future observations but could not reduce this uncertainty since everything that can be known exactly would be encoded in the observation model Usually the parameter values are not known exactly but there is only limited knowledge on their possible values This prior information is formulated mathematically by the prior probability p 6 M which reflects our beliefs and knowledge about the parameter values before observing data Opposed to the aleatory uncertainty encoded in the observation model the epistemic uncertainty present in the prior can be reduced by gathering more information on the phenomenon for a more illustrative discussion on the differences between these two sources of uncertainty see O Hagan 2004 Bayesian inference
22. Ey_full Vary_full gp_pred gp x Yy xt Ef_fulll Varf_fulll gp_pred gp x y x predcf 1 Ef_full2 Varf_full2 gp_pred gp x Y x predcf 2 3 The two components Ef_full1 and Ef_ful12 above are basically identical to the component shown in Figure 6 3 which shows that there is no practical difference in the interpolation performance between the two models considered in this demo How ever the additive model with periodic component has much more predictive power 6 2 ADDITIVE MODELS 45 200 400 600 800 200 400 600 800 a Two additive components b Three additive components with periodic covariance function Figure 6 4 The Mauna Loa CO data Prediction with two different models On the left model with covariance function ks x x kpp 2 x x and on the right a model with covariance function kise x x 01 kse x x 02 Kperiodic X x It can be seen that the latter has more predictive power into the future This is illustrated in Figure 6 4 where one can see that the prediction with non periodic model starts to decrease towards prior mean very quickly and does not extrapolate the period whereas the periodic model extrapolates the almost linear increase and periodic behaviour The MCMC or grid integration approach for the ad ditive model is identical to regression with only one covariance function and is not repeated here 6 2 2 Additive models with sparse
23. OS ite tantees SS SS OSS 0 X 0 He setae 4 Vy x 4 a SSS Ka a NAR NOM sures Fg SR QI RES SSS as Unie ASS SEK AOU SSSR Moores NERS SOK SSM aoe 2 WN LMM 4 SSS SSO 3 E 2 SS 1 SSS 2 3 2 4 2 x 2 x Xo a x e km x1 1101 km x2 x502 f km x x 01 Figure 6 5 GP latent mean predictions using a MAP estimate with different additive and non additive covariance functions The 2D toy data is generated from an additive process 6 4 PRODUCT OF COVARIANCE FUNCTIONS 49 gpcfl gpcf_exp gpcf2 gpcf_matern32 gpcf gpcf_prod cf gpcfl gpcf2 Above we first initialized the two functions to be multiplied and in the third line we constructed a covariance function structure which handles the actual multiplication The product covariance can be combined with the metric structures also For ex ample if we want to model a temporal component with one covariance function and the spatial components with other we can construct the covariance function k x x 1 T 1 T k x1 24 ka x2 31 3 4 as follows metricl metric_euclidean 1 metric2 metric_euclidean 2 3 gpcfl gpcf_exp metric metricl gpcf2 gpcf_matern32 metric metric2 gpcf gpcf_prod cf gpcfl gpcf2 The product covariance gpcf_prod can be used to combine categorical covariance gpcf_cat with other covariance functions to build hierarchical linear and non linear models
24. and itst indeces for k fold cv with ranndom permutation returns a colormap ranging from blue through gray to red Create a blue gray red colormap Converts GP prediction results to a KML file Calculate the Oth 1st and 2nd moment of a given unnormalized probability distribution Pick element from x randomly If x is matrix pick row from x randomly Compatibility wrapper to str2func Set the inducing inputs and blocks for two dimensional input data weighted mean A 6 Optimization OPTIMIZATION FUNCTIONS in the optim folder BSEARCH BSEARCH_OPT FSEARCH FSEARCH_OPT SCGES SCGES_OPT SCGES SCG2 SCG2_OPT FMINSCG FMINLBFGS Finds the minimum of a combinatorial function using backward search Default options for backward search Finds the minimum of a combinatorial function using forward search Default options for forward search Scaled conjugate gradient optimization with early stopping Default options for scaled conjugate gradient optimization Scaled conjugate gradient optimization with early stopping new options structure Scaled conjugate gradient optimization Default options for scaled conjugate gradient optimization scg2 new options structure Scaled conjugate gradient optimization Limited memory Quasi Newton Broyden Fletcher Goldfarb Shanno BFGS 72 APPENDIX A FUNCTION LIST Appendix B Covariance functions In this section we summarize all the covariance functions in the GPstuff package Squar
25. as usual but with derivative observations joint covariance matrices are to be used instead of the normal ones Using derivative observations in GPstuff requires two steps when initializing the GP structure one must set option derivobs to on The second step is to form right sized observation vector With input size n x m the observation vector with derivatives should be of size n m n The observation vector is constructed by adding partial derivative observations after function value observations x sux Yoo a 7 13 ay x Lm ee Different noise level could be assumed for function values and derivative observations but at the moment the implementation allows only same noise for all the observations 7 5 1 GP regression with derivatives demo_derivativeobs In this section we will go through the demonstration demo_derivativeobs This demo will present the main differences between GP regression with and without deriva tive observations and how to use them with GPstuff The demo is divided in two parts in the first part the GP regression is done without derivatives observations and in the second with them Here we will present the lines of the second part because the first part is almost identical with just a few differences First we create the artificial data Notice how observation vector is defined differ ently for GP models with and without derivatives observations With derivative obser vations the obse
26. comparison Usually since future observations are not yet available we need to estimate the expected utility by taking the expectation over the future data distribution u E xn 41 941 u Yn 1 Xn 1 D M 5 2 There are several methods for estimating 5 2 GPstuff provides two commonly used approaches cross validation and deviance information criterion 5 3 Cross validation Cross validation CV is an approach to compute predictive performance estimate by re using observed data As the distribution of Xn 1 Yn 1 is unknown we assume that it can be reasonably well approximated using the training data x y i 1 2 n To avoid the double conditioning on the training data and simulate the fact that the future observations are not in the training data the ith observation x yi in the training data is left out and then the predictive distribution for y is computed with a model that is fitted to all of the observations except x yi By repeating this for every point in the training data we get a collection of leave one out cross validation LOO CV predictive densities where D denotes all the elements of D except x y To get the expected utility estimate these predictive densities are compared to the actual y s using the utility u and the expectation is taken over i roo E u yi xi Dy M 5 4 The right hand side terms are conditioned on n 1 data points making the estimate almost unbiased
27. density of w is given by Pw w J po f7 w E 1 where J is the Jacobian of the transformation 6 f 1 w The parameter transforma tions are discussed shortly for example in Gelman et al 2004 p 24 Due to the log transformation w log 0 transformation the probability densities po are changed to the densities Pw w J pe exp w J pe 9 E 2 where the Jacobian is J Dexptwo exp w 0 Now given Gaussian observation model see Section 2 1 2 the posterior of w can be written as Pw w D x p y X 0 p 0ly 9 E 3 which leads to energy function E w log p y X 8 log p 4 log 4 E 0 log 8 where the absolute value signs are not shown explicitly around 0 because it is strictly positive Thus the log transformation just adds term log 0 in the energy function The inference on w requires also the gradients of an energy function E w These 83 84 APPENDIX E TRANSFORMATION OF HYPERPARAMETERS can be obtained easily with the chain rule OE w LEC log dl w 00 _ OE 6 00 dE 0 1 a J E 4 a J a0 EA Here we have used the fact that the last term derivative of 0 with respect to w is the same as the Jacobian J oe af Now in the case of log transformation the Jacobian can be replaced by and the gradient is gotten an easy expression OE w _ OE w 0 1 E 5 Bibliography Abramowitz M and St
28. does not utilize inducing inputs but evaluates the correlations exactly This enables the GP model to capture both the long and short length scale phenomena The GPstuff package is coded so that if the GP structure is defined to be CS FIC all the CS functions are treated outside FIC approximation Thus the CS FIC model requires that there is at least one CS covariance function and one globally supported function such as squared exponential If there are more than two covariance functions in the 46 CHAPTER 6 COVARIANCE FUNCTIONS GP structure all the globally supported functions utilize inducing inputs and all the CS functions are added to A 6 3 Additive covariance functions with selected variables In the demo demo_regression_additive we demonstrate how covariance functions can be modified so that they are functions of only a subset of inputs We will consider modelling an artificial 2D regression data with additive covariance func tions where the individual covariance functions use only either the first or second input variable That is the covariance is ky 21 2 01 k2 2 x4 02 where the covariance functions are of type ki x1 x41101 R gt R instead of k x x 6 R R which has been the usual case in previous demos Remember the notation x 7 va and x z 2 Also solutions from the covariance function that uses both in put variables are shown for comparison In the regression we assume a Gaussian no
29. from spatial epidemiology The Log Gaussian Cox 58 CHAPTER 7 SPECIAL OBSERVATION MODELS 1 0 8 1000 0 6 0 4 500 1 0 2 0 1860 1880 1900 1920 1940 1960 0 0 Year 0 02 04 06 08 1 a Coal mine disasters b Red wood data Intensity wo N Figure 7 3 Two intensity surfaces estimated with Log Gaussian Cox process The figures are from the demo_lgcp where the aim is to study an underlying intensity surface of a point process On the left a temporal and on the right a spatial point process process with the same techniques is implemented in the function 1gcp for one or two dimensional input data The usage of the function is demonstrated in demo_lgcp This demo analyzes two data sets The first one is one dimensional case data with coal mine disasters from R distribution The data contain the dates of 191 coal mine explosions that killed ten or more men in Britain between 15 March 1851 and 22 March 1962 The analysis is conducted using expectation propagation and CCD integration over the parameters and the results are shown in Figure 7 3 The second data are the redwood data from R distribution This data contain 195 locations of redwood trees in two dimensional lattice The smoothed intensity surface is shown in Figure 7 3 7 4 Binomial observation model In this demo demo_binomial1 we show how binomial likelihood is used in the GPstuff toolbox The inference is done in this example with Laplace approximation an
30. full GP The Laplace approximation EP and integration methods can be used with the same commands as with full GP This is demonstrated in demo_spatiall Chapter 5 Model assessment and comparison There are various means to assess the goodness of the model and its predictive perfor mance and GPstuff provides built in functionalities to many common test statistics In this chapter we will briefly discuss the model comparison and assesment in general and introduce a few basic methods that can be conducted routinely with GPstuff s tools 5 1 Marginal likelihood Marginal likelihood is often used for model selection see e g Kass and Raftery 1995 It corresponds to ML II or with model priors to MAP II estimate in the model space selecting the model with the highest marginal likelihood or highest marginal posterior probability The use of this in model selection is as justified as using a MAP point estimate for the parameters It works well if the posterior is concentrated to a single model that is if single model produce similar predictions as Bayesian model avarage model obtained by integrating over the model space In GPstuff if MAP estimate and integration approximation IA are almost the same marginal likelihood can be used as a quick estimate to compare models but we recommend using cross validation for more thorough model assessment and selection 5 2 Predictive performance estimates In prediction problems it is natural to a
31. in the model That is we sample both the parameters and the latent variables and estimate the needed poste rior statistics by sample estimates or by histograms Neal 1997 Diggle et al 1998 The full MCMC is performed by alternate sampling from the conditional posteriors p D and p 0 D f Sampling both the parameters and latent variables is usually awfully slow since there is a strong correlation between them Vanhatalo and Vehtari 2007 Vanhatalo et al 2010 Sampling from the approximate marginal p 6 D is a much easier task since the parameter space is smaller The parameters can be sampled from their marginal posterior or its approxima tion either with HMC slice sampling SLS Neal 2003 or via importance sampling Geweke 1989 In importance sampling we use a Normal or Student t proposal dis tribution with mean and covariance P approximated with the negative Hessian of the 14 CHAPTER 2 INFERENCE AND PREDICTION log posterior and approximate the integral with M a q D 35 wi 2 26 where w q 0 g 9 are the importance weights Importance sampling is ade quate only if the importance weights do not vary substantially Thus the goodness of the Monte Carlo integration can be monitored using the importance weights The worst scenario occurs when the importance weights are small with high probability and with small probability get very large values that is the tails of q are much wider than tho
32. is the process of up dating our prior knowledge based on new observations in other words it is the process for reducing the epistemic uncertainty The cornerstone of Bayesian inference is the Bayes theorem which defines the conditional probability of the parameters after observing the data P DIA M p MO P DIM This is called the posterior distribution and it contains all information about parameter 0 conveyed from the data D by the model The normalization constant p 6 D M 1 1 p DIM I p D A M pOIM d a2 is equal to the conditional probability that the data came from the model M given our model assumptions It is also called the marginal likelihood for the model The model M stands for all the hypotheses and assumptions that are made about the phe nomenon It embodies the functional forms of the observation model and the prior 1 2 CHAPTER 1 INTRODUCTION which are always tied together as well as our subjective assumptions used to define these mathematical abstractions Because everything is conditioned on M it is a re dundant symbol and as such omitted from this on Usually we are not able to define correct model and most of the time we have only limited ability to encode our prior beliefs in the mathematical formulation Still many models turn out to be useful in practice The true power of Bayesian approach comes from the possibility to construct and analyze hierarchical models In hierarchical
33. is the rank of H Its derivative is o lr 1 T T T ap ospty X sy Py 5 y PG G Py G PG 8 5 1 fas N OK 1 mat D OA OK 1OKy 7 1 K 06 Bie G H A HK y where has been used the fact that matrices K r ey and A are symmetric The above expression 8 5 could be simplified a little further because y PG G Py 2 y PG 8 1 EXPLICIT BASIS FUNCTIONS 65 8 1 1 Mean functions in GPstuff demo_regression_meanf Here we will demonstrate with the demo_regression_meanf how to use mean functions with GPstuff GP regression is done for artificial one dimensional data with full GP model and gaussian likelihood In this presentation only the lines of code are presented which are relevant in the mean function context Otherwise this demonstra tion follows closely demo_regressionl After creating the data and initializing the covariance and likelihood functions we are ready to initialize the GP structure The basis functions for mean are given in a cell array of mean function structures note the similarity to the use of covariance function structures Create covariance and likelihood structures pcf gpcf_sexp lengthScale 0 5 magnSigma2 5 ik lik_gaussian sigma2 0 4 2 HQ oe Initialize base functions for GP s mean function gpmf1l gpmf_constant prior_mean 3 prior_cov 1 gpmf2 gpmf_linear prior_mean 3 prior_cov 1 gpmf3 gpmf_squared prior_mean 3
34. magnSigma2_prior pm gp gp_set type FULL lik lik cf gpcf Here 1ik_gaussian initializes the likelihood function corresponding to Gaus sian observation model 1ik_gaussian provides code for both the likelihood func tion and observation model but for naming simplicity 1ik prefix is used and its parameter values Function gocf_sexp initializes the covariance function and its parameter values 1ik_gaussian returns structure lik and gpcf_sexp returns gpcf that contain all the information needed in the evaluations function handles pa rameter values etc The next five lines create the prior structures for the parameters of the likelihood and the covariance function which are set in the likelihood and co variance function structures The prior for noise variance is usual uninformative log uniform The priors for length scale and magnitude are the Student t distribution which works as a weakly informative prior Gelman 2006 note that prior is set to the square root of the magnSigmaz i e magnitude The last line creates the GP structure by giving it the likelihood and covariance function and setting the type to FULL which means that the model is a regular GP without sparse approximations see section 4 for other types Type FULL is the default and could be left out in this case In the GPstuff toolbox the Gaussian process structure is the fundamental unit around which all the other blocks
35. of the model are collected This is illustrated in Fig ure 3 1 In the gp structure t ype defines the type of the model 1ikelih defines the likelihood structure cf contains the covariance function structure infer_params defines which parameters are inferred covariance likelihood etc and will be dis cussed in detail in the chapter 4 and jitterSigma2 is a small constant which is added in the diagonal of the covariance matrix to make it numerically more stable If 3 1 GAUSSIAN PROCESS REGRESSION DEMO_REGRESSION1 19 there were more than one covariance function they would be handled additively See section 6 2 for details The likelihood and covariance structure are similar to the GP structure The likelihood structure contains the type of the likelihood its parameters here sigma2 prior structure p and structure holding function handles for internal use fh The covariance structure contains the type of the covariance its parameters here LengthScale and magnSigma2 prior structure p and structure holding function handles for internal use fh Using the constructed GP structure we can evaluate basic summaries such as co variance matrices make predictions with the present parameter values etc For exam ple the covariance matrices Ks and C Krf 02 I for three two dimensional input vectors can be evaluated as follows example_x 1 1 00 11 K C gp_trcov gp example_x Rons 0 0400 0 0054 0 0000 0 0054 0
36. the GP structure and we can construct the model similarly to the full GP models lik lik_gaussian sigma2 0 2 2 gpcf gpcf_sexp lengthScale 11 magnSigma2 0 2 2 4 2 FIC AND PIC SPARSE APPROXIMATIONS 31 2 SSSR SOLU SMILE a FIC sparse approximation b PIC sparse approximation SSeS BOSS SSS c Variational sparse approximation d DTC SOR sparse approximation Figure 4 2 The posterior predictive mean of the latent function in the demo_sparseApprox data set obtained with FIC PIC variational and DTC SOR sparse approximations The red crosses show the optimized inducing inputs and the block areas for PIC are colored underneath the latent surface ul u2 meshgrid linspace 1 8 1 8 6 linspace 1 8 1 8 6 j X_u ul u2 gp_fic gp_set type FIC lik lik Tef gpcf X_u X_u jitterSigma2 le 4 The difference is that we have to define the type of the sparse approximation here FIC and set the inducing inputs X_u in the GP structure The posterior predictive mean and the inducing inputs are shown in Figure 4 2 Since the inducing inputs are considered as extra parameters common to all of the covariance functions there may be more than one covariance function in additive mod els they are set in the GP structure instead of the covariance function structure If we want to optimize the inducing inputs alongside the parameters we need to
37. use gp_set to set option infer_params to include inducing Sometimes for example in spatial problems it is better to fix the inducing inputs see Vanhatalo et al 2010 or it may be more efficient to optimize the parameters and inducing inputs separately so that we iterate the separate optimization steps until convergence 32 CHAPTER 4 SPARSE GAUSSIAN PROCESSES PIC sparse approximation in GPstuff In PIC in addition to defining the inducing inputs we need to appoint every data point in a block The block structure is common to all covariance functions similarly to the inducing inputs for which reason the block information is stored in the GP struc ture With this data set we divide the two dimensional input space into 16 equally sized square blocks and appoint the training data into these according to the input co ordinates This and the initialization of the GP structure are done as follows Initialize the inducing inputs in a regular grid over the input space ul u2 meshgrid linspace 1 8 1 8 6 linspace 1 8 1 8 6 X_u ul u2 Initialize test points xtl xt2 meshgrid 1 8 0 1 1 8 1 8 0 1 1 8 xXt xtb 2 xt2 2 3 oe set the data points into clusters Here we construct two cell arrays trindex contains the block index vectors for training data That is x trindex i and y trindex i belong to the i th block tstindex contains the block index vectors for test data That is test inputs p tstinde
38. value PRIOR_GAMMA Gamma prior structure PRIOR_INVGAMMA Inverse gamma prior structure PRIOR_LAPLACE Laplace double exponential prior structure PRIOR_LOGLOGUNIF Uniform prior structure for the log log parameter PRIOR_LOGUNIF Uniform prior structure for the logarithm of the parameter PRIOR_GAUSSIAN Gaussian prior structure PRIOR_LOGGAUSSIAN Log Gaussian prior structure PRIOR_SINVCHI2 Scaled inverse chi square prior structure PRIOR_T Student t prior structure PRIOR_SORTT Student t prior structure for the square root of the parameter PRIOR_UNIF Uniform prior structure PRIOR_SQRTUNIF Uniform prior structure for the square root of the BETA_LPDF BETA_PDF DIR_LPDF DIR_PDF GAM_CDF GAM_LPDF GAM_PDF GEO_LPDF INVGAM_LPDF INVGAM_PDF LAPLACE_LPDF LAPLACE_PDF LOGN_LPDF LOGT_LPDF MNORM_LPDF MNORM_PDF NORM_LPDF NORM_PDF POISS_LPDF POISS_PDF SINVCHI2_LPDF SINVCHI2_PDF T_LPDF T_PDF CATRAND DIRRAND EXPRAND GAMRAND INTRAND INVGAMRAND INVGAMRAND1 INVWISHRND NORMLTRAND NORMRTRAND NORMTRAND NORMTZRAND SINVCHI2RAND TRAND UNIFRAND WISHRND Others KERNELP HAMMERSLEY BBMEAN GIBBS HMC2 HMC2_OPT HMEAN METROP2 METROP2_OPT RESAMPDET RESAMPRES RESAMP SIM RESAMPSTR SLS APPENDIX A FUNCTION LIST parameter Probability density functions Beta log probability density function lpdf Beta probability density function pdf Log probability density function of uniform Dirichlet distribution Prob
39. we can form covariance function Rasmussen and Williams 2006 which calls a dotproduct covariance function k Xxi Xj o x Xxj B 9 75 Piecewise polynomial functions gpef_ppcesX The piecewise polynomial functions are the only compactly supported covariance func tions see section 4 in GPstuff There are four of them with the following forms kppo xi xj 0 1 r B 10 kppi Xi xj 07 1 r 4 G 1 r 1 B 11 2 a kpp2 Xi Xj F r 4j 3 r 37 6 r 3 B 12 2 kopali x4 e 1 r i U 93 235 15 r 64 367 45 r 15j 45 r 15 B 13 where j d 2 q 1 These functions correspond to processes that are 2q times mean square differentiable at the zero and positive definite up to the dimension d Wendland 2005 The covariance functions are named gpcf_ppcs0 gopcf_ppcsl gpcf_ppcs2 and gpcf_ppcs3 Rational quadratic covariance function gpcf_rq The rational quadratic RQ covariance function Rasmussen and Williams 2006 eee zi Tjk kro Xi X 14 oe B 14 k 1 can be seen as a scale mixture of squared exponential covariance functions with differ ent length scales The smaller the parameter a gt 0 is the more diffusive the length scales of the mixing components are The parameter ly gt 0 characterizes the typical length scale of the individual components in input dimension k Periodic covariance function gpcf_periodic Many real
40. world systems exhibit periodic phenomena which can be modelled with a periodic covariance function One possible construction Rasmussen and Williams 2006 is k x xj exp y 2sin T i k ai B 15 d 2 k 1 k where the parameter y controls the inverse length of the periodicity and l the smooth ness of the process in dimension k Product covariance function gpcf_product A product of two or more covariance functions k x x ko x x is a valid covari ance function as well Combining covariance functions in a product form can be done with gocf_prod for which the user can freely specify the covariance functions to be multiplied with each other from the collection of covariance functions implemented in GPstuff 76 APPENDIX B COVARIANCE FUNCTIONS Categorical covariance function gpcf_cat Categorical covariance function gocf_cat returns correlation 1 if input values are equal and 0 otherwise 1 if x x 0 k x x i 2 B 16 G 3 l 0 otherwise Categorical covariance function can be combined with other covariance functions using gpcf_prod for example to produce hierarchical models Appendix C Observation models Here we summarize all the observation models in GPstuff Most of them are imple mented in files 1ik_ which reminds that at the inference step they are considered likelihood functions Gaussian 1ik_gaussian The i i d Gaussian noise with variance g is y f o
41. 0400 0 0054 0 0000 0 0054 0 0400 Co 0 0800 0 0054 0 0000 0 0054 0 0800 0 0054 0 0000 0 0054 0 0800 Here K is Kt and C is K 02 1 noise 3 1 2 MAP estimate for the hyperparameters MAP estimate for the parameters can be obtained with function gp_optim which works as a wrapper for usual gradient based optimization functions It can be used as follows opt optimset TolFun 1le 3 TolX 1le 3 Display iter gp gp_optim gp x y optimf fminscg opt opt gp_optim takes a GP structure training input x training target y which are defined in demo_regression1 and options and returns a GP structure with optimized hy per parameter values By default gp_opt imuses fminscg function but gp_optim can use also for example fminlbfgs or fminunc Optimization options are set with optimset function All the estimated parameter values can be easily checked using the function gp_pak which packs all the parameter values from all the covari ance function structures in one vector usually using log transformation other trans formations are also possible The second output argument of gp_pak lists the labels for the parameters w s gp_pak gp disp exp w disp s It is also possible to set the parameter vector of the model to desired values using gp_unpak gp gp_unpak gp w It is possible to control which parameters optimized in two ways 1 If any parame ter is given an empty prior i
42. 67 68 GPLA_E GPLA_G GPLA_PRED GP_MC GPMC_PRED GPMC_PREDS GP_IA GP IA_PRED LGCP Model assesment GP_DIC GP_KFCV GP_LOOE GP_LOOG GP_EPLOOE EP_LOOPRED GP_PEFF Metrics METRIC_EUCLIDE Misc LDLROWMODIFY LDLROWUPDATE SPINV SCALED_HMC SCALED_MH GP_INSTALL Demonstration p DEMO_BINOMIAL1 APPENDIX A FUNCTION LIST Construct Laplace approximation and return negative marginal log posterior estimate Evaluate gradient of Laplace approximation s marginal log posterior estimate Predictions with Gaussian Process Laplace approximation Markov chain sampling for Gaussian process models Predictions with Gaussian Process MCMC approximation Conditional predictions with Gaussian Process MCMC approximation Integration approximation with grid Monte Carlo or CCD integration Prediction with Gaussian Process GP_IA solution Log Gaussian Cox Process intensity estimate for 1D and 2D data and comparison The DIC statistics and efective number of parameters in a GP model K fold cross validation for a GP model Evaluate the leave one out predictive density in case of Gaussian observation model Evaluate the gradient of the leave one out predictive density GP_LOOE in case of Gaussian observation model Evaluate the leave one out predictive density in case of non Gaussian observation model and EP Leave one out predictions with Gaussian Process EP approximation The efective number of parameters in GP model with focus
43. 7 N y 0 Kss 071 2 8 Setting this in the denominator of the equation 2 1 gives a Gaussian distribution also for the conditional posterior of the latent variables f D 0 o N Krt Krf 0 I t y Ke Ke se Krf 0 I Kr 2 9 Since the conditional posterior of f is Gaussian the posterior process or distribu tion p f D is also Gaussian By placing the mean and covariance from 2 9 in the equations 2 5 and 2 7 we obtain the predictive distribution FID 0 0 GP m 5 kp X X 2 10 where the mean mp X k x X Krs 071 y and covariance kp x x k x X k x X Ke 071 k X X The predictive distribution for new observations y can be obtained by integrating p y D 0 07 f p y f o7 p D 0 0 df The re sult is again Gaussian with mean Ej 5 4 f and covariance Covain g 07D 2 1 3 Laplace approximation In the Laplace approximation the mean is approximated by the posterior mode of f and the covariance by the curvature of the log posterior at the mode The approximation is constructed from the second order Taylor expansion of log p f y 6 around the mode f which gives a Gaussian approximation to the conditional posterior p D 0 9 g D 0 N 2 11 where f arg max p f D 0 and X7 is the Hessian of the negative log con ditional posterior at the mode Gelman et al 2004 Rasmussen and Williams 2006 uo V
44. Aalto University School of Science and Technology Department of Biomedical Engineering and Computational Science BECS Espoo Finland July 6 2011 Gaussian processes for Bayesian analysis User guide for Matlab toolbox GPstuff Version 3 1 Jarno Vanhatalo Jaakko Riihimaki Jouni Hartikainen and Aki Vehtari If you use GPstuff please use reference Jarno Vanhatalo Jaakko Riihimaki Jouni Hartikainen and Aki Vehtari 2011 Bayesian Modeling with Gaussian Processes using the MATLAB Toolbox GP stuff submitted This manual is subject to some modifications contact Aki Vehtari tkk fi ii iii Revision history July 2010 First printing For GPstuff 2 0 September 2010 New functionalities mean functions for For GPstuff 2 1 FULL GP derivative observations for squared exponential covariance in FULL GP April 2011 New syntax For GPstuff 3 1 iv Contents 2 1 2 2 2 3 3 1 3 2 4 1 4 2 4 3 1 Introduction 1 1 Bayesian modeling ooa a 1 2 Gaussian process models o oo 1 2 1 Gaussian process prior o oo 2 Inference and prediction Conditional posterior of the latent function 2 1 1 The posterior mean and covariance 2 1 2 Gaussian observation model ooa a 2 1 3 Laplace approximation 2 1 4 Expectation propagation algorithm 2 1 5 Markov chain Monte Carlo oaa Marginal likelihood given parameters Margi
45. G O and Sk ld M 2006 Robust Markov chain Monte Carlo methods for spatial generalized linear mixed models Journal of Computa tional and Graphical Statistics 15 1 17 Cressie N A C 1993 Statistics for Spatial Data John Wiley amp Sons Inc Csat L and Opper M 2002 Sparse online Gaussian processes Neural Computa tion 14 3 641 669 Davis T A 2006 Direct Methods for Sparse Linear Systems SIAM 85 86 BIBLIOGRAPHY Dietterich T G 1998 Approximate statistical tests for comparing supervised classi fication learning algorithms Neural Computation 10 7 1895 1924 Diggle P J and Ribeiro P J 2007 Model based Geostatistics Springer Sci ence Business Media LLC Diggle P J Tawn J A and Moyeed R A 1998 Model based geostatistics Jour nal of the Royal Statistical Society Series C Applied Statistics 47 3 299 350 Duane S Kennedy A Pendleton B J and Roweth D 1987 Hybrid Monte Carlo Physics Letters B 195 2 216 222 Elliot P Wakefield J Best N and Briggs D editors 2001 Spatial Epidemiology Methods and Applications Oxford University Press Finkenstadt B Held L and Isham V 2007 Statistical Methods for Spatio Temporal Systems Chapman amp Hall CRC Furrer R Genton M G and Nychka D 2006 Covariance tapering for interpo lation of large spatial datasets Journal of Computational and Graphical Statistics 15 3 502 523
46. MODELS FOR SPATIAL EPIDEMIOLOGY 55 7 2 1 Disease mapping with Poisson likelihood demo_spatial1 The model constructed in this section is the following y Il Poisson y exp fi e 7 4 r f x O GP 0 k x x 0 7 5 0 half Student t v a 7 6 The vector y collects the numbers of deaths for each area The co ordinates of the areas are in the input vectors x and 0 contains the covariance function parameters The co ordinates are defined from lower left corner of the area in 20km steps The model is constructed with the following lines pcfl gpcf_matern32 lengthScale 5 magnSigma2 0 05 dv prior ty s2 10y m prior_t pcfl gpcf_matern32 gpcfl lengthScale_prior pl magnSigma2_prior pm Q g aG fai ik lik_poisson gp gp_set type FIC lik lik cf gpcfl AX Xu jitterSigma2 le 4 infer_params covariance j gp gp_set gp latent_method Laplace FIC sparse approximation is used since the data set is rather large and inferring full GP would be too slow The inducing inputs Xu are set to a regular grid not shown here in the two dimensional lattice and they will be considered fixed The extra parameter z in the last line tells the Laplace algorithm that now there is an input which affects only the likelihood This input is stored in ye and it is a vector of expected number of deaths e e1
47. N f o7I C 1 Student t lik_t 1ik_gaussiansmt The Student t observation model implemented in 1ik_t is T v 1 2 yi es ae C2 n II T v 2 yvro i vo C 2 where v is the degrees of freedom and the scale parameter The scale mixture ver sion of the Student t distribution is implemented in 1ik_gaussiansmt and it is parametrized as Yil fi a Ui N fi aU C 3 U Inv x v 77 C 4 where each observation has its own noise variance aU Neal 1997 Gelman et al 2004 Logit Lik_logit The logit transformation gives the probability for y of being 1 or 1 as 1 Be oee C 5 1 exp y fi a Plogit yi fi 77 78 APPENDIX C OBSERVATION MODELS Probit Lik_probit The probit transformation gives the probability for y of being 1 or 1 as yi f xi Prosi ulf PS f NGO Dz C6 Poisson 1ik_poisson The Poisson observation model with expected number of cases e is y f e Il Poisson y exp fi e C 7 i 1 Negative Binomial lik_negbin The negative binomial is a robustified version of the Poisson distribution It is parametrized n I r ys EN we ylf er yi T r Gar mes i l where u eexp fi r is the dispersion parameter governing the variance e is the expected number of cases and y is positive integer telling the observed count Binomial 1ik_binomial The binomial observation model with the probability of success p exp fi 1 exp fi
48. S g g E 2 E 2 E 2 E 6 25 8 25 8 25 3 3 3 2 8 3 3 2 3 4 2 8 3 3 2 3 4 2 8 3 3 2 3 4 log length scale log length scale log length scale a Grid based b Monte Carlo c Central composite design Figure 2 2 Illustration of the grid based Monte Carlo and central composite design integration Contours show the posterior density g log W D and the integration points are marked with dots The left figure shows also the vectors z along which the points are searched in the grid integration and central composite design The integration is conducted over g log v D rather than q V D since the former is closer to Gaussian Reproduced from Vanhatalo et al 2010 the number of the design points grows very moderately and for example for d 6 one needs only 45 points The accuracy of the CCD is between the MAP estimate and the full integration with the grid search or Monte Carlo Rue et al 2009 report good results with this integration scheme and it has worked well in our experiments as well CCD tries to incorporate the posterior variance of the parameters in the infer ence and seems to give good approximation in high dimensions Since CCD is based on the assumption that the posterior of the parameter is close to Gaussian the densi ties p v D at the points on the circumference should be monitored in order to detect serious discrepancies from this assumption These densities are identical if the poste rior is Gau
49. The iterations are continued until the convergence The predictive mean and covariance are again obtained from equa tions 2 5 and 2 7 analogically to the Laplace approximation From the equations 2 9 2 11 and 2 16 it can be seen that the Laplace and EP approximations are similar to the exact solution with the Gaussian observation model The diagonal matrices W and gt gt correspond to the noise variance o7I and thus the two approximations can be seen as Gaussian approximations for the likelihood Nickisch and Rasmussen 2008 2 1 5 Markov chain Monte Carlo The accuracy of the approximations considered this far is limited by the Gaussian form of the approximating function An approach which gives exact solution in the limit of an infinite computational time is the Monte Carlo integration Robert and Casella 2004 This is based on sampling from the unnormalized posterior p f D 0 d x p y p X 0 and using the samples to represent the posterior distribution The posterior marginals can be visualized with histograms and posterior statistics approxi mated with sample means For example the posterior expectation of f using M sam ples is M 1 i Er p o 0lf 7p Df 2 17 i l 2 2 MARGINAL LIKELIHOOD GIVEN PARAMETERS 11 T 0 8 0 55 0 3 10 5 0 j JY I Fil aaa rl a Disease mapping b Classification Figure 2 1 Illustration of the Laplace approximation solid line EP d
50. V log p f D 9 p_ Key W 2 12 where W is a diagonal matrix with entries Wi V s Vs log p y fi p p Note that actually the Taylor expansion is made for the unnormalized posterior and the nor malization term is computed from the Gaussian approximation Here the approximation scheme is called the Laplace method following Williams and Barber 1998 but essentially the same approximation is named Gaussian approx imation by Rue et al 2009 in their Integrated nested Laplace approximation INLA scheme for Gaussian Markov random field models The posterior mean of f X can be approximated from the equation 2 5 by replac ing the posterior mean E p o f by f The posterior covariance is approximated sim ilarly by using Kii W in the place of Cove iD o f After some rearrangements and using K f V log ply f _g the approximate posterior predictive distribution is FID 0 0 GP 1m X kp X 2 13 where the mean and covariance are m X k x X V log p y _7 and ky X X k X x k amp X Ke W k X x The approximate conditional predictive 10 CHAPTER 2 INFERENCE AND PREDICTION densities of new observations y can now be evaluated for example with quadrature integration over each f separately p GilD 9 6 i palf DFID 0 dah 2 14 2 1 4 Expectation propagation algorithm The Laplace method constructs a Gaussian approximation at the posterior mode and a
51. ability density function of uniform Dirichlet distribution Cumulative of Gamma probability density function Log of Gamma probability density function lpdf Gamma probability density function pdf Geometric log probability density function lpdf Inverse Gamma log probability density function Inverse Gamma probability density function Laplace log probability density function lpdf Laplace probability density function pdf Log normal log probability density function Log probability density function lpdf for Multivariate Normal log probability density function lpdf Multivariate Normal log probability density function lpdf Normal log probability density function lpdf Normal probability density function pdf Poisson log probability density function Poisson probability density function Scaled inverse chi log probability density function Scaled inverse chi probability density function Student s T log probability density function lpdf Student s T probability density function pdf cdf lpdf log Student s T Random number generators Random matrices from categorical distribution Uniform dirichlet random vectors Random matrices from exponential distribution Random matrices from gamma distribution Random matrices from uniform integer distribution Random matrices from inverse gamma distribution Random matrices from inverse gamma distribution Random matrices from inverse Wishart distribution
52. ady made model structure gp and the training data x and y The function divides the data into k groups conducts inference separately for each of the training groups and evaluates the expected utilities with the test groups Since no optional parameters are given the inference is conducted using MAP estimate for the parameters The default division of the data is into 10 groups The expected utilities and their variance estimates are stored in the structure cvres as follows cvres mlpd_cv 0 0500 Var_lpd_cv 0 0014 mrmse_cv 0 2361 Var_rmse_cv 1 4766e 04 mabs_cv 0 1922 Var_abs_cv 8 3551le 05 gp_kfcv returns also other statistics if more information is needed and the function can be used to save the results automatically However these functionalities are not considered here Readers interested on detailed analysis should read the help text for gp_kfcv Now we will turn our attention to other inference methods than MAP estimate for the parameters Assume we have a record structure from gp_mc function with Markov chain samples of the parameters stored in it In this case we have two options how to evaluate the DIC statistics We can set the focus on the parameters or all the parameters that is parameters and latent variables The two versions of DIC and effective number of parameters are evaluated as follows rgp gp_mc gp xX yY opt DIC p eff gp_dic rgp x y focus param DIC2 p_eff2 gp_dic rgp x y fo
53. al GPstuff provides a probit and logit transformation which lead to observation models yif xi Prost sls if f NGO Daz G4 1 1 exp yi f xi Since the likelihood is not Gaussian we need to use approximate inference methods discussed in the section 2 1 We will conduct the inference with Laplace approxi mation EP and full MCMC With the two analytic approximations we optimize the parameters to their MAP estimates and use those point estimates for prediction In MCMC approach we sample both the parameters and the latent variables For a de tailed discussion on the differences between MCMC Laplace and EP approaches in classification setting see Kuss and Rasmussen 2005 Nickisch and Rasmussen 2008 Plogit yil f Xz 3 5 3 2 1 Constructing the model The model construction for the classification follows closely the steps presented in the previous section The model is constructed as follows lik lik_probit gpcf gpcf_sexp lengthScale 0 9 0 9 magnSigma2 10 gp gp_set lik lik cf gpcf jitterSigma2 le 9 The above lines first initialize the likelihood function the covariance function and the GP structure The default uniform priors for length scale and magnitude are used The model construction and the inference with logit likelihood would be similar with lik_logit A small jitter value is added to the diagonal of the training covariance to make certain matrix o
54. al multiplied by a constant Ii Zy giving 1 A aval log Zep 5 log K S 54 5 ia Cep 2 22 where Ctp collects the terms that are not explicit functions of 6 or there is an implicit dependence through the iterative algorithm though Note that in actual implementation the Laplace and EP marginal likelihood approx imations are obtained as byproduct when computing the conditional posterior of f 2 3 Marginalization over parameters The previous section treated methods to evaluate exactly the Gaussian case or approx imately Laplace and EP approximations the log marginal likelihood given parameters In this section we describe approaches for estimating parameters or integrating numer ically over them 2 3 1 Maximum a posterior estimate of parameters In full Bayesian approach we should integrate over all unknowns Given we have integrated over the latent variables it often happens that the posterior of the parameters is peaked or predictions are unsensitive to small changes in parameter values In such case we can approximate the integral over p 0 D with maximum a posterior MAP estimate 6 ae n D are sin log p D 9 logp 0 2 23 In this approximation the parameter values are given a point mass one at the posterior mode and the latent function marginal is approximated as p f D p f D 4 The log marginal likelihood and thus also the log posterior is differentiable with
55. approximations The Mauna Loa CO2 data set is studied with sparse additive Gaussian processes in the demo demo_regression2 There the covariance is kse x x kipp 2 x x since periodic covariance does not work well with sparse approximations The model construction and inference is conducted similarly as in the previous section so we will not repeat it here However it is worth mentioning few things that should be noticed when running the demo PIC works rather well for this data set whereas FIC fails to recover the fast varying phenomenon The reason for this is that the inducing inputs are too sparsely located so that FIC can not reveal the short length scale phenomenon In general FIC is able to model only phenomena whose length scale is long enough compared to the average distance between adjacent inducing inputs see Vanhatalo et al 2010 for details PIC on the other hand is able to model also fast varying phenomena inside the blocks Its drawback however is that the correlation structure is discontinuous which may result in discontinuous predictions The CS FIC model corrects these deficiencies In FIC and PIC the inducing inputs are parameters of every covariance function which means that all the correlations are circulated through the inducing inputs and the shortest length scale the GP is able to model is defined by the locations of the inducing inputs The CS FIC sparse GP is build differently There the CS covariance function
56. as illustrated in demo_regression_hier The above construction repre sents the covariance function k x x Kexp 1 x1 r kmatern32 2 sl z5 xh 6 6 50 CHAPTER 6 COVARIANCE FUNCTIONS Chapter 7 Special observation models In this chapter we will introduce few more models that we are able to infer with GP stuff These models utilize different observation models than what has been considered this far 7 1 Robust regression with Student likelihood A commonly used observation model in the GP regression is the Gaussian distribution This is convenient since the inference is analytically tractable up to the covariance func tion parameters However a known limitation with the Gaussian observation model is its non robustness due which outlying observations may significantly reduce the accu racy of the inference A formal definition of robustness is given for example in terms of an outlier prone observation model The observation model is outlier prone of an order n if p flyi Ynt1 gt P f yi Yn AS Yn gt O Hagan 1979 West 1984 That is the effect of a single conflicting observation on the posterior becomes asymptotically negligible as the observation approaches infinity This contrasts heavily with the Gaussian observation model where each observation influences the posterior no matter how far it is from the others A well known robust observation model is the Student distribution n T
57. ashed line and MCMC histogram for the conditional posterior of a latent variable p f D 0 in two applications On the left a disease mapping problem with Poisson observation model used in Vanhatalo et al 2010 where the Gaussian approximation works well On the right a classification problem with probit likelihood used in Vanhatalo and Vehtari 2010 where the posterior is skewed and the Gaussian approximation is quite poor where is the i th sample from the conditional posterior The problem with Monte Carlo methods is how to draw samples from arbitrary distributions The challenge can be overcome with Markov chain Monte Carlo methods Gilks et al 1996 where one constructs a Markov chain whose stationary distribution is the posterior distribution p f D 6 d and uses the Markov chain samples to obtain Monte Carlo estimates A rather efficient sampling algorithm is hybrid Monte Carlo HMC Duane et al 1987 Neal 1996 which utilizes the gradient information of the posterior distribution to di rect the sampling in interesting regions Significant improvement in mixing of the sam ple chain of the latent variables can be obtained by using the variable transformation discussed in Christensen et al 2006 Vanhatalo and Vehtari 2007 After having the posterior sample of latent variables we can sample from the posterior predictive distri bution of any set of f simply by sampling with each f one F from p t 6 6 which i
58. b Expectation propagation c Markov chain Monte Carlo Figure 3 4 The class probability contours for Laplace approximation EP and MCMC solution The strongest line in the middle is the 50 probability the decision bound ary and the thinner lines are 2 5 25 75 and 97 5 probability contours It is seen that the decision boundary is approximated similarly with all the methods but there is a larger difference elsewhere 15 10 5 0 5 0 5 10 Figure 3 5 The marginal posterior for two latent variables in demo_classificl Histogram is the MCMC solution dashed line is the Laplace approximation and full line the EP estimate The left figure is for latent variable at location 1 0 0 2 and the right figure at 0 9 0 9 24 CHAPTER 3 GPSTUFF BASICS REGRESSION AND CLASSIFICATION 3 2 3 Inference with expectation propagation The EP approximation is used similarly as the Laplace approximation We only need to set the latent method to EP gp gp_set gp latent_method EP gp gp_optim gp x y opt opt Eft_ep Varft_ep Eyt_ep Varyt_ep pyt_ep gp_pred gp x Y xt yt ones size xt 1 1 The EP solution for the predicted class probabilities is visualized in the Figure 3 4 b and the Figure 3 5 shows the marginal predictive posterior for two latent variables 3 2 4 Inference with MCMC The MCMC solution with non Gaussian likelihood is fo
59. can be defined by the covariance function itself The metric structures can be used with any stationary covariance function that is with functions of type k x x k r The reason why this property is implemented by using one extra structure is that this way user does not need to modify the covariance function when redefining the distance Only a new metric file needs to be created It should be noticed though that not all metrics lead to positive definite covariances with all covariance functions For exam ple the squared exponential o exp r is not positive definite with metric induced by L norm r ae x x 1 whereas the exponential o exp r is A metric structure cannot be used with the linear or neural network covariance functions since they are not stationary but a smaller set of inputs can be chosen by using the field selectedVariables selectedVariables can be used also for sta tionary covariance functions as a shorthand to set a metric structure to use the selected variables With metric is also possible to do more elaborate input models as discussed in next section s In this demo we select only the second input variable as gpcf_12 gpcf_12 gpcf_linear selectedVariables 2 gpcf_linear gpcf_12 coeffSigma2_prior pt The result with this model is shown in Figure 6 5 b The neural network covariance function is another non stationary covariance func tion with which the metric st
60. ce wise polynomial kse x x kpp 2 x x The solution of this model that shows the long and short length scale phenomena separately is visualized in Figure 6 3 together with the original data This model interpolates the underlying function well but as will be demonstrated later its predictive properties into the future are not so good Better pre dictive performance is obtained by adding up two squared exponential and one periodic covariance function kse X x 01 kse x x 02 Kperiodic X x Which is build as follows 44 CHAPTER 6 COVARIANCE FUNCTIONS gt 380 5 380 a r amp Pa 5 Z j y 350 0 2 f 8 HUN g 1 ta 5 310 1958 1981 2004 1958 1981 2004 year year a The Mauna Loa COz2 data b The long and short left scale term component right scale Figure 6 3 The Mauna Loa CO data and prediction for the long term and short term component demo_regression2 gpcfl gpcf_sexp lengthScale 67 12 magnSigma2 66 66 gpcfp gpcf_periodic lengthScale 1 3 magnSigma2 2 4 2 4 gpcfp gpcf_periodic gpcfp period 12 optimPeriod 1 lengthScale_exp 90 12 decay 1 lik lik_gaussian sigma2 0 3 gpcf2 gpcf_sexp lengthScale 2 magnSigma2 2 pl prior_t s2 10 nu 3 pn priert s2 10 nu 4 7 gpcfl gpcf_sexp gpcfl lengthScale_prior pl magnSigma2_prior pl gp
61. cf2 gpcf_sexp gpcf2 lengthScale_prior pl magnSigma2_prior pl gpcfp gpcf_periodic gpcfp lengthScale_prior pl magnSigma2_prior pl gpcfp gpcf_periodic gpcfp lengthScale_sexp_prior pl period_prior pn lik lik_gaussian lik sigma2_prior pn gp gp_set lik lik cf gpcfl gpcfp gpcf2 An additive model is constructed similarly to a model with just one covariance function The only difference is that now we give more than one covariance func tion structure for the gp_init The inference with additive model is conducted exactly the same way as in the demo_regressionl The below lines summa rize the hyperparameter optimization and conduct the prediction for the whole pro cess f and two components g and h whose covariance functions are kse x x 01 and kse x xX 02 Kperiodic X x respectively The prediction for latent functions that are related to only subset of the covariance functions used for training is done by setting an optional argument predcf This argument tells which covariance functions are used for the prediction If we want to use more than one covariance function for pre diction and there are more than two covariance functions in the model the value for predcf is a vector opt optimset TolFun le 3 TolX le 3 Display iter gp gp_optim gp x y opt opt xt 1 800 Ef_full Varf_full
62. cus all Here the first line performs the MCMC sampling with options opt The last two lines evaluate the DIC statistics With Markov chain samples we cannot use the go_peff function to evaluate pp 0 since that is a special function for models with fixed param eters The k fold CV is conducted with MCMC methods as easily as with the MAP estimate The only difference is that we have to define that we want to use MCMC and to give the sampling options for gp_kfcv These steps are done as follows opt nsamples 100 opt repeat 4 opt hmc_opt hmc2_opt opt hmc_opt steps 4 opt hmc_opt stepadj 0 05 opt hmc_opt persistence 0 opt hmc_opt decay 0 6 opt hmc_opt nsamples 1 hmc2 state sum 100 clock cvres gp_kfcv gp x y inf_method MCMC opt opt With integration approximation evaluating the DIC and k fold CV statistics are similar to the MCMC approach The steps required are opt int_method grid opt step_size 2 opt optimf fminscg gp_array gp_ia gp x y opt models 3 full_IA DIC 3 p_eff 3 gp_dic gp_array x y focus param DIC2 3 p_eff2 3 gp_dic gp_array x y focus all Then the 10 fold cross validation cvres gp_kfcv gp x y inf_method IA opt opt This far we have demonstrated how to use DIC and k fold CV functions with full GP The function can be used with sparse ap
63. d Other Methods for Non linear Regression PhD thesis University of Toronto Rasmussen C E and Williams C K I 2006 Gaussian Processes for Machine Learning The MIT Press Rathbun S L and Cressie N 1994 Asymptotic properties of estimators for the parameters of spatial inhomogeneous poisson point processes Advances in Applied Probability 26 1 122 154 Richardson S 2003 Spatial models in epidemiological applications In Green P J Hjort N L and Richardson S editors Highly Structured Stochastic Systems pages 237 259 Oxford University Press Robert C P and Casella G 2004 Monte Carlo Statistical Methods Springer second edition Rue H Martino S and Chopin N 2009 Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations Journal of the Royal statistical Society B 71 2 1 35 Sanchez S M and Sanchez P J 2005 Very large fractional factorials and cen tral composite designs ACM Transactions on Modeling and Computer Simulation 15 362 377 Sanso F and Schuh W D 1987 Finite covariance functions Journal of Geodesy 61 4 331 347 Seeger M 2005 Expectation propagation for exponential families Technical report Max Planck Institute for Biological Cybernetics Tiibingen Germany Seeger M 2008 Bayesian inference and optimal design for the sparse linear model Journal of Machine Learning Research 9 759 8 13 S
64. d squared exponential covariance function The binomial likelihood is defined as follows N ply f z T sa Yi zi yi pi L pi 7 12 where p exp fi 1 exp fi is the success of probability and the vector z denotes the number of trials In this demo a Gaussian process prior is assumed for the latent variables f The binomial likelihood is initialised in GPstuff as Create the likelihood structure lik lik_binomial Create the GP data structure gp gp_set lik lik cf gpcfl jitterSigma2 le 8 To use binomial model an extra parameter the number of trials is needed to be set as a parameter for each function that requires the data y For example the model is initialized and optimized as 7 5 DERIVATIVE OBSERVATIONS IN GP REGRESSION 59 100 GP 95 CI GP mean X observations true latent function 90 F 80 F 70 607 50 40H 30 20 fi L L fi L 1 5 1 0 5 0 0 5 1 1 5 Figure 7 4 GP solutions a MAP estimate with squared exponential covariance func tion and binomial likelihood in a toy example Set the approximate inference method gp gp_set gp latent_method Laplace gp gp_optim gp x y z N opt opt To make predictions with binomial likelihood model without computing the pre dictive density the total number of trials N in test points needs to be provided in addition to N that is total number o
65. e advantage of using GPs is that we can conduct the inference directly in the func tion space GP prior implies that any set of function values f indexed by the input coordinates X have a multivariate Gaussian prior distribution p t X 0 T N f 0 Ky 5 1 6 where K is the covariance matrix Notice that the distribution over functions will be denoted by 9P whereas the distribution over a finite set of latent variables will be denoted by N The covariance matrix is constructed from a covariance function Ke i k x x 0 which characterizes the correlation between different points in the process E f x f x amp xi x Covariance function encodes the prior assumptions of the latent function such as the smoothness and scale of the variation and can be chosen freely as long as the covariance matrices produced are symmetric and positive semi definite vT Krv gt 0 Vv R An example of a stationary covariance function is the squared exponential d 1 kse Xi X 0 o2 exp X tix a R 1 7 k 1 where 0 o2 11 u is a vector of parameters Here 02 is the scaling parame ter and J is the length scale which governs how fast the correlation decreases as the distance increases in the direction k The squared exponential is only one example of possible covariance functions Discussion on other common covariance functions is given for example by Diggle and Ribeiro 2007 Finkens
66. e assume a Gaussian form in which case it is natural to approximate the predictive distribution with a Gaussian as well In this case the equations 2 5 and 2 7 give its mean and covariance The Gaussian approximation can be justified if the conditional posterior is unimodal which it is for example if the likelihood is log concave this can eas ily be seen by evaluating the Hessian of the posterior p f D 0 and there is enough data so that the posterior will be close to Gaussian A pragmatic justification for using Gaussian approximation is that many times it suffices to approximate well the mean and variance of the latent variables These on the other hand fully define Gaussian distribution and one can approximate the integrals over f by using the Gaussian form for their conditional posterior Detailed considerations on the approximation error and the asymptotic properties of the Gaussian approximation are presented for example by Rue et al 2009 and Vanhatalo et al 2010 2 1 2 Gaussian observation model A special case of an observation model for which the conditional posterior of the latent variables can be evaluated analytically is the Gaussian distribution y N fi Gh 2 1 CONDITIONAL POSTERIOR OF THE LATENT FUNCTION 9 where the parameter represents the noise variance a Since both the likelihood and the prior are Gaussian functions of the latent variable we can integrate analytically over f to obtain p y X 0 0
67. e predicted underlying function sexp for 1 input linear for 2 input lt s SEES SSS S X 10 SESS 5 SSNS s So LER EK SSS SSSS SESS SS SSRS S SSSSSSSSS SSS SESS SISS SS SSS ESTS RRN SSS SE SSIS OSS SRS SN ago eager soos SS SSES gt SSS SINKS oes aS amen Ses SSS aioe aces b ieav SSS S 0 oe ee 4 SSE SS ESSE SSM eS MSS SOS M SSeSSS 5 ee a SSSR SESS OSES SOE SOON S OSD CS SSS OS SOS OSS SOS SSSR Se nn oe SS N SSS SSS SEES SSSRA ss SSS S Ss p E S S SSS g EE A x 2 2 2 x a e Kiinear X x 10 1 b kse 1 1101 kiinear 2 402 The predicted underlying function additive sexp The predicted underlying function sexp LEE sss LE at 3 HU SSN 3 LE RONS CU MAA NN LEON s LEO 0 0 iba aes SAN et A ORE RRR chess zi WO ION ee SORA rae BRAY 7 OR KAI g SRK 3 SER 3 ASS W Yy Liye 2 Lip 2 Wy 2 2 1 0 x 2 2 x x e 2D 1 cs x kse x1 1101 kse x2 5102 d kse x x 01 The predicted underlying function additive neural network The predicted underlying function neural network K SESS Uy BOSS SSO LUISE RES My 3 M ESS 4 Sp AAN aa UU MgO ZARON REESS ANNS 2 LILLIA IRL ASS SSE OIE OR ZEEE 2 LG SORES etl ROY Up b nd P ere naaa i NNN ORE ele S CALKINS SSS SS
68. ed exponential covariance function gpcf_sexp Probably the most widely used covariance function is the squared exponential SE d Lik Tjk Kon ag 5 gt Pix p jk B 1 k k 1 The length scale J governs the correlation scale in input dimension k and the mag nitude ee the overall variability of the process A squared exponential covariance function leads to very smooth GPs that are infinitely times mean square differentiable Exponential covariance function gpcf_exp Exponential covariance function is defined as Zik Tjk exp KY e B 2 2 k 1 k k Xi xj Gans The parameters l and en have similar role as with the SE covariance function The exponential covariance function leads to very rough GPs that are not mean square dif ferentiable M ern class of covariance functions gpcf_maternxXxX The Matern class of covariance functions is given by ky i Xj on dur K 2vr B 3 g 2 1 2 where r gt on Hogs The parameter v governs the smoothness of the k process and K is a modified Bessel function Abramowitz and Stegun 1970 sec 9 6 The Matern covariance functions can be represent in a simpler form when v is a 73 74 APPENDIX B COVARIANCE FUNCTIONS half integer The Matern covariance functions with v 3 2 gpcf_matern32 and v 5 2 gpcf_matern52 are ky 3 2 i Xj o2 1 v3r exp v3r B 4 2 5r2 ky 5 2 Xi Xj On 1 V5r ex
69. eeger M Williams C K I and Lawrence N 2003 Fast forward selection to speed up sparse Gaussian process regression In Bishop C M and Frey B J ed itors Ninth International Workshop on Artificial Intelligence and Statistics Society for Artificial Intelligence and Statistics Snelson E 2007 Flexible and Efficient Gaussian Process Models for Machine Learning PhD thesis University College London Snelson E and Ghahramani Z 2006 Sparse Gaussian process using pseudo inputs In Weiss Y Scholkopf B and Platt J editors Advances in Neural Information Processing Systems 18 The MIT Press Snelson E and Ghahramani Z 2007 Local and global sparse Gaussian process ap proximations In Meila M and Shen X editors Artificial Intelligence and Statis tics 11 Omnipress BIBLIOGRAPHY 89 Spiegelhalter D J Best N G Carlin B P and van der Linde A 2002 Bayesian measures of model complexity and fit Journal of the Royal Statistical Society B 64 4 583 639 Storkey A 1999 Efficient Covariance Matrix Methods for Bayesian Gaussian Pro cesses and Hopfield Neural Networks PhD thesis University of London Sundararajan S and Keerthi S S 2001 Predictive approaches for choosing hyper parameters in gaussian processes Neural Computation 13 5 1103 1118 Takahashi K Fagan J and Chen M S 1973 Formation of a sparse bus impedance matrix and its application to short circuit
70. egun I A 1970 Handbook of mathematical functions Dover Publications Inc Ahmad O B Boschi Pinto C Lopez A D Murray C J Lozano R and Inoue M 2000 Age standardization of rates A new WHO standard GPE Discussion Paper Series 31 Alvarez M A Luengo D Titsias M K and Lawrence N D 2010 Efficient mul tioutput Gaussian processes through variational inducing kernels JMLR Workshop and conference proceedings 9 25 32 Banerjee S Carlin B P and Gelfand A E 2004 Hierarchical Modelling and Analysis for Spatial Data Chapman Hall CRC Banerjee S Gelfand A E Finley A O and Sang H 2008 Gaussian predictive process models for large spatial data sets Journal of the Royal Statistical Society B 70 4 825 848 Bernardo J M 1979 Expected information as expected utility Annals of Statistics 7 3 686 690 Bernardo J M and Smith A F M 2000 Bayesian Theory John Wiley amp Sons Ltd Best N Richardson S and Thomson A 2005 A comparison of Bayesian spatial models for disease mapping Statistical Methods in Medical Research 14 35 59 Buhmann M D 2001 A new class of radial basis functions with compact support Mathematics of Computation 70 233 307 3 18 Burman P 1989 A comparative study of ordinary cross validation v fold cross validation and the repeated learning testing methods Biometrika 76 3 503 514 Christensen O F Roberts
71. ements corrupted with iid Gaussian noise as y f x i where e N 0 o2 In this particular example the input vectors x are two dimensional This results in the overall model y o N f o7I 3 1 f x GP 0 k x x 0 3 2 0 07 p 0 p a 3 3 We will show how to construct the model with a squared exponential covariance func tion and how to conduct the inference 3 1 1 Constructing the model In GPstuff the model is constructed in the following three steps e create structures that define likelihood and covariance function e define priors for the parameters 18 CHAPTER 3 GPSTUFF BASICS REGRESSION AND CLASSIFICATION gp type FULL lik 1x1 struct cf 1xl struct infer_params covariance likelihood jitterSigma2 0 lik gpcf type Gaussian type gpcf_sexp sigma2 0 0400 lengthScale 1 1000 1 2000 p 1x1 struct magnSigma2 0 0400 fh 1xl struct p 1xl struct fh 1x1 struct Figure 3 1 The GP likelihood and covariance function structure in demo_regressionl e create a GP structure which stores all the above These three steps are done as follows lik lik_gaussian sigma2 0 2 2 gpcf gpcf_sexp lengthScale 1 1 1 2 magnSigma2 0 2 2 pn prior_logunif lik lik_gaussian lik sigma2_prior pn pl prior_t pm prior_sqrtunif gpcf gpcf_sexp gpcf lengthScale_prior pl
72. ent of the log posterior of the parameters needs some extra sparse matrix tools The problematic part is the trace in the derivative of the log marginal likelihood for example o Ke OOD kitto ty O Ke 071 00 1 F log p y X 0 07 5y Kis o7I Dye as o7I 1 5 4 2 The trace would require the full inverse of the covariance matrix if evaluated naively Luckily Takahashi et al 1973 introduced an algorithm whereby we can evaluate a sparsified version of the inverse of a sparse matrix This can be utilized in the gra dient evaluations as described by Vanhatalo and Vehtari 2008 The same problem was considered by Storkey 1999 who used the covariance matrices of Toeplitz form which are fast to handle due their banded structure However constructing Toeplitz covariance matrices is not possible in two or higher dimensions without approxima tions EP algorithm requires also special considerations with CS covariance functions The posterior covariance in EP 2 16 does not remain sparse and thereby it has to be expressed implicitly during the updates This issue is discussed in Vanhatalo et al 2010 Vanhatalo and Vehtari 2010 Compactly supported covariance functions in GPstuff The demo demo_regression_ppcs contains a regression example with fairly large data that contain US annual precipitation summaries from year 1995 for 5776 obser vation stations The data was previously used by Vanhatalo and Vehta
73. et al 1998 If the data are points X x i 1 2 n on a finite region V in R4 then the likelihood of the unknown function f is n LEXI exp f egoa F oo i l Evaluation of the likelihood would require nontrivial integration over the exponential of the Gaussian process Moller et al 1998 propose to discretise the region V and assume locally constant intensity in subregions This transforms the problem to a form equivalent to having Poisson model for each subregion Likelihood after the discreti sation is K L X f Poisson yr exp f Xx 7 11 k 1 where x is the coordinate of the kth sub region and yx is the number of data points in it Tokdar and Ghosh 2007 proved the posterior consistency in limit when sizes of subregions go to zero 1D case data are the coal mine disaster data from R distribution coal rda contains the dates of 191 coal mine explosions that killed ten or more men in Britain between 15 March 1851 and 22 March 1962 Computation time with expectation propagation and CCD integration over the parameters took 20s 2D case data are the redwood data from R distribution redwoodfull rda contains 195 locations of redwood trees Computation time with Laplace approximation and MAP II for parameters took 3s In section 7 2 1 we demonstrated how fast no MCMC inference for this model can be made using Laplace method or expectation propagation for integrating over the latent variables in application
74. et the sampling options The structure gibbs_opt contains the options for the slice sampling used for v The parameters of the squared exponential covariance function are sampled with HMC and the options for this sampler are set into the structure hmc_opt gpcfl gpcf_sexp lengthScale 1 magnSigma2 0 2 2 7 1 ROBUST REGRESSION WITH STUDENT T LIKELIHOOD 53 gpcfl gpcf_sexp gpcfl lengthScale_prior pl magnSigma2_prior pm lik lik_gaussiansmt ndata n sigma2 repmat 1 n 1 rnu prior prior_logunif gp gp_init lik lik Aef gpcfl jitterSigma2 le 9 hmc_opt steps 10 hmc_opt stepadj 0 06 hmc_opt nsamples 1 hmc2 state sum 100 clock hmc_opt persistence 1 hmc_opt decay 0 6 Sample r g opt gp_mc gp x yY nsamples 300 hmc_opt hmc_opt The Student t observation model with Laplace approximation The Student t observation model is implemented in 1ikelih_t This is used simi larly to the observation models in the classification setting The difference is that now the likelihood has also hyperparameters These parameters can be optimized alongside the covariance function parameters with Laplace approximation We just need to give them prior which is by default log uniform and write in the parameter string which defines the optimized parameters likelihood All this is done with the following lines pl prior_t pm p
75. f trials in training points In GPstuff this is done as following Set the total number of trials Nt at the grid points xgrid Eft_la Varft_la Eyt_la Varyt_la gp_pred gp x y xgrid z N zt Ntgrid To compute the predictive densities with binomial likelihood To compute predictive densities at the test points xt the total number of trials Nt must be set additionally Eft_la Varft_la Eyt_la Varyt_la pyt_la gp_pred gp x y xt z r yt yt at Ne Z oe oe 7 5 Derivative observations in GP regression Incorporating derivative observations in GP regression is fairly straightforward be cause a derivative of Gaussian process is a Gaussian process In short derivative ob servation are taken into account by extending covariance matrices to include derivative observations This is done by forming joint covariance matrices of function values and derivatives Following equations Rasmussen and Williams 2006 state how the covariances between function values and derivatives and between derivatives are cal culated Of Of oS 07 k x X Oxi Lej 7 OLdi Tej i Of OK X X Ojj Cov fi Bza ai J Cov 60 CHAPTER 7 SPECIAL OBSERVATION MODELS The joint covariance matrix for function values and derivatives is of the following form k Ker Kp Kp Kpp KY a k x X ij Ok x X UU E ai a Kyo Kpy K 07k xi X DD OL GOL Prediction is done
76. for the predicted class probabilities is visualized in Figure 3 4 c and Figure 3 5 shows the marginal predictive posterior for two latent variables It can be seen that there are little differences between the three different approxima tions Here MCMC is the most accurate then comes EP and Laplace is the worst However the inference times line up in the opposite order The difference between the approximations is not always this large For example in spatial epidemiology with Poisson observation model see Section 7 2 Laplace and EP approximations work in our experience practically as well as MCMC 26 CHAPTER 3 GPSTUFF BASICS REGRESSION AND CLASSIFICATION Chapter 4 Sparse Gaussian processes The challenges with using Gaussian process models are the fast increasing computa tional time and memory requirements The evaluation of the inverse and determinant of the covariance matrix in the log marginal likelihood or its approximation and its gradient scale as O n in time which restricts the implementation of GP models to moderate size data sets For this reason there are number of sparse Gaussian processes introduced in the literature 4 1 Compactly supported covariance functions A compactly supported covariance function gives zero correlation between data points whose distance exceeds a certain threshold leading to a sparse covariance matrix The challenge in constructing CS covariance functions is to guarantee their positive defi
77. gp_set lik lik cf gpcf jitterSigma2 le 4 latent_method MCMC 54 CHAPTER 7 SPECIAL OBSERVATION MODELS f gp_pred gp x y X gp gp_set gp latent_opt struct f f Set the parameters for MCMC Covariance parameter options opt hmc_opt steps 5 opt hmc_opt stepadj 0 05 opt hmc_opt nsamples 1 Latent options opt latent_opt display 0 opt latent_opt repeat 10 opt latent_opt sample_latent_scale 0 05 Sample rgp g opt gp_mc gp x Yy nsamples 400 opt 7 2 Models for spatial epidemiology Spatial epidemiology concerns both describing and understanding the spatial variation in the disease risk in geographically referenced health data One of the main classes of spatial epidemiological studies is disease mapping where the aim is to describe the overall disease distribution on a map and for example highlight areas of elevated or lowered mortality or morbidity risk e g Lawson 2001 Richardson 2003 Elliot et al 2001 The spatially referenced health data may be point level appointing to con tinuously varying co ordinates and showing for example home residence of diseased people More commonly however the data are an areal level referring to a finite sub region of a space as for example county or country and telling the counts of diseased people in the area e g Banerjee et al 2004 In this section we will consider two disease mapping mode
78. guments which contain the array of GP s gp_array for hyper parameter values th 9 24 with weights P_TH p J D A M Since we use the grid method the weights are proportional to the marginal posterior and A 1Vi see section 2 3 2 Ef_ia and Varf_ia contain the predictive mean and variance at the prediction locations p The last two output arguments can be used to plot the predictive distribution p f D at input location x which is the i th row of p as demonstrated in the Figure 3 2 x_ia contains a regular grid of values f and x_ia contains p f D at those values 3 1 4 Marginalization over parameters with MCMC The main function for conducting Markov chain sampling is gp_mc which loops through all the specified samplers in turn and saves the sampled parameters in a record structure In later sections we will discuss models where also latent variables are sam pled but now we concentrate on the covariance function parameters Sampling and predictions can be made as follows hmc_opt hmc2_opt rfull g opt gp_mc gp x y nsamples 400 repeat 5 hmc_opt hmc_opt rfull thin rfull 50 2 Eft mc Varft_mc gpmc_preds rfull x y xt The gp_mc function generates nsamples Markov chain samples Between consec utive samples repeat 5 sampler iterations are done At each iteration gp_mc runs the actual samplers For example giving the option hmc_opt tells that gp_mc should
79. ise The hyperparameter values are set to their MAP estimate The models considered in this demo utilize the following six covariance functions e constant and linear e constant and squared exponential for the first input and linear for the second input e squared exponential for the first input and squared exponential for the second input e squared exponential e neural network for the first input and neural network for the second input e neural network We will demonstrate how to construct the first second and fifth model A linear covariance function with constant term can be constructed in GPstuff as constant covariance function gpcf_c gpcf_constant constSigma2 1 gpcf_c gpcf_constant gpcf_c constSigma2_prior pt o linear covariance function gpcf_l gpcf_linear coeffSigma2_prior pt Gaussian process using this linear covariance function is constructed as previously gp gp_set lik lik cf gpcf_c gpcf_l jitterSigma2 jitter In this model the covariance function is jinear x x 0 x R which means that the components of x are coupled in the in the covariance function kjjinear The constant term gocf_const is denoted by c The second model is more flexible It contains a squared exponential which is a function of the first input dimension x and a linear covariance function which is a function of the second input dimension x2 The additive cova
80. late score function convergence diagnostic GBINIT Initial iterations for Gibbs iteration diagnostic GBITER Estimate number of additional Gibbs iterations Time series analysis ACORR Estimate autocorrelation function of time series ACORRTIME Estimate autocorrelation evolution of time series simple GEYER_ICSE Compute autocorrelation time tau using Geyer s initial convex sequence estimator requires Optimization toolbox Compute autocorrelation time tau using Geyer s initial monotone sequence estimator GEYER_IMSE Kernel density estimation etc KERNEL1 1D Kernel density estimation of data KERNELS Kernel density estimation of independent components of data KERNELP 1D Kernel density estimation with automatic kernel width NDHIST Normalized histogram of N dimensional data HPDI Estimates the Bayesian HPD intervals Manipulation of MCMC chains THIN Delete burn in and thin MCMC chains JOIN Join similar structures of arrays to one structure of arrays BATCH Batch MCMC sample chain and evaluate mean median of batches Misc CUSTATS Calculate cumulative statistics of data BBPRCTILE Bayesian bootstrap percentile GRADCHEK Checks a user defined gradient function using finite differences DERIVATIVECHECK Compare user supplied derivatives to finite differencing derivatives A 3 Distributions PROBABILITY DISTRIBUTION FUNCTIONS in the dist folder Priors PRIOR_FIXED Fix parameter to its current
81. ls One that utilizes a Poisson observation model and other with a negative binomial observation model The models follow the general approach discussed for example by Best et al 2005 The data are aggregated into areas A with co ordinates x x1 2 The mortal ity morbidity in an area A is modeled as a Poisson or negative Binomial with mean eiki where e is the standardized expected number of cases in the area Aj and the p is the relative risk which is given a Gaussian process prior The standardized expected number of cases e can be any positive real number that defines the expected mortality morbidity count for the 7 th area Common practice is to evaluate it following the idea of the directly standardized rate e g Ahmad et al 2000 where the rate in an area is standardized according to the age distribution of the population in that area The expected value in the area A is obtained by summing the products of the rate and population over the age groups in the area i DD Eni r 1 where Y and JV are the total number of deaths and people in the whole area of study in the age group r and Nir is the number of people in the age group r and in the area A In the following demos e and y are calculated from real data that contains deaths to either alcohol related diseases or cerebral vascular diseases in Finland The examples here are based on the works by Vanhatalo and Vehtari 2007 and Vanhatalo et al 2010 7 2
82. metrization corrects for scale and rotation and simplifies the integration Rue et al 2009 The exploration of log p v D is started from the mode and continued so that the bulk of the posterior mass is included in the integration The grid points are set evenly along the directions z and thus the area weights A are equal This is illustrated in Figure 2 2 a see also Rue et al 2009 Vanhatalo et al 2010 The grid search is feasible only for a small number of parameters since the number of grid points grows exponentially with the dimension of the parameter space d For example the number of the nearest neighbors of the mode increases as O 37 which results in 728 grid points already for d 6 If also the second neighbors are included the number of grid points increases as O 5 resulting in 15624 grid points for six parameters 2 3 3 Monte Carlo integration Monte Carlo integration works better than the grid integration in large parameter spaces since its error decreases with a rate that is independent of the dimension Robert and Casella 2004 There are two options to find a Monte Carlo estimate for the marginal posterior p f D The first option is to sample only the parameters from their marginal posterior p 6 D or from its approximation see Figure 2 2 b In this case the pos terior marginals are approximated with mixture distributions as in the grid integration The other option is to run a full MCMC for all the parameters
83. model assessment and comparison using cross validation predictive densities Neural Computation 14 10 2439 2468 Wendland H 1995 Piecewise polynomial positive definite and compactly sup ported radial functions of minimal degree Advances in Computational Mathematics 4 1 389 396 Wendland H 2005 Scattered Data Approximation Cambridge University Press West M 1984 Outlier models and prior distributions in Bayesian linear regression Journal of the Royal Statistical Society Series B 46 3 431 439 Wiener N 1949 Extrapolation Interpolation and Smoothing of Stationary Time Series MIT Press 90 BIBLIOGRAPHY Williams C K I 1996 Computing with infinite networks In Mozer M C Jordan M I and Petsche T editors Advances in Neural Information Processing Systems volume 9 The MIT Press Williams C K I and Barber D 1998 Bayesian classification with Gaussian processes IEEE Transactions on Pattern Analysis and Machine Intelligence 20 12 1342 1351 Williams C K I and Rasmussen C E 1996 Gaussian processes for regression In Touretzky D S Mozer M C and Hasselmo M E editors Advances in Neural Information Processing Systems 8 pages 514 520 MIT Press Wu Z 1995 Compactly supported positive definite radial functions Advances in Computational Mathematics 4 1 283 292
84. models prior probabilities are appointed also for the parameters of the prior Let us write the prior as p 6 77 where n denotes the parameters of the prior distribution hyperparameters By setting a hyperprior p n for the hyperparameters we obtain a hierarchical model structure where the fixed parameter values move further away from the data This allows more flexible models and leads to more vague prior which is beneficial if the modeler is unsure of the specific form of the prior In theory the hierarchy could be extended as far as one desires but there are practical limits on how many levels of hierarchy are reasonable Goel and Degroot 1981 The models implemented in GPstuff are hierarchical models where the parameter 0 is replaced by a latent function f x The observation model is build such that an indi vidual observation depends on a function value at a certain input location x The latent function f x is given a Gaussian process GP prior Rasmussen and Williams 2006 whose properties are defined by a mean and covariance function and the hyperparam eters related to them The hierarchy is continued to third level by giving a hyperprior for the covariance function parameters The assumption is that there is a functional de scription for the studied phenomenon which we are not aware of and the observations are noisy realizations of this underlying function The power of this construction lies in the flexibility and non parametric f
85. more reasonable model would be P x g x h x 6 1 where the latent value function is a sum of two functions slow and fast varying By placing a separate GP prior for both of the functions g and h we obtain an additive prior f x 0 SP O g x x ki x x 6 2 The marginal likelihood and posterior distribution of the latent variables are as before with Kr 5 Kg Knn However if we are interested on only say phenomenon g we can consider the A part of the latent function as correlated noise and evaluate the predictive distribution for g which with the Gaussian observation model would be G X D 0 GP kg amp X Kep 071 ty kg X 3 kg amp X Kis 071 tka X 6 3 41 42 CHAPTER 6 COVARIANCE FUNCTIONS a Neural network solution b Squared exponential solution Figure 6 1 GP latent mean predictions using a MAP estimate with neural network or squared exponential covariance functions The 2D data is generated from a step function GP 95 Cl GP 95 Cl GP mean GP mean L X observations true latent function X observations true latent function a Neural network solution Figure 6 2 GP solutions a MAP estimate with neural network or squared exponential covariance functions b Squared exponential solution 6 2 ADDITIVE MODELS 43 With non Gaussian likelihood the Laplace and EP approximations for thi
86. n for Student t likeli hood optimizing all parameters 4 MCMC approach with Student t likelihood so that v 4 and 5 Laplace approximation for Student t observation model so that v 4 We will demonstrate the steps for parts 2 3 and 5 The scale mixture model The scale mixture representation of Student t observation model is implemented as a special kind of covariance function gocf_noiset Itis very similar to the gocf_noise covariance function in that it returns diagonal covariance matrix diag aU The scale mixture model is efficient to handle with Gibbs sampling since we are able to sample all the parameters U a T efficiently from their full conditionals with regular built in samplers For the degrees of freedom v we use slice sampling All the sampling steps are stored in the gocf_noiset structure and gp_mc sampling function knows to use them if we add gibbs_opt field in its options structure Below we show the lines needed to perform the MCMC for the scale mixture model The third line constructs the noise covariance function and the fourth line sets the option ix_nu to zero which means that we are going to sample also the degrees of freedom The degrees of freedom is many times poorly identifiable for which rea son fixing its value to for example four is reasonable This is the reason why it is fixed by default With this data set its sampling is safe though The next line initial izes the GP structure and the lines after that s
87. n with Gaussian process N_PPCS Regression problem demonstration for 2 input function with Gaussian process using CS covariance N_ADDITIVE1 Regression problem demonstration with additive model N_ADDITIVE2 Regression demonstration with additive Gaussian process using linear squared exponential and neural network covariance fucntions N_HIER Hierarchical regression demonstration N_ROBUST A regression demo with Student t distribution as a residual model N_SPARSE1 Regression problem demonstration for 2 input function with sparse Gaussian processes N_SPARSE2 Regression demo comparing different sparse approximations Demonstration for a disease mapping problem with Gaussian process prior and Poisson likelihood Demonstration for a disease mapping problem with A 2 DIAGNOSTIC TOOLS 69 Gaussian process prior and negative binomial observation model A 2 Diagnostic tools THE DIAGNOSTIC TOOLS in the diag folder Covergence diagnostics PSRF Potential Scale Reduction Factor CPSRF Cumulative Potential Scale Reduction Factor MPSRF Multivariate Potential Scale Reduction Factor CMP SRF Cumulative Multivariate Potential Scale Reduction Factor IPSRF Interval based Potential Scale Reduction Factor CIPSRF Cumulative Interval based Potential Scale Reduction Factor KSSTAT Kolmogorov Smirnov goodness of fit hypothesis test HAIR Brooks hairiness convergence diagnostic CUSUM Yu Mykland convergence diagnostic for MCMC SCORE Calcu
88. nalization over parameters 00000 2 3 1 Maximum a posterior estimate of parameters 2 3 2 Gridintegration 0 2 000 2 3 3 Monte Carlo integration 0 2 3 4 Central composite design o oo a GPstuff basics regression and classification Gaussian process regression demo_regressionl 3 1 1 Constructing the model 2 2 2 0 0 3 1 2 MAP estimate for the hyperparameters 3 1 3 Marginalization over parameters with grid integration 3 1 4 Marginalization over parameters with MCMC Gaussian process classification 000 3 2 1 Constructing the model 0 3 2 2 Inference with Laplace approximation 3 2 3 Inference with expectation propagation 3 2 4 Inference with MCMC Sparse Gaussian processes Compactly supported covariance functions FIC and PIC sparse approximations DTC SOR VAR approximations oo 4 3 1 Comparison of sparse GP models 4 3 2 Sparse GP models with non Gaussian likelihoods U N e m o o o N 10 11 12 12 13 13 14 17 17 17 19 20 21 21 22 22 24 24 vi a 3 a w CONTENTS Model assessment and comparison 5 1 Marginal likelihood ooa 5 2 Predictive performance estimates o o aaa 5 3 Cross validation o oo ee 5 3 1 Leave one o
89. nction f x on arbitrary input vector x The posterior GP is illustrated in Figure 1 2 1 2 GAUSSIAN PROCESS MODELS 5 Random draws from GP Marginal density Il l I l 2 I l l I Be o o 1 3 l g So E i S I g1 O 6 E Il l I l 5 L 1 1 1 4 i i 1 2 3 4 5 0 02 04 input x p f x 6 Marginal of f f Marginal of f f Marginal of f f 2 2 2 1 1 1 GAPE zi 2 2 2 0 2 2 0 2 2 0 2 f x f x f x Figure 1 1 An illustration of a Gaussian process The upper left figure presents three functions drawn randomly from a zero mean GP with squared exponential covariance function The hyperparameters are 1 and g 1 and the grey shading repre sents central 95 probability interval The upper right subfigure presents the marginal distribution for a single function value The lower subfigures present three marginal distributions between two function values at distinct input locations shown in the up per left subfigure by dashed line It can be seen that the correlation between function values f x and f x is the greater the closer x and x are to each others Random draws from GP CHAPTER 1 INTRODUCTION Marginal density ace gt 5 Z D1 A 2 S of l g 5 i i 7 Sy 5 5 l l 5 2f 1 1 1 1 pa aE 0 1 2 3 5 2 input x p f x 6 Marginal of If AN Marginal of If KAN Marginal of f AI 1 0 5 1 1 2 0 5 0 5 1 8 0 5 1 0 2 0 5 0 5 1 0
90. nki University of Technology Aalto University the Academy of Finland Finnish Funding Agency for Technology and Innovation TEKES and the Graduate School in Electronics and Telecommunica tions and Automation GETA Jarno Vanhatalo also thanks the Finnish Foundation for Economic and Technology Sciences KAUTE Finnish Cultural Foundation and Emil Aaltonen Foundation for supporting his post graduate studies during which the code package was prepared Vil viii CONTENTS Preface This is a manual for software package GPstuff which is a collection of Matlab func tions to build and analyse Bayesian models build over Gaussian processes The pur pose of the manual is to help people to use the software in their own work and possibly modify and extend the features The manual consist of two short introductory sections about Bayesian inference and Gaussian processes which introduce the topic and the notation used throughout the book The theory is not extensively covered and readers not familiar with it are suggested to see the references for a more complete discussion After the introductory part the Chapter 2 is devoted for inference techniques Gaus sian process models lead to analytically unsolvable models for which reason efficient approximative methods are essential The techniques implemented in GPstuff are in troduced in general level and references are given to direct the reader for more detailed discussion on the methods The rest of
91. ns ors hoe eee oie Sen Bw A ea RE ee A4 Monte Carlo se soraa ee Rn eR Bi tle HR Gee A 5 Miscellaneous o oo aa a A 6 Optimization ooe e e e a ee ee ee Covariance functions Observation models Priors Transformation of hyperparameters 63 63 65 67 67 69 69 70 71 71 73 77 79 83 Acknowledgments The coding of the GPstuff toolbox started in 2006 based on the MCMCStuff toolbox 1998 2006 http www lce hut fi research mm mcemcestuff which was based on Netlab toolbox 1996 2001 lt http www mathworks com matlabcentral fileexchange 2654 netlab gt The GPstuff toolbox has been developed by BECS Bayes group Aalto University The main authors of the GPstuff have been Jarno Vanhatalo Jaakko Riihim i Jouni Hartikainen and Aki Vehtari but the package contains code written by many more people In the Bayesian methodology group at the Department of Biomedical Engi neering and Computational Science BECS at Aalto University these persons are in alphabetical order Toni Auranen Roberto Calandra Pasi Jyl ki Tuomas Nikoskinen Eero Pennala Heikki Peura Ville Pietil nen Markus Siivola Simo S kk and Ville Tolvanen People outside Aalto University are in alphabetical order Christopher M Bishop Timothy A Davis Dirk Jan Kroon Ian T Nabney Radford M Neal and Carl E Rasmussen We want to thank them all for sharing their code under a free software license The research leading to GPstuff was funded by Helsi
92. observation model The CS covariance functions are used just as glob ally supported covariance functions but are much faster The inference in the demo is conducted with Laplace approximation and EP The code for Laplace approximation looks the following gpcfl gpcf_ppces2 nin 2 lengthScale 5 magnSigma2 0 05 pl prior_t pm prior_sqrtt s2 0 3 gpcfl gpcf_ppcs2 gpcfl lengthScale_prior pl magnSigma2_prior pm 7 3 LOG GAUSSIAN COX PROCESS 57 o Create the likelihood structure lik lik_negbin Create the GP data structure p gp_set lik lik teft gpcfl jitterSigma2 le 4 Q oe Set the approximate inference method to Laplace p gp_set gp latent_method Laplace Q o Set the options for the scaled conjugate optimization pt optimset TolFun 1le 2 TolX 1le 2 Display iter O o Optimize with the scaled conjugate gradient method gp gp_optim gp X Yy Zz ye opt opt C gp_trcov gp xx nnz C prod size C p amd C figure spy C p p make prediction to the data points Ef Varf gp_pred gp x Y X 2 ye 7 3 Log Gaussian Cox process Log Gaussian Cox process is an inhomogeneous Poisson process model used for point data with unknown intensity function A x modeled with log Gaussian process so that f x log A x see Rathbun and Cressie 1994 M ller
93. on latent variables AN An Euclidean distance for Gaussian process models Function to modify the sparse cholesky factorization L D L C when a row and column k of C have changed Multiple rank update or downdate of a sparse LDL factorization Evaluate the sparsified inverse matrix A scaled hybric Monte Carlo samping for latent values A scaled Metropolis Hastings samping for latent values Matlab function to compile all the c files to mex in the GPstuff gp folder rograms Demonstration of Gaussian process model with binomial likelihood DEMO_BINOMIAL_APC Demonstration for modeling age period cohort data DEMO_CLASSIFIC DEMO_LGCP DEMO_MODELASSE DEMO_MODELASSE DEMO_NEURALNET DEMO_PERIODIC DEMO_REGRESSIO DEMO_REGRESSIO DEMO_REGRESSIO DEMO_REGRESSIO DEMO_REGRESSIO DEMO_REGRESSIO DEMO_REGRESSIO DEMO_REGRESSIO DEMO_SPATIAL1 DEMO_SPATIAL2 by a binomial model combined with GP prior Classification problem demonstration for 2 classes Demonstration for a log Gaussian Cox process with inference via EP or Laplace approximation SMENT1 Demonstration for model assesment with DIC number of effective parameters and ten fold cross validation SMENT2 Demonstration for model assesment when the observation model is non Gaussian WORKCOV Demonstration of Gaussian process with a neural network covariance function Regression problem demonstration for periodic data N1 Regression problem demonstration for 2 input functio
94. or for g is another GP prior g SP h x b K h x Bh x The predictive equations are obtained by using the mean and covariance of the prior 8 1 in the zero mean GP predictive equations 2 9 i g E f R Z 8 1 Cov g Cov f R Bo a HK H R 8 2 5 1 sirry fasi i B B HK H B b HK y R H HK K hi x h2 x H x is row vector hy x 63 64 CHAPTER 8 MEAN FUNCTIONS If the prior assumptions about the weights are vague then B7 O O is a zero matrix and the predictive equations 8 1 and 8 2 don t depend on b or B Ug E f R B 8 3 Cov g Cov f R HK H R 8 4 B HK H HK7 y Corresponding the exact and vague prior for the basis functions weights there are two versions of the marginal likelihood With exact prior the marginal likelihood is 1 1 I 1 log p y X b B 3M N M 5 log K 5 log B 5 log A 5 log 2m M H b y N K H BH A B HK H where n is the amount of observations Its derivative with respect to hyperparameters are dS ori 10K oc I X b B M L YNTIM aa og p y X b B 3M N 20 1 _ OK 1 _ 0A Ze x mt ge a 80 hai HK a0 Ku H With a vague prior the marginal likelihood is 1 z 1 log p y X T A 1 1 n m er log K 5 log A 5 log 27 Ay HK H HiT gla C K H A HK where m
95. orm of the GP prior We can use simple paramet ric observation models that describe the assumed observation noise The assumptions about the functional form of the phenomenon are encoded in the GP prior Since GP is a non parametric model we do not need to fix the functional form of the latent function but we can give implicit statements of it These statements are encoded in the mean and covariance function which determine for example the smoothness and variability of the function 1 2 Gaussian process models The general form of the models in GPstuff can be written as follows observation model yl f J pilt o 1 3 i 1 GP prior f x GP m x k x x 0 1 4 hyperprior 6 4 p 0 p d 1 5 Here y y1 Yn is a vector of observations target values at input locations X x tin zia y f x R R is a latent function with value fi f x at input location x A boldface notation will denote a set of latent variables ina vector f f1 fn Here the inputs are real valued vectors but in general other inputs such as strings or graphs are possible as well 0 denotes the parameters in the covariance function k x x 0 and denotes the parameters in the observation model p y f The notation will be slightly abused since p y f is also used for the 1 2 GAUSSIAN PROCESS MODELS 3 likelihood function For simplicity of presentation the mean function is considered zero m x 0
96. oth treated here following Quifionero Candela and Rasmussen 30 CHAPTER 4 SPARSE GAUSSIAN PROCESSES 2005 The approximations are based on introducing an additional set of latent vari ables u u called inducing variables These correspond to a set of input locations X called inducing inputs The latent function prior is approximated as X X u 0 p u X 0 du 4 3 where q f X X u 0 is an inducing conditional The above decomposition leads to the exact prior if the true conditional f X X u N Keu Kilu u Kef Keu Kilu Kus is used However in FIC framework the latent variables are assumed to be condition ally independent given u in which case the inducing conditional factorizes q f X X u 0 Il ai fi X Xu u 0 In PIC latent variables are set in blocks which are conditionally independent of each others given u but the latent variables within a block have a multi variate normal distribution with the original covariance The approximate conditionals of FIC and PIC can be summarized as q f X Xu u 0 M N f Ke Kilu u mask Ke Ktu Kilu Kus M 4 4 where the function A mask M with matrix M of ones and zeros returns a ma trix A of size M and elements A if M 1 and A 0 otherwise An approximation with M I corresponds to FIC and an approximation where M is block diagonal corresponds to PIC The inducing inputs are given a zero mean Gaus sian prior u 0 X N 0 K
97. p v5r B 5 Neural network covariance function gpcf_neuralnetwork A neural network with suitable transfer function and prior distribution converges to a GP as the number of hidden units in the network approaches to infinity Neal 1996 Williams 1996 Rasmussen and Williams 2006 A nonstationary neural network co variance function is 2 ITOK k xi x3 sin eat B 6 1 2x7 Ux 1 2x7 Ux where X 1 x1 a is an input vector augmented with 1 diag o 07 07 is a diagonal weight prior where 0 is a variance for bias parameter controlling the functions offset from the origin The variances for weight parameters are o7 07 and with small values for weights the neural network covariance function produces smooth and rigid looking functions The larger values for the weight variances pro duces more flexible solutions Constant covariance function gpcf_constant Perhaps the simplest covariance function is the constant covariance function 2 k xi Xj o B 7 with variance parameter o This function can be used to implement the constant term in the dotproduct covariance function Rasmussen and Williams 2006 reviewed be low Linear covariance function gpcf_linear The linear covariance function is k x j XT UX B 8 where the diagonal matrix diag o 07 contains the prior variances of the linear model coefficients Combining this with the constant function above
98. p vart gp_set type VAR lik lik efr gpcf X_u X_u gp_dtc gp_set type DTC lik lik cf gpcf X_u X_u gp_sor gp_set type SOR lik lik cf gpcf X_u X_u 4 3 1 Comparison of sparse GP models Figure 4 2 shows the predictive mean of all the sparse approximations the mean of SOR is the same as that of DTC It should be noticed that variational approximation is closest to the full GP solution in Figure 3 2 The FIC and PIC approximations are most close to the full solution FIC works rather differently on one corner of the region whereas the latent surface predicted by PIC contains discontinuities DTC suffers most on the borders of the region 34 CHAPTER 4 SPARSE GAUSSIAN PROCESSES The sparse GP approximations are compared also in the demo demo_regression_sparse2 which demonstrates the differences between FIC variational and DTC in other context See also the discussions on the differences between these sparse approximations given by Quifionero Candela and Rasmussen 2005 Snelson 2007 Titsias 2009 Alvarez et al 2010 4 3 2 Sparse GP models with non Gaussian likelihoods The extension of sparse GP models to non Gaussian likelihoods is very straightforward in GPstuff User can define the sparse GP just as described in the previous two sections and then continue with the construction of likelihood exactly the same way as with a
99. parameters when the dimensionality of the parameters d is moderate or high In this setting the integration is considered as a quadratic design problem in a d dimensional space with the aim at finding points that allow to estimate the curvature of the posterior distribution around the mode The de sign used by Rue et al 2009 and Vanhatalo et al 2010 is the fractional factorial design Sanchez and Sanchez 2005 augmented with a center point and a group of 2d star points In this setting the design points are all on the surface of a d dimensional sphere and the star points consist of 2d points along each axis which is illustrated in Figure 2 2 c The integration is then a finite sum with special weights summarized by the equation 2 24 The design points are searched after transforming V into z space which is assumed to be a standard Gaussian variable The integration weights can then be determined from the statistics of a standard Gaussian variable E z z d E z 0 and E 1 1 which result in A e 1 exp fo D 7 2 27 for the points on the sphere and Ap 1 for the central point see appendix Vanhatalo et al 2010 for a more detailed derivation The CCD integration speeds up the com putations considerably compared to the grid search or Monte Carlo integration since 2 3 MARGINALIZATION OVER PARAMETERS 15 0 5 0 5 0 5 gl g 4 es 4 ne xe To 2 2 2 Z 15 1 5 Z 15
100. perations more stable 3 2 2 Inference with Laplace approximation The MAP estimate for the parameters can be found using go_optim as gp_set gp latent_method Laplace gp_optim gp x y opt opt gp gp The first line defines which inference method is used for the latent variables It initializes the Laplace algorithm and sets needed fields in the GP structure The default method for latent variables is Laplace so this line could be omitted The next line uses default fminscg optimization function with same the options as above gp_pred can be used to obtain the mean and variance for the latent variables and new observations together with the predictive probability for a test observation Eft_la Varft_la Eyt_la Varyt_la pyt_la gp_pred gp x y xt yt ones size xt 1 1 The first four input arguments are the same as in the section 3 the fifth and sixth ar guments are a parameter value pair Parameter yt tells that we give test observations related to xt as optional inputs for gp_pred Here we want to evaluate the probability to observe class 1 and thus we give a vector of ones as test observations The proba bility densities for test observations are returned in the fifth output argument pyt The training data together with predicted class probability contours is visualized in Figure 3 4 a 3 2 GAUSSIAN PROCESS CLASSIFICATION 23 1 0 5 0 05 a Laplace approximation
101. ply by using the expression of the con ditional mean Ej 9 4 f X k x X K7 f see equation 1 9 and the text below it Since this holds for any set of latent variables f we obtain a parametric posterior mean function mp KID 0 6 Eien olf D 0 6 df K X X10 Kil Be a lE 2 The posterior predictive covariance between any set of latent variables f can be eval uated from Covip o lf Erino Cova elfl Coven Eqelfl C6 where the first term simplifies to the conditional covariance in equation 1 9 and the second term can be written as k x X Ky Cove n o f Kp k X x Plugging these in the equation and simplifying gives us the posterior covariance function kp x X D 0 k x X 0 k x XO Ki K Cove iD o f Kj k X x 0 2 7 From this on the posterior predictive mean and covariance will be denoted m X and kp X Even if the exact posterior p f D 0 is not available in closed form we can still approximate its posterior mean and covariance functions if we can approximate E D 0 and Cove p 9 4 f A common practice to approximate the posterior p f D 0 o is either with MCMC e g Neal 1997 1998 Diggle et al 1998 Kuss and Ras mussen 2005 Christensen et al 2006 or by giving an analytic approximation to it e g Williams and Barber 1998 Gibbs and Mackay 2000 Minka 2001 Csat6 and Opper 2002 Rue et al 2009 The analytic approximations considered her
102. pproximates the posterior covariance via the curvature of the log density at that point The expectation propagation EP algorithm Minka 2001 for its part tries to mini mize the Kullback Leibler divergence from the true posterior to its approximation 1 i 5 alf D 0 6 7p 10 J ti Fil Zs fi 37 2 15 w 1 where the likelihood terms have been replaced by site functions t f Z i 52 i N fi jii 42 and the normalizing constant by Zgp EP algorithm updates the site pa rameters Z fi and a sequentially At each iteration first the 2 th site is removed from the i th marginal posterior to obtain a cavity distribution q fi fi D 0 ti fi Second step is to find a Gaussian q f to which the Kullback Leibler divergence from the cavity distribution multiplied by the exact likelihood for that site is minimized that is fi arg min KL q_ fi p yil fi la fi This is equivalent to matching the first and second moment of the two distributions Seeger 2008 The site terms Z are scaling parameters which ensure that Zgp p D 6 6 After the moments are solved the parameters of the local approximation t are updated so that the new marginal pos terior q fi ti f matches with the moments of f For last the parameters of the approximate posterior 2 15 are updated to give aE D 0 9 N f Ke Kee 2 tA Kee Kee Kep Ker 2 16 where diag 6 2 and ji ji1 fin
103. proximations exactly the same way as with full GP and this is demonstrated in demo_modelassessment1 for FIC and PIC 40 CHAPTER 5 MODEL ASSESSMENT AND COMPARISON 5 5 2 demo_modelassessment2 The functions gop_peff gp_dic and gp_kfcv work similarly for non Gaussian likelihoods as for a Gaussian one The only difference is that the integration over the latent variables is done approximately The way the latent variables are treated is defined in the field lLatent_method of the GP structure and this is initialized when constructing the model as discussed in the section 3 2 If we have conducted the analysis with Laplace approximation and MAP estimate for the parameters we can evaluate the DIC and k fold CV statistics as follows p_eff_latent gp_peff gp x y DIC_latent p_eff_latent2 gp_dic gp x y focus latent Evaluate the 10 fold cross validation results cvres gp_kfcv gp x y These are exactly the same lines as presented earlier with the Gaussian likelihood The difference is that the GP structure gp and the data x and y are different and the inte grations over latent variables in gp_dic and gp_kfcv are done with respect to the approximate conditional posterior q f l D which was assumed to be Laplace ap proximation The effective number of parameters returned by gp_peff is evaluated as in the equation 5 15 with the modification that 77 71 is replaced by W in the case of Laplace approximation and gt
104. regressionl xt1 xt2 meshgrid 1 8 0 1 1 8 1 8 0 1 1 8 xt sch 1089 KE2 3 9 Eft_map Varft_map gp_pred gp x y xt 3 1 3 Marginalization over parameters with grid integration To integrate over the parameters we can use any method described in the section 2 3 The grid integration is performed with the following line gp_array P_TH th Eft_ia Varft_ia fx_ia x_ia gp_ia gp x y xt int method grid gp_ia returns an array of GPs gp_array for parameter values th J M with weights P_TH p v D A Since we use the grid method the weights are propor 3 2 GAUSSIAN PROCESS CLASSIFICATION 21 tional to the marginal posterior and A 1Vi see section 2 3 2 The last four outputs are returned only if the prediction locations xt representing X are given Ef_ia and Varf_ia contain the predictive mean and variance at the prediction locations The last two output arguments can be used to plot the predictive distribution p fi ID as demonstrated in Figure 3 2 x_ia contains a regular grid of values fi and fx_ia contains p f D at those values The gp_ia optimizes the hyperparameters to their posterior mode evaluates the Hessian P with finite differences and makes a grid search starting from the mode Since we give the prediction inputs p representing X the integration method returns also the predictive distributions for those locations Otherwise gp_ia would return only the first three ar
105. ri 2008 and can be downloaded from http www image ucar edu Data The model is the same as described in the section 3 1 but now we use the piecewise polynomial covariance func tion kpp 2 gecf_ppcs2 GPstuff utilizes the sparse matrix routines from SuiteS parse written by Tim Davis http www cise ufl edu research sparse SuiteSparse and this package should be installed before using the compactly supported covariance func tions The user interface of GPstuff makes no difference between globally and compactly supported covariance functions but the code is optimized so that it uses sparse matrix routines whenever the covariance matrix is sparse Thus we can construct the model 4 2 FIC AND PIC SPARSE APPROXIMATIONS 29 2500 2000 1500 1000 500 a The nonzero elements of b The posterior predictive mean surface Krs Figure 4 1 The nonzero elements of Ke with kpp 2 function and the posterior predic tive mean of the latent function in the US precipitation data set find the MAP estimate for the parameters and predict to new input locations in a famil iar way pn priort tnu 4 s2 023 lik lik_gaussian sigma2 1 sigma2_prior pn pl2 prior gamma sh 5 is 1 pm2 prior_sgrtt nu 1 s2 150 gpcf2 gpcf_ppces2 nin nin lengthScale 1 2 magnSigma2 3 lengthScale_prior pl2 magnSigma2_prior pm2 gp gp_set lik lik
106. riance An older version of DTC is the subset of regressors SOR sparse approx imation which utilizes K a K u K However this resembles a singular Gaussian distribution and thus the PEIEE Vanai may be negative DTC tries to fix this prob lem by using K see Qui onero Candela and Rasmussen 2005 DTC and SOR are identical in other respects than in the predictive variance evaluation In spatial statis tics SOR has been used by Banerjee et al 2008 with a name Gaussian predictive process model The approximate prior of the variational approximation by Titsias 2009 is ex actly the same as that of DTC The difference between the two approximations is that in variational setting the inducing inputs and covariance function parameters are op timized differently The inducing inputs and parameters can be seen as variational parameters that should be chosen to maximize the variational lower bound between the true GP posterior and the sparse approximation This leads to optimization of modified log marginal likelihood V 0 Xu log N y 0 071 Qe str Kee Keru Ky Kus 4 9 with Gaussian observation model With non Gaussian observation model the varia tional lower bound is similar but o71 is replaced by W Laplace approximation or X EP Variational DTC and SOR sparse approximation in GPstuff The variational DTC and SOR sparse approximations are constructed similarly to FIC Only the type of the GP changes g
107. riance function Create a Matern nu 5 2 covariance function a neural network covariance function Create a periodic covariance function Create a piece wise polynomial q 0 covariance function Create a piece wise polynomial q 1 covariance function Create a piece wise polynomial q 2 covariance function Create a piece wise polynomial q 3 covariance function Create a product form covariance function Create a rational quadratic covariance function Create a squared exponential covariance function Likelihood functions LIK_GAUSSIAN LIK_GAUSSIANSMT Create LIK_BINOMIAL LIK_LOGIT LIK_NEGBIN LIK_POISSON LIK_PROBIT LIK_T Create a Gaussian likelihood structure a Gaussian scale mixture approximating t Create a binomial likelihood structure Create a Logit likelihood structure Create a Negbin likelihood structure Create a Poisson likelihood structure Create a Probit likelihood structure Create a Student t likelihood structure Inference utilities GP_E GP_G GP_EG GP_PRED GPEP_E GPEP_G GPEP_PRED Evaluate energy function un normalized negative marginal log posterior Evaluate gradient of energy GP_E for Gaussian Process Evaluate both GP_E and GP_G Useful in optimisation Make predictions with Gaussian process Conduct Expectation propagation and return negative marginal log posterior estimate Evaluate gradient of EP s negative marginal log posterior estimate Predictions with Gaussian Process EP approximation
108. riance function is kse 1 1101 kiinear 2 02 which is a mapping from R to R With the squared exponential covariance function the inputs to be used can be selected using a metric structure as follows Covariance function for the first input variable gpcf_sl gpcf_sexp magnSigma2 0 15 magnSigma2_prior pt create metric structure metricl metric_euclidean 1 lengthScales 0 5 lengthScales_prior pt set the metric to the covariance function structure gpcf_sl gpcf_sexp gpcf_sl metric metricl 6 4 PRODUCT OF COVARIANCE FUNCTIONS 47 Here we construct the covariance function structure just as before We also set a prior structure for the magnitude pt To modify squared exponential covariance function so that it depends only on a subset of inputs is done using a metric structure In this example we use met ric_euclidean which allows user to group the inputs so that all the inputs in one group are appointed to the same length scale The metric structure has function handles which then evaluate for example the distance with this modified euclidean metric For example for inputs x x Rt the modified distance could be t Vn 04 If 2 x3 x3 2 13 wa 24 13 6 5 where the second and third input dimension are given the same length scale This is dif ferent from the previously used r Saa x andr yoi x xi P that
109. rior_t gpcfl gpcf_sexp lengthScale 1 magnSigma2 0 2 2 gpcfl gpcf_sexp gpcfl lengthScale_prior pl magnSigma2_prior pm Create the likelihood structure pn prior_logunif lik lik_t nu 4 nu_prior prior_logunif sigma2 10 sigma2_prior pn Finally create the GP data structure gp gp_set lik lik cf gpcfl jitterSigma2 le 6 latent_method Laplace Set the options for the scaled conjugate optimization opt optimset TolFun 1le 4 TolX 1le 4 Display iter Maxiter 20 Optimize with the scaled conjugate gradient method gp gp_optim gp x y opt opt Predictions to test points Eft Varft gp_pred gp x y xt The Student t observation model with MCMC When using MCMC for the Student t observation model we need to define sampling options for covariance function parameters latent variables and likelihood parameters After this we can run gp_mc and predict as before All theses steps are shown below pl prior_t pm prior_sqrtunif gpcfl gpcf_sexp lengthScale 1 magnSigma2 0 2 2 gpcfl gpcf_sexp gpcfl lengthScale_prior pl magnSigma2_prior pm Create the likelihood structure pn prior_logunif lik lik_t nu 4 nu_prior sigma2 10 sigma2_prior pn Finally create the GP data structure gp
110. ructure can not be used However a smaller set of input variables can be chosen similarly as with the linear covariance function using the field selectedVariables In this demo we consider additive neural network covariance func tions each having one input variable kyn 21 1101 knmn 2 25 02 In GPstuff this can be done as gpcf_nnl gpcf_neuralnetwork weightSigma2 1 biasSigma2 1 selectedVariables 1 gpcf_nnl gpcf_neuralnetwork gpcf_nnl weightSigma2_prior pt biasSigma2_prior pt gpcf_nn2 gpcf_neuralnetwork weightSigma2 1 biasSigma2 1 selectedVariables 2 gpcf_nn2 gpcf_neuralnetwork gpcf_nn2 weightSigma2_prior pt biasSigma2_prior pt gp gp_init lik lik ef gpcf_nnl gpcf_nn2 jitterSigma2 jitter The result from this and other six models are shown in Figure 6 5 6 4 Product of covariance functions A product of two or more covariance functions is a valid covariance function as well Such constructions may be useful in situations where the phenomenon is known to be separable Combining covariance functions into product form k x x ko x x is straightforward in GPstuff There is a special covariance function gpcf_prod for this purpose For example multiplying exponential and M ern covariance functions is done as follows 48 CHAPTER 6 COVARIANCE FUNCTIONS The predicted underlying function constant linear Th
111. run the hybrid Monte Carlo sampler with sampling options stored in the struc ture hmc_opt The default sampling options for HMC are set by hmc2_opt func tion The function thin removes the burn in from the sample chain here 50 and thins the chain with user defined parameter here 2 This way we can decrease the autocorrelation between the remaining samples GPstuff provides also other diagnostic tools for Markov chains and in detailed MCMC analysis we should use several start ing points and monitor the convergence carefully Gelman et al 2004 Robert and Casella 2004 Figure 3 3 shows an example of approximating a posterior distribution of parameters with Monte Carlo sampling The function gomc_preds returns the conditional predictive mean and variance for each sampled parameter value These are E ce x 0 0 s 1 M and Var x p 9 9 f s 1 M where M is the number of samples Marginal predictive mean and variance can be computed directly with gp_pred 3 2 Gaussian process classification We will now consider a binary GP classification see demo_classific with ob servations y 1 1 i 1 n associated with inputs X x _ The 22 CHAPTER 3 GPSTUFF BASICS REGRESSION AND CLASSIFICATION observations are considered to be drawn from a Bernoulli distribution with a success probability p y 1 x The probability is related to the latent function via a sig moid function that transforms it to a unit interv
112. rvation vector includes the partial derivative observations which are set as a column vector after function value observations Create the data tp 9 snumber of training points 1 K 2 5 tp 2 y sin x cos x 2 dy cos x 3 2xsin x 2 cos x koh 0 06 oe The underlying process f Derivative of the process noise standard deviation oe oe Add noise y y kohxrandn size y dy dy kohxrandn size dy derivative obs are also noisy x x dy dy 7 5 DERIVATIVE OBSERVATIONS IN GP REGRESSION 61 ee observation vector without derivative observations Ly dy observation vector with derivative observations The model constructed for regression is a full GP with a Gaussian likelihood The covariance function is squared exponential which is the only covariance function that is compatible with derivative observations at the moment The field derivobs is added into gp_set so that the inference is done with derivative observations Field DerivativeCheck should also be added to optimset when taking the predictions so that derivative observations can be used gpcfl gpcf_sexp lengthScale 0 5 magnSigma2 5 pl prior_t a prior structure pm prior_sqrtt a prior structure gpcfl gpcf_sexp gpcfl lengthScale_prior pl magnSigma2_prior pm Ae gp gp_set cf gpcfl derivobs on GP without derivative observations GP with derivati
113. s are similar since only o7I and Kg o71 y change in the approximations The multiple length scale model can be formed also using specific covariance func tions For example A rational quadratic covariance function gpcf_rq can be seen as a scale mixture of squared exponential covariance functions Rasmussen and Williams 2006 and could be useful for data that contain both local and global phenomena However using sparse approximations with the rational quadratic would prevent it from modeling local phenomena The additive model 6 2 suits better for sparse GP formalism since it enables to combine FIC with CS covariance functions As discussed in section 4 2 FIC can be interpreted as a realization of a special kind of covariance function By adding FIC with CS covariance function for example 4 1 one can construct a sparse additive GP prior which implies the latent variable prior f X Xu 0 N 0 Keu Kuh Ku A 6 4 This prior will be referred as CS FIC Here the matrix A kpp a X X is sparse with the same sparsity structure as in kpp q X X and it is fast to use in computa tions and cheap to store CS FIC can be extended to have more than one component However it should be remembered that FIC works well only for long length scale phe nomena and the computational benefits of CS functions are lost if their length scale gets too large Vanhatalo et al 2010 For this reason the CS FIC should be con structed so tha
114. s given in the equation 1 9 Similarly we can obtain a sample of y by drawing one y for each pO from pole 0 Q 2 2 Marginal likelihood given parameters In Bayesian inference we integrate over the unknown latent values to get the marginal likelihood In Gaussian case this is straightforward since it has an analytic solution see equation 2 8 and log marginal likelihood is 1 1 log p D 6 o 5 log 2r z log Ke 0 I 5 y Kes 0 I ty 2 18 If the observation model is not Gaussian the marginal likelihood needs to be ap proximated The Laplace approximation to the marginal likelihood denominator in the equation 2 1 is constructed for example by writing p D 6 p yl d p E X 0 df J exp g f df 19 after which a second order Taylor expansion around f is done for g This gives a 12 CHAPTER 2 INFERENCE AND PREDICTION Gaussian integral over f multiplied by a constant and results in the approximation 1 T A A 1 log p D 0 log a D 0 6 x 3f Kerf logp ylf 5 log B 2 20 where B I W Ks 5 W1 Rue et al 2009 use the same approximation when they define p 0 D p y 0 a D 0 _ where the denominator is the Laplace approximation in equation 2 11 see also Tierney and Kadane 1986 EP s marginal likelihood approximation is the normalization constant Zep J PEO J ZNG 82At 2 21 i 1 in equation 2 15 This is a Gaussian integr
115. s is that roughly speaking one uses only the nonzero elements of the covariance matrix in the calculations This may speed 27 28 CHAPTER 4 SPARSE GAUSSIAN PROCESSES up the calculations substantially since in some situations only a fraction of the elements of the covariance matrix are non zero In practice efficient sparse matrix routines are needed Davis 2006 These are nowadays a standard utility in many statistical com puting packages such as MATLAB or R or available as an additional package for them The CS covariance functions have been rather widely studied in the geostatistics applications The early works concentrated on their theoretical properties and aimed to approximate the known globally supported covariance functions Gneiting 2002 Fur rer et al 2006 Moreaux 2008 There the computational speed up is obtained using efficient linear solvers for the prediction equation f K Krr o7 ly The pa rameters are fitted to either the empirical covariance or a globally supported covariance function Kaufman et al 2008 study the maximum likelihood estimates for tapered covariance functions i e products of globally supported and CS covariance functions where the magnitude can be solved analytically and the length scale is optimized using a line search in one dimension The benefits from a sparse covariance matrix have been immediate since the problems collapse to solving sparse linear systems However uti lizing the gradi
116. se of g Problems can be detected by monitoring the cumulative normalized weights and the estimate of the effective sample size Mett 1 So w where wi gt gt wi Geweke 1989 Gelman et al 2004 Vehtari and Lampinen 2002 In some situa tions the naive Gaussian or Student t proposal distribution is not adequate since the posterior distribution q V D may be non symmetric or the covariance estimate P is poor In these situations we use the scaled Student t proposal distribution proposed by Geweke 1989 In this approach the scale of the proposal distribution is adjusted along each main direction defined by P so that the importance weights are maximized Although Monte Carlo integration is more efficient than grid integration for high dimensional problems it also has its downside For most examples few hundred inde pendent samples are enough for reasonable posterior summaries Gelman et al 2004 which seems achievable The problem however is that we are not able to draw inde pendent samples from the posterior Even with a careful tuning of Markov chain sam plers the autocorrelation is usually so large that the required sample size is in thousands which makes Monte Carlo integration computationally very demanding compared to for example the MAP estimate p D 2 3 4 Central composite design Rue et al 2009 suggest a central composite design CCD for choosing the repre sentative points from the posterior of the
117. ssess the predictive performance of the model by focusing on the model s predictive distribution Good 1952 Bernardo and Smith 2000 The posterior predictive distribution of an output yn 1 given the new input Xn 1 and the training data D x yi i 1 2 is obtained by marginalizing over the unknown latent variable and parameters 0 given the model M P Yn 1 Xn41 D M f rlineal ni 8 6 D M p xn41 D M dbd6 5 1 where p yn 1 Xn 1 0 D M J p ynsilfntis P fn4il Xn41 9 D M dfn4i In the following we will assume that knowing x does not give more information about or Q that is p0 ol Xn 1 D M p 0 D M 35 36 CHAPTER 5 MODEL ASSESSMENT AND COMPARISON To estimate the predictive performance of the model we would like to compare the posterior predictive distribution to future observations from the same process that generated the given set of training data D Agreement or discrepancy between the pre dictive distribution and the observations can be measured with a utility or loss function U Yn 1 Xn 1 D M Preferably the utility u would be application specific measur ing the expected benefit or cost of using the model Good generic utility function is the log score which is called log predictive likelihood log p yn 41 Xn41 D M when used for predictive density It measures how well the model estimates the whole predic tive distribution Bernardo 1979 and is thus especially useful in model
118. ssian and thereby great variability on their values indicates that CCD has failed The posterior of the parameters may be far from a Gaussian distribution but for a suitable transformation the approximation may work well For example the covari ance function parameters should be transformed through logarithm as discussed in the section 3 1 CHAPTER 2 INFERENCE AND PREDICTION Chapter 3 GPstuff basics regression and classification In this chapter we illustrate the use of GPstuff package with two classical examples regression and classification The regression task has Gaussian observation model and forms an important special case of the GP models since it is the only model where we are able to marginalize over the latent variables analytically Thus it serves as a good starting point for demonstrating the use of the software package The classification problem is a usual text book example of tasks with non Gaussian observation model where we have to utilize the approximate methods discussed in the previous chapter This chapter serves also as a general introduction to the GPstuff software package In the next few sections we will introduce and discuss many of the functionalities of the package that will be present in more advanced models as well 3 1 Gaussian process regression demo_regression1 The demonstration program demo_regression1 considers a simple regression prob lem where we wish to infer a latent function f x given measur
119. study In Power Industry Computer Appli cation Conference Proceedings EEE Power Engineering Society Tierney L and Kadane J B 1986 Accurate approximations for posterior moments and marginal densities Journal of the American Statistical Association 81 393 82 86 Titsias M K 2009 Variational learning of inducing variables in sparse Gaussian processes JMLR Workshop and Conference Proceedings 5 567 574 Tokdar S T and Ghosh J K 2007 Posterior consistency of logistic Gaussian pro cess priors in density estimation Journal of Statistical Planning and Inference 137 34 42 Vanhatalo J Pietil inen V and Vehtari A 2010 Approximate inference for disease mapping with sparse gaussian processes Statistics in Medicine 29 15 1580 1607 Vanhatalo J and Vehtari A 2007 Sparse log Gaussian processes via MCMC for spatial epidemiology JMLR Workshop and Conference Proceedings 1 73 89 Vanhatalo J and Vehtari A 2008 Modelling local and global phenomena with sparse Gaussian processes In McAllester D A and Myllym ki P editors Pro ceedings of the 24th Conference on Uncertainty in Artificial Intelligence pages 571 578 Vanhatalo J and Vehtari A 2010 Speeding up the binary Gaussian process classifi cation In Griinwald P and Spirtes P editors Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence pages 1 9 Vehtari A and Lampinen J 2002 Bayesian
120. t al 2010 The third option is to evaluate DIC with focus on all the variables f 6 In this case the expectations are over p f 6 D 5 5 Model assessment demos The model assessment methods are demonstrated with the functions demo_modelassessmenttl and demo_modelassessment2 The former compares the sparse GP approxima tions to the full GP with regression data and the latter compares the logit and probit likelihoods in GP classification 5 5 1 demo_modelassessment1 Assume that we have built our regression model with a Gaussian noise and used opti mization method to find the MAP estimate for the parameters We evaluate the effective number of latent variables and DIC statistics p_eff_latent gp_peff gp x y DIC_latent p_eff_latent2 gp_dic gp x y focus latent where peg is evaluated with two different approximations Since we have the MAP estimate for the parameters the focus is on the latent variables In this case we can also use gp_peff which returns the effective number of parameters approximated as pol n K Ki 0 D7 5 15 Spiegelhalter et al 2002 When the focus is on the latent variables the function gp_dic evaluates the DIC statistics and the effective number of parameters as de scribed by the equations 5 13 and 5 14 The k fold CV expected utility estimate can be evaluated as follows cvres gp_kfcv gp x y 5 5 MODEL ASSESSMENT DEMOS 39 The gp_kfcv takes the re
121. t is considered fixed 2 It is possible to use gp_set to set option infer_params to control which groups of parameters covariance likelihood or inducing are meant to be inferred To make predictions for new locations x given the training data x y we can use the gp_pred function which returns the posterior predictive mean and variance for each f X see equation 2 10 This is illustrated below where we create a regular grid where the posterior mean and variance are computed The posterior mean m X and the training data points are shown in Figure 3 2 20 CHAPTER 3 GPSTUFF BASICS REGRESSION AND CLASSIFICATION predicted surface data point SOS LESS TSS 40 2 2 0 0 5 1 1 0 5 a The predictive mean and training data b The marginal posterior predictive distribu tions p fi D Figure 3 2 The predictive mean surface training data and the marginal posterior for two latent variables in demo_regression1 Histograms show the MCMC solution and the grid integration solution is drawn with a line 150 150 200 150 1 150 100 100 100 100 50 50 50 50 0 0 0 10 05115 05 1 2 4 6 0 03 0 06 Length s 1 Length s2 magnitude Noise variance Figure 3 3 The posterior distribution of the hyperparameters together with MAP esti mate black cross The results are from demo_
122. t possible long length scale phenomena are handled with FIC part and the short length scale phenomena with CS part The implementation of the CS FIC model follows closely the implementation of FIC and PIC for details see Vanhatalo and Vehtari 2008 2010 In the following sections we will demonstrate the additive models with two prob lems First we will consider full GP with covariance function that is a sum of periodic and squared exponential covariance function This GP prior is demonstrated for a Gaussian and non Gaussian likelihood The second demo concentrates on sparse GPs in additive models The FIC PIC and CS FIC sparse models are demonstrated with data set that contains both long and short length scale phenomena 6 2 1 Additive models demo demo_periodic In this section we will discuss the demonstration program demo_periodic This demonstrates the use of a periodic covariance function gocf_periodic with two data sets the Mauna Loa CO2 data see for example Rasmussen and Williams 2006 and the monthly Finnish drowning statistics 2002 2008 The first data is a regression problem with Gaussian noise whereas the second consist of count data that is modeled with Poisson observation model Here we will describe only the regression problem the other data can be examined by running the demo We will analyze the Mauna Loa CO2 data with two additive models The first one utilizes covariance function that is a sum of squared exponential and pie
123. tadt et al 2007 Rasmussen and Williams 2006 All the covariance functions implemented in GPstuff are summa rized in the Chapter 8 By definition the marginal distribution of any subset of latent variables can be constructed by simply taking the appropriate submatrix of the covariance and subvector 4 CHAPTER 1 INTRODUCTION of the mean Imagine that we want to predict the values f at new input locations X The joint prior for latent variables f and f is f Krs a X X 6 N 0 fy 1 8 El Pe Ki where Ky k X X Kp 7 k X X 0 and K k X X Here the co variance function k denotes also vector and matrix valued functions k x X RI x RUD IX and k X X RIX x RIX RX The marginal distribu tion of f is p f X 0 N f O0 K like the marginal distribution of f given in 1 6 This is illustrated in Figure 1 1 The conditional distribution of a set of latent variables given another set of latent variables is Gaussian as well For example the distribution of given f is f X X 0 N K e Kyi f Kz Kz Kj K 5 1 9 which can be interpreted as the posterior predictive distribution for f after observ ing the function values at locations X The posterior distribution 1 9 generalizes to a Gaussian process with mean function m X 0 k x X 0 K f and covariance k X X 0 k X 0 k amp X 0 Kj K X X which define the posterior dis tribution of the latent fu
124. the manual discusses the basic models implemented in the package and demonstrates their usage This discussion begins from the Chapter 3 which considers simple Gaussian process regression and classification problems These examples serve also as examples how to use the basic functions in the GPstuff and all the rest of the examples build over the considerations in this chapter The rest of the chapters concen trate on more special model constructions such as sparse Gaussian processes additive models and various observations models Also functions for model assessment and comparison are discussed All these considerations are more or less individual exam ples that can be read if needed The essential parts that everyone should know are covered by the end of the chapter 3 Chapter 4 discusses how various sparse approximations can be used with GPstuff The Chapter 5 discusses methods provided for model assessment and comparison The Chapter 6 tells more about some covariance functions and how to combine them The Chapter 7 reviews some special observation models implemented in GPstuff The Chapter 8 reviews how non zero mean functios can be used in GPstuff The Appendix A lists the functions in the toolbox Contents m The Appendices B D collects techni cal details and lists of covariance functions observation models and prior distributions implemented The Appendix E discusses the reparameterization used and its effect to the implementation ix
125. u u so that the approximate prior over latent variables is q f X Xu 0 M N f 0 Key Ky Kus A 4 5 The matrix Kr Ky Kus is of rank m and A is a rank n block diagonal matrix The prior covariance above can be seen as a non stationary covariance function of its own where the inducing inputs X and the matrix M are free parameters similar to parameters which can be optimized alongside 0 Snelson and Ghahramani 2006 Lawrence 2007 The computational savings are obtained by using the Woodbury Sherman Morrison lemma e g Harville 1997 to invert the covariance matrix in 4 5 as Kru Kila Kup A A VVT 4 6 where V AT Ktu chol Ku u Kus AT Ktu 1 There is a similar result also for the determinant With FIC the computational time is dominated by the matrix multiplications which need time O m n With PIC the cost depends also on the sizes of the blocks in A If the blocks were of equal size b x b the time for inversion of A would be O n b x b O nb With blocks at most the size of the number of inducing inputs that is b m the computational cost in PIC and FIC are similar PIC approaches FIC in the limit of a block size one and the exact GP in the limit of a block size n Snelson 2007 FIC sparse approximation in GPstuff The same data that were discussed in the section 3 1 is analyzed with sparse approxi mations in the demo demo_regression_sparsel The sparse approximation is always a property of
126. und similarly as with the Gaus sian model discussed earlier The major difference is that now we need to sample also the latent variables We need to set the latent method to MCMC gp gp_set gp latent_method MCMC JjitterSigma2 le 4 For MCMC we need to add a larger jitter value to the diagonal to avoid numerical prob lems By default sampling from the latent value distribution p f 0 D is done using scaled_mh which implements scaled Metropolis Hastings algorithm Neal 1998 Other sampler provided by the GPstuff is a scaled HMC scaled_hmc Vanhatalo and Vehtari 2007 Below the actual sampling is performed similarly as with the Gaussian likelihood Set the parameters for MCMC hmc_opt hmc2_opt hmc_opt steps 10 hmc_opt stepadj 0 05 hmc_opt nsamples 1 latent_opt display 0 latent_opt repeat 20 latent_opt sample_latent_scale 0 5 hmc2 state sum 100 clock r g opt gp_mc gp x y hmc_opt hmc_opt latent_opt latent_opt nsamples 1 o re set some of the sampling options hmc_opt steps 4 hmc_opt stepadj 0 05 latent_opt repeat 5 hmc2 state sum 100 clock Sample rgp g opt gp_mc gp x Yy nsamples 400 hmc_opt hmc_opt latent_opt latent_opt record r Make predictions Efs_mc Varfs_mc Eys_mc Varys_mc Pys_mc gpmc_preds rgp x Yr xt yt ones size xt 1 1 Pyt_mc mean Pys_mc 2
127. ut cross validation o o so oaa aaa 5 3 2 k fold cross validation o ooa a SA DIC rma oare aen ey ee tes Be ee a Bee 5 5 Model assessment demos 0 000 2 ee eee 5 5 1 demo_modelassessmentl 0004 5 5 2 demo_modelassessment2 0 00004 Covariance functions 6 1 Neural network covariance function 200 6 2 Additive models 2 0 0 0 0 a 6 2 1 Additive models demo demo_periodic 6 2 2 Additive models with sparse approximations 6 3 Additive covariance functions with selected variables 6 4 Product of covariance functions 000005 Special observation models 7 1 Robust regression with Student t likelihood 7 1 1 Regression with Student distribution 7 2 Models for spatial epidemiology 00 7 2 1 Disease mapping with Poisson likelihood demo_spatiall 7 2 2 Disease mapping with negative Binomial likelihood 7 3 Log Gaussian Cox process sooo 7 4 Binomial observation model naaa aaae 7 5 Derivative observations in GP regression 7 5 1 GP regression with derivatives demo_derivativeobs Mean functions 8 1 Explicit basis functions oaoa e 8 1 1 Mean functions in GPstuff demo_regression_meanf Function list AST GPE 2 ao at aar a a Huse ela E a epee o ATE R E ah eeu A 2 Diagnostictools 2 2 ee A 3 Distributio
128. v 1 2 yi ee ee 1 7 1 ee repre T vo 7 1 where v is the degrees of freedom and the scale parameter Gelman et al 2004 Student t distribution is outlier prone of order 1 and it can reject up to m outliers if there are at least 2m observations in all O Hagan 1979 The Student t distribution can be utilized as such or it can be written via the scale mixture representation yili a Ui N fi aUi 7 2 U Inv x v T 1 3 where each observation has its own noise variance aU that is Inv y distributed Neal 1997 Gelman et al 2004 The degrees of freedom v corresponds to the degrees of freedom in the Student distribution and aT corresponds to 51 52 CHAPTER 7 SPECIAL OBSERVATION MODELS a Gaussian observation model b Student t observation model Figure 7 1 An example of regression with outliers On the left Gaussian and on the right the Student t observation model The real function is plotted with black line In GPstuff both of the representations are implemented The scale mixture repre sentation can be inferred only with MCMC and the Student t observation model with Laplace approximation and MCMC 7 1 1 Regression with Student distribution Here we will discuss demo_regression_robust The demo contains five parts 1 Optimization approach with Gaussian noise 2 MCMC approach with scale mixture noise model and all parameters sampled 3 Laplace approximatio
129. ve observations prediction 95 prediction 95 atx eeren observations ag observations v der obs output y o output y input x input x a Standard GP b GP with der observations Figure 7 5 GP predictions of the process f with fig b and without fig a derivative observations In the figure 7 5 are the predictions observations the underlying process and the 95 confidence intervals for both GP model with and GP model without derivative obser vations 62 CHAPTER 7 SPECIAL OBSERVATION MODELS Chapter 8 Mean functions In the standard GP regression a zero mean function is assumed for the prior process This is convenient but there are nonetheless some advantages in using a specified mean function The basic principle in doing GP regression with a mean function is to apply a zero mean GP for the difference between observations and the mean function 8 1 Explicit basis functions Here we follow closely the presentation of Rasmussen and Williams 2006 about the subject and briefly present the main results A mean function can be specified as a weighted sum of some basis functions h m h x with weights 3 The target of modeling the underlying process g is assumed to be a sum of mean function and a zero mean GP g h x 3 9P 0 K By assuming Gaussian prior for the weights 3 N b B the weight parameters can be integrated out and the pri
130. x i belong to the i th block AP A A A ol oO LS D7 0851041 Ton 99 mask zeros size x 1 size x 1 trindex tstindex for i1 1 4 for i2 1 4 ind l size x 1 ind ind b1 il lt x ind 1 amp x ind 1 lt b1 il 1 ind ind b1 i2 lt x ind 2 amp x ind 2 lt b1 i2 1 trindex 4 il 1 i2 ind ind2 l size p 1 ind2 ind2 b1 il lt p ind2 1 amp p ind2 1 lt b1 il 1 ind2 ind2 b1 i2 lt p ind2 2 amp p ind2 2 lt b1 i2 1 tstindex 4 il 1 i2 ind2 end end Create the PIC GP data structure and set the inducing inputs and block indeces gpcf gpcf_sexp lengthScale 11 magnSigma2 0 2 2 lik lik_gaussian sigma2 0 2 2 gp_pic gp_set type PIC lik lik cf gpcf X_u X_u Tte index trindex jitterSigma2 le 4 Now the cell array t rindex contains the block index vectors for the training data It means that for example the inputs and outputs x trindex i andy trindex i belong to the i th block The optimization of parameters and inducing inputs is done the same way as with FIC or a full GP model In prediction however we have to give one extra input tst index for gp_pred This defines how the prediction inputs are appointed in the blocks in a same manner as t rindex appoints the training inputs Eft pic gp_pred gp_pic x y xt tstind tstindex
131. y eee ei D6 is a scale parameter and v gt 0 is the degrees of freedom parameter where s Gamma prior prior_gamma The gamma distribution is parametrized as Bo p0 rae D 7 where a gt 0 is a shape parameter and gt 0 is an inverse scale parameter Inverse gamma prior prior_invgamma The inverse gamma distribution is parametrized as p 0 Fay ote D 8 where a gt 0 is a shape parameter and 3 gt 0 is a scale parameter Uniform prior prior_unif The uniform prior is parametrized as p O x 1 D 9 Square root uniform prior prior_sqrtunif The square root uniform prior is parametrized as p 0 1 D 10 Log uniform prior prior_logunif The log uniform prior is parametrized as p log 1 D 11 81 Log log uniform prior prior_loglogunif The log log uniform prior is parametrized as p log log x 1 D 12 82 APPENDIX D PRIORS Appendix E Transformation of hyperparameters The inference on the parameters of covariance functions is conducted mainly trans formed space Most of ten used transformation is log transformation which has the advantage that the parameter space 0 00 is transformed into oo 00 The change of parametrization has to be taken into account in the evaluation of the probability den sities of the model If parameter 0 with probability density pg is transformed into the parameter w f 0 with equal number of components the probability

Gaussian processes in Bayesian modeling: User guide for Matlab

Contents

Download Pdf Manuals

Related Search

Related Contents