Practical Guide to Gaussian Processes

Preface
This book gives an introduction to Gaussian processes and shows their various, but not complete, applications. The introduction is aimed for users who want to apply the technique to solve practical engineering problems. With application examples, it shows how Gaussian processes can be used for machine learning to infer from known to unknown situations. The book serves as a reference for common analytical representations of Gaussian processes and for mathematical operations and methods in specific use cases.

Introduction
A Gaussian process is a stochastic process with the property that every finite subset of its values is multivariate normally distributed (or Gaussian distributed). A stochastic process is a function whose values are random variables and which follow a given probability distribution. This allows to model functions with probabilities whose values cannot be completely determined due to a lack of information. A Gaussian process is constructed from functions of mean values, variances and covariances and thus describes the function values as a continuum of correlated random variables in the form of an infinite-dimensional normal distribution. The distribution of a Gaussian process can be imagined as a probability distribution of functions. A sample of it yields a random function with certain preferred properties of its curve shape.

Applications
Gaussian processes are used for mathematical modeling of the behavior of non-deterministic systems on the basis of stochastic quantities or observations. Gaussian processes are suitable for signal analysis and synthesis, form a powerful tool for the interpolation, extrapolation, or smoothing of arbitrarily dimensional discrete measurement points (Gaussian process regression or kriging), and find application in classification problems. Gaussian processes, which are related to kernel methods, can be used as a supervised machine learning technique for abstract modeling based on training examples. This Bayesian approach to machine learning has the advantage that it often does not require iterative training as neural networks do. Instead, Gaussian processes can be derived very efficiently with linear algebra from statistical quantities of the examples and are mathematically clearly interpretable and well controllable. Moreover, for interpolations and predictions, an associated confidence interval is computed for each individual output value, which accurately estimates its own prediction error, while correctly accounting for error propagation when the variance of the input values is known.

Definition
A Gaussian process is a special type of stochastic process $$(X_t){t \in T}$$ on any index set $$T$$, if its finite-dimensional distributions are multivariate normal distributions (also Gaussian distributions) for all $$t_1, t_2, \dotsc, t_n \in T$$. That is, the multivariate distribution of $$(X{t_1}, X_{t_2}, \dotsc, X_{t_n})$$ is given by an n-dimensional normal distribution.

Term: Even the term Gaussian process may indicate temporal or sequential processes, this restriction does not exist. In a generalized sense, process can be understood as a continuum.

Notation
In analogy to the one- and multidimensional Gaussian distribution, a Gaussian process is completely and uniquely determined by its first two moments. In the multidimensional Gaussian distribution, these are the expected value vector or mean vector $$\vec\mu$$ and the covariance matrix $$\sigma$$. For the description of Gaussian process, these are replaced by an expected value function or mean function
 * $$m(t) := \mathbb{E}(X_t),\quad t \in T$$

and a covariance function
 * $$k(t,t') := \operatorname{Cov}(X_t, X_{t'}):= \mathbb{E}\left[(X_t- m(t))\cdot(X_{t'}- m(t'))\right], \quad t,t' \in T$$.

These functions can be understood in the simplest one-dimensional case as a vector with continuous rows and as a matrix with continuous rows and columns. The following table compares one-dimensional and multidimensional Gaussian distributions with Gaussian processes. The tilde symbol $$\sim$$ can be read as "is distributed as".

The probability density function of a Gaussian process cannot be represented analytically because there is no corresponding notation for operations with continuous matrices. This gives the impression that one cannot perform computations with Gaussian processes in the same way as with finite-dimensional normal distributions. However, the essential property of the Gaussian process is not the infinity of the dimensions, but rather the assignment of the dimensions to the coordinates of a function. In practical applications, one always has to deal with a finite number of interpolation points and can therefore perform all calculations as in the finite-dimensional case. The limit to infinitely many dimensions is only needed in an intermediate step, namely if values are to be read out at new interpolated grid points. In this intermediate step, the Gaussian process, i.e. the mean function and covariance function, is represented or approximated by suitable analytical expressions. In this case the assignment to the grid points is done via the parameterized coordinates $$t$$ in the analytical expression. In the finite-dimensional case with discrete grid points the associated coordinates $$t_i$$ are assigned to the dimensions by their indices.

Example of a Gaussian process
As a simple real world example, consider a Gaussian process


 * $$(X_t)_{t \in T} \sim \mathcal {GP}(m(t),k(t,t'))$$

with a scalar variable $$t$$ (time), given by the mean function


 * $$m(t) = 5 \, \text{Volt}$$

and covariance function


 * $$k(t,t')=\begin{cases}

(1\,\text{Volt})^2 &t=t'\\ 0 &t \ne t' \end{cases} $$

This Gaussian process describes an endless temporal electrical signal with Gaussian white noise with a standard deviation of one volt centered around a mean voltage of 5 volts.

Definitions of special properties
A Gaussian process is called centered if its expected value or mean is constantly 0, that is, if $$m(t) := \mathbb{E}(X_t) = 0$$ for all $$t \in T$$.

A covariance function $$k(t,t') := \operatorname{Cov}(X_t, X_{t'})$$ is called stationary when it is translation invariant, that is, it can be described by a relative function $$k(t,t') = k(t-t')$$.

A Gaussian process is called stationary (or translation invariant) if its covariance function is stationary and its mean is constant.

A covariance function is called radial when the function $$k(t,t') = k(|t-t'|)$$ is radial symmetric with a one-dimensional parameter using the Euclidean norm $$|\cdot|$$. It is used to describe systems with isotropic model properties.

List of Common Gaussian Processes and Covariance Functions

 * Constant:  $$m(t) = 0$$ and $$k(t,t') = \sigma^2$$
 * Corresponds to a constant value from a Gaussian distribution with standard deviation $$\sigma$$.


 * Offset:  $$m(t) = c$$ and $$k(t,t') = 0$$
 * Corresponds to a constant value given by $$c$$.


 * Gaussian White noise: $$k(t,t')=\sigma^2\delta_{t,t'}$$
 * ($$\sigma$$: standard deviation, $$\delta$$: Kronecker delta)


 * Rational quadratic:  $$ k(r) = (1+r^2)^{-\alpha}, \quad \alpha \geq 0$$
 * Gamma-exponential:  $$ k(r) = \exp \left(-\left(\frac{r} \ell \right)^\gamma\right)$$
 * Ornstein-Uhlenbeck: $$ k(r) = \exp \left(-\frac{r} \ell \right)$$
 * Corresponds to a simple Gauss-Markov process and describes continuous, non-differentiable functions, as well as white noise after passing through an RC low-pass filter.


 * Squared exponential:  $$ k(r) = \exp \Big(-\frac{r^2}{2\ell^2} \Big)$$
 * Describes infinitely smooth differentiable functions.


 * Matérn: 
 * $$ k_{\nu=p+1/2}(r) = \exp\left(-\frac{\sqrt{2\nu}r}{\ell}\right)\frac{\Gamma(p+1)}{\Gamma(2p+1)}\sum_{i=0}^p\frac{(p+i)!}{i!(p-i)!}\left(\frac{\sqrt{8\nu}r}{\ell}\right)^{p-i}$$
 * A highly versatile Gaussian process used to describe most typical measurement curves. The functions of the Gaussian process are $$n$$ times continuously differentiable if $$\nu>n$$. Covariance functions with $$\nu=1/2$$, $$3/2$$, $$5/2$$, etc. correspond to white noise that has passed through 1, 2, or 3 RC low-pass filters or has been convolved with the function $$\exp \left(-|x| \right)$$. Common special cases include:
 * $$k_{\nu=3/2}(r) = \left(1+\frac{\sqrt{3}r}\ell\right)\exp\left(-\frac{\sqrt{3}r}\ell\right)$$
 * $$k_{\nu=5/2}(r) =\left(1+\frac{\sqrt{5}r}\ell+\frac{5r^2}{3\ell^2}\right)\exp\left(-\frac{\sqrt{5}r}\ell\right) $$
 * $$k_{\nu=1/2}(r)$$ corresponds to the Ornstein-Uhlenbeck covariance function, and $$k_{\nu \rightarrow \infty}(r)$$ corresponds to the squared exponential function.


 * Periodic:  $$ k(r) = \exp\left(-\frac{ 2\sin^2\left(\pi\frac r {T} \right)}{\ell^2} \right)$$
 * Functions from this Gaussian process are both periodic with period $$T$$ and smooth (squared exponential). If the square around the sine is replaced by the absolute value, non-smooth periodic functions result.


 * Polynomial:  $$k(t,t') = \left(t^\top t'+\sigma_0^2\right)^p$$
 * Grows rapidly outward and is usually a poor choice for regression problems, but can be useful in high-dimensional classification problems. It is positive semidefinite and does not necessarily generate invertible covariance matrices.


 * Brownian bridge: $$m(t)=0$$ and $$k(t,t')=\min(t,t') - t t'$$
 * Wiener process: $$m(t)=0$$ and $$k(t,t')=\min(t,t')$$
 * Corresponds to the Brownian motion or integral over Gaussian white noise.


 * Itō process: If $$T=\mathbb{R}_{+} $$ and $$f$$, $$g$$ are two integrable real-valued functions and $$(W_t)$$ is a Wiener process, then the Ito process
 * $$X_t= \int_0^t f(s) \, \mathrm ds + \int_0^t g(s) \, \mathrm dW_s $$
 * is a Gaussian process with $$m(t) = \int_0^t f(s)\, \mathrm ds $$ and $$ k(t,t') = \int_0^{\min(t,{t'})} g^2(s) \, \mathrm ds $$.

Remarks:
 * $$r := \|t-t'\|$$ is the distance for stationary and radial covariance functions $$k(t,t')=k(r)$$.
 * $$\ell$$ is the characteristic length scale of the covariance function where the correlation has decayed to about $$e^{-1}$$.
 * Most stationary covariance functions $$k(r)$$ are normalized to $$k(0)=1$$ and are therefore equivalent to correlation functions. For use as covariance functions, they are multiplied by a variance $$\sigma^2$$, which assigns the variables a scaling and/or physical unit.
 * Covariance functions cannot be arbitrary functions $$k(r)$$ or $$k(t,t')$$, as it must be ensured that they are positive definite. Positive semidefinite functions are also valid covariance functions, but it should be noted that these do not necessarily result in invertible covariance matrices and are therefore usually combined with a positive definite function.

Mathematical operations with Gaussian processes
Gaussian processes (or normal distributions) can be used to perform various stochastic operations that allow different functions with normally distributed errors to be joined or extracted from each other. If there are cross-correlations between the functions, it is assumed that they follow a joint normal distribution. In signal processing, for example, the operations are used to handle temporal signals and their measurement uncertainties. The distributions of these functions are described in the following operations in vector and matrix notation for finitely many interpolation points $$y \sim \mathcal N\left(\mu, \Sigma \right)$$, which analogously applies to arbitrary mean functions $$m(t)$$ and covariance functions $$k(t,t')$$. The normally distributed vectors ($$y_1$$, $$y_2$$ etc.) are described as functions accordingly.

Addition: uncorrelated functions
If the sum of two independent (and especially uncorrelated) functions is built, then their mean functions and their covariance functions add up:
 * $$y_1+y_2 \sim \mathcal N\left(\mu_1, \Sigma_1 \right) + \mathcal N\left(\mu_2, \Sigma_2 \right) = \mathcal N\left(\mu_1+\mu_2, \Sigma_1+\Sigma_2 \right)$$.

The associated probability density functions thereby undergo a convolution.

Addition: correlated functions
Correlated functions can in an extreme case be identical or differ only by constant factors. The sum then corresponds to a multiplication with the added factors. If both functions are identical, the result is $$y+y=2y \sim \mathcal N\left(2\mu, 4\Sigma \right)$$.

Difference: uncorrelated functions
If the difference of two independent uncorrelated functions is built, then their mean functions are subtracting while their covariance functions are adding:
 * $$y_1-y_2 \sim \mathcal N\left(\mu_1, \Sigma_1 \right) - \mathcal N\left(\mu_2, \Sigma_2 \right) = \mathcal N\left(\mu_1-\mu_2, \Sigma_1+\Sigma_2 \right)$$.

Subtraction of a Correlated Component
If the function y2 of a Gaussian process describes a additive component y1 of another Gaussian process, then subtracting this component results in the subtraction of the mean function and covariance function:
 * $$y_1-y_2 \sim \mathcal N\left(\mu_1, \Sigma_1 \right) \setminus \mathcal N\left(\mu_2, \Sigma_2 \right) = \mathcal N\left(\mu_1-\mu_2, \Sigma_1-\Sigma_2 \right)$$

The backslash operator $$\setminus$$ was symbolically used here in the sense of "without the contained component".

Multiplication
The following multiplication with an arbitrary matrix $$F$$ also includes the special cases of the product with a function (diagonal matrix $$F$$) or with a scalar ($$F=c\cdot\mathbb{I}$$):
 * $$Fy \sim F\cdot\mathcal N\left(\mu, \Sigma \right) = \mathcal N\left(F\mu, F\Sigma F^\top \right)$$

It should be noted here that the product of the functions of two Gaussian processes with each other would not result in another Gaussian process, since the resulting probability distribution would have lost the property of being Gaussian or normal.

General linear transformation
All previously shown operations are special cases of the general linear transformation:
 * $$A\cdot\mathcal N\left(\mu_1, \Sigma_1 \right) + B\cdot\mathcal N\left(\mu_2, \Sigma_2 \right) = \mathcal N\left(A\mu_1+B\mu_2, A\Sigma_1 A^\top+B\Sigma_2 B^\top + A\Sigma_{12} B^\top+B\Sigma_{12}^\top A^\top \right)$$

This relation describes the sum $$A\cdot y_1 + B\cdot y_2$$ with constant matrices $$A$$ and $$B$$ and the support point vectors $$y_1$$ and $$y_2$$ of the functions of two Gaussian processes with $$y_1 \sim \mathcal N\left(\mu_1, \Sigma_1 \right)$$ and $$y_2 \sim \mathcal N\left(\mu_2, \Sigma_2 \right)$$. For partially correlated functions $$y_1$$ and $$y_2$$, the cross-covariance matrix $$\Sigma_{12}$$ must be given and all variables must be jointly normal (i.e. the must follow a common multivariate normal distribution) as a precondition. In such case the sum $$A\cdot y_1 + B\cdot y_2$$ is correlated with $$y_1$$ by the cross-covariance matrix $$A\Sigma_{1}+B\Sigma_{12}$$ and with $$y_2$$ by $$A\Sigma_{12}+B\Sigma_{2}$$. A cross-covariance matrix $$\Sigma_{XY}$$ between two functions $$X$$ and $$Y$$ can be converted into a cross-correlation matrix $$C_{XY}$$ using their covariance matrices $$\Sigma_{X}$$ and $$\Sigma_{Y}$$ through the relation $$\left[C_{XY}\right]{ij} = \left[\Sigma{XY}\right]{ij}/\sqrt{\left[\Sigma_X\right]{ii}\left[\Sigma_Y\right]_{jj}}$$. In the case of two partially correlated Gaussian processes, it should be noted that special dependencies may exist where the sum does not result in a normal distribution and the equation accordingly loses its validity, although both input quantities are normally distributed.

Fusion
If the same unknown function is described by two different Gaussian processes with uncorrelated errors to each other, then a union or fusion (also Sensor fusion) of the two parts of partial information can be formed to achieve a reduction of the error or variance. For example, in signal processing, the same waveform is measured by two different sensors (such as the trajectory of an aircraft by an inertial sensor and independently by a GNSS location determination), which add up two different independent noise or error signals. The joint distribution
 * $$\Sigma_\text{Fusion} = \left(\Sigma_1^{-1} + \Sigma_2^{-1}\right)^{-1}$$
 * $$\mu_\text{Fusion} = \Sigma_\text{Fusion}\Sigma_1^{-1}\mu_1 + \Sigma_\text{Fusion}\Sigma_2^{-1}\mu_2$$

corresponds to the overlap or the normalized product of the two probability density functions and describes the most likely Gaussian process taking into account both parts of information (see also Inverse-variance weighting). The expressions can also be rearranged, such that only one matrix inversion needs to be performed:
 * $$\mu_\text{Fusion} = \mu_1 - \Sigma_1\left(\Sigma_1 + \Sigma_2\right)^{-1}\left(\mu_1 - \mu_2\right) = \Sigma_2\left(\Sigma_1 + \Sigma_2\right)^{-1}\mu_1 + \Sigma_1\left(\Sigma_1 + \Sigma_2\right)^{-1}\mu_2$$
 * $$\Sigma_\text{Fusion} = \Sigma_1 - \Sigma_1\left(\Sigma_1 + \Sigma_2\right)^{-1}\Sigma_1 = \Sigma_1\left(\Sigma_1 + \Sigma_2\right)^{-1}\Sigma_2$$

The validity of the formula requires function pairs with entirely uncorrelated errors. However, if there is partial correlation with cross-covariance $$\Sigma_{12}$$, then the extended and generalized formula, the so-called Bar-Shalom-Campo fusion, applies, where the correlated part is temporarily subtracted and then added back after fusion:
 * $$\mu_\text{Fusion} = \mu_1 - (\Sigma_1 - \Sigma_{12})(\Sigma_1 + \Sigma_2 - \Sigma_{12} - \Sigma_{21})^{-1}(\mu_1-\mu_2)$$
 * $$\Sigma_\text{Fusion} = \Sigma_1 - (\Sigma_1 - \Sigma_{12})(\Sigma_1 + \Sigma_2 - \Sigma_{12} - \Sigma_{21})^{-1}(\Sigma_1 - \Sigma_{21})$$

Decomposition
A given function $$y_\text{sum}$$ can be approximately decomposed into its additive components when the prior distributions of the entire function and the components are given. According to the addition rule, the Gaussian process of the entire function
 * $$\mu_\text{sum} = \mu_1 + \ldots + \mu_n$$
 * $$\Sigma_\text{sum}=\Sigma_1 + \ldots + \Sigma_n$$

is composed of the prior distributions of the components. The individual components $$y_i$$ can then be estimated by the posterior Gaussian processes
 * $$\mu_{\text{post,}i} = \mu_i + \Sigma_i\Sigma_\text{sum}^{-1} \left(y_\text{sum} - \mu_\text{sum}\right)$$
 * $$\Sigma_{\text{post,}i} = \Sigma_i - \Sigma_i \Sigma_\text{sum}^{-1} \Sigma_i^\top$$

which are correlated to each other by the cross covariances
 * $$\Sigma_{\text{post,}i,j} = -\Sigma_i \Sigma_\text{sum}^{-1} \Sigma_j^\top$$.

Apart from very specific cases, this decomposition is ambiguous. The components are therefore coupled probability distributions of possible solutions around the most likely components (see also Example: Signal Decomposition).

The decomposition is based on the equations for fusion in the previous section, which are applied to the specific distributions $$\mathcal N\left(\mu_\text{sum}, \Sigma_\text{sum}\right)$$ and $$\mathcal N\left(\mu_i, \Sigma_i\right)$$. The density product or overlap extracts the corresponding component in this case.

Introduction
Gaussian processes can be used to interpolate, extrapolate, or smooth discrete measurement data of a mapping $$\mathbb{R}^n \to \mathbb{R}$$. This application of Gaussian processes is called Gaussian process regression. The method is often called kriging for historical reasons, especially in the spatial domain. It is particularly suitable for problems for which no specific model function is known. Its property as a machine learning method allows automatic model building based on observations. In this application, a Gaussian process captures the typical behavior of the system, which can be used to derive the optimal interpolation for the problem. The result is a probability distribution of possible interpolation functions and the solution with the highest probability.

Overview of the individual steps
The calculation of a Gaussian process regression can be performed by the following steps:


 * 1) Prior mean function: If there is a consistent trend in the measured values, a prior mean function is constructed to equalize the trend.
 * 2) Prior covariance function: The covariance function is selected according to certain qualitative properties of the system or composed from covariance functions of different properties according to certain rules.
 * 3) Fine-tuning of parameters: to obtain quantitatively correct covariances, the selected covariance function is adjusted to the available measured values either targeted or by an optimization procedure until the covariance function reflects the empirical covariances.
 * 4) Conditional distribution: By considering known measured values, the conditional posterior Gaussian process is calculated from the prior Gaussian process for new support points with still unknown values.
 * 5) Interpretation: Finally, from the posterior Gaussian process, the mean function is taken as the best possible interpolation and, if required, the diagonal of the covariance function is taken as the location-dependent variance.

Step 2: Prior covariance function
In practical applications, a Gaussian process must be determined from finitely many discrete measured values or finitely many sample curves. In analogy to the one-dimensional Gaussian distribution, which is completely determined by the mean and standard deviation of discrete measured values, one would expect several single but complete functions $$f_i(t)$$ in order to calculate the mean function



m(t) = \frac{1}{N} \sum_{i=1}^{N} f_i(t) $$

and the (empirical) covariance function



k(t,t') = \frac{1}{N - 1} \sum_{i=1}^{N} \left[f_i(t) - m(t)\right] \cdot \left[f_i(t') - m(t')\right] $$.

Regression problem and stationary covariance
Often, however, no such distribution of exemplary functions is available. In the regression problem instead only discrete interpolation points of a single function are known, which are to be interpolated or smoothed. Also in such a case a Gaussian process can be determined. For this purpose, instead of this single function, a set of many copies of the function shifted to each other is considered. This distribution can now be described with the help of a covariance function. Usually it can be expressed as a relative function of this shift by $$k(t,t') = k(t'-t)$$. It is then called stationary covariance function and applies equally to all locations of the function and describes the everywhere equal (thus stationary) correlation of each point to its neighborhood, as well as the correlation of neighboring points among each other.

The covariance function is represented analytically and determined heuristically or looked up in the literature. The free parameters of the analytical covariance functions are fitted to the measured values. Many physical systems have a similar form of the stationary covariance function, so that with a few tabulated analytical covariance functions most applications can be described. For example, there are covariance functions for abstract properties such as smoothness, roughness (lack of differentiability), periodicity or noise, which can be combined and fitted according to certain rules to reproduce the properties of the measured values.

Examples of stationary covariance
The following table shows examples of covariance functions with such abstract properties. The example curves are random samples of the respective Gaussian process and represent typical function shapes. They were generated with the corresponding covariance matrix $$\Sigma_{ij}=k(t_i, t_j)$$ and a random generator for multidimensional normal distributions as correlated random vector. The stationary covariance functions $$k(t,t')$$ are abbreviated here as one-dimensional functions $$k(r)$$ with $$r := |t-t'| $$.

Construction of new covariance functions
The properties can be combined according to certain computational rules. The basic goal in constructing a covariance function is to reproduce the true covariances as precisely as possible, while at the same time satisfying the condition of positive definiteness. The examples shown, except for the constant, have the latter property, and the additions and multiplications of such functions also remain positive definite. The constant covariance function is only positive semidefinite and must be combined with at least one positive definite function. The lowest covariance function in the table shows a possible mixture of different properties. The functions in this example are periodic over a certain distance, have a relatively smooth behavior and are overlaid with a certain measurement noise.

For mixed properties, the following rules applies:
 * In the case of additive effects, the covariances are added, as for example in the superposition of measurement noise.
 * For reinforcing or mitigating effects to each other, the covariances are multiplied, such as in case of the slow decay of periodicity.

Multidimensional functions
What is shown here with one-dimensional functions can be transferred analogously also to multi-dimensional systems, by simply replacing the distance $$r$$ by a corresponding n-dimensional distance norm. The support points in the higher dimensions are unrolled in an arbitrary order and represented by vectors, so that they can be processed in the same way as in the one-dimensional case. The following two figures show two examples with two-dimensional Gaussian processes and different stationary and radial covariance functions. In the respective right figure a random draw of the Gaussian process is shown.



Non-stationary covariance functions
Gaussian processes can also have non-stationary properties of the covariance function, that is, relative covariance functions that change as a function of location. The literature describes how nonstationary covariance functions can be constructed so that positive definiteness is ensured here as well. A simple possibility is, for example, an interpolation of different covariance functions over the location with the inverse distance weighting.

Step 3: Fine tuning of parameters
The qualitatively constructed covariance functions contain parameters, called hyperparameters, which must be tuned (or calibrated) to the system in order to obtain quantitatively correct results. This can be done by direct knowledge about the system, e.g., the known value of the standard deviation of the measurement noise or the prior standard deviation of the overall system (sigma prior, the square corresponds to the diagonal elements of the covariance matrix).

However, the parameters can also be adjusted automatically. For this purpose, one uses the marginal likelihood, i.e., the probability density for a given measured curve as a metric for the agreement between the assumed Gaussian process and the existing measured curve. The parameters are then optimized to maximize this agreement. Since the exponential function is strictly monotone, it is sufficient to maximize the exponent of the probability density function, the so-called log-marginal likelihood function


 * $$\log p(\mathbf{y}) = - \frac12 \mathbf{y}^\top \Sigma^{-1}\mathbf{y}-\frac12 \log|\Sigma|-\frac n2 \log(2\pi)$$

with the measurement vector $$\mathbf{y}$$ of length $$n$$ and the hyperparameter-dependent covariance matrix $$\Sigma$$. Mathematically, maximizing the marginal likelihood causes an optimal tradeoff between accuracy (minimizing the residuals) and simplicity of the theory. A simple theory is characterized by large non-diagonal elements, describing a high correlation in the system. This means that there are few degrees of freedom in the system and thus, in some sense, the theory can cope with few rules to explain all correlations. If these rules are chosen too simple, the measurements would not be reproduced sufficiently well and the residual errors grow too much. At a maximum marginal likelihood, the equilibrium of an optimal theory is found, provided that sufficiently many measurement data were available for good conditioning. This implicit property of the maximum likelihood estimation can also be understood as Ockham's parsimony principle.

Step 4: Conditional Gaussian process with known support points
If the Gaussian process of a system has been determined as described above, i.e. if the prior mean function and covariance function are known, a prediction of arbitrary interpolated intermediate values can be computed with the Gaussian process, when only a few support points of the desired function are known by measurements. The prediction is done by the conditional probability of a multidimensional Gaussian distribution given a partial information. The dimensions of the multidimensional Gaussian distribution


 * $${X} = \binom{{X}_\text{U}}{{X}_\text{K}} \sim \mathcal N\left(\binom{{\mu}_\text{U}}{{\mu}_\text{K}}, \begin{pmatrix}{\Sigma}_\text{UU} & {\Sigma}_\text{UK} \\ {\Sigma}_\text{KU} & {\Sigma}_\text{KK}\end{pmatrix}\right)$$

are divided into unknown values to be predicted (index U for unknown) and known measured values (index K for known). Vectors thereby decompose into two parts. The covariance matrix is accordingly divided into four blocks: Covariances within the unknown values (UU), within the known measured values (KK) and covariances between the unknown and known values (UK and KU). The values of the covariance matrix are taken at discrete points of the covariance function and the mean vector at corresponding points of the mean function: $$\Sigma_{ij}=k(t_i, t_j)$$ or $$\mu_{i}=m(t_i)$$.

By considering the known measured values $$X_\text{K}$$, the distribution changes to the conditional or posterior normal distribution
 * $$X_\text{U} \mid X_\text{K} \sim \mathcal N \left(\mu_\text{U} + \Sigma_\text{UK}\Sigma_\text{KK}^{-1} (X_\text{K} - \mu_\text{K}), \Sigma_\text{UU} - \Sigma_\text{UK}\Sigma_\text{KK}^{-1}\Sigma_\text{KU}\right) $$,

where $$X_\text{U}$$ are the unknown variables to be determined. The notation $$\mid X_\text{K}$$ reads as "given $$X_\text{K}$$", which means under the condition that $$X_\text{K}$$ is given.

The first parameter of the resulting Gaussian distribution describes the new mean vector we are looking for, which now corresponds to the most likely function values of the interpolation. In addition, the entire predicted new covariance matrix is given in the second parameter. In particular, this contains the confidence intervals of the predicted mean values, given by the root of the main diagonal elements.

Measurement noise and other interfering signals
White measurement noise of variance $$\sigma_\text{noise}^2$$ can be modeled as part of the prior covariance model by adding appropriate terms to the diagonal of $$\Sigma_\text{KK}$$. If the same covariance function is also used to form the matrix $$\Sigma_\text{UU}$$, the predicted distribution would also describe a white noise of variance $$\sigma_\text{noise}^2$$. To obtain a prediction of an noise free signal, in the posterior distribution


 * $$X_\text{U} \mid X_\text{K} \sim \mathcal N \left(\mu_\text{U} + \Sigma_\text{UK}\left[\Sigma_\text{KK} + \mathbb{I}\sigma_\text{noise}^2\right]^{-1} (X_\text{K} - \mu_\text{K}), \Sigma_\text{UU} - \Sigma_\text{UK}\left[\Sigma_\text{KK} + \mathbb{I}\sigma_\text{noise}^2\right]^{-1}\Sigma_\text{KU}\right)$$

the corresponding terms are omitted at $$\Sigma_\text{UU}$$ and if applicable in $$\Sigma_\text{UK}$$ and $$\Sigma_\text{KU}$$. This averages out the measurement noise as good as possible, which is also correctly accounted for in the predicted confidence interval. In the same way, any unwanted additive noise signal can be removed from the measurement data (see also arithmetic operation decomposition), provided that it can be described by a covariance function and is sufficiently well distinguishable from the desired signal component. For this purpose, instead of the diagonal matrix $$\mathbb{I}\sigma_\text{noise}^2$$, the corresponding covariance matrix of the interference $$\Sigma_\text{noise}$$ is used. Measurements with noisy signals thus require two covariance models: $$k(t,t')$$ for the desired signal component to be estimated and $$k(t,t')+k_\text{noise}(t,t')$$ for the raw signal.

Derivation of the conditional distribution
The derivation can be done via the Bayes formula by substituting the two probability densities for known and unknown support points and the composite probability density. The resulting conditional posterior normal distribution corresponds to the overlap or intersection of the Gaussian distribution with the subvector space spanned by the known values.

For noisy measurements that are themselves a multidimensional normal distribution, the overlap to the prior distribution is obtained by multiplying the two probability densities. The product of the probability densities of two multidimensional normal distributions corresponds to the arithmetic operation Fusion, which can be used to derive the distribution where the noise is suppressed.

Posterior Gaussian process
In the full notation as a Gaussian process, the posterior Gaussian process yields
 * $$(X_t) \sim \mathcal {GP}(m,k)$$

and the $$n$$ known measurements $$\mathbf{x}=(x_1,x_2,\ldots,x_n)$$ at the coordinates $$\mathbf{t}=(t_1,t_2,\ldots,t_n)$$ a new distribution, given by the conditional posterior Gaussian process
 * $$(X_t\mid \mathbf{t},\mathbf{x}) \sim \mathcal {GP}\left(m_{\mathrm{post}}, k_{\mathrm{post}}\right)$$

Here, $$K$$ is a covariance matrix obtained by evaluating the covariance function $$k$$ at discrete rows $$t_i$$ and columns $$t_j$$. Moreover, $$\mathbf{k}$$ was appropriately formed as a vector of functions by evaluating $$k$$ only at discrete rows or only at discrete columns.

In practical numerical calculations with finite numbers of support points, only the equation of the conditional multivariate normal distribution is used. The notation of the posterior Gaussian process serves here only the theoretical understanding, in order to describe the limit towards the continuum in the form of functions and thus to depict the assignment of the values to the coordinates.

Step 5: Interpretation
From the prior Gaussian process, the measured values are used to obtain a posterior Gaussian process, which takes into account the known partial information. This result of the Gaussian process regression represents not only one solution, but the entirety of all possible solution functions of the interpolation weighted with different probabilities. The indecision expressed in this way is not a weakness of the method. It does perfect justice to the problem, since in the case of a theory which is not completely known or in the case of noisy measurements, the solution, in principle, cannot be determined unambiguously. Mostly, however, we are specifically interested in at least the solution with the highest probability. This is given by the mean function $$m_{\mathrm{post}}(t)$$ in the first parameter of the posterior Gaussian process. From the conditional covariance function in the second parameter, it is possible to obtain the scatter around this solution. The diagonal $$k_{\mathrm{post}}(t,t)$$ of the covariance function gives a function with the variances of the predicted most likely function. The confidence interval is then given by the bounds $$m_{\mathrm{post}}(t) \pm \sqrt{k_{\mathrm{post}}(t,t)}$$.

The Python code for the examples can be found on the respective image description page.

Underdetermined measurments
In some cases of conditional Gaussian processes, groups of linearly related measured values are completely indeterminate. E.g., this is the case for indirect measurements following from underdetermined equations, such as with a noninvertible positive semidefinite matrix of the form $$A^\top\Sigma^{-1}A$$. The grid points then cannot be easily partitioned into known and unknown values, and the associated covariance matrix would be singular due to infinite uncertainties. This would correspond to a normal distribution that is infinitely stretched in certain spatial directions transverse to the coordinate axes. To account for the relationships between the undetermined variables, in such a case, the inverse matrix $$\Sigma_\text{2}^{-1}$$, called the precision matrix, must be used. This can describe completely undetermined measurements, which is expressed by zeros in the diagonal. For such a singular distribution $$\mathcal N\left(\mu_\text{2}, \Sigma_\text{2}\right)$$ with partially unknown measurements $$\mu_\text{2}$$ and singular measurement uncertainties $$\Sigma_\text{2}$$, the wanted posterior distribution is obtained by the overlap to the prior Gaussian process model $$\mathcal N\left(\mu_\text{1}, \Sigma_\text{1}\right)$$ calculated by multiplying the probability densities. The union of the two normal distributions


 * $$\Sigma_\text{Fusion} = \left(\mathbb{I} + \Sigma_\text{1}\Sigma_\text{2}^{-1}\right)^{-1}\Sigma_\text{1}$$
 * $$\mu_\text{Fusion} = \left(\mathbb{I} + \Sigma_\text{1}\Sigma_\text{2}^{-1}\right)^{-1}\mu_\text{1} + \Sigma_\text{Fusion}\Sigma_\text{2}^{-1}\mu_\text{2}$$

is obtained by the Fusion operation after appropriate transformation, so that the singular of the two matrices remains inverse. The result is always a finite distribution, since the finite matrix dominates. If both are finite, the equation can be put into the form of the posterior Gaussian process as in the section on the conditional distribution.

Linear combination to a Gaussian process
From given basis functions $$\varphi_j(t)$$ a linear combination is to be formed, which has maximum overlap with the distribution $$\mathcal N(\mu, \Sigma)$$ of an associated Gaussian process $$\mathcal {GP}(m,k)$$. Or measured values $$\mu$$ are to be approximated, while the interfering signal $$\mathcal N(0, \Sigma)$$, contained within, is ignored as far as possible. In both cases, the wanted coefficients can be calculated using generalized least squares estimation


 * $$c = \left( A^\top \Sigma^{-1} A \right)^{-1} A^\top \Sigma^{-1}\mu$$
 * $$\Sigma_c = \left( A^\top \Sigma^{-1} A \right)^{-1} $$

The matrix $$A_{ij}=\varphi_j(t_i)$$ contains the function values of the basis functions $$\varphi_j(t)$$ at the interpolation points $$t_i$$. The resulting coefficients c with the associated covariance matrix $$\Sigma_c$$ describe the linear combination with the largest possible probability density in the distribution $$\mathcal N(\mu, \Sigma)$$. The linear combination thereby approximates the mean function or the measured values $$\mu$$ in such a way that the residuals are best described by the covariance matrix $$\Sigma$$. The method is used, for example, in the program library Scikit-learn to empirically estimate a polynomial mean function of a Gaussian process.

Approximation of an empirical Gaussian process
An empirically determined Gaussian process


 * $$m(t) = \frac{1}{N} \sum_{p=1}^{N} f_p(t)$$
 * $$k(t, t') = \frac{1}{N - 1} \sum_{p=1}^{N} \left[f_p(t) - m(t)\right] \cdot \left[f_p(t') - m(t')\right]$$

from example functions $$f_p(t)$$ with few distinct degrees of freedom can be approximated and simplified by means of the eigenvalue decomposition or singular value decomposition


 * $$\Sigma = VSV^\top$$

of the covariance matrix $$\Sigma_{ij} = k(t_i, t_j)$$. This is done by choosing the $$n$$ largest eigenvalues or singular values $$\lambda_p=\sigma_p^2$$ from the diagonal matrix $$S$$. The corresponding columns $$v_p$$ of $$V$$ are the principal components of the Gaussian process (see Principal Component Analysis). If the columns are represented as functions $$v_p(t)$$, then the original Gaussian process is represented by the mean function $$m(t)$$ and the covariance function


 * $$k(t, t') \approx \sum_{p=1}^{n} \sigma_p^2 v_p(t) v_p(t')$$

This Gaussian process describes exclusively functions of the linear combination


 * $$f(t) = m(t) + \sum_p c_p v_p(t)$$,

where each coefficient $$c_p$$ is scattered around zero mean as an independent random variable of variance $$\sigma_p^2 = \lambda_p$$.

Such a simplification is positively semidefinite and it usually lacks the properties to describe small-scale variations. These properties can be added to the covariance function in the form of a stationary covariance function fitted to the residuals:

k(t, t') \approx \sum_{p=1}^{n} \sigma_p^2 v_p(t) v_p(t') + k_\text{stat}(t'-t) $$

Example: Trend prediction
In a hypothetical application example from market research, the future demand for the topic "snowboard" is to be predicted. For this purpose, an extrapolation of the number of Google searches on this term is to be calculated.

In the past data, one can see a periodic, but non-sinusoidal seasonal dependence, which can be explained by the winter in the northern hemisphere. Moreover, the trend decreased continuously over the last decade. In addition, one recognizes a recurring increase in search queries during the Olympic Games every four years. The covariance function was therefore modeled with a slow trend and a one- and four-year period:

The trend also appears to have a significant asymmetry. This can be the case if the underlying random effects do not add up but reinforce each other, resulting in a Log-Normal Distribution. However, the logarithm of such values describes a normal distribution, to which Gaussian process regression can be applied.



The figure shows an extrapolation of the curve (to the right of the dashed line). Since the results here were transformed back from the logarithmic plot using an exponential function, the predicted confidence intervals are correspondingly asymmetrical (gray area). The extrapolation plausibly shows the seasonal patterns and also the increase in searches for the Olympic Games every four years. The example with mixed properties demonstrates very well the versatile modeling possibilities of the Gaussian process regression, which are unified in a single interpolation method.



Example: Sensor calibration
In an application example from industry, sensors are to be calibrated using Gaussian processes. Due to tolerances during manufacturing, the characteristic curves $$f(x)$$ of the sensors show large individual differences. This causes high costs in calibration, since a complete characteristic curve would have to be measured for each sensor. However, the effort can be minimized by learning the exact behavior of the scattering by a Gaussian process. For this purpose, the complete characteristic curves $$f_i(x)$$ of $$N$$ randomly selected representative sensors are measured and thus the Gaussian process $$\mathcal{GP}(m,k)$$ of the scattering is calculated by



m(x) = \frac{1}{N} \sum_{i=1}^{N} f_i(x) $$



k(x,x') = \frac{1}{N - 1} \sum_{i=1}^{N} \left[f_i(x) - m(x)\right] \cdot \left[f_i(x') - m(x')\right] $$

In the example shown, 15 representative characteristic curves are given. The resulting Gaussian process is represented by the mean function $$m(x)$$ and the confidence interval $$m(x) \pm \sqrt{k(x,x)}$$.



With the conditional Gaussian process $$\mathcal{GP}(m_\text{post},k_\text{post})$$ with


 * $$m_{\mathrm{post}}(x) = m(x) + \mathbf{k}^\top(x,\mathbf{x})K(\mathbf{x},\mathbf{x})^{-1} (\mathbf{y} - m(\mathbf{x}))$$



k_{\mathrm{post}}(x,x') = k(x,x') - \mathbf{k}^\top(x,\mathbf{x})K(\mathbf{x}, \mathbf{x})^{-1}\mathbf{k}(\mathbf{x},x') $$

the complete characteristic map can now be reconstructed for each new sensor with a few individual measured values $$\mathbf{y}$$ at the coordinates $$\mathbf{x}$$. The number of measured values must correspond at least to the number of degrees of freedom of the tolerances, which have an independent linear influence on the shape of the characteristic curve.

In the example shown, a single measured value is not yet sufficient to determine the characteristic curve unambiguously and precisely. The confidence interval shows the region of the curve which is not yet sufficiently accurate. With another measured value in this range, the remaining uncertainty can finally be completely eliminated. The exemplary fluctuations of the very differently acting sensors in this example thus seem to be caused by the tolerances of only two relevant inner degrees of freedom.





Example: Signal decomposition
In a signal processing application example, a temporal signal is to be decomposed into its components. Let it be known about the system that the signal consists of three components following the three covariance functions
 * $$k_1(r) = 2{,}7^2 \exp(-r^2)$$
 * $$k_2(r) = 2{,}7^2 \exp(-0{,}4|\sin(r/2{,}5)|)$$
 * $$k_3(r) = 0{,}6^2 \delta_r$$

The sum signal then follows the addition rule of the covariance function
 * $$k_\text{sum}(r) = k_1(r) + k_2(r) + k_3(r)$$.

The following two figures show three random signals which were generated and added for demonstration with these covariance functions. In the sum of the signals one can hardly recognize the periodic signal hidden in it with the naked eye, since its spectral range overlaps with that of the two other components.

The example shows how this method can be used to separate very different signals in one step. In contrast, other filtering methods such as moving averaging, Fourier filtering, polynomial regression, or spline approximation are optimized for specific signal characteristics and provide neither accurate error estimates nor cross-correlations.

If the Gaussian processes of the individual components for a given signal are not precisely known, then in some cases hypothesis testing can be performed using the log-marginal likelihood function, provided sufficient data are available to well-condition the function. Via its maximization, the parameters of the conjectured covariance functions can be fitted to the measured data.



Literature

 * In: Olivier Bousquet (publisher): Advanced Lectures on Machine Learning. ML 2003. (= Lecture Notes in Computer Science. vol. 3176). Springer, Berlin/Heidelberg 2004.(, pdf)
 * C. E. Rasmussen, C. K. I. Williams: Gaussian Processes for Machine Learning. MIT Press, 2006, ISBN 0-262-18253-X. (gaussianprocess.org, pdf)
 * R. M. Dudley: Real Analysis and Probability. Wadsworth and Brooks/Cole, 1989.
 * B. Simon: Functional Integration and Quantum Physics. Academic Press, 1979.
 * M. L. Stein: Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.

Educational material

 * Gaussian Processes Web Site (Textbook, tutorials, code, etc.)
 * Interactive demo on Gaussian process regression
 * The Kernel Cookbook (Guide to the construction of covariance functions)

Software

 * GPy – A Gaussian Process-Framework in Python
 * Scikit Learn Gaussian Process – Gaussian process module of the machine learning library Scikit-learn for Python
 * Gaussian process library written in C++11