Table of Contents

Understanding Gaussian Process with Stan in Python

$\begin{bmatrix}X_1 \\ X_2 \end{bmatrix}$

A Gaussian process is a distribution over functions. Instead of assuming a fixed parametric form such as

y = a + bx

we define a prior over all plausible functions and let the observed data update that prior. The key idea is simple: any finite collection of function values has a multivariate normal distribution.

If a latent function is written as

f(x) ~ GP(m(x), k(x, x'))

then for observed input points x[1], ..., x[N], the vector

[f(x[1]), ..., f(x[N])]

follows a multivariate normal distribution with mean vector determined by m(x) and covariance matrix determined by the kernel k(x, x').

The covariance kernel

The covariance kernel controls what kinds of functions are plausible before seeing the data. A common choice is the exponentiated quadratic kernel, also called the squared exponential or radial basis function kernel:

k(x_i, x_j) = alpha^2 exp(-(x_i - x_j)^2 / (2 rho^2))

The parameters have intuitive roles:

alpha controls the vertical scale of the function.
rho controls the length scale, or how quickly the function can vary along the input axis.
sigma is usually added separately as observation noise.

For noisy observations, the model is

y_i = f(x_i) + epsilon_i
epsilon_i ~ Normal(0, sigma)

Equivalently,

y ~ MultiNormal(0, K + sigma^2 I)

where K is the covariance matrix built from the kernel.

Why Stan works well for Gaussian processes

Stan is useful here because Gaussian process regression is a Bayesian model with unknown hyperparameters. We usually do not know the correct values of alpha, rho, and sigma, so we place priors on them and infer them from the data.

The tradeoff is computational cost. A full Gaussian process requires Cholesky decomposition of an N x N covariance matrix, which scales roughly as O(N^3). This is fine for small and medium examples, but for large datasets one should consider sparse or approximate Gaussian process methods.

Stan model

The following Stan program uses the exponentiated quadratic covariance function. It adds a small jitter term to the diagonal for numerical stability.

data {
  int<lower=1> N;
  array[N] real x;
  vector[N] y;
}

transformed data {
  real delta = 1e-9;
}

parameters {
  real<lower=0> alpha;
  real<lower=0> rho;
  real<lower=0> sigma;
}

model {
  matrix[N, N] K;
  matrix[N, N] L_K;

  alpha ~ normal(0, 1);
  rho ~ inv_gamma(5, 5);
  sigma ~ normal(0, 0.5);

  K = gp_exp_quad_cov(x, alpha, rho);

  for (n in 1:N) {
    K[n, n] = K[n, n] + square(sigma) + delta;
  }

  L_K = cholesky_decompose(K);
  y ~ multi_normal_cholesky(rep_vector(0, N), L_K);
}

Save it as gp_regression.stan.

Python example with CmdStanPy

The Python side only needs to prepare the data, compile the Stan program, and sample from the posterior. Check the current CmdStanPy installation instructions for your environment, because installation details can change. A local sanity check is:

python -c "import cmdstanpy; print(cmdstanpy.__version__)"

A minimal reproducible example:

import numpy as np
import matplotlib.pyplot as plt
from cmdstanpy import CmdStanModel

rng = np.random.default_rng(20211202)

N = 40
x = np.linspace(-3, 3, N)
y_true = np.sin(x)
y = y_true + rng.normal(0, 0.15, size=N)

model = CmdStanModel(stan_file="gp_regression.stan")
fit = model.sample(
    data={"N": N, "x": x.tolist(), "y": y.tolist()},
    chains=4,
    iter_warmup=1000,
    iter_sampling=1000,
    seed=20211202,
)

print(fit.summary().loc[["alpha", "rho", "sigma"]])

posterior = fit.draws_pd(vars=["alpha", "rho", "sigma"])
print(posterior.describe())

plt.scatter(x, y, label="observed", color="black")
plt.plot(x, y_true, label="true function", color="tab:blue")
plt.legend()
plt.show()

This model estimates the covariance hyperparameters, but it does not yet generate predictions at new input points. For prediction, add test inputs to the Stan data block and use the conditional multivariate normal formula, or generate posterior predictions in Python using the sampled hyperparameters.

Posterior prediction formula

Let x_train be the observed inputs and x_test be new inputs. Define:

K      = k(x_train, x_train) + sigma^2 I
K_star = k(x_train, x_test)
K_test = k(x_test, x_test)

The conditional distribution of the latent function at test points is

f_test | y ~ Normal(K_star^T K^{-1} y,
                   K_test - K_star^T K^{-1} K_star)

In implementation, avoid directly computing K^{-1}. Use a Cholesky factor and triangular solves instead. That is both faster and more numerically stable.

Practical checks

For a Gaussian process fitted with Stan, I usually check the following:

Rhat should be close to 1.0 for alpha, rho, and sigma.
Effective sample size should not be extremely small.
Posterior predictions should be smooth at a scale consistent with the learned rho.
sigma should be plausible relative to the observed noise.
If the sampler reports divergences, inspect priors and rescale the input data.

Input scaling matters. A length scale prior that works for x in [-3, 3] may be inappropriate for x in [0, 100000]. A stable workflow is to standardize or rescale inputs first, fit the model, and then interpret the length scale on that transformed scale.

Summary

A Gaussian process defines a prior over functions through a covariance kernel. Stan lets us infer the kernel hyperparameters instead of fixing them by hand. The basic implementation is compact: build the covariance matrix, add observation noise and jitter to the diagonal, use a Cholesky decomposition, and sample the hyperparameters. For small datasets this gives a clear and flexible Bayesian regression model; for large datasets, the same ideas remain useful, but approximate methods become necessary.