Bayesian linear regression: basics

March 18, 2021

The standard linear regression model with gaussian noise is compactly summarized as

$$ f(\mathbf{x}) = \mathbf{w}^\mathsf{T} \mathbf{x}\,,\qquad y = f(\mathbf{x}) + \varepsilon\,, \qquad \varepsilon \sim \mathcal{N}(0, \sigma). $$

Given the data $\mathbf{X} = [\mathbf{x}_1\,, \dots\,, \mathbf{x}_n]^\mathsf{T}$ and the weight $\mathbf{w}$, the above assumptions imply that the distribution $p(\mathbf{y}\mid \mathbf{X}, \mathbf{w})$ is normal: to be specific, we have that $$\mathbf{y}\mid \mathbf{X}, \mathbf{w} \,\sim\,\mathcal{N}(\mathbf{X}\mathbf{w}, \sigma\mathbf{I})$$

From the bayesian point of view, having a prior belief $p(\mathbf{w})$ about the weights is a good way to perform inference about the underlying model. We pick our prior as the conjugate of our likelihood: we thus set $p(\mathbf{w}) = \mathcal{N}\left(\mathbf{0}, \pmb{\Sigma}\right)$. We apply Bayes' rule to obtain the posterior on the parameters $\mathbf{w}$:

$$ p(\mathbf{w}\mid \mathbf{X}, \mathbf{y}) = \dfrac{p(\mathbf{y}\mid \mathbf{X}, \mathbf{w})\,p(\mathbf{w})}{p(\mathbf{y}\mid \mathbf{X})}$$

Ignoring the denominator (which is independent of the weights), we obtain that $ p(\mathbf{w}\mid \mathbf{X}, \mathbf{y}) \propto \exp(-\frac12 \mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y})\,)$ where the exponent $\mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y})$ is given by

$$ \mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y}) = \mathbf{w}^\mathsf{T}\left(\pmb{\Sigma} + \sigma^{-2}\mathbf{X}^\mathsf{T}\mathbf{X}\right)\mathbf{w} + \sigma^{-2}\mathbf{y}^\mathsf{T}\mathbf{y}– 2\sigma^{-2}\mathbf{w}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} $$

Completing the square

We wish to complete the square to find an expression akin to $$ (\mathbf{w} – \pmb{\mu})^\mathsf{T}\mathbf{Q}^{-1}(\mathbf{w} – \pmb{\mu}) = \mathbf{w}^\mathsf{T}\mathbf{Q}^{-1}\mathbf{w} + \pmb{\mu}^\mathsf{T}\pmb{\mu} – 2\pmb{\mu}^\mathsf{T}\mathbf{Q}^{-1}\mathbf{w}$$ For this, we need to have $\pmb{\mu}^\mathsf{T}\mathbf{Q}^{-1} = \sigma^{-2}(\mathbf{X}^\mathsf{T}\mathbf{y})^\mathsf{T}$. We can easily obtain this by setting

$$ \begin{cases} &\mathbf{Q}^{-1} = \pmb{\Sigma} + \sigma^{-2}\mathbf{X}^\mathsf{T}\mathbf{X} \\ &\pmb{\mu} = \sigma^{-2}\mathbf{Q}\mathbf{X}^\mathsf{T}\mathbf{y} \end{cases} $$

we therefore obtain that the posterior $\mathbf{w}\mid \mathbf{X}, \mathbf{y}$ is distributed as the gaussian $\mathcal{N}(\pmb{\mu}, \mathbf{Q})$. To be rigorous, we would need to verify that the leftover terms $\sigma^{-2}\mathbf{y}^\mathsf{T}\mathbf{y} + \sigma^{-4}\mathbf{y}^\mathsf{T}\mathbf{X}\mathbf{Q}^\mathsf{T}\mathbf{Q}\mathbf{X}^\mathsf{T}\mathbf{y}$ are indeed appropriately removed by the normalizing constant. This will be the focus of another post.

Posterior predictive

The predictive distribution is obtained by integrating the distribution $p(f_\star \mid \mathbf{x}_\star, \mathbf{w})$ for a new function value $f_\star = f(\mathbf{x}_\star)$ against the posterior:

$$ p(f_\star \mid \mathbf{x}_\star, \mathbf{X}, \mathbf{y}) = \int p(f_\star \mid \mathbf{x}_\star, \mathbf{w})\,p(\mathbf{w}\mid \mathbf{X}, \mathbf{y})\,\mathrm{d}\mathbf{w}$$

because the term $p(f_\star \mid \mathbf{x}_\star, \mathbf{w})$ is a dirac $\delta_{f_\star = \mathbf{w}^\mathsf{T}\mathbf{x}_\star}$, the distribution $f_\star \mid \mathbf{x}_\star, \mathbf{X}, \mathbf{y}$ is equal to the distribution of $\mathbf{w}^\mathsf{T}\mathbf{x}_\star$. Using basic properties of multivariate gaussians, we get the posterior predictive $\mathcal{N}(\pmb{\mu}^\mathsf{T}\mathbf{x}_\star, \mathbf{x}_\star^\mathsf{T}\mathbf{Q}\mathbf{x}_\star)$.