Bayesian linear regression: basics

The standard linear regression model with gaussian noise is compactly summarized as

$$ f(\mathbf{x}) = \mathbf{w}^\mathsf{T} \mathbf{x}\,,\qquad y = f(\mathbf{x}) + \varepsilon\,, \qquad \varepsilon \sim \mathcal{N}(0, \sigma). $$

Given the data \(\mathbf{X} = [\mathbf{x}_1\,, \dots\,, \mathbf{x}_n]^\mathsf{T}\) and the weight \(\mathbf{w}\), the above assumptions imply that the distribution \(p(\mathbf{y}\mid \mathbf{X}, \mathbf{w})\) is normal: to be specific, we have that $$\mathbf{y}\mid \mathbf{X}, \mathbf{w} \,\sim\,\mathcal{N}(\mathbf{X}\mathbf{w}, \sigma\mathbf{I})$$

From the bayesian point of view, having a prior belief \(p(\mathbf{w})\) about the weights is a good way to perform inference about the underlying model. We pick our prior as the conjugate of our likelihood: we thus set \(p(\mathbf{w}) = \mathcal{N}\left(\mathbf{0}, \pmb{\Sigma}\right)\). We apply Bayes' rule to obtain the posterior on the parameters \(\mathbf{w}\):

$$ p(\mathbf{w}\mid \mathbf{X}, \mathbf{y}) = \dfrac{p(\mathbf{y}\mid \mathbf{X}, \mathbf{w})\,p(\mathbf{w})}{p(\mathbf{y}\mid \mathbf{X})}$$

Ignoring the denominator (which is independent of the weights), we obtain that \( p(\mathbf{w}\mid \mathbf{X}, \mathbf{y}) \propto \exp(-\frac12 \mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y})\,)\) where the exponent \(\mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y})\) is given by

$$ \mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y}) = \mathbf{w}^\mathsf{T}\left(\pmb{\Sigma} + \sigma^{-2}\mathbf{X}^\mathsf{T}\mathbf{X}\right)\mathbf{w} + \sigma^{-2}\mathbf{y}^\mathsf{T}\mathbf{y}– 2\sigma^{-2}\mathbf{w}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} $$

Completing the square

We wish to complete the square to find an expression akin to $$ (\mathbf{w} – \pmb{\mu})^\mathsf{T}\mathbf{Q}^{-1}(\mathbf{w} – \pmb{\mu}) = \mathbf{w}^\mathsf{T}\mathbf{Q}^{-1}\mathbf{w} + \pmb{\mu}^\mathsf{T}\pmb{\mu} – 2\pmb{\mu}^\mathsf{T}\mathbf{Q}^{-1}\mathbf{w}$$ For this, we need to have \(\pmb{\mu}^\mathsf{T}\mathbf{Q}^{-1} = \sigma^{-2}(\mathbf{X}^\mathsf{T}\mathbf{y})^\mathsf{T}\). We can easily obtain this by setting

$$ \begin{cases} &\mathbf{Q}^{-1} = \pmb{\Sigma} + \sigma^{-2}\mathbf{X}^\mathsf{T}\mathbf{X} \\ &\pmb{\mu} = \sigma^{-2}\mathbf{Q}\mathbf{X}^\mathsf{T}\mathbf{y} \end{cases} $$

we therefore obtain that the posterior \(\mathbf{w}\mid \mathbf{X}, \mathbf{y}\) is distributed as the gaussian \(\mathcal{N}(\pmb{\mu}, \mathbf{Q})\). To be rigorous, we would need to verify that the leftover terms \(\sigma^{-2}\mathbf{y}^\mathsf{T}\mathbf{y} + \sigma^{-4}\mathbf{y}^\mathsf{T}\mathbf{X}\mathbf{Q}^\mathsf{T}\mathbf{Q}\mathbf{X}^\mathsf{T}\mathbf{y}\) are indeed appropriately removed by the normalizing constant. This will be the focus of another post.

Posterior predictive

The predictive distribution is obtained by integrating the distribution \(p(f_\star \mid \mathbf{x}_\star, \mathbf{w})\) for a new function value \(f_\star = f(\mathbf{x}_\star)\) against the posterior:

$$ p(f_\star \mid \mathbf{x}_\star, \mathbf{X}, \mathbf{y}) = \int p(f_\star \mid \mathbf{x}_\star, \mathbf{w})\,p(\mathbf{w}\mid \mathbf{X}, \mathbf{y})\,\mathrm{d}\mathbf{w}$$

because the term \(p(f_\star \mid \mathbf{x}_\star, \mathbf{w})\) is a dirac \(\delta_{f_\star = \mathbf{w}^\mathsf{T}\mathbf{x}_\star}\), the distribution \(f_\star \mid \mathbf{x}_\star, \mathbf{X}, \mathbf{y}\) is equal to the distribution of \(\mathbf{w}^\mathsf{T}\mathbf{x}_\star\). Using basic properties of multivariate gaussians, we get the posterior predictive \(\mathcal{N}(\pmb{\mu}^\mathsf{T}\mathbf{x}_\star, \mathbf{x}_\star^\mathsf{T}\mathbf{Q}\mathbf{x}_\star)\).