Generalized Linear Models in python

Given inputs $\mathbf{X} = [\mathbf{x}_1, \dots , \mathbf{x}_n]^\mathsf{T} \in \mathbb{R}^{n\times p}$ and outputs $\mathbf{y} = [y_1, \dots, y_n]$, a linear model assumes the approximate relationship $ y \approx \mathbf{x}^\mathsf{T}\beta $. Specifically, the assumptions of Ordinary Least Squares allows one to estimate a conditional mean, that is one assumes $\mathbf{E}[y\mid\mathbf{x}] = \mathbf{x}^\mathsf{T}\beta$. These assumptions allow for fixed transformations of the inputs $\tilde{\mathbf{x}} = \phi(\mathbf{x})$ so that the more complex model

$$\mathbf{E}[y\mid\mathbf{x}] = \tilde{\mathbf{x}}^\mathsf{T}\beta = \sum_{i=1}^p\beta_j\phi(\mathbf{x}^{(j)})$$

Is still linear in the parameters $\beta$. However, in many situations, one wishes to construct a more complex relationship, one in which the conditional mean $\mathbf{E}[y\mid\mathbf{x}] $ is not linear in $\beta$. In this case, Generalized Linear Models (GLMs) are a good pick. These models are still somewhat simple, in the sense that they only assume an invertible link function $g$, so that the relationship becomes

$$\mathbf{E}[y\mid\mathbf{x}] = g^{-1}(\tilde{\mathbf{x}}^\mathsf{T}\beta) = g^{-1}\left(\sum_{i=1}^p\beta_j\phi(\mathbf{x}^{(j)})\right)$$

In this post, we'll use the Statsmodels python api to fit GLMs.

Data

First, let's generate three bivariate datasets with different nonlinear relationships. This means that $ y = \psi ( \phi(x) + \varepsilon))$ where $\varepsilon$ is some zero-centered, symmetric noise. In order to visualize our results we pick $p=1$ at first.

Double Spline

When both $\phi, \psi$ are splines, we obtain the following scatter:

scatter spl

with a simple polynomial fit of degree $5$ in red.

Positive Spline

When we set $\phi$ as a spline, but $\psi$ as a positive function such as $ \psi(x) = a\,x^2 + b\,\exp(x) $, we obtain the scatterplot

scatter spl pos

Discrete output

To be added.