Small Deviations

Broadcasting my objective subjectivity

Suppose you have a quadratic form \(f\) defined on \(\mathbb{R}^p\) :

$$ f(\mathbf{x}) \triangleq \mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x} + \mathbf{b}^\mathsf{T}\mathbf{x} + c $$

“Completing the square” is mapping this expression to a factorized, simpler one:

$$ f(\mathbf{x}) \triangleq (\mathbf{x} – \bar{\mathbf{x}})^\mathsf{T}\mathbf{Q}\,(\mathbf{x} – \bar{\mathbf{x}}) + d $$

where \(\bar{\mathbf{x}}\) is the position of the ellipsoïd characterized by the eigenvalues of \(\mathbf{Q}\).

Algebraïc manipulations

Completing the square is analogous to the scalar case: first expand the candidate expression:

$$ f(\mathbf{x}) \triangleq \mathbf{x}^\mathsf{T}\mathbf{Q}\mathbf{x} + \bar{\mathbf{x}}^\mathsf{T}\mathbf{Q}\bar{\mathbf{x}} – 2 \bar{\mathbf{x}}^\mathsf{T} \mathbf{Q}^\mathsf{T}\mathbf{x} + d $$

Then, proceed by identifying the main terms, yielding the system

$$ \mathbf{Q} = \mathbf{A} \qquad \text{and} \qquad – 2\mathbf{Q}\bar{\mathbf{x}} = \mathbf{b} $$ assuming \(\mathbf{A}\) is invertible, we get \(\bar{\mathbf{x}} = -\frac12\mathbf{A}^{-1}\mathbf{b}\). Thus, we can proceed by adding and substracting the necessary terms:

\begin{align} f(\mathbf{x}) &= \mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x} + \mathbf{b}^\mathsf{T}\mathbf{x} + c \\ &= \mathbf{x}^\mathsf{T}\mathbf{Q}\mathbf{x} – 2 \bar{\mathbf{x}}^\mathsf{T} \mathbf{Q}^\mathsf{T}\mathbf{x} + c \\ &= \mathbf{x}^\mathsf{T}\mathbf{Q}\mathbf{x} – 2 \bar{\mathbf{x}}^\mathsf{T} \mathbf{Q}^\mathsf{T}\mathbf{x} + \bar{\mathbf{x}}^\mathsf{T}\mathbf{Q}\bar{\mathbf{x}} – \bar{\mathbf{x}}^\mathsf{T}\mathbf{Q}\bar{\mathbf{x}} + c \\ &= \mathbf{x}^\mathsf{T}\mathbf{Q}\mathbf{x} + \bar{\mathbf{x}}^\mathsf{T}\bar{\mathbf{x}} – 2 \bar{\mathbf{x}}^\mathsf{T} \mathbf{Q}^\mathsf{T}\mathbf{x} + d \end{align}

finally yielding \(d = c\, – \,\bar{\mathbf{x}}^\mathsf{T}\mathbf{Q}\bar{\mathbf{x}} = c\, – \,\frac14\mathbf{b}^\mathsf{T}\mathbf{A}^{-1}\mathbf{b}\).

Writing this in a single equation, we get:

\begin{align} \mathbf{x}^\mathsf{T}\mathbf{A}\mathbf{x} + \mathbf{b}^\mathsf{T}\mathbf{x} + c = &\left(\mathbf{x}-\frac12\mathbf{A}^{-1}\mathbf{b}\right)^\mathsf{T}\mathbf{A}\left(\mathbf{x}-\frac12\mathbf{A}^{-1}\mathbf{b}\right)\\ &+ c\, – \,\frac14\mathbf{b}^\mathsf{T}\mathbf{A}^{-1}\mathbf{b} \end{align}

An example

Let's take an example of a quadratic form: we generate randomly our parameters \(\mathbf{A}, \mathbf{b}, c\) and obtain:

$$\mathbf{A} = \begin{bmatrix} 7.722 & 2.774 \\ 2.774 & 1.078 \end{bmatrix} \quad \mathbf{b} = \begin{bmatrix} 0.651 \\ -0.319 \end{bmatrix} \quad c = -0.848 $$

the snippet to generate this quadratic form in python is:

import jax.numpy as jnp
import numpy as np
from jax import vmap

_seed = 11
np.random.seed(_seed)
num_points = 2500
xgrid = ygrid = np.linspace(-10,10, num_points)
meshx, meshy = np.meshgrid(xgrid, ygrid)
points = np.vstack([meshx.ravel(), meshy.ravel()]).T
A = np.random.randn(2,2)
A = A @ A.T
A = np.round(A,3)
b = np.random.randn(2,1)
b = np.round(b, 3)
c = np.round(np.random.randn(1), 3)

A,b,c = jnp.array(A), jnp.array(b), jnp.array(c)

def quadratic_form(x):
    return x.T @ A @ x + b.T @ x + c

qfvals = vmap(quadratic_form)(points)

the contours of this quadratic form are plotted below:

countours example

now let us plot the countours of the factorized version of \(f\):

factorized quadform

they are the same, as expected. The snippet to reproduce this second experiment is:

def factorized_form(x):
    x = x.reshape(-1,1)
    xbar = 0.5 * jnp.linalg.inv(A) @ b
    d = c - 0.25 * b.T @ jnp.linalg.inv(A @ A.T) @ b
    return (x - xbar).T @ A @ (x - xbar) + d

qfvals = vmap(factorized_form)(points)

The standard linear regression model with gaussian noise is compactly summarized as

$$ f(\mathbf{x}) = \mathbf{w}^\mathsf{T} \mathbf{x}\,,\qquad y = f(\mathbf{x}) + \varepsilon\,, \qquad \varepsilon \sim \mathcal{N}(0, \sigma). $$

Given the data \(\mathbf{X} = [\mathbf{x}_1\,, \dots\,, \mathbf{x}_n]^\mathsf{T}\) and the weight \(\mathbf{w}\), the above assumptions imply that the distribution \(p(\mathbf{y}\mid \mathbf{X}, \mathbf{w})\) is normal: to be specific, we have that $$\mathbf{y}\mid \mathbf{X}, \mathbf{w} \,\sim\,\mathcal{N}(\mathbf{X}\mathbf{w}, \sigma\mathbf{I})$$

From the bayesian point of view, having a prior belief \(p(\mathbf{w})\) about the weights is a good way to perform inference about the underlying model. We pick our prior as the conjugate of our likelihood: we thus set \(p(\mathbf{w}) = \mathcal{N}\left(\mathbf{0}, \pmb{\Sigma}\right)\). We apply Bayes' rule to obtain the posterior on the parameters \(\mathbf{w}\):

$$ p(\mathbf{w}\mid \mathbf{X}, \mathbf{y}) = \dfrac{p(\mathbf{y}\mid \mathbf{X}, \mathbf{w})\,p(\mathbf{w})}{p(\mathbf{y}\mid \mathbf{X})}$$

Ignoring the denominator (which is independent of the weights), we obtain that \( p(\mathbf{w}\mid \mathbf{X}, \mathbf{y}) \propto \exp(-\frac12 \mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y})\,)\) where the exponent \(\mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y})\) is given by

$$ \mathrm{T}(\mathbf{w}, \mathbf{X}, \mathbf{y}) = \mathbf{w}^\mathsf{T}\left(\pmb{\Sigma} + \sigma^{-2}\mathbf{X}^\mathsf{T}\mathbf{X}\right)\mathbf{w} + \sigma^{-2}\mathbf{y}^\mathsf{T}\mathbf{y}– 2\sigma^{-2}\mathbf{w}^\mathsf{T}\mathbf{X}^\mathsf{T}\mathbf{y} $$

Completing the square

We wish to complete the square to find an expression akin to $$ (\mathbf{w} – \pmb{\mu})^\mathsf{T}\mathbf{Q}^{-1}(\mathbf{w} – \pmb{\mu}) = \mathbf{w}^\mathsf{T}\mathbf{Q}^{-1}\mathbf{w} + \pmb{\mu}^\mathsf{T}\pmb{\mu} – 2\pmb{\mu}^\mathsf{T}\mathbf{Q}^{-1}\mathbf{w}$$ For this, we need to have \(\pmb{\mu}^\mathsf{T}\mathbf{Q}^{-1} = \sigma^{-2}(\mathbf{X}^\mathsf{T}\mathbf{y})^\mathsf{T}\). We can easily obtain this by setting

$$ \begin{cases} &\mathbf{Q}^{-1} = \pmb{\Sigma} + \sigma^{-2}\mathbf{X}^\mathsf{T}\mathbf{X} \\ &\pmb{\mu} = \sigma^{-2}\mathbf{Q}\mathbf{X}^\mathsf{T}\mathbf{y} \end{cases} $$

we therefore obtain that the posterior \(\mathbf{w}\mid \mathbf{X}, \mathbf{y}\) is distributed as the gaussian \(\mathcal{N}(\pmb{\mu}, \mathbf{Q})\). To be rigorous, we would need to verify that the leftover terms \(\sigma^{-2}\mathbf{y}^\mathsf{T}\mathbf{y} + \sigma^{-4}\mathbf{y}^\mathsf{T}\mathbf{X}\mathbf{Q}^\mathsf{T}\mathbf{Q}\mathbf{X}^\mathsf{T}\mathbf{y}\) are indeed appropriately removed by the normalizing constant. This will be the focus of another post.

Posterior predictive

The predictive distribution is obtained by integrating the distribution \(p(f_\star \mid \mathbf{x}_\star, \mathbf{w})\) for a new function value \(f_\star = f(\mathbf{x}_\star)\) against the posterior:

$$ p(f_\star \mid \mathbf{x}_\star, \mathbf{X}, \mathbf{y}) = \int p(f_\star \mid \mathbf{x}_\star, \mathbf{w})\,p(\mathbf{w}\mid \mathbf{X}, \mathbf{y})\,\mathrm{d}\mathbf{w}$$

because the term \(p(f_\star \mid \mathbf{x}_\star, \mathbf{w})\) is a dirac \(\delta_{f_\star = \mathbf{w}^\mathsf{T}\mathbf{x}_\star}\), the distribution \(f_\star \mid \mathbf{x}_\star, \mathbf{X}, \mathbf{y}\) is equal to the distribution of \(\mathbf{w}^\mathsf{T}\mathbf{x}_\star\). Using basic properties of multivariate gaussians, we get the posterior predictive \(\mathcal{N}(\pmb{\mu}^\mathsf{T}\mathbf{x}_\star, \mathbf{x}_\star^\mathsf{T}\mathbf{Q}\mathbf{x}_\star)\).

Following up a previous post on block-LDU decompositions, we now have enough knowledge to derive the famous Woodbury identity.

Block-LDU decomposition

Recall the block-LDU decomposition of a matrix \(M\) :

$$ \begin{bmatrix} A & B\\ C & D \end{bmatrix} = \begin{bmatrix} \mathrm{I}_{n} & 0 \\ CA^{-1} & \mathrm{I}_{m} \end{bmatrix} \begin{bmatrix} A & 0 \\ 0 & D – CA^{-1}B \end{bmatrix} \begin{bmatrix} \mathrm{I}_{n} & \, A^{-1}B\\ 0 & \mathrm{I}_{m} \end{bmatrix} $$

If \(D\) is invertible, we can also decompose \(M\) into \(\tilde{U}\tilde{D}\tilde{L}\) blocks:

$$ \begin{bmatrix} A & B\\ C & D \end{bmatrix} = \begin{bmatrix} \mathrm{I}_{n} & BD^{-1} \\ 0 & \mathrm{I}_{m} \end{bmatrix} \begin{bmatrix} A – BD^{-1}C & 0 \\ 0 & D \end{bmatrix} \begin{bmatrix} \mathrm{I}_{n} & 0 \\ D^{-1}C & \mathrm{I}_{m} \end{bmatrix} $$

These two block factorizations can be inverted, and comparing their entries yields the formula.

A tale of two inverses

Let's compute the inverse \(M^{-1}\) in terms of \(M=LDU\) and \(M = \tilde{U}\tilde{D}\tilde{L}\). Computing the inverses of block triangular matrices is easy. The first block-LDU decomposition yields the inverse

$$ M^{-1} = \begin{bmatrix} \mathrm{I}_{n} & \, -A^{-1}B\\ 0 & \mathrm{I}_{m} \end{bmatrix} \begin{bmatrix} A^{-1} & 0 \\ 0 & \left(D – CA^{-1}B\right)^{-1} \end{bmatrix} \begin{bmatrix} \mathrm{I}_{n} & 0 \\ – CA^{-1} & \mathrm{I}_{m} \end{bmatrix} $$

which finally yields the blocks

$$ M^{-1} = \begin{bmatrix} A^{-1} +A^{-1}B \left(D – CA^{-1}B\right)^{-1}CA^{-1} & \, -A^{-1}B \left(D – CA^{-1}B\right)^{-1}\\ – \left(D – CA^{-1}B\right)^{-1}CA^{-1} & \left(D – CA^{-1}B\right)^{-1} \end{bmatrix} $$

The second (block-UDL) decomposition implies that

$$ M^{-1} = \begin{bmatrix} \mathrm{I}_{n} & 0 \\ -D^{-1}C & \mathrm{I}_{m} \end{bmatrix} \begin{bmatrix} \left(A – BD^{-1}C\right)^{-1} & 0 \\ 0 & D^{-1} \end{bmatrix} \begin{bmatrix} \mathrm{I}_{n} & -BD^{-1} \\ 0 & \mathrm{I}_{m} \end{bmatrix} $$

block-multplying the three matrices yields the expression

$$ M^{-1} = \begin{bmatrix} \left(A – BD^{-1}C\right)^{-1} & -\left(A – BD^{-1}C\right)^{-1}BD^{-1} \\ -D^{-1}C\left(A – BD^{-1}C\right)^{-1} & D^{-1} + D^{-1}C\left(A – BD^{-1}C\right)^{-1}BD^{-1} \end{bmatrix} $$

The final result

By identifying the two expressions for \(M^{-1}\) we obtain that, in particular, the upper-left blocks are equal. This means that

$$ \left(A – BD^{-1}C\right)^{-1} = A^{-1} +A^{-1}B \left(D – CA^{-1}B\right)^{-1}CA^{-1} $$ which is the final result ! The result is often written differently, e.g.

$$ \left(A + UWV\right)^{-1} = A^{-1} – A^{-1}U \left(W^{-1} + VA^{-1}U\right)^{-1}VA^{-1} $$

Corollaries

As the Woodbury lemma holds for any comformable matrices (that is \(A, U, W, V\) are respectively \(n\times n, n\times k, k \times k, k \times n\)), we can obtain several special cases.

Inverse of a sum

By setting \(U \triangleq V \triangleq \mathrm{I}_n\) (and \( W \triangleq B\)) we obtain that

$$ \left(A + B\right)^{-1} = A^{-1} – A^{-1}\left(A^{-1} + B^{-1}\right)^{-1}A^{-1}$$

Inverse of a sum of products

By setting \(W \triangleq \mathrm{I}\) we obtain that

$$ \left(A + BC\right)^{-1} = A^{-1} – A^{-1}B\left(\mathrm{I} + CA^{-1}B\right)^{-1}CA^{-1}$$

If \(A\) itself is the product of invertible matrices, we have that

$$ \left(AB + CD\right)^{-1} = B^{-1}A^{-1} – B^{-1}A^{-1}C\left(\mathrm{I} + DB^{-1}A^{-1}C\right)^{-1}DB^{-1}A^{-1}$$

This post mirrors the background section of the schur complement wikipedia entry. It re-creates the block-LDU of a square matrix in terms of block matrices.

Assume we have a square matrix \({M}\) of size \((m+n)\times (m+n)\) with blocks \(A, B, C, D\). Assume \(A\) is square \(n \times n\) and invertible, and \(D\) is square \(m \times m\). Our matrix is:

$$ {M} = \begin{bmatrix} A& B\\ C& D \end{bmatrix} $$

and we wish to perform block gaussian elimination to remove the lower-left block \(C\).

Finding the correct transformation

We proceed by left-multiplying by a lower-block-triangular matrix \(L\):

$$ LM = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} A& B\\ C& D \end{bmatrix} = \begin{bmatrix} A'& B'\\ 0 & D' \end{bmatrix} $$

The most important equation is the equality implied by the lower-left block: we must pick \(L\) so that \(L_{21}A + L_{22}C = 0\). We do not need to find every possible solution, just one is enough. Let's pick \(L_{22} = \mathrm{I}_m\) the identity and \(L_{21} = -CA^{-1}\). To simplify things we also set \(L_{11} = \mathrm{I}_n\).

We finally have that

$$ LM = \begin{bmatrix} \mathrm{I}_{n} & 0 \\ -CA^{-1} & \mathrm{I}_{m} \end{bmatrix} \begin{bmatrix} A& B\\ C& D \end{bmatrix} = \begin{bmatrix} A& B\\ 0 & D -CA^{-1}B \end{bmatrix} $$

The matrix \(D -CA^{-1}B\) is called the Schur complement of \(D\) in \(M\). One can find the right order(\(D\) minus \(C, A, B\)) visually by using the following rule:

schur rule

Block diagonalization

Can we further simplify our matrix \(LM\) ? It turns out we can transform \(LM\) into a simpler matrix, a product of a triangular and diagonal matrix. To find the correct transformation, we repeat the process to find \(U\) so that

$$ LMU = \begin{bmatrix} A& B\\ 0 & D -CA^{-1}B \end{bmatrix} \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix} = \begin{bmatrix} D_{11} & 0 \\ 0 & D_{22} \end{bmatrix} $$

Again, we only need to find one solution, and it's easy to identify the critical equation, which is the one involving the \(B\) and \(U_{12}\) blocks: \( A U_{12} + BU_{22} = 0\). We pick (as we did before \( U_{22} = \mathrm{I}_m\) and \(U_{12} =\, – A^{-1}B\). The rest is not important, so we pick the simplest solution \(U_{11} = \mathrm{I}_n\).

Finally, the decomposition is

$$ LMU = \begin{bmatrix} \mathrm{I}_{n} & 0 \\ -CA^{-1} & \mathrm{I}_{m} \end{bmatrix} \begin{bmatrix} A& B\\ C & D \end{bmatrix} \begin{bmatrix} \mathrm{I}_{n} & \, – A^{-1}B\\ 0 & \mathrm{I}_{m} \end{bmatrix} = \begin{bmatrix} A & 0 \\ 0 & D – CA^{-1}B \end{bmatrix} $$

The LDU decomposition

The above equality can be modified by multiplying by the inverses \(\tilde{L}, \tilde{U}\) of \(L,U\). These matrices are also lower and upper triangular, so that

$$ M = \tilde{L}D\tilde{U}$$

This decomposition into simple block-matrices (block-triangular and block-diagonal) is called the block-LDU decomposition. Its final form is

$$ \begin{bmatrix} A& B\\ C & D \end{bmatrix} = \begin{bmatrix} \mathrm{I}_{n} & 0 \\ CA^{-1} & \mathrm{I}_{m} \end{bmatrix} \begin{bmatrix} A & 0 \\ 0 & D – CA^{-1}B \end{bmatrix} \begin{bmatrix} \mathrm{I}_{n} & \, A^{-1}B\\ 0 & \mathrm{I}_{m} \end{bmatrix} $$

A good way of recalling quickly the decomposition is (assuming one knows the expression for the schur complement):

block ldu memo

Suppose you have a block triangular matrix \(\mathrm{L}\) of size \((m+n)\times (m+n)\) so that

$$ \mathrm{L} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} $$

where \(L_{11}\) is square \(n\times n\) and \(L_{22}\) is square \(m\times m\). To find the inverse, we simply solve a system in the block-unknowns \(A,B,C,D\):

$$ \mathrm{L}\mathrm{L}^{-1} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix} \begin{bmatrix} A & B \\ C & D \end{bmatrix} = \begin{bmatrix} \mathrm{I}_n & 0 \\ 0 & \mathrm{I}_m \end{bmatrix} = \mathrm{I}_{n+m} $$

using block-wise matrix multiplication, we obtain the system

$$ \begin{cases} &L_{11} A = \mathrm{I}_n\\ &L_{11} B = 0 \\ &L_{21}A = -L_{22}C \\ &L_{21}B + L_{22}D = \mathrm{I}_n
\end{cases} $$

Therefore, necessary and sufficient conditions for the existence of \(\mathrm{L}^{-1}\) are :

  • \(L_{11}\) is invertible and thus \(A = L_{11}^{-1}\)
  • \(B = 0\)
  • \(L_{22}\) is invertible and \(D = L_{22}^{-1}\)
  • We have that \(C = \, – L_{22}^{-1} L_{21} L_{11}^{-1}\)

Which can be summarized into the equality

$$ \mathrm{L}^{-1} = \begin{bmatrix} L_{11} & 0 \\ L_{21} & L_{22} \end{bmatrix}^{-1} = \begin{bmatrix} L_{11}^{-1} & 0 \\ – L_{22}^{-1} L_{21} L_{11}^{-1} & L_{22}^{-1} \end{bmatrix} $$

Note that upper-triangular matrices possess the same property, meaning that

$$ \mathrm{U}^{-1} = \begin{bmatrix} U_{11} & U_{12} \\ 0 & U_{22} \end{bmatrix}^{-1} = \begin{bmatrix} U_{11}^{-1} & -U_{11}^{-1}U_{12}U_{22}^{-1}\\ 0 & U_{22}^{-1} \end{bmatrix} $$

memorization of the two formula is not necessary, as one can transform \(U\) into a lower-block-triangular matrix \(\mathrm{L} = \mathrm{U}^\mathsf{T}\), apply the \(L\)-inverse formula to obtain the lower-left inverse block $$\left[\mathrm{L}^{-1}\right]_{21} = – L_{22}^{-1} L_{21} L_{11}^{-1}$$ and then swap the blocks again: the \(1\)'s becomes \(2\)'s and vice-versa, giving that $$\left[\mathrm{U}^{-1}\right]_{12} = – U_{11}^{-1} U_{12} U_{22}^{-1}$$ as expected.

This short blog post is about a formula which is useful in statistics, machine learning and control science. In these fields, one often has a matrix $\mathbf{A}$ which is element-wise parametrized by a scalar $\theta$, so that the matrix entries are each scalar functions of the parameter:

$$ \mathbf{A}(\theta) = \begin{bmatrix} a_{11}(\theta) & a_{12}(\theta) &\ldots & a_{1j}(\theta) & \ldots & a_{1n}(\theta) \\ a_{21}(\theta) & a_{22}(\theta) &\ldots & a_{1j}(\theta) & \ldots & a_{1n}(\theta) \\ \vdots & \vdots & \ddots & \vdots & & \vdots \\ a_{i1}(\theta) & a_{i2}(\theta) &\ldots & a_{ij}(\theta) & \ldots & a_{in}(\theta) \\ \vdots & \vdots & & \vdots & \ddots & \vdots \\ a_{n1}(\theta) & a_{n2}(\theta) &\ldots & a_{nj}(\theta) & \ldots & a_{nn}(\theta) \\ \end{bmatrix} $$

We then naturally defined the derivative $\dfrac{\partial \mathbf{A}}{\partial \theta} $ as a matrix with entries $\left[\dfrac{\partial a_{ij}}{\partial \theta} \right]_{ij}$. Assume that $\mathbf{A}^{-1}$ exists for a specific value of $\theta$. In order to find the derivative $\dfrac{\partial \mathbf{A}^{-1}}{\partial \theta} $ of the inverse, we first differentiate each side of the equality

\begin{align} \mathbf{A}^{-1}\mathbf{A} = \mathbf{A}\mathbf{A}^{-1} = \mathbf{I}_n \end{align}

We however first need to figure out the derivative of a product $\mathbf{A}\mathbf{B}$

Derivative of a product of matrices

Suppose both $\mathbf{A}, \mathbf{B}$ depend on the parameter $\theta$. The $ij$-th entry of the product $\mathbf{A}\mathbf{B}$ is:

$$ \left[\mathbf{A}\mathbf{B} \right]_{ij} = \sum_{k=1}^n a_{ik}b_{kj} $$

We therefore only need the derivative of this expression for every entry of the product $\mathbf{A}\mathbf{B}$. This is a derivative of scalar functions, which is easy to evaluate:

\begin{align} \left[\dfrac{\partial}{\partial \theta}\mathbf{A}\mathbf{B} \right]_{ij} &= \dfrac{\partial}{\partial \theta}\sum_{k=1}^n a_{ik}b_{kj} \\ &= \sum_{k=1}^n \dfrac{\partial}{\partial \theta}(a_{ik}b_{kj}) \\ &= \sum_{k=1}^n \dfrac{\partial}{\partial \theta}a_{ik}b_{kj} + a_{ik}\dfrac{\partial}{\partial \theta} b_{kj} \\ & = \sum_{k=1}^n \dfrac{\partial}{\partial \theta}a_{ik}b_{kj} + \sum_{p=1}^n a_{ip}\dfrac{\partial}{\partial \theta} b_{pj} \\ & = \left[ \dfrac{\partial \mathbf{A}}{\partial \theta}\mathbf{B} + \mathbf{A} \dfrac{\partial \mathbf{B}}{\partial \theta}\right]_{ij} \\ \end{align}

Last derivations

We can apply this identity to obtain the equality:

$$ \dfrac{\partial \mathbf{A}}{\partial \theta}\mathbf{A}^{-1} + \mathbf{A}\dfrac{\partial \mathbf{A}^{-1}}{\partial \theta} = \mathbf{0}_n \iff \dfrac{\partial \mathbf{A}^{-1}}{\partial \theta} = \,-\mathbf{A}^{-1}\dfrac{\partial \mathbf{A}}{\partial \theta}\mathbf{A}^{-1} $$

This post is about using python's pillow library together with the python OpenCV binder cv2 to perform basic image processing tasks. To be specific, the goal is to transform an object's image into a scatterplot representing its boundaries.

Loading an image

We'll first work with this kirby image, which we can obtain from the url using the python requests package as follows:

from PIL import Image
import requests

url = "https://upload.wikimedia.org/wikipedia/en/2/2d/SSU_Kirby_artwork.png"

url_data = requests.get(url, stream=True)


img = Image.open(url_data.raw)
img = np.asarray(img)

We can then visualize the image using matplotlib's imshow routine

import matplotlib.pyplot as plt

plt.imshow(img)
plt.axis('off')
plt.show()

The result is as expected:

image base

Edge detection

We'll extract the image's edges using the Canny algorithm. We'll skip the details and jump to the results. Using OpenCV's canny algorithm implementation with thresholds $90, 90$, we obtain the edges:

image edges

The image is a rectangular $363\times 275$ array, which we can pad to a $363\times 363$ square using the snippet

import cv2 as ocv

pad_edges = ocv.copyMakeBorder(edges,
                           (size - edges.shape[0])//2,
                           (size - edges.shape[0])//2,
                           (size - edges.shape[1])//2,
                           (size - edges.shape[1])//2,
                           ocv.BORDER_CONSTANT,
                           value=[0,0,0, 0])

This yields the image

edges square

Edges as a scatter

To turn our square image array $I$ of size $363\times 363$ into a set of points $P$ in the plane , we start from a square grid $G$ of the same size with values uniformly spaced on the square $[-5,+5]^2$. We then map the pixel $I_{ij}$ to the point $G_{ij}$ only if the pixel's intensity is nonzero. This is simply done through a for loop:

for i,x in enumerate(xgrid):
    for j,y in enumerate(ygrid):
        if pad_edges[i,j] > 0:
            points.append([x,y])

The result is the following scatter:

scatter

Given inputs $\mathbf{X} = [\mathbf{x}_1, \dots , \mathbf{x}_n]^\mathsf{T} \in \mathbb{R}^{n\times p}$ and outputs $\mathbf{y} = [y_1, \dots, y_n]$, a linear model assumes the approximate relationship $ y \approx \mathbf{x}^\mathsf{T}\beta $. Specifically, the assumptions of Ordinary Least Squares allows one to estimate a conditional mean, that is one assumes $\mathbf{E}[y\mid\mathbf{x}] = \mathbf{x}^\mathsf{T}\beta$. These assumptions allow for fixed transformations of the inputs $\tilde{\mathbf{x}} = \phi(\mathbf{x})$ so that the more complex model

$$\mathbf{E}[y\mid\mathbf{x}] = \tilde{\mathbf{x}}^\mathsf{T}\beta = \sum_{i=1}^p\beta_j\phi(\mathbf{x}^{(j)})$$

Is still linear in the parameters $\beta$. However, in many situations, one wishes to construct a more complex relationship, one in which the conditional mean $\mathbf{E}[y\mid\mathbf{x}] $ is not linear in $\beta$. In this case, Generalized Linear Models (GLMs) are a good pick. These models are still somewhat simple, in the sense that they only assume an invertible link function $g$, so that the relationship becomes

$$\mathbf{E}[y\mid\mathbf{x}] = g^{-1}(\tilde{\mathbf{x}}^\mathsf{T}\beta) = g^{-1}\left(\sum_{i=1}^p\beta_j\phi(\mathbf{x}^{(j)})\right)$$

In this post, we'll use the Statsmodels python api to fit GLMs.

Data

First, let's generate three bivariate datasets with different nonlinear relationships. This means that $ y = \psi ( \phi(x) + \varepsilon))$ where $\varepsilon$ is some zero-centered, symmetric noise. In order to visualize our results we pick $p=1$ at first.

Double Spline

When both $\phi, \psi$ are splines, we obtain the following scatter:

scatter spl

with a simple polynomial fit of degree $5$ in red.

Positive Spline

When we set $\phi$ as a spline, but $\psi$ as a positive function such as $ \psi(x) = a\,x^2 + b\,\exp(x) $, we obtain the scatterplot

scatter spl pos

Discrete output

To be added.

JAX is the fusion of XLA (short for Accelerated Linear Algebra) and Autograd, a calculus which allows for automatic computation of differentials. A very popular Autograd framework as of today is Pytorch's, but JAX's is gaining popularity every day. This post is an introduction to JAX through simple examples.

A rational function

Assume you have two polynomials $P,Q$ of degree less than $d = 5$. Denote the coefficients $\vec{\alpha}, \vec{\beta}$ of $P,Q$. In order to avoid singularities at the roots of $Q$, we will rather study the function

$$ f(x) = \dfrac{P(x)}{1 + Q(x)^2} = \dfrac{\alpha_0 + \alpha_1x + \alpha_2x^2 + \alpha_3 x^3 + \alpha_4x^4 + \alpha_5x^5}{1 + (\beta_0 + \beta_1x + \beta_2x^2 + \beta_3 x^3 + \beta_4x^4 + \beta_5x^5)^2} $$

The function is computed in JAX code using the following snippet:

@jit
def rational_function(x, num_par, denom_par):
    acc_num = 0.0
    acc_denom = 0.0
    for i,par in enumerate(num_par):
        acc_num += par*jnp.float_power(x, i)
        
    for j,par in enumerate(denom_par):
        acc_denom += par*jnp.float_power(x, j)
        
    return acc_num / (1.0 + acc_denom**2)

we can plot this function on the interval $I_{10} = (-10, +10)$, giving us the figure

rational function

Gradient and Hessian basics

Gradient

If one does not specify anything to JAX's grad routine, it takes a function func(arg1, arg2, arg3, ...) and returns a function with the same arguments, except the values of the new function are now the gradient w.r.t arg1. Let's try this functionality by calling grad(rational_function) and plotting the result on our $I_{10}$ grid:

rat func and its gradient wrt x

We obtain $f'$ as expected, notice that the points $x$ for which $f'(x)$ vanishes are local extrema of $f$. We can count 5 of them in total if we ignore regions where $f(x) = 0$.

Hessian

As the grad(...) call returns a function with $f$'s arguments, we can again call grad on the output of the first gradient call:

y_g = grad(rational_function)
y_gg = grad(y_g)

Do not forget to call jit to pre-compile the gradient functions and avoid performing unnecessary computations at each evaluation (y_g = jit(y_g) and same thing for y_gg). We can then plot the functions $f, f', f''$ on $I_{10}$ after normalizing the $y$-axis (because $f''$ exhibits high variability).

jax grad hessian wrt x

Gradient and arguments

As mentionned earlier, calling grad(...) without specifying the argument w.r.t which the gradient is computed automatically computes the gradient w.r.t the first argument. To change that behavior, one should use the argnums parameter. For example, the snippet

num_par_grad = jit(grad(rational_function, argnums=1))

Computes the gradient $\partial f /\partial \vec{\alpha}$. As this gradient at each $x$ is an array of shape $1\times 6$, it is impossible to visualize the gradient on the full grid $I_{10}$. To simplify the problem, we set the degree $\deg(P) = 2$ and visualize $\nabla_{\vec{\alpha}}f$ on our grid. The result can be seen below:

grad wrt numerator coeffs

The same simplifications and computations can be used to compute $\partial f /\partial \vec{\beta}$ (by setting argnums=2, which yields the plot

grad wrt denominator coeffs

Contours

The above curves where the result of plotting the values of $\partial f /\partial \vec{\alpha}$ as $x$ changes on a uniform grid $I_{10}$ having fixed both $\vec{\alpha}, \vec{\beta}$. We can proceed differently, setting $x = x_0$ and varying the coefficients $ \vec{\alpha}, \vec{\beta}$ on a uniform grid. (TO DO).

Tags

#autodiff #autograd #jax #python #code #math #applied #data #plots #gradient #hessian

Given a random variable (RV) $X\sim p_X$ and a function $f$ whose inverse $f^{-1}$ is differentiable, we know that the pdf of the random variable $Y = f(X)$ is given by

$$ p_Y(y) = \left\lvert\dfrac{\partial }{\partial y} f^{-1}\right\rvert \, p_X \circ f^{-1} (y) $$

This post reviews some applications of this formula.

Scaling and Shifting

Suppose you have a scalars $\sigma > 0$ and $\mu \neq 0$. You may want to know the pdf of the shifted variable $X + \mu$ or the scaled variable $\sigma X$. Applying the formula yields that $X+\mu$ is distributed as $p_X(x – \mu)$ and $\sigma X$ as $p_X(x/\sigma)/\sigma$. Therefore, the normalisation $Y = (X – \mu)/\sigma$ yields a pdf of $p_Y(y) = \sigma p_X(\sigma y+\sigma\mu)$.

Reciprocal

Assuming that $X$ is strictly positive, what is the pdf of $1/X$ ? By applying our formula, one finds that $$ Y = X^{-1} \sim p_Y(y) = y^{-2}\,p_X(y^{-1}) $$

Logarithm

Suppose that $X \geq 0$, so that $\log X$ is defined. The pdf of this new variable is then $p_Y(y) = \exp(y)\,p_X(\exp(y))$

Exponentiation

Suppose that $X \geq 0$, so that for a scalar $\gamma > 0$ the variable $X^\gamma$ is defined. The pdf of $Y = X^\gamma$ is then $$p_Y(y) = y^{\frac{1 – \gamma}{\gamma}}\,p_X(y^{1/\gamma}). $$

Enter your email to subscribe to updates.