Small Deviations

Broadcasting my objective subjectivity

In this post, we aim to prove that any continuous density \(x\mapsto p(x)\) over a bounded subset \(\mathcal{X} \subset [a,b]\) can be approximated by a convex combination of deltas \(\delta_{x_i}\): there is always a \(N\) for which we have with \(\varepsilon\)-accuracy that \(p(x) \approx p_n(x) = \sum_{i=1}^n\gamma_i \delta_{x_i}(x)\) for all \(n > N\).

Because any bounded subset in \(\mathbb{R}\) can be finitely covered by closed intervals \([a_i, b_i]\), we can prove our result for each interval and use the fact that for any \(c,d\in\mathcal{X}\) bounded, the integral is equal to

$$ \int_c^dp(x)\mathrm{d}x = \sum_{i=1}^n\int_{a_i}^{b_i}p(x)\mathrm{d}x $$

Thus to simplify the problem, we can assume \(\mathcal{X} = [a,b]\) some interval.

Weak convergence

For continuous distributions \(f,g\) on \(\mathcal{X}\) define the inner product

$$ \langle f,g\rangle \triangleq \int_\mathcal{X}f(x)g(x)\,\mathrm{d}x $$ We say a sequence \(f_n\) converges weakly to a limit \(f_\star\) if for every distribution \(g\) on \(\mathcal{X}\) we have that \(\langle f_n, g\rangle \to \langle f_\star, g\rangle\) as \(n\to\infty\).

The proof

We can use Riemann sums to prove the statement. First, notice that because \(\int p(x)\mathrm{d}x = 1\), we can for any \(n\) find a partition \(a = x_0 < x_1 < \dots < x_n\) with gaps \(\Delta x_i = x_{i+1} – x_i\) so that

$$ \sum_{i=0}^{n-1}p(x_i^\star)\Delta x_i = 1 + u_n $$ with \(u_n \to 0\) as \(n\) tends to infinity. We can assume that \(u_n \leq 0\) by taking \(p(x_i^\star) = \inf\{p(x)\mid x_i \leq x \leq x_{i+1}\}\) which is attained on a closed interval by the continuity of \(p\). We can therefore for any \(n\) construct the following \(\gamma_i\):

$$ \gamma_i \triangleq p(x_i^\star)\Delta x_i – u_n/n $$ so that we always have \(\gamma_i \geq 0\) and \(\sum_{i=0}^{n-1}\gamma_i = 1 + u_n – u_n = 1\).

Finally, consider the convex sum \(p_n = \sum_{i=0}^{n-1} \gamma_i \delta_{x_i^\star}\), we have

$$ \langle p_n , f\rangle = \sum_{i=0}^{n-1} \gamma_i f(x_i^\star) = \sum_{i=0}^{n-1}\left( p(x_i^\star)\Delta x_i – u_n/n \right) f(x_i^\star) $$ which expands into

$$ \sum_{i=0}^{n-1} \gamma_i g(x_i^\star) = \sum_{i=0}^{n-1}p(x_i^\star)f(x_i^\star)\Delta x_i – \dfrac{u_n}{n}\sum_{i=0}^{n-1}f(x_i^\star) $$ which converges to \(\langle p, f \rangle\) because \(\dfrac{u_n}{n}\sum_{i=0}^{n-1} f(x_i^\star) \to 0\) from the bounded nature of \(f\).

This post is about a proof of Jensen's Inequality, and more specifically a way to go from finite spaces to countable and uncountable spaces.

Finite case

Recall that a function \(f\) is convex on \([a,b]\) iff the inequality

$$ f(ta + (1-t)b) \leq tf(a) + (1-t)f(b) $$

holds for any real \(t\) between \(0\) and \(1\). We can easily generalize this relationship to any convex sum of points (a convex sum is a weighted sum \(\sum_{i=1}^n \gamma_ia_i\) for which \(\gamma_i \geq 0\) and \(\sum_{i=1}^n \gamma_i = 1\)) using induction on \(n\). The above inequality serves as the initialization(\(n\) equal to \(1\) and \(2\)). Assuming the inequality holds for \(n-1\) points we can notice that

$$ f\left(\sum_{i=1}^n \gamma_ia_i\right) = f\left(\gamma_na_n + \sum_{i=1}^{n-1} \gamma_ia_i\right) = f( ta + (1-t)b) $$

if we set \(t = \gamma_n\), \(a = a_n\), \(b = \sum_{i=1}^{n-1} \gamma_i'a_i = \sum_{i=1}^{n-1} \frac{\gamma_i}{1-\gamma_n} a_i \)

thus, by the convexity of \(f\), we have

$$ f\left(\sum_{i=1}^n \gamma_ia_i\right) \leq \gamma_n f(a_n) + (1 – \gamma_n)f\left(\sum_{i=1}^{n-1} \gamma_i'a_i\right) $$

however, since the \(\gamma_i'\) form a convex combination of \(n-1\) points, we can perform the induction step: \(f\left(\sum_{i=1}^{n-1} \gamma_i'a_i\right) \leq \sum_{i=1}^{n-1} \gamma_i'f(a_i)\). Thus we obtain that \(f\left(\sum_{i=1}^n \gamma_ia_i\right) \leq \sum_{i=1}^n \gamma_if(a_i)\) for any finite convex combination \(\{ (\gamma_i,a_i) \}_{i=1}^n\).

Countable case

A countable collection of points \((a_n)_{n\in\mathbb{N}}\) can be combined in a convex fashion using any positive sequence \((\gamma_n)_{n\in\mathbb{N}}\) such that \(\sum_{i=1}^\infty \gamma_i= \lim_{n\to\infty}\sum_{i=1}^n \gamma_i = 1\). This implies that the coefficients vanish at infinity: \(\gamma_i \to 0\) as \(i \to \infty\).

Assuming both \(\sum_{i=1}^n \gamma_i a_i\) and \(\sum_{i=1}^n \gamma_i f(a_i)\) converge, and assuming \(f\) is continuous, we can take limits on both sides of our inequality to yield:

$$ f\left(\sum_{i=1}^\infty \gamma_ia_i\right) \leq \sum_{i=1}^\infty \gamma_if(a_i) $$

Continuous case

Assuming we have a distribution \(p\) defined on the real line \(\mathbb{R}\), the integral $$\mathbb{E}_p[f(x)] = \int_\mathcal{X}f(x)p(x)\,\mathrm{d}x$$ can be approximated using countable sums. Indeed, the function \(p(x)\) can be approximated using a convex combination of dirac deltas: \(p(x) \approx p_n(x) = \sum_{i=1}^n\gamma_i \delta_{x_i}(x)\) (formally, convex combinations of diracs are weakly dense in the space of distributions). We have that \(\mathbb{E}_{p_n}[f(x)] \to \mathbb{E}_{p}[f(x)]\) as \(n \to \infty\) and

$$ f\left(\mathbb{E}_{p_n}[x]\right) = f\left(\sum_{i=1}^n\gamma_i x_i\right) \leq \sum_{i=1}^n\gamma_i f(x_i) = \mathbb{E}_{p_n}[f(x)] $$ and we can let \(n\) tend to \(\infty\) on each side of the inequality to conclude.

This post is about a fundamental inequality in variational inference.

Intuition

In order to fit a model \(p_\theta(\cdot)\) to datapoints \(\mathrm{x}_1, \dots \mathrm{x}_n\) one has to compute the marginal likelihood \(p_\theta(\mathrm{x})\) of the data (also called the evidence ) under the parameter \(\theta\). In variational methods, one usually models a generative process \(p_\theta(\mathrm{x}\mid \mathrm{z})\) which maps a latent \(\mathrm{z}\) into an estimate of the data \(\mathrm{x}\). If one has a simple distribution \(p(\mathrm{z})\) over the latents, then one has the equality:

$$ p_\theta(\mathrm{x}) = \int_\mathcal{Z}p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} $$

therefore, one can approximate the marginal likelihood by sampling points \(\mathrm{z}_1, \dots, \mathrm{z}_m\) from \(p(\mathrm{z})\) and using monte-carlo integration:

$$p_\theta(\mathrm{x}) \approx \dfrac1m \sum_{i=1}^mp_\theta(\mathrm{x}\mid \mathrm{z}_i)$$

However, picking the \(\mathrm{z}_i\) at random will result in poor estimates, as a random \(\mathrm{z}_i\) is unlikely to have generated \(\mathrm{x}\) (that is, the value \(p_\theta(\mathrm{x}\mid \mathrm{z}_i)\) will be low). The best way to approximate the integral is to pick \(m\) latents which maximise the above sum.

Thus, instead of sampling from \(p(\mathrm{z})\) which is a fixed, simple and suboptimal distribution, we wish to sample from a parametric distribution \(\mathrm{z} \sim q_\phi(\mathrm{z}\mid \mathrm{x})\) called an encoder. This name comes from the fact that the datapoints \(\mathrm{x}\) are often high-dimensional, and the latent variables (called codes) are much simpler: \(\dim(\mathrm{x}) \gg \dim(\mathrm{z})\). The reason we wish to sample from \(q_\phi\) is that by optimizing over the parameters \(\phi\), we can put more mass on codes \(\mathrm{z}^\star\) that result in high likelihoods \(p_\theta(\mathrm{x}\mid \mathrm{z}^\star)\).

The variational equality

The Kullback-Leibler divergence\).

As we sample from \(q_\phi\), we should compute

\begin{align} \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p_\theta(\mathrm{z}\mid \mathrm{x})) &= \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log q_\phi(\mathrm{z}\mid \mathrm{x}) – \log p_\theta(\mathrm{z}\mid \mathrm{x})] \\ & = \log p_\theta(\mathrm{x}) + \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log q_\phi(\mathrm{z}\mid \mathrm{x}) – \log p_\theta(\mathrm{x}\mid \mathrm{z}) – \log p(\mathrm{z})] \\ & = \log p_\theta(\mathrm{x}) + \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \; – \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] \end{align}

Therefore, denoting

\begin{align} E(\theta, \phi, \mathrm{x}) &\triangleq \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p_\theta(\mathrm{z}\mid \mathrm{x})) \\ \Omega(\phi) &\triangleq \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \end{align}

we have the equality \begin{align} \log p_\theta(\mathrm{x})\; – E(\theta, \phi, \mathrm{x}) = \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] – \Omega(\phi) \end{align}

which reads: “ Evidence \(-\) error \(=\) approximation \(+\) regularizer “. The regularizer term \(\Omega(\phi)\) makes sure our new distribution \(q_\phi\) is not too far from a simple distribution \(p(\mathrm{z})\), and the \(E(\cdot)\) term quantifies the amount of error we incurred when sampling from \(q_\phi\) instead of \(p\) to estimate the likelihood \(\log p_\theta(\mathrm{x}\mid \mathrm{z})\). The regularizer includes a minus sign because we wish to minimize \(\Omega(\phi)\) while maximizing the approximate likelihood.

A first derivation of the inequality

Because the KL-divergence is always positive, we must that that \(E(\cdot)\) is positive, and thus

$$\log p_\theta(\mathrm{x}) \geq \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] + \Omega(\phi) \triangleq \mathcal{L}(\theta, \phi, \mathrm{x}) $$

Thus, to estimate an intractable evidence \(\log p_\theta(\mathrm{x})\), we can optimize the evidence lower bound \(\mathcal{L}(\theta, \phi, \mathrm{x})\) over \(\phi\) to tighten the bound so that \(\mathcal{L}(\theta, \phi, \mathrm{x}) \approx \log p_\theta(\mathrm{x})\).

The distributions \(p(\mathrm{z})\) and \(p_\theta(\mathrm{x}\mid\mathrm{z})\) are often chosen to be multivariate gaussians to obtain a closed form for \(\Omega(\phi)\). The expectation term \(\mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ]\) is often estimated using monte-carlo methods.

A second derivation

A second derivation, found in this famous paper on normalizing flows uses Jensen's inequality. Indeed, by the concavity of the logarithm, we have for any distribution \(p\) the inequality \(\log(\mathbb{E}_p[Y]) \geq \mathbb{E}_p[\log(Y)]\). In particular, we can apply this inequality to derive the inequality through the steps

\begin{align} \log p_\theta(\mathrm{x}) &= \log\int_\mathcal{Z} p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} \\ &= \log\int_\mathcal{Z} \dfrac{q_\phi(\mathrm{z}\mid \mathrm{x})}{q_\phi(\mathrm{z}\mid \mathrm{x})}\, p_\theta(\mathrm{x}\mid \mathrm{z})p(\mathrm{z})\,\mathrm{dz} \\ &\geq \int_\mathcal{Z} q_\phi(\mathrm{z}\mid \mathrm{x})\log\dfrac{p_\theta(\mathrm{x}\mid \mathrm{z}) p(\mathrm{z})}{q_\phi(\mathrm{z}\mid \mathrm{x})}\, \,\mathrm{dz} \\ & = \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ]\; – \mathrm{KL}(q_\phi(\mathrm{z}\mid \mathrm{x})\parallel p(\mathrm{z})) \end{align}

This proof is elegant but does not give an analytic expression for the jensen gap \(\log p_\theta(\mathrm{x}) – \mathbb{E}_{\mathrm{z}\sim q_\phi}[\log p_\theta(\mathrm{x}\mid \mathrm{z}) ] + \Omega(\phi)\). This gap has the exact form \(E(\cdot)\) derived in the previous section.

Suppose one wants to compute the expected value \(\mathbb{E}_X[X]\) of a quadratic form \(f(x) \triangleq x^\mathsf{T}\mathrm{A} x\). Here \(x\) is a \(n\) dimensional vector and \(\mathrm{A}\) a \(n\times n\) matrix. This post introduces a popular trick to compute the expectation of the quadratic form \(f\).

Trace definition and property

Recall the definition of the trace for a matrix \(\mathrm{X} = [\mathrm{X}_{ij}]_{i,j = 1}^n\): the trace is the sum of the diagonal elements $$ \mathrm{Tr}[\mathrm{X}] = \sum_{i=1}^n \mathrm{X}_{ii} $$ from which it is easy to derive that \(\mathrm{Tr}[\cdot]\) is linear. The trace of a \(1\times 1\) matrix is simply the value of its only element \(\mathrm{X}_{11}\). One easily finds, using any usual definition for integrals, that as a linear map one has

$$\int \mathrm{Tr}[g(\mathrm{X})]\,\mathrm{dX} = \mathrm{Tr}\left[ \int g(\mathrm{X})\,\mathrm{dX}\right] $$

Another important property is the cyclic nature of the trace: \(\mathrm{Tr}[\mathrm{XY}] = \mathrm{Tr}[\mathrm{YX}]\) which can be generalized to three matrices $$ \mathrm{Tr}[\mathrm{XYZ}] = \mathrm{Tr}[\mathrm{ZXY}] = \mathrm{Tr}[\mathrm{YZX}] $$ or any finite product of \(m\) matrices.

The trace trick

Suppose that \(X\) is a random variable distributed according to the density \(p\). Suppose the integral \(\Sigma = \mathrm{Cov}[X,X] = \mathbb{E}[XX^\mathsf{T}]\) exists. The expectation of the quadratic \(f\) is then equal to \(\mathrm{Tr}[\mathrm{A}\Sigma]\).

To prove this, we first use the fact that \(f(x)\) is a scalar, and therefore is equal to \(\mathrm{Tr}[f(x)]\). Then, by the cyclic property of the trace \(f(x) = \mathrm{Tr}[\mathrm{A}XX^\mathsf{T}]\). Finally, as the trace is linear, the trace of the expectation is equal to the expectation of the trace: $$ \mathbb{E}_X[\mathrm{Tr}[\mathrm{A}XX^\mathsf{T}]] = \mathrm{Tr}[\mathrm{A}\mathbb{E}_X[XX^\mathsf{T}]] $$

which concludes the proof.

This post is about a derivation of the Singular Value Decomposition of a matrix \(\mathrm{A}\) when several conditions are met. These conditions are quite restrictive, but make derivations very easy.

Symmetric PD matrices

Let us begin with a symmetric matrix \(\mathrm{S}\in\mathbb{R}^{n\times n}\) which is assumed positive-definite with \(n\) distinct eigenvalues \(\{\lambda_i\}_i\). It is easy to prove that the eigenvectors \(v_i\) are orthogonal :

$$0 < v_i^\mathsf{T}\mathrm{S}\,v_i = \lambda_i\lVert v_i \rVert^2 $$

gives that all eigenvalues are strictly positive, but we have for \(i \neq j\)

$$ \lambda_i\,v_i^\mathsf{T}v_j = (\mathrm{S}v_i)^\mathsf{T}v_j = v_i^\mathsf{T}\mathrm{S}^\mathsf{T}v_j = v_i^\mathsf{T}\mathrm{S}v_j = \lambda_j\,v_i^\mathsf{T}v_j$$

because \(\mathrm{S} = \mathrm{S}^\mathsf{T}\). As \(\lambda_i \neq \lambda_j\) we must have \(v_i^\mathsf{T}v_j = 0\). A symmetric PD matrix \(\mathrm{S}\) can therefore be decomposed as \(\mathrm{S} = \mathrm{Q}\mathrm{\Lambda}\mathrm{Q}^\mathsf{T}\) with \(\mathrm{Q}\) orthogonal and \(\mathrm{\Lambda}\) diagonal with positive entries.

Eigendecomposition of a product

Suppose \(\mathrm{A}\) is square, and further that the matrices \(\mathrm{A}^\mathsf{T} \mathrm{A}\) and \(\mathrm{A}\mathrm{A}^\mathsf{T}\) have distinct eigenvalues. Then, by our derivation in the previous section, we have that each of the two matrices can be decomposed as

$$ \mathrm{A}^\mathsf{T} \mathrm{A} = \mathrm{U}\mathrm{\Sigma}_1\mathrm{U}^\mathsf{T} \qquad\mathrm{and}\qquad \mathrm{A}\mathrm{A}^\mathsf{T} = \mathrm{V}\mathrm{\Sigma}_2\mathrm{V}^\mathsf{T} $$

because these two matrices are always symmetric and positive-definite. Further, the eigenvalues \(\sigma^1_i\) and \(\sigma^2_i\) stored in \(\mathrm{\Sigma}_1\) and \(\mathrm{\Sigma}_2\) are equal: if \(\mathrm{A}^\mathsf{T} \mathrm{A}u_i = \sigma^1_i u_i\) then

$$\mathrm{A}\mathrm{A}^\mathsf{T} (\mathrm{A}u_i) = \sigma^1_i (\mathrm{A}u_i) = \sigma^2_j v_j\qquad\text{for some}\;j $$

We can therefore always rearrange \(\mathrm{\Sigma}_1\) into \(\mathrm{\Sigma}_2\) by pre and post multiplying by a permutation: \(\mathrm{\Sigma}_1 = \mathrm{P}\mathrm{\Sigma}_2\mathrm{P}^\mathsf{T}\). By applying the same permutation to the matrix of eigenvectors we keep the decomposition

$$ \mathrm{A}\mathrm{A}^\mathsf{T} = \mathrm{V}\mathrm{\Sigma}_2\mathrm{V}^\mathsf{T} = (\mathrm{V}\mathrm{P}^\mathsf{T})\mathrm{P}\mathrm{\Sigma}_2\mathrm{P}^\mathsf{T}(\mathrm{V}\mathrm{P}^\mathsf{T})^\mathsf{T} = \tilde{\mathrm{V}}\mathrm{\Sigma}_1\tilde{\mathrm{V}}^\mathsf{T}$$

we can therefore assume \(\mathrm{\Sigma}_1 = \mathrm{\Sigma}_2 = \mathrm{\Sigma}\) for some appropriate orthogonal matrices \(\mathrm{U}, \mathrm{V}\).

Singular Value Decomposition

Because the eigenvalues stored in \(\mathrm{\Sigma}\) are strictly positive, we can take their square root \(\sqrt{\sigma_i}\) and store them in a new matrix \(\tilde{\mathrm{\Sigma}} = \mathrm{\Sigma}^{1 / 2}\) such that $$\tilde{\mathrm{\Sigma}}\tilde{\mathrm{\Sigma}} = \tilde{\mathrm{\Sigma}}^\mathsf{T}\tilde{\mathrm{\Sigma}} = \tilde{\mathrm{\Sigma}}\tilde{\mathrm{\Sigma}}^\mathsf{T} = \mathrm{\Sigma} $$

the trick behind the SVD is to construct a specififc \(\mathrm{V}\) from the \(\mathrm{U}\)-decomposition. Consider the equalities

\begin{align} \mathrm{A}\mathrm{A}^\mathsf{T} &= \mathrm{V}\mathrm{\Sigma}\mathrm{V}^\mathsf{T} \\ & = \left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)\mathrm{\Sigma}\left(\tilde{\mathrm{\Sigma}}^{-1}\mathrm{U}^\mathsf{T} \mathrm{A}^\mathsf{T}\right) \\ & = \left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)\mathrm{\Sigma}\left(\mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\right)^\mathsf{T} \end{align}

therefore, by setting \(\mathrm{V} \triangleq \mathrm{A}\mathrm{U}\tilde{\mathrm{\Sigma}}^{-1}\) we can do two things: orthogonally diagonalize the matrix \(\mathrm{A}\mathrm{A}^\mathsf{T}\) and ensure that \(\mathrm{A} = \mathrm{U}^\mathsf{T}\tilde{\mathrm{\Sigma}}\mathrm{V}\). The matrices \(\mathrm{U}, \mathrm{V}\) are often named differently, so that the SVD decomposition of \(\mathrm{A}\) is

$$ \mathrm{A} = \mathrm{U}\mathrm{\Sigma}\mathrm{V}^\mathsf{T} $$ for some unitary matrices \(\mathrm{U}, \mathrm{V}\) and a positive diagonal matrix \(\mathrm{\Sigma}\).

A special kind of norm for \(n\times n\) square matrices \(\mathrm{A}\) is called the subordinate (or induced) norm. If we have a norm \(\lVert\cdot\rVert_\alpha\) on the input space, and a second norm \(\lVert\cdot\rVert_\beta\) on the output space, the subordinate norm is

$$ \lVert\mathrm{A}\rVert_{\alpha, \beta} = \sup_{x \neq 0}\dfrac{\;\lVert\mathrm{A}x\rVert_{\beta}}{\;\lVert x\rVert_\alpha} $$

Submultiplicative property

An important property of these induced norms is that for any third norm \(\lVert\cdot\rVert_{\gamma}\), we have:

$$ \lVert\mathrm{AB}\rVert_{\alpha, \beta} \leq \lVert\mathrm{A}\rVert_{\gamma, \beta} \lVert\mathrm{B}\rVert_{\alpha, \gamma} $$

This is easily proven via the computations

$$ \dfrac{\;\lVert\mathrm{ABx}\rVert_{\beta}}{\;\lVert x\rVert_\alpha} = \dfrac{\;\lVert\mathrm{A}y\rVert_{\beta}}{\;\lVert y\rVert_\gamma} \dfrac{\;\lVert\mathrm{B}x\rVert_{\gamma}}{\;\lVert x\rVert_\gamma} $$ for \(y = \mathrm{B}x\). As each supremum on the right is an upper bound, we get the inequality

$$ \dfrac{\;\lVert\mathrm{ABx}\rVert_{\beta}}{\;\lVert x\rVert_\alpha} \leq \lVert\mathrm{A}\rVert_{\gamma, \beta} \lVert\mathrm{B}\rVert_{\alpha, \gamma} $$

we get the final equality because the supremum is the least upper bound. A special case is when taking \(\alpha = \beta\), which yields that $$\lVert\mathrm{AB}\rVert_{\alpha, \alpha} \leq \lVert\mathrm{A}\rVert_{\alpha, \alpha} \lVert\mathrm{B}\rVert_{\alpha, \alpha} $$ we denote this norm \(\lVert\cdot\rVert_\alpha\).

Condition number

The condition number \(\kappa_\alpha(\mathrm{A})\) of a matrix is the quantity $$\kappa_\alpha(\mathrm{A}) = \lVert\mathrm{A}\rVert_\alpha\lVert\mathrm{A}^{-1}\rVert_\alpha $$ which is larger than \(1\) because $$ 1 = \lVert\mathrm{I}\rVert_\alpha = \lVert\mathrm{A}\mathrm{A}^{-1}\rVert_\alpha \leq \lVert\mathrm{A}\rVert_\alpha\lVert\mathrm{A}^{-1}\rVert_\alpha $$ this number is large when either \(\mathrm{A}\) or its inverse creates large distortions as the input varies on the unit circle.

Stability of a similarity

In linear algebra, a matrix \(\mathrm{B}\) is said to be similar to \(\mathrm{A}\) when there is an invertible transformation \(\mathrm{X}\) such that \(\mathrm{B} = \mathrm{X}\mathrm{A}\mathrm{X}^{-1}\). In numerical analyses, it is often that we wish to know the robustness of an operation \(f\colon x \mapsto f(x)\) with respect to a small perturbation of its input: if \(y = f(x)\), what is \(f(x+\varepsilon)\) ? For real functions, continuity tells us that as long as \(\lvert \varepsilon \rvert < \delta\), the perturbed output \(f(x+\varepsilon)\) is \(\tilde{\varepsilon}\)-close to \(y\) for some small \(\tilde{\varepsilon} > 0\). What is the equivalent bound on the error of the matrix function \(f_{\mathrm{X}}(\mathrm{A}) = \mathrm{X}\mathrm{A}\mathrm{X}^{-1}\) ? Suppose \(\mathrm{A}\) is perturbed by a matrix \(\mathrm{E}\) with small magnitude: \(\lVert\mathrm{E}\rVert_\alpha < \delta\) Then, the value of the function at this perturbed output is

$$ f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E}) = \mathrm{X}(\mathrm{A}+ \mathrm{E})\mathrm{X}^{-1} = f_{\mathrm{X}}(\mathrm{A}) + f_{\mathrm{X}}(\mathrm{E}) $$

thus, the norm of the error \(\mathrm{B}\, – f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E})\) is given by

$$ \lVert\mathrm{B}\,– f_{\mathrm{X}}(\mathrm{A}+ \mathrm{E})\rVert_\alpha = \lVert\mathrm{X}\mathrm{E}\mathrm{X}^{-1}\rVert_\alpha \leq \kappa_\alpha(\mathrm{X})\,\lVert\mathrm{E}\rVert_\alpha $$

therefore, the condition number \(\kappa_\alpha\) serves as a measure of continuity for the similarity operation \(f_\mathrm{X}(\cdot)\). If the condition number is small, the matrix is said to be “well-behaved”, so that for small input errors \(\mathrm{E}\) the output is only perturbed at most by the small amout \(\kappa_\alpha(\mathrm{X})\,\lVert\mathrm{E}\rVert_\alpha\). If the condition number is large, even small input errors cause large deviations in the output.

Orthogonal matrices are stable

The norm induced by setting \(\alpha = 2\) reveals that orthogonal matrices (the matrices \(\mathrm{Q}\) for which \(\mathrm{Q}^\mathsf{T}\mathrm{Q} = \mathrm{I}\)) are computationally very stable. Indeed, the perturbation created by the similarity \(\mathrm{A}\mapsto \mathrm{Q}\mathrm{A}\mathrm{Q}^\mathsf{T}\) is equal to \(\lVert\mathrm{Q}\mathrm{E}\mathrm{Q}^\mathsf{T}\rVert_2 = \lVert\mathrm{E}\rVert_2\)

These transformations introduce no extra perturbations !

Stability of a time-varying linear system

Consider the time-varying \(n\times n\) system \(\mathrm{A}(t)\, x(t) = b(t)\). Here, the notion of “stable” is different: a time-varying linear system is unstable if small durations cause important variations in the state \(x\). A stable system has therefore a small relative variation \(\lVert \dot{x} \rVert / \lVert x \rVert\).

Differentiating this equality yields

$$ \dot{x}(t) =\mathrm{A}(t)^{-1}\dot{b}(t)\, – \mathrm{A}(t)^{-1}\dot{\mathrm{A}}(t)\,x(t) $$

Therefore the stability of the system can be upper-bounded (dropping the \(t\) for simplicity)

$$ \dfrac{\lVert\dot{x}\rVert}{\lVert x \rVert} \leq \kappa(\mathrm{A}) \left(\dfrac{\lVert \dot{b} \rVert}{\lVert \mathrm{A}\rVert \lVert x\rVert } + \dfrac{\lVert \dot{\mathrm{A}}\rVert}{\lVert \mathrm{A}\rVert}\right) $$

therefore, a system whose matrix has a small condition number at time \(t\) will stay relatively close to its current position in the near future.

This post is about the Gram-Schmidt procedure which transforms a set of vectors \(u_1, \dots u_n\) into a new set of vectors \(v_1, \dots, v_n\) which are orthonormal (ON) for some inner product \(\langle\cdot, \cdot \rangle\). This process is purely algebraïc and can be performed over any hilbert space, no matter its complexity: vectors, matrices, functions are vectors that can be ortho-normalized.

Deriving the algorithm

The first step

Suppose we want to achieve two things: we should have \(v_i \perp v_j\) and \(\lVert v_i\rVert = 1\) for any \(v_i, v_j\) in the new collection. Set \(v_1 = u_1 / \lVert u_1\rVert\) so that the second criterion is met. The simplest and most intuitive way to transform a set of vectors \(\{u_i\}_i\) into another \(\{v_i\}_i\) is by a linear map. Let us try to construct \(v_2\) as a linear combination of \(u_2\) and \(v_1\): \(v_2 = a\,u_2 + bv_1\). We must have

\begin{align} \langle v_1, v_2\rangle & = a\langle v_1, u_2\rangle + b\langle v_2, v_2\rangle \\ & = a\langle v_1, u_2\rangle + b = 0 \end{align}

So \(b\) must be proportional to \(-\langle v_1, u_2\rangle\) the projection of the new vector \(u_2\) onto \(v_1\). Further, if we want that \(\langle v_2, v_2\rangle = 1\) we must set \(a= 1/\lVert u_2 \rVert\) and thus $$b = -a\langle v_1, u_2\rangle =\; – \frac{\langle v_1, u_2\rangle}{\lVert u_2 \rVert}$$

Iterating the process

Suppose we have applied the process \(i-1\) times, so that we have an orthonormal family of \(i-1\) vectors \(\{v_j\}_j\) and a new vector \(u_i\). We wish to find a linear combination again:

$$ v_i = \gamma_0u_i + \sum_{j=1}^{i-1}\gamma_{j}v_j $$

If we solve for \(1 \leq k \leq i-1\) the equation \(\langle v_i, v_k\rangle = 0\), because we have \(\langle v_j, v_k\rangle = 0\) for any \(j\neq k\), we obtain that

$$ \langle v_i, v_k\rangle = \gamma_0\langle u_i, v_k\rangle + \gamma_k\langle v_k, v_k\rangle = \gamma_0\langle u_i, v_k\rangle + \gamma_k = 0 $$

this condition, together with the unit norm constraint, gives solutions

$$ \begin{cases} & \gamma_0 = \dfrac{1}{\lVert u_i \rVert} \\ & \\ & \gamma_k =\; -\gamma_0\langle u_i, v_k\rangle =\; -\dfrac{\langle u_i, v_k\rangle}{\lVert u_i \rVert} \end{cases} $$ we thus obtain the Gram-Schmidt algorithm at step \(i\):

$$ v_i \gets \dfrac{1}{\lVert u_i \rVert}\,u_i -\,\sum_{j=1}^{i-1}\dfrac{\langle u_i, v_j\rangle}{\lVert u_i \rVert}\,v_j $$

The QR decomposition

Assume the vectors \(u_1, \dots u_n\) are the columns \(a_j\in\mathbb{R}^n\) of a matrix \(\mathrm{A} = \left[ a_{ij} \right]_{i,j = 1}^n\). We can apply the algorithm to the vectors \(a_j\) with the inner product \(\langle x, y \rangle \triangleq x^\mathsf{T} y\). Denote \(q_1, \dots, q_n\) the ON family obtained. We can then revert the Gram-Schmidt equations to get

\begin{align} a_i = \lVert a_i\rVert\cdot\sum_{j=1}^i \langle a_i, q_j \rangle q_j &= \sum_{j=1}^i \lVert a_i\rVert\cdot\langle a_i, q_j \rangle q_j \\ &= \sum_{j=1}^i r_{ji}q_j = \left[\mathrm{QR}\right]_i \end{align} With \(\mathrm{Q}\) a unitary matrix and \(\mathrm{R}\) upper triangular, as \(r_{ji} = 0\) for \(j > i\).

Linear dependence

Finally, notice that if we have linearly dependent vectors in the original family \(\{u_i\}_i\), the Gram-Schmidt update becomes \(0\) after the rank of the original family is reached. If \(\{u_i\}_{i=1}^n\) has rank \(p\), then the ON family is

$$\{v_j\}_{j=1}^n = \{ v_1, \dots, v_p, 0, \dots, 0 \}$$

Suppose \(\langle\cdot, \cdot\rangle\) is any real-valued inner product, be it on a space \(V\) of vectors, matrices, continuous functions, or even Sobolev spaces. We can prove using simple algebra that we always have

$$ \lvert \langle\phi, \psi\rangle \vert \leq \lVert \phi \rVert\,\lVert \psi \rVert$$

for any two elements \(\phi, \psi\) in \(V\). The norm \(\lVert \cdot \rVert\) is the usual inner-product associated norm \(\lVert \phi \rVert = {\langle\phi,\phi\rangle}^{1 / 2}\)

The algebraïc lemma

It is well-known that the quadratic \(q(x) \triangleq ax^2 + bx + c\) with can be factorized as

$$ q(x) = a\left(x + \dfrac{b}{2a}\right)^2 – \dfrac{b^2 – 4ac}{4a}$$

therefore, the equation \(q(x) = 0\) has no solution if and only if \(b^2 – 4ac < 0\).

The proof

Let us assume \(\phi \neq \lambda\psi\) and compute \(\langle\phi – \lambda\psi, \phi – \lambda\psi\rangle > 0\), we get the inequality

$$ \lambda^2\langle\psi, \psi\rangle – 2\lambda\langle\phi, \psi\rangle + \langle\phi, \phi\rangle $$ which is a strictly positive quadratic form in the variable \(\lambda\). Using our lemma, we therefore must have $$4\langle\phi, \psi\rangle^2 < 4\langle\psi, \psi\rangle\langle\phi, \phi\rangle$$ this yields the cauchy-schwartz inequality which turns into an equality when \(\phi = \lambda\psi\).

This post aims at gathering different inequalities pertaining to the product \(\mathrm{A}x\) where \(\mathrm{A}\) is a \(m\times n\) matrix, and \(v\in\mathbb{R}^n\) is a vector. We will decompose the matrix \(\mathrm{A}\) in terms of its column vectors \(a_j\) and its row vectors \(\tilde{a}_i^\mathsf{T}\):

$$ \mathrm{A} = \begin{bmatrix} a_1 & \dots & a_n \end{bmatrix} = \begin{bmatrix} \tilde{a}_1^\mathsf{T} \\ \vdots \\ \tilde{a}_m^\mathsf{T} \end{bmatrix} $$

Euclidean norm

The quantity $$\lVert\mathrm{A}x \rVert_2^2 = \sum_{i=1}^m \left(\tilde{a}_i^\mathsf{T}x\right)^2$$ can be decomposed using the cauchy-schwartz inequality to obtain

$$\lVert\mathrm{A}x \rVert_2^2 \leq \left(\sum_{i=1}^m \lVert\tilde{a}_i\rVert_2^2\right)\, \lVert x\rVert_2^2 = \lVert \mathrm{A} \rVert_{\mathrm{F}}^2 \, \lVert x\rVert_2^2 $$

thus there is a \(M = \lVert \mathrm{A} \rVert_{\mathrm{F}}\) such that \( \lVert\mathrm{A}x \rVert_2 \leq M\lVert x\rVert_2^2\).

This means all matrices are continuous & bounded linear maps for the \(l_2\) norm.

Manhattan norm

Suppose we wish to measure with input and output spaces using the \(l_1\) norm. The quantity $$\lVert\mathrm{A}x \rVert_1 = \sum_{i=1}^m \lvert\tilde{a}_i^\mathsf{T}x\rvert = \sum_{i=1}^m \lvert\sum_{j=1}^n a_{ij}x_j\rvert $$ can be upper-bounded using the finite-sum generalization of the scalar inequality \(\lvert a + b \rvert \leq \lvert a\rvert + \lvert b \rvert\). Denoting \(\lVert \mathrm{A} \rVert_1\) the sum of the unsigned entries \(\lvert a_{ij} \rvert\) we have the inequality:

$$ \lVert\mathrm{A}x \rVert_1 \leq \lVert\mathrm{A} \rVert_1 \lVert x \rVert_1$$

we can conclude similarly to the first section.

Infinity norm

Assume we now use the infinity norm \(\lVert \cdot \rVert_\infty\), we can upper bound the output \(\mathrm{A}x\) as follows:

\begin{align} \lVert \mathrm{A}x \rVert_\infty = \max_i\lvert \tilde{a}_i^\mathsf{T} x\rvert &= \max_i \left\lvert \sum_{j=1}^n a_{ij}x_j\right\rvert \\ & \leq \max_i \sum_{j=1}^n \left\lvert a_{ij} \right\rvert \left\lvert x_j\right\rvert \\ & \leq \lVert x\rVert_\infty \max_i \sum_{j=1}^n \left\lvert a_{ij} \right\rvert = M\,\lVert x\rVert_\infty \end{align}

we thus obtain that any linear operator is bounded and continuous for \(l_\infty\) in finite dimensions.

Duality

To be added.

A linear map and its matrix

Suppose we have a linear map \(T\colon V \to W\) where \(V,W\) are two \(\mathbb{R}\)-vector spaces with bases \(\{v_i\}_{i=1}^n\) and \(\{w_j\}_{j=1}^m\). The matrix representation \(\mathrm{Mat}(T) = \mathrm{A}\in\mathbb{R}^{m\times n}\) of the map \(T\) is easy to find: first expand any \(v\in V\) in terms of the basis vectors:

$$ T( v ) = T \left(\sum_{i=1}^n \alpha_i\,v_i\right) $$

Then use the linearity of the map \(T\) to obtain:

$$ T( v ) = \sum_{i=1}^n \alpha_i\,T(v_i) $$

finally, expand each vector \(T(v_i)\) in terms of the basis vectors of \(W\):

$$ T(v_i) = \sum_{j=1}^m a_{ji} w_j $$

This gives us the equality for any \(v\in V\): $$ T(v) = \sum_{j=1}^m \sum_{i=1}^n \alpha_i a_{ji} w_j $$

The coefficients \(\vec{\alpha} = [\alpha_1\,,\, \dots \,,\, \alpha_n]\) uniquely characterize the vector \(v\), and similarly the coefficients \(a_{ij}\) are uniquely characterize the map \(T\) given the bases of \(V,W\). Because the expansion of \(T(v)\) is unique in the basis \(\{w_j\}_j\), we obtain that \(T(v)\) has the coefficients \(\beta_j = \sum_{k=1}^n a_{jk}\alpha_k\). We can then only think of linear maps as multiplication and addition of the coefficients \(\alpha_i\) with the entries \(a_{ij}\).

We can then define the array \(\mathrm{A} = \left[ a_{ij} \right]_{ij}\) and matrix multiplication \(\mathrm{A} \vec{\alpha}\) as expected.

Composition as matrix multiplication

Let \(T,S,R\) be linear maps with matrices \(\mathrm{A},\mathrm{B},\mathrm{C}\). The third map \(R\) is the composition \(S\circ T\). Given bases for the spaces \(V\xrightarrow{T} W \xrightarrow{S} Z\), let us denote

  • \(\vec{\alpha} = [\alpha_1\,,\, \dots \,,\, \alpha_n]\) the unique coefficients representing an arbitrary \(v\in V\) using the basis for \(V\),
  • \(\vec{\beta} = [\beta_1\,,\, \dots \,,\, \beta_m]\) the unique coefficients representing \(w = T(v)\in W\)
  • \(\vec{\gamma} = [\gamma_1\,,\, \dots \,,\, \gamma_p]\) the unique coefficients representing \(z = S(w)\in W\)

From the first section, we know that \(T\) has matrix representation \(\mathrm{A} = \left[a_{ij}\right]_{ij}\) so that the ouput coefficients are

$$ \beta_j = \sum_{k=1}^n a_{jk} \alpha_k $$

and that \(S\) has representation \(\mathrm{B} = \left[b_{ij}\right]_{ij}\) so that

$$ \gamma_q = \sum_{j=1}^m b_{qj} \beta_j $$

therefore, we have the equalities

\begin{align} \gamma_q &= \sum_{j=1}^m b_{qj} \sum_{k=1}^n a_{jk} \alpha_k \\ &= \sum_{k=1}^n \left(\sum_{j=1}^m b_{qj}a_{jk}\right) \alpha_k \\ &= \sum_{k=1}^n c_{qk}\alpha_k \end{align}

thus we obtain that the matrix \(\mathrm{C}\) has coefficients \(c_{qk} = \sum_{j=1}^m b_{qj}a_{jk} = \left[ \mathrm{BA} \right]_{qk}\). We thusly proved that \(\mathrm{C} = \mathrm{Mat}(S\circ T) = \mathrm{BA} = \mathrm{Mat}(S)\mathrm{Mat}(T)\).

Enter your email to subscribe to updates.