$$ \def\*#1{\mathbf{#1}} \def\+#1{\mathcal{#1}} \def\-#1{\mathrm{#1}} \def\!#1{\mathsf{#1}} \def\@#1{\mathscr{#1}} \newcommand{\mr}[1]{\mbox{\scriptsize \color{RedViolet}$\triangleright\;$#1}\quad\quad} $$

Lecture 8: Tensorization of Variance, the Poincaré inequality

Author

instructed by Chihao Zhang, scribed by Xuanxuan Liu and Yuchen He

We have previously studied concentration inequalities, discrete-time and continuous-time Markov processes. In this lecture, we will explore the deep connections between these topics. Most materials in this note are adapted from Van16.

Tensorization of variance

Let \(X=(X_1,\dots,X_n)\in \bb R^n\) be a random vector of length \(n\). Consider a function \(f:\bb R^n\to \bb R\).

Recall that when analyzing the concentration of \(f(X)\), we aim to show that \(f(X)\) concentrates tightly around its mean, \(\E{f(X)}\). Intuitively, what properties of \(f\) and \(X\) facilitate this concentration? On one hand, consider the linear function \(f(X)=\frac{1}{n}\sum_{i=1}^n X_i\). If the \(X_i\)’s are independent, the variance scales as \(O(1/n)\), leading to strong concentration. On the other hand, recall the McDiarmid’s inequality. If \(f\) is Lipschitz (i.e., changing one coordinate does not change the function value significantly), then \(f(X)\) tends to concentrate.

Based on this, we identify two intuitive requirements for concentration:

  • The correlations between \(X_i\)’s are weak.
  • The function \(f\) is not sensitive to any single coordinate.

Recall the Chebyshev inequality, for any \(t>0\), \[ \E{\abs{f-\E{f}}\geq t} \leq \frac{\Var{f}}{t^2}. \] To use Chebyshev’s inequality to prove concentration results, we need to bound \(\Var{f}\). In this lecture, we will demonstrate that if \(X\) has independent components, we can obtain an upper bound on \(\Var{f}\) by summing the variances with respect to each coordinate individually. This property is called the tensorization of variance.

Theorem 1 (Tensorization of Variance) Suppose \(X_1,\dots,X_n\) are mutually independent. Then \[ \Var{f(X)} \leq \sum_{i=1}^n \E{\Var{f(X)\mid X_{-i}}}, \] where \(X_{-i} = (X_1,\dots,X_{i-1},X_{i+1},\dots,X_n)\).

The term “tensorization” refers to the ability to decompose a high-dimensional value into a sum of one-dimensional values.

Example 1 Consider \(f(X)=\frac{1}{n}\sum_{i=1}^n X_i\). Since the \(X_i\)’s are independent, fixing all variables except \(X_i\) leaves us with \(\Var{\sum_{k=1}^n X_k\mid X_{-i}} = \Var{X_i}\). Therefore, the tensorization theorem yields: \[ \Var{\frac{1}{n}\sum_{i=1}^n X_i} \leq \frac{1}{n^2} \sum_{i=1}^n\Var{X_i}. \] This recovers the standard variance formula for the sum of independent variables.

Now, we provide a proof of the theorem using martingales. For convenience, we denote the conditional variance with respect to the \(i\)-th coordinate as \(\Var[i]{f(X)} := \Var{f(X)\mid X_{-i}}\).

Proof (Proof of the tensorization of variance*.). Consider the Doob martingale \(\set{Z_k}_{k=0}^n\) where \(Z_k\defeq \E{f\mid \+F_k}\), with \(\+F_k=\sigma\tp{X_1,\dots,X_k}\). Define \(\Delta_k=Z_k-Z_{k-1}\). Recall that we have \(Z_n=f(X)\) and \(Z_0=\E{f}\). Therefore, we can decompose \(f-\E{f} = \sum_{i=1}^n \Delta_k\).

By the martingale property, for any \(\ell<k\), \[ \E{\Delta_k\cdot \Delta_{\ell}} = \E{\E{\Delta_k\cdot \Delta_{\ell}\mid \+F_{k-1}}} = \E{\Delta_{\ell}\cdot \E{\Delta_k\mid \+F_{k-1}}}=0. \] Consequently, the variance is the sum of the expected squared increments: \[ \Var{f} = \E{\tp{\sum_{i=1}^n \Delta_k}^2} = \E{\sum_{i=1}^n \Delta_k^2}. \] To prove the theorem, it suffices to show that for each \(k \in [n]\), \(\E{\Delta_k^2}\leq \E{\Var[k]{f}}\). Let \(\+F_{-k}=\sigma(X_{-k})\). Recall that \(Z_{k-1} = \E{f \mid \+{F}_{k-1}}\). Because \(X_k\) is independent of \(X_{-k}\), conditioning on \(\+{F}_k\) (which adds knowledge of \(X_k\)) provides no new information about \(\E{f \mid \mathcal{F}_{-k}}\). Thus, we can write: \[ \begin{align*} \Delta_k &= \E{f\mid \+F_k} - \E{f\mid \+F_{k-1}}\\ \mr{tower rule}&= \E{f\mid \+F_k} - \E{\E{f\mid \+F_{-k}}\mid \+F_{k-1}}\\ \mr{mutual independence}&= \E{f\mid \+F_k} - \E{\E{f\mid \+F_{-k}}\mid \+F_{k}}. \end{align*} \] Then \[ \begin{align*} \E{\Delta_k^2} &= \E{\E{f - \E{f\mid \+F_{-k}}\mid \+F_{k}}^2}\\ \mr{Jensen's inequality} &\leq \E{\E{\tp{f - \E{f\mid \+F_{-k}}}^2\mid \+F_{k}}}\\ &= \E{\tp{f - \E{f\mid \+F_{-k}}}^2}\\ &= \E{\E{\tp{f - \E{f\mid \+F_{-k}}}^2\mid \+F_{-k}}}\\ &= \E{\Var[k]{f}}. \end{align*} \] Summing over \(k\) completes the proof.


The term \(\E{\Var[i]{f(X)}}\) on the RHS of the above theorem represents the average sensitivity of the function to the \(i\)-th coordinate. We can further bound this using the maximum possible variation. Define the derivative: \[ D_if(x) = \sup_{z} f(x_1,\dots,x_{i-1},z,x_{i+1},\dots,x_n) - \inf_{z} f(x_1,\dots,x_{i-1},z,x_{i+1},\dots,x_n). \]
Since the variance of a random variable bounded in an interval of length \(L\) is at most \(L^2/4\), we obtain the following corollary.

Corollary 1 For independent \(X_1,\dots,X_n\), \(\Var{f(X)}\leq \frac{1}{4}\sum_{i=1}^n \E{\tp{D_if(X)}^2}\).

A major limitation of the tensorization result above is the requirement that the \(X_i\)’s must be independent. An important question arises: Can we generalize this to dependent random variables? In the remainder of this lecture, we aim to answer this question. Specifically, we want to establish Poincaré-type inequalities for variables that are not necessarily independent. We seek an inequality of the form: \[ \Var{f}\leq C\sum_{i=1}^n \mbox{Expected sensitivity of the }i\mbox{-th coordinate}. \]

Poincaré inequality for reversible Markov chains

Recall the properties of reversible Markov chains from previous lectures. Let \(P\) be the transition matrix of a reversible chain with stationary distribution \(\pi\). We assume its eigenvalues are ordered as \(1 = \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n \geq -1\).

We previously defined the Laplacian operator \(L = I - P\). Its eigenvalues, denoted by \(\gamma_i = 1 - \lambda_i\), satisfy \(0 = \gamma_1 \leq \gamma_2 \leq \cdots \leq \gamma_n \leq 2\). The second smallest eigenvalue, \(\gamma_2\), is known as the spectral gap. Using the Rayleigh quotient, we have: \[ \gamma_2 = \min_{v\perp \bb 1} \frac{\inner{v}{Lv}_{\pi}}{\inner{v}{v}_\pi} = \min_{v\perp \bb 1} \frac{\inner{v}{Lv}_{\pi}}{\inner{v-\E[\pi]{v}}{v-\E[\pi]{v}}_\pi} = \min_{v\perp \bb 1} \frac{\inner{v}{Lv}_{\pi}}{\Var[\pi]{v}}. \] The second equality is due to \(\E[\pi]{v}=\inner{v}{\bb 1}_{\pi}=0\). For functions \(f,g\), define the Dirichlet form associated with the chain \(\+E_{L}(f,g)\) as \(\inner{f}{Lg}_{\pi}\). Using this notation, the variational characterization of the spectral gap implies: \[ \gamma_2 = \min_{v\perp \bb 1} \frac{\+E_{L}(f,f)}{\Var[\pi]{v}}. \]
Rearranging this yields the Poincaré inequality for the Markov chain \(P\).

Definition 1 (Poincaré Inequality) We say a Markov chain with stationary distribution \(\pi\) satisfies the Poincaré inequality with constant \(C\) if for all functions \(f\): \[ \Var[\pi]{f} \leq C\cdot \mathcal{E}_L(f,f). \]

The above calculations show that a Markov chain \(P\) always satisfies the Poincaré inequality with constant \(\frac{1}{\gamma_2}\) when \(\gamma_2\neq 0\).

Let’s look at an example to see how the Poincaré inequality is correlated with the tensorization of variance. When the information is clear from the context, we abbreviate the Dirichlet form \(\+E_{L}(f,g)\) as \(\+E(f,g)\).

Example 2 (Random walk on the hypercube) Consider the random walk on the hypercube \(\set{\pm 1}^n\). The Markov chain is as follows: at each step \(t\),

  • Pick \(i\in [n]\), \(c\in \set{\pm 1}\) uniformly at random.
  • Update \(X_{t+1}=X_t(i\leftarrow c)\) (i.e., replace the \(i\)-th bit of \(X_t\) with \(c\)).

The stationary distribution \(\pi\) is the uniform distribution. We see how the Poincaré inequality for this Markov chain implies the tensorization of variance for \(\pi\).

Define \(d_t \defeq \max_{x,y\in \set{\pm 1}^n} D_{\!{TV}}\tp{P^t(x,\cdot),P^t(y,\cdot)}\). For each \(x,y\in \set{\pm 1}^n\), \[\begin{align*} D_{\!{TV}}\tp{P^t(x,\cdot),P^t(y,\cdot)}&\leq \Pr{X_t\neq Y_t}\\ \mr{union bound}&\leq n\cdot \tp{1-\frac{1}{n}}^t\\ &\leq n\cdot e^{-\frac{t}{n}}. \end{align*}\] Using the spectral bound from [Che98], we know that \(\abs{\lambda_2}^t\leq d_t\) (the proof of this result is provided at the end of this note). For large \(t\), this implies \[ \lambda_2\leq n^{\frac{1}{t}}e^{-\frac{1}{n}} \approx 1-\frac{1}{n}, \] Then we obtain the lower bound \(\gamma_2\geq \frac{1}{n}\) and thus establish the Poincaré inequality \[ \forall f,\ \Var[\pi]{f}\leq n\cdot \+E(f,f). \]

It remains to bound \(\+E(f,f)\). Let \(x^{\oplus i}\) denote the vector \(x\) with the \(i\)-th bit flipped. For this Markov chain, \[ \begin{align*} \+E(f,f) &=\frac{1}{2}\cdot \sum_{x,y\in \set{\pm 1}^n} \pi(x)P(x,y)\tp{f(x)-f(y)}^2\\ &=\frac{1}{2}\cdot \sum_{i=1}^n \sum_{x\in \set{\pm 1}^n} 2^{-n}\cdot \frac{1}{n}\cdot \frac{1}{2}\cdot \tp{f(x)-f(x^{\oplus i})}^2\\ &=\frac{1}{n}\cdot \sum_{i=1}^n \sum_{x_{-i}\in \set{\pm 1}^{n-1}} 2^{-\tp{n-1}}\cdot \frac{1}{2}\E[c_1,c_2\sim \!{Unif(\pm 1)}]{f\tp{x^{i\gets c_1}} - f\tp{x^{i\gets c_2}}}^2\\ &= \frac{1}{n}\sum_{i=1}^n \E{\Var[i]{f}}. \end{align*} \]

Substituting this into the Poincaré inequality, we get \(\Var[\pi]{f}\leq \sum_{i=1}^n \E{\Var[i]{f}}\). This exactly recovers the tensorization of variance theorem for \(\pi\) !

This example illustrates a profound insight: The Poincaré inequality can be viewed as a generalization of the tensorization of variance to dependent variables. The Dirichlet form \(\mathcal{E}(f,f)\) captures the sum of local sensitivities. In fact, for most Markov processes studied in this course, the Dirichlet form explicitly manifests as a summation of the sensitivities of each coordinate.

The Poincaré inequality for Markov semigroups

We now generalize our results from discrete-time to continuous-time Markov processes. We will see how the rapid mixing of a Markov process is related to concentration.

Consider a Markov semigroup \(\set{P_t}_{t\geq 0}\) with generator \(\+L\) and a unique stationary distribution \(\pi\). When the process is reversible, we can verify that the adjoint operator \(\+L^*=\+L\). Recall that \(\+L\) is analogous to \(P-I\) or \(-L\) in the discrete-time case. We can similarly define the Dirichlet form \(\+E_{\+L}(f,g)=-\inner{f}{\+Lg}_{\pi}\) for functions \(f\) and \(g\). We denote the \(L^2(\pi)\) norm as \(\|f\|_{L^2(\pi)} = \sqrt{\langle f, f \rangle_\pi}\).

Theorem 2 For a reversible Markov semigroup, the following statements are equivalent.

  1. \(\forall f,\ \Var{f}\leq c\cdot \+E_{\+L}(f,f)\);
  2. \(\forall f,\ \norm{P_t f - \E{f}}_{\+L^2(\pi)}\leq e^{-\frac{t}{c}}\cdot \norm{f - \E{f}}_{\+L^2(\pi)}\);
  3. \(\forall f,\ \+E_{\+L}(P_t f,P_t f)\leq e^{-\frac{2t}{c}}\cdot \+E_{\+L}(f,f)\);
  4. For any \(f\), there exists a constant \(k(f)\) such that \(\norm{P_t f - \E{f}}_{\+L^2(\pi)}\leq k(f) e^{-\frac{t}{c}}\);
  5. For any \(f\), there exists a constant \(k(f)\) such that \(\+E_{\+L}(P_t f,P_t f)\leq k(f) e^{-\frac{2t}{c}}\).

Note that the first point is the standard definition of the Poincaré inequality. Points 2 and 4 state that the function \(P_t f\) converges exponentially fast to the constant function \(\E{f}\) in the \(L^2\) norm. Points 3 and 5 provide a powerful tool for proving rapid mixing: we only need to establish a contraction of the Dirichlet form.

In this lecture, we will focus on proving the implication (3) \(\Rightarrow\) (1). Interesting readers can refer to Van16 for the complete proof of the theorem. Before delving into the proof, let’s see an example.

The Poincaré inequality of the standard Gaussian distribution

To establish the Poincaré inequality for the standard Gaussian distribution, we consider the OU process: \[ \d X_t = -X_t \d t + \sqrt{2}\d B_t. \] Recall the key properties of this process:

  • The stationary distribution \(\pi\) is \(\+N(0,1)\);
  • The generator \(\+L\) satisfies \(\+L f(x)=-x f'(x) + f''(x)\);
  • The distribution of \(X_t\) is equivalent to \(e^{-t}X_0 + \sqrt{1-e^{-2t}}\xi\) where \(\xi\) is a standard Gaussian independent with \(X_0\). Therefore, \(P_tf(x) = \E{f\tp{e^{-t}x + \sqrt{1-e^{-2t}}\xi}}\).

First, let’s derive the useful tool of Gaussian integration by parts. Assume \(X\sim \+N(0,1)\). Then \[ \begin{align*} \E{Xf(X)} &= \frac{1}{\sqrt{2\pi}}\int_{\bb R} x f(x)\cdot e^{-\frac{x^2}{2}}\d x\\ &= \frac{1}{\sqrt{2\pi}}\int_{\bb R} -f(x)\cdot \d e^{-\frac{x^2}{2}}\\ \mr{integration by parts}&= \frac{1}{\sqrt{2\pi}}\int_{\bb R} e^{-\frac{x^2}{2}} f'(x)\d x\\ &= \E{f'(X)}. \end{align*} \]

Then using Gaussian integration by parts, we can simplify the Dirichlet form for the OU generator. Let \(\gamma\) be the probability density function of \(\+N(0,1)\). By definition, for funcions \(f,g\), \[ \begin{align*} \+E_{\+L}\tp{f,g} &=-\inner{f}{\+L g}_{\pi}\\ &=\int_{\bb R} f(x)\tp{xg'(x) - g''(x)} \gamma(x)\d x\\ \mr{Gaussian integration by parts}&=\int_{\bb R} \tp{ \tp{f(x)g'(x)}' - f(x)g''(x)} \gamma(x)\d x\\ &=\int_{\bb R} f'(x)g'(x) \gamma(x)\d x. \end{align*} \] Therefore, \(\+E_{\+L}\tp{f,f} = \E[\pi]{(f'(X))^2}\).

Then we compute the Dirichlet form of the evolved function \(P_t f\): \[ \begin{align*} \+E_{\+L}\tp{P_tf, P_tf} &= \norm{\tp{P_t f}'}^2_{L^2(\pi)}\\ &=e^{-2t} \norm{P_tf'}^2_{L^2(\pi)}\\ &= e^{-2t} \E[X_0\sim \pi]{\E{f'(X_t)\mid X_0}^2}\\ \mr{Jensen's inequality}&\leq e^{-2t}\E[X_0\sim \pi]{\tp{f'(X_t)}^2} \\ &=e^{-2t}\+E_{\+L}(f,f). \end{align*} \] This proves statement 3 in the above theorem with constant \(c=1\), which further implies that \[ \forall f, \ \Var[\pi]{f}\leq \+E_{\+L}(f,f). \] This is the Gaussian Poincaré Inequality. For the high dimensional cases, we can similarly get \[ \Var[\pi]{f}\leq \E[\pi]{\norm{\nabla f}^2} = \sum_{i=1}^n \E{\tp{\frac{\partial}{\partial x_i} f(X)}^2}, \] which again recovers the tensorization result.

Proof of (3) \(\Rightarrow\) (1)

We see how the Poincaré inequality is related to the mixing of the Markov process. Note that we say a Markov process mixes, we mean that for a function \(f\), \(P_t f\) converges to a constant function. Therefore, we check the derivative of \(\Var{P_t f}\).

Lemma 1 \(\frac{\d}{\d t} \Var{P_t f} = -2\+E\tp{P_t f,P_t f}\)

Proof. We have \[ \begin{align*} \frac{\d}{\d t} \Var{P_t f} &= \frac{\d}{\d t} \E{\tp{P_t f}^2}\\ &=\E{2 P_tf\cdot \frac{\d}{\d t} P_t f}\\ \mr{$\frac{\d}{\d t} P_t=\+L P_t$}&= \E{2 P_tf\cdot \+L P_t f}\\ &= -2\+E\tp{P_t f,P_t f}. \end{align*} \]

Corollary 2 \(\Var{f} = 2\int_0^{\infty} \+E\tp{P_tf, P_tf}\).

Proof. From the above lemma, \[\begin{align*} \Var{f} &= \Var{P_0f} - \Var{P_{\infty}f}\\ &=- \int_0^{\infty} \frac{\d}{\d t} \Var{P_t f}\\ &=2 \int_0^{\infty} \+E\tp{P_t f,P_t f}. \end{align*}\]

Assume the third statement holds: \[ \+E_{\+L}(P_t f,P_t f)\leq e^{-\frac{2t}{c}}\cdot \+E_{\+L}(f,f), \] for some constant \(c\). We then have \[ \Var{f} = 2\int_0^{\infty} \+E\tp{P_tf, P_tf} \leq 2 \int_0^{\infty} e^{-\frac{2t}{c}}\cdot \+E_{\+L}(f,f) = c\cdot \+E(f,f). \] This proves the Poincaré inequality.

Proof of \(|\lambda_2|^t \leq d_t\) in [Che98]

Finally, we justify the bound \(|\lambda_2|^t \leq d_t\) used earlier. Consider a discrete Markov chain \(P\). Recall that \(d_t \defeq \max_{x,y\in \Omega} D_{\text{TV}}(P^t(x,\cdot), P^t(y,\cdot))\).

For a function \(f: \Omega \to \mathbb{R}\), define its Lipschitz parameter (with respect to the discrete metric \(\mathbb{1}[x \neq y]\)) as \(\!{Lip}(f)\defeq \max_{x\neq y} \abs{f(x)-f(y)}\). We claim that the Lipschitz constant contracts under the Markov operator. Let \(\omega\) denote the optimal coupling of \(P^t(x,\cdot)\) and \(P^t(y,\cdot)\). Then for any \(x,y\) \[ \begin{align*} \abs{P_tf(x)-P_tf(y)} &= \abs{\E{f(X_t)\mid X_0=x} - \E{f(X_t)\mid X_0=y}}\\ &= \abs{\E[(X,Y)\sim \omega]{f(X)-f(Y)}}\\ \mr{Jensen's inequality}&\leq \E[(X,Y)\sim \omega]{\abs{f(X)-f(Y)}}\\ &\leq \!{Lip}(f)\cdot \Pr[(X,Y)\sim \omega]{X\neq Y}\\ &\leq \!{Lip}(f)\cdot d_t. \end{align*} \] Let \(f_2\) be an eigenvector of \(P\) corresponding to the eigenvalue \(\lambda_2\). We have \(P^t f_2 = \lambda_2^t f_2\). Applying the inequality above to \(f_2\): \[ \abs{\lambda_2}^t \cdot \!{Lip}(f_2) = \!{Lip}(\lambda_2^t f_2) = \!{Lip}(P^t f_2) \leq = \!{Lip}(f_2) \cdot d_t. \] Since \(f_2\) is not constant, i.e., \(\!{Lip}(f_2) > 0\), this implies \(\abs{\lambda_2}^t\leq d_t\).