$$ \def\*#1{\mathbf{#1}} \def\+#1{\mathcal{#1}} \def\-#1{\mathrm{#1}} \def\!#1{\mathsf{#1}} \def\@#1{\mathscr{#1}} \newcommand{\mr}[1]{\mbox{\scriptsize \color{RedViolet}$\triangleright\;$#1}\quad\quad} $$

Lecture 10: Pathwise Method, Johnson-Lindenstrauss Lemma

Author

instructed by Chihao Zhang, scribed by Rui Huang and Yuchen He

In our previous lectures, we established the concentration bounds and the Poincaré inequality for the Gaussian distribution using functional analytic tools.

In this lecture, we will revisit these fundamental properties using a distinct and powerful approach: Eldan’s pathwise method. This method uses stochastic calculus to track the evolution of the function value along a path, providing a dynamic perspective on how fluctuations accumulate. The exposition in this note will follow the presentation in (Eldan 2022).

Eldan, Ronen. 2022. “Analysis of High-Dimensional Distributions Using Pathwise Methods.” In Proc. Int. Cong. Math, 6:4246–70.

Pathwise method

Consider a standard \(d\)-dimensional Gaussian vector \(X=(X_1,\dots,X_d)\sim \mathcal{N}(0,I_d)\). Our goal is to study the behavior of the centered random variable \(f(X) - \E{f(X)}\) for a \(1\)-Lipschitz function \(f\). Here we say a differentiable function \(f\) is \(c\)-Lipschitz if \(\|\nabla f(x)\|\leq c\) for all \(x \in \mathbb{R}^d\).

Recall our proof of McDiarmid’s inequality. There, we constructed a discrete Doob martingale \(Z_k = \E{f(X) \mid \mathcal{F}_k}\) by revealing the information coordinate-by-coordinate (i.e., \(\mathcal{F}_k = \sigma(X_1, \dots, X_k)\)) and bounding the discrete differences \(|Z_k - Z_{k-1}|\). In this lecture, we retain the Doob martingale framework, but we change the filtration. Instead of revealing coordinates one by one, we will reveal the information pathwise over continuous time.

Consider a standard \(d\)-dimensional Brownian motion \(\{B_t\}_{t\geq 0}\). We know that \(B_1 \sim \mathcal{N}(0, I_d)\). Therefore, studying the random variable \(f(X)\) is equivalent to studying \(f(B_1)\). We define the continuous Doob martingale \[ M_t \defeq \E{f(B_1)\mid \+F_t},\ \mbox{ where } \+F_t=\sigma\tp{\set{B_s}_{s\leq t}}. \] Observe that \(M_1 = f(B_1)\) and \(M_0 = \E{f(B_1)}\). Thus, our goal transforms into bounding the difference \(M_1 - M_0\).

Since \(\{M_t\}_{0 \leq t \leq 1}\) is a continuous martingale, a key quantity for our analysis is its quadratic variation. Recall that for a stochastic process \(\{X_t\}_{t \geq 0}\), the quadratic variation \([X]_t\) is defined as: \[ [X]_t = \lim_{0= t_0\leq \cdots\leq t_n=t,\atop \max \tp{t_i-t_{i-1}}\to 0} \sum_{i=1}^n \tp{X_{t_i} - X_{t_{i-1}}}^2. \] In stochastic calculus notation, we often write this as \([X]_t = \int_0^t \tp{\d X_s}^2\).

We now invoke a result that connects continuous martingales to Brownian motion. To avoid confusion regarding dimensions, let \(\{B_t\}\) denote our \(d\)-dimensional process and let \(\{W_t\}\) denote a standard \(1\)-dimensional Brownian motion.

Theorem 1 (Dambis / Dubins-Schwartz theorem) For any continuous martingale \(\set{M_t}_{t\geq 0}\), there exists a standard 1-dimensional Brownian motion \(\set{W_t}_{t\geq 0}\) such that \(M_t = W_{[M]_t}\) for any \(t\geq 0\).

This theorem states that any continuous martingale can be regarded as a time-scaled Brownian motion. The “clock” for this Brownian motion is determined by the quadratic variation \([M]_t\). As a sanity check, if \(M_t\) is a standard Brownian motion, then \([M]_t = t\), and the theorem yields \(M_t = W_t\) as expected.

We will not prove this theorem formally in this lecture, but we can provide an intuition based on the martingale representation theorem. If \(\{M_t\}_{t \geq 0}\) is a square-integrable martingale adapted to a Brownian filtration, we can express it as a stochastic integral: \[ M_t-M_0 = \int_0^t H_s \d W_s, \] for some adapted process \(\set{H_s}_{s\geq 0}\). Consequently, the quadratic variation is \([M]_t = \int_0^t H_s^2 ds\). Intuitively, the term \(H_s\) acts as the “velocity” of the variance accumulation. If we speed up or slow down time according to \(H_s^2\), the process looks exactly like a standard Brownian motion.

Evolution of \(M_t\)

We now focus on the evolution of the martingale \(\set{M_t}_{t\geq 0}\). A natural first approach might be to use the explicit Gaussian density: given \(\mathcal{F}_t\), we know \(B_1 \sim \mathcal{N}(B_t, (1-t)I_d)\). We could write the integral formula for \(M_t\) and differentiate it directly.

However, we will use a more elegant and powerful approach via semigroup calculus. Consider the semigroup \(\{P_t\}_{t\geq 0}\) associated with the standard Brownian motion \(\d X_t = \d B_t\). By the Markov property, we can express the martingale as: \[ M_t = \E{f(B_1)\mid \+F_t} = \E{f(X_{1})\mid X_t=B_t} = P_{1-t}f(B_t). \] To analyse the quadratic variation \([M]_t\), we first need to compute the stochastic differential \(\d M_t\). Applying Itô’s Lemma, \[ \begin{align*} \d M_t &= \d \tp{P_{1-t}f(B_t)}\\ &= \partial_t \tp{P_{1-t}f}(B_t) \d t + \inner{\nabla P_{1-t}f (B_t)}{\d B_t} + \frac{1}{2} \Delta P_{1-t}f (B_t) \d t. \end{align*} \] Using the fact that the generator of Brownian motion is the Laplacian \(\mathcal{L} = \frac{1}{2} \Delta\), the semigroup satisfies the equation \(\partial_s P_s = \frac{1}{2}\Delta P_s\). Letting \(s = 1-t\)., \[ \begin{align*} \partial_t \tp{P_{1-t}f}(B_t) \d t &= -\partial_s \tp{P_{s}f}(B_t) \d t\\ &= -\+L P_s f (B_t) \d t\\ &= -\frac{1}{2}\Delta P_{1-t} f (B_t) \d t. \end{align*} \] Therefore, the drift terms (the \(\d t\) terms) cancel out and we have \[ \d M_t = \inner{\nabla P_{1-t}f (B_t)}{\d B_t}. \] From this, the squared differential is \[ \tp{\d M_t}^2 = \norm{\nabla P_{1-t}f (B_t)}^2 \d t. \] Consequently, the quadratic variation is given by the integral \[ [M]_t = \int_0^t \tp{\d M_s}^2 = \int_0^t \norm{\nabla P_{1-s}f (B_s)}^2 \d s. \]

Let \(v_s = \nabla P_{1-s}f (B_s)\). We then bound this gradient using the Lipschitz property of \(f\). Recall that in the process \(\d X_t = \d B_t\), the state at time \(1\) given time \(s\) is distributed as \(X_s + \sqrt{1-s}\cdot \xi\), where \(\xi\sim \+N(0,I_d)\). Therefore, we have \[ \begin{align*} v_s &= \nabla \E{f(X_1)\mid X_s=B_s} \\ &= \nabla \E{f(B_s + \sqrt{1-s}\cdot \xi)} \\ &= \E{\nabla f(B_s + \sqrt{1-s}\cdot \xi)}. \end{align*} \] Now applying Jensen’s inequality and the assumption that \(f\) is \(1\)-Lipschitz, \[ \norm{v_s}^2 \leq \E{\norm{\nabla f(B_s + \sqrt{1-s}\cdot \xi)}^2} \leq 1. \] Substituting this bound into our integral for the quadratic variation, we get \[ [M]_t = \int_0^t \norm{\nabla P_{1-s}f (B_s)}^2 \d s\leq t. \]

Proof of Gaussian concentration bounds via the pathwise method

With the bound on the quadratic variation established, we can now derive the concentration bound for \(M_1 - M_0\).

Using the Dambis / Dubins-Schwartz theorem, we can treat \(M_1 - M_0\) as a time-changed Brownian motion \(W_{[M]_1}\). Since \([M]_1 \leq 1\), the event \(\{M_1 - M_0 \geq \alpha\}\) implies that the Brownian motion \(\set{W_t}_{t\geq 0}\) must have hit the level \(\alpha\) at some time \(t \leq 1\). Therefore, for any \(\alpha > 0\): \[ \begin{align*} \Pr{M_1-M_0\geq \alpha} &= \Pr{W_{[M]_1}-W_{[M]_0}\geq \alpha}\\ \mr{$[M]_1\leq 1$}&\leq \Pr{\exists t\in[0,1], W_t\geq \alpha}\\ \mr{reflection principle}&= 2\Pr{W_1\geq \alpha}\\ &= 2\tp{1-\Phi(\alpha)}, \end{align*} \] where \(\Phi\) denotes the CDF of the standard Gaussian distribution.

Now consider the general case where \(f\) is \(c\)-Lipschitz. The bound on the quadratic variation scales accordingly: \([M]_t \leq c^2 t\). The probability bound becomes \[ \begin{align*} \Pr{M_1-M_0\geq \alpha} &\leq 2 \Pr{\exists t\in[0,c^2], W_t\geq \alpha} \\ &= 2\Pr{W_{c^2}\geq \alpha} \\ &= 2\Pr{W_1\geq \frac{\alpha}{c}} \\ &\leq 2e^{-\frac{\alpha^2}{2c^2}}. \end{align*} \]

This result has a similar form to McDiarmid’s inequality. Recall that McDiarmid’s inequality states that for independent random variables \(X_1, \dots, X_d\): \[ \Pr{f -\E{f}\geq \alpha}\leq 2\exp\set{-\frac{2\alpha^2}{\sum_{i=1}^d c_i^2}}, \] where \(c_i = \sup_{x\in \bb R^d}\sup_{y\in \bb R}\abs{f(x)- f(x_{-i},y)}\) represents the coordinate-wise sensitivity. The sum of squared coordinate sensitivities \(\sum c_i^2\) plays an analogous role to the Lipschitz constant \(c^2\) here.

Proof of the Gaussian Poincaré Inequality via the pathwise method

We can now utilize the pathwise framework to prove the Poincaré inequality for the high-dimensional Gaussian. Recall that \(M_1 = f(B_1)\) and \(M_0 = \E{f(B_1)}\). Therefore, the variance of \(f\) is simply the variance of the martingale increment \(M_1 - M_0\). Using the Ito isometry, we can express this variance as the expected quadratic variation: \[ \begin{align*} \Var{f} &= \Var{M_1} = \Var{\int_0^1 \d M_t}\\ &=\int_0^1 \Var{\d M_t} = \int_0^1 \E{\tp{\d M_t}^2} \\ &=\E{[M]_1}. \end{align*} \] Recall from our previous derivation that the quadratic variation is the integral of the squared gradient of the semigroup: \[ \E{[M]_1} = \E{\int_0^1 \norm{v_s}^2 \d s}. \] Here, \[ v_s = \nabla P_{1-s}f(B_s) = P_{1-s}\nabla f(B_s). \]

From the Jensen’s inequality and tower rule, \[ \begin{align*} \E{\norm{v_{s+t}}^2 \mid \+F_s} \geq \norm{\E{v_{s+t}\mid \+F_s}}^2 = \norm{\E{\nabla f(B_1)}\mid \+F_s}^2 = \norm{v_s}^2. \end{align*} \] This inequality implies that the process \(\set{\norm{v_s}^2}_{s\geq 0}\) is a sub-martingale. Therefore, \[ \E{[M]_1} \leq \E{\int_0^1 \norm{v_1}^2 \d s} = \E{\norm{\nabla f(B_1)}^2}. \] This establishes the Poincaré inequality for the standard Gaussian in \(R^d\) with constant \(c=1\): \[ \Var{f(B_1)}\leq \E{\norm{\nabla f(B_1)}^2}. \]

Applications

Then we see some applications of the above results.

Johnson-Lindenstrauss lemma

Assume we have \(n\) vectors \(x_1,\dots,x_n\in \bb R^m\) for some large dimension \(m\). We wish to reduce the dimension of the problem while preserving the geometry of the data. Specifically, we want to construct a linear mapping \(T:\mathbb{R}^m \to \mathbb{R}^k\) with \(k \ll m\) such that for all \(i,j\in [n]\), the pairwise distances are preserved up to a multiplicative factor of \((1 \pm \epsilon)\): \[ (1-\eps)\norm{x_i-x_j} \leq \norm{Tx_i - Tx_j} \leq (1+\eps)\norm{x_i-x_j}. \]

Without loss of generality, we can assume \(m \le n\). If \(m > n\), we can consider the subspace \(V = \text{span}(x_1, \dots, x_n)\). The dimension of \(V\) is at most \(n\). By projecting the vectors onto \(V\) and choosing an orthonormal basis for \(V\), we can represent these vectors in \(\mathbb{R}^n\) without changing their pairwise distances. Thus, we assume \(x_i \in \mathbb{R}^n\) for the remainder of the proof.

Our goal is to construct a matrix \(T\in \mathbb{R}^{k\times n}\) satisfying the above requirements. We define \(T\) as a random matrix where each entry \(T_{i,j}\) is drawn independently from \(\+N(0,1/k)\). Let’s analyze the norm of a single projected vector. Fix a vector \(z \in \mathbb{R}^n\). Let \(Y=Tz\). Then the \(i\)-th component of \(Y\) is \[ Y_i=\sum_{j=1}^n T_{i,j} z_j \sim \+N\tp{0, \frac{\norm{z}^2}{k}}. \] Consequently, the vector \(Y\) has the same distribution as a scaled Gaussian \(\frac{\norm{z}}{\sqrt{k}}\cdot G\), where \(G\sim \+N(0, I_k)\) is a standard Gaussian in \(\bb R^k\).

Consider a function \(F(x)\defeq \frac{\norm{z}}{\sqrt{k}}\norm{x}\). Note that this function is \(\frac{\norm{z}}{\sqrt{k}}\)-Lipschitz. By the Gaussian concentration inequality for an \(L\)-Lipschitz function \(F\), for any \(t>0\), \[ \Pr{\abs{F(G) - \E{F(G)}}\geq t}\leq 2\exp\set{-\frac{t^2}{2L^2}}. \] Setting \(t=\eps\norm{z}\) and \(L=\frac{\norm{z}}{\sqrt{k}}\), we have \[ \Pr{\abs{\norm{T z} - \E{\norm{Tz}}} \geq \eps \norm{z}} \leq 2\exp\set{-\frac{k\eps^2}{2}}. \]

This concentration bound controls the deviation from the mean \(\E{\|Tz\|}\). To show that \(\|Tz\| \approx \|z\|\), we must show that the mean is close to the true norm \(\|z\|\). Specifically, we claim \[ \sqrt{1-1/k} \norm{z} \leq \E{\norm{Tz}} \leq \norm{z}. \]

For the upper bound, by the Jensen’s inequality, \[ \begin{align*} \E{\norm{Tz}}^2 &\leq \E{\norm{Tz}^2}\\ &=\E{\frac{\norm{z}^2}{k}\cdot \norm{G}^2}\\ &=\frac{\norm{z}^2}{k} \cdot \sum_{i=1}^k \E{G_i^2} \\ &= \norm{z}^2. \end{align*} \]

For the lower bound, we have \[ \begin{align*} \E{\norm{Tz}}^2 = \E{\norm{Tz}^2} - \Var{\norm{Tz}}= \norm{z}^2 - \Var{\norm{Tz}}. \end{align*} \] To bound the variance, we use the Gaussian Poincaré inequality: \[ \Var{\norm{Tz}} = \Var{F(G)} \leq \E{\norm{\nabla F(G)}^2}. \] Note that \[ \nabla F(x) = \nabla\tp{ \frac{\norm{z}}{\sqrt{k}}\norm{x}} = \frac{\norm{z}}{\sqrt{k}}\frac{x}{\norm{x}}. \] Therefore, \(\norm{\nabla F(G)}^2 = \frac{\norm{z}^2}{k}\) almost surely. Substituting this back, we have \(\E{\norm{Tz}}^2 \geq \tp{1-1/k}\norm{z}^2\)

Combining all above, for any fixed vector \(z\in \bb R^n\), \[ \Pr{\norm{Tz}\geq (1+\eps)\norm{z} \mbox{ or }\norm{Tz}\leq \tp{\sqrt{1-1/k}-\eps}\norm{z}} \leq 2\exp\set{-\frac{k\eps^2}{2}}. \] By choosing \(k=\+O\tp{\frac{\log n}{\eps^2}}\), this probability can be bounded by \(\frac{1}{5n^2}\). Applying the union bound over all \(\binom{n}{2}\) pairs of vectors \((x_i - x_j)\),, the total failure probability is bounded by \(\binom{n}{2} \cdot \frac{1}{5n^2} < \frac{1}{10}\). Thus, with constant probability, for all \(i,j\in[n]\), \[ (1-\+O(\eps))\norm{x_i-x_j} \leq \norm{Tx_i - Tx_j} \leq (1+\eps)\norm{x_i-x_j}. \]