Lecture 7: Markov Semigroup, Langevin Dynamics and DDPMs
Markov semigroup
We generally have two methods to describe a stochastic process: one is to describe the local walk of the random variable at each step (e.g., using stochastic differential equations), and the other is to define the process globally using a transition matrix or an analogous concept. In this section, we will develop the continuous-time analog of a transition matrix. This will allow us to analyze Itô processes, which we studied in the last class: \[ \d X(t)=\mu(t,X(t))\d t+\sigma(t,X(t))\d B(t). \]
Markov transition kernel, semigroup and operator
To build intuition, let’s first recall the discrete-time, finite-state case. For a finite Markov chain, the one-step transition matrix \(P=(P_{ij})_{i,j\in[n]}\) is defined as the probability of moving from state \(i\) to state \(j\) in one step: \[ P(i,j)=\Pr{X_{t+1}=j|X_{t}=i}. \] The \(t\)-step transition matrix, \(P^t\), is then: \[ P^{t}(i,j)=\Pr{X_{s+t}=j|X_{s}=i} \] where \(P^{t}(i,j)\) denotes the probability that the process moves from state \(i\) to state \(j\) in exactly \(t\) steps. By the law of total probability (or the definition of matrix multiplication), we have: \[ P^{s+t}(i,j)=(P^{s}\cdot P^{t})(i,j)=\sum_{k\in[n]}P^{s}(i,k)\cdot P^{t}(k,j). \] We also know that if \(\pi_t\) is the probability distribution of \(X_t\) (as a row vector), the distribution at the next step is given by \(\pi_{t+1}^{\top} = \pi_{t}^{\top}P\).
Now, let’s extend this to the continuous-time setting. Consider a time-homogeneous Markov process \(\set{X_{t}}_{t\ge0}\). We define its Markov semigroup \(\set{P_t}_{t\geq 0}\) with each \(P_t\), called the Markov transition kernel, being a function that gives the probability of the process moving from a point \(x\) into a set \(A\) in time \(t\): \[ P_{t}(x,A)=\Pr{X_{t}\in A|X_{0}=x}. \] In other words, for each fixed \(x\), \(P_{t}(x,\cdot)\) is a probability measure. It represents the distribution of the process at time \(t\), given it started at \(x\). Just as in the discrete case, these kernels obey a composition rule.
Proposition 1 (Chapman-Kolmogorov equation) For any \(s, t\ge0\), \(P_{s+t}(x,A)=\int P_{s}(x,\d y)P_{t}(y,A)\).
The perspective of operators
An alternative way to understand these transitions is to view them as operators that act on functions. Let’s return to the discrete-time case. Let \(f:[n]\rightarrow\mathbb{R}\) be an arbitrary function, which we can represent as a vector \(f=(f(1),...,f(n))^{\top}\in\mathbb{R}^{n}\). If we multiply the matrix \(P\) by the vector \(f\), we get a new vector \(Pf\): \[ [Pf](i)=\sum_{j\in[n]}P(i,j)f(j)=\E{f(X_{t+1})|X_{t}=i}. \] The result, \(Pf\), is another function defined on \([n]\). Its value at \(i\), \([Pf](i)\), is the expected value of the function \(f\) at the next step, given that we are currently at state \(i\).
Similarly, applying the \(t\)-step transition matrix \(P^t\) gives the expected value after \(t\) steps: \[ [P^t f](i)=\E{f(X_{s+t})|X_{s}=i}. \]
We can now apply this same operator perspective to the Markov semigroup \(\set{P_{t}}_{t\ge0}\). For a suitable test function \(f:\mathbb{R}\rightarrow\mathbb{R}\), we define the action of the operator \(P_t\) on \(f\) as: \[ P_{t}f(x):=\int_{\mathbb{R}}f(y)P_{t}(x,\d y)=\E{f(X_{t})|X_{0}=x}. \] From this operator viewpoint, the Chapman-Kolmogorov equation takes on a much simpler and more elegant form: \[ P_{s+t}=P_{s}P_{t}. \]
For a more rigorous treatment of these technical details, please refer to the standard monograph Analysis and Geometry of Markov Diffusion Operators. 
The Markov semigroups and test functions we consider usually need to satisfy some technical conditions. For example, we will require that \(||P_{t}f-f||_{\infty}\rightarrow 0\) as \(t\rightarrow0\). In other words, \(P_t\) converges to the identity operator \(I\). This is analogous to the transition matrix \(P^t\) for a discrete-time Markov chain being the identity matrix \(I\) when \(t=0\).
Generator of the Markov semigroup
We now define the generator, which serves as the continuous-time analogue to the \(P-I\) matrix in the discrete-time case.
Definition 1 (Generator of a Markov semigroup.) Let \(\mathcal{P}=\set{P_{t}}_{t\ge0}\) be a Markov semigroup. An operator \(\mathcal{L}\) is the generator of \(\mathcal{P}\) if and only if: \[ \mathcal{L}f=\lim_{t\downarrow 0}\frac{P_{t}f-f}{t} \] holds for those \(f\in D(\mathcal{L}):=\set{f\ |\ \mbox{the above limit exists}}\).
Intuitively, you can see this as the derivative of the transition kernel \(P_t\) evaluated at \(t=0\), i.e., \(\mathcal{L}=\frac{\d P_{t}}{\d t}\big|_{t=0}\). This means the generator \(\mathcal{L}\) represents the instantaneous rate of change of the process at its very beginning.
This leads to the following relationship, which is the solution to this simple differential equation.
Proposition 2 For any \(t\ge0\), \(P_{t}=e^{t\mathcal{L}}\) holds. Here, for a matrix \(A\), its exponent \(e^A \defeq \sum_{k=0}^{\infty} \frac{A^k}{k!}\).
Proof. By definition, we have: \[ \begin{align*} \mathcal{L}(P_{t}f)& =\lim_{s\downarrow 0}\frac{P_{s}(P_{t}f)-P_{t}f}{s} \\ \mr{Chapman-Kolmogorov}&= \lim_{s\downarrow 0}\frac{P_{t+s}f-P_{t}f}{s} \\ &= \tp{\frac{\d}{\d t}P_{t}}f. \end{align*} \] The solution to the differential equation \(\mathcal{L}P_{t}=\frac{\d}{\d t}P_{t}\) is \(P_{t}=e^{t\mathcal{L}}\).
A key property that enables many calculations is that the generator and the semigroup operators commute.
Proposition 3 For any \(t\ge0\), \(\mathcal{L}\) and \(P_{t}\) are commutative.
Proof. By definition, we have: \[\begin{align*} \mathcal{L}(P_{t}f)&=\lim_{s\downarrow 0}\frac{P_{s}P_{t}f-P_{t}f}{s} \\ \mr{Chapman-Kolmogorov} &= \lim_{s\downarrow 0}\frac{P_{t}P_{s}f-P_{t}f}{s} \\ &= \lim_{s\downarrow 0}P_{t}\frac{P_{s}f-f}{s}\\ &= P_{t}\lim_{s\downarrow 0}\frac{P_{s}f-f}{s}\\ &=P_{t}\mathcal{L}f. \end{align*}\] The third and fourth equalities hold because \(P_t\) is a continuous linear operator.
We can now find a concrete formula for the generator of the time-homogeneous Itô process.
Lemma 1 Assume \(\set{X_{t}}_{t\ge0}\) satisfies \[ \d X_{t}=\mu(X_{t})\d t+\sigma(X_{t})\d B_{t}, \] Then its generator \(\+L\) satisfies \[ \mathcal{L}f=f^{\prime}\mu+\frac{1}{2}f^{\prime\prime}\sigma^{2}, \] for any suitable test function \(f\).
Proof. By the definition of the generator, we have: \[\begin{align*} \mathcal{L}f(x)&=\lim_{h\downarrow 0}\frac{\E{f(X_{h})|X_0=x}-f(x)}{h}\\ &= \lim_{h\downarrow 0}\frac{\E{f(X_{h})-f(X_{0})|X_{0}=x}}{h} \\ &=\E{\frac{\d f(X_{t})}{\d t}\Bigg|_{t=0}\mid X_{0}=x}. \end{align*}\] We apply Itô’s formula to \(\d f(X_t)\): \[\begin{align*} \mathcal{L}f(x)&=\frac{1}{\d t}\E{\tp{f^{\prime}(x)\mu(x)+\frac{1}{2}f^{\prime\prime}(x)(\sigma(x))^{2}}\d t+f^{\prime}(x)\sigma(x)\d B_{t} \mid X_0=x}\\ & = f^{\prime}(x)\mu(x)+\frac{1}{2}f^{\prime}(x)(\sigma(x))^{2}+\frac{f^{\prime}(x)\sigma(x)}{\d t}\E{\d B_{t}}. \end{align*}\] Since \(\{B_{t}\}_{t\ge0}\) is a standard Brownian motion, \(\E{\d B_{t}}=0\). This shows \(\mathcal{L}f(x)=f^{\prime}(x)\mu(x)+\frac{1}{2}f^{\prime\prime}(x)(\sigma(x))^{2}\).
Example 1 (Generator of the standard Brownian motion) The standard Brownian motion satisfies \(\d X_{t}=\d B_{t}\). Therefore, \(\mu=0\) and \(\sigma=1\). Directly applying the above lemma, we get: \[ \mathcal{L}f(x)=\frac{1}{2}f^{\prime\prime}(x). \]
Note that the generator for standard Brownian motion is half the Laplacian. This gives the intuition that why the letter \(\+L\) is often used to represent the generator.
Kolmogorov forward / backward equations
We already derived the formula \[ \frac{\d}{\d t}P_{t}=\mathcal{L}P_{t}, \] in the previous subsection. This is called the Kolmogorov backward equation (KBE). We know that \(\mathcal{L}\) and \(P_t\) are commutative, and therefore we have the Kolmogorov forward equation (KFE) \[ \frac{\d}{\d t}P_{t}=P_{t}\mathcal{L}, \] which is also known as the Fokker-Planck equation. If we draw an analogy to the finite-state, discrete-time case where \(P\in\mathbb{R}^{n\times n}\) is a matrix, the KBE and KFE correspond to the following trivial equalities, respectively: \[ \mr{KBE}P^{t}-P^{t-1}=(P-I)P^{t-1} \] and \[ \mr{KFE}P^{t}-P^{t-1}=P^{t-1}(P-I). \]
The adjoint operator
We will now explain the intuition behind the names “forward” and “backward.”
Let’s first consider the KFE. The “forward” in its name refers to propagating an initial probability density forward in time. To see this, we introduce the adjoint operator, \(P_t^*\). On the function space \(L^2(\mathbb{R})\), with the inner product \(\langle f,g\rangle=\int_{\mathbb{R}}f(x)g(x) \d x\), the adjoint operator \(P_t^*\) is defined as the unique operator that satisfies the following relation for any functions \(f\) and \(g\): \[ \langle P_t f,g\rangle=\langle f,P_t^{*}g\rangle \] Now, let \(\pi_{t}(x)\) be the probability density of \(X_t\). The expectation of any test function \(f \in L^2(\mathbb{R})\) can be written as: \[ \E{f(X_{t})} = \int_{\mathbb{R}}f(x)\pi_{t}(x)\d x =\langle f,\pi_{t}\rangle. \] We can also compute this same expectation by starting from the initial distribution \(\pi_0\) and using the operator \(P_t\): \[\begin{align*} \E{f(X_{t})} &= \E{\E{f(X_{t})|X_{0}}} = \int_{\mathbb{R}} \E{f(X_{t})|X_{0}=x} \pi_{0}(x)\d x\\ &= \int_{\mathbb{R}} (P_{t}f)(x) \pi_{0}(x)\d x = \langle P_{t}f,\pi_{0}\rangle. \end{align*}\] By definition of the adjoint operator, \(\langle P_{t}f,\pi_{0}\rangle = \langle f, P_t^* \pi_0 \rangle\). Comparing the two expressions, we have \(\langle f,\pi_{t}\rangle = \langle f, P_t^* \pi_0 \rangle\). Since this holds for all test functions \(f\), we have: \[ \pi_{t} = P_t^* \pi_0. \] This shows that \(P_t^*\) is the operator that evolves probability densities forward in time. It plays the same role as the transpose of the transition matrix, \((P^t)^\top\) in the finite-state case, which satisfies \(\pi_t =(P^t)^\top \pi_0\).
If we take the adjoint of the KFE (\(\frac{\d}{\d t}P_{t}=P_{t}\mathcal{L}\)), we get \(\frac{\d}{ \d t}P_{t}^{*}=\mathcal{L}^{*}P_{t}^{*}\). Applying this to \(\pi_0\): \[ \frac{\d}{\d t}(P_{t}^{*}\pi_{0}) = \mathcal{L}^{*}(P_{t}^{*}\pi_{0}) \iff \frac{\d}{\d t}\pi_{t}=\mathcal{L}^{*}\pi_{t}. \] This is the most common form of the Fokker-Planck equation. It describes the forward evolution of the probability density \(\pi_t\).
Now, let’s explain the backward equation, \(\frac{\d}{\d t}P_{t}=\mathcal{L}P_{t}\). Let’s apply this to a test function \(f(x) = \mathbb{I}[x \in A]\) for some set \(A\). Then we have for any \(s\geq 0\), \[ \frac{\d}{\d t} \E{\bb 1[X_{s+t}\in A]\mid X_s=x} = \mathcal{L} \E{\bb 1[X_{s+t}\in A]\mid X_s=x}. \] Equivalently, \[ \frac{\d}{\d t} \Pr{X_{s+t}\in A|X_{s}=x} = \mathcal{L} \Pr{X_{s+t}\in A|X_{s}=x} \] This equation describes the evolution of the probability of landing in a future set \(A\) after time \(t\). Notice that the generator \(\mathcal{L}\) acts on the initial state variable \(x\). It is called “backward” because it describes how the probability of a future event changes with respect to the starting (or backward) variables.
Using KBE/KFE to study the evolution of the measures
Let’s return to the KFE, \(\frac{\d}{\d t}\pi_{t}=\mathcal{L}^{*}\pi_{t}\). We now find the explicit formula for the adjoint generator \(\mathcal{L}^*\) corresponding to the Itô process: \[ \d X_{t}=\mu(X_{t})\d t+\sigma(X_{t})\d B_{t}. \] We know the generator is: \[ \mathcal{L}f(x)=\mu(x)\frac{\d}{\d x}f(x)+\frac{1}{2}\sigma^{2}(x)\frac{\d^{2}}{\d x^{2}}f(x). \] We use the definition to calculate \(\mathcal{L}^{*}\). Fix twice-differentiable compactly supported functions \(f\) and \(g\), we know that density function \(\pi\), we know that \[ \inner{\+L f}{g}=\inner{f}{\+L^{*}g}. \] The LHS of above is therefore \[ \inner{\+L f}{g}=\int_{\bb R} \tp{\mu(x)f^{\prime}(x)+\frac{1}{2}\sigma^{2}(x)f^{\prime\prime}(x)}g(x)\d x. \]
Using the integration by parts formula, we get \[ \begin{align*} \int_{\mathbb{R}}f^{\prime}(x)\mu(x)g(x)\d x & =\mu(x)g(x)f(x)|_{-\infty}^{\infty}-\int_{\bb R}f(x)\pdv{}{x}(\mu(x)g(x))\d x\\ & =-\Big\langle f,\frac{\partial}{\partial x}(\mu\cdot g)\Big\rangle . \end{align*} \] The last equality holds because \(f(x)\) and \(g(x)\) are compactly supported. Similarly, using integration by parts twice, we get \[ \begin{align*} \int_{\mathbb{R}}f^{\prime\prime}(x)\sigma^{2}(x)g(x)\d x &=-\int_{\mathbb{R}}f^{\prime}(x)\pdv{x}(\sigma^{2}(x)g(x))\d x\\ &=\int_{\mathbb{R}}f(x)\pdv[2]{x}\;(\sigma^{2}(x)g(x))\d x\\ &=\Big\langle f,\frac{\partial^{2}}{\partial x^{2}}(\sigma^{2}g)\Big\rangle . \end{align*} \] Therefore \(\inner{f}{\+L^*g}=\inner{f}{-\frac{\partial}{\partial x}(\mu g)+\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}(\sigma^{2}g)}\). Hence, \[ \mathcal{L}^{*}g=-\frac{\partial}{\partial x}(\mu g)+\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}(\sigma^{2}g). \] Thus, the evolution equation for the probability distribution \(\pi_{t}\) of this Itô process is \[ \frac{\partial}{\partial t}\pi_{t}(x)=-\frac{\partial}{\partial x}(\mu(x)\pi_{t}(x))+\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}(\sigma^{2}(x)\pi_{t}(x)). \] As a powerful application, it can be used to find the stationary distribution of the process. Assume \(\pi\) is the stationary distribution of some Itô process. Substituting it into the above equation, we get \[ \frac{\partial}{\partial x}(\mu(x)\pi(x))=\frac{1}{2}\frac{\partial^{2}}{\partial x^{2}}(\sigma^{2}(x)\pi(x)). \tag{1}\]
Example 2 (Ornstein-Uhlenbeck process) The stochastic differential equation of the OU process is \[ \d X_{t}=-X_{t}\d t+\sqrt{2}\d B_{t}. \] Substituting \(\mu(x)=-x\) and \(\sigma^2(x)=2\) into Equation 1 and solving the equation, we get \[ \pi(x)\propto e^{-\frac{x^{2}}{2}}. \]
Example 3 (Langevin dynamics) For the more general Langevin dynamics, its stochastic differential equation is \[ \d X_{t}=-f^{\prime}(X_{t})\d t+\sqrt{2}\d B_{t}. \] Substituting the \(\mu = -f'(x)\) and \(\sigma^2 = 2\) from the above equation into Equation 1 yields \[ \frac{\partial}{\partial x}(-f^{\prime}(x)\pi(x))=\frac{\partial^{2}}{\partial x^{2}}\pi(x). \] We can integrate both sides once and get \[ -f^{\prime}(x)\pi(x)=\pi^{\prime}(x), \] Its solution is \[ \pi(x)\propto e^{-f(x)}. \]
The convergence of Langevin diffusion
In this section, we will analyze the convergence rate of Langevin dynamics in \(\bb R^d\). Let \(f: \mathbb{R}^d \to \mathbb{R}\) be a \(\mu\)-strongly convex function. Recall that a differentiable function is \(\mu\)-strongly convex if for any \(x, y\in \bb R^d\): \[ f(y)\ge f(x)+\langle\grad f(x),y-x\rangle+\frac{\mu}{2}||x-y||_{2}^{2}. \] This analysis strategy is directly analogous to the one used for gradient flow (the continuous version of the gradient descent algorithm). In the analysis of gradient flow, one computes the evolution of \(||X_{t}-Y_{t}||^{2}\), where \(X_t\) and \(Y_t\) are two paths following the same ODE, \(\d X_{t}=-\grad f(X_{t}) dt\), to show that they contract. For Langevin dynamics, our approach is similar, but we must also handle the stochastic term. We achieve this using a powerful coupling technique.
Let’s consider two processes, \(\{X_t\}\) and \(\{Y_t\}\), governed by the Langevin SDE: \[ \begin{cases}\d X_{t}=-\grad f(X_{t})\d t+\d B_{t} \\ \d Y_{t}=-\grad f(Y_{t})\d t+\d B_{t}\end{cases} . \tag{2}\] The key trick is that we couple \(X_t\) and \(Y_t\) by driving them with the same Brownian motion, \(\d B_t\). Our goal is to show that the distribution of \(X_t\) (call it \(\pi_t\)) converges to the stationary distribution \(\pi \propto e^{-f}\). To do this, we let \(X_0\) start from an arbitrary distribution, while \(Y_0\) is drawn from the stationary distribution \(\pi\). By analyzing the expected squared distance \(\E{||X_t - Y_t||^2}\), we can bound the convergence in the \(W_2\) distance.
Here for two distributions \(\mu\) and \(\nu\), let \(\zeta\) be the set of all couplings of these two distributions. The \(W_2\) distance between \(\mu\) and \(\nu\) is defined as \[ W_2(\mu,\nu) = \inf_{\omega\in \zeta} \E[(X,Y)\sim \omega]{\|X-Y\|^2}^{\frac{1}{2}}. \] Therefore, \(\E{||X_t - Y_t||^2}\) is indeed an upper bound of \(W_2(\pi_t,\pi)^2\).
From the coupled SDEs in Equation 2, we can find the dynamics of the difference \(X_t - Y_t\): \[ \frac{\d}{\d t}(X_{t}-Y_{t})=\grad f(Y_{t})-\grad f(X_{t}). \] Because we used the same noise \(\d B_t\) for both processes, the stochastic terms have canceled out completely. Now, let’s look at the derivative of the expected squared distance: \[\begin{align*} \frac{\d \E{||X_{t}-Y_{t}||^{2}}}{\d t} &=2\E{\Big\langle\frac{\d(X_{t}-Y_{t})}{\d t},X_{t}-Y_{t}\Big\rangle}\\ & =2\E{\langle\grad f(Y_{t})-\grad f(X_{t}),X_{t}-Y_{t}\rangle}. \end{align*}\] From the \(\mu\)-strong convexity, we know that \(\langle \grad f(x) - \grad f(y), x - y \rangle \ge \mu ||x-y||_2^2\). Applying this to the above equation gives: \[ \frac{\d \E{||X_{t}-Y_{t}||^{2}}}{\d t} \le-2\mu\cdot \E{||X_{t}-Y_{t}||^{2}}. \] This is a Grönwall-type differential inequality. Its solution is: \[ \E{||X_{t}-Y_{t}||^{2}} \le e^{-2\mu t} \cdot \E{||X_{0}-Y_{0}||^{2}}. \] Since \(W_2(\pi_t, \pi)^2 \le \E{||X_t - Y_t||^2}\), this analysis proves that the distribution \(\pi_t\) converges to the stationary distribution \(\pi\) at an exponential rate.
Note that this analysis applies to the continuous-time process. When implementing this in reality, we must use a discretized process (e.g., the Euler-Maruyama scheme), much like how the continuous gradient flow is discretized to create the practical gradient descent algorithm.
Denoising diffusion probabilistic model (DDPM)
It is worth noting that the convergence analysis for Langevin dynamics relies heavily on the strong log-concavity of \(\pi\) (or equivalently, the strong convexity of \(f\)). This condition does not hold for most distributions encountered in practice. For example, \(\pi\) can be the uniform distribution of all images of white cats. A recently popularized method for sampling from general distributions is the denoising diffusion probabilistic model (DDPM).
The core idea of DDPM is a two-stage process:
Forward process: Systematically destroy a complex data sample \(X_0 \sim \pi\) by adding Gaussian noise over time, until its distribution converges to a standard Gaussian distribution.
Reverse process: Learn a process that can reverse this noising, starting from a simple sample \(Y_0 \sim \mathcal{N}(0, I_d)\) and running the denoising process backward in time to generate a sample \(Y_T\) that resembles the original data distribution \(\pi\).
Specifically, in the forward process, we start from \(X_0 \sim \pi\) and run an Ornstein-Uhlenbeck (OU) process \(\set{X_t}_{t \ge 0}\). Recall that the OU process follows the stochastic differential equation \(\d X_{t}=-X_{t}\d t+\sqrt{2}\d B_{t}\). We know that \(X_{t}\) equals to \[ e^{-t}X_{0}+\sqrt{1-e^{-2t}} \cdot \xi. \] in distribution. Here, \(\xi \sim \mathcal{N}(0, I_d)\) is a standard Gaussian random variable. Therefore, when \(T\) is sufficiently large, \(X_T\) approaches a Gaussian random variable. Let \(\pi_t\) be the distribution of \(X_t\). In the reverse process, we first sample \(Y_0\) from a Gaussian distribution, and then compute \(\set{Y_t}_{t \in [0, T]}\) according to the reverse SDE: \[ \d Y_{t}=( Y_{t}+2\grad \log \pi_{T-t}(Y_{t}))\d t+\sqrt{2}\d {B}_{t}. \] The deduction of this reverse process can be seen in Section 3.1 of this note. If \(Y_0 \sim \mathcal{N}(0, I_d)\), then for a sufficiently large \(T\), the distribution of \(Y_T\) will be close to \(\pi\).
Score matching
To implement this, we only need to discretize the reverse SDE and simulate it. To execute this reverse process, the most critical problem is to estimate the value of \(\grad \log \pi_t(x)\) for every \(t\). This is also known as the score function.
This function is generally intractable to compute directly. However, in practice, we have access to samples from the distribution \(\pi_t\). (For example, we have samples from \(\pi_0\). By simulating the OU process starting from these samples, we can obtain samples of \(\pi_t\)). Therefore, we hope to use these samples to train a neural network to estimate \(\grad \log \pi_t(x)\).
The following magical formula makes it convenient to train such a neural network.
Theorem 1 (Tweedie’s formula) Let \(Y=X+\sqrt{t}\cdot\xi\), where \(\xi\sim\mathcal{N}(0, I_d)\) is a \(d\)-dimensional standard Gaussian distribution. Then \[ \E{X|Y=y}=y+t\grad \log p_{Y}(y), \] where \(p_Y(y)\) is the probability density function of \(Y\).
We can obtain a set of pairs \((X_i, Y_i)\), where \(X_i\) is a (scaled) sample from \(\pi_0\) and \(Y_i\), distributed as \(\pi_t\), is obtained by shifting \(X_i\) by a Gaussian distribution. Our goal is now to find a function \(F_\theta\) such that \(F_\theta(y) \approx E[X|Y=y]\). This then allows us to use Tweedie’s formula to estimate the value of \(\grad \log p_Y(y)\). This is a regression problem, and we can minimize the mean squared error \(\E{||X-F_\theta(Y)||^2}\) using the known samples. It is easy to verify that the optimal parameter \(\theta^*\) satisfies \(F_{\theta^*}(y) = \E{X|Y=y}\). You can see more details in Section 2.2 of this note.
Correspondence between continuous-time and discrete-time Markov processes
Continuous-time and discrete-time Markov processes are closely related. At the end of this lecture, we will place the corresponding concepts for both side-by-side for comparison. We believe that familiarity with this comparison provides better intuition for the subject.
| Concepts | Continuous-time Markov Process | Discrete-time Markov Chain |
|---|---|---|
| Transition probability | \(P_t\) | \(P^t\) |
| Generator | \(\+L\) | \(P-I\) |
| KBE | \(\dv{}{t} P_{t}=\mathcal{L}P_{t}\) | \(P^{t}-P^{t-1}=(P-I)P^{t-1}\) |
| KFE | \(\dv{}{t} P_{t}=P_{t}\mathcal{L}\) | \(P^{t}-P^{t-1}=P^{t-1}(P-I)\) |
| Adjoint operator | \(P_t^*\) | \((P^t)^{\top}\) |
The high-dimensional setting
Most of the discussions above are focused on the one-dimensional setting for simplicity. In this section, we generalize them to processes in \(\mathbb{R}^d\).
We start with the Itô formula in \(d\)-dimensions. Let \(\set{\*X_{t}}_{t\ge0}\) be a \(d\)-dimensional Itô process satisfying \[ \d \*X_{t}=\*\mu_t\d t+\*\sigma_t\d B_{t}, \] where \(\*\mu_t:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}\) is the drift vector, \(\*\sigma_t:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times m}\) is the diffusion matrix, and \(\set{B_{t}}_{t\ge0}\) is an \(m\)-dimensional standard Brownian motion.
Theorem 2 For any twice-differentiable function \(f:\mathbb{R}^{d}\rightarrow\mathbb{R}\), the Itô formula states that \[ \d f(\*X_{t})=\inner{\grad f(\*X_t)}{\*\mu_t}\dd t + \inner{\grad f(\*X_t)}{\*\sigma_t \dd B_t} + \frac12\!{Tr}\tp{\*\sigma_t^{\top}\grad^2 f(\*X_t)\*\sigma_t}\dd t, \] where \(\!{Tr}(\cdot)\) is the trace operator.
Proof. As usual, we only provide an informal justification of the formula by assuming the heuristic identity \[ \d B_{t}^{(i)}\cdot \d B_{t}^{(j)}=\begin{cases}\dd t & i=j \\ 0 & i\neq j\end{cases}. \tag{3}\] Then by multi-dimensional Taylor’s formula, we obtain \[ \begin{align*} \dd f(X_t) &=f(X_{t+\dd t})-f(X_t)\\ &=\inner{\grad f(X_t)}{\dd X_t}+\frac{1}{2}\dd X_t^{\top}\grad^2 f(X_t)\dd X_t + o(\dd t)\\ &=\inner{\grad f(X_t)}{\mu_t \dd t + \sigma_t \dd B_t} + \frac{1}{2}(\sigma_t\dd B_t)^{\top}\grad^2 f(X_t)(\sigma_t \dd B_t) + o(\dd t) \end{align*} \] Finally, by Equation 3, the last line equals to \[ \inner{\grad f(X_t)}{\mu_t}\dd t + \inner{\grad f(X_t)}{\sigma_t \dd B_t} + \frac12\!{Tr}\tp{\sigma_t^{\top}\grad^2 f(X_t)\sigma_t}\dd t + o(\dd t). \] The last equality above is due to Equation 3.
We slightly abuse notations here and use \(\dd t\) to denote either differential operator or an infinitesimal quantity.
Let’s calculate the generator \(\+L\) and its adjoint \(\+L^*\) for this \(d\)-dimensional time-homogeneous Itô processes. The following lemma is a generalization of Lemma 1.
Lemma 2 Assume \(\set{\*X_t}_{t\ge 0}\) satisfies \[ \dd \*X_t = \*\mu(\*X_t)\dd t + \*\sigma(\*X_t)\dd B_t, \] where \(\*X_t\in \bb R^d\), \(\*\mu(\*X_t)\in \bb R^d\) and \(\*\sigma(\*X_t)\in \bb R^{d\times m}\). Then its generator \(\+L\) satisfies \[ \+L f = \*\mu^\top \grad f + \frac{1}{2}\!{Tr}(\*\sigma^\top \grad^2 f \*\sigma). \]
We remark that the proof of Lemma 2 follows the same steps as that of Lemma 1, using the multi-dimensional Itô formula.
Now the adjoint operator \(\+L^*\):
Lemma 3 We have that \[ \begin{align*} \+L^* g &= -\div(\*\mu g)+\frac{1}{2} \grad^2:(\*\Sigma g) \\ &= -\sum_{i=1}^d \frac{\partial}{\partial x_i}(\mu_i\cdot g) + \frac{1}{2}\sum_{i=1}^d \sum_{j=1}^d \frac{\partial^2}{\partial x_i \partial x_j}(\*\Sigma_{ij}\cdot g), \end{align*} \] where \(\*\Sigma = \*\sigma\*\sigma^\top\). In particular when \(\*\Sigma = D\cdot \*I_d\) for some constant \(D>0\), we have \[ \+L^* g = -\div(\*\mu g) + \frac{D}{2} \Delta g. \]
We will repeat the proof in the one-dimensional case. Recall the main tool we used there is the integration by parts formula. Now we need its multi-dimensional version.
Proposition 4 For any compactly supported scalar field \(f\colon \bb R^d\to\bb R\) and vector field \(\*g\colon \bb R^d\to\bb R^d\), we have the integration by parts formula: \[ \int_{\bb R^d} \grad f\cdot \*g \d x = -\int_{\bb R^d} f \cdot \tp{\div \*g} \d x. \tag{4}\] For a compactly supported rank-2 tensor (matrix) \(\*T\), we have \[ \int_{\bb R^d} \grad \*g : \*T \d x = -\int_{\bb R^d} \*g \cdot (\div \*T) \d x. \tag{5}\]
For two matrices (rank-2 tensors) \(\*A, \*B \in \bb R^{d\times d}\), their double inner product \(\*A:\*B\) is defined as \(\sum_{i=1}^d \sum_{j=1}^d A_{ij} B_{ij}\).
Proof. From the product rule of divergence, we have \[ \div(f\*g) = \grad f \cdot \*g + f \cdot \div \*g. \] Integrating both sides over \(\bb R^d\) gives \[ \int_{\bb R^d} \div(f\*g) \d x = \int_{\bb R^d} \grad f \cdot \*g \d x + \int_{\bb R^d} f \cdot \div \*g \d x. \] Applying the divergence theorem, the left-hard side equals \[ \lim_{R\to\infty} \int_{\partial B(0,R)} f\*g \cdot \*n \d S =0. \] To prove Equation 5, we use the product rule of divergence for tensors: \[ \div(\*T \*g) = (\div\*T)\cdot \*g + \*T : \grad \*g. \] Integrating both sides over \(\bb R^d\) and applying the divergence theorem, we get \[ 0 = \int_{\bb R^d} (\div\*T)\cdot \*g \d x + \int_{\bb R^d} \*T : \grad \*g \d x. \]
The divergence theorem states that for any compactly supported vector field \(\*F:\Omega\to \bb R^d\), \[ \int_{\Omega} \div \*F \d x = \int_{\partial \Omega} \*F \cdot \*n \d S, \] where \(\Omega \subset \bb R^d\) is a bounded domain with smooth boundary \(\partial \Omega\), and \(\*n\) is the unit outer normal vector on \(\partial \Omega\).
Proposition 5 We can also state the integration by parts formula in a coordinate-wise manner: \[ \int_{\mathbb{R}^{d}} f(x)\frac{\partial}{\partial x_i}g(x)\d x = -\int_{\mathbb{R}^{d}} g(x)\frac{\partial}{\partial x_i}f(x)\d x, \] for any \(i\in\{1,2,\ldots,d\}\).
Proof (of Lemma 3). For compactly supported test functions \(f\) and \(g\), we have \[ \inner{\+L f}{g} = \inner{f}{\+L^* g}. \] From Lemma 2, we have \[ \inner{\+L f}{g} = \int_{\bb R^d} \tp{\inner{\grad f(x)}{\mu(x)} + \frac{1}{2}\!{Tr}(\sigma(x)^\top \grad^2 f(x) \sigma(x))} g(x) \d x. \]
Let us first handle the term \(\int_{\bb R^d} \inner{\grad f(x)}{\mu(x)} g(x) \d x\): \[ \begin{align*} \int_{\bb R^d} \inner{\grad f(x)}{\mu(x)} g(x) \d x &= \int_{\bb R^d}\grad f(x)\cdot (\mu(x) g(x)) \d x \\ \mr{integration by parts} & = -\int_{\bb R^d} f(x) \div(\mu(x)g(x)) \d x \\ & = -\inner{f}{\div(\mu g)}. \end{align*} \] Next, we handle the term \(\int_{\bb R^d} \!{Tr}(\sigma(x)^\top \grad^2 f(x) \sigma(x)) g(x) \d x\). By the cyclic property of trace, we have \[ \!{Tr}(\*\sigma^\top \grad^2 f \*\sigma) = \grad^2 f : (\*\sigma \*\sigma^\top) = \grad^2 f : \*\Sigma. \] Therefore, \[ \begin{align*} \int_{\bb R^d} \!{Tr}(\sigma(x)^\top \grad^2 f(x) \sigma(x)) g(x) \d x &= \int_{\bb R^d} \grad^2 f(x) : \*\Sigma(x) g(x) \d x \\ \mr{integration by parts} & = -\int_{\bb R^d} \grad f(x) \cdot \div(\*\Sigma(x) g(x)) \d x \\ \mr{integration by parts} & = \int_{\bb R^d} f(x) \cdot \grad^2 : (\*\Sigma(x) g(x)) \d x \\ & = \inner{f}{\grad^2 : (\*\Sigma g)}. \end{align*} \]
This immediately gives the KFE / Fokker-Planck equation for the \(d\)-dimensional Itô process: \[ \pdv{}{t}\pi_t = -\div(\*\mu \pi_t) + \frac{1}{2} \grad^2 : (\*\Sigma \pi_t), \] where \(\pi_t\) is the law of \(\*X_t\).