$$ \def\*#1{\mathbf{#1}} \def\+#1{\mathcal{#1}} \def\-#1{\mathrm{#1}} \def\!#1{\mathsf{#1}} \def\@#1{\mathscr{#1}} \newcommand{\mr}[1]{\mbox{\color{RedViolet}$\triangleright\;$#1}\quad\quad} $$

Lecture 3: Concentration Inequalities via Martingale

Author

instructed by Chihao Zhang, scribed by Fangke Li and Yuchen He

Our study of concentration inequalities so far has relied on the strong assumption that the random variables are mutually independent. A natural question arises: can we derive similar bounds when this independence is absent? To do so, we will turn to the theory of martingales, a concept we learnt in our probability theory course.

Martingale

The concept of a martingale originates from the idea of a fair game.

Imagine you are playing a game where you can bet on a “high” or “low” outcome in each round. You are free to bet any amount of money. The game is considered fair if, regardless of your betting strategy, your expected gain in any given round is zero. Consequently, your expected total wealth remains constant over time.

To express this mathematically, let \(X_t\) denote your winnings in round \(t\), and let \(Z_t\) be your total fortune after round \(t\). Your fortune at time \(T>0\) is naturally the sum of your initial fortune \(Z_0\) and all subsequent winnings: \(Z_T=Z_0+\sum_{t=1}^T X_t\). The condition for a fair game is that \[ \forall t\ge 0,\;\E{X_{t+1}\mid X_1,\dots,X_t} = 0, \] or equivalently, that the expected fortune for the next round, given all past outcomes, is simply the current fortune: \[      \forall t\ge 0,\;\E{Z_{t+1} \mid X_1,\dots,X_t} = Z_t. \] Crucially, we do not require the outcome \(X_{t+1}\) to be independent of the past outcomes \(X_1,\dots,X_t\). In the context of our game, this means your betting strategy for the next round can depend on all your previous wins and losses, as long as the game itself remains fair. Abstracting this property gives us the general definition of a martingale.

Definition 1 (Filtration) Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space. A filtration is a sequence of \(\sigma\)-algebras \(\{\mathcal{F}_n\}_{n \geq 0}\) such that \[\begin{align*}\mathcal{F}_0 \subseteq \mathcal{F}_1 \subseteq \mathcal{F}_2 \subseteq \cdots \subseteq \mathcal{F}. \end{align*}\]

With this, we can define a martingale.

Definition 2 (Discrete Martingale) A sequence of integrable random variables \(\{Z_n\}_{n \geq 0}\) is called a martingale with respect to \(\{\mathcal{F}_n\}_{n\geq 0}\) if for any \(n\), \(Z_n\) is \(\mathcal{F}_n\)-measurable and \[\begin{align*} \E {Z_{n+1} \mid \mathcal{F}_n} = Z_n \end{align*}\] almost surely.

Often, the filtration is generated by another sequence of random variables, \(\set{X_n}_{n \geq 0}\), in which case we write \(\mathcal{F}_n=\sigma(X_0,X_1,\dots,X_n)\) and say \(\{Z_n\}_{n \geq 0}\) is a martingale with respect to \(\set{X_n}_{n \geq 0}\). If the filtration is generated by the process \(\{Z_n\}_{n \geq 0}\) itself, we simply call \(\{Z_n\}_{n \geq 0}\) a martingale.

If the equality \(\E {Z_{n+1} \mid \mathcal{F}_n} = Z_n\) in the definition is replaced by \(\geq\) or \(\leq\), the process is called a submartingale or a supermartingale , respectively.

We first recall some examples from our probability theory course.

Example 1 (Random walk) The classic example is a simple random walk. Consider a sum \(S_{n} = \sum_{i=1}^n X_n\), where \(X_i\in \set{-1, +1}\) is the \(i\)-th random step and \(S_n\) represents the current position. The variables \(\set{X_i}_{i\geq 1}\) are i.i.d. Assume \(\E{X_i}=\mu\). Obviously, when \(\mu\neq 0\), \(\set{S_n}_{n\geq 0}\) itself is not a martingale. We consider the shifted process.

Define \(Y_n=X_n-\mu\) and \(S'_n=\sum_{i=1}^n Y_n\). Then \[\begin{align*} \E{S'_{n+1}|Y_0,\dots,Y_n}&=\E{S'_n+Y_{n+1}|Y_0,\dots,Y_n}\\&=S'_n+\E{X_{n+1}-\mu|X_0,\dots,X_n}\\&=S'_n. \end{align*}\] Therefore, \(\{S'_n\}_{n\geq 0}\) is a martingale with respect to \(\{Y_n\}_{n\geq 0}\).

The above example indicates that any sum of independent random variables can be transformed into a martingale by centering it. The following example shows that martingales are not limited to additive structures.

Example 2 Consider a sequence of random variables \(\{X_n\}_{n \ge 1}\), where for any \(n \ge 1\), \(\E{X_n \mid X_1, \dots, X_{n-1}} = 1\).
Define \(P_n = \prod_{k=1}^n X_k\) (with \(P_0=1\)), then \(\{P_n\}_{n \ge 0}\) is a martingale with respect to \(\{X_n\}_{n \ge 1}\):

\[\begin{align*} \E{P_{n+1} \mid X_1, \dots, X_n} &= \E{P_n \cdot X_{n+1} \mid X_1, \dots, X_n} \\ &= P_n \cdot \E{X_{n+1} \mid X_1, \dots, X_n} \\ &= P_n. \end{align*}\]

Example 3 (Galton-Watson Process) The Galton-Watson process is a classic stochastic process, which models population growth.

Let \(G_t\) denote the number of individuals in the \(t\)-th generation. Let \(X_{t,i}\) denote the number of offspring of the \(i\)-th individual in the \(t\)-th generation. Assume those \(X_{t,i}\)’s are i.i.d. Let \(\mu = \E {X_{t,i}}\). The next generation’s size is \(G_{t+1} = \sum_{i=1}^{G_t} X_{t,i}.\)

Let \(\mathcal{F}_t\) denote the information of the first \(t\) generations, i.e.,

\[\begin{align*} \forall t \ge 1, \quad \mathcal{F}_t = \sigma(X_{1,1}, X_{1,2}, \dots, X_{t-1,1}, X_{t-1,2}, \dots, X_{t-1,G_{t-1}}). \end{align*}\]

Then \(G_t\) is \(\mathcal{F}_t\)-measuable. By Wald’s equation,

\[\begin{align*} \E{G_{t+1} \mid \mathcal{F}_t} &=\E{\sum_{i=1}^{G_t} X_{t,i} \,\Big|\, \mathcal{F}_t } \\ &=\E{X_{t,1} \mid \mathcal{F}_t} \cdot G_t \\ &=\mu \cdot G_t. \end{align*}\]

This is not a martingale unless \(\mu=1\). However, we can normalize the process. Define \(M_t := \mu^{-t} \cdot G_t\). Then \[\begin{align*} \E{M_{t+1} \mid \mathcal{F}_t} &= \mu^{-(t+1)} \E{G_{t+1} \mid \mathcal{F}_t} \\ &= \mu^{-(t+1)} (\mu \cdot G_t) \\ &= \mu^{-t} \cdot G_t \\ &= M_t. \end{align*}\] That is, \(\set{M_t}_{t\geq 0}\) is a martingale with regard to \(\set{G_t}_{t\geq 0}\).

Example 4 (Pólya’s Urn) Pólya’s Urn illustrates a process with reinforcement. An urn starts with one white and one black ball. At each step, we draw a ball from the urn, note its color, and return it to the urn along with another ball of the same color. For convenience, we start counting from the 2nd round so that there are \(n\) balls in the urn after \(n\) rounds.

Let \(X_n\) denote the number of black balls after the n-th draw, and let \(Z_n = \frac{X_n}{n}\) denote the proportion of black balls. Clearly \(Z_2 = \frac{1}{2}\). For any \(n \ge 2\), \[\begin{align*} \E{Z_{n+1} \mid X_{2}, \dots, X_n} &= \frac{1}{n+1} \E{X_{n+1} \mid X_{2}, \dots, X_n} \\ &= \frac{1}{n+1} \big(Z_n (X_n+1) + (1-Z_n) X_n \big) \\ &= Z_n. \end{align*}\]

Thus, \(\{Z_n\}_{n \ge 2}\) is a martingale with respect to \(\{X_n\}_{n \ge 2}\).

The Doob Martingale

Then we introduce an important and general construction that can turn almost any random variables into a martingale. Given \(n\) random variables \(X_1, \dots, X_n\) and a function \(f(X_1, \dots, X_n)\), we can reveal the variables one by one and track the expected value of \(f\).

For \(k=0,1,\dots,n\), define \[\begin{align*} Z_k = \E{f(X_1, \dots, X_n) \mid X_1, \dots, X_k}. \end{align*}\]

Then \(\{Z_k\}_{0\le k \le n}\) is a martingale with respect to \(\{X_k\}_{0\le k \le n}\). The sequence \(\{Z_k\}_{0\le k \le n}\) is called a Doob martingale or a Doob sequence.

We can verify this by applying the tower rule of conditional expectation. For \(k<n\), \[\begin{align*} \E{Z_{k+1} \mid X_1, \dots, X_k} &= \E{\E{f(X_1, \dots, X_n) \mid X_1, \dots, X_{k+1}}\mid X_1, \dots, X_k} \\ \mr{the tower rule}&= \E{f(X_1, \dots, X_n) \mid X_1, \dots, X_k} \\ &= Z_k. \end{align*}\]

Azuma-Hoeffding inequality

Now we use the above tools to derive stronger concentration inequalities. In our first class, we studied the classic Hoeffding’s inequality for sums of independent random variables.

Theorem 1 (Hoeffding’s inequality) Let \(X_1, \ldots, X_n\) be mutually independent random variables where each \(X_i\in [a_i, b_i]\) for certain \(a_i\le b_i\) with probability \(1\). Assume \(\E {X_i} = p_i\) for every \(1\le i\le n\). Let \(X = \sum_{i=1}^{n} X_i\) and \(\mu\defeq \E{X}=\sum_{i=1}^n p_i\), then \[ \Pr {| X - \mu | \geq t} \leq 2 \exp\left(-\frac{2t^2}{\sum_{i=1}^n (b_i - a_i)^2}\right) \] for all \(t\geq 0\).

A key question in modern probability and data science is how functions of many random variables, \(f(X_1,\dots,X_n)\), concentrate around their mean, especially when the variables are not independent. Hoeffding’s inequality only covers the simple case where \(f\) is a sum and the random variables are independent. This motivates a generalization that can handle more complex, structured dependencies.

Theorem (Azuma-Hoeffding inequality). Assume \(\{Z_k\}_{0\leq k\leq n}\) is a martingale with respect to a filtration \(\{\mathcal{F}_k\}_{0\leq k\leq n}\) and assume \(\forall\ k\geq 1\), \(Z_k - Z_{k-1} \in [a_k,b_k]\). Then for any \(t > 0\), \[\begin{align*} \Pr{\abs{Z_n - Z_0} \ge t} \le 2\exp\set{-\frac{2 t^2}{\sum_{k=1}^n (b_k - a_k)^2}}. \end{align*}\]

If we set \(Z_k = \sum_{i=1}^k X_i\) with \(Z_0=\mu\) for mutually independent random variables \(\set{X_k}_{1\leq k\leq n}\) in the above theorem, this recovers the original Hoeffding’s inequality.

The proof strategy is quite similar to that of the original Hoeffding’s inequality. Recall that in the original proof, an important step is to use the independence to derive \[ \E{e^{\beta\tp{\sum_{i=1}^n X_i - \E{X_i}}}} = \prod_{i=1}^n \E{e^{\beta\tp{ X_i - \E{X_i}}}} \] for any \(\beta>0\) and then use the Hoeffding’s lemma to bound each \(\E{e^{\beta\tp{ X_i - \E{X_i}}}}\). For the Azuma-Hoeffding inequality, the key difference is how to split the expectation into a product without independence.

Proof (Proof of the Azuma-Hoeffding inequality). We only prove one side. By Markov’s inequality, for any \(\beta > 0\), \[\begin{align*} \Pr{Z_n - Z_0 \ge t} &= \Pr{e^{\beta (Z_n - Z_0)} \ge e^{\beta t}} \\ &\le e^{-\beta t} \E{e^{\beta (Z_n - Z_0)}}. \end{align*}\]

Then we use the tower rule of conditional expectations: \[\begin{align*} \E{e^{\beta \sum_{k=1}^n (Z_k - Z_{k-1})}} &= \E{\E{e^{\beta \sum_{k=1}^n (Z_k - Z_{k-1})} \mid \mathcal{F}_{n-1}}} \\ &= \E{e^{\beta \sum_{k=1}^{n-1} (Z_k - Z_{k-1})} \E{e^{\beta (Z_n - Z_{n-1})} \mid \mathcal{F}_{n-1}}}. \end{align*}\]

Since \(\set{Z_k}_{0\leq k\leq n}\) is a martingale, \(\E{Z_n-Z_{n-1}\mid\mathcal{F}_{n-1}}=0\). From the Hoeffding’s lemma, \[\begin{align*} \E{e^{\beta (Z_n - Z_{n-1})} \mid \mathcal{F}_{n-1}} \le \exp\set{\frac{\beta^2 (b_n - a_n)^2}{8}}. \end{align*}\]

Iterating this argument gives \[\begin{align*} \E{e^{\beta (Z_n - Z_0)}} \le \exp\set{\sum_{i=k}^n \frac{\beta^2 (b_k - a_k)^2}{8}}. \end{align*}\] Optimizing \(\beta\) yields the final result.

Application: the balls-in-a-bag problem

Suppose there is a bag containing \(g\) green balls and \(r\) red balls, where all balls are identical except for their colors. We draw \(n\) balls from it with replacement. We want to estimate the proportion of red balls \(\frac{r}{r+g}\) through the observations.

Let \(X_i = \mathbf{1}[\text{the }i\text{-th draw is red}]\), and let \(X = \sum_{i=1}^{n} X_i\). The draws are i.i.d. and each \(X_i \sim \!{Ber}\left(\frac{r}{r+g}\right)\). We have \(\mathbb{E}[X] = \frac{nr}{r+g}\).

We can directly apply the Hoeffding’s inequality to obtain

\[\begin{align*} \mathbb{P}\left[|X - \mathbb{E}[X]| \ge t\right] \le 2 \exp\set{-\frac{2t^2}{n}}. \end{align*}\]

Then we slightly modify the setting and consider sampling without replacement. Now let \(Y_i = \mathbf{1}[\text{the }i\text{-th draw is red}]\). Suppose \(r+g\geq n\). Let \(Y = \sum_{i=1}^{n} Y_i\). By some conbinatoric arguments we have \[\begin{align*} \Pr {Y=k} = \frac{{r\choose k}{g\choose n-k}}{{r+g\choose n}}. \end{align*}\]

As a result, we still have \[\begin{align*} \E Y = n \cdot \frac{r}{r+g}. \end{align*}\]

Since the random variables \(\{Y_i\}_{1\leq i\leq n}\) are dependent, the Hoeffding’s inequality no longer applies. However, intuition suggests the number of red balls should still be concentrated around its mean.

Define \(f(Y_1,\dots,Y_n) = \sum_{i=1}^n Y_i\). Let \(Z_k = \E{f(Y_1,\dots,Y_n)\mid Y_1,\dots , Y_k}\). Then we know that \(\set{Z_k}_{0\leq k\leq n}\) is a Doob martingale and we have \(\E{f} = Z_0\), \(f=Z_n\). To apply the Azuma-Hoeffding inequality, we need to bound the difference \(\abs{Z_i-Z_{i-1}}\). We can prove that \(\abs{Z_i-Z_{i-1}}\leq 1\) via direct calculation (see the details in Sec 2.1 of this note). Then applying the Azuma-Hoeffding inequality, we get the same concentration bound: \[\begin{align*} \Pr{\abs{f-\E f}\ge t}=\Pr{\abs{Z_n-Z_0}\ge t}\le 2e^{-\frac{2t^2}{n}}. \end{align*}\]

McDiarmid’s Inequality

We’ve seen how the Azuma-Hoeffding inequality provides concentration for martingales. As shown in the previous example, the quality of the bound relies on the width of the martingale, that is, the magnitude of \(\abs{Z_i - Z_{i-1}}\). To determine the width of each \(\abs{Z_i - Z_{i-1}}\) is relatively easy if the function \(f\) and the variables \(\set{X_i}_{1\leq i\leq n}\) enjoy certain nice properties.

Definition 3 (Lipschitz function) For a function \(f:\mathbb{R}^n\mapsto\mathbb{R}\), we call it \((c_1,\dots,c_n)\)-Lipschitz if for any \(x_1,\dots,x_n\in \bb R\), \(i\in[n]\) and \(y\in \bb R\), \[\begin{align*} |f(x_1,\dots,x_n)-f(x_1,\dots,x_{i-1},y,x_{i+1},\dots,x_n)|\le c_i. \end{align*}\]

When \(c_1=c_2=\dots=c_n=L\), we also call it \(L\)-Lipschitz.

Theorem 2 (McDiarmid’s inequality) Let \(f:\bb R^n\to \bb R\) be a \((c_1,\dots,c_n)\)-Lipschitz function, and let \(X_1, \ldots, X_n\) be \(n\) mutually independent random variables. Then we have \[\begin{align*} \Pr{ \left| f(X_1, \ldots, X_n) - \E{f(X_1, \ldots, X_n)} \right| \ge t } \le 2 \exp\left( -\frac{2t^2}{\sum_{i=1}^n c_i^2} \right). \end{align*}\]

Proof. The proof strategy is to construct a Doob martingale and then apply the Azuma-Hoeffding inequality. Define \(\{Z_k\}_{1\leq k\leq n}\) as follows:

\[\begin{align*} Z_k := \mathbb{E}\left[ f\tp{\overline{X}_{1,n}} \mid \overline{X}_{1,k} \right]. \end{align*}\] where \(\overline{X}_{i,j}\) denotes \(X_i,\dots, X_j\). Hence, \[ Z_k-Z_{k-1}\le \sup_{y} \left\{ \mathbb{E}\left[ f(\overline{X}_{1,n}) \mid \overline{X}_{1,k-1}, X_k=y\right] - \mathbb{E}\left[ f(\overline{X}_{1,n}) \mid \overline{X}_{1,k-1} \right]\right\}:=b_k, \] and \[ Z_k-Z_{k-1}\ge \inf_{x} \left\{ \mathbb{E}\left[ f(\overline{X}_{1,n}) \mid \overline{X}_{1,k-1}, X_k = x\right] - \mathbb{E}\left[ f(\overline{X}_{1,n}) \mid \overline{X}_{1,k-1}\right]\right\}:=a_k. \]

As a result, \[ \begin{align*} b_k-a_k &\leq \sup_{x, y} \left\{ \mathbb{E}\left[ f(\overline{X}_{1,n}) \mid \overline{X}_{1,k-1}, X_k = y \right] - \mathbb{E}\left[ f(\overline{X}_{1,n}) \mid \overline{X}_{1,k-1}, X_k = x \right] \right\}\\ &= \sup_{x, y}\left\{ \sum_{\sigma_{k+1},\dots, \sigma_n} f\tp{\ol{X}_{1,k},y,\sigma_{k+1},\dots, \sigma_n} \cdot \Pr{{\ol{X}_{k+1,n}}=(\sigma_{k+1},\dots,\sigma_n)\mid \ol{X}_{1,k},y} \right.\\ &\quad - \left.f\tp{\ol{X}_{1,k},x,\sigma_{k+1},\dots, \sigma_n} \cdot \Pr{{\ol{X}_{k+1,n}}=(\sigma_{k+1},\dots,\sigma_n)\mid \ol{X}_{1,k},x}\right\}. \end{align*} \]

Since \(X_1,\dots,X_n\) are mutually independent, \[ \Pr{\tp{\ol{X}_{k+1,n}}=(\sigma_{k+1},\dots,\sigma_n)\mid \ol{X}_{1,k},x} = \Pr{\tp{\ol{X}_{k+1,n}}=(\sigma_{k+1},\dots,\sigma_n)\mid \ol{X}_{1,k},y} . \] Therefore, by the Lipschitz property of \(f\), we have \[ b_k-a_k \leq c_k. \]

Applying the Azuma–Hoeffding inequality, we obtain

\[\begin{align*} \mathbb{P}\left( \left| f(X_1, \ldots, X_n) - \mathbb{E}[f(X_1, \ldots, X_n)] \right| \ge t \right) &= \mathbb{P}\left( |Z_n - Z_0| \ge t \right) \\ &\le 2 \exp\left( -\frac{2t^2}{\sum_{i=1}^n c_i^2} \right). \end{align*}\]

Let’s see some applications of the McDiarmid’s inequality.

Example 5 (Balls into bins) Suppose we throw \(m\) balls independently and uniformly into \(n\) bins. Let \(Y_i=\*1[\text{bin }i\text{ is empty}]\) and \(Y=\sum_{i=1}^n Y_i\) be the number of total empty bins. We want to show \(Y\) is concentrated around its mean \(\E{Y} = n\cdot (1-1/n)^m\). Since \(\set{Y_i}_{1\leq i\leq n}\) are dependent, we cannot apply the Hoeffding’s inequality directly.

Let \(X_i\) denote the position of the \(i\)-th ball. Then \(Y\) can be writen as a function of \(X_1,\dots,X_m\). Denote this function as \(f\). Note that if we change the placement of a single ball, the number of empty bins can change by at most \(1\), which indicates that \(f\) is \(1\)-Lipschitz. Applying the Mcdiamid’s inequality, we have \[ \Pr{\abs{Y-\E{Y}\geq t}} \leq 2e^{-\frac{2t^2}{n}} \]

Example 6 (Pattern matching) Let \(P \in \{0,1\}^m\) be a fixed binary string of length \(m\). What is the number of times \(P\) appears as a substring in a random binary string \(X\) of length \(n\)?

Let \(Y_i = \*1[(X_i,\dots, X_{i+m-1})=P]\) and \(Y =\sum_{i=1}^{n-m+1} Y_i\) denote the number of occurrences of \(P\) in \(X\). The random variables \(\set{Y_i}_{1\leq i\leq n}\) are not independent.

We can easily compute the expectation of \(Y\) via the linearity of expectation. In fact, we have \(\E{Y} = (n-m+1)\E{Y_1} = (n-m+1)\cdot 2^{-m}\).

Define \(n\) independent random variables \(X_1, \dots, X_n\), where \(X_i\) represents the \(i\)-th character of \(X\). Then we can represent \(Y = f(X_1, \dots, X_n)\) for some function \(f\). It is easy to know that \(f\) is \(m\)-Lipschitz.

We can then use the McDiarmid’s inequality to show that \(f\) is well concentrated: \[\begin{align*} \Pr{|f - \E{f}| \ge t} \le 2 e^{-\frac{2t^2}{nm^2}}. \end{align*}\]

Example 7 (Chromatic number in Erdős-Rényi graph) Consider the Erdős-Rényi graph \(\mathcal{G}(n,p)\). For a graph \(G \sim \mathcal{G}(n,p)\), let \(\chi_G\) denote its chromatic number, i.e., the minimum number of colors needed to color the vertices so no two adjacent vertices share the same color. We want to analyse the concentration of \(\chi_G\).

Define the random variable \(X_e\) for each possible edge \(e = \{u,v\} \subseteq V\) as \(X_e = \*1[e\in G]\), which is an indicator of the existence of this edge.

Note that \(\set{X_e}_{e\in \binom{[n]}{2}}\) are independent, and the chromatic number can be written as a function \(\chi_G = f\tp{\set{X_e}_{e\in \binom{[n]}{2}}}\).
It is easy to see that \(f\) is 1-Lipschitz, since adding or removing a single edge can change the chromatic number by at most \(1\). By McDiarmid’s inequality, we have \[\begin{align*} \Pr{|\chi_G - \E{\chi_G}| \ge t} \le 2 \exp\set{-\frac{2t^2}{{n\choose2}}}. \end{align*}\]

This bound is very loose. To get a non-trivial probability, we need \(t\) to be on the order of \(n\), but we already know \(\chi_G\leq n\).

The weakness of the first bound comes from the large number of variables. Let’s redefine them. Assume the vertex set of \(G\) is \(\{v_1, \dots, v_n\}\). Define \(n\) random variables \(X_1, \dots, X_n\), where each \(X_i\) encodes the edges between \(v_i\) and \(\{v_1, \dots, v_{i-1}\}\). Once \(X_1, \dots, X_n\) are given, the entire graph is determined, and thus the chromatic number can be written as a function \(g(X_1, \dots, X_n)\). Since each \(X_i\) only involves the connections between \(v_i\) and previous vertices, the \(n\) variables are independent.

If we change \(X_i\), we only alter the connections of a single vertex, \(v_i\). In the worst case, this might require one new color for \(v_i\), so the chromatic number changes by at most \(1\). Therefore, \(g\) is also \(1\)-Lipschitz. Applying McDiarmid’s inequality, we obtain a better concentration bound. \[\begin{align*} \Pr{|\chi_G - \E{\chi_G}| \ge t} \le 2 e^{-2t^2 / n}. \end{align*}\]

Example 8 (Concentration on hypercube) Let \(A\subseteq\{0,1\}^n\) be a set containing an \(\eps\) fraction of all points, i.e., \(|A|=\varepsilon\cdot 2^n\) for some \(\eps\in (0,1)\). Let \(A_r\) be the set of all points within a Hamming distance \(r\) of \(A\): \[\begin{align*} A_r:=\{x\in\{0,1\}^n:\inf_{a\in A}\|x-a\|_1\le r\}. \end{align*}\]

We want to show that for a suitably chosen \(r\), \(A_r\) contains almost all the points of the hypercube.

Let \(X=(X_1,\dots,X_n)\) be a random vertex chosen uniformly at random from \(\{0,1\}^n\). Define the function \(f(x)\) as the Hamming distance from \(x\) to \(A\): \[\begin{align*} f(x)=\inf_{a\in A}\|x-a\|_1. \end{align*}\]

By the triangle inequality, if two points \(x,y\) differ in only one coordinate, \(∣f(x)−f(y)∣≤\|x−y\|=1\). Therefore, the function \(f\) is \(1\)-Lipschitz. Note that \(\set{X_i}_{1\leq i\leq n}\) are mutually independent. From the one-side McDiarmid’s inequality, \[\begin{align*} \Pr{f(X)\le\E{f(X)}-t}\le e^{-\frac{2t^2}{n}}. \end{align*}\]

If we take \(t=\E{f(X)}\), we have \[\begin{align*} \varepsilon=\Pr{X\in A}=\Pr{f(X)=0}\leq \Pr{f(X)\le 0}\le e^{-\frac{2\E{f(X)}^2}{n}}. \end{align*}\] As a result, \(\E{f(X)}\le\sqrt{\frac{n}{2}\log\frac{1}{\varepsilon}}\), which means the average distance of \(X\) to \(A\) is small.

Using the other side of the McDiarmid’s inequality \[\begin{align*} \Pr{f(X)\ge\E{f(X)}+t}\le e^{-\frac{2t^2}{n}}, \end{align*}\]

Taking \(t=\sqrt{\frac{n}{2}\log\frac{1}{\varepsilon}}\), we have \[\begin{align*} \Pr{f(X)\ge\E{f(X)}+\sqrt{\frac{n}{2}\log\frac{1}{\varepsilon}}}\le \varepsilon, \end{align*}\]

Since \(\E{f(X)}\le\sqrt{\frac{n}{2}\log\frac{1}{\varepsilon}}\), with \(r = \sqrt{2n\log\frac{1}{\varepsilon}}\), \[\begin{align*} \Pr{X\notin A_r}&= \Pr{f(X)\ge \sqrt{2n\log\frac{1}{\varepsilon}}}\le \varepsilon. \end{align*}\] Therefore, for all \(\varepsilon\gt0\) and \(|A| =\varepsilon\cdot 2^n\), when \(r=\sqrt{2n\log\frac{1}{\varepsilon}}\), \(|A_r|\ge(1-\varepsilon)2^n\).