Lecture 14: Sparsification

Authors

instructed by Chihao Zhang

scribed by Yuchen He

Part of this note follows the course material of James R. Lee.

The motivation of sparsification

Consider a matrix $A \in \bb R^{m \times n}$ where $m \gg n$. We can view $A$ as a collection of $m$ row vectors: \[ A=\left[\begin{matrix}a_1^{\top}\\\vdots \\ a_m^{\top} \end{matrix}\right], \] where each $a_i\in \bb R^n$. Let $b \in \mathbb{R}^m$ be a target vector. A fundamental problem is to solve the least squares regression: \[ \min_{x\in \bb R^n} \norm{Ax-b}_2. \] When $m$ is extremely large, computing the exact solution becomes computationally prohibitive. This motivates the idea of sparsification.

We want to select a small subset of $M$ rows from $A$ (where $n \le M \ll m$) to form a smaller matrix $\tilde{A} \in \mathbb{R}^{M \times n}$ that preserves the geometric structure of $A$. Specifically, we desire that for all $x \in \mathbb{R}^n$, $\norm{Ax}_2\approx \norm{\tilde A x}_2$.

If such an $\tilde{A}$ exists, we can solve the reduced problem: \[ \min_{x \in \bb{R}^n} \|\tilde{A}x - \tilde{b}\|_2, \] where $\tilde{b} \in \mathbb{R}^M$ consists of the entries of $b$ corresponding to the selected rows.

We can abstract the linear regression example into a broader function approximation problem. Assume we have a global objective function $F: \mathbb{R}^n \to \mathbb{R}$ that decomposes into a sum of $m$ simpler functions: \[ F(x)=\sum_{i=1}^m f_i(x). \] Here $m$ is very large, making the evaluation of $F(x)$ expensive.

Our goal is to find a sparse approximation $\tilde{F}$. Instead of summing all $m$ terms, we assign a non-negative weight $c_i \geq 0$ to each function $f_i$ to construct: \[ \tilde F(x)=\sum_{i=1}^m c_if_i(x). \] We say the weight vector $c = (c_1, \dots, c_m)$ is $M$-sparse if the number of non-zero entries is small, i.e., $\|c\|_0 \le M$.

This framework unifies many important problems. For example, consider an undirected graph $G=(V,E)$ with non-negative edge weights $w_{i,j}$ for each edge $(i,j) \in E$. Without loss of generality, assume $V=[n]$. Define the component function for each edge as: \[ f_{i,j}(x)=w_{i,j}\abs{x_i-x_j},\quad \mbox{for } x\in \set{0,1}^n \] Then, the global function $F(x)$ represents the cut function: \[ F(x) = \sum_{(i,j) \in E} w_{i,j} |x_i - x_j| = \text{weight of the cut defined by } x. \] Minimizing $F(x)$ over non-trivial binary vectors corresponds to finding the global minimum cut of the graph. If we can sparsify $F(x)$, we effectively construct a sparse graph $H$ (with fewer edges $M \ll |E|$) such that the value of every cut in $H$ is approximately the same as in $G$. This allows us to run expensive graph algorithms (like max-flow or min-cut) on the sparse graph $H$, significantly reducing computational complexity from depending on $|E|$ to depending on $M$.

In this lecture, we consider the linear model with $\ell^2$-loss. That is, we assume each $f_i(x) = \inner{a_i}{x}^2$ for some vector $a_i\in \bb R^n$.

Importance sampling

A natural first attempt at sparsification is to select $M$ rows of $A$ uniformly at random. However, this naive approach often fails.

Consider a matrix $A$ where one row $a_1$ has a very large norm, while all other rows $a_i$ are negligible. The function $F(x) = \|Ax\|^2$ is dominated by the term $\langle a_1, x \rangle^2$. If we sample uniformly, the probability of picking $a_1$ is small. Most of the time, we will miss $a_1$ entirely, resulting in an approximation $\tilde{F}(x) \approx 0$, which is a huge relative error. To capture the geometry of $A$, we must bias our sampling towards important rows.

We assign a probability $p_i$ to each index $i \in [m]$ such that $\sum p_i = 1$. Denote this distribution as $\mu$. We select indices $i_1, \dots, i_M$ independently according to $\mu$. To ensure the approximation is correct in expectation, we must reweight the selected terms.

Define the single-sample estimator \[ \hat{F}_I(x) = \frac{f_I(x)}{p_I}, \quad \text{where } I \sim \mu. \]

It is easy to verify this is unbiased: \[ \E{\hat{F}_I(x)} = \sum_{i=1}^m p_i \frac{f_i(x)}{p_i} = \sum_{i=1}^m f_i(x) = F(x). \]

To reduce variance, we average $M$ independent copies. Our final estimator is \[ \hat F(x) = \frac{1}{M}\sum_{k=1}^M \frac{f_{i_k}(x)}{p_{i_k}}. \]

Statistical leverage scores

We choose $p_i\propto \alpha_i$ where in our $\ell^2$-loss case, \[ \alpha_i\defeq \sup_{x: F(x)\neq 0} \frac{f_i(x)}{F(x)} = \sup_{x:F(x)\leq 1} f_i(x). \] Intuitively, $\alpha_i$ measures the worst-case importance of row $i$. If $\alpha_i$ is close to 1, there exists a direction $x$ where row $i$ is the dominant contributor to the norm, so we must sample it.

For the matrix sparsification case where $f_i(x) = \langle a_i, x \rangle^2$, these sensitivities have a closed-form solution known as leverage scores.

Assume without loss of generality that $A$ has full rank $n$ (otherwise, replace the inverse with the pseudoinverse in the following calculations). We have \[ \begin{align*} \alpha_i &=\sup_{x: \norm{Ax}^2\leq 1} \inner{a_i}{x}^2\\ &= \sup_{x: \inner{(A^{\top}A)^{\frac{1}{2}}x}{(A^{\top}A)^{\frac{1}{2}}x}\leq 1} \inner{a_i}{x}^2\\ \mr{$y=(A^{\top}A)^{\frac{1}{2}}x$} &= \sup_{y:\norm{y}\leq 1} \inner{a_i}{(A^{\top}A)^{-\frac{1}{2}}y}^2 \\ &=\norm{(A^{\top}A)^{-\frac{1}{2}} a_i}_2^2. \end{align*} \] This quantity, denoted as $\sigma_i(A)$, is the $i$-th statistical leverage score of $A$.

To determine the exact value of $p_i$, we need to calculate the normalizing factor. We have \[ \begin{align*} \sum_{i=1}^m \sigma_i(A) &= \sum_{i=1}^m \inner{a_i}{(A^{\top}A)^{-1} a_i} \\ &=\!{Tr}\tp{(A^{\top}A)^{-1}\cdot \sum_{i=1}^m a_i a_i^{\top}}\\ &\leq \!{Tr}\tp{(A^{\top}A)^{-1} \cdot A^{\top}A}\\ &=\!{rank}(A)\\ &=n. \end{align*} \] This suggests $p_i=\frac{a_i^{\top} \tp{A^{\top}A}^{-1}a_i}{n}$.

Using these probabilities, our sparse approximation $\hat{F}(x)$ corresponds to a reweighted random matrix. Recall that
\[ F(x) = \sum_{i=1}^n \inner{a_i}{x}^2 = x^{\top} \tp{\sum_{i=1}^n a_i a_i^{\top}} x = x^\top (A^\top A) x. \] The estimator is: \[ \hat F(x) = \frac{1}{M}\sum_{k=1}^M \frac{\inner{a_{i_k}}{x}^2}{p_{i_k}} = x^{\top} \tp{\frac{1}{M}\sum_{k=1}^M \frac{a_{i_k} a_{i_k}^{\top}}{p_{i_k}}} x. \] Thus, constructing the sparse function $\hat{F}$ is equivalent to constructing the random matrix \[ \hat{A}^\top \hat{A} \defeq \frac{1}{M} \sum_{k=1}^M \frac{a_{i_k} a_{i_k}^\top}{p_{i_k}}. \]

There are two main approaches to analyse the approximation error:

Using matrix concentration bounds to analyse $\hat{A}^\top \hat{A}$,
Applying Dudley’s theorem to bound $\E{\sup_{x: F(x)\leq 1} |\hat{F}(x) - F(x)|}$.

To align with the topic of this course, we focus on the second approach.

Applying Dudley’s theorem.

We now use Dudley’s theorem to bound the expected supreme error.

Let $\eps_1, \dots, \eps_M$ be independent Rademacher variables which are uniformly distributed in $\{\pm 1\}$. From the result of the previous lecture, we can replace the deviation with the magnitude of the randomized process \[ \E{\sup_{x\in \Omega} \abs{F(x) - \hat F(x)}} \leq 2\E{\sup_{x\in \Omega} \abs{ x^{\top}\tp{\frac{1}{M}\sum_{k=1}^M \eps_k \frac{a_{i_k}^{\top} a_{i_k}}{p_{i_k}}} x }}, \] where $\Omega=\set{x\in \bb R^n: \abs{F(x)}\leq 1}$ is the unit ellipsoid defined by $A^{\top}A$.

We fix the sampled indices $\vec{i} = (i_1, \dots, i_M)$ and consider the randomness only over the signs $\eps_1,\dots,\eps_M$. Define the process indexed by $x$: \[ Z_x = \abs{ x^{\top}\tp{\frac{1}{M}\sum_{k=1}^M \eps_k \frac{a_{i_k}^{\top} a_{i_k}}{p_{i_k}}} x }. \] Consider the increment between two points $x,y$: \[\begin{align*} Z_x-Z_y &\leq \abs{\frac{1}{M}\sum_{k=1}^M \eps_k \tp{\frac{\inner{a_{i_k}}{x}^2 - \inner{a_{i_k}}{y}^2}{p_{i_k}}}}\\ &= \abs{\frac{1}{M}\sum_{k=1}^M \eps_k \cdot \frac{\inner{a_{i_k}}{x+y}\cdot\inner{a_{i_k}}{x-y}}{p_{i_k}}}. \end{align*}\] Since this is the absolute value of a sum of independent bounded random variables (conditioned on $\vec{i}$), $Z_x - Z_y$ is a sub-Gaussian random variable. Its variance proxy \[ \sigma_{x,y}^2\leq \frac{1}{M^2}\sum_{k=1}^M \frac{\inner{a_{i_k}}{x+y}^2\inner{a_{i_k}}{x-y}^2}{p_{i_k}^2}. \]

Furthermore, \[ \begin{align*} \sigma^2_{x,y} &\leq \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}^2}{p_{i_k}} \cdot \frac{1}{M^2}\cdot \sum_{k=1}^M \frac{\inner{a_{i_k}}{x+y}^2}{p_{i_k}} \\ &\leq \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}^2}{p_{i_k}} \cdot \frac{4}{M^2}\cdot \sup_{x\in \Omega} \sum_{k=1}^M \frac{\inner{a_{i_k}}{x}^2}{p_{i_k}}\\ &= \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}^2}{p_{i_k}} \cdot \frac{4}{M}\cdot \sup_{x\in \Omega} \hat F(x). \end{align*} \]

The bound on $\sigma^2_{x,y}$ naturally suggests a specific distance metric on $\Omega$. For any $x,y\in \Omega$, define the distance \[ d_{\infty}^{\vec{i}} (x,y) = \max_{k\in[M]} \abs{\frac{\inner{a_{i_k}}{x-y}}{\sqrt{p_{i_k}}}}. \] Similarly, we define the scaling factor $\sigma$ corresponding to the “radius” of the process: \[ \sigma^2 = \frac{4}{M}\cdot \sup_{x\in \Omega} \hat F(x). \] With these definitions, we have shown that the process is a $\sigma^2$-sub-Gaussian process. Therefore, by Dudley’s theorem, the expected supremum is bounded by: \[ \E{\sup_{x \in \Omega} Z_x} \lesssim \sigma \int_0^{\infty} \sqrt{\log N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)} \, \mathrm{d}\varepsilon. \]

The problem is now reduced to calculating the covering number $N\tp{\Omega, d_{\infty}^{\vec{i}}, \eps}$.

The covering number

We rely on the following lemma, which bounds the complexity of covering a Euclidean ball with polytopes defined by linear constraints. We will prove this lemma in the next lecture.

Lemma 1 (Covering lemma) Let $U \in \mathbb{R}^{M \times n}$ be a matrix with rows $u_1^\top, \dots, u_M^\top$. Define the distance \[ d_U(x,y) = \norm{U(x-y)}_{\infty} = \max_{k \in [M]} |\langle u_k, x-y \rangle|. \] Then the covering number of the Euclidean unit ball $B_2^n$ satisfies \[ \log N\tp{B_2^n, d_U, \eps}\leq \frac{\log M}{\eps^2}\cdot \max_{k\in[M]}\norm{u_k}^2_2. \]

Our domain $\Omega = \{x : \|Ax\|_2 \le 1\}$ is an ellipsoid. To apply the lemma, we apply a change of variables to map $\Omega$ to the standard unit ball $B_2^n$. Let $y = (A^{\top}A)^{1/2} x$. Then \[ F(x) = x^\top A^\top A x = y^\top y = \|y\|_2^2. \] Thus, $x \in \Omega \iff y \in B_2^n$. Substituting $x = (A^{\top}A)^{-1/2} y_x$ and $z = (A^{\top}A)^{-1/2} y_z$ into our distance metric $d_{\infty}^{\vec{i}}$, \[ \begin{align*} d_{\infty}^{\vec{i}}(x,z) &= \max_{k\in[M]} \left| \frac{\langle a_{i_k}, x-z \rangle}{\sqrt{p_{i_k}}} \right| \\ &= \max_{k\in[M]} \left| \left\langle \frac{a_{i_k}}{\sqrt{p_{i_k}}}, (A^{\top}A)^{-1/2} (y_x - y_z) \right\rangle \right| \\ &= \max_{k\in[M]} \left| \left\langle (A^{\top}A)^{-1/2} \frac{a_{i_k}}{\sqrt{p_{i_k}}}, y_x - y_z \right\rangle \right|. \end{align*} \] This is exactly the form $d_U(y_x, y_z)$ if we define the matrix $U$ with rows \[ u_k = (A^{\top}A)^{-1/2} \frac{a_{i_k}}{\sqrt{p_{i_k}}}. \] Crucially, the squared norm of these rows is determined by our leverage score sampling probabilities. Recall that $p_{i_k} = \frac{\|(A^\top A)^{-1/2} a_{i_k}\|^2}{n}$. Therefore, \[ \|u_k\|_2^2 = \frac{1}{p_{i_k}} \|(A^{\top}A)^{-1/2} a_{i_k}\|_2^2 = n. \]

Applying the covering lemma with $\max \|u_k\|^2 = n$, we obtain our first bound \[ \log N\tp{\Omega, d_{\infty}^{\vec{i}}, \eps} = \log N(B_2^n, d_U,\eps)\lesssim \frac{n\log M}{\eps^2}. \]

A bound for small $\eps$

While the bound above is powerful for large $\varepsilon$, if we substitute this into Dudley’s integral $\int \sqrt{\log N}\ \d\eps$, we get $\int \frac{1}{\eps} d\varepsilon$, which diverges near zero. We need a tighter bound for very small $\eps$.

For small scales, we use a simple volumetric argument. We observe that our random metric is dominated by the scaled Euclidean metric. For any $x, y \in \Omega$: \[ \begin{align*} d_{\infty}^{\vec{i}}(x,y) &= \max_{k\in[M]} \frac{\inner{a_{i_k}}{x-y}}{\sqrt{p_{i_k}}}\\ &= \max_{k\in[M]} \frac{\inner{\tp{A^{\top}A}^{\frac{1}{2}}(x-y)}{\tp{A^{\top}A}^{-\frac{1}{2}}a_{i_k}}}{\sqrt{p_{i_k}}} \\ \mr{Cauchy-Schwarz}&\leq \max_{k\in[M]} \frac{\sqrt{\inner{x-y}{A^{\top}A(x-y)}}\cdot \sqrt{\inner{a_{i_k}}{\tp{A^{\top}A}^{-1}a_{i_k}}}}{\sqrt{p_{i_k}}} \\ \mr{$p_{i_k} = \frac{\|(A^\top A)^{-1/2} a_{i_k}\|^2}{n}$}&= \sqrt{n}\cdot \norm{A(x-y)}_2 \\ &\leq 2\sqrt{n}. \end{align*} \] Thus, the diameter of $\Omega$ in this metric is at most $2\sqrt{n}$. Therefore, \[ \Omega \subseteq 2\sqrt{n}\cdot B_{\infty}^{\vec i}, \] where $B_{\infty}^{\vec i}=\set{x: d_{\infty}^{\vec{i}}(x,0)\leq 1}$.

To cover $\Omega$ with $d_{\infty}^{\vec{i}}$-balls of radius $\varepsilon$, it suffices to cover the set $\{x : \|x\|_2 \le \sqrt{n}\}$ with balls of Euclidean radius $\eps$.

Therefore, \[ N\tp{\Omega, d_{\infty}^{\vec{i}}, \eps} \leq \tp{\frac{c\sqrt{n}}{\eps}}^n \] for some universal constant $c$.

Applying Dudley’s theorem

We are now ready to combine our covering number bounds and apply Dudley’s theorem to derive the final error bound. Recall Dudley’s bound: \[ \begin{align*} \E{\sup_{x \in \Omega} Z_x} &\lesssim \E{\sigma} \int_0^{\infty} \sqrt{\log N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)} \ \dd \varepsilon\\ &\lesssim \frac{1}{\sqrt{M}} \cdot \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \cdot \int_0^{\infty} \sqrt{\log N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)} \ \dd \varepsilon. \end{align*} \] Note that the diameter of $\Omega$ is $2\sqrt{n}$, therefore, for $\eps>2\sqrt{n}$, $N(\Omega, d_{\infty}^{\vec{i}}, \varepsilon)=1$. Without loss of generality, assume the choice of $M$ satisfies $\log M<2n$. Therefore, \[ \begin{align*} \E{\sup_{x: F(x)\leq 1} Z_x} &\lesssim \frac{\E{\sqrt{\sup_{x\in \Omega} \hat F(x)}}}{\sqrt{M}} \cdot \sqrt{n}\int_{0}^{2\sqrt{n}} \sqrt{\log \frac{\sqrt{n}}{\eps} \wedge \frac{\log M}{\eps^2}}\ \dd \eps\\ &\lesssim \frac{\E{\sqrt{\sup_{x\in \Omega} \hat F(x)}}}{\sqrt{M}} \cdot \sqrt{n}\int_{0}^{2\sqrt{n}} \sqrt{\frac{\sqrt{n}}{\eps} \wedge \frac{\log M}{\eps^2}}\ \dd \eps\\ \mr{$r=\frac{\log M}{\sqrt{n}}$}&= \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \cdot \sqrt{\frac{n}{M}}\cdot \tp{ \int_0^{r}\frac{n^{\frac{1}{4}}}{\sqrt{\eps}}\ \dd \eps + \int_r^{2\sqrt{n}} \frac{\sqrt{\log M}}{\eps} \ \dd \eps}\\ &\lesssim \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \cdot \underbrace{\sqrt{\frac{n}{M}}\cdot \sqrt{\log M}\cdot \log n}_{\theta}. \end{align*} \] Let $S=\E{\sup_{x\in \Omega} \abs{F(x)-\hat F(x)}}$. We have established that \[ \begin{align*} S&\lesssim \theta \cdot \E{\sqrt{\sup_{x\in \Omega} \hat F(x)}} \\ &\leq \theta \cdot\E{\sqrt{1+ \sup_{x\in \Omega} \abs{F(x)-\hat F(x)}}}\\ \mr{Jensen's inequality} &\leq \theta \cdot \tp{1+S}^{\frac{1}{2}}. \end{align*} \] Solve this inequality, we can get \[ S= \E{\sup_{x\in \Omega} \abs{F(x)-\hat F(x)}} \lesssim \theta =\sqrt{\frac{n}{m}}\cdot \sqrt{\log M}\cdot \log n. \]

The motivation of sparsification

Importance sampling

Statistical leverage scores

Applying Dudley’s theorem.

The covering number

A bound for small \(\eps\)

Applying Dudley’s theorem