Jarvis' Blog 总有美丽的风景让人流连

# 训练 GAN 的理论分析和实践 (Wasserstein GAN)

2020-12-28
Jarvis
Post

## A. (ICLR 2017) Towards Principled Methods for Training Generative Adversarial Networks

GAN的总目标函数是:

$\min_G\max_D V(D,G)=\mathbb{E}_{\mathbf{x}\sim p_{\text{data}}(\mathbf{x})}[\log D(\mathbf{x})]+\mathbb{E}_{\mathbf{z}\sim p_{\mathbf{z}}(\mathbf{z})}[\log(1-D(G(\mathbf{x})))].$

$\max_{D} \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}(\mathbf{x})}[\log D(\mathbf{x})]+\mathbb{E}_{\mathbf{z}\sim p_{\text{data}}(\mathbf{z})}[\log (1-D(G(\mathbf{z})))]$

$\min_{G} \mathbb{E}_{\mathbf{x}\sim p_{\text{data}}(\mathbf{x})}[\log (1-D(G(\mathbf{z})))]$

### 1. 不稳定性的来源

#### 1.1 完美判别器

If two distributions $\mathbb{P}_r$ and $\mathbb{P}_g$ have support contained on two disjoint compact subsets $\mathcal{M}$ and $\mathcal{P}$ respectively, then there is a smooth optimal discrimator $D^*: \mathcal{X} \rightarrow [0,1]$ that has accuracy 1 and $\nabla_xD^*(x)=0$ for all $x\in \mathcal{M}\cup\mathcal{P}.$

Let $\mathbb{P}_r$ and $\mathbb{P}_g$ be two distributions whose support lies in two manifolds $\mathcal{M}$ and $\mathcal{P}$ that don’t have full dimension and don’t perfectly align. We further assume that $\mathbb{P}_r$ and $\mathbb{P}_g$ are continuous in their respective manifolds. Then, \begin{align}\nonumber JSD(\mathbb{P}_r\Vert\mathbb{P}_g) &= \log2 \\ \nonumber KL(\mathbb{P}_r\Vert\mathbb{P}_g) &= +\infty \\ \nonumber KL(\mathbb{P}_g\Vert\mathbb{P}_r) &= +\infty \\ \end{align}\nonumber

#### 1.2 结论, 损失函数存在的问题

Let $g_{\theta}:\mathcal{Z}\rightarrow\mathcal{X}$ be a differentiable function that induces a distribution $\mathbb{P}_r$. Let $\mathbb{P}_g$ be the real data distribution. Let $D$ be a differentiable discriminator. If the conditions of Theorems 1 or 2 and satisfied, $D-D^*<\epsilon$, and $\mathbb{E}_{z\sim p(z)}[\Vert J_{\theta}g_{\theta}(z)\Vert_2^2]\leq M$, then $\Vert \nabla_{\theta}\mathbb{E}_{z\sim p(z)}[\log(1-D(g_{\theta}(z)))]\Vert_2<M\frac{\epsilon}{1-\epsilon}$

Under the same assumptions of Theorem 4 $\lim_{\Vert D-D^*\Vert\rightarrow0}\nabla_{\theta}\mathbb{E}_{z\sim p(z)}[\log(1-D(g_{\theta}(z)))]=0$

Let $\mathbb{P}_r$ and $\mathbb{P}_g$ be two continuous distributions, with densities $P_r$ and $P_{g_{\theta}}$ respectively. Let $D^*=\frac{P_r}{P_{g_{\theta_0}}+p_r}$ be the optimal discriminator, fixed for a value $\theta_0$. Therefore, $\mathbb{E}_{z\sim p(z)}[-\nabla_{\theta}\log D^*(g_{\theta}(z))\vert_{\theta=\theta_0}]=\nabla_{\theta}[KL(\mathbb{P}_{g_{\theta}}\Vert\mathbb{P}_r)-2JSD(\mathbb{P}_{g_{\theta}}\Vert\mathbb{P}_r)]\vert_{\theta=\theta_0}$

1. 极小化生成器损失会使得两个分布的 JSD 距离变大, 这不符合我们优化的目标
2. 极小化 KL 散度. 我们知道 KL 散度在生成错误样本时 cost 非常大, 而在发生 mode dropping 的时候 cost 非常小.

Let $g_{\theta}:\mathcal{Z}\rightarrow\mathcal{X}$ be a differentiable function that induces a distribution $\mathbb{P}_g$. Let $\mathbb{P}_r$ be the real data distribution, with either conditions of Theorems 1 or 2 satisfied. Let $D$ be a discriminator such that $D^* - D=\epsilon$ is a centered Gaussian process indexed by $x$ and independent for every $x$ (polularly known as white noise) and $\nabla_xD^*-\nabla_xD=r$ another independent centered Gaussian process indexed by $x$ and independent for every $x$. Then, each coordinate of $\mathbb{E}_{z\sim p(z)}[-\nabla_{\theta}\log D(g_{\theta}(z))]$ is a centered Cauchy distribution with infinite expectation and variance.

### 2. 软指标和分布

If $X$ has distribution $\mathbb{P}_X$ with support on $\mathcal{M}$ and $\epsilon$ is an absolutely continuous distribution with density $P_{\epsilon}$, then $\mathbb{P}_{X+\epsilon}$ is absolutely continuous with density \begin{align} P_{X+\epsilon}(x) &=\mathbb{E}_{y\sim \mathbb{P}_X}[P_{\epsilon}(x-y)] \\ &=\int_{\mathcal{M}}P_{\epsilon}(x-y) d\mathbb{P}_X(y) \end{align}

Let $\mathbb{P}_r$ and $\mathbb{P}_g$ be two distributions with support on $\mathcal{M}$ and $\mathcal{P}$ respectively, with $\epsilon\sim\mathcal{N}(0, \sigma^2I)$. Then, the gradient passed to the generator has the form
\begin{align} \mathbb{E}_{z\sim p(z)} &[\nabla_{\theta}\log(1-D^*(g_{\theta}(z)))] \\ &=\mathbb{E}_{z\sim p(z)}[a(z)\int_{\mathcal{M}}P_{\epsilon}(g_{\theta}(z)-y)\nabla_{\theta}\Vert g_{\theta}(z)-y\Vert^2 \mbox{d}\mathbb{P}_r(y)] \\ &=-b(z)\int_{\mathcal{P}}P_{\epsilon}(g_{\theta}(z)-y)\nabla_{\theta}\Vert g_{\theta}(z)-y\Vert^2 \mbox{d}\mathbb{P}_g(y)] \end{align} where $a(z)$ and $b(z)$ are positive functions. Furthermore, $b>a$ if and only if $P_{r+\epsilon}>P_{g+\epsilon}$, and $b > a$ if and only if $P_{r+\epsilon}<P_{g+\epsilon}$ .

If $\epsilon$ is a random vector with mean 0, then we have $W(\mathbb{P}_X,\mathbb{P}_{X+\epsilon})\leq V^{1/2}$ where $V=\mathbb{E}[\Vert\epsilon\Vert_2^2]$ is the variance of $\epsilon$ .

We recall the definition of the Wasserstein metric $W(P,Q)$ for $P$ and $Q$ two distributions over $\mathcal{X}$. Namely, $W(P,Q)=\inf_{\gamma\in\Gamma}\int_{\mathcal{X}\times\mathcal{X}}\Vert x-y\Vert_2d\gamma(x,y)$ where $\Gamma$ is the set of all possible joints on $\mathcal{X} \times \mathcal{X}$ that have marginals $P$ and $Q$ .

## B. (ICML 2017) Wasserstein generative adversarial networks

### 1. 几种的距离函数

• Total Variance distance

$\delta(\mathbb{P}_r,\mathbb{P}_g)=\sup_{A\in\Sigma}\vert\mathbb{P}_r(A)-\mathbb{P}_g(A)\vert.$
• Kullback-Leibler (KL) divergence

$KL(\mathbb{P}_r,\mathbb{P}_g)=\int\log\left(\frac{P_r(x)}{P_g(x)}\right)P_r(x)d\mu(x),$

其中 $\mathbb{P}_r$ 和 $\mathbb{P}_g$ 都存在空间 $\mathcal{X}$ 上测度为 $\mu$ 的密度函数. KL divergence 是非对称的, 并且可能达到无穷大.

• Jensen-Shannon (JS) divergence

$JS(\mathbb{P}_r,\mathbb{P}_g)=KL(\mathbb{P}_r,\mathbb{P}_m)+KL(\mathbb{P}_g,\mathbb{P}_m),$

其中 $\mathbb{P}_m=(\mathbb{P}_r+\mathbb{P}_g)/2$ , 这个测度总是存在的, 并且是对称的.

• Earth-Mover (EM) distance

$W(\mathbb{P}_r,\mathbb{P}_g)=\inf_{\gamma\in\Pi(\mathbb{P}_r,\mathbb{P}_g)}\mathbb{E}_{(x,y)\sim\gamma}[\Vert x-y\Vert],$

其中 $\Pi(\mathbb{P}_r,\mathbb{P}_g)$ 是所有联合概率 $\gamma(x,y)$ 的集合, 这些联合概率的边际分布是 $\mathbb{P}_r$ 和 $\mathbb{P}_g$.

Let $Z\sim U[0, 1]$ the uniform distribution on the unit interval. Let $\mathbb{P}_0$ be the distribution of $(0,Z)\in\mathbb{R}^2$ (a 0 on the x-axis and the random varianble $Z$ on the y-axis), uniform on a straight vertical line passing through the origin. Noew let $g_{\theta}(z)=(\theta,z)$ with $\theta$ a single real parameter. It is easy to see that in this case,

• $W(\mathbb{P}_0,\mathbb{P}_{\theta})=\vert\theta\vert$
• $JS(\mathbb{P}_0,\mathbb{P}_{\theta})=\begin{cases} \log2\qquad \mbox{if } \theta\neq0, \\ 0 \qquad \mbox{if } \theta=0, \end{cases}$
• $KL(\mathbb{P}_0\Vert\mathbb{P}_{\theta})=KL(\mathbb{P}_{\theta}\Vert\mathbb{P}_0)=\begin{cases} +\infty\qquad \mbox{if } \theta\neq0, \\ 0\qquad \mbox{if } \theta=0, \end{cases}$
• and $\delta(\mathbb{P}_0,\mathbb{P}_{\theta})=\begin{cases} 1\qquad \mbox{if } \theta\neq0, \\ 0\qquad \mbox{if } \theta=0, \end{cases}$

Let $\mathbb{P}_r$ be a fixed distribution over $\mathcal{X}$ . Let $Z$ be a random variable (e.g Gaussian) over another space $\mathcal{Z}$ . Let $\mathbb{P}_{\theta}$ denote the distribution of $g_{\theta}(Z)$ where $g:(z,\theta)\in\mathcal{Z}\times\mathbb{R}^d\rightarrow g_{\theta}(z)\in\mathcal{X}$. Then,

1. If $g$ is continuous in $\theta$, so is $W(\mathbb{P}_r,\mathbb{P}_g)$.
2. If $g$ is locally Lipschitz and satisfies regularity assumption 1, then $W(\mathbb{P}_r,\mathbb{P}_g)$ is continuous everywhere, and differentiable almost everywhere.
3. Statements 1-2 are false for the Jensen-Shannon divergence $JS(\mathbb{P}_r,\mathbb{P}_{\theta})$ and all the KLs.

Let $\mathbb{P}$ be a distribution on a compact space $\mathcal{X}$ and $(\mathbb{P}_n)_{n\in\mathbb{N}}$ be a sequence of distributions on $\mathcal{X}$. Then, considering all limits as $n\rightarrow\infty$,

1. The following statemetns are equivalent
• $\delta(\mathbb{P}_n,\mathbb{P})\rightarrow0$ with $\delta$ the total variation distance.
• $JS(\mathbb{P}_n,\mathbb{P})\rightarrow0$ with JS the Jensen-Shannon divergence.
2. The following statements are equivalent
• $W(\mathbb{P}_n,\mathbb{P})\rightarrow0$
• $\mathbb{P}_n\overset{\mathcal{D}}{\rightarrow}\mathbb{P}$ where $\overset{\mathcal{D}}{\rightarrow}$ represents convergence in distribution for random variables.
3. $KL(\mathbb{P}_n\Vert\mathbb{P})\rightarrow 0$ or $KL(\mathbb{P}\Vert\mathbb{P}_n)\rightarrow 0$ imply the statements in (1).
4. The statements in (1) imply the statements in (2).

### 2. Wasserstein GAN

Let $\mathbb{P}_r$ be any distribution. Let $\mathbb{P}_{\theta}$ be the distribution of $g_{\theta}(Z)$ with $Z$ a random variable with density $p$ and $g_{\theta}$ a function satisfying assumption 1. Then, there is a solution $f:\mathcal{X}\rightarrow\mathbb{R}$ to the problem $\max_{\Vert f\Vert_L\leq1}\mathbb{E}_{x\sim\mathbb{P}_r}[f(x)]-\mathbb{E}_{x\sim\mathbb{P}_{\theta}}[f(x)]$ and we have $\nabla_{\theta}W(\mathbb{P}_r,\mathbb{P}_{\theta}) =-\mathbb{E}_{z\sim p(z)}[\nabla_{\theta}f(g_{\theta}(z))]$ when both terms are well-defined.

## 参考文献

1. Towards Principled Methods for Training Generative Adversarial Networks
Martin Arjovsky, Léon Bottou
[html], [pdf], In ICLR 2017

2. Wasserstein generative adversarial networks
Martin Arjovsky, Soumith Chintala, Léon Bottou
[html], [pdf] In ICML 2017

Content