Mathematical background of GANs


Let’s take a look at the math behind this process to get a better understanding of the mechanism. Let G and Drepresent the generator and discriminator networks, respectively. Let V represent the performance criterion of the system. The optimization objective is described as follows:


x: real data (e.g., real images)

x* = G(z): fake data generated by G from random noise z

D(x): probability that the discriminator thinks x is real

D(x*): probability that the discriminator thinks the fake data is real


D(x) gives a number between 0 and 1, which is the probability that x is real.
So D(x*) is the probability that a fake sample is thought to be real.

Now:
1 - D(x*) is the probability that the fake sample is correctly detected as fake.

Global maxima and minima


As mentioned before, the goal of the discriminator, D, is to maximize the prediction confidence of real samples. Therefore, D needs to be trained with gradient ascent ( the max operator in the objective). The update rule for, D, is as follows:


In this formula, θd is the parameter of D (such as convolution kernels and weights in fully connected layers), m is the size of mini batch(or batch size for short), and i is the index of the sample in the mini-batch. Here, we assume that we are using mini-batches to feed the training data, which is fairly reasonable since it’s the most commonly used and empirically effective strategy. Therefore, the gradients need to be averaged over m samples.

There are 3 different ways to feed training data into models:

  1. One sample at a time, which is often referred to as stochastic (for example, Stochastic Gradient Descent or SGD).

  2. A handful of samples at a time, which is called mini-batch.

  3. All samples at one time, which is, in fact, called batch.

The stochastic way introduces too much randomness so that one bad sample could jeopardize the good work of several previous training steps. The full batch requires too much memory to calculate. Therefore, we feed data to all models by mini-batch in this module, even though we might slothfully refer to it as just batch.

The goal of the generator network, G, is to fool the discriminator, D, and let D believe that the generated samples are real. Therefore, the training of G is to maximize D(G(z)) or minimize 1D(G(z)). Therefore, G needs to be trained with gradient descent (the min operator in the objective). The updated rule for G is as follows:


In this formula, θg is the parameters of Dm is the size of the mini-batch, and i is the index of the sample in the mini-batch.

If we are unfamiliar with the concept of GD, think of it as a little boy kicking a sticky ball on bumpy terrain. The boy wants the ball to be at the bottom of the lowest pit so that he can call it a day and go home. The ball is sticky, so it doesn’t roll after it hits the ground, even on a slope. Therefore, where the ball will hit is determined by which direction and how hard the boy kicks it. The amount of force the boy kicks the ball with is described by the step size (or the learning rate). The direction of kicking is determined by the characteristics of the terrain under his feet.

An efficient choice would be the downhill direction, which is the negative gradient of the loss function with respect to the parameters. Therefore, we often use gradient descent to minimize an objective function. However, the boy is so obsessed with the ball that he only stares at the ball and refuses to look up to find the lowest pit in a wider range. Therefore, the GD method is sometimes inefficient because it takes a very long time to reach the bottom. The gradient ascent is the opposite of gradient descent, which is to find the highest peak.




Comments

Popular Posts