Mathematical background of GANs
Let’s take a look at the math behind this process to get a better understanding of the mechanism. Let
• x: real data (e.g., real images)
• x* = G(z): fake data generated by G from random noise z
• D(x): probability that the discriminator thinks x is real
• D(x*): probability that the discriminator thinks the fake data is real
| Global maxima and minima |
As mentioned before, the goal of the discriminator,
In this formula,
There are 3 different ways to feed training data into models:
One sample at a time, which is often referred to as stochastic (for example, Stochastic Gradient Descent or SGD).
A handful of samples at a time, which is called mini-batch.
All samples at one time, which is, in fact, called batch.
The stochastic way introduces too much randomness so that one bad sample could jeopardize the good work of several previous training steps. The full batch requires too much memory to calculate. Therefore, we feed data to all models by mini-batch in this module, even though we might slothfully refer to it as just batch.
The goal of the generator network,
In this formula,
If we are unfamiliar with the concept of GD, think of it as a little boy kicking a sticky ball on bumpy terrain. The boy wants the ball to be at the bottom of the lowest pit so that he can call it a day and go home. The ball is sticky, so it doesn’t roll after it hits the ground, even on a slope. Therefore, where the ball will hit is determined by which direction and how hard the boy kicks it. The amount of force the boy kicks the ball with is described by the step size (or the learning rate). The direction of kicking is determined by the characteristics of the terrain under his feet.
An efficient choice would be the downhill direction, which is the negative gradient of the loss function with respect to the parameters. Therefore, we often use gradient descent to minimize an objective function. However, the boy is so obsessed with the ball that he only stares at the ball and refuses to look up to find the lowest pit in a wider range. Therefore, the GD method is sometimes inefficient because it takes a very long time to reach the bottom. The gradient ascent is the opposite of gradient descent, which is to find the highest peak.
Comments
Post a Comment