High-quality data can be a goldmine, opening new possibilities for insights, more accurate predictions, and smarter algorithms. However, acquiring, collecting, and labelling data can be lengthy, tedious, and expensive. But while the synthesis of gold is not feasible, the creation of data is. The secret recipe: take a few GPUs, add a GAN, and train it for a few weeks.
The Need for Augmented Data
With the rise of artificial intelligence, the requirements for data increased in quality as well as in quantity. Large amounts of high-quality data improve the performance of machine learning models. Increasing the size of a dataset can effectively enhance model performance, particularly for cases when models only work well on data they were trained with, but fail to accurately predict future observations (i.e., overfitting).
However, not only quantity matters. The quality of data is essential, too. A machine learning algorithm can only perform as well as its training data allows it to. This data-centric aspect of machine learning is particularly significant when there is a lack of relevant data. Let us look at the example of a medical company that wants to build an AI-tool to help with the diagnosis of cancerous X-ray scans. For the performance of the underlying machine learning model, well labelled and well distributed training images are essential. However, cancerous samples in training data are rare, often protected by privacy laws and less available than healthy input samples.
Hence, when building such a product to help diagnosis, it would be beneficial to create fake data to supplement the real data. Generating realistic fake data can be advantageous in many business areas. Particularly for defect detection, it is often the only feasible method to gain better balanced training datasets, since the state broken naturally has scarce examples.
Data augmentation is the way to go when real data is rare or privacy needs to be protected. In those cases, generating data is usually less expensive. In this article, we will focus on the advantages of applying data augmentation on image data.
Traditional Data Augmentation
Artificially increasing the variance in training data is called data augmentation, and the traditional method to augment visual data is by transforming the input images. Examples for such transformations are positional transformations like rotating, mirroring, cropping, or colour altering like changing the brightness, HUE, or saturation of an image. The standard data augmentation techniques reduce overfitting problems of machine learning models significantly.
Basic image augmentation techniques are often already implemented in most computer vision packages and can be applied at the stage of loading the training data, i.e., the augmented images do not need extra storage space. These techniques, although being cheap and simple, suffer from low diversity in their results. A model learns better from data with higher variance than those standard techniques can provide. Generative Adversarial Networks (GANs) tackle this variance issue and scale up the amount of high-quality training data tremendously.
What is a GAN?
A GAN is an architecture with two adversarial neural networks competing against one another. Those two networks have different jobs to fulfill. The Discriminator network’s job is to evaluate a set of images and assess whether they are real or artificially generated. To put in simple words, the discriminator network is a classifier that differentiates between real and fake data. On the other hand, the second network’s purpose is to generate fake images and try fooling the discriminator network into believing that those generated images are real. That is why it is referred to as the Generator network.
Within the Generator and Discriminator lie neural networks, both being trained in the process. Since GANs are typically used for visual data, the networks are usually convolutional in nature. The Generator gets feedback from the Discriminator, learning how well its fake images perform. Over many training epochs, the Generator learns which output does the trick and which images are easily detectable as fake. The Discriminator on the other hand, improves its classification skills. Both networks need to learn in a balanced way to optimize the generated output.
Having set up the architecture and training data, a successful model is not a safe shot. The two most common challenges faced by GANs during training are mode collapse and non-convergence. The latter leads to unsatisfying outcomes, in which the generated images fail to mimic realistic pictures. With more time spent on training, the model does not improve. Mode collapse on the other hand often results in perfectly realistic outputs. But they all look the same. In the naive GAN-model, if the Generator learns to produce one (or a handful) of images to fool the Discriminator, it would be successful. Regarding data augmentation, this does not do the job though. We want a wide variety of new images for high-quality data generation.
How do we solve these challenges? The answer lies in choosing an appropriate choice of loss. The loss function decides in which direction model optimization / machine learning continues. A balance between an efficient choice of loss leading to converging results while, when necessary, penalizing a lack of variance in the output distribution increase the chances for success tremendously.
The quality of the GAN output cannot only be measured visually but also in an objective way. The method is called Fréchet Inception Distance. Hereby, the real as well as the fake images are sent through a pre-trained neural network. The output for each image is a feature vector. For comparing two sets of image data, we need to evaluate the output of the whole batch to capture the differences between fake data and the original distribution. The FID-metric is a measure for those distances. A small value for the Fréchet Inception Distance means that a pre-trained network sees little difference between those distributions, meaning it also takes the variety into account.
Adversarial architectures have come a long way since their first emergence in 2014 . There is still some distance to cover before reaching the final goal of full output control. The generator network presented in this article uses random noise vectors as input. If each dimension of such a vector could be disentangled and mapped to a specific output feature, we’d have completely controlled image generation.
Those features are entangled and cannot be naively mapped to features that are perceived as independent by humans. However, new architectures are popping up regularly, tackling the imperfections of previous architectures and step-by-step fencing the output space a bit further. The complexity of today’s deep learning methods leads to increasingly sophisticated architectures.
Approaches using sub-spaces, each trained on a different feature , could be one solution, but there are many more approaches. To get an idea of how complex the field of GANs has become, let us look at one of its most prominent architectures: StyleGAN. StyleGAN uses a mapping network to map the random input to an intermediate vector, which then gets fed into the generator. This mapping improves the disentanglement of the input space to a latent space, which receives some more meaning. Furthermore, instead of inputting the latent vector once, it is an input to each layer.
This property allows the GAN to mix styles of different images, by inputting different latent vectors into different generator layers. These are only some of the specific tricks this GAN architecture uses. The example boxes in this article were built with its successor StyleGAN2, which uses slight variations of those features. The number of different architectures is immense and fast-growing, with increasing complexity, making the upcoming years exciting.
The contribution of GANs in many areas, particularly medicine, is stunning and valuable. They are one of the most important breakthroughs in computer vision and the generation of visual data. Nevertheless, there are dangers associated with using them. Having the ability to generate images and videos, can (and will) be abused. Deep fakes, a form of identity theft, can easily enable unethical behaviour. Deep fakes are generated images/videos matching the look and/or voice of real, usually famous people. Creating real-looking visual content can (and will!) be used to misinform and deceive people. It offers the tools to manipulate elections and support ill intended propaganda. Distinguishing between fake and real media requires excellent discriminators. Excellent discriminators might just help to build excellent generators. Hence, tackling this problem technically is unrealistic, as the better we get in distinguishing between real and fake, the more difficult it can be to distinguish between real and fake. Unfortunately, if regulations are slow to tackle the issue of misuse of AI, those powerful deep learning tools are going to be misused.
Generative Adversarial Networks are a powerful tool for high-quality image augmentation and synthesis. After outcompeting most other approaches on creating new visual data for years, e.g., Variational Autoencoders , GANs have recently found a serious competitor in diffusion models . However, with the attention-based mechanisms first introduced in transformers  and the 3D-progress of NeRFs  entering the GAN-universe, a cooling-down of progress is not in sight. The future of artificial data generation is bright and multifarious.